analytic tools for higher-order data

47
Analytic tools for higher-order data 1 3 2 P 1 2 3 1, 3 2 2, 3 1, 2 3 Phonecall-Eu CollegeMsg Email-Eu SMS-A FBWall 0.0 0.5 Fraction of counts Blocking motifs 0.0 0.5 Fraction of counts Non-blocking motifs 1, 3 2 1 2, 3 1, 2 3 Austin Benson Stanford University Univ. of Chicago, Jan. 12, 2017 S ¯ S 9 10 3 5 6 7 8 1 2 4 Dec.00 June.01 Dec.01 0 50 100 150 CEO Skilling appointed CEO Skilling Started Calif. bankruptcy Calif. legislature India (general) Number of emails

Upload: austin-benson

Post on 28-Jan-2018

104 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Analytic tools for higher-order data

Analytic tools for higher-order data

1

32

P1

2

3

1,32

12,3

1,23

Phonecall-Eu

CollegeMsg

Email-Eu

SMS-A

FBWall

0.00.5Fraction of counts

Blocking motifs

0.0 0.5Fraction of counts

Non-blocking motifs1,3 2

1 2,3

1,2 3

Austin BensonStanford University

Univ. of Chicago, Jan. 12, 2017

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

S

S̄9

10

3

5

6

7

8

1

2

4

Dec.00 June.01 Dec.010

50

100

150 CEO Skillingappointed

CEO SkillingStarted

Calif. bankruptcyCalif. legislatureIndia (general)

Nu

mb

er

of

em

ails

Page 2: Analytic tools for higher-order data

1. New analytical techniques for data§ Higher-order clustering in networks

Science 16, SDM 15§ Higher-order structures in temporal graphs

WSDM 17§ Models for inherently higher-order data

SIAM Review 17, NIPS 16, more in prep.§ Novel null models for graph data

KDD 14

2

3. Modeling user behavior on the Web§ Discrete choice models

WWW 16, more in prep.§ Repeat behavior

WWW 16

2. Faster algorithms for larger problems§ Distributed matrix algs. for tall-and-skinny data

NIPS 14, BigData 13§ Practical fast matrix multiplication

SIAM J. Matrix Analysis & App. 16, PPoPP 15§ Structured matrix computations

SIAM J. Scientific Computing 14

My researchcomputational tools for bigger and better data analysis

… all for downstream applications§ Ecology, biology and neuroscience

Science 16, NIPS 14§ Transportation systems

SIAM Review 17, Science 16, NIPS 16§ Social networks

WSDM 17, WWW 16, KDD 14

Page 3: Analytic tools for higher-order data

3

Complex systems are modeled by networks my goal analyze the model to gain insights into the system

Brainsnodes are neuronsedges are synapses

Social interactionsnodes are peopleedges are friendships

Electrical gridnodes are power plantsedges are transmission lines

Tim Meko, Washington Post

Currencynodes are accountsedges are exchanges

Page 4: Analytic tools for higher-order data

4

Some common high-level analyses

1. Clustering/partitioning/community detection. Finding groups of nodes that form functional modules

2. Ranking. Finding what things are important (e.g., PageRank and its variants)

3. Spreading and traversing. How viruses, social influences, etc. spread, random walks, routing

4. Evolution. what new links will form?

Graphs are defined by nodes and edges, so we usually think about our analysis, models, and algorithms in terms of nodes and edges.

https://www1.udel.edu

Complex systems are modeled by networks my goal analyze the model to gain insights into the system

Page 5: Analytic tools for higher-order data

5

A fundamental network analysis is finding clusters, modules, communities, groupsReal-world networks have modular organization.We want to automatically find the modules in the system.

Co-author network

Main idea. Find groups of nodes with high internal edge density and low external edge density.

Brain network, de Reus et al., RSTB, 2014.

Page 6: Analytic tools for higher-order data

Background on community detection and conductance

Conductance is one of the most important quality scores [Schaeffer07]used in Markov chain theory, spectral clustering, bioinformatics, vision, etc.

The conductance of a set of vertices is the ratio of edges leaving to total edges

small conductance ó good cluster

�(S) = cut(S)min(vol(S),vol(S̄))

cut(S) = 7

|S| = 15

vol(S) = 85

cut(S̄) = 7

|S̄| = 20

vol(S̄) = 151

φ(S) = 7/85

(edges leaving set)(total edges in the set)

6

Page 7: Analytic tools for higher-order data

7

Thinking beyond nodes and edgesNodes & edges are a bit simplistic.

There is abundant evidence that small subgraph patterns, or network motifs, drive complex systems.

“Contact neighborhoods” drive social influence.

Ugander et al., PNAS, 2012Mangan et al., J. of Mol. Bio., 2003Alon, Nature Reviews Genetics, 2007

Signed “feed-forward loops” are delay units in genetic transcription.A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

A B

C

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

Also seeMilo et al., 2002Sporns & Kötter, 2004Milenković & Pržulj, 2008Yaveroğlu et al., 2014

Page 8: Analytic tools for higher-order data

Nodes and edges are not the fundamental units of complex networks.

So why should we look for structure interms of them?

We shouldn’t.

And we gain new insights into data with higher-order analysis techniques.

8

Page 9: Analytic tools for higher-order data

Network

MotifDifferent motifs give different clusters

9We can formalize this by generalizing conductance for motifs!

Higher-order clusteringfinding groups of nodes based on motifsBenson, Gleich, & Leskovec, Science, 2016

Page 10: Analytic tools for higher-order data

In practice, motifs organize real-world networks amazing well and recover aquatic layers in food webs

http://marinebio.org/oceans/marine-zones/

10

A C

B D

Figure 2: Higher-order organization of the Florida Bay food web. A: Motif conductance fordifferent motifs on the Florida Bay ecosystem food web (19). A priori it is not clear whether thenetwork is organized based on a given motif. For example, motifs M5 (green) and M8 (blue)do not reveal any higher-order organization (motif conductance has high values). However,the downward spikes of the red curve show that M6 reveals rich higher-order modular struc-ture (27). The shape of the curve suggests that food webs might be organized based on themotif M6. Ecologically, motif M6 corresponds to two species mutually feeding on each otherand also preying on a common third species. B: Clustering of the food web based on motifM6. (For illustration, edges not participating in at least one instance of the motif are omitted.)The clustering reveals three known aquatic layers: pelagic fishes (yellow), benthic fishes andcrabs (red), and sea-floor macroinvertebrates (blue) as well as a cluster of microfauna and detri-tus (green). Our framework identifies these modules with higher accuracy (84%) than existingmethods (65–69%) (11). The clustering reveals that the energy flow pattern of motif M6 occursfrequently within these modules and infrequently across these modules. For example, it is un-common for two competitors from one aquatic layer to hunt each other and then have commonprey in a different layer. C: A higher-order cluster (yellow nodes in (B)) demonstrates how thepelagic layer is organized based on the motif M6. The needlefish and other pelagic fishes eateach other while several other fishes are prey for these two species. D: Organization of micro-fauna cluster (green nodes in (B)) based on the motif M6. Here, several microfauna decomposeinto Particulate Organic Carbon in the water (water POC) but also consume water POC. Freebacteria serves as an energy source for both the microfauna and water POC.

8

Micronutrient sources

Pelagic fishes and benthic prey

Benthic macroinvertebrates

Benthic Fishes

We don’t know how to find this structure with edge partitioning.

Page 11: Analytic tools for higher-order data

Motif-based conductance

Need new notions of cut and volume

S S

SS̄ S̄

�(S) = cut(S)min(vol(S),vol(S̄))

�M (S) = cutM (S)min(volM (S),volM (S̄))

volM (S) = #(motif end points in S)vol(S) = #(edge end points in S)

Benson, Gleich, & Leskovec, Science, 2016 11

cut(S) = #(edges cut) cutM (S) = #(motifs cut)

Page 12: Analytic tools for higher-order data

Motif-based conductance

Motif

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

S

S̄�M (S) =

motifs cutmotif volume

=18

12Benson, Gleich, & Leskovec, Science, 2016

9

10

3

5

6

7

8

1

2

4

9

10

3

5

6

7

8

1

2

4

Page 13: Analytic tools for higher-order data

Higher-order clustering

Given a motif M and a graph G, we want tofind a set of nodes S that minimizes motif conductance

Unfortunately, this is NP-hard…

Solution: motif-based spectral clustering1. Input graph G and motif M2. Form new weighted graph W3. Apply spectral clustering on W à output clusters

Theorem resulting clusters will obtain near optimal motif conductance(motif Cheeger inequality)

13Benson, Gleich, & Leskovec, Science, 2016

Page 14: Analytic tools for higher-order data

Motif-based spectral clustering

W (M)

ij = counts co-occurrences of motif pattern between i , j

W (M)

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

1

2

34

5

6

7

8

9

10

1

11

1

1

13

11

1

1

1

21

1

14Benson, Gleich, & Leskovec, Science, 2016

9

10

3

5

6

7

8

1

2

4

Page 15: Analytic tools for higher-order data

Motif-based spectral clustering

W (M)

ij = counts co-occurrences of motif pattern between i , j

W (M)

Key insightSpectral clustering on weighted graph W(M) finds clusters of low motif conductance

�M (S) =

motifs cut

motif volume

=

1

10

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

1

2

34

5

6

7

8

9

10

1

11

1

1

13

11

1

1

1

21

1

15Benson, Gleich, & Leskovec, Science, 2016

Page 16: Analytic tools for higher-order data

Motif-based spectral clustering step 1

Step 1. Form weighted graph W(M)

1

2

34

5

6

7

8

9

10

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

16

Page 17: Analytic tools for higher-order data

Motif-based spectral clustering step 2

Step 2. Compute Fiedler vector f(M)

associated with λ2 of the normalized Laplacian of W(M).Roughly O(# edges) time.

D = diag(W (M)e)

L(M) = D�1/2(D � W (M))D�1/2

L(M)z = �2z

f(M) = D�1/2z

D = diag(W (M)e)

L(M) = D�1/2(D � W (M))D�1/2

L(M)z = �2z

f(M) = D�1/2z

Diagonal degree matrix

17

(using ideas from [Mihail89,Chung92])

f(M)

1

2

3 4

5

6

7

8

9

10

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Page 18: Analytic tools for higher-order data

Motif-based spectral clustering step 3

Step 3.Sort nodes by values in f(M): f1, f2, …fnLet Sr = {f1, …, fr} and compute the motif conductance of each Sr“motif conductance sweep cut” 18

(using tools from [Mihail89,Chung92])

1

2

3 4

5

6

7

8

9

10

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Page 19: Analytic tools for higher-order data

Motif Cheeger inequality

Theorem If the motif has three nodes, then the sweep procedure on the weighted graph finds a set S of nodes for which

Theorem for 4+ nodes, we use a slightly altered definition of conductance.

�M (S) 4q

�⇤M

optimal value over all sets of nodes

19Benson, Gleich, & Leskovec, Science, 2016

M(G) = {instances of M in G}

Key Proof Step

cut

M

(S, G) =

X

{i ,j ,k}2M(G)

Indicator[x

i

, x

j

, x

k

not the same]

=

1

4

(x

2

i

+ x

2

j

+ x

2

k

� x

i

x

j

� x

j

x

k

� x

i

x

k

)

= quadratic in x

Page 20: Analytic tools for higher-order data

Applications

20

1. We do not know the motif of interestfood webs and new applications

2. We know the motif of interesttranscription regulation networks, connectome, social networks

3. We seek richer information from our datatransportation networks and new applications

Page 21: Analytic tools for higher-order data

Application 1 Food web structure

Nodes are speciesEdges are energy flows (who eats whom)i → j means j eats iDo not know which motif, so we try several.By Cheeger inequality, motif conductance better than edge conductance

21Benson, Gleich, & Leskovec, Science, 2016

Page 22: Analytic tools for higher-order data

A C

B D

Figure 2: Higher-order organization of the Florida Bay food web. A: Motif conductance fordifferent motifs on the Florida Bay ecosystem food web (19). A priori it is not clear whether thenetwork is organized based on a given motif. For example, motifs M5 (green) and M8 (blue)do not reveal any higher-order organization (motif conductance has high values). However,the downward spikes of the red curve show that M6 reveals rich higher-order modular struc-ture (27). The shape of the curve suggests that food webs might be organized based on themotif M6. Ecologically, motif M6 corresponds to two species mutually feeding on each otherand also preying on a common third species. B: Clustering of the food web based on motifM6. (For illustration, edges not participating in at least one instance of the motif are omitted.)The clustering reveals three known aquatic layers: pelagic fishes (yellow), benthic fishes andcrabs (red), and sea-floor macroinvertebrates (blue) as well as a cluster of microfauna and detri-tus (green). Our framework identifies these modules with higher accuracy (84%) than existingmethods (65–69%) (11). The clustering reveals that the energy flow pattern of motif M6 occursfrequently within these modules and infrequently across these modules. For example, it is un-common for two competitors from one aquatic layer to hunt each other and then have commonprey in a different layer. C: A higher-order cluster (yellow nodes in (B)) demonstrates how thepelagic layer is organized based on the motif M6. The needlefish and other pelagic fishes eateach other while several other fishes are prey for these two species. D: Organization of micro-fauna cluster (green nodes in (B)) based on the motif M6. Here, several microfauna decomposeinto Particulate Organic Carbon in the water (water POC) but also consume water POC. Freebacteria serves as an energy source for both the microfauna and water POC.

8

Micronutrient sources

Pelagic fishes and benthicprey

Benthic macro-invertebrates

Benthic Fishes

Motif M6 reveals aquatic layers

A B

C

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (11).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (13) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (11).

7

61% accuracy vs.48% with edge-based methods

22Benson, Gleich, & Leskovec, Science, 2016

Application 1 Food web structure

Page 23: Analytic tools for higher-order data

Application 2 Yeast regulatory functionsNodes are groups of genesi → j if i is regulated by a transcriptional factor encoded by j

Partitioning based on signed motifs re-identifies known functions.97% accuracy vs.68–82% for other methods

A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

A C

B

D

Figure S8: Higher-order organization of the S. cerevisiae transcriptional regulation network.A: The four higher-order structures used by our higher-order clustering method, which canmodel signed motifs. These are coherent feedfoward loop motifs, which act as sign-sensitivedelay elements in transcriptional regulation networks (46). The edge signs refer to activation(positive) or repression (negative). B: Six higher-order clusters revealed by the motifs in (A).Clusters show functional modules consisting of several motifs (coherent feedforward loops),which were previously studied individually (45). The higher-order clustering framework identi-fies the functional modules with higher accuracy (97%) than existing methods (68–82%). C–D:

Two higher-order clusters from (B). In these clusters, all edges have positive sign. The func-tionality of the motifs in the modules correspond to drug resistance (C) or cell cycle and matingtype match (D). The clustering suggests that coherent feedforward loops function together as asingle processing unit rather than as independent elements.

S48

23Benson, Gleich, & Leskovec, Science, 2016

Page 24: Analytic tools for higher-order data

Application 3 Transportation

North American air transport network

Nodes are cites.Edges reflect reachability, and are directed and unweighted.Data from [Frey&Dueck07].

24

Page 25: Analytic tools for higher-order data

Accepted pending minor revisions

Do not distribute.

D

C

B

A

Figure 4: Higher-order spectral analysis of a network of airports in Canada and the UnitedStates (22). A: The three higher-order structures used in our analysis. Each motif is “an-chored” by the blue nodes i and j, which means our framework only seeks to cluster togetherthe blue nodes. Specifically, the motif adjacency matrix adds weight to the (i, j) edge basedon the number of third intermediary nodes (green squares). The first two motifs correspondto highly-connected cities and the motif on the right connects non-hubs to non-hubs. B: Thetop 50 most populous cities in the United States which correspond to nodes in the network.The edge thickness is proportional to the weight in the motif adjacency matrix WM . The thick,dark lines indicate that large weights correspond to popular mainline routes. C: Embedding ofnodes provided by their corresponding components of the first two non-trivial eigenvectors ofthe normalized Laplacian for WM . The marked cities are eight large U.S. hubs (green), threeWest coast non-hubs (red), and three East coast non-hubs (purple). The primary spectral coor-dinate (left to right) reveals how much of a hub the city is, and the second spectral coordinate(top to bottom) captures West-East geography (11). D: Embedding of nodes provided by theircorresponding components in the first two non-trivial eigenvectors of the standard, edge-based(non-higher-order) normalized Laplacian. This method does not capture the hub and geographyfound by the higher-order method. For example, Atlanta, the largest hub, is in the center of theembedding, next to Salina, a non-hub.

10

Counts length-two paths

W (M)ij = #{bi-directional length-2 paths from i to j}

Weighted adjacency matrix already reveals hub-like structure

25Benson, Gleich, & Leskovec, Science, 2016

Application 3 Transportation

Page 26: Analytic tools for higher-order data

Top 10U.S. hubs

East coast non-hubs

West coast non-hubs

Primary spectral coordinate

Atlanta, the top hub, is next to Salina, a non-hub.

MOTIF SPECTRAL EMBEDDING EDGE SPECTRAL EMBEDDING26Benson, Gleich, & Leskovec, Science, 2016

Application 3 Transportation

Page 27: Analytic tools for higher-order data

Other higher-order analyses for data

27

1. Adapting tools for richer network models.Defining motifs in temporal graphs and then finding them.

2. New methods for inherently higher-order data.Understanding the structure of tensor data.

Page 28: Analytic tools for higher-order data

Technical infrastructurepackets over the internet, messages over supercomputer

Private communicatione-mail, phone calls, text messages, instant messages

Public communicationFacebook wall post, Q&A forums, Wikipedia edits

Paymentscredit card transactions, Bitcoin, Venmo

Biologycell signaling

Timestamped connections between things are everywhere

28

Page 29: Analytic tools for higher-order data

What are temporal networks, formally?

A temporal networks is a sequence of (directed) edges with timestamps

(u, v, 05:31:14)(u, x, 05:32:33)(x, y, 06:07:53)(u, v, 06:45:12) many timestamps between the same pair of nodes!(y, z, 06:45:12)

Richer than other dynamic graph models1. The growth of networks (citation networks, Facebook friends)2. Sequences of coarse-grain snapshots

29

What is a motif in this network representation?

Page 30: Analytic tools for higher-order data

a

bc

d

25s17s, 28s, 30s, 35s

15s, 32s31s

14s

1 23

a

b c

25s 28s32s

a

b c

25s 30s32s

c

d a

31s 32s35s

a

d c

14s 17s15s

a

d c

14s 28s32s

δ = 10s

Temporal network

Motif instance1. Isomorphic subgraph2. Respects ordering3. All edges happen

within δ time units Wrong order(a, c) after (c, a)

Too long32 – 14 = 18 > 10

Formal definition1. Directed multigraph2. Edge ordering3. Time window δ

Temporal network motifs

30Paranjape, Benson, & Leskovec, WSDM, 2017.

Page 31: Analytic tools for higher-order data

The temporal motif problem

a

bc

d

25s17s, 28s, 30s, 35s

15s, 32s31s

14s

1 23

δ = 10s

Given a network, a motif with time duration δ,how many instances of the motif are in the network?

3 ( )a

b c

25s 28s32s

a

b c

25s 30s32s

c

d a

31s 32s35s

How can we do this efficiently?

31

Page 32: Analytic tools for higher-order data

Summary of new algorithmsIn a graph with m temporal edges and T static triangles1. General algorithm for any motif.

faster than naive methods, but may not be optimal

2. 2 nodes, k temporal edges. O(k2 m)(uses general algorithm)linear time in size of temporal graph (for const. k)

3. 3 nodes, 3 temporal edges, stars. O(m)linear time in size of temporal graph

4. 3 nodes, 3 temporal edges, triangles. O(T1/2m)faster than Ω(Tm) with naive method 1

2

3

1, 2 3

1, 32

32Paranjape, Benson, & Leskovec, WSDM, 2017.

Page 33: Analytic tools for higher-order data

New algorithms let us analyze large datasets

33Paranjape, Benson, & Leskovec, WSDM, 2017.

§ Processing temporal graph with 2 billion temporal edges takes just a few hours (single threaded)

§ Triangle counting algorithm provides significant speedups over naive methods.

# temp. edges speedupWikipedia edits 10M 1.92xStack Overflow comments 63M 1.29xBitcoin transactions 123M 56.5xText messages 800M 2.28xPhone calls 2.04B 1.42x

Page 34: Analytic tools for higher-order data

Overview of some empirical results

1,32

12,3

1,23

Phonecall-Eu

CollegeMsg

Email-Eu

SMS-A

FBWall

0.00.5Fraction of counts

Blocking motifs

0.0 0.5Fraction of counts

Non-blocking motifs1,3 2

1 2,3

1,2 3

Paranjape, Benson, & Leskovec, WSDM, 2017.34

Exposes one-to-one vs. one-to-many behavior in communication systems like text messaging and e-mail.

101 102 103

Time to complete motif (seconds)

0

200

400

600

800

1000

1200

Num

ber

of

inst

ance

s

Motif counts over time in CollegeMsg

1 2,3

1,3 2

1,2 3

1,2,3

Show changes in communication patterns at different time scales.

Page 35: Analytic tools for higher-order data

Other higher-order analyses for data

35

1. Adapting tools for richer network models.Defining motifs in temporal graphs and then finding them.

2. New methods for inherently higher-order data.Understanding the structure of tensor* data.

*technically, these are hypermatrices

Page 36: Analytic tools for higher-order data

Higher-order data

36

1

32

A

AWe use matrices to describe pairwise data.Ecology Aij = 1 if j eats i, 0 otherwiseCorrelations Aij = correlation(i, j)

Often our data is inherently higher-order.We can represent this data with tensors.Natural language Aijk is frequency of 3-gram of words i, j, kMulti-modal Aijk is # of flights from city i and city j on airline k

Page 37: Analytic tools for higher-order data

Used in machine learning for optimal recovery of latent variable models (eigenvectors of higher-order moment tensors) [Anandkumar+14]

1

32

A

tensor A : n ⇥ n ⇥ n

tensor eigenvector

X

j ,k

A(i , j , k )xj

x

k

= �x

i

,

Ax

2 = �x

Higher-order Markov chains are often described using tensors

Also give eigenvector equations on a product state-space. O(n2) memory L

P[Xt+1 = i | Xt = j , Xt�1 = k ] = P(i , j , k )

Tensor eigenvectorsand higher-order Markov chains

[Lim05, Qi05]

37

Page 38: Analytic tools for higher-order data

Data type Stochasticprocess

Property Data representation Computation

First-order Random walk on a graph

Relative time spent at each node (stationary dist.)

Matrix representing first-order Markov chain

Pij = Prob(j --> i)

Matrix Eigenvector

Higher-order Spacey random walk on a graph

Relative time spent at each node (stationary distribution)

Tensor representing higher-order Markov chain

Pijk = Prob((k, j) --> (j, i))

Tensor eigenvector

[Li&Ng13]

Benson, Gleich, & Lim. SIAM Review, to appear 2017. 38

The context for spacey random walks

1

32

P

P

Px2 = x

Px = x

Page 39: Analytic tools for higher-order data

The spacey random walk stochastic processConsider a higher-order Markov chain.

If we were perfect, we’d figure out the stationary distribution of that. But we are spacey!§ On arriving at state j, we promptly

“space out” and forget we came from k. § But we still believe we are “higher-order”§ So we invent a state k by drawing

a random state from our history.

P[Xt+i | history] = P[Xt+i | Xt = j , Xt�1

= k ]

Benson, Gleich, & Lim. SIAM Review, to appear 2017. 39

Page 40: Analytic tools for higher-order data

1012497114

Xt-1

Xt

Yt

Theorem limiting distributions of this process are tensor eigenvectors

The spacey random walkP[Xt+1 = i | Xt = j , Xt�1 = k ] = P(i , j , k ) Higher-order

Markov chain

P[Xt+1 = i | Xt = j , Yt = g] = P(i , j , g)

Benson, Gleich, & Lim. SIAM Review, to appear 2017. 40

Page 41: Analytic tools for higher-order data

Spacey random walks:theory and applications

Benson, Gleich, & Lim. SIAM Review, to appear 2017.

§ Stationary distributions can be computed in O(n) memory unlike higher-order random walks.

§ Spacey RWs are asymptotically first-order Markovian§ Generalization of Pólya urn processes§ All 2x2x2x…x2 problems have a stationary

distribution (not necessarily unique, though)§ New method to compute tensor eigenvectors based

on ODE theory (better than power method)§ Spacey RWs are a good model for NYC taxi cab

movement.

41

Page 42: Analytic tools for higher-order data

Spacey random walks for co-clustering

Wu, Benson, & Gleich, NIPS, 2016. 42

Dec.00 June.01 Dec.010

50

100

150 CEO Skillingappointed

CEO SkillingStarted

Calif. bankruptcyCalif. legislatureIndia (general)

Nu

mb

er

of

em

ails

Enron e-mail dataAijkt = # e-mails b/w i and j on week k with topic t

[i1, …, in1] x [j1,…, jn2] x [k1, …, kn3]The data nonnegative brick of numbers (tensor)Co-clustering output subsets of rows, columns, and slicesMain idea use relationship b/w conductance and random walks to generalize conductance for this data

Fun 3-gram cluster{bag, plastic, garbage, grocery, trash, freezer} Fun 4-gram cluster{german, chancellor, angela, merkel,gerhard, schroeder, helmut, kohl}

Natural language corpusAijk = #(consecutive co-occurrences of words i, j, k)Aijkl = #(consecutive co-occurrences of words i, j, k, l)

Page 43: Analytic tools for higher-order data

Analytic tools for higher-order data

§ Finding clusters based on motifsHigher-order organization of complex networks, Science, 2016.Code + data http://snap.stanford.edu/higher-order

§ Higher-order structures in temporal networksMotifs in Temporal Networks, WSDM, 2017.Code + data http://snap.stanford.edu/temporal-motifs

§ New stochastic processes for higher-order dataThe spacey random walk, SIAM Review, to appear.General tensor spectral co-clustering, NIPS, 2016.Code + data https://github.com/arbenson/spacey-random-walks

https://github.com/wutao27/GtensorSC

1

32

P

1

2

3

43

Figure 1: Higher-order network structures and the higher-order network clusteringframework. A: Higher-order structures are captured by network motifs. For example, all13 connected three-node directed motifs are shown here. B: Clustering of a network based onmotif M7. For a given motif M , our framework aims to find a set of nodes S that minimizesmotif conductance, �M(S), which we define as the ratio of the number of motifs cut (filledtriangles cut) to the minimum number of nodes in instances of the motif in either S or S̄ (13).In this case, there is one motif cut. C: The higher-order network clustering framework. Given agraph and a motif of interest (in this case, M7), the framework forms a motif adjacency matrix(WM ) by counting the number of times two nodes co-occur in an instance of the motif. Aneigenvector of a Laplacian transformation of the motif adjacency matrix is then computed. Theordering � of the nodes provided by the components of the eigenvector (15) produces nested setsSr = {�1, . . . , �r} of increasing size r. We prove that the set Sr with the smallest motif-basedconductance, �M(Sr), is a near-optimal higher-order cluster (13).

7

S

S̄9

10

3

5

6

7

8

1

2

4

Page 44: Analytic tools for higher-order data

Ongoing work localized motif-based clusters

44

§ Find clusters containing motifs near a given seed node…and do it fast by generalizing diffusion techniques.

§ Triangle-based clusters are more robust to error than edge-based clusters in in common null models.

Edge-baseddiffusion

seed

error

error

error

Triangle-baseddiffusion

Avg. precision = 0.98(100 samples)

same seed, better localization

Cluster output

Avg. precision = 0.20(100 samples)

Cluster output

Page 45: Analytic tools for higher-order data

Ongoing work generalizing clustering coefficients

45

?

Fundamental network measurementClustering coefficient

= fraction of length-2 paths that are closed

Main idea of the workLook for higher-order closure patterns.

?

Random edges

u

u

Near-random 4-clique closure

Higher 4-clique closure than 3-clique closure

Page 46: Analytic tools for higher-order data

There are many directions in the early stages

46

Algorithms§ Sampling for temporal motifs.§ Data-dependent algorithms.§ Persistent homology for natural time scales in temporal data.

Models§ Good null model for temporal graphs to test statistical significance

of temporal motifs.

Theory§ Lower bounds for temporal motif counting and enumeration.

Applications§ Higher-order temporal analysis in economic trade data.

Page 47: Analytic tools for higher-order data

Thanks!Austin BensonStanford [email protected]://stanford.edu/~arbenson

@austinbenson

Collaborators: Ashwin Paranjape, Hao Yin, Jure Leskovec (Stanford),Tao Wu, David Gleich (Purdue),Lek-Heng Lim (Chicago)

47