large network analysis : visualization and clustering

84
Large network analysis: visualization and clustering Nathalie Villa-Vialaneix http://www.nathalievilla.org [email protected] Groupe de travail “Biopuces”, INRA Toulouse, 16/12/2011 Travail joint avec Fabrice Rossi (SAMM, Université Paris 1) Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 1 / 27

Upload: tuxette

Post on 11-May-2015

134 views

Category:

Science


3 download

DESCRIPTION

December 16th, 2011 Groupe de travail BioPuces, INRA de Toulouse

TRANSCRIPT

Page 1: Large network analysis : visualization and clustering

Large network analysis: visualization andclustering

Nathalie Villa-Vialaneixhttp://www.nathalievilla.org

[email protected]

Groupe de travail “Biopuces”, INRA Toulouse, 16/12/2011

Travail joint avec Fabrice Rossi (SAMM, Université Paris 1)

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 1 / 27

Page 2: Large network analysis : visualization and clustering

A brief introduction to network/graph

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 2 / 27

Page 3: Large network analysis : visualization and clustering

A brief introduction to network/graph

What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.

A relation between two entities is modeled by an edgearête

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 3 / 27

Page 4: Large network analysis : visualization and clustering

A brief introduction to network/graph

What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.

The entities are called the nodes or the vertexes (vertices in British)nœuds/sommets

A relationbetween two entities is modeled by an edgearête

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 3 / 27

Page 5: Large network analysis : visualization and clustering

A brief introduction to network/graph

What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.

A relation between two entities is modeled by an edgearête

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 3 / 27

Page 6: Large network analysis : visualization and clustering

A brief introduction to network/graph

Examples

Social network: nodes: persons - edges: 2 persons are connected

powered by TouchgraphTM 1 (Natty’s facebook

TM 2 network)

1http://www.touchgraph.com

2https://www.facebook.com

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27

Page 7: Large network analysis : visualization and clustering

A brief introduction to network/graph

ExamplesBiological network (an example, among others: co-expression network): nodes:genes - edges: the partial correlation between a pair of genesCor(Xi ,Xj |(Xk )k,i,j) is large (significant...)

[Liaubet et al., 2010, Villa-Vialaneix et al., 2011]

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27

Page 8: Large network analysis : visualization and clustering

A brief introduction to network/graph

ExamplesAny other relational data

Modeling a large corpus of medieval documents

Notarial acts (mostly baux à fief, moreprecisely, land charters) established in aseigneurie named “Castelnau Montratier”,written between 1250 and 1500, involvingtenants and lords.a

ahttp://graphcomp.univ-tlse2.fr

IndivualTransaction

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27

Page 9: Large network analysis : visualization and clustering

A brief introduction to network/graph

ExamplesAny other relational data

Modeling a large corpus of medieval documents

• nodes: transactions and individuals(3 918 nodes)

• edges: an individual is directlyinvolved in a transaction (6 455 edges)

IndivualTransaction

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27

Page 10: Large network analysis : visualization and clustering

A brief introduction to network/graph

ExamplesAny other relational data

IndivualTransaction

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27

Page 11: Large network analysis : visualization and clustering

A brief introduction to network/graph

More complex relational modelsNodes may be labeled by a factor

... or by a numerical information. [Laurent and Villa-Vialaneix, 2011]Edges may also be labeled (type of the relation) or weighted (strength ofthe relation).

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 5 / 27

Page 12: Large network analysis : visualization and clustering

A brief introduction to network/graph

More complex relational modelsNodes may be labeled by a factor

... or by a numerical information. [Laurent and Villa-Vialaneix, 2011]

Edges may also be labeled (type of the relation) or weighted (strength ofthe relation).

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 5 / 27

Page 13: Large network analysis : visualization and clustering

A brief introduction to network/graph

More complex relational modelsNodes may be labeled by a factor

... or by a numerical information. [Laurent and Villa-Vialaneix, 2011]Edges may also be labeled (type of the relation) or weighted (strength ofthe relation).

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 5 / 27

Page 14: Large network analysis : visualization and clustering

Simple graph mining

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 6 / 27

Page 15: Large network analysis : visualization and clustering

Simple graph mining

Settings

In the following, a graph G = (V ,E,W) with:

• V : set of vertexes {x1, . . . , xn};

• E: set of edges;

• W : weights on edges s.t. Wij ≥ 0, Wij = Wji and Wii = 0.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 7 / 27

Page 16: Large network analysis : visualization and clustering

Simple graph mining Numerical characteristics calculation

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 8 / 27

Page 17: Large network analysis : visualization and clustering

Simple graph mining Numerical characteristics calculation

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =

∑j Wij .

Vertexes with a high degree are called hubs: measure of the vertexpopularity.

IndividualTransaction

Ratier

Ratier (II) Castelnau

Jean Laperarede

Bertrande Audoy

Gailhard Gourdon

Guy Moynes (de)

Pierre Piret (de)

Bernard Audoy

Hélène Castelnau

Guiral Baro

Bernard Audoy

Arnaud Bernard Laperarede

Guilhem Bernard Prestis

Jean Manas

Jean Laperarede

Jean Laperarede

Jean Roquefeuil

Jean Pojols

Ramond Belpech

Raymond Laperarede

Bertrand Prestis (de)

Ratier

(Monseigneur) Roquefeuil (de)

Guilhem Bernard Prestis

Arnaud Gasbert Castanhier (del)

Ratier (III) Castelnau

Pierre Prestis (de)

P Valeribosc

Guillaume Marsa

Berenguier Roquefeuil

Arnaud Bernard Perarede

Jean Roquefeuil

Arnaud I Audoy

Arnaud Bernard Perarede

2 vertex betweenness centralité: number of shortest paths between all pairs of

vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).

• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;

• co-expression network:

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27

Page 18: Large network analysis : visualization and clustering

Simple graph mining Numerical characteristics calculation

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =

∑j Wij .

Vertexes with a high degree are called hubs: measure of the vertexpopularity.

2 vertex betweenness centralité: number of shortest paths between all pairs of

vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).

• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;

• co-expression network:

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27

Page 19: Large network analysis : visualization and clustering

Simple graph mining Numerical characteristics calculation

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =

∑j Wij .

Vertexes with a high degree are called hubs: measure of the vertexpopularity.The degree distribution is known to fit a power law loi de puissancein most real networks:

1 2 5 10 20 50 100 200 500

15

5050

0

Names

Transactions

2 vertex betweenness centralité: number of shortest paths between all pairs of

vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).

• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;

• co-expression network:

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27

Page 20: Large network analysis : visualization and clustering

Simple graph mining Numerical characteristics calculation

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =

∑j Wij .

2 vertex betweenness centralité: number of shortest paths between all pairs of

vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).

• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;

• co-expression network:

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27

Page 21: Large network analysis : visualization and clustering

Simple graph mining Numerical characteristics calculation

Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =

∑j Wij .

2 vertex betweenness centralité: number of shortest paths between all pairs of

vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).• in the medieval network: moral persons such as the “Chapter of

Cahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;

• co-expression network:

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27

Page 22: Large network analysis : visualization and clustering

Simple graph mining Visualization

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 10 / 27

Page 23: Large network analysis : visualization and clustering

Simple graph mining Visualization

Visualization tools help understand the graphmacro-structure

Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27

Page 24: Large network analysis : visualization and clustering

Simple graph mining Visualization

Visualization tools help understand the graphmacro-structure

Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27

Page 25: Large network analysis : visualization and clustering

Simple graph mining Visualization

Visualization tools help understand the graphmacro-structure

Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27

Page 26: Large network analysis : visualization and clustering

Simple graph mining Visualization

Visualization tools help understand the graphmacro-structure

Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])

• attractive forces: similar to springs along the edges

• repulsive forces: similar to electric forces between all pairs of vertices

iterative algorithm until stabilization of the vertex positions.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27

Page 27: Large network analysis : visualization and clustering

Simple graph mining Visualization

Visualization software• package igraph1 [Csardi and Nepusz, 2006] (static

representation with useful tools for graph mining)

• free software Gephi2 (interactive software, supports zoomingand panning)

1http://igraph.sourceforge.net/

2http://gephi.org

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 12 / 27

Page 28: Large network analysis : visualization and clustering

Simple graph mining Visualization

Visualization software• package igraph1 [Csardi and Nepusz, 2006] (static

representation with useful tools for graph mining)

• free software Gephi2 (interactive software, supports zoomingand panning)

1http://igraph.sourceforge.net/

2http://gephi.org

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 12 / 27

Page 29: Large network analysis : visualization and clustering

Simple graph mining Visualization

FDP drawbacks

• slow (can hardly handle graphs with more than a few hundredsvertexes) ;

• more suited for pretty visualization than for meaningful visualization:

• small edges with uniform length;• as a consequence: vertexes with high degrees are all in the center of

the representation

What is a “natural” way to investigate a network?

1 understand the macroscopic structure: find “communities” and theirlinks;

2 focus on details in some areas of interest.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 13 / 27

Page 30: Large network analysis : visualization and clustering

Simple graph mining Visualization

FDP drawbacks

• slow (can hardly handle graphs with more than a few hundredsvertexes) ;

• more suited for pretty visualization than for meaningful visualization:

• small edges with uniform length;• as a consequence: vertexes with high degrees are all in the center of

the representation

What is a “natural” way to investigate a network?

1 understand the macroscopic structure: find “communities” and theirlinks;

2 focus on details in some areas of interest.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 13 / 27

Page 31: Large network analysis : visualization and clustering

Simple graph mining Visualization

FDP drawbacks

• slow (can hardly handle graphs with more than a few hundredsvertexes) ;

• more suited for pretty visualization than for meaningful visualization:

• small edges with uniform length;• as a consequence: vertexes with high degrees are all in the center of

the representation

What is a “natural” way to investigate a network?

1 understand the macroscopic structure: find “communities” and theirlinks;

2 focus on details in some areas of interest.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 13 / 27

Page 32: Large network analysis : visualization and clustering

Simple graph mining Clustering

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 14 / 27

Page 33: Large network analysis : visualization and clustering

Simple graph mining Clustering

Vertex clustering classificationVertex clustering: cluster vertexes into groups that are denselyconnected and share a few links (comparatively) with the other groups.

Representing a clustering by the projected graph:• each cluster is displayed by a node whose area is proportional to the number of

vertexes;• the link width between two clusters is proportional to the sum of the weight.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 15 / 27

Page 34: Large network analysis : visualization and clustering

Simple graph mining Clustering

Vertex clustering classificationVertex clustering: cluster vertexes into groups that are denselyconnected and share a few links (comparatively) with the other groups.Representing a clustering by the projected graph:• each cluster is displayed by a node whose area is proportional to the number of

vertexes;• the link width between two clusters is proportional to the sum of the weight.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 15 / 27

Page 35: Large network analysis : visualization and clustering

Simple graph mining Clustering

Vertex clustering classificationVertex clustering: cluster vertexes into groups that are denselyconnected and share a few links (comparatively) with the other groups.Representing a clustering by the projected graph:• each cluster is displayed by a node whose area is proportional to the number of

vertexes;• the link width between two clusters is proportional to the sum of the weight.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 15 / 27

Page 36: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 16 / 27

Page 37: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization

General overview

2 steps:

• Find a hierarchical clustering;

• Combine it with a representation of the various levels of thehierarchie.

References:

• article: (in French) [Rossi and Villa-Vialaneix, 2011]• programs (java) available athttp://apiacoa.org/software/graph/index.en.html

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 17 / 27

Page 38: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 18 / 27

Page 39: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity based clustering

Modularity [Newman and Girvan, 2004]

Q(V1, . . . ,VC) =1

2m

C∑k=1

∑xi ,xj∈Vk

(Wij − Pij)

with Pij weight in the “null model”:

Pij =didj

2m

with di = 12∑

j,i Wij .

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 19 / 27

Page 40: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Interpretation

A good clustering should have a high modularity :

• Q ↗ when (xi , xj) are in the same cluster and Wij >> Pij

• Q ↗ when (xi , xj) are in different clusters and Wij << Pij because

Q(C) +1

2m

∑k,k ′

∑i∈Ck , j∈Ck ′

(Wij − Pij) = 0.

• Modularity: helpful to separate hubs (, spectral clustering or mincut clustering) but has a resolution default[Fortunato and Barthélémy, 2007].

Problem: Modularity optimization is NP-complet (exhaustive search isunworkable for graphs with more than a few hundreds vertexes)

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 20 / 27

Page 41: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Interpretation

A good clustering should have a high modularity :

• Q ↗ when (xi , xj) are in the same cluster and Wij >> Pij

• Q ↗ when (xi , xj) are in different clusters and Wij << Pij because

Q(C) +1

2m

∑k,k ′

∑i∈Ck , j∈Ck ′

(Wij − Pij) = 0.

• Modularity: helpful to separate hubs (, spectral clustering or mincut clustering) but has a resolution default[Fortunato and Barthélémy, 2007].

Problem: Modularity optimization is NP-complet (exhaustive search isunworkable for graphs with more than a few hundreds vertexes)

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 20 / 27

Page 42: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Interpretation

A good clustering should have a high modularity :

• Q ↗ when (xi , xj) are in the same cluster and Wij >> Pij

• Q ↗ when (xi , xj) are in different clusters and Wij << Pij because

Q(C) +1

2m

∑k,k ′

∑i∈Ck , j∈Ck ′

(Wij − Pij) = 0.

• Modularity: helpful to separate hubs (, spectral clustering or mincut clustering) but has a resolution default[Fortunato and Barthélémy, 2007].

Problem: Modularity optimization is NP-complet (exhaustive search isunworkable for graphs with more than a few hundreds vertexes)

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 20 / 27

Page 43: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering

• Initialization: C(1)1 = {x1}, . . . , C(1)n = {xn}

• Repeat: merge the two clusters that maximize

Sig(Ci ,Cj) =∆QCi ,Cj√

deg(Ci)deg(Cj)

until modularity can not be improved by merging.

• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters

• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 44: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering

• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters

• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 45: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering

• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters

• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 46: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering

• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters

• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 47: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering

• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters

• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 48: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering

• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters

• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 49: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease

of 25% of the number of clusters

. . .level 1 level 2 ... level L (final)

• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 50: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease

of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1

• Find the graph induced by the clustering at level l...

• ... and use the clustering at level l + 1 on the meta-vertexes...• ... in order to move some meta-vertexes from a cluster to another one

(opportunistic moves)• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 51: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease

of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1

• Find the graph induced by the clustering at level l...• ... and use the clustering at level l + 1 on the meta-vertexes...

• ... in order to move some meta-vertexes from a cluster to another one(opportunistic moves)

• Connectivity checking: at level L , check connectivity connexité andsplit clusters that are not connected

Step 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 52: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease

of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1

• Find the graph induced by the clustering at level l...• ... and use the clustering at level l + 1 on the meta-vertexes...• ... in order to move some meta-vertexes from a cluster to another one

(opportunistic moves)

• Connectivity checking: at level L , check connectivity connexité andsplit clusters that are not connected

Step 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 53: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease

of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and

split clusters that are not connected

Step 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 54: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Modularity maximization (inspired by [Noack and Rotta, 2009])

Step 1: Find a first clusteringStep 2: Iterations until stabilization

• Merge clusters at level L to improve modularity;

• Multi-level refinement at levels 1, L et L + 1 ;

• Connectivity checking.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27

Page 55: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Iterating the clustering

Purpose: Avoid the modularity resolution issue.Methodology: Repeat the clustering in each cluster in a hierarchical way.

level 1

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 22 / 27

Page 56: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Iterating the clustering

Purpose: Avoid the modularity resolution issue.Methodology: Repeat the clustering in each cluster in a hierarchical way.

level 2

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 22 / 27

Page 57: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Iterating the clustering

Purpose: Avoid the modularity resolution issue.Methodology: Repeat the clustering in each cluster in a hierarchical way.

level 3

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 22 / 27

Page 58: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Stopping criterion

Common issue with clustering algorithm: they always provide a result,even if not meaningful.

Significativity of a clustering:

1 Random graph generation in the set of random graph with the samedegree distribution;

2 Find the modularity maximum;

3 Find the (empirical) p-value of the observed modularity maximum(for the graph of interest);

4 If the modularity maximum is significant (larger than themodularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27

Page 59: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Stopping criterionCommon issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:

1 Random graph generation in the set of random graph with the samedegree distribution;Methodology: MCMC algorithm [Roberts Jr., 2000] by randompermutation of the edges of the observed graph;

2 Find the modularity maximum;3 Find the (empirical) p-value of the observed modularity maximum

(for the graph of interest);4 If the modularity maximum is significant (larger than the

modularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27

Page 60: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Stopping criterionCommon issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:

1 Random graph generation in the set of random graph with the samedegree distribution;Methodology: MCMC algorithm [Roberts Jr., 2000] by randompermutation of the edges of the observed graph;

After Q |E | permutations, the graph is random according to the uniformdistribution on the graphs with fixed degree distribution

2 Find the modularity maximum;3 Find the (empirical) p-value of the observed modularity maximum

(for the graph of interest);4 If the modularity maximum is significant (larger than the

modularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27

Page 61: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Stopping criterion

Common issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:

1 Random graph generation in the set of random graph with the samedegree distribution;

2 Find the modularity maximum;

3 Find the (empirical) p-value of the observed modularity maximum(for the graph of interest);

4 If the modularity maximum is significant (larger than themodularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27

Page 62: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Hierarchical graph clustering

Stopping criterion

Common issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:

1 Random graph generation in the set of random graph with the samedegree distribution;

2 Find the modularity maximum;

3 Find the (empirical) p-value of the observed modularity maximum(for the graph of interest);

4 If the modularity maximum is significant (larger than themodularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27

Page 63: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

Plan

1 A brief introduction to network/graph

2 Simple graph miningNumerical characteristics calculationVisualizationClustering

3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 24 / 27

Page 64: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Political books”• vertexes: 105 american political books;• edges weighted by the number of co-purchasing of two books by the

same buyer.

FDP

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 25 / 27

Page 65: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Political books”• vertexes: 105 american political books;• edges weighted by the number of co-purchasing of two books by the

same buyer.

“LinLog” [Noack, 2007]

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 25 / 27

Page 66: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Political books”

• vertexes: 105 american political books;

• edges weighted by the number of co-purchasing of two books by thesame buyer.

Hierarchical representation (level 1/3)

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 25 / 27

Page 67: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

• 10 373 vertexes: transactions or active persons;

• edges modeling the direct involvement of a person in a transaction.

Modularity optimization: 48 clusters containing from 10 to 740 vertexes.Levels: 3 levels (up to 100 clusters for the sake of readability); 93 clustersat the finest level.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 68: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

• 10 373 vertexes: transactions or active persons;

• edges modeling the direct involvement of a person in a transaction.

Modularity optimization: 48 clusters containing from 10 to 740 vertexes.

Levels: 3 levels (up to 100 clusters for the sake of readability); 93 clustersat the finest level.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 69: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

• 10 373 vertexes: transactions or active persons;

• edges modeling the direct involvement of a person in a transaction.

Modularity optimization: 48 clusters containing from 10 to 740 vertexes.Levels: 3 levels (up to 100 clusters for the sake of readability); 93 clustersat the finest level.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 70: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 71: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 72: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 73: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 74: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 75: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 76: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 77: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 78: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

“Medieval”

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27

Page 79: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

Conclusion and limits

Summary• fast

• fully automatic (no parameter tuning)

Main limit: iterative estimation of the space needed by a cluster isoverestimating the real area⇒ edges can be unnecessarily long andinduce a visualization bias

Thank you for your attention

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27

Page 80: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

Conclusion and limits

Summary• fast

• fully automatic (no parameter tuning)

Main limit: iterative estimation of the space needed by a cluster isoverestimating the real area⇒ edges can be unnecessarily long andinduce a visualization bias

Thank you for your attention

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27

Page 81: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

Conclusion and limits

Summary• fast

• fully automatic (no parameter tuning)

Main limit: iterative estimation of the space needed by a cluster isoverestimating the real area⇒ edges can be unnecessarily long andinduce a visualization bias

Thank you for your attention

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27

Page 82: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

ReferencesCsardi, G. and Nepusz, T. (2006).

The igraph software package for complex network research.

InterJournal, Complex Systems.

Fortunato, S. and Barthélémy, M. (2007).

Resolution limit in community detection.

In Proceedings of the National Academy of Sciences, volume 104, pages 36–41.

doi:10.1073/pnas.0605965104; URL:http://www.pnas.org/content/104/1/36.abstract.

Fruchterman, T. and Reingold, B. (1991).

Graph drawing by force-directed placement.

Software-Practice and Experience, 21:1129–1164.

Laurent, T. and Villa-Vialaneix, N. (2011).

Using spatial indexes for labeled network analysis.

Information, Interaction, Intelligence (i3).

Under revision.

Liaubet, L., Villa-Vialaneix, N., Gamot, A., Rossi, F., Chérel, P., and SanCristobal, M.(2010).

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27

Page 83: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

The structure of a gene network reveals 7 biological functions underlying eQTLs inpig.In Gesellschaft für Tierzuchtwissenschaften e. V., editor, World Congress on GeneticsApplied to Livestock Production (WCGALP), number 0147, Leipzig, Germany.

Newman, M. and Girvan, M. (2004).Finding and evaluating community structure in networks.Physical Review, E, 69:026113.

Noack, A. (2007).Energy models for graph clustering.Journal of Graph Algorithms and Applications, 11(2):453–480.

Noack, A. and Rotta, R. (2009).Multi-level algorithms for modularity clustering.In SEA ’09: Proceedings of the 8th International Symposium on ExperimentalAlgorithms, pages 257–268, Berlin, Heidelberg. Springer-Verlag.

Roberts Jr., J. M. (2000).Simple methods for simulating sociomatrices with given marginal totals.Social Networks, 22(3):273 – 283.

Rossi, F. and Villa-Vialaneix, N. (2011).Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27

Page 84: Large network analysis : visualization and clustering

Hierarchical graph clustering and visualization Examples

Représentation d’un grand réseau à partir d’une classification hiérarchique de sessommets.Journal de la Société Française de Statistique, 152(3):34–65.

Villa-Vialaneix, N., Liaubet, L., Laurent, T., Gamot, A., Cherel, P., and Sancristobal,M. (2011).L’analyse d’un réseau de co-expression génique met en valeur des groupesfonctionnels homogènes et des gènes importants relatifs a un phénotype d’intérêt.In Actes des 43èmes Journées de Statistique, Société Française de Statistique,Tunis, Tunisie.

Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27