large network analysis : visualization and clustering
DESCRIPTION
December 16th, 2011 Groupe de travail BioPuces, INRA de ToulouseTRANSCRIPT
Large network analysis: visualization andclustering
Nathalie Villa-Vialaneixhttp://www.nathalievilla.org
Groupe de travail “Biopuces”, INRA Toulouse, 16/12/2011
Travail joint avec Fabrice Rossi (SAMM, Université Paris 1)
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 1 / 27
A brief introduction to network/graph
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 2 / 27
A brief introduction to network/graph
What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.
A relation between two entities is modeled by an edgearête
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 3 / 27
A brief introduction to network/graph
What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.
The entities are called the nodes or the vertexes (vertices in British)nœuds/sommets
A relationbetween two entities is modeled by an edgearête
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 3 / 27
A brief introduction to network/graph
What is a network/graph? réseau/grapheMathematical object used to model relational data between entities.
A relation between two entities is modeled by an edgearête
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 3 / 27
A brief introduction to network/graph
Examples
Social network: nodes: persons - edges: 2 persons are connected
powered by TouchgraphTM 1 (Natty’s facebook
TM 2 network)
1http://www.touchgraph.com
2https://www.facebook.com
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27
A brief introduction to network/graph
ExamplesBiological network (an example, among others: co-expression network): nodes:genes - edges: the partial correlation between a pair of genesCor(Xi ,Xj |(Xk )k,i,j) is large (significant...)
[Liaubet et al., 2010, Villa-Vialaneix et al., 2011]
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27
A brief introduction to network/graph
ExamplesAny other relational data
Modeling a large corpus of medieval documents
Notarial acts (mostly baux à fief, moreprecisely, land charters) established in aseigneurie named “Castelnau Montratier”,written between 1250 and 1500, involvingtenants and lords.a
ahttp://graphcomp.univ-tlse2.fr
IndivualTransaction
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27
A brief introduction to network/graph
ExamplesAny other relational data
Modeling a large corpus of medieval documents
• nodes: transactions and individuals(3 918 nodes)
• edges: an individual is directlyinvolved in a transaction (6 455 edges)
IndivualTransaction
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27
A brief introduction to network/graph
ExamplesAny other relational data
IndivualTransaction
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 4 / 27
A brief introduction to network/graph
More complex relational modelsNodes may be labeled by a factor
... or by a numerical information. [Laurent and Villa-Vialaneix, 2011]Edges may also be labeled (type of the relation) or weighted (strength ofthe relation).
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 5 / 27
A brief introduction to network/graph
More complex relational modelsNodes may be labeled by a factor
... or by a numerical information. [Laurent and Villa-Vialaneix, 2011]
Edges may also be labeled (type of the relation) or weighted (strength ofthe relation).
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 5 / 27
A brief introduction to network/graph
More complex relational modelsNodes may be labeled by a factor
... or by a numerical information. [Laurent and Villa-Vialaneix, 2011]Edges may also be labeled (type of the relation) or weighted (strength ofthe relation).
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 5 / 27
Simple graph mining
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 6 / 27
Simple graph mining
Settings
In the following, a graph G = (V ,E,W) with:
• V : set of vertexes {x1, . . . , xn};
• E: set of edges;
• W : weights on edges s.t. Wij ≥ 0, Wij = Wji and Wii = 0.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 7 / 27
Simple graph mining Numerical characteristics calculation
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 8 / 27
Simple graph mining Numerical characteristics calculation
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =
∑j Wij .
Vertexes with a high degree are called hubs: measure of the vertexpopularity.
IndividualTransaction
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Ratier
Ratier (II) Castelnau
Jean Laperarede
Bertrande Audoy
Gailhard Gourdon
Guy Moynes (de)
Pierre Piret (de)
Bernard Audoy
Hélène Castelnau
Guiral Baro
Bernard Audoy
Arnaud Bernard Laperarede
Guilhem Bernard Prestis
Jean Manas
Jean Laperarede
Jean Laperarede
Jean Roquefeuil
Jean Pojols
Ramond Belpech
Raymond Laperarede
Bertrand Prestis (de)
Ratier
(Monseigneur) Roquefeuil (de)
Guilhem Bernard Prestis
Arnaud Gasbert Castanhier (del)
Ratier (III) Castelnau
Pierre Prestis (de)
P Valeribosc
Guillaume Marsa
Berenguier Roquefeuil
Arnaud Bernard Perarede
Jean Roquefeuil
Arnaud I Audoy
Arnaud Bernard Perarede
2 vertex betweenness centralité: number of shortest paths between all pairs of
vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).
• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;
• co-expression network:
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27
Simple graph mining Numerical characteristics calculation
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =
∑j Wij .
Vertexes with a high degree are called hubs: measure of the vertexpopularity.
2 vertex betweenness centralité: number of shortest paths between all pairs of
vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).
• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;
• co-expression network:
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27
Simple graph mining Numerical characteristics calculation
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =
∑j Wij .
Vertexes with a high degree are called hubs: measure of the vertexpopularity.The degree distribution is known to fit a power law loi de puissancein most real networks:
1 2 5 10 20 50 100 200 500
15
5050
0
Names
Transactions
2 vertex betweenness centralité: number of shortest paths between all pairs of
vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).
• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;
• co-expression network:
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27
Simple graph mining Numerical characteristics calculation
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =
∑j Wij .
2 vertex betweenness centralité: number of shortest paths between all pairs of
vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).
• in the medieval network: moral persons such as the “Chapter ofCahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;
• co-expression network:
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27
Simple graph mining Numerical characteristics calculation
Extracting important nodes1 vertex degree degré: number of edges adjacent to a given vertex or di =
∑j Wij .
2 vertex betweenness centralité: number of shortest paths between all pairs of
vertices that pass through the vertex. Betweenness is a centrality measure(vertexes that are likely to disconnect the network if removed).• in the medieval network: moral persons such as the “Chapter of
Cahors” or the “Church of Flaugnac” have a high betweenness despitea low degree;
• co-expression network:
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 9 / 27
Simple graph mining Visualization
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 10 / 27
Simple graph mining Visualization
Visualization tools help understand the graphmacro-structure
Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27
Simple graph mining Visualization
Visualization tools help understand the graphmacro-structure
Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27
Simple graph mining Visualization
Visualization tools help understand the graphmacro-structure
Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27
Simple graph mining Visualization
Visualization tools help understand the graphmacro-structure
Standard approach: force directed placement algorithms (FDP) (e.g.,[Fruchterman and Reingold, 1991])
• attractive forces: similar to springs along the edges
• repulsive forces: similar to electric forces between all pairs of vertices
iterative algorithm until stabilization of the vertex positions.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 11 / 27
Simple graph mining Visualization
Visualization software• package igraph1 [Csardi and Nepusz, 2006] (static
representation with useful tools for graph mining)
• free software Gephi2 (interactive software, supports zoomingand panning)
1http://igraph.sourceforge.net/
2http://gephi.org
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 12 / 27
Simple graph mining Visualization
Visualization software• package igraph1 [Csardi and Nepusz, 2006] (static
representation with useful tools for graph mining)
• free software Gephi2 (interactive software, supports zoomingand panning)
1http://igraph.sourceforge.net/
2http://gephi.org
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 12 / 27
Simple graph mining Visualization
FDP drawbacks
• slow (can hardly handle graphs with more than a few hundredsvertexes) ;
• more suited for pretty visualization than for meaningful visualization:
• small edges with uniform length;• as a consequence: vertexes with high degrees are all in the center of
the representation
What is a “natural” way to investigate a network?
1 understand the macroscopic structure: find “communities” and theirlinks;
2 focus on details in some areas of interest.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 13 / 27
Simple graph mining Visualization
FDP drawbacks
• slow (can hardly handle graphs with more than a few hundredsvertexes) ;
• more suited for pretty visualization than for meaningful visualization:
• small edges with uniform length;• as a consequence: vertexes with high degrees are all in the center of
the representation
What is a “natural” way to investigate a network?
1 understand the macroscopic structure: find “communities” and theirlinks;
2 focus on details in some areas of interest.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 13 / 27
Simple graph mining Visualization
FDP drawbacks
• slow (can hardly handle graphs with more than a few hundredsvertexes) ;
• more suited for pretty visualization than for meaningful visualization:
• small edges with uniform length;• as a consequence: vertexes with high degrees are all in the center of
the representation
What is a “natural” way to investigate a network?
1 understand the macroscopic structure: find “communities” and theirlinks;
2 focus on details in some areas of interest.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 13 / 27
Simple graph mining Clustering
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 14 / 27
Simple graph mining Clustering
Vertex clustering classificationVertex clustering: cluster vertexes into groups that are denselyconnected and share a few links (comparatively) with the other groups.
Representing a clustering by the projected graph:• each cluster is displayed by a node whose area is proportional to the number of
vertexes;• the link width between two clusters is proportional to the sum of the weight.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 15 / 27
Simple graph mining Clustering
Vertex clustering classificationVertex clustering: cluster vertexes into groups that are denselyconnected and share a few links (comparatively) with the other groups.Representing a clustering by the projected graph:• each cluster is displayed by a node whose area is proportional to the number of
vertexes;• the link width between two clusters is proportional to the sum of the weight.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 15 / 27
Simple graph mining Clustering
Vertex clustering classificationVertex clustering: cluster vertexes into groups that are denselyconnected and share a few links (comparatively) with the other groups.Representing a clustering by the projected graph:• each cluster is displayed by a node whose area is proportional to the number of
vertexes;• the link width between two clusters is proportional to the sum of the weight.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 15 / 27
Hierarchical graph clustering and visualization
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 16 / 27
Hierarchical graph clustering and visualization
General overview
2 steps:
• Find a hierarchical clustering;
• Combine it with a representation of the various levels of thehierarchie.
References:
• article: (in French) [Rossi and Villa-Vialaneix, 2011]• programs (java) available athttp://apiacoa.org/software/graph/index.en.html
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 17 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 18 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity based clustering
Modularity [Newman and Girvan, 2004]
Q(V1, . . . ,VC) =1
2m
C∑k=1
∑xi ,xj∈Vk
(Wij − Pij)
with Pij weight in the “null model”:
Pij =didj
2m
with di = 12∑
j,i Wij .
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 19 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Interpretation
A good clustering should have a high modularity :
• Q ↗ when (xi , xj) are in the same cluster and Wij >> Pij
• Q ↗ when (xi , xj) are in different clusters and Wij << Pij because
Q(C) +1
2m
∑k,k ′
∑i∈Ck , j∈Ck ′
(Wij − Pij) = 0.
• Modularity: helpful to separate hubs (, spectral clustering or mincut clustering) but has a resolution default[Fortunato and Barthélémy, 2007].
Problem: Modularity optimization is NP-complet (exhaustive search isunworkable for graphs with more than a few hundreds vertexes)
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 20 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Interpretation
A good clustering should have a high modularity :
• Q ↗ when (xi , xj) are in the same cluster and Wij >> Pij
• Q ↗ when (xi , xj) are in different clusters and Wij << Pij because
Q(C) +1
2m
∑k,k ′
∑i∈Ck , j∈Ck ′
(Wij − Pij) = 0.
• Modularity: helpful to separate hubs (, spectral clustering or mincut clustering) but has a resolution default[Fortunato and Barthélémy, 2007].
Problem: Modularity optimization is NP-complet (exhaustive search isunworkable for graphs with more than a few hundreds vertexes)
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 20 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Interpretation
A good clustering should have a high modularity :
• Q ↗ when (xi , xj) are in the same cluster and Wij >> Pij
• Q ↗ when (xi , xj) are in different clusters and Wij << Pij because
Q(C) +1
2m
∑k,k ′
∑i∈Ck , j∈Ck ′
(Wij − Pij) = 0.
• Modularity: helpful to separate hubs (, spectral clustering or mincut clustering) but has a resolution default[Fortunato and Barthélémy, 2007].
Problem: Modularity optimization is NP-complet (exhaustive search isunworkable for graphs with more than a few hundreds vertexes)
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 20 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering
• Initialization: C(1)1 = {x1}, . . . , C(1)n = {xn}
• Repeat: merge the two clusters that maximize
Sig(Ci ,Cj) =∆QCi ,Cj√
deg(Ci)deg(Cj)
until modularity can not be improved by merging.
• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters
• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering
• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters
• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering
• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters
• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering
• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters
• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering
• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters
• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering
• During the algorithm, save the levels corresponding to a decreaseof 25% of the number of clusters
• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease
of 25% of the number of clusters
. . .level 1 level 2 ... level L (final)
• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease
of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1
• Find the graph induced by the clustering at level l...
• ... and use the clustering at level l + 1 on the meta-vertexes...• ... in order to move some meta-vertexes from a cluster to another one
(opportunistic moves)• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connectedStep 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease
of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1
• Find the graph induced by the clustering at level l...• ... and use the clustering at level l + 1 on the meta-vertexes...
• ... in order to move some meta-vertexes from a cluster to another one(opportunistic moves)
• Connectivity checking: at level L , check connectivity connexité andsplit clusters that are not connected
Step 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease
of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1
• Find the graph induced by the clustering at level l...• ... and use the clustering at level l + 1 on the meta-vertexes...• ... in order to move some meta-vertexes from a cluster to another one
(opportunistic moves)
• Connectivity checking: at level L , check connectivity connexité andsplit clusters that are not connected
Step 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clustering• Greedy clustering• During the algorithm, save the levels corresponding to a decrease
of 25% of the number of clusters• Multi-level refinement for l = L − 1→ 1• Connectivity checking: at level L , check connectivity connexité and
split clusters that are not connected
Step 2: Iterations until stabilization• Merge clusters at level L to improve modularity;• Multi-level refinement at levels 1, L et L + 1 ;• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Modularity maximization (inspired by [Noack and Rotta, 2009])
Step 1: Find a first clusteringStep 2: Iterations until stabilization
• Merge clusters at level L to improve modularity;
• Multi-level refinement at levels 1, L et L + 1 ;
• Connectivity checking.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 21 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Iterating the clustering
Purpose: Avoid the modularity resolution issue.Methodology: Repeat the clustering in each cluster in a hierarchical way.
level 1
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 22 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Iterating the clustering
Purpose: Avoid the modularity resolution issue.Methodology: Repeat the clustering in each cluster in a hierarchical way.
level 2
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 22 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Iterating the clustering
Purpose: Avoid the modularity resolution issue.Methodology: Repeat the clustering in each cluster in a hierarchical way.
level 3
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 22 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Stopping criterion
Common issue with clustering algorithm: they always provide a result,even if not meaningful.
Significativity of a clustering:
1 Random graph generation in the set of random graph with the samedegree distribution;
2 Find the modularity maximum;
3 Find the (empirical) p-value of the observed modularity maximum(for the graph of interest);
4 If the modularity maximum is significant (larger than themodularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Stopping criterionCommon issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:
1 Random graph generation in the set of random graph with the samedegree distribution;Methodology: MCMC algorithm [Roberts Jr., 2000] by randompermutation of the edges of the observed graph;
2 Find the modularity maximum;3 Find the (empirical) p-value of the observed modularity maximum
(for the graph of interest);4 If the modularity maximum is significant (larger than the
modularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Stopping criterionCommon issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:
1 Random graph generation in the set of random graph with the samedegree distribution;Methodology: MCMC algorithm [Roberts Jr., 2000] by randompermutation of the edges of the observed graph;
After Q |E | permutations, the graph is random according to the uniformdistribution on the graphs with fixed degree distribution
2 Find the modularity maximum;3 Find the (empirical) p-value of the observed modularity maximum
(for the graph of interest);4 If the modularity maximum is significant (larger than the
modularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Stopping criterion
Common issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:
1 Random graph generation in the set of random graph with the samedegree distribution;
2 Find the modularity maximum;
3 Find the (empirical) p-value of the observed modularity maximum(for the graph of interest);
4 If the modularity maximum is significant (larger than themodularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27
Hierarchical graph clustering and visualization Hierarchical graph clustering
Stopping criterion
Common issue with clustering algorithm: they always provide a result,even if not meaningful.Significativity of a clustering:
1 Random graph generation in the set of random graph with the samedegree distribution;
2 Find the modularity maximum;
3 Find the (empirical) p-value of the observed modularity maximum(for the graph of interest);
4 If the modularity maximum is significant (larger than themodularity of 100 random graphs), the clustering is saved; otherwise,the recursive partitioning is stopped.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 23 / 27
Hierarchical graph clustering and visualization Examples
Plan
1 A brief introduction to network/graph
2 Simple graph miningNumerical characteristics calculationVisualizationClustering
3 Hierarchical graph clustering and visualizationHierarchical graph clusteringExamples
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 24 / 27
Hierarchical graph clustering and visualization Examples
“Political books”• vertexes: 105 american political books;• edges weighted by the number of co-purchasing of two books by the
same buyer.
FDP
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 25 / 27
Hierarchical graph clustering and visualization Examples
“Political books”• vertexes: 105 american political books;• edges weighted by the number of co-purchasing of two books by the
same buyer.
“LinLog” [Noack, 2007]
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 25 / 27
Hierarchical graph clustering and visualization Examples
“Political books”
• vertexes: 105 american political books;
• edges weighted by the number of co-purchasing of two books by thesame buyer.
Hierarchical representation (level 1/3)
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 25 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
• 10 373 vertexes: transactions or active persons;
• edges modeling the direct involvement of a person in a transaction.
Modularity optimization: 48 clusters containing from 10 to 740 vertexes.Levels: 3 levels (up to 100 clusters for the sake of readability); 93 clustersat the finest level.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
• 10 373 vertexes: transactions or active persons;
• edges modeling the direct involvement of a person in a transaction.
Modularity optimization: 48 clusters containing from 10 to 740 vertexes.
Levels: 3 levels (up to 100 clusters for the sake of readability); 93 clustersat the finest level.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
• 10 373 vertexes: transactions or active persons;
• edges modeling the direct involvement of a person in a transaction.
Modularity optimization: 48 clusters containing from 10 to 740 vertexes.Levels: 3 levels (up to 100 clusters for the sake of readability); 93 clustersat the finest level.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
“Medieval”
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 26 / 27
Hierarchical graph clustering and visualization Examples
Conclusion and limits
Summary• fast
• fully automatic (no parameter tuning)
Main limit: iterative estimation of the space needed by a cluster isoverestimating the real area⇒ edges can be unnecessarily long andinduce a visualization bias
Thank you for your attention
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27
Hierarchical graph clustering and visualization Examples
Conclusion and limits
Summary• fast
• fully automatic (no parameter tuning)
Main limit: iterative estimation of the space needed by a cluster isoverestimating the real area⇒ edges can be unnecessarily long andinduce a visualization bias
Thank you for your attention
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27
Hierarchical graph clustering and visualization Examples
Conclusion and limits
Summary• fast
• fully automatic (no parameter tuning)
Main limit: iterative estimation of the space needed by a cluster isoverestimating the real area⇒ edges can be unnecessarily long andinduce a visualization bias
Thank you for your attention
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27
Hierarchical graph clustering and visualization Examples
ReferencesCsardi, G. and Nepusz, T. (2006).
The igraph software package for complex network research.
InterJournal, Complex Systems.
Fortunato, S. and Barthélémy, M. (2007).
Resolution limit in community detection.
In Proceedings of the National Academy of Sciences, volume 104, pages 36–41.
doi:10.1073/pnas.0605965104; URL:http://www.pnas.org/content/104/1/36.abstract.
Fruchterman, T. and Reingold, B. (1991).
Graph drawing by force-directed placement.
Software-Practice and Experience, 21:1129–1164.
Laurent, T. and Villa-Vialaneix, N. (2011).
Using spatial indexes for labeled network analysis.
Information, Interaction, Intelligence (i3).
Under revision.
Liaubet, L., Villa-Vialaneix, N., Gamot, A., Rossi, F., Chérel, P., and SanCristobal, M.(2010).
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27
Hierarchical graph clustering and visualization Examples
The structure of a gene network reveals 7 biological functions underlying eQTLs inpig.In Gesellschaft für Tierzuchtwissenschaften e. V., editor, World Congress on GeneticsApplied to Livestock Production (WCGALP), number 0147, Leipzig, Germany.
Newman, M. and Girvan, M. (2004).Finding and evaluating community structure in networks.Physical Review, E, 69:026113.
Noack, A. (2007).Energy models for graph clustering.Journal of Graph Algorithms and Applications, 11(2):453–480.
Noack, A. and Rotta, R. (2009).Multi-level algorithms for modularity clustering.In SEA ’09: Proceedings of the 8th International Symposium on ExperimentalAlgorithms, pages 257–268, Berlin, Heidelberg. Springer-Verlag.
Roberts Jr., J. M. (2000).Simple methods for simulating sociomatrices with given marginal totals.Social Networks, 22(3):273 – 283.
Rossi, F. and Villa-Vialaneix, N. (2011).Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27
Hierarchical graph clustering and visualization Examples
Représentation d’un grand réseau à partir d’une classification hiérarchique de sessommets.Journal de la Société Française de Statistique, 152(3):34–65.
Villa-Vialaneix, N., Liaubet, L., Laurent, T., Gamot, A., Cherel, P., and Sancristobal,M. (2011).L’analyse d’un réseau de co-expression génique met en valeur des groupesfonctionnels homogènes et des gènes importants relatifs a un phénotype d’intérêt.In Actes des 43èmes Journées de Statistique, Société Française de Statistique,Tunis, Tunisie.
Large network analysis (Biopuces) Nathalie Villa-Vialaneix & Fabrice Rossi Toulouse, INRA, 16/12/2011 27 / 27