some statistical approaches in random graph...

45
Some statistical approaches in random graph modeling Catherine MATIAS CNRS, Laboratoire Statistique & G´ enome, ´ Evry, FRANCE http://stat.genopole.cnrs.fr/cmatias

Upload: others

Post on 22-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Some statistical approaches in random graphmodeling

Catherine MATIAS

CNRS, Laboratoire Statistique & Genome, Evry, FRANCE

http://stat.genopole.cnrs.fr/∼cmatias

Page 2: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Outline

Molecular interactions networks

Some statistical networks modelsExponential random graphs(Overlapping) Stochastic block modelsLatent space models

Analyzing networks: (probabilistic) node clustering

Page 3: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Outline

Molecular interactions networks

Some statistical networks modelsExponential random graphs(Overlapping) Stochastic block modelsLatent space models

Analyzing networks: (probabilistic) node clustering

Page 4: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

What kind of networks? (1/3)

Protein interactions networks (PIN)

Figure: Yeast Protein Interaction Network. Source:http://www.bordalierinstitute.com/images/yeastProteinInteractionNetwork.jpg

Page 5: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

What kind of networks? (1/3)

Protein interactions networks (PIN)

I Describe possible physical interactions between proteins(formation of protein complex, phosphorylation cascade).

I Public databases store interactions known from the literature.

I Many interactions are based on yeast two-hybrid experiments,inducing many false positive.

Page 6: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

What kind of networks? (2/3)

Metabolic networks

Figure: Glycolysis pathway.

Page 7: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

What kind of networks? (2/3)

Metabolic networks

I Describe chemical reactions between metabolites (smallmolecules) transforming a substrate to a product.

I Most reactions need to be catalyzed by enzymes and areconsidered to be reversible.

I The metabolic networks are mostly inferred using comparativegenomics techniques, inducing many false negatives.

I Modeled using oriented hypergraphs.

Tous les modeles mentionnes jusqu’ici peuvent etre orientes ou non orientes,selon que les reactions sont reversibles ou pas. Un graphe oriente est ungraphe ou chaque arete a une origine et une destination. Pour un graphesimple, chaque arete est alors une paire orientee. Pour un hypergraphe, unehyperarete orientee correspond a plusieurs sources et plusieurs destinations(voir Fig. 2.3).

Fig. 2.3 – Hypergraphe oriente correspondant au reseau : A+B ! C, C !D.

Dans le cas d’un reseau ou toutes les reactions sont irreversibles (resp.reversibles), on choisit de modeliser le reseau par un graphe oriente (resp.non oriente). Dans le cas intermediaire ou certaines reactions sont reversibleset d’autres ne le sont pas, alors on utilise un graphe mixte (certaines aretessont orientees et d’autres ne le sont pas).

Enfin, on peut noter que l’utilisation d’aretes non orientees dans le casd’une reaction reversible laisse encore une ambiguıte (plusieurs reseaux peuventavoir la meme representation), meme quand on utilise un graphe biparti ouun hypergraphe. Un exemple est donne dans la figure 2.4.

Fig. 2.4 – Graphe biparti correspondant a deux reseaux possibles : A+B "C+D ou A+C " B+D.

On constate que l’orientation ne rend pas compte de la di!erence entresubstrats et produits, ou devrait-on dire entre composes de gauche et com-poses de droite (puisque les termes substrats et produits impliquent une

24

tel-001

95401,

versio

n 1 - 1

0 Dec

2007 Figure: Oriented hypergraph modeling a metabolic network. Source: V. Lacroix.

Page 8: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

What kind of networks? (3/3)

Gene regulatory networks

16 CHAPITRE 1. INTRODUCTION

code pour un acide amine specifique. Un ARNm definit ainsi une suite d’acides amines qui sontassembles pour former la proteine correspondante.

1.1.2 Un systeme organise en reseaux de regulation.

Malgre la multiplicite des conditions d’expression d’un gene donne, certains groupes de genespresentent des niveaux d’expression proches dans de nombreuses conditions biologiques. En e!et,les proteines n’assurent pas leur role dans la cellule de maniere individuelle : il peut s’agir de laformation, a un moment donne, de complexes proteiques c’est-a-dire d’amas de plusieurs proteinesliees chimiquement. Plus frequemment, diverses proteines interviennent de facon consecutive dansune cascade de reactions chimiques, on parle alors de voies metaboliques. Pour que cette colla-boration entre proteines et/ou ARN puisse avoir lieu, il est necessaire que l’expression des genesd’un organisme soit controlee et coordonnee.

Fig. 1.2 – Activation de la transcription d’ungene.

Il existe en e!et des mecanismes de regulationde l’expression des genes. Certaines proteines sontconnues pour etre des facteurs de transcription.Ces proteines favorisent ou au contraire inhibentl’expression d’autres genes. Elles peuvent se liera des sites specifiques, appeles sites de fixation,dans les regions regulatrices des genes. Ces regionspeuvent se trouver en amont du gene comme sur laFigure 1.2 mais aussi en aval, et pas necessairementa proximite du gene. En se fixant a l’ADN, seul ouen formant des complexes les uns avec les autres, lesfacteurs de transcription peuvent activer ou inhi-ber la transcription de leur(s) gene(s) cible(s). Lesfacteurs de transcription peuvent de plus etre eux-memes regules, dans ce cas ils participent a unevoie de regulation. Ces proteines jouent un role cle dans les mecanismes de regulation d’un or-ganisme et la connaissance de certaines d’entre elles represente une information particulierementinteressante pour la reconstruction d’un reseau de regulation.

Fig. 1.3 – Reseau PDR de S. cerevisiae.

Ainsi, un gene cible peut etre regule par unecombinaison de facteurs de transcription et un fac-teur de transcription peut reguler plusieurs genescibles. Un des principaux defis actuel est de com-prendre ces phenomenes de regulation entre lesgenes. Il s’agit de representer l’ensemble des re-lations de regulation caracteristiques d’un proces-sus sous la forme d’un graphe dont les noeudssont les genes (facteurs de transcription et genescibles) et les aretes orientees representent des ef-fets transcriptionnels activateurs ou inhibiteurs[WLL04, GF05]. Le reseau “PDR” represente Fi-gure 1.3 a ainsi ete identifie par l’equipe Genomiquede la Levure (ENS, Paris) pour la resistance aux

produits toxiques chez la levure S. cerevisiae.Un tel reseau de regulation peut etre reconstruit grace a la mise en oeuvre de di!erentes tech-

niques comprenant entre autres la detection de genes, l’identification de facteurs de transcription,la recherche de sites de fixation dans la sequence d’ADN, la detection de la capacite pour deux

Figure: PDR network in S. cerevisiae. Source: Lab. Genomique de la levure, ENS.

Page 9: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

What kind of networks? (3/3)

Gene regulatory networks

I Describe regulations (inhibitions or activations) of geneexpressions, by other genes.

I Oriented graph, with positive or negative label.

I Either static or dynamic (in time).

I Mostly statistically inferred from transcription data sets.

1.5. RECONSTRUIRE UN RESEAU GENETIQUE 25

1.5 Approches statistiques pour la reconstruction de reseauxgenetiques

1.5.1 Modelisation dynamique des motifs de regulation

Une fois que l’on a determine un ensemble de genes a etudier, il s’agit de determiner lerole de chacun. Les puces a ADN permettent d’observer l’expression de l’ensemble de ces genessimultanement et o!rent ainsi un point de vue global sur le comportement de ce systeme. Ceciconstitue une source de donnees considerable dont il s’agit de tirer le maximum d’information.Pour cela, on introduit le processus X decrivant l’intensite d’expression de p genes sur n instantssuccessifs,

X = {X it ; 1 ! i ! p, 0 ! t ! n},

ou chaque variable aleatoire X it represente l’intensite d’expression du gene i a l’instant t. Ainsi le

vecteur Xt, de dimension p, decrit le niveau d’expression de l’ensemble des genes observes au tempst et le vecteur X i, de dimension n, le profil d’expression du gene i tout au long de l’experience. Ils’agit alors d’apprehender les relations qui existent entre les di!erents genes observes a partir del’observation de ce processus.

Jusqu’ici, un grand nombre d’approches ont ete utilisees pour decrire des systemes de regulationet en particulier dans le but de reconstruire un graphe, oriente ou non-oriente (voir [DJ02, BJ05]pour une synthese). Il s’agit de mettre en evidence des motifs de regulation comme celui representeFigure 1.10 par exemple. Une arete tracee du gene G1 vers le gene G2 traduit le fait que la proteinecodee par le gene 1 (facteur de transcription) regule l’expression du gene 2.

!

++! 1 2 3G G G

Fig. 1.10 – Exemple de motif de regulation.

Une approche assez communement etablie aujourd’hui repose sur les modeles graphiques gaus-siens ou GGM pour Graphical Gaussian Model [SS05a, SS05b, WK00a, TH02a, TH02b, WYS03,WMH03]. On suppose pour cela que l’ensemble des niveaux d’expression des p genes observes estun vecteur gaussien (Y i)i=1..p dont on observe des repetitions. Chaque gene i est represente parun noeud Y i et l’on trace une arete entre deux noeuds des que les niveaux d’expression correspon-dants Y i et Y j sont correles conditionnellement a l’ensemble des niveaux d’expression observes.Le graphe ainsi obtenu est appele graphe de concentration. Si l’on observe un phenomene gouvernepar le motif de la Figure 1.10 par exemple, on s’attend a inferer le graphe represente Figure 1.11.Ce graphe permet bien de retrouver que la dependance entre les variables Y 1 et Y 3 se fait parl’intermediaire de la variable Y 2.

L’approche GGM permet en e!et de detecter uniquement les relations de dependance directesentre les di!erents niveaux d’expression. On evite ainsi de tracer une arete entre deux genes dontles niveaux d’expression sont correles en raison de leur lien respectif avec un troisieme gene parexemple. En revanche, il n’est pas possible de representer certains motifs comme les boucles de

Figure: Example of a regulatory motif. Source: S. Lebre.

Page 10: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Challenges

Main issues

I Analyzing large data sets (thousands of nodes and edges),with many noise.

I Identifying structures (motifs, groups, etc).

I Observed networks are sampled from existing ones.

Page 11: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Outline

Molecular interactions networks

Some statistical networks modelsExponential random graphs(Overlapping) Stochastic block modelsLatent space models

Analyzing networks: (probabilistic) node clustering

Page 12: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Some models

Some famous models

I Erdos Renyi random graph

I Degree distribution (power law, fixed degree sequence, etc)

I Preferential attachment (dynamic model)

I . . .

Here, we are going to focus on (static) ’statistical’ models,

I Exponential random graph model (ERGM) [Frank & Strauss 86].

I Stochastic block model or MixNet [Frank & Harary 82,Holland et al. 83, Snijders & Nowicki 97, Daudin et al. 08].

I Overlapping stochastic block models (OSBM)[Latouche et al. 11a] or mixed membership SBM [Airoldi et al. 08].

I Latent space models [Hoff et al. 02, Handcock et al. 07].

We refer to [Goldenberg et al. 10] for a recent overview.

Page 13: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Some models

Some famous models

I Erdos Renyi random graph

I Degree distribution (power law, fixed degree sequence, etc)

I Preferential attachment (dynamic model)

I . . .

Here, we are going to focus on (static) ’statistical’ models,

I Exponential random graph model (ERGM) [Frank & Strauss 86].

I Stochastic block model or MixNet [Frank & Harary 82,Holland et al. 83, Snijders & Nowicki 97, Daudin et al. 08].

I Overlapping stochastic block models (OSBM)[Latouche et al. 11a] or mixed membership SBM [Airoldi et al. 08].

I Latent space models [Hoff et al. 02, Handcock et al. 07].

We refer to [Goldenberg et al. 10] for a recent overview.

Page 14: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Some models

Some famous models

I Erdos Renyi random graph

I Degree distribution (power law, fixed degree sequence, etc)

I Preferential attachment (dynamic model)

I . . .

Here, we are going to focus on (static) ’statistical’ models,

I Exponential random graph model (ERGM) [Frank & Strauss 86].

I Stochastic block model or MixNet [Frank & Harary 82,Holland et al. 83, Snijders & Nowicki 97, Daudin et al. 08].

I Overlapping stochastic block models (OSBM)[Latouche et al. 11a] or mixed membership SBM [Airoldi et al. 08].

I Latent space models [Hoff et al. 02, Handcock et al. 07].

We refer to [Goldenberg et al. 10] for a recent overview.

Page 15: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Exponential random graphs (1/2)

Notations

I X = (Xij)1≤i,j≤n the (binary) adjacency matrix of the graph

I S(X) a known vector of graph statistics on X

I θ a vector of parameters

Pθ(X = x) =1

c(θ)exp(θᵀS(x)), c(θ) =

∑graphs y

exp(θᵀS(y))

Examples

I S(X) may contain the number of edges, triangles, k-stars, . . .

I S may also contain covariates.

Remarks

I S(X) becomes a vector of sufficient statistics

I c(θ) is not computable

Page 16: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Exponential random graphs (1/2)

Notations

I X = (Xij)1≤i,j≤n the (binary) adjacency matrix of the graph

I S(X) a known vector of graph statistics on X

I θ a vector of parameters

Pθ(X = x) =1

c(θ)exp(θᵀS(x)), c(θ) =

∑graphs y

exp(θᵀS(y))

Examples

I S(X) may contain the number of edges, triangles, k-stars, . . .

I S may also contain covariates.

Remarks

I S(X) becomes a vector of sufficient statistics

I c(θ) is not computable

Page 17: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Exponential random graphs (1/2)

Notations

I X = (Xij)1≤i,j≤n the (binary) adjacency matrix of the graph

I S(X) a known vector of graph statistics on X

I θ a vector of parameters

Pθ(X = x) =1

c(θ)exp(θᵀS(x)), c(θ) =

∑graphs y

exp(θᵀS(y))

Examples

I S(X) may contain the number of edges, triangles, k-stars, . . .

I S may also contain covariates.

Remarks

I S(X) becomes a vector of sufficient statistics

I c(θ) is not computable

Page 18: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Exponential random graphs (1/2)

Notations

I X = (Xij)1≤i,j≤n the (binary) adjacency matrix of the graph

I S(X) a known vector of graph statistics on X

I θ a vector of parameters

Pθ(X = x) =1

c(θ)exp(θᵀS(x)), c(θ) =

∑graphs y

exp(θᵀS(y))

Examples

I S(X) may contain the number of edges, triangles, k-stars, . . .

I S may also contain covariates.

Remarks

I S(X) becomes a vector of sufficient statistics

I c(θ) is not computable

Page 19: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Exponential random graphs (2/2)

Issues on parameter estimation

I Maximum likelihood estimation is difficult

I Maximum pseudo-likelihood estimators may be used[Frank & Strauss 86]. Quality of approximation ?

I MCMC approaches [Hunter et al. 11]: may be slow to converge

I Very different values of θ can give rise to essentially the samedistribution

I [Chatterjee & Diaconis 11] established a ’degeneracy’ of thesemodels, which are ’ill-posed’

Other issues

I What about clustering the nodes ?

Page 20: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Exponential random graphs (2/2)

Issues on parameter estimation

I Maximum likelihood estimation is difficult

I Maximum pseudo-likelihood estimators may be used[Frank & Strauss 86]. Quality of approximation ?

I MCMC approaches [Hunter et al. 11]: may be slow to converge

I Very different values of θ can give rise to essentially the samedistribution

I [Chatterjee & Diaconis 11] established a ’degeneracy’ of thesemodels, which are ’ill-posed’

Other issues

I What about clustering the nodes ?

Page 21: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Stochastic block model (binary graphs)

1 2

3

4

5

6

7

84

5

6

7

8

p••

9

10

p••

p••

p••

p••

n = 10, Z5• = 1X12 = 1, X15 = 0

Binary case

I Q groups (=colors •••).I {Zi}1≤i≤n i.i.d. vectors Zi = (Zi1, . . . , ZiQ) ∼M(1,π),

where π = (π1, . . . , πQ) group proportions. Zi is notobserved,

I Observations: edges indicator Xij , 1 ≤ i < j ≤ n,

I Conditional on the {Zi}’s, the random variables Xij areindependent B(pZiZj ).

Page 22: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Stochastic block model (binary graphs)

1 2

3

4

5

6

7

84

5

6

7

8

p••

9

10

p••

p••

p••

p••

n = 10, Z5• = 1X12 = 1, X15 = 0

Binary case

I Q groups (=colors •••).I {Zi}1≤i≤n i.i.d. vectors Zi = (Zi1, . . . , ZiQ) ∼M(1,π),

where π = (π1, . . . , πQ) group proportions. Zi is notobserved,

I Observations: edges indicator Xij , 1 ≤ i < j ≤ n,

I Conditional on the {Zi}’s, the random variables Xij areindependent B(pZiZj ).

Page 23: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Stochastic block model (binary graphs)

1 2

3

4

5

6

7

84

5

6

7

8

p••

9

10

p••

p••

p••

p••

n = 10, Z5• = 1X12 = 1, X15 = 0

Binary case

I Q groups (=colors •••).I {Zi}1≤i≤n i.i.d. vectors Zi = (Zi1, . . . , ZiQ) ∼M(1,π),

where π = (π1, . . . , πQ) group proportions. Zi is notobserved,

I Observations: edges indicator Xij , 1 ≤ i < j ≤ n,

I Conditional on the {Zi}’s, the random variables Xij areindependent B(pZiZj ).

Page 24: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Stochastic block model (binary graphs)

1 2

3

4

5

6

7

84

5

6

7

8

p••

9

10

p••

p••

p••

p••

n = 10, Z5• = 1X12 = 1, X15 = 0

Binary case

I Q groups (=colors •••).I {Zi}1≤i≤n i.i.d. vectors Zi = (Zi1, . . . , ZiQ) ∼M(1,π),

where π = (π1, . . . , πQ) group proportions. Zi is notobserved,

I Observations: edges indicator Xij , 1 ≤ i < j ≤ n,

I Conditional on the {Zi}’s, the random variables Xij areindependent B(pZiZj ).

Page 25: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Stochastic block model (weighted graphs)

1 2

3

4

5

6

7

84

5

6

7

8

θ••

9

10

θ••

θ••

θ••

θ••

n = 10, Z5• = 1X12 ∈ R, X15 = 0

Weighted case

I Observations: weights Xij , where Xij = 0 or Xij ∈ Rs \ {0},I Conditional on the {Zi}’s, the random variables Xij are

independent with distribution

µZiZj (·) = pZiZjf(·, θZiZj ) + (1− pZiZj )δ0(·)

(Assumption: f has continuous cdf at zero).

Page 26: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

SBM properties

Results

I Identifiability of parameters [Allman et al. 09, Allman et al. 11].

I Parameter estimation / node clustering procedures:exact EM approach is not possible

Page 27: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

SBM properties

Results

I Identifiability of parameters [Allman et al. 09, Allman et al. 11].

I Parameter estimation / node clustering procedures:variational EM [Daudin et al. 08, Picard et al. 09], variationalBayes [Latouche et al. 11b], online variational EM[Zanghi et al. 08], other methods [Ambroise & Matias 10] . . .

Page 28: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

SBM properties

Results

I Identifiability of parameters [Allman et al. 09, Allman et al. 11].

I Parameter estimation / node clustering procedures:variational EM [Daudin et al. 08, Picard et al. 09], variationalBayes [Latouche et al. 11b], online variational EM[Zanghi et al. 08], other methods [Ambroise & Matias 10] . . .

I Model selection criteria [Daudin et al. 08, Latouche et al. 11b]

Page 29: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

SBM properties

Results

I Identifiability of parameters [Allman et al. 09, Allman et al. 11].

I Parameter estimation / node clustering procedures:variational EM [Daudin et al. 08, Picard et al. 09], variationalBayes [Latouche et al. 11b], online variational EM[Zanghi et al. 08], other methods [Ambroise & Matias 10] . . .

I Model selection criteria [Daudin et al. 08, Latouche et al. 11b]

Remaining challenges

I Behavior of the nodes posterior dist. / Quality of variationalapprox. ?

I Consistency of the MLE ?(ongoing works of Celisse, Daudin & Pierre and Mariadassou& Matias)

Page 30: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Overlapping SBM / Mixed membership SBM

Figure: Overlapping mixture model. Source: Palla et al., Nature, 2005.

Nodes may belong to many classes[Latouche et al. 11a, Airoldi et al. 08].

Page 31: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

OSBM [Latouche et al. 11a]

ModelI Zi = (Zi1, . . . , ZiQ) ∼

∏Qq=1 B(πq)

I Xij |Zi, Zj ∼ B(g(pZiZj )) where g(x) = (1 + e−x)−1 (logisticfunction) and

pZiZj = Zᵀi WZj + Zᵀ

i U + V ᵀZj + ω

W is a Q×Q real matrix while U and V are Q-dimensionalreal vectors and ω real number.

Results [Latouche et al. 11a]

I Parameter’s identifiability

I Variational Bayes approach + variational logistic Bayes

I Model selection criterion

IssuesI Quality of (double) variational approximation ?

Page 32: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

OSBM [Latouche et al. 11a]

ModelI Zi = (Zi1, . . . , ZiQ) ∼

∏Qq=1 B(πq)

I Xij |Zi, Zj ∼ B(g(pZiZj )) where g(x) = (1 + e−x)−1 (logisticfunction) and

pZiZj = Zᵀi WZj + Zᵀ

i U + V ᵀZj + ω

W is a Q×Q real matrix while U and V are Q-dimensionalreal vectors and ω real number.

Results [Latouche et al. 11a]

I Parameter’s identifiability

I Variational Bayes approach + variational logistic Bayes

I Model selection criterion

IssuesI Quality of (double) variational approximation ?

Page 33: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

OSBM [Latouche et al. 11a]

ModelI Zi = (Zi1, . . . , ZiQ) ∼

∏Qq=1 B(πq)

I Xij |Zi, Zj ∼ B(g(pZiZj )) where g(x) = (1 + e−x)−1 (logisticfunction) and

pZiZj = Zᵀi WZj + Zᵀ

i U + V ᵀZj + ω

W is a Q×Q real matrix while U and V are Q-dimensionalreal vectors and ω real number.

Results [Latouche et al. 11a]

I Parameter’s identifiability

I Variational Bayes approach + variational logistic Bayes

I Model selection criterion

IssuesI Quality of (double) variational approximation ?

Page 34: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Latent space models [Handcock et al. 07]

ModelI Zi i.i.d. vectors in a latent space Rd.

I Conditional on {Zi}, the {Xij} are independent Bernoulli r.v.

log-odds(Xij = 1|Zi, Zj , Uij , θ) = θ0 + θᵀ1Uij − ‖Zi − Zj‖,

where log-odds(A) = log P(A)/(1− P(A)) ; {Uij} set ofcovariate vectors and θ parameters vector.

I This may be extended to weighted networks

Results [Handcock et al. 07]

I Two-stage maximum likelihood or MCMC procedures are usedto infer the model’s parameters

I Assuming Zi sampled from mixture of multivariate normal,one may obtain a clustering of the nodes.

IssuesI No model selection procedure to infer the ’effective’ dimension

d of latent space and the number of groups

Page 35: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Latent space models [Handcock et al. 07]

ModelI Zi i.i.d. vectors in a latent space Rd.

I Conditional on {Zi}, the {Xij} are independent Bernoulli r.v.

log-odds(Xij = 1|Zi, Zj , Uij , θ) = θ0 + θᵀ1Uij − ‖Zi − Zj‖,

where log-odds(A) = log P(A)/(1− P(A)) ; {Uij} set ofcovariate vectors and θ parameters vector.

I This may be extended to weighted networks

Results [Handcock et al. 07]

I Two-stage maximum likelihood or MCMC procedures are usedto infer the model’s parameters

I Assuming Zi sampled from mixture of multivariate normal,one may obtain a clustering of the nodes.

IssuesI No model selection procedure to infer the ’effective’ dimension

d of latent space and the number of groups

Page 36: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Latent space models [Handcock et al. 07]

ModelI Zi i.i.d. vectors in a latent space Rd.

I Conditional on {Zi}, the {Xij} are independent Bernoulli r.v.

log-odds(Xij = 1|Zi, Zj , Uij , θ) = θ0 + θᵀ1Uij − ‖Zi − Zj‖,

where log-odds(A) = log P(A)/(1− P(A)) ; {Uij} set ofcovariate vectors and θ parameters vector.

I This may be extended to weighted networks

Results [Handcock et al. 07]

I Two-stage maximum likelihood or MCMC procedures are usedto infer the model’s parameters

I Assuming Zi sampled from mixture of multivariate normal,one may obtain a clustering of the nodes.

IssuesI No model selection procedure to infer the ’effective’ dimension

d of latent space and the number of groups

Page 37: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Outline

Molecular interactions networks

Some statistical networks modelsExponential random graphs(Overlapping) Stochastic block modelsLatent space models

Analyzing networks: (probabilistic) node clustering

Page 38: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Clustering the nodes of a network

Probabilistic approach

I Using either mixture or overlapping mixture models, one mayrecover nodes groups.

I These groups reflect a common ’connectivity behaviour’.

Non probabilistic approach = community detection

I Many clustering methods try to group the nodes that belongto the same clique.

I Here the nodes in the same groups tend to be connected witheach other.

Page 39: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Clustering the nodes of a network

Probabilistic approach

I Using either mixture or overlapping mixture models, one mayrecover nodes groups.

I These groups reflect a common ’connectivity behaviour’.

Non probabilistic approach = community detection

I Many clustering methods try to group the nodes that belongto the same clique.

I Here the nodes in the same groups tend to be connected witheach other.

Page 40: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Major difference between probabilistic/non probabilisticapproach

Observation of

may lead to either

MixNet model Clustering based on cliques

Page 41: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

[Airoldi et al. 08] E.M. Airoldi, D.M. Blei, S.E. Fienberg andE.P. Xing.Mixed Membership Stochastic Blockmodels.J. Mach. Learn. Res., 9:1981-2014, 2008.

[Allman et al. 09] E.S. Allman, C. Matias and J.A. Rhodes.Identifiability of parameters in latent structure models withmany observed variables.Ann. Statist., 37(6A):3099-3132, 2009.

[Allman et al. 11] E.S. Allman, C. Matias and J.A. Rhodes.Parameter identifiability in a class of random graph mixturemodels.J. Statist. Planning and Inference, 141(5):1719-1736, 2011.

[Ambroise & Matias 10] C. Ambroise and C. Matias.New consistent and asymptotically normal estimators forrandom graph mixture models.arXiv:1003.5165, 2010.

[Chatterjee & Diaconis 11] S. Chatterjee and P. Diaconis

Page 42: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Estimating and Understanding Exponential Random GraphModels.arXiv:1102.2650, 2011.

[Daudin et al. 08] J-J. Daudin, F. Picard and S. Robin.A mixture model for random graphs.Statist. Comput., 18(2):173-183, 2008.

[Frank & Harary 82] O. Frank and F.Harary.Cluster inference by using transitivity indices in empiricalgraphs.J. Amer. Statist. Assoc., 77(380):835-840, 1982.

[Frank & Strauss 86] O. Frank and D. Strauss.Markov graphs.J. Amer. Statist. Assoc., 81(395):832–842, 1986.

[Goldenberg et al. 10] A. Goldenberg, A.X. Zheng, S.E.Fienberg and E.M. Airoldi.A Survey of Statistical Network Models.Found. Trends Mach. Learn., 2(2):129-233, 2010.

Page 43: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

[Handcock et al. 07] M.S. Handcock, A.E. Raftery and J.M.TantrumModel-based clustering for social networks.J. R. Statist. Soc. A., 170(2):301–354, 2007.

[Hoff et al. 02] P.D. Hoff, A.E. Raftery and M. S. HandcockLatent space approaches to social network analysis.J. Amer. Statist. Assoc., 97(460):1090-1098, 2002.

[Hunter et al. 11] D. R. Hunter, S. M. Goodreau and M. S.Handcockergm.userterms: A Template Package for Extending statnet.http://www.stat.psu.edu/∼dhunter/, 2011.

[Holland et al. 83] P. Holland, K.B. Laskey and S. Leinhardt.Stochastic blockmodels: some first steps.Social networks, 5:109-137, 1983.

[Latouche et al. 11a] P. Latouche, E. Birmele and C.Ambroise.

Page 44: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Overlapping Stochastic Block Models With Application to theFrench Political Blogosphere.Annals of Applied Statistics, 5(1):309-336, 2011.

[Latouche et al. 11b] P. Latouche, E. Birmele and C.Ambroise.Variational Bayesian Inference and Complexity Control forStochastic Block Models.Statistical Modelling, (arXiv:0912.2873), to appear.

[Picard et al. 09] F. Picard, V. Miele, J-J. Daudin, L. Cottretand S. Robin.Deciphering the connectivity structure of biological networksusing MixNet.BMC Bioinformatics, 10:1-11, 2009.

[Snijders & Nowicki 97] T.A.B. Snijders and K. Nowicki.Estimation and prediction for stochastic blockmodels forgraphs with latent block structure.J. Classification 14(1):75-100, 1997.

[Zanghi et al. 08] H. Zanghi, C. Ambroise and V. Miele.

Page 45: Some statistical approaches in random graph modelingmath.univ-lille1.fr/~tran/Exposesgraphesaleatoires/Matias.pdf · I Exponential random graph model (ERGM) [Frank & Strauss 86]

Fast Online Graph Clustering via Erdos Renyi Mixture.Pattern Recognition, 41(12):3592-3599, 2008.