bayesian estimation of the latent dimension and communities in stochastic blockmodels · 2020. 5....
TRANSCRIPT
1/16
Bayesian community detection Results Conclusion References
JSM19 – Novel Approaches for Analyzing Dynamic Networks
Bayesian estimation of the latent dimensionand communities in stochastic blockmodels
Francesco Sanna Passino, Nick HeardDepartment of Mathematics, Imperial College London
July 30, 2019
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
2/16
Bayesian community detection Results Conclusion References
Stochastic blockmodels as random dot product graphs
Consider an undirected graph with symmetric adjacency matrix A ∈ {0, 1}n×n.
In random dot product graphs, the probability of a link between two nodes is expressedas the inner product between two latent positions xi,xj ∈ F, 0 ≤ x>y ≤ 1 ∀ x,y ∈ F:
P(Aij = 1) = x>i xj .
The stochastic blockmodel is the classical model for community detection in graphs.Given a matrix B ∈ [0, 1]K×K of within-community probabilities, the probability of alink depends on the community allocations zi and zj ∈ {1, . . . ,K} of the two nodes:
P(Aij = 1) = Bzizj .
The stochastic blockmodel can be interpreted as a special case of a random dot productgraph. IfBkh = µ>k µh with µk,µh ∈ F, and all the nodes in community k are assignedthe latent position µk, then:
P(Aij = 1) = µ>ziµzj , i < j, Aij = Aji.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
3/16
Bayesian community detection Results Conclusion References
Network embeddings
Consider an undirected graph with symmetric adjacency matrix A ∈ {0, 1}n×n, andmodified Laplacian L = D−1/2AD−1/2, D = diag(
∑ni=1Aij).
The adjacency embedding of A in Rd is:
X = [x1, . . . , xn]> = ΓΛ1/2 ∈ Rn×d,
where Λ is a d× d diagonal matrix containing the top d largest eigenvalues of A, andΓ is a n× d matrix containing the corresponding orthonormal eigenvectors.
The Laplacian embedding of A in Rd is:
X = [x1, . . . , xn]> = ΓΛ1/2 ∈ Rn×d,
where Λ is a d× d diagonal matrix containing the top d largest eigenvalues of L, and Γis a n× d matrix containing the corresponding orthonormal eigenvectors.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
4/16
Bayesian community detection Results Conclusion References
Spectral estimation of the stochastic blockmodel
Based on asymptotic properties, Rubin-Delanchy et al., 2017, propose the followingalgorithm for consistent estimation of the latent positions in stochastic blockmodels:
Algorithm 1: Spectral estimation of the stochastic blockmodel (spectral clustering)
Input: adjacency matrix A (or the Laplacian matrix L), dimension d, and number ofcommunities K ≥ d.
1 compute spectral embedding X = [x1, . . . , xn]> or X = [x1, . . . , xn]
> into Rd,2 fit a Gaussian mixture model with K components,
Result: return cluster centres µ1, . . . ,µK ∈ Rd and node memberships z1, . . . , zn.
In practice: d and K are estimated sequentially. Issues:Sequential approach is sub-optimal: the estimate of K depends on choice of d.Theoretical results only hold for d fixed and known.Distributional assumptions when d is misspecified are not available.
This talk discusses a novel framework for joint estimation of d and K.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
5/16
Bayesian community detection Results Conclusion References
A Bayesian model for network embeddings
Choose integer m ≤ n and obtain embedding X ∈ Rn×m→m arbitrarily large.Bayesian model for simultaneous estimation of d andK→ allow for d = rank(B) ≤ K .
xi|d, zi,µzi ,Σzi ,σ2zi
d∼ Nm
([µzi
0m−d
],
[Σzi 00 σ2
ziIm−d
]), i = 1, . . . , n,
(µk,Σk)|d iid∼ NIWd(0, κ0, ν0 + d− 1,∆d), k = 1, . . . ,K,
σ2kjiid∼ Inv-χ2(λ0, σ
20), j = d+ 1, . . . ,m,
d|z d∼ Uniform{1, . . . ,K∅},zi|θ iid∼ Multinoulli(θ), i = 1, . . . , n, θ ∈ SK−1,
θ|K d∼ Dirichlet( αK, . . . ,
α
K
),
Kd∼ Geometric(ω).
where K∅ is the number of non-empty communities.
Alternative: dd∼ Geometric(δ).
Yang et al., 2019, independently proposed a similar frequentist model.Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
6/16
Bayesian community detection Results Conclusion References
Empirical model validation
−1 −0.9 −0.8 −0.7 −0.6 −0.5
−0.5
0
0.5
Figure 1. Scatterplot of the simulated X1 and X2 – i.e. X:d
−0.4 −0.2 0 0.2 0.4 0.6
−0.4
−0.2
0
0.2
0.4
Figure 2. Scatterplot of the simulated X3 and X4
Simulated GRDPG-SBM with n = 2500, d = 2, K = 5.
Nodes allocated to communities with probability θk = P(zi = k) = 1/K .
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
7/16
Bayesian community detection Results Conclusion References
Empirical model validation
2 4 6 8 10 12 14
−1
−0.5
0
0.5
Dimension
Overall meanMax/min within-cluster meanWithin-cluster mean
Figure 3. Within-cluster and overall means of X:15
−0.2 −0.1 0 0.1 0.20
5
10
15
Correlation coe�cient ρ(k)ij
ρ(k)ij forX:d
Histogram of ρ(k)ij forXd:
Figure 4. Within-cluster correlation coefficients of X:30
Means are approximately 0 for columns with index > d.
Reasonable to assume correlation ρ(k)ij = 0 for i, j > d.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
8/16
Bayesian community detection Results Conclusion References
Curse of dimensionality
0 100 200 300 400 500
0
2
4
6
·10−2
Dimension
Varia
nce
Within cluster varianceTotal variance
0 5 10 15 20 25
0
2
4
6
·10−2
Dimension
Varia
nce
Figure 5. Within-block variance and total variance for the adjacency embedding obtained from a simulatedSBM with d = 2, K = 5, n = 500, and well separated means µ1 = [0.7, 0.4],µ2 = [0.1, 0.1],µ3 = [0.4, 0.8],µ4 = [−0.1, 0.5] and µ5 = [0.3, 0.5], and θ = (0.2, 0.2, 0.2, 0.2, 0.2).
For some k and k′: σ2kj ≈ σ2k′j for j � d and k 6= k′.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
9/16
Bayesian community detection Results Conclusion References
Second order clustering
Bayesian model parsimony: K underestimated for d� m.
Possible solution: second order clustering v = (v1, . . . , vK) with vk ∈ {1, . . . ,H}.If vk = vk′ , then σ2kj = σ2k′j for j > d:
xi|d, zi, vzi ,µzi ,Σzi ,σ2vzi
d∼ Nm
([µzi
0m−d
],
[Σzi 00 σ2
vziIm−d
]), i = 1, . . . , n,
vk|K,H d∼ Multinoulli(φ), k = 1, . . . ,K,
φ|H d∼ Dirichlet
(β
H, . . . ,
β
H
),
H|K d∼ Uniform{1, . . . ,K}.
The parameter vk defines clusters of clusters.
Empirical results show that the model is able to handle d� m.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
10/16
Bayesian community detection Results Conclusion References
N3(µ1,Σ1)
N3(µ2,Σ2)
N3(µ3,Σ3)
N3(µ4,Σ4)
N3(µ5,Σ5) N8(08,σ23I5)
N8(08,σ22I5)
N8(08,σ21I5)
3 = latent dimension
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
11/16
Bayesian community detection Results Conclusion References
Extension to directed and bipartite graphs
Consider a directed graph with adjacency matrix A ∈ {0, 1}n×n.
The d-dimensional adjacency embedding of A in R2d, is defined as:
X = UD1/2 ⊕ VD1/2 =[UD1/2 VD1/2
]=[Xs Xr
].
where A = UDV> + U⊥D⊥V>⊥ is the SVD decomposition of A, where U ∈ Rn×d,D ∈ Rd×d
+ diagonal, and V ∈ Rn×d.
Essentially, only three distributions change:
xi|d,K, zi d∼ N2m
µzi
0m−dµ′zi
0m−d
,
Σzi 0 0 00 σ2
ziIm−d 0 00 0 Σ′zi 00 0 0 σ2′
ziIm−d
,
(µk,Σk)|d,K iid∼ NIWd(0, κ0, ν0 + d− 1,∆d), k = 1, . . . ,K,
σ2kj |d,Kiid∼ Inv-χ2(λ0, σ
20), j = d+ 1, . . . ,m.
Co-clustering: different clusters for sources and receivers→ bipartite graphs.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
12/16
Bayesian community detection Results Conclusion References
Enron e-mail network: number of clusters
5 10 150
1
2
3·105
K∅H∅
Figure 6. Posterior histogram of K∅ and H∅, unconstrainedmodel, MAP for d in red.
2 4 6 8 10 12 14 160
0.5
1
1.5
2
2.5
·105
K∅H∅
Figure 7. Posterior histogram of K∅ and H∅, constrainedmodel, MAP for d in red.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
13/16
Bayesian community detection Results Conclusion References
Enron e-mail network: number of clusters
4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
1.2
·105
K∅
Figure 8. Posterior histogram of K∅, unconstrained modelwithout second order clustering, MAP for d in red.
6 8 10 12 14 160
0.2
0.4
0.6
0.8
1·105
K∅
Figure 9. Posterior histogram of K∅, constrained model with-out second order clustering, MAP for d in red.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
14/16
Bayesian community detection Results Conclusion References
Enron e-mail network: scree-plot
0 50 100 150
0
5
10
15
20
25
Figure 10. Singular values of the adjacency matrix.
Choice of d is consistent with the elbow of the scree-plot.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
15/16
Bayesian community detection Results Conclusion References
Conclusion
Community detection and stochastic blockmodels:Bayesian model for simultaneous selection of K andd in generalised random dot product graphs,Allow for initial misspecification of the arbitrarily largeparameter m, then refine estimate d,Gaussian mixture model (with constraints) based onspectral embedding,Easy to extend to directed and bipartite graphs.
More details:Sanna Passino and Heard, 2019 – arXiv: 1904.05333.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels
16/16
Bayesian community detection Results Conclusion References
References
Athreya, A. et al. (2018). “Statistical Inference on Random Dot Product Graphs: a Survey”.In: Journal of Machine Learning Research 18.226, pp. 1–92.
Rubin-Delanchy, P. et al. (2017). “A statistical interpretation of spectral embedding: thegeneralised random dot product graph”. In: ArXiv e-prints. arXiv: 1709.05506.
Sanna Passino, F. and N. A. Heard (2019). “Bayesian estimation of the latent dimensionand communities in stochastic blockmodels”. In: arXiv e-prints. arXiv: 1904.05333.
Yang, C. et al. (2019). “Simultaneous dimensionality and complexity model selectionfor spectral graph clustering”. In: arXiv e-prints. arXiv: 1904.02926.
Francesco Sanna Passino Imperial College London
Bayesian estimation of the latent dimension and communities in stochastic blockmodels