estimating clique composition and size distributions from sampled network data
DESCRIPTION
Estimating Clique Composition and Size Distributions from Sampled Network Data. Minas Gjoka , Emily Smith, Carter T. Butts. University of California, Irvine. Outline. Problem statement Estimation methodology Results with real-life graphs. Cliques. - PowerPoint PPT PresentationTRANSCRIPT
Estimating Clique Composition and SizeDistributions from Sampled Network Data
Minas Gjoka, Emily Smith, Carter T. Butts
University of California, Irvine
Outline
• Problem statement• Estimation methodology• Results with real-life graphs
Cliques
A complete subgraph that contains i vertices is an order-i clique
…
order-1
order-2
order-3
order-4
order-5
order-i
A maximal clique is a clique that is not included in a larger clique
Cliques
A complete subgraph that contains i vertices is an order-i clique
cb
da
order-3
order-4
A maximal clique is a clique that is not included in a larger clique
b
da c
b
d
cb
a
cd
a
4 non-maximal order-3 cliques
Counting of Cliques
graph G
3 2
4 51
8
76
Ci is the count of order-i cliques (maximal or non-maximal) C1
C2
C3
C4
order-1
order-2
order-3
order-4 Clique Distribution of G
C = (C1, C2, C3, C4) = ( 0, 1, 2, 1 )
Goal 1: Estimate Ci (for all i) in graph G from sampled network data
Counting of Cliques Vertex Attributes
graph G
3 2
4 51
8
76
p =3
Vertex Attribute vector Xj j=1..p, p<=N
u =[ 3 0 0 ]
u =[ 2 1 0 ]
u =[ 2 0 1 ]Clique Composition
Distribution of GCu is the count of order-u cliques
Goal 2: Estimate Cu (for all u) in graph G from sampled network data
Motivation
• Counting of Cliques– cliques describe local structure (clustering, cohesive subgroups)– algorithmic implications of cliques in engineering context– cliques used as input in network models
• Sampled network data– unknown graphs with access limitations– massive known graphs
Related Work
• Model-based methods Do not scale Do not help with counting
• Design-based methods– Subgraph (or motif) counting tools that use
sampling e.g. MFinder, FANMOD, MODA No support for subgraphs of size larger than 10 No support for vertex attributes Biased Estimation
Estimation
Methodology1. Collect an egocentric network sample H1,..,Hn
a) Collect a probability sample of “n” nodes from the graph:
Vj, X[Vj] j=1..nuniform independence sampling
weighted independence samplinglink-trace sampling
with replacementwithout replacement
7
4
Methodology1. Collect an egocentric network sample H1,..,Hn
a) Collect a probability sample of “n” nodes from the graph:
graph G(V,E)
Vj, X[Vj] j=1..n
3 2
4 51
8
76
C3n=2
Methodology1. Collect an egocentric network sample H1,..,Hn
a) Collect a probability sample of “n” nodes from the graph:b) Fetch the egonet of each sampled node:
Vj, X[Vj] j=1..nG[Vj]
j=1..n
graph G(V,E)
3 2
4 51
8
76
C3n=2
8
6 7
3 2
54
Methodology1. Collect an egocentric network sample H1,..,Hn
a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj
Vj, X[Vj]G[Vj]
j=1..n
graph G(V,E)
3 2
4 51
8
76
C3n=2
8
6 7
3 2
54
Methodology1. Collect an egocentric network sample H1,..,Hn
a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj– can use existing exact clique counting algorithms– clique type is determined by counting algorithm.
Vj, X[Vj]G[Vj]
j=1..n
graph G(V,E)
3 2
4 51
8
76
C3
1 0
n=28
6 7
3 2
54
Methodology1. Collect an egocentric network sample H1,..,Hn
a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj
3. Apply estimation method that combines calculations– Clique Degree Sums (CDS)
– Distinct Clique Counting (CC)
Vj, X[Vj]G[Vj]
j=1..n
1 0
n=2
graph G(V,E)
3 2
4 51
8
76
8
6 7
3 2
54C3
Methodology1. Collect an egocentric network sample H1,..,Hn
a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj
3. Apply estimation method that combines calculations– Clique Degree Sums (CDS)
o labeling of neighbors not required, more space efficient– Distinct Clique Counting (CC)
o higher accuracy
Vj, X[Vj]G[Vj]
j=1..n
1 0
n=2
graph G(V,E)
3 2
4 51
8
76
8
6 7
3 2
54C3
Labeling of neighbors
graph G
8 7
9 6
5
34
1
C3
2
99 6
5
Labeling of neighbors
graph G
8 7
9 6
5
34
1
n=2
C3
2
Vj, X[Vj], G[Vj]
8 7
34
1
2
6
5
9
Labeling of neighbors
• Distinct Clique Counting (CC)– labeled neighbors
graph G
8 7
9 6
5
34
1
n=2
Labeled NeighborsC3
9 6
5
4
8 7
9 6
5
2
9 6
5
5
4 3
6
555
4 3
Calculate count C3
5
9
Labeling of neighbors
• Distinct Clique Counting (CC)– labeled neighbors
• Clique Degree Sums (CDS)– unlabeled neighbors
graph G
8 7
9 6
5
34
1
n=2 Unlabeled Neighbors
Calculate count C3
C3
6
5
4
8 7
9 6
5
2
9 6
55
34
9
5
4 3
55
Labeled Neighbors
Calculate count C3
Order-i Clique Degree dij contains the number of i-cliques that node j belongs
Clique Degree Sums unlabeled neighbors
Order-i Clique Degree dij contains the number of i-cliques that node j belongs
d38
Clique Degree Sums unlabeled neighbors
C3
2
3
1
8
graph G (V,E)
86
7 5
4
= 2H8
Clique Degree Sums unlabeled neighbors
All nodesNumber of i-cliques that node j belongs
Di is the Order-i Clique Degree Sum
d38
Clique Degree Sums unlabeled neighbors
C3
2
3
1
8
graph G (V,E)All nodes
Number of i-cliques that node j belongs 8
6
7 5
4
D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38D3 = 1 + 1 + 0 + 1 + 2 + 1 + 1 + 2D3 = 9D3 = 3C3
Di is the Order-i Clique Degree Sum
is a design-unbiased Horvitz-Thompson estimator ( )
Clique Degree Sums unlabeled neighbors
All nodesNumber of i-cliques that node j belongs
Sampled nodes
Node j inclusion probability
Clique Degree Sums unlabeled neighbors
All nodes
Sampled nodes
Node j inclusion probability
Number of i-cliques that node j belongs
Number of u-cliques that node j belongs
is a design-unbiased Horvitz-Thompson estimator ( )
Clique Degree Sums Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and
Node inclusion probability
Joint node inclusion probability
Clique Degree Sums Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and
• Uniform Independence Sampling• Weighted Independence Sampling• Link-trace Sampling
• Without replacement• With replacement
Clique Degree Sums Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and
• Uniform Independence Sampling • Without replacement
Joint node inclusion probability
Node inclusion probability
All nodesSampled nodes
Distinct Clique Counting labeled neighbors
i-clique inclusion probability
is a design-unbiased Horvitz-Thompson estimator ( ) )
number of distinct i-cliquesin H1, .., Hn
• Uniform Independence Sampling• Weighted Independence Sampling• Link-trace Sampling
• With replacement• Without replacement
Distinct Clique Counting labeled neighbors
i-clique inclusion probability
number of distinct i-cliquesin H1, .., Hn
• Uniform Independence Sampling • With replacement
is a design-unbiased Horvitz-Thompson estimator ( ) )
Distinct Clique Counting labeled neighbors
b c
a
2
3
1
86
7 5
4graph G
4)8/31(1 k N=8
n=4 UIS with replacement
C3
Distinct Clique Counting labeled neighbors
b c
a
2
3
1
86
7 5
4graph G
86
78
6
7
2
1
5
36.2))8/31(1/(2ˆ 4 iC
n=4 UIS with replacement
N=8
86
7
2
1
5
2
1
5
4)8/31(1 k
C3
Observedorder-3 cliques
Distinctorder-3 cliques
Computational complexity
• Space complexity to count Ci or Cu
– O(1) for Clique Degree Sums Method – O(ci) or O(cu) for Distinct Clique Counting Method
• Time complexity– from O(3N/3) to O(n*3D/3) where N is the graph size, D
is the maximum degree, and n is the sample size– from O(n*3D/3) to O(3D/3) via parallel computations per
egonet
Benefits of our methodology
• Full knowledge of graph not required• Fast estimation for massive known graphs• Estimation or exact computation easily
parallelizable for massive known graphs• Estimation with or without neighbor labels• Supports vertex attributes • Supports a variety of sampling designs
Results
Simulation Results
Simulation ResultsFacebook New Orleans
Egonet sample size n=1,000Uniform independence sampling, without replacement
1000 simulations
Clique Degree Sums
Distinct Clique Counting
Simulation Results
Error metric Normalized Mean Absolute Error :1000 simulations
Distinct Clique Counting
Clique Degree Sums
Simulation Results
Distinct Clique Counting
Clique Degree Sums
Which estimation method to use?Heuristic
Average Edge Count =All edges between egos and neighbors
Unique edges between egos and neighbors
2
3
1
86
7 5
4graph G
86
78
6
7
2
1
5n=3
N=8
86
7
2
1
5
Average Edge Count =9
6b c
a
= 1.5
Estimation ResultsFacebook ‘09
• Facebook ‘09 crawled dataset[1]
– 36,628 unique egonets
[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.
Estimation Resultsvertex attributes, Facebook ‘09
• Complemented dataset with gender attributes– about 6 million users
References[1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 .[2] Facebook datasets: http://odysseas.calit2.uci.edu/research/osn.html[3] Python code for Clique Estimators: http://tinyurl.com/clique-estimators
Thank you!
• Unbiased estimation methods of clique distributions– Clique Degree Sums– Distinct Clique Counting
• Facebook cliques• Future work– support estimation of any subgraphs (beyond cliques)