estimating clique composition and size distributions from sampled network data

Estimating Clique Composition and SizeDistributions from Sampled Network Data

Minas Gjoka, Emily Smith, Carter T. Butts

University of California, Irvine

Outline

• Problem statement• Estimation methodology• Results with real-life graphs

Cliques

A complete subgraph that contains i vertices is an order-i clique

…

order-1

order-2

order-3

order-4

order-5

order-i

A maximal clique is a clique that is not included in a larger clique

Cliques

A complete subgraph that contains i vertices is an order-i clique

cb

da

order-3

order-4

A maximal clique is a clique that is not included in a larger clique

b

da c

b

d

cb

a

cd

a

4 non-maximal order-3 cliques

Counting of Cliques

graph G

3 2

4 51

8

76

Ci is the count of order-i cliques (maximal or non-maximal) C1

C2

C3

C4

order-1

order-2

order-3

order-4 Clique Distribution of G

C = (C1, C2, C3, C4) = ( 0, 1, 2, 1 )

Goal 1: Estimate Ci (for all i) in graph G from sampled network data

Counting of Cliques Vertex Attributes

graph G

3 2

4 51

8

76

p =3

Vertex Attribute vector Xj j=1..p, p<=N

u =[ 3 0 0 ]

u =[ 2 1 0 ]

u =[ 2 0 1 ]Clique Composition

Distribution of GCu is the count of order-u cliques

Goal 2: Estimate Cu (for all u) in graph G from sampled network data

Motivation

• Counting of Cliques– cliques describe local structure (clustering, cohesive subgroups)– algorithmic implications of cliques in engineering context– cliques used as input in network models

• Sampled network data– unknown graphs with access limitations– massive known graphs

Related Work

• Model-based methods Do not scale Do not help with counting

• Design-based methods– Subgraph (or motif) counting tools that use

sampling e.g. MFinder, FANMOD, MODA No support for subgraphs of size larger than 10 No support for vertex attributes Biased Estimation

Estimation

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graph:

Vj, X[Vj] j=1..nuniform independence sampling

weighted independence samplinglink-trace sampling

with replacementwithout replacement

7

4


a) Collect a probability sample of “n” nodes from the graph:

graph G(V,E)

Vj, X[Vj] j=1..n

3 2

4 51

8

76

C3n=2


a) Collect a probability sample of “n” nodes from the graph:b) Fetch the egonet of each sampled node:

Vj, X[Vj] j=1..nG[Vj]

j=1..n

graph G(V,E)

3 2

4 51

8

76

C3n=2

8

6 7

3 2

54


a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node

2. Calculate the clique count Ci (or Cu) in each egonet Hj

Vj, X[Vj]G[Vj]

j=1..n

graph G(V,E)

3 2

4 51

8

76

C3n=2

8

6 7

3 2

54



2. Calculate the clique count Ci (or Cu) in each egonet Hj– can use existing exact clique counting algorithms– clique type is determined by counting algorithm.

Vj, X[Vj]G[Vj]

j=1..n

graph G(V,E)

3 2

4 51

8

76

C3

1 0

n=28

6 7

3 2

54




3. Apply estimation method that combines calculations– Clique Degree Sums (CDS)

– Distinct Clique Counting (CC)

Vj, X[Vj]G[Vj]

j=1..n

1 0

n=2

graph G(V,E)

3 2

4 51

8

76

8

6 7

3 2

54C3




3. Apply estimation method that combines calculations– Clique Degree Sums (CDS)

o labeling of neighbors not required, more space efficient– Distinct Clique Counting (CC)

o higher accuracy

Vj, X[Vj]G[Vj]

j=1..n

1 0

n=2

graph G(V,E)

3 2

4 51

8

76

8

6 7

3 2

54C3

Labeling of neighbors

graph G

8 7

9 6

5

34

1

C3

2

99 6

5


graph G

8 7

9 6

5

34

1

n=2

C3

2

Vj, X[Vj], G[Vj]

8 7

34

1

2

6

5

9


• Distinct Clique Counting (CC)– labeled neighbors

graph G

8 7

9 6

5

34

1

n=2

Labeled NeighborsC3

9 6

5

4

8 7

9 6

5

2

9 6

5

5

4 3

6

555

4 3

Calculate count C3

5

9


• Distinct Clique Counting (CC)– labeled neighbors

• Clique Degree Sums (CDS)– unlabeled neighbors

graph G

8 7

9 6

5

34

1

n=2 Unlabeled Neighbors

Calculate count C3

C3

6

5

4

8 7

9 6

5

2

9 6

55

34

9

5

4 3

55

Labeled Neighbors

Calculate count C3

Order-i Clique Degree dij contains the number of i-cliques that node j belongs

Clique Degree Sums unlabeled neighbors

Order-i Clique Degree dij contains the number of i-cliques that node j belongs

d38


C3

2

3

1

8

graph G (V,E)

86

7 5

4

= 2H8


All nodesNumber of i-cliques that node j belongs

Di is the Order-i Clique Degree Sum

d38


C3

2

3

1

8

graph G (V,E)All nodes

Number of i-cliques that node j belongs 8

6

7 5

4

D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38D3 = 1 + 1 + 0 + 1 + 2 + 1 + 1 + 2D3 = 9D3 = 3C3

Di is the Order-i Clique Degree Sum

is a design-unbiased Horvitz-Thompson estimator ( )


All nodesNumber of i-cliques that node j belongs

Sampled nodes

Node j inclusion probability


All nodes

Sampled nodes

Node j inclusion probability

Number of i-cliques that node j belongs

Number of u-cliques that node j belongs

is a design-unbiased Horvitz-Thompson estimator ( )

Clique Degree Sums Estimator Variance

We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and

Node inclusion probability

Joint node inclusion probability



• Uniform Independence Sampling• Weighted Independence Sampling• Link-trace Sampling

• Without replacement• With replacement



• Uniform Independence Sampling • Without replacement

Joint node inclusion probability

Node inclusion probability

All nodesSampled nodes

Distinct Clique Counting labeled neighbors

i-clique inclusion probability

is a design-unbiased Horvitz-Thompson estimator ( ) )

number of distinct i-cliquesin H1, .., Hn

• Uniform Independence Sampling• Weighted Independence Sampling• Link-trace Sampling

• With replacement• Without replacement


i-clique inclusion probability

number of distinct i-cliquesin H1, .., Hn

• Uniform Independence Sampling • With replacement

is a design-unbiased Horvitz-Thompson estimator ( ) )


b c

a

2

3

1

86

7 5

4graph G

4)8/31(1 k N=8

n=4 UIS with replacement

C3


b c

a

2

3

1

86

7 5

4graph G

86

78

6

7

2

1

5

36.2))8/31(1/(2ˆ 4 iC

n=4 UIS with replacement

N=8

86

7

2

1

5

2

1

5

4)8/31(1 k

C3

Observedorder-3 cliques

Distinctorder-3 cliques

Computational complexity

• Space complexity to count Ci or Cu

– O(1) for Clique Degree Sums Method – O(ci) or O(cu) for Distinct Clique Counting Method

• Time complexity– from O(3N/3) to O(n*3D/3) where N is the graph size, D

is the maximum degree, and n is the sample size– from O(n*3D/3) to O(3D/3) via parallel computations per

egonet

Benefits of our methodology

• Full knowledge of graph not required• Fast estimation for massive known graphs• Estimation or exact computation easily

parallelizable for massive known graphs• Estimation with or without neighbor labels• Supports vertex attributes • Supports a variety of sampling designs

Results

Simulation Results

Simulation ResultsFacebook New Orleans

Egonet sample size n=1,000Uniform independence sampling, without replacement

1000 simulations

Clique Degree Sums

Distinct Clique Counting

Simulation Results

Error metric Normalized Mean Absolute Error :1000 simulations


Clique Degree Sums

Simulation Results


Clique Degree Sums

Which estimation method to use?Heuristic

Average Edge Count =All edges between egos and neighbors

Unique edges between egos and neighbors

2

3

1

86

7 5

4graph G

86

78

6

7

2

1

5n=3

N=8

86

7

2

1

5

Average Edge Count =9

6b c

a

= 1.5

Estimation ResultsFacebook ‘09

• Facebook ‘09 crawled dataset[1]

– 36,628 unique egonets

[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.

Estimation Resultsvertex attributes, Facebook ‘09

• Complemented dataset with gender attributes– about 6 million users

References[1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 .[2] Facebook datasets: http://odysseas.calit2.uci.edu/research/osn.html[3] Python code for Clique Estimators: http://tinyurl.com/clique-estimators

Thank you!

• Unbiased estimation methods of clique distributions– Clique Degree Sums– Distinct Clique Counting

• Facebook cliques• Future work– support estimation of any subgraphs (beyond cliques)

estimating clique composition and size distributions from sampled network data

Documents

count of order

orderia maximal clique

cliques maximal

clique composition

nonmaximal c1c2c3c4order

clique distribution

complete subgraph

larger clique3cliques