estimating clique composition and size distributions from sampled network data

44
Estimating Clique Composition and Size Distributions from Sampled Network Data Minas Gjoka, Emily Smith, Carter T. Butts University of California, Irvine

Upload: nguyet

Post on 23-Feb-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Estimating Clique Composition and Size Distributions from Sampled Network Data. Minas Gjoka , Emily Smith, Carter T. Butts. University of California, Irvine. Outline. Problem statement Estimation methodology Results with real-life graphs. Cliques. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Estimating Clique Composition and Size Distributions from Sampled Network Data

Estimating Clique Composition and SizeDistributions from Sampled Network Data

Minas Gjoka, Emily Smith, Carter T. Butts

University of California, Irvine

Page 2: Estimating Clique Composition and Size Distributions from Sampled Network Data

Outline

• Problem statement• Estimation methodology• Results with real-life graphs

Page 3: Estimating Clique Composition and Size Distributions from Sampled Network Data

Cliques

A complete subgraph that contains i vertices is an order-i clique

order-1

order-2

order-3

order-4

order-5

order-i

A maximal clique is a clique that is not included in a larger clique

Page 4: Estimating Clique Composition and Size Distributions from Sampled Network Data

Cliques

A complete subgraph that contains i vertices is an order-i clique

cb

da

order-3

order-4

A maximal clique is a clique that is not included in a larger clique

b

da c

b

d

cb

a

cd

a

4 non-maximal order-3 cliques

Page 5: Estimating Clique Composition and Size Distributions from Sampled Network Data

Counting of Cliques

graph G

3 2

4 51

8

76

Ci is the count of order-i cliques (maximal or non-maximal) C1

C2

C3

C4

order-1

order-2

order-3

order-4 Clique Distribution of G

C = (C1, C2, C3, C4) = ( 0, 1, 2, 1 )

Goal 1: Estimate Ci (for all i) in graph G from sampled network data

Page 6: Estimating Clique Composition and Size Distributions from Sampled Network Data

Counting of Cliques Vertex Attributes

graph G

3 2

4 51

8

76

p =3

Vertex Attribute vector Xj j=1..p, p<=N

u =[ 3 0 0 ]

u =[ 2 1 0 ]

u =[ 2 0 1 ]Clique Composition

Distribution of GCu is the count of order-u cliques

Goal 2: Estimate Cu (for all u) in graph G from sampled network data

Page 7: Estimating Clique Composition and Size Distributions from Sampled Network Data

Motivation

• Counting of Cliques– cliques describe local structure (clustering, cohesive subgroups)– algorithmic implications of cliques in engineering context– cliques used as input in network models

• Sampled network data– unknown graphs with access limitations– massive known graphs

Page 8: Estimating Clique Composition and Size Distributions from Sampled Network Data

Related Work

• Model-based methods Do not scale Do not help with counting

• Design-based methods– Subgraph (or motif) counting tools that use

sampling e.g. MFinder, FANMOD, MODA No support for subgraphs of size larger than 10 No support for vertex attributes Biased Estimation

Page 9: Estimating Clique Composition and Size Distributions from Sampled Network Data

Estimation

Page 10: Estimating Clique Composition and Size Distributions from Sampled Network Data

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graph:

Vj, X[Vj] j=1..nuniform independence sampling

weighted independence samplinglink-trace sampling

with replacementwithout replacement

Page 11: Estimating Clique Composition and Size Distributions from Sampled Network Data

7

4

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graph:

graph G(V,E)

Vj, X[Vj] j=1..n

3 2

4 51

8

76

C3n=2

Page 12: Estimating Clique Composition and Size Distributions from Sampled Network Data

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graph:b) Fetch the egonet of each sampled node:

Vj, X[Vj] j=1..nG[Vj]

j=1..n

graph G(V,E)

3 2

4 51

8

76

C3n=2

8

6 7

3 2

54

Page 13: Estimating Clique Composition and Size Distributions from Sampled Network Data

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node

2. Calculate the clique count Ci (or Cu) in each egonet Hj

Vj, X[Vj]G[Vj]

j=1..n

graph G(V,E)

3 2

4 51

8

76

C3n=2

8

6 7

3 2

54

Page 14: Estimating Clique Composition and Size Distributions from Sampled Network Data

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node

2. Calculate the clique count Ci (or Cu) in each egonet Hj– can use existing exact clique counting algorithms– clique type is determined by counting algorithm.

Vj, X[Vj]G[Vj]

j=1..n

graph G(V,E)

3 2

4 51

8

76

C3

1 0

n=28

6 7

3 2

54

Page 15: Estimating Clique Composition and Size Distributions from Sampled Network Data

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node

2. Calculate the clique count Ci (or Cu) in each egonet Hj

3. Apply estimation method that combines calculations– Clique Degree Sums (CDS)

– Distinct Clique Counting (CC)

Vj, X[Vj]G[Vj]

j=1..n

1 0

n=2

graph G(V,E)

3 2

4 51

8

76

8

6 7

3 2

54C3

Page 16: Estimating Clique Composition and Size Distributions from Sampled Network Data

Methodology1. Collect an egocentric network sample H1,..,Hn

a) Collect a probability sample of “n” nodes from the graphb) Fetch the egonet of each sampled node

2. Calculate the clique count Ci (or Cu) in each egonet Hj

3. Apply estimation method that combines calculations– Clique Degree Sums (CDS)

o labeling of neighbors not required, more space efficient– Distinct Clique Counting (CC)

o higher accuracy

Vj, X[Vj]G[Vj]

j=1..n

1 0

n=2

graph G(V,E)

3 2

4 51

8

76

8

6 7

3 2

54C3

Page 17: Estimating Clique Composition and Size Distributions from Sampled Network Data

Labeling of neighbors

graph G

8 7

9 6

5

34

1

C3

2

Page 18: Estimating Clique Composition and Size Distributions from Sampled Network Data

99 6

5

Labeling of neighbors

graph G

8 7

9 6

5

34

1

n=2

C3

2

Vj, X[Vj], G[Vj]

8 7

34

1

2

6

5

Page 19: Estimating Clique Composition and Size Distributions from Sampled Network Data

9

Labeling of neighbors

• Distinct Clique Counting (CC)– labeled neighbors

graph G

8 7

9 6

5

34

1

n=2

Labeled NeighborsC3

9 6

5

4

8 7

9 6

5

2

9 6

5

5

4 3

6

555

4 3

Calculate count C3

Page 20: Estimating Clique Composition and Size Distributions from Sampled Network Data

5

9

Labeling of neighbors

• Distinct Clique Counting (CC)– labeled neighbors

• Clique Degree Sums (CDS)– unlabeled neighbors

graph G

8 7

9 6

5

34

1

n=2 Unlabeled Neighbors

Calculate count C3

C3

6

5

4

8 7

9 6

5

2

9 6

55

34

9

5

4 3

55

Labeled Neighbors

Calculate count C3

Page 21: Estimating Clique Composition and Size Distributions from Sampled Network Data

Order-i Clique Degree dij contains the number of i-cliques that node j belongs

Clique Degree Sums unlabeled neighbors

Page 22: Estimating Clique Composition and Size Distributions from Sampled Network Data

Order-i Clique Degree dij contains the number of i-cliques that node j belongs

d38

Clique Degree Sums unlabeled neighbors

C3

2

3

1

8

graph G (V,E)

86

7 5

4

= 2H8

Page 23: Estimating Clique Composition and Size Distributions from Sampled Network Data

Clique Degree Sums unlabeled neighbors

All nodesNumber of i-cliques that node j belongs

Di is the Order-i Clique Degree Sum

Page 24: Estimating Clique Composition and Size Distributions from Sampled Network Data

d38

Clique Degree Sums unlabeled neighbors

C3

2

3

1

8

graph G (V,E)All nodes

Number of i-cliques that node j belongs 8

6

7 5

4

D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38D3 = 1 + 1 + 0 + 1 + 2 + 1 + 1 + 2D3 = 9D3 = 3C3

Di is the Order-i Clique Degree Sum

Page 25: Estimating Clique Composition and Size Distributions from Sampled Network Data

is a design-unbiased Horvitz-Thompson estimator ( )

Clique Degree Sums unlabeled neighbors

All nodesNumber of i-cliques that node j belongs

Sampled nodes

Node j inclusion probability

Page 26: Estimating Clique Composition and Size Distributions from Sampled Network Data

Clique Degree Sums unlabeled neighbors

All nodes

Sampled nodes

Node j inclusion probability

Number of i-cliques that node j belongs

Number of u-cliques that node j belongs

is a design-unbiased Horvitz-Thompson estimator ( )

Page 27: Estimating Clique Composition and Size Distributions from Sampled Network Data

Clique Degree Sums Estimator Variance

We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and

Node inclusion probability

Joint node inclusion probability

Page 28: Estimating Clique Composition and Size Distributions from Sampled Network Data

Clique Degree Sums Estimator Variance

We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and

• Uniform Independence Sampling• Weighted Independence Sampling• Link-trace Sampling

• Without replacement• With replacement

Page 29: Estimating Clique Composition and Size Distributions from Sampled Network Data

Clique Degree Sums Estimator Variance

We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of and

• Uniform Independence Sampling • Without replacement

Joint node inclusion probability

Node inclusion probability

All nodesSampled nodes

Page 30: Estimating Clique Composition and Size Distributions from Sampled Network Data

Distinct Clique Counting labeled neighbors

i-clique inclusion probability

is a design-unbiased Horvitz-Thompson estimator ( ) )

number of distinct i-cliquesin H1, .., Hn

• Uniform Independence Sampling• Weighted Independence Sampling• Link-trace Sampling

• With replacement• Without replacement

Page 31: Estimating Clique Composition and Size Distributions from Sampled Network Data

Distinct Clique Counting labeled neighbors

i-clique inclusion probability

number of distinct i-cliquesin H1, .., Hn

• Uniform Independence Sampling • With replacement

is a design-unbiased Horvitz-Thompson estimator ( ) )

Page 32: Estimating Clique Composition and Size Distributions from Sampled Network Data

Distinct Clique Counting labeled neighbors

b c

a

2

3

1

86

7 5

4graph G

4)8/31(1 k N=8

n=4 UIS with replacement

C3

Page 33: Estimating Clique Composition and Size Distributions from Sampled Network Data

Distinct Clique Counting labeled neighbors

b c

a

2

3

1

86

7 5

4graph G

86

78

6

7

2

1

5

36.2))8/31(1/(2ˆ 4 iC

n=4 UIS with replacement

N=8

86

7

2

1

5

2

1

5

4)8/31(1 k

C3

Observedorder-3 cliques

Distinctorder-3 cliques

Page 34: Estimating Clique Composition and Size Distributions from Sampled Network Data

Computational complexity

• Space complexity to count Ci or Cu

– O(1) for Clique Degree Sums Method – O(ci) or O(cu) for Distinct Clique Counting Method

• Time complexity– from O(3N/3) to O(n*3D/3) where N is the graph size, D

is the maximum degree, and n is the sample size– from O(n*3D/3) to O(3D/3) via parallel computations per

egonet

Page 35: Estimating Clique Composition and Size Distributions from Sampled Network Data

Benefits of our methodology

• Full knowledge of graph not required• Fast estimation for massive known graphs• Estimation or exact computation easily

parallelizable for massive known graphs• Estimation with or without neighbor labels• Supports vertex attributes • Supports a variety of sampling designs

Page 36: Estimating Clique Composition and Size Distributions from Sampled Network Data

Results

Page 37: Estimating Clique Composition and Size Distributions from Sampled Network Data

Simulation Results

Page 38: Estimating Clique Composition and Size Distributions from Sampled Network Data

Simulation ResultsFacebook New Orleans

Egonet sample size n=1,000Uniform independence sampling, without replacement

1000 simulations

Clique Degree Sums

Distinct Clique Counting

Page 39: Estimating Clique Composition and Size Distributions from Sampled Network Data

Simulation Results

Error metric Normalized Mean Absolute Error :1000 simulations

Distinct Clique Counting

Clique Degree Sums

Page 40: Estimating Clique Composition and Size Distributions from Sampled Network Data

Simulation Results

Distinct Clique Counting

Clique Degree Sums

Page 41: Estimating Clique Composition and Size Distributions from Sampled Network Data

Which estimation method to use?Heuristic

Average Edge Count =All edges between egos and neighbors

Unique edges between egos and neighbors

2

3

1

86

7 5

4graph G

86

78

6

7

2

1

5n=3

N=8

86

7

2

1

5

Average Edge Count =9

6b c

a

= 1.5

Page 42: Estimating Clique Composition and Size Distributions from Sampled Network Data

Estimation ResultsFacebook ‘09

• Facebook ‘09 crawled dataset[1]

– 36,628 unique egonets

[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.

Page 43: Estimating Clique Composition and Size Distributions from Sampled Network Data

Estimation Resultsvertex attributes, Facebook ‘09

• Complemented dataset with gender attributes– about 6 million users

Page 44: Estimating Clique Composition and Size Distributions from Sampled Network Data

References[1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 .[2] Facebook datasets: http://odysseas.calit2.uci.edu/research/osn.html[3] Python code for Clique Estimators: http://tinyurl.com/clique-estimators

Thank you!

• Unbiased estimation methods of clique distributions– Clique Degree Sums– Distinct Clique Counting

• Facebook cliques• Future work– support estimation of any subgraphs (beyond cliques)