estimating clustering coefficients and size of …lirank/pubs/2013-...global cc algorithm 1. Ψ𝑔...
TRANSCRIPT
![Page 1: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/1.jpg)
Estimating Clustering Coefficients and Size of Social Networks via
Random Walk Stephen J. Hardiman*
Capital Fund Management
France
Liran Katzir
Advanced Technology Labs Microsoft Research, Israel
*Research was conducted while the author was unaffiliated
![Page 2: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/2.jpg)
Motivation: Social Networks
Facebook Twitter Qzone Google+
Sina Weibo
Habbo Renren
LinkedIn Vkontakte
Bebo
Tagged Orkut
Netlog
Friendster hi5
Flixster
MyLife Classmates.com
Sonico.com
Plaxo
![Page 3: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/3.jpg)
Motivation: External access
v1 v2
v3 v5
v6
v7
v4 v8
v9
Social Analytics
The online social network
Disk Space
Communication
Privacy
![Page 4: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/4.jpg)
Task: Estimate parameters
Business development/ advertisement/ market size.
Predicting Social Products’ Potential.
Global Clustering Coefficient
Network Average
CC
Number of Registered
Users
![Page 5: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/5.jpg)
Global CC = 3 x number of triangles
number of connected triplet
Global Clustering Coefficient
v1 v2
v3 v5
v6
v7
v4 v8
v9
Triangle Connected Triplet
![Page 6: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/6.jpg)
Global Clustering Coefficient
Exact: [Alon et al, 1997]
Estimation – input is read at least once:
• Random Access: [Avron, 2010]
• Streaming Model: [Buriol et al, 2006]
Estimation – sampling:
• Random Access: [Schank et al, 2005]
• External Access: This work.
![Page 7: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/7.jpg)
Ci = #connections between vi′s neighbors
di (di−1)/2
Local Clustering Coefficient
v1 v2
v3 v5
v6
v7
v4 v8
v9
di – degree of node i
d1 = 1 d9 = 2 d2 = 3
C2 =1/3
Network Average CC = average local CC
![Page 8: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/8.jpg)
Network Average CC
Exact: Naïve.
Estimation – input is read at least once:
• Streaming Model: [Becchetti et al, 2010]
Estimation – sampling:
• Random Access: [Schank et al, 2005]
• External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.
![Page 9: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/9.jpg)
Number of Registered Users
Exact: trivial
Estimation – sampling:
• External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.
![Page 10: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/10.jpg)
Random Walk
v1 v2
v3 v5
v6
v7
v4 v8
v9
Sampled Nodes: v1 v2 v3 v4
1
22
3
22
2
22
2
22
Stationary
Distribution = 𝑑𝑖
𝑑𝑖
3
22
2
22
3
22
4
22
2
22
v5
![Page 11: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/11.jpg)
Random Walk - Summary
v1 v2
v3 v5
v6
v7
v4 v8
v9
Visible Nodes Invisible Nodes Sampled Nodes
Visible Edges
Invisible Edges
![Page 12: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/12.jpg)
Global CC Algorithm
1. Ψ𝑔 – Sampled nodes average degree - 1.
𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,
0 Otherwise.
2. Φ𝑔 – Sampled nodes average 𝜙𝑘𝑑𝑘 .
The estimated global clustering coefficient:
𝑐𝑔 =Φ𝑔
Ψ𝑔
𝜙𝑘 = 1 iff 𝑣𝑘−1, 𝑣𝑘 , 𝑣𝑘+1 is a triangle
![Page 13: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/13.jpg)
Global CC Example
v1 v2
v3 v5
v4
𝜙2 = 0
𝜙3 = 1
Φ𝑔 =1
30 + 2 + 0 =
2
3 Ψ𝑔 =
1
50 + 2 + 1 + 3 + 1 =
7
5
𝑐𝑔 = 2
3
5
7 ≈ 0.47
𝑐𝑔 =9
23≈ 0.39
𝜙4 = 0 v6
v7
![Page 14: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/14.jpg)
Expectation of 𝝓𝒌
𝐸 𝜙𝑘𝑑𝑘 = 𝑑𝑖
𝐷𝐸 𝜙𝑘𝑑𝑘|𝑥𝑘 = 𝑣𝑖
𝑛
𝑖=1
= 𝑑𝑖
𝐷
𝑛
𝑖=1
2𝑙𝑖𝑑𝑖𝑑𝑖
𝑑𝑖
= 2𝑙𝑖𝐷
𝑛
𝑖=1
Total expectation
𝑑𝑖𝑑𝑖 combinations. 2𝑙𝑖 yield 𝜙𝑘=1
𝑙𝑖 – The number of triangles contain vi.
𝑑𝑖 – The degree of node vi.
𝑛 – The number of nodes.
𝐷 = 𝑑𝑖
𝑛
𝑖=1
![Page 15: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/15.jpg)
Global CC Proof
𝐷 = 𝑑𝑖
𝑛
𝑖=1
𝑙𝑖 – The number of triangles contain vi.
𝑑𝑖 – The degree of node vi.
𝑛 – The number of nodes.
𝐸 Φ𝑔 = 𝐸 𝜙𝑘𝑑𝑘 =2
𝐷 𝑙𝑖
𝑛
𝑖=1
𝐸 Ψ𝑔 =1
𝐷 𝑑𝑖 𝑑𝑖 − 1
𝑛
𝑖=1
𝑐𝑔 =Φ𝑔
concentration bounds𝐸 Φ𝑔
Ψ𝑔
concentration bounds𝐸 Ψ𝑔
≅2 𝑙𝑖
𝑛𝑖=1
𝑑𝑖 𝑑𝑖 − 1𝑛𝑖=1
= 𝑐𝑔
![Page 16: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/16.jpg)
Guarantees
For any 𝜖 ≤1
8 and 𝛿 ≤ 1, we have
Prob 1 − 휀 𝑐𝑔 ≤ 𝑐𝑔 ≤ 1 + 휀 𝑐𝑔 ≥ 1 − 𝛿
when the number of samples, r, satisfies
𝑟 ≥ 𝑟𝑔 = 𝑂 mixing time(휀)
![Page 17: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/17.jpg)
Network Average CC Algorithm
1. Ψ𝑙 – Sampled nodes average 1/degree .
𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,
0 Otherwise.
2. Φ𝑙 – Sampled nodes average 𝜙𝑘1
𝑑𝑘−1.
The estimated network average CC:
𝑐𝑙 =Φ𝑙
Ψ𝑙
![Page 18: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/18.jpg)
Evaluations
Network n (size) D/n cl cg
DBLP 977,987 8.457 0.7231 0.1868
Orkut 3,072,448 76.28 0.1704 0.0413
Flickr 2,173,370 20.92 0.3616 0.1076
Live Journal 4,843,953 17.69 0.3508 0.1179
DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.
![Page 19: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/19.jpg)
Global CC
Relative improvement ranges between 300% and 500% depending on the network.
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2
Re
lati
ve e
stim
atio
n v
alu
e
Percentage of mined nodes
DBLP Network
Gjoka et al*
Ribeiro et al*
This work
![Page 20: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/20.jpg)
Network Average CC
Relative improvement ranges between 50% and 400% depending on the network.
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Re
lati
ve e
stim
atio
n v
alu
e
Percentage of mined nodes
Orkut Network
Ribeiro et al
Gjoka et al
Random walk
![Page 21: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/21.jpg)
Conclusions
1. New external access estimator from Global Clustering Coefficient.
2. Improved estimator for Network Average Clustering Coefficient.
3. Improved estimator for number of registered users.
![Page 22: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/22.jpg)
Estimating Sizes of Social Networks via Biased Sampling
Liran Katzir
Yahoo! Labs, Haifa, Israel
Edo Liberty
Yahoo! Labs, Haifa, Israel
Oren Somekh
Yahoo! Labs, Haifa, Israel
![Page 23: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/23.jpg)
The expected number of collisions in a list of r
i.i.d. samples from a set of n elements is 𝑟 𝑟−1
2𝑛.
The Birthday “Paradox”
A collision is a pair of identical samples.
Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2, x5), and (x3, x5)
![Page 24: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/24.jpg)
Cardinality estimation uniform
Needs 𝑟 = 𝑂 𝑛 samples to converge. Used by [Ye et al, 2010] to estimate the size.
When C collisions are observed
n ≅𝑟 𝑟 − 1
2𝐶
![Page 25: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/25.jpg)
Stationary distribution sampling
v1 v2
v3 v5
v6
v7
v4 v8
v9
Sampled Nodes: v5
1
22
3
22
2
22
2
22
Stationary
Distribution = 𝑑𝑖
𝑑𝑖
3
22
2
22
3
22
4
22
2
22
v2 v5 v4 v2
![Page 26: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/26.jpg)
Cardinality estimation stationary
Needs 𝑟 = 𝑂 𝑛4 log 𝑛 samples to converge when 𝑑𝑖~𝑧𝑖𝑝𝑓( 𝑛, 2).
When C collisions are observed
n ≅ 𝑑𝑥
1𝑑𝑥
2𝐶
![Page 27: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/27.jpg)
Example:
v1 v2
v3 v5
v6
v7
v4 v8
v9
v5 v2 v5 v4 v2
𝑑𝑥 = 2 + 3 + 2 + 4 + 3 1
𝑑𝑥=
1
2+
1
3+
1
2+
1
4+
1
3
𝑛 =14
23
12
2∙2 ≈ 6.7
![Page 28: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/28.jpg)
Global CC Proof
𝐷 = 𝑑𝑖
𝑛
𝑖=1
𝑑𝑖 – The degree of node vi.
𝑛 – The number of nodes.
𝐸 𝑑𝑥 = 𝑑𝑖
𝐷𝑑𝑖
𝑛
𝑖=1
𝐸1
𝑑𝑥=
𝑑𝑖
𝐷
1
𝑑𝑖
𝑛
𝑖=1
=𝑛
𝐷
𝑛 = 𝑑𝑥
1𝑑𝑥
concentration bounds𝐸 𝑑𝑥 𝐸
1𝑑𝑥
2𝐶concentration bounds
2𝐸 𝐶≅
𝑑𝑖𝐷
𝑑𝑖𝑛𝐷
𝑑𝑖𝐷
𝑑𝑖𝐷
= 𝑛
𝐸 𝐶 = 𝑑𝑖
𝐷
𝑑𝑖
𝐷
𝑛
𝑖=1
![Page 29: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/29.jpg)
Improvements
1. Using all samples (Hardiman et al 2009).
2. Using Conditional Monte Carlo (This work).
![Page 30: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/30.jpg)
All Samples
Restrict computation to indexes m steps apart, 𝐼 = 𝑘, 𝑙 | 𝑘 − 𝑙 ≥ 𝑚
A collision is only be considered within 𝐼. Φ = 𝑥𝑘 = 𝑥𝑙 | 𝑘, 𝑙 ∈ 𝐼
Ratio of degrees is similarly defined
Ψ = 𝑑𝑥𝑘
𝑑𝑥𝑙𝑘,𝑙 ∈𝐼
![Page 31: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/31.jpg)
Conditional Monte Carlo
A collision between 𝑥𝑘 and 𝑥𝑙, is replaced by the conditional collision is steps k+1 and l+1 respectively.
𝐸 1𝑥𝑘+1=𝑥𝑙+1|𝑥𝑘 , 𝑥𝑙 =
Common Neighbors
𝑑𝑥𝑘𝑑𝑥𝑙
![Page 32: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/32.jpg)
Conditional Monte Carlo
• The pair 𝑣4, 𝑣7 is not a collision, but it
contributes 1
12 to the collision counter.
v1 v2
v3 v5
v6
v7
v4 v8
v9
![Page 33: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/33.jpg)
Size Estimation
0
0.5
1
1.5
2
2.5
0.5 1 1.5 2 2.5
Re
lati
ve e
stim
atio
n v
alu
e
Percentage of mined nodes
DBLP Network Priot art
This work
![Page 34: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣](https://reader033.vdocument.in/reader033/viewer/2022042214/5eb9a8efe17fcd22311d71a0/html5/thumbnails/34.jpg)
Thanks