challenges and opportunities posed by power laws in network analysis

Challenges and Opportunities Posed by Power Laws in Network Analysis

Bruno RibeiroUMass Amherst

MURI REVIEW MEETINGBerkeley, 26th Oct 2011

Power Laws in Networks

→ Network topology:– power law distribution of node degrees

• AS topology, social networks (Facebook, etc)

→ Network traffic:

– Flow: subset of packets– Power law distribution of flow sizes

routerpacket stream

vertex degree - d

d ] Flickr dataset

Characterizing Networks from Incomplete Data

This talk

→ Estimate distributions (of degrees, of flow sizes, …) from incomplete data (sampled edges, sampled packets, …)

→ Uncover central nodes in the network

Outline

→ Challenge: Estimating subset size distributions from incomplete data– Incomplete data:

• randomly sampled edges, randomly sampled packets, …– Impact of power laws on estimation accuracy– Impact of other distributions on estimation accuracy

→ Opportunity: Uncovering central nodes in power law networks

ESTIMATING SUBSET SIZE DISTRIBUTIONS FROM INCOMPLETE DATA

Part 1: Challenge

Subset size distributionsSet of fishes

Number of fishes(subset size)

types of fish (subsets)

distribution

x - subset size (number of fishes)

fractionof subsets (types of fish)with size x

Estimating subset size distributions

Set of fishes

randomlysample Nfishes(uniformly)

distribution

x - subset size (number of fishes)

fractionof subsets (types of fish)with size x

unbiasedestimate

sampledfishes

Questions

How many fishes need to catch toobtain accurate distribution estimates?

What is impact of distribution shapeon estimation accuracy?

Incomplete Data Estimation

randomsampling

estimation

IP flow size distribution

set of IP packets

Sampled packets

→ Distribution of # incoming links to a webpage– Q: do we need to crawl most of web graph?

→ Incoming links observed as outgoing links from other webpages– set = set of links– subset = incoming links to a webpage– sampling: link sampling

Network-related subset sizedistributions (webgraph)

?in-degree:# of links to webpage

outgoing links

→ Distribution number of packets in a TCP flow

– Set = IP packets– Subset = a IP flow– Sampling: packet sampling

Network-related subset sizedistributions (IP traffic)

routerpacket stream

Incomplete Data, Edge Sampling Example

Original graph

Sampled in-degrees

3x Estimator

OriginalIn-Degree Distribution

Incomplete data model→ Set elements sampled with probability p– without replacement– independently

→ Model– : probability that j out

of i subset elements are sampled– i : fraction of subsets with i elements• e.g.: fraction of nodes with degree i, fraction of flows

with i packets

Model (cont)→ bij – binomial(i,j)→ i : fraction of subsets with i elements→ W : maximum subset size

→ : fraction of subsets with j sampled

elements– d0 is not observable

Mean Squared Error Question

→ i : unbiased estimate of of i

→ p : sampling probability→ N : sampled subsets (e.g. N sampled flows)

Exists an unbiased estimator that has small mean squared error: MSE(i)?

Try Maximum Likelihood Estimator (MLE)?

Maximum Likelihood Estimation→ Simulation: edge sampling→ Flickr network (photo-sharing), 1.5M nodes

in-degree

Cramer-Rao Lower Bound (CRLB)→ Let B = [bij] , d = [dj] , = [i]– Then

d = B→ D = diag(d) : diagonal matrix Djj = dj

→ i : unbiased estimate of of i

→ J : Fisher information matrix of N subsets– J = BT D B– lower bound Mean Squared Error of i :

MSE(i) (J-1)ii/NNeed to find J-1

Recap→ Interested in the inverse of Fisher information

matrix becauseMSE(i) (J-1)ii/N

→ N : # of subsets sampled (# of nodes, # of TCP flows)

→ : subset size distribution estimate (what we seek)

→ p : sampling probability (edges, packets)

→ W : maximum subset size

Results

Heavier than exponential subset size distribution tail

→ Theorem 1: Suppose that W decreases more slowly than exponential. More precisely assume –log(W) = o(W) error grows

with subset size W

Exponential subset size distribution tail

→ Theorem 2: Suppose that W decreases exponentially in W. More precisely assume -log(W) = W log a + o(W) as W ∞ for some 0 < a < 1

Lighter than exponential subset size distribution tail

→ Theorem 3: Suppose that W decreases faster than exponentially in W. More precisely assume -log(W) = 𝜔(W). Then it follows that

0 < p ≤ 1

Infinite support & power laws

→ If is power law with infinite support (W ∞)– if p < any unbiased estimator has ½“infinite” MSE• might as well output random estimates

– if p > estimates can be accurate if ½enough samples are collected

Estimating Subset Size Average

→ I : randomly chosen subset size→ Average subset size E[I]:

– E[I] ≤ ∞ & E[I2] = ∞ then estimation error is unbounded• Reason: inspection paradox• Sampling biased towards very large subsets

– Average size of sampled subsets E[I2]/2E[I]

– otherwise, error is bounded

IMPACT OF POWER LAWS ON SAMPLING CENTRAL NETWORK NODES

Part 2: Opportunity

→ Central nodes important in networks– Communication bottlenecks, trend setters, information aggregators

→ Notions of centrality. – betweenness, closeness, PageRank, degree

Challenge: identify top k central nodes exploring small fraction of network

Central Nodes

central nodes

Degree as a proxy for centrality→ Betweenness centrality: node is central if it belongs to many

shortest paths→ Closeness centrality: node is

central if has short paths to all other nodes

→ Rank correlation measures the degree of similarity between two rankings

→ Low rank correlation inplanar graphs (e.g. power grid)Set Type of Network # of nodes # of edges Description

AS-Snapshot Computer 22,963 48,436 Snapshot of Internet at level of ASca-CondMat Collaboration 23,133 186,936 ArXiv Condense Matter

ca-HepPh Collaboration 12,008 237,010 ArXiv High Energy Physicsemail-Enron Social 36,692 367,662 Email network from Enron

Rank correlation with Degree

Random walk in steady state visits node with probability proportional to node degree

In power law graphs such bias towards high degree nodes is strong

We observe that RWs more efficient than more evolved techniques (AXS, RXS)

Looking for high degree nodes

% of network sampled

Thank you

challenges and opportunities posed by power laws in network analysis

maximum subset size

sampled subsets

subset elements

fraction of subsets

degree i

sampled edges

incomplete dataincomplete

tcp flow set

Documents

solving discrete ill--posed problems with tikhonov...

how do we get the antibiotics we need? opportunities and...

scholarly e-resources in laws: current trends & new...

foreign investment in iran: laws and regulations,...

geophysical inverse theory and regularization problems ·...

addressing local content laws whilst...

click for table of contentsphans laws, child-labor laws, and...

foreign trade policy professional opportunities economic...

handling the problems and opportunities posed by multiple...

the future of financial services...the future of financial...

education, climate and environment · pdf fileboth firms are...

opportunities for organ donor intervention research ·...

revolutionary new casino and sports betting laws create...

appshield: enabling multi-entity access control...

handling the problems and opportunities posed by multiple on

opportunities in health information management ·...

research into the threats and opportunities of an ageing...

historical preservation laws and long-term climate … ·...

biopromise? - iisd · identify opportunities for, and...

india budget analysis a change in direction - ibef · the...