the very small world of the well-connected xiaolin shi, matt bonner, lada adamic, anna gilbert

Post on 06-Jan-2018

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Network or Hairball?  Huge networks difficult to study, store, share..  Can we shrink or summarize a network?  Starting point: important vertices  Vertex-Importance Graph Synopsis

TRANSCRIPT

The Very Small World

of theWell-Connected

Xiaolin Shi, Matt Bonner, Lada Adamic, Anna Gilbert

Outline VIGS: Vertex-Importance Graph Synopsis

Testing VIGS with different datasets and importance measures

Analytical expectations

Making guarantees about VIGS

Connectedness: KeepOne, KeepAll

Related Work

Graph Sampling, Rich Club, K-cores, Web Measure

Network or

Hairball?

Huge networks difficult to study, store, share..

Can we shrink or summarize a network? Starting point: important vertices

Vertex-Importance Graph Synopsis

Vertex-Importance Graph Synopsis

Create subgraph of important vertices

Study both key nodes and entire graph

Which vertices are important? High-traffic routers? The most quoted blog?

Standard, well-defined measures Degree, Betweenness, Closeness, PageRank

VIGS In Action• Starting point: random graph with 100 vertices• Select an importance measure - Degree• pick 9 highest degree vertices• keep only edges between these 9 vertices

average degree = 4 average degree = 0.9

Motivating example: citations among ACM

papers

500 random papers 500 most cited papers

Datasets Erdos-Renyi random graph and three real networks BuddyZoo - collection of buddy lists TREC - links between blogs Web - an older web crawl from PARC

Erdos-Renyi BuddyZoo TREC Web

Vertices 10,000 135,131 29,690 152,171

Edges 49,935 803,200 195,940 1,686,541

ASP 4.26 5.96 3.72 3.48

Directed false false true true

Importance measures degree (number

of connections) denoted by size

betweenness (number of shortest paths a vertex lies on) denoted by color

Importance measures degree (number

of connections) denoted by size

closeness (length of shortest path to all others) denoted by color

High correlation between different importance measurements

Undirected graphs - higher correlation Closeness has lowest correlation in all datasets

Correlation among measures

High correlation between different importance measurements Undirected graphs – higher orrelation Closeness has lowest correlation in all datasets

Correlation among measures

Assortativity In an assortative graph, high-value nodes

tend to connect to other high-value nodes Example: degree

assortative disassortative

Assortativity - Degree

• ER: Neutral

• BZ: Assortative

• TREC and Web: Disassortative

Assortativity

Degree distributions

Subgraphs

Apply VIGS! Select Degree, top 100 nodes Example: degree Substantial difference between datasets!

Subgraphs

The selection of an importance measure may have an impact, even in the same dataset

Connectivity: size of largest component

Proportion of nodes that are connected either directly or indirectly

Subgraph Connectivity - ER

• Highly connected, even with only a few vertices

• All importance measures almost completely connected by 2000 nodes

• Better performance than random

Subgraph Connectivity

subgraphs: density

average degree = 4 average degree = 0.9

What is the proportion of edges to nodes in the original graphs vs. subgraphs?

Subgraph Density - ER

• Black line slope = Edges/Vertices in entire network

• Lower dotted line = subgraph of random vertices

• VIGS subgraphs: lower than total density, higher than random subgraph density

Subgraph Density

Average Shortest Path‘ASP’

whole network ASP

ASP between IV’s in subgraph.

ASP between IV’s in whole graph

ER ASP shorter between IV’s, but higher in subgraph

Subgraph Average Shortest Path

‘ASP’ for Erdos Renyi

Subgraph ASP’s

Relative Rank of Vertices in Subgraph - ER

• Do IV’s maintain their relative rank in subgraphs?

• IV and edges only• ER - little correlation,

steadily increasing until all vertices are included

Relative Rank in Subgraph

TREC anomaly - closeness

Four Regions Four regions, highlighted in density plot:

OriginalCloseness only, Regions highlighted

Cause: Blog Aggregator One node has connections to 99% of the

nodes between 1 and 7961! (regions 1, 2, 3) This same node has only 1 connection to a

node beyond 7961 (region 4) Nodes between 5828 and 7961 (region 3)

have only 1 connection: to the aggregator Spam blogs? New blogs? Private blogs?

Examining Density

The first 3 regions feature nodes connected to the aggregator

R1: well connected blogs Average increase in total edges

per node added: 12.93 R2: far less connected, but

not quite barren Average increase per node: 3.2

R3: isolated spam/new blogs 1 edge per node increase

Examining Density

R4: well connected, but not linked to aggregator

Average increase even higher than region 1: 17.8

Aggregator inflated the closeness scores of connected nodes (R1, 2, 3) above those in region 4

Examining Avg Shortest Paths (ASP)

R1: ASP slightly below 2 Some nodes directly connected,

99%+ within 2 hops via aggregator R2 and 3: ASP levels at ~2

Fewer and fewer direct links, but all accessible via aggregator

R4: ASP’s begin to increase ASP doesn’t explode: ~70% of R4

links are to R1 or R2 nodes R3 only reachable from R4 via agr. Access to aggregator through

connected R1/R2 nodes: adds a hop to path

Examining Relative Ranking Correlation

R1-3: correlation steadily decreases

R4: rapid increase in correlation!

Spam blogs importance in subgraph initially inflated

Realigns when blogs in 4 connect with real blogs in 1-2

Localized to closeness Region 1, 2 and 3 nodes have high closeness

thanks to the aggregator Recall ASP graph - short distance to many, many

nodes via aggr. Connection to aggregator doesn’t confer high

degree, PageRank or Betweenness - nodes must ‘fend for themselves’ Degree: link to aggr. Is just 1 link. PR: aggr. ‘vote’ diluted by high degree Bet: Aggr. Is gateway to its children, could use any

child to reach aggr.

• VIGS results vary by graph and importance measure

• Still, subgraphs tended towards– High connectivity– Average or higher density– Shorter ASP’s– Maintain relative importance rank of vertices

– “spam” affects closeness primarily

Empirical Analysis Summary

Preserving Properties So far, just studying subgraphs Applying VIGS - may need guarantees Hard to make a guarantee?

Example property: subgraph is connected

Preserving Properties

Preserving Properties Is it difficult to guarantee the connectedness

of a VIGS subgraph? NP-complete: reducible to Steiner Minimum

Spanning Tree (MST) problem Resort to heuristics

KeepOne, KeepAll from Gilbert and Levchenko (2004)

KeepOne and KeepAll KeepOne - build an MST: drop as many vertices/edges as

possible while maintaining connectivity. Problem! ASP/diameter could increase

Solution: KeepAll - MST, but add all vertices/edges on a shortest path

Heuristic Performance - ER

• KO - did not have to add many vertices, but shortest path rather large (ER ASP was 4.26)

• KA - good improvement in path length, but huge increase in vertices

ASP

Heuristic Performance - BZ

• Similar performance to ER - KO results in significantly longer shortest paths, but KA adds many vertices

• Is 4000 too many vertices to add? Small compared to total graph, but huge compared to number of important vertices

ASP

Heuristic Performance - TREC

• Almost completely connected from the start

• KA adds only a few vertices, doesn’t change much

• Results for Web dataset similar

ASP

Related Work Graph sampling - Similar objective: synopsis

Concerned only with original graph Random sampling, snowball sampling… Lee, Kim, Jeong (2006), Leskovec, Faloutsos (2006), Li, Church, Hastie (2006)

Rich-club Concerned only with high degree nodes Zhou, Mondragon (2004), Colizza, Flammini, Serrano, Vespignani (2006)

Related Work K-cores

Subgraphs where each vertex has at least k-connections within the subgraph

Dorogovstev, Goltsev, Mendes (2006) Core connectivity

Smallest number of important vertices to remove before destroying largest component

Mislove, Marcon, Gummadi, Druschel, Bhattacharjee (2007)

VIGS wrap up vertex-importance graph synopsis

create a subgraph of important vertices to study both the full graph and these vertices in particular

properties of VIGS depend on entire network and importance measure

real world networks have dense, closely knit VIGS

in some cases easy to meet connectivity & ASP guarantees

Thanks to Xiaolin Shi

Matthew Bonner

Lada Adamic

NSF DMS 0547744

top related