all-distances sketches, revisited: hip estimators for massive graphs analysis

26
All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis Edith Cohen Microsoft Research Presented by: Thomas Pajor Microsoft Research

Upload: talon

Post on 22-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis. Edith Cohen Microsoft Research . Presented by: Thomas Pajor Microsoft Research . Very Large Graphs. Model many types of relations and interactions Call detail data, email exchanges Web crawls - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

All-Distances Sketches, Revisited: HIP Estimators for

Massive Graphs Analysis

Edith CohenMicrosoft Research

Presented by:

Thomas PajorMicrosoft Research

Page 2: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Very Large Graphs Model many types of relations and interactions

Call detail data, email exchanges Web crawls Social Networks: Twitter, Facebook, linkedIn Web searches, Commercial transactions,…

Need for scalable analytics: Centralities/Influence (power/importance/coverage of a

node or a set of nodes): Viral marketing,… Similarities/Communities (how tightly related are 2 or

more nodes): Recommendations, Advertising, Marketing

Page 3: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

All-Distances Sketches (ADS) [C ‘94]

Summary structures: For each node : “samples” the distance relations of to all other nodes.Useful for queries involving a single node: Neighborhood cardinality and statistics

Sketches of different nodes are coordinated: related in a way that is useful for queries that involve multiple nodes (similarities, influence, distance)

Page 4: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

All-Distances Sketches (ADS) [C ‘94]Basic properties

edges, nodes, parameter which controls trade-off between sketch size and information

ADSs work for directed or undirected graphsCompact size: Scalable Computation: lnedge traversals to

compute for all nodes Many applications

Page 5: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

All-Distances Sketches: Definition

is a list of pairs of the form Draw a random permutation of the nodes: < kth smallest rank amongst nodes that are

closer to than

This is a bottom- ADS, it is the union of bottom- MinHash sketches ( smallest rank) of all “neighborhoods.”There are other ADS “flavors”, vary by the rank distribution (e.g. can use ) or sketch structure.

Page 6: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

ADS example

5

5

4

433

101010

10 1010

65

7

675

3

4

1

2

43

3

44

13 14 15100 65 7 15 1716 1710

SP distances:

0.49

0.91

0.56 0.42

0.07

0.21

0.140.28

0.63 0.84

0.70

0.77

0.35

Random permutation of nodes

Page 7: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

ADS example

All nodes sorted by SP distance from

0.63 0.42 0.56 0.84 0.07 0.35 0.49 0.91 0.21 0.28 0.14 0.700.77

:0.63 0.42 0.07

Page 8: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

0.21

ADS example

:0.63 0.42 0.56 0.07 0.35 0.14

0.63 0.42 0.56 0.84 0.07 0.35 0.49 0.91 0.21 0.28 0.14 0.700.77

Sorted by SP distances from

Page 9: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

“Basic” use of ADSs (90’s– 2013)

Extract MinHash sketch of the neighborhood of , , from ADS: bottom-From MinHash sketches, we can estimate: Cardinality

Estimate has CV (optimally uses the information in the MinHash sketch)

Jaccard similarity of and , Other relations of and ,

Page 10: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Historic Inverse Probability (HIP) inclusion probability & estimator

For each node , we estimate the “presence” of with respect to (=1 if , 0 otherwise)

Estimate is if . If , we compute the probability that it is included,

conditioned on fixed rank values of all nodes that are closer to than We then use the inverse-probability estimate . [HT52]

This is unbiased (when ):

Page 11: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Bottom- HIP

For bottom- and

HIP can be used with all flavors of MinHash sketches. Over distance (ADS) or time (Streams)

Page 12: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Example: HIP estimates

0.21

Bottom- ADS of

0.63 0.42 0.56 0.07 0.35 0.14

:

:2nd smallest r value among closer nodes

11 0 .630 .56 0 .350 .210 . 42

: 11 1 .591 .79 2 .864 .762 .38

Page 13: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

HIP cardinality estimateBottom- ADS of

: 11 1 .591 .79 2 .864 .762 .38distance: 50 6 10 15 1710

Query:

𝒏𝟔(𝒗 )= ∑(𝒊 ,𝒅 𝒗𝒊 )∈𝑨𝑫𝑺 (𝒗 )∨𝒅 𝒗𝒊≤𝟔

𝒂𝒗𝒊=1+1+1 .59=3 .59

Page 14: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Quality of HIP cardinality Estimate

has CV

𝑛𝒅(𝒗 )= ∑( 𝒊 ,𝒅𝒗𝒊)∈𝑨𝑫𝑺 (𝒗 )∨𝒅𝒗𝒊≤𝒅

𝒂𝒗𝒊

Lemma: The HIP neighborhood cardinality estimator

See paper for the proof

This is improvement over “basic” estimators, which have CV

Page 15: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

HIP versus Basic estimators

X

HIP

Basic

Page 16: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

HIP: applicationsQuerying ADSs: Cardinality estimation: gain in relative error

over “basic” (MinHash based) estimates More complex queries: closeness centrality

with topic awareness (gain can be polynomial) Estimating relations (similarities, coverage) of

pairs (sets) of nodes .

Streaming: Approximate distinct counting on streams.

Page 17: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Topic-aware Distance-decayCloseness Centrality

non increasing; some filter

Centrality with respect to a filter Topic, interests, education level, age, community,

geography, language, product type Applications for filter: attribute completion,

targeted advertisements

Page 18: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

….Closeness Centrality

non increasing; some filter

Polynomial (Harmonic) decay: Exponential decay Threshold ( ):

Page 19: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

HIP estimates of Centrality

non increasing; some filter

𝑪𝒗= ∑𝒊∈ 𝑨𝑫𝑺(𝒗 )

𝒂𝒗𝒊𝜶 (𝒅𝒗𝒊)𝜷(𝒊)  

Page 20: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

HIP estimates: closeness to good/evilBottom- ADS of

Filter: measures “goodness”Distance-decay:

: 11 1 .591 .79 2 .864 .762 .38distance: 50 6 10 15 1710

: 𝟏𝟎 𝟏 𝟎 .𝟐 𝟏 𝟎 .𝟗𝟎 .𝟏

𝑪𝒗= ∑𝒊∈ 𝑨𝑫𝑺(𝒗 )

𝒂𝒗𝒊 𝜷(𝒊)𝒆−𝒅 𝒗𝒊=𝒆−𝟓+𝒆−𝟔+𝟎 .𝟑𝒆−𝟏𝟎+⋯

Page 21: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Counting Distinct Elements on Data Stream

Elements occur multiple times, we want to count the number of distinct elements approximately with “small” storage, about Best practical and theoretical algorithms maintain a

MinHash sketch. Cardinality is estimated by applying an estimator to sketch [Flajolet Martin 85],…

Best (in practice) is the HyperLogLog (HLL) algorithm and variations. [Flajolet + FGM 2007],…

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,

Page 22: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Counting Distinct Elements with HIP We maintain a MinHash sketch and an approximate counter -- variation on [Morris77]. The counter explicitly maintains an approximate distinct count. Each time the sketch is updated ( times ), we increase the

counter (add the HIP estimate for the inserted new distinct element)

The approximate counter can be represented with few bits (e.g., can be a relative correction to sketch-based estimate or share its “exponent”) This works with any MinHash sketch. In experiments, for comparison, we use the same sketch as HyperLogLog (HLL).

Page 23: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

HLL vs. HIP (on HLL sketch)

Page 24: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Conclusion

Further ADS+HIP applications: closeness similarity (using ADS+HIP) [CDFGGW COSN 2013] … Timed-influence oracle

ADS: old but a very versatile and powerful tool for (scalable approximate) analytics on very large graphs: distance/similarity oracles, distance distribution, closeness, coverage, influence, tightness of communities

HIP: simple and practical technique, applicable with ADSs and streams

Page 25: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Thank you!!

Page 26: All-Distances Sketches, Revisited:  HIP  Estimators for Massive  Graphs Analysis

Legends of Chima Ninjago

Star Wars

Cragger Laval

ErisRascal

Darth Vader Luke Skywalker

R2-D2 Yoda

Sensei Wu

Lloyd “The Green Ninja”

Kai “Ninja of Fire”

Nya “Samurai X”Acidicus