scalable dynamic graph summarization
TRANSCRIPT
Scalable Dynamic Graph Summarization
Ioanna Tsalouchidou 1 Gianmarco De Francisci Morales 2
Francesco Bonchi 3 Ricardo Baeza-Yates 1
1Web Research Group, DTICPompeu Fabra University, Spain
2Qatar Computing Research Institute
3Algorithmic Data Analytics LabISI Foundation, Turin, Italy
IEEE International Conference on Big Data, 2016
IntroductionMethodologyExperimentsConclusions
Table of Contents
1. Introduction– Motivation– Related Work– Our approach
2. Methodology– Baseline algorithm– MicroClustering algorithm
3. Experiments
4. Conclusions
2
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Introduction to Big Graphs
Big Data in social, communication, biological networksetc.
Are represented by Big Graphs
Encode relationship and communication patterns betweenpeople, news, trends, proteins etc.
3
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
4
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
4
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
0.3
0.7
4
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
0.5
0.9
4
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Characteristics
These graphs have some common characteristics:
Dynamic: structural and interaction evolution
Massive: with hundreds of millions of vertices and billionsof edges
A B
C
4
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem - Solution
Problem
Store and process biggraphs
Their evolution in time inmain memory
Applying algorithms iscomputationally expensive
Aggregate vertices andedges to reduce the size
Supernode: a set ofvertices of the original graph
Superedge: an edgebetween two supernodes
5
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem - Solution
Problem Solution
Store and process biggraphs
Their evolution in time inmain memory
Applying algorithms iscomputationally expensive
Aggregate vertices andedges to reduce the size
Supernode: a set ofvertices of the original graph
Superedge: an edgebetween two supernodes
5
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Related Work
Graph Summarization:
GraSS: Graph structure summarization [LeFevre and Terzi, ’10]
Graph summarization with quality guarantees [Riondato et al., ’14]
Data stream clustering:
A framework for clustering evolving data streams [Aggarwal etal., ’03]
6
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Background: Static graph summarization
Represent graphs as adjacency matrices
Minimize the reconstruction ErrorQuality guaranties: geometric clustering of the nodes
Static Graph:
Adjacency matrix:
7
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Background: Static graph summarization
Represent graphs as adjacency matricesMinimize the reconstruction ErrorQuality guaranties: geometric clustering of the nodes
Static Graph:
Adjacency matrix:=⇒
Summary Graph:
Summary adjacency matrix:
7
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation: Tensor summarization
Time series of w static graphs
The graph time series is represented by an adjacency tensor
Summary represented by an adjacency matrix
Νode1
ΝodeN
w
N
N
Super Node1
k
k
8
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t0
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t1
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t2
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t3
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t4
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t4
w
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
t5
wSuper Node1
k
k
Dynamic graph: infinite stream of static graphs
Tensor with one dimension increasing in time
Define a sliding tensor window
Summarize the tensor within the tensor window
9
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Problem formulation:Dynamic graph summarization via tensor streaming
At each time-stamp :
Input: most recent adjacency matrix
Update of the sliding window
Clustering nodes to supernodes
Output: one summary at every time-stamp
10
IntroductionMethodologyExperimentsConclusions
MotivationRelated WorkOur approach
Contributions
Introduce the problem of lossy dynamic graph summarization
Two online algorithms for summarizing dynamic, large-scalegraphs
Distributed, scalable algorithms, implemented in Apache Spark
11
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
Baseline algorithm: kC
Νode1
ΝodeN
wN
N
S0
Super-nodes
SC-1
Data PointsCluster each node of the tensor tothe supernodes
Each node has wN values
Clustering N points at everytime-stamp
Problem: (w − 1)N2 values remainunchanged
12
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
IntroductionMethodologyExperimentsConclusions
Baseline algorithmMicroClustering algorithm
MicroClustering algorithm: µC
At0
AtN-1
μC0
μC1
S0
Data Points Micro-Clusters Super-nodes
μC2
μCmC-1
SC-1
Two level clustering
Step1: adjacency matrix tomicro-clusters
Step2: keep statistics in themicro-clusters
Step3: run maintenancealgorithm
Step4: micro-clusters tosupernodes
13
IntroductionMethodologyExperimentsConclusions
Datasets and Experimental Setup
Datasets:
Twitter hashtag co-occurrences
Yahoo! Network Flow
Synthetic Dataset
Environment:
Cluster of 400 cores distributed in 30 machines.
Each machine: 24 cores Intel(R) Xeon(R) CPU E5-2430 0 @2.20 GHz.
Memory: driver program 12GB, executor process 3GB.
14
IntroductionMethodologyExperimentsConclusions
Scalability
15
IntroductionMethodologyExperimentsConclusions
Reconstruction Error
16
IntroductionMethodologyExperimentsConclusions
Conclusions
Problem: Large, evolving graphs are difficult to store andprocess
Solution: Graph summarization, reduces the size andcaptures the evolution of the input graph
Evaluation: Scalable, distributed solution with small error
17
Scalable Dynamic Graph Summarization
Ioanna Tsalouchidou 1 Gianmarco De Francisci Morales 2
Francesco Bonchi 3 Ricardo Baeza-Yates 1
1Web Research Group, DTICPompeu Fabra University, Spain
2Qatar Computing Research Institute
3Algorithmic Data Analytics LabISI Foundation, Turin, Italy
IEEE International Conference on Big Data, 2016