a pattern-aware graph mining system
TRANSCRIPT
A Pattern-Aware Graph Mining SystemKasra Jamshidi Rakesh Mahadasa Keval Vora
Simon Fraser University
https://github.com/pdclab/peregrine
Why should you pay attention?
executes 700x fasterPeregrine
consumes 100x less memoryPeregrine
scales to 100x larger datasetsPeregrine
On 8x fewer machines
With a more expressive API
a
db
a
bd
Data Graph
e
c
gf
Graph Mining
a
db
7
a
db
a
bd
Graph Mining
e
c
gf
Data Graph
a
db
8
a
db
a
bd
Graph Mining
e
c
gf
Data Graph
a
db
Subgraph
9
a
db
a
bd
Graph Mining
e
c
gf
Data Graph Pattern
10
Graph Mining
a
db
a
bd
Edge-Induced
11
Graph Mining
Edge-Induced
a
db
a
bd
12
Graph Mining
a
db
a
bd
Edge-Induced
13
Graph Mining
a
db
a
bd
Vertex-Induced
14
Graph Mining
a dba b d
Vertex-Induced
15
Graph Mining
a
db
a
bd
Vertex-Induced
16
Graph Mining
a
db
e
c
gf
Data Graph
17
Graph Mining
a
db
e
c
gf
Data Graph
a
db
a
bd
Vertex-Induced
18
Graph Mining
a
db
e
c
gf
Data Graph
a
db
a
b d
a
db
a
b d
a
db
a
b d
a
db
a
b d
Edge-Induced
a
db
a
bd
Vertex-Induced
19
Frequent Patterns(Edge-Induced)
Graph Mining
a
db
e
c
gf
Data Graph
20
Unlabeled Pattern Distribution(Vertex-Induced)
2 1 0
1 02
Graph Mining
a
db
e
c
gf
Data Graph
21
Scalability Challenge
22
• 4-motif counting on Orkut graph (|V| = 13M, |E| = 117M)
Scalability Challenge
23
Scalability Challenge
• 4-motif counting on Orkut graph (|V| = 13M, |E| = 117M)
123,503,340,341,270 subgraphs
24
System Requirements
25
System Requirements
Uniqueness
x
zy
x
zy
x
zy
26
System Requirements
Uniqueness
x
zy xz
y
x
z
y
27
System Requirements
Uniqueness
x
zy xz
y
x
z
y
28
System Requirements
Uniqueness
Structurex
zy
29
System Requirements
Uniqueness
Structurex
zy
30
System Requirements
Uniqueness
Structure
Interestingness
31
System Requirements
Uniqueness
Structure
Interestingness
32
System Requirements
Uniqueness
Structure
Interestingness
33
System Requirements
Uniqueness
Structure
Interestingness
34
Existing Work
Uniqueness
Structure
Interestingness
Arabesque (SOSP ‘15)RStream (OSDI ‘18)Fractal (SIGMOD ‘19)AutoMine (SOSP ‘19)
35
Existing Work
Uniqueness
Structure
Interestingness
Arabesque (SOSP ‘15)RStream (OSDI ‘18)Fractal (SIGMOD ‘19)AutoMine (SOSP ‘19)
Overlook user requirements
36
Existing Work
Uniqueness
Structure
Interestingness
Arabesque (SOSP ‘15)RStream (OSDI ‘18)Fractal (SIGMOD ‘19)AutoMine (SOSP ‘19)
Overlook user requirementsPer-subgraph computations
37
38
Pattern Awareness
Pattern Selection
Pattern Matching
39
Pattern Awareness
Pattern Selection
Pattern Matching
40
Pattern Awareness
Pattern Selection
41
Pattern Programming
#include “Peregrine.hh”using namespace Peregrine;
void motifCounting(int size){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);auto counts = count(G, patterns);
for (auto &[pattern, n] : counts)std::cout << pattern << “ ” << n << std::endl;
}
42
Pattern Programming
#include “Peregrine.hh”using namespace Peregrine;
void motifCounting(int size){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);auto counts = count(G, patterns);
for (auto &[pattern, n] : counts)std::cout << pattern << “ ” << n << std::endl;
}
43
Pattern Programming
#include “Peregrine.hh”using namespace Peregrine;
void motifCounting(int size){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);auto counts = count(G, patterns);
for (auto &[pattern, n] : counts)std::cout << pattern << “ ” << n << std::endl;
}
44
Pattern Programming
#include “Peregrine.hh”using namespace Peregrine;
void motifCounting(int size){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);auto counts = count(G, patterns);
for (auto &[pattern, n] : counts)std::cout << pattern << “ ” << n << std::endl;
}
45
Pattern Programming
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);
auto counts = count(G, patterns);
46
Pattern Programming
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);patterns[0].set_labels({‘a’, ‘b’, ‘c’, ‘d’});
auto counts = count(G, patterns);
47
Pattern Programming
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);patterns[0].set_labels({‘a’, ‘b’, ‘c’, ‘d’});patterns[0].add_edge(1, 5);
auto counts = count(G, patterns);
48
Pattern Programming
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);patterns[0].set_labels({‘a’, ‘b’, ‘c’, ‘d’});patterns[0].add_edge(1, 5);patterns.emplace_back(“path/to/pattern.txt”);auto counts = count(G, patterns);
49
Pattern Programming
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);auto pattern = Pattern().add_edge(1, 2)
.add_edge(1, 3)
.add_edge(2, 3);auto counts = count(G, {pattern});
50
Pattern Programming
#include “Peregrine.hh”using namespace Peregrine;
void motifCounting(int size){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);auto counts = count(G, patterns);
for (auto &[pattern, n] : counts)std::cout << pattern << “ ” << n << std::endl;
}
51
Pattern Programming
#include “Peregrine.hh”using namespace Peregrine;
void motifCounting(int size){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(size, VERTEX_INDUCED);auto counts = count(G, patterns);
for (auto &[pattern, n] : counts)std::cout << pattern << “ ” << n << std::endl;
}
52
Pattern Programming
void frequentSubgraphMining(){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(2, EDGE_INDUCED);auto mapDomain = [](auto &&match, auto &&aggregator)
{ aggregator.map(match.pattern, match.mapping); };
auto results = match<Pattern, Domain>(G, patterns, mapDomain);
for (auto &[pattern, frequency] : results)std::cout << pattern << “ ” << frequency << std::endl;
}
53
Pattern Programming
void frequentSubgraphMining(){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(2, EDGE_INDUCED);auto mapDomain = [](auto &&match, auto &&aggregator)
{ aggregator.map(match.pattern, match.mapping); };
auto results = match<Pattern, Domain>(G, patterns, mapDomain);
for (auto &[pattern, frequency] : results)std::cout << pattern << “ ” << frequency << std::endl;
}
54
Pattern Programming
void frequentSubgraphMining(){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(2, EDGE_INDUCED);auto mapDomain = [](auto &&match, auto &&aggregator)
{ aggregator.map(match.pattern, match.mapping); };
auto results = match<Pattern, Domain>(G, patterns, mapDomain);
for (auto &[pattern, frequency] : results)std::cout << pattern << “ ” << frequency << std::endl;
}
55
Pattern Programming
void frequentSubgraphMining(){
DataGraph G(“path/to/graph/”);auto patterns = PatternGenerator::all(2, EDGE_INDUCED);auto mapDomain = [](auto &&match, auto &&aggregator)
{ aggregator.map(match.pattern, match.mapping); };
auto results = match<Pattern, Domain>(G, patterns, mapDomain);
for (auto &[pattern, frequency] : results)std::cout << pattern << “ ” << frequency << std::endl;
}
56
Pattern Awareness
Pattern Selection
Pattern Matching
57
Pattern Awareness
Pattern Selection
u
v
Anti-Edge
58
Pattern Awareness
Pattern Selection
u
v
Anti-Vertex
59
Pattern Awareness
Pattern Selection
Pattern Matching
60
Pattern Awareness
Pattern Selection
Pattern Matching
61
Pattern Awareness
Pattern Matching
• Symmetry breaking (RECOMB ‘07)
• Core pattern reduction (SIGMOD ‘16)
62
Symmetry Breaking
63
Symmetry Breaking
u x
y v
64
Symmetry Breaking
u x
y v
65
Symmetry Breaking
x
y
66
v
u
Symmetry Breaking
u x
y v
67
Symmetry Breaking
u y
x v
68
Symmetry Breaking
u x
y v
Worker 1
a
db
e
c
gf
Data Graph
69
b x
y v
Symmetry Breaking
a
db
e
c
gf
Data Graph Worker 1
70
b x
y d
Symmetry Breaking
Worker 1
a
db
e
c
gf
Data Graph
71
Symmetry Breaking
b x
y d
Worker 1
u x
y v
Worker 2
a
db
e
c
gf
Data Graph
72
Symmetry Breaking
b x
y d
Worker 1
d x
y v
Worker 2
a
db
e
c
gf
Data Graph
73
Symmetry Breaking
b x
y d
Worker 1
d x
y b
Worker 2
a
db
e
c
gf
Data Graph
74
Symmetry Breaking
b a
y d
Worker 1
d a
y b
Worker 2
a
db
e
c
gf
Data Graph
75
Symmetry Breaking
b a
g d
Worker 1
d a
g b
Worker 2
a
db
e
c
gf
Data Graph
76
Symmetry Breaking
b a
g d
Worker 1
d a
g b
Worker 2
a
db
e
c
gf
Data Graph
77
Symmetry Breaking
u < v
u
v
x
y
78
Symmetry Breaking
x < y
u
v
x
y
79
Symmetry Breaking
u < vx < y
u
v
x
y
80
Core Pattern
u
v
x
y
81
Core Pattern
u
v
x
y
82
Core Pattern
u
v
x
y
83
Core Pattern
b
d
Pattern-Unaware
a
db
e
c
gf
Data Graph
84
Core Pattern
b a
d
Pattern-Unaware
a
db
e
c
gf
Data Graph
85
Core Pattern
b a
d
Pattern-Unaware
a
db
e
c
gf
Data Graph
86
Core Pattern
b a
g d
Pattern-Unaware
a
db
e
c
gf
Data Graph
87
Core Pattern
b a
g d
Pattern-Unaware
a
db
e
c
gf
Data Graph
88
Core Pattern
b x
y d
Pattern-Aware
a
db
e
c
gf
Data Graph
89
Core Pattern
b x
y d
Pattern-Aware
a
db
e
c
gf
Data Graph
90
Core Pattern
a
db
e
c
gf
Data Graph
b x
y d
Pattern-Aware
91
Core Pattern
b a
g d
Pattern-Aware
a
db
e
c
gf
Data Graph
92
Core Pattern
b a
g d
Pattern-Aware
a
db
e
c
gf
Data Graph
h
93
Core Pattern
b a
h d
Pattern-Aware
b a
g d
a
db
e
c
gf
Data Graph
h
94
Core Pattern
b a
h d
Pattern-Aware
b a
g d
a
db
e
c
gf
Data Graph
h
95
Pattern Awareness
Pattern Matching
• Symmetry breaking (RECOMB ‘07)
• Core pattern reduction (SIGMOD ‘16)
96
Pattern Awareness
Pattern Matching
• Early termination
97
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
98
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
99
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
100
Triangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
101
TripletTriangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
102
TripletTriangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
103
TripletTriangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
104
TripletTriangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
105
TripletTriangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
106
TripletTriangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
107
TripletTriangle
Early Termination
bool globalClusteringCoefficient(int bound){
DataGraph G(“path/to/graph/”);auto triplet = PatternGenerator::star(3);int numTriplets = count(G, {triplet});auto countAndCheck = [=](auto &&match, auto &&aggregator){int numTriangles = aggregator.readValue(match.pattern);if (3*numTriangles/numTriplets > bound) aggregator.stop();else aggregator.map(match.pattern, 1);
}auto triangle = PatternGenerator::clique(3);auto result = match<Pattern, int>(G, triangle, countAndCheck);return 3*result[triangle]/numTriplets > bound;
}
108
Comparison with Existing Work
• Peregrine with 16 logical cores and 32GB RAM
• Arabesque & Fractal with 8x16 logical cores and 8x32GB RAM
• RStream with 96 logical cores and 192GB RAM
109
Comparison with Existing Work
110
Comparison with Existing Work
111
Effects of Pattern Awareness
112
Scalability
113
Scalability
114
Scalability
115
Scalability
116
Scalability
117
Scalability
118
Scalability
119
Pattern-aware programming and processing models
120
Pattern-aware programming and processing models
• Shift abstraction from subgraph to pattern
121
Pattern-aware programming and processing models
• Shift abstraction from subgraph to pattern
•User program is transparent to the system
122
Pattern-aware programming and processing models
• Shift abstraction from subgraph to pattern
•User program is transparent to the system
•Up to 42x faster than pattern-unaware
•Up to 737x faster than state-of-the-art
123
Pattern-aware programming and processing models
• Shift abstraction from subgraph to pattern
•User program is transparent to the system
•Up to 42x faster than pattern-unaware
•Up to 737x faster than state-of-the-art
https://github.com/pdclab/peregrine
124