unifying data-parallel and graph-parallel analytics · 2017-09-12 · graphx:! unifying...
TRANSCRIPT
GraphX:���Unifying Data-Parallel and Graph-Parallel Analytics
Presented by Joseph Gonzalez ���Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014
*These slides are best viewed in PowerPoint with animation.
Graphs are Central to Analytics
Raw Wikipedia
< / >!< / >!< / >!XML!
Hyperlinks PageRank Top 20 Pages Title PR
Text Table
Title Body Topic Model
(LDA) Word Topics Word Topic
Editor Graph Community Detection
User Community
User Com.
Term-Doc Graph
Discussion Table
User Disc.
Community Topic
Topic Com.
Update ranks in parallel
Iterate until convergence
Rank of user i Weighted sum of
neighbors’ ranks
3
R[i] = 0.15 +X
j2Nbrs(i)
wjiR[j]
PageRank: Identifying Leaders
The Graph-Parallel Pattern
4
Model / Alg. State
Computation depends only on the neighbors
Many Graph-Parallel Algorithms • Collaborative Filtering
– Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization
• Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling
• Semi-supervised ML – Graph SSL – CoEM
• Community Detection – Triangle-Counting – K-core Decomposition – K-Truss
• Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring
• Classification – Neural Networks
5
Graph-Parallel Systems
6
oogle
Expose specialized APIs to simplify graph programming.
Exploit graph structure to achieve orders-of-
magnitude performance gains over more general ���data-parallel systems.
PageRank on the Live-Journal Graph
22
354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab
Naïve Spark
Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark
Graphs are Central to Analytics
Raw Wikipedia
< / >!< / >!< / >!XML!
Hyperlinks PageRank Top 20 Pages Title PR
Text Table
Title Body Topic Model
(LDA) Word Topics Word Topic
Editor Graph Community Detection
User Community
User Com.
Term-Doc Graph
Discussion Table
User Disc.
Community Topic
Topic Com.
Separate Systems to Support Each View Table View Graph View
Dependency Graph
6. Before
8. After
7. After
Table
Result
Row
Row
Row
Row
Having separate systems ���for each view is ���
difficult to use and inefficient
10
Difficult to Program and Use
Users must Learn, Deploy, and Manage multiple systems
Leads to brittle and often ���complex interfaces
11
Inefficient
12
Extensive data movement and duplication across ���the network and file system
< / >!< / >!< / >!XML!
HDFS HDFS HDFS HDFS
Limited reuse internal data-structures ���across stages
Solution: The GraphX Unified Approach
Enabling users to easily and efficiently express the entire graph analytics pipeline
New API Blurs the distinction between
Tables and Graphs
New System Combines Data-Parallel Graph-Parallel Systems
Tables and Graphs are composable ���views of the same physical data
GraphX Unified Representation
Graph View Table View
Each view has its own operators that ���exploit the semantics of the view
to achieve efficient execution
View a Graph as a Table
Id
Rxin Jegonzal Franklin Istoica
SrcId DstId
rxin jegonzal franklin rxin istoica franklin franklin jegonzal
Property (E)
Friend Advisor
Coworker PI
Property (V)
(Stu., Berk.) (PstDoc, Berk.)
(Prof., Berk) (Prof., Berk)
R
J
F
I
Property Graph Vertex Property Table
Edge Property Table
Table Operators Table (RDD) operators are inherited from Spark:
16
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])
// Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]
}
Graph Operators
17
Triplets Join Vertices and Edges The triplets operator joins vertices and edges:
The mrTriplets operator sums adjacent triplets. SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum FROM triplets AS t GROUPBY t.dstId
Triplets Vertices Edges
B
A
C
D
A B A C B C C D
A B A
B A C B C C D
F
E
Map Reduce Triplets
Map-Reduce for each vertex
D
B
A
C
mapF( ) A B
mapF( ) A C
A1
A2
reduceF( , ) A1 A2 A
19
F
E
Example: Oldest Follower
D
B
A
C What is the age of the oldest follower for each user?
val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices
23 42
30
19 75
16 20
We express the Pregel and GraphLab ���abstractions using the GraphX operators���
in less than 50 lines of code!
21
By composing these operators we can ���construct entire graph-analytics pipelines.
DIY Demo this Afternoon
GraphX System Design
Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
A A
F
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing Table
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Vertex Table
(RDD)
Caching for Iterative mrTriplets Edge Table
(RDD) A B
A C
C D
B C
A E
A F
E F
E D
Mirror Cache
B C D
A
Mirror Cache
D E F
A
B
C
D
E
A
F
B
C
D
E
A
F
A
D
Vertex Table
(RDD)
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror Cache
B C D
A
Mirror Cache
D E F
A
Incremental Updates for Iterative mrTriplets
B
C
D
E
A
F
Change A A
Change E
Scan
Vertex Table
(RDD)
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror Cache
B C D
A
Mirror Cache
D E F
A
Aggregation for Iterative mrTriplets
B
C
D
E
A
F
Change
Change
Scan
Change
Change
Change
Change
Local Aggregate
Local Aggregate
B C
D
F
Reduction in Communication Due to Cached Updates
0.1
1
10
100
1000
10000
0 2 4 6 8 10 12 14 16
Net
wor
k Co
mm
. (M
B)
Iteration
Connected Components on Twitter Graph
Most vertices are within 8 hops���of all vertices in their comp.
Benefit of Indexing Active Edges
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
Runt
ime
(Sec
onds
)
Iteration
Connected Components on Twitter Graph
Scan
Indexed
Scan All Edges
Index of “Active” Edges
Join Elimination Identify and bypass joins for unused triplets fields » Example: PageRank only accesses source attribute
30
0 2000 4000 6000 8000
10000 12000 14000
0 5 10 15 20
Com
mun
icatio
n (M
B)
Iteration
PageRank on Twitter Three Way Join
Join Elimination
Factor of 2 reduction in communication
Additional Query Optimizations
Indexing and Bitmaps: » To accelerate joins across graphs » To efficiently construct sub-graphs
Substantial Index and Data Reuse: » Reuse routing tables across graphs and sub-graphs » Reuse edge adjacency information and indices
31
Performance Comparisons
22 68
207 354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab GraphX Giraph
Naïve Spark Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges
GraphX scales to larger graphs
203
451
749
0 200 400 600 800
GraphLab
GraphX
Giraph
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 2x slower than GraphLab » Scala + Java overhead: Lambdas, GC time, … » No shared memory parallelism: 2x increase in comm.
Twitter Graph: 1.5 Billion Edges
PageRank is just one stage…. ������
What about a pipeline?
HDFS HDFS
Compute Spark Preprocess Spark Post.
A Small Pipeline in GraphX
Timed end-to-end GraphX is faster than GraphLab
Raw Wikipedia
< / >!< / >!< / >!XML!
Hyperlinks PageRank Top 20 Pages
342
1492
0 200 400 600 800 1000 1200 1400 1600
GraphLab + Spark GraphX
Giraph + Spark Spark
Total Runtime (in Seconds)
605
375
The GraphX Stack���(Lines of Code)
GraphX (3575)
Spark
Pregel (28) + GraphLab (50)
PageRank (5)
Connected Comp. (10)
Shortest Path (10)
ALS (40) LDA
(120)
K-core (51) Triangle
Count (45)
SVD (40)
Status Alpha release as part of Spark 0.9
Seeking collaborators and feedback
Conclusion and Observations Domain specific views: Tables and Graphs » tables and graphs are first-class composable objects » specialized operators which exploit view semantics
Single system that efficiently spans the pipeline » minimize data movement and duplication » eliminates need to learn and manage multiple systems
Graphs through the lens of database systems » Graph-Parallel Pattern à Triplet joins in relational alg. » Graph Systems à Distributed join optimizations
38
Active Research Static Data à Dynamic Data » Apply GraphX unified approach to time evolving data » Model and analyze relationships over time
Serving Graph Structured Data » Allow external systems to interact with GraphX » Unify distributed graph databases with relational
database technology
39
Thanks!
[email protected] [email protected]
[email protected] [email protected]
http://amplab.github.io/graphx/
Graph Property 1���Real-World Graphs
41
100 102 104 106 108100
102
104
106
108
1010
degree
count
Top 1% of vertices are adjacent to
50% of the edges!
Num
ber o
f Ver
tices
AltaVista WebGraph1.4B Vertices, 6.6B Edges
Degree
More than 108 vertices ���have one neighbor.
0 20 40 60 80
100 120 140 160 180 200
2008 2009 2010 2011 2012 Ra
tio o
f Edg
es to
Vert
ices
Year
Power-Law Degree Distribution Edges >> Vertices
Graph Property 2���Active Vertices
1
10
100
1000
10000
100000
1000000
10000000
100000000
0 10 20 30 40 50 60 70
Num
-Ver
tices
Number of Updates
51% updated only once! PageRank on Web Graph
Graphs are Essential to Data Mining and Machine Learning
Identify influential people and information
Find communities
Understand people’s shared interests
Model complex data dependencies
Ratings Items
Recommending Products Users
Low-Rank Matrix Factorization:
45
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5) Use
r Fac
tors
(U)
Movie Factors (M
) U
sers Movies
Netflix U
sers
≈ x
Movies
f(i)
f(j)
Iterate:
f [i] = arg minw2Rd
X
j2Nbrs(i)
�rij � wT f [j]
�2+ �||w||22
Recommending Products
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting User Behavior
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
? ?
?
?
? ?
?
? ? ?
?
?
? ?
? ?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
46
Conditional Random Field!Belief Propagation!
Count triangles passing through each vertex: "
Measures “cohesiveness” of local community
More Triangles Stronger Community
Fewer Triangles Weaker Community
1 2 3
4
Finding Communities
Preprocessing Compute Post Proc.
Example Graph Analytics Pipeline
48
< / >!< / >!< / >!XML!
Raw Data ETL Slice Compute
Repeat
Subgraph PageRank Initial Graph
Analyze
Top Users