CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 BIG DATA
PART B. GEAR SESSIONSSESSION 3: BIG GRAPH ANALYSIS
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535
CS535 Big Data | Computer Science | Colorado State University
FAQs• Online GEAR presentation will be available on 4/6• You will have 3 days of discussion period on Piazza
• 4/6 ~ 4/8
CS535 Big Data | Computer Science | Colorado State University
Topics of Todays Class• GraphX: Graph Processing in a Distributed Dataflow Framework
• Part 1: Introduction and Graph parallelism • Part 2: Distributed Graph Representation• Part 3: Implementation of Distributed Graph Processing
CS535 Big Data | Computer Science | Colorado State University
GEAR Session 3. Big Graph Analysis
Lecture 2. Distributed Large Graph Analysis-IIGraphX: Graph Processing
in a Distributed Dataflow Framework
CS535 Big Data | Computer Science | Colorado State University
This material is built based on• Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I., 2014.
Graphx: Graph processing in a distributed dataflow framework. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 599-613).
• KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput.
• 48, 1 (1998), 96–129. • GraphX Programming Guide https://spark.apache.org/docs/latest/graphx-programming-
guide.html
CS535 Big Data | Computer Science | Colorado State University
Introduction• GraphX is a library built on top of the Apache Spark for graphs and graph-parallel
computation
• Introduces a Graph abstraction• Directed multigraph with properties attached to each vertex and edge
• Provides a set of graph operators• E.g. subgraph, JoinVertices, and aggregateMessages
• Provides an optimized variant of the Pregel API• Implements graph algorithms and builders
• PageRank• Connected Components• Triangle Counting
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2
Computational Challenges• Graph processing systems outperform general-purpose distributed dataflow
frameworks with own specialized optimization schemes• E.g. Pregel, PowerGraph, BLAS, Kineograph
• Graphs are often only a part of the large analytics process• Combines graphs with unstructured and tabular data• Analytics pipelines are forced to compose multiple systems • Extra data movement and duplication• Fault tolerance
• Design of graph processing systems on top of general purpose distributed dataflow systems is needed
CS535 Big Data | Computer Science | Colorado State University
GEAR Session 3. Big Graph Analysis
Lecture 2. Distributed Large Graph Analysis-IIGraphX: Graph Processing in a Distributed Dataflow Framework
Distributed Dataflow Model and Optimization Schemes for Graph Processing
CS535 Big Data | Computer Science | Colorado State University
Dataflow Models - Traditional Network Programming
• Message-passing between nodes (e.g. MPI)
• Very difficult to do at scale• How to split the problem across nodes?• Network communication & data locality
• How to deal with failures? (inevitable at scale)• Stragglers?
• Node not failed but slow• Writing programs for each machine
• Rarely used in commodity datacenters!
CS535 Big Data | Computer Science | Colorado State University
Dataflow Models – Modern distributed dataflow models• Restrict the programming interface
• System can do more automatically
• Express jobs as graphs of high-level operators• System picks how to split each operator into tasks and where to run each task• Run parts multiple times for fault recovery• Examples: MapReduce, Spark, Dryad, Storm, Pig, Hive…
• Examples of dataflow operators• join, map, groupby, … most of the operators introduced in the Apache Spark discussion
CS535 Big Data | Computer Science | Colorado State University
Why did these graph processing systems evolve separately from distributed dataflow frameworks?• Early emphasis on single stage computation and on-disk processing
• Limited capability to handle iterative graph algorithms• Repeatedly and randomly access subsets of the graph• E.g. MapReduce
• Early distributed dataflow frameworks did not support fine-grained control over the data partitioning• Recent frameworks (e.g. Spark and Naiad) support in-memory representation and fine-grained control
over data partitioning
CS535 Big Data | Computer Science | Colorado State University
Optimization used in GraphX• Encoding graph as a collections
• Vertex-cut partitioning
• Executing graph algorithms as the common dataflow operators• Join optimizations
• E.g. CSR indexing, join elimination and join-site specification• Materialized view maintenance
• Vertex mirroring and delta updates• Applying above techniques and provides a new set of the Spark dataflow operators for
graph processing• Reducing memory overhead and improve system performance
• Immutability GraphX reuses indices across graph and collection views over multiple iterations
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3
GEAR Session 3. Big Graph Analysis
Lecture 2. Distributed Large Graph Analysis-IIGraphX: Graph Processing in a Distributed Dataflow Framework
The Property Graphs as Collections and Executing Graph Algorithms
CS535 Big Data | Computer Science | Colorado State University
Property Graph• User-defined properties with each vertex and edge• Meta-data• e.g. user profiles and time stamps• Program state• E.g. the PageRank of vertices or inferred affinities• Applicable for natural phenomena such as social networks and web graphs
• Often highly skewed• Power-law degree distributions• Orders of magnitude more edges than vertices
CS535 Big Data | Computer Science | Colorado State University
Transforming a Property Graph to a Pair of Collections
• Vertex collection
• Vertex properties (with a unique key: Vertex Identifier)
• Vertex Identifiers are 64-bit integer
• Derived externally (e.g. using userID) or applying a hash function to the vertex property (e.g. URL)
• Edge collection
• Edge properties (with source and destination vertex identifiers)
• Having a pair of collection enables the system to compute graph algorithms with
existing dataflow operations
• Join: adding additional vertex properties
• Creating new collections: creating a new graph
• E.g. maintaining a graph for PageRanks and another graph for membership information while sharing the same
edge collection
CS535 Big Data | Computer Science | Colorado State University
The Graph-Parallel Abstraction (Discussed in W10-A)• Iterative local
transformations • E.g. PageRank algorithm
• Vertex program• Launches the vertex program
for each vertex and interacts with adjacent vertex programs through messages (e.g. pregel), or shared state (e.g. PowerGraph)
• Example with the PageRank algorithm
def PageRank(v: Id, msgs: List[Double]) { // Compute the message sum
var msgSum = 0for (m <- msgs) { msgSum += m }
// Update the PageRank PR(v) = 0.15 + 0.85 * msgSum// Broadcast messages with new PR
for (j <- OutNbrs(v)) {msg = PR(v) / NumLinks(v)
send_msg(to=j, msg) } // Check for termination
if (converged(PR(v))) voteToHalt(v) }
PageRank in Pregel
CS535 Big Data | Computer Science | Colorado State University
The Graph-Parallel Abstraction (Discussed in W10-A)• Advantage
• Well-suited for iterative graph algorithms for the static neighborhood structure of the graph
• Disadvantage• It cannot express computation where disconnected vertices interact • It cannot process graph data that changes the graph structure in the course of the computation
CS535 Big Data | Computer Science | Colorado State University
The GAS Decomposition
• Gonzalez et al.1 observed that most vertex programs interact with neighboring vertices by collecting messages in the form of a generalized commutative associative sum and then broadcasting new messages in an inherently parallel loop
1 GONZALEZ, J. E., LOW, Y., GU, H., BICKSON, D., AND GUESTRIN, C. “Powergraph: Distributed graph-parallel computation on natural graphs,” OSDI’12, USENIX Association, pp. 17–30.
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4
Types of graph computation [1/3]• Gather: Your computation gathers information from neighboring vertices• e.g. authority value of the HITS algorithm• e.g. current PageRank value
CS535 Big Data | Computer Science | Colorado State University
Types of graph computation [2/3]• Apply: The vertex applies an update the vertex property• e.g. update the authority value with the sum of new authority values after normalizing
the value• e.g. Add passed PageRank values and normalize it and update the current PageRank
value
CS535 Big Data | Computer Science | Colorado State University
Types of graph computation [3/3]• Scatter: a vertex should send out information to neighboring vertices.
CS535 Big Data | Computer Science | Colorado State University
The GAS Decomposition• The GAS decomposition
• Splits vertex programs into three data-parallel stages • Gather• Apply• Scatter
def PageRank(v: Id, msgs: List[Double]) { // Compute the message sum
var msgSum = 0for (m <- msgs) { msgSum += m }
// Update the PageRank PR(v) = 0.15 + 0.85 * msgSum// Broadcast messages with new PR
for (j <- OutNbrs(v)) {msg = PR(v) / NumLinks(v)
send_msg(to=j, msg) } // Check for termination
if (converged(PR(v))) voteToHalt(v) }
CS535 Big Data | Computer Science | Colorado State University
Gather
Apply
Scatter
The GAS Decomposition• pull-based model of message computation
• The system asks the vertex program for value of the message between adjacent vertices • Rather than the user sending messages directly from the vertex program
• Therefore, vertex-cut is suitable for this style of computation
• Limited communication pattern• Supports only between adjacent vertices
CS535 Big Data | Computer Science | Colorado State University
Graph Computation as Dataflow Ops.• The graph-parallel computation can be expressed as a sequence of join stages, group-
by stages and map operations• Join stage
• Vertex and edge properties are joined to form the triplets view• Consists of each edge and its corresponding source and destination vertex properties
• Group-by stage• The triplets are grouped by source or destination vertex to construct the neighborhood of each vertex
to construct the neighborhood of each vertex and compute aggregates• Gathers messages destined to the same vertex
• Map operation• Applies the message final results for the given vertex to update the vertex property
• Join operation• To distribute the values to the vertices
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5
Discussions• Assume that you implement the
PageRank algorithm using three stages in GraphX. What stage will be applied for the line5?
a. Join stageb. GroupBy stagec. map operationsd. All of the above
0: def PageRank(v: Id, msgs: List[Double]) { 1: // Compute the message sum
2: var msgSum = 03: for (m <- msgs) { msgSum += m }
4: // Update the PageRank 5: PR(v) = 0.15 + 0.85 * msgSum6: // Broadcast messages with new PR
7: for (j <- OutNbrs(v)) {8: msg = PR(v) / NumLinks(v)
9: send_msg(to=j, msg) 10: } 11: // Check for termination
12: if (converged(PR(v))) voteToHalt(v) 13: }
CS535 Big Data | Computer Science | Colorado State University
Discussions• Assume that you implement the
PageRank algorithm using three stages in GraphX. What stage will be applied for the line3?
a. Join stageb. GroupBy stagec. map operationsd. All of the above
0: def PageRank(v: Id, msgs: List[Double]) { 1: // Compute the message sum
2: var msgSum = 03: for (m <- msgs) { msgSum += m }
4: // Update the PageRank 5: PR(v) = 0.15 + 0.85 * msgSum6: // Broadcast messages with new PR
7: for (j <- OutNbrs(v)) {8: msg = PR(v) / NumLinks(v)
9: send_msg(to=j, msg) 10: } 11: // Check for termination
12: if (converged(PR(v))) voteToHalt(v) 13: }
CS535 Big Data | Computer Science | Colorado State University
The GAS Decomposition with GraphX• Gather
• GroupBy stage
• Apply• Map operation
• Scatter• Join stage
CS535 Big Data | Computer Science | Colorado State University
Triplets view• Each edge and its corresponding source and destination vertex properties
A
B
A
Vertices Edges
B A B
Triplets
CREATE VIEW triplets ASSELECT s.Id, d.Id, s.P, e.P, d.PFROM edges AS eJOIN vertices AS s JOIN vertices AS d ON e.srcId = s.Id AND e.dstId = d.Id
Constructing Triplets in SQL
CS535 Big Data | Computer Science | Colorado State University
GraphX Graph Operators• Transform vertex and edge collections
• Graph Constructor• Logically binds a pair of vertex and edge property collections into a property graph• Verities integrity constrains – every vertex occurs only once and that edges do not link to missing
vertices• def Graph(v: Collection[(Id, V)], e: Collection[(Id, Id, E)])
• Collection views• Vertex and edges operators expose the graph’s vertex and edge property collections• Triplets operator returns the triplets view of the graph• def vertices: Collection[(Id, V)] • def edges: Collection[(Id, Id, E)] • def triplets: Collection[Triplet]
CS535 Big Data | Computer Science | Colorado State University
GraphX Graph Operators• Graph-parallel computation
• MapReduce Triplets operator encodes the two-stage process of graph-parallel computation• Composes the map and group-by dataflow operators on the triplets view• User-defined map function is applied to each triplet• Generates values and aggregates them at the destination vertex using user-defined binary
aggregation function • def mrTriplets(f: (Triplet) => M, sum: (M, M) => M): Collection[(Id, M)] • In SQLSELECT t.dstID, reduce(mapF(t)) AS msgSumFROM triplets AS t GROUP BY t.dstId
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6
GraphX Graph Operators
• Convenience functions • def mapV(f: (Id, V) => V): Graph[V, E]• def mapE(f: (Id, Id, E) => E): Graph[V, E] • def leftJoinV(v: Collection[(Id, V)], f: (Id, V, V) => V): Graph[V, E]• def leftJoinE(e: Collection[(Id, Id, E)], f: (Id, Id, E, E) => E): Graph[V, E] • def subgraph(vPred: (Id, V) => Boolean, ePred: (Triplet) => Boolean) : Graph[V, E] • def reverse: Graph[V, E] }
CS535 Big Data | Computer Science | Colorado State University
Example use of mrTripletsA B
ED
C
F
42 23
30
7519
16
AmapF( )=1B
Source property 42
Target property 23Messageto vertex B
V id Property
A 0
B 2
C ?
D ?
E ?
F ?
Resultingvertices
Compute the number of older followersfor each user in a social network
val graph: Graph[User, Double]def mapUDF(t: Triplet[User, Double]) =
??? What will be your computation here?def reduceUDF(a: Int, b: Int): Int = a + b val seniors: Collection[(Id, Int)] = graph.mrTriplets(mapUDF, reduceUDF)
CS535 Big Data | Computer Science | Colorado State University
Example use of mrTripletsA B
ED
C
F
42 23
30
7519
16
AmapF( )=1B
Source property 42
Target property 23Messageto vertex B
V id Property
A 0
B 2
C 1
D 1
E 0
F 3
Resultingvertices
Compute the number of older followersfor each user in a social network
val graph: Graph[User, Double]def mapUDF(t: Triplet[User, Double]) =
if (t.src.age > t.dst.age) 1 else 0def reduceUDF(a: Int, b: Int): Int = a + b val seniors: Collection[(Id, Int)] = graph.mrTriplets(mapUDF, reduceUDF)
CS535 Big Data | Computer Science | Colorado State University
Implementation of the Pregel abstraction using GraphX• Initializes the vertex properties
with an additional field to track active vertices
• While they are active, messages are computed using the mrTriplets operator
• Edge-parallel map operation• Message computation
• Commutative associated aggregation
def Pregel(g: Graph[V, E], vprog: (Id, V, M) => V, sendMsg: (Triplet) => M, gather: (M, M) => M): Collection[V] = {
// Set all vertices as activeg = g.mapV((id, v) => (v, halt=false))// Loop until convergencewhile (g.vertices.exists(v => !v.halt)) {
// Compute the messagesval msgs: Collection[(Id, M)] =
// Restrict to edges with active source g.subgraph(ePred=(s,d,sP,eP,dP)=>!sP.halt) // Compute messages .mrTriplets(sendMsg, gather)
// Receive messages and run vertex program g = g.leftJoinV(msgs).mapV(vprog) } return g.vertices
}
CS535 Big Data | Computer Science | Colorado State University
GEAR Session 3. Big Graph Analysis
Lecture 2. Distributed Large Graph Analysis-IIGraphX: Graph Processing in a Distributed Dataflow Framework
Distributed Representation of a Graph
CS535 Big Data | Computer Science | Colorado State University
Distributed Graph Representation• GraphX represents graphs internally as a pair of vertex and edge collections built on
the Spark RDD abstraction • Indexing and graph-specific partitioning as a layer on top of RDDs
1
2
3
4
56
1 2
Edge partition A
1 3
4 1
Edge partition B
4 5
1 5
Edge partition C
1 6
5 6
Edges
1
2
Vertex partition A
Vertices
3
1
1
1
4
5
Vertex partition B
6
1
1
0
Partition A
Routing Table
Partition B
A 1,2,3
B 1
C 1
A
B 4,5
C 5.6
Partition A
Partition B
Partition C
Bitm
askB
itmask
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7
Vertices and Edges• Vertex collection is hash-partitioned by the vertex ids• Vertices are stored in a local hash index within each
partition• Bitmask stores the visibility of each vertex
• Soft deletions to promote index reuse• If vertex 5 and adjacent edges are restricted from the graph,
they are removed from the corresponding collection by updating the bitmasks
• Your computation can reuse this index
• Edges are divided into three edge partitions by applying a partition function • E.g. 2D partitioning
• Vertices are partitioned by vertex id
1 2Edge partition A
1 3
4 1Edge partition B
4 5
1 5Edge partition C
1 6
5 6
Edges
1
2
Vertex partition A
Vertices
3
1
1
1
4
5
Vertex partition B
6
1
1
0
Bitmask
Bitmask
CS535 Big Data | Computer Science | Colorado State University
Routing table• Encoding the edge partitions for each vertex• Join site information is stored in the routing table
Partition A
Routing Table
Partition B
A 1,2,3
B 1
C 1
A
B 4,5
C 5.6
CS535 Big Data | Computer Science | Colorado State University
Graph Partitioning: EdgePartition2D• Inspired by the multilevel k-way partitioning1
• 2D graph partitioning• Upper bound of 2 " − 1 on the vertex replication factor
• ,where n is the number of partitions
1KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput.48, 1 (1998), 96–129.
CS535 Big Data | Computer Science | Colorado State University
Graph Partitioning: EdgePartition2D• Consider a graph G = (V, E)
• ,where V is the set of vertices and E is the set of edges• Every vertex in V has a vertex identifier and a vertex property• Every edge in E has source and destination vertex identifiers and edge property
• Goal• Create n partitions of G such that:• The partitions should incur minimum communication• The workload should be balanced
CS535 Big Data | Computer Science | Colorado State University
Step 1: Creating a partition table
If n is a perfect squarerows(# of rows) = !cols (# of columns) = ! If n is not a perfect square
rows = the floor value of (n + cols -1)cols = the ceiling of the decimal value of !
For example, if n = 27, cols = 6 and rows = 5The last column would have 3 rows
!
!
CS535 Big Data | Computer Science | Colorado State University
Step 2: Assigning vertices and edges
!
!
Vertex assignmentUsing elementary modular hash v%nVertices are equally distributed among the partitions
Edge assignment
The source vertex (src) is mapped on the columnscol = ((src x mixingPrime)% !, if n is a perfect squarecol = ((src x mixingPrime)% ( #
$%&'), otherwise,where mixingPrime is a large prime number to improve the balance of edge distributions
The destination vertices (des) is mapped on the rowsrow = ((des x mixingPrime)% !, if n is a perfect squarerow = ((des x mixingPrime)% ( #
$%&'), if n is not a perfect square and col < cols - 1
row = ((des x mixingPrime)% )*+,-.)/.0+. otherwise
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8
Step 3: Storing edge properties
!
!
Storing Edge Properties(col x ! + row) if n is a perfect square(col x rows + rows) otherwise
Edge assignment
The source vertex (src) is mapped on the columnscol = ((src x mixingPrime)% !, if n is a perfect squarecol = ((src x mixingPrime)% ( #
$%&'), otherwise,where mixingPrime is a large prime number to improve the balance of edge distributions
The destination vertices (des) is mapped on the rowsrow = ((des x mixingPrime)% !, if n is a perfect squarerow = ((des x mixingPrime)% ( #
$%&'), if n is not a perfect square and col < cols - 1
row = ((des x mixingPrime)% )*+,-.)/.0+. otherwise
CS535 Big Data | Computer Science | Colorado State University
Discussions• Let’s locate a set of edges using EdgePartition2D• {(s, d1) , (s, d2) , (s, d3) , (s, d4) , (s, d5) } (sharing the same source vertex)• Where will they be located?a. a single cellb. a single rowc. a single columnd. randomly dispersed
!
!
CS535 Big Data | Computer Science | Colorado State University
Discussions• Let’s locate a set of edges using EdgePartition2D• {(s, d1) , (s, d2) , (s, d3) , (s, d4) , (s, d5) } (sharing the same source vertex)• Where will they be located?a. a single cellb. a single rowc. a single columnd. randomly dispersed
!
!
CS535 Big Data | Computer Science | Colorado State University
Understanding the effect of EdgePartition2D• Let’s locate an edge (vsrc, vdes) • All the edges where vsrc is the source vertex
• Would be placed in the same column, col• Example:• If vsrc = 9 and mixingPrime = 3 for the 25 (=n) partitions• (9 x 3)%5 = 2
• The actual cell will be determined by the destination vertex• If vdes is 2 and mixingPrime = 3 • (2 x 3)%5 = 1
• Therefore, the edge (vsrc, vdes) is stored in the partition 11 (the partition defined as the 2nd row and the 3rd column)
!
!
0 1 2 3. 4
CS535 Big Data | Computer Science | Colorado State University
Understanding the effect of EdgePartition2D• A vertex with the vertex id of v can be in any of
the cell in the column of (v x mixingPrime)% !• If it was a source vertex
• Similarly, a vertex with the vertex id of v can be in any of the cell in the raw of (v x mixingPrime)% !• If it was a destination vertex
• Can a vertex v be in any other cells except aforementioned set of cells?
• No!
!
!
0 1 2 3. 4
CS535 Big Data | Computer Science | Colorado State University
Understanding the effect of EdgePartition2D• Therefore, any edge containing v has to be
placed in any of ! + ! -1 = 2 ! - 1 partitions
• The upper bound on the vertex replication factor is 2 " - 1 • This is directly related to the communication cost to
synchronize the status of the vertex properties
!
!
0 1 2 3. 4
Naman Shah, Matthew Malensek, Harshil Shah, Shrideep Pallickara, and Sangmi Lee Pallickara, “Scalable Network Analytics for Characterization of Outbreak Influence in Voluminous Epidemiology Datasets,” Concurrency and Computation: Practice & Experience. John- Wiley. 2018Naman Shah, Harshil Shah, Matthew Malensek, Sangmi Lee Pallickara, and Shrideep Pallickara. “Network Analysis for Identifying and Characterizing Disease Outbreak Influence from Voluminous Epidemiology Data,”Proceedings of the IEEE International Conference on Big Data (IEEE BigData). Washington D.C., USA. 2016
CS535 Big Data | Computer Science | Colorado State University