tddd43 advanced data models and databases topic: graph data · 2017-12-03 · 4 tddd43 – advanced...
TRANSCRIPT
2TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Graphs are Everywhere
● Transportation networks● Bibliographic networks● Computer networks● Social networks● Topic maps● Knowledge bases● Protein interactions● Biological food chains● etc.
4TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Categories of Graph Data Systems
● Triple stores– Typically, pattern matching queries– Data model: RDF
● Graph databases– Typically, navigational queries– Prevalent data model: property graphs
● Graph processing systems– Typically, complex graph analysis tasks– Prevalent data model: generic graphs
Remembermy earlier lecture
on RDF andSPARQL
Graph Data Models
8TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
RDF Data Model
● Data comes as a set of triples (s, p, o)– subject: URI– predicate: URI– object: URI or literal
● Such a set may be understood as a graph– Triples as directed edges– Subjects and objects as vertexes– Edges labeled by predicate
http://dbpedia.org/resource/Mount_Baker
http://dbpedia.org/resource/Washington
http://dbpedia.org/property/location
1880
http://dbpedia.org/property/lastEruption
Remembermy earlier lecture
on RDF andSPARQL
9TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Property Graph
10TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Property Graph (cont'd)
● Directed multigraph– multiple edges between the same pair of nodes
● Any node and any edge may have a label● Additionally, any node and any edge may have
an arbitrary set of key-value pairs (“properties”)
11TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Property Graphs versus RDF Graphs
● Both data models have a lot of similarities:– Directed multigraphs– Labels on edges and on vertexes– Attributes with values on vertexes
● However, there are some subtle differences:– No edge properties in RDF graphs– Edge labels cannot appear as nodes in a PG (in
RDF we may have <s1,p1,o1> and <p1,p2,o2>)– No multivalued (vertex) properties in a PG
(unless we use a collection object as the value)– Node and edge identifiers in a PG are local
to the PG, whereas URIs are globally uniqueidentifiers (important for data integration)
12TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Generic Graphs
● Data model– Directed multigraphs– Arbitrary user-defined data structure can be used
as value of a vertex or an edge (e.g., a Java object)● Example (Flink Gelly API):
● Advantage: give users maximum flexibility● Drawback: systems cannot provide built-in operators
related to vertex data or edge data
// create new vertexes with a Long ID and a String value
Vertex<Long, String> v1 = new Vertex<Long, String>(1L, "foo");
Vertex<Long, String> v2 = new Vertex<Long, String>(2L, "bar");
Edge<Long, Double> e = new Edge<Long, Double>(1L, 2L, 0.5);
Graph Databases
15TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Categories of Graph Data Systems
● Triple stores– Typically, pattern matching queries– Data model: RDF
● Graph databases– Typically, navigational queries– Prevalent data model: property graphs
● Graph processing systems– Typically, complex graph analysis tasks– Prevalent data model: generic graphs
16TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Examples of Graph DB Systems
● Systems that focus on graph databases– Neo4j– Sparksee– Titan– InfiniteGraph
● Multi-model NoSQL storeswith support for graphs:– OrientDB– ArangoDB
● Triple stores with TinkerPop support– Blazegraph– Stardog– IBM System G
17TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Apache TinkerPop
● Graph computing framework– Vendor-agnostic
● Includes a graph structure API– Formerly known as Blueprints API– For creating and modifying Property Graphs– Example:
● Also includes a process API– Graph-parallel engine– Graph traversal, based on a language called Gremlin
Graph graph = …
Vertex marko = graph.addVertex(T.label, "person", T.id, 1, "name", "marko", "age", 29);
Vertex vadas = graph.addVertex(T.label, "person", T.id, 2, "name", "vadas", "age", 27);
marko.addEdge("knows", vadas, T.id, 7, "weight", 0.5f);
18TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Gremlin Graph Traversal Language
● Part of the TinkerPop framework● Powerful domain-specific language (DSL) with
embeddings in various programming languages● Expressions specify a concatenation of traversal steps
19TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Gremlin Example
g.V().has('name','marko').out('knows').values('name')
Result:
==>vadas
==>josh
20TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Gremlin Example
g.V().has('name','marko').out('knows').values('name').path()
Result:
==>[v[1],v[2],vadas]
==>[v[1],v[4],josh]
21TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Gremlin Example
g.V().has('name','marko').repeat(out()).times(2).values('name')
Result:
==>ripple
==>lop
22TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Cypher
● Declarative graph database query language● Proprietary (used by Neo4j)● The OpenCypher project aims
to deliver an open specification● Example
– Recall our initial Gremlin example:
g.V().has('name','marko').out('knows').values('name')– In Cypher we could express this query as follows:
MATCH ( {name:'marko'} )-[:knows]->( x )RETURN x.name
23TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Possible Clauses in Cypher Queries
CREATE - creates nodes and edges
DELETE - removes nodes, edges, properties
SET - sets values of properties
MATCH - specifies a pattern to match in the data graph
WHERE - filters pattern matching results
RETURN - which nodes / edges / properties in the matched data should be returned
UNION - merges results from two or more queries
WITH - chains subsequent query parts (like piping in Unix commands)
24TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Node Patterns in Cypher
● Node patterns may have different forms:
( ) - matches any node
(:person) - matches nodes whose label is person
( {name:'marko'} ) - matches nodes that have a property name='marko'
(:person {name:'marko'} ) - matches nodes that have both the label person and a property name='marko'
● Every node pattern can be assigned a variable – Can be used to refer to the matching node
in another query clause or to express joins– For instance, (x), (x:person)
25TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Relationship Patterns in Cypher
● Relationship pattern must be placed between two node patterns and it may have different forms
--> or <-- - matches any edge (with the given direction)-[:knows]-> - matches edges whose label is knows-[ {weight:0.5} ]-> - matches edges that have a property weight=0.5-[:knows {weight:0.5} ]-> - matches edges that have both the label knows and a property weight=0.5-[:knows*..4]-> - matches paths of knows edges of up to length 4
● Every relationship pattern can be assigned a variable – For instance, <-[x:knows]-
26TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
More Complex Cypher Patterns
● Node patterns and relationship patterns are justbasic building blocks that can be combined intomore complex patterns
● Examples:– MATCH (a)-[:knows]->()-[:knows]->(a)
RETURN a
– MATCH p = shortestPath( (:person {name:'marko'])-[*]->(:person {name:'josh']))RETURN p
27TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Filtering in Cypher
● Pattern matching results can be filtered outby using the WHERE clause (similar to SQL)
● Examples:– MATCH (a:person)-[x:knows]->(b:person)
WHERE x.weight > 0.5 AND x.weight < 0.9RETURN a , b
– MATCH ()-[x:knows]->()WHERE exists(x.weight)RETURN x
– MATCH (a)-[:knows]->(b)-[x:knows]->(c)WHERE NOT (a)-[:knows]->(c)RETURN *
Graph Processing Systems
30TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Categories of Graph Data Systems
● Triple stores– Typically, pattern matching queries– Data model: RDF
● Graph databases– Typically, navigational queries– Prevalent data model: property graphs
● Graph processing systems– Typically, complex graph analysis tasks– Prevalent data model: generic graphs
31TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Complex Graph Analysis Tasks???
● Tasks that require an iterative processing ofthe entire graph or large portions thereof
● Examples:– Centrality analysis (e.g., PageRank)– Clustering, connected components– Graph coloring– Diameter finding– All-pairs shortest path– Graph pattern mining (e.g., frequent
subgraphs, community detection)– Machine learning (e.g., belief propagation,
Gaussian non-negative matrix factorization)
32TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: PageRank
PRk+1(v) = ΣvIN Prk(vIN) / |Out(vIN)|
v1 v3
v2 v4
k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25
PR(v2) 0.25
PR(v3) 0.25
PR(v4) 0.25
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
33TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: PageRank
PRk+1(v) = ΣvIN Prk(vIN) / |Out(vIN)|
v1 v3
v2 v4
k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25 0.37
PR(v2) 0.25
PR(v3) 0.25
PR(v4) 0.25
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
PR2(v1) = PR1(v3)/1 + PR1(v4)/2PR2(v1) = 0.25/1 + 0.25/2PR2(v1) = 0.25 + 0.125PR2(v1) = 0.375
34TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: PageRank
PRk+1(v) = ΣvIN Prk(vIN) / |Out(vIN)|
v1 v3
v2 v4
k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25 0.37 0.43 0.35 0.39 0.38 0.38
PR(v2) 0.25 0.08 0.12 0.14 0.11 0.13 0.13
PR(v3) 0.25 0.33 0.27 0.29 0.29 0.28 0.28
PR(v4) 0.25 0.20 0.16 0.20 0.19 0.19 0.19
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
PR2(v1) = PR1(v3)/1 + PR1(v4)/2PR2(v1) = 0.25/1 + 0.25/2PR2(v1) = 0.25 + 0.125PR2(v1) = 0.375
Convergence
35TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Observation
● Many such algorithms iteratively propagatedata along the graph structure by transforming intermediate vertex and edge values
36TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Can we use MapReduce for this?
37TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Can we use MapReduce for this?
● M/R does not directly support iterative algorithms● Materializing intermediate results at
each M/R iteration harms performance● Extra M/R job on each iteration for
checking whether a fixed point hasbeen reached
● Additional issue for graph algorithms– Invariant graph-topology data
reloaded and reprocessed ateach iteration
– Wastes I/O, CPU, andnetwork bandwidth
38TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Graph Processing Systems
Pregel Family● Pregel
● Giraph
● Giraph++
● Mizan
● GPS
● Pregelix
● Pregel+
GraphLab Family● GraphLab
● PowerGraph
● GraphChi(centralized)
Other Systems● Trinity
● TurboGraph(centralized)
● Signal/Collect
39TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Vertex-Centric Abstraction
● Many such algorithms iteratively propagatedata along the graph structure by transforming intermediate vertex and edge values– These transformations are defined
in terms of functions on the valuesof adjacent vertexes and edges
– Hence, such algorithms can beexpressed by specifying a functionthat can be applied to any vertexseparately
● “Think like a vertex”
40TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Vertex-Centric Abstraction (cont'd)
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
● Overall execution can be parallelized– Terminates when all vertexes have
halted and no messages in transit
41TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: Vertex-Centric PageRank
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
v1 v3
v2 v4k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25
PR(v2) 0.25
PR(v3) 0.25
PR(v4) 0.25
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
42TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: Vertex-Centric PageRank
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
v1 v3
v2 v4k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25
PR(v2) 0.25
PR(v3) 0.25
PR(v4) 0.25
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
0.125
0.083
43TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: Vertex-Centric PageRank
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
v1 v3
v2 v4k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25
PR(v2) 0.25
PR(v3) 0.25
PR(v4) 0.25 0.20
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
0.125
0.083
44TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: Vertex-Centric PageRank
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
v1 v3
v2 v4k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25 0.37
PR(v2) 0.25 0.08
PR(v3) 0.25 0.33
PR(v4) 0.25 0.20
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
45TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: Vertex-Centric PageRank
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
v1 v3
v2 v4k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25 0.37
PR(v2) 0.25 0.08
PR(v3) 0.25 0.33
PR(v4) 0.25 0.20
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
0.1
0.1
46TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: Vertex-Centric PageRank
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
v1 v3
v2 v4k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25 0.37 0.43 0.35 0.39 0.38
PR(v2) 0.25 0.08 0.12 0.14 0.11 0.13
PR(v3) 0.25 0.33 0.27 0.29 0.29 0.28
PR(v4) 0.25 0.20 0.16 0.20 0.19 0.19
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
localconvergence
47TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Example: Vertex-Centric PageRank
PRk+1(v) = ΣvIN PRk(vIN) / |Out(vIN)|
v1 v3
v2 v4k=0 k=1 k=2 k=3 k=4 k=5 k=6
PR(v1) 0.25 0.37 0.43 0.35 0.39 0.38 0.38
PR(v2) 0.25 0.08 0.12 0.14 0.11 0.13 0.13
PR(v3) 0.25 0.33 0.27 0.29 0.29 0.28 0.28
PR(v4) 0.25 0.20 0.16 0.20 0.19 0.19 0.19
● Vertex compute function consists of three steps:1. Read all incoming messages from neighbors2. Update the value of the vertex3. Send messages to neighbors
● Additionally, function may “vote to halt”if a local convergence criterion is met
48TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Google Pregel
● First system that implementedvertex-centric computation forshared-nothing clusters– Communication through
message passing
● Based on the bulk synchronousparallel (BSP) programming model– Supersteps with
synchronization barriers
● Apache Giraph was a first opensource implementation of Pregel
| ← S
u perstep → |
| ← S
u perstep → |
| ← S
u perstep → |
50TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
MapReduce versus Pregel
● Graph topology is not passed across iterations, vertexes only send theirstate to their neighbors
● Requires passing of entire graph topology from one iteration to the next
● Intermediate result after each iteration is stored on disk and then read again from disk
● Programmer needs to writea driver program to support iterations, and another M/R job to check for fixed point
● Main memory based
● Usage of supersteps and master-client architecture makes programming easy
51TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Limitation of Pregel
● In the BSP model, performance islimited by slowest worker machine– Many real-world graphs have
power-law degree distribution,which may lead to few highly-loaded workers
| ← S
u perstep → |
| ← S
u perstep → |
| ← S
u perstep → |
52TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Limitation of Pregel
● In the BSP model, performance islimited by slowest worker machine– Many real-world graphs have
power-law degree distribution,which may lead to few highly-loaded workers
● Possible optimizations tobalance the workload:– Decompose the vertex program– Sophisticated graph partitioning– Graph-centric abstraction
● Another possibility: asynchronous execution (instead of BSP)
| ← S
u perstep → |
| ← S
u perstep → |
| ← S
u perstep → |
53TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Combiner
● Takes two messages and combines them into one– Associative, commutative function
● Can be used to aggregate messages before sending them to the worker node that has the target vertex
● Example:– In the vertex-centric PageRank, messages are values
mIN = (Prk(vIN) / |Out(vIN)|) of each incoming neighbor vIN
– In the vertex function these values are summed up: (Prk(vIN1) / |Out(vIN1)|) + (Prk(vIN2) / |Out(vIN2)|) + ...
– Parts of this sum may be computed by worker nodes that have some of the incoming neighbor vertexes
54TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Signal/Collect Model
● Signaling (edge function):– Every edge uses the value of its source vertex to
compute a message (“signal”) for the target vertex– Executed on the worker that has the source vertex
● Collecting (vertex function):– Every vertex computes its new value based on
the messages received from its incoming edges– Executed on the worker that has the target vertex
55TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Gather, Apply, Scatter (GAS) Model
● Gather:– Accumulate incoming messages,
i.e., same purpose as a combiner
● Apply:– Update the vertex value based
on the accumulated information– Operates only on the vertex
● Scatter:– Computes outgoing messages– Can be executed in parallel for
each adjacent edge
56TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Limitation of Pregel
● In the BSP model, performance islimited by slowest worker machine– Many real-world graphs have
power-law degree distribution,which may lead to few highly-loaded workers
● Possible optimizations tobalance the workload:– Decompose the vertex program– Sophisticated graph partitioning– Graph-centric abstraction
● Another possibility: asynchronous execution (instead of BSP)
| ← S
u perstep → |
| ← S
u perstep → |
| ← S
u perstep → |
57TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Partitioning
● Goal: distribute the vertexes to achieve a balanced workload while minimizing inter-partitionedges to avoid costly network traffic
– For instance, hash-based (random)partitioning has extremely poor locality
● Unfortunately, the problem is NP-complete– k-way graph partitioning problem
● Various heuristics and approximation algorithms
59TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Vertex-Cut
● PowerGraph introduced a partitioning scheme that “cuts” vertexes such that the edges of high-degree vertexes are handled by multiple workers– improved work balance
● Power-law graphs have good vertex cuts– Communication is linear in the number
of machines each vertex spans– Vertex-cut minimizes this number– Hence, reduced network traffic
64TDDD43 – Advanced Data Models and Databases, HT 2017Topic: Graph Data Olaf Hartig
Limitation of Pregel
● In the BSP model, performance islimited by slowest worker machine– Many real-world graphs have
power-law degree distribution,which may lead to few highly-loaded workers
● Possible optimizations tobalance the workload:– Decompose the vertex program– Sophisticated graph partitioning– Graph-centric abstraction
● Another possibility: asynchronous execution (instead of BSP)
| ← S
u perstep → |
| ← S
u perstep → |
| ← S
u perstep → |
www.liu.se
Acknowledgements:● Some of the slides about graph processing systems are from a slideset of Sherif Sakr. Thanks Sherif!
Image sources:● Example Property Graph http://tinkerpop.apache.org/docs/current/tutorials/getting-started/ ● BSP Illustration https://en.wikipedia.org/wiki/Bulk_synchronous_parallel ● Smiley https://commons.wikimedia.org/wiki/File:Face-smile.svg ● Frowny https://commons.wikimedia.org/wiki/File:Face-sad.svg ● Powerlaw charts http://www9.org/w9cdrom/160/160.html