lighthouse: large-scale graph pattern matching on giraph
TRANSCRIPT
LighthouseLarge-scale graph pattern matching on Giraph
2
Timeline• Inspired by Google Pregel (2010)
• Donated to ASF by Yahoo! in 2011
• Top-level project in 2012
• 1.0 release in January 2013
• 1.1 release in November 2014
• Used at Facebook, LinkedIn, Yahoo!
3
Vertex-centric API
5
?
?
?
2
3
Iteration i+1Iteration i
4
PU 1
PU 2
PU 3
PU 4
PU 5
Iteration i Iteration i+1
BSP/Pregel implementation
5
Architecture
Netty Netty Netty Netty
...
Hadoop File System (HDFS)
Zookeeper
Master Coordinator
Worker 1 Worker 2 Worker N Master
Compute threads
Vertices
Message Inbox
Message Outbox
6
Lighthouse
Giraph execution algebra
Binding Table. Matching and potential graph patterns are stored in a table that is distributed across the messages sent around by vertices. !• Scan: starts traversals from certain vertices. • Select: prunes traversals based on expressions. • Project: adds data to the binding table. • Hash Join: joins paths generated from different traversals • Step Join: performs a further hop in the traversal. • Move: continues a traversal from different vertices.
8
5
?
?
?
2
3
Iteration i+1Iteration i
V1 John … VN
… … … …
V4 Paul … VJ
V7 Mark … VL
Distributed Binding Table
9
MATCH (person:Person {firstName:"Antonio"}) -[:WORK_AT]-> (company), (company) -[:IS_LOCATED_IN]-> (country)
WHERE person.browser = "Chrome" RETURN person.id, person.lastName, company.id, country.id
10
MATCH (person:Person) -[:WORK_AT]-> (company) RETURN person.id, person.birthDate, company.id
11
Scan
Project12
StepJoin
13
Cypher path-queriesDesired functionality: • weighted shortest paths • multiple source and destinations • top N shortest paths for each pair • provide both paths and their costs • restrict search to subset of graph
Restrictions: • Monotonic cost function • Path-independent local vertex/edge restrictions
14
ProposalMATCH p = (a:Start) -[e* | not(endNode(e)).danger ]-> (b:Finish)
CHEAPEST 3 SUM e.distance * e.maxSpeed AS length RETURN a, b, path, length
Features: • Selector applied before WHERE condition (optional) • Number of paths for each pair (e.g. 3) (optional) • User-defined cost function (required) • AS keyword to bind distance to variable (optional)
15
Giraph implementation
Two phases: !• First phase: we compute the routes of each top K
shortest paths. Each vertex discovers and registers the precedent vertex in the shortest paths (similar to Pregel BFS).
• Second phase: starting from “leaves”, we traverse back the structure building the paths.
16
Preliminary results
17
Thanks.