scalable sparql querying of large rdf graphs
DESCRIPTION
Scalable SPARQL Querying of Large RDF Graphs. Xu Bo 2012.06.11. In PVLDB, 4(21), 2011. Outline. About Presenter Semantic Web Previous Work New Problem SYSTEM ARCHITECTURE EXPERIMENTS CONCLUSIONS AND FUTURE WORK. About Presenter. Daniel Abadi Associate Professor of Computer - PowerPoint PPT PresentationTRANSCRIPT
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Scalable SPARQL Querying of Large RDF Graphs
Xu Bo
2012.06.11
In PVLDB, 4(21), 2011
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
OutlineOutline• About Presenter
• Semantic Web
• Previous Work
• New Problem
• SYSTEM ARCHITECTURE
• EXPERIMENTS
• CONCLUSIONS AND FUTURE WORK
23/4/20 2
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
About Presenter
• Daniel Abadi• Associate Professor of Computer Science in Yale University• Research
– Column-Oriented Database Systems– Petascale Parallel Database Systems (HadoopDB) – Semantic Web Data Management
23/4/20 3
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Semantic WebSemantic Web
• The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web
23/4/20 4
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Google Knowledge Graph
23/4/20 5
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Key Technology
• HTML
• XML
23/4/20 6
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The Disadvantage of XML David Billington is a lecturer of Discrete Mathematics.
• there is no standard way of assigning meaning to tag nesting
23/4/20 7
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The Disadvantage of Xpath
• Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember
• XML is semantically unsatisfactory
23/4/20 8
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF
• Resource Description Framework• 用 Web标识符(称作统一资源标识符, Uniform
Resource Identifiers 或 URIs)来标识事物,用简单的属性( property)及属性值来描述资源
23/4/20 9
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF as Triples and a Graph
23/4/20 10
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
SPARQL
• RDF query language
• A basic graph pattern
• Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern
23/4/20 11
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Example for Star Pattern
• Find the names of the strikers that play for FC Barcelona.
23/4/20 12
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Another Example• Find football players playing for clubs in apopulous region where they were born.
23/4/20 13
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
23/4/20 14
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Previous Work• RDF In RDBMSs
• Property Tables
• Vertically Partitioned Approach
23/4/20 15
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF In RDBMSs• Get the title of the book(s) Joe Fox wrote
in 2001
23/4/20 16
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Property Tables
23/4/20 17
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Vertically Partitioned Approach
23/4/20 18
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
New Problem• Single node RDF management systems are
abundant– Sesame– Jena– RDF3X– 3store
• Research in clustered RDF management is less significantly explored: The focus of the talk
23/4/20 19
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
SYSTEM ARCHITECTURE
23/4/20 20
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Graph Partitioning• Hash vs. Graph partitioning
– Hash: Only efficient for star patterns– Graph: Taking advantage of graph model
23/4/20 21
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Graph Partitioning• Edge vs. Vertex partitioning
– Edge: Natural but inefficient for query execution
– Vertex: Superior for common graph patterns
23/4/20 22
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Vertex Partitioning• Preprocess
– remove triples whose predicate is rdf:type
• METIS partitioner
23/4/20 23
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Triple Placement• Minimizing data shuffling/exchange
– Allowing data overlap
• N-hop guarantee– The extent of data overlap– If a vertex is assigned to a machine, any
vertex that is within n-hop of this vertex is also stored in this machine
23/4/20 24
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
DIRECTED N-HOP GUARANTEE
23/4/20 25
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
A potential problem• triples (s, p, o) and (o, p’, o’)
– 2-hop guarantee
• triples (s, p, o) and (s’, p’, o)– not guaranteed
• “object-connected” is not unusual
• undirected n-hop guarantee
23/4/20 26
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Triple Placement Algorithm
23/4/20 27
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Query Processing• Queries are executed in RDF-stores and/or
Hadoop• Query execution is more efficient in RDF-
stores than in Hadoop– Pushing as much of the processing as possible
into RDF-stores– Minimizing the number of Hadoop jobs– The larger the hop guarantee, the more work is
done in RDF-stores23/4/20 28
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
To Communicate, or not to Communicate
• Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed?– Choose the “center” of the query graph– Calculate the distance from the “center” to the
furthest edge– If distance > n, communication is needed; not
needed otherwise
23/4/20 29
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Determining whether a Query is PWOC
PWOC Query– parallelizable without communication
• DoFE– distance of farthest edge– the vertex in a graph with the smallest DoFE will be
the most central in a graph
23/4/20 30
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The algorithm
23/4/20 31
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
the issue of duplicate results• naive approach
– remove duplicates after the query has completed
• owner-computes model– add triples (v, ‘<isOwned>’, ‘Yes’) to a
partition– For each query issued to the RDF-stores, add
an additional pattern (core, ‘<isOwned>’, ‘Yes’)
23/4/20 32
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
A query is not PWOC• decompose the query into PWOC
subqueries
• use Hadoop jobs to join the results of the PWOC subqueries
• The number of Hadoop jobs required to complete the query increases as the number of subqueries increases
23/4/20 33
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
minimal number of subqueries• reduces to the problem of finding minimal
edge partitioning of a graph into subgraphs of bounded diameter
• brute-force
23/4/20 34
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Examlple
23/4/20 35
DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1
the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Decompose Example
23/4/20 36
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
EXPERIMENTS• 20-machine cluster
• Leigh University Benchmark (LUBM): 270 million triples
• Competitors:– Single-node RDF-3X– SHARD: triple-store system in Hadoop– Graph partitioning (the proposed system)– Hash partitioning on subjects
23/4/20 37
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Data Load Time
23/4/20 38
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Performance Comparison
23/4/20 39
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Varying Number of Machines
23/4/20 40
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Summary
23/4/20 41
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
23/4/20 42