scalable sparql querying of large rdf graphs

42
Graph Data Management Lab, School of Computer Science GDM@FUDAN GDM@FUDAN http://gdm.fudan.edu. http://gdm.fudan.edu. Scalable SPARQL Querying of Large RDF Graphs Xu Bo 2012.06.11 In PVLDB, 4(21), 2011

Upload: abeni

Post on 08-Jan-2016

65 views

Category:

Documents


6 download

DESCRIPTION

Scalable SPARQL Querying of Large RDF Graphs. Xu Bo 2012.06.11. In PVLDB, 4(21), 2011. Outline. About Presenter Semantic Web Previous Work New Problem SYSTEM ARCHITECTURE EXPERIMENTS CONCLUSIONS AND FUTURE WORK. About Presenter. Daniel Abadi Associate Professor of Computer - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Scalable SPARQL Querying of Large RDF Graphs

Xu Bo

2012.06.11

In PVLDB, 4(21), 2011

Page 2: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

OutlineOutline• About Presenter

• Semantic Web

• Previous Work

• New Problem

• SYSTEM ARCHITECTURE

• EXPERIMENTS

• CONCLUSIONS AND FUTURE WORK

23/4/20 2

Page 3: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

About Presenter

• Daniel Abadi• Associate Professor of Computer Science in Yale University• Research

– Column-Oriented Database Systems– Petascale Parallel Database Systems (HadoopDB) – Semantic Web Data Management

23/4/20 3

Page 4: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Semantic WebSemantic Web

• The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web

23/4/20 4

Page 5: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Google Knowledge Graph

23/4/20 5

Page 6: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Key Technology

• HTML

• XML

23/4/20 6

Page 7: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The Disadvantage of XML David Billington is a lecturer of Discrete Mathematics.

• there is no standard way of assigning meaning to tag nesting

23/4/20 7

Page 8: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The Disadvantage of Xpath

• Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember

• XML is semantically unsatisfactory

23/4/20 8

Page 9: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF

• Resource Description Framework• 用 Web标识符(称作统一资源标识符, Uniform

Resource Identifiers 或 URIs)来标识事物,用简单的属性( property)及属性值来描述资源

23/4/20 9

Page 10: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF as Triples and a Graph

23/4/20 10

Page 11: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

SPARQL

• RDF query language

• A basic graph pattern

• Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern

23/4/20 11

Page 12: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Example for Star Pattern

• Find the names of the strikers that play for FC Barcelona.

23/4/20 12

Page 13: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Another Example• Find football players playing for clubs in apopulous region where they were born.

23/4/20 13

Page 14: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

23/4/20 14

Page 15: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Previous Work• RDF In RDBMSs

• Property Tables

• Vertically Partitioned Approach

23/4/20 15

Page 16: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF In RDBMSs• Get the title of the book(s) Joe Fox wrote

in 2001

23/4/20 16

Page 17: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Property Tables

23/4/20 17

Page 18: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Vertically Partitioned Approach

23/4/20 18

Page 19: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

New Problem• Single node RDF management systems are

abundant– Sesame– Jena– RDF3X– 3store

• Research in clustered RDF management is less significantly explored: The focus of the talk

23/4/20 19

Page 20: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

SYSTEM ARCHITECTURE

23/4/20 20

Page 21: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Graph Partitioning• Hash vs. Graph partitioning

– Hash: Only efficient for star patterns– Graph: Taking advantage of graph model

23/4/20 21

Page 22: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Graph Partitioning• Edge vs. Vertex partitioning

– Edge: Natural but inefficient for query execution

– Vertex: Superior for common graph patterns

23/4/20 22

Page 23: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Vertex Partitioning• Preprocess

– remove triples whose predicate is rdf:type

• METIS partitioner

23/4/20 23

Page 24: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Triple Placement• Minimizing data shuffling/exchange

– Allowing data overlap

• N-hop guarantee– The extent of data overlap– If a vertex is assigned to a machine, any

vertex that is within n-hop of this vertex is also stored in this machine

23/4/20 24

Page 25: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

DIRECTED N-HOP GUARANTEE

23/4/20 25

Page 26: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

A potential problem• triples (s, p, o) and (o, p’, o’)

– 2-hop guarantee

• triples (s, p, o) and (s’, p’, o)– not guaranteed

• “object-connected” is not unusual

• undirected n-hop guarantee

23/4/20 26

Page 27: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Triple Placement Algorithm

23/4/20 27

Page 28: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Query Processing• Queries are executed in RDF-stores and/or

Hadoop• Query execution is more efficient in RDF-

stores than in Hadoop– Pushing as much of the processing as possible

into RDF-stores– Minimizing the number of Hadoop jobs– The larger the hop guarantee, the more work is

done in RDF-stores23/4/20 28

Page 29: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

To Communicate, or not to Communicate

• Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed?– Choose the “center” of the query graph– Calculate the distance from the “center” to the

furthest edge– If distance > n, communication is needed; not

needed otherwise

23/4/20 29

Page 30: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Determining whether a Query is PWOC

PWOC Query– parallelizable without communication

• DoFE– distance of farthest edge– the vertex in a graph with the smallest DoFE will be

the most central in a graph

23/4/20 30

Page 31: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The algorithm

23/4/20 31

Page 32: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

the issue of duplicate results• naive approach

– remove duplicates after the query has completed

• owner-computes model– add triples (v, ‘<isOwned>’, ‘Yes’) to a

partition– For each query issued to the RDF-stores, add

an additional pattern (core, ‘<isOwned>’, ‘Yes’)

23/4/20 32

Page 33: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

A query is not PWOC• decompose the query into PWOC

subqueries

• use Hadoop jobs to join the results of the PWOC subqueries

• The number of Hadoop jobs required to complete the query increases as the number of subqueries increases

23/4/20 33

Page 34: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

minimal number of subqueries• reduces to the problem of finding minimal

edge partitioning of a graph into subgraphs of bounded diameter

• brute-force

23/4/20 34

Page 35: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Examlple

23/4/20 35

DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1

the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,

Page 36: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Decompose Example

23/4/20 36

Page 37: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

EXPERIMENTS• 20-machine cluster

• Leigh University Benchmark (LUBM): 270 million triples

• Competitors:– Single-node RDF-3X– SHARD: triple-store system in Hadoop– Graph partitioning (the proposed system)– Hash partitioning on subjects

23/4/20 37

Page 38: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Data Load Time

23/4/20 38

Page 39: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Performance Comparison

23/4/20 39

Page 40: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Varying Number of Machines

23/4/20 40

Page 41: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Summary

23/4/20 41

Page 42: Scalable SPARQL Querying of Large RDF Graphs

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

23/4/20 42