graph data management lab, school of computer sciencegdm@fudangdm@fudan scalable sparql querying of...

42
Graph Data Management Lab, School of Computer Science GDM@FUDAN GDM@FUDAN http://gdm.fudan.edu. http://gdm.fudan.edu. Scalable SPARQL Querying of Large RDF Graphs Xu Bo 2012.06.11 In PVLDB, 4(21), 2011

Upload: emil-cox

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Scalable SPARQL Querying of Large RDF Graphs

Xu Bo

2012.06.11

In PVLDB, 4(21), 2011

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

OutlineOutline• About Presenter

• Semantic Web

• Previous Work

• New Problem

• SYSTEM ARCHITECTURE

• EXPERIMENTS

• CONCLUSIONS AND FUTURE WORK

23/4/19 2

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

About Presenter

• Daniel Abadi• Associate Professor of Computer Science in Yale University• Research

– Column-Oriented Database Systems– Petascale Parallel Database Systems (HadoopDB) – Semantic Web Data Management

23/4/19 3

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Semantic WebSemantic Web

• The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web

23/4/19 4

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Google Knowledge Graph

23/4/19 5

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Key Technology

• HTML

• XML

23/4/19 6

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The Disadvantage of XML David Billington is a lecturer of Discrete Mathematics.

• there is no standard way of assigning meaning to tag nesting

23/4/19 7

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The Disadvantage of Xpath

• Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember

• XML is semantically unsatisfactory

23/4/19 8

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF

• Resource Description Framework• 用 Web标识符(称作统一资源标识符, Uniform

Resource Identifiers 或 URIs)来标识事物,用简单的属性( property)及属性值来描述资源

23/4/19 9

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF as Triples and a Graph

23/4/19 10

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

SPARQL

• RDF query language

• A basic graph pattern

• Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern

23/4/19 11

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Example for Star Pattern

• Find the names of the strikers that play for FC Barcelona.

23/4/19 12

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Another Example• Find football players playing for clubs in apopulous region where they were born.

23/4/19 13

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

23/4/19 14

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Previous Work• RDF In RDBMSs

• Property Tables

• Vertically Partitioned Approach

23/4/19 15

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

RDF In RDBMSs• Get the title of the book(s) Joe Fox wrote

in 2001

23/4/19 16

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Property Tables

23/4/19 17

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Vertically Partitioned Approach

23/4/19 18

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

New Problem• Single node RDF management systems are

abundant– Sesame– Jena– RDF3X– 3store

• Research in clustered RDF management is less significantly explored: The focus of the talk

23/4/19 19

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

SYSTEM ARCHITECTURE

23/4/19 20

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Graph Partitioning• Hash vs. Graph partitioning

– Hash: Only efficient for star patterns– Graph: Taking advantage of graph model

23/4/19 21

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Graph Partitioning• Edge vs. Vertex partitioning

– Edge: Natural but inefficient for query execution

– Vertex: Superior for common graph patterns

23/4/19 22

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Vertex Partitioning• Preprocess

– remove triples whose predicate is rdf:type

• METIS partitioner

23/4/19 23

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Triple Placement• Minimizing data shuffling/exchange

– Allowing data overlap

• N-hop guarantee– The extent of data overlap– If a vertex is assigned to a machine, any

vertex that is within n-hop of this vertex is also stored in this machine

23/4/19 24

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

DIRECTED N-HOP GUARANTEE

23/4/19 25

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

A potential problem• triples (s, p, o) and (o, p’, o’)

– 2-hop guarantee

• triples (s, p, o) and (s’, p’, o)– not guaranteed

• “object-connected” is not unusual

• undirected n-hop guarantee

23/4/19 26

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Triple Placement Algorithm

23/4/19 27

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Query Processing• Queries are executed in RDF-stores and/or

Hadoop• Query execution is more efficient in RDF-

stores than in Hadoop– Pushing as much of the processing as possible

into RDF-stores– Minimizing the number of Hadoop jobs– The larger the hop guarantee, the more work is

done in RDF-stores23/4/19 28

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

To Communicate, or not to Communicate

• Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed?– Choose the “center” of the query graph– Calculate the distance from the “center” to the

furthest edge– If distance > n, communication is needed; not

needed otherwise

23/4/19 29

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Determining whether a Query is PWOC

PWOC Query– parallelizable without communication

• DoFE– distance of farthest edge– the vertex in a graph with the smallest DoFE will be

the most central in a graph

23/4/19 30

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

The algorithm

23/4/19 31

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

the issue of duplicate results• naive approach

– remove duplicates after the query has completed

• owner-computes model– add triples (v, ‘<isOwned>’, ‘Yes’) to a

partition– For each query issued to the RDF-stores, add

an additional pattern (core, ‘<isOwned>’, ‘Yes’)

23/4/19 32

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

A query is not PWOC• decompose the query into PWOC

subqueries

• use Hadoop jobs to join the results of the PWOC subqueries

• The number of Hadoop jobs required to complete the query increases as the number of subqueries increases

23/4/19 33

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

minimal number of subqueries• reduces to the problem of finding minimal

edge partitioning of a graph into subgraphs of bounded diameter

• brute-force

23/4/19 34

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Examlple

23/4/19 35

DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1

the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Decompose Example

23/4/19 36

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

EXPERIMENTS• 20-machine cluster

• Leigh University Benchmark (LUBM): 270 million triples

• Competitors:– Single-node RDF-3X– SHARD: triple-store system in Hadoop– Graph partitioning (the proposed system)– Hash partitioning on subjects

23/4/19 37

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Data Load Time

23/4/19 38

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Performance Comparison

23/4/19 39

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Varying Number of Machines

23/4/19 40

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

Summary

23/4/19 41

Graph Data Management Lab, School of Computer Science

GDM@FUDAGDM@FUDANN

http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn

23/4/19 42