scalable web data management using rdf

35
Web Data Management Advanced Database Presentation By: Navid Sedighpour Professor : Dr. Alireza Bagheri Nevember 2015 1

Upload: navid-sedighpour

Post on 16-Feb-2017

140 views

Category:

Data & Analytics


0 download

TRANSCRIPT

1

Web Data ManagementAdvanced Database Presentation

By:

Navid Sedighpour

Professor :

Dr. Alireza Bagheri

Nevember 2015

2

InterestLack of schema

Data is unstructured or at best “semi-structured”Missing data, additional attributes, similar data but not identical

VolatilityMay confirm to one schema now, but not later

ScaleHow to capture everything?

Querying DifficultyWhat is the user language? What are the primitives?Aren’t Search Engines sufficient?

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

3

Fusion Tables Users contribute data in spreadsheetPossible joins between multiple data setsExtensive visualization

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

More Recent Approaches to Web Querying

4

More Recent Approaches to Web QueryingXML

Data exchange languageTree based structure

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

5

More Recent Approaches to Web QueryingRDF

W3C RecommendationSimple, self-descriptive model

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

6

RDF Data Volumes90% of world's data generated over last two years

Data are growing fast

Size almost doubling every year

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

7

RDF Data Volumes March 2009 – 89 Datasets

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

8

RDF Data Volumes September 2010 – 203 datasets

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

9

RDF Data Volumes September 2011 – 295 Datasets

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

10

RDF Data VolumesApril 2014 – 1091 Datasets

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

11

RDF IntroductionEverything is an uniquely named resource

Prefixes can be used to shorten names

Properties of resources can be defined

Relationships with other resources can be defined

Resource description can be contributed by different people/groups and can be located anywhere in the webIntegrated web “database”

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

12

RDF Data ModelTriple : Subject, Predicate (Property) , Object

Subject : The entity that is described (URI or Blank Node)

Predicate : a feature of the entity

Object : value of the feature

Set of RDF Triples is called “RDF Graph”

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

13

RDF Example Instance

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

14

RDF Graph

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

15

SPARQL Queries

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

16

Naïve Triple Store Design

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

17

Naïve Triple Store Design

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

Easy to ImplementBut

Too Many self-joins

18

Property TablesGrouping by Entities

Types :Clustered Property TablesProperty Class Tables

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

19

Clustered Property TablesGroup together the properties that tend to occur in the same (or similar) subjects

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

20

Property Class TablesCluster the subjects with the same type of property into one property table

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

21

Property TablesAdvantages :

Fewer Joins

Disadvantages :Lots of NULLsClustering is not trivialMulti-valued properties are complicated

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

22

Binary TablesGrouping by Properties: for each property build a two column table containing both subject and object, ordered by subjects

Also called “Vertically Partitioned Approach”

N two column tables (n is the number of unique properties in the data)

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

23

Binary TablesAdvantages :

Support multi-valued PropertiesNo NULLsNo ClusteringGood performance for subject-subject joins

Disadvantages:Not useful for subject-subject joinsExpensive inserts

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

24

Graph-Based ApproachAnswering SPARQL query = Subgraph Matching

gStore

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

25

Two steps need to be done :1. For each node of Q* get the lists of nodes in G* that include that node2. Do a multi-way join to get the candidate list

Alternatives :Sequential scan of G*

Both steps are inefficientS-Tree

Height Balanced Tree over signatures Run an inclusion query for each node of Q* and get lists of nodes in G* that include that node (q & s = q)

VS-Tree Support both steps efficiently Grouping by vertices

Graph-Based Approach

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

26

S-Tree

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

Pruning

27

S-Tree

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

28

S-Tree

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

29

S-Tree

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

30

S-Tree

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

31

VS-Tree

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

32

VS-Tree

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

33

ConclusionRDF Data seem to have considerable promise for web data management

We talked about four approaches to web data management including Naïve triple store design, Property Tables, Binary Tables and Graph-Based approach

VS-Tree has the best performance in Graph-Base approaches

gStore is more efficient than other approaches

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

34

References

Introduction Naïve Triple Store Design

Property Tables Binary Tables Graph-Based Conclusion

[1] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach, "Scalable semantic web data management using vertical partitioning," in Proceedings of the 33rd international conference on Very large data bases, 2007, pp. 411-422.

[2] L. Zou, J. Mo, L. Chen, M. T. Özsu, and D. Zhao, "gStore: answering SPARQL queries via subgraph matching," Proceedings of the VLDB Endowment, vol. 4, pp. 482-493, 2011.

[3] L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, and D. Zhao, "gStore: a graph-based SPARQL query engine," The VLDB Journal—The International Journal on Very Large Data Bases, vol. 23, pp. 565-590, 2014.

[4] X. Shen, L. Zou, M. T. Ozsu, L. Chen, Y. Li, S. Han, et al., "A Graph-based RDF Triple Store."

35

Thanks

Any Questions???