tripleprov: efficient processing of lineage queries over a native rdf store

29
23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea TripleProv Efficient Processing of Lineage Queries over a Native RDF Store Marcin Wylot 1 , Philippe Cudré-Mauroux 1 , and Paul Groth 2 1) eXascale Infolab, University of Fribourg, Switzerland 2) Web & Madia Group, VU University Amsterdam, Netherlands

Upload: exascale-infolab

Post on 23-Aug-2014

863 views

Category:

Science


7 download

DESCRIPTION

Given the heterogeneity of the data one can find on the Linked Data cloud, being able to trace back the provenance of query results is rapidly becoming a must-have feature of RDF systems. While provenance models have been extensively discussed in recent years, little attention has been given to the efficient implementation of provenance-enabled queries inside data stores. This paper introduces TripleProv: a new system extending a native RDF store to efficiently handle such queries. TripleProv implements two different storage models to physically co-locate lineage and instance data, and for each of them implements algorithms for tracing provenance at two granularity levels. In the following, we present the overall architecture of our system, its different lineage storage models, and the various query execution strategies we have implemented to efficiently answer provenance-enabled queries. In addition, we present the results of a comprehensive empirical evaluation of our system over two different datasets and workloads.

TRANSCRIPT

Page 1: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea

TripleProvEfficient Processing of Lineage

Queries over a Native RDF Store

Marcin Wylot1, Philippe Cudré-Mauroux1, and Paul Groth2

1) eXascale Infolab, University of Fribourg, Switzerland 2) Web & Madia Group, VU University Amsterdam, Netherlands

Page 2: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Outline

➢ Motivation

➢ Provenance Polynomials

➢ System

➢ Results

Page 3: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Data Provenance

“Provenance is information about

entities, activities, and people involved

in producing a piece of data or thing, which can be used to form

assessments about its quality, reliability or trustworthiness.”

How a query answer was derived: what data was

combined to produce the result.

Page 4: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Data Integration

➢ Integrated and summarized data

➢ Trust, transparency, and cost

➢ Capability to pinpoint the exact source from which the result was selected

➢ Capability to trace back the complete list of sources and how they were combined to deliver a result

Page 5: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Querying Distributed Data SourcesHow exactly was the answer derived?

Page 6: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Application: Post-query Calculations

➢ Scores or probabilities for query result

➢ Result ranking

➢ Compute trust

➢ Information quality based on used sources

Page 7: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Application: Query Execution

➢ Modify query strategies on the fly

➢ Restrict results to certain subset of sources

➢ Restrict results w.r.t. queries over provenance

➢ Access control, only certain sources will appear

➢ Detect if result would be valid when removing certain

source

Page 8: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Provenance Polynomials

➢ Ability to characterize ways each source contributed

➢ Pinpoint the exact source to each result

➢ Trace back the list of sources the way they were combined

to deliver a result

Page 9: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Graph-based Query

select ?lat ?long ?g1 ?g2 ?g3 ?g4where {

graph ?g1 {?a [] "Eiffel Tower" . } graph ?g2 {?a inCountry FR . } graph ?g3 {?a lat ?lat . } graph ?g4 {?a long ?long . }

}

lat long l1 l2 l4 l4, lat long l1 l2 l4 l5,lat long l1 l2 l5 l4, lat long l1 l2 l5 l5,lat long l1 l3 l4 l4, lat long l1 l3 l4 l5,lat long l1 l3 l5 l4,lat long l1 l3 l5 l5,

lat long l2 l2 l4 l4, lat long l2 l2 l4 l5,lat long l2 l2 l5 l4, lat long l2 l2 l5 l5,lat long l2 l3 l4 l4, lat long l2 l3 l4 l5,lat long l2 l3 l5 l4, lat long l2 l3 l5 l5,

lat long l3 l2 l4 l4, lat long l3 l2 l4 l5,lat long l3 l2 l5 l4, lat long l3 l2 l5 l5,lat long l3 l3 l4 l4, lat long l3 l3 l4 l5,lat long l3 l3 l5 l4,lat long l3 l3 l5 l5,

Page 10: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

TripleProv Resuls

result: lat, long

provenance polynomial:(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)

Page 11: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Polynomials Operators

➢ Union (⊕)

○ constraint or projection satisfied with multiple sources

l1 ⊕ l2 ⊕ l3

○ multiple entities satisfy a set of constraints or projections

➢ Join (⊗)

○ sources joined to handle a constraint or a projection

○ OS and OO joins between few sets of constraints

(l1 ⊕ l2) ⊗ (l3 ⊕ l4)

Page 12: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Example Polynomial

select ?lat ?long where { ?a [] ``Eiffel Tower''.?a inCountry FR .?a lat ?lat .?a long ?long .

}

(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)

Page 13: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Example Polynomial

select ?l ?long ?lat where {

?p name ``Krebs, Emil'' .

?p deathPlace ?l .

?c [] ?l .

?c featureClass P .

?c inCountry DE .

?c long ?long .

?c lat ?lat .

}

[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5)] ⊗

[( l6 ⊕ l7) ⊗ (l8) ⊗ (l9 ⊕ l10) ⊗ (l11 ⊕ l12) ⊗ (l13)]

Page 14: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Granularity Levels

➢ source-level: sources of a triples

➢ triple-level: all pieces of data used to answer the query

(l1 ⊕ l2) ⊗ (l3 ⊕ l4)

Page 15: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

System Architecture

Page 16: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Native Data Model

➢ Semantically co-located data

➢ Template based molecules

Page 17: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Various Physical Storage Models

Differences:➢ ease of implementation➢ memory consumption➢ query execution➢ interference with the original concept of molecule

1) SPOL 2) LSPO 3) SLPO 4) SPLO

Page 18: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Annotated Triples

➢ Annotated provenance

➢ Quadruples

➢ Easy to implement

➢ Source data repeated

for each triple

Page 19: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Co-located Elements

➢ Data grouped by source

➢ Physically co-located

➢ Avoids duplication of the

same source inside a

molecule

➢ Data about a given subject

co-located in one molecule

➢ More difficult to implement

Page 20: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Experiments

How expensive it is to trace

provenance?

What is the overhead on query

execution time?

Page 21: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Datasets

➢ Two collections of RDF data gathered from the Web

○ Billion Triple Challenge (BTC): Crawled from the linked

open data cloud

○ Web Data Commons (WDC): RDFa, Microdata

extracted from common crawl

➢ Typical collections gathered from multiple sources

➢ sampled subsets of ~110 million triples each; ~25GB each

Page 22: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Workloads

➢ 8 Queries defined for BTC○ T. Neumann and G. Weikum. Scalable join processing on very large rdf

graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.

➢ Two additional queries with UNION and OPTIONAL

clauses

➢ 7 various new queries for WDC

http://exascale.info/tripleprov

Page 23: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Results

Overhead of tracking provenance compared to

vanilla version of the system for BTC dataset

source-level co-located

source-level annotated

triple-level co-located

triple-level annotated

Page 24: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Conclusions

➢ provenance overhead is considerable but acceptable,

on average about 60-70%

➢ most suitable storage model depends upon data and

workloads characteristics

➢ annotated: more appropriate for heterogenous datasets

and workloads retrieving provenance

➢ co-located: more appropriate for homogenous datasets

and workload filtering by source

Page 25: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Future Work

➢ Distributed version

➢ Dynamic storage model

➢ Adaptive query execution strategies

➢ PROV output

➢ Over provenance queries

Page 26: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Summary

➢ TripleProv: an efficient triplestore tracking provenance

➢ Two storage models

➢ Fine-grained multilevel provenance tracing

➢ Formal provenance polynomials

➢ Experimental evaluation

http://exascale.info/tripleprov

Page 27: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Loading & Memory

Billion Triple Challenge

Web Data Commons

Page 28: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Results

Overhead of tracking provenance compared to

vanilla version of the system for WDC dataset

source-level SLPO

source-level SPOL

triple-level SLPO

triple-level SPOL

Page 29: TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Polynomials: multiple records

[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)]

[(l5 ⊕ l7) ⊗ (l4) ⊗ ( l13 ⊕ l17) ⊗ (l28)]

[(l4) ⊗ (l1 ⊕ l2) ⊗ ( l3 ⊕ l7) ⊗ (l8 ⊕ l9⊕ l4)]