workload matters: why rdf databases need a new design güneş aluç m. tamer Özsu khuzaima daudjee
Post on 18-Dec-2015
216 Views
Preview:
TRANSCRIPT
Outline
• Why do RDF data management systems need a new design?
• How do we envision RDF data management systems to be re-designed?
A Running Example
1
Tamer ?post ?person UWaterloohasPost ??? worksAt
likestaggedInretweetsfavorites
etc.
Consider the following SPARQL query:
Single Table Layout
2
P S O… … …
hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …
… … …
O S P… … …
Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …
… … …
P O S… … …
worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …
… … …
Single Table Layout
2
P S O… … …
hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …
… … …
O S P… … …
Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …
… … …
P O S… … …
worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …
… … …
Tamer ?posthasPost
Single Table Layout
2
P S O… … …
hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …
… … …
O S P… … …
Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …
… … …
P O S… … …
worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …
… … …
?person UWaterlooworksA
t
Single Table Layout
2
P S O… … …
hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …
… … …
O S P… … …
Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …
… … …
P O S… … …
worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …
… … …
?post ?person???
Single Table Layout
2
O S P… … …
Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …
… … …
P O S… … …
worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …
… … …
?post ?person???
(1) Many irrelevant intermediate result tuples
(2) These tuples are fragmented across the OSP table
(3) Indexes are not very useful in locating the relevant tuple
Group-by-Predicates
3
favorites
S O… …
Bob Post235Gunes Post1Olaf Post234
… …
likes
S O… …
Alice Post23Ken Post24
… …
retweets
S O… …
Gunes Post2358Ken Post23570
… …
taggedIn
S O… …
Alice Post2Bob Post2357Olaf Post23571
… …
Group-by-Predicates
3
favorites
S O… …
Bob Post235Gunes Post1Olaf Post234
… …
likes
S O… …
Alice Post23Ken Post24
… …
retweets
S O… …
Gunes Post2358Ken Post23570
… …
taggedIn
S O… …
Alice Post2Bob Post2357Olaf Post23571
… …
?post ?person???
Group-by-Entities
4
Post2
Post23
Post24
Post2357
Post23571
Post1
Post234
Post235
Post2358
Post23570
…
Alice X X
Bob X X
Gunes X X
Ken X X
Olaf X X
…
likestaggedIn retweets
favorites
FacebookEntities
TwitterEntities
Group-by-Entities
4
Post2
Post23
Post24
Post2357
Post23571
Post1
Post234
Post235
Post2358
Post23570
…
Alice X X
Bob X X
Gunes X X
Ken X X
Olaf X X
…
likestaggedIn retweets
favorites
FacebookEntities
TwitterEntities
?post ?person???
Group-by-Vertices
5
…
Post1 ← favorites GunesPost2 ← taggedIn AlicePost23 ← likes AlicePost24 ← likes KenPost234 ← favorites OlafPost235 ← favorites BobPost2357 ← taggedIn BobPost2358 ← retweets GunesPost23570 ← retweets KenPost23571 ← taggedIn Olaf
…
Group-by-Vertices
5
…
Post1 ← favorites GunesPost2 ← taggedIn AlicePost23 ← likes AlicePost24 ← likes KenPost234 ← favorites OlafPost235 ← favorites BobPost2357 ← taggedIn BobPost2358 ← retweets GunesPost23570 ← retweets KenPost23571 ← taggedIn Olaf
…
?post ?person???
Does The Winner Take It All?
• With a single query, we were able to conceptually show problems with existing solutions
• SPARQL workloads that RDF data management systems support – contain a very diverse selection of queries– and these selection of queries dynamically change
6
Does The Winner Take It All?
G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. International Semantic Web Conference, 2014. Forthcoming. 6
1 6 10 15 20 25 30 35 40 44 49 54 59 64 69 73 78 83 88 93 981
10
100
1000
10000
100000
RDF-3x Fastest System
Percentage of Test Query Templates
Mea
n Q
uery
Exe
cutio
n Ti
me
(mill
isec
onds
)
Does The Winner Take It All?
G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. International Semantic Web Conference, 2014. Forthcoming. 6
Outline
• Why do RDF data management systems need a new design?
• How do we envision RDF data management systems to be re-designed?
Group-by-Query
7
RDFPhysical Design
Fixed Workload-Driven
Single Table LayoutGroup-by-PredicatesGroup-by-EntitiesGroup-by-Vertices
Group-by-Query
Outline
• Why do RDF data management systems need a new design?
• How do we envision RDF data management systems to be re-designed?– Group-by-Query Representation– Partial Tuning
Group-by-Query
7
Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn
Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn
Post235favoriteshasPost
Tamer Post23 Bob UWaterloohasPost worksAtlikes
Post2taggedInhasPost
Group-by-Query
7
Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn
Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn
Post235favoriteshasPost
Tamer Post23 Bob UWaterloohasPost worksAtlikes
Post2taggedInhasPost
Group-by-Query
7
Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn
Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn
Post235favoriteshasPost
Tamer Post23 Bob UWaterloohasPost worksAtlikes
Post2taggedInhasPost
Group-by-Query
7
Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn
Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn
Post235favoriteshasPost
Tamer Post23 Bob UWaterloohasPost worksAtlikes
Post2taggedInhasPost
Group-by-Query (Advantages)
• Triples relevant to the evaluation of a query are physically clustered
• Indexes are more efficient in localizing query evaluation to only the relevant triples
• Fewer intermediate result tuples are generated
8
Outline
• Why do RDF data management systems need a new design?
• How do we envision RDF data management systems to be re-designed?– Group-by-Query Representation– Partial Tuning
Proposal #1Updating Physical Storage Layout
10
Initially, triples are not clustered in the storage system for any particular workload
Proposal #1Updating Physical Storage Layout
10
As queries are executed (that is, as triples flow through the cache), there is an opportunity to cluster (hot) triples that are co-accessed within the same query or across multiple queries
Proposal #1Updating Physical Storage Layout
10
Assume a hash function (oracle) decides on a good placement of triples and that the hash function is capable of adapting to changing workloads
Proposal #1Updating Physical Storage Layout
10
Then, one of the challenges is to develop this hash function
Proposal #2Partial Indexing
11
On top of the aforementioned scheme, consider an index which
false positively returns irrelevant triples (striped)for some queries in the workload
Proposal #2Partial Indexing
11
This is no big deal because, these false positive triples can be eliminated from the query evaluation pipeline, w/ just a little bit of extra computational cost
On the other hand, this index is much easier to update and maintain
top related