daw: duplicate-aware federated query processing over the web of data
DESCRIPTION
DAW: Duplicate-AWare Federated Query Processing over the Web of Data presented at ISWC2013 research track.TRANSCRIPT
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
Muhammad Saleem , Axel-Cyrille Ngonga Ngomo1, Josiane Xavier Parreira2 , Helena F. Deus3 , Manfred Hauswirth2
1Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, [email protected]
2Digital Enterprise Research Institute(DERI), National University of Ireland.,[email protected]
International Semantic Web Conference (ISWC), October 21-25 , 2013, Sydney, Australia
Motivation
S1
S2
S3
S4
RDF RDF RDF RDF
Parser
Source Selection
Federator Optimzer
Integrator
Get Individual Triple Patterns
Identify capable source against Individual Triple Patterns
Generate optimized sub-query Exe. Plan
Integrate sub-queries results
Execute sub-queries
MotivationSELECT ?v1 ?v2 WHERE { ?uri <p1> ?v1. // Triple Pattern 1 (TP1) ?uri <p2> ?v2. // Triple Pattern 2 (TP2)}
S1
RDF
Source Selection Algorithm
S2
RDF
S3
RDF
S4
RDF
Triple pattern-wise source selection
S1 S2 S3TP1 =
S4TP2 = S2S1
Total triple pattern-wise selected sources = 6
Motivation
Retrieved results for TP1 (?uri <p1> ?v1) Retrieved results for TP2 (?uri <p2> ?v2)
Triple pattern-wise source selection and skipping
S1 S2 S3TP1 =
Total triple pattern-wise selected sources = 4
S1 S2TP2 = S4
Min. number of new triples (threshold) = 20
Total triple pattern-wise skipped sources = 2
Problem Statement
• Data duplication in LOD datasets– E.g. DrugBank and Neurocommons are duplicated at
DERI health Care and Life Sciences Knowledge Base• Duplicate results retrieval increase the query
execution time and network traffic• How to estimate the overlap between data
sources before sub-queries federation?
Sketches
• Data structures that provide dataset summaries– Min-wise Independent Permutations (MIPs) – Bloom filters
• Estimate overlap among different ID sets• MIPs provide good tradeoff between estimation error
and space requirements• MIPs of different lengths can be compared• Sketches all alone cannot be used in SPARQL federation– SPARQL queries are highly selective when subject,
predicate, or object becomes bound in a triple pattern
Min-wise Independent Permutations
48 24 36 18 820
21 3 12 24 877
9 21 15 24 4640
21 18 45 30 339
h1 = (7x + 3) mod 51
h2 = (5x + 6) mod 51
hN = (3x + 9) mod 51
8
9
9
Apply Permutations to all ID’s
ID set
Create MIP Vector from Minima of Permutations
8
9
30
24
36
9
8
24
20
48
36
13
MIPs estimated operations
h(concat(s,o))
T4(s,p,o)
T5(s,p,o)
T6(s,p,o)
T1(s,p,o) T2(s,p,o)
T3(s,p,o) Triples
VA VB
8
9
20
24
36
9
Union (VA , VB)
Resemblance (VA , VB ) = 2/6 => 0.33
Overlap (VA , VB ) = 0.33*(6+6) / (1+0.33) => 3
hi ≈ Overlap≈
𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛=𝑂¿¿|𝑆 ′ 𝑖|={|𝑆𝑖|𝑖𝑓 h𝑛𝑒𝑖𝑡 𝑒𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑛𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠𝑏𝑜𝑢𝑛𝑑
|𝑆𝑖|×𝑎𝑣𝑔𝑆𝑏𝑗𝑆𝑒𝑙𝑆 (𝑝 )𝑖𝑓 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠𝑏𝑜𝑢𝑛𝑑|𝑆𝑖|×𝑎𝑣𝑔𝑂𝑏𝑗𝑆𝑒𝑙𝑆 (𝑝 ) 𝑖𝑓𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠𝑏𝑜𝑢𝑛𝑑
DAW
• A combination of MIPs with compact data summaries
• Use average selectivities values for bound subject and objects
• Can be combined with any existing SPARQL endpoint federation system
• Can be used for partial result retrieval
DAW Index
[] a sd:Service ; sd:endpointUrl <http://localhost:8890/sparql> ; sd:capability [ sd:predicate diseasome:name ; sd:totalTriples 147 ; sd:avgSbjSel ``0.0068'' ; sd:avgObjSel ``0.0069'' ; sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ; sd:capability [ sd:predicate diseasome:chromosomalLocation ; sd:totalTtriples 160 ; sd:avgSbjSel ``0.0062'' ; sd:avgObjSel ``0.0072'' ; sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;
Triple Pattern-wise source ranking and skipping
Evaluation SetupDataset
Total Size (MB)
Index Size (bytes) No of Slice Discrepancy
No of Dup. Slices
Index Gen. Time (sec)
Diseasome 18.62 0.17 10 1500 1 4Geo 274.14 1.63 10 50000 2 133LinkedMDB 448.93 1.66 10 100000 1 201Publication 39.07 0.2 10 2500 1 6
Queries DistributionDataset STP S-1 S-2 P-1 P-2 P-3 TotalDiseasome 5 5 5 4 5 2 26Geo 5 5 5 - - - 15LinkedMDB 5 - - - - - 5Publication 5 5 5 7 7 4 33Total 20 15 15 11 12 6 79
EndPoint CPU(GHz) RAM Hard Disk12.2. i3 4GB 300GB22.9. i7 16GB 256GB SSD32.6. i5 4GB 150GB42.53. i5 4GB 300GB52.3. i5 4GB 500GB62.53. i5 4GB 300GB72.9. i7 8GB 450GB82.6. i5 8GB 400GB92.6. i5 8GB 400GB
102.9. i7 16GB 500GB
• Slice generator tool [1] for random slicing and duplicates
• We have extended FedX, SPLENDID, DARQ with DAW
[1] http://goo.gl/trjGSJ
Triple Pattern-wise sources skipped DARQ
Dataset STP S-1 S-2 P-1 P-2 P-3 Total RecallDiseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100%Geo 22(40) 23(55) 37(101) - - -82(196) 99.99%LinkedMDB 22(38) - - - - -22(38) 100%Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100%Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128)
FedX and SPLENDIDDataset STP S-1 S-2 P-1 P-2 P-3 Total RecallDiseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100%Geo 19(37) 23(55) 37(101) - - -79(193) 99.99%LinkedMDB 15(31) - - - - -15(31) 100%Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100%Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
Triple Pattern-wise sources skipped DARQ
Dataset STP S-1 S-2 P-1 P-2 P-3 Total RecallDiseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100%Geo 22(40) 23(55) 37(101) - - -82(196) 99.99%LinkedMDB 22(38) - - - - -22(38) 100%Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100%Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128)
FedX and SPLENDIDDataset STP S-1 S-2 P-1 P-2 P-3 Total RecallDiseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100%Geo 19(37) 23(55) 37(101) - - -79(193) 99.99%LinkedMDB 15(31) - - - - -15(31) 100%Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100%Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
FedX Extension with DAW
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STPDiseasome Publication Geo Data Movie
0
1
2
3
4
5
6
FedXDAW
Exec
ution
tim
e (s
ec)
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %FedX 2.44 18.79 1.48 -12.38 4.60 14.71 1.74 7.59 2.44 9.76DAW 1.98 1.67 3.92 1.61 2.20
SPLENDID Extension with DAW
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STPDiseasome Publication Geo Movie
0
1
2
3
4
5
6
7
8
9
10SPLENDID
DAW
Exec
ution
tim
e (s
ec)
Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11DAW 3.04 2.37 6.22 1.688 3.30
DARQ Extension with DAW
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STPDiseasome Publication Geo Movie
0
5
10
15
20
25
30
35
40
DARQ
DAW
Exec
ution
tim
e (s
ec)
Over all performance Evaluation
Diseasome
Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %DARQ 8.27 23.34 5.26 6.14 23.44 16.31 1.96 13.88 9.59 16.46DAW 6.34 4.94 19.62 1.688 8.01
Source Ranking vs Recall
1 2 3 4 5 6 7 8 9 100
20
40
60
80
100
120
Optimal
DAW
Ranked Sources
Reca
ll in
%
1 2 3 4 5 6 7 8 9 100
20
40
60
80
100
120
Optimal
DAW
Ranked Sources
Reca
ll in
%
Diseasome Publication
Conclusion and Future Work• A sub-query can retrieve results that are already retrieved by another query
– Resources are wasted– Query runtime is increased– Extra traffic is generated
• Sketches all alone cannot be used due to expressive nature of SPARQL queries• We used MIPs applied to RDF predicates along with compact data summaries • Performance improvement
– FedX : 9.76 %– SPLENDID: 11.11 %– DAW: 16.76 %
• The effect of MIPs sizes and threshold values to find the optimal trade-off between execution time and recall will be explored
[email protected] AKSW, University of Leipzig, Germany