daw: duplicate-aware federated query processing over the web of data

DAW: Duplicate-AWare Federated Query Processing over the Web of Data

Muhammad Saleem , Axel-Cyrille Ngonga Ngomo1, Josiane Xavier Parreira2 , Helena F. Deus3 , Manfred Hauswirth2

1Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, [email protected]

2Digital Enterprise Research Institute(DERI), National University of Ireland.,[email protected]

International Semantic Web Conference (ISWC), October 21-25 , 2013, Sydney, Australia

Motivation

S1

S2

S3

S4

RDF RDF RDF RDF

Parser

Source Selection

Federator Optimzer

Integrator

Get Individual Triple Patterns

Identify capable source against Individual Triple Patterns

Generate optimized sub-query Exe. Plan

Integrate sub-queries results

Execute sub-queries

MotivationSELECT ?v1 ?v2 WHERE { ?uri <p1> ?v1. // Triple Pattern 1 (TP1) ?uri <p2> ?v2. // Triple Pattern 2 (TP2)}

S1

RDF

Source Selection Algorithm

S2

RDF

S3

RDF

S4

RDF

Triple pattern-wise source selection

S1 S2 S3TP1 =

S4TP2 = S2S1

Total triple pattern-wise selected sources = 6

Motivation

Retrieved results for TP1 (?uri <p1> ?v1) Retrieved results for TP2 (?uri <p2> ?v2)

Triple pattern-wise source selection and skipping

S1 S2 S3TP1 =

Total triple pattern-wise selected sources = 4

S1 S2TP2 = S4

Min. number of new triples (threshold) = 20

Total triple pattern-wise skipped sources = 2

Problem Statement

• Data duplication in LOD datasets– E.g. DrugBank and Neurocommons are duplicated at

DERI health Care and Life Sciences Knowledge Base• Duplicate results retrieval increase the query

execution time and network traffic• How to estimate the overlap between data

sources before sub-queries federation?

Sketches

• Data structures that provide dataset summaries– Min-wise Independent Permutations (MIPs) – Bloom filters

• Estimate overlap among different ID sets• MIPs provide good tradeoff between estimation error

and space requirements• MIPs of different lengths can be compared• Sketches all alone cannot be used in SPARQL federation– SPARQL queries are highly selective when subject,

predicate, or object becomes bound in a triple pattern

Min-wise Independent Permutations

48 24 36 18 820

21 3 12 24 877

9 21 15 24 4640

21 18 45 30 339

h1 = (7x + 3) mod 51

h2 = (5x + 6) mod 51

hN = (3x + 9) mod 51

8

9

9

Apply Permutations to all ID’s

ID set

Create MIP Vector from Minima of Permutations

8

9

30

24

36

9

8

24

20

48

36

13

MIPs estimated operations

h(concat(s,o))

T4(s,p,o)

T5(s,p,o)

T6(s,p,o)

T1(s,p,o) T2(s,p,o)

T3(s,p,o) Triples

VA VB

8

9

20

24

36

9

Union (VA , VB)

Resemblance (VA , VB ) = 2/6 => 0.33

Overlap (VA , VB ) = 0.33*(6+6) / (1+0.33) => 3

hi ≈ Overlap≈

𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛=𝑂¿¿|𝑆 ′ 𝑖|={|𝑆𝑖|𝑖𝑓 h𝑛𝑒𝑖𝑡 𝑒𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑛𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠𝑏𝑜𝑢𝑛𝑑

|𝑆𝑖|×𝑎𝑣𝑔𝑆𝑏𝑗𝑆𝑒𝑙𝑆 (𝑝 )𝑖𝑓 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠𝑏𝑜𝑢𝑛𝑑|𝑆𝑖|×𝑎𝑣𝑔𝑂𝑏𝑗𝑆𝑒𝑙𝑆 (𝑝 ) 𝑖𝑓𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠𝑏𝑜𝑢𝑛𝑑

DAW

• A combination of MIPs with compact data summaries

• Use average selectivities values for bound subject and objects

• Can be combined with any existing SPARQL endpoint federation system

• Can be used for partial result retrieval

DAW Index

[] a sd:Service ; sd:endpointUrl <http://localhost:8890/sparql> ; sd:capability [ sd:predicate diseasome:name ; sd:totalTriples 147 ; sd:avgSbjSel ``0.0068'' ; sd:avgObjSel ``0.0069'' ; sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ; sd:capability [ sd:predicate diseasome:chromosomalLocation ; sd:totalTtriples 160 ; sd:avgSbjSel ``0.0062'' ; sd:avgObjSel ``0.0072'' ; sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;

Triple Pattern-wise source ranking and skipping

Evaluation SetupDataset

Total Size (MB)

Index Size (bytes) No of Slice Discrepancy

No of Dup. Slices

Index Gen. Time (sec)

Diseasome 18.62 0.17 10 1500 1 4Geo 274.14 1.63 10 50000 2 133LinkedMDB 448.93 1.66 10 100000 1 201Publication 39.07 0.2 10 2500 1 6

Queries DistributionDataset STP S-1 S-2 P-1 P-2 P-3 TotalDiseasome 5 5 5 4 5 2 26Geo 5 5 5 - - - 15LinkedMDB 5 - - - - - 5Publication 5 5 5 7 7 4 33Total 20 15 15 11 12 6 79

EndPoint CPU(GHz) RAM Hard Disk12.2. i3 4GB 300GB22.9. i7 16GB 256GB SSD32.6. i5 4GB 150GB42.53. i5 4GB 300GB52.3. i5 4GB 500GB62.53. i5 4GB 300GB72.9. i7 8GB 450GB82.6. i5 8GB 400GB92.6. i5 8GB 400GB

102.9. i7 16GB 500GB

• Slice generator tool [1] for random slicing and duplicates

• We have extended FedX, SPLENDID, DARQ with DAW

[1] http://goo.gl/trjGSJ

Triple Pattern-wise sources skipped DARQ

Dataset STP S-1 S-2 P-1 P-2 P-3 Total RecallDiseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100%Geo 22(40) 23(55) 37(101) - - -82(196) 99.99%LinkedMDB 22(38) - - - - -22(38) 100%Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100%Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128)

FedX and SPLENDIDDataset STP S-1 S-2 P-1 P-2 P-3 Total RecallDiseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100%Geo 19(37) 23(55) 37(101) - - -79(193) 99.99%LinkedMDB 15(31) - - - - -15(31) 100%Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100%Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)

FedX Extension with DAW

STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STPDiseasome Publication Geo Data Movie

0

1

2

3

4

5

6

FedXDAW

Exec

ution

tim

e (s

ec)

Over all performance Evaluation

Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %FedX 2.44 18.79 1.48 -12.38 4.60 14.71 1.74 7.59 2.44 9.76DAW 1.98 1.67 3.92 1.61 2.20

SPLENDID Extension with DAW

STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STPDiseasome Publication Geo Movie

0

1

2

3

4

5

6

7

8

9

10SPLENDID

DAW

Exec

ution

tim

e (s

ec)

Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %

SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11DAW 3.04 2.37 6.22 1.688 3.30

DARQ Extension with DAW

STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STPDiseasome Publication Geo Movie

0

5

10

15

20

25

30

35

40

DARQ

DAW

Exec

ution

tim

e (s

ec)

Over all performance Evaluation

Diseasome

Publication Geo Data Movie Overall

Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %DARQ 8.27 23.34 5.26 6.14 23.44 16.31 1.96 13.88 9.59 16.46DAW 6.34 4.94 19.62 1.688 8.01

Source Ranking vs Recall

1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

120

Optimal

DAW

Ranked Sources

Reca

ll in

%

1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

120

Optimal

DAW

Ranked Sources

Reca

ll in

%

Diseasome Publication

Conclusion and Future Work• A sub-query can retrieve results that are already retrieved by another query

– Resources are wasted– Query runtime is increased– Extra traffic is generated

• Sketches all alone cannot be used due to expressive nature of SPARQL queries• We used MIPs applied to RDF predicates along with compact data summaries • Performance improvement

– FedX : 9.76 %– SPLENDID: 11.11 %– DAW: 16.76 %

• The effect of MIPs sizes and threshold values to find the optimal trade-off between execution time and recall will be explored

[email protected] AKSW, University of Leipzig, Germany

daw: duplicate-aware federated query processing over the web of data

Education

total diseasome

gb ssd

triple patternwise sources

total recall diseasome

capability sd

service sd

chromosomallocation

endpointurl sd