Opportunistic Linked Data Querying through Approximate Membership MetadataMiel Vander Sande
“Solve a query for a client, and it will be happy for a day.
Teach a client to SPARQL, and it’ll query happily ever after.” !
— Confucius, 431 BC
Linked Data Fragments: a uniform view on publishing Linked Data
Exploring the axis: selector and metadata
Approximate Membership Metadata
Querying through Approximate Membership Metadata
Opportunistic Querying
Linked Data Fragments: a uniform view on publishing Linked Data
Exploring the axis: selector and metadata
Approximate Membership Metadata
Querying through Approximate Membership Metadata
Opportunistic Querying
Interaction between client & server. The hunt for trade-offs: What can we learn?
high server costlow server cost
datadump
SPARQLendpoint
interface offered by the server
high availability low availabilityhigh bandwidth low bandwidthout-of-date data live data
low client costhigh client cost
Linked Data Fragments area uniform view on Linked Data interfaces.
datadump
SPARQLendpoint
interface offered by the server
Every Linked Data interfaceoffers specific fragments of a Linked Data set.
data
metadata
controls
What triples does it contain?
What do we know about it?
How to access more data?
Each type of Linked Data Fragment is defined by three characteristics.
all dataset triples
(none)
data dump
number of triples, file size
data
metadata
controls
Each type of Linked Data Fragment is defined by three characteristics.
triples matching the query
(none)
(none)
SPARQL query resultdata
metadata
controls
Each type of Linked Data Fragment is defined by three characteristics.
Linked Data Fragments: a uniform view on publishing Linked Data
Exploring the axis: selector and metadata
Approximate Membership Metadata
Querying through Approximate Membership Metadata
Opportunistic Querying
low server cost
datadump
SPARQLquery results
high availabilitylive data
Linked Datadocuments
triple patternfragments
You have to start somewhere: Triple Pattern Fragments.
Verborgh, R., Hartig, O.,…: Querying datasets on the Web with high availability. ISWC2014
high bandwidth
data (first 100)
controls (other fragments)
metadata (total count)
controls
Triple pattern fragment serversenable clients to be intelligent.
<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [ hydra:template "http://fragments.dbpedia.org/2014/en {?subject,predicate,object}"; hydra:mapping [ hydra:variable "subject"; hydra:property rdf:subject ], [ hydra:variable "predicate"; hydra:property rdf:predicate ], [ hydra:variable "object"; hydra:property rdf:object ] ].
The RDF representation explains:“you can query by triple pattern”.
The RDF representation explains:“this is the number of matches”.
metadata
Triple pattern fragment serversenable clients to be intelligent.
<#fragment> void:triples 8141.
Give them a SPARQL query.Give them a URL of any dataset fragment.
How can intelligent clientssolve SPARQL queries over fragments?
They look inside the fragmentto see how to access the dataset
and use the metadatato decide how to plan the query.
The client splits the queryinto the available fragments.
SELECT ?artist ?name WHERE { ?artist a dbpedia-owl:Artist; rdfs:label ?name; dbpedia-owl:birthPlace dbpedia:Padua. FILTER LANGMATCHES(LANG(?name), "EN") }
The client gets the fragments and inspects their metadata.
?artist a dbpedia-owl:Artist.first 100 triples
96,000
?artist rdfs:label ?name.first 100 triples
12,000,000
?artist dbont:birthPlace dbpedia:Padua.first 100 triples
135
?artist a dbpedia-owl:Artist. 96.000
?artist rdfs:label ?name. 12.000.000
?artist dbont:birthPlace dbpedia:Padua.dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.
135
dbpedia:Alberto_Bigon dbont:birthPlace dbpedia:Padua.
The metadata enables the client to choose the right starting point.
dbp:Alberto_Benettin a dbont:Artist.
dbp:Alberto_Benettin rdfs:label ?name.
For some patterns, many requests are of type “is this triple in the dataset?”
Frac
tion
of m
embe
rshi
p qu
erie
s
0%
25%
50%
75%
100%
L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 S6 S7 F1 F2 F3 F4 F5 C1 C2 C3
20 WatDiv querieslinear (L), star (S), snowflake-shaped (F) and complex (C)
Advancing in selector and/or metadata dimensions.
met
adat
aselector
Triple Pattern Fragments
low server costhigh availability
live data
high bandwidth
Simple Questions
Complex Questions
No information for the client
Extensive usefulinformation for the client
Advancing in selector and/or metadata dimensions.
met
adat
aselector
Triple Pattern Fragments
Substring search
J Van Herwegen et. al.: Substring Filtering for Low-Cost Linked Data InterfacesLast talk of this session!
Advancing in selector and/or metadata dimensions.
met
adat
aselector
Triple Pattern Fragments
Substring search
Approximate MembershipFunction (AMF)
Linked Data Fragments: a uniform view on publishing Linked Data
Exploring the axis: selector and metadata
Approximate Membership Metadata
Querying through Approximate Membership Metadata
Opportunistic Querying
Append TPF response with a compact representation of all possible mappings.
met
adat
a
Triple Pattern Fragments
Approximate Membership Function (AMF)
Approximate set membership assessment with a predefined false positive probability.
Bloom filter / Golomb-coded set
+
“Can we reduce the number of HTTP requests?”
“Can we reduce the total execution time?”
“What is the overhead on server CPU load?”
Bloom Filter
Golomb-coded set (GCS)
0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 … 0 1 0
!
!
n0 dbpedia:Alberto_Benettin
n1 dbpedia:Alberto_Bigon
nx …
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0
m
0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 … 0 1 0
k0 k1 kx
k0 k1 kx
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0
!
n0 dbpedia:Alberto_Benettin
n1 dbpedia:Alberto_Bigon k
0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0
k
0 1 0 1 1 0 1Golomb coded
Geometric distribution
“this BloomFilter with false positive probability X and hash function Y represents the presence of all bindings for ?s”.
metadata
Server enables clients to avoid membership requests.
<#fragment> void:triples 96300. # existing count metadata _:membershipFunction a ms:BloomFilter; # AMF metadata ms:hashSize 524288; ms:hashFunction <MyMurmur1>, <MyMurmur2>; ms:memberCollection [ ms:sourceCollection <#fragment>; ms:projectedProperty rdf:subject ]; ms:falsePositiveRate 0.05; ms:falseNegativeRate 0.0; ms:binaryRepresentation "QmF...ZTY"^^xsd:base64Binary.
GET ?artist dbont:birthPlace dbpedia:Padua.dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.
135
…
Client filters non-members locally with one extra (cached) request
GET dbpedia:Alberto_Benettin a dbont:Artist. 0GET dbpedia:Alberto_Bigon a dbont:Artist. 1GET dbpedia:Alberto_Da_Zara a dbont:Artist. 1GET dbpedia:Alberto_Gallo a dbont:Artist. 0GET dbpedia:Alberto_Bigon a dbont:Artist. 1
GET ?artist a dbont:Artist.
Appr
ox. M
embe
rshi
p Fi
lt.
GET …
We evaluated for request count, server cost and speedup in a Web setting.
BloomFilter: MurMurHash3, GCS: FNV-1
1 HTTP Cache with 1 Mbps
p = 1/1024 (0.1%) , 1/128 (1%), 1/64 (1.6%)
250 queries from 125 diverse WatDiv templates on Amazon EC2 machine
WatDiv 100M triples dataset
Timeout: 3min
We evaluated for request count, server cost and speedup in a Web setting.
vs. vanilla TPF server & client
Original “greedy” algorithmOptimized join-tree algorithm*
250 queries from 125 diverse WatDiv templates on Amazon EC2 machine
* Van Herwegen, et. al.: Query Execution Optimization for Clients of Triple Patterns Fragments. ESWC2015
2 client algorithms:
> 50% of the queries has fewer requests,< 20% has more requests.
Greedy Bloom
Greedy GCS
Optimized Bloom
Optimized GCS
Percentage of queries (p = 1/1024)
0% 25% 50% 75% 100%
6%
5%
18%
17%
59%
62%
49%
50%
35%
33%
33%
32%
Equal Fewer Requests More Requests
Queries with relatively many HTTP req. (45,000+ / query) benefit greatly
Diff
eren
ce in
#Re
ques
ts
0
4,000
8,000
12,000
16,000
Fewer Requests More Requests
Greedy Bloom Greedy GCS Optimized Bloom Optimized GCS
< 35
No queries have reduction in execution time, a third even has increase.
Greedy Bloom
Greedy GCS
Optimized Bloom
Optimized GCS
Percentage of queries (p = 1/1024)
0% 25% 50% 75% 100%
16%
31%
33%
38%0%
84%
69%
67%
62%
Equal Lower Execution time Higher Execution time
Server remains low-cost, as impact is very acceptable (< 6%).
CPU
(%)
0
7.5
15
22.5
30
Original
Bloom (1/1
024)
Bloom (1/1
28)
Bloom (1/6
4)
GCS (1/1
024)
GCS (1/1
28)
GCS (1/6
4)
11.110.810.2
14.9
11.210.89.2
Linked Data Fragments: a uniform view on publishing Linked Data
Exploring the axis: selector and metadata
Approximate Membership Metadata
Querying through Approximate Membership Metadata
Opportunistic Querying
During execution, a result candidate could already be correct (1 - p).
Can we be opportunistic here, and temporarily allow imprecise results?
“Can we reduce the time to 100% recall?”
Opportunistic Linked Data Querying 13
only allowcertain results
temporarily allowuncertain results
startexecution
startexecution
1st resultcomputed
1st resultcomputed
n < r resultscomputed
n < r resultscomputed
r resultscomputed
r resultscomputed
r + f resultscomputed
0% recall 100% recall 100% recall100% precision
Fig. 2. This SPARQL query execution timeline compares regular and opportunistic
query execution, assuming r total query results and f false positives. Note how
both approaches achieve 100% recall and precision at a shared point in the end, but
there exists a period during which only opportunistic execution reaches 100% recall
(shaded).
need to be discarded. The user thus sees the photos faster than if theyhad only been retrieved after full precision was achieved. This exampleindicates that opportunistic query answering has direct concrete uses inWeb applications.
7 Evaluation
In the following, we discuss our evaluation of executing SPARQL queriesagainst TPF interfaces with an AMF feature. From these experiments, we aimto assess whether AMFs are a valuable asset in the metadata dimension. We�rst describe the experiments and their setup. Then, we discuss their resultsto validate the three hypotheses of Section 3.2.
7.1 Experimental setup
We extended the existing implementations of the TPF client3 and server4 tosupport both Bloom �lters and Golomb-coded sets. The server is con�guredby specifying the AMF and the desired false positive probability. We chose the32-bit MurMurHash3 hash function for GCS and FNV-1 for the Bloom �lter.The server calculates a membership function on the �y for each request fora triple pattern with a single variable.
We ran the experiments with di�erent false positive probabilities p:1/1024 ⇡ 0.1%, 1/128 ⇡ 1%, and 1/64 ⇡ 1.6%. In each experiment, we exe-cuted 250 queries generated from 125 diverse WatDiv SPARQL templates onthree interfaces: i) regular TPF interface ii) TPF with Bloom �lters, and iii) TPFwith GCS. All three cases were tested with both the original and the opti-mized client; the last two setups were tested with and without opportunistic3https://github.com/LinkedDataFragments/Client.js/tree/amq
4https://github.com/LinkedDataFragments/Server.js/tree/amq
Temporarily allowing <100% precision can reduce 100% recall time with 1/3.
Exec
utio
n tim
e (s
)
0
35
70
105
140
Greedy + Bloom (p = 1/1024)
100% Recall 100% Precision
Number of revoked results was 0 or 1.
Linked Data Fragments: a uniform view on publishing Linked Data
Exploring the axis: selector and metadata
Approximate Membership Metadata
Querying through Approximate Membership Metadata
Opportunistic Querying
For some queries types, bandwidth highly decreases for TPF query execution.
Approximate Membership Metadata is a nuanced debate
For larger fragments, realtime computation hurts execution time. We expect gain with pre-caching and out-of-band delivery.
Opportunistic querying is a promising direction for further exploration.
TRIPLE PATTERNfragments
dataAPPR. MEM. FILT.
No one size fits all, explore the axis.Find metrics that fit your use-case.
Client & Server loadRequest & Response size
Protocol (HTTP) impact…
Try you own trade-off server at our demo (and get a nice cup of coffee).
Start serving Linked Data like a barista
Opportunistic Linked Data Querying through Approximate Membership MetadataMiel Vander Sande