compressed rdf: practical uses & hands-on
TRANSCRIPT
Compressed RDF: Practical Uses & Hands-on
Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training SchoolKeyword search in Big Linked Data
Session I (09:00 - 10:30) "Basics of Compression for Big Linked Data Management“
Big (Linked) Semantic Data Compression: motivation & challenges
Compact Data Structures
Session II (13:30 - 15:00) “RDF Compression“
RDF Compression. HDT
RDF Dictionaries
RDF Triples
Session III (15:30-17:00) “Compressed RDF: Practical Uses & Hands-on”
Practical Uses (LOD-a-lot, RDF Archiving, etc.)
Hands on
PAGE 2
General agenda
images: zurb.com
Practical uses
LOD-a-lot: Web-scale queries in your pocket
RDF archiving
Linked Data markets (Linked Close Data)
Hands on
HDT-it
Command line tools
HDT and Fuseki
HDT and Linked Data Fragments
HDT and C++/Java
HDT and Jena
PAGE 3
Agenda of this session
images: zurb.com
LOD-a-lot
Use case 1
E.g. retrieve all entities in LOD with the label “Axel Polleres“
Options:
Crawl and index LOD locally (-no-)
Follow-your-nose (where should I start?)
Federated querying (as good as the endpoints you query)
Use LOD Laundromat as a “good approximation” (still querying 650K datasets)
5
Still… what about Web-scale queries
select distinct ?x {?x rdfs:label “Axel Polleres"
}
6
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD Laundromat
LOD-a-lot
7
But what about Web-scale queries
- flashback -
The real motivation
consume
The real motivation
htt
p:/
/ww
w.k
unsan.a
f.m
il/N
ew
s/
Art
icle
/413995/s
erv
ing-t
he-m
asses/
Oh man I’m hungry and I don’ t even know if I will like whatever
you are cooking
consume
The real motivationOh man I’m hungry
and I don’ t even know if I will like whatever
you are cooking
consume
htt
p:/
/ww
w.k
unsan.a
f.m
il/N
ew
s/
Art
icle
/413995/s
erv
ing-t
he-m
asses/
LOD-a-lot
11
But what about Web-scale queriesBut one could be really hungry
htt
ps:/
/hw
y55burg
ers
.word
pre
ss.c
om
/tag/f
ood-c
hallenge/
12
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
LOD-a-lot
SPARQL endpoint
(metadata)
LOD-a-lot
Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias
28B triples
Disk size:
HDT: 304 GB
HDT-FoQ (additional indexes): 133 GB
Memory footprint (to query):
15.7 GB of RAM (3% of the size)
144 seconds loading time
8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
LDF page resolution in milliseconds.
13
LOD-a-lot (some numbers)
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
14
LOD-a-lot
https://datahub.io/dataset/lod-a-lot
http://purl.org/HDT/lod-a-lot
Query resolution at Web scale
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
15
LOD-a-lot (some use cases)
subjects predicates objects
16
ACKs LOD-a-lot
Archiving
Use case 2
ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015
So far so good... But RDF is evolving
Number
of
sources
Update rate
month
year
week
day
hour
minute
second
104 105 106101100 102 103
DBpediaBTC
Dyldo
Internet
of Things
Virtual/Augmented
Reality
versions?LOD-a-lot
3
Most semantic Web/Linked Data tools are focused onthis “static view” but do not consider
versioning/evolution
Linked Data Archives:The missing link in the RDF evolution
Sindice, SWSE, Swoogle, LOD Cache, LOD-Laundromat… so far, no versions!
Web archives: Common Crawl, Internet Memory, Internet Archive, …
20
Preservation matters
21
…in the last few years:
Managing the Evolution and
Preservation of the Data Web (FP7)Preserving Linked Data (FP7)
Research projects
Archives
Tools
Benchmarking
one of the fundamental problems in the Web of Data
BEnchmark of RDF ARchives
RDF evolution at Scale
v-RDFCSA
22
…in the last few years:
Managing the Evolution and
Preservation of the Data Web (FP7)Preserving Linked Data (FP7)
Research projects
Archives
Tools
Benchmarking
one of the fundamental problems in the Web of Data
BEnchmark of RDF ARchives
RDF evolution at Scale
v-RDFCSA
23
RDF Archiving. Archiving policies
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
V2 V3
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
ex:S3 ex:study ex:C1 .
ex:S2 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .
V1,2,
3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].
a) Independent Copies/Snapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Queries and systems
We implemented and evaluate archiving systems on Jena-TDB and HDT, based on IC, CB and TB policies.
Serve as an initial baseline to compare archiving systems
More info: https://aic.ai.wu.ac.at/qadlod/bear.html
25
BEAR: Benchmarking the Efficiency of RDF Archiving
26
RDF Archiving. Archiving policies
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
V2 V3
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
ex:S3 ex:study ex:C1 .
ex:S2 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .
V1,2,
3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].
a) Independent Copies/Snapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
Instantiation of archive queries in AnQL [1]
Mat(Q,V1)
version materialization
Diff(Q,V1,V2)
Ver(Q)
join(Q1,vi,Q2,vj)
Change(Q)
27
Benchmarking: Define the queries
SELECT * WHERE { Q :[v1] }
[1] Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.
Instantiation of archive queries in AnQL
Mat(Q,V1)
Diff(Q,V1,V2)
delta materialization
Ver(Q)
join(Q1,vi,Q2,vj)
Change(Q)
28
Benchmarking: Define the queries
SELECT * WHERE {
{ { {Q :[v1]} MINUS {Q :[v2]} } BIND (v1 AS ?V )
}
UNION
{ { {Q :[v2] } MINUS {Q :[v1]}} BIND (v2 AS ?V )
}
Instantiation of archive queries in AnQL
Mat(Q,V1)
Diff(Q,V1,V2)
Ver(Q)
results of Q annotated with the version
join(Q1,vi,Q2,vj)
Change(Q)
29
Benchmarking: Define the queries
SELECT * WHERE { Q :?V }
Instantiation of archive queries in AnQL
Mat(Q,V1)
Diff(Q,V1,V2)
Ver(Q)
join(Q1,v1,Q2,v2)
Change(Q)
30
Benchmarking: Define the queries
SELECT * WHERE { {Q :[v1]} {Q :[v2]} }
Instantiation of archive queries in AnQL
Mat(Q,V1)
Diff(Q,V1,V2)
Ver(Q)
join(Q1,vi,Q2,vj)
Change(Q)
Returns consecutive versions in which Diff of a query is not null
31
Benchmarking: Define the queries
SELECT ?V1 ?V2 WHERE
{ {{Q :?V1 } MINUS {Q :?V2}} UNION
{{Q :?V2 } MINUS {Q :?V1}}
FILTER( abs(?V1-?V2) = 1 ) }
Open question remains: What is the right query
syntax for archive queries?
32
Time-based access. Queries
Materialize (s,?,? ; version)
33
Time-based access. Queries
diff(?,?,o ; version0 ; version t)
RDFCSA: Compressed Suffix Array
v-RDFCSA[2] is designed as a lightweight TB approach
Version information encoding
Any triple can be identified by the position of its subject within SA
Let be N the number of different versions and n the set of version-oblivioustriples
Two encoding strategies
tpv: N bitsequences 𝐁𝐯 𝐢 [𝟏, 𝐧] to encode what triple appears in version i
vpt: n bitsequences 𝐁𝐭 i [1, N ] to encode versions where the kth triple occurs
34
Self-Indexing RDF Archives: v-RDFCSA
Bv1 0 1 1 0 1
Bv2 0 1 0 1 0
Bv3 1 0 0 0 1
Triples
1 2 3 4 5tpv
Versions
1
2
3
Bt1
0 1 1 0 1
0 1 0 1 0
1 0 0 0 1
Triples
1 2 3 4 5vpt
Version
s1
23
Bt2 Bt3 B
t4 B
t5
[2] Ana Cerdeira-Pena, Antonio Fariña, Javier D. Fernández, and Miguel A. Martínez-Prieto. Self-Indexing RDF Archives. Data Compression Conference (DCC), 2016.
Performs more than one order of magnitude faster than Jena-TDB for query resolution
Linked Open/Close Data(Linked Data markets)
Use case 3
G3b
G1b
Linked Open Data
Cloud
Linked Closed Data
Cloud
dbpedia
G3a G4a
G1a G2a
G1c G2c
G2b
So far so good but.. Linked Open/Close Data
“Deep Semantic Web”
Linked Open/Close Data
A) Efficient Exchange: Compression + Encryption (hdtcrypt)
38
Linked Open/Close Data
B) A secure LD Endpoint
39
Linked Open/Close Data
Self-Enforcing Access Control for Encrypted RDF
Javier D. Fernández, Sabrina Kirrane, Axel Polleres and
Simon Steyskal. In ESWC’17
Future work:
Hands on!
Find these slides in: https://aic.ai.wu.ac.at/qadlod/presentations/keystoneHandsOn2017.pdf
https://aic.ai.wu.ac.at/qadlod/presentations/codeKeystone2017
1) Desktop tool HDT-it!
Thanks to Mario Arias
Consuming HDT
1) Desktop tool HDT-it!
Download the tool for your OS:
http://www.rdfhdt.org/downloads/
Get an HDT dataset from the web
http://www.rdfhdt.org/datasets/
OR
http://lodlaundromat.org/wardrobe/
OR convert your RDF dataset with the tool.
As a suggestion of small datasets:
SWDF (242K triples) or the bigger DBLP (55M triples)
Consuming HDT
2) Command line Tools (C++ and Java)
Consuming HDT
rdfhdt.org
HDT-C++ HDT-Java
Command Line tools X X
TP search X X
Full SPARQL - with Jena
Parametrizable Compression X -
Full text support X -
Practical Uses LDF Jena, Fuseki
2) Command line Tools (c++ and Java)
For simplicity, in this lecture we will use Java
Download hdt-java library from https://github.com/rdfhdt/hdt-java/
git clone https://github.com/rdfhdt/hdt-java.git
or download https://github.com/rdfhdt/hdt-java/archive/master.zip
Install the library with maven:
mvn install
Query an HDT file:
Go to HDT-cli and execute:
./bin/hdhSearch.sh /path/to/your/hdt
This will open a simple console where you can query triple patterns
Export/Import
$> rdf2hdt file.nt output.hdt
$> hdt2rdf file.hdt output.nt
Consuming HDT
3) Set up a SPARQL Endpoint with HDT and Fuseki
Go to hdt-fuseki and compile adding the dependencies:
mvn package dependency:copy-dependencies
Run fuseki
./bin/hdtEndpoint.sh --hdt path/to/dataset.hdt /mydataset
Open your Web Browser and go to: http://localhost:3030
Select Control Panel / Dataset / myDataset and click Select
Type your SPARQL Query and see the results.
Be careful with the number of results, here there is no limitation in the number of results such as in e.g. virtuoso:
select * WHERE{ ?s ?p ?o} LIMIT 400
Consuming HDT
4) Set up a Linked Data Fragments Endpoint with HDT
Download LDF Server (Node.js is the best one but we will use java for simplicity in the installation).
git clone https://github.com/LinkedDataFragments/Server.Java.git
or download https://github.com/LinkedDataFragments/Server.Java/archive/master.zip
Install the server, avoid the test (it fails :)
mvn install -Dmaven.test.skip=true
Open the file config-example.json and modify the settings to point to your hdt, e.g.
"settings": { "file": "/home/user/myfile.hdt" }
Run the server
java -jar target/ldf-server.jar
Access http://localhost:8080
Consuming HDT
5) Access with the HDT C++/Java libraries (again, we restrict here to Java)
JAVADOC:
http://purl.org/HDT/javadoc/api
http://purl.org/HDT/javadoc/core
I will refer to Eclipse and Maven but you can use your preferred environment
Consuming HDT
Setting up the environment…
Create a new maven project
Consuming HDT / HDT-java library
Setting up the environment…
Create a new maven project
Select to create a simple project (skip archetype selection)
Consuming HDT / HDT-java library
Setting up the environment…
Create a new maven project
With a simple archetype
And any metadata
Consuming HDT / HDT-java library
Setting up the environment…
Include the maven dependency of hdt-java-core in the pom.xml
Consuming HDT / HDT-java library
Setting up the environment…
Include the maven dependency of hdt-java-core in the pom.xml
Finally, let’s create a new Class and query our HDT
Consuming HDT / HDT-java library
- Test other queries- get the S, P, O of the solution
Let’s access the dictionary of terms in HDT
Consuming HDT / HDT-java library
- Open two HDT files- Use the dictionaries to get the common predicates used in both
Let’s access the terms as IDs
Consuming HDT / HDT-java library
- Use the estimation of results to count the cardinality of all subjects
- We can build an histogram and see the distribution
6) Query full SPARQL with Jena and HDT
First, include the hdt-jena dependency in pom.xml
Consuming HDT
6) Query full SPARQL with Jena and HDT
First, include the hdt-jena dependency in pom.xml
Import HDT into a model and query!
Consuming HDT
- Test other queries over your data
+) Query LOD-a-lot
First, get the correct hdt-java branch to deal with really long IDs
git clone -b long-dict-id https://github.com/rdfhdt/hdt-java/
Install, avoid the test
mvn install -Dmaven.test.skip=true
Change java head space
export MAVEN_OPTS="-Xmx25G"
In hdt-java-cli
./bin/hdtSearch.sh /media/javi/data/lod-a-lot/LOD_a_lot_v1.hdt
Consuming HDT
Let’s the lecture… end
We are currently facing Big Linked Data challenges
Generation, publication and consumption
Thanks to compression, the Big Linked Data today will be the “pocket” data tomorrow
Compression is not just about space
Fast exchange
Fast processing/management
Fast querying
Compression democratizes the access to Big Linked Data
= Cheap, scalable consumers
PAGE 59
Take-home messages
Thank you!
Let’s the lecture… end