compressed rdf: practical uses & hands-on

Compressed RDF: Practical Uses & Hands-on

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training SchoolKeyword search in Big Linked Data

Session I (09:00 - 10:30) "Basics of Compression for Big Linked Data Management“

Big (Linked) Semantic Data Compression: motivation & challenges

Compact Data Structures

Session II (13:30 - 15:00) “RDF Compression“

RDF Compression. HDT

RDF Dictionaries

RDF Triples

Session III (15:30-17:00) “Compressed RDF: Practical Uses & Hands-on”

Practical Uses (LOD-a-lot, RDF Archiving, etc.)

Hands on

PAGE 2

General agenda

images: zurb.com

http://zurb.com/word/brainstorming

Practical uses

LOD-a-lot: Web-scale queries in your pocket

RDF archiving

Linked Data markets (Linked Close Data)

Hands on

HDT-it

Command line tools

HDT and Fuseki

HDT and Linked Data Fragments

HDT and C++/Java

HDT and Jena

PAGE 3

Agenda of this session

images: zurb.com

http://zurb.com/word/brainstorming

LOD-a-lot

Use case 1

E.g. retrieve all entities in LOD with the label “Axel Polleres“

Options:

Crawl and index LOD locally (-no-)

Follow-your-nose (where should I start?)

Federated querying (as good as the endpoints you query)

Use LOD Laundromat as a “good approximation” (still querying 650K datasets)

5

Still… what about Web-scale queries

select distinct ?x {?x rdfs:label “Axel Polleres"

}

6

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint

(metadata)

LOD Laundromat

LOD-a-lot

7

But what about Web-scale queries

- flashback -

The real motivation

consume

The real motivation

htt

p:/

/ww

w.k

unsan.a

f.m

il/N

ew

s/

Art

icle

/413995/s

erv

ing-t

he-m

asses/

Oh man I’m hungry and I don’ t even know if I will like whatever

you are cooking

consume

The real motivationOh man I’m hungry

and I don’ t even know if I will like whatever

you are cooking

consume

htt

p:/

/ww

w.k

unsan.a

f.m

il/N

ew

s/

Art

icle

/413995/s

erv

ing-t

he-m

asses/

LOD-a-lot

11

But what about Web-scale queriesBut one could be really hungry

htt

ps:/

/hw

y55burg

ers

.word

pre

ss.c

om

/tag/f

ood-c

hallenge/

12

LOD

Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

LOD-a-lot

SPARQL endpoint

(metadata)

LOD-a-lot

Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias

28B triples

Disk size:

HDT: 304 GB

HDT-FoQ (additional indexes): 133 GB

Memory footprint (to query):

15.7 GB of RAM (3% of the size)

144 seconds loading time

8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS

LDF page resolution in milliseconds.

13

LOD-a-lot (some numbers)

305€

(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)

14

LOD-a-lot

https://datahub.io/dataset/lod-a-lot

http://purl.org/HDT/lod-a-lot

https://datahub.io/dataset/lod-a-lot

http://purl.org/HDT/lod-a-lot

Query resolution at Web scale

Evaluation and Benchmarking

No excuse

RDF metrics and analytics

15

LOD-a-lot (some use cases)

subjects predicates objects

16

ACKs LOD-a-lot

Archiving

Use case 2

ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015

So far so good... But RDF is evolving

Number

of

sources

Update rate

month

year

week

day

hour

minute

second

104 105 106101100 102 103

DBpediaBTC

Dyldo

Internet

of Things

Virtual/Augmented

Reality

versions?LOD-a-lot

3

Most semantic Web/Linked Data tools are focused onthis “static view” but do not consider

versioning/evolution

Linked Data Archives:The missing link in the RDF evolution

Sindice, SWSE, Swoogle, LOD Cache, LOD-Laundromat… so far, no versions!

Web archives: Common Crawl, Internet Memory, Internet Archive, …

20

Preservation matters

21

…in the last few years:

Managing the Evolution and

Preservation of the Data Web (FP7)Preserving Linked Data (FP7)

Research projects

Archives

Tools

Benchmarking

one of the fundamental problems in the Web of Data

BEnchmark of RDF ARchives

RDF evolution at Scale

v-RDFCSA

22

…in the last few years:

Managing the Evolution and

Preservation of the Data Web (FP7)Preserving Linked Data (FP7)

Research projects

Archives

Tools

Benchmarking

one of the fundamental problems in the Web of Data

BEnchmark of RDF ARchives

RDF evolution at Scale

v-RDFCSA

23

RDF Archiving. Archiving policies

V1

ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .

ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .

V2 V3


V1


ex:S3 ex:study ex:C1 .


ex:C1 ex:hasProfessor ex:P1 .

ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .

V1,2,

3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].

a) Independent Copies/Snapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

24

BEAR

https://aic.ai.wu.ac.at/qadlod/bear.html


Queries and systems

We implemented and evaluate archiving systems on Jena-TDB and HDT, based on IC, CB and TB policies.

Serve as an initial baseline to compare archiving systems

More info: https://aic.ai.wu.ac.at/qadlod/bear.html

25

BEAR: Benchmarking the Efficiency of RDF Archiving


26

RDF Archiving. Archiving policies

V1


ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .

V2 V3


V1




ex:C1 ex:hasProfessor ex:P1 .

ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .

V1,2,

3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].

a) Independent Copies/Snapshots (IC)

b) Change-based approach (CB)

c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

Instantiation of archive queries in AnQL [1]

Mat(Q,V1)

version materialization

Diff(Q,V1,V2)

Ver(Q)

join(Q1,vi,Q2,vj)

Change(Q)

27

Benchmarking: Define the queries

SELECT * WHERE { Q :[v1] }

[1] Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.

Instantiation of archive queries in AnQL

Mat(Q,V1)

Diff(Q,V1,V2)

delta materialization

Ver(Q)

join(Q1,vi,Q2,vj)

Change(Q)

28


SELECT * WHERE {

{ { {Q :[v1]} MINUS {Q :[v2]} } BIND (v1 AS ?V )

}

UNION

{ { {Q :[v2] } MINUS {Q :[v1]}} BIND (v2 AS ?V )

}


Mat(Q,V1)

Diff(Q,V1,V2)

Ver(Q)

results of Q annotated with the version

join(Q1,vi,Q2,vj)

Change(Q)

29


SELECT * WHERE { Q :?V }


Mat(Q,V1)

Diff(Q,V1,V2)

Ver(Q)

join(Q1,v1,Q2,v2)

Change(Q)

30


SELECT * WHERE { {Q :[v1]} {Q :[v2]} }


Mat(Q,V1)

Diff(Q,V1,V2)

Ver(Q)

join(Q1,vi,Q2,vj)

Change(Q)

Returns consecutive versions in which Diff of a query is not null

31


SELECT ?V1 ?V2 WHERE

{ {{Q :?V1 } MINUS {Q :?V2}} UNION

{{Q :?V2 } MINUS {Q :?V1}}

FILTER( abs(?V1-?V2) = 1 ) }

Open question remains: What is the right query

syntax for archive queries?

32

Time-based access. Queries

Materialize (s,?,? ; version)

33

Time-based access. Queries

diff(?,?,o ; version0 ; version t)

RDFCSA: Compressed Suffix Array

v-RDFCSA[2] is designed as a lightweight TB approach

Version information encoding

Any triple can be identified by the position of its subject within SA

Let be N the number of different versions and n the set of version-oblivioustriples

Two encoding strategies

tpv: N bitsequences 𝐁𝐯 𝐢 [𝟏, 𝐧] to encode what triple appears in version i

vpt: n bitsequences 𝐁𝐭 i [1, N ] to encode versions where the kth triple occurs

34

Self-Indexing RDF Archives: v-RDFCSA

Bv1 0 1 1 0 1

Bv2 0 1 0 1 0

Bv3 1 0 0 0 1

Triples

1 2 3 4 5tpv

Versions

1

2

3

Bt1

0 1 1 0 1

0 1 0 1 0

1 0 0 0 1

Triples

1 2 3 4 5vpt

Version

s1

23

Bt2 Bt3 B

t4 B

t5

[2] Ana Cerdeira-Pena, Antonio Fariña, Javier D. Fernández, and Miguel A. Martínez-Prieto. Self-Indexing RDF Archives. Data Compression Conference (DCC), 2016.

Performs more than one order of magnitude faster than Jena-TDB for query resolution

Linked Open/Close Data(Linked Data markets)

Use case 3

G3b

G1b

Linked Open Data

Cloud

Linked Closed Data

Cloud

dbpedia

G3a G4a

G1a G2a

G1c G2c

G2b

So far so good but.. Linked Open/Close Data

“Deep Semantic Web”

Linked Open/Close Data

A) Efficient Exchange: Compression + Encryption (hdtcrypt)

38


B) A secure LD Endpoint

39


Self-Enforcing Access Control for Encrypted RDF

Javier D. Fernández, Sabrina Kirrane, Axel Polleres and

Simon Steyskal. In ESWC’17

Future work:

Hands on!

Find these slides in: https://aic.ai.wu.ac.at/qadlod/presentations/keystoneHandsOn2017.pdf

https://aic.ai.wu.ac.at/qadlod/presentations/codeKeystone2017

https://aic.ai.wu.ac.at/qadlod/presentations/keystoneHandsOn2017.pdf

https://aic.ai.wu.ac.at/qadlod/presentations/codeKeystone2017

1) Desktop tool HDT-it!

Thanks to Mario Arias

Consuming HDT

1) Desktop tool HDT-it!

Download the tool for your OS:

http://www.rdfhdt.org/downloads/

Get an HDT dataset from the web

http://www.rdfhdt.org/datasets/

OR

http://lodlaundromat.org/wardrobe/

OR convert your RDF dataset with the tool.

As a suggestion of small datasets:

SWDF (242K triples) or the bigger DBLP (55M triples)

Consuming HDT

http://www.rdfhdt.org/downloads/

http://www.rdfhdt.org/datasets/

http://lodlaundromat.org/wardrobe/

2) Command line Tools (C++ and Java)

Consuming HDT

rdfhdt.org

HDT-C++ HDT-Java

Command Line tools X X

TP search X X

Full SPARQL - with Jena

Parametrizable Compression X -

Full text support X -

Practical Uses LDF Jena, Fuseki

2) Command line Tools (c++ and Java)

For simplicity, in this lecture we will use Java

Download hdt-java library from https://github.com/rdfhdt/hdt-java/

git clone https://github.com/rdfhdt/hdt-java.git

or download https://github.com/rdfhdt/hdt-java/archive/master.zip

Install the library with maven:

mvn install

Query an HDT file:

Go to HDT-cli and execute:

./bin/hdhSearch.sh /path/to/your/hdt

This will open a simple console where you can query triple patterns

Export/Import

$> rdf2hdt file.nt output.hdt

$> hdt2rdf file.hdt output.nt

Consuming HDT

3) Set up a SPARQL Endpoint with HDT and Fuseki

Go to hdt-fuseki and compile adding the dependencies:

mvn package dependency:copy-dependencies

Run fuseki

./bin/hdtEndpoint.sh --hdt path/to/dataset.hdt /mydataset

Open your Web Browser and go to: http://localhost:3030

Select Control Panel / Dataset / myDataset and click Select

Type your SPARQL Query and see the results.

Be careful with the number of results, here there is no limitation in the number of results such as in e.g. virtuoso:

select * WHERE{ ?s ?p ?o} LIMIT 400

Consuming HDT

http://localhost:3030/

4) Set up a Linked Data Fragments Endpoint with HDT

Download LDF Server (Node.js is the best one but we will use java for simplicity in the installation).

git clone https://github.com/LinkedDataFragments/Server.Java.git

or download https://github.com/LinkedDataFragments/Server.Java/archive/master.zip

Install the server, avoid the test (it fails :)

mvn install -Dmaven.test.skip=true

Open the file config-example.json and modify the settings to point to your hdt, e.g.

"settings": { "file": "/home/user/myfile.hdt" }

Run the server

java -jar target/ldf-server.jar

Access http://localhost:8080

Consuming HDT

https://github.com/LinkedDataFragments/Server.Java.git

https://github.com/LinkedDataFragments/Server.Java/archive/master.zip

http://localhost:8080/

5) Access with the HDT C++/Java libraries (again, we restrict here to Java)

JAVADOC:

http://purl.org/HDT/javadoc/api

http://purl.org/HDT/javadoc/core

I will refer to Eclipse and Maven but you can use your preferred environment

Consuming HDT

http://purl.org/HDT/javadoc/api

http://purl.org/HDT/javadoc/core

Setting up the environment…

Create a new maven project

Consuming HDT / HDT-java library



Select to create a simple project (skip archetype selection)




With a simple archetype

And any metadata



Include the maven dependency of hdt-java-core in the pom.xml



Include the maven dependency of hdt-java-core in the pom.xml

Finally, let’s create a new Class and query our HDT


- Test other queries- get the S, P, O of the solution

Let’s access the dictionary of terms in HDT


- Open two HDT files- Use the dictionaries to get the common predicates used in both

Let’s access the terms as IDs


- Use the estimation of results to count the cardinality of all subjects

- We can build an histogram and see the distribution

6) Query full SPARQL with Jena and HDT

First, include the hdt-jena dependency in pom.xml

Consuming HDT

6) Query full SPARQL with Jena and HDT

First, include the hdt-jena dependency in pom.xml

Import HDT into a model and query!

Consuming HDT

- Test other queries over your data

+) Query LOD-a-lot

First, get the correct hdt-java branch to deal with really long IDs

git clone -b long-dict-id https://github.com/rdfhdt/hdt-java/

Install, avoid the test

mvn install -Dmaven.test.skip=true

Change java head space

export MAVEN_OPTS="-Xmx25G"

In hdt-java-cli

./bin/hdtSearch.sh /media/javi/data/lod-a-lot/LOD_a_lot_v1.hdt

Consuming HDT

https://github.com/rdfhdt/hdt-java/

Let’s the lecture… end

We are currently facing Big Linked Data challenges

Generation, publication and consumption

Thanks to compression, the Big Linked Data today will be the “pocket” data tomorrow

Compression is not just about space

Fast exchange

Fast processing/management

Fast querying

Compression democratizes the access to Big Linked Data

= Cheap, scalable consumers

PAGE 59

Take-home messages

Thank you!

Let’s the lecture… end

compressed rdf: practical uses & hands-on

Documents