binary rdf for scalable publishing, exchanging and consumption in the web of data

26
Javier D. Fernández Supervised by: Miguel A. Martínez-Prieto and Claudio Gutierrez University of Valladolid (Spain) University of Chile (Chile) Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data PhD Symposium

Upload: javier-d-fernandez

Post on 29-Aug-2014

704 views

Category:

Technology


1 download

DESCRIPTION

Slides of my presentation at WWW 2012 PhD Symposium

TRANSCRIPT

Page 1: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Javier D. Fernández

Supervised by: Miguel A. Martínez-Prieto and Claudio Gutierrez

University of Valladolid (Spain)

University of Chile (Chile)

Binary RDF for Scalable Publishing,

Exchanging and Consumption

in the Web of Data

PhD Symposium

Page 2: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

(1) Resource Description Framework

Webs, services, protocols

Persons, Proteins, geography…

(2) A standard model for data exchange on the Web

Understandable by computers

(3) W3C Recommendation (2004)

(4) Data model

(subject, predicate, object)

Brief RDF Introduction

PhD Symposium

Page 3: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

<http://books/book21>

<http://books/author33>

“Spain in the Heart”

“Pablo Neruda”

URI URI

<http://myblog/lectures>

URI

literal

lectures:to_read_list

_collection

Blank

Subject, Predicate, Object (U,B) , U , (U,B,L)

RDF Example

PhD Symposium

Page 4: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Image: Danilo Rizzuti / FreeDigitalPhotos.net

1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4. Include links to other URIs, so that they can discover more things.

PhD Symposium

Page 5: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Image: Danilo Rizzuti / FreeDigitalPhotos.net PhD Symposium

Page 6: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

DBPedia (en) 233 M.triples ~ 33 GB

Uniprot 845 “ ~ 230 GB

Publish?

Exchange?

Process/Consume/Query?

Scalability problems

PhD Symposium

Page 7: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

RDF Publication

No Recommendations/methodology to publish at large scale

Related Work: Some metadata for discovery, such as Void, Semantic

Sitemaps.

RDF dump

SPARQL Endpoints/

APIs

dereferenceable URIs

sensor

PhD Symposium

Page 8: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

RDF Exchanging issues

RDF/XML, N3, Turtle, JSON.

Document-centric (verbose) data-centric view (machine)

No structure (chunks, universal compression)

Related Work: Universal compression (gzip, bzip2) and the Efficient XML

Interchange Format (EXI).

Image: renjith krishnan / FreeDigitalPhotos.net PhD Symposium

Page 9: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

RDF Processing/Consumption (After Exchanging)

Costly Post-processing

Decompression

Indexing (RDF Store)

Finally… consume

Related Work (indexing): Based on Relational Storage (Virtuoso) Multi-indexes

(RDF3X), Distributed Systems (Map-Reduce) and others (Bit-Mat).

Image: renjith krishnan / FreeDigitalPhotos.net PhD Symposium

Page 10: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

The scalability problems has

a main impact on Users

Would you download hundreds of GB...

… if you don’t know exactly what they contain,

that need costly exchange and post-processing,

and require a powerful store to query them ?

Image: renjith krishnan / FreeDigitalPhotos.net PhD Symposium

Page 11: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

In the following...

1. Proposed approach for scalable publishing, exchanging and consumption

of large RDF datasets

2. Preliminary results

3. Methodology

4. On-going work and conclusions

Image: jscreationzs / FreeDigitalPhotos.net PhD Symposium

Page 12: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

An integrated solution

We call for, and we study in this thesis, a Binary RDF Serialization format:

Machine oriented (binary)

Clean publication

Metadata

Modular

Efficient exchange

Compression

Basic data operations

Easy to parse and consume

Primitive query resolution

Image: jscreationzs / FreeDigitalPhotos.net PhD Symposium

Page 13: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

HDT Overview

PhD Symposium

Page 14: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Dictionary+Triples partition

<http://books/author33>

<http://books/book21>

dc:author

dc:title

foaf:name

“Pablo Neruda”

“Spain in the Heart”

1

2

3

4

5

6

7

2 1

7

6

PhD Symposium

Page 15: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Key concepts: The Dictionary

Largest component (up to 74%)

Long URIs, shared prefixes

Lang, datatype tags in literals

Efficient IDString operations

We plan to work on a specific organization which

Optimizes space (regularities)

Provides efficient performance in operations

PhD Symposium

Page 16: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Preliminary results in Rich Functional Dictionaries

We propose to adapt techniques for string dictionaries;

Front-Coding

Making dictionary partitions

[*] Compression of RDF Dictionaries. Miguel A. Martínez-Prieto, Javier D. Fernández, Rodrigo Cánovas. ACM Symposium on Applied Computing (SAC 2012).

PhD Symposium

Page 17: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Key concepts: Triples

Specific compression:

More efficient compression than just gzip.

Data indexing for consumption:

Allows direct patterns resolution without decompression

(s,p,o), (s,?p,?o) and (s,p,?o)

We plan to work on a specific technique which

optimizes space

provides efficient performance in primitive operations

PhD Symposium

Page 18: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Preliminary results in Triples Encoding

We propose to use Bitmap indexes:

[*] Compact Representation of Large RDF Data Sets for Publishing and

Exchange. Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutierrez. International Semantic Web Conference(ISWC 2010).

PhD Symposium

Page 19: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Methodology

RDF structure in theory and practice.

Binary RDF Specification.

Succinct Dictionaries.

Triples Indexes.

Practical deployment.

Image: jscreationzs / FreeDigitalPhotos.net PhD Symposium

Page 20: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Some Results… HDT Acknowledged as W3C

member submission:

http://www.w3.org/Submission/2011/03/

supported by:

PhD Symposium

Page 21: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Some Results... HDT for exchanging

PhD Symposium

Page 22: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Some Results... HDT for consumption

Direct Consumption, without decompression after exchanging

Example of use: HDT-it (Thanks to Mario Arias, DERI)

Image: jscreationzs / FreeDigitalPhotos.net PhD Symposium

Page 23: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

On-going promising work: HDT-FoQ

[*] Exchange and Consumption of Huge RDF Data. Miguel A. Martínez-Prieto, Mario Arias, Javier D. Fernández. Extended Semantic Web Conference(ESWC 2012). To appear

PhD Symposium

Page 24: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

In conclusion

Binary RDF aims to lightweight the Web of Data;

Logical decomposition: Header, Dictionary, and Triples

Clean publication

Compressed RDF format for exchanging

Machine-friendly, direct consumption

Rich Functional Dictionary/Triples representations for querying

PhD Symposium

Page 25: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Still much work on…

Getting a global understanding of the real structure of RDF networks.

Applying this knowledge in innovative dictionary and triples indexes.

full SPARQL at consumption

Supporting dynamic operations

inserting, deleting, and updating binary RDF

PhD Symposium

Page 26: Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

Thanks!

HDT: http://www.rdfhdt.org/

Group: http://dataweb.infor.uva.es/

Slides: http://www.slideshare.net/javifer

Javier D. Fernández ([email protected])

Supervised by: Miguel A. Martínez-Prieto, Claudio Gutierrez

University of Valladolid (Spain)

University of Chile (Chile)

PhD Symposium