reducing storage requirements of multi-version graph databases …€¦ · a. linked data in 2001,...

Thibault Mahieu

databases using forward and reverse deltasReducing storage requirements of multi-version graph

Academic year 2017-2018Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Koen De BosschereDepartment of Electronics and Information Systems

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander SandeSupervisor: Prof. dr. ir. Ruben Verborgh

Acknowledgments

I would like to thank my promoter prof. dr. ir. Ruben Verborgh and my supervisors Ruben

Taelman and dr. ir. Miel Vander Sande for their help and advice during the course of this Mas-

ter’s Dissertation. Their constant guidance and support helped shape this Master’s Dissertation

to what it is today.

I would also like to thank my friends and family, who have supported me throughout the years.

Especially my brother, Christof Mahieu, whom I always could turn to for support.

Thibault Mahieu May 31, 2018

iv

Usage

The author gives permission to make this master dissertation available for consultation and

to copy parts of this master dissertation for personal use. In the case of any other use, the

limitations of copyright have to be respected, in particular with regard to the obligation to state

expressly the source when quoting results from this master dissertation.

Thibault Mahieu May 31, 2018

v

Reducing Storage Requirements of

Multi-Version Graph Databases Using

Forward and Reverse Deltas

by

Thibault Mahieu

Master’s dissertation submitted in order to obtain the academic degree of

Master in Computer Science Engineering

Academic year 2017–2018

Supervisor: Prof. dr. ir. Ruben Verborgh

Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander Sande

Faculty of Engineering and Architecture

University of Ghent

Department of Electronics and Information Systems

Chairman: Prof. dr. ir. Koen De Bosschere

Summary

This master’s dissertation presents a potential storage optimization for change-based multi-version graph databases. This storage optimization is then applied to an existing RDF archivecalled OSTRICH. Finally, the implementation of the presented storage optimization is comparedwith the original OSTRICH RDF archive.

Keywords

RDF Versioning, RDF Archiving, Semantic Data Versioning, Bidirectional Delta Chain

Reducing Storage Requirements of Multi-VersionGraph Databases Using Forward and Reverse Deltas

Thibault Mahieu

Supervisors: prof. dr. ir. Ruben Verborgh, ir. Ruben Taelman, dr. ir. Miel Vander Sande

Abstract—Linked Datasets evolve over time for numerous reasons, suchas the addition of new data. Capturing this evolution via versioned dataarchives can provide new insights. This master’s dissertation presentsa potential storage optimization for change-based multi-version graphdatabases. This storage optimization is then applied to an existing RDFarchive called OSTRICH, for which the implementation is called COBRA.Finally, COBRA is compared with the original OSTRICH RDF archive.Our experiments show that COBRA lowers the storage size compared toOSTRICH, but not for every benchmark.

Keywords—RDF Versioning, RDF Archiving, Bidirectional Delta Chain

I. PREFACE

Datasets change over time for numerous reasons, such as theaddition of new information. Capturing this evolution allows forhistorical analyses which can provide new insights. Linked Datais no exception to this. In fact, archiving Linked Open Data hasbeen an area of research for a few years [2].

One particular research focus is enabling offsettable querystreams for RDF archives since query streams are morememory-efficient for large query results and the offset allowsfor faster queries when only a subset is needed. OSTRICH [3]is state-of-the-art when it comes to offset-enabled RDF archives.OSTRICH stores versions in a delta chain that starts with a fullymaterialized snapshot followed by a series of changesets relativeto the snapshots, referred to as deltas. However, OSTRICH hasa large ingestion time for large dataset versions. The ingestiontime can be reduced by introducing additional snapshots how-ever this, in turn, can increase the storage size.

In this work, we will explore if we can reduce the resultingstorage size increase of the multiple snapshot approach, whilemaintaining the ingestion time reduction, by restructuring thedelta chain.

II. BACKGROUND

A. Linked Data

In 2001, Tim Berners Lee, the inventor of the World-Wide-Web, proposed the idea of the Semantic Web [4]. The goal ofthe Semantic Web is to make data on the Web understandable tomachines so that they can perform complex tasks. In order tomake this vision a reality, Linked Data (LD) was introduced. Asdescribed by Bizer et al. [5], LD refers to data published on theWeb in such a way that it is: machine-readable, its meaning isexplicitly defined, it is linked to other external data sets, and inturn, can be linked to from external data sets. The standard forrepresenting LD is RDF, a graph-based data model that uses a<subject, predicate, object> triple structure. RDF data can bequeried using SPARQL, a graph-based pattern matching querylanguage, where the graph-based patterns are made up of triplepatterns, and consist of a subject, predicate and object.

B. RDF Stores

RDF stores are storage systems designed for storing RDFdata.

HDT [6] is a RDF store focussed on compression that consistsof three parts:• Header - contains metadata and serves as an entry point tothe data• Dictionary - mapping between triple components and uniqueidentifiers, referred to as dictionary encoding• Triples - structure of the underlying RDF graph after dictio-nary encodingHDT resolves queries on the compressed data, but only has oneindex (SP-O), making certain triple patterns hard to resolve. Inaddition, by design HDT stores are immutable after creation,making them unsuitable for volatile datasets.

HDT-FoQ [7] is an extension on HDT [6] that focusses onresolving queries faster. For this reason, HDT-FoQ adds two ad-ditional indexes, namely PS-O and OP-S, to cover more accesspatterns. The PS-O makes use of a wavelet tree, while the OP-Sindex uses adjacency lists, similar to the SP-O index.

C. Non-RDF Archives

Many techniques from non-RDF archives and Version Con-trol System (VCS) can be repurposed for versioning RDFarchives.

RCS [8] is a delta-based VCS, wherein each delta consistsof insertions and deletions of lines. The latest version is storedcompletely and older revisions are stored in so-called reversedeltas, resulting in quick access to the latest version. To add anew revision the system stores the latest revision completely andreplaces the previous revision by its delta, keeping the rest of thechain intact.

D. RDF Archives

RDF archives are versioned RDF stores, Fernandez et al.[2] distinguish three archiving policies for Linked Open Data(LOD):• Independant Copies (IC) - Every version is stored fully ma-terialized.• Change-Based (CB) - Only changes between versions arestored.• Timestamp-Based (TB) - Triples are annotated with theirtemporal validity.

D.1 Independant Copies Archive Policy

SemVersion [9] is IC versioning system for RDF, that tries toemulate classical Concurrent Versions System (CVS) systems

for version management. Each version is stored separately inRDF stores that conform to a certain API, that manages saidversions.

D.2 Change-Based Archive Policy

Cassidy et al. [10] propose a CB RDF archive that is builton Darcs theory of patches [11] - a mathematical model thatdescribes how patches can be manipulated in order to get thedesired version in the context of software. This model describesfundamental operations, such as the commute operation, the re-vert operation and the merge operation. Cassidy et al. adaptthese operations so that are applicable to RDF stores as well.

Im et al. [12] introduced a CB store with a RelationalDatabase Management Systems (RDBMS). They propose an ag-gregated deltas approach wherein not only the delta between aparent and child, but all possible deltas are stored. This resultsin an increased storage overhead, but a decreased version mate-rialization cost compared to the classic sequential delta chain.

Vander Sande et al. [13] introduce R&WBase - a distributedCB RDF archive, wherein version are stored as a consecutivedeltas. Deltas between versions consist of an addition set and adeletion set, respectively listing which triples haven been addedand deleted. Since deltas are stored in the same graph, triples areannotated with a context number, indicating which version thetriple belongs and whether it was added or deleted. In particu-lar, an even context number indicates the triple is an addtition, anuneven context number indicates the triple is a deletion. Queriescan be handled efficiently by looking at the highest context num-ber. If the context number is even than the triple is present forthat version. If the context number is uneven than the triple isnot present for that version. Finally, R&WBase also supportstagging, branching and merging of datasets.

R43ples [14] is another CB RDF archive, since it groups ad-ditions and deletions in named graphs. R43ples allows manipu-lation of revisions with SPARQL, by introducing new keywordssuch as REVISION, TAG and BRANCH. Versions are materi-alized by starting from the head of the branch and applying allprior additions/deletions.

D.3 Timestamp-Based Archive Policy

Hauptmann et al. [15] propose a similar delta-based storeas R43ples, including complete graphs and version control viaSPARQL. However, in Hauptmann’s approach, each triple is vir-tually annotated with version information that is cached using ahash table, making it a TB approach.

x-RDF-3X [16] extends RDF-3X [17] with versioning sup-port. Each triple is annotated with a creation timestamp andwhen appropriate, a deletion timestamp, making it a TB ap-proach.

v-RDFCSA [18] is an TB archiving extension on RDFCSA[19], a compact self-indexing RDF storage that is based on suf-fix arrays.

Dydra [20] is a RDF archive that stores versions as namedgraphs in a quad store, that can be queried using the REVI-SION SPARQL keyword. Dydra uses B+-trees with six indexes:GSPO, GPOS, GOSP, SPOG, POSG, OSPG. B+-tree values in-dicate which revisions a particular quad is visible in, making ita TB system.

Fig. 1: Non-Aggregated unidirectional delta chain, as done in TailR.

Fig. 2: Aggregated unidirectional delta chain where all deltas are relative to thesnapshot at the beginning of the chain, as done in OSTRICH.

D.4 Hybrid Archive Policy

TailR [21] interleaves fully materialized versions (snapshots)in between the delta chain, as seen in Figure 1. The snapshotsreset the version materialization cost but can lead to a higherstorage requirement.

OSTRICH [3] is another hybrid solution that interleaves fullymaterialized snapshots in between the delta chain, as seen inFigure 2. However, unlike TailR, OSTRICH uses aggregateddeltas [12], deltas who directly refer to the snapshot, instead ofthe previous version. Moreover, the delta chain is stored by an-notating each triple with version information, making it a IC, CBand TB hybrid. Ingestion can be done using an in-memory batchalgorithm or a streaming algorithm. OSTRICH supports offset-table query result streams. In addition, OSTRICH also providesquery count estimation functionality, which can be used as a ba-sis for query optimization in query engines [22].

E. Query Atoms

Fernandez et al. [2] also distinguish five types of queries,called query atoms.• Version Materialization (VM) queries retrieve data from asingle version.• Delta Materialization (DM) queries retrieve the differencesbetween two versions.• Version Query (VQ) annotates query result with versionnumbers wherein data exists.• Cross-Version join (CV) joins results of two queries over twodifferent versions.• Change Materialization (CM) returns a list of versions inwhich a given query produces consecutively different results.

Some storage policies are better suited for some query atomsthan others. The IC approach is best suited for VM queriessince the versions are stored completely and do not need tobe reconstructed. The CB approach is particularly effective forDM queries between neighboring versions since these changesare stored. The TB approach is very efficient in resolving VQqueries since triples are naturally annotated with version num-bers wherein they exist.

F. RDF Archiving Benchmarks

BEAR [2] is a RDF archiving benchmark based on real-worlddata from three different domains:• BEAR-A - 58 weekly snapshots from the Dynamic LinkedData Observatory [23].• BEAR-B - the 100 most volatile resources from DBPediaLive [24] at three different granularities: instant, hour and daily.• BEAR-C - 32 weekly snapshots from the Open Data PortalWatch project [25].

SnapshotΔ Δ Δ Δ

Reverse Delta Chain Forward Delta ChainFig. 3: A simplified non-aggregated bidirectional delta chain.

Snapshot Δ ΔΔ Δ

Reverse Delta Chain Forward Delta ChainFig. 4: A simplified aggregated bidirectional delta chain.

BEAR-A provides triple pattern queries and their results forseven triple patterns. BEAR-B provides triple pattern queriesand their results for ?PO and ?P? triple patterns, which are basedon the most frequent triple patterns from the DBpedia queryset. BEAR-C provides 10 complex queries that, although theycannot be efficiently resolved with current archiving strategies,they could help foster development of new query resolution al-gorithms.

III. STORAGE OPTIMIZATION: BIDIRECTIONAL DELTACHAIN

As seen in previous works [21], [3], a delta chain consistsof a fully materialized snapshot followed by a series of deltas.The main idea behind our storage optimization is moving thesnapshot from the front of the delta chain to the middle of thedelta chain, in order to potentially reduce the overall storagesize. This transforms the delta chain into a bidirectional deltachain, which divides the original delta chain into two smallerdelta chains, i.e. the reverse delta chain and the forward deltachain. Figure 3 and 4 show two example bidirectional deltachains.

A. Non-Aggregated Bidirectional Delta Chain

In a non-aggregated delta chain, all deltas reference the clos-est preceding version. So in order to materialize a version, allpreceding deltas need to be applied until the fully materializedsnapshot is reached. As stated above, a bidirectional delta chaindivides the original delta chain into two smaller delta chains.Moreover, the size of the deltas remains the same, since the re-verse delta chain is just the inverse of the original deltas. There-fore, the worst-case materialization cost for bidirectional deltachains is half of that for unidirectional delta chains. On the otherhand, bidirectional non-aggregated delta chains could also po-tentially reduce storage size, while maintaining a similar versionmaterialization time. Indeed, if we compare a series of two uni-directional delta chains with a single bidirectional delta chain,one fewer snapshot would need to be stored.

B. Aggregated Bidirectional Delta Chain

In an aggregated delta chain, all deltas reference the snapshot,which means that an aggregated delta contains all the changesfrom all preceding deltas. In this work, we assume that a higherdistance between versions results in a bigger aggregated deltachain. This assumption holds for datasets that steadily growover time by adding more new triples because later versions willhave more and more new triples compared to earlier versions.It follows then that reducing the average distance between the

Snapshot Δ ΔΔ Δ

0 1 32 4Metadata Metadata

Addition Counts Addition Counts

ADD SPO

... ...

DEL SPO

... ...

ADD POS

... ...

ADD OSP

... ...

DEL POS

... ...

DEL OSP

... ...

HDTADD SPO

... ...

DEL SPO

... ...

ADD POS

... ...

ADD OSP

... ...

DEL POS

... ...

DEL OSP

... ...

Dictionary

Fig. 5: An overview of the storage structure of a bidirectional delta chain. Figureadapted from OSTRICH [3].

snapshot and the versions results in smaller aggregated deltas,thus reducing the overall storage size. Bidirectional delta chainsreduce the average distance between the snapshot and other ver-sions. Therefore bidirectional delta chains should have a lowerstorage size, compared to unidirectional delta chains for grow-ing datasets.

C. Bidirectional Delta Chain Disadvantages

In-order ingestion is the biggest drawback of bidirectionaldelta chains. Indeed, to ingest a version in the reverse deltachain, we would need to calculate the delta between the versionand an unknown snapshot. However, a fix-up algorithm couldbe used to build the bidirectional delta chain. In the fix-up algo-rithm, all versions are stored in a forward delta chain. Once thefuture snapshot is inserted, the forward delta chain can be con-verted into a reverse delta chain. RCS [8] presents an alternativeto the fix-up algorithm. For this algorithm, the latest version isalways stored fully materialized. To add a new version, the sys-tem stores the new version completely and replaces the previousversion by its delta, keeping the rest of the chain intact.

IV. BIDIRECTIONAL RDF ARCHIVE

A. Storage Overview

The storage structure for the bidirectional delta chain can beseen in Figure 5. The storage structure is similar to OSTRICH[3] and as you can see the reverse delta chain has the same stor-age structure as the forward delta chain.

B. Multiple Snapshots

The fix-up algorithm requires multiple snapshots, but OS-TRICH only supports a single snapshot. Therefore, we modifyOSTRICH so that multiple snapshots are supported. Support-ing multiple snapshots comes down to finding the correspond-ing snapshot for a given version. We calculate the greatest lowerbound and the least upper bound of all the snapshots for thegiven version. If the upper bound snapshot does not have a re-verse delta chain, our version is stored in a forward delta chainand the corresponding snapshot is the lower bound snapshot.

Snapshot Δ Δ

Snapshot Δ Δ Δ

Δ

Fig. 6: State of the delta chains before the fix-up algorithm is applied.

If the upper bound snapshot has a reverse snapshot, the corre-sponding snapshot is the snapshot closest to the version.

C. Ingestion

As mentioned before, in-order ingestion is difficult in a bidi-rectional delta chain. Therefore, we will first discuss out-of-order ingestion, before discussing in-order ingestion.

C.1 Out-of-order Ingestion

Ingesting versions out-of-order in a reverse delta chain is sim-ilar to OSTRICH’s forward ingestion process, we simply needto transform the input changeset. Firstly, since the forward in-gestion algorithm expects the input changeset to reference thesnapshot, we reverse the input change set by swapping the ad-ditions and deletion so that the input changeset references thesnapshot. Secondly, since the forward ingestion algorithm ex-pects the version closest to the snapshot to be inserted first, weinsert the versions in reverse order.

C.2 In-order Ingestion

For in-order ingestion, we utilize a fix-up algorithm, whichstarts ingesting versions in a temporary forward delta chain.Once the system decides a new delta chain needs to be initiated,for example, the delta chain size exceeds a certain threshold, thesystem will store the next version once in the temporary forwarddelta chain and store it again as the snapshot for the new perma-nent delta chain. The reason behind storing the version twice isto simplify the input extraction, which will be explained in thefollowing section. Subfigure 6 shows the resulting delta chains.

Once the system has some idle time the fix-up process canbe performed. The fix-up process starts by extracting the origi-nal input changeset from the temporary delta chain. Hence, thealgorithm iterates over the version information for every triplein the temporary delta chain. If the previous version is present,that means that the triple was already added in a previous versionand therefore the triple was not present in the input changeset.If the previous version is not present in the version information,the triple was first added in the current version and should bepresent in the input changeset. The temporary delta chain canthen be deleted and a new permanent reverse delta chain can beconstructed out-of-order with the extracted input changeset.

D. Queries

VM queries retrieve data from a single version. VM queriesare handled exactly the same as in OSTRICH, even for versionsstored in the reverse delta chain, since we stored inverse deltas.

DM queries retrieve the differences between two versions andannotate whether they are an addition or deletion. In this work,we will focus on DM queries for a single snapshot and corre-sponding reverse and forward delta chain. We can discern threecases for DM queries, namely: a DM query between snapshotand delta, a DM query between two deltas in the same delta

chain (intra-delta) and a DM query between two deltas in op-posite delta chains (inter-delta). The first case and second caseare similar to OSTRICH. In the third case, we resolve the DMquery by splitting up the requested delta in two sequential deltasthat are relative the snapshot and then merging theses sequen-tial deltas back together. In other words if we use DARCs [11]patch notation, with o being the start version, e being the endversion and s being the snapshot: oDe = oD1sD2e This strat-egy is quite efficient, since the deltas relative to the snapshot arestored. Furthermore, since the snapshot deltas are sorted, theycan be merged in a sort-merge fashion. It is difficult to give anexact count of the results, for inter-delta DM queries. However,an estimation of the result count can be calculated by summingup the counts of both deltas relative to the snapshot. This canoverestimate the actual count if triples are present in both deltas.

VQ queries annotate triples with version numbers in whichthey exist. We will present a VQ algorithm for a single snap-shot and corresponding reverse and forward delta chain. Thealgorithm is based on the VQ algorithm of OSTRICH. The al-gorithm starts by iterating over all the triples in the snapshot forthe given triple pattern. Next, the deletion trees are probed forthe triple. If the triple is not present in the deletion tree, thetriple is present in all versions. If the triple is present in a dele-tion tree the corresponding versions are erased from the versionannotation. After all the snapshot triples have been processed,the algorithm iterates over the addition triples stored in the ad-dition tree in a sort-merge join fashion. As was the case withsnapshot triples, the deletion trees are probed for the triple. Ifthe triple is not present in the deletion trees, the triple is presentin all versions ranging from the version that introduced the tripleto the last version. If the triple is present in a deletion tree theversions are erased from the annotations. Result streams can bepartially offset, by offsetting the snapshot iterator of HDT [6].

V. EVALUATION

COBRA (Change-Based Offset-Enabled Bidirectional RDFArchive) refers to the C++ software implementation of our stor-age optimization. COBRA uses the same software libraries asOSTRICH [3].

A. Experimental Setup

We will evaluate the ingestion and query resolution capabili-ties of COBRA. For this we will use the BEAR [2] benchmark,particularly BEAR-A, BEAR-B daily and BEAR-B hourly.

The ingestion process will be evaluated on storage size and in-gestion time. For BEAR-A we will only ingest the first eight ver-sions due to memory constraints. Similarly, for BEAR-B hourly,we will only ingest the first 400 versions. For BEAR-B daily, wewill ingest all 89 versions. We will do the ingestion evaluationfor multiple storage layouts and ingestion orders namely:• OSTRICH-1F: OSTRICH with one forward delta chain, asseen in Figure 2.• OSTRICH-2F: OSTRICH with two forward delta chains.• COBRA-PRE FIX UP, COBRA’s pre fix-up state, as seen inFigure 6.• COBRA-POST FIX UP, COBRA’s bidirectional delta chainpost fix-up, as seen in 4.

Fig. 7: Cumulative storage size for the first eight versions of BEAR-A.

• COBRA-OUT OF ORDER, COBRA’s bidirectional deltachain, as seen in 4, but ingested out-of-order (snapshot - reversedelta chain - forward delta chain).

BEAR also provides query sets, which will be evaluated asVM queries for all version, DM queries between all versionsand a VQ query. Since neither OSTRICH nor COBRA supportmultiple snapshots for all query atoms, we limit our experimentsto OSTRICH’s unidirectional storage layout and COBRA’s bidi-rectional storage layout.

B. Results

As can be seen in Figures 7, 8 and 9, there is no approach thathas the lowest storage size for all the benchmarks. Indeed, CO-BRA has the lowest storage size for BEAR-A, OSTRICH-1Fhas the lowest storage size for BEAR-B daily and OSTRICH-2F has the lowest storage size for BEAR-B hourly. We can seethat for all benchmarks, COBRA-OUT OF ORDER reduces thestorage increase from initializing a second delta chain. How-ever, this does not always result in an overall storage size reduc-tion due to the size difference between the first delta chain andthe reverse delta chain.

Table I shows the ingestion times of the different configu-rations for all three benchmarks. We can see that OSTRICH-1F has the highest ingestion time. We also see that COBRA-PRE FIX UP has a higher ingestion time than OSTRICH-2Fdue to the additional version.

Figures 10, 13, 16 display the mean VM query duration forall three benchmarks. We can see that VM queries are resolvedfaster in COBRA than OSTRICH, eventhough the same VM al-gorithm was used.

As can be seen in Figures 11, 14, 17 COBRA resolves DMqueries faster than OSTRICH. The reason for this is that intra-delta DM queries are faster in smaller delta chains.

Figure 12, 15, 18 display the mean VQ query duration forall three benchmarks. It can be seen that VQ are roughly similarfor COBRA and OSTRICH, which means COBRA’s altered VQalgorithm does not cause significant overhead.

VI. CONCLUSION

In this work, we presented bidirectional delta chains as a po-tential storage optimization for CB RDF archives. We appliedthis storage optimization on an existing RDF archive named OS-TRICH [3]. For this purpose, we modified OSTRICH so thatmultiple snapshots could be supported. Next, we presented anin-order ingestion algorithm. Moreover, we presented a novel

Fig. 8: Cumulative storage size for all versions of BEAR-B daily.

Fig. 9: Cumulative storage size for the first 400 version of BEAR-B hourly.

Fig. 10: Mean BEAR-A VM query duration ofall versions for all triplepatterns.

Fig. 11: Mean BEAR-ADM query duration be-tween all versions for alltriple patterns.

Fig. 12: Mean BEAR-A VQ query duration forall triple patterns.

Fig. 13: Mean BEAR-B daily VM query du-ration of all versions forall triple patterns.

Fig. 14: Mean BEAR-B daily DM query du-ration between all ver-sions for all triple pat-terns.

Fig. 15: Mean BEAR-Bdaily VQ query durationfor all triple patterns.

Fig. 16: Mean BEAR-B hourly VM query du-ration of all versions forall triple patterns.

Fig. 17: Mean BEAR-B hourly DM query du-ration between all ver-sions for all triple pat-terns.

Fig. 18: Mean BEAR-B hourly VQ query du-ration for all triple pat-terns.

TABLE I: Ingestion times of the different configurations for all three benchmarks. The ingestion time of COBRA-OUT OF ORDER is reprented as the sum of theingestion time of COBRA-PRE FIX UP and the fix-up time, since COBRA-OUT OF ORDER uses the fix-up algorithm.

configuration BEAR-A (min) BEAR-B daily (min) BEAR-B hourly (min)OSTRICH-1F 1419.27 6.53 34.47OSTRICH-2F 686.87 3.18 15.2COBRA-PRE FIX UP 775.31 3.28 14.87COBRA-POST FIX UP 775.31 + 502.75 3.28 + 2.48 14.87 + 11.41COBRA-OUT OF ORDER 877.52 4.24 18.30

DM query algorithm for inter-delta versions. Finally, we alteredthe existing VQ query algorithm so that bidirectional chains aresupported.

We evaluated different storage configurations and concludedthat no storage configuration has the lowest storage size for allbenchmarks. We recommend initializing a new delta chain whenthe latest delta chain becomes too large. We also recommendmerging two forward delta chains into a bidirectional delta chainif the first delta chain is more similar to the second snapshotthan the first snapshot. We also confirmed that initiating a newdelta chain is a viable method for reducing the ingestion time.Finally, we evaluated VM, DM and VQ queries for OSTRICHand COBRA and observed that VM and DM queries were fasterin COBRA and VQ were equal.

In conclusion, binary delta chains are not the all-round stor-age optimization technique we set-out to find at the start of thiswork, however it is a viable tool for reducing the overall storagesize in certain cases.

On this topic, there are many opportunities for future re-search. First, there needs to be a reliable way of predictingwhether a delta chain is more similar to the preceding snap-shot or the future snapshot. Second, future work could devisea novel input extraction algorithm for the fix-up algorithm sothat the middle version does not need to be stored twice. Fi-nally, additional research is needed to expand the current DMand VQ algorithms for multiple snapshots and allow for moreefficient offsets.

REFERENCES

[1] Cyril Schoreels, Brian Logan, and Jonathan M Garibaldi, “Agent basedgenetic algorithm employing financial technical analysis for making trad-ing decisions using historical equity market data,” in Intelligent AgentTechnology, 2004.(IAT 2004). Proceedings. IEEE/WIC/ACM InternationalConference on. IEEE, 2004, pp. 421–424.

[2] Javier D. Fernandez, Jurgen Umbrich, Axel Polleres, and Magnus Knuth,“Evaluating query and storage strategies for rdf archives,” in Proceedingsof the 12th International Conference on Semantic Systems, New York, NY,USA, 2016, SEMANTiCS 2016, pp. 41–48, ACM.

[3] Ruben Taelman, Ruben Verborgh, and Erik Mannens, “Exposing RDFarchives using triple pattern fragments,” in Lecture Notes in ComputerScience (including subseries Lecture Notes in Artificial Intelligence andLecture Notes in Bioinformatics), 2017.

[4] Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,”Scientific American, vol. 284, no. 5, pp. 34–43, 2001.

[5] Christian Bizer, Tom Heath, and Tim Berners-Lee, “Linked data - the storyso far,” International Journal on Semantic Web and Information Systems,vol. 5, no. 3, pp. 1–22, 2009.

[6] Javier D. Fernndez, Miguel A. Martnez-Prieto, Claudio Gutirrez, AxelPolleres, and Mario Arias, “Binary rdf representation for publication andexchange (hdt),” Web Semantics: Science, Services and Agents on theWorld Wide Web, vol. 19, pp. 22 – 41, 2013.

[7] Miguel A. Martınez-Prieto, Mario Arias Gallego, and Javier D. Fernandez,“Exchange and consumption of huge rdf data,” in The Semantic Web: Re-search and Applications, Elena Simperl, Philipp Cimiano, Axel Polleres,Oscar Corcho, and Valentina Presutti, Eds., Berlin, Heidelberg, 2012, pp.437–452, Springer Berlin Heidelberg.

[8] Tichy Walter F., “Rcs a system for version control,” Software: Practiceand Experience, vol. 15, no. 7, pp. 637–654, 1982.

[9] Max Volkel and Tudor Groza, “SemVersion: An RDF-based OntologyVersioning System,” in Proceedings of IADIS International Conference onWWW/Internet (IADIS 2006), Miguel Baptista Nunes, Ed., Murcia, Spain,October 2006, pp. 195–202.

[10] Steve Cassidy and James Ballantine, “Version control for rdf triplestores.,” in ICSOFT 2007 - 2nd International Conference on Softwareand Data Technologies, Proceedings, 01 2007, pp. 5–12.

[11] David Roundy, “Darcs: Distributed version management in haskell,” inProceedings of the 2005 ACM SIGPLAN Workshop on Haskell, New York,NY, USA, 2005, Haskell ’05, pp. 1–4, ACM.

[12] Dong-Hyuk Im, Sang-Won Lee, and Hyoung-Joo Kim, “A version man-agement framework for rdf triple stores,” International Journal of Soft-ware Engineering and Knowledge Engineering, vol. 22, no. 01, pp. 85–106, 2012.

[13] Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Sam Coppens, ErikMannens, and Rik Van de Walle, “R&Wbase: git for triples,” in Proceed-ings of the 6th Workshop on Linked Data on the Web, Christian Bizer, TomHeath, Tim Berners-Lee, Michael Hausenblas, and Soren Auer, Eds., May2013, vol. 996 of CEUR Workshop Proceedings.

[14] Markus Graube, Stephan Hensel, and Leon Urbas, “R43ples: Revisionsfor triples an approach for version control in the semantic web,” in CEURWorkshop Proceedings, 2014.

[15] Claudius Hauptmann, Michele Brocco, and Wolfgang Worndl, “Scalablesemantic version control for linked data management,” in LDQ@ESWC,2015.

[16] Thomas Neumann and Gerhard Weikum, “x-rdf-3x: Fast querying, highupdate rates, and consistency for rdf databases,” Proc. VLDB Endow., vol.3, no. 1-2, pp. 256–263, Sept. 2010.

[17] Thomas Neumann and Gerhard Weikum, “Rdf-3x: A risc-style engine forrdf,” Proc. VLDB Endow., vol. 1, no. 1, pp. 647–659, Aug. 2008.

[18] A. Cerdeira-Pena, A. Faria, J. D. Fernndez, and M. A. Martnez-Prieto,“Self-indexing rdf archives,” in 2016 Data Compression Conference(DCC), March 2016, pp. 526–535.

[19] Nieves R. Brisaboa, Ana Cerdeira-Pena, Antonio Farina, and GonzaloNavarro, “A compact RDF store using suffix arrays,” in Lecture Notesin Computer Science (including subseries Lecture Notes in Artificial Intel-ligence and Lecture Notes in Bioinformatics), 2015.

[20] James Anderson and Arto Bendiken, “Transaction-time queries in Dydra,”in Joint proceedings of the 3rd Workshop on Managing the Evolution andPreservation of the Data Web (MEPDaW 2017) and the 4th Workshop onLinked Data Quality (LDQ 2017) co-located with 14th European SemanticWeb Conference (ESWC 2017), 2016.

[21] Paul Meinhardt, Magnus Knuth, and Harald Sack, “Tailr: a platform forpreserving history on the web of data,” in Proceedings of the 11th Inter-national Conference on Semantic Systems. ACM, 2015, pp. 57–64.

[22] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwe-gen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, and PieterColpaert, “Triple pattern fragments: a low-cost knowledge graph interfacefor the web,” Web Semantics: Science, Services and Agents on the WorldWide Web, vol. 37, pp. 184–206, 2016.

[23] Tobias Kafer, Ahmed Abdelrahman, Jurgen Umbrich, Patrick O’Byrne,and Aidan Hogan, “Exploring the dynamics of linked data,” in TheSemantic Web: ESWC 2013 Satellite Events, Philipp Cimiano, MiriamFernandez, Vanessa Lopez, Stefan Schlobach, and Johanna Volker, Eds.,Berlin, Heidelberg, 2013, pp. 302–303, Springer Berlin Heidelberg.

[24] Mohamed Morsey, Jens Lehmann, Sren Auer, Claus Stadler, and Sebas-tian Hellmann, “Dbpedia and the live extraction of structured data fromwikipedia,” Program, vol. 46, no. 2, pp. 157–181, 2012.

[25] Jurgen Umbrich, Sebastian Neumaier, and Axel Polleres, “Quality assess-ment and evolution of open data portals,” in Future Internet of Things andCloud (FiCloud), 2015 3rd International Conference on. IEEE, 2015, pp.404–411.

Table of Contents

Acknowledgments iv

Usage v

Summary vi

Extended Abstract vii

Table of Contents xiii

List of Figures xvi

List of Tables xix

1 Preface 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 RDF Storage Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4.1 RDBMS Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4.1.1 Triple Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1.2 Property Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1.3 Vertical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4.2 NoSQL Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4.3 Native Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.1 Non-RDF Archives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.2 RDF Archives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5.2.1 Independant Copies . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.2.2 Change-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5.2.3 Timestamp-Based . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5.2.4 Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6.1 Independant Copies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6.2 Change-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6.3 Timestamp-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7 RDF Archive Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7.1 BEAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7.2 EvoGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7.3 SPBv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

xiii

TABLE OF CONTENTS xiv

3 Use Case: Friend Network 133.1 Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Storage Optimization: Bidirectional Delta Chain 154.1 Advantages Bidirectional Delta Chain . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Advantages Non-Aggregated Bidirectional Delta Chain . . . . . . . . . . . 154.1.2 Advantages Aggregated Bidirectional Delta Chain . . . . . . . . . . . . . 16

4.2 Disadvantages Bidirectional Delta Chain . . . . . . . . . . . . . . . . . . . . . . . 164.3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 OSTRICH Overview 195.1 Storage Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 Snapshot Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.1.2 Delta Chain Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.1.3 Delta Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1.3.1 Local Change Flags . . . . . . . . . . . . . . . . . . . . . . . . . 205.1.3.2 Deletion Relative Position . . . . . . . . . . . . . . . . . . . . . . 205.1.3.3 Multiple Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.3.4 Addition Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.3.5 Deletion Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1.3.6 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2 Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3.1 Version Materialization Query . . . . . . . . . . . . . . . . . . . . . . . . 225.3.1.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3.1.2 Result Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3.2 Delta Materialization Query . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3.2.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3.2.2 Result Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3.3 Version Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.3.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.3.2 Result Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Bidirectional RDF Archive 256.1 Storage Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Multiple Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.3 Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.3.1 Out-of-order Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3.2 In-order Ingestion: Fix-Up Algorithm . . . . . . . . . . . . . . . . . . . . 26

6.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.4.1 Version Materialized Query . . . . . . . . . . . . . . . . . . . . . . . . . . 286.4.2 Delta Materialized Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.4.2.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.4.2.2 Result Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.4.3 Version Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.4.3.1 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.4.3.2 Result Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Evaluation 337.1 COBRA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.2.1 Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337.2.2 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347.3.1 Ingestion Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347.3.2 Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

TABLE OF CONTENTS xv

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.4.1 Ingestion Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.4.1.1 Storage Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.4.1.2 Ingestion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.4.2 Query Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.4.3 Hypotheses Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

8 Conclusion and Future Work 498.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Bibliography 51

Appendices 54A BEAR-A Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55B BEAR-B daily Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61C BEAR-B hourly Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

List of Figures

2.1 An example RDF graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Unidirectional delta chain, as done in TailR. . . . . . . . . . . . . . . . . . . . . . 102.3 Unidirectional delta chain where all deltas are relative to the snapshot at the

beginning of the chain, as done in OSTRICH. . . . . . . . . . . . . . . . . . . . . 10

3.1 Friend Network Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 A simplified non-aggregated bidirectional delta chain. . . . . . . . . . . . . . . . 154.2 A simplified aggregated bidirectional delta chain. . . . . . . . . . . . . . . . . . . 164.3 An example to showcase unidirectional and bidirectional non-aggregated delta

chains. Triples are represented by numbers. . . . . . . . . . . . . . . . . . . . . . 174.4 An example to showcase unidirectional and bidirectional aggregated delta chains.

Triples are represented by numbers. . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1 An overview of the storage structure used in OSTRICH. . . . . . . . . . . . . . . 20

6.1 An overview of the storage structure of a bidirectional delta chain. Figure adaptedfrom OSTRICH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 An illustration of the fix-up algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 286.3 State of the delta chains before the fix-up algorithm is applied. . . . . . . . . . . 28

7.1 Comparison of the cumulative storage sizes (in GB) per version for the first eightversions of the BEAR-A benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.2 Comparison of the cumulative ingestion times (in hours) per version for the firsteight versions of the BEAR-A benchmark. . . . . . . . . . . . . . . . . . . . . . . 35

7.3 Comparison of the individual ingestion times (in minutes) per version for the firsteight versions of the BEAR-A benchmark. . . . . . . . . . . . . . . . . . . . . . . 36

7.4 Comparison of the cumulative storage sizes (in MB) per version of the BEAR-Bdaily benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.5 Comparison of the cumulative ingestion time (in min) per version of the BEAR-Bdaily benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.6 Comparison of the individual ingestion time (in min) per version of the BEAR-Bdaily benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.7 Comparison of the cumulative storage sizes (in MB) per version of the BEAR-Bhourly benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7.8 Comparison of the cumulative ingestion time (in min) per version of the BEAR-Bhourly benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7.9 Comparison of the individual ingestion time (in min) per version of the BEAR-Bhourly benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.10 Average VM query durations for all triple patterns in BEAR-A. . . . . . . . . . . 407.11 Average DM query durations between version 3 and all other versions for all triple

patterns in BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.12 Average DM query durations between all versions for all triple patterns in BEAR-A. 417.13 Average VQ query durations for all triple patterns in BEAR-A. . . . . . . . . . . 427.14 Average VM query durations for all provided triple patterns in BEAR-B daily. . 42

xvi

LIST OF FIGURES xvii

7.15 Average DM query durations between version 3 and all other versions for all triplepatterns in BEAR-B daily. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.16 Average DM query durations between all versions for all triple patterns in BEAR-B daily. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.17 Average VQ query durations for all provided triple patterns in BEAR-B daily. . 447.18 Average VM query durations for all provided triple patterns in the first 400 ver-

sions of BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.19 Average DM query durations between version 3 and all other versions for all triple

patterns in BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.20 Average DM query durations between all versions for all triple patterns in BEAR-

B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.21 Average VQ query durations for all provided triple patterns in the first 400 ver-

sions of BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

1 Average VM query durations for SPO triple patterns in the first eight versions ofBEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2 Average VM query durations for low cardinality S?O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3 Average VM query durations for low cardinality SP? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Average VM query durations for high cardinality SP? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Average VM query durations for low cardinality ?PO triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Average VM query durations for high cardinality ?PO triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7 Average VM query durations for low cardinality ??O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8 Average VM query durations for high cardinality ??O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

9 Average VM query durations for low cardinality ?P? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

10 Average VM query durations for high cardinality ?P? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

11 Average VM query durations for low cardinality S?? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

12 Average VM query durations for high cardinality S?? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

13 Average DM query durations for SPO triple patterns in the first eight versions ofBEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

14 Average DM query durations for low cardinality S?O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

15 Average DM query durations for low cardinality SP? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

16 Average DM query durations for high cardinality SP? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

17 Average DM query durations for low cardinality ?PO triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

18 Average DM query durations for high cardinality ?PO triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

19 Average DM query durations for low cardinality ??O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

20 Average DM query durations for high cardinality ??O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

21 Average DM query durations for low cardinality ?P? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

22 Average DM query durations for high cardinality ?P? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

LIST OF FIGURES xviii

23 Average DM query durations for low cardinality S?? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

24 Average DM query durations for high cardinality S?? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

25 Average VQ query durations for SPO triple patterns in the first eight versions ofBEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

26 Average VQ query durations for low cardinality S?O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

27 Average VQ query durations for low cardinality SP? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

28 Average VQ query durations for high cardinality SP? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

29 Average VQ query durations for low cardinality ?PO triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

30 Average VQ query durations for high cardinality ?PO triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

31 Average VQ query durations for low cardinality ??O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

32 Average VQ query durations for high cardinality ??O triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

33 Average VQ query durations for low cardinality ?P? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

34 Average VQ query durations for high cardinality ?P? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

35 Average VQ query durations for low cardinality S?? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

36 Average VQ query durations for high cardinality S?? triple patterns in the firsteight versions of BEAR-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

37 Average VM query durations for ?P? triple patterns in BEAR-B daily. . . . . . . 6138 Average DM query durations for ?P? triple patterns in BEAR-B daily. . . . . . . 6139 Average VQ query durations for ?P? triple patterns in BEAR-B daily. . . . . . . 6140 Average VM query durations for ?PO triple patterns in BEAR-B daily. . . . . . 6241 Average DM query durations for ?PO triple patterns in BEAR-B daily. . . . . . 6242 Average VQ query durations for ?PO triple patterns in BEAR-B daily. . . . . . . 6243 Average VM query durations for ?P? triple patterns in the first 400 versions of

BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6244 Average DM query durations for ?P? triple patterns in the first 400 versions of

BEAR-B hourly.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6245 Average VQ query durations for ?P? triple patterns in the first 400 versions of

BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6346 Average VM query durations for ?PO triple patterns in the first 400 versions of

BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6347 Average DM query durations for ?PO triple patterns in the first 400 versions of

BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6348 Average VQ query durations for ?PO triple patterns in the first 400 versions of

BEAR-B hourly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

List of Tables

2.1 An example triple table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 An example property table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 An example vertical partioning table. . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.1 Overview of which index OSTRICH uses for each triple pattern. . . . . . . . . . 21

7.1 Storage sizes and ingestion times of the different approaches for all three bench-marks. COBRA-POST FIX UP represents the in-order ingestion of the bidirec-tional delta chain using the fix-up algorithm. Therefore, the ingestion time is thesum of the ingestion time of COBRA-PRE FIX UP and the fix-up time. . . . . . 39

xix

LIST OF TABLES xx

Acronyms

CVS Concurrent Versions System. ix, 8

CB Change-Based. ix, 8, 9, 11, 13, 15

CM Change Materialization. ix, 10, 14

COBRA Change-based Offset-enabled Bidirectional RDF Archive. ix, 17, 20

CSA Compressed Suffix Array. ix

CV Cross-Version join. ix, 10

DBMS Database Management Systems. ix, 4

DM Delta Materialization. ix, 10, 14, 18, 19

FOAF Friend Of A Friend. ix, 13

IC Independant Copies. ix, 8–11

LUBM Lehigh University Benchmark. ix, 11

RDBMS Relational Database Management Systems. ix, 4, 8, 13

RDF Resource Description Framework. vii, ix, 2–9, 11, 13–15

SCM Software Configuration Management. ix, 7

SPARQL SPARQL Protocol And RDF Query Language. ix, 2, 3, 9

TB Timestamp-Based. ix, 9, 11, 18

VCS Version Control System. ix, 7

VM Version Materialization. ix, 10, 11, 14, 15, 17, 18

VQ Version Query. ix, 10, 11, 14, 18, 19

xxi

Chapter 1

Preface

1.1 Introduction

Datasets change over time for numerous reasons, such as the addition of new information orthe correction of erroneous information. Capturing this evolution allows for historical analyseswhich can provide new insights. For example, historical market data can be used to predictfuture market trends [2].Linked Data is no exception to this. Linked Data is data that is structured in such a way thatit is understandable for machines. The standard way of modeling Linked Data is RDF, whichuses a <subject, predicate, object> triple structure. This triple structure can be used to formgraphs, where the subject and object are nodes linked together via the predicate.Archiving RDF data has been an area of research for a few years [3]. The archiving strategiescan be categorized into three groups [3]:

• Independant Copies (IC) - Every version is stored fully materialized.

• Change-Based (CB) - Only changes between versions are stored.

• Timestamp-Based (TB) - Triples are annotated with their temporal validity.

One particular research focus is enabling offsettable query streams for RDF archives since querystreams are more memory efficient for large query results and the offset allows for faster querieswhen only a subset is needed. OSTRICH [1] is the state of the art when it comes to offset-enabled RDF archives. OSTRICH is a IC, CB and TB hybrid RDF archive that stores versionsin a delta chain that starts with a fully materialized snapshot followed by a series of deltas thatall reference this snapshot. However, OSTRICH has a large ingestion time for large datasetversions. The ingestion time can be reduced by introducing additional snapshots however this,in turn, can increase the storage size since snapshots are fully materialized.In this work, we will explore if we can reduce the resulting storage size increase of the multiplesnapshot approach, while maintaining the ingestion time reduction, by restructuring the deltachain.

1.2 Research Question

The research question is as follows:“How much can we reduce storage usage of change-based RDF archives by restructuring thedelta chain?”

1

CHAPTER 1. PREFACE 2

1.3 Outline

The remainder of this thesis is structured as follows. Chapter 2 will explain all the necessaryconcepts, terms and techniques needed to talk about RDF archives. Next, Chapter 3 willshowcase a use case in order to highlight the need for a RDF archive for large datasets. Followedby Chapter 4, where we will present our storage optimization and corresponding hypotheses.We will apply this storage optimization on OSTRICH, so Chapter 5 will give a detailed overviewof OSTRICH first. The implementation of this storage optimization will then be discussed inChapter 6. In Chapter 7, we evaluate our implementation and compare it with OSTRICH.Finally, a conclusion is presented in Chapter 8, alongside an overview of possible future work.

Chapter 2

Background

In order to fully understand this thesis, it is important to explain certain concepts and listwhich technologies are available. First, the Semantic Web will be explained, followed by itskey technologies RDF and SPARQL. Second, a listing of popular RDF storage techniques willbe given. Third, an overview of query types will be given. Fourth, an outline of RDF archivetechniques will be given. Finally, the most popular versioned RDF benchmarks will be listed.

2.1 Semantic Web

In 2001, Tim Berners Lee, the inventor of the World-Wide-Web, proposed the idea of theSemantic Web [4]. The goal of the Semantic Web is to make data on the Web understandableto machines so that they can perform complex tasks.

In order to make this vision a reality, Linked Data was introduced. As described by Bizer et al.[5], Linked Data (LD) refers to data published on the Web in such a way that it is: machine-readable, its meaning is explicitly defined, it is linked to other external data sets, and it can belinked to from external data sets. RDF and SPARQL are two key technologies of Linked Data,which will be explained in the following sections.

2.2 RDF

RDF [6] is a way of representing data and is a key technology of linked data. RDF uses <subject,predicate (or property), object>triples to organize data. RDF triples are interperted as follows:the subject has a certain property with value object. RDF triples are used to form statements,which can be represented by a directed graph wherein the object and subject are nodes thatare linked by a predicate. The subject and predicate are represented as resource URIs, whilethe object can be either a resource URI, a literal or a blank node. Figure 2.1 is an exampleRDF graph, for the sake of brevity no URI’s are used. Figure 2.1 describes me and contains thefollowing triples:

• <me, type, Person>

• <me, age, 23>

• <me, name, ”Thibault Mahieu”>

3

CHAPTER 2. BACKGROUND 4

Figure 2.1: An example RDF graph.

2.3 SPARQL

Although other RDF query languages exist such as RQL [7] and (i)RDQL [8], SPARQL is theW3C’s recommended RDF query language [9]. SPARQL is a graph-based pattern matchingquery language for RDF data. The graph-based patterns we will focus on are made up of triplepatterns, which like RDF triples contain a subject, predicate and object. However, in this case,subject, predicate and object can be either fixed or variable. The query processor will try tomatch these patterns with elements of the domain, by adhering to the fixed components andfilling in the variable components.

SPARQL has four query forms [9]:

• SELECT Returns all, or a subset of, the variables bound in a query pattern match.

• CONSTRUCT Returns a RDF graph constructed by substituting variables in a set oftriple templates.

• ASK Returns a boolean indicating whether a query pattern matches or not.

• DESCRIBE Returns a RDF graph that describes the resources found.

These query forms are typically followed by a ’WHERE’ clause, that limits the result. The’WHERE’ clause uses pattern matching on the triples, as mentioned above.

As an illustration, the following query selects all book titles [9]:

Listing 2.1: An example SPARQL query that fetches all book titles.

PREFIX dc: <http :// purl.org/dc/elements /1.1/>

SELECT ?title

WHERE { <http :// example.org/book/book1 > dc:title ?title }

2.4 RDF Storage Layout

This section gives an overview of general RDF storage layouts. They can be divided into twogroups, namely native and non-native storage techniques[10, 11]. Native storage techniques arededicated storage techniques that have been built from scratch, while non-native make use ofexisting Database Management Systems (DBMSs). We further divide the non-native techniquesinto Relational Database Management Systems (RDBMSs) and NoSQL storage techniques.

2.4.1 RDBMS Storage

RDBMSs have been around for decades and therefore have become extremely optimized. There-fore, a lot of techniques have been proposed to map RDF data to a RDBMS [12–14]. Threetechniques will be discussed, namely triple tables, property tables and vertical partioning.


SUBJECT PROPERTY OBJECT

ID1 Name ”John Smith”

ID1 Salary 1800

ID1 Department ”Human Resources”

ID2 Name ”Jane Tully”

ID2 Salary 2000

ID2 Department ”Management”

ID2 Bonus 100

ID2 Phone Number (251) 546-9442

ID3 Name ”Rob Phelps”

ID3 Salary 1600

ID3 Department ”Human Resources”

Table 2.1: An example triple table.

2.4.1.1 Triple Table

The triple table technique maps triples into a single table with three columns for the subject,property and object. While this technique is very flexible, it has some performance issues. Sinceall triples are stored in a single table, queries require a lot of expensive self-joins for certain triplepatterns [14]. Furthermore, since all triples are stored in a single table, this table can quicklybecome too large to fit in memory, making queries even slower. Table 2.1 is an example tripletable for an employee database.

2.4.1.2 Property Table

Wilkinson et al. [12, 13], proposed property tables as a solution for the scalability problems oftriple tables. Property tables try to group related RDF nodes in order to reduce query time andstorage requirements. They consist of a subject column and several property columns. Triplesthat cannot be grouped are simply stored in a leftover triple table.Table 2.2 is an example property table for the same employee database.Wilkinson et al. [13] also discuss property-class tables which are a special case of property tables.The idea behind property-class tables is to store nodes of the same class together. In essence,this corresponds to storing the value of ’rdf:type’ in a property table.The most important advantage of property tables over triple tables is the faster query time. Thespeed-up is due to the fact that some self-joins can be avoided since related nodes are stored inthe same row. In addition, storage size is generally lower than the triple table approach.The disadvantages are that the tables can become very sparse with NULLs, due to the unknownvalues. In addition, property tables cannot handle multi-valued attributes efficiently. Due tothese disadvantages and the general complexity of property tables, they have not been widelyadopted except in specialized cases [14].

2.4.1.3 Vertical Partitioning

Abadi et al. [14] proposed a new storage solution called SWstore. SWstore utilizes a techniquecalled vertical partitioning which, similar to a property table, groups triples. For every predicate,there is a column table which contains the subject and object in a triple. Unlike property tables,multi-valued attributes are possible by simply storing the subject with all possible objects.Furthermore, NULL values, or unknown values, do not need to be stored. Table 2.3 is anexample vertical partitioning table for the same employee database.


SUBJECT NAME SALARY DEPARTMENT Phone Number

ID1 ”John Smith” 1800 ”Human Resources” NULL

ID2 ”Jane Tully” 2000 ”Management” (251) 546-9442

ID3 ”Rob Phelps” 1600 ”Human Resources” NULL

(a) Property Table

SUBJECT PROPERTY OBJECT

ID2 Bonus 100

(b) Left Over Triple Store

Table 2.2: An example property table.

SUBJECT OBJECT

ID1 ”John Smith”

ID2 ”Jane Tully”

ID3 ”Rob Phelps”

(a) Name Table

SUBJECT OBJECT

ID1 1800

ID2 2000

ID3 1600

(b) Salary Table

SUBJECT OBJECT

ID1 ”Human Resources”

ID2 ”Management”

ID3 ”Human Resources”

(c) Department Table

SUBJECT OBJECT

ID2 (251) 546-9442

(d) Phone Number Table

SUBJECT OBJECT

ID2 100

(e) Bonus Table

Table 2.3: An example vertical partioning table.


2.4.2 NoSQL Storage

With the rising popularity of NoSQL databases, multiple RDF mappings into NoSQL storeshave been proposed. Most of these systems use popular NoSQL stores as a backbone. Threemajor groups of NoSQL stores can be distinguished, namely key-value stores, document storesand column stores [15].Key-value stores are NoSQL stores that manage a dictionary. Since RDF uses triples and key-value stores only deal with pairs, indexing can be difficult. The AWETO [16] system tries tosolve this indexing problem by using four index orders: S-PO, P-SO, P-OS, and O-PS.Document stores are more complex key-value stores as they allow to encapsulate (key, value)-pairs in documents. Typically, RDF document stores rely on popular document stores such asMongoDB and CouchDB.Column stores store and retrieve data by column instead of by row. Unique keys are used toconnect related column data. In addition, data can be indexed both row-wise and column-wise.For instance, Rya [17] is a RDF store that uses Accumulo, a column store similar to GoogleBigtable, as a backbone.

2.4.3 Native Storage

Native storage techniques are dedicated storage techniques that use RDF’s triple structure totheir benefit.

YARS [18] is an optimized multi-index system that is designed for fast queries. The RDF tripleis extended with context information that refers to the provenance of the data and is referred toas a quad. Since each quad component can be set or variable, there are 24, or 16, access patterns.These access patterns can be covered by only six indexes that utilize B+-trees. Furthermore, thestring representation of each quad component is mapped to a short integer ID. These mappingsare stored in a dictionary, which is used to convert from ID to string representation and vice-versa. By working with IDs instead of string representations, the indexes take less storage space.Furthermore, queries are faster since IDs can be compared more efficiently with each other.

YARS2 [19] extends YARS to a distributed system. In particular, distributed indexing methodsand parallel query evaluation methods are presented.

Hexastore [20] uses six indexes for each permutation of the triple, namely SPO, SOP, OPS, PSO,OSP, and POS. Moreover, a dictionary is also used to map the RDF to keys. As an example, inthe SPO index, a subject key is linked to a sorted vector of property keys, which all point to alist of object keys.

RDF-3X [21] stores all triples in the leaf pages of a compressed clustered B+-tree. Six indexesare used for all permutation of the triples, six indexes for the aggregated indexes and three forthe one-valued indexes, totaling 15 indexes. By storing the triples lexicographically, SPARQLqueries can be converted into range scans. In addition, the string literals are replaced with idsusing a dictionary, resulting in faster queries and a lower storage size.

Bitmat [22] is a compact bit matrix structure for representing a large number of RDF triples.The data is represented as a bit cube with subject, predicate and object as dimensions, whereineach cell represents whether the triple exists or not. This binary matrix allows for efficient joinswith the use of binary AND/OR operations.

TripleBit [23] is a compact RDF store that relies on a bit matrix storage structure. The RDFtriples are represented as a two dimensional bit matrix with RDF links (properties) as columnsand RDF nodes (subjects, objects) as rows. Each cell consists of a boolean that indicates ifa triple is present or not, resulting in a sparse matrix that can be compressed efficiently. Inaddition, dictionary encoding is used to reduce storage requirements even further. In order tospeed up queries, two auxiliary indexes are used, namely ID-Chunk matrix and ID-Predicate bitmatrix. The ID-Chunk matrix is used to quickly find chunks matching to a RDF node (subject,object). The ID-Predicate bit matrix is used to find the related predicates for a given RDF node(subject, object).


HDT [24] is a binary representation for exchanging RDF data, so compression is the main focus.HDT consists of three parts:

• header - contains metadata and serves as an entry point to the data

• dictionary - mapping between triple components and unique identifiers, referred to asdictionary encoding

• triples - structure of the underlying RDF graph after dictionary encoding

HDT resolves queries on the compressed data, but only has one index (SP-O), making certaintriple patterns hard to resolve. In addition, by design HDT stores are immutable after creation,making them unsuitable for volatile datasets.

HDT-FoQ [25] is an extension on HDT [24] that focusses on resolving queries faster. For thisreason, HDT-FoQ adds two additional indexes, namely PS-O and OP-S, to cover more accesspatterns. The PS-O makes use of a wavelet tree, while the OP-S index uses adjacency lists,similar to the SP-O index.

Waterfowl [26] builds on HDT-FoQ [25] by using wavelet trees in the SP-O index, instead ofadjacency lists. To the best of our knowledge, no data can be found on Waterfowl’s performancecompared to HDT-FoQ.

2.5 Archiving

Version control, also called versioning in this document, refers to the management of changesto a collection of information, for example, a dataset or codebase. The latter has been aroundfor over four decades [27], proving how invaluable rolling back to a previous version is. Manyversion control techniques can be reused in archiving techniques. We will consider both non-RDFarchiving techniques and RDF archiving techniques.

2.5.1 Non-RDF Archives

Many techniques from non-RDF archives and Version Control System (VCS) can be repurposedfor versioning RDF archives.

RCS [29] is a delta-based VCS, wherein each delta consists of insertions and deletions of lines.The latest version is stored completely and older revisions are stored in so-called reverse deltas,resulting in quick access to the latest version. To add a new revision the system stores the latestrevision completely and replaces the previous revision by its delta, keeping the rest of the chainintact.

2.5.2 RDF Archives

This section gives an overview of existing RDF archive approaches. Fernandez et al. [30]distinguish three groups of archive storage policies.

2.5.2.1 Independant Copies

In the IC approach the dataset is stored independently for every version. In the IC approachtriples are repeated many times across different versions, so they typically have a higher storagerequirement than other approaches. Due to its straightforward and simple approach, the ICapproach is used in popular systems such as the Dynamic Linked Data Observatory [31] andDBpedia [30].

SemVersion [32] is an IC versioning system for RDF, that tries to emulate classical ConcurrentVersions System (CVS) systems for version management. Each version is stored separately in


RDF stores that conform to a certain API, that manages said versions.

2.5.2.2 Change-Based

The CB approach tries to solve the large storage requirement of the IC approach by only storingthe changes between versions. These changes, sometimes called deltas, typically consist of aset of additions and set of deletions. However, only storing the changes introduces a versionmaterialization cost. This refers to the cost of reconstructing a database version by going overall the changes up to that point. It is clear that this cost will increase with every new versionof the dataset.

Cassidy et al. [33] propose a CB RDF archive that is built on Darcs theory of patches [51] - amathematical model that describes how patches can be manipulated in order to get the desiredversion in the context of software. This model describes fundamental operations, such as thecommute operation, the revert operation, and the merge operation. Cassidy et al. adapt theseoperations so that are applicable to RDF stores as well.

Im et al. [34] introduced a CB store with a RDBMS. They propose an aggregated deltas approachwherein not only the delta between a parent and child, but all possible deltas are stored. Thisresults in an increased storage overhead, but a decreased version materialization cost comparedto the classic sequential delta chain.

Vander Sande et al. [35] introduce R&WBase - a distributed CB RDF archive, wherein versionsare stored as consecutive deltas. Deltas between versions consist of an addition set and a deletionset, respectively listing which triples haven been added and deleted. Since deltas are stored inthe same graph, triples are annotated with a context number, indicating which version the triplebelongs and whether it was added or deleted. In particular, an even context number indicatesthe triple is an addition and an uneven context number indicates the triple is a deletion. Queriescan be handled efficiently by looking at the highest context number. If the context number iseven than the triple is present for that version. If the context number is uneven than the triple isnot present for that version. Finally, R&WBase also supports tagging, branching, and mergingof datasets.

R43ples [36] is another CB RDF archive, since it groups additions and deletions in named graphs.R43ples allows manipulation of revisions with SPARQL, by introducing new keywords such asREVISION, TAG and BRANCH. Versions are materialized by starting from the head of thebranch and applying all prior additions/deletions.

2.5.2.3 Timestamp-Based

In the TB approach triples are annotated with creation and deletion timestamps. These anno-tations ensure that no triples are stored more than once.

Hauptmann et al. [37] propose a similar delta-based store as R43ples, including complete graphsand version control via SPARQL. However, in Hauptmann’s approach, each triple is virtuallyannotated with version information that is cached using a hash table, making it a TB approach.

x-RDF-3X [38] extends RDF-3X [21] with versioning support. Each triple is annotated with acreation timestamp and when appropriate, a deletion timestamp, making it a TB approach.

v-RDFCSA [39] is an TB archiving extension on RDFCSA [40], a compact self-indexing RDFstorage that is based on suffix arrays.

Dydra [41] is a RDF archive that stores versions as named graphs in a quad store, that can bequeried using the REVISION SPARQL keyword. Dydra uses B+-trees with six indexes: GSPO,GPOS, GOSP, SPOG, POSG, OSPG. B+-tree values indicate for which revisions a particularquad is visible, making it a TB system.


Figure 2.2: Unidirectional delta chain, as done in TailR.

Figure 2.3: Unidirectional delta chain where all deltas are relative to the snapshot at the beginning ofthe chain, as done in OSTRICH.

2.5.2.4 Hybrid

In the hybrid approach, the three aforementioned archiving strategies are combined.

TailR [42] interleaves fully materialized versions (snapshots) in between the delta chain. Thesnapshots reset the version materialization cost but can lead to a higher storage requirement.

OSTRICH [1] is another hybrid solution that interleaves fully materialized snapshots in betweenthe delta chain, as seen in Figure 2.3. However, unlike TailR, OSTRICH uses aggregated deltas[34] – deltas who directly refer to the snapshot, instead of the previous version. Moreover, thedelta chain is stored by annotating each triple with version information, making it a IC, CBand TB hybrid. OSTRICH focuses on providing memory-efficient query streams which can beoffsetted. In addition, OSTRICH also provides query count estimation functionality, which canbe used as a basis for query optimization in query engines [43].

2.6 Query Types

Fernandez et al. [3] identified five fundamental query groups, referred to as query atoms:

• Version Materialization (VM) queries retrieve data from a single version. For example,“Which paintings are on display today?”.

• Delta Materialization (DM) queries retrieve the differences between two versions. Forexample, “Which paintings were added or removed between yesterday and today?”.

• Version Query (VQ) annotates query result with version numbers wherein data exists. Forexample, ”When was the ’Mona Lisa’ on display?”.

• Cross-Version join (CV) joins results of two queries over two different versions. For exam-ple, “Which paintings were on display yesterday and today?”.

• Change Materialization (CM) returns a list of versions in which a given query producesconsecutively different results. For example, “When was the ’Mona Lisa’ put on displayor removed from display?”.

Although other query classifications exist [44], we will only refer the above-mentioned queryatoms, for the sake of simplicity.

Some storage policies are better suited for some query atoms than others.


2.6.1 Independant Copies

Since all version are fully materialized and indexed in the IC approach, Version Materialization(VM) queries are relatively simple. Delta Materialization (DM) and CV queries are moderatelycomplex because two queries need to be executed. VQ and CM are very complex because allversions need to be queried.

2.6.2 Change-Based

Due to the version materialization cost, as discussed in section 2.5.2.2, VM queries are more com-plex in the CB approach than the IC approach. On the other hand, CB queries for neighboringversions are very efficient in the CB approach since those changesets are stored.

2.6.3 Timestamp-Based

In the TB solution VQ queries are particularly efficient because the triples are naturally anno-tated with version numbers wherein they exist. However, other query atoms are typically slowerthan the IC approach due to the extra checks if a triple is valid for a given version.

2.7 RDF Archive Benchmarks

A benchmark is a set of tests that measure the performance of a system. More importantly,it allows us to easily compare systems. In this section, three RDF archive benchmarks arepresented, namely BEAR [3, 45], EvoGen [46] and SPBv [44].

2.7.1 BEAR

BEAR [3, 45] is a benchmark for RDF archives that utilizes real data from three differentdomains:

• BEAR-A - 58 weekly snapshots from the Dynamic Linked Data Observatory [31].

• BEAR-B - the 100 most volatile resources from DBPedia Live [47] over the course of threemonths at three different granularities: instant, hour and daily.

• BEAR-C - 32 weekly snapshots of the Open Data Portal Watch project [48].

The data is stored under four different policies. Under the IC policy each version is storedin a separate N-Triples file, while in the CB policy, only additions and deletions of triples arestored in separate N-Triples files. Under the TB policy, a named graph annotates the tripleswith versions, while the CBTB policy only annotates the triples which have changed. TheBEAR benchmark also provides triple patterns and their corresponding query results. BEAR-Aprovides triple pattern queries and their results for the following triple patterns: S??, ?P?, ?P?,SP?, ?PO, S?O and SPO. BEAR-B provides triple pattern queries and their results for ?POand ?P? triple patterns for the hourly and daily granularity. These triple patterns are based onthe most frequent triple patterns from the DBpedia query set. BEAR-C provides 10 complexqueries that, although they cannot be efficiently resolved with current archiving strategies, theycould help foster development of new query resolution algorithms.

2.7.2 EvoGen

EvoGen [46] is a highly configurable benchmark suite that generates synthetic and evolving RDFdata. EvoGen is an extension on the Lehigh University Benchmark (LUBM) synthetic dataset,adding additional classes and properties for enabling schema evolution. Parameters can be used


to configure: instance evolution, schema evolution, query workload generation and archivingstrategy.

2.7.3 SPBv

SPBv [44] is another highly configurable benchmark that generates RDF data based on theBBC’s media organization data, which they refer to as creative works. Creative works consistof properties such as: title, shortTitle, description, dateCreated, audience and format. The datagenerator tries to simulate the natural evolution of these creative works, by storing the creativeworks in different versions according to their creation date. SPBv can also be used to generatequeries. However, unlike Fernandez et al. [3], Papakonstantinou et al. consider eight querytypes:

• Modern version materialization queries fully materialize the latest version.

• Modern single-version structured queries are performed in the latest version.

• Historical version materialization queries full materialize a version in the past.

• Historical single-version structured queries are performed in a version in the past.

• Delta materialization retrieves the delta between two versions.

• Single-delta structured queries are performed on the delta of two consecutive versions.

• Cross-delta structured queries are evaluated on changes of multiple versions.

• Cross-version structured joins the results of queries on several versions, thereby re-trieving information common in several versions.

Chapter 3

Use Case: Friend Network

This chapter describes a use case to highlight the need for an RDF archive for large versiondatasets.

3.1 Use Case

The use case is a Friend Of A Friend (FOAF) network in social media, which captures informationsuch as who is friends with whom. As an example, consider the following raw FOAF data:

ex:Trevor foaf:knows ex:John

ex:John foaf:knows ex:Trevor

ex:John foaf:knows ex:Amy

ex:Amy foaf:knows ex:John

RDF’s triple structure is a good choice to model this data [18]. Figure 3.1 depicts the resultingRDF graph.

Since people tend to acquire and lose friends over time, FOAF networks evolve over time.In order to capture this evolution, we can version the data. For example, the system couldperiodically take a snapshot of the network and store it as a new version. Typically, a largepart of the data will remain unchanged between version. Moreover, versions can be perfectlydescribed with additions and deletions of friends with respect to another version, making CBan excellent storage policy.

As for storage size, the friend networks could become enormous when we are dealing with socialmedia such as Facebook, which has 2.2 billion active accounts at the time of writing [49] withan average of 255 friends per account [50]. Therefore, our solution needs to be storage efficient.

Finally, VM, DM and VQ queries could be performed on these friend networks. An exampleVM query could be: ”Who were my friends in sixth grade?”. An example DM query could be:”Which friends did I add after switching schools?”. An example VQ query could be: ”How longhave I been friends with someone?”. Users could interact with the system using a web-basedSPARQL endpoint. Since we are dealing with high volume data, query results could easily

Figure 3.1: Friend Network Example

13

CHAPTER 3. USE CASE: FRIEND NETWORK 14

become too large to be displayed on a single web page. This means that when a page is loaded,only a subset of the query is needed, making offsettable queries very efficient.

3.2 Requirements

From our use case we identify the following requirements for our system:

• an efficient RDF archive storage technique

• an efficient offsettable VM query stream algorithm

• an efficient offsettable DM query stream algorithm

• an efficient offsettable VQ query stream algorithm

• low storage

3.3 Need

As previously mentioned, the use case handles a dataset with very large versions. Therefore,we need a storage-efficient solution with a low ingestion time. On the other hand, we also needto support fast and offsettable queries. In chapter 2, we saw that OSTRICH [1] is the state-of-the-art in terms of offsettable RDF archives. While OSTRICH is storage-efficient comparedto other RDF archives, OSTRICH has a large ingestion time that increases with the size of thedelta chain. Taelman et al. suggest that additional snapshots can be used to limit the ingestiontime, however this, in turn, could increase the storage size since snapshots are fully materialized.Therefore, there is a need for a storage solution that maintains the ingestion time of OSTRICHwith multiple snapshots while also keeping the resulting storage size increase down.

Chapter 4

Storage Optimization: Bidirectional

Delta Chain

As outlined in Section 3.3, there is a need for a storage solution that maintains the ingestiontime of OSTRICH with multiple snapshots while also keeping the resulting storage size increasedown. In this work we propose a storage optimization for CB RDF archives, that is based onrestructuring the delta chain.As seen in previous works [1, 42], a delta chain consists of a fully materialized snapshot followedby a series of deltas. The main idea behind our storage optimization is moving the snapshotfrom the front of the delta chain to the middle of the delta chain, in order to potentially reducethe overall storage size. This transforms the delta chain into a bidirectional delta chain, whichdivides the original delta chain into two smaller delta chains, i.e. the reverse delta chain andthe forward delta chain. Figure 4.1 and 4.2 show two example bidirectional delta chains.In this chapter we will discuss the advantages and disadvantages of bidirectional delta chains.

4.1 Advantages Bidirectional Delta Chain

The advantages of the bidirectional delta chain depend on whether it is applied on an aggregatedor a non-aggregated delta chain, which both will be explained hereafter.

4.1.1 Advantages Non-Aggregated Bidirectional Delta Chain

In a non-aggregated delta chain, all deltas reference the closest preceding version. So in order tomaterialize a version, all preceding deltas need to be applied until the fully materialized snapshotis reached. It follows then that the version materialization cost scales with the length of thedelta chain and the size of the deltas.As stated above a bidirectional delta chain divides the original delta chain into two smaller deltachains. Moreover, the size of the deltas remains the same, since the reverse delta chain is justthe inverse of the original deltas. Therefore, the worst-case materialization cost for bidirectional

SnapshotΔ Δ Δ Δ

Reverse Delta Chain Forward Delta Chain

Figure 4.1: A simplified non-aggregated bidirectional delta chain.

15

CHAPTER 4. STORAGE OPTIMIZATION: BIDIRECTIONAL DELTA CHAIN 16

Snapshot Δ ΔΔ Δ

Reverse Delta Chain Forward Delta ChainFigure 4.2: A simplified aggregated bidirectional delta chain.

delta chains is half of that for unidirectional delta chains.Figure 4.3 gives an example of both a unidirectional and bidirectional non-aggregated deltachain. As you can see the reverse delta chain, is simply the inverse of the original forward deltachain, so the size of the deltas remain equal. Moreover, in the bidirectional delta chain, thelength of the delta chains have been halved, thus reducing the average version materializationcost.Bidirectional non-aggregated delta chains could also potentially reduce storage size, while main-taining a similar version materialization time. Indeed, if we compare a series of two unidirectiondelta chains with a single bidirectional delta chain, one fewer snapshot would need to be stored.

4.1.2 Advantages Aggregated Bidirectional Delta Chain

In an aggregated delta chain, all deltas reference a single snapshot, which means that an aggre-gated delta contains all the changes from all preceding deltas.In this work, we assume that a higher distance between versions results in a bigger aggregateddelta chain. This assumption holds for datasets that steadily grow over time by adding more newtriples because later versions will have more and more new triples compared to earlier versions.It follows then that reducing the average distance between the snapshot and the versions resultsin smaller aggregated deltas, thus reducing the overall storage size. Bidirectional delta chainsreduce the average distance between the snapshot and other versions. Therefore bidirectionaldelta chains should have a lower storage size, compared to unidirectional delta chains for growingdatasets.Figure 4.4 gives an example of both a unidirectional and a bidirectional aggregated delta chain.As seen in Subfigure 4.4a, newer versions gradually increase in size, due to the addition of newtriples. The corresponding unidirectional aggregated delta chain is shown in Subfigure 4.4b, asyou can see these newer triples are repeated in every subsequent delta. However, in the bidi-rectional aggregated delta chain, shown in Subfigure 4.4c, the deltas become smaller since thetriples are repeated for fewer versions, e.g. triple 4.Another way of reducing the the average distance between the snapshot and other versions, isintroducing an additional snapshot, as seen in Figure 6.2a. However, bidirectional delta chainshave an advantage in the sense that they only need to store a single snapshot.

4.2 Disadvantages Bidirectional Delta Chain

A bidirectional delta chain contains a reverse delta chain - a delta chain where the deltas precedethe reference snapshot. However, building such a delta chain is difficult when we need to insertversions in-order and do not know the future snapshot. Indeed, we can not calculate the deltabetween the version we need to insert and the future snapshot if the future snapshot is notknown.A fix-up algorithm is a potential way of solving this issue. In the fix-up algorithm, all versionsare stored in a forward delta chain. Once the future snapshot is inserted, the forward deltachain can be converted into a reverse delta chain.As discussed in Subsection 2.5.1, RCS [29] presents an incremental algorithm to build the reversedelta chain without the need for a fix-up algorithm. For this algorithm, the latest version is


1 6

2 6

4 5 6 7

3 4 6

1 2 4 6 7

4 5 6 7 8

1 4 5 6 7

(a) A fully materialized example data set.

add 1 remove 2

2 6

add 5 add 7

remove 3

add 3 add 4

remove 1add 2

remove 5add 1add 5 add 8

remove 1 remove 2

(b) An example unidirectional non-aggregated delta chain.

add 1 remove 3 remove 4


add 2 remove 1

4 5 6 7

add 2 remove 5add 1

add 5 add 8

remove 1 remove 2

(c) An example bidirectional non-aggregated delta chain.

Figure 4.3: An example to showcase unidirectional and bidirectional non-aggregated delta chains.Triples are represented by numbers.

1 6

2 6

4 5 6 7

3 4 6

1 2 4 6 7

4 5 6 7 8

1 4 5 6 7

(a) A fully materialized example data set.

2 6

add 3 add 4

remove 2add 1

remove 2

add 4 add 5 add 7

remove 2

add 1 add 4 add 7

add 1 add 4 add 5 add 7

remove 2

add 4 add 5 add 8 add 7

(b) An example unidirectional aggregated delta chain.

add 1 remove 4 remove 5 remove 7


add 2 remove 4 remove 5 remove 7

add 1 add 2

remove 5add 1 add 8

4 5 6 7

(c) An example bidirectional aggregated delta chain.

Figure 4.4: An example to showcase unidirectional and bidirectional aggregated delta chains. Triplesare represented by numbers.


always stored fully materialized. To add a new version, the system stores the new versioncompletely and replaces the previous version by its delta, keeping the rest of the chain intact.

4.3 Hypotheses

In this section we propose five hypotheses regarding aggregated unidirectional delta chains andaggregated bidirectional delta chains.The first hypothesis states that: “Disk space will be significantly lower for a bidirectional deltachain compared to a unidirectional delta chain.”. For the reasoning behind this hypotheses, werefer to Section 4.1.2.The second hypothesis states that: “In-order ingestion time will be lower for a unidrectionaldelta chain compared to a bidirectional delta chain.”. This hypothesis stems from the fact thatthe fix-up algorithm needs to insert the versions in a temporary forward delta chain first beforethey can be inserted in the reverse delta chain and RCS needs to calculate a delta before a newversion can be inserted.The third hypothesis states that: “The mean VM query duration will be equal for both a uni-directional delta chain and a bidirectional delta chain.”. The reasoning behind this hypothesisis that a VM query comes down to applying the stored aggregated delta to the snapshot, sowheter a delta was stored in a reverse delta chain or forward delta chain should not afect theVM query time.The fourth hypothesis states that: “The mean DM query duration will be equal for both aunidirectional delta chain and a bidirectional delta chain.”. We stated this hypothesis becauseboth the unidirectional and bidirectional delta chains store aggregated deltas.The fifth hypothesis states that: “The mean VQ query duration will be equal for both a unidi-rectional delta chain and a bidirectional delta chain.”. We state this hypothesis because a VQquery should iterate over every every version to gather the version information of each triple, sowheter a delta was stored in a reverse delta chain or forward delta chain should not afect theVQ query time.

Chapter 5

OSTRICH Overview

We will apply the storage optimization discussed in Chapter 4 to OSTRICH [1]. Therefore wewill give a detailed overview on OSTRICH in this chapter. The chapter is outlined as follows.First, we will give an overview of the storage structure. Next, we will give an overview of theingestion process. Finally, we will give an overview of how VM, DM and VQ are handled.

5.1 Storage Structure

In this section, we will explain the storage structure of OSTRICH [1] in more detail. Figure 5.1gives an overview of all the components in a delta chain, which will be explained in more detailin the following subsections.

5.1.1 Snapshot Storage

The first version of OSTRICH is always stored as a fully materialized snapshot, which is storedusing HDT(-FoQ) [24, 25]. HDT is a good solution for storing snapshots since it has a low storagerequirement. Moreover, HDT enables fast VM queries due to the fact that HDT stores areimmutable and have multiple indexes. Furthermore, in HDT, query results can be representedas triple streams which can be offsetted. Finally, HDT also provides cardinality estimation forthe query results, making it an excellent solution for snapshot storage.

5.1.2 Delta Chain Dictionary

A delta chain consists of two dictionaries that are used to encode the triple components, namelythe snapshot dictionary and the delta dictionary. The snapshot dictionary is the dictionary usedin HDT, and stores all mappings for triple components present in the snapshot. The snapshotdictionary is static, so it can be sorted and compressed efficiently. The delta dictionary is thedictionary that stores the triple components of newly added triples that were not present in thesnapshot. The delta dictionary is volatile since a new version can lead to new mappings. In casea triple component needs to be encoded, the snapshot dictionary is probed first, followed by thedelta dictionary in case there was no match in the snapshot dictionary. If neither dictionarycontains a mapping a new entry is created. In case a triple component needs to be decoded, areserved bit is checked that indicates wether the mapping is stored in the snapshot dictionaryor the delta dictionary.

19

CHAPTER 5. OSTRICH OVERVIEW 20

Snapshot Δ Δ

21 3Metadata

Addition Counts

ADD SPO

... ...

DEL SPO

... ...

ADD POS

... ...

ADD OSP

... ...

DEL POS

... ...

DEL OSP

... ...

HDT

Dictionary

Figure 5.1: An overview of the storage structure used in OSTRICH [1].

5.1.3 Delta Storage

OSTRICH stores subsequent versions in an aggregated delta chain. However, aggregated deltasoften contain duplicate changes across the deltas, since they contain all previous deltas, thereforea TB-like approach is used to compress the deltas. Unlike in the regular TB approach, wheretriples are annotated with timestamps, OSTRICH annotates the triples with the version whereinthe triple exists, meaning triples are stored only once.A delta chain consists of a set of triple additions and triple deletions, which are stored separatelydue to the requirements for certain query algorithms. Both additions and deletions are storedand indexed by B+ trees, with the encoded triple acting as key and the corresponding versioninformation as value. The version information consists of the triple timestamp information, localchange flags and in case of deletions also the relative position of the triple inside the delta chain.The two latter ones will be explained in the following sections.

5.1.3.1 Local Change Flags

The local change flags indicate whether a triple is local change. A local change refers to a seriesof triple instances in the delta chain that negate each other. For example, a triple that is deletedin version 1 and is added again in version 2. Since it is difficult to determine whether a changeis local, OSTRICH stores this information as local change flags for each version triple duringingestion. Local changes improve VM query evaluation since local changes can be filtered out.

5.1.3.2 Deletion Relative Position

The relative position of a deletion is the position of the deletion triple if all deletion triples inthe delta chain were sorted. The benefits of storing the relative position for deletions is two-fold.First, it allows query algorithms to efficiently offset the query results. Second, as we will describein Section 5.1.3.5, it allows us to efficiently find the deletion count for any triple pattern.


Triple pattern Index

S P O SPOS P ? SPOS ? O OSPS ? ? SPO? P O POS? P ? POS? ? O OSP? ? ? SPO

Table 5.1: Overview of which index OSTRICH uses for each triple pattern.

5.1.3.3 Multiple Indexes

OSTRICH [1] stores a triple in three different component orders, namely SPO, POS and OSP.As seen in Table 5.1, this is sufficient to resolve any triple pattern. Since OSTRICH storesaddition and deletion separately, there are a total of six B+ trees per delta chain.

5.1.3.4 Addition Counts

In order to handle queries more efficiently, OSTRICH [1] stores a mapping from triple patternand version to the number of matching additions. These addition counts are calculated andstored during ingestion. However, since the number of mappings can grow very quickly, OS-TRICH only stores addition counts that exceed a certain threshold. If an addition count isbelow the threshold, thus not stored in the mapping, it is calculated on the fly.

5.1.3.5 Deletion Counts

As discussed in Section 5.1.3.2, every deletion is annotated with its relative position in the deltachain. OSTRICH starts by performing a backward search in the deletion trees in order to findthe largest triple for a given triple pattern. Once OSTRICH has a match, it will look up therelative position for the triple. Since it has the largest triple for the given triple pattern, thetriple will be the last deletion in the list, therefore the relative position corresponds with thedeletion count for the given triple pattern.

5.1.3.6 Metadata

In order to get a view of all versions stored in the delta chain, OSTRICH stores a list of storedversions as metadata. This also enables the system to easily find the version count.

5.2 Ingestion

Ingestion refers to inserting new versions into the storage. OSTRICH [1] focusses on ingestinga unidirectional non-aggregated changeset. In other words, the ingestion transforms the inputas seen in Figure 2.2 into the storage structure seen in Figure 2.3.In short, the algorithm performs a sort-merge join over the addition stream, the deletion streamand the input stream, which are sorted in SPO-order. The algorithm iterates over all threestreams until they are finished. In each iteration, the smallest triple between all three streamsis processed. There are seven cases:

1. deletion < input and deletion < addition


2. addition < input and addition < deletion

3. input < addition and input < deletion

4. input == deletion and input < addition

5. input == addition and input < deletion

6. addition == deletion and addition < input

7. addition == input == deletion

In the first case, where the deletion is the smallest triple, the deletion information is copied tothe new version and the relative positions are updated in this case and all other cases. In thesecond case, where the addition is the smallest triple, the addition information is copied to thenew version. In the third case the input is the smallest triple, which means the triple was notadded or deleted in previous versions. In this case, OSTRICH adds the triple as either a non--local change addition or a non-local change deletion. In the fourth case, the input triple alreadyexists as a deletion. If the input triple is an addition it is added as a local change. Similarly,in the fifth case, the triple already exists as an addition. If the input triple is a deletion it isadded as a local change. In the sixth case the triple already existed as an addition and deletion.In this case the triple is carried over to the next version. In the seventh case the input triplealready existed as both an addition and deletion. In this case, if the input triple is an additionit becomes a deletion, and vice versa and the local change flag is carried over.

5.3 Queries

OSTRICH [1] supports three query atoms, namely VM, DM and VQ.

5.3.1 Version Materialization Query

As described by Fernandez et al. [3], VM queries retrieve data from a single version.In the following sections, we will describe how VM queries are resolved in OSTRICH [1], as wellas how the cardinality of the result stream can be estimated.

5.3.1.1 Query

Algorithm 5.1 shows how VM queries are resolved. First, the corresponding snapshot is re-trieved. Next, the snapshot is queried for the given triple pattern and the offset is applied. Ifthe requested version is the snapshot itself, the algorithm returns HDT’s snapshot iterator.The algorithm continues by initializing the addition and deletion streams to the start positionfor the given triple pattern and version. The addition and deletion streams will both filter outlocal changes since they do not affect the final result, as explained in Section 5.1.3.1.The algorithm always returns snapshot triples first before returning additions. Therefore, de-termining the offset for the snapshot, addition and deletion stream can be split up in two cases,namely the offset lies within the range of the snapshot count minus the deletion count or withinthe range of addition triples.In the first case, the offset is within the range of the snapshot count minus the deletion count, thealgorithm starts a loop that converges to the actual snapshot offset. The loop starts by lookingat the triple at the current snapshot offset. The algorithm then offsets the deletion stream withthat snapshot triple. This triple offset is done by navigating the deletion tree to the smallesttriple before or equal to the offset triple. The offset within the deletion stream is then stored.The loop continues until the sum of the original offset and deletion offset is different from thesnapshot offset.In the second case, where the offset lies within the addition range, the algorithm terminates thesnapshot iterator. The algorithm then applies an offset to the addition iterator. This is offset isthe original offset minus the snapshot count incremented with the number of deletions.


Finally, the algorithm returns an offsettable iterator that contains the snapshot iterator, thedeletion iterator and the addition iterator. This iterator performs a sort-merge join operationto delete triples from the snapshot iterator that also appear in the deletion iterator. Once thesnapshot and deletions have been resolved, the iterator will emit all additions.

queryVm(store , tp , version , originalOffset) {

snapshot = store.getSnapshot(version ).query(tp, originalOffset)

if (snapshot.getVersion () = version) {

return snapshot

}

additions = store.getAdditionsStream(tp, version)

deletions = store.getDeletionStream(tp, version)

offset = 0

if (originalOffset < snapshot.count(tp) - deletions.exactCount(tp)) {

do {

snapshot.offset(originalOffset + offset)

offsetTriple = snapshot.peek()

deletions = store.getDeletionsStream(tp, version , offsetTriple)

offset = deletions.getOffset(tp)

} while (snapshot.getCurrentOffset () != originalOffset + offset)

}

else {

snapshot.offset(snapshot.count(tp))

additions.offset(originalOffset - snapshot.count(tp) +

deletions.exactCount(tp))

}

return PatchedSnapshotIterator(snapshot , deletions , additions)

}

Algorithm 5.1: Version Materiliazation Algorithm from OSTRICH [1]

5.3.1.2 Result Count

The result count for VM queries is the number of snapshot triples for a given triple patternsummed up with the addition count and subtracted with the deletion count. As explained inSubsection 5.1.3.4, large addition counts are calculated and stored during ingestion, so the resultcount can be efficiently calculated.

5.3.2 Delta Materialization Query

As described by Fernandez et al. [3], DM queries retrieve the differences between two versionsand annotates whether they are an addition or deletion. OSTRICH supports DM queries for asingle snapshot and forward delta chain. So we can discern two cases namely, DM query betweensnapshot-delta and DM query between two deltas in the same delta chain.In the following sections, we will describe how OSTRICH resolves DM queries, as well as howthe cardinality of the result stream is estimated.

5.3.2.1 Query

The first case, a DM query between snapshot and delta, is trivial since OSTRICH stores aggre-gated deltas, so all deltas are relative to the snapshot.The second case is a DM query between two deltas in the same delta chain. In this case,the algorithm iterates over the triples inside the addition tree and deletion tree for the given


triple pattern, in a sort-merge join fashion. Triples are only emitted if they have a differentaddition/deletion flag for the two versions.


In the first case, a DM query between snapshot and delta, the result count is exactly the numberof snapshot triples summed together with the number of deletions and additions for the giventriple pattern.The second case is a DM query between two deltas in the same delta chain. In this case,OSTRICH gives an estimation of the result count by summing up the additions and deletion forthe given triple pattern in both versions. This can overestimate the actual count if triples arechanged such that they negate each other inside the version range.

5.3.3 Version Query

As described by Fernandez et al. [3], VQ annotates triples with version numbers in which theyexist. OSTRICH [1] supports VQ queries for a single snapshot and forward delta chain.In the following sections, we will describe how version queries are resolved in OSTRICH, as wellas how the cardinality of the result stream can be estimated.

5.3.3.1 Query

The algorithm starts by iterating over all the triples in the snapshot for the given triple pattern.Next, the deletion tree is probed for the triple. If the triple is not present in the deletion tree,the triple is present in all versions. If the triple is present in the deletion tree the correspondingversions are erased from the annotations. After all snapshot triples have been processed, thealgorithm iterates over the addition triples stored in the addition tree. As was the case withsnapshot triples, the deletion tree is probed again for the triple. If the triple is present in adeletion tree the versions are erased from the annotations. If the triple is not present, the tripleis present in all versions ranging from the version that introduced the triple to the last version.Result streams can be partially offsetted, by offsetting the snapshot iterator.


The result count can be calculated by retrieving the count for the requested triple pattern inthe snapshot and adding the addition count for the requested triple pattern.

Chapter 6

Bidirectional RDF Archive

As mentioned in Chapter 5 we will apply the potential storage optimization that was presentedin Chapter 4 to OSTRICH [1].The chapter is outlined as follows. First, we will give an overview of the storage structure.Second, we will explain how we expanded OSTRICH to support multiple snapshots. Third, wewill give an overview of the ingestion process. Finally, we will give an overview of how VM, DMand VQ are handled.

6.1 Storage Structure

As seen in Figure 5.1 and Figure 6.1, the storage structure is similar to OSTRICH’s storagestructure which was explained in Section 5.1. Indeed, the storage structure of the bidirectionaldelta chain is just a snapshot and two delta chains, i.e. the reverse delta chain and forward deltachain.

6.2 Multiple Snapshots

In Section 4.2, we presented two in-order ingestion algorithms for bidirectional delta chainswhich utilize multiple snapshots. However, OSTRICH only supports one snapshot, therefore weneed to expand OSTRICH so multiple snapshots are supported.Supporting multiple snapshots comes down to finding the corresponding snapshot for a givenversion. We can discern three cases: the store consists of only forward delta chains, the storeconsists of only bidirectional delta chains and the store consists of a combination of bidirectionaland forward delta chains. For the first case, assuming versions are in ascending order, thecorresponding snapshot of a version is the greatest lower bound of all the snapshots, i.e. thelargest snapshot that is still smaller than the version. For the second case, if we assume the deltachains are of equal length and all versions are in incremental order, the corresponding snapshotfor a version is the snapshot that is closest to the version. In the third case, we calculate thegreatest lower bound and the least upper bound of all the snapshots for the given version. If theupper bound snapshot does not have a reverse delta chain, our version is stored in a forward deltachain and the corresponding snapshot is the lower bound snapshot. If the upper bound snapshothas a reverse snapshot, the corresponding snapshot is the snapshot closest to the version.

6.3 Ingestion

Ingestion refers to inserting new versions into the storage. In this work, we focus on ingestingunidirectional non-aggregated changesets for the delta chains and fully materialized versions for

25

CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 26

Snapshot Δ ΔΔ Δ

0 1 32 4Metadata Metadata

Addition Counts Addition Counts

ADD SPO

... ...

DEL SPO

... ...

ADD POS

... ...

ADD OSP

... ...

DEL POS

... ...

DEL OSP

... ...

HDTADD SPO

... ...

DEL SPO

... ...

ADD POS

... ...

ADD OSP

... ...

DEL POS

... ...

DEL OSP

... ...

Dictionary

Figure 6.1: An overview of the storage structure of a bidirectional delta chain. Figure adapted fromOSTRICH [1].

the snapshot. In other words, the ingestion transforms the input as seen in Figure 2.2 into thestorage structure seen in Figure 6.2b.We will first explain how we ingest versions in a forward delta chain. Next, we briefly explainhow we can insert version out-of-order in a reverse delta chain. Finally, we explain how we caninsert versions in-order in a reverse delta chain.

6.3.1 Out-of-order Ingestion

Out-of-order ingestion refers to ingesting versions in non-chronological order. Out-of-order in-gestion is difficult in a realistic setting since it requires the system to somehow buffer the inputchangeset. However, we will see that if we ignore this impracticality, we can easily ingest versionsin the reverse delta chain.Ingesting versions in a reverse delta chain is similar to ingesting in a forward delta chain, we sim-ply need to transform the input changeset. Firstly, since the forward ingestion algorithm expectsthe input changeset to reference the snapshot, we reverse the input change set by swapping theadditions and deletion so that the input changeset references the snapshot. For example, Figure4.3b shows an example input changeset and Figure 4.3c shows the reversed input changeset.Secondly, since the forward ingestion algorithm expects the version closest to the snapshot tobe inserted first, we insert the versions in reverse order.

6.3.2 In-order Ingestion: Fix-Up Algorithm

As mentioned in Section 6.3.1, out-of-order ingestion requires the system to buffer the inputchangeset, which is not always practical. Therefore we also propose an in-order ingestion algo-rithm. As discussed in Section 4.2, a potential way of inserting versions in-order in a reversedelta chain is the fix-up algorithm. In this section, we will expand upon this idea further.


The in-order ingestion process starts by inserting the first half of the delta chain in a temporaryforward delta chain, as discussed in Section 5.2. Once the system decides a new delta chainneeds to be initiated, for example, the delta chain size exceeds a certain threshold, the systemwill store the next version once in the temporary forward delta chain and store it again as thesnapshot for the new permanent delta chain. The reason behind storing the version twice is tosimplify the input extraction, which will be explained in the following section. Subfigure 6.3shows the resulting delta chains.Once the system has some idle time the fix-up process can be performed. It is important to notethat the fix-up process can be performed at any time since the temporary forward delta chain isfully functional. In summary, the fix-up process extracts the original input changeset from thetemporary delta chain. The temporary delta chain can then be deleted and a new permanentreverse delta chain can be constructed out-of-order with the extracted input changeset. Subfig-ure 6.2b shows the final result.The input changeset is extracted from the temporary delta chain using Algorithm 6.1 to extractthe additions and Algorithm 6.2 to extract the deletions. Algorithm 6.1 starts by iterating overevery addition in the main addition tree of the delta chain. As discussed in Subsection 5.1.3,additions are annotated with version information. Since this version information represents anaggregated delta chain, we need to transform it in order to get the non-aggregated input changeset. The algorithm does this by iterating over the version information. If the previous versionis present, that means that the triple was already added in a previous version and therefore thetriple was not present in the input addition change set. If the previous version is not present inthe version information, the triple was first added in the current version and should be presentin the input changeset. We write the triple to file in order to limit memory usage. However,deserializing the triple brings additional overhead. Algorithm 6.2 extracts the input deletionchange set and is similarly resolved as Algorithm 6.1.Once the input changeset is recovered, the temporary delta chain is deleted. Finally, the ex-tracted changeset is inserted out-of-order into a reverse delta chain, as discussed in Subsection6.3.1.

extract_additions(store) {

main_addition_tree = store ->spo_additions_index ();

addition_iterator = main_addition_tree ->get_cursor ();

while (current_addition = addition_iterator ->next ()) {

versions = current_addition ->get_versions ();

for (i = 0; i < versions.size (); i++) {

if (i == 0 || versions[i-1] != versions[i] - 1) {

versions[i]/ additions.nt << current_addition ->get_triple ();

}

}

}

}

Algorithm 6.1: Algorithm to extract addition n-triple files from a delta chain.


Snapshot Snapshot Δ Δ ΔΔ Δ

(a) State of the delta chains before the fix-up algorithm is applied.

Snapshot Δ ΔΔ ΔΔ Δ

(b) State of the delta chains after the fix-up algorithm is applied.

Figure 6.2: An illustration of the fix-up algorithm.

Snapshot Δ Δ

Snapshot Δ Δ Δ

Δ

Figure 6.3: State of the delta chains before the fix-up algorithm is applied.

extract_deletions(store) {

main_deletion_tree = store ->spo_deletions_index ();

deletion_iterator = main_deletion_tree ->get_cursor ();

while (current_deletion = deletion_iterator ->next ()) {

versions = current_deletion ->get_versions ();

for (i = 0; i < versions.size (); i++) {

if (i == 0 || versions[i-1] != versions[i] - 1) {

versions[i]/ deletions.nt << current_deletion ->get_triple ();

}

}

}

}

Algorithm 6.2: Algorithm to extract deletion n-triple files from a delta chain.

6.4 Query

In this section we will explain how VM, DM and VQ are resolved in bidirectional delta chains.

6.4.1 Version Materialized Query

As described by Fernandez et al. [3], VM queries retrieve data from a single version. VMqueries are handled exactly the same as OSTRICH [1], which was described in Subsection 5.3.1.Indeed, even in the case where the version is stored in the reverse delta chain, the algorithm isthe same since inverse deltas were ingested.


6.4.2 Delta Materialized Query

As described by Fernandez et al. [3], DM queries retrieve the differences between two versionsand annotates whether they are an addition or deletion. In this work, we will focus on DMqueries for a single snapshot and corresponding reverse and forward delta chain. We can discernthree cases namely, a DM query between snapshot-delta, a DM query between two deltas in thesame delta chain (intra-delta DM query) and a DM query between two deltas in different deltachains (inter-delta DM query).The first case and the second case are handled exactly the same as OSTRICH [1], see Subsection5.3.2.The third case is a DM query between two versions where one version is stored in the reversedelta chain and the other version in the forward delta chain. In summary, we resolve this case bysplitting up the delta in two sequential deltas that are relative the snapshot and then mergingthe sequential deltas together. In other words if we use DARCs [51] patch notation, with o beingthe start version, e being the end version and s being the snapshot:

oDe = oD1sD2e

This strategy is quite efficient, since the delta relative to the snapshot are stored. Furthermore,since the snapshot deltas are sorted, they can be merged in a sort-merge fashion.

6.4.2.1 Query

The algorithm starts by calculating the deltas relative to the snapshot, which corresponds withthe first case of the DM query algorithm that was explained in Subsection 5.3.2. We refer tothe delta iterator between the version in the reverse delta chain and the snapshot, as the reversedelta iterator. Similarly, we refer to the delta iterator between the snapshot and the version inthe forward delta chain, as the forward delta iterator. The algorithm continues by iterating overthe two delta iterators in a sort-merge join fashion, as seen in Algorithm 6.3. If the triples at theheads of the streams are equal, we do not emit a triple, since we have an addition and deletionthat cancel each other out. If the triple at the head of the reverse delta iterator is smaller,meaning the triple is not present in the forward delta iterator, the triple and its change flag isemitted. Similarly, if the triple is the head of the forward delta iterator is the smaller, the tripleand its change flag is emitted.

next_delta_triple () {

if(forward_it ->has_next () || reverse_forward_it ->has_next ()) {

if(reverse_it ->peek_head () == forward_it ->peek_head ()) {

reverse_it ->next ();

forward_it ->next ();

}

else if(reverse_it ->peek_head () < forward_it ->peek_head ()) {

return reverse_it ->next ();

}

else {

return forward_it ->next ();

}

}

}

Algorithm 6.3: Sort-merge join algorithm for merging two delta iterators.



It is difficult to give an exact count of the results, for inter-delta DM queries. However, anestimation of the result count can be calculated by summing up the counts of both deltasrelative to the snapshot. However, this can overestimate the actual count if triples are presentin both deltas.

6.4.3 Version Query

As described by Fernandez et al. [3], VQ annotates triples with version numbers in which theyexist. In this work we will only present an algorithm for a single snapshot and correspondingreverse and forward delta chain. The algorithm is based on the VQ algorithm of OSTRICH,which was explained in Subsection 5.3.3.In the following sections, we will describe how version queries are resolved, as well as how thecardinality of the result stream can be estimated.

6.4.3.1 Query

As seen in Algorithm 6.4, the algorithm starts by iterating over all the triples in the snapshotfor the given triple pattern. Next, the deletion trees are probed for the triple. If the triple isnot present in the deletion tree, the triple is present in all versions. If the triple is present ina deletion tree the corresponding versions are erased from the version annotation, as seen inAlgorithm 6.6. After all the snapshot triples have been processed, the algorithm iterates overthe addition triples stored in the addition tree in a sort-merge join fashion, as seen in Algorithm6.5. As was the case with snapshot triples, the deletion trees are probed for the triple. If thetriple is not present in a deletion trees, the triple is present in all versions ranging from theversion that introduced the triple to the last version. If the triple is present in a deletion treethe versions are erased from the annotations.Result streams can be partially offsetted, by offsetting the snapshot iterator of HDT [24].

next_VQ_triple () {

if (snapshot_it ->has_next ()) {

result_triple = snapshot_it ->next ();

result_versions = erase_deleted_versions(result_triple , null , null);

return TripleVersion(result_triple , result_versions );

}

// iterate over additions in a sort -merge join fashion

// and erase deletions

if (has_next_addition ()) {

result_triple = next_addition ();

result_versions = erase_deleted_versions(null , result_triple );

return TripleVersion(result_triple , result_versions );

}

return false;

}

Algorithm 6.4: Version Query Algorithm


next_addition () {

if(forward_it ->has_next () || reverse_forward_it ->has_next ()) {

if(reverse_it ->peek_head () == forward_it ->peek_head ()) {

return (reverse_it ->next(), reverse_it ->next ());

}

else if(reverse_it ->peek_head () < forward_it ->peek_head ()) {

return (reverse_it ->next(), null);

}

else {

return (null , forward_it ->next ());

}

}

else {

return (null , null);

}

}

Algorithm 6.5: Algorithm to merge forward and reverse addition iterators.


erase_deleted_versions(

snapshot_triple , reverse_addition , forward_addition) {

// initialise version annotation optimistically

smallest_version = reverse_patch_tree ->get_min ();

largest_version = forward_patch_tree ->get_max ();

if(snapshot_triple) {

// vector from smallest_version to largest_version with step size 1

versions = [smallest_version : largest_version ];

}

else if(reverse_addition && forward_addition) {

versions_reverse =

[smallest_version : reverse_addition.get_largest_version ()];

version_forward =

[forward_addition.get_largest_version () : largest_version ];

versions = versions_reverse + version_forward;

}

else if(reverse_addition) {

versions =

[smallest_version : reverse_addition.get_largest_version ()];

}

else if(forward_addition) {

versions =

[forward_addition.get_smallest_version () : largest_version ];

}

// erase deletions from reverse delta chain

deletion_versions = reverse_patchtree ->get_deletion(triple );

versions.erase(deletion_versions );

// erase deletions from forward delta chain

deletion_versions = forward_patchtree ->get_deletion(triple );

versions.erase(deletion_versions );

return versions;

}

Algorithm 6.6: Algorithm to calculate version annotations in VQ queries.


The result count can be estimated by retrieving the count for the requested triple pattern in thesnapshot and adding the addition counts for the requested triple pattern from the reverse andforward delta chains. This can overestimate the actual result count if a triple is added in bothdelta chains, since the triple will only appear once in the result stream.

Chapter 7

Evaluation

In this chapter, we will evaluate COBRA and compare it with OSTRICH[1]. We start bydescribing the software implementation of the bidirectional RDF archive described in Chapter6. Next, we will outline our experimental setup. Next, we will report the results from ourexperiments. Finally, we will discuss and interpret these results.

7.1 COBRA Implementation

COBRA (Change-Based Offset-Enabled Bidirectional RDF Archive) refers to the C++ softwareimplementation of the storage described in Chapter 6 and can be found at https://github.ugent.be/tpmahieu/COBRA. COBRA uses the same technologies as OSTRICH [1]. COBRAuses HDT-(FoQ) [24, 25] for storing the snapshot. Moreover, the extended dictionary is com-pressed with gzip. For our B+ tree indexes, we use the B+ tree implementation from KyotoCabinet (http://fallabs.com/kyotocabinet/), which is memory-mapped and can be easilycompressed. From Kyoto Cabinet, we also use the Hash Database implementation for storingthe addition counts, which is also memory-mapped.

7.2 Experimental Setup

In this work we will evaluate the ingestion capabilities and the query resolution capabilitiesof COBRA. For this we will use the BEAR [45] benchmark which can be found at https://aic.ai.wu.ac.at/qadlod/bear.html. In particular, we will use the BEAR-A, BEAR-Bdaily and BEAR-B hourly benchmarks. All experiments were performed on a 64-bit Ubuntu14.04 machine with a 6-core 2.40 GHz CPU and 48 GB of RAM.

7.2.1 Ingestion

The ingestion process will be evaluated on storage size and ingestion time. For BEAR-A we willonly ingest the first eight versions due to memory constraints. Similarly, for BEAR-B hourly,we will only ingest the first 400 versions. For BEAR-B daily, we will ingest all 89 versions.We will do the ingestion evaluation for multiple storage layouts and ingestion orders namely:

• OSTRICH-1F: OSTRICH with one forward delta chain, as seen in Figure 2.3.

• OSTRICH-2F: OSTRICH with two forward delta chains, as seen in Figure 6.2a.

• COBRA-PRE FIX UP, COBRA’s pre fix-up state, as seen in Figure 6.3.

• COBRA-POST FIX UP, COBRA’s bidirectional delta chain post fix-up, as seen in 6.2b.

33

https://github.ugent.be/tpmahieu/COBRA


http://fallabs.com/kyotocabinet/

https://aic.ai.wu.ac.at/qadlod/bear.html

https://aic.ai.wu.ac.at/qadlod/bear.html

CHAPTER 7. EVALUATION 34

• COBRA-OUT OF ORDER, COBRA’s bidirectional delta chain, as seen in 6.2b, but in-gested out-of-order (snapshot - reverse delta chain - forward delta chain).

7.2.2 Query

The BEAR benchmark also provides query sets that are grouped by triple pattern. BEAR-Aprovides seven query sets containing around 100 triple patterns that are further divided in highresult cardinality and low result cardinality. BEAR-B only provides two query sets that contain?P? and ?PO queries. These queries will be evaluated as VM queries for all version, DM queriesbetween all versions and a VQ query. In order to minimize outliers, we replicate the queries fivetimes and take the mean results. Furthermore, we also perform a warm-up period before thefirst query of each triple pattern.Since neither OSTRICH nor COBRA support multiple snapshots for all query atoms, we limitour experiments to OSTRICH’s unidrectional storage layout and COBRA’s bidirectional storagelayout.

7.3 Results

In this section we will present the results from our experiments. The raw results can be foundat https://github.ugent.be/tpmahieu/COBRA. A discusion of these results will be presentedin the next section.

7.3.1 Ingestion Results

Table 7.1a displays the storage sizes and ingestion times of the different delta chain configurationsfor the first eight versions of the BEAR-A benchmark. Figure 7.1 and Figure 7.2 show thecumulative storage size and cumulative ingestion time per version , while Figure 7.3 shows theindividual ingestion time per version.For BEAR-A, OSTRICH-1F has the highest ingestion time and requires more storage thanCOBRA-OUT OF ORDER. OSTRICH-2F ingests the fastest but also requires more storagethan COBRA-OUT OF ORDER. COBRA-PRE FIX UP requires the most storage space andingests faster than OSTRICH-1F but slower than OSTRICH-2F.

Table 7.1b shows the ingestion times of the different approaches for BEAR-B daily. Figure 7.4and Figure 7.5 display the cumulative storage size and cumulative ingestion time, while Figure7.6 shows the individual ingestion time per version.For BEAR-B daily, OSTRICH-1F has the lowest storage size but also the highest ingestion time.OSTRICH-2F has the lowest ingestion time. COBRA-PRE FIX UP has a similar ingestion timeand storage size as OSTRICH-2F. COBRA-OUT OF ORDER has the higest storage size.

Table 7.1c shows the ingestion times of the different approaches for the first 400 version ofBEAR-B hourly. Figure 7.7 and Figure 7.8 display the cumulative storage size and cumulativeingestion time for every version for BEAR-B. Figure 7.9 shows the ingestion time per version.For BEAR-B hourly, OSTRICH-2F has the lowest storage size and the lowest ingestion time.COBRA-PRE FIX UP has a similar ingestion time and storage size as OSTRICH-2F. COBRA-OUT OF ORDER has a higer ingestion time and storage size compared to OSTRICH-2F.OSTRICH-1F has the highest storage size and ingestion time.

7.3.2 Query Results

Figures 7.10, 7.12 and 7.13 show the mean VM, DM and VQ query duration for all triple patternsprovided by the BEAR-A benchmark. In order to have outlier query durations to influence theresults, we opted for the mean over the median. Appendix A lists the average query durationsper tripple pattern.



Figure 7.1: Comparison of the cumulative storage sizes (in GB) per version for the first eight versionsof the BEAR-A benchmark.

Figure 7.2: Comparison of the cumulative ingestion times (in hours) per version for the first eightversions of the BEAR-A benchmark.


Figure 7.3: Comparison of the individual ingestion times (in minutes) per version for the first eightversions of the BEAR-A benchmark.

Figure 7.4: Comparison of the cumulative storage sizes (in MB) per version of the BEAR-B dailybenchmark.


Figure 7.5: Comparison of the cumulative ingestion time (in min) per version of the BEAR-B dailybenchmark.

Figure 7.6: Comparison of the individual ingestion time (in min) per version of the BEAR-B dailybenchmark.


Figure 7.7: Comparison of the cumulative storage sizes (in MB) per version of the BEAR-B hourlybenchmark.

Figure 7.8: Comparison of the cumulative ingestion time (in min) per version of the BEAR-B hourlybenchmark.


Approach Storage Size (GB) Ingestion Time (hour)

OSTRICH-1F 3.92 23.66OSTRICH-2F 3.83 11.45COBRA-PRE FIX UP 4.31 12.92COBRA-POST FIX UP 3.36 12.92 + 8.38COBRA-OUT OF ORDER 3.40 14.63

(a) Storage sizes and ingestion times of the different approaches for BEAR-A.

Storage Layout Storage Size (MB) Ingestion Time (min)


(b) Storage sizes and ingestion times of the different approaches for BEAR-B daily.

Storage Layout Storage Size (MB) Ingestion Time (min)


(c) Storage sizes and ingestion times of the different approaches for BEAR-B hourly.

Table 7.1: Storage sizes and ingestion times of the different approaches for all three benchmarks.COBRA-POST FIX UP represents the in-order ingestion of the bidirectional delta chain using the fix-upalgorithm. Therefore, the ingestion time is the sum of the ingestion time of COBRA-PRE FIX UP and

the fix-up time.

For BEAR-A, COBRA resolves VM queries faster than OSTRICH. Similarly, COBRA resolvesDM queries slighlty faster than OSTRICH. Finally, VQ queries are resolved sligthly faster inOSTRICH compared to COBRA.

Figures 7.14, 7.16 and 7.17 show the average VM, DM and VQ query duration for all triplepatterns provided by the BEAR-B benchmark for the all versions of the BEAR-B daily dataset.Appendix B lists the average query durations per tripple pattern.For BEAR-B daily, COBRA resolves VM queries and DM queries faster than OSTRICH. VQquery durations are similar for OSTRICH and COBRA.

Figures 7.18, 7.20 and 7.21 show the average VM, DM and VQ query duration for all triplepatterns provided by the BEAR-B benchmark for the first 400 versions of the BEAR-B hourlydataset. Appendix C lists the average query durations per triple pattern.For BEAR-B hourly, COBRA resolves VM and DM queries faster than OSTRICH. VQ querydurations are similar for OSTRICH and COBRA.

7.4 Discussion

In section, we discuss the results presented in Section 7.3. First, we will discuss the ingestionresults. Second, we will discuss the query results. Finally, we evaluate our hypotheses that werepresented in 4.3.


Figure 7.9: Comparison of the individual ingestion time (in min) per version of the BEAR-B hourlybenchmark.

Figure 7.10: Average VM query durations for all triple patterns in BEAR-A.


Figure 7.11: Average DM query durations between version 3 and all other versions for all triple patternsin BEAR-A.

Figure 7.12: Average DM query durations between all versions for all triple patterns in BEAR-A.


Figure 7.13: Average VQ query durations for all triple patterns in BEAR-A.

Figure 7.14: Average VM query durations for all provided triple patterns in BEAR-B daily.


Figure 7.15: Average DM query durations between version 3 and all other versions for all triple patternsin BEAR-B daily.

Figure 7.16: Average DM query durations between all versions for all triple patterns in BEAR-B daily.


Figure 7.17: Average VQ query durations for all provided triple patterns in BEAR-B daily.

Figure 7.18: Average VM query durations for all provided triple patterns in the first 400 versions ofBEAR-B hourly.


Figure 7.19: Average DM query durations between version 3 and all other versions for all triple patternsin BEAR-B hourly.

Figure 7.20: Average DM query durations between all versions for all triple patterns in BEAR-B hourly.


Figure 7.21: Average VQ query durations for all provided triple patterns in the first 400 versions ofBEAR-B hourly.

7.4.1 Ingestion Evaluation

7.4.1.1 Storage Size

There is no aproach that has the lowest storage size for all the benchmarks. Indeed, COBRAhas the lowest storage size for BEAR-A, OSTRICH-1F has the lowest storage size for BEAR-Bdaily and OSTRICH-2F has the lowest storage size for BEAR-B hourly.As mentioned above, COBRA has the lowest storage size for BEAR-A. In Figure 7.1, we canidentify two causes for this storage size reduction. First, we see that the reverse delta chain,version 0 up to 4, has a lower storage size than the forward delta chain of OSTRICH-1F,OSTRICH-2F and COBRA-PRE FIX UP. Second, we see that for version 4 OSTRICH-2F andCOBRA-PRE FIX UP have a storage increase due to the additional snapshot, which does notneed to be stored for OSTRICH-1F and COBRA-OUT OF ORDER.For BEAR-B daily, OSTRICH-1F has the lowest storage size, due to a large storage increasefor the other approaches, which can be seen in Figure 7.4. This large storage increase is theresult of the additional delta chain of OSTRICH-2F, COBRA-PRE FIX UP and COBRA-OUT OF ORDER, which is initialized in version 4. We can also see that the this storageincrease is smaller for COBRA-OUT OF ORDER compared to COBRA-PRE FIX UP andOSTRICH-2F, due to the additional snapshot of the latter. However, in this case, COBRA-OUT OF ORDER does not have a smaller storage size than OSTRICH-2F and COBRA-PRE FIX UP,due to the storage size of the reverse delta chain.For BEAR-B hourly, OSTRICH-2F has the lowest storage size. In Figure 7.7, we can again seethe large storage increase for OSTRICH-2F, COBRA-PRE FIX UP and COBRA-OUT OF ORDER,which is again smaller for COBRA-OUT OF ORDER. However, for BEAR-B hourly OSTRICH-1F has the largest storage size, which implies that additional snapshots can reduce the totalstorage size.


7.4.1.2 Ingestion Time

In figure 7.3 and Figure 7.6 we observe that for all storage configurations, the ingestion timeincreases with the the number of versions until a new delta chain in initiated. The reasonfor this behiour is that the ingestion algorithm needs to iterate over the addition and deletiontrees, so the ingestion time increases with the size of the addition and deletion trees. Therefore,OSTRICH-1F ingests slower than the other approaches.In Figure 7.2, 7.5 and 7.8 we see that OSTRICH-2F has the lowest ingestion time for all thebenchmarks. COBRA-PRE FIX UP has similar ingestion times for OSTRICH-2F for BEAR-Bdaily and BEAR-B hourly, but ingests much slower for BEAR-A. As explained in Subsection6.3.2, COBRA-PRE FIX UP stores the middle version twice, in order to speed-up and simplifythe fix-up step. For datasets with many small versions such as BEAR-B, the storage size andingestion time increase is neglible. However, for datasets with only a few large versions, such asBEAR-A, the storage size and ingestion time increase is non-neglible.In Table 7.1a, 7.1b and 7.1c, we see that the fix-up time is quite large, but the in-order ingestionduration for COBRA is still lower than the ingestion duration for OSTRICH-1F, for all evaluatedbenchmarks.

7.4.2 Query Evaluation

VM queries are resolved faster by COBRA compared to OSTRICH, eventhough COBRA andOSTRICH have the same VM algorithm. We attribute this discrepency to the smaller deltachains of COBRA.DM queries are resolved faster by COBRA compared to OSTRICH for all three benchmarks. InFigure 7.11, 7.15 and 7.19, we can see that intra-delta DM queries are resolved faster in COBRAcompared to OSTRICH. Indeed, as discussed in Subsection 5.3.2 intra-delta DM queries relyon iterating over the deletion and addition trees, so smaller addition and deletion trees resultin faster inter-delta DM queries. Therefore, since COBRA has halved OSTRICH’s delta chain,COBRA also halved the inter-delta DM query time. In Figure 7.11, 7.19 and 7.15, we also seethat COBRA handles inter-delta DM queries as efficienly as OSTRICH handles intra-delta DMqueries.VQ query durations are roughly equal for both OSTRICH and COBRA, with OSTRICH beingfaster for BEAR-A but slower for BEAR-B daily and BEAR-B hourly. As discussed in Subsection5.3.3 and Subsection 6.4.3, COBRA’s VQ algorithm is similar to OSTRICH’s VQ algorithm. SoCOBRA has only a limited extra overhead compared to OSTRICH, namely the merging of thereverse and forward addition tree, and a potential additonal look-up in the reverse deletion tree.Which, as evidenced by the Figure 7.13, 7.17 and 7.21, only have a limited additional overhead.

7.4.3 Hypotheses Evaluation

In Section 4.3 we proposed five hypotheses, which will be evaluated in this section based on theexperimental results of our software implementation.The first hypotheses states that: “Disk space will be significantly lower for a bidirectional deltachain compared to a unidirectional delta chain.”. In Subsection 7.4.1 we discussed that COBRAhas a lower storage size than OSTRICH for BEAR-A and BEAR-B hourly, but that COBRAhas a higher storage size for BEAR-B daily. Therefore we reject this hypotheses.The second hypotheses stated that: “In-order ingestion time will be lower for a unidrectionaldelta chain compared to a bidirectional delta chain.” we discussed that OSTRICH-1F ingeststhe slowest for all benchmarks. Therefore, we reject this hypotheses.The third hypotheses stated that: “The mean VM query duration will be equal for both aunidirectional delta chain and a bidirectional delta chain.”. As discussed in Subsection 7.4.2,the average VM query duration is faster for COBRA compared to OSTRICH, due to the smallerdelta chains of COBRA. Therefore, we reject this hypotheses.The fourth hypotheses stated that: “The mean DM query duration will be equal for both aunidirectional delta chain and a bidirectional delta chain.”. As discussed in Subsection 7.4.2,the average DM is resolved faster in COBRA compared to OSTRICH, due to the lower intra-


delta DM query duration. Therefore, we reject this hypotheses.The fifth hypotheses stated that: “The mean VQ query duration will be equal for both aunidirectional delta chain and a bidirectional delta chain.”. As discussed in Subsection 7.4.2,the average VQ query duration is roughly equal for both OSTRICH and COBRA. Therefore,we accept this hypotheses.

Chapter 8

Conclusion and Future Work

In this chapter, we will give a conclusion for this thesis and list potential future work.

8.1 Conclusion

In this work, we presented bidirectional delta chains as a potential storage optimization forCB RDF archives. We applied this storage optimization on an existing RDF archive namedOSTRICH [1]. For this purpose, we modified OSTRICH so that multiple snapshots could besupported. Next, we presented an in-order ingestion algorithm using a fix-up strategy. Moreover,we presented a novel DM query algorithm for inter-delta versions. Finally, we altered the existingVQ query algorithm so that bidirectional chains are supported.

In our ingestion results presented in Section 7.4.1, we confirmed that multiple snapshots are aviable method of reducing the ingestion time for OSTRICH, as Taelman et al. [1] suggested.Moreover, we discovered that for all three evaluated benchmarks, COBRA has a lower ingestiontime than OSTRICH, even with the additional ingestion cost due to the fix-up time. Finally,we saw that COBRA does not reduce the storage size for all three evaluated benchmarks, onlyfor the first eight versions of BEAR-A. From the results of BEAR-B hourly, we conclude thatOSTRICH with one snapshot, performs poorly for datasets with many versions. So, in thiscase, we recommend breaking up the delta chain by introducing an additional snapshot or usinga bidirectional delta chain. However from the results of BEAR-B daily, we conclude that forsmaller datasets the delta chain should be sufficiently large, before a new delta chain is initiatedso that the initial cost of a new delta chain does not dominate the storage size. Finally, wesaw that for all benchmarks COBRA reduced the storage increase from the second delta chain,because COBRA does not need to store the second snapshot. However this did not always resultin an overall storage size reduction, due to the size of COBRA’s reverse delta chain. Thereforewe recommend transforming two unidirectional delta chains into a bidirectional delta chain, onlyif the first delta chain is more similar to the second snapshot, so that the resulting reverse deltachain will be smaller than the current forward delta chain.

In our query results presented in Section 7.4.2, we saw that COBRA reduces the VM querydurations. We attributed this to the smaller delta chain of COBRA, so we expect similar resultsfor OSTRICH with two snapshots. Similarly, for the DM queries, we observed that COBRA hasa reduced inter-delta DM cost compared to OSTRICH, which we also attributed to the smallerdelta chain, so similar gains can be expected for OSTRICH with two snapshots. Finally, we sawthat VQ queries are roughly equal for both OSTRICH and COBRA.

In conclusion, binary delta chains are not the all-round storage optimization technique we set-out to find at the start of this work, however it is a viable tool for reducing the overall storagesize in certain cases. In particular, for merging two unidirectional delta chains when the firstdelta chain is more similar to the second snapshot.

49

CHAPTER 8. CONCLUSION AND FUTURE WORK 50

8.2 Future Work

In this work there are a number of potentially interesting research opportunities left for futurework. First, as mentioned in the previous section, there needs to be a reliable way of predictingwhether a delta chain is more similar to the preceding snapshot or the future snapshot, beforetwo unidirectional delta chains can be transformed into a bidirectional delta chain. In Chapter4, we also presented an alternative algorithm for ingesting versions in-order in a bidirectionaldelta chain which could be implemented and evaluated. Moreover, in Section 7.4 we discussedthat storing an additional version for COBRA’s pre fix-up state has a nonnegligible overheadfor large versions, so future work could research how to extract the input changeset from thesnapshot so the additional version would not need to be stored twice. Finally, additional researchis needed to expand the current DM and VQ algorithms for multiple snapshots and allow formore efficient offsets.

Bibliography

[1] R. Taelman, R. Verborgh, and E. Mannens, “Exposing RDF archives using triple patternfragments,” in Lecture Notes in Computer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformatics), 2017.

[2] C. Schoreels, B. Logan, and J. M. Garibaldi, “Agent based genetic algorithm employing fi-nancial technical analysis for making trading decisions using historical equity market data,”in Intelligent Agent Technology, 2004.(IAT 2004). Proceedings. IEEE/WIC/ACM Interna-tional Conference on, pp. 421–424, IEEE, 2004.

[3] J. D. Fernandez, J. Umbrich, A. Polleres, and M. Knuth, “Evaluating query and storagestrategies for rdf archives,” in Proceedings of the 12th International Conference on SemanticSystems, SEMANTiCS 2016, (New York, NY, USA), pp. 41–48, ACM, 2016.

[4] T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web,” Scientific American,vol. 284, no. 5, pp. 34–43, 2001.

[5] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data - the story so far,” InternationalJournal on Semantic Web and Information Systems, vol. 5, no. 3, pp. 1–22, 2009.

[6] F. Manola, E. Miller, B. McBride, et al., “Rdf primer,” W3C recommendation, vol. 10,no. 1-107, p. 6, 2004.

[7] G. Karvounarakis, A. Magganaraki, S. Alexaki, V. Christophides, D. Plexousakis, M. Scholl,and T. Tolle, “Querying the Semantic Web with RQL,” Computer Networks, 2003.

[8] A. Bernstein and C. Kiefer, “Imprecise rdql: Towards generic retrieval in ontologies usingsimilarity joins,” in Proceedings of the 2006 ACM Symposium on Applied Computing, SAC’06, (New York, NY, USA), pp. 1684–1689, ACM, 2006.

[9] W. S. W. Group et al., “Sparql 1.1 overview, w3c recommendation 21 march 2013,” 2012.

[10] D. C. Faye, O. Cure, and G. Blin, “A survey of RDF storage approaches,” Revue Africainede la Recherche en Informatique et Mathematiques Appliquees, vol. 15, pp. 11–35, 2012.

[11] O. Cure and G. Blin, RDF database systems: triples storage and SPARQL query processing.Morgan Kaufmann, 2014.

[12] K. J. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds, “Efficient rdf storage and retrievalin jena2,” in SWDB, 2003.

[13] K. Wilkinson, “Jena Property Table Implementation,” in SSWS, (Athens, Georgia, USA),pp. 35–46, 2006.

[14] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach, “Sw-store: a vertically parti-tioned dbms for semantic web data management,” The VLDB Journal, vol. 18, pp. 385–406,Apr 2009.

[15] O. Cure and G. Blin, “An update strategy for the waterfowl rdf data store.,” in Inter-national Semantic Web Conference (Posters and Demos) (M. Horridge, M. Rospocher,and J. van Ossenbruggen, eds.), vol. 1272 of CEUR Workshop Proceedings, pp. 377–380,CEUR-WS.org, 2014.

51

BIBLIOGRAPHY 52

[16] X. Pu, J. Wang, Z. Song, P. Luo, and M. Wang, “Efficient incremental update and queryingin aweto rdf storage system,” Data and Knowledge Engineering, vol. 89, pp. 55 – 75, 2014.

[17] R. Punnoose, A. Crainiceanu, and D. Rapp, “Rya: A scalable rdf triple store for the clouds,”in Proceedings of the 1st International Workshop on Cloud Intelligence, Cloud-I ’12, (NewYork, NY, USA), pp. 4:1–4:8, ACM, 2012.

[18] A. Harth and S. Decker, “Optimized index structures for querying rdf from the web,” inProceedings of the Third Latin American Web Congress, LA-WEB ’05, (Washington, DC,USA), pp. 71–, IEEE Computer Society, 2005.

[19] A. Harth, J. Umbrich, A. Hogan, and S. Decker, “YARS2: A federated repository for query-ing graph structured data from the Web,” in Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2007.

[20] C. Weiss, P. Karras, and A. Bernstein, “Hexastore: sextuple indexing for semantic webdata management,” Proc. VLDB Endow., vol. 1, pp. 1008–1019, Aug. 2008.

[21] T. Neumann and G. Weikum, “Rdf-3x: A risc-style engine for rdf,” Proc. VLDB Endow.,vol. 1, pp. 647–659, Aug. 2008.

[22] M. Atre, J. Srinivasan, and J. A. Hendler, “Bitmat: A main-memory bit matrix of rdftriples for conjunctive triple pattern queries,” in International Semantic Web Conference,2008.

[23] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu, “Triplebit: A fast and compactsystem for large scale rdf data,” Proc. VLDB Endow., vol. 6, pp. 517–528, May 2013.

[24] J. D. Fernandez, M. A. Martınez-Prieto, C. Gutierrez, A. Polleres, and M. Arias, “Binaryrdf representation for publication and exchange (hdt),” Web Semantics: Science, Servicesand Agents on the World Wide Web, vol. 19, pp. 22 – 41, 2013.

[25] M. A. Martınez-Prieto, M. Arias Gallego, and J. D. Fernandez, “Exchange and consumptionof huge rdf data,” in The Semantic Web: Research and Applications (E. Simperl, P. Cimi-ano, A. Polleres, O. Corcho, and V. Presutti, eds.), (Berlin, Heidelberg), pp. 437–452,Springer Berlin Heidelberg, 2012.

[26] O. Cure, G. Blin, D. Revuz, and D. C. Faye, “Waterfowl: A compact, self-indexed andinference-enabled immutable rdf store,” in The Semantic Web: Trends and Challenges(V. Presutti, C. d’Amato, F. Gandon, M. d’Aquin, S. Staab, and A. Tordai, eds.), (Cham),pp. 302–316, Springer International Publishing, 2014.

[27] M. J. Rochkind, “The source code control system,” IEEE Transactions on Software Engi-neering, vol. SE-1, pp. 364–370, Dec 1975.

[28] C. Schneider, A. Zundorf, and J. Niere, “Coobra - a small step for development tools tocollaborative environments,” in Workshop on Directions in Software Engineering Environ-ments, 2004.

[29] T. W. F., “Rcs — a system for version control,” Software: Practice and Experience, vol. 15,no. 7, pp. 637–654, 1982.

[30] J. D. Fernandez, A. Polleres, and J. Umbrich, “Towards efficient archiving of dynamic linkedopen data,” in CEUR Workshop Proceedings, 2015.

[31] T. Kafer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan, “Exploring the dy-namics of linked data,” in The Semantic Web: ESWC 2013 Satellite Events (P. Cimiano,M. Fernandez, V. Lopez, S. Schlobach, and J. Volker, eds.), (Berlin, Heidelberg), pp. 302–303, Springer Berlin Heidelberg, 2013.

[32] M. Volkel and T. Groza, “SemVersion: An RDF-based Ontology Versioning System,” inProceedings of IADIS International Conference on WWW/Internet (IADIS 2006) (M. B.Nunes, ed.), (Murcia, Spain), pp. 195–202, October 2006.

BIBLIOGRAPHY 53

[33] S. Cassidy and J. Ballantine, “Version control for rdf triple stores.,” in ICSOFT 2007 -2nd International Conference on Software and Data Technologies, Proceedings, pp. 5–12,01 2007.

[34] D.-H. Im, S.-W. Lee, and H.-J. Kim, “A version management framework for rdf triplestores,” International Journal of Software Engineering and Knowledge Engineering, vol. 22,no. 01, pp. 85–106, 2012.

[35] M. Vander Sande, P. Colpaert, R. Verborgh, S. Coppens, E. Mannens, and R. Van de Walle,“R&Wbase: git for triples,” in Proceedings of the 6th Workshop on Linked Data on the Web(C. Bizer, T. Heath, T. Berners-Lee, M. Hausenblas, and S. Auer, eds.), vol. 996 of CEURWorkshop Proceedings, May 2013.

[36] M. Graube, S. Hensel, and L. Urbas, “R43ples: Revisions for triples an approach for versioncontrol in the semantic web,” in CEUR Workshop Proceedings, 2014.

[37] C. Hauptmann, M. Brocco, and W. Worndl, “Scalable semantic version control for linkeddata management,” in LDQ@ESWC, 2015.

[38] T. Neumann and G. Weikum, “x-rdf-3x: Fast querying, high update rates, and consistencyfor rdf databases,” Proc. VLDB Endow., vol. 3, pp. 256–263, Sept. 2010.

[39] A. Cerdeira-Pena, A. Farina, J. D. Fernandez, and M. A. Martınez-Prieto, “Self-indexingrdf archives,” in 2016 Data Compression Conference (DCC), pp. 526–535, March 2016.

[40] N. R. Brisaboa, A. Cerdeira-Pena, A. Farina, and G. Navarro, “A compact RDF store usingsuffix arrays,” in Lecture Notes in Computer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformatics), 2015.

[41] J. Anderson and A. Bendiken, “Transaction-time queries in Dydra,” in Joint proceedings ofthe 3rd Workshop on Managing the Evolution and Preservation of the Data Web (MEPDaW2017) and the 4th Workshop on Linked Data Quality (LDQ 2017) co-located with 14thEuropean Semantic Web Conference (ESWC 2017), 2016.

[42] P. Meinhardt, M. Knuth, and H. Sack, “Tailr: a platform for preserving history on theweb of data,” in Proceedings of the 11th International Conference on Semantic Systems,pp. 57–64, ACM, 2015.

[43] R. Verborgh, M. Vander Sande, O. Hartig, J. Van Herwegen, L. De Vocht, B. De Meester,G. Haesendonck, and P. Colpaert, “Triple pattern fragments: a low-cost knowledge graphinterface for the web,” Web Semantics: Science, Services and Agents on the World WideWeb, vol. 37, pp. 184–206, 2016.

[44] V. Papakonstantinou, G. Flouris, I. Fundulaki, K. Stefanidis, and Y. Roussakis, “Spbv:Benchmarking linked data archiving systems,” in Joint Proceedings of BLINK2017: 2ndInternational Workshop on Benchmarking Linked Data and NLIWoD3: Natural LanguageInterfaces for the Web of Data co-located with 16th International Semantic Web Conference(ISWC 2017), Vienna, Austria, October 21st - to - 22nd, 2017., 2017.

[45] J. D. F. Garcia, J. Umbrich, and A. Polleres, “Bear: Benchmarking the efficiency of rdfarchiving,” Working Papers on Information Systems, Information Business and Operations02/2015, Department fur Informationsverarbeitung und Prozessmanagement, WU ViennaUniversity of Economics and Business, Vienna, 2015.

[46] M. Meimaris and G. Papastefanatos, “The EvoGen benchmark suite for evolving RDFdata,” in CEUR Workshop Proceedings, 2016.

[47] M. Morsey, J. Lehmann, S. Auer, C. Stadler, and S. Hellmann, “Dbpedia and the liveextraction of structured data from wikipedia,” Program, vol. 46, no. 2, pp. 157–181, 2012.

[48] J. Umbrich, S. Neumaier, and A. Polleres, “Quality assessment and evolution of opendata portals,” in Future Internet of Things and Cloud (FiCloud), 2015 3rd InternationalConference on, pp. 404–411, IEEE, 2015.

BIBLIOGRAPHY 54

[49] “Number of facebook users worldwide 2008-2018 — statistic.” https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/.Accessed: 2018-05-24.

[50] R. I. M. Dunbar, “Do online social media cut through the constraints that limit the size ofoffline social networks?,” Royal Society Open Science, 2016.

[51] D. Roundy, “Darcs: Distributed version management in haskell,” in Proceedings of the 2005ACM SIGPLAN Workshop on Haskell, Haskell ’05, (New York, NY, USA), pp. 1–4, ACM,2005.

https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/

https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/

Appendices

A BEAR-A Query Results

In this section we list the average query dura-tion for all the triple patterns of the BEAR-Abenchmark.

Figure 1: Average VM query durations for SPOtriple patterns in the first eight versions of BEAR-

A.

Figure 2: Average VM query durations for lowcardinality S?O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 3: Average VM query durations for lowcardinality SP? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 4: Average VM query durations for highcardinality SP? triple patterns in the first eight ver-

sions of BEAR-A.

55

BIBLIOGRAPHY 56

Figure 5: Average VM query durations for lowcardinality ?PO triple patterns in the first eight ver-

sions of BEAR-A.

Figure 6: Average VM query durations for highcardinality ?PO triple patterns in the first eight ver-

sions of BEAR-A.

Figure 7: Average VM query durations for lowcardinality ??O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 8: Average VM query durations for highcardinality ??O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 9: Average VM query durations for lowcardinality ?P? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 10: Average VM query durations for highcardinality ?P? triple patterns in the first eight ver-

sions of BEAR-A.

BIBLIOGRAPHY 57

Figure 11: Average VM query durations for lowcardinality S?? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 12: Average VM query durations for highcardinality S?? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 13: Average DM query durations for SPOtriple patterns in the first eight versions of BEAR-

A.

Figure 14: Average DM query durations for lowcardinality S?O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 15: Average DM query durations for lowcardinality SP? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 16: Average DM query durations for highcardinality SP? triple patterns in the first eight ver-

sions of BEAR-A.

BIBLIOGRAPHY 58

Figure 17: Average DM query durations for lowcardinality ?PO triple patterns in the first eight ver-

sions of BEAR-A.

Figure 18: Average DM query durations for highcardinality ?PO triple patterns in the first eight ver-

sions of BEAR-A.

Figure 19: Average DM query durations for lowcardinality ??O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 20: Average DM query durations for highcardinality ??O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 21: Average DM query durations for lowcardinality ?P? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 22: Average DM query durations for highcardinality ?P? triple patterns in the first eight ver-

sions of BEAR-A.

BIBLIOGRAPHY 59

Figure 23: Average DM query durations for lowcardinality S?? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 24: Average DM query durations for highcardinality S?? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 25: Average VQ query durations for SPOtriple patterns in the first eight versions of BEAR-

A.

Figure 26: Average VQ query durations for lowcardinality S?O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 27: Average VQ query durations for lowcardinality SP? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 28: Average VQ query durations for highcardinality SP? triple patterns in the first eight ver-

sions of BEAR-A.

BIBLIOGRAPHY 60

Figure 29: Average VQ query durations for lowcardinality ?PO triple patterns in the first eight ver-

sions of BEAR-A.

Figure 30: Average VQ query durations for highcardinality ?PO triple patterns in the first eight ver-

sions of BEAR-A.

Figure 31: Average VQ query durations for lowcardinality ??O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 32: Average VQ query durations for highcardinality ??O triple patterns in the first eight ver-

sions of BEAR-A.

Figure 33: Average VQ query durations for lowcardinality ?P? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 34: Average VQ query durations for highcardinality ?P? triple patterns in the first eight ver-

sions of BEAR-A.

BIBLIOGRAPHY 61

Figure 35: Average VQ query durations for lowcardinality S?? triple patterns in the first eight ver-

sions of BEAR-A.

Figure 36: Average VQ query durations for highcardinality S?? triple patterns in the first eight ver-

sions of BEAR-A.

B BEAR-B daily Query Re-

sults

In this section we list the average query dura-tion for the ?P? and ?PO triple patterns of theBEAR-B daily benchmark.

Figure 37: Average VM query durations for ?P?triple patterns in BEAR-B daily.

Figure 38: Average DM query durations for ?P?triple patterns in BEAR-B daily.

Figure 39: Average VQ query durations for ?P?triple patterns in BEAR-B daily.

BIBLIOGRAPHY 62

Figure 40: Average VM query durations for ?POtriple patterns in BEAR-B daily.

Figure 41: Average DM query durations for ?POtriple patterns in BEAR-B daily.

Figure 42: Average VQ query durations for ?POtriple patterns in BEAR-B daily.

C BEAR-B hourly Query Re-

sults

In this section we list the average query du-ration for the ?P? and ?PO triple patterns inthe first 400 versions of BEAR-B hourly bench-mark.

Figure 43: Average VM query durations for ?P?triple patterns in the first 400 versions of BEAR-B

hourly.

Figure 44: Average DM query durations for ?P?triple patterns in the first 400 versions of BEAR-B

hourly..

BIBLIOGRAPHY 63

Figure 45: Average VQ query durations for ?P?triple patterns in the first 400 versions of BEAR-B

hourly.

Figure 46: Average VM query durations for ?POtriple patterns in the first 400 versions of BEAR-B

hourly.

Figure 47: Average DM query durations for ?POtriple patterns in the first 400 versions of BEAR-B

hourly.

Figure 48: Average VQ query durations for ?POtriple patterns in the first 400 versions of BEAR-B

hourly.

reducing storage requirements of multi-version graph databases …€¦ · a. linked data in 2001,...

Documents