an overview of graph data management and...
TRANSCRIPT
An Overview of Graph Data Management and Analysis
M. Tamer Ozsu
University of WaterlooDavid R. Cheriton School of Computer Science
© M. Tamer Ozsu ADC PhD School (2015-06-04) 1 / 76
Graph Data are Very Common
Internet
© M. Tamer Ozsu ADC PhD School (2015-06-04) 2 / 76
Graph Data are Very Common
Socialnetworks
© M. Tamer Ozsu ADC PhD School (2015-06-04) 2 / 76
Graph Data are Very Common
Trade volumesand
connections
© M. Tamer Ozsu ADC PhD School (2015-06-04) 2 / 76
Graph Data are Very Common
Biologicalnetworks
© M. Tamer Ozsu ADC PhD School (2015-06-04) 2 / 76
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod-cloud.net/
Graph Data are Very Common
As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
Linked data
© M. Tamer Ozsu ADC PhD School (2015-06-04) 2 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 3 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 4 / 76
Graph Types
Property graph
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447) actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 5 / 76
Graph Types
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
© M. Tamer Ozsu ADC PhD School (2015-06-04) 5 / 76
Graph Types
Property graph
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447) actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Workload: Online queries andanalytic workloads
Query execution: Varies
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
Workload: SPARQL queries
Query execution: subgraphmatching by homomorphism
© M. Tamer Ozsu ADC PhD School (2015-06-04) 5 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 6 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 7 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
Focus here is on the
how algorithms behave
as their input changes.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
Focus here is on the
how algorithms behave
as their input changes.
The types of workloads
that the approaches are
designed to handle.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
Dynamic
graphs with
high veloc-
ity changes –
not possible to
see the entire
graph at once.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
Dynamic
graphs with
high veloc-
ity changes –
not possible to
see the entire
graph at once.
Dynamic
graphs with un-
known changes
– requires re-
discovery of
the graph (e.g.,
LOD).
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Computation accesses a
portion of the graph
and the results are
computed for a subset
of vertices; e.g., point-
to-point shortest path,
subgraph matching,
reachability, SPARQL.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Computation accesses a
portion of the graph
and the results are
computed for a subset
of vertices; e.g., point-
to-point shortest path,
subgraph matching,
reachability, SPARQL.
Computation accesses
the entire graph and
may require multiple
iterations; e.g., PageR-
ank, clustering, graph
colouring, all pairs
shortest path.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Sees the en-
tire input in
advance.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
Sees the en-
tire input
in advance,
which may
change; an-
swers computed
as change oc-
curs.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Classification [Ammar and Ozsu, 2015]
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
Sees the en-
tire input
in advance,
which may
change; an-
swers computed
as change oc-
curs.
Similar to dy-
namic, but
computation
happens in
batches of
changes.© M. Tamer Ozsu ADC PhD School (2015-06-04) 8 / 76
Example Design Points
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Compute the query result/perform analytic computation over the graphas it exists.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 9 / 76
Example Design Points
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Compute the query result/perform analytic computation over the graphas it is revealed.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 9 / 76
Example Design Points
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Compute the query result/perform analytic computation on each snap-shot from scratch.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 9 / 76
Example Design Points
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Continuously compute the query result/perform analytic computation asthe input changes.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 9 / 76
Example Design Points
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Compute the query result/perform analytic computation after a batch ofinput changes.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 9 / 76
Example Design Points – Not all alternatives make sense
Graph Dynamism
StaticGraphs
DynamicGraphs
StreamingGraphs
EvolvingGraphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
BatchDynamic
Workload Types
OnlineQueries
AnalyticsWorkloads
Dynamic (or batch-dynamic) algorithms do not make sense for staticgraphs.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 10 / 76
Graph Processing Systems
System Memory/Disk
ArchitectureComputingparadigm
DeclarativeLanguage
Hadoop Disk Parallel/Distributed MapReduce 5
Haloop Disk Parallel/Distributed MapReduce 5
Pegasus Disk Parallel/Distributed MapReduce 5
Pregel/Giraph Memory Parallel/Distributed Vertex-Centric 5
GraphLab Memory Parallel/Distributed Vertex-Centric 5
GraphChi Disk Single machine Vertex-Centric 5
GraphX Disk Single machine Edge-Centric 5
TurboGraph Disk Single machine Vertex-Centric 5
Trinity Memory Parallel/DistributedMapReduce/
Vertex-Centric3 (TSL)
Titan Disk Parallel/Distributed ? 3 (Gremlin)
Neo4J Disk Single machineProcedural/Linked-list
3 (Cypher)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 11 / 76
Graph Workloads
Online graph querying
Reachability
Single source shortest-path
Subgraph matching
SPARQL queries
Offline graph analytics
PageRank
Clustering
Strongly connectedcomponents
Diameter finding
Graph colouring
All pairs shortest path
Graph pattern mining
Machine learning algorithms(Belief propagation, Gaussiannon-negative matrixfactorization)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 12 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 13 / 76
Reachability Queries
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447) actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 14 / 76
Reachability Queries
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447) actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Can you reach film 1267 from film 2014?
© M. Tamer Ozsu ADC PhD School (2015-06-04) 14 / 76
Reachability Queries
film 2014(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167(name, “United Kingdom”)
(population, 62348447) actor 29704(actor name, “Jack Nicholson”)
film 3418(label, “The Passenger”)
film 1267(label, “The Last Tycoon”)
director 8476(director name, “Stanley Kubrick”)
film 2685(label, “A Clockwork Orange”)
film 424(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Is there a book whose rating is > 4.0 associated with a film that wasdirected by Stanley Kubrick?
© M. Tamer Ozsu ADC PhD School (2015-06-04) 14 / 76
Reachability Queries
Think of Facebook graph and finding friends of friends.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 14 / 76
Subgraph Matching
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
© M. Tamer Ozsu ADC PhD School (2015-06-04) 15 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 16 / 76
PageRank Computation
A web page is important if it is pointed to by other importantpages.
P1 P2
P3
P5P6
P4
r(Pi ) =∑
Pj∈BPi
r(Pj)
|FPj|
r(P2) =r(P1)
2+
r(P3)
3
rk+1(Pi ) =∑
Pj∈BPi
rk(Pj)
|FPj|
BPi: in-neighbours of Pi
FPi: out-neighbours of Pi
© M. Tamer Ozsu ADC PhD School (2015-06-04) 17 / 76
PageRank Computation
A web page is important if it is pointed to by other importantpages.
P1 P2
P3
P5P6
P4
rk+1(Pi ) =∑
Pj∈BPi
rk(Pj)
|FPj|
Iteration 0 Iteration 1 Iteration 2Rank atIter. 2
r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2
Iterative processing.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 17 / 76
Some Alternative Computational Models for OfflineAnalytics
MapReducemap and reduce functionsNot suitable for iterative processing due to data movement at eachstageNeed to save in storage system intermediate results of each iteration
Vertex-centric paradigmSpecify (a) the computation to be performed at each vertex, and (b)its communication with neighbour verticesDesigned specifically for interactive graph processingSynchronous (e.g., Pregel, Giraph)
Asynchronous (e.g., GraphLab)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 18 / 76
Some Alternative Computational Models for OfflineAnalytics
MapReducemap and reduce functionsNot suitable for iterative processing due to data movement at eachstageNeed to save in storage system intermediate results of each iteration
Vertex-centric paradigmSpecify (a) the computation to be performed at each vertex, and (b)its communication with neighbour verticesDesigned specifically for interactive graph processingSynchronous (e.g., Pregel, Giraph)
Asynchronous (e.g., GraphLab)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 18 / 76
Pregel-like Graph Processing Systems
Pregel-like systems are BSP, vertex-centric programs.
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
CommunicationBarrier
CommunicationBarrier
Superstep 1 Superstep 2 Superstep 3Computation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 19 / 76
Pregel-like Graph Processing Systems
Pregel-like systems are BSP, vertex-centric programs.
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
CommunicationBarrier
CommunicationBarrier
Superstep 1 Superstep 2 Superstep 3
Computation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 19 / 76
Pregel-like Graph Processing Systems
Pregel-like systems are BSP, vertex-centric programs.
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
CommunicationBarrier
CommunicationBarrier
Superstep 1 Superstep 2 Superstep 3
Computation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 19 / 76
Pregel-like Graph Processing Systems
Pregel-like systems are BSP, vertex-centric programs.
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
CommunicationBarrier
CommunicationBarrier
Superstep 1 Superstep 2 Superstep 3
Computation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 19 / 76
Pregel-like Graph Processing Systems
Pregel-like systems are BSP, vertex-centric programs.
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
CommunicationBarrier
CommunicationBarrier
Superstep 1 Superstep 2 Superstep 3
Computation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 19 / 76
Pregel-like Graph Processing Systems
Pregel-like systems are BSP, vertex-centric programs.
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
CommunicationBarrier
CommunicationBarrier
Superstep 1 Superstep 2 Superstep 3
Computation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 19 / 76
Pregel-like Graph Processing Systems
Pregel-like systems are BSP, vertex-centric programs.
“Think like a vertex”:
?
© M. Tamer Ozsu ADC PhD School (2015-06-04) 19 / 76
GraphLab (Asynchronous)
GraphLab features asynchronous execution:
No communication barriers. 3
Uses the most recent vertex values. 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
© M. Tamer Ozsu ADC PhD School (2015-06-04) 20 / 76
GraphLab (Asynchronous)
Implemented via distributed locking:
v0
v1 v2
v3 v4
© M. Tamer Ozsu ADC PhD School (2015-06-04) 21 / 76
GraphLab (Asynchronous)
Implemented via distributed locking:
v0
v1 v2
v3 v4
© M. Tamer Ozsu ADC PhD School (2015-06-04) 21 / 76
GraphLab (Asynchronous)
Implemented via distributed locking:
v0
v1 v2
v3 v4
© M. Tamer Ozsu ADC PhD School (2015-06-04) 21 / 76
GraphLab (Asynchronous)
Implemented via distributed locking:
v0
v1 v2
v3 v4
© M. Tamer Ozsu ADC PhD School (2015-06-04) 21 / 76
GraphLab (Asynchronous)
Implemented via distributed locking:
v0
v1 v2
v3 v4
© M. Tamer Ozsu ADC PhD School (2015-06-04) 21 / 76
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 22 / 76
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
64 machines TW UK
Giraph (byte array) 5.8GB 7.0GBGraphLab (sync) 4.5GB 14GB
TW 16 machines 128 machines
Giraph (byte array) 8.5GB 5.8GBGraphLab (sync) 11GB 3.3GB
© M. Tamer Ozsu ADC PhD School (2015-06-04) 22 / 76
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 22 / 76
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 22 / 76
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
No Mutations
Time Memory
Byte array 3 3Hash map 7 7
With Mutations (DMST)
Time Memory
Byte array 77 3Hash map 3 7
© M. Tamer Ozsu ADC PhD School (2015-06-04) 22 / 76
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –Performance degrades as more machines are used due to lockcontention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 22 / 76
Workloads Have Different Resource Demands [Han et al., 2014]
Algorithm CPU Memory Network
PageRank Medium Medium HighSSSP Low Low LowWCC Low Medium MediumDMST High High Medium
© M. Tamer Ozsu ADC PhD School (2015-06-04) 23 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 24 / 76
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can be defined
Relationships with other resources canbe defined
Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer Ozsu ADC PhD School (2015-06-04) 25 / 76
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can be defined
Relationships with other resources canbe defined
Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer Ozsu ADC PhD School (2015-06-04) 25 / 76
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can be defined
Relationships with other resources canbe defined
Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer Ozsu ADC PhD School (2015-06-04) 25 / 76
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can be defined
Relationships with other resources canbe defined
Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer Ozsu ADC PhD School (2015-06-04) 25 / 76
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can be defined
Relationships with other resources canbe defined
Resource descriptions can becontributed by different people/groupsand can be located anywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer Ozsu ADC PhD School (2015-06-04) 25 / 76
RDF Data Model
Triple: Subject, Predicate (Property), Object(s, p, o)
Subject: the entity that is described (URIor blank node)
Predicate: a feature of the entity (URI)Object: value of the feature (URI, blank
node or literal)
(s, p, o) ∈ (U ∪ B)× U × (U ∪ B ∪ L)
Set of RDF triples is called an RDF graph
U
Subject Object
U B U B L
U: set of URIsB: set of blank nodesL: set of literals
Predicate
Subject Predicate Objecthttp://...imdb.../film/2014 rdfs:label “The Shining”http://...imdb.../film/2014 movie:releaseDate “1980-05-23”http://...imdb.../29704 movie:actor name “Jack Nicholson”. . . . . . . . .
© M. Tamer Ozsu ADC PhD School (2015-06-04) 26 / 76
RDF Example InstancePrefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/
bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/
Subject Predicate Object
mdb: film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”’mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
URI Literal
URI
© M. Tamer Ozsu ADC PhD School (2015-06-04) 27 / 76
RDF Graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
© M. Tamer Ozsu ADC PhD School (2015-06-04) 28 / 76
RDF Query Model – SPARQL
Query Model - SPARQL Protocol and RDF Query LanguageGiven U (set of URIs), L (set of literals), and V (set of variables), aSPARQL expression is defined recursively:
an atomic triple pattern, which is an element of
(U ∪ V )× (U ∪ V )× (U ∪ V ∪ L)
?x rdfs:label “The Shining”
P FILTER R, where P is a graph pattern expression and R is a built-inSPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p > 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph patternexpressions
Example:SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}© M. Tamer Ozsu ADC PhD School (2015-06-04) 29 / 76
SPARQL Queries
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 30 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 31 / 76
Naıve Triple Store Design
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
Easy to implementbut
too many self-joins!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 32 / 76
Naıve Triple Store Design
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c tFROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5WHERE T1 . p=” r d f s : l a b e l ”AND T2 . p=” movie : r e l a t e d B o o k ”AND T3 . p=” movie : d i r e c t o r ”AND T4 . p=” r e v : r a t i n g ”AND T5 . p=” movie : d i r e c t o r n a m e ”AND T1 . s=T2 . sAND T1 . s=T3 . sAND T2 . o=T4 . sAND T3 . o=T5 . sAND T4 . o > 4 . 0AND T5 . o=” S t a n l e y K u b r i c k ”
Easy to implementbut
too many self-joins!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 32 / 76
Naıve Triple Store Design
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c tFROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5WHERE T1 . p=” r d f s : l a b e l ”AND T2 . p=” movie : r e l a t e d B o o k ”AND T3 . p=” movie : d i r e c t o r ”AND T4 . p=” r e v : r a t i n g ”AND T5 . p=” movie : d i r e c t o r n a m e ”AND T1 . s=T2 . sAND T1 . s=T3 . sAND T2 . o=T4 . sAND T3 . o=T5 . sAND T4 . o > 4 . 0AND T5 . o=” S t a n l e y K u b r i c k ”
Easy to implementbut
too many self-joins!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 32 / 76
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Borneaet al., 2013]
Clustered property table: group together the properties that tend tooccur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type ofproperty into one property table
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject refs:label movie:directormob:film/2014 “The Shining” mob:director/8476mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor namemdb:actor “Jack Nicholson”
© M. Tamer Ozsu ADC PhD School (2015-06-04) 33 / 76
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Borneaet al., 2013]
Clustered property table: group together the properties that tend tooccur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type ofproperty into one property table
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject refs:label movie:directormob:film/2014 “The Shining” mob:director/8476mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor namemdb:actor “Jack Nicholson”
Advantages
I Fewer joins
I If the data is structured, we have a relational system – similar tonormalized relations
© M. Tamer Ozsu ADC PhD School (2015-06-04) 33 / 76
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Borneaet al., 2013]
Clustered property table: group together the properties that tend tooccur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type ofproperty into one property table
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject refs:label movie:directormob:film/2014 “The Shining” mob:director/8476mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor namemdb:actor “Jack Nicholson”
Advantages
I Fewer joins
I If the data is structured, we have a relational system – similar tonormalized relations
Disadvantages
I Potentially a lot of NULLs
I Clustering is not trivial
I Multi-valued properties are complicated
© M. Tamer Ozsu ADC PhD School (2015-06-04) 33 / 76
Binary Tables
Grouping by properties: For each property, build a two-column table,containing both subject and object, ordered by subjects [Abadi et al.,2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject Objectmdb:film/2014 mdb:director/8476mdb:film/2685 mdb:director/8476
movie:director
Subject Objectmob:film/2014 “The Shining”mob:film/2685 “The Clockwork Orange”
refs:label
Subject Objectmdb:actor/29704 “Jack Nicholson”
movie:actor name
© M. Tamer Ozsu ADC PhD School (2015-06-04) 34 / 76
Binary Tables
Grouping by properties: For each property, build a two-column table,containing both subject and object, ordered by subjects [Abadi et al.,2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject Objectmdb:film/2014 mdb:director/8476mdb:film/2685 mdb:director/8476
movie:director
Subject Objectmob:film/2014 “The Shining”mob:film/2685 “The Clockwork Orange”
refs:label
Subject Objectmdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
© M. Tamer Ozsu ADC PhD School (2015-06-04) 34 / 76
Binary Tables
Grouping by properties: For each property, build a two-column table,containing both subject and object, ordered by subjects [Abadi et al.,2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject Objectmdb:film/2014 mdb:director/8476mdb:film/2685 mdb:director/8476
movie:director
Subject Objectmob:film/2014 “The Shining”mob:film/2685 “The Clockwork Orange”
refs:label
Subject Objectmdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
Disadvantages
I Not useful for subject-object joins
I Expensive inserts
© M. Tamer Ozsu ADC PhD School (2015-06-04) 34 / 76
Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Aluc et al., 2013]
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
© M. Tamer Ozsu ADC PhD School (2015-06-04) 35 / 76
Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Aluc et al., 2013]
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
Advantages
I Maintains the graph structure
I Full set of queries can be handled
© M. Tamer Ozsu ADC PhD School (2015-06-04) 35 / 76
Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Aluc et al., 2013]
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
Advantages
I Maintains the graph structure
I Full set of queries can be handled
Disadvantages
I Graph pattern matching is expensive
© M. Tamer Ozsu ADC PhD School (2015-06-04) 35 / 76
Two Systems
gStore
mdb:film/2014
bm:books/0743424425
mdb:director/8476
mdb:film/424mdb:film/2685
mdb:actor/29804
mdb:film/3418 mdb:film/1267
mdb:actor/30013
movie:ac
tor
moive:director
“Spartacus”moive:director moive:director
“Jack_Nicholson”
“A Clockwork Orange”
rdfs:label
“1980-05-23”
rdfs:label
moive:actor_name
y:hasBudget
y:has_box_office
“22000000#dollar”
movie:relatedBook4.7
rev:rating
bm:offers/0743424425
scam:hasOffer
y:hasBudget y:hasBudget
“21000000#dollar” “26589355#dollar” “12000000#dollar” “60000000#dollar”
y:has_box_office
movie:actor
movie:actor movie:actor
“The Passager” “The last Tycoon”rdfs:label rdfs:label
“Scatman Crothers”
movie:initial_release_datemoive:actor_name
Fig. 2. An RDF graph G
?x
?y
?z
mdb:movierdf:type
moive:director
“*Jack*”moive:actor_name
y:hasBudget
?budget<30000000Desc, top10
movie:actor
SELECT ?x ?y WHERE{ ?x hasBudget ?budget. ?x rdf:type mdb:movie. ?x movie:director ?y. ?y movie:actor_name ?z. FILTER( regex(str(?z),``Jack'') AND (?budget <30000000) )}ORDER BY ?budgetLIMIT 10
Fig. 3. SPARQL and Query Graph Q
a query signature graph Q⇤, the encoding strategy is analogueto encoding RDF graphs.
The online query evaluation process consists of two steps:filtering and joining. First, we generate the candidates for eachquery node using VS⇤-tree. Then, applying a depth-first searchstrategy, we perform the multi-way join over these candidatelists to find the subgraph matches of SPARQL query Q overRDF graph G.
III. Techniques
In this section, we briefly discuss the techniques used ingStore system; full details are given in elsewhere [5], [6]. Ac-cording to our framework in Section II, we solve the SPARQLquery processing by subgraph matching over the signaturegraph. A key issue is that the proposed encoding and pruningstrategies should support, in a uniform manner, di↵erent kindsof data (such as strings and numeric data), and SPARQLqueries with di↵erent operators . We discuss the encoding andpruning methods in Section III-A. Another technical issue isthe index structure, which is discussed in Section III-B. Wealso present some system-oriented optimization, such as indexcaching strategy and multicore-based query optimization inour system.
A. Encoding Techniques
In gSore, answering SPARQL queries is equivalent tofinding subgraph matches of query graph Q over RDF graphG. If vertex v (in query Q) can match vertex u (in RDF graphG), each neighbor vertex and each adjacent edge of v shouldmatch to some neighbor vertex and some adjacent edge of u.Thus, given a vertex u in G, we encode each of its adjacentedge labels and the corresponding neighbor vertex labels into
System Architecture
Offline Online
Storage
Input Input
RDF Parser
RDF Graph Builder
Encoding Module
VS*-tree builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value Store
VS*-treeStore
SPARQL Parser
SPARQL Query
Encoding Module
VS*-tree
Query Graph
Filter Module
Join Module
Signature Graph
Node Candidate
Results
Fig. 4. System Architecture
bitstrings, denoted as vS ig(u). We encode query Q with thesame encoding method. Consequently, the match between Qand G can be verified by simply checking the match betweencorresponding encoded bitstrings.
Given a vertex u, we encode each of its adjacent edgese(eLabel, nLabel) into a bitstring, where eLabel is the edgelabel and nLabel is the vertex label. This bitstring is callededge signature (i.e., eS ig(e)). It has two parts: eS ig(e).e,eS ig(e).n. The first part eS ig(e).e (M bits) denotes the edgelabel (i.e., eLabel) and the second part eS ig(e).n (N bits)denotes the neighbor vertex label (i.e., nLabel). The code ofvS ig(u) is formed by performing OR operator over all eS ig(e).Figure 5 illustrates the process.
mdb:film/2014
mdb:director/8476
mdb:actor/29804 moive:director
“1980-05-23”
y:hasBudget
“22000000#dollar”
movie:initial_release_date
movie:actor
e1 rdfs:label "The Shining"e2 movie:initial_release_date "1980-05-23"e3 movie:director mdb:director/8476e4 movie:actor mdb:actor/29704e5 movie:actor mdb:actor/30013e6 y:hasDuration 7140.0$#se7 y:hasBudget 22000000$#$e8 y:hasImdb "0081505"rdfs:label"The Shining"
hasDuration
hasDuration
"0081505"
y:hasImdb
eSig.e eSig.ne1 001000010 000010000101000e2 000110000 000000011100000e3 100100000 000010010000001e4 000010010 001001000000001e5 000010010 001001010000000e6 101000000 000001001100000e7 001010000 000010000001001e8 100010000 001000001001000
nSig 101110010 001011011101001
Fig. 5. Encoding Technique
1) Computing eS ig(e).e: Given an RDF repository, let |P|denote the number of di↵erent properties. If |P| is small, weset |eS ig(e).e| = |P|, where |eS ig(e).e| denotes the length ofthe bitstring, and build a 1-to-1 mapping between the propertyand the bit position. If |P| is large, we resort to the hashingtechnique. Let |eS ig(e).e| = M. Using an appropriate hashfunction, we set m out of M bits in eS ig(e).e to be ‘1’.
chameleon-db
Structural Index
...
Vertex Index
Spill Index
Clu
ster
Ind
ex
Sto
rag
eS
yst
em Sto
rag
eA
dvis
or
QueryEngine Plan Generation Evaluation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 36 / 76
gStore
General Approach:
Work directly on the RDF graph and the SPARQL query graph
Use a signature-based encoding of each entity and class vertex tospeed up matching
Filter-and-evaluate
Use a false positive algorithm to prune nodes and obtain a set ofcandidates; then do more detailed evaluation on those
Use an index (VS∗-tree) over the data signature graph (has lightmaintenance load) for efficient pruning
© M. Tamer Ozsu ADC PhD School (2015-06-04) 37 / 76
1. Encode Q and G to Get Signature Graphs
Query signature graph Q∗
0100 0000 1000 000000010
0000 010010000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 000100010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 000101000
0100 1000
01000
1001 1000
01000
0001 0100
01000
© M. Tamer Ozsu ADC PhD School (2015-06-04) 38 / 76
2. Filter-and-Evaluate
Query signature graph Q∗
0100 0000 1000 000000010
0000 010010000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 000100010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 000101000
0100 1000
01000
1001 1000
01000
0001 0100
01000
Find matches of Q∗ oversignature graph G ∗
Verify each match inRDF graph G
© M. Tamer Ozsu ADC PhD School (2015-06-04) 39 / 76
How to Generate Candidate List
Two step process:1 For each node of Q∗ get lists of nodes in G∗ that include that node.2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists of nodes inG∗ that include that node.
• Given query signature q and a set of data signatures S , find alldata signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)
Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
© M. Tamer Ozsu ADC PhD School (2015-06-04) 40 / 76
How to Generate Candidate List
Two step process:1 For each node of Q∗ get lists of nodes in G∗ that include that node.2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists of nodes inG∗ that include that node.
• Given query signature q and a set of data signatures S , find alldata signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)
Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
© M. Tamer Ozsu ADC PhD School (2015-06-04) 40 / 76
How to Generate Candidate List
Two step process:1 For each node of Q∗ get lists of nodes in G∗ that include that node.2 Do a multi-way join to get the candidate list
Alternatives:Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists of nodes inG∗ that include that node.
• Given query signature q and a set of data signatures S , find alldata signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)
Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
© M. Tamer Ozsu ADC PhD School (2015-06-04) 40 / 76
How to Generate Candidate List
Two step process:1 For each node of Q∗ get lists of nodes in G∗ that include that node.2 Do a multi-way join to get the candidate list
Alternatives:Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists of nodes inG∗ that include that node.
• Given query signature q and a set of data signatures S , find alldata signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)
Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
© M. Tamer Ozsu ADC PhD School (2015-06-04) 40 / 76
How to Generate Candidate List
Two step process:1 For each node of Q∗ get lists of nodes in G∗ that include that node.2 Do a multi-way join to get the candidate list
Alternatives:Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists of nodes inG∗ that include that node.
• Given query signature q and a set of data signatures S , find alldata signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)
Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
© M. Tamer Ozsu ADC PhD School (2015-06-04) 40 / 76
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000
Possibly large join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 41 / 76
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000
Possibly large join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 41 / 76
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
Possibly large join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 41 / 76
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
Possibly large join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 41 / 76
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
004
009
Possibly large join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 41 / 76
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
004
009on on
Possibly large join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 41 / 76
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
004
009on on
Possibly large join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 41 / 76
VS-tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
Super edge
© M. Tamer Ozsu ADC PhD School (2015-06-04) 42 / 76
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 43 / 76
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 43 / 76
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 43 / 76
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 43 / 76
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 43 / 76
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
© M. Tamer Ozsu ADC PhD School (2015-06-04) 43 / 76
Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and aremore varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] anddynamic [Kirchberg et al., 2011]
An experiment [Aluc et al., 2014a]
No single system is a sole winner across all queriesNo single system is the sole loser across all queries, eitherThere can be 2–5 orders of magnitude difference in the performance(i.e., query execution time) between the best and the worst system fora given queryThe winner in one query may timeout in anotherPerformance difference widens as dataset size increases
© M. Tamer Ozsu ADC PhD School (2015-06-04) 44 / 76
Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and aremore varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] anddynamic [Kirchberg et al., 2011]
An experiment [Aluc et al., 2014a]
No single system is a sole winner across all queriesNo single system is the sole loser across all queries, eitherThere can be 2–5 orders of magnitude difference in the performance(i.e., query execution time) between the best and the worst system fora given queryThe winner in one query may timeout in anotherPerformance difference widens as dataset size increases
Can existing systems cope with these trends – workload diversity &dynamism
No! [Aluc et al., 2014b]
I Fragmented data
I Suboptimal pruning by indexes
I Unnecessarily large sets of intermediate result tuples
© M. Tamer Ozsu ADC PhD School (2015-06-04) 44 / 76
Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and aremore varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] anddynamic [Kirchberg et al., 2011]
An experiment [Aluc et al., 2014a]
No single system is a sole winner across all queriesNo single system is the sole loser across all queries, eitherThere can be 2–5 orders of magnitude difference in the performance(i.e., query execution time) between the best and the worst system fora given queryThe winner in one query may timeout in anotherPerformance difference widens as dataset size increases
Our proposal: Idea behind chameleon-db
I When designing and implementing an RDF data management system,assume nothing about the workload upfront
I Organize data dynamically and purely based on the workload
© M. Tamer Ozsu ADC PhD School (2015-06-04) 44 / 76
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
Characteristics:
Records are not necessarily of fixed length
Records are not grouped into tables
Records do not necessarily share the same set of RDF predicates
Each record represents a very tiny part of the RDF graph
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?x ?yA
?y ?z?b
Figure: Query 1
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?x ?yA
?y ?z?b
Figure: Query 1
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?x ?yA
?y ?z?b
Figure: Query 1
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?a
?z
C
?a
?x
A
?a
?y
B
Figure: Query 2
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?a
?z
C
?a
?x
A
?a
?y
B
Figure: Query 2
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?a
?z
C
?a
?x
A
?a
?y
B
Figure: Query 2
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Group-by-Query Approach
Advantages
I Data are physically clustered for the workload
I Better pruning by the indexes
I Fewer intermediate result tuples
© M. Tamer Ozsu ADC PhD School (2015-06-04) 45 / 76
Challenges
Physical Data Layout: As the workloads change, the way data aregrouped together may no longer be suitable
Hierarchical Clustering Algorithm [Aluc et al., 2015]Tunable-LSH [Aluc et al., 2015]
Indexing: Indexing upfront is not a choice
Query Evaluation: Can we execute queries efficiently even when thephysical layout is constantly changing? [Aluc et al., 2015]
© M. Tamer Ozsu ADC PhD School (2015-06-04) 46 / 76
chameleon-db
Prototype system [Aluc et al., 2013]35,000 lines of code in C++ under Linux (plus code for SPARQL 1.0parser)
Structural Index
...
Vertex Index
Spill Index
Clu
ster
Inde
xS
tora
geS
yste
m Sto
rage
Adv
isor
QueryEngine Plan Generation Evaluation
© M. Tamer Ozsu ADC PhD School (2015-06-04) 47 / 76
Some Open Problems
Scalability of the solutions to very large datasets
Maintenance of auxiliary data structures in dynamic environments
Adaptive systems to handle varying and time-changing workloads
Uncertain RDF data processing
Keyword search over RDF data
Query processing over incomplete RDF data
© M. Tamer Ozsu ADC PhD School (2015-06-04) 48 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 49 / 76
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites can processqueries
Alternatives
Data re-distribution + querydecompositionSPARQL federation: justprocess at SPARQL endpointsLive querying (see nextsection)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 50 / 76
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites can processqueries
Alternatives
Data re-distribution + querydecompositionSPARQL federation: justprocess at SPARQL endpointsLive querying (see nextsection)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 50 / 76
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites can processqueries
Alternatives
Data re-distribution + querydecomposition
SPARQL federation: justprocess at SPARQL endpointsLive querying (see nextsection)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 50 / 76
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites can processqueries
Alternatives
Data re-distribution + querydecompositionSPARQL federation: justprocess at SPARQL endpoints
Live querying (see nextsection)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 50 / 76
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites can processqueries
Alternatives
Data re-distribution + querydecompositionSPARQL federation: justprocess at SPARQL endpointsLive querying (see nextsection)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 50 / 76
Distributed RDF Processing [Kaoudi and Manolescu, 2015]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data (i.e.,LOD)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 51 / 76
Distributed RDF Processing [Kaoudi and Manolescu, 2015]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data (i.e.,LOD)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 51 / 76
Distributed RDF Processing [Kaoudi and Manolescu, 2015]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data (i.e.,LOD)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 51 / 76
Distributed RDF Processing [Kaoudi and Manolescu, 2015]
Data partitioning approaches
RDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternatives
Table-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data (i.e.,LOD)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 51 / 76
Distributed RDF Processing – 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets (e.g., [Atreet al., 2010; Prasser et al., 2012])
SPARQL query Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} using the data summary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
© M. Tamer Ozsu ADC PhD School (2015-06-04) 52 / 76
Distributed RDF Processing – 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets (e.g., [Atreet al., 2010; Prasser et al., 2012])
SPARQL query Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} using the data summary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
© M. Tamer Ozsu ADC PhD School (2015-06-04) 52 / 76
Distributed RDF Processing – 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets (e.g., [Atreet al., 2010; Prasser et al., 2012])
SPARQL query Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} using the data summary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
© M. Tamer Ozsu ADC PhD School (2015-06-04) 52 / 76
SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpoint
Alternatives
SPARQL query decomposed Q = {Q1, . . . ,Qk} and executed over{D1, . . . ,Dn} – DARQ, FedX [Schwarte et al., 2011], SPLENDID[Gorlitz and Staab, 2011], ANAPSID [Acosta et al., 2011]Partial query evaluation – Distributed gStore [Peng et al., 2014]
Partial evaluationI Given function f (s, d) and part of its input s, perform f ’s
computation that only depends on s to get f ′(d)
I Compute f ′(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
© M. Tamer Ozsu ADC PhD School (2015-06-04) 53 / 76
SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpoint
Alternatives
SPARQL query decomposed Q = {Q1, . . . ,Qk} and executed over{D1, . . . ,Dn} – DARQ, FedX [Schwarte et al., 2011], SPLENDID[Gorlitz and Staab, 2011], ANAPSID [Acosta et al., 2011]Partial query evaluation – Distributed gStore [Peng et al., 2014]
Partial evaluationI Given function f (s, d) and part of its input s, perform f ’s
computation that only depends on s to get f ′(d)
I Compute f ′(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
© M. Tamer Ozsu ADC PhD School (2015-06-04) 53 / 76
SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpoint
Alternatives
SPARQL query decomposed Q = {Q1, . . . ,Qk} and executed over{D1, . . . ,Dn} – DARQ, FedX [Schwarte et al., 2011], SPLENDID[Gorlitz and Staab, 2011], ANAPSID [Acosta et al., 2011]Partial query evaluation – Distributed gStore [Peng et al., 2014]
Partial evaluationI Given function f (s, d) and part of its input s, perform f ’s
computation that only depends on s to get f ′(d)
I Compute f ′(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
© M. Tamer Ozsu ADC PhD School (2015-06-04) 53 / 76
Distributed SPARQL Using Partial Query Evaluation
Two steps:1 Evaluate a query at each site to find local matches
Query is the function and each Di is the known inputInner match or local partial match
2 Assemble the partial matches to get final resultCrossing matchCentralized assemblyDistributed assembly
D1
D2
D3
D4
Crossing match
© M. Tamer Ozsu ADC PhD School (2015-06-04) 54 / 76
Distributed SPARQL Using Partial Query Evaluation
Two steps:1 Evaluate a query at each site to find local matches
Query is the function and each Di is the known inputInner match or local partial match
2 Assemble the partial matches to get final resultCrossing matchCentralized assemblyDistributed assembly
D1
D2
D3
D4
Crossing match
© M. Tamer Ozsu ADC PhD School (2015-06-04) 54 / 76
Some Open Problems
Handling data at non-SPARQL endpoint sites
Modification to SPARQL endpoints (for partial query evaluation)
Heterogeneous use of vocabularies (use of ontologies)
© M. Tamer Ozsu ADC PhD School (2015-06-04) 55 / 76
Outline
1 Introduction – Graph Types
2 Property Graph ProcessingClassificationOnline queryingOffline analytics
3 RDF Graph QueryingData WarehousingDistributed SPARQL ExecutionLinked Object Data Querying
© M. Tamer Ozsu ADC PhD School (2015-06-04) 56 / 76
Closer Look
© M. Tamer Ozsu ADC PhD School (2015-06-04) 57 / 76
Globally Distributed Network of Data
© M. Tamer Ozsu ADC PhD School (2015-06-04) 58 / 76
Traditional Hypertext-based Web Access
IMDb WorldBook
Data exposedto the Webvia HTML
© M. Tamer Ozsu ADC PhD School (2015-06-04) 59 / 76
Linked Data Publishing Principles
IMDb WorldBook
(http://...linkedmdb.../Shining,releaseDate, 23 May 1980)(http://...linkedmdb.../Shining, filmLocation, http://cia.../UK)(http://...linkedmdb.../29704,actedIn, http://...linkedmdb.../Shining)
...
(http://cia.../UK, hasPopulation, 63230000)...
Shi
ning
UK
Data model: RDFGlobal identifier: URIAccess mechanism: HTTPConnection: data links
© M. Tamer Ozsu ADC PhD School (2015-06-04) 60 / 76
Live Query Processing
Not all data resides at SPARQLendpoints
Freshness of access to dataimportant
Potentially countably infinitedata sources
Live querying
On-line executionOnly rely on linked dataprinciples
Alternatives
Traversal-based approachesIndex-based approachesHybrid approaches
© M. Tamer Ozsu ADC PhD School (2015-06-04) 61 / 76
Linked Data Model [Hartig, 2012]
Web Document
Given a countably infinite set D (documents), a Web of Linked Data is atuple W = (D, adoc, data) where:
I D ⊆ D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to finite sets of RDF triples.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 62 / 76
Linked Data Model [Hartig, 2012]
Web Document
Given a countably infinite set D (documents), a Web of Linked Data is atuple W = (D, adoc, data) where:
I D ⊆ D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to finite sets of RDF triples.
Web of Linked Data
A Web of Linked Data W = (D, adoc, data)contains a data link from document d ∈ D todocument d ′ ∈ D if there exists a URI u suchthat:
I u is mentioned in an RDF triplet ∈ data(d), and
I d ′ = adoc(u).© M. Tamer Ozsu ADC PhD School (2015-06-04) 62 / 76
SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked DataQuery result completeness cannot be guaranteed by any (terminating)execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S , and areachability condition cScope: all data along paths of data links that satisfy the conditionComputationally feasible
© M. Tamer Ozsu ADC PhD School (2015-06-04) 63 / 76
SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked DataQuery result completeness cannot be guaranteed by any (terminating)execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S , and areachability condition cScope: all data along paths of data links that satisfy the conditionComputationally feasible
© M. Tamer Ozsu ADC PhD School (2015-06-04) 63 / 76
Traversal Approaches
Discover relevant URIs recursively bytraversing (specific) data links at queryexecution runtime [Hartig, 2013;Ladwig and Tran, 2011]
Implements reachability-based querysemantics
Start from a set of seed URIsRecursively follow and discover newURIs
Important issue is selection of seed URIs
Retrieved data serves to discover newURIs and to construct result
© M. Tamer Ozsu ADC PhD School (2015-06-04) 64 / 76
Traversal Approaches
Discover relevant URIs recursively bytraversing (specific) data links at queryexecution runtime [Hartig, 2013;Ladwig and Tran, 2011]
Implements reachability-based querysemantics
Start from a set of seed URIsRecursively follow and discover newURIs
Important issue is selection of seed URIs
Retrieved data serves to discover newURIs and to construct result
Advantages
Easy to implement.No data structure to maintain.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 64 / 76
Traversal Approaches
Discover relevant URIs recursively bytraversing (specific) data links at queryexecution runtime [Hartig, 2013;Ladwig and Tran, 2011]
Implements reachability-based querysemantics
Start from a set of seed URIsRecursively follow and discover newURIs
Important issue is selection of seed URIs
Retrieved data serves to discover newURIs and to construct result
Advantages
Easy to implement.No data structure to maintain.
Disadvantages
Possibilities for parallelized data retrieval are limitedRepeated data retrieval introduces significant query latency.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 64 / 76
Index Approaches
Use pre-populated index to determine relevant URIs (and to avoid asmany irrelevant ones as possible)
Different index keys possible; e.g., triple patterns [Umbrich et al.,2011]
Index entries a set of URIsIndexed URIs may appear multiple times (i.e., associated with multipleindex keys)Each URI in such an entry may be paired with a cardinality (utilized forsource ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
© M. Tamer Ozsu ADC PhD School (2015-06-04) 65 / 76
Index Approaches
Use pre-populated index to determine relevant URIs (and to avoid asmany irrelevant ones as possible)
Different index keys possible; e.g., triple patterns [Umbrich et al.,2011]
Index entries a set of URIsIndexed URIs may appear multiple times (i.e., associated with multipleindex keys)Each URI in such an entry may be paired with a cardinality (utilized forsource ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
Advantages
Data retrieval can be fully parallelizedReduces the impact of data retrieval on query execution time
© M. Tamer Ozsu ADC PhD School (2015-06-04) 65 / 76
Index Approaches
Use pre-populated index to determine relevant URIs (and to avoid asmany irrelevant ones as possible)
Different index keys possible; e.g., triple patterns [Umbrich et al.,2011]
Index entries a set of URIsIndexed URIs may appear multiple times (i.e., associated with multipleindex keys)Each URI in such an entry may be paired with a cardinality (utilized forsource ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
Advantages
Data retrieval can be fully parallelizedReduces the impact of data retrieval on query execution time
Disadvantages
Querying can only start after index constructionDepends on what has been selected for the indexFreshness may be an issueIndex maintenance
© M. Tamer Ozsu ADC PhD School (2015-06-04) 65 / 76
Hybrid Approach
Perform a traversal-based execution using a prioritized list of URIs tolook up [Ladwig and Tran, 2010]
Initial seed from the pre-populated index
Non-seed URIs are ranked by a function based on information in theindex
New discovered URIs that are not in the index are ranked accordingto number of referring documents
© M. Tamer Ozsu ADC PhD School (2015-06-04) 66 / 76
Some Open Problems
Optimize queries by using statistics collected during earlier queryexecutions
Heterogeneous use of vocabularies (use of ontologies)
Combine SPARQL federation to leverage SPARQL endpointfunctionality
© M. Tamer Ozsu ADC PhD School (2015-06-04) 67 / 76
Acknowledgements
This presentation draws upon collaborative research and discussions withthe following colleagues (in alphabetical order)
Gunes Aluc, U. Waterloo
Khaled Ammar, U. Waterloo
Khuzaima Daudjee, U. Waterloo
Young Han, U. Waterloo
Olaf Hartig, U. Waterloo
Lei Chen, Hong Kong UST
Lei Zou, Peking Univ.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 68 / 76
Thank you!
Research supported by
© M. Tamer Ozsu ADC PhD School (2015-06-04) 69 / 76
© M. Tamer Ozsu ADC PhD School (2015-06-04) 70 / 76
References I
Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: a verticallypartitioned DBMS for semantic web data management. VLDB J., 18(2):385–406.
Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalable semanticweb data management using vertical partitioning. In Proc. 33rd Int. Conf. on VeryLarge Data Bases, pages 411–422.
Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011). ANAPSID:an adaptive query processing engine for SPARQL endpoints. In Proc. 10th Int.Semantic Web Conf., pages 18–34.
Aluc, G., Ozsu, M. T., and Daudjee, K. (2015). Clustering RDF databases usingTunable-LSH. CoRR, abs/1504.02523.
Aluc, G., Hartig, O., Ozsu, M. T., and Daudjee, K. (2014a). Diversified stress testing ofRDF data management systems. In Proc. 13th Int. Semantic Web Conf., pages197–212.
Aluc, G., Ozsu, M. T., and Daudjee, K. (2014b). Workload matters: Why RDFdatabases need a new design. Proc. VLDB Endowment, 7(10):837–840.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 71 / 76
References II
Aluc, G., Ozsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: aworkload-aware robust RDF data management system. Technical Report CS-2013-10,University of Waterloo. Available at https://cs.uwaterloo.ca/sites/ca.computer-science/files/uploads/files/CS-2013-10.pdf.
Aluc, G., Ozsu, M. T., Daudjee, K., and Hartig, O. (2015). Executing queries overschemaless RDF databases. In Proc. 31st Int. Conf. on Data Engineering, pages807–818.
Ammar, K. and Ozsu, M. T. (2015). Approaches to graph processing – an overview. Inpreparation.
Arias, M., Fernandez, J. D., Martınez-Prieto, M. A., and de la Fuente, P. (2011). Anempirical study of real-world SPARQL queries. CoRR, abs/1103.5043.
Atre, M., Chaoji, V., Zaki, M. J., and Hendler, J. A. (2010). Matrix “bit” loaded: Ascalable lightweight join query processor for rdf data. In Proc. 19th Int. World WideWeb Conf., pages 41–50.
Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O.,and Bhattacharjee, B. (2013). Building an efficient RDF store over a relationaldatabase. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages121–132.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 72 / 76
References III
Buneman, P., Cong, G., Fan, W., and Kementsietsidis, A. (2006). Using partialevaluation in distributed query evaluation. In Proc. 32nd Int. Conf. on Very LargeData Bases, pages 211–222.
Duan, S., Kementsietsidis, A., Srinivas, K., and Udrea, O. (2011). Apples and oranges:a comparison of RDF benchmarks and real RDF datasets. In Proc. ACM SIGMODInt. Conf. on Management of Data, pages 145–156.
Gorlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federation exploitingVOID descriptions. In Proc. 2nd Int. Workshop on Consuming Linked Data.
Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: A distributedshared-nothing RDF engine based on asynchronous message passing. In Proc. ACMSIGMOD Int. Conf. on Management of Data, pages 289–300.
Han, M., Daudjee, K., Ammar, K., Ozsu, M. T., Wang, X., and Jin, T. (2014). Anexperimental comparison of Pregel-like graph processing systems. Proc. VLDBEndowment, 7(12):1047–1058.
Hartig, O. (2012). SPARQL for a web of linked data: Semantics and computability. InProc. 9th Extended Semantic Web Conf., pages 8–23.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 73 / 76
References IV
Hartig, O. (2013). SQUIN: a traversal based query execution system for the web oflinked data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages1081–1084.
Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying of large RDFgraphs. Proc. VLDB Endowment, 4(11):1123–1134.
Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham, B.(2011). Heuristics-based query processing for large RDF graphs using cloudcomputing. IEEE Trans. Knowl. and Data Eng., 23(9):1312–1327.
Kaoudi, Z. and Manolescu, I. (2015). RDF in the clouds: A survey. VLDB J., 24:67–91.
Kirchberg, M., Ko, R. K. L., and Lee, B.-S. (2011). From linked data to relevant data –time is the essence. CoRR, abs/1103.5046.
Ladwig, G. and Tran, T. (2010). Linked data query processing strategies. In Proc. 9thInt. Semantic Web Conf., pages 453–469.
Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linked data. InProc. 8th Extended Semantic Web Conf., pages 139–153.
Lee, K. and Liu, L. (2013). Scaling queries over big rdf graphs with semantic hashpartitioning. Proc. VLDB Endowment, 6(14):1894–1905.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 74 / 76
References V
Peng, P., Zou, L., Ozsu, M. T., Chen, L., and Zhao, D. (2014). Processing SPARQLqueries over linked data – a distributed graph-based approach. In submitted forpublication.
Prasser, F., Kemper, A., and Kuhn, K. A. (2012). Efficient distributed query processingfor autonomous rdf databases. In Proc. 15th Int. Conf. on Extending DatabaseTechnology, pages 372–383.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011). Fedx: Afederation layer for distributed query processing on linked open data. In Proc. 8thExtended Semantic Web Conf., pages 481–486.
Umbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011). Comparingdata summaries for processing live queries over linked data. World Wide Web J.,14(5-6):495–544.
Verborgh, R., Hartig, O., Meester, B. D., Haesendonck, G., Vocht, L. D., Sande, M. V.,Cyganiak, R., Colpaert, P., Mannens, E., and de Walle, R. V. (2014). Queryingdatasets on the web with high availability. In Proc. 13th Int. Semantic Web Conf.,pages 180–196.
Wilkinson, K. (2006). Jena property table implementation. Technical ReportHPL-2006-140, HP Laboratories Palo Alto.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 75 / 76
References VI
Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towards scalable I/Oefficient SPARQL query evaluation on the cloud. In Proc. 29th Int. Conf. on DataEngineering, pages 565–576.
Zou, L., Mo, J., Chen, L., Ozsu, M. T., and Zhao, D. (2011). gStore: answeringSPARQL queries via subgraph matching. Proc. VLDB Endowment, 4(8):482–493.
Zou, L., Ozsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014). gStore: Agraph-based SPARQL query engine. VLDB J., 23(4):565–590.
© M. Tamer Ozsu ADC PhD School (2015-06-04) 76 / 76