accumulo summit 2015: rya: optimizations to support real time graph queries on accumulo [frameworks]

32
Rya: Optimizations to Support Real Time Graph Queries on Accumulo Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. ONR Case Number 43-279-15 JB.01.2015

Upload: accumulo-summit

Post on 15-Jul-2015

390 views

Category:

Technology


10 download

TRANSCRIPT

Page 1: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

Rya: Optimizations to Support Real Time Graph Queries on AccumuloDr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu

DISTRIBUTION STATEMENT A. Approved for

public release; distribution is unlimited.

ONR Case Number 43-279-15 JB.01.2015

Page 2: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

22

Acknowledgements

This work is the collective effort of:

Parsons’ Rya Team, sponsored by the Department of

the Navy, Office of Naval Research

Rya Founders: Roshan Punnoose, Adina Crainiceanu,

and David Rapp

Page 3: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

33

Overview

Rya Overview

Query Execution within Rya

Query Optimizations

Results

Summary

Page 4: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

44

Background: Rya and RDF

Rya: Resource Description Framework (RDF)

Triplestore built on top of Accumulo

RDF: W3C standard for representing

linked/graph data

Represents data as statements (assertions) about

resources

– Serialized as triples in {subject, predicate, object}

form

– Example:

• {Caleb, worksAt, Parsons}

• {Caleb, livesIn, Virginia}

Caleb

ParsonsVirginia

worksAtlivesIn

Page 5: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

55

Background: SPARQL

RDF Queries are described using SPARQL

SPARQL Protocol and RDF Query Language

SQL-like syntax for finding triples matching

specific patterns

Look for subgraphs that match triple statement patterns

Joins are performed when there are variables common

to two or more statement patterns

SELECT ?people WHERE {

?people <worksAt> <Parsons>.

?people <livesIn> <Virginia>.

}

Page 6: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

66

Rya Architecture

Open RDF Interface for interacting with RDF data

stored on Accumulo

Open RDF (Sesame): Open

Source Java framework for

storing and querying RDF

data

Open RDF Provides several

interfaces/abstractions

central for interacting with

a RDF datastore

– SAIL interface for interacting with underlying persisted

RDF model

– SAIL: Storage And Inference Layer

Data storage layer

Query processing in SAIL layer

SPARQL

Rya Open RDF

Rya QueryPlanner

Accumulo

Page 7: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

77

Storage: Triple Table Index

3 Tables

SPO : subject, predicate, object

POS : predicate, object, subject

OSP : object, subject, predicate

Store triples in the RowID of the table

Store graph name in the Column Family

Advantages:

Native lexicographical sorting of row keys fast range queries

All patterns can be translated into a scan of one of these tables

Page 8: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

88

Overview

Rya Overview

Query Execution within Rya

Query Optimizations

Results

Summary

Page 9: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

99

worksAt, Netflix, Dan

worksAt, OfficeMax, Zack

worksAt, Parsons, Bob

worksAt, Parsons, Greta

worksAt, Parsons, John

Rya Query Execution

Implemented OpenRDF Sesame SAIL API

Parse queries, generate initial query plan, execute plan

Triple patterns map to range queries in Accumulo

SELECT ?x WHERE { ?x <worksAt> <Parsons>.

?x <livesIn> <Virginia>. }

Step 1: POS Table – scan range

Bob, livesIn, Georgia

Greta, livesIn, Virginia

John, livesIn, Virginia

Step 2: for each ?x, SPO – index lookup

Page 10: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1010

More Complicated Example of Rya Query Execution

Step 2: For each ?x,

SPO Table lookup

Greta, commuteMethod,

bike

John, commuteMethod,

Bus

Step 3: For each

remaining ?x, SPO

Table lookup

Step 1: POS Table – scan

range for worksAt, Parsons

?x livesIn Virginia?x worksAt Parsons

?x commuteMethod bike

worksAt, Netflix, Dan

worksAt, Parsons, Bob

worksAt, Parsons, Greta

worksAt, Parsons, John

worksAt, PlayStation,

Alice

Bob, livesIn, Georgia

Greta, livesIn, Virginia

John, livesIn, Virginia

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <livesIn> Virginia.

?x <commuteMethod> bike.

}

Page 11: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1111

Challenges in Query Execution

Scalability and Responsiveness

Massive amounts of data

Potentially large amounts of comparisons

Consider the Previous Example:

Default query execution: comparing each “?x” returned from first

statement pattern query to all subsequent triple patterns

There are 8.3 million Virginia residents, about 15,000 Parsons

employees, and 750,000 people who commute via bike.

Only 100 people who work at Parsons commute via bike while 1000

people who work at Parsons live in Virginia.

Poor query execution plans can result in simple queries

taking minutes as opposed to milliseconds

SELECT ?x WHERE {

?x <livesIn> Virginia.

?x <worksAt> Parsons.

?x <commuteMethod> bike.

}

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <livesIn> Virginia.

?x <commuteMethod> bike.

}

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <commuteMethod> bike.

?x <livesIn> Virginia.

}

vs. vs.

Page 12: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1212

Overview

Rya Overview

Query Execution within Rya

Query Optimizations

Results

Summary

Page 13: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1313

Rya Query Optimizations

Goal: Optimize query execution (joins) to better support real time responsiveness

Three Approaches:

Reduce the number of joins: Pattern Based Indices

– Pre-calculate common joins

Limit data in joins: Use more stats to improve query planning

– Cardinality estimation on individual statement patterns

– Join selectivity estimation on pairs of statement patterns

Make joins more efficient: Distribute the Join Processing

– Distribute processing using SPARK SQL or MapReduce

– Use Hash Joins and Intersecting Iterators

– Just beginning to start looking at this

Page 14: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1414

Rya Query Optimizations Using Cardinalities

Goal: Optimize ordering of query execution to

reduce the number of comparison operations

Order execution based on the number of triples that

match each triple pattern

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <commuteMethod> bike.

?x <livesIn> Virginia.

}

8.3M matches

15k matches

750k matches

Page 15: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1515

Rya Cardinality Usage

Maintain cardinalities on the following triple patterns

element combinations: Single elements: Subject, Predicate, Object

Composite elements: Subject-Predicate, Subject-Object,

Predicate-Object

Computed periodically using MapReduce Row ID:

– <CardinalityType><TripleElements>

• OBJECT, Parsons

• PREDICATEOBJECT, worksAt, Parsons

Cardinality stored in the value

Sparse table: Only store cardinalities above a threshold

Only need to recompute cardinalities if the

distribution of the data changes significantly

Page 16: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1616

Limitations of Cardinality Approach

Consider a more complicated query

Cardinality approach does not take into account number of results returned by joins

Solution lies in estimating the “join selectivity” for a each pair of triples

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <commuteMethod> bike.

?vehicle <vehicleType> SUV.

?x <livesIn> Virginia.

?x <owns> ?vehicle.

}

2.1M matches

15k matches

750k matches

8.3M matches

254M matches

Page 17: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1717

Rya Query Optimizations Using Join Selectivity

Query optimized using

only Cardinality Info:

Query optimized using Cardinality

and Join Selectivity Info:

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <commuteMethod> bike.

?vehicle <vehicleType> SUV.

?x <livesIn> Virginia.

?x <owns> ?vehicle.

}

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <commuteMethod> bike.

?x <livesIn> Virginia.

?x <owns> ?vehicle.

?vehicle <vehicleType> SUV.

}

Join Selectivity measures number of results returned by joining two

triple patterns Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas

Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008

Due to computational complexity, estimate of join selectivity for triple

patterns is pre-computed and stored in Accumulo

Join selectivity estimated by computing the number of results obtained

when each triple pattern is joined with the full table

Page 18: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1818

Join Selectivity: General Algorithm

For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a

variable and p1, o1 , p2, o2 constant, estimate the number of results

Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>)

give number of results returned by joining a statement pattern with

the full table along the subject component

Full table join statistics precomputed and stored in index

Join statistics for each triple pattern computed using following equation:

Use analogous definition if variables appear in predicate or object position

Join selectivity statistics used with cardinalities to generate more

efficient query plans

Page 19: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

1919

Join Selectivity: Integration into Rya

Join Selectivity estimates used to optimize Rya queries

through a greedy algorithm approach

Query constructed starting with first triple pattern to be

evaluated (the pattern with the smallest cardinality) and then

patterns are added based on minimization of a cost function

Cost function

C = leftCard + rightCard + leftCard*rightCard*selectivity

C measures number of entries Accumulo must scan and the

number of comparisons required to perform the join

Selectivity set to one if two triple patterns share no common

variables, otherwise precomputed estimates used

Ensures that patterns with common variables are grouped

together

Page 20: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2020

Construction of Selectivity Tables

For the pattern <?x, p1, o1>, associate each RDF triple of

the form <c, p1, o1> with the cardinality |<c,?y,?z>| and

then sum the results

Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits

the key-value pair (c, (p1, o1))

Map Job 2 processes the cardinality table and emits the key

value pair (c, |<c,?y,?x>|), which consists of the constant c

and its single component, subject cardinality for the table

Map Job 3 merges the results from jobs 1 and 2 by emitting

the key-value pair ((p1, o1), |<c,?y,?x>|)

Map Job 4 sums the cardinalities from those key-value pairs

containing (p1, o1) as a key, and the result is written to the

selectivity table

Page 21: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2121

Query Optimizations Using Pre-Computed Joins

Reduce joins by pre-computing common joins

Approach taken from: Heese, Ralf, et al. "Index Support for

SPARQL." European Semantic Web Conference, Innsbruck,

Austria. 2007.

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <commuteMethod> bike.

?x <livesIn> Virginia.

?x <owns> ?vehicle.

?vehicle <vehicleType> SUV.

}

Pre-compute using

batch processing

and look up during

query execution

Page 22: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2222

Query Optimizations Using Pre-Computed Joins

Index Result Table

.…

Aaron, ToyotaRav4

Caleb, JeepCherokee

Puja, HondaCRV

.…

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <commuteMethod> bike.

?x <livesIn> Virginia.

?x <owns> ?vehicle.

?vehicle <vehicleType> SUV.

}

SELECT ?person ?car

WHERE {

?person <livesIn> Virginia.

?person <owns> ?car.

?car <vehicleType> SUV.

}

1. Pre-compute a portion of the query

using MapReduce

2. Store SPARQL describing the query

along with pre-computed values in

Accumulo

3. Normalize query variables to match

stored SPARQL variables during

query execution

Stored SPARQL

Page 23: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2323

Overview

Rya Overview

Query Execution within Rya

Query Optimizations

Results

Summary

Page 24: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2424

Query Optimization Results

Ran 14 queries against the Lehigh University Benchmark (LUBM)

dataset (33.34 million triples) LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity

– Remaining queries were executed 12 times

Cluster Specs:

– 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and

48 GB RAM

Results indicate that cardinality and join selectivity optimizations provide

improved or comparable performance

Page 25: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2525

Summary

Cardinality estimation and join selectivity can

improve query response times for ad hoc queries

Effects of join selectivity are more apparent for

complex queries over large datasets

Pre-computed joins are extremely useful for

optimizing common queries

Potentially avoid large number of join operations

Maintaining pre-computed join indices is difficult

Page 26: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2626

Questions?

Page 27: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2727

BACK-UP

Page 28: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2828

Useful Links

SPARQL http://www.w3.org/TR/rdf-sparql-query/

http://jena.apache.org/tutorials/sparql.html

RDF http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

Rya https://github.com/LAS-NCSU/rya

– Source on github: Provides documentation and sample client code

– Email Aaron Mihalik ([email protected]) for access (US Citizens only)

Rya Working Group

– Monthly telecon / update on progress, issues, upcoming features

– Email Puja Valiyil [email protected] to join (US Citizens only)

Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting-

started.docbook?view

Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html

Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the

clouds. Proceedings of the 1st International Workshop on Cloud Intelligence.

http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf

Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya.

Information Systems Journal (2013).

http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf

Page 29: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

2929

Next Steps

Maintaining pre-computed join indices

Dynamically determining potential pre-computed

joins

Distributing query planning and execution

SPARK SQL

Rya backed by other datastores

Fully open sourcing Rya

Page 30: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

3030

Sample LUBM Queries (1 of 3)

Query 1

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>

SELECT ?X WHERE

{ GRAPH <http://LUBM>

{?X rdf:type ub:GraduateStudent .

?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>}

}

Query 3

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>

SELECT ?X WHERE

{ GRAPH <http://LUBM>

{?X rdf:type ub:Publication .

?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>}

}

Page 31: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

3131

Sample LUBM Queries (2 of 3)

Query 7

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>

SELECT ?X ?Y WHERE

{ GRAPH <http://LUBM>

{?X rdf:type ub:Student .

?Y rdf:type ub:Course .

?X ub:takesCourse ?Y .

<http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y}

}

Query 8

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>

SELECT ?X ?Y ?Z WHERE

{ GRAPH <http://LUBM>

{?X rdf:type ub:Student .

?Y rdf:type ub:Department .

?X ub:memberOf ?Y .

?Y ub:subOrganizationOf <http://www.University0.edu> .

?X ub:emailAddress ?Z}

}

Page 32: Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

3232

Sample LUBM Queries (3 of 3)

Query 9

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>

SELECT ?X ?Y ?Z WHERE

{ GRAPH <http://LUBM>

{?X rdf:type ub:Student .

?Y rdf:type ub:Faculty .

?Z rdf:type ub:Course .

?X ub:advisor ?Y .

?Y ub:teacherOf ?Z .

?X ub:takesCourse ?Z}

}

Query 11

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>

SELECT ?X WHERE

{ GRAPH <http://LUBM>

{?X rdf:type ub:ResearchGroup .

?X ub:subOrganizationOf <http://www.University0.edu>}

}