the graph query language: towards a unification of graph query approaches david silberberg...

Download The Graph Query Language: Towards a Unification of Graph Query Approaches David Silberberg 443-778-6231 JTC1 SC32N1634

If you can't read please download the document

Upload: dustin-hines

Post on 18-Jan-2018

233 views

Category:

Documents


0 download

DESCRIPTION

Heritage Style Viewgraphs3 Goals of the Graph Query Language (GQL) Project  To unify disparate graph query approaches into a single, seamless, and declarative language  Supports semantic search over graph data structures represented by schemas  Supports traditional graph algorithms that systematically follow edges to discover interesting subgraphs (e.g., shortest path, minimal spanning tree, etc.)  Supports metrics-oriented graphs algorithms (e.g., social network analysis, etc.)  Supports special commands tailored to analysis of graphs  Supports ontology-assisted query  To quantify the scalability of this type of language

TRANSCRIPT

The Graph Query Language: Towards a Unification of Graph Query Approaches David Silberberg JTC1 SC32N1634 Heritage Style Viewgraphs2 Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions Heritage Style Viewgraphs3 Goals of the Graph Query Language (GQL) Project To unify disparate graph query approaches into a single, seamless, and declarative language Supports semantic search over graph data structures represented by schemas Supports traditional graph algorithms that systematically follow edges to discover interesting subgraphs (e.g., shortest path, minimal spanning tree, etc.) Supports metrics-oriented graphs algorithms (e.g., social network analysis, etc.) Supports special commands tailored to analysis of graphs Supports ontology-assisted query To quantify the scalability of this type of language Heritage Style Viewgraphs4 Assumptions Data model is a typed graph that adheres to a schema Not XML graphs tend to be more highly connected Not a semantic model inference cannot, in general, be performed on the schema Data graphs can be large Query languages are only an abstract representation of questions The object is finding the right abstraction for the way people think about interacting with graphs Other query languages onto other data models will work but do those languages help facilitate or hinder the formulation of those requests or the interpretation of the results? Algorithms are external to the graph management system There are too many algorithms New algorithms may be implemented or modified regularly We are not the experts in writing efficient algorithms Heritage Style Viewgraphs5 Benefits of GQL Potential for significant reduction in time to perform analysis Provides visual analysis applications with a new paradigm for interacting with graph data Reduces the time to find information useful to analysts Enables interactive analysis using large data graphs Heritage Style Viewgraphs6 Graph Interaction Methods Graph interactions take many forms Browse One-step-at-a-time exploration of a graph Semantic Schema-Based Search Several-steps-at-a-time graph query Algorithms Find subgraphs Calculate graph metrics Analysis Hypothesis expressions, etc. GQL is a declarative graph query language for integrating all these approaches! Heritage Style Viewgraphs7 Example Scenario Farmer Jones' lettuce crop did well this year, but few other farmers did well. Why? First, find Farmer Jones. (Browsing) Jones Heritage Style Viewgraphs8 Example Scenario Rabbits usually eat lettuce. Let's find the rabbits that ate Farmer Jones' lettuce. (Semantic Schema-Based Search) Jones PrizeRomanIcy BugsHarvey Heritage Style Viewgraphs9 Example Scenario Let's look at all the farmers, and their locations, whose lettuce was eaten by fewer than 5 rabbits. (Semantic Schema-Based Search) JonesSmithHarris PrizeRomanIcyLeafySoftCrispyGreenTasty BugsHarveyPeter Smalltown, USA Heritage Style Viewgraphs10 Example Scenario What commonalities do the farmers have with each other and with the rabbits? (Semantic and Algorithmic Search) JonesSmithHarris PrizeRomanIcyLeafySoftCrispyGreenTasty BugsHarveyPeter Smalltown, USA RedSly Acme Rent-a- Fox Heritage Style Viewgraphs11 Example Scenario If Fred fox ate Prize lettuce, what else would we learn? (Analysis-specific Methods, Semantic Search, and Algorithmic Search) JonesSmithHarris PrizeRomanIcyLeafySoftCrispyGreenTasty BugsHarveyPeter Smalltown, USA RedSly Acme Rent-a- Fox Fred Brer Fox Enterprises Heritage Style Viewgraphs12 Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions Heritage Style Viewgraphs13 Related Work Four categories of graph query languages and examples 1.Knowledge base (subject-predicate-object) query languages SPARQL, RQL, RAL, RDF Query Language 2.Graph reasoning query languages OWL-QL, GraphLog, Query and Inference Service for RDF 3.Query languages with graph operators GOQL GRAM 4.Graphical user interface query language Q G RAPH Heritage Style Viewgraphs14 Features of GQL that Support Analysis Schema-based graph query Schema-based graph query Returns a single graph or a set of graphs (not tables or XML files) Returns a single graph or a set of graphs Aliasing Aliasing Graph exploration through wildcard search Graph exploration through wildcard search Embedded queries (helps achieve first order logic expressiveness) Embedded queries Creates new graph structures in query results Creates new graph structures in query results Query over defined patterns (of activity or behavior, for example) Query over defined patterns Special commands tailored to analysis Hypothesis expressions Hypothesis expressions Composite vertices (of vertices and edges) Composite vertices External algorithms that return graphs (e.g., shortest path) External algorithms that return graphs External algorithms that return metrics (e.g., social network analysis) External algorithms that return metrics Ontology-assisted graph query Ontology-assisted graph query NEXT Heritage Style Viewgraphs15 Example Graph Model FoxRabbit Lettuce Carrot name Eats Chases time age Eats time name Eats time nameage Fox: fox1 Fox: fox2 Rabbit: rabbit1 Rabbit: rabbit2 Lettuce: lettuce1 Carrot: carrot1 Rabbit: rabbit3Carrot: carrot2 Lettuce: lettuce2 age: 3name: George Chases: chases1 age: 2name: Peter Eats: eats3 name: PrizeLettuce age: 4name: Bugs Chases: chases2 name: Fredage: 2age: 1name: Jack name: CarrotTop Eats: eats2 Eats: eats4 Eats: eats5 Eats: eats6 name: Icy name: BigCarrot Eats: eats1 time: 2pm time: 3pm time: 5pm time: 8am time: 7pm time: 9amtime: 7am time: 8am Heritage Style Viewgraphs16 GQL Operators - Overview Basic Syntax SUBGRAPH clause Finds a subgraph in the source graph CONSTRAINT clause Filters the subgraph based on property constraints RETURN clause Describes the resulting graph or sets of graphs to return Syntax for analysis ASSUME clause Supports hypothesis statements PATTERN clause Defines search patterns BACK Heritage Style Viewgraphs17 Basic GQL Operators Subgraph Template Operators SUBGRAPH clause Conjunctions and disjunctions of path-segment operators Hierarchy operators (for composite vertices) Constraint Operators CONSTRAINT clause Standard first-order logic Conjunctions, disjunctions and negations as well as universal and existential quantification of predicates. Projection Operators RETURN clause Constructs the result graph(s) Path segment operator Hierarchy operator (for composite vertices) Present results as a set of graphs Edge expansion operator Common join operator Heritage Style Viewgraphs18 Simple Query that Returns a Single Graph SUBGRAPHFox Chases Rabbit AND Fox Eats Rabbit CONSTRAINT Chases.Time < Eats.Time RETURNFox Chases Rabbit AND Fox Eats Rabbit Fox: fox1Rabbit: rabbit1 age: 3name: George Chases: chases1 age: 2name: Peter Eats: eats1 time: 2pm time: 3pm Type represents variable Motivated by languages like SQL In constrast to (Fox ?f1) Heritage Style Viewgraphs19 Returning a Set of Graphs Can be done with edge expansion or joins in the RETURN clause Can be seamlessly integrated with non-graph expansion expressions Any query can be returned as a set of graphs if desired SUBGRAPHFox Chases Rabbit RETURNFox Chases# Rabbit Fox: fox1Rabbit: rabbit1 age: 3name: George Chases: chases1 age: 2name: Peter time: 2pm Fox: fox1Rabbit: rabbit2 age: 3name: George Chases: chases2 age: 4name: Bugs time: 5pm result graph 1 result graph 2 BACK Heritage Style Viewgraphs20 Aliasing SUBGRAPHFox ALIAS ChasingFox Chases Rabbit AND Fox ALIAS EatingFox Eats Rabbit CONSTRAINTChasingFox.name EatingFox.name RETURNChasingFox Chases Rabbit AND EatingFox Eats Rabbit If our graph had an additional edge in which George Fox chased Jack Rabbit at 8 a.m., the result would look like: Fox: fox1 Fox: fox2Rabbit: rabbit3 age: 3name: George Chases: chases3 name: Fredage: 2age: 1name: Jack Eats: eats2 time: 8am time: 9am BACK Heritage Style Viewgraphs21 Embedded Queries Significant component of first order logic expressiveness To request the first fox that ate a rabbit, the following existential query is formulate: SUBGRAPHFox Eats ALIAS E1 Rabbit CONSTRAINTNOT EXISTS (SUBGRAPHFox Eats ALIAS E2 Rabbit CONSTRAINTE1.time > E2.time) RETURNFox Eats Rabbit Fox: fox2Rabbit: rabbit3 name: Fredage: 2age: 1name: Jack Eats: eats2 time: 9am BACK Heritage Style Viewgraphs22 New Result Graph Structure Query SUBGRAPHFox Eats Rabbit AND Rabbit Eats Lettuce RETURNFox new(Ingests) Lettuce Fox: fox1 Fox: fox2 Lettuce: lettuce1 Lettuce: lettuce2 age: 3name: Georgename: PrizeLettuce name: Fredage: 2 name: Icy Ingests: ingests3 Ingests: ingests1 BACK Heritage Style Viewgraphs23 Hypothesis Expressions Enables queries on hypothetical data SUBGRAPH Fox Chases Rabbit AND Fox Eats Rabbit AND Rabbit Eats Lettuce CONSTRAINT Chases.time < 8am RETURN Fox new(Ingests) Lettuce ASSUME EDGE Chases [NEW time = 7am] FROM Fox[CONSTRAINT name= Fred] TO Rabbit[CONSTRAINT name= Jack] Motivated by OWL-QL BACK Heritage Style Viewgraphs24 Composite Vertices Composite vertices Composed of vertices and edges Contained vertices can be composite as well HuntingEvent FoxRabbit Lettuce Carrot name Eats Chases time age Eats time name Eats time nameage time Place name location OccuredAt Heritage Style Viewgraphs25 Composite Vertex Queries - continued SUBGRAPHHuntingEvent OccuredAt Place AND HuntingEvent DIRECTLY CONTAINS Rabbit AND Rabbit Eats Lettuce CONSTRAINTPlace.name = Smith Game Park RETURNRabbit Eats Lettuce Rabbit Lettuce Eats time name age BACK Addresses a subset of Harel's Higraphs Multiple hops CONTAINS or IS-CONTAINED-BY Feasible because of the hierarchy Heritage Style Viewgraphs26 Wildcard Queries SUBGRAPH Fox * ALIAS InterestingEdge Rabbit RETURNFox InterestingEdge Rabbit Fox: fox1 Fox: fox2 Rabbit: rabbit1 Rabbit: rabbit2 Rabbit: rabbit3 age: 3name: George Chases: chases1 age: 2name: Peter age: 4name: Bugs Chases: chases2 name: Fredage: 2age: 1name: Jack Eats: eats2 Eats: eats1 time: 2pm time: 3pm time: 5pm time: 9am BACK One edge wildcard queries Multiple hops May be computationally expensive in a graph Can be handled by an external AllPath() algorithm Heritage Style Viewgraphs27 Pattern Definition Assigns names to interesting graph patterns Can be reused in multiple queries PATTERN Predator (Fox new(PreysUpon) Rabbit) = SUBGRAPHFox Chases Rabbit AND Fox Eats Rabbit CONSTRAINTChases.time < Eats.time RETURNFox new(PreysUpon) Rabbit Heritage Style Viewgraphs28 Pattern Use Query: SUBGRAPHPredator(Fox PreysUpon Rabbit) AND Rabbit Eats Lettuce RETURN Fox new(Ingests) Lettuce Is evaluated as if it were: SUBGRAPHFox Chases Rabbit AND Fox Eats Rabbit AND Rabbit Eats Lettuce CONSTRAINTChases.time < Eats.time RETURN Fox new(Ingests) Lettuce BACK Heritage Style Viewgraphs29 External Graph Algorithms that Return Subgraphs Shortest Path SUBGRAPHGameWarden Chases Fox AND ShortestPath(Fox, Rabbit) ALIAS SP_alias AND Rabbit Eats Lettuce RETURNGameWarden Chases Fox AND SP_alias AND Rabbit Eats Lettuce Adjacent Vertices SUBGRAPHAdjacentVertices(Rabbit) ALIAS AV_alias CONSTRAINT count_edges(Rabbit) > 10 RETURNAV_alias BACK Heritage Style Viewgraphs30 External Graph Algorithms that Return Metrics Centrality: Find the Foxes that eventually Eat the Rabbits, who play a central role in the garden activities SUBGRAPHFox Eats Rabbit CONSTRAINTCentrality (Fox, Rabbit, Lettuce) >.8 RETURNFox Eats Rabbit Clustering Coefficient: Find the Foxes that are likely to work together when Chasing Rabbits SUBGRAPHFox ALIAS Fox1 Chases Rabbit AND Fox ALIAS Fox2 Chases Rabbit CONSTRAINTClusteringCoefficient (Fox1, Fox2) >.6 AND Fox1 Fox2 RETURNFox Eats Rabbit Heritage Style Viewgraphs31 Some Issues with External Algorithms Algorithms do not filter results, they operate direction on the graph and tie into the rest of the results Algorithms need to return a set of graphs (or a graph under some circumstances) in a standard format Order of query execution No current way to refer to the result vertices and edges of algorithms that are not specifically identified in the query SUBGRAPHAdjacentVertices(Rabbit) ALIAS AV_alias CONSTRAINT ClusteringCoefficient (, ) >.6 RETURN Rabbit AND Rabbit BACK Heritage Style Viewgraphs32 Ontology Assisted Query FoxRabbit Lettuce Carrot name Eats Chases time age Eats time name Eats time nameage Animal Carnivore Organism Herbivore Vegetable WolfFoxHareSheepLettuceCarrot isA Eats Chases Eats Ontology Mappings Graph Schema Heritage Style Viewgraphs33 Ontology-Assisted Query Result SUBGRAPHCarnivore Eats Herbivore AND Herbivore Eats Vegetable RETURNCarnivore new(Ingests) Vegetable Fox: fox1 Fox: fox2 Lettuce: lettuce1 Lettuce: lettuce2 age: 3name: Georgename: PrizeLettuce name: Fredage: 2 name: Icy Ingests: ingests3 Ingests: ingests1 Heritage Style Viewgraphs34 Some Issues of Ontology-Assisted Query Why not just have an ontology query language? Performance issues? Scaling issues? Capitalize on features that semantics bring to bear on a graph query language Semantic abstraction (e.g., subsumption, hierarchy) Use inference to create semantically consistent models Impose semantic on the graph model BACK Heritage Style Viewgraphs35 Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions Heritage Style Viewgraphs36 Query Optimization Query execution time is the key to success for any query language GQL is no exception We apply relational database optimization techniques to graph queries Optimization issues Addressed query optimization on a per path-segment basis yes Address path-segment ordering initial thoughts Address the management of large amounts of intermediate results of a query not yet Address incorporating external algorithms not yet Address ontology elaboration performance not yet Heritage Style Viewgraphs37 Query Optimization Query plan representations are used to define query execution plans Query plan representations are constructed to optimize the query execution time Via graph algebra Via graph statistics to estimate query costs for each operation Query optimizer determines The best algorithm to execute each operation The best operation ordering to optimize overall query execution time Heritage Style Viewgraphs38 Query Planning and Optimization Query planning process determines the operators required to solve a query Query optimization process determines the most efficient way to: Execute query operators Order the execution of query operators Heuristics have been identified to implement query planning and optimization based on statistical analysis Heritage Style Viewgraphs39 Graph Statistics Estimating costs requires statistical knowledge of the graph We estimate the cost of the path segment operator One of the most common and costly operations Statistics that we initially considered useful: Vertex Cardinality: The number of vertices of type v is count(v) or just V. Vertex Edge Set Cardinality: The total number of edges e that emanate from all vertices of type v is count(e v ) or just E V. Edge Cardinality: The number of edges of type e is count(e) or just E. Edge Distribution: The number of different vertex type pairs that edges of type e connect of just E D. Selectivity Factor: The percentage of vertices or edges that match a property constraint is sel( ), where is the property constraint. Uniformity assumption Independence assumption Heritage Style Viewgraphs40 Path Segment Vertex Search, No Indices Algorithm Iterate through a set of vertices of type v in O(V) time For each vertex, iterate through its edge list to find edges of type e in O(E V /V) time Follow the edge to vertex w in constant time Execution time is O(V*(E V /V)) = O(E V ) Heritage Style Viewgraphs41 Path Segment Indices on Vertex Edge Set Requires each edge set to be indexed through a logarithmic-time search tree (e.g., B+ tree) Next values are (virtually) collocated with the matching value Enables a constant time search for the next value(s) Algorithm Iterate through vertices of type v in time O(V) Find matching edge(s) in logarithmic time O(log(E V /V) Iterate through the matching edges in time O(E/E D V) Execution time is O(V * (log(E V /V) + E/E D V) ) = O(V*log(E V /V) + E/E D ) If E D E (i.e., one edge of type e emanates from each v), then the algorithm tends to operate in time O(V*log(E V /V)) If E D E and E V V, the algorithm tends operate in time O(V) If E D E and E V >> V, the algorithm tends to operate in time O(V*log(E V )) If E D >> E, then the algorithm tends to operate in time O(E/E D ) Heritage Style Viewgraphs42 Path Segment Edge Indices, Constraint Beneficial when the query includes a constraint v on an indexed property of vertices of type v Vertex edge sets are indexed as well Algorithm Logarithmic-time search through the indexed properties v in time O(log(V)) Iterate through vertices (collocated in the index) that satisfy the constraint in time O(sel( v )*V) Performs a logarithmic-time search on the edges of each matching vertex in time O(log(E V /V)) Iterate through the matching edges in time O(E/E D V) Execution time is O(log(V) + (sel( v )*V*(log(E V /V) + E/E D V)) ) = O(log(V) + sel( v )*V*log(E V /V) + sel( v )*E/E D ) If sel( v ) 0, the dominant factor is the search for vertices or O(log(V)) If the selectivity factor is higher, the execution time approaches the times of the previous slide Heritage Style Viewgraphs43 Path Segment Edge Search, No Indices Algorithm Iterate over edge types e and select those that connect v to w in time O(E) Find the corresponding vertices in constant time Execution time is O(E) Heritage Style Viewgraphs44 Path Segment Edge Search, Constraint Beneficial when the query statement includes a constraint e on an indexed property of edges of type e Algorithm Performs a logarithmic-time search through properties to find the first matching edge in time O(log(E)) Performs a linear search through all subsequent matching edges in time O(sel( e )*E) Find both vertices attached to each edge in constant time Execution time is O(log(E) + sel( e )*E) If sel( e ) 0, the algorithm tends to an execution time of O(log(E)) Otherwise, the algorithm tends to an execution time of O(E) Heritage Style Viewgraphs45 Varying Number of Vertices per Vertex Type Heritage Style Viewgraphs46 Varying Number of Edges per Vertex Heritage Style Viewgraphs47 Varying Edge Types with Constraints Heritage Style Viewgraphs48 Path Segment Ordering Assume the following query SUBGRAPHFox Chases Rabbit AND Rabbit Eats Lettuce CONSTRAINTRabbit.age < 3 RETURNFox new(Ingests) Lettuce Query processing produces the following query execution plan Fox new (Ingests) Lettuce Rabbit.age < 3 FoxRabbit Lettuce Chases Eats Heritage Style Viewgraphs49 Path Segment Execution Order Choice Which is more efficient? Fox new Ingests Lettuce Rabbit.age < 3 FoxRabbit Lettuce Chases Eats Fox new Ingests Lettuce Rabbit.age < 3 FoxChases LettuceEats Rabbit or Heritage Style Viewgraphs50 Execution Order Heuristics In simple terms Identify the path segment operation that promises to return the least number of results Then identify the next operation that promises to return the next least number of results It is actually more complicated than this Need to search an exponential number of orderings to find the most efficient ordering Heuristics can make this search tractable Heritage Style Viewgraphs51 Path-Segment Ordering Metric Order the path segment operators to return the fewest results Rough heuristic: If predicates v, e, and w are applied to V, E and W respectively Start with V and use selectivity factors to estimate execution time Execution time is: V * sel( v ) * (E/E D V) * sel( e ) * (WE D /E) * sel( w ) Or, sel( v ) * sel( e ) * sel( w ) * W Use this formula to determine whether Fox Chases Rabbit should precede or follow Rabbit Eats Lettuce Heritage Style Viewgraphs52 Outline Goals & Example Scenario Key Features of GQL Computational Complexity of Query Execution Future Directions Heritage Style Viewgraphs53 Prototype Implementation Schedule Currently Implemented Schema search returning a single graph Pattern matching Aliasing Ontology assisted graph query Next to be implemented within approximately 6 months Externally defined functions Wildcard search Hypothesis expressions Future Return a set of graphs (instead of a single graph) Embedded queries Return new graph structures in query results Composite vertices (of vertices and edges) Predefined patterns Query Optimization Heritage Style Viewgraphs54 Future Work Relate GQL to a graphical interface Enables analysts to express queries through graphical means Can leverage several technologies (QGraph, Conceptual Graphs, etc.) Augment GQL to include Uncertainty, Geospatial and Temporal operators and data structures Address query optimization techniques Create a generic (as much as possible) back-end API to integrate with data sources Relational Different graph approaches