graph database - unimi.it‣ cypher is an expressive (yet compact) graph database query language ‣...

Post on 21-Jun-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GRAPH DATABASE

Ernesto Damiani and Paolo Ceravolopaolo.ceravolo@unimi.it

Università degli Studi di MilanoDipartimento di Informatica

WHAT IS A GRAPH?

‣ Formally, a graph is just a collection of vertices and edges

‣ Graphs represent entities as nodes and the ways in which those entities relate as relationships

‣ This general-purpose, expressive structure allows us to model all kinds of scenarios

‣ Graphs are extremely useful in understanding a wide diversity of datasets in fields such as science, government, and business

‣ Represent networks: social structures, topological relationships

‣ Represent a sequence of events

‣ Represent relationships between concepts: hyperonymy, hyponymy, meronymy

WHAT IS A GRAPH?

WHAT IS A GRAPH?

THE LABELED GRAPH MODEL

‣ The most popular form of graph model is the Labeled Graph Model

‣ It contains nodes and relationships

‣ Nodes contain properties (key-value pairs)

‣ Nodes can be labeled with one or more labels

‣ Relationships are named and directed, and always have a start and end node

‣ Relationships can also contain properties

THE LABELED GRAPH MODEL

{date: 20

GRAPH DATABASE MANAGEMENT SYSTEM

‣ A Graph Database Management System is an online database management system

‣ CRUD (Create, Read, Update, and Delete) properties

‣ OLTP (Online Transaction Processing) transactional systems

‣ OLAP (Online Analytical Processing)

‣ Management System that address scalability are also available

GRAPH DATABASE MANAGEMENT SYSTEM

‣ There are two properties of graph databases we should consider when investigating graph database technologies:

‣ The underlying storage

‣ Some graph databases use native graph storage that is optimised and designed for storing and managing graphs

‣ The processing engine

‣ Native graph processing require that a graph database use index-free adjacency, meaning that connected nodes physically “point” to each other in the database

GRAPH DATABASE MANAGEMENT SYSTEM

‣ Index-free adjacency

‣ A graph processing engine is said native if it implements index-free adjacency

‣ An index table implies O(log n) computational complexity while adjacent relationship O(1)

‣ The cost of queries is not dependent on the size of the graph but on the size of the traversed path

‣ With index-free adjacency, bidirectional joins are effectively precomputed and stored in the database as relationships

GRAPH DATABASE MANAGEMENT SYSTEM

GRAPH COMPUTE ENGINES

‣ A graph compute engine is a technology that enables global graph computational algorithms to be run against large datasets

‣ The architecture includes a system of record (SOR) database with OLTP properties

‣ Periodically, an Extract, Transform, and Load (ETL) job moves data from the system of record database into the graph compute engine for offline querying and analysis

WHY USING GRAPH DATABASES

‣ Performances

‣ In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph

‣ Flexibility

‣ Structure and schema can emerge with our growing understanding of the problem space

‣ Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality

‣ Semantic lifting and expansion are naturally implemented on graphs

‣ Integration with heterogeneous sources is also more natural in graph databases

WHY USING GRAPH DATABASES

‣ Agility

‣ Governance is typically applied in a programmatic fashion, using tests to drive out the data model and queries, as well as assert the business rules that depend upon the graph

RELATIONAL DATABASES LACK RELATIONSHIPS

‣ Join tables add accidental complexity; they mix business data with foreign key metadata

‣ Foreign key constraints add additional development and maintenance overhead

‣ parse tables with nullable columns require special checking in code

‣ Several expensive joins are often needed

‣ Reciprocal queries are even more costly

RELATIONAL DATABASES LACK RELATIONSHIPS

‣ Relational databases struggle with highly connected domains

‣ To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain

SELECT p1.PersonFROM Person p1 JOIN PersonFriend

ON PersonFriend.FriendID = p1.ID JOIN Person p2

ON PersonFriend.PersonID = p2.ID

WHERE p2.Person = 'Bob'

RELATIONAL DATABASES LACK RELATIONSHIPS

‣ Relational databases struggle with highly connected domains

‣ To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain

SELECT p1.PersonFROM Person p1 JOIN PersonFriend

ON PersonFriend.PersonID = p1.ID JOIN Person p2

ON PersonFriend.FriendD = p2.ID

WHERE p2.Person = 'Bob'

NOSQL DATABASES ALSO LACK RELATIONSHIPS ‣ Seeing a reference to order: 1234 in the

record beginning user: Alice, we infer a connection between user: Alice and order: 1234. This gives us false hope that we can use keys and values to manage graphs

‣ There are no identifiers that “point” backward (the foreign aggregate “links” are not reflexive, of course), we lose the ability to run other interesting queries on the database

‣ Aggregate stores do not maintain consistency of connected data, nor do they support what is known as index- free adjacency

‣ Aggregate stores must employ inherently latent methods for creating and querying relationships outside the data model

PERFORMANCE

‣ Graph Databases are designed to traverse graphs, their performances in querying interconnected domains are high

PERFORMANCE

‣ Graph Databases are designed to traverse graphs, their performances in querying interconnected domains are high

QUERYING GRAPHS‣ Cypher is an expressive (yet compact) graph database query language

‣ Other graph databases have other means of querying data. Many, including Neo4j, support the RDF query language SPARQL and the imperative, path-based query language Gremlin

(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)

QUERYING GRAPHS

 (emil:Person {name:'Emil'})      <-[:KNOWS]-(jim:Person {name:'Jim'})      -[:KNOWS]->(ian:Person {name:'Ian'})      -[:KNOWS]->(emil)

QUERYING GRAPHS

MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)RETURN b, c

QUERYING GRAPHS

MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c) WHERE a.name = 'Jim'RETURN b, c

QUERYING GRAPHS‣ Cypher Clauses

‣ WHERE: Provides criteria for filtering pattern matching results.

‣ CREATE and CREATE UNIQUE: Create nodes and relationships.

‣ MERGE: Ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships.

‣ DELETE: Removes nodes, relationships, and properties.

‣ SET: Sets property values.

‣ FOREACH: Performs an updating action for each element in a list.

‣ UNION: Merges results from two or more queries.

‣ WITH: Chains subsequent query parts and forwards results from one to the next. Similar to piping commands in Unix.

INCREMENTAL MODELING‣ Graph databases provide for the smooth evolution of a data model

‣ We develop the data model feature by feature, user story by user story

INCREMENTAL MODELING

INCREMENTAL MODELING

INCREMENTAL MODELING

INCREMENTAL MODELING

INCREMENTAL MODELING‣ If we need to find all the events

that have occurred over a specific period, we can build a timeline tree

INCREMENTAL MODELING‣ The carousel fraud

QUERYING GRAPHS‣ POLE MODEL

‣ The POLE data model focuses on four basic types of entities and the relationships between them: Persons, Objects, Locations, and Events

Greater Manchester, UK from August 2017

INTEGRATION WITH ONTOLOGIES ‣ An ontology is a formal, explicit specification of a shared

conceptualization that is characterized by high semantic expressiveness required for increased complexity ( Feilmayr and Wöß - 2016)

‣ Ontology are typically represented as graphs

‣ Web Ontology Language (OWL) is typically represented using RDF triples

‣ Ontologies contain inference rules that can be applied to a knowledge base

INTEGRATION WITH ONTOLOGIES ‣ Taking an example for the  LUBM benchmark (Lehigh University Benchmark), a

student is derived to be an attendee if he or she takes some course

‣ Thus when she matches the following ontological rule: Student and (takesCourse some) SubClassOf Attendee

‣Any experienced Neo4j programmer may rub his or her hands since this rule can be translated straightforward into the following Cypher expression:

match (x:Student)-[:takesCourse]->() set x:Attendee

‣ That is perfectly possible but could become cumbersome in case of deeply nested rules that may also depend on each other

‣ For instance, the Cypher expression misses the subclasses of Student such as UndergraduateStudent. Strictly speaking the expression above should therefore read: match (x)-[:takesCourse]->() where x:Student or x:UndergraduateStudent set x:Attendee

top related