introduction to graph database - cse.hcmut.edu.vn

30
Introduction to Graph database Course: Data Engineering Teacher: Assoc. Prof. Dr. Dang Tran Khanh

Upload: others

Post on 17-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Graph database - cse.hcmut.edu.vn

Introduction to Graph database Course: Data EngineeringTeacher: Assoc. Prof. Dr. Dang Tran Khanh

Page 2: Introduction to Graph database - cse.hcmut.edu.vn

Group 13

● 2070682 - Phạm Nguyễn Nhật Minh● 2170088 - Đặng Ngô Nhật Trường● 2170409 - Lê Dương Khoa● 2070401 - Nguyễn Việt Anh● 1870387 - Lê Văn Duẫn

2

Page 3: Introduction to Graph database - cse.hcmut.edu.vn

Agenda

1. Graph database introduction

2. Graph database study case

3. Research results

4. Conclusion & Summary

3

Page 4: Introduction to Graph database - cse.hcmut.edu.vn

Graph database introduction

4

Page 5: Introduction to Graph database - cse.hcmut.edu.vn

Why are graphs important?● Modeling chemical and biological data● Social networks● The web● Hierarchical data

5

Page 6: Introduction to Graph database - cse.hcmut.edu.vn

What is Graph database?

● A database built on top of graph data structure. ● Collection of vertices (nodes) and edges.● Property:

❏ Each node/edge is uniquely identified❏ Each node has a set of incoming and outgoing

edges❏ Each node and edge has a collection of

properties❏ Each edge has a label that defines the

relationship between it two nodes6

Page 7: Introduction to Graph database - cse.hcmut.edu.vn

Graph queries

● List nodes/edges that have this property● List matching subgraphs● Can these two nodes reach each other?● How many hops does it take for two nodes to connect?

7

Page 8: Introduction to Graph database - cse.hcmut.edu.vn

Graph database versus Relational database

8

Page 9: Introduction to Graph database - cse.hcmut.edu.vn

Graph database versus Relational database

● Pros:○ Schema Flexibility○ More intuitive querying○ Avoid “join bombs”○ Local hops are not a function of the total nodes

● Cons:○ Not always advantageous○ Query language are not unified

9

Page 10: Introduction to Graph database - cse.hcmut.edu.vn

When do we need graph database?

● Solve many to many relationships problems (Eg. Friends on Facebook)

● When relationships between data are important.

10

Page 11: Introduction to Graph database - cse.hcmut.edu.vn

When do we need graph database?

● Who are friends of Alice's friends?

● Who are Alice and Bob’s mutual friends?

● Assuming, Alice doesn’t knowBob, who should they gothrough to get know eachother as quickly as possible?

Graph databases are able to solvethese problems directly andquickly.

11

Page 12: Introduction to Graph database - cse.hcmut.edu.vn

Examples in Neo4j using the Cypher language

12

Page 13: Introduction to Graph database - cse.hcmut.edu.vn

Graph database study caseMartin Macak, Matus Stovcik and Barbora Buhnova

“The Suitability of Graph Databases for Big Data Analysis: A Benchmark” , IoTBDS 2020

13

Page 14: Introduction to Graph database - cse.hcmut.edu.vn

Problem statement

● Not aware the borders of situations in which strategy performs better in non-extreme cases

● Become rarer when it tests running in cluster with big data

14

Page 15: Introduction to Graph database - cse.hcmut.edu.vn

Purpose

● Find out which strategy is better when running tests under different queries

● Discuss the threats to validity of this work

15

Page 16: Introduction to Graph database - cse.hcmut.edu.vn

Database technologies for comparison● PostgreSQL● Neo4j

Dataset● Microsoft Academic Graph [Ref]● 1.7 billion rows, 14 tables, 13 relationships between tables● A number of relationship distributed to a series of distinct queries on

data set

Setup

16

Page 17: Introduction to Graph database - cse.hcmut.edu.vn

Microsoft Academic Graph (https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) 17

Page 18: Introduction to Graph database - cse.hcmut.edu.vn

Queries● 10 queries, specific number of joins or traversal needed● 3 types: join across tables, where statements, string contain functions● find paper/journal under specific conditions

Cluster● Three-nodes cluster● High availability● 1 master node - 2 read-only node

18

Page 19: Introduction to Graph database - cse.hcmut.edu.vn

Research results

19

Page 20: Introduction to Graph database - cse.hcmut.edu.vn

Simple joins across multiple tables- J1: Counts the number of papers presented at a

conference.

- J2: Counts the number of papers presented at conference instances

- J3: Counts the number of journals presented at conference instances

Target: Determining the threshold of data complexity20

Page 21: Introduction to Graph database - cse.hcmut.edu.vn

Relational databaseGraph database

21

Page 22: Introduction to Graph database - cse.hcmut.edu.vn

Simple joins across multiple tables- In J1, Neo4j handled values counting of the join

between the enormous size of Papers and Conferences better by pre-made relationships.

- In J2 and J3, PostgreSQL performs better than Neo4j due to optimizations of joins where it did not have to use every row in both the joined tables, only a subset of rows

- PostgreSQL achieved better results in more complex joins. 22

Page 23: Introduction to Graph database - cse.hcmut.edu.vn

Join queries with condition across multiple tables

- W1 Counts the number of papers presented at conferences with a specified short name.

- W2 Counts the number of papers, linked through a conference with a specified short name that was presented at conference instances.

- W3 Counts the number of journals, linked through a conference with a specified short name, that was presented at conference instances.

- W4 Counts the number of papers with specified original paper title presented at conferences.

23

Page 24: Introduction to Graph database - cse.hcmut.edu.vn

Relational databaseGraph database

24

Page 25: Introduction to Graph database - cse.hcmut.edu.vn

Join queries with simple condition across tables- W4 query is similar to W1, but have different in

between the direction of traversal.So the direction of relationships, node A -> node B or vice versa has huge impact on performance

- Graph database can have two-way directions of a relationship for better performance in exchange of disk space sacrifice.

25

Page 26: Introduction to Graph database - cse.hcmut.edu.vn

Summary & Conclusion

26

Page 27: Introduction to Graph database - cse.hcmut.edu.vn

Summary

● Introduce to Graph Database

● Compare Graph Database and Relational Database

● Introduce some use cases

● Review a paper about benchmark of Neo4j vs PostgreSQL

27

Page 28: Introduction to Graph database - cse.hcmut.edu.vn

Conclusion

● Graph theory is the foundation

● GraphDB

● Neo4j

● Cloud infra: Microsoft Azure Cosmos DB, Amazon Neptune

28

Page 29: Introduction to Graph database - cse.hcmut.edu.vn

References

● Microsoft Academic Graph [Link1] [Link2]● “The Suitability of Graph Databases for Big Data Analysis: A Benchmark” ,

IoTBDS 2020 [Link]

● “An Overview of Microsoft Academic Service (MAS) and Applications”Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-june (Paul) Hsu, Kuansan Wang

29

Page 30: Introduction to Graph database - cse.hcmut.edu.vn

THANK YOU SO MUCH

30