graph database with cassandra

Post on 07-Apr-2017

287 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Proprietary and Confidential / © The Nerdery, LLC 2

Graph DatabasesBrandon VeberChad Dvoracek

Proprietary and Confidential / © The Nerdery, LLC 3

Agenda

•Introduction to graph databases–What they are–Why to use them

•Titan technology stack–NoSQL distributed scalable data storage–Spark in-memory distributed computing

•Graph queries and analytics

Proprietary and Confidential / © The Nerdery, LLC 4

Introduction to Graph

Proprietary and Confidential / © The Nerdery, LLC 5

What is a Graph Database?

Graph databases use graph structures such as nodes and edges to store data and relationships.

Entities are modelled as nodes and the relationships between them are modelled as edges.

Blue, J Driving Insights with Network Graphs. Retrieved fromwww.mapr.com/blog/driving-insights-network-graphs

Proprietary and Confidential / © The Nerdery, LLC 6

How is it different from RDBMS?

● Relational databases prioritize the table

● Relationships are ad-hoc in the form of FK constraints

● Querying through complex relationships requires several costly joins

Graph DB vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/

Proprietary and Confidential / © The Nerdery, LLC 7

How is it different from RDBMS?

● Nodes contain entities and their corresponding properties

● Relationships are given top priority

● Pointers instead of index look-ups

Graph Database. Accessed from https://en.wikipedia.org/wiki/Graph_database

Proprietary and Confidential / © The Nerdery, LLC 8

How is it different from RDBMS?

● Inherently NoSQL● Scalable● High availability

● Data model is intuitive and agile.

Graph vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/

Proprietary and Confidential / © The Nerdery, LLC 9

When to use Graph DB

Proprietary and Confidential / © The Nerdery, LLC 10

When not to use Graph DB

● Data warehousing

● Schema-oriented design

● Aggregates on sets

● Robust transactional processing

Proprietary and Confidential / © The Nerdery, LLC 11

When to use Graph DB

● Graph databases work well with highly interconnected data with complex relationships

● Some use cases include:○ Social networks○ Route planning○ Master data management○ Recommendation engine

AWS Master Data Management Model. Accessed from http://neo4j.com/graphgist/8526106

Proprietary and Confidential / © The Nerdery, LLC 12

Successful Use Cases

Proprietary and Confidential / © The Nerdery, LLC 13

Successful Use Cases - HealthUnlocked

Goal: Redesign system to manage performance issues associated with increasing data volume

Methods:

● Graph database to store relationships between symptoms, conditions and treatments

● Language processing to build multilingual ontology into the database

Result:

● Improved query performance● Easier data model for pattern matching● Two months to launch

Proprietary and Confidential / © The Nerdery, LLC 14

Open Source Graph Framework

● The Apache Tinkerpop project provides an open source, vendor agnostic framework for graph construction, query and analysis.

● Changing between graph engines and back-end storage technologies is possible without significant refactoring

● Supports graph databases (OLTP) and graph analytics (OLAP)

Proprietary and Confidential / © The Nerdery, LLC 15

Titan

Proprietary and Confidential / © The Nerdery, LLC 16

Technology Stack - Storage

● Supports several distributed NoSQL databases

● Support for ACID transactions

● Linearly scalable

Proprietary and Confidential / © The Nerdery, LLC 17

Technology Stack - Analytics & ETL

Titan offers support for several analytics and batch loading technologies.

Proprietary and Confidential / © The Nerdery, LLC 18

Technology Stack - Search + Framework

Titan supports the following search technologies:

•ElasticSearch•Lucene•Solr

Titan also integrates natively with Apache Tinkerpop

Proprietary and Confidential / © The Nerdery, LLC 19

Apache Cassandra

● Key-Value Store

● Exceptional fault tolerance

● Scalable

● Denormalized tables

Proprietary and Confidential / © The Nerdery, LLC 20

Apache Spark

● Resilient distributed datasets

● In-memory cluster computing

● Scalable

● Up to 100x faster than MapReduce

● Native Cassandra connector

Proprietary and Confidential / © The Nerdery, LLC 21

Datastax Graph

● Designed for cloud applications

● Multi-model capable

● Enterprise support

● Scalable

Proprietary and Confidential / © The Nerdery, LLC 22

Queries and Analytics

Proprietary and Confidential / © The Nerdery, LLC 23

Example Model

Proprietary and Confidential / © The Nerdery, LLC 24

Example Model

● Edges can contain values and properties as well

● The ‘Includes’ edge will contain a quantity property

Proprietary and Confidential / © The Nerdery, LLC 25

Simple Traversal

Proprietary and Confidential / © The Nerdery, LLC 26

Traversal ExampleQuestion: What items were purchased in ‘Transaction 1’?

g.V().hasLabel(‘transaction’).has(‘tx_id’,1).out(‘includes’).values(‘name’)

Output

● Pop

● Gum

● Bread

Proprietary and Confidential / © The Nerdery, LLC 27

Traversal ExampleQuestion: What customers have shopped at ‘Store 1’?

g.V().hasLabel(‘store’).has(‘store_id’,1).out(‘processes’).in(‘purchases’).values(‘name’)

Output

● Customer 1

Proprietary and Confidential / © The Nerdery, LLC 28

Branching TraversalQuestion: Of all transactions when ‘Pop’ was purchased what was the average quantity?

g.V().has('name','Pop').inE('includes').values('quantity').mean()

Output● 1.5

Proprietary and Confidential / © The Nerdery, LLC 29

Branching TraversalQuestion: What is the average quantity of all items sold when purchased in a transaction?

g.V().hasLabel('item').local(inE('includes').values('quantity').mean())

Output● 1.5● 2.5● 1

Proprietary and Confidential / © The Nerdery, LLC 30

More Traversal Strategies

● Recursive● Path● Projecting● Declarative

Proprietary and Confidential / © The Nerdery, LLC 31

Graph Analytics - Network Properties

● Node count - Total number of nodes● Edge count - Total number of edges ● Diameter - Maximum length of a shortest path between any two nodes● Min & Max & Mean Degree - Degree is the number of connections for each node● Degree distribution - Histogram (shown on next page)

Proprietary and Confidential / © The Nerdery, LLC 32

Graph Analytics - Degree Distribution

The degree of a node represents how many connections it has. A degree distribution is the probability distribution of those degrees in the network.

Most graphs exhibit the behavior of GitHub distribution shown on the right

BIG GRAPH DATA ON HORTONWORKS DATA PLATFORM, Accessed at http://hortonworks.com/blog/big-graph-data-on-hortonworks-data-platform/

Proprietary and Confidential / © The Nerdery, LLC 33

Graph Analytics - Network Properties

● Clustering coefficients - Represent the randomness of connections in a graph● Centrality - Identify the most important nodes (e.g. PageRank)● Community detection - Identify groups of nodes that are more densely connection

among themselves than the other nodes in the graph

Proprietary and Confidential / © The Nerdery, LLC 34

Questions?

Proprietary and Confidential / © The Nerdery, LLC 35

Contact

The Nerderyinfo@nerdery.com(877) 664.6373

top related