graph database with cassandra

Proprietary and Confidential / © The Nerdery, LLC 2

Graph DatabasesBrandon VeberChad Dvoracek

Agenda

•Introduction to graph databases–What they are–Why to use them

•Titan technology stack–NoSQL distributed scalable data storage–Spark in-memory distributed computing

•Graph queries and analytics

Introduction to Graph

What is a Graph Database?

Graph databases use graph structures such as nodes and edges to store data and relationships.

Entities are modelled as nodes and the relationships between them are modelled as edges.

Blue, J Driving Insights with Network Graphs. Retrieved fromwww.mapr.com/blog/driving-insights-network-graphs

How is it different from RDBMS?

● Relational databases prioritize the table

● Relationships are ad-hoc in the form of FK constraints

● Querying through complex relationships requires several costly joins

Graph DB vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/

● Nodes contain entities and their corresponding properties

● Relationships are given top priority

● Pointers instead of index look-ups

Graph Database. Accessed from https://en.wikipedia.org/wiki/Graph_database

● Inherently NoSQL● Scalable● High availability

● Data model is intuitive and agile.

Graph vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/

When to use Graph DB

When not to use Graph DB

● Data warehousing

● Schema-oriented design

● Aggregates on sets

● Robust transactional processing

When to use Graph DB

● Graph databases work well with highly interconnected data with complex relationships

● Some use cases include:○ Social networks○ Route planning○ Master data management○ Recommendation engine

AWS Master Data Management Model. Accessed from http://neo4j.com/graphgist/8526106

Successful Use Cases

Successful Use Cases - HealthUnlocked

Goal: Redesign system to manage performance issues associated with increasing data volume

Methods:

● Graph database to store relationships between symptoms, conditions and treatments

● Language processing to build multilingual ontology into the database

Result:

● Improved query performance● Easier data model for pattern matching● Two months to launch

Open Source Graph Framework

● The Apache Tinkerpop project provides an open source, vendor agnostic framework for graph construction, query and analysis.

● Changing between graph engines and back-end storage technologies is possible without significant refactoring

● Supports graph databases (OLTP) and graph analytics (OLAP)

Technology Stack - Storage

● Supports several distributed NoSQL databases

● Support for ACID transactions

● Linearly scalable

Technology Stack - Analytics & ETL

Titan offers support for several analytics and batch loading technologies.

Technology Stack - Search + Framework

Titan supports the following search technologies:

•ElasticSearch•Lucene•Solr

Titan also integrates natively with Apache Tinkerpop

Apache Cassandra

● Key-Value Store

● Exceptional fault tolerance

● Scalable

● Denormalized tables

Apache Spark

● Resilient distributed datasets

● In-memory cluster computing

● Scalable

● Up to 100x faster than MapReduce

● Native Cassandra connector

Datastax Graph

● Designed for cloud applications

● Multi-model capable

● Enterprise support

● Scalable

Queries and Analytics

Example Model

● Edges can contain values and properties as well

● The ‘Includes’ edge will contain a quantity property

Simple Traversal

Traversal ExampleQuestion: What items were purchased in ‘Transaction 1’?

g.V().hasLabel(‘transaction’).has(‘tx_id’,1).out(‘includes’).values(‘name’)

Output

● Pop

● Gum

● Bread

Traversal ExampleQuestion: What customers have shopped at ‘Store 1’?

g.V().hasLabel(‘store’).has(‘store_id’,1).out(‘processes’).in(‘purchases’).values(‘name’)

Output

● Customer 1

Branching TraversalQuestion: Of all transactions when ‘Pop’ was purchased what was the average quantity?

g.V().has('name','Pop').inE('includes').values('quantity').mean()

Output● 1.5

Branching TraversalQuestion: What is the average quantity of all items sold when purchased in a transaction?

g.V().hasLabel('item').local(inE('includes').values('quantity').mean())

Output● 1.5● 2.5● 1

More Traversal Strategies

● Recursive● Path● Projecting● Declarative

Graph Analytics - Network Properties

● Node count - Total number of nodes● Edge count - Total number of edges ● Diameter - Maximum length of a shortest path between any two nodes● Min & Max & Mean Degree - Degree is the number of connections for each node● Degree distribution - Histogram (shown on next page)

Graph Analytics - Degree Distribution

The degree of a node represents how many connections it has. A degree distribution is the probability distribution of those degrees in the network.

Most graphs exhibit the behavior of GitHub distribution shown on the right

BIG GRAPH DATA ON HORTONWORKS DATA PLATFORM, Accessed at http://hortonworks.com/blog/big-graph-data-on-hortonworks-data-platform/

Graph Analytics - Network Properties

● Clustering coefficients - Represent the randomness of connections in a graph● Centrality - Identify the most important nodes (e.g. PageRank)● Community detection - Identify groups of nodes that are more densely connection

among themselves than the other nodes in the graph

Questions?

Contact

The Nerderyinfo@nerdery.com(877) 664.6373

graph database with cassandra

Technology

graph database use cases

visualize your graph database

research article evaluating the cassandra nosql database...

apache cassandra - distributed database management...

orientdb the graph database

hbase or cassandra? a comparative study of nosql database

nosql database: apache cassandra

neo4j - graph database

graph database & neo4j

graph database super star

cassandra summit 2014: huge online genealogical database...

do more apache cassandra distributed database work...

introduction to graph database

survey of graph database

graph database and neo4j

nosql database benchmarking - cmg · pdf filenosql database...

evaluating apache cassandra as a cloud database...evaluating...

titan: big graph data with cassandra

e-commerce database best practices for your black friday and...

graph database using neo4j