graph analytics - titan and cassandra @nj data science meetup

19
Text Titan By Isaac Rieksts @IsaacRieksts 1 These thoughts are mine own and do not represent the company

Upload: irieksts

Post on 28-Jul-2015

68 views

Category:

Software


0 download

TRANSCRIPT

Text

TitanBy Isaac Rieksts @IsaacRieksts

1These thoughts are mine own and do not represent the company

OutlineGraph Database overview

Tinkerpop

Titan

Graph Queries

Our use of Ttian

Demo

Graph Databases

Id Name1 Bob2 Tom3 Joe

Person Knows1 21 3

Person

Crosswalk Bob

Joe

TomKnows

Knows

Tinkerpop

Abstraction layer

Query Language

Computing

Graphs with Spark

GraphX

Pregel

GraphLab

Why Titan

Flexible backend

No added infrastructure cost

Community support

Backends

Cassandra

Hbase

Hazelcastcache

Persistit

Berkeley

Text

Database Trianglehttp://blog.nahurst.com/visual-guide-to-nosql-systems

HBase

Strong consistency at the record level

Transaction support

Store procedures

Replication

Cassandra

Tunable consistency

Multiple datacenter support

Built in replication and fault tolerance

CQL query language

Keyspace passwords

IndexingBuilt-in

Fast for exact matches

Lucene

More advanced queries

Good for single box

Elasticsearch

Advanced queries

large scale clusters

Gremlin vs SPARQL

Support for complex queries

http://gremlindocs.com/

Easy query language

http://www.w3.org/TR/rdf-sparql-query/

Gremlin SPARQL

Gremlin vs SPARQL example 1

g.v(‘tg:1')

.out('tg:knows')

SELECT ?x WHERE {

tg:1 tg:knows ?x

}

Gremlin SPARQL

g.v(‘tg:1')

.out(‘tg:knows')

.out('tg:name')

SELECT ?y WHERE {

tg:1 tg:knows ?x .

?x tg:name ?y

}

Gremlin SPARQL

Gremlin vs SPARQL example 2

Our Mission▪Deliver the most current information on the U.S. healthcare provider

universe using integrated solutions in order for customers to: › Prevent fraud, waste and abuse across the healthcare system › Comply with evolving state and federal regulations › Improve market opportunity for non retail drugs and devices

Health Market Science a Lexisnexis Company

The Business

BusinessSolutionsHealth Care Provider & Facilities

Variety/Velocity • >2000 of sources • 6 Million unique HCPs • 10+ years history Data Challenges • Constant change in real

world data • Conflicting & partial info • Frequent changes to source

structure • Authoritative sources vs.

crowdsource • Predicting source quality

Master Data SolutionsMedical Procedures & Diagnosis

Volume/Velocity • ~1B claims annually • +5B records annually • 5+ years history Data Challenges • Sources have incomplete

capture • Overlapping source data • Statistical projections &

biases • Social media type

relationships

Medical Claims Data

Batch (CompleteView,

Expense Manager)

Transactional (PRS/MDM/

VerifyRx)

Big Data Relational DB & Analytics

(Claims)

Master Data Management

Visualization

Dashboard / Reports

Structured Storage

RelationalIndexing

Flexible Storage

NoSQL Graph(s)

Interfacing

Web Services

Distributed Processing

Standardize

Validate

Match

Consolidate

Analytics

Data Sources

Government

Web

Customer

I’m happy

User Interface

Our use of Titan

Link storage

Analytics of links

Affiliation of business influences

Visualization of relationships

Demo