neo4j in depth
TRANSCRIPT
Neo4j in Depth
Max De Marzi
About Me
• Max De Marzi -‐ Neo4j Field Engineer
• My Blog: http://maxdemarzi.com • Find me on Twitter: @maxdemarzi • Email me: [email protected] • GitHub: http://github.com/maxdemarzi
TLDR:
Property Graph Data Model
What you already know
The Problem
• all JOINs are executed every time you query (traverse) the relationship
• executing a JOIN means to search for a key in another table
• with Indices executing a JOIN means to lookup a key
• B-Tree Index: O(log(n))
• more entries => more lookups => slower JOINs
People ConferencesAttend
143 Max326 Big Data Tech Con
725 NoSQL Now
981 Chariot Data IO143 981
143 725
143 326
MaxBig Data Tech Con
NoSQL Now
Chariot Data IO
143
326
725
981143 981
143 725
143 326
uid: MDMname: Max
uid: BDTCwhere: Burlinggame
uid: NSNwhere: San Francisco
uid: CDIOwhere: Philadelphia
Nodes
Relationships
member
member
member
A Property Graph
Neo4j Secret Sauce
• Pointers instead of look-ups
• Fixed sized records for fast access
• Do all your “Joining” on creation
• Spin spin spin through this data structure
Relational Databases Can’t Handle Relationships Well
• Cannot model or store data and relationships without complexity
• Performance degrades with number & levels of relationships, and database size
• Query complexity grows with need for JOINs • Adding new types of data and relationships requires schema redesign, increasing time to market
… making traditional databases inappropriate when relationships are valuable in real-‐time
Slow development Poor performanceLow scalabilityHard to maintain
NoSQL Databases Don’t Handle Relationships
• No data structures to model or store relationships
• No query constructs to support relationships
• Relating data requires “JOIN logic” in the application
• No ACID support for transactions
… making NoSQL databases inappropriate when relationships are valuable in real-‐time
Real-‐Time Query Performance Performance must hold steady with scale
Connectedness and Size of Data Set
Respon
se Tim
e
0 to 2 hops0 to 3 degreesThousands of connections
Tens to hundreds of hopsThousands of degrees Billions of connections
Relational and Other NoSQL Databases
Neo4j
Neo4j is 1000x faster Reduces minutes to milliseconds
Re-‐Imagine Your Data as a Graph
Neo4j is an enterprise-‐grade graph database that enables you to: • Model and store your data as a graph
• Query relationships with ease and in real-‐time
• Seamlessly evolve applications to support new requirements by adding new kinds of data and relationships
Agile developmentHigh performanceVertical and horizontal scaleSeamless evolution
Neo4j Overview
Product • Neo4j -‐ World’s leading graph database
• 1M+ downloads, adding 50k+ per month
• 150+ enterprise subscription customers including over 50 of the Global 2000
Company • Neo Technology, Creator of Neo4j • 80 employees with HQ in Silicon Valley, London, Munich, Paris and Malmö
• $45M in funding from Fidelity, Sunstone, Conor, Creandum, Dawn Capital
2000 2003 2007 2009 2011 2013 2014 2015
Neo4j: The Graph Database Leader
GraphConnect, first conference for graph DBs
First Global 2000 Customer
Introduced Cypher a declarative query
language for property graphs
Published O’Reilly book on Graph Databases
$11M Series A from Fidelity, Sunstoneand Conor
$11M Series B from Fidelity, Sunstoneand Conor
Commercial Leadership
First native
graph DB in 24/7
production
Invented property graph model
Contributed first graph DB to open source
$2.5M SeedRound from Sunstone and Conor
Funding
Technical Leadership
Extended graph data model to labeled
property graph
150+ customers
50K+ monthlydownloads
500+ graph DB events worldwide
$20M Series C led by
Creandum, with Dawn and
existing investors
“Forrester estimates that over 25% of enterprises will be using graph databases by 2017”
Neo4j Leads the Graph Database Revolution
“Neo4j is the current market leader in graph databases.”
“Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-‐driven operations and decisions after the design of data capture.”
1. IT Market Clock for Database Management Systems, 2014 2. TechRadar™: Enterprise DBMS, Q1 2014 3.Graph Databases – and Their Potential to Transform How We Capture Interdependencies (Enterprise Management Associates)
Building a Recommendation Engine in 2 Minutes with Neo4j Developer Experience: Neo4j UI with Cypher Query Language
Two-‐Minute Video Demo
https://www.youtube.com/watch?v=qbZ_Q-‐YnHYo
Neo4j – Key Product Features
Native Graph StorageEnsures data consistency and performance
Native Graph ProcessingMillions of hops per second, in real time
“Whiteboard Friendly” Data ModelingModel data as it naturally occurs
High Data IntegrityFully ACID transactions
The Graph Query Language: Cypher Requires 10x to 100x less code than SQL Scalability and High AvailabilityVertical and horizontal scaling optimized for graphs Built-‐in ETLSeamless import from other databases IntegrationDrivers and APIs for popular languages
MATCH(A)
CAR
DRIVES
name: “Dan” born: May 29, 1970 twitter: “@dan”
name: “Ann” born: Dec 5, 1975
since: Jan 10, 2011
brand: “Volvo” model: “V70”
Property Graph Model Components
Nodes • The objects in the graph • Can have properties • Can be labeled
Relationships • Relate nodes by type and direction • Can have properties
LOVES
LOVES
LIVES WITH
OWNS
PERSON PERSON
Triple Store/RDF Model
• Resource Description Framework • Subject, Predicate, Object • Standard Data Model • Names for subjects, predicates, objects must be URIs
• Names must be Global • No properties on the Relationships • Like “3rd Normal Form” for Relational Databases (but really more like 5/6th)
Property Graph Data Model (Movies)
RDF Data Model (Movies)
Property Graph Vs Triple Store
• Property Graph is a more generic case of the Triple Store • Lack of properties on relationships for Triple Stores reduce ( or complicate) their expressive power
Query Languages
• Graph Databases: • Cypher -‐ declarative, pattern matching, easy to understand
• Gremlin -‐ imperative, step driven, math inspired
• Native APIs (Java, REST)
• Triple Stores: • SPARQL (standard) • PROLOG (or prolog-‐like languages)
General Use Cases
• Graph Databases: • Local Queries (anchor on a node or set of nodes then traverse)
• Realtime (<20ms) requirements • Complex, deep traversals • Flexible graph models
• Triple Stores: • Global Queries (find pattern in large volume of information)
• Browsing Content • Inference Discovery
How do you model Flight Data?
How do you model Flight Data?
How do you model Flight Data?
How do you model Flight Data?
How do you model Flight Data?
How do you model Flight Data?
How do you model Flight Data?
How do you model Comic Books?
How do you model a world where anything can happen?
Graph Databases allow Model Flexibility
https://vimeo.com/79399404
Watch the presentation at:
Java CORE API
Direct access to Nodes and Relationships
Java Core API
• Step by Step from GraphDatabaseService • Start a transaction (reads and writes) • findNode(Label, Property, Value) • findNodes(Label, Property, Value) • findNodes(Label) • getNodeById(Long)
• getRelationships(Direction, Type) • getProperty(Property, (optional) Default Value)
Example (get the friends of a user)
Traversal API
Describe Traversals
Traversal API
• Start with the Simple Defaults (order, relationships, depth, uniqueness, etc) • Custom Expanders • Where should I go next
• Custom Evaluators • I’ve gone there… should I accept this path?
Traversal API Example
Cypher Query Language
ASCII Art Pattern Matching
Cypher: Powerful and Expressive Query Language
MATCH (:Person { name:“Dan”} ) -‐[:LOVES]-‐> (:Person { name:“Ann”} )
LOVES
Dan Ann
Label Property Label Property
Node Node
MATCH (boss)-‐[:MANAGES*0..3]-‐>(sub), (sub)-‐[:MANAGES*1..3]-‐>(report) WHERE boss.name = “John Doe” RETURN sub.name AS Subordinate, count(report) AS Total
Express Complex Queries Easily with Cypher
Find all direct reports and how many people they manage,
up to 3 levels down
Cypher QuerySQL Query
Hello World Recommendation
Hello World Recommendation
Movie Data Model
Cypher Query: Movie Recommendation
MATCH (watched:Movie {title:"Toy Story”}) <-‐[r1:RATED]-‐ () -‐[r2:RATED]-‐> (unseen:Movie) WHERE r1.rating > 7 AND r2.rating > 7 AND watched.genres = unseen.genres AND NOT( (:Person {username:”maxdemarzi"}) -‐[:RATED|WATCHED]-‐> (unseen) ) RETURN unseen.title, COUNT(*) ORDER BY COUNT(*) DESC LIMIT 25
What are the Top 25 Movies • that I haven't seen • with the same genres as Toy Story • given high ratings • by people who liked Toy Story
Movie Data Model
Cypher Query: k-‐NN Recommendation
MATCH (m:Movie) <-‐[r:RATED]-‐ (b:Person) -‐[s:SIMILARITY]-‐ (p:Person {name:'Zoltan Varju'}) WHERE NOT( (p) -‐[:RATED|WATCHED]-‐> (m) ) WITH m, s.similarity AS similarity, r.rating AS rating ORDER BY m.name, similarity DESC WITH m.name AS movie, COLLECT(rating)[0..3] AS ratings WITH movie, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS recommendation ORDER BY recommendation DESC RETURN movie, recommendation LIMIT 25
What are the Top 25 Movies • that Zoltan Varju has not seen • using the average rating • by my top 3 neighbors
Neo4j Interface
Server, Service, Library
High Speed Fraud -‐ 1000 R/S
http://maxdemarzi.com/2014/02/12/online-‐payment-‐risk-‐management-‐with-‐neo4j/
High Speed Fraud -‐ 8000 R/S
http://maxdemarzi.com/2014/02/27/neo4j-‐at-‐ludicrous-‐speed/
High Speed Fraud -‐ 28000 R/S
http://maxdemarzi.com/2014/03/10/its-‐over-‐9000-‐neo4j-‐on-‐websockets/
Neo4j
Additional Features
Neo4j Clustering Architecture Optimized for Speed & Availability at Scale
57
Performance Benefits: • No network hops within queries • Real-‐time operations with fast and
consistent response times • Cache sharding spreads cache across
cluster for very large graphs
Clustering Features: • Master-‐slave replication with master re-‐election and failover
• Each instance has its own local cache • Horizontal scaling & disaster recovery
Load Balancer
Neo4jNeo4jNeo4j
Getting Data into Neo4j
Cypher-‐Based “LOAD CSV” Capability • Transactional (ACID) writes • Initial and incremental loads of up to 10 million nodes and relationships
Command-‐Line Bulk Loader neo4j-‐import • For initial database population • For loads with 10B+ records • Up to 1M records per second
4.58 million things and their relationships…
Loads in 100 seconds!
Databases
Data Storage and Business Rules Execution
Data Mining and Aggregation
Neo4j Fits into Your Enterprise Environment
Application
Graph Database Cluster
Neo4j Neo4j Neo4j
Ad HocAnalysis
ETLBulk Analytic Infrastructure
Graph Compute EngineHadoop EDW …
ETL
Data Scientist
End User
Value from Relationships – Common Use Cases
Internal Applications Master Data Management
Network and IT Operations
Fraud Detection
Customer-‐Facing Applications Real-‐time Recommendations
Graph-‐based Search Identity and
Access Management
Open Corporates
Uses Neo4j
Open Corporates
Open Corporates
Uses Neo4j
https://skillsmatter.com/skillscasts/4097-‐case-‐study-‐how-‐opencorporates-‐uses-‐neo4j-‐to-‐provide-‐insight
Open Source Examples
http://maxdemarzi.com/2012/10/18/matches-are-the-new-hotness/
What are the Top 10 Jobs for me • that are in the same location I’m in • for which I have the necessary qualifications
Partial Subgraph Search
Recommend LoveFind your soulmate in the graph • Are they energetic? • Do they like dogs? • Have a good sense of humor? • Neat and tidy, but not crazy about it?
What are the Top 10 Potential Mates for me • that are in the same location • are sexually compatible • have traits I want • want traits I have
Love Recommendation
Two Party Partial Subgraph Search
http://maxdemarzi.com/2013/04/19/match-making-with-neo4j/
Real-‐Time Recommendations with Neo4j
SocialRecommendations
Products and Services Content Routing
Walmart BUSINESS CASE
World’s largest companyby revenue
World’s largest retailer and private employer
SF-‐based global e-‐commerce division
manages several websites
Found in 1969Bentonville, Arkansas
• Needed online customer recommendations to keep pace with competition
• Data connections provided predictive context, but were not in a usable format
• Solution had to serve many millions of customers and products while maintaining superior scalability and performance
Walmart SOLUTION
• Brings customers, preferences, purchases, products and locations into a graph model
• Uses connections to make product recommendations
• Solution deployed across WalMart divisions and websites
Global Courier BUSINESS CASE
World’s largest courier
480,000 employees€55 billion in revenue
Needed new B2C and B2B parcel routing
system for its logistics practice
Legacy system neither supported the full network
nor the shift to online demands
Needed to replace aging B2B and B2C parcel routing system whose requirements include: • 24x7 availability • Peak loads of 5M parcels per day, 3K per second • Support for complex and diverse software stack • Predictable performance with linear scalability • Daily changes to logistics networks • Route from any point to any point • Single point of truth for entire network
Global Courier SOLUTION
Neo4j provides the ideal domain fit since a logistics network is a graph • High availability and performance via Neo4j clustering
• Greatly simplified Cypher queries for routing versus relational SQL queries
• Flexible data model that reflects the real logistics world far better than relational
• Easy-‐to-‐grasp whiteboard-‐friendly model
eBay BUSINESS CASE
C2C and B2C retail network
Full e-‐commerce functionality for individuals
and businesses
Integrated with logistics vendors for product
deliveries
• Needed an offering to compete with Amazon Prime
• Enable customer-‐selected delivery inside 90 minutes
• Calculate best route option in real-‐time • Scale to enable a variety of services • Offer more predictable delivery times
eBay Now SOLUTION
• Acquired UK-‐based Shutl. a leader in same-‐day delivery
• Used Neo4j to create eBay Now • 1000 times faster than the prior MySQL-‐based solution
• Faster time-‐to-‐market • Improved code quality with 10 to 100 times less query code
Classmates BUSINESS CASE
Online yearbook connecting friends from school, work and military
in US and Canada
Founded as Memory Lane in Seattle
Develop new social networking capabilities to monetize yearbook-‐related offerings • Show all the people I know in a yearbook • Show yearbooks my friends appear in most often • Show sections of a yearbook that my friends appear most in
• Show me other schools my friends attended
Classmates SOLUTION
Neo4j provides a robust and scalable graph database solution • 3-‐instance cluster with cache sharding and disaster-‐recovery
• 18ms response time for top 4 queries • 100M nodes and 600M relationships in initial graph—including people, images, schools, yearbooks and pages
• Projected to grow to 1B nodes and 6B relationships
National Geographic BUSINESS CASE
Non-‐profit scientific and educational institution
founded in 1888
Covers geography, archaeology, natural science, environment and historical
conservation
Journals, online media, radio, TV, documentaries, live events and consumer
content and goods
• Improve poor performance of PostgreSQL app • Increase user engagement by linking to 100+ years of multimedia content
• Improve targeting by understand subscribers’ interests better
• Recommend content and services to users based on their interests
National Geographic SOLUTION
• Enabled complex real-‐time analytics across eight million users and a century of content
• Delivered robust performance by eliminating triple-‐nested SQL joins
• Cross-‐refers users among content, live events, travel, goods and causes
• Neo4j solution much less cumbersome and easier to maintain than previous SQL system
Curaspan BUSINESS CASE
Leader in patient management for discharges
and referrals Manages patient referrals 4600+ health care facilities Connects providers, payers via web-‐based patient management platform Founded in 1999 in
Newton, Massachusetts
• Improve poor performance of Oracle solution
• Support more complexity including granular, role-‐based access control
• Satisfy complex Graph Search queries by discharge nurses and intake coordinators Find a skilled nursing facility within n miles of a given location, belonging to health care group XYZ, offering speech therapy and cardiac care, and optionally Italian language services
Curaspan SOLUTION• Met fast, real-‐time performance demands
• Supported queries span multiple hierarchies including provider and employee-‐permissions graphs
• Improved data model to handle adding more dimensions to the data such as insurance networks, service areas and care organizations
• Greatly simplified queries, simplifying multi-‐page SQL statements into one Neo4j function
FiftyThree BUSINESS CASE
Maker of Paper, one of the top apps
in Apple’s App Store, with millions of users
Based in New York City
• Add social capabilities to digital-‐paper app • Support social collaboration across millions of users in new Mix app
• Enable seamless interaction between social and content-‐asset networks
• Ensure new apps are robust, scalable and fast
FiftyThree SOLUTION
• Neo4j data model ideal for social network, content management and access control • Users create, publish and share designs simply • Easy to develop and evolve Neo4j-‐based app • Integrates well with FiftyThree EC2 architecture
See the Neo4j solution in action Betting the Company (Literally) on a Graph Databasehttp://aseemk.com/talks/neo4j-‐lessons-‐learned#/
App Store Editor’s Choice2012 iPad App of Year Apple Best Apps of 2014
Users Love Neo4j
jQuery Inventor Heroku Founder
THANK YOU