optimizing cypher queries in neo4j

71
Optimizing Cypher Queries in Neo4j Wes Freeman (@wefreema) Mark Needham (@markhneedham)

Upload: neo4j-the-open-source-graph-database

Post on 14-Jan-2015

1.656 views

Category:

Technology


0 download

DESCRIPTION

Mark and Wes will talk about Cypher optimization techniques based on real queries as well as the theoretical underlying processes. They'll start from the basics of "what not to do", and how to take advantage of indexes, and continue to the subtle ways of ordering MATCH/WHERE/WITH clauses for optimal performance as of the 2.0.0 release.

TRANSCRIPT

Page 1: Optimizing Cypher Queries in Neo4j

Optimizing Cypher Queries in Neo4j

Wes Freeman (@wefreema)

Mark Needham (@markhneedham)

Michael Hunger
do you have a slide that explains the combinatorial complexity of paths, i.e. 1000 friends, 2 hops -> 1M paths, 3hops -> 1BN paths and the need to filter down early and reduce the cardinality in between with distinct?
Michael Hunger
in general some color coding would be coolgreen headline or lightgreen background for the "good ones"and red headline or light-red background for the "bad ones
Mark Needham
Let's just have the logo on the first and last page so we don't cut out space that we can use for content
Wes Freeman
is this better?
Mark Needham
slightly slower
Page 2: Optimizing Cypher Queries in Neo4j

Today's schedule

• Brief overview of cypher syntax

• Graph global vs Graph local queries

• Labels and indexes

• Optimization patterns

• Profiling cypher queries

• Applying optimization patterns

Michael Hunger
It would be cool if you had a slides on your blue stats, what they mean, and also what hardware you ran the queries on, and if it was the first run or subsequent run (perhaps show the effect of query building / cold-cache vs. hot-cache and precompiled queries)
Mark Needham
All of them are with hot caches. We can mention that
Michael Hunger
do you also plan to do anti-patterns?things that don't work so well in cypherlike unbounded var-length pathsor cross-path predicates?
Michael Hunger
an intro? what, why, how?
Michael Hunger
a picture?
Wes Freeman
yep, plan to put one in. this was the result of 30 minutes of dumping stuff here :P
Page 3: Optimizing Cypher Queries in Neo4j

Cypher Syntax

• Statement partso Optional: Querying part (MATCH|WHERE)o Optional: Updating part (CREATE|MERGE)o Optional: Returning part (WITH|RETURN)

• Parts can be chained together

Page 4: Optimizing Cypher Queries in Neo4j

Cypher Syntax - Refresher

MATCH (n:Label)-[r:LINKED]->(m)WHERE n.prop = "..."RETURN n, r, m

Michael Hunger
add p= and return p
Page 5: Optimizing Cypher Queries in Neo4j

Starting points

• Graph scan (global; potentially slow)

• Label scan (usually reserved for aggregation queries; not ideal)

• Label property index lookup (local; good!)

Page 6: Optimizing Cypher Queries in Neo4j

Introducing the football dataset

Page 7: Optimizing Cypher Queries in Neo4j

The 1.9 global scanO(n)

n = # of nodes

START pl = node(*) MATCH (pl)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

150ms w/ 30k nodes, 120k rels

Michael Hunger
I would probably use pl for player to not confuse it with p for path
Page 8: Optimizing Cypher Queries in Neo4j

The 2.0 global scan

MATCH (pl)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

130ms w/ 30k nodes, 120k rels

O(n)n = # of nodes

Page 9: Optimizing Cypher Queries in Neo4j

Why is it a global scan?

• Cypher is a pattern matching language

• It doesn't discriminate unless you tell it too It must try to start at all nodes to find this

pattern, as specified

Mark Needham
Wonder how we can flesh this out
Page 10: Optimizing Cypher Queries in Neo4j

Introduce a label

Label your starting points

CREATE (player:Player {name: "Wayne Rooney"} )

Michael Hunger
"p" vs "player" ? consitency
Page 11: Optimizing Cypher Queries in Neo4j

O(k)k = # of nodes with that labelLabel scan

MATCH (pl:Player)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

80ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Page 12: Optimizing Cypher Queries in Neo4j

Indexes don't come for free

CREATE INDEX ON :Player(name)

OR

CREATE CONSTRAINT ON pl:PlayerASSERT pl.name IS UNIQUE

Michael Hunger
what does "don't come for free mean" ? that the inccur costs on write operations?
Page 13: Optimizing Cypher Queries in Neo4j

O(log k)k = # of nodes with that labelIndex lookup

MATCH (pl:Player)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

6ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Michael Hunger
Why log k ? Doesn't that depend on the index impl (b-tree, vs. hash vs. xxx)? And isn't it log (tree-level) ? which is about constant
Page 14: Optimizing Cypher Queries in Neo4j

Optimization Patterns

• Avoid cartesian products

• Avoid patterns in the WHERE clause

• Start MATCH patterns at the lowest cardinality and expand outward

• Separate MATCH patterns with minimal expansion at each stage

Page 15: Optimizing Cypher Queries in Neo4j

Introducing the movie data set

Page 16: Optimizing Cypher Queries in Neo4j

Anti-pattern: Cartesian Products

MATCH (m:Movie), (p:Person)

Michael Hunger
would be cool to have the blue stats here too and also what the cartesian product results to
Page 17: Optimizing Cypher Queries in Neo4j

Subtle Cartesian Products

MATCH (p:Person)-[:KNOWS]->(c)WHERE p.name="Tom Hanks"WITH cMATCH (k:Keyword)RETURN c, k

Michael Hunger
would be cool to have the blue stats here too and also what the cartesian product results to
Page 18: Optimizing Cypher Queries in Neo4j

Counting Cartesian Products

MATCH (pl:Player),(t:Team),(g:Game)RETURN COUNT(DISTINCT pl), COUNT(DISTINCT t), COUNT(DISTINCT g)

80000 ms w/ ~900 players, ~40 teams, ~1200 games

Michael Hunger
would be cool to have what the cartesian product results to i.e. 900x40x1200 = 43M
Page 19: Optimizing Cypher Queries in Neo4j

MATCH (pl:Player)WITH COUNT(pl) as playersMATCH (t:Team)WITH COUNT(t) as teams, playersMATCH (g:Game)RETURN COUNT(g) as games, teams, players8ms w/ ~900 players, ~40 teams, ~1200 games

Better Counting

Michael Hunger
would love to see a few slides about the power of with first (aggregate, order, limit, change cardinality) and how it can be used to separate query parts or unwanted filter expressions on paths etc.
Michael Hunger
why? b/c you reduce the cardinality to 1 again with with!!
Page 20: Optimizing Cypher Queries in Neo4j

Directions on patterns

MATCH (p:Person)-[:ACTED_IN]-(m)WHERE p.name = "Tom Hanks"RETURN m

Michael Hunger
show the stats with and w/o direction (db-hits)
Mark Needham
This one was going to be more in passing - weren't going into depth here so didn't pull out profile stats
Michael Hunger
note the missing direction somehow, perhaps with a gray arrow-tip ?
Page 21: Optimizing Cypher Queries in Neo4j

Parameterize your queries

MATCH (p:Person)-[:ACTED_IN]-(m)WHERE p.name = {name}RETURN m

Page 22: Optimizing Cypher Queries in Neo4j

Fast predicates first

Bad:MATCH (t:Team)-[:played_in]->(g)WHERE NOT (t)-[:home_team]->(g) AND g.away_goals > g.home_goals RETURN t, COUNT(g)

Michael Hunger
highlight the fast predicate in green and the slow one in red, make them bold
Mark Needham
This is the bit we're thinking of removing?
Michael Hunger
Why?
Page 23: Optimizing Cypher Queries in Neo4j

Better:MATCH (t:Team)-[:played_in]->(g)WHERE g.away_goals > g.home_goals AND NOT (t)-[:home_team]->()RETURN t, COUNT(g)

Fast predicates first

Michael Hunger
same as above
Page 24: Optimizing Cypher Queries in Neo4j

Patterns in WHERE clauses

• Keep them in the MATCH

• The only pattern that needs to be in a WHERE clause is a NOT

Mark Needham
Should I add an example here? Perhaps I can show the speed of a query with a pattern match in the WHERE
Michael Hunger
yes please
Page 25: Optimizing Cypher Queries in Neo4j

MERGE and CONSTRAINTs

• MERGE is MATCH or CREATE

• MERGE can take advantage of unique constraints and indexes

Michael Hunger
add a bunch of slides that explains merge (the two sides of merge) -> create rels and create subgraphs
Michael Hunger
only with a constraint it is an atomic operation, otherwise it doesn't take an index lock and is just a "best-try"
Mark Needham
I will add an example of this
Page 26: Optimizing Cypher Queries in Neo4j

MERGE (without index)MERGE (g:Game

{date:1290257100,

time: 1245,

home_goals: 2,

away_goals: 3,

match_id: 292846,

attendance: 60102})

RETURN g

188 ms w/ ~400 games

Michael Hunger
make clearer that this is w/o index or constraint
Page 27: Optimizing Cypher Queries in Neo4j

Adding an index

CREATE INDEX ON :Game(match_id)

Page 28: Optimizing Cypher Queries in Neo4j

MERGE (with index)MERGE (g:Game

{date:1290257100,

time: 1245,

home_goals: 2,

away_goals: 3,

match_id: 292846,

attendance: 60102})

RETURN g

6 ms w/ ~400 games

Page 29: Optimizing Cypher Queries in Neo4j

Alternative MERGE approachMERGE (g:Game { match_id: 292846 })ON CREATESET g.date = 1290257100

SET g.time = 1245SET g.home_goals = 2SET g.away_goals = 3SET g.attendance = 60102RETURN g

Mark Needham
Wes: I've added two versions of MERGE as we discussed. Assuming that you'll explain the difference between the approaches i.e. you'd take over for this slide :D
Page 30: Optimizing Cypher Queries in Neo4j

Profiling queries

• Use the PROFILE keyword in front of the queryo from webadmin or shell - won't work in

browser

• Look for db_hits and rows

• Ignore everything else (for now!)

Michael Hunger
but only in the shell, profile doesn't work in browser !
Mark Needham
Good point!
Page 31: Optimizing Cypher Queries in Neo4j

Reviewing the football dataset

Page 32: Optimizing Cypher Queries in Neo4j

Football OptimizationMATCH (game)<-[:contains_match]-(season:Season),

(team)<-[:away_team]-(game),

(stats)-[:in]->(game),

(team)<-[:for]-(stats)<-[:played]-(player)

WHERE season.name = "2012-2013"

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESC

LIMIT 103137 ms w/ ~900 players, ~20 teams, ~400 games

Page 33: Optimizing Cypher Queries in Neo4j

Football Optimization==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-e746407c504a", "

INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323"], returnItemNames=["player.name", "COLLECT(DISTINCT team.name)", "goals"], _rows=10, _db_hits=0)

==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323 of type Number),false)"], limit="Literal(10)", _rows=10, _db_hits=0)

==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-e746407c504a,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)

==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED108", "season", " UNNAMED55", "player", "team", " UNNAMED124", " UNNAMED85", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)

==> PatternMatch(g="(player)-[' UNNAMED124']-(stats)", _rows=5192, _db_hits=0)

==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=5192, _db_hits=15542)

==> TraversalMatcher(trail="(season)-[ UNNAMED12:contains_match WHERE true AND true]->(game)<-[ UNNAMED85:in WHERE true AND true]-(stats)-[ UNNAMED108:for WHERE true AND true]->(team)<-

[ UNNAMED55:away_team WHERE true AND true]-(game)", _rows=15542, _db_hits=1620462)

Michael Hunger
make db-hits bold and increase the font size, gray out the rest in a lighter gray
Mark Needham
Wes: why does the traversal matcher overdo it for this example? It covers way too much ground
Wes Freeman
because it's one giant match. it can handle the whole match.
Page 34: Optimizing Cypher Queries in Neo4j

Break out the match statements

MATCH (game)<-[:contains_match]-(season:Season)

MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)

MATCH (team)<-[:for]-(stats)<-[:played]-(player)

WHERE season.name = "2012-2013"

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESCLIMIT 10200 ms w/ ~900 players, ~20 teams, ~400 games

Mark Needham
If these are in a different order then the performance is radically different. I just happen to have it in a logical order from the initial query
Michael Hunger
wanna put that comment in the notes?
Page 35: Optimizing Cypher Queries in Neo4j

Start small

• Smallest cardinality label first

• Smallest intermediate result set first

Page 36: Optimizing Cypher Queries in Neo4j

Exploring cardinalitiesMATCH (game)<-[:contains_match]-(season:Season)

RETURN COUNT(DISTINCT game), COUNT(DISTINCT season)

1140 games, 3 seasons

MATCH (team)<-[:away_team]-(game:Game)

RETURN COUNT(DISTINCT team), COUNT(DISTINCT game)

25 teams, 1140 games

Page 37: Optimizing Cypher Queries in Neo4j

Exploring cardinalitiesMATCH (stats)-[:in]->(game:Game)

RETURN COUNT(DISTINCT stats), COUNT(DISTINCT game)

31117 stats, 1140 games

MATCH (stats)<-[:played]-(player:Player)

RETURN COUNT(DISTINCT stats), COUNT(DISTINCT player)

31117 stats, 880 players

Michael Hunger
what do we learn from these cardinalities? Perhaps do a table on a separate slide? Where you can highlight (green) the low ones and red the high ones (like stats)
Michael Hunger
perhaps also re-display the model?
Mark Needham
we learn that we shouldn't start with stats because it'll unnecessarily load lots of data
Mark Needham
Starting at game--team or season--game makes more sense.
Page 38: Optimizing Cypher Queries in Neo4j

Look for teams firstMATCH (team)<-[:away_team]-(game:Game)MATCH (game)<-[:contains_match]-(season)

WHERE season.name = "2012-2013"

MATCH (stats)-[:in]->(game)

MATCH (team)<-[:for]-(stats)<-[:played]-(player)

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESC

LIMIT 10162 ms w/ ~900 players, ~20 teams, ~400 games

Page 39: Optimizing Cypher Queries in Neo4j

==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-b0c7ec85c969", " INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5"], returnItemNames=["player.name", "COLLECT(DISTINCT team.name)", "goals"], _rows=10, _db_hits=0)

==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5 of type Number),false)"], limit="Literal(10)", _rows=10, _db_hits=0)

==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-b0c7ec85c969,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)

==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED168", "season", " UNNAMED125", "player", "team", " UNNAMED152", " UNNAMED51", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)

==> PatternMatch(g="(stats)-[' UNNAMED152']-(team),(player)-[' UNNAMED168']-(stats)", _rows=5192, _db_hits=0)

==> PatternMatch(g="(stats)-[' UNNAMED125']-(game)", _rows=10394, _db_hits=0)

==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=380, _db_hits=380)

==> PatternMatch(g="(season)-[' UNNAMED51']-(game)", _rows=380, _db_hits=1140)

==> TraversalMatcher(trail="(game)-[ UNNAMED12:away_team WHERE true AND true]->(team)", _rows=1140,

_db_hits=1140)

Look for teams first

Page 40: Optimizing Cypher Queries in Neo4j

Filter games a bit earlierMATCH (game)<-[:contains_match]-(season:Season)

WHERE season.name = "2012-2013"MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)

MATCH (team)<-[:for]-(stats)<-[:played]-(player)

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESC

LIMIT 10148 ms w/ ~900 players, ~20 teams, ~400 games

Michael Hunger
profiling ? at least add db-hits to your blue stats
Page 41: Optimizing Cypher Queries in Neo4j

Filter out stats with no goalsMATCH (game)<-[:contains_match]-(season:Season)

WHERE season.name = "2012-2013"MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)WHERE stats.goals > 0MATCH (team)<-[:for]-(stats)<-[:played]-(player)RETURN player.name, COLLECT(DISTINCT team.name), SUM(stats.goals) as goalsORDER BY goals DESCLIMIT 10

59 ms w/ ~900 players, ~20 teams, ~400 games

Michael Hunger
db-hits / profile
Mark Needham
Wes: should we talk about 'rows' as well or is that too deep? Generally you want to keep the row count as close to your final output as possible. If row count increases dramatically it can suggest the query isn't optimal.
Mark Needham
I want to say something like: db_hits is good for getting you started but after you've fixed that it's all about experimenting and cardinalities
Page 42: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: {title} })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Michael Hunger
the stats, and profile info?
Mark Needham
We haven't finished with the football optimisation yet! Can we move this down?
Michael Hunger
Yep
Page 43: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 44: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 45: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 46: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)

MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 47: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)

MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 48: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)

MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 49: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)

MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 50: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 51: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 52: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as

related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 53: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Page 54: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight // 1 row per actorORDER BY actormoviesweight DESCWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Michael Hunger
the stats, and profile info?
Mark Needham
We need to talk through the differences between the two versions - it's not obvious at first glance
Michael Hunger
Highlight them, make them bigger font and gray the rest out
Page 55: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight // 1 row per actorORDER BY actormoviesweight DESCWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Page 56: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)WITH movie, actor, length((actor)-

[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Page 57: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Page 58: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Page 59: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Page 60: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Page 61: Optimizing Cypher Queries in Neo4j

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers // 1 row

10x faster

Page 62: Optimizing Cypher Queries in Neo4j

Design for Queryability

Model

Page 63: Optimizing Cypher Queries in Neo4j

Design for Queryability

Query

Page 64: Optimizing Cypher Queries in Neo4j

Design for Queryability

Model

Page 65: Optimizing Cypher Queries in Neo4j

Making the implicit explicit

• When you have implicit relationships in the graph you can sometimes get better query performance by modeling the relationship explicitly

Michael Hunger
"introduce shortcuts" "materialize virtual paths" model data structures as part of the graph (list, tree, trie, sorting, ...)
Wes Freeman
I added this word
Wes Freeman
(sometimes)
Page 66: Optimizing Cypher Queries in Neo4j

Making the implicit explicit

Michael Hunger
Having worked together as colleagues implies knowing each other, makes path finding easier and faster and also the pattern that you have to writecould that be a general rule "the more complex the pattern is you're writing" the slower your query will be?probably not true for MATCH (n)-[*]-(m) :)
Page 67: Optimizing Cypher Queries in Neo4j

Refactor property to node

Bad:MATCH (g:Game)WHERE g.date > 1343779200 AND g.date < 1369094400RETURN g

Michael Hunger
Explain that "date" can be modeled as a season
Mark Needham
I'll fill in times for these two versions
Page 68: Optimizing Cypher Queries in Neo4j

Good:MATCH (s:Season)-[:contains]->(g)WHERE season.name = "2012-2013"RETURN g

Refactor property to node

Page 69: Optimizing Cypher Queries in Neo4j

Conclusion

• Avoid the global scan

• Add indexes / unique constraints

• Split up MATCH statements

• Measure, measure, measure, tweak, repeat

• Soon Cypher will do a lot of this for you!

Page 70: Optimizing Cypher Queries in Neo4j

Bonus tip

• Use transactions/transactional cypher endpoint

Page 71: Optimizing Cypher Queries in Neo4j

Q & A

• If you have them send them in