optimizing cypher queries in neo4j

Post on 14-Jan-2015

1.656 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Mark and Wes will talk about Cypher optimization techniques based on real queries as well as the theoretical underlying processes. They'll start from the basics of "what not to do", and how to take advantage of indexes, and continue to the subtle ways of ordering MATCH/WHERE/WITH clauses for optimal performance as of the 2.0.0 release.

TRANSCRIPT

Optimizing Cypher Queries in Neo4j

Wes Freeman (@wefreema)

Mark Needham (@markhneedham)

Michael Hunger
do you have a slide that explains the combinatorial complexity of paths, i.e. 1000 friends, 2 hops -> 1M paths, 3hops -> 1BN paths and the need to filter down early and reduce the cardinality in between with distinct?
Michael Hunger
in general some color coding would be coolgreen headline or lightgreen background for the "good ones"and red headline or light-red background for the "bad ones
Mark Needham
Let's just have the logo on the first and last page so we don't cut out space that we can use for content
Wes Freeman
is this better?
Mark Needham
slightly slower

Today's schedule

• Brief overview of cypher syntax

• Graph global vs Graph local queries

• Labels and indexes

• Optimization patterns

• Profiling cypher queries

• Applying optimization patterns

Michael Hunger
It would be cool if you had a slides on your blue stats, what they mean, and also what hardware you ran the queries on, and if it was the first run or subsequent run (perhaps show the effect of query building / cold-cache vs. hot-cache and precompiled queries)
Mark Needham
All of them are with hot caches. We can mention that
Michael Hunger
do you also plan to do anti-patterns?things that don't work so well in cypherlike unbounded var-length pathsor cross-path predicates?
Michael Hunger
an intro? what, why, how?
Michael Hunger
a picture?
Wes Freeman
yep, plan to put one in. this was the result of 30 minutes of dumping stuff here :P

Cypher Syntax

• Statement partso Optional: Querying part (MATCH|WHERE)o Optional: Updating part (CREATE|MERGE)o Optional: Returning part (WITH|RETURN)

• Parts can be chained together

Cypher Syntax - Refresher

MATCH (n:Label)-[r:LINKED]->(m)WHERE n.prop = "..."RETURN n, r, m

Michael Hunger
add p= and return p

Starting points

• Graph scan (global; potentially slow)

• Label scan (usually reserved for aggregation queries; not ideal)

• Label property index lookup (local; good!)

Introducing the football dataset

The 1.9 global scanO(n)

n = # of nodes

START pl = node(*) MATCH (pl)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

150ms w/ 30k nodes, 120k rels

Michael Hunger
I would probably use pl for player to not confuse it with p for path

The 2.0 global scan

MATCH (pl)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

130ms w/ 30k nodes, 120k rels

O(n)n = # of nodes

Why is it a global scan?

• Cypher is a pattern matching language

• It doesn't discriminate unless you tell it too It must try to start at all nodes to find this

pattern, as specified

Mark Needham
Wonder how we can flesh this out

Introduce a label

Label your starting points

CREATE (player:Player {name: "Wayne Rooney"} )

Michael Hunger
"p" vs "player" ? consitency

O(k)k = # of nodes with that labelLabel scan

MATCH (pl:Player)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

80ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Indexes don't come for free

CREATE INDEX ON :Player(name)

OR

CREATE CONSTRAINT ON pl:PlayerASSERT pl.name IS UNIQUE

Michael Hunger
what does "don't come for free mean" ? that the inccur costs on write operations?

O(log k)k = # of nodes with that labelIndex lookup

MATCH (pl:Player)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

6ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Michael Hunger
Why log k ? Doesn't that depend on the index impl (b-tree, vs. hash vs. xxx)? And isn't it log (tree-level) ? which is about constant

Optimization Patterns

• Avoid cartesian products

• Avoid patterns in the WHERE clause

• Start MATCH patterns at the lowest cardinality and expand outward

• Separate MATCH patterns with minimal expansion at each stage

Introducing the movie data set

Anti-pattern: Cartesian Products

MATCH (m:Movie), (p:Person)

Michael Hunger
would be cool to have the blue stats here too and also what the cartesian product results to

Subtle Cartesian Products

MATCH (p:Person)-[:KNOWS]->(c)WHERE p.name="Tom Hanks"WITH cMATCH (k:Keyword)RETURN c, k

Michael Hunger
would be cool to have the blue stats here too and also what the cartesian product results to

Counting Cartesian Products

MATCH (pl:Player),(t:Team),(g:Game)RETURN COUNT(DISTINCT pl), COUNT(DISTINCT t), COUNT(DISTINCT g)

80000 ms w/ ~900 players, ~40 teams, ~1200 games

Michael Hunger
would be cool to have what the cartesian product results to i.e. 900x40x1200 = 43M

MATCH (pl:Player)WITH COUNT(pl) as playersMATCH (t:Team)WITH COUNT(t) as teams, playersMATCH (g:Game)RETURN COUNT(g) as games, teams, players8ms w/ ~900 players, ~40 teams, ~1200 games

Better Counting

Michael Hunger
would love to see a few slides about the power of with first (aggregate, order, limit, change cardinality) and how it can be used to separate query parts or unwanted filter expressions on paths etc.
Michael Hunger
why? b/c you reduce the cardinality to 1 again with with!!

Directions on patterns

MATCH (p:Person)-[:ACTED_IN]-(m)WHERE p.name = "Tom Hanks"RETURN m

Michael Hunger
show the stats with and w/o direction (db-hits)
Mark Needham
This one was going to be more in passing - weren't going into depth here so didn't pull out profile stats
Michael Hunger
note the missing direction somehow, perhaps with a gray arrow-tip ?

Parameterize your queries

MATCH (p:Person)-[:ACTED_IN]-(m)WHERE p.name = {name}RETURN m

Fast predicates first

Bad:MATCH (t:Team)-[:played_in]->(g)WHERE NOT (t)-[:home_team]->(g) AND g.away_goals > g.home_goals RETURN t, COUNT(g)

Michael Hunger
highlight the fast predicate in green and the slow one in red, make them bold
Mark Needham
This is the bit we're thinking of removing?
Michael Hunger
Why?

Better:MATCH (t:Team)-[:played_in]->(g)WHERE g.away_goals > g.home_goals AND NOT (t)-[:home_team]->()RETURN t, COUNT(g)

Fast predicates first

Michael Hunger
same as above

Patterns in WHERE clauses

• Keep them in the MATCH

• The only pattern that needs to be in a WHERE clause is a NOT

Mark Needham
Should I add an example here? Perhaps I can show the speed of a query with a pattern match in the WHERE
Michael Hunger
yes please

MERGE and CONSTRAINTs

• MERGE is MATCH or CREATE

• MERGE can take advantage of unique constraints and indexes

Michael Hunger
add a bunch of slides that explains merge (the two sides of merge) -> create rels and create subgraphs
Michael Hunger
only with a constraint it is an atomic operation, otherwise it doesn't take an index lock and is just a "best-try"
Mark Needham
I will add an example of this

MERGE (without index)MERGE (g:Game

{date:1290257100,

time: 1245,

home_goals: 2,

away_goals: 3,

match_id: 292846,

attendance: 60102})

RETURN g

188 ms w/ ~400 games

Michael Hunger
make clearer that this is w/o index or constraint

Adding an index

CREATE INDEX ON :Game(match_id)

MERGE (with index)MERGE (g:Game

{date:1290257100,

time: 1245,

home_goals: 2,

away_goals: 3,

match_id: 292846,

attendance: 60102})

RETURN g

6 ms w/ ~400 games

Alternative MERGE approachMERGE (g:Game { match_id: 292846 })ON CREATESET g.date = 1290257100

SET g.time = 1245SET g.home_goals = 2SET g.away_goals = 3SET g.attendance = 60102RETURN g

Mark Needham
Wes: I've added two versions of MERGE as we discussed. Assuming that you'll explain the difference between the approaches i.e. you'd take over for this slide :D

Profiling queries

• Use the PROFILE keyword in front of the queryo from webadmin or shell - won't work in

browser

• Look for db_hits and rows

• Ignore everything else (for now!)

Michael Hunger
but only in the shell, profile doesn't work in browser !
Mark Needham
Good point!

Reviewing the football dataset

Football OptimizationMATCH (game)<-[:contains_match]-(season:Season),

(team)<-[:away_team]-(game),

(stats)-[:in]->(game),

(team)<-[:for]-(stats)<-[:played]-(player)

WHERE season.name = "2012-2013"

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESC

LIMIT 103137 ms w/ ~900 players, ~20 teams, ~400 games

Football Optimization==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-e746407c504a", "

INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323"], returnItemNames=["player.name", "COLLECT(DISTINCT team.name)", "goals"], _rows=10, _db_hits=0)

==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323 of type Number),false)"], limit="Literal(10)", _rows=10, _db_hits=0)

==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-e746407c504a,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)

==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED108", "season", " UNNAMED55", "player", "team", " UNNAMED124", " UNNAMED85", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)

==> PatternMatch(g="(player)-[' UNNAMED124']-(stats)", _rows=5192, _db_hits=0)

==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=5192, _db_hits=15542)

==> TraversalMatcher(trail="(season)-[ UNNAMED12:contains_match WHERE true AND true]->(game)<-[ UNNAMED85:in WHERE true AND true]-(stats)-[ UNNAMED108:for WHERE true AND true]->(team)<-

[ UNNAMED55:away_team WHERE true AND true]-(game)", _rows=15542, _db_hits=1620462)

Michael Hunger
make db-hits bold and increase the font size, gray out the rest in a lighter gray
Mark Needham
Wes: why does the traversal matcher overdo it for this example? It covers way too much ground
Wes Freeman
because it's one giant match. it can handle the whole match.

Break out the match statements

MATCH (game)<-[:contains_match]-(season:Season)

MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)

MATCH (team)<-[:for]-(stats)<-[:played]-(player)

WHERE season.name = "2012-2013"

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESCLIMIT 10200 ms w/ ~900 players, ~20 teams, ~400 games

Mark Needham
If these are in a different order then the performance is radically different. I just happen to have it in a logical order from the initial query
Michael Hunger
wanna put that comment in the notes?

Start small

• Smallest cardinality label first

• Smallest intermediate result set first

Exploring cardinalitiesMATCH (game)<-[:contains_match]-(season:Season)

RETURN COUNT(DISTINCT game), COUNT(DISTINCT season)

1140 games, 3 seasons

MATCH (team)<-[:away_team]-(game:Game)

RETURN COUNT(DISTINCT team), COUNT(DISTINCT game)

25 teams, 1140 games

Exploring cardinalitiesMATCH (stats)-[:in]->(game:Game)

RETURN COUNT(DISTINCT stats), COUNT(DISTINCT game)

31117 stats, 1140 games

MATCH (stats)<-[:played]-(player:Player)

RETURN COUNT(DISTINCT stats), COUNT(DISTINCT player)

31117 stats, 880 players

Michael Hunger
what do we learn from these cardinalities? Perhaps do a table on a separate slide? Where you can highlight (green) the low ones and red the high ones (like stats)
Michael Hunger
perhaps also re-display the model?
Mark Needham
we learn that we shouldn't start with stats because it'll unnecessarily load lots of data
Mark Needham
Starting at game--team or season--game makes more sense.

Look for teams firstMATCH (team)<-[:away_team]-(game:Game)MATCH (game)<-[:contains_match]-(season)

WHERE season.name = "2012-2013"

MATCH (stats)-[:in]->(game)

MATCH (team)<-[:for]-(stats)<-[:played]-(player)

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESC

LIMIT 10162 ms w/ ~900 players, ~20 teams, ~400 games

==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-b0c7ec85c969", " INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5"], returnItemNames=["player.name", "COLLECT(DISTINCT team.name)", "goals"], _rows=10, _db_hits=0)

==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5 of type Number),false)"], limit="Literal(10)", _rows=10, _db_hits=0)

==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-b0c7ec85c969,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)

==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED168", "season", " UNNAMED125", "player", "team", " UNNAMED152", " UNNAMED51", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)

==> PatternMatch(g="(stats)-[' UNNAMED152']-(team),(player)-[' UNNAMED168']-(stats)", _rows=5192, _db_hits=0)

==> PatternMatch(g="(stats)-[' UNNAMED125']-(game)", _rows=10394, _db_hits=0)

==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=380, _db_hits=380)

==> PatternMatch(g="(season)-[' UNNAMED51']-(game)", _rows=380, _db_hits=1140)

==> TraversalMatcher(trail="(game)-[ UNNAMED12:away_team WHERE true AND true]->(team)", _rows=1140,

_db_hits=1140)

Look for teams first

Filter games a bit earlierMATCH (game)<-[:contains_match]-(season:Season)

WHERE season.name = "2012-2013"MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)

MATCH (team)<-[:for]-(stats)<-[:played]-(player)

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESC

LIMIT 10148 ms w/ ~900 players, ~20 teams, ~400 games

Michael Hunger
profiling ? at least add db-hits to your blue stats

Filter out stats with no goalsMATCH (game)<-[:contains_match]-(season:Season)

WHERE season.name = "2012-2013"MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)WHERE stats.goals > 0MATCH (team)<-[:for]-(stats)<-[:played]-(player)RETURN player.name, COLLECT(DISTINCT team.name), SUM(stats.goals) as goalsORDER BY goals DESCLIMIT 10

59 ms w/ ~900 players, ~20 teams, ~400 games

Michael Hunger
db-hits / profile
Mark Needham
Wes: should we talk about 'rows' as well or is that too deep? Generally you want to keep the row count as close to your final output as possible. If row count increases dramatically it can suggest the query isn't optimal.
Mark Needham
I want to say something like: db_hits is good for getting you started but after you've fixed that it's all about experimenting and cardinalities

Movie query optimizationMATCH (movie:Movie {title: {title} })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Michael Hunger
the stats, and profile info?
Mark Needham
We haven't finished with the football optimisation yet! Can we move this down?
Michael Hunger
Yep

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)

MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)

MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)

MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)

MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as

related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight // 1 row per actorORDER BY actormoviesweight DESCWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Michael Hunger
the stats, and profile info?
Mark Needham
We need to talk through the differences between the two versions - it's not obvious at first glance
Michael Hunger
Highlight them, make them bigger font and gray the rest out

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight // 1 row per actorORDER BY actormoviesweight DESCWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)WITH movie, actor, length((actor)-

[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers // 1 row

10x faster

Design for Queryability

Model

Design for Queryability

Query

Design for Queryability

Model

Making the implicit explicit

• When you have implicit relationships in the graph you can sometimes get better query performance by modeling the relationship explicitly

Michael Hunger
"introduce shortcuts" "materialize virtual paths" model data structures as part of the graph (list, tree, trie, sorting, ...)
Wes Freeman
I added this word
Wes Freeman
(sometimes)

Making the implicit explicit

Michael Hunger
Having worked together as colleagues implies knowing each other, makes path finding easier and faster and also the pattern that you have to writecould that be a general rule "the more complex the pattern is you're writing" the slower your query will be?probably not true for MATCH (n)-[*]-(m) :)

Refactor property to node

Bad:MATCH (g:Game)WHERE g.date > 1343779200 AND g.date < 1369094400RETURN g

Michael Hunger
Explain that "date" can be modeled as a season
Mark Needham
I'll fill in times for these two versions

Good:MATCH (s:Season)-[:contains]->(g)WHERE season.name = "2012-2013"RETURN g

Refactor property to node

Conclusion

• Avoid the global scan

• Add indexes / unique constraints

• Split up MATCH statements

• Measure, measure, measure, tweak, repeat

• Soon Cypher will do a lot of this for you!

Bonus tip

• Use transactions/transactional cypher endpoint

Q & A

• If you have them send them in

top related