optimizing cypher queries in neo4j

Optimizing Cypher Queries in Neo4j

Wes Freeman (@wefreema)

Mark Needham (@markhneedham)

Michael Hunger

do you have a slide that explains the combinatorial complexity of paths, i.e. 1000 friends, 2 hops -> 1M paths, 3hops -> 1BN paths and the need to filter down early and reduce the cardinality in between with distinct?

Michael Hunger

in general some color coding would be coolgreen headline or lightgreen background for the "good ones"and red headline or light-red background for the "bad ones

Mark Needham

Let's just have the logo on the first and last page so we don't cut out space that we can use for content

Wes Freeman

is this better?

Mark Needham

slightly slower

Today's schedule

• Brief overview of cypher syntax

• Graph global vs Graph local queries

• Labels and indexes

• Optimization patterns

• Profiling cypher queries

• Applying optimization patterns

Michael Hunger

It would be cool if you had a slides on your blue stats, what they mean, and also what hardware you ran the queries on, and if it was the first run or subsequent run (perhaps show the effect of query building / cold-cache vs. hot-cache and precompiled queries)

Mark Needham

All of them are with hot caches. We can mention that

Michael Hunger

do you also plan to do anti-patterns?things that don't work so well in cypherlike unbounded var-length pathsor cross-path predicates?

Michael Hunger

an intro? what, why, how?

Michael Hunger

a picture?

Wes Freeman

yep, plan to put one in. this was the result of 30 minutes of dumping stuff here :P

Cypher Syntax

• Statement partso Optional: Querying part (MATCH|WHERE)o Optional: Updating part (CREATE|MERGE)o Optional: Returning part (WITH|RETURN)

• Parts can be chained together

Cypher Syntax - Refresher

MATCH (n:Label)-[r:LINKED]->(m)WHERE n.prop = "..."RETURN n, r, m

Michael Hunger

add p= and return p

Starting points

• Graph scan (global; potentially slow)

• Label scan (usually reserved for aggregation queries; not ideal)

• Label property index lookup (local; good!)

Introducing the football dataset

The 1.9 global scanO(n)

n = # of nodes

START pl = node(*) MATCH (pl)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

150ms w/ 30k nodes, 120k rels

Michael Hunger

I would probably use pl for player to not confuse it with p for path

The 2.0 global scan

MATCH (pl)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

130ms w/ 30k nodes, 120k rels

O(n)n = # of nodes

Why is it a global scan?

• Cypher is a pattern matching language

• It doesn't discriminate unless you tell it too It must try to start at all nodes to find this

pattern, as specified

Mark Needham

Wonder how we can flesh this out

Introduce a label

Label your starting points

CREATE (player:Player {name: "Wayne Rooney"} )

Michael Hunger

"p" vs "player" ? consitency

O(k)k = # of nodes with that labelLabel scan

MATCH (pl:Player)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

80ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Indexes don't come for free

CREATE INDEX ON :Player(name)

OR

CREATE CONSTRAINT ON pl:PlayerASSERT pl.name IS UNIQUE

Michael Hunger

what does "don't come for free mean" ? that the inccur costs on write operations?

O(log k)k = # of nodes with that labelIndex lookup

MATCH (pl:Player)-[:played]->(stats) WHERE pl.name = "Wayne Rooney" RETURN stats

6ms w/ 30k nodes, 120k rels (~900 :Player nodes)

Michael Hunger

Why log k ? Doesn't that depend on the index impl (b-tree, vs. hash vs. xxx)? And isn't it log (tree-level) ? which is about constant

Optimization Patterns

• Avoid cartesian products

• Avoid patterns in the WHERE clause

• Start MATCH patterns at the lowest cardinality and expand outward

• Separate MATCH patterns with minimal expansion at each stage

Introducing the movie data set

Anti-pattern: Cartesian Products

MATCH (m:Movie), (p:Person)

Michael Hunger

would be cool to have the blue stats here too and also what the cartesian product results to

Subtle Cartesian Products

MATCH (p:Person)-[:KNOWS]->(c)WHERE p.name="Tom Hanks"WITH cMATCH (k:Keyword)RETURN c, k

Michael Hunger

would be cool to have the blue stats here too and also what the cartesian product results to

Counting Cartesian Products

MATCH (pl:Player),(t:Team),(g:Game)RETURN COUNT(DISTINCT pl), COUNT(DISTINCT t), COUNT(DISTINCT g)

80000 ms w/ ~900 players, ~40 teams, ~1200 games

Michael Hunger

would be cool to have what the cartesian product results to i.e. 900x40x1200 = 43M

MATCH (pl:Player)WITH COUNT(pl) as playersMATCH (t:Team)WITH COUNT(t) as teams, playersMATCH (g:Game)RETURN COUNT(g) as games, teams, players8ms w/ ~900 players, ~40 teams, ~1200 games

Better Counting

Michael Hunger

would love to see a few slides about the power of with first (aggregate, order, limit, change cardinality) and how it can be used to separate query parts or unwanted filter expressions on paths etc.

Michael Hunger

why? b/c you reduce the cardinality to 1 again with with!!

Directions on patterns

MATCH (p:Person)-[:ACTED_IN]-(m)WHERE p.name = "Tom Hanks"RETURN m

Michael Hunger

show the stats with and w/o direction (db-hits)

Mark Needham

This one was going to be more in passing - weren't going into depth here so didn't pull out profile stats

Michael Hunger

note the missing direction somehow, perhaps with a gray arrow-tip ?

Parameterize your queries

MATCH (p:Person)-[:ACTED_IN]-(m)WHERE p.name = {name}RETURN m

Fast predicates first

Bad:MATCH (t:Team)-[:played_in]->(g)WHERE NOT (t)-[:home_team]->(g) AND g.away_goals > g.home_goals RETURN t, COUNT(g)

Michael Hunger

highlight the fast predicate in green and the slow one in red, make them bold

Mark Needham

This is the bit we're thinking of removing?

Michael Hunger

Why?

Better:MATCH (t:Team)-[:played_in]->(g)WHERE g.away_goals > g.home_goals AND NOT (t)-[:home_team]->()RETURN t, COUNT(g)

Fast predicates first

Michael Hunger

same as above

Patterns in WHERE clauses

• Keep them in the MATCH

• The only pattern that needs to be in a WHERE clause is a NOT

Mark Needham

Should I add an example here? Perhaps I can show the speed of a query with a pattern match in the WHERE

Michael Hunger

yes please

MERGE and CONSTRAINTs

• MERGE is MATCH or CREATE

• MERGE can take advantage of unique constraints and indexes

Michael Hunger

add a bunch of slides that explains merge (the two sides of merge) -> create rels and create subgraphs

Michael Hunger

only with a constraint it is an atomic operation, otherwise it doesn't take an index lock and is just a "best-try"

Mark Needham

I will add an example of this

MERGE (without index)MERGE (g:Game

{date:1290257100,

time: 1245,

home_goals: 2,

away_goals: 3,

match_id: 292846,

attendance: 60102})

RETURN g

188 ms w/ ~400 games

Michael Hunger

make clearer that this is w/o index or constraint

Adding an index

CREATE INDEX ON :Game(match_id)

MERGE (with index)MERGE (g:Game

{date:1290257100,

time: 1245,

home_goals: 2,

away_goals: 3,

match_id: 292846,

attendance: 60102})

RETURN g

6 ms w/ ~400 games

Alternative MERGE approachMERGE (g:Game { match_id: 292846 })ON CREATESET g.date = 1290257100

SET g.time = 1245SET g.home_goals = 2SET g.away_goals = 3SET g.attendance = 60102RETURN g

Mark Needham

Wes: I've added two versions of MERGE as we discussed. Assuming that you'll explain the difference between the approaches i.e. you'd take over for this slide :D

Profiling queries

• Use the PROFILE keyword in front of the queryo from webadmin or shell - won't work in

browser

• Look for db_hits and rows

• Ignore everything else (for now!)

Michael Hunger

but only in the shell, profile doesn't work in browser !

Mark Needham

Good point!

Reviewing the football dataset

Football OptimizationMATCH (game)<-[:contains_match]-(season:Season),

(team)<-[:away_team]-(game),

(stats)-[:in]->(game),

(team)<-[:for]-(stats)<-[:played]-(player)

WHERE season.name = "2012-2013"

RETURN player.name,

COLLECT(DISTINCT team.name),

SUM(stats.goals) as goals

ORDER BY goals DESC

LIMIT 103137 ms w/ ~900 players, ~20 teams, ~400 games

Football Optimization==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-e746407c504a", "

INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323"], returnItemNames=["player.name", "COLLECT(DISTINCT team.name)", "goals"], _rows=10, _db_hits=0)

==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323 of type Number),false)"], limit="Literal(10)", _rows=10, _db_hits=0)

==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEe91b055b-a943-4ddd-9fe8-e746407c504a,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE240cfcd2-24d9-48a2-8ca9-fb0286f3d323,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)

==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED108", "season", " UNNAMED55", "player", "team", " UNNAMED124", " UNNAMED85", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)

==> PatternMatch(g="(player)-[' UNNAMED124']-(stats)", _rows=5192, _db_hits=0)

==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=5192, _db_hits=15542)

==> TraversalMatcher(trail="(season)-[ UNNAMED12:contains_match WHERE true AND true]->(game)<-[ UNNAMED85:in WHERE true AND true]-(stats)-[ UNNAMED108:for WHERE true AND true]->(team)<-

[ UNNAMED55:away_team WHERE true AND true]-(game)", _rows=15542, _db_hits=1620462)

Michael Hunger

make db-hits bold and increase the font size, gray out the rest in a lighter gray

Mark Needham

Wes: why does the traversal matcher overdo it for this example? It covers way too much ground

Wes Freeman

because it's one giant match. it can handle the whole match.

Break out the match statements

MATCH (game)<-[:contains_match]-(season:Season)

MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)

MATCH (team)<-[:for]-(stats)<-[:played]-(player)


RETURN player.name,



ORDER BY goals DESCLIMIT 10200 ms w/ ~900 players, ~20 teams, ~400 games

Mark Needham

If these are in a different order then the performance is radically different. I just happen to have it in a logical order from the initial query

Michael Hunger

wanna put that comment in the notes?

Start small

• Smallest cardinality label first

• Smallest intermediate result set first

Exploring cardinalitiesMATCH (game)<-[:contains_match]-(season:Season)

RETURN COUNT(DISTINCT game), COUNT(DISTINCT season)

1140 games, 3 seasons

MATCH (team)<-[:away_team]-(game:Game)

RETURN COUNT(DISTINCT team), COUNT(DISTINCT game)

25 teams, 1140 games

Exploring cardinalitiesMATCH (stats)-[:in]->(game:Game)

RETURN COUNT(DISTINCT stats), COUNT(DISTINCT game)

31117 stats, 1140 games

MATCH (stats)<-[:played]-(player:Player)

RETURN COUNT(DISTINCT stats), COUNT(DISTINCT player)

31117 stats, 880 players

Michael Hunger

what do we learn from these cardinalities? Perhaps do a table on a separate slide? Where you can highlight (green) the low ones and red the high ones (like stats)

Michael Hunger

perhaps also re-display the model?

Mark Needham

we learn that we shouldn't start with stats because it'll unnecessarily load lots of data

Mark Needham

Starting at game--team or season--game makes more sense.

Look for teams firstMATCH (team)<-[:away_team]-(game:Game)MATCH (game)<-[:contains_match]-(season)




RETURN player.name,



ORDER BY goals DESC


==> ColumnFilter(symKeys=["player.name", " INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-b0c7ec85c969", " INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5"], returnItemNames=["player.name", "COLLECT(DISTINCT team.name)", "goals"], _rows=10, _db_hits=0)

==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5 of type Number),false)"], limit="Literal(10)", _rows=10, _db_hits=0)

==> EagerAggregation(keys=["Cached(player.name of type Any)"], aggregates=["( INTERNAL_AGGREGATEbb08f36b-a70d-46b3-9297-b0c7ec85c969,Distinct(Collect(Property(team,name(0))),Property(team,name(0))))", "( INTERNAL_AGGREGATE199af213-e3bd-400f-aba9-8ca2a9e153c5,Sum(Property(stats,goals(13))))"], _rows=503, _db_hits=10899)

==> Extract(symKeys=["stats", " UNNAMED12", " UNNAMED168", "season", " UNNAMED125", "player", "team", " UNNAMED152", " UNNAMED51", "game"], exprKeys=["player.name"], _rows=5192, _db_hits=5192)

==> PatternMatch(g="(stats)-[' UNNAMED152']-(team),(player)-[' UNNAMED168']-(stats)", _rows=5192, _db_hits=0)

==> PatternMatch(g="(stats)-[' UNNAMED125']-(game)", _rows=10394, _db_hits=0)

==> Filter(pred="Property(season,name(0)) == Literal(2012-2013)", _rows=380, _db_hits=380)

==> PatternMatch(g="(season)-[' UNNAMED51']-(game)", _rows=380, _db_hits=1140)

==> TraversalMatcher(trail="(game)-[ UNNAMED12:away_team WHERE true AND true]->(team)", _rows=1140,

_db_hits=1140)

Look for teams first

Filter games a bit earlierMATCH (game)<-[:contains_match]-(season:Season)

WHERE season.name = "2012-2013"MATCH (team)<-[:away_team]-(game)



RETURN player.name,



ORDER BY goals DESC


Michael Hunger

profiling ? at least add db-hits to your blue stats

Filter out stats with no goalsMATCH (game)<-[:contains_match]-(season:Season)

WHERE season.name = "2012-2013"MATCH (team)<-[:away_team]-(game)

MATCH (stats)-[:in]->(game)WHERE stats.goals > 0MATCH (team)<-[:for]-(stats)<-[:played]-(player)RETURN player.name, COLLECT(DISTINCT team.name), SUM(stats.goals) as goalsORDER BY goals DESCLIMIT 10

59 ms w/ ~900 players, ~20 teams, ~400 games

Michael Hunger

db-hits / profile

Mark Needham

Wes: should we talk about 'rows' as well or is that too deep? Generally you want to keep the row count as close to your final output as possible. If row count increases dramatically it can suggest the query isn't optimal.

Mark Needham

I want to say something like: db_hits is good for getting you started but after you've fixed that it's all about experimenting and cardinalities

Movie query optimizationMATCH (movie:Movie {title: {title} })

MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Michael Hunger

the stats, and profile info?

Mark Needham

We haven't finished with the football optimisation yet! Can we move this down?

Michael Hunger

Yep

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })


Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers


MATCH (genre)<-[:HAS_GENRE]-(movie)

MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers


MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)

MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers


MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)

MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers


MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)

MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers


MATCH (genre)<-[:HAS_GENRE]-(movie)MATCH (director)-[:DIRECTED]->(movie)MATCH (actor)-[:ACTED_IN]->(movie)MATCH (writer)-[:WRITER_OF]->(movie)MATCH (actor)-[:ACTED_IN]->(actormovies)MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword) as weight, count(DISTINCT actormovies) as actormoviesweight, movie, collect(DISTINCT genre.name) as genres, collect(DISTINCT director.name) as directors, actor, collect(DISTINCT writer.name) as writersORDER BY weight DESC, actormoviesweight DESCWITH collect(DISTINCT {name: actor.name, weight: actormoviesweight}) as actors, movie, collect(DISTINCT {related: {title: related.title}, weight: weight}) as

related, genres, directors, writersMATCH (movie)-[:HAS_KEYWORD]->(keyword:Keyword)<-[:HAS_KEYWORD]-(movies)WITH keyword.name as keyword, count(movies) as keyword_weight, movie, related, actors, genres, directors, writersORDER BY keyword_weightRETURN collect(DISTINCT keyword), movie, actors, related, genres, directors, writers

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)

WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight // 1 row per actorORDER BY actormoviesweight DESCWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Michael Hunger

the stats, and profile info?

Mark Needham

We need to talk through the differences between the two versions - it's not obvious at first glance

Michael Hunger

Highlight them, make them bigger font and gray the rest out


WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweight // 1 row per actorORDER BY actormoviesweight DESCWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster

Movie query optimizationMATCH (movie:Movie {title: 'The Matrix' })<-[:ACTED_IN]-(actor)WITH movie, actor, length((actor)-

[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster


WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers

10x faster


WITH movie, actor, length((actor)-[:ACTED_IN]->()) as actormoviesweightORDER BY actormoviesweight DESC // 1 row per actorWITH movie, collect({name: actor.name, weight: actormoviesweight}) as actors // 1 row MATCH (movie)-[:HAS_GENRE]->(genre)WITH movie, actors, collect(genre) as genres // 1 row MATCH (director)-[:DIRECTED]->(movie)WITH movie, actors, genres, collect(director.name) as directors // 1 rowMATCH (writer)-[:WRITER_OF]->(movie)WITH movie, actors, genres, directors, collect(writer.name) as writers // 1 row MATCH (movie)-[:HAS_KEYWORD]->(keyword)<-[:HAS_KEYWORD]-(movies:Movie)WITH DISTINCT movies as related, count(DISTINCT keyword.name) as keywords, movie, genres, directors, actors, writers // 1 row per related movieORDER BY keywords DESCWITH collect(DISTINCT { related: { title: related.title }, weight: keywords }) as related, movie, actors, genres, directors, writers // 1 rowMATCH (movie)-[:HAS_KEYWORD]->(keyword)RETURN collect(keyword.name) as keywords, related, movie, actors, genres, directors, writers // 1 row

10x faster

Design for Queryability

Model


Query


Model

Making the implicit explicit

• When you have implicit relationships in the graph you can sometimes get better query performance by modeling the relationship explicitly

Michael Hunger

"introduce shortcuts" "materialize virtual paths" model data structures as part of the graph (list, tree, trie, sorting, ...)

Wes Freeman

I added this word

Wes Freeman

(sometimes)

Making the implicit explicit

Michael Hunger

Having worked together as colleagues implies knowing each other, makes path finding easier and faster and also the pattern that you have to writecould that be a general rule "the more complex the pattern is you're writing" the slower your query will be?probably not true for MATCH (n)-[*]-(m) :)

Refactor property to node

Bad:MATCH (g:Game)WHERE g.date > 1343779200 AND g.date < 1369094400RETURN g

Michael Hunger

Explain that "date" can be modeled as a season

Mark Needham

I'll fill in times for these two versions

Good:MATCH (s:Season)-[:contains]->(g)WHERE season.name = "2012-2013"RETURN g

Refactor property to node

Conclusion

• Avoid the global scan

• Add indexes / unique constraints

• Split up MATCH statements

• Measure, measure, measure, tweak, repeat

• Soon Cypher will do a lot of this for you!

Bonus tip

• Use transactions/transactional cypher endpoint

Q & A

• If you have them send them in

optimizing cypher queries in neo4j

Technology

players match g

c match

return g

patterns match p

players match t

queries match p

goals g

actormovies match movie