the postgresql query planner_longer

Upload: egon-valdmees

Post on 06-Apr-2018

246 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 The PostgreSQL Query Planner_longer

    1/32

    The PostgreSQL

    Query PlannerRobert Haas

    PostgreSQL East 2010

  • 8/3/2019 The PostgreSQL Query Planner_longer

    2/32

    Why Does My Query Need

    a Plan? SQL is a declarative language.

    In other words, a SQL query is not a program.

    No control flow statements (e.g. for, while) and

    no way to control order of operations.

    SQL describes results, not process.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    3/32

    Why Didn't The Planner Do

    It My Way? Maybe your way is actually slower, or

    Maybe you gave the planner bad information, or Maybe the query planner really did goof.

    Related question:

    How do I force the planner to use my index?

  • 8/3/2019 The PostgreSQL Query Planner_longer

    4/32

    Query Planning

    Make queries run fast.

    Minimize disk I/O.

    Prefer sequential I/O to random I/O. Minimize CPU processing.

    Don't use too much memory in the process.

    Deliver correct results.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    5/32

    Query Planner Decisions

    Access strategy for each table.

    Sequential Scan, Index Scan, Bitmap Index Scan.

    Join strategy. Join order.

    Join strategy: nested loop, merge join, hash join.

    Inner vs. outer. Aggregation strategy.

    Plain, sorted, hashed.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    6/32

    Table Access Strategies

    Sequential Scan (Seq Scan)

    Read every row in the table.

    Index Scan or Bitmap Index Scan Read only part of the table by using the index to

    skip uninteresting parts.

    Index scan reads index and table in alternation.

    Bitmap index scan reads index first, populatingbitmap, and then reads table in sequential order.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    7/32

    Sequential Scan

    Always works no need to create indices inadvance.

    Doesn't require reading the index, which has bothI/O and CPU cost.

    Best way to access very small tables.

    Usually the best way to access all or nearly therows in a table.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    8/32

    Index Scan

    Potentially huge performance gain when readingonly a small fraction of rows in a large table.

    Only table access method that can return rows insorted order very useful in combination withLIMIT.

    Random I/O against base table!

  • 8/3/2019 The PostgreSQL Query Planner_longer

    9/32

    Bitmap Index Scan

    Scans all index rows before examining base table,populating a TID bitmap.

    Table I/O is sequential, with skips; results inphysical order.

    Can efficiently combine data multiple indices TID bitmap can handle boolean AND and OR

    operations.

    Handles LIMIT poorly.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    10/32

    Join Planning

    Fixing the join order and join strategy is the hardpart of query planning.

    # of possibilities grows exponentially withnumber of tables.

    When search space is small, planner does a nearlyexhaustive search.

    When search space is too large, planner usesheuristics or GEQO to limit planning time andmemory usage.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    11/32

    Join Strategies

    Nested loop.

    Nested loop with inner index-scan.

    Merge join. Hash join.

    Each join strategy takes an outer relation and aninner relation and produces a result relation.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    12/32

    Nested Loop Pseudocode

    for (each outer tuple)for (each inner tuple)if (join condition is met)emit result row;

    Outer or inner loop could be scanning output of

    some other join, or a base table. Base table scancould be using an index.

    Cost is roughly proportional to product of tablesizes bad if BOTH are large.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    13/32

    Nested Loop Example #1

    SELECT * FROM foo, bar WHERE foo.x = bar.x

    Nested LoopJoin Filter: (foo.x = bar.x)-> Seq Scan on bar-> Materialize

    -> Seq Scan on foo

    This might be very slow!

  • 8/3/2019 The PostgreSQL Query Planner_longer

    14/32

    Nested Loop Example #2

    SELECT * FROM foo, bar WHERE foo.x = bar.x

    Nested Loop-> Seq Scan on foo-> Index Scan using bar_pkey on bar

    Index Cond: (bar.x = foo.x)

    Nested loop with inner index-scan! Much better...though probably still not the best plan.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    15/32

    Merge Join

    Only handles equality joins something like a.x= b.x.

    Put both input relations into sorted order (usingsort or index scan) and scan through the two inparallel, matching up equal values.

    Normally visits each input tuple only once, but

    may need to rescan portions of the inner inputif there are duplicate values in the outer input.

    Take OUTER={1 2 2 3} and INNER={2 2 3 4}

  • 8/3/2019 The PostgreSQL Query Planner_longer

    16/32

    Merge Join Example

    SELECT * FROM foo, bar WHERE foo.x = bar.xMerge Join

    Merge Cond: (foo.x = bar.x)-> Sort

    Sort Key: foo.x-> Seq Scan on foo

    -> Materialize-> Sort

    Sort Key: bar.x-> Seq Scan on bar

  • 8/3/2019 The PostgreSQL Query Planner_longer

    17/32

    Hash Join

    Like merge join, only handles equality joins.

    Hash each row from the inner relation to create a

    hash table. Then, hash each row from the outerrelation and probe the hash table for matches.

    Very fast but requires enough memory to storeinner tuples. Can get around this using multiple

    batches. Not guaranteed to retain input ordering.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    18/32

    Hash Join Example

    SELECT * FROM foo, bar WHERE foo.x = bar.x

    Hash JoinHash Cond: (foo.x = bar.x)-> Seq Scan on foo-> Hash

    -> Seq Scan on bar

  • 8/3/2019 The PostgreSQL Query Planner_longer

    19/32

    Join Removal

    Upcoming 9.0 feature.

    Consider this query:

    SELECT p.id, p.name FROM projects pLEFT JOIN person pm

    ON p.project_manager_id = pm.id;

    If there is a unique index on person (id), then thejoin need not be performed at all.

    Common scenario when using views.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    20/32

    Join Reordering

    SELECT * FROM fooJOIN bar ON foo.x = bar.xJOIN baz ON foo.y = baz.y

    SELECT * FROM fooJOIN baz ON foo.y = baz.yJOIN bar ON foo.x = bar.x

    SELECT * FROM fooJOIN (bar JOIN baz ON true)

    ON foo.x = bar.x AND foo.y = baz.y

  • 8/3/2019 The PostgreSQL Query Planner_longer

    21/32

    EXPLAIN Estimates

    Hash Join (cost=8.28..404.52 rows=9000 width=118)Hash Cond: (foo.x = bar.x)-> Hash Join (cost=3.02..275.52 rows=9000 width=12)

    Hash Cond: (foo.y = baz.y)-> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=8)-> Hash (cost=1.90..1.90 rows=90 width=4)

    -> Seq Scan on baz (cost=0.00..1.90 rows=90 width=4)-> Hash (cost=4.00..4.00 rows=100 width=106)

    -> Seq Scan on bar (cost=0.00..4.00 rows=100 width=106)

  • 8/3/2019 The PostgreSQL Query Planner_longer

    22/32

    EXPLAIN ANALYZE

    Hash Join (cost=8.28..404.52 rows=9000 width=118)(actual time=0.743..51.582 rows=9000 loops=1)Hash Cond: (foo.x = bar.x)-> Hash Join (cost=3.02..275.52 rows=9000 width=12)

    (actual time=0.368..30.964 rows=9000 loops=1)Hash Cond: (foo.y = baz.y)-> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=8)

    (actual time=0.021..9.908 rows=10000 loops=1)-> Hash (cost=1.90..1.90 rows=90 width=4)

    (actual time=0.280..0.280 rows=90 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 4kB-> Seq Scan on baz (cost=0.00..1.90 rows=90 width=4)

    (actual time=0.010..0.138 rows=90 loops=1)

    -> Hash (cost=4.00..4.00 rows=100 width=106)(actual time=0.354..0.354 rows=100 loops=1)

    Buckets: 1024 Batches: 1 Memory Usage: 14kB-> Seq Scan on bar (cost=0.00..4.00 rows=100 width=106)

    (actual time=0.007..0.167 rows=100 loops=1)Total runtime: 59.376 ms

  • 8/3/2019 The PostgreSQL Query Planner_longer

    23/32

    Not The Same Thing!

    SELECT * FROM(foo JOIN bar ON foo.x = bar.x)LEFT JOIN baz ON foo.y = baz.y

    SELECT * FROM(foo LEFT JOIN baz ON foo.y = baz.y)JOIN bar ON foo.x = bar.x

  • 8/3/2019 The PostgreSQL Query Planner_longer

    24/32

    Review of Join Planning

    Join Order

    Join Strategy

    Nested loop Nested loop with inner index-scan

    Merge join

    Hash join Join removal

    Inner vs. outer

  • 8/3/2019 The PostgreSQL Query Planner_longer

    25/32

    Aggregates and DISTINCT

    Plain aggregate.

    e.g. SELECT count(*) FROM foo;

    Sorted aggregate. Sort the data (or use pre-sorted data); when you

    see a new value, aggregate the prior group.

    Hashed aggregate.

    Insert each input row into a hash table based onthe grouping columns; at the end, aggregate allthe groups.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    26/32

    Statistics

    All of the decisions discussed earlier in this talkare made using statistics.

    Seq scan vs. index scan vs. bitmap index scan

    Nested loop vs. merge join vs. hash join

    ANALYZE (manual or via autovacuum) gathersthis information.

    You must have good statistics or you will get badplans!

  • 8/3/2019 The PostgreSQL Query Planner_longer

    27/32

    Confusing The Planner

    SELECT * FROM foo WHERE a = 1 AND b = 1

    If 20% of the rows have a = 1 and 10% of the

    rows have b = 1, the planner will assume that20% * 10% = 2% of the rows meet both criteria.

    SELECT * FROM foo WHERE (a + 0) = a

    Planner doesn't have a clue, so will assume 0.5%of rows will match.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    28/32

    What Could Go Wrong?

    If the plannerunderestimates the row count, itmay choose an index scan instead of a sequentialscan, or a nested loop instead of a hash or merge

    join.

    If the planneroverestimates the row count, it maychoose a sequential scan instead of an index scan,

    or a merge or hash join instead of a nested loop. Small values for LIMIT tilt the planner toward

    fast-start plans and magnify the effect of badestimates.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    29/32

    Query Planner Parameters seq_page_cost (1.0), random_page_cost (4.0) Reduce

    these costs to account for caching effects. If database isfully cached, try 0.005.

    default_statistics_target (10 or 100) Level of detail forstatistics gathering. Can also be overridden on a per-column basis.

    enable_hashjoin, enable_sort, etc. - Just for testing.

    work_mem Amount of memory per sort or hash. from_collapse_limit, join_collapse_limit,

    geqo_threshold Sometimes need to be raised, but becareful!

  • 8/3/2019 The PostgreSQL Query Planner_longer

    30/32

    Things That Are Slow

    DISTINCT.

    PL/pgsql loops.FOR x IN SELECT ... LOOP SELECT ... END LOOP

    Repeated calls to SQL or PL/pgsql functions.SELECT id, some_function(id) FROM table;

  • 8/3/2019 The PostgreSQL Query Planner_longer

    31/32

    Upcoming Features

    Join removal (right now just for LEFT joins).

    Machine-readable EXPLAIN output.

    Hash statistics. Better model for Materialize costs.

    Improved use of indices to handle MIN(x),

    MAX(x), and x IS NOT NULL.

  • 8/3/2019 The PostgreSQL Query Planner_longer

    32/32

    Questions?

    Any questions?