postgresql troubleshoot on-line, (ritfest 2015 meetup at moscow, russia)
TRANSCRIPT
PostgreSQL Troubleshoot On-Line.
Ilya KosmodemyanskyAlexey Lesovsky
case 1: Bad release. Overview.
- Symptoms: - significant load increase, - slowing operations. - Often unpredictable: - we don't know where the problem occurs. - Emergency: - problem must be found and resolved ASAP.
case 1: Bad release. Troubleshoot.
- Outside the database - top, sysstat, etc... - nagios/zabbix/... - Inside the database - pgbadger/loganalyze/etc... - pg_stat_statements
case 1: Bad release. Outside the database.
- top: - cpu usage, load average, swapping, iowait. - sysstat: - disk utilization (iostat), - resource consumption (sar).
case 1: Bad release. Outside the database.
- Okmeter: - online monitoring service, - rich feature/plugin set, - postgresql good support.
case 1: Bad release. Inside the database.
- Log analyze (pgBadger) - huge logs - read log before report creating - a lot of time need tobuild report - pg_stat_statements (contrib) - small storage footprint, - quick and flexible reports.
case 1: Bad release. Inside the database.
- query_stat_total.sql - https://github.com/PostgreSQL-Consulting/pg-utils
case 1: Bad release. Query #1.
SELECT p.id, p.ratingFROM posts p LEFT JOIN complaints com ON (com.post_id = p.id AND com.user_id= ?) WHERE p.is_deleted IS FALSE AND com.is_hide IS NOT TRUE AND p.type_id != ? ORDER BY p.rating DESC LIMIT ?;
case 1: Bad release. Query #1. JOIN -> (NOT) EXISTS
SELECT p.id, p.rating FROM posts p WHERE p.is_deleted IS FALSE AND p.type_id != ?AND NOT EXISTS (SELECT 1 FROM complaints com
WHERE com.post_id = p.id AND user_id = ? AND is_hide = true)
ORDER BY p.rating DESC LIMIT ?;
case 1: Bad release. Query #2.
SELECT * FROM tags WHERE (tags.title ilike ?)
Trigram Index.
CREATE INDEX tags_title_trigram_key on tags using gin(title gin_trgm_ops);
case 1: Bad release. Query #3.
SELECT post.* FROM post JOIN domain ON post.domain_id = domain.id LEFT OUTER JOIN domain_acl ON domain_acl.domain_id = domain.id AND domain_acl.user_id = ? WHERE post.deleted = ? AND post.domain_id IN (?, ?, ?, ?, ?, ?, ?, ?) AND ((domain.flags & ?) = ? OR (domain_acl.acl & ?) = ?) AND post.id NOT IN (?, ?) ORDER BY post.last_activity DESC LIMIT ? OFFSET ?
case 1: Bad release. Query #3. Index Only Scan
SELECT * FROM post WHERE id IN (SELECT post.id
FROM post JOIN domain ON post.domain_id = domain.id LEFT OUTER JOIN domain_acl
ON domain_acl.domain_id = domain.id AND domain_acl.user_id = ?
WHERE post.deleted = ? AND post.domain_id IN (?, ?, ?, ?, ?, ?, ?, ?) AND ((domain.flags & ?) = ? OR (domain_acl.acl & ?) = ?) AND post.id NOT IN (?, ?) ORDER BY post.last_activity DESC LIMIT ? OFFSET ?)
order by post.last_activity DESC
CREATE INDEX post_domain_id_last_activity_id_deleted_partial ON post USING btree (domain_id, last_activity, id, deleted) where deleted = 0;
case 1: Bad release. Query #4.
SELECT * FROM "group" WHERE ("group".group_vislvl_content >= ?) AND (group_main_domain_id IS NULL OR group_main_domain_id IN(?,?)) AND ("group".obj_pics_count + "group".group_persons_count + "group".group_blog_posts_count + "group".group_wiki_count >= ?) AND "group".group_is_demo = ? AND "group".obj_status_did = ? ORDER BY "group".group_persons_count desc, "group".obj_created asc LIMIT ?;
case 1: Bad release. Query #4. Partial Index
CREATE INDEX group_special2_key ON "group" USING btree (group_persons_count DESC, obj_created) WHERE ("group".obj_pics_count + "group".group_persons_count +
"group".group_blog_posts_count + "group".group_wiki_count >= 1);
case 2: More app servers... We need more...
- project grow - load increasing - add more app servers - more apps -> more db connections
case 2: More app servers... We need more...
- too much db connections are bad - high resource contention - os overhead (memory, locks, forks)
case 2: More app servers... We need more...
- pgbouncer - lightweight connection pooler - stable, simple, fast (libevent) - use pgbouncer between apps and database
case 2: More app servers... We need more...
- simple test: without pgbouncer - pgbench -C -c 32 -T 300 -U postgres shopdb transaction type: TPC-B (sort of), scaling factor: 128, query mode: simple, number of clients: 32, number of threads: 1, duration: 300 s number of transactions actually processed: 253628 latency average: 37.851 ms tps = 845.403711 (including connections establishing) tps = 15320.442789 (excluding connections establishing)
case 2: More app servers... We need more...
- simple test: with pgbouncer - pgbench -C -c 32 -T 300 -U postgres shopdb transaction type: TPC-B (sort of), scaling factor: 128, query mode: simple, number of clients: 32, number of threads: 1, duration: 300 s number of transactions actually processed: 2689931 latency average: 3.569 ms tps = 8966.389025 (including connections establishing) tps = 19225.431659 (excluding connections establishing)
case 2: More app servers... We need more...
- total: 300 seconds with 32 clients on 8-core server - latency: 37.8ms vs. 3.5ms - total transactions: 253628 vs. 2689931 - tps: 15320 vs. 19225
Thanks.
Questions?