linux tuning for postgresql at secon 2015
TRANSCRIPT
Оптимизация LinuxдляPostgreSQL
Алексей ЛесовскийPostgreSQL-Consulting.com
Пенза, 24 апреля 20152015.secon.ru
Пенза, 24 апреля 20152015.secon.ru
About:● Алексей Лесовский
● PostgreSQL DBA
● Linux system administrator
PostgreSQL-Consulting.com● Поддержка 24/7
● Аудит и оптимизация производительности
● Консультации и тренинги
● Мониторинг и аварийное реагирование
● Планирование ресурсов
Пенза, 24 апреля 20152015.secon.ru
Agenda
● Linux on PostgreSQL ?
● Resources and Databases
● OS subsystems
● CPU & Process scheduling, Power saving
● Memory & VM, NUMA, Huge Pages
● Storage & Filesystems, Input/Output
● Other misc
Пенза, 24 апреля 20152015.secon.ru
Intro.
● Linux as a platform for PostgreSQL database
● Active development & Community support
● Rich features & Fast implementation
● Stable & Mature & Durable
● Default settings in Linux
Пенза, 24 апреля 20152015.secon.ru
Tuning targets
CPU
Disk
Memory
ConcurrencyQuery Speed
Sort, Group, Hash,...
OS cache (page cache)DB cache (shared_buffers)Operations (work_mem)
DB storageTransactional Log (WAL)
Cold start
Пенза, 24 апреля 20152015.secon.ru
Tuning targets
CPU
Disk
Memory
CPU SchedulerNUMA
Power saving
Virtual MemoryNUMAHuge Pages
FileSystemsStorage I/O
Пенза, 24 апреля 20152015.secon.ru
Tuning targets:
● CPU scheduler
● Virtual memory and NUMA
● Huge pages
● File systems
● Storage IO
● Power saving policy
● Others
Пенза, 24 апреля 20152015.secon.ru
Tuning target: CPU scheduler
● CPU scheduler is ...
● sysctl: kernel.sched_migration_cost_ns
● sysctl: kernel.sched_autogroup_enabled
● http://www.postgresql.org/message-id/[email protected]
● http://kernelnewbies.org/Linux_2_6_38#head-59575a6aeafa38490226a560ee02de89829a5b20
Пенза, 24 апреля 20152015.secon.ru
Tuning target: CPU scheduler
● WARNING: should use update kernel because
● Ubuntu 12.04 Bug #1055222,
● Ubuntu 14.04 Bug #1422016
A kernel panic occurs while disabling the automatic task group creation within /etc/sysctl.conf using "kernel.sched_autogroup_enabled = 0". After adding the "noautogroup" kernel parameter to GRUB, the system will boot without issue.
Пенза, 24 апреля 20152015.secon.ru
Tuning target: CPU scheduler
● pgbench tests: $ pgbench -S -c 8 -T 30 -U postgres pgbench transaction type: SELECT only scaling factor: 30 duration: 30 s number of clients: 8 number of threads: 1
sched_migration_cost_ns = 50000, sched_autogroup_enabled = 1
- tps: 22621, 22692, 22502
sched_migration_cost_ns = 500000, sched_autogroup_enabled = 0
- tps: 23689, 23930, 23657
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Virtual Memory
● Virtual Memory
● Allocator, Caching, Dirty pages and Writeback
● vm.dirty_background_ratio & vm.dirty_ratio
● vm.dirty_background_bytes & vm.dirty_bytes
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Virtual Memory
0% of pagecache RAM dirty
do nothing
dirty_backgrouond_ratio (10% of RAM dirty) – wakeup flushd
dirty_ratio (20% of RAM dirty) – process start sync writes
100% of pagecache RAM dirty
flushd writes dirty buffers in background
flushd and write()'ng process write dirty buffer
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Virtual Memory
● vm.swappiness
● Out-of-memory & OOM-Killer
Пенза, 24 апреля 20152015.secon.ru
Tuning target: NUMA
c1 c2c3 c4
c1 c2c3 c4
c1 c2c3 c4
c1 c2c3 c4
M1
M3 M4
M2
S1 S2
S4S3
S – SocketC – CPU CoreM – Memory Bank
Пенза, 24 апреля 20152015.secon.ru
Tuning target: NUMA
● BIOS: enable memory node interleaving
● kernel boot: numa=of
● numactl utility
● sysctl: vm.zone_reclaim_mode = 0
● sysctl: kernel.numa_balancing = 0
Пенза, 24 апреля 20152015.secon.ru
Tuning target: NUMA
$ numactl --hardware available: 1 nodes (0-0) -- NUMA disabled node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 196578 MB node 0 free: 1122 MB node distances: node 0 0: 10 $ numactl --hardware available: 2 nodes (0-1) -- NUMA enabled node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17 node 0 size: 65501 MB node 0 free: 4186 MB node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23 node 1 size: 65536 MB node 1 free: 10803 MB node distances: node 0 1 0: 10 21 1: 21 10
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Huge Pages
● Huge pages vs. Transparent huge pages
● Transparent huge pages
● echo never > /sys/kernel/mm/transparent_hugepage/enabled
● echo never > /sys/kernel/mm/transparent_hugepage/defrag
Пенза, 24 апреля 20152015.secon.ru
PostgreSQL with Huge Pages
● libhugetlbfs
● sysctl: vm.hugetlb_shm_group
● sysctl: vm.nr_hugepages
● sysctl: vm.nr_overcommit_hugepages
● environment var: HUGETLB_SHM & LD_PRELOAD
● postgresql.conf: huge_pages = try | on
Пенза, 24 апреля 20152015.secon.ru
PostgreSQL with Huge Pages
● $ grep ^Huge /proc/meminfo HugePages_Total: 75532 HugePages_Free: 54 HugePages_Rsvd: 13 HugePages_Surp: 0 Hugepagesize: 2048 kB
● $ pmap -x $(pgrep postgres|head -n 1) Address Kbytes RSS Dirty Mode Mapping 00007f24a1e00000 25802752 0 0 rw-s- anon_hugepage (deleted)
Пенза, 24 апреля 20152015.secon.ru
Huge Pages: Resume
● Disable Transparent Huge Pages
● PostgreSQL <= 9.2: libhugetlbfs
● PostgreSQL 9.3: huge pages not supported
● PostgreSQL >= 9.4: huge pages native support
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Filesystems
● Ext3 vs Ext4 vs XFS
● Filesystem Barriers
● Disable Write Cache● hdparm -W0 /device/● MegaCli64 -LDSetProp -DisDskCache -Lall -aALL
● Hardware RAID + BBU = barrier=0
● Software RAID = barrier=1
● Enterprise SSD with Power Loss Protection = barrier=0
● Consumer SSD w/o Power Loss Protection = barrier=1
● http://www.thessdreview.com/
Пенза, 24 апреля 20152015.secon.ru
Filesystems barriers
● $ pgbench -s 30 -c 8 -T 30 -U postgres pgbench transaction type: TPC-B (sort of) scaling factor: 30 query mode: simple number of clients: 8 number of threads: 1 duration: 30 s
barrier=1 barrier=0 tps = 1005 tps = 1142 tps = 988 tps = 1164 tps = 988 tps = 1129
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Storage IO
● SATA/SAS vs SSD
● IO elevators ● noop: SSD, hi-end storage● deadline: RAID, SATA/SAS● cfq: good default
● # echo 'elevator_name' > /sys/block/<device>/queue/scheduler
● kernel boot: elevator=<name>
● /sys/block/*/queue/: rotational, rq_affinity, read_ahead_kb
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Power saving policy
● acpi_cpufreq vs intel_pstate
● scaling_governor● /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors● /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
● acpi_cpufreq + performance
● intel_pstate + powersave
Пенза, 24 апреля 20152015.secon.ru
Power saving policy: acpi_cpufreq + ondemand
$ select o.payment_method_id,pc.name,count(o.payment_method_id) as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit')
AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=238.506..238.512 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.068..232.756 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.058..206.598 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.78 rows=1 width=8) (actual time=0.042..109.343 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.001..0.004 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id)
Total runtime: 238.630 ms
Пенза, 24 апреля 20152015.secon.ru
Power saving policy: acpi_cpufreq + performance
$ select o.payment_method_id,pc.name,count(o.payment_method_id) as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit')
AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=193.589..193.593 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.034..188.948 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.030..166.176 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.78 rows=1 width=8) (actual time=0.022..87.541 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.001..0.003 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id)
Total runtime: 193.659 ms
Пенза, 24 апреля 20152015.secon.ru
Tuning target: Misc
● Clocksource
● acpi_pm vs. hpet vs. tsc
● /sys/devices/system/clocksource/clocksource0/available_clocksource
● /sys/devices/system/clocksource/clocksource0/current_clocksource
Пенза, 24 апреля 20152015.secon.ru
Clocksource: TSC
$ select o.payment_method_id,pc.name,count(o.payment_method_id) as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit')
AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=244.300..244.308 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.075..238.265 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.064..211.236 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.79 rows=1 width=8) (actual time=0.046..111.409 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.001..0.004 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id)
Total runtime: 244.433 ms
Пенза, 24 апреля 20152015.secon.ru
Clocksource: ACPI_PM
$ select o.payment_method_id,pc.name,count(o.payment_method_id) as cnt from operation o LEFT JOIN payment_method pm ON pm.id=o.payment_method_id LEFT JOIN payment_classify pc ON pc.id = pm.payment_classify_id WHERE (type='deposit')
AND (DATE(o.created)>='2015-04-05' AND DATE(o.created)<='2015-04-05') GROUP BY o.payment_method_id,pc.name QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=44.24..44.25 rows=1 width=26) (actual time=1062.165..1062.188 rows=20 loops=1) -> Nested Loop Left Join (cost=0.01..44.23 rows=1 width=26) (actual time=0.104..1047.211 rows=10660 loops=1) -> Nested Loop Left Join (cost=0.01..43.64 rows=1 width=16) (actual time=0.091..962.252 rows=10660 loops=1) Join Filter: (pm.id = o.payment_method_id) Rows Removed by Join Filter: 394420 -> Index Scan using operation_date_created_type_state_key on operation o (cost=0.01..41.79 rows=1 width=8) (actual time=0.046..104.546 rows=10660 loops=1) Index Cond: ((date(created) >= '2015-04-05'::date) AND (date(created) <= '2015-04-05'::date) AND (type = 'deposit'::op_type)) -> Seq Scan on payment_method pm (cost=0.00..1.38 rows=38 width=12) (actual time=0.002..0.039 rows=38 loops=10660) -> Index Scan using pc_pkey on payment_classify pc (cost=0.00..0.58 rows=1 width=22) (actual time=0.003..0.004 rows=1 loops=10660) Index Cond: (id = pm.payment_classify_id)
Total runtime: 1062.316 ms
Пенза, 24 апреля 20152015.secon.ru
Bonus. Tuning target: Pgbouncer
● open files limit● nofile (/etc/security/limits.conf)● prlimit utility (util-linux)
● sysctl: net.ipv4.ip_local_port_range
Пенза, 24 апреля 20152015.secon.ru
Questions?