monitor some of the things

2013-10-18

MONITORSOME OF THE THINGS

Optimization, Backups, Replication, and more

Baron Schwartz, Peter Zaitsev &

Vadim Tkachenko

High PerformanceMySQL

3rd Edition

Covers Version 5.5

ME

• Cofounder of @VividCortex

• Author of High Performance MySQL

• @xaprb on Twitter

• [email protected]

• http://www.linkedin.com/in/xaprb

mailto:[email protected]


http://www.linkedin.com/in/xaprb

http://www.linkedin.com/in/xaprb

RANT, RECAPPED

• The sky is falling

• Tools drive processes, and we need better tools designed for methods

• Pay attention to CAPS (Capacity, Availability, Performance, Scalability)

• Monitoring tools need to be a lot smarter

• Measure and monitor “work getting done”

HARD CAPACITY

• Disk volume

• CPU Cycles

• max_connections

• File descriptors, sockets, TCP port numbers, etc

• %used, absolute quantity available

SOFT CAPACITY

• Neil Gunther’s Universal Scalability Law


• Throughput, concurrency, errors

AVAILABILITY

• Availability is absence of downtime • %used, absolute quantity available


• MTBF, MTTR, MTTD, %availability

TASK PERFORMANCE

• Task performance is consistently fast response time.

• Measure an SLA in percentile response time per task, over observation intervals




• Response time, 95% response time

RESOURCE PERFORMANCE

• Resource performance is ability to run tasks consistently fast.





• Throughput, concurrency, busy time, total response time, backlog/queue

SCALABILITY

• Universal Scalability Law again • %used, absolute quantity available





STALL DETECTION

• Overloaded or underperforming? • %used, absolute quantity available





• Utilization, saturation, errors, sources of load/demand

GIT ‘ER DONE

MONITOR WORK AND RESOURCES

WHAT NOT TO DO

• Don’t use top-N lists from Google

• Don’t just do what’s included in some Nagios plugin

№1TOP 10 LIST

1. MySQL availability2. Presence of insecure users and databases3. Aborted connects4. Error log5. Deadlocks6. Change in server configuration7. Slow query log8. Slave lag9. Percentage of maximum allowed connections10. Percentage of full table scans

№2TOP 10 LIST

1. Threads_connected2. Created_tmp_disk_tables3. Handler_read_first4. Innodb_buffer_pool_wait_free5. Key_reads6. Max_used_connections7. Open_tables8. Select_full_join9. Slow_queries10. Uptime

№1PLUGIN

1. threadcache-hitrate (Hit rate of the thread-cache) 2. slave-io-running (Slave io running: Yes) 3. slave-sql-running (Slave sql running: Yes) 4. qcache-hitrate (Query cache hitrate) 5. qcache-lowmem-prunes (Query cache entries pruned because of low memory) 6. keycache-hitrate (MyISAM key cache hitrate) 7. bufferpool-hitrate (InnoDB buffer pool hitrate) 8. bufferpool-wait-free (InnoDB buffer pool waits for clean page available) 9. log-waits (InnoDB log waits because of a too small log buffer) 10. tablecache-hitrate (Table cache hitrate) 11. table-lock-contention (Table lock contention) 12. index-usage (Usage of indices) 13. tmp-disk-tables (Percent of temp tables created on disk) 14. long-running-procs (long running processes)

№2PLUGIN

1. connection-time2. uptime3. threads-connected4. threadcache-hitrate5. q[uery]cache-hitrate6. q[uery]cache-lowmem-

prunes7. [myisam-]keycache-hitrate8. [innodb-]bufferpool-hitrate9. [innodb-]bufferpool-wait-free10. [innodb-]log-waits11. tablecache-hitrate

12. table-lock-contention13. index-usage14. tmp-disk-tables15. slow-queries16. long-running-procs17. slave-lag18. slave-io-running19. slave-sql-running20. sql21. open-files22. encode23. cluster-ndb-running

№3PLUGIN

SURFACE AREA

HTTP://WWW.FLICKR.COM/PHOTOS/NASAMARSHALL/5926864640/

http://www.flickr.com/photos/nasamarshall/5926864640/

http://www.flickr.com/photos/nasamarshall/5926864640/

DUPLICATE SIGNALS

• Queries

• Com_admin_commands

• Com_assign_to_keycache

• Com_alter_db

• Com_alter_db_upgrade

• Com_alter_event

• Com_alter_function

• Com_alter_procedure

• Com_alter_server

• Com_alter_table

• Com_alter_tablespace

• Com_alter_user

• Com_analyze

• Com_begin

• Com_binlog

• Com_ad_nauseum

DESIRABLE METRICS






• Utilization, saturation, errors, sources of load/demand

Desirable Easy

IRRELEVANT

EXAMPLE PLEASE?

RESOURCE LIMITS

• Threads_connected near max_connections?

• %table cache used?

• Open file handles?

• Long-running queries/transactions?

ERRORS

• Deadlocks?

• Aborted connects?

AVAILABILITY

• Ability to connect and run a query?

• Uptime is small?

• Replication is running?

PERFORMANCE

• You can get throughput (Queries) and concurrency (Threads_running) from MySQL

• But in a Nagios check, no context to know whether they’re good or bad

• You generally can’t get response time, busy time, utilization, backlog, etc

• You can aggregate thread states, thread times, users, databases, query abstracts...

NAGIOS IS BEST AT

LIVING IN THE MOMENT

THOU SHALT NOT

• Cache hit ratios

• Thread cache hit ratio

• Buffer pool cache hit ratio

• Table cache hit ratio

• Key cache hit ratio

• Query cache hit ratio

• Rates of “bad” queries

• % temp tables on disk

• % full table scans

• % slow queries

• Unfixable things

• Replication delay

WHY NOT?

• Those are properties of the workload and application

• They are not conditions to alert/warn about

• They are not fixable / actionable in the service

ALERTS ARE

BETTER TOGETHER

QUESTION:

WHAT IS BETTER?

№1 ALERT!!!!!Disk CRIT 100% /dev/sda2

№2 ALERT!!!!!Replication CRIT Slave I/O Thread No

№3 ALERT!!!!!Replication CRIT Slave SQL Thread No

№4 ALERT!!!!!Replication CRIT Seconds_Behind_Master NULL

№5 ALERT!!!!!MySQL CRIT oldest transaction: 86400 seconds

- OR -

№1 ALERT!!!!!CRIT* Disk /dev/sda2 full* Replication stopped* Oldest transaction 86400 seconds* 4999 threads in status “Waiting for table metadata lock”

HOLLER AT ME

QUESTIONS?@XAPRB / [email protected]



RESOURCES

• Chapter 3 of High Performance MySQL, 3rd Edition

• Percona White Papers

• Causes of Downtime in Production MySQL Servers

• Preventing MySQL Emergencies

• Goal-Driven Performance Optimization

• Forecasting MySQL Scalability with the Universal Scalability Law

• Method R: Optimizing Oracle Performance, Cary Millsap

• The Goal, Eli Goldratt

• The USE Method (Brendan Gregg) & his new book

• Guerrilla Capacity Planning, Neil J. Gunther

• Fundamental Performance & Scalability Instrumentation

http://www.percona.com/about-us/mysql-white-paper/causes-of-downtime-in-production-mysql-servers

http://www.percona.com/about-us/mysql-white-paper/causes-of-downtime-in-production-mysql-servers

http://www.percona.com/about-us/mysql-white-paper/preventing-mysql-emergencies

http://www.percona.com/about-us/mysql-white-paper/preventing-mysql-emergencies

http://www.percona.com/about-us/mysql-white-paper/goal-driven-performance-optimization

http://www.percona.com/about-us/mysql-white-paper/goal-driven-performance-optimization

http://www.percona.com/about-us/mysql-white-paper/forecasting-mysql-scalability-with-the-universal-scalability-law

http://www.percona.com/about-us/mysql-white-paper/forecasting-mysql-scalability-with-the-universal-scalability-law

http://www.xaprb.com/blog/2011/10/06/fundamental-performance-and-scalability-instrumentation/

http://www.xaprb.com/blog/2011/10/06/fundamental-performance-and-scalability-instrumentation/

monitor some of the things

Software

response time throughput

availability response

total response time

fast response time

percentile response

busy time

errors mtbf

innodb log