perconalive feb-2011-share

53
High Value Transaction Processing Mark Callaghan

Upload: mdcallag

Post on 21-Apr-2017

1.316 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Perconalive feb-2011-share

High Value Transaction Processing

Mark Callaghan

Page 2: Perconalive feb-2011-share

What do I mean by value?

▪ Low price?

▪ High price/performance?

▪ Valuable data

Page 3: Perconalive feb-2011-share

OLTP in the datacenter

▪ Sharding

▪ Availability

▪ Legacy applications

▪ Used by many applications

Page 4: Perconalive feb-2011-share

Sharding

▪ Sharding is easy, resharding is hard

▪ Joins within a shard are still frequent and useful

▪ Some all-shards joins must use Hive

▪ Provides some fault-isolation benefits

Page 5: Perconalive feb-2011-share

Availability

▪ Sources of downtime

▪ Schema change (but now we have OSC)

▪ Manual failover

▪ Misbehaving applications

▪ Oops

Page 6: Perconalive feb-2011-share

Used by many applications

If your company is successful then

▪ Your database will be accessed by many different applications

▪ Application authors might not be MySQL experts

▪ Application owners might have different priorities than the DB team

Page 7: Perconalive feb-2011-share

Legacy applications

If your company is successful then you will have

▪ Applications written many years ago by people who are gone

▪ Design decisions that are not good for your current size

▪ Not enough resources or time to rewrite applications

Page 8: Perconalive feb-2011-share

Our busy OLTP deployment

▪ Query response time

▪ 4 ms reads, 5ms writes

▪ Network bytes per second

▪ 38GB peak

▪ Queries per second

▪ 13M peak

▪ Rows read per second

▪ 450M peak

▪ Rows changed per second

▪ 3.5M peak

▪ InnoDB page IO per second

▪ 5.2M peak

Page 9: Perconalive feb-2011-share

Recent improvements

▪ Joint work by Facebook, Percona and Oracle/MySQL

▪ Prevent InnoDB stalls

▪ Stalls from caches

▪ Stalls from mutexes

▪ IO efficiency

▪ Improve monitoring

▪ Improve XtraBackup

Page 10: Perconalive feb-2011-share

How do you measure performance?

▪ Response time variance leads to bad user experiences

▪ Optimizations that defer work must handle steady-state loads

▪ When designing a server the choices are:

▪ No concurrency (and no mutexes)

▪ One mutex

▪ More than one mutex

Page 11: Perconalive feb-2011-share

This has good average performance

Page 12: Perconalive feb-2011-share

Which metric matters?

Page 13: Perconalive feb-2011-share

Stalls from caches

Caches that defer expensive operations must eventually complete them at the same rate at which they are deferred.

▪ InnoDB purge

▪ InnoDB insert buffer

▪ Async writes are not async

▪ Fuzzy checkpoint constraint enforcement

Page 14: Perconalive feb-2011-share

InnoDB purge stalls

▪ InnoDB purge removes delete-marked rows

▪ Done by the main background thread in 5.1 plugin

▪ Optionally done by a separate thread in 5.5

▪ Purge is single-threaded and might be stalled by disk reads

▪ Further it gets behind, more likely it won’t catch up

▪ Need multiple purge threads as the main background thread can become the dedicated purge thread and that isn’t enough

do {n_pages_purged = trx_purge();

} while (n_pages_purged);

Page 15: Perconalive feb-2011-share

InnoDB insert buffer stalls

▪ The insert buffer is not drained as fast as it can get full

▪ Drain rate is 5% of innodb_io_capacity

▪ bugs.mysql.com/59214

▪ Fixed in the Facebook patch and XtraDB

▪ Patch pending for MySQL 5.5

Page 16: Perconalive feb-2011-share

Performance drops when ibuf is full

Page 17: Perconalive feb-2011-share

Otherwise, the insert buffer is awesome

Page 18: Perconalive feb-2011-share

Fuzzy checkpoint constraint

▪ TotalLogSize = #log_files X innodb_log_file_size

▪ AsyncLimit = 0.70 X TotalLogSize

▪ SyncLimit = 0.75 X TotalLogSize

▪ OldestDirtyLSN is the smallest oldest_modification LSN of all dirty pages in the buffer pool

▪ Age = CurrentLSN – OldestDirtyLSN

Fuzzy Checkpoint Constraint

▪ If Age > SyncLimit then flush_dirty_pages_synch()

▪ Else if Age > AsyncLimit then flush_dirty_pages_async()

Page 19: Perconalive feb-2011-share

Async page writes are not async

▪ Async page write requests submitted per fuzzy checkpoint constraint are not async

▪ User transactions may do this via log_preflush_pool_modified_pages

▪ Caller does large write for doublewrite buffer

▪ Caller then submits in-place write requests for background write threads

▪ Caller then waits for background write threads to finish

▪ bugs.mysql.com/55004

▪ Fixed in the Facebook patch

Page 20: Perconalive feb-2011-share

Fuzzy checkpoint constraint enforcement

Prior to InnoDB plugin 5.1.38, page writes done to enforce the fuzzy checkpoint constraint were not submitted by the main background thread.

▪ InnoDB plugin added innodb_adaptive_flushing in 5.1.38 plugin

▪ Percona added innodb_adaptive_checkpoint

▪ Facebook patch added innodb_background_checkpoint

Page 21: Perconalive feb-2011-share

Sysbench QPS at 20 second intervals with checkpoint stalls

Page 22: Perconalive feb-2011-share

Stalls from mutexes

▪ Extending InnoDB files

▪ Opening InnoDB tables

▪ Purge/undo lock conflicts

▪ TRUNCATE table and LOCK_open

▪ DROP table and LOCK_open

▪ Buffer pool invalidate

▪ LOCK_open and kernel_mutex

▪ Excessive calls to fcntl

▪ Deadlock detection overhead

▪ innodb_thread_concurrency

Page 23: Perconalive feb-2011-share

Stalls from extending InnoDB files

▪ A global mutex is locked when InnoDB tables are extended while writes are done to extend the file

▪ All reads on the file are blocked until the writes are done

▪ bugs.mysql.com/56433

▪ To be fixed real soon in the Facebook patch

Page 24: Perconalive feb-2011-share

Stalls from opening InnoDB tables

▪ Opening table handler instances is serialized on LOCK_open. Index cardinality stats might then be computed using random reads

▪ bugs.mysql.com/49463 and bugs.mysql.com/53046

▪ Fixed in the Facebook patch and MySQL 5.5

▪ When stats are recomputed many uses of that table will stall

▪ Fixed in the Facebook patch

▪ Index stats could be recomputed too frequently

▪ bugs.mysql.com/56340

▪ Fixed in the Facebook patch, MySQL 5.1 and MySQL 5.5

Page 25: Perconalive feb-2011-share

Stalls from purge/undo lock conflicts

▪ Purge and undo are not concurrent on the same InnoDB table

▪ Purge gets a share lock on the table

▪ Undo gets an exclusive lock on the table

▪ REPLACE statements that use insert-then-undo can generate undo

▪ bugs.mysql.com/54538

▪ Fixed in MySQL 5.1.55 and MySQL 5.5

Page 26: Perconalive feb-2011-share

TRUNCATE table and LOCK_open

▪ LOCK_open is held when the truncate is done by InnoDB

▪ When file-per-table is used the file must be removed and that can take too long

▪ The InnoDB buffer pool LRU must be scanned

▪ New queries cannot be started

▪ bugs.mysql.com/41158 and bugs.mysql.com/56696

▪ Fixed in MySQL 5.5 courtesy of meta-data locking

Page 27: Perconalive feb-2011-share

DROP table and LOCK_open

▪ LOCK_open is held when the drop is done by InnoDB

▪ When file-per-table is used the file must be removed and that can take too long

▪ The InnoDB buffer pool LRU must be scanned

▪ New queries cannot be started

▪ bugs.mysql.com/56655

▪ Fixed in the Facebook patch

▪ Do most InnoDB processing in the background drop queue

▪ Fixed in MySQL 5.5 courtesy of meta-data locking

Page 28: Perconalive feb-2011-share

TRUNCATE/DROP table and invalidate

▪ Pages for table removed from buffer pool and adaptive hash

▪ InnoDB buffer pool mutex locked while the LRU is scanned

▪ This is slow with a large buffer pool

▪ Most threads in InnoDB will block waiting for the buffer pool mutex

▪ bugs.mysql.com/51325 and bugs.mysql.com/56332

▪ I hope Yasufumi can fix it

Page 29: Perconalive feb-2011-share

LOCK_open and kernel_mutex conflicts

▪ Thread A

▪ Gather table statistics while holding LOCK_open

▪ Block on kernel_mutex while starting a transaction

▪ Thread B

▪ Hold kernel_mutex while doing deadlock detection

▪ All other threads block on LOCK_open or kernel_mutex

▪ bugs.mysql.com/51557

▪ Fixed in MySQL 5.5

Page 30: Perconalive feb-2011-share

Stalls from excessive calls to fcntl

▪ fcntl

▪ My Linux kernels get the big kernel lock on fcntl calls

▪ MySQL called fcntl too often

▪ Doubled peak QPS by hacking MySQL to call fcntl less

▪ Almost 200,000 QPS without using HandlerSocket

▪ bugs.mysql.com/54790

▪ Fixed in Facebook patch, then reverted because it broke SSL tests

▪ Not sure where or when this will be fixed

Page 31: Perconalive feb-2011-share

Sysbench read-only with fcntl fix

Page 32: Perconalive feb-2011-share

Stalls from deadlock detection overhead

▪ InnoDB deadlock detection was very inefficient. Worst case when all threads waited on the same row lock.

▪ Added option to disable it in the Facebook patch and rely on lock wait timeout

▪ MySQL made it more efficient in MySQL 5.1

▪ bugs.mysql.com/49047

Page 33: Perconalive feb-2011-share

Stalls from innodb_thread_concurrency

▪ When there are 1000+ sleeping threads it can take too long to wake up a specific thread

▪ Change innodb_thread_concurrency to use FIFO scheduling in addition to existing use of LIFO and FIFO+LIFO = FLIFO

▪ Fixed in the Facebook patch

Page 34: Perconalive feb-2011-share

Sysbench TPS with FLIFO

Page 35: Perconalive feb-2011-share

IO efficiency

High priority problems for me are:

▪ Reducing IOPs used for my workload

▪ Supporting very large databases

Significant improvements:

▪ Switch from mysqldump to XtraBackup

▪ Run innosim to confirm storage performance

▪ Tune InnoDB

▪ Improve schemas and queries

Page 36: Perconalive feb-2011-share

mysqldump vs XtraBackup

▪ mysqldump is slower for backup

▪ Clustered index is scanned row-at-a-time in key order (lots of random reads)

▪ Backup accounts for half of the disk reads for servers I watch

▪ Single-table restore is easy with mysqldump

▪ Possible with XtraBackup thanks to work by Vamsi from Facebook

▪ Incremental backup

▪ Not possible with mysqldump

▪ XtraBackup has incremental (scan all data, write only the changed blocks)

▪ Vamsi from Facebook added support for really incremental, scan & write only

the changed blocks

Page 37: Perconalive feb-2011-share

innosim storage benchmark

▪ InnoDB IO simulator that models

▪ Doublewrite buffer

▪ Dirty page writes

▪ Transaction log and binlog fsync and IO

▪ User transactions that do read, write and commit

▪ Search for “facebook innosim”

▪ Source code on launchpad

Page 38: Perconalive feb-2011-share

Tune InnoDB

▪ It is not easy to support many concurrent disk reads

▪ Innodb_thread_concurrency tickets not released when waiting for a read

▪ If innodb_thread_concurrency is too high then writers suffer

▪ If innodb_thread_concurrency is too low then readers suffer

▪ Smaller pages are better for some but not all tables

▪ A large log file can reduce the dirty page flush rate

▪ A large buffer pool can reduce the page read rate

Page 39: Perconalive feb-2011-share

IOPs is a function of size and concurrency

Page 40: Perconalive feb-2011-share

Smaller pages aren’t always better

Page 41: Perconalive feb-2011-share

Checkpoint IO rate by log file size

Page 42: Perconalive feb-2011-share

Page read rate by buffer pool size

Page 43: Perconalive feb-2011-share

Improve schemas

▪ Make your performance critical queries index only

▪ Primary key columns are included in the secondary index

▪ Understand how the insert buffer makes index maintenance cheaper

▪ Figure out how to do schema changes with minimal downtime

▪ We used the Online Schema Change tool (thanks Vamsi)

▪ You can also do the schema change on a slave first and then promote it

Page 44: Perconalive feb-2011-share

Monitoring

▪ Per table, index, account via information_schema tables

▪ Efficient and always enabled

▪ Easy to use

▪ Enhanced slow query log

▪ Facebook patch added options to do sampling for the slow query log

▪ Sample from all queries and from all queries that have an error

▪ Error is limited to errno, error text must wait for 5.5 plugin

▪ Aggregate by query text and URL from query commen

Page 45: Perconalive feb-2011-share

Open Problems

▪ Parallel replication apply

▪ Support max concurrent queries

▪ Automate slave failover when a master fails

▪ Use InnoDB compression for OLTP

▪ Multi-master replication with conflict resolution

Page 46: Perconalive feb-2011-share

Parallel replication apply

▪ Replication apply is single-threaded. This causes lag on IO-bound slaves even when SQL is simple

▪ mk-slave-prefetch can help but something better is needed

▪ Is a thread running BEGIN; replay-slave-sql; ROLLBACK better?

▪ I want:

▪ N replay queues

▪ Binlog events (SBR or RBR) hashed to queues by database names

▪ Each queue replayed in parallel

Page 47: Perconalive feb-2011-share

Max concurrent queries

▪ Use large values for max concurrent connections per account

▪ Enforce smaller values for max concurrent queries

▪ We have begun testing an implementation.

▪ Enforce at statement entry

▪ Account for threads that block (row lock, disk IO, network IO)

Page 48: Perconalive feb-2011-share

Automate slave failover

▪ Global transactions IDs from the Google patch is awesome

▪ But I don’t have the skills to port or support it

▪ A unique ID per binlog group or event might be sufficient

▪ Add an attribute to binlog event metadata

▪ Preserve on the slave similar to server ID

Page 49: Perconalive feb-2011-share

InnoDB compression for OLTP

▪ Change InnoDB to not log page images for compressed pages

▪ Logging them increases the log IO rate

▪ Increasing the log IO rate then increases the checkpoint IO rate

▪ Change InnoDB to use QuickLZ instead of zlib for compression

▪ Add an option to limit compression to the PK index

▪ Add per-table compression statistics

Page 50: Perconalive feb-2011-share

MySQL in the datacenter

▪ Previously dominated the market

▪ Now it must learn to share

▪ PostgreSQL continues to improve for OLTP

▪ Hbase, Cassandra, MongoDB are getting transactions today

Page 51: Perconalive feb-2011-share

Why NoSQL

▪ Do less, but do it better

▪ Some offer write-optimized data stores

▪ Some don’t require sharding

▪ Interesting HA models

▪ Cassandra doesn’t have the notion of failover

▪ HBase doesn’t require failover when a server dies

▪ Healthy development communities improve code quickly

Page 52: Perconalive feb-2011-share

What comes next

▪ Batch extraction is not the answer for MySQL/NoSQL integration

▪ NoSQL deployments will be reminded that

▪ Some of your problems are independent of technology

▪ You need better monitoring

▪ There is downtime when you need to modify the clustered index

▪ Database ops is hard with legacy apps and multi-user deployments

▪ In a few years someone will document the many stalls in HBase

Page 53: Perconalive feb-2011-share

The End

Thank you