q u e u e s your q u e r y w aits in - percona · 946.4 / 1676.3 → 0.56 ms / op. device: r/s...

QUEUES YOUR QUERYWAITS IN

JOSH SNYDER

SO MUCH QUEUING SO LITTLE TIME

JOSH SNYDER

ESCALATOR ETIQUETTE

ESCALATING UPHEAVAL

Why would they do such a thing?

Latency Throughput

units time time-1

measures smallest sliver ofwork

largest sample ofwork

lim n → 1 lim n → ∞

HYPOTHETICAL ESCALATOR ANALYSISPerson standing requires 1 stair; 24 secondsSo: 24 stair-secondsPerson walking requires 12 secondsTo break even, walkers must be spaced ≤2 stairsapart

http://www.gizmodo.co.uk/2017/03/the-results-are-in-the-holborn-escalator-trial-proves-that-it-is-better-to-stand-on-the-escalator-well-sometimes/

http://www.gizmodo.co.uk/2017/03/the-results-are-in-the-holborn-escalator-trial-proves-that-it-is-better-to-stand-on-the-escalator-well-sometimes/

WHAT'S TO COMEtwo "favorite" tools: iostat and loadavglayers and layers of latencymanaging multi-tenancyload (un)balancing

IOSTATPresents disk I/O statisticsReads /proc/diskstats (Linux)

IOSTAT: AN EXAMPLEreads r_sects r_ms t_act

t=0 459583 96210660 2693900 599164

t=10 476346 100180364 2741888 608628

Δ 16763 3969704 47988 9464

/ 10 1676.3 396970.4 4798.8 946.4

human 1676.3 / second 190.83 MB/s 4.7988 .9464

r_await: average time each I/O waited 4798.8 ms spent 1676.3 reads 4798.8 / 1676.3 → 2.86 ms / op

avgrq-sz: mean sectors per I/O396970.4 sectors 1676.3 reads 396970.4 / 1676.3 → 236.81 sectors / op

svctm: non-idle time (%util) / #OPS946.4 ms / second (94.6 %util) 1676.3 reads 946.4 / 1676.3 → 0.56 ms / OP

Device: r/s rMB/s avgrq-sz avgqu-sz r_await svctm %util vda 1676 190.83 236.81 4.80 2.86 0.56 94.64

LOAD AVERAGECollected by the schedulerBased on process states

PROCESS STATESA process/thread/task is either:

runnableon a CPUstarved of CPU

waiting for something

BEING RUNNABLEint i = 0; while(1) { i++; }

CPU STARVATIONPossible reasons:

no CPU is availabletask is (temporarily) assigned to a CPU with otherwork to doa bug in the scheduler (see "A Decade of WastedCores")

MEASURING CPU STARVATION

Formats documented in Documentation/scheduler/sched-stats.txt

$ awk '/^cpu/ { printf "%s %.9fs\n", $1, $9 / 1e9 }' /proc/schedstat cpu0 508.505281125s cpu1 186.946423306s

$ awk '{ printf "%.9fs\n", $2 / 1e9 }' /proc/$PID/schedstat 3.181567463s

Resource-by-resource analysisUSE method

WAITING VOLUNTARILYaccept() a new network connection

recv() data on a socket

sleep() a timer

futex() a memory address (lock)

waitpid() a process

etc...

SLEEPING INVOLUNTARILYpkill -STOP mysqld

PROCESS STATESRUNNING (R) misnomer: runnable process

UNINTERRUPTIBLE (D) waiting for disk

INTERRUPTIBLE (S) waiting for something else

STOPPED (T) (forced to) wait for SIGCONT

ZOMBIE (Z) waiting for parent to waitpid()

See include/linux/sched.h for gory details

LOAD AVERAGEInstantaneous load:

TASK_RUNNING (R) + TASK_UNINTERRUPTIBLE (D)

sampled every 5 secondsinto an exponentially weighted moving average

See: include/linux/sched.hkernel/sched/loadavg.c

WHAT'S BETTER THAN A LOAD AVERAGE?for CPU: runqueue latencyfor disk:

iostat avgqu-sz (per disk)delayacct_blkio_ticks (per task)

https://github.com/hashbrowncipher/taskstats_exporter

https://github.com/hashbrowncipher/taskstats_exporter

DELAYACCT_BLKIO_TICKSHow long a process spent in the D state, in hundredthsof a second: $ awk '{ print $42 / 100 }' < /proc/$PID/stat19.68

WORKLOAD DEPENDENCE (1)Workload:

Resources: ≥32 coresSSD with maximum performance at QD=16-32

def random_reader(): while True: do_random_read()

threaded(random_reader, 500).start()

WORKLOAD DEPENDENCE (2)Compare this workload:def locked_random_reader(semaphore): while True: with semaphore: do_random_read()

max_ios = 32 semaphore = Semaphore(max_ios) threaded(lambda: random_reader(semaphore), 500).start()

WORKLOAD DEPENDENCE: LESSONSChanges in workload will change both bad stats(load) and good ones (delayacct_blkio_ticks)Locks are a form of queueing!

QUESTIONS?

SIMPLIFIED MYSQL EXAMPLE1. Query packet arrives at NIC2. Kernel adds packet to socket queue; wakes recv()'ing MySQL

thread (S → R)3. Buffer pool lookup: MISS! (waited for locks, R → S → R)4. Read pages from disk (R → D → R)5. Big result; send result to client (R → S → R)6. Wait for client: recv() (R → S)7. GOTO 1

"SIMPLIFIED" IS A KEY WORDDid packet processing happen due to interrupt, orpolling?How hot are the CPU cachesQuery passed through bunches of MySQLevents_stagesetc...

LATENCY ANALYSIS IS FRACTALLY COMPLEX!

CPUs (and everything else) are abstractions that hidecomplexity:

is it throttling?how many cycles did I stall due to memory access?how many cycles did I stall due to lack of resources inthe processor?

BUT WE DO IT ANYWAY!CPU time is still a useful metric, even though we take itwith a grain of salt!

COLLECTING LATENCY INFORMATIONTwo methods: 1. Timing2. Sampling (Little's law!)

COLLECTING TIMINGST: Accumulator t: current time T += tend - tstart

INSIST ON VDSO TIMEKEEPINGvDSO: virtual dynamic shared objectvDSO timekeeping takes 25-45ns in a tight loopnon-vDSO timekeeping takes ~4x longer

bad $ strace -qq -e clock_gettime date > /dev/null clock_gettime(CLOCK_REALTIME, {946684800, 0}) = 0 good $ strace -qq -e clock_gettime date > /dev/null

http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.htmlhttps://www.slideshare.net/AmazonWebServices/cmp402-amazon-ec2-instances-deep-dive

http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html

https://www.slideshare.net/AmazonWebServices/cmp402-amazon-ec2-instances-deep-dive

LITTLE'S LAWIn a stable system: L = λWL (mean dwelling customers) λ (mean arrival rate) W (mean dwell time)

LITTLE'S LAW APPLIED TO MYSQLOver 1 second:

1000 query threads (~Threads_running sampled)1e5 Questions (SHOW STATUS LIKE 'Questions')

L / λ = W So: 1000 / (1e5 / sec) = 10 ms per query

Over the same period the application records 23outstanding queries (on average)

23 / (1e5 / sec) = 23 ms per query

23 - 10 = 13 ms of unaccounted-for time (on average)

PROBLEMS ABOUNDWe now know an average about our queries in general.

INTERLUDE: WHY NOT HISTOGRAMS‽A useful histogram requires ~16-276 counters

An average requires 2 (+1 for variance).

We can track more averages than we can ever ashistograms.

QUESTIONS?

A TALE OF TWO TENANTSService "�shpics" with N workers and two paths:

cache hit (1 ms)cache miss (100-10000 ms)

AN "ANSWER" TO BAD AVERAGESTrack cache hit/miss time (mean + variance)separatelyTrack time-in-queue separately from work time

MULTI-TENANCYif misses get too slow, hits will waittwo "tenants", one blocking the other

FAIRNESSQueue time is useful, but not in isolation!

If a 100 ms RPC waits 3 ms: no big dealIf a 100 µs RPC waits 3 ms: alarm bells!

SLOWDOWNS = (queued time) / (working time)worthwhile whenever a human is waitingcf. express lanes in grocery storesOLTP vs. batch workloads

CONCURRENCY LIMITING�shpics service:

limit misses to 90% of workersdrop requests above 90%single pool of servers; single deployment

ALL DATASTORES ARE MULTI-TENANT!query threads compete with each otherbatching and coalescing work → background workbackground threads compete with query threadsbackups are background work

EXAMPLE: MYSQL BACKUPSPipeline: read | compress | sendread from disk (ionice(1); CFQ IOPRIO_CLASS_IDLE)compress (chrt(1); SCHED_IDLE)send over network (prio qdisc; SOL_PRIORITY)

CLASSFUL SCHEDULINGwork is divided into classesif high-class work exists, low-class work waitsnice(1) is NOT classful (timeslices)

EXAMPLE: CASSANDRA COMPACTIONGoal: maximal compaction; minimal disruption

don't pick a rate a priori!SCHED_IDLE → possible starvationsolution: cpuset

HERACLES

From GoogleColocated batch and latency-sensitive tasksPer-resource analysis

SO MUCH MORE!token bucketscgroupsqdiscs (codel)

QUESTIONS?

LOAD BALANCINGhow are requests allocated to backends?central queue minimizes queued timeunder unrealistic assumptions

See, in general, ch 24 of "Performance Modeling and Design of Computer Systems"

BAD LOAD BALANCINGrandomround-robin (slightly better)

BETTER LOAD BALANCINGjoin-shortest-queueleast-work-leftTAGS (for "practical" workloads)

TAGS (PREPARE YOUR MIND)throws away work!unbalances load!fairness over throughput

TAGS (KEY CONCEPTS)Non-preemptible idempotent jobsLarge variance in job sizeUnwilling (unable) to make predictionsSlowdown metric (covered earlier)Server expansion requirement

TAGSAllow jobs to run a limited amount of timeKill and requeue jobs that run too long

LOAD UNBALANCING

(FINAL) QUESTIONS?

q u e u e s your q u e r y w aits in - percona · 946.4 / 1676.3 → 0.56 ms / op. device: r/s...

Documents