q u e u e s your q u e r y w aits in - percona · 946.4 / 1676.3 → 0.56 ms / op. device: r/s...
TRANSCRIPT
QUEUES YOUR QUERYWAITS IN
JOSH SNYDER
SO MUCH QUEUING SO LITTLE TIME
JOSH SNYDER
ESCALATOR ETIQUETTE
ESCALATING UPHEAVAL
Why would they do such a thing?
Latency Throughput
units time time-1
measures smallest sliver ofwork
largest sample ofwork
lim n → 1 lim n → ∞
HYPOTHETICAL ESCALATOR ANALYSISPerson standing requires 1 stair; 24 secondsSo: 24 stair-secondsPerson walking requires 12 secondsTo break even, walkers must be spaced ≤2 stairsapart
http://www.gizmodo.co.uk/2017/03/the-results-are-in-the-holborn-escalator-trial-proves-that-it-is-better-to-stand-on-the-escalator-well-sometimes/
WHAT'S TO COMEtwo "favorite" tools: iostat and loadavglayers and layers of latencymanaging multi-tenancyload (un)balancing
IOSTATPresents disk I/O statisticsReads /proc/diskstats (Linux)
IOSTAT: AN EXAMPLEreads r_sects r_ms t_act
t=0 459583 96210660 2693900 599164
t=10 476346 100180364 2741888 608628
Δ 16763 3969704 47988 9464
/ 10 1676.3 396970.4 4798.8 946.4
human 1676.3 / second 190.83 MB/s 4.7988 .9464
r_await: average time each I/O waited 4798.8 ms spent 1676.3 reads 4798.8 / 1676.3 → 2.86 ms / op
avgrq-sz: mean sectors per I/O396970.4 sectors 1676.3 reads 396970.4 / 1676.3 → 236.81 sectors / op
svctm: non-idle time (%util) / #OPS946.4 ms / second (94.6 %util) 1676.3 reads 946.4 / 1676.3 → 0.56 ms / OP
Device: r/s rMB/s avgrq-sz avgqu-sz r_await svctm %util vda 1676 190.83 236.81 4.80 2.86 0.56 94.64
LOAD AVERAGECollected by the schedulerBased on process states
PROCESS STATESA process/thread/task is either:
runnableon a CPUstarved of CPU
waiting for something
BEING RUNNABLEint i = 0; while(1) { i++; }
CPU STARVATIONPossible reasons:
no CPU is availabletask is (temporarily) assigned to a CPU with otherwork to doa bug in the scheduler (see "A Decade of WastedCores")
MEASURING CPU STARVATION
Formats documented in Documentation/scheduler/sched-stats.txt
$ awk '/^cpu/ { printf "%s %.9fs\n", $1, $9 / 1e9 }' /proc/schedstat cpu0 508.505281125s cpu1 186.946423306s
$ awk '{ printf "%.9fs\n", $2 / 1e9 }' /proc/$PID/schedstat 3.181567463s
Resource-by-resource analysisUSE method
WAITING VOLUNTARILYaccept() a new network connection
recv() data on a socket
sleep() a timer
futex() a memory address (lock)
waitpid() a process
etc...
SLEEPING INVOLUNTARILYpkill -STOP mysqld
PROCESS STATESRUNNING (R) misnomer: runnable process
UNINTERRUPTIBLE (D) waiting for disk
INTERRUPTIBLE (S) waiting for something else
STOPPED (T) (forced to) wait for SIGCONT
ZOMBIE (Z) waiting for parent to waitpid()
See include/linux/sched.h for gory details
LOAD AVERAGEInstantaneous load:
TASK_RUNNING (R) + TASK_UNINTERRUPTIBLE (D)
sampled every 5 secondsinto an exponentially weighted moving average
See: include/linux/sched.hkernel/sched/loadavg.c
WHAT'S BETTER THAN A LOAD AVERAGE?for CPU: runqueue latencyfor disk:
iostat avgqu-sz (per disk)delayacct_blkio_ticks (per task)
https://github.com/hashbrowncipher/taskstats_exporter
DELAYACCT_BLKIO_TICKSHow long a process spent in the D state, in hundredthsof a second: $ awk '{ print $42 / 100 }' < /proc/$PID/stat19.68
WORKLOAD DEPENDENCE (1)Workload:
Resources: ≥32 coresSSD with maximum performance at QD=16-32
def random_reader(): while True: do_random_read()
threaded(random_reader, 500).start()
WORKLOAD DEPENDENCE (2)Compare this workload:def locked_random_reader(semaphore): while True: with semaphore: do_random_read()
max_ios = 32 semaphore = Semaphore(max_ios) threaded(lambda: random_reader(semaphore), 500).start()
WORKLOAD DEPENDENCE: LESSONSChanges in workload will change both bad stats(load) and good ones (delayacct_blkio_ticks)Locks are a form of queueing!
QUESTIONS?
SIMPLIFIED MYSQL EXAMPLE1. Query packet arrives at NIC2. Kernel adds packet to socket queue; wakes recv()'ing MySQL
thread (S → R)3. Buffer pool lookup: MISS! (waited for locks, R → S → R)4. Read pages from disk (R → D → R)5. Big result; send result to client (R → S → R)6. Wait for client: recv() (R → S)7. GOTO 1
"SIMPLIFIED" IS A KEY WORDDid packet processing happen due to interrupt, orpolling?How hot are the CPU cachesQuery passed through bunches of MySQLevents_stagesetc...
LATENCY ANALYSIS IS FRACTALLY COMPLEX!
CPUs (and everything else) are abstractions that hidecomplexity:
is it throttling?how many cycles did I stall due to memory access?how many cycles did I stall due to lack of resources inthe processor?
BUT WE DO IT ANYWAY!CPU time is still a useful metric, even though we take itwith a grain of salt!
COLLECTING LATENCY INFORMATIONTwo methods: 1. Timing2. Sampling (Little's law!)
COLLECTING TIMINGST: Accumulator t: current time T += tend - tstart
INSIST ON VDSO TIMEKEEPINGvDSO: virtual dynamic shared objectvDSO timekeeping takes 25-45ns in a tight loopnon-vDSO timekeeping takes ~4x longer
bad $ strace -qq -e clock_gettime date > /dev/null clock_gettime(CLOCK_REALTIME, {946684800, 0}) = 0 good $ strace -qq -e clock_gettime date > /dev/null
http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.htmlhttps://www.slideshare.net/AmazonWebServices/cmp402-amazon-ec2-instances-deep-dive
LITTLE'S LAWIn a stable system: L = λWL (mean dwelling customers) λ (mean arrival rate) W (mean dwell time)
LITTLE'S LAW APPLIED TO MYSQLOver 1 second:
1000 query threads (~Threads_running sampled)1e5 Questions (SHOW STATUS LIKE 'Questions')
L / λ = W So: 1000 / (1e5 / sec) = 10 ms per query
Over the same period the application records 23outstanding queries (on average)
23 / (1e5 / sec) = 23 ms per query
23 - 10 = 13 ms of unaccounted-for time (on average)
PROBLEMS ABOUNDWe now know an average about our queries in general.
INTERLUDE: WHY NOT HISTOGRAMS‽A useful histogram requires ~16-276 counters
An average requires 2 (+1 for variance).
We can track more averages than we can ever ashistograms.
QUESTIONS?
A TALE OF TWO TENANTSService "�shpics" with N workers and two paths:
cache hit (1 ms)cache miss (100-10000 ms)
AN "ANSWER" TO BAD AVERAGESTrack cache hit/miss time (mean + variance)separatelyTrack time-in-queue separately from work time
MULTI-TENANCYif misses get too slow, hits will waittwo "tenants", one blocking the other
FAIRNESSQueue time is useful, but not in isolation!
If a 100 ms RPC waits 3 ms: no big dealIf a 100 µs RPC waits 3 ms: alarm bells!
SLOWDOWNS = (queued time) / (working time)worthwhile whenever a human is waitingcf. express lanes in grocery storesOLTP vs. batch workloads
CONCURRENCY LIMITING�shpics service:
limit misses to 90% of workersdrop requests above 90%single pool of servers; single deployment
ALL DATASTORES ARE MULTI-TENANT!query threads compete with each otherbatching and coalescing work → background workbackground threads compete with query threadsbackups are background work
EXAMPLE: MYSQL BACKUPSPipeline: read | compress | sendread from disk (ionice(1); CFQ IOPRIO_CLASS_IDLE)compress (chrt(1); SCHED_IDLE)send over network (prio qdisc; SOL_PRIORITY)
CLASSFUL SCHEDULINGwork is divided into classesif high-class work exists, low-class work waitsnice(1) is NOT classful (timeslices)
EXAMPLE: CASSANDRA COMPACTIONGoal: maximal compaction; minimal disruption
don't pick a rate a priori!SCHED_IDLE → possible starvationsolution: cpuset
HERACLES
From GoogleColocated batch and latency-sensitive tasksPer-resource analysis
SO MUCH MORE!token bucketscgroupsqdiscs (codel)
QUESTIONS?
LOAD BALANCINGhow are requests allocated to backends?central queue minimizes queued timeunder unrealistic assumptions
See, in general, ch 24 of "Performance Modeling and Design of Computer Systems"
BAD LOAD BALANCINGrandomround-robin (slightly better)
BETTER LOAD BALANCINGjoin-shortest-queueleast-work-leftTAGS (for "practical" workloads)
TAGS (PREPARE YOUR MIND)throws away work!unbalances load!fairness over throughput
TAGS (KEY CONCEPTS)Non-preemptible idempotent jobsLarge variance in job sizeUnwilling (unable) to make predictionsSlowdown metric (covered earlier)Server expansion requirement
TAGSAllow jobs to run a limited amount of timeKill and requeue jobs that run too long
LOAD UNBALANCING
(FINAL) QUESTIONS?