monitoring and observability
DESCRIPTION
In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end. Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability. You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.TRANSCRIPT
![Page 1: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/1.jpg)
/
Monitoring and Observability
in Complex Architectures
Tuesday, October 2, 12
![Page 2: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/2.jpg)
Hi! I’m @postwait
I founded @OmniTI and @MessageSystems and @Circonus
Tuesday, October 2, 12
![Page 3: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/3.jpg)
Hi! I’m @postwait
I am very active in @TheOfficialACMparticipating in @ACMQueueand the practitioners board.
Tuesday, October 2, 12
![Page 4: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/4.jpg)
Hi! I’m @postwait
I (regrettably) build complex systems.
Tuesday, October 2, 12
![Page 5: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/5.jpg)
Why we are here
We’re here to talk aboutcoping with breakage
Tuesday, October 2, 12
![Page 6: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/6.jpg)
Rule #1
Direct observation of failureleads to quicker rectification.
Tuesday, October 2, 12
![Page 7: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/7.jpg)
Rule #2
You cannot correctwhat you cannot measure.
Tuesday, October 2, 12
![Page 8: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/8.jpg)
Solution Approach #1
Debugging failures requires eithervisibility into theprecipitating state
Tuesday, October 2, 12
![Page 9: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/9.jpg)
Precipitating State
Single threaded applications
✓ Easy
Tuesday, October 2, 12
![Page 10: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/10.jpg)
Precipitating State
Multi-threaded applications
✓ Challenging
Tuesday, October 2, 12
![Page 11: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/11.jpg)
Precipitating State
Distributed applications
here there be dragons
Tuesday, October 2, 12
![Page 12: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/12.jpg)
Solution Approach #2
ordirect observation of a(and likely very many)failing transaction
Tuesday, October 2, 12
![Page 13: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/13.jpg)
Direct Observation
Observing something fail...is priceless.
Tuesday, October 2, 12
![Page 14: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/14.jpg)
Direct Observation
Observation leads tointelligent questioning.
Tuesday, October 2, 12
![Page 15: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/15.jpg)
Direct Observation
Questioning leads to answers...but only through more observation.
Tuesday, October 2, 12
![Page 16: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/16.jpg)
Direct Observation
Questioning leads to answers...but only through more observation.
and herein lies the rub.
Tuesday, October 2, 12
![Page 17: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/17.jpg)
Leaning Towards Scientific Process
In production you don’t have• repeatability• control groups• external verification
Tuesday, October 2, 12
![Page 18: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/18.jpg)
Leaning Towards Scientific Process
In production you don’t have• repeatability• control groups• external verification
... or do you?
Tuesday, October 2, 12
![Page 19: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/19.jpg)
What’s monitoring got to do with it?
Monitoring is all about thepassive observation oftelemetry data.
Tuesday, October 2, 12
![Page 20: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/20.jpg)
Monitoring Telemetry
cannot pinpoint problems
can provides evidence ofthe existence of a problem
Tuesday, October 2, 12
![Page 21: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/21.jpg)
Monitoring
Gives you evidence thatthere is a problem
Tuesday, October 2, 12
![Page 22: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/22.jpg)
Monitoring
Gives you evidence thatyou have fixed a problem(or at least the symptoms)
Tuesday, October 2, 12
![Page 23: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/23.jpg)
Monitoring Tactically
If it could be of interest,measure it andexpose the measurement
Tuesday, October 2, 12
![Page 24: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/24.jpg)
Monitoring: embedded
statsdhttps://github.com/etsy/statsd
resmonhttp://labs.omniti.com/labs/resmon
metricshttps://github.com/codahale/metrics
folsomhttps://github.com/boundary/folsom
metrics.jshttps://github.com/mikejihbe/metrics
metrics-nethttps://github.com/danielcrenna/metrics-net
Tuesday, October 2, 12
![Page 25: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/25.jpg)
Monitoring: collection
reconnoiterhttp://labs.omniti.com/labs/reconnoiter
graphitehttp://graphite.wikidot.com/
OpenTSDBhttp://opentsdb.net/
circonushttp://circonus.com/
libratohttps://metrics.librato.com/
Tuesday, October 2, 12
![Page 26: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/26.jpg)
Monitoring: Bling
visualizing an architecture rollout
Tuesday, October 2, 12
![Page 27: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/27.jpg)
Monitoring: Bling
visualizing the impact on service times
Tuesday, October 2, 12
![Page 28: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/28.jpg)
average API service time latency
Tuesday, October 2, 12
![Page 29: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/29.jpg)
actual API service time latency
http://www.slideshare.net/postwait/atldevops
Tuesday, October 2, 12
![Page 30: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/30.jpg)
Monitoring: Bling
Tuesday, October 2, 12
![Page 31: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/31.jpg)
Repeatability is a Pipe Dream
You production problem is a(hopefully pathological)outcome of circumstance.
A circumstance which oftencannot be repeated.
Tuesday, October 2, 12
![Page 32: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/32.jpg)
Control Groups
Control groups cancompensate for theinability toprecisely repeat an experiment.
Tuesday, October 2, 12
![Page 33: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/33.jpg)
Control Groups
Most architectures have redundancy.
Tuesday, October 2, 12
![Page 34: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/34.jpg)
Control Groups
With the right design,you can turn that redundancyinto a debugging environment.
[1] http://omniti.com/surge/2012/sessions/xtreme-deployment
Tuesday, October 2, 12
![Page 35: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/35.jpg)
Control Groups: Simple Example
I have 10 web serversI fix 1I verify 1 is fixedI verify 9 are still broken
Tuesday, October 2, 12
![Page 36: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/36.jpg)
Control Groups: Seems Easy
Web servers tend to be:• homogeneous• share-(nothing|little)• independent
Tuesday, October 2, 12
![Page 37: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/37.jpg)
Control Groups: Not So Easy
Most other services aren’t so homogeneous and equal:databases, batch processes (think billings), orchestration middleware, message queues
Tuesday, October 2, 12
![Page 38: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/38.jpg)
Observability
Some might claim thatseeing telemetry data isobservation...
It is doubly indirect at best.
Tuesday, October 2, 12
![Page 39: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/39.jpg)
Observability
I want todirectly seetheerrant behaviour
Tuesday, October 2, 12
![Page 40: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/40.jpg)
Observability is forgiving
In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.
Tuesday, October 2, 12
![Page 41: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/41.jpg)
Observing the network
tcpdump / snoopwireshark
Tuesday, October 2, 12
![Page 42: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/42.jpg)
Observing the network
Looking at just thearrival of new connections
tcpdump -nnq -tttt -s384'tcp port 80 and (tcp[13] & (2|16) == 2)'
Tuesday, October 2, 12
![Page 43: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/43.jpg)
Observing the network
Looking at just the dataarrival and departure timestcpdump -nnq -tt-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'
snoop -rq -ta-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'
Tuesday, October 2, 12
![Page 44: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/44.jpg)
Observing the network
Finding the difference betweena client’s question anda server’s answer(tcpdump | awk filter).{ gsub(".[0-9]+(: | >)"," \& "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);
if(S[EP] == "C" && $4 == ".80") { printf("%f %s\n", $1 - L[EP], EP); }
S[EP]= ($4==".80")?"S":"C"; L[EP]= $1;}
Tuesday, October 2, 12
![Page 45: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/45.jpg)
Observing the network
Tuesday, October 2, 12
![Page 46: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/46.jpg)
Observing the network
Tuesday, October 2, 12
![Page 47: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/47.jpg)
Observing user-space
strace[1] / trussgstack / pstackgcore + gdb / dbx / mdb[2]
[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf
Tuesday, October 2, 12
![Page 48: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/48.jpg)
System call tracing
Watching sshdis a good way to get familiar.truss -f -p `pgrep sshd`
Tuesday, October 2, 12
![Page 49: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/49.jpg)
System call tracing
An active web server is going to belike a firehose.truss -f -p `pgrep httpd`
Tuesday, October 2, 12
![Page 50: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/50.jpg)
Observing the system
DTrace
Live production demo or GTFO.
Tuesday, October 2, 12
![Page 51: Monitoring and observability](https://reader034.vdocument.in/reader034/viewer/2022051818/54b7b67c4a7959c9688b46ed/html5/thumbnails/51.jpg)
Thank You
Questions?
Tuesday, October 2, 12