Download - Shimin Chen LBA Reading Group
![Page 1: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/1.jpg)
Lightweight, High-Resolution Monitoring for Troubleshooting Production SystemsS. Bhatia, A. Kumar, M. E. Fiuczynski, L. Peterson (Princeton & Google), OSDI’08
Shimin ChenLBA Reading Group
![Page 2: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/2.jpg)
Introduction
Troubleshooting complex software systems is difficult Undetected software bugs Intermittent hardware failures Configuration errors Unanticipated workloads
![Page 3: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/3.jpg)
Why hard to reproduce and diagnose production-only problems?
Enough data for diagnosis Un-expected faults Collected all time for temporally distant faults
and failures Whole system: “domino effects”
System crashes and non-determinism Low costs
Monitoring and diagnosis overhead Modifications to applications or OS Not require taking the system offline
![Page 4: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/4.jpg)
New Tool: Chopstix
Data collection: Continuously collects summaries of
system’s behavior Low-level OS operations with detailed
contextual information Keep data for weeks <1% CPU utilization, 256KiB RAM,
16MiB/daily log Aggregation and visualization
![Page 5: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/5.jpg)
Help diagnose a class of problems
For example:“What was the system doing last Wednesday around 5pm when the ssh prompt latency was temporarily high, yet system load appeared to be low?
![Page 6: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/6.jpg)
Contributions
How probabilistic data structure (sketches) enables detailed and lightweight data collection
How sysadmin can use this data to diagnose problems (e.g., PlanetLab)
![Page 7: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/7.jpg)
Outline
Introduction System Design (Section 2 & 4) Usage Model (Section 3) Evaluation (Section 5) Discussion (Section 6) Summary
![Page 8: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/8.jpg)
System Components
Data collector Implemented in the kernel Companion user process periodically
copies data from the kernel to disk A polling process fetches data from
every machine in a networked system to a central location
Aggregator Visualizer
![Page 9: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/9.jpg)
Data Collector
Goal: low resource usage & high coverage Solution: sketch-based sampling
Problem of uniform sampling: Infrequent events less likely to be recorded
The idea of sketch: Sampling rate is adjusted for different event
frequency Here, “sketch” means event frequency
![Page 10: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/10.jpg)
Data Collector: five steps
1. A trigger fires for a potentially interesting event;2. The relevant event data structure is retrieved and
a sketch is updated;3. A sampling function is evaluated to determine if
this event should be recorded;4. If so, a sample of relevant information is saved;5. A user-level process periodically polls the kernel
for this data and resets the data structure, thereby defining a data collection epoch.
![Page 11: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/11.jpg)
Event Triggers
Polling has problems Waste when no activity Polling frequency may not be high enough if
high activity Triggers:
Instrument OS interfaces in kernel: E.g., calls to page allocators
For HW stats, such as L2 misses, make the processor generate periodic interrupts. Interrupt handlers are the triggers.
1. A trigger fires for a potentially interesting event;
![Page 12: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/12.jpg)
Sketch Update
“sketch” means event frequency in a given data-collection epoch
To reduce space overheads, use a hash table for all sketches hash key: event type, VM addr, executable
identifier, uid, etc. (event type specific) Experiments show low probability of false
negatives When an event is triggered, compute the hash
index, increment the sketch with a weight Weight: jiffies of the event Longer events have larger weights
2. The relevant event data structure is retrieved and a sketch is updated;
![Page 13: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/13.jpg)
Logarithm Sampling Function
Goal: # samples of any one event = log (event frequency) Implementation:
Choose an integer t=2k
If event frequency is power of t, record sample In other words, if event frequency is an integer with k
zeros, 2k zeros, 3k zeros, …, in the least significant bits, record sample
How to choose t? Low watermark and high watermark (CPU cycles
consumed by Chopstix) Double t if > high watermark in previous epoch Halve t if < low watermark in previous epoch
3. A sampling function is evaluated to determine if this event should be recorded;
![Page 14: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/14.jpg)
Event Sample
Stack trace, uid, program id, other event-specific details
Kernel stack trace: tracing frame pointer User-level stack trace: similar
But stop if the stack is paged out
4. If so, a sample of relevant information is saved;
![Page 15: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/15.jpg)
Epochs
Periodically copy samples and hash table out
Reinitialize hash table Double copies of in-kernel Chopstix
data structure for fast swap
5. A user-level process periodically polls the kernel for this data and resets the data structure, thereby defining a data collection epoch.
![Page 16: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/16.jpg)
Vital Signs (Event Types)
11 vital signs: 1. CPU utilization2. scheduling delay3. resource blocking4. disk I/O activity5. user page allocation6. kernel page allocation7. L2-cache utilization8. system call invocation9. signal delivery10. socket transmission11. mutex/semaphore locking.
![Page 17: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/17.jpg)
More details
![Page 18: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/18.jpg)
Aggregator & Visualizer
Macromedia Flash web interface
![Page 19: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/19.jpg)
Outline
Introduction System Design Usage Model Evaluation Discussion Summary
![Page 20: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/20.jpg)
Workflow
Want to diagnose a misbehavior: temporal ambiguity: not punctual and seemingly cannot
be reproduced spatial ambiguity: cannot be localized to a component
Search Chopstix data for symptoms (unusual vital sign behaviors)
Given specific times, can zoom into corresponding epochs Search these epochs first then previous epochs Look for outliers in vital signs with threshold filters Correlate candidate symptoms
Given a set of symptoms, addressing the problem: Reproduce the problem by artificially triggering the
symptoms Or avoid the symptoms
![Page 21: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/21.jpg)
Correlating Symptoms
Interpreting symptoms is “art” The paper describes a collection of
10 guiding rules for understanding symptoms
![Page 22: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/22.jpg)
Rules:
Rule #1: High CPU utilization (cpuu) with a low (nearly-zero) L2-cache-miss rate is a symptom of a busy loop
Rule #2: An increase in the net (user or kernel) memory allocated is an indicator of a memory leak
Rule #3: Unsatisfied I/O requests indicate bad blocks on the disk size of data requested > size of data returned
![Page 23: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/23.jpg)
Rules:
Rule #4: When the combined value of cpuu for processes is low, scheduling delay (sched) is high, and the total cpuu is high for the system, it is often a sign of a kernel bottleneck
See paper for the others
![Page 24: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/24.jpg)
Case Study: observed behavior
Planetlab nodes were observed to crash every 1–7 days without leaving any information on the console or in the system logs.
Shortly before such crashes it was observed that ssh sessions to nodes would stall for tens of seconds.
Some nodes that were running an identical software stack did not suffer these crashes, indicating that the problem was load-dependent.
KDB, NMI watchdog not effective
![Page 25: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/25.jpg)
First Attempt
Resource blocking High I/O activity Find that request-response latencies
stayed low and I/O thruput degraded negligibly during these periods
Not the problem
![Page 26: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/26.jpg)
Second Attempt
Rule #4 High scheduling delays with heavy
CPU utilization A bug in the scheduling loop
![Page 27: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/27.jpg)
Other Examples
Brief descriptions of five other examples in the paper
![Page 28: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/28.jpg)
Outline
Introduction System Design Usage Model Evaluation Discussion Summary
![Page 29: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/29.jpg)
Experimental Setup
Core2Duo, 4GB RAM Linux 2.6.20.1 NMI interrupts for every 107 CPU
cycles and for every 6*105 L2-cache misses
![Page 30: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/30.jpg)
![Page 31: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/31.jpg)
![Page 32: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/32.jpg)
Aggregation/Visualization
Using the data set spanning three days, initialization cost: ~80 seconds if not cached ~3 seconds if cached
![Page 33: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/33.jpg)
Coverage of sketches
Compute the false negative probs The formula is not explained
Vary hash table size and threshold value t in practice
Hash table size: 128—4KiB False negative for each vital signs:
10-3 to 10-4
![Page 34: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/34.jpg)
Discussions
Properties of problems that may be diagnosed via Chopstix: Impact system’s behavior System must stay longer than one
epoch Applicable to other OS
General schemes should be portable Implementation details may vary
![Page 35: Shimin Chen LBA Reading Group](https://reader035.vdocument.in/reader035/viewer/2022062519/56814eca550346895dbc6826/html5/thumbnails/35.jpg)
Summary
Chopstix: Log succinct summaries Low-level OS events Sketch-based sampling
Real implementation Experience with Planetlab Guiding rules for interpreting vital
signs