characterization of pathological behavior philip koopman - (412) 268-5225 dan siewiorek

Download Characterization of Pathological Behavior  Philip Koopman - (412) 268-5225 Dan Siewiorek

If you can't read please download the document

Upload: arnold-rice

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

3 Outline u Definitions u Digital Hardware Prediction u Digital Software Characterization u Research Challenges

TRANSCRIPT

Characterization of Pathological BehaviorPhilip Koopman - (412) Dan Siewiorek - (412) (and more than a dozen other contributors) 2 Goals u Detect pathological patterns for fault prognosis u Develop fault propagation models u Develop statistical identification and stochastic characterization of pathological phenomena 3 Outline u Definitions u Digital Hardware Prediction u Digital Software Characterization u Research Challenges 4 Definitions: Cause-Effect Sequence and Duration u FAULT - incorrect state of hardware/software caused by component failure, environment, operator errors, or incorrect design u ERROR -manifestation of a fault within a program or data structure u FAILURE - services deviates from specified service due to an error u DURATION Permanent-continuous and stable due to hardware failure, repair by replacement Intermittent-occasionally present due to unstable hardware or varying hardware/software state, repair by replacement Transient-resulting from design errors or temporary environmental conditions, not repairable by replacement 5 CMU Andrew File Server Study u Configuration 13 SUN II Workstations with processor 4 Fujitsu Eagle Disk Drives u Observations 21 Workstation Years u Frequency of events Permanent Failures 29 Intermittent Faults610 Transient Faults446 System Crashes298 u Mean Time To Permanent Failures6552 hours Intermittent Faults 58 hours Transient Faults 354 hours System Crash 689 hours 6 Some Interesting Numbers u Permanent Outages/Total Crashes = 0.1 u Intermittent Faults/Permanent Failures = 21 Thus first symptom appears over 1200 hours prior to repair u (Crashes - Permanent)/Total Faults = u 14/29 failures had three or fewer error log entries 8/29 had no error log entries 7 Harbinger Detection of Anomalies 8 Digital Hardware Prediction 9 Measurement and Prediction Module u History Collection -- Calculation and reporting of system availability u Future prediction -- failure prediction of system devices History Collection Future Predict Measurement & Prediction Module Operating System User Application Prog 10 Operating System History Collection Uptime(fraction) Calculator Crash Monitor Files of system state info History Collection u This module consists : Crash Monitor - monitors system state Calculator - Average uptime and average of fraction, User Application Prog Files of uptime (fraction) information u => Availability 11 Average uptime reboot crash Crash Monitor System states changing periodically samples system state up down downtime = t 3 - t 1 =13 min uptime = t 2 - t 1 = 600 min interval = 5 min time t1t1 t3t3 t2t2 12 An NT system accumulative availability daily report over An NT system accumulative availability daily report over 5-month period 5-month period Preliminary Experiment Data (cont.) 13 u This module generates device failure warning information Sys-log Monitor : monitors new entries by checking the system event log periodically. DFT Engine : DFT Heuristic applied and corresponding device failure warning issued if rules satisfied. Future Prediction DFT Error Log Sys-log Monitor Dispersion Frame Technique Engine Future Prediction User Application Prog Operating System Files of device failure warning 14 u periods of increasingly unreliable behavior prior to catastrophic failure. Principle from observation disk time errors Disk repair Mem Board repair mem Filter by event type CPU repair Error entry example: DISK:9/180445/ /829000:errmsg:xylg:syc:cmd6:reset failed (drive not ready) blk 0 type time u Based on this observation, the DFT Heuristic was derived, to detect the non-monotonically decreasing inter-arrival time. 15 i-4 i-2 i-3 ii-1 t How DFT Works via an example rule: if a sliding window of 1/2 of the current error interval successively twice covers 3 errors in the future - issue a warning last 5 errors of the same type (disk) warning 16 Digital Software Characterization 17 Where We Started: Component Wrapping u Improve Commercial Off-The-Shelf (COTS) software robustness 18 Exception Handling The Basis for Error Detection u Exception handling is an important part of dependable systems Responding to unexpected operating conditions Tolerating activation of latent design defects u Robustness testing can help evaluate software dependability Reaction to exceptional situations (current results) Reaction to overloads and software aging (future results) First big objective: measure exception handling robustness Apply to operating systems Apply to other applications u Its difficult to improve something you cant measure so lets figure out how to measure robustness! 19 Measurement Part 1: Software Testing u SW Testing requires:Ballista uses: Test caseBad value combinations Module under testModule under Test Oracle (a specification)Watchdog timer/core dumps 20 Ballista: Scalable Test Generation u Ballista combines test values to generate test cases 21 Ballista: High Level + Repeatable u High level testing is done using API to perform fault injection Send exceptional values into a system through the API Requires no modification to code -- only linkable object files needed Can be used with any function that takes a parameter list Direct testing instead of middleware injection simplifies usage Each test is a specific function call with a specific set of parameters System state initialized & cleaned up for each single-call test Combinations of valid and invalid parameters tried in turn A simplistic model, but it does in fact work... u Early results were encouraging: Found a significant percentage of functions with robustness failures Crashed systems from user mode u The testing object-based approach scales! 22 CRASH Robustness Testing Result Categories u C atastrophic Computer crashes/panics, requiring a reboot e.g., Irix 6.2: munmap(malloc((1