first-fault data capture - jhuapl.edu
Post on 20-Jul-2022
2 Views
Preview:
TRANSCRIPT
FIRST-FAULT DATA CAPTURE
Steven A. Stolper, Software Consultant
info@stolperconsult.com
FIRST-FAULT DATA CAPTURE
Agenda
� Logging and its problems
� First-Fault Data Capture (Tracing)
� FFDC Architectures� FFDC Architectures
Logging
� Traditionally used to provide information
for debugging.
� Best for low-rate data
� “Can you send me the logs?”
Time-Stamp Seq# Thread ID Severity Module Message ID Text :2007-8/16:14:17:43 000003 (0xb7f3c6c0) DEBUG STARTUP (00001) test_startup: spawning thread 12007-8/16:14:17:43 000004 (0xb7f3c6c0) DEBUG STARTUP (00002) test_startup: spawned thread 1::2007-8/16:14:17:43 000028 (0xb7f3bbb0) INFO STARTUP (08003) system_WaitForStart: Waiting for system startup.2007-8/16:14:17:43 000029 (0xb753abb0) DEBUG STARTUP (00004) Thread 1: Initializing!2007-8/16:14:17:43 000030 (0xb753abb0) DEBUG STARTUP (00005) Thread 1: Waiting for start!
The Problem with Logging
� The “immediate cause” of the failure
occurs after the actual problem.
� Error messages are often missing or poorly
coded.coded.
� System state information is unavailable
� High-rate data is unavailable
� Many types of failures do not log errors
Relying on error messages to find “the smoking gun”
is a matter of chance!!!
Traditional Debugging
� Turn on “verbose” logging or enable
“instrumented” code.
� Put an instrumented build on the � Put an instrumented build on the
spacecraft (if on the testbed or in ATLO)
Both approaches suffer from a key flaw:
The problem must occur again!!!!
Q: What are the 5 words guaranteed to upset your QA Team?
“Can you reproduce the problem?”A:
Trace Data Recorder
� Analogous to “Black Box” flight recorder on aircraft
� Records information about system's dynamic behavior
� After crash, the information in the TDR helps to analyze the failure
� The data can provide valuable insight into � The data can provide valuable insight into � complex behavioral problems
� sporadic changes in system inputs
� unexpected interactions between subsystems
� bugs in low-level software.
� High rate data “always on”
*Software architecture
for troubleshooting
high-availability systems
Process 1 Process 2 Process 3
Thread NThread 1
TDRLib
TDRLib
TDRLib
TDR_Store()
TDR_BufferInit()
LOG
Subsystem
Uplink
“catastrophic”
errors
TDR Subsystem
Data path
Command path
Library/
Data Structure
Execution context
RetrievalAgent
BillboardNon-volatileStorage
“Raw”
Dump
“Freeze” /
“Dump”
TDR
“Freeze” /
“Dump”
Post-
processing
Tool
Ground Data System
“Raw” Dump
Uplink
“Freeze” /
“Dump”
“Retrieve”
“Retrieve”
DumpAgent
Buffer Organization
Process 1
Lib 1
Option #1
Process 1
Lib 1
Option #2
Lib 1
Lib 1
Process 2
Lib 1
Lib 1
Lib 1
Lib 1
Process 2
Lib 1
Lib 1* Option 3 for Q&D
Example Library Interface
TDR_BufferInit(Application_ID, num_entries, process_str)Creates and initializes buffer for process
TDR_ThreadAddIdentity(thread_str)Identifies thread a thread of execution using buffer
TDR_ThreadRemoveIdentity()Removes thread identification information from buffer
TDR_Store(module_ID, event_id, data_p)TDR_Store(module_ID, event_id, data_p)Stores trace information
TDR_BufferDestroy()Destroys trace buffer
The “data_p” argument to TDR_store() points to:
typedef union {
unsigned char data_byte[TDR_BYTES_PER_ENTRY];
unsigned int data_word[TDR_WORDS_PER_ENTRY];
} TDR_data_t;
Trace Data Recorder
� Example assumes one buffer per heavy-weight process
� The implementation of the TDR_Store() function is critical:
� “Application” software calls function as part of normal execution so data always available if problem. TDR_Store() called many so data always available if problem. TDR_Store() called many hundreds of times each second.
� Ideally should execute in fast, constant time, and not make any system calls that can block execution of caller.
� Should avoid mutual exclusion primitives that can cause unrelated execution contexts to serialize as they contend for access to the capture buffer. (atomic_inc_return())
� If buffer sizes restricted to power of 2, can roll the buffer very swiftly.
Data structures
PID Module ID Num Entries Buffer size Instance String
Next available
Capture Buffer Body (Shared between Lib and Dump Agent)
Capture Buffer Header (Lib only)
Freeze Flag
Buffer Pointer Optional Information
Buffer Statistics
Capture Entry
Thread Identification Information
Buffer Entries
.
.
.
.
PID Module ID Entry ID Time Stamp Data
Buffer and Dump Control
PID
Billboard
Buffer Pointer Buffer Size
Current Dump Size Max Allowed Dump Size Global “Freeze” Flag
Orphaned Time
Buffer Pointer Table
PID Buffer Pointer Buffer Size Orphaned Time
Compile-Time Interface
� Each entry placed into a capture buffer
should have a unique ID.
� Ideally, when defining the ID, the developer
could also provide information to help could also provide information to help
generate a “parser” to post-process the raw
dump to produce human-readable output.
� Entry ID’s placed in own header file for each
module.
� Can construct tool to scan for duplicate ID’s
Compile-Time Interface
(cont)
/* Define an example entry ID for a buffer entry containing two integers
* Note that there is no “;” following the definition.
*/
TDR_DEFINE_ENTRY(EXAMPLE_ENTRY_ID, 1, “hex value = 0x%x and decimal value =
%d”)
#define TDR_DEFINE_ENTRY(entry_ID_mnemonic, entry_number, format_string) \
enum { entry_ID_mnemonic = entry_number};
#define TDR_DEFINE_ENTRY(entry_ID_mnemonic, entry_number, format_string) \
{entry_number, #entry_ID_mnemonic, format_string},
%d”)
Recent Example:M
ete
rs p
er
seco
nd
Summary
� Logging as a debugging tool suffers from
inherent problems
� Tracing gives state and high-rate data vital
to troubleshooting problems in the fieldto troubleshooting problems in the field
Additional Reading
“The Software Detective: First-fault Data
Capture”
Embedded Systems Design Magazine, CMP Embedded Systems Design Magazine, CMP
Media LLC., August 2007
top related