building a better backtrace: techniques for postmortem program
TRANSCRIPT
Building a Better Backtrace: Techniques for Postmortem
Program Analysis
Ben Liblit & Alex Aiken
A Few Grim Realities
• Programs fail post-deployment– Ship with known bugs– Users discover new bugs
• Users are lousy testers– Never do the same thing twice– Wild variation in execution environment– Poor bug reporting, if any
• Users’ bugs are the ones that really matter
Program Analysis for Pessimists
• Assume & prepare for postmortem analysis– Compile-time analysis, stashed away for later– Lightweight (deployable) instrumentation
• Analyze failed program instances– Mix of automated / interactive tools– Not quite static analysis, not quite dynamic
• Help humans find and fix bugs that matter
This Talk: Reconstructing Execution Chronologies
• Control flow decision history captures important properties
• Fundamental questions– “How in the world did it get here?”– “What happened just before this point?”– “How can I make this happen again?”
• Broader interest than just crashes
Striking a Compromise
• Heavyweight approaches– Replay debugging– Program tracing
• Lightweight approaches– Examine stack trace in debugger– printf() debugging
• Middleweight (our) approach– “How might we have gotten here, given …?”
Striking a Compromise
• Heavyweight approaches– Replay debugging– Program tracing
• Lightweight approaches– Examine stack trace in debugger– printf() debugging
• Middleweight (our) approach– “How might we have gotten here, given …?”
Striking a Compromise
• Heavyweight approaches– Replay debugging– Program tracing
• Lightweight approaches– Examine stack trace in debugger– printf() debugging
• Middleweight (our) approach– “How might we have gotten here, given …?”
Striking a Compromise
• Heavyweight approaches– Replay debugging– Program tracing
• Lightweight approaches– Examine stack trace in debugger– printf() debugging
• Middleweight (our) approach– “How might we have gotten here, given …?”
The Big Idea: “Gotten Here” is Control Flow Reachability
The Big Idea: “Gotten Here” is Control Flow Reachability
The Big Idea: “Gotten Here” is Control Flow Reachability
The Big Idea: “Gotten Here” is Control Flow Reachability
• Interested in paths– “How”, not just “yes/no”
• Transitive paths within one function
• Multiple functions?– Matched call/return paths– This is a form of context
free language reachability
?
?
( )
[ ]
Global Control Flow Graph
call return
entry exit
call return
Variations in Matching Grammar
• Complete execution– All calls & returns must be matched
{()(){()}[{}(())]}
Variations in Matching Grammar
• Aborted execution– Some calls without returns– We use a variant of this
{()(){()}[{}(())]}
CFL Reachability Algorithm
• Similar to transitive graph search– Use a work list to incrementally extend frontier– Forward from α or backward from ω– Transitively adding flow edges is one case
• Several additional cases for calls/returns• Complexity
– O(N3) for arbitrary grammar and graph– O(E) for our analyses (and many others)
Reconstruction WithCrash Site Only
• Work backward from crash site• Remember why each path is extended
– Record justifications in route map– route(x, z) = { r1, …, rn }
– ri = cross from x to y, then see route(y, z)• x and y must be “adjacent”: one of four cases
• route(α, ω) defines possible chronologies
Reconstruction WithCrash Site Only
• One case, unmatched call, determines stack
(
Reconstruction WithCrash Site Only
• One case, unmatched call, determines stack– Unmatched parens: {()(){()}[{}(())]}– Stack trace: {[(
(
Reconstruction WithCrash Site Only
• One case, unmatched call, determines stack– Unmatched parens: {()(){()}[{}(())]}– Stack trace: {[(
• But we probably havea specific stack tracein mind…
(
Reconstruction WithCrash Site + Stack Trace
• S ::= vector of call edges• Build |S + 1| clones of
global flow graph
Reconstruction WithCrash Site + Stack Trace
• S ::= vector of call edges• Build |S + 1| clones of
global flow graph• Two types of call edge
– (i must match )i
• Stays on same layer
Reconstruction WithCrash Site + Stack Trace
• S ::= vector of call edges• Build |S + 1| clones of
global flow graph• Two types of call edge
– (i must match )i
• Stays on same layer
– ci must be unmatched• Only way to next layer• Determined by S
c6
c3
c14
Reconstruction WithCrash Site + Stack Trace
• Possible histories– Start at α on top layer– End at ω on bottom layer– route(α, 0, ω, |S|)
• Backward, not forward– More deterministic
• Complexity– O(E) work, |S + 1| times
c6
c3
c14
Reconstruction WithCrash Site + Event Trace
• V ::= vector of trace nodes• Use |V + 1| layered clones, as before• Must report event when crossing trace node
– On each layer, knock out all trace nodes but one• On bottommost layer, no trace nodes at all!
– Further restricts set of possible paths• Complexity: O(E|V|)
Reconstruction With …
• Stack trace + event trace• Multiple event traces• Ambiguous traces• Incomplete event trace
– Recent-branch registers• Program counter sampling• Finite state machine of your choosing…
Practical Considerations
• Dynamic dispatch / function pointers– Usual static techniques (points-to, receiver-class, etc.)– Event tracing can help– Note: stack trace is never dynamic
• Interactivity– Backward analysis is best: most bugs are close to crash– FIFO work list, demand-driven search– Deterministic versus non-deterministic state machines
Areas For Future Exploration
• Sparsity of trace information– Identify state-preserving regions– Explore such regions only once
• Summarization / visualization– Basis: dominator tree walk-back– Opportunity for novel algorithms here
Areas For Future Exploration
• Adaptive Gap Reduction– Programmer inquiries guide future annotation
• “Which way did this branch really go?”• “How many times did this loop really execute?”
– Identification of key inflection points– Insert lightweight event tracing nodes
• Related work in efficient path profiling– More evidence for future reconstructions
Summary and Conclusions
• Program analysis in an imperfect world– Post-crash: unique challenges / leverage points
• CFL path recovery as basis for analysis– Efficient, demand-driven, adaptable
• Future work– Adaptive annotation to fill in gaps– Leveraging multiple runs– Data value modeling