stackmine performance debugging in the large via mining...
TRANSCRIPT
StackMine – Performance Debugging in the Large via Mining Millions of Stack Traces
Shi Han1, Yingnong Dang1, Song Ge1, Dongmei Zhang1, and Tao Xie2
Software Analytics Group, Microsoft Research Asia1 North Carolina State University2
June 6th, 2012
Performance Issues in the Real World
2
• One of top user complaints
• Impacting large number of users every day
• High impact on usability and productivity
High Disk I/O High CPU consumption
Given limited time and resource before software release, development-site testing and debugging become insufficient to
ensure satisfactory software performance.
ICSE 2012
Performance Debugging in the Large
3
Pattern Matching
Trace StorageTrace collection
Bug update
Problematic Pattern Repository
Bug DatabaseNetwork
Trace analysis
How many issues are still unknown?
Which trace file should I investigate first?
Bug filingKey to issue
discovery
Bottleneck of scalability
ICSE 2012
Problem Definition
Input: Runtime traces
collected from millions of users
Output: Program execution patterns causing the most impactful performance problems
inspected by performance analysts
4ICSE 2012
Goal
Conduct systematic discovery & analysis of program execution patterns
• Efficient handling of large-scale trace sets
• Automatic discovery of new patterns
• Effective prioritization of investigation
5ICSE 2012
Challenges
6
Combination of expertise• Generic machine learning tools without domain
knowledge guidance do not work well
Highly complex analysis• Numerous program runtime combinations
triggering performance problems• Multi-layer runtime components from application
to kernel being intertwined
Large-scale trace data• TBs of trace files and increasing• Millions of events in single trace stream
Internet
ICSE 2012
What happens behind a typical UI-delay? An example of delayed browser tab creation -
Intuition
7
Intuition
ReadyThread Callstacks
Wait Callstacks
CPU Sampled Callstacks
CPU Wait Ready CPUWaitCPUUI thread Ready
Time
Wait callstackntdll!UserThreadStartBrowser! Main…
ntdll!LdrLoadDll…
nt!AccessFaultnt!PageFault…
Wait callstackntdll!UserThreadStartBrowser! Main…
Browser!OnBrowserCreatedAsyncCallback…
BrowserUtil!ProxyMaster::GetOrCreateSlaveBrowserUtil!ProxyMaster::ConnectToObject…
rpc!ProxySendReceive…wow64!RunCpuSimulationwow64cpu!WaitForMultipleObjects32wow64cpu!CpupSyscallStub…
ReadyThread callstacknt!KiRetireDpcList
nt!ExecuteAllDpcs…
nt!IopfCompleteRequest…nt!SetEvent…
Underlying Disk I/O
Worker thread Ready CPU
Unexpected long execution
CPU sampled callstackntdll!UserThreadStart…
Ntdll!WorkerThread…
ole!CoCreateInstance…
ole!OutSerializer::UnmarshalAtIndexole!CoUnmarshalInterface…
CPU sampled callstackntdll!UserThreadStart…
ntdll!WorkerThread…
ole!CoCreateInstance…
ole!OutSerializer::UnmarshalAtIndexole!CoUnmarshalInterface…
CPU sampled callstackntdll!UserThreadStart…
ntdll!WorkerThread…
ole!CoCreateInstance…
ole!OutSerializer::UnmarshalAtIndexole!CoUnmarshalInterface…
ReadyThread callstackntdll!UserThreadStart…
rpc!LrpcIoComplete…
user32!PostMessage…
win32k!SetWakeBitnt!SetEvent…
ICSE 2012
Formulated as a callstack mining and clustering problem
Approach
8
Problematic program execution patterns
Callstack patternsPerformance
Issues
Caused by
Discovered by mining & clustering costly patterns
Mainly represented by
ICSE 2012
ICSE 2012
Approach – Workflow
9
Callstacks
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
ntdll!OpenFile
Callstack patterns
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!AccessFault
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!PageRead
709
412
259
Pattern clusters
ntdll!OpenFile
stobject!WakeUp_DeviceMonitor
KernelBase!LoadLibrary
nt!PageRead
nt!AccessFault
1392
709
259
412
Cluster Hits Total Wait time (ms)
1 94,279 927,824
2 51,107 561,416
3 35,536 3,051,307
… …
Ranked cluster
123
Trace streams
AOI…
AOI Extraction
Sequence Pattern Mining
Pattern Clustering
AOI Extraction
Sequence Pattern Mining
PatternClustering
Ranking
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- KernelBase!LoadLibrary
| | | | | |- ……
| | | | | | |- ntdll!OpenFile
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- ……
| | | | | | |- nt!PageRead
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- KernelBase!LoadLibrary
| | | | | | |- ……
| | | | | | | |- nt!Trap
| | | | | | | | |- nt!AccessFault
| | | | | | | | | |- ……
Ranking
ICSE 2012
Approach – AOI Extraction
10
Callstacks
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
ntdll!OpenFile
Callstack patterns
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!AccessFault
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!PageRead
709
412
259
Pattern clusters
ntdll!OpenFile
stobject!WakeUp_DeviceMonitor
KernelBase!LoadLibrary
nt!PageRead
nt!AccessFault
1392
709
259
412
Cluster Hits Total Wait time (ms)
1 94,279 927,824
2 51,107 561,416
3 35,536 3,051,307
… …
Ranked cluster
123
Trace streams
AOI…
AOI Extraction
Sequence Pattern Mining
PatternClustering
Ranking
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- KernelBase!LoadLibrary
| | | | | |- ……
| | | | | | |- ntdll!OpenFile
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- ……
| | | | | | |- nt!PageRead
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- KernelBase!LoadLibrary
| | | | | | |- ……
| | | | | | | |- nt!Trap
| | | | | | | | |- nt!AccessFault
| | | | | | | | | |- ……
Approach – AOI ExtractionMotivation
Runtime traces capture both
• Relevant executions for performance issue
– E.g., executions relevant to browser-tab creation
• Irrelevant executions for performance issues
– E.g., executions of concurrently executed IM
– Noisy data for mining
– Huge investigation scope induced
ICSE 2012 11
Approach – AOI ExtractionCritical Path
12
CPU Wait Ready CPUWaitCPUUI thread Ready
Time
Disk I/O
CPUReady Wait Ready CPUWorker
thread 1
Ready CPUWorker
thread 2
CPU(UI) CPU(WT1) CPU(WT2) CPU(WT1) CPU(UI) Disk I/O Ready CPUCritical
path
Scenario start Scenario finish
ICSE 2012
Approach – AOI ExtractionWait Graph
ICSE 2012 13
CPU Wait CPUReadyUI thread
Ready CPU Wait Ready CPU Wait Ready CPUWorker
thread 1
CPUWorker
thread 2Execution irrelevant to Worker thread 1’s waits
Time
Scenario start Scenario finish
Disk I/O
Critical
path
Wait Graph
ICSE 2012
Approach – Callstack Pattern Mining
14
Callstacks
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
ntdll!OpenFile
Callstack patterns
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!AccessFault
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!PageRead
709
412
259
Pattern clusters
ntdll!OpenFile
stobject!WakeUp_DeviceMonitor
KernelBase!LoadLibrary
nt!PageRead
nt!AccessFault
1392
709
259
412
Cluster Hits Total Wait time (ms)
1 94,279 927,824
2 51,107 561,416
3 35,536 3,051,307
… …
Ranked cluster
123
Trace streams
AOI…
AOI Extraction
Sequence Pattern Mining
PatternClustering
Ranking
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- KernelBase!LoadLibrary
| | | | | |- ……
| | | | | | |- ntdll!OpenFile
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- ……
| | | | | | |- nt!PageRead
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- KernelBase!LoadLibrary
| | | | | | |- ……
| | | | | | | |- nt!Trap
| | | | | | | | |- nt!AccessFault
| | | | | | | | | |- ……
Approach – Callstack Pattern MiningFrequent Sequence Pattern
Sequence Occurrence
A B 4
A B C D F 1
A B C E F 3
A B G F 2
All Patterns Support
A, B, AB 10
F, AF, BF, ABF 6
Closed Patterns Support
A, B, AB 10
F, AF, BF, ABF 6
Maximal Patterns Support
A, B, AB 10
F, AF, BF, ABF 6
• Non-consecutive frequent sub-sequence as callstack pattern
• Frequent maximal patterns are compact
• Example
Frequent sequence pattern miner
Frequency threshold 𝑇
5 (or 50% of total)
• Non-consecutive costly sub-sequence as callstack pattern
• Costly maximal patterns are compact
Cost threshold 𝑇
5 ms (or 50% of total)
Sequence Wait time
A B 4 ms
A B C D F 1 ms
A B C E F 3 ms
A B G F 2 ms
Costly sequence pattern miner
Approach – Callstack Pattern MiningCostly Sequence Pattern
ICSE 2012
ICSE 2012
Approach – Callstack Pattern Clustering
16
Callstacks
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
ntdll!OpenFile
Callstack patterns
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!AccessFault
stobject!WakeUp_DeviceMonitor
…
KernelBase!LoadLibrary
…
nt!PageRead
709
412
259
Pattern clusters
ntdll!OpenFile
stobject!WakeUp_DeviceMonitor
KernelBase!LoadLibrary
nt!PageRead
nt!AccessFault
1392
709
259
412
Cluster Hits Total Wait time (ms)
1 94,279 927,824
2 51,107 561,416
3 35,536 3,051,307
… …
Ranked cluster
123
Trace streams
AOI…
AOI Extraction
Sequence Pattern Mining
PatternClustering
Ranking
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- KernelBase!LoadLibrary
| | | | | |- ……
| | | | | | |- ntdll!OpenFile
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- ……
| | | | | | |- nt!PageRead
| | | | | | | |- ……
- ntdll!UserThreadStart
|- ……
| |- ……
| | |- user32!InternalCallWinProc
| | | |- stobject!WakeUp_DeviceMonitor
| | | | |- kernel32!LoadLibrary
| | | | | |- KernelBase!LoadLibrary
| | | | | | |- ……
| | | | | | | |- nt!Trap
| | | | | | | | |- nt!AccessFault
| | | | | | | | | |- ……
Approach – Callstack Pattern Clustering
• Motivation
– Same issue often reflected by variant patterns
– Defect often hidden in invariant parts of variant patterns
• Goal
– Precise measurement of issue impact for better prioritization
– Comprehensive issue representation with pattern variations for quick and precise fixing
17ICSE 2012
Approach – Callstack Pattern ClusteringSimilarity Model
18
App_main
InitComponents
GetHashCode
GetShortPathName
……
……
MmAccessFault
SwapKernelStack
MiIssueHardFault
IoPageRead
……
App_main
InitComponents
GetHashCode
GetShortPathName
……
……
MmAccessFault
ExpandKernelStack
MiIssueHardFault
IoPageRead
……
wow64Service
wow64System
wow64QueryAttr
……
Match
Insertion/Deletion
Substitution
Match
0. A
lignm
ent b
ased
on
edit d
istance m
od
el
1. Common-purpose function: weight↓𝑈𝑛𝑖() is small
2. Special-purpose function: weight↑ 𝑈𝑛𝑖() is large
3. Variant part representing non-essential factors
4. Similar names implying relevant functionalities
SwapKernelStack ExpandKernelStack
5. Constant call path: weight↓ 𝐹𝐵𝑖() + 𝐵𝐵𝑖() is small
ICSE 2012
Technical Highlights
• Machine learning for system domain– Formulate the discovery of problematic execution
patterns as callstack mining & clustering
– Systematic mechanism to incorporate domain knowledge
• Interactive performance analysis system– Parallel mining infrastructure based on HPC + MPI
– Visualization aided interactive exploration
19ICSE 2012
Evaluation – Windows 7 Study
• Task: since Dec 2010, a continued effort to improve Windows performance– Analysts from one performance analysis team for
Microsoft Windows– Hunt for hidden performance bugs that caused common
impact on Windows Explorer UI response– Based on over 6,000 trace streams
• Data– 921 qualified out of 1,000 randomly sampled trace streams– 181 million callstacks in total
• 140 million wait callstacks• 41 million CPU sampled callstacks
ICSE 2012 20
Evaluation – Windows 7 StudyResearch Questions
• RQ1. How much does StackMine improvepractices of performance debugging in the large?
• RQ2. How well do the derived performance signatures capture performance bottlenecks?
•
• RQ3. How much does StackMine outperformalternative techniques?
ICSE 2012 21
Evaluation – Windows 7 StudyRQ1. Overall Improvement of Practices
• Traditional approach – would take 20~60 days
• Using StackMine – 18 hours
ICSE 2012 22
140 million callstacks
689 thousand callstacks
2,239 costly patterns
1,251 pattern clusters
Top 400 pattern clusters
93 performance signatures
12 highly impactful bugs
AOIExtraction
CallstackPattern Mining
PatternClustering
PatternRanking
10 hours of automatic computation
Human analyst confirmation
8 hours of one human analyst’s review
Feature teamconfirmation
Evaluation – Windows 7 StudyRQ2. Performance Bottleneck Coverage
• Performance Bottleneck Coverage (PBC) of a set of performance signatures
• The higher PBC achieved, the lower possibility that high-impact performance bugs remain not captured
• 58.26% PBC achieved by the 93 signatures
ICSE 2012 23
𝑃𝐵𝐶 =𝑇𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒 𝑜𝑓 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒 𝑜𝑓 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑒𝑑 𝑡𝑟𝑎𝑐𝑒 𝑠𝑡𝑟𝑒𝑎𝑚𝑠
Evaluation – Windows 7 StudyRQ3. Comparison with Alternative Techniques
StackMine requires only 7.2%, 5.8%, and 6.3% of trace streams required by the other three techniques
ICSE 2012 24
193
238220
140
50
100
150
200
250
20% 30% 40% 50% 60%
Nu
mb
er
of
req
uir
ed
tra
ce
stre
ams
to in
vest
igat
e
Performance bottleneck coverage (%)
Baseline-Random
Greedy-Total
Greedy-Max
StackMine
Impact
25
“We believe that the MSRA tool is highly valuable and much more efficient for mass trace (100+ traces) analysis. For 1000 traces, we believe the tool saves us 4-6 weeks of time to create new signatures, which is quite a significant productivity boost.”
- from Development Manager in Windows
Highly effective new issue discovery on
Windows mini-hang
Continuous impact on future Windows versions
ICSE 2012
Conclusion
• The first formulation and real-world deployment of performance debugging in the large
– as a data mining problem on callstacks
• A mining-clustering mechanism for reducing costly-pattern mining results
– based on domain-specific characteristics of callstacks
• Industrial impact on using StackMine in performance debugging in the large for Microsoft Windows
26ICSE 2012
Acknowledgment
• Our partners in Microsoft product teams
• The researchers from Microsoft Research
27ICSE 2012
Q&A
Thank you!
ICSE 2012 28