sfo15-301: benchmarking best practices 101
TRANSCRIPT
![Page 1: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/1.jpg)
Presented by
Date
Event
SFO15-301: Benchmarking Best Practices 101
Bernie OgdenMaxim Kuvyrkov
Maxim Kuvyrkov
Wednesday 23 September 2015
SFO15
![Page 2: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/2.jpg)
Overview
● What is benchmarking?● Design
○ Designing a benchmarking experiment● Repeatability
○ Can we repeat the result?● Reproducibility & Reporting
○ Can others repeat the result?○ What does a good report look like?
![Page 3: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/3.jpg)
What Is Benchmarking?
![Page 4: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/4.jpg)
What is benchmarking?
● An experiment, like any other● Scientific method
○ Form a hypothesis○ Test, with control of variables○ Report results…○ …with enough detail for others to replicate
![Page 5: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/5.jpg)
What is benchmarking?
● Slow, hard work● If we don’t do it right
○ We waste effort○ We fail to deliver member value○ We look bad
● But if we over-do it, we also waste effort○ No experiment is perfect○ We must be aware of limitations, and understand
and explain their consequences
![Page 6: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/6.jpg)
Design
![Page 7: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/7.jpg)
Goal
Establish goal: what am I trying to do?● Measure performance improvement due to
code change● Compare performance of 32- and 64-bit
builds of libfoo.so
![Page 8: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/8.jpg)
Experiment
In light of goal, design experiment● Identify question to ask● Select testing domain● Identify variables● Consider how to control variables
![Page 9: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/9.jpg)
Testing Domain (1/2)
Select appropriate testing domain for the effect being measured. For instance:CPU-specific, CPU-bound effect● Test on single implementation of that CPU● Example: FP performance on Cortex-A57
![Page 10: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/10.jpg)
Testing Domain (2/2)
Architecture specific, CPU-bound effect● Test on range of CPUs that implement arch● Example: FP performance on ARMv7-AArchitecture-generic, memory-bound● Test on range of SoCs and implementations● Example: AArch32 memcpy performance on
v7-A & v-8A
![Page 11: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/11.jpg)
Know Your Target
Know major hardware features● Core, frequency, cache hierarchy...Have a sense of ‘background activity’● Determine post-boot ‘settling’ time● Check background processes, memory use● What interrupts are there, where do they go?
![Page 12: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/12.jpg)
Know Your Benchmark
Purpose● Static codegen, JITGeneral characteristics● Code size, mem loadWhat is it exercising?● Pointer chasing, FP
What is it sensitive to?● BP, memory systemPhase behaviour● Physics, renderingRun & reporting rules
![Page 13: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/13.jpg)
Know The Intersection
Controlling all variables, study behaviour of benchmark on targetRun multiple times to determine variabilityShould be able to converge on some average to a narrow interval with high confidence
![Page 14: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/14.jpg)
Why Bother?
● Interpretation of results● Identification of significant variables● Identification of benchmark subsets
![Page 15: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/15.jpg)
Repeatability
![Page 16: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/16.jpg)
Repeatability
An experiment is repeatable if one team can repeatedly run the same experiment over short periods of time and get the same results.
![Page 17: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/17.jpg)
Control Variables
SourceLibrariesToolchainBuild envOS imageFirmwareHardware
CPU frequencyCore migrationInterruptsThermal throttlingPower managementASLRMMU effects
Cache warmupCache hierarchyCode layoutMemory controllerEtc etc etc...
![Page 18: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/18.jpg)
Countering Noise: Mitigation
Improves run-to-run consistencyReduces realism● Reboot for every run● Warm-up period● Fix CPU frequency● Power management/thermal control● Bind processes to cores
![Page 19: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/19.jpg)
Countering Noise: Statistics
● Some variables cannot be controlled● Controlling variables reduces realism● Multiple runs required to show effect of
controlling variables● Multiple runs required for consistency of results● Changes may affect variance as well as mean
![Page 20: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/20.jpg)
Combined Approach
Reduce target noise sources● To threshold of unacceptable irrealism● To point where no further reduction can be
achievedIncrease number of runs● Until effect is repeatable to some acceptable
confidence interval
![Page 21: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/21.jpg)
How Much Noise Is Acceptable?
Roughly: the effect size should be larger than some confidence interval● 0.95 is popular, but won’t include the true
mean 1 time in 20YMMV, depending on the experiment.
![Page 22: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/22.jpg)
Reproducibility
![Page 23: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/23.jpg)
Reproducibility
An experiment is reproducible if external teams can run the same experiment over large periods of time and get commensurate (comparable) results.Achieved if others can repeat what we did and get the same results as us, within the given confidence interval.
![Page 24: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/24.jpg)
Recording
Record everything● Beware of implicit knowledge● We don’t know what we don’t know● Recording is cheap● Future analysis
![Page 25: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/25.jpg)
Recording
Record everything… but the following points are especially important.● Full details of target hardware & OS● Exact toolchain used● Exact benchmark sources● Full build logs
![Page 26: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/26.jpg)
Reporting
![Page 27: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/27.jpg)
Reporting
● Clear, concise reporting allows others to utilise benchmark results
● Reports for one audience can slip to others● Do not assume knowledge
○ The reader may not know what your board is...● Include relevant data
○ Make sure all data are available● Define terms
![Page 28: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/28.jpg)
Reporting: Goal
Explain the goal of the experiment● What decision will it help you to make?● What improvement will it allow you to
deliver?Explain the question that the experiment asksExplain how the answer to that question helps you to achieve the goal
![Page 29: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/29.jpg)
Reporting
● Method: Sufficient high-level detail○ Target, toolchain, build options, source, mitigation
● Limitations: Acknowledge and justify○ What are the consequences for this experiment?
● Results: Discuss in context of goal○ Co-locate data, graphs, discussion○ Include units - numbers without units are useless○ Include statistical data○ Use the benchmark’s metrics
![Page 30: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/30.jpg)
Conclusion
![Page 31: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/31.jpg)
It’s a lot of work...
But we have to do it to get meaningful, shareable benchmarking resultsWe can (should) limit the amount of work, as long as we understand the consequences and are explicit about them
![Page 32: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/32.jpg)
Actions?
![Page 33: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/33.jpg)
END
![Page 34: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/34.jpg)
BACKUP/REFERENCE
![Page 35: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/35.jpg)
Graphs:Strong Suggestions
![Page 36: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/36.jpg)
Speedup Over Baseline (1)
Misleading scale● A is about 3.5%
faster than it was before, not 103.5%
Obfuscated regression● B is a regression
![Page 37: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/37.jpg)
Speedup Over Baseline (2)
Baseline becomes 0Title now correctRegression clear
But, no confidence interval.
![Page 38: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/38.jpg)
Speedup Over Baseline (3)
Error bars tell us more: effect D can be disregarded, A is a real, but noisy, effect.
Watch out for scale change
![Page 39: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/39.jpg)
Labelling (1/2)
What is the unit?What are we comparing?
![Page 40: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/40.jpg)
Labelling (2/2)
![Page 41: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/41.jpg)
Graphs:Weak Suggestions
![Page 42: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/42.jpg)
Speedup Over Baseline (4)
Can add a mean
![Page 43: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/43.jpg)
Direction of ‘Good’ (1)
InconsistentMight be necessary
![Page 44: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/44.jpg)
Direction of ‘Good’ (2)
If you have to change the direction of ‘good’, flag the direction (everywhere)
Can be helpful to flag it anyway
![Page 45: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/45.jpg)
Consistent Order
Presents improvements neatlyBut, hard to compare different graphs in the same report
![Page 46: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/46.jpg)
Scale (1/2)
A few high scores make other results hard to seeA couple of alternatives may be more clear...
![Page 47: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/47.jpg)
Scale (2/2)TruncateSeparate Outliers
![Page 48: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/48.jpg)
Noise Mitigation
![Page 49: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/49.jpg)
Mitigation: Settling and warm-up
Monitor /proc/loadavg to determine how long system takes to ‘settle down’ after bootRun one or two iterations of benchmark to initialize cache, branch predictors, etc, before beginning timingOr run the benchmark so many times that warm-up effects are insignificant
![Page 50: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/50.jpg)
Mitigation: Other Processes/Migration
Use a minimal OSShut down non-essential processes● Tricky to generalize reliablySet CPU affinity● One CPU runs the benchmark, another runs
‘everything else’
![Page 51: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/51.jpg)
Mitigation: Interrupts
Disable/monitor/constrain irqbalance daemon/proc/irq/*/smp_affinity: where interrupts can go (as far as kernel knows)/proc/interrupts: where interrupts are goingDisable network● Fiddly, but doable● At least disable accidental access
![Page 52: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/52.jpg)
Mitigation: DVFS
cpufreq can set a fixed frequencyWatch out for broken thermal throttlingDon’t try to extrapolate results to different frequencies (you’ve thrown off relative timings to rest of system)
![Page 53: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/53.jpg)
Mitigation: ASLR
ASLR randomizes base of heap and stackAffects alignment and relative position of dataMay cause cache thrashingecho 0 | sudo tee /proc/sys/kernel/randomize_va_space
![Page 54: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/54.jpg)
Mitigation: MMU
Use largest available page size● Fewer TLB misses, potentially fewer page faults● Intuitively: Better performance, less noiseAArch64 supports 4KB & 64KB page sizeAArch32+LPAE, and AArch64 support huge pages● 4K page -> 2MB huge page● 64k page -> 512MB huge page
![Page 55: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/55.jpg)
Mitigation: Huge Page Gotchas (1/2)
Huge pages can have downsides● Increased chance of cache thrashing
○ Large address == less random way selection● Potentially similar effects elsewhere in
system, e.g. channel selection in memory controller
![Page 56: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/56.jpg)
Mitigation: Huge Page Gotchas (2/2)
THP collapses pages into huge pages● Can happen at any time, potentially a noise
sourcelibHugeTLBfs can back heap with huge pages● Which will affect alignment, introducing noise
![Page 57: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/57.jpg)
Bias: Code Layout
Code layout effects may dominate the effect we are trying to measureCan cause cache thrashing, branch mispredictsVaries statically and dynamicallyEasily perturbed:● Link order● Environment variables
![Page 58: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/58.jpg)
Mitigation: Layout Bias
● Vary experimental conditions○ PIE, ASLR help in this respect○ Vary link order, environment size○ Tooling, e.g….
● Statistically isolate size of effect, to within some confidence interval
![Page 59: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/59.jpg)
Bare Metalvs
Rich OS
![Page 60: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/60.jpg)
Bare Metal vs Rich OS
Bare Metal● High control / low realism● Configuration trades control for realismRich OS● High realism / low control● Configuration trades realism for control
![Page 61: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/61.jpg)
Bare Metal vs Rich OS
Pragmatic considerations● Some benchmarks hard to run BM● Longer-running benchmarks less perturbable● Infrastructure and skills
○ In any given organisation, may be more oriented towards BM or Rich OS
![Page 62: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/62.jpg)
Regression Tracking
![Page 63: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/63.jpg)
Regression Tracking
You’ll have a few bots running a few point buildsResults will be noisy and incompleteLook out for (informally) significant, and lasting, changes
![Page 64: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/64.jpg)
http://llvm.org/perf/db_default/v4/nts/graph?plot.0=39.1304.3&highlight_run=27797
Regression Tracking
![Page 65: SFO15-301: Benchmarking Best Practices 101](https://reader031.vdocument.in/reader031/viewer/2022021919/588394001a28ab2b568b487b/html5/thumbnails/65.jpg)
Regression Tracking
https://chromeperf.appspot.com/report?masters=ChromiumPerf&bots=chromium-rel-win7-gpu-nvidia&tests=angle_perftests%2FDrawCallPerf_d3d11&checked=all