scalable system measurement and performance analysis...
TRANSCRIPT
![Page 1: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/1.jpg)
Scalable System Measurement and
Performance Analysis: Recent Progress
Rob FowlerRENCI
Workshop on Performance & Productivity
of Extreme-Scale Parallel ComputingOct. 16 2006.
![Page 2: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/2.jpg)
(Scalable System) Measurement and
Performance Analysis: Recent Progress
Rob FowlerWorkshop on
Performance & Productivityof Extreme-Scale Parallel Computing
Oct. 16 2006.
![Page 3: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/3.jpg)
The I/O Issue and Scaling.
• Seymour Cray (1976): – I/O has certainly been lagging the last decade.
• D. Kuck (1988): – Also, I/O needs lots of work.
• Dave Patterson (1994): – Terabytes >> Teraflops or Why Work on
Processors When I/O is Where the Action is?
•Seymour Cray: –A supercomputer is a device that turns a compute-bound problem into an I/O-bound problem.
![Page 4: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/4.jpg)
A “Real” WRF problem on BG/L
LEAD 27 Km data, 84 hour simulation. 32 BG/L processors(CO mode)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Time (sec)
Num
ber o
f Ite
ratio
ns
BG/L 32-CO
0
10
20
30
40
50
60
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
LEAD 27 Km data, 84 hour simulation. 32 BG/L processors(CO mode)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Time (sec)
Num
ber o
f Ite
ratio
ns
BG/L 32-CO
0
10
20
30
40
50
60
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
• WRF run for LEAD : CONUS 27km grid, 84 simulated hours, hourly simulation output, checkpoint every 12 simulated hours.•NetCDF used for I/O.
![Page 5: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/5.jpg)
Speedup and efficiency of WRF on BG/L
0
2000
4000
6000
8000
10000
12000
14000
16000
0 100 200 300 400 500 600
Number of processors
Tim
e (s
econ
ds)
Class 1 Class 2 Class 3 Class 4
0
400
800
1200
1600
2000
440 450 460 470 480 490 500 510 520
0
2000
4000
6000
8000
10000
12000
14000
16000
0 100 200 300 400 500 600
Number of processors
Tim
e (s
econ
ds)
Class 1 Class 2 Class 3 Class 4
0
400
800
1200
1600
2000
440 450 460 470 480 490 500 510 520
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600
Number of processors
Effi
cien
cy
Class 1 Class 2 Class 3 Class 4
• Four classes of iterations: 1 & 2 just compute, 3 & 4 do file I/O using NetCDF.• Computation scales reasonably well. I/O does not scale at all.• What about weak scaling, i.e. run on a “petascale challenge input”:
• With a bigger problem, the computation will scale better on more processors.• With a bigger problem on more data, I/O will be even more of a bottleneck.
• Procurement benchmarks: Output only once, at the end of the run.
![Page 6: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/6.jpg)
A Simple Model of This WRF Run• Postulate an Amdahl’s law model for each
class of iteration:
• For each class of iteration, fit the parameters:
(lst. sq., R2>.98)
(use asymptote)
• Aggregate Model: t(P) = 529900/P + 4100 (sec)
sequentialparallel tPtPt += /)(
900041600032007228021400 sec4576001tsequentialtparallelClass
![Page 7: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/7.jpg)
WRF on XT3
FIG X-2. LEAD 27 KM data, 84 hour simulation.Cray XT3
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4
Time (sec)
Num
ber o
f Ite
ratio
ns
XT3-32 XT3-64 XT3-128 XT3-256 XT3-512 XT3-1024
0
1000
2000
3000
4000
5000
6000
0 0.2 0.4 0.6 0.8 1
FIG X-2. LEAD 27 KM data, 84 hour simulation.Cray XT3
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4
Time (sec)
Num
ber o
f Ite
ratio
ns
XT3-32 XT3-64 XT3-128 XT3-256 XT3-512 XT3-1024
0
1000
2000
3000
4000
5000
6000
0 0.2 0.4 0.6 0.8 1
![Page 8: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/8.jpg)
Summary of WRF Results.FIG X-7 Time to Solution
0
5000
10000
15000
20000
25000
0 500 1000 1500 2000 2500
No. processors
Tim
e (S
ec)
BG/L -CO BG/L-CO-NOIO XT3 XT3-NOIO
![Page 9: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/9.jpg)
Takeaway messages
I/O continues to be the elephant in the room.Use alternatives to NetCDF, but this doesn’t make the
problem go away.
“Good” news: An ensemble of 50 WRF runs is a lot more useful than a 50X bigger single run.
Petascale capacity vs. petascale capability?
(SC06 BOF on Petascale Performance Eval. Wed 5PM)
![Page 10: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/10.jpg)
Scalable (System Measurement and
Performance Analysis): Recent Progress
Rob FowlerWorkshop on
Performance & Productivityof Extreme-Scale Parallel Computing
Oct. 16 2006.
![Page 11: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/11.jpg)
Performance/Productivity at the High End:Optimization in multiple dimensions
On-node issues:memory, ILP, CLP,…
System-wideparallelism
Software development,maintenance, reliability
Power
Reliability
Algorithmicissues
![Page 12: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/12.jpg)
Moore's lawCircuit element count doubles every N months. (N ~18)
• Why: Features shrink, semiconductor dies grow.
• Corollaries: Gate delays decrease. Wires are relatively longer.
• In the past the focus has been making "conventional" processors faster.– Faster clocks– Clever architecture and implementation instruction-level parallelism.– Clever architecture, HW/SW Prefetching, and massive caches ease the
“memory wall” problem.• Problems:
– Faster clocks --> more power (P ~ V2F)– More power goes to overhead: cache, predictors, “Tomasulo”, clock, …– Big dies --> fewer dies/wafer, lower yields, higher costs– Together --> Power hog processors on which some signals take 6 cycles
to cross.
![Page 13: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/13.jpg)
P4: Competing with charcoal?
Thanks to Bob Colwell
![Page 14: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/14.jpg)
• Single thread ILP– Instruction pipelining constraints, etc.– Memory operation scheduling for latency, BW.
• Memory arch. Becoming a point-to-point network.
• Multi-threading CLP– Resource contention within a core
• Memory hierarchy• Functional units, …
• Multi-core CLP– Chip-wide resource contention
• Shared on-chip components of the memory system
• Shared chip-edge interfaces
The Sea Change: More On-Chip Parallelism.
Challenge: Programmers (and optimizing compilers) will need to be able understand all this. Tools needed.
![Page 15: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/15.jpg)
Why is performance not obvious?
Hardware complexity– Keeping up with Moore’s law with one thread.– Instruction-level parallelism.
• Deeply pipelined, out-of-order, superscalar, threads.– Memory-system parallelism
• Parallel processor-cache interface, limited resources.• Need at least k concurrent memory accesses in flight.
Software complexity– Competition/cooperation with other threads– Dependence on (dynamic) libraries.– Languages: “Object Orientation Considered Harmful”– Compilers
• Aggressive (-O3+) optimization conflicts with manual transformations.
• Incorrectly conservative analysis and optimization.
![Page 16: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/16.jpg)
“Simple” recipe for performance.• Simultaneously achieve (or balance)
– High ILP, – Memory locality and parallelism, – Chip-level parallelism,– Concurrent I/O and communication.
• Address this throughout lifecycle.– Algorithm design and selection.– Implementation– Repeat
• Translate to machine-specific code.• Maintain algorithms, implementation, compilers.
![Page 17: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/17.jpg)
It gets worse: Scalable HECAll the problems of on-node efficiency, plus• Scalable parallel algorithm design,• Load balance,• Communication performance,• Competition of communication with
applications,• External perturbations,• Reliability issues:
– Recoverable errors performance perturbation.– Non-recoverable error You need a plan B
• Checkpoint/restart (expensive, poorly scaled I/O)• Robust applications
![Page 18: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/18.jpg)
While you’re at it …
• Promote programmer productivity.• Lower costs.• Decrease overall time to solution.• And protect the huge investment in
the existing code base.
![Page 19: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/19.jpg)
Performance Tuning in Practice
![Page 20: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/20.jpg)
The Trend in Software
At the limit of usability?
![Page 21: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/21.jpg)
Featuritis in extremis,a.k.a. feeping creaturism?
(Special order: $1200from Wegner.)
![Page 22: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/22.jpg)
All you need to know about software engineering.
'The Hitchiker's Guide to the Galaxy, in a moment of reasoned lucidity which is almost unique among its current tally of five million, nine hundred and seventy-three thousand, five hundred and nine pages, says of the Sirius Cybernetics Corporation products that “it is very easy to be blinded to the essential uselessness of them by the sense of achievement you get from getting them to work at all. In other words - and this is the rock-solid principle on which the whole of the Corporation's galaxywide success is founded -- their fundamental design flaws are completely hidden by their superficial design flaws.”
(Douglas Adams, "So Long, and Thanks for all the Fish")
![Page 23: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/23.jpg)
Dealing with Featuritis
• Use specialized, interoperating tools.– Encourage fusion of data from multiple
sources– E.g., performance measurement on your
favorite accelerator, NIC, …• Drive workflows with scripts.
– Integrate with cluster runtime management
• Integrate with the development process.– Eclipse/PTP
![Page 24: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/24.jpg)
Rice HPCToolkit: A Review
• Use Event Based Sampling (EBS)– Low, controllable overhead (<2%)– No instrumentation needed/wanted– Collect multiple metrics, compute others.
• Hierarchical correlation with source– Attribution to source line granularity
• Flat or call stack attribution• Time-varying behavior - epochs• Driven from scripts• Top-down analysis encouraged
![Page 25: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/25.jpg)
Issue: EBS on Petascale Systems
• Hardware on BG/L and XT3 both support event based sampling.
• Current OS kernels from vendors do not have EBS drivers.
• This needs to be fixed!– Linux/Zeptos?
• Issue: Does event-based sampling introduce “jitter”?– Much less impact than using fine-grain calipers.– Not unless you have much worse performance
problems.
![Page 26: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/26.jpg)
Problem: Profiling Parallel Programs• Sampled profiles can be collected for about 1%
overhead.• How can one productively use profiling on large parallel
systems?– Understand the performance characteristics of the
application.– Study node-to-node variation.
• Model and understand systematic variation.– Characterize intrinsic, systemic effects in app.
• Identify anomalies: app. bugs, system effects.– Automate everything.
• Do little “glorified manual labor” in front of a GUI.• Find/diagnose unexpected problems, not just the expected
ones.• Avoid the “10,000 windows” problem.
![Page 27: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/27.jpg)
Statistical Analysis: Bi-clustering• Data Input: an M by P dense matrix of non-negative
values.– P columns, one for each process(or).– M rows, one for each measure at each source construct.
• Problem: Identify bi-clusters.– Identify a group of processors that are different from the
others because they are “different” w.r.t. some set of metrics. Identify the set of metrics.
– Identify multiple bi-clusters until satisfied.• The “Cancer Gene Expression Problem”
– The columns represent patients/subjects• Some are controls, others have different, but related cancers.
– The rows represent data from DNA micro-array chips.– Which (groups of) genes correlate (+ or -) with which
diseases?– There’s a lot of published work on this problem.– So, use the bio-statisticians’ code as our starting point.
• “Gene shaving” algorithm by M.D. Anderson and Rice researchers applied to profiles collected using HPCToolkit.
(See Adam Bordelon, Rice student.)
![Page 28: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/28.jpg)
0 1 2 3 4 5012345
Cluster #1 (raw)
Cluster 1: 62% of variance in Sweep3D
Weight Clone ID -6.39088 sweep.f,sweep:260 -7.43749 sweep.f,sweep:432-7.88323 sweep.f,sweep:435-7.97361 sweep.f,sweep:438-8.03567 sweep.f,sweep:437-8.46543 sweep.f,sweep:543 -10.08360 sweep.f,sweep:538-10.11630 sweep.f,sweep:242-12.53010 sweep.f,sweep:536-13.15990 sweep.f,sweep:243-15.10340 sweep.f,sweep:537-17.26090 sweep.f,sweep:535
if (ew_snd .ne. 0) thencall snd_real(ew_snd, phiib, nib, ew_tag, info)
c nmess = nmess + 1c mess = mess + nib
elseif (i2.lt.0 .and. ibc.ne.0) then
leak = 0.0do mi = 1, mmi
m = mi + miodo lk = 1, nk
k = k0 + sign(lk-1,k2)do j = 1, jt
phiibc(j,k,m,k3,j3) = phiib(j,lk,mi)leak = leak
& + wmu(m)*phiib(j,lk,mi)*dj(j)*dk(k)end doend doend doleakage(1+i3) = leakage(1+i3) + leak
elseleak = 0.0do mi = 1, mmim = mi + mio
do lk = 1, nkk = k0 + sign(lk-1,k2)
do j = 1, jtleak =leak+ wmu(m)*phiib(j,lk,mi)*dj(j)*dk(k)end doend doend do
leakage(1+i3) = leakage(1+i3) + leakendif
endif
if (ew_rcv .ne. 0) then
call rcv_real(ew_rcv, phiib, nib, ew_tag, info)else
if (i2.lt.0 .or. ibc.eq.0) thendo mi = 1, mmido lk = 1, nkdo j = 1, jt
phiib(j,lk,mi) = 0.0d+0end doend doend do
![Page 29: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/29.jpg)
Cluster 2: 36% of variance
Weight Clone ID -6.31558 sweep.f,sweep:580 -7.68893 sweep.f,sweep:447-7.79114 sweep.f,sweep:445-7.91192 sweep.f,sweep:449-8.04818 sweep.f,sweep:573-10.45910 sweep.f,sweep:284 -10.74500 sweep.f,sweep:285-12.49870 sweep.f,sweep:572-13.55950 sweep.f,sweep:575-13.66430 sweep.f,sweep:286-14.79200 sweep.f,sweep:574
if (ns_snd .ne. 0) thencall snd_real(ns_snd, phijb, njb, ns_tag, info)
c nmess = nmess + 1c mess = mess + njb
elseif (j2.lt.0 .and. jbc.ne.0) thenleak = 0.0do mi = 1, mmi
m = mi + miodo lk = 1, nk
k = k0 + sign(lk-1,k2)do i = 1, it
phijbc(i,k,m,k3) = phijb(i,lk,mi)leak = leak + weta(m)*phijb(i,lk,mi)*di(i)*dk(k)
end doend doend do
leakage(3+j3) = leakage(3+j3) + leakelseleak = 0.0do mi = 1, mmi
m = mi + miodo lk = 1, nk
k = k0 + sign(lk-1,k2)do i = 1, it
leak = leak + weta(m)*phijb(i,lk,mi)*di(i)*dk(k)end doend doend do
leakage(3+j3) = leakage(3+j3) + leakendif
endif
0 1 2 3 4 5012345
Cluster #2 (raw)
c J-inflows for block (j=j0 boundary)c
if (ns_rcv .ne. 0) thencall rcv_real(ns_rcv, phijb, njb, ns_tag, info)
elseif (j2.lt.0 .or. jbc.eq.0) then
do mi = 1, mmido lk = 1, nkdo i = 1, it
phijb(i,lk,mi) = 0.0d+0end doend doend do
![Page 30: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/30.jpg)
Further Issues with Scalable Performance Monitoring
• Scalable performance monitoring– Profiles/summaries: space efficient but lack temporal
detail– event traces: temporal detail but space demanding
• Even collecting profiles/summaries is challenging– exorbitant data volume (100K nodes)– high extraction costs, with perturbation risk
• Tunable detail and data volume– application signatures (tasks)
• Dynamic filtering of time series.– 1st try: Polyline fit using least squares– In progress: Wavelet-based filtering
– stratified sampling (system)• adaptive node subset
“… a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.” Herbert Simon
![Page 31: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/31.jpg)
Sampling Theory: Exploiting Software• SPMD models create behavioral equivalence classes
– domain and functional decomposition
• By construction, …– most tasks perform similar functions– most tasks have similar performance
• Sampling theory and measurement– extract data from “representative” nodes– compute metrics across representatives– balance volume and statistical accuracy
• Estimate mean with confidence 1-α and error bound d– select a random sample of size n from population of size N
– approaches for large populations
Sampling Must Be Unbiased!
Source: Todd Gamblin
12
1−
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛+≥
SzdNNnα
dSzα
![Page 32: Scalable System Measurement and Performance Analysis ...estrabd/LACSI2006/workshops/workshop3/Slides/...Scalable System Measurement and Performance Analysis: Recent Progress Rob Fowler](https://reader030.vdocument.in/reader030/viewer/2022040712/5e1769299b8a7664db36124e/html5/thumbnails/32.jpg)
Adaptive Performance Data Sampling• Simple case
– select subset n of N nodes– collect data from the n
• Stratified sampling – identify low variance subpopulations– sample subpopulations independently– reduced overhead for same confidence
• Metrics vary over time– samples must track changing variance– number of subpopulations also vary
• Sampling options– fixed subpopulations (time series)– random subpopulations (independence)
Source: Todd Gamblin
(See Todd Gamblin, UNC student.