system utilization benchmark on the cray t3e and ibm sp adrian wong, leonid oliker, william kramer,...
TRANSCRIPT
System Utilization Benchmarkon the Cray T3E and IBM SP
Adrian Wong, Leonid Oliker, William Kramer, Teresa Kaltz, Therese Enright and David Bailey
National Energy Research Scientific Computing Center
Lawrence Berkeley National Laboratory
Scientific Supercomputer Workload
• Long running batch jobs (hours)• Typically 64 nodes per job• Often long list of queued jobs• Job turnaround maybe days
Utilization
Slo
wd
ow
n
Motivations
– Ability to fully utilize a large computer is almost as important as the speed of the computer.
– Large capability mainframes rarely have idle cycles - need to maximize users’ productivity.
– Need a way to measure potential day-to-day utilization.
– No metric to gauge configuration changes other than anecdotal.
– Increased complexity of scheduling with parallel platforms
A test to assess system capabilities& configuration effects on utilization
Effective System Performance (ESP)Effective System Performance (ESP)
Parallel Job Scheduling
Pro
cess
ors
Time
Optimization problem in packing with space (processor) and time constraintsDynamic situationTradeoffs in turnaround, utilization & fairness
Scheduling Strategies
Job Queue
Hole
Ord
er
of
Subm
issi
on
Best-Fit-FirstScan queue forbest fit
First-Come-First-ServeWait for right size hole
Starvationof largejobs
May idle systemRespects submission order
Key OS System Capabilities
• Swapping / Gang-scheduling• Job migration / compaction• Priority preemption• Backfill• Disjoint partitions• Checkpoint / restart• Dynamically adjustable queue
structures
ESP Design Goals & Attributes
• Transferable metric(s) / Valid comparisons• Reproducible• Easily interpreted results• Portable• Platform size and speed independent• Capture essence of real workload• Compact and easily distributed• Easy to run (< 12 hours)• Automated / no human intervention• Focus on utilization / factor out CPU speed• Test responsiveness & adaptability of
scheduler
ESP Design
• Start with throughput test• Profile of jobs determined by historical
accounting data• Find applications with appropriate size
and time• Use two full configuration jobs to
encapsulate change of operational mode (e.g. interactive to batch)
• Submit jobs in three blocks in pseudo-random order
ESP Test Schematic
time <12 hours
full config #1 full config #2
regular jobsregular jobs
>10% >10%
regular jobs regular jobsshutdown/reboot (opt)
regular jobsVanilla variant (throughput)
Individual Applications in Jobmix
Application Discipline Sizes Count
gfft (FC) Large FFT 512 2
md Biology 8,24 4,3
nqclarge Chemistry 8,16 2,5
paratec Material Sci 256 1
qcd Nuclear Physics 128,256 1
scf Chemistry 32,64 7,10
scfdirect Chemistry 64,81 7,2
superlu Linear Alg 8 15
tlbe Fusion 16,32,49,64,128 2,6,5,8,1
Jobmix Application Elapsed Times
0
1800
3600
5400
7200
9000
10800
Tim
e (
se
cs
)
T3E SP
0
1800
3600
5400
7200
9000
10800
Tim
e (
se
cs
)
Increasing Partition Size
Platforms Tested
• Cray T3E– 512 processors– 450 MHz Alpha EV56– Microkernel MPP OS– NQS & Global Resource Mgr– Oversubscription possible– BFF strategy w/ dynamic queue configs
• IBM SP– 512 processors– 200 MHz Power3– Semiautonomous Monolithic OSes– Loadleveller batch queues– FCFS w/ backfill (backfill disabled in 1st attempt)
T3E Chronology (with swap)
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Normalized Time
Uti
liza
tion (
%)
0
64
128
192
256
320
384
448
512
576
Part
itio
n S
ize
Insufficient work;Tailend dilemma
Starvationof largejobs
Normalized = Elapsed / Theoretical Min
T3E Chronology (without swap)
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Normalized Time
Uti
liza
tion (
%)
0
64
128
192
256
320
384
448
512
576
Part
itio
n S
ize
Slight decrease in utilization w/o swap capability
Higher overall efficiency - significant overhead w/ swap
SP Chronology
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Normalized Time
Uti
liza
tion (
%)
0
64
128
192
256
320
384
448
512
576
Part
itio
n S
ize
Waiting formachineto idle
Queue Wait Times (normalized)
0.0
0.4
0.8
1.2
1.6
2.0
Jobs sorted by Partition Size & Submit Time
0.0
0.4
0.8
1.2
1.6
2.0
0.0
0.4
0.8
1.2
1.6
2.0
T3E Swap
T3E NoSwap
SP
BFF - larger jobs = longer wait
FCFS - less dependence on size
Swap permits more simultaneous jobs running = shorter wait times
Idling twice causes 3 distinct regimes of wait times
Restoring Backfill on the SP
• Recognized that backfill is the standard mode for Loadleveller
• Have problems with backfill and ESP stipulations
• However… interesting data from invalid testshot
Backfill Effect I (Chronology)
0
20
40
60
80
100
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
Utiliz
ation (
%)
0
64
128
192
256
320
384
448
512
576
Par
titi
on
Siz
e
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Normalized Time
Uti
lizati
on (
%)
0
64
128
192
256
320
384
448
512
576
Part
itio
n S
ize
SP FCFS
SP FCFSw/ backfill
Highly efficient, but violates test
Need to selectively backfill
Backfill Effect II (Queue Wait Times)
0.0
0.4
0.8
1.2
1.6
2.0
0.0
0.4
0.8
1.2
1.6
2.0
SP FCFS
SP FCFS w/ backfill
Backfill and Flaw in ESP test
FC job submitted
All jobs finishexcept one
GuaranteedFC runtime
time
Backfill is working as expected but long-running job negates effect of reservation time - need finer granularity jobs
Stipulation for FC jobs?1. Run immediately (possibly premature termination of running jobs) T3E2. Run after current jobs finish SP w/ backfill3. No further jobs launched until FC finishes SP
Further Design Issues
• How to end the test?• Possible to use backfill (globally or selectively)?• Can we formulate a turnaround metric?• Scalability in size and speed• Finer granularity of jobs cf. overall test• Perhaps need additional vanilla throughput test
to evaluate purely scheduler performance
Conclusions & Observations
• SP - Can achieve very high utilization with backfill and no topology constraints
• SP -Lack of adaptability with dynamic workload - run ASAP mode
• T3E - Swapping with high overhead degrades utilization
• T3E - Can adapt to dynamic workload requirements