system utilization benchmark on the cray t3e and ibm sp adrian wong, leonid oliker, william kramer,...

System Utilization Benchmarkon the Cray T3E and IBM SP

Adrian Wong, Leonid Oliker, William Kramer, Teresa Kaltz, Therese Enright and David Bailey

National Energy Research Scientific Computing Center

Lawrence Berkeley National Laboratory

Scientific Supercomputer Workload

• Long running batch jobs (hours)• Typically 64 nodes per job• Often long list of queued jobs• Job turnaround maybe days

Utilization

Slo

wd

ow

n

Motivations

– Ability to fully utilize a large computer is almost as important as the speed of the computer.

– Large capability mainframes rarely have idle cycles - need to maximize users’ productivity.

– Need a way to measure potential day-to-day utilization.

– No metric to gauge configuration changes other than anecdotal.

– Increased complexity of scheduling with parallel platforms

A test to assess system capabilities& configuration effects on utilization

Effective System Performance (ESP)Effective System Performance (ESP)

Parallel Job Scheduling

Pro

cess

ors

Time

Optimization problem in packing with space (processor) and time constraintsDynamic situationTradeoffs in turnaround, utilization & fairness

Scheduling Strategies

Job Queue

Hole

Ord

er

of

Subm

issi

on

Best-Fit-FirstScan queue forbest fit

First-Come-First-ServeWait for right size hole

Starvationof largejobs

May idle systemRespects submission order

Key OS System Capabilities

• Swapping / Gang-scheduling• Job migration / compaction• Priority preemption• Backfill• Disjoint partitions• Checkpoint / restart• Dynamically adjustable queue

structures

ESP Design Goals & Attributes

• Transferable metric(s) / Valid comparisons• Reproducible• Easily interpreted results• Portable• Platform size and speed independent• Capture essence of real workload• Compact and easily distributed• Easy to run (< 12 hours)• Automated / no human intervention• Focus on utilization / factor out CPU speed• Test responsiveness & adaptability of

scheduler

ESP Design

• Start with throughput test• Profile of jobs determined by historical

accounting data• Find applications with appropriate size

and time• Use two full configuration jobs to

encapsulate change of operational mode (e.g. interactive to batch)

• Submit jobs in three blocks in pseudo-random order

ESP Test Schematic

time <12 hours

full config #1 full config #2

regular jobsregular jobs

>10% >10%

regular jobs regular jobsshutdown/reboot (opt)

regular jobsVanilla variant (throughput)

Individual Applications in Jobmix

Application Discipline Sizes Count

gfft (FC) Large FFT 512 2

md Biology 8,24 4,3

nqclarge Chemistry 8,16 2,5

paratec Material Sci 256 1

qcd Nuclear Physics 128,256 1

scf Chemistry 32,64 7,10

scfdirect Chemistry 64,81 7,2

superlu Linear Alg 8 15

tlbe Fusion 16,32,49,64,128 2,6,5,8,1

Jobmix Application Elapsed Times

0

1800

3600

5400

7200

9000

10800

Tim

e (

se

cs

)

T3E SP

0

1800

3600

5400

7200

9000

10800

Tim

e (

se

cs

)

Increasing Partition Size

Platforms Tested

• Cray T3E– 512 processors– 450 MHz Alpha EV56– Microkernel MPP OS– NQS & Global Resource Mgr– Oversubscription possible– BFF strategy w/ dynamic queue configs

• IBM SP– 512 processors– 200 MHz Power3– Semiautonomous Monolithic OSes– Loadleveller batch queues– FCFS w/ backfill (backfill disabled in 1st attempt)

T3E Chronology (with swap)

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Normalized Time

Uti

liza

tion (

%)

0

64

128

192

256

320

384

448

512

576

Part

itio

n S

ize

Insufficient work;Tailend dilemma

Starvationof largejobs

Normalized = Elapsed / Theoretical Min

T3E Chronology (without swap)

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Normalized Time

Uti

liza

tion (

%)

0

64

128

192

256

320

384

448

512

576

Part

itio

n S

ize

Slight decrease in utilization w/o swap capability

Higher overall efficiency - significant overhead w/ swap

SP Chronology

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Normalized Time

Uti

liza

tion (

%)

0

64

128

192

256

320

384

448

512

576

Part

itio

n S

ize

Waiting formachineto idle

Queue Wait Times (normalized)

0.0

0.4

0.8

1.2

1.6

2.0

Jobs sorted by Partition Size & Submit Time

0.0

0.4

0.8

1.2

1.6

2.0

0.0

0.4

0.8

1.2

1.6

2.0

T3E Swap

T3E NoSwap

SP

BFF - larger jobs = longer wait

FCFS - less dependence on size

Swap permits more simultaneous jobs running = shorter wait times

Idling twice causes 3 distinct regimes of wait times

Restoring Backfill on the SP

• Recognized that backfill is the standard mode for Loadleveller

• Have problems with backfill and ESP stipulations

• However… interesting data from invalid testshot

Backfill Effect I (Chronology)

0

20

40

60

80

100

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

Utiliz

ation (

%)

0

64

128

192

256

320

384

448

512

576

Par

titi

on

Siz

e

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Normalized Time

Uti

lizati

on (

%)

0

64

128

192

256

320

384

448

512

576

Part

itio

n S

ize

SP FCFS

SP FCFSw/ backfill

Highly efficient, but violates test

Need to selectively backfill

Backfill Effect II (Queue Wait Times)

0.0

0.4

0.8

1.2

1.6

2.0

0.0

0.4

0.8

1.2

1.6

2.0

SP FCFS

SP FCFS w/ backfill

Backfill and Flaw in ESP test

FC job submitted

All jobs finishexcept one

GuaranteedFC runtime

time

Backfill is working as expected but long-running job negates effect of reservation time - need finer granularity jobs

Stipulation for FC jobs?1. Run immediately (possibly premature termination of running jobs) T3E2. Run after current jobs finish SP w/ backfill3. No further jobs launched until FC finishes SP

Further Design Issues

• How to end the test?• Possible to use backfill (globally or selectively)?• Can we formulate a turnaround metric?• Scalability in size and speed• Finer granularity of jobs cf. overall test• Perhaps need additional vanilla throughput test

to evaluate purely scheduler performance

Conclusions & Observations

• SP - Can achieve very high utilization with backfill and no topology constraints

• SP -Lack of adaptability with dynamic workload - run ASAP mode

• T3E - Swapping with high overhead degrades utilization

• T3E - Can adapt to dynamic workload requirements

Ongoing and Future Work

• Scheduled test run on 512-way Origin 2K & Compaq SC

• Vanilla throughput runs on T3E and SP• Redesign for next version of ESP• Distribute ESP to other interested sites

system utilization benchmark on the cray t3e and ibm sp adrian wong, leonid oliker, william kramer,...

Documents

idle slide

jobmix slide

days slide

partition size slide

utilization fairness

configuration jobs

size swap

batch jobs hours