rm-replay: a high-fidelity tuning, optimization and ... · §rm-replay: a high-fidelity tuning,...

14
RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management Maxime Martinasso Swiss National Supercomputing Centre (CSCS) ADAC 7 2019

Upload: others

Post on 15-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource ManagementMaxime MartinassoSwiss National Supercomputing Centre (CSCS)ADAC 7 2019

Page 2: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Exploring resource manager parametrization of a production system§ HPC center view

§ Goal is to improve the usage of the resource: higher utilization, higher throughput§ What-if scenario:

§ Changing RM parameters§ Introducing new policies§ Updating RM and using new features

§ User point of view§ How long will my job wait in the queue?

§ If I improve my runtime estimate§ If I use this specific queue§ If I use a special constraint

§ Current approaches§ Identify a better configuration? How do you test it?

§ Apply a change between two maintenances, monitor§ Using a simulator?

§ Complexity of a production system: partitions, priorities, node states, reservation, …§ Accuracy of the simulation§ Translate simulation output into decision making

RM-Replay, SC18 2

Page 3: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

New approach: RM-Replay

§ Use the exact same software stack for the RM§ No modification of the code (or very minimal)

§ Use fewer resources than the production system§ Less physical resources: nodes§ Less time: faster-than-real-time clock

§ “Replay” the exact set of actions for a workload:§ Job submissions§ Node availability§ Reservations

§ Allow historical studies, gather statistics on events (end of allocation period, GB)§ Use a well-known interface (the RM itself) by the system engineers

RM-Replay, SC18 3

Page 4: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

How does RM-Replay work?

§ Use a container to recreate the entire resource manager software stack in an isolated environment

§ Inside the same container recreate the set of users (who submit jobs)§ Create an adjustable clock which will replace the system clock and will be used

by the software stack

§ Develop a set of programs to recreate the interaction originating from a workload§ Provide configuration data outside the container to enable portability to other

HPC systems

RM-Replay, SC18 4

Page 5: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Instance of RM-Replay: Slurm-Replay

§ Slurm is a resource manager used on many large HPC systems§ 6 of top 10 in the top500 (November 2017)§ Many features: High scalability and performance, fair-share, reservation, plugins, …

§ Slurm is a complex software:§ Accurate simulation is very difficult§ Integrated simulator modifies code base and the scheduling behavior

§ Event based simulation, capturing all events? Impact on the scheduling?§ Lack of portability

§ Characteristics of Slurm-Replay:§ Use the original and not modified Slurm software stack§ Use configuration parameters from a large HPC production system (Piz Daint)§ Replay production workloads§ Evaluate scheduling metrics: throughput, utilization, waiting time, …

RM-Replay, SC18 5

Page 6: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

How does Slurm-Replay work?

Slurmd

Slurmctld

Clock

counter

Job RunnerStepd

Submitter

Node Controller

Ticker

Slurm

processes

Slurm-Replay

processes

1

2

3

4

5

6

Docker Container

Slurmdbd

7

SQL

database

Clock rate

Workload

trace

Slurm

configuration

files

Database dump:

users, accounts,...

RM-Replay, SC18 6

Page 7: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Workload:

RM-Replay, SC18 7

Time

Nodes

Workloadstart time

Workloadstop time

Originalschedule

Selectedjobs in trace

Workloadend time

submission time

preset job

job in trace

job not in trace

Legend

Jobstart time

Job end time

Elements in a trace:Job={id, submission time, start time, end time, number of nodes, time limit, partition, dependency, priority, qos, reservation name, user, account, state,exit code, eligible time, list of nodes}Reservation={id, name, account, start time, end time, list of nodes}Node state={start time, end time, node name, state, reason}

• Jobs• Job dependencies• Node states• Reservations

Multitenant:• Extract users/groups

Page 8: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Workflow

submitter

•Start submitter•Start node controller

•Reservations are submitted to Slurm•Wait for Replay-clock to start

ticker

•Make Replay-clock progress until all jobshave terminated (Wend time)

collect output

•Create a replay trace from the database•Derive workload metrics•Display Slurm statistics

•Display Slurm and Slurm-Replay errors

Docker container

Wstart time: Workload start timeWend time: Workload end time

Slurm-Replay processScriptsSlurm process

ticker•Set Replay-clock to Wstart time

(no increment)

configure slurm

•Copy Slurm configuration•Update configuration

Slurm

•Start slurmdbd•Start slurmd

•Start slurmctld•Enable frontend and partitions

create slurm database

•Start a SQL database•Create a new Slurm database

•Update content from original database

RM-Replay, SC18 8

(What-if)

Page 9: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Technical solutions and limitations

§ No root privileges inside the container (to use HPC container runtimes)

§ Wrap functions used to impersonate users to always return true:

§ setuid, setgid, chown§ Wrap common C time functions to use a faster clock:

§ sleep, gettimeofday, …, clock counter is in /dev/shm

§ Create and bind mount /etc/passwd and /etc/group from users in the workload

§ Use Slurm Frontend feature to execute only one slurmd daemon for an arbitrary number of nodes

§ Missing data from the Slurm database taken from system logs

§ job dependency, topology, reservation submission time

RM-Replay, SC18 9

Page 10: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Accuracy: makespan

0.05 0.06 0.07 0.08 0.09 0.1 0.15 0.2Replay-clock rate

2860

2880

2900

Mak

espa

n[m

in]

Accuracy of Slurm-Replay for the GPU constraint.

ReferenceMedian

RM-Replay, SC18 10

§ Data distribution over 50 tests§ 24 hours, 6025 jobs: 2664 jobs GPU constraint, 2409 jobs MC constraint, 169 jobs

in other partitions§ 10 reservations and 51 state changes of nodes§ Peak utilization of 97% of GPU nodes and 54% of MC nodes§ Schedule completed in 47.65 hours for GPU and 42.7 hours for MC

0.05 0.06 0.07 0.08 0.09 0.1 0.15 0.2Replay-clock rate

2550

2560

2570

2580

Mak

espa

n[m

in]

Accuracy of Slurm-Replay for the MC constraint.

ReferenceMedian

Page 11: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Performance vs clock rate

62391 day

144542 days

209353 days

280464 days

351625 days

Number of jobs and days

0.0

0.2

0.4

0.6

0.8

1.0

Failu

rera

tio

Clock rate0.050.10.150.2

RM-Replay, SC18 11

0.04

0.05

0.06

0.07

0.08

0.09 0.1 0.1

5 0.2

Replay-clock rate

1

2

3

4

5

Elap

sed

time

[hou

r]

0.0

0.2

0.4

0.6

0.8

1.0

Failu

rera

tioof

Slur

m-R

epla

y

Elapsed time [s]Failure ratio [%]

§ Failure: Slurm-Replay takes too long to process backlog of jobs§ Performance dependent on the underlying CPU frequency§ A clock rate of 0.06 (16.7x faster) seems to be the best option on a 2.1 GHz CPU§ Use only one slurmd daemon

Page 12: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Example of use case

within 10% ofactual runtime

within 50% ofactual runtime

unmodifiedruntime estimation

101

102

103

Slow

dow

n(lo

gsca

le)

Runtimeestimation:

MeanMedian

500

750

1000

1250

Num

bero

faf

fect

edjo

bs

RM-Replay, SC18 12

User: “Will my job be scheduled earlier if I provide a better runtime estimation?”

Slowdown means: how much a user is willing to wait:

!"#$% + !'(')!'(')

There are fewer jobs waiting but no improvement in the waiting time.

Page 13: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Conclusion and future work

§ RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management

§ Useful for testing what-if scenario: parameter, version, feature, policy, …§ Help to take better decision on a production system configuration

§ Slurm-Replay an instance of RM-Replay for Slurm

§ Use Slurm-Replay at CSCS§ Investigate slurmd performance

RM-Replay, SC18 13

https://github.com/eth-cscs/slurm-replay

Page 14: RM-Replay: A High-Fidelity Tuning, Optimization and ... · §RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management §Useful for testing what-if

Thank you for your attention.