slurm - linux clusters institute · • active development occurs at github.com; releases are...

Slurm

Ryan CoxFulton Supercomputing Lab

Brigham Young University (BYU)

Slurm Workload Manager

4-8 August 2014 2

• What is Slurm?• Installation• Slurm Configuration

• Daemons• Configuration Files

• Client Commands• User and Account Management• Policies, tuning, and advanced configuration

• Priorities• Fairshare• Backfill• QOS

• Simple Linux Utility for Resource Management• Anything but simple

• Resource manager and scheduler• Originally developed at LLNL (Lawrence Livermore)• GPL v2• Commercial support/development available

• Core development by SchedMD• Other major contributors exist

• Built for scale and fault tolerance• Plugin-based: lots of plugins to modify Slurm behavior for your needs

4-8 August 2014 3

What is Slurm?

BYU's Scheduler History

4-8 August 2014 4

• BYU has run several scheduling systems through its HPC history• Moab/Torque was the primary scheduling system for many years• Slurm replaced Moab/Torque as BYU's sole scheduler in January 2013• BYU has now contributed Slurm patches, some small and some large.

Examples:• New fair share algorithm: LEVEL_BASED• cgroup out-of-memory notification in job output• Script to generate a file equivalent to PBS_NODEFILE • Optionally charge for CPU equivalents instead of just CPUs (WIP)

Terminology

4-8 August 2014 5

• Partition – A set of nodes (usually a cluster, using the traditional definition of “cluster”)• Cluster – Multiple Slurm “clusters” can be managed by one slurmdbd; one slurmctld per

cluster• Job step – a suballocation within a job

• E.g. Job 1234 has been allocated 12 nodes. It launches 4 job steps that each run on 3 of the nodes.• Similar to subletting an apartment

● Sublet the whole place or just a room or two

• User – A user (“Bob” has a Slurm “user”: “bob”)• Account – A group of users and subaccounts• Association – the combination of user, account, partition, and cluster

• A user can be members of multiple accounts with different limits for different partitions and clusters, etc.

Installation

4-8 August 2014 6

• Version numbers are Ubuntu-style (e.g. 14.03.6)• 14.03 == major version released in March 2014• .6 == minor version

• Download official releases from schedmd.com• git repo available at http://github.com/SchedMD/slurm

• Active development occurs at github.com; releases are tagged (git tag)

• Two main methods of installation• ./configure && make && make install # and install missing -dev{,el} packages• Build RPMs, etc

• Some distros have a package, usually “slurm-llnl”• Version may be behind by a major release or three• If you want to patch something, this is the hardest approach

http://github.com/SchedMD/slurm

Installation: Build RPMs

4-8 August 2014 7

• Set up ~/.rpmmacros with something like this (see top of slurm.spec for more options):##slurm macros

%_with_blcr 1

%_with_lua 1

%_with_mysql 1

%_with_openssl 1

%_smp_mflags -j16

##%_prefix /usr/local/slurm

• Copy missing version info from META file to slurm.spec (grep for META in slurm.spec)• Let's assume we add the following lines to slurm.spec:

Name: slurm

Version: 14.03.6

Release: 0%{?dist}-custom1

• Assuming RHEL 6, the RPM version will become: slurm-14.03.6-0.el6-custom1• If the slurm code is in ./slurm/, do:

• ln -s slurm slurm-14.03.6-0.el6-custom1• tar hzcvf slurm-14.03.6-0.el6-custom1.tgz slurm-14.03.6-0.el6-custom1• rpmbuild -tb slurm-14.03.6-0.el6-custom1.tgz

• The *.rpm files will be in ~/rpmbuild/RPMS

Configuration: Daemons

4-8 August 2014 8

• Daemons• slurmctld – controller that handles scheduling, communication with nodes, etc• slurmdbd – (optional) communicates with MySQL database• slurmd – runs on a compute node and launches jobs• slurmstepd – run by slurmd to launch a job step• munged – authenticates RPC calls (https://code.google.com/p/munge/)

● Install munged everywhere with the same key

• slurmd• hierarchical communication between slurmd instances (for scalability)

• slurmctld and slurmdbd can have primary and backup instances for HA• State synchronized through shared file system (StateSaveLocation)

https://code.google.com/p/munge/

Configuration: Config Files

4-8 August 2014 9

• Config files are read directly from the node by commands and daemons• Config files should be kept in sync everywhere

• Exception slurmdbd.conf: only used by slurmdbd, contains database passwords

• DebugFlags=NO_CONF_HASH tell Slurm to tolerate some differences. Everything should be consistent except maybe backfill parameters, etc that slurmd doesn't need

• Can use “Include /path/to/file.conf” to separate out portions, e.g. partitions, nodes, licenses

• Can configure generic resources with GresTypes=gpu• man slurm.conf• Easy: http://slurm.schedmd.com/configurator.easy.html• Almost as easy: http://slurm.schedmd.com/configurator.html

http://slurm.schedmd.com/configurator.easy.html

Configuration: Gotchas

4-8 August 2014 10

• SlurmdTimeout – The interval that slurmctld waits for slurmd to respond before assuming a node is dead and killing its jobs

• Set appropriately so file system disruptions and Slurm updates don't kill everything. Ours is 1800 (30 minutes).

• Slurm queries the hardware and configures nodes appropriately... may not be what you want if you want Mem=64GB instead of 63.3489 GB

• Can set FastSchedule=2

• You probably want this: AccountingStorageEnforce=associations,limits,qos

• ulimit at the time of sbatch gets propagated to the job: set PropagateResourceLimits if you don't like that

Commands

4-8 August 2014 11

• squeue – view the queue• sbatch – submit a batch job• salloc – launch an interactive job• srun – two uses:

• outside of a job – run a command through the scheduler on compute node(s) and print the output to stdout

• inside of a job – launch a job step (i.e. suballocation) and print to the job's stdout

• sacct – view job accounting information• sacctmgr – manage users and accounts including limits• sstat – view job step information (I rarely use)• sreport – view reports about usage (I rarely use)• sinfo – information on partitions and nodes• scancel – cancel jobs or steps, send arbitrary signals (INT, USR1, etc)• scontrol – list and update jobs, nodes, partitions, reservations, etc

Commands: Read the Manpages

4-8 August 2014 12

• Slurm is too configurable to cover everything here• I will share some examples in the next few slides• New features are added frequently• squeue now has more output options than A-z (printf style):

new output formatting method added in 14.11

Host Range Syntax

4-8 August 2014 13

• Host range syntax is more compact, allows smaller RPC calls, easier to read config files, etc

• Node lists have a range syntax with [] using “,” and “-”• Usable with commands and config files• n[1-10,40-50] and n[5-20] are valid• Up to two ranges are allowed: n[1-100]-[1-16]

• I haven't tried this out recently so it may have increased; manpage still says two

• Comma separated lists are allowed: • a-[1-5]-[1-2],b-3-[1-16],b-[4-5]-[1-2,7,9]

Commands: squeue

4-8 August 2014 14

Want to see all running jobs on nodes n[4-31] submitted by all users in account acctE using QOS special with a certain set of job names in reservation res8 but only show the job ID and the list of nodes the jobs are assigned then sort it by time remaining then descending by job ID?

There's a command for that!

squeue -t running -w n[4-31] -A acctE -q special -n name1,name2 -R res8 -o "%.10i %N" -S +L,-i

Way too many options to list here. Read the manpage.

Commands: sbatch (and salloc, srun)

4-8 August 2014 15

• sbatch parses #SBATCH in a job script and accepts parameters on CLI• Also parses most #PBS syntax

• salloc and srun accept most of the same options• LOTS of options: read the manpage• Easy way to learn/teach the syntax: BYU's Job Script Generator

• LGPL v3, Javascript, available on Github

• Slurm and PBS syntax

• May need modification by your site

• https://github.com/BYUHPC/BYUJobScriptGenerator

https://github.com/BYUHPC/BYUJobScriptGenerator

Script Generator (1/2)

4-8 August 2014 16

• Demo: https://byuhpc.github.io/BYUJobScriptGenerator/• Code: https://github.com/BYUHPC/BYUJobScriptGenerator/

Script Generator (2/2)

4-8 August 2014 17

https://byuhpc.github.io/BYUJobScriptGenerator/

https://github.com/BYUHPC/BYUJobScriptGenerator/

Commands: sbatch (and salloc, srun)

4-8 August 2014 18

• Short and long versions exist for most options• -N 2 # node count• -n 8 # task count

• default behavior is to try loading up fewer nodes as much as possible rather than spreading tasks

• -t 2-04:30:00 # time limit in d-h:m:s, d-h, h:m:s, h:m, or m• -p p1 # partition name(s): can list multiple partitions• --qos=standby # QOS to use• --mem=24G # memory per node• --mem-per-cpu=2G # memory per CPU• -a 1-1000 # job array

Job Arrays

4-8 August 2014 19

• Used to submit homogeneous scripts that differ only by an index number• $SLURM_ARRAY_TASK_ID stores the job's index number (from -a)• An individual job looks like 1234_7 where ${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}• “scancel 1234” for the whole array or “scancel 1234_7” for just one job in the array• Prior to 14.11

• Job arrays are purely for convenience

• One sbatch call, scancel can work on the entire array, etc

• Internally, one job entry created for each job array entry at submit time

• Overhead of job array w/1000 tasks is about equivalent to 1000 individual jobs

• Starting in 14.11• “Meta” job is used internally

• Scheduling code is aware of the homogeneity of the array

• Individual job entries are created once a job is started

• Big performance advantage

Commands: scontrol

4-8 August 2014 20

• scontrol can list, set and update a lot of different things• scontrol show job $jobid # checkjob equiv• scontrol show node $node• scontrol show reservation• scontrol <hold|release> $jobid # hold/release (“uhold” allows user to release)• Update syntax:

• scontrol update JobID=1234 Timelimit=2-0 #set 1234 to a 2 day timelimit

• scontrol update NodeName=n-4-5 State=DOWN Reason=”cosmic rays”

• Create reservation:• scontrol create reservation reservationname=testres nodes=n-[4,7-10]

flags=maint,ignore_jobs,overlap starttime=now duration=2-0 users=root

• scontrol reconfigure #reread slurm.conf• LOTS of other options: read the manpage

Resource Enforcement

4-8 August 2014 21

• Slurm can enforce resource requests through the OS• CPU

• task/cgroup – uses cpuset cgroup (best)• task/affinity – pins a task using sched_setaffinity (good but a user

can escape it)

• memory• memory cgroup (best)• polling (polling-based: huge race conditions exist, but much better

than nothing; users can escape it)

QOS

4-8 August 2014 22

• A QOS can be used to:• Modify job priorities based on QOS priority• Configure preemption• Allow access to dedicated resources• Override or impose limits• Change “charge rate” (a.k.a. UsageFactor)

• A QOS can have limits: per QOS and per user per QOS• List existing QOS with: sacctmgr list qos• Modify:

sacctmgr modify qos long set MaxWall=14-0 UsageFactor=10

QOS: Preemption

4-8 August 2014 23

• Preemption is easy to configure: sacctmgr modify qos normal set preempt=standby• You can set up the following

• sacctmgr modify qos high set preempt=normal,low• sacctmgr modify qos normal set preempt=low

• GraceTime (optional) guarantees a minimum runtime for preempted jobs• Use AllowQOS to specify which QOS is allowed to run in each partition• If userbob owns all the nodes in partition “bobpartition”:

• In slurm.conf, set AllowQOS=bobqos,standby in partition “bobpartition”• sacctmgr modify user userbob set qos+=bobqos• sacctmgr modify qos bobqos set preempt=standby

User/Account Management

4-8 August 2014 24

• sacctmgr load/dump can be used as a poor way to implement user management• Proper use of sacctmgr in a transactional manner is better and allows more

flexibility, though you'll need to integrate it with your account creation process, etc

• A user can be a member of multiple accounts• Default account can be specified with sacctmgr (DefaultAccount)• Fairshare “Shares” can be set to favor/penalize certain users• Can grant/revoke access to multiple QOS's

• sacctmgr list assoc user=userbob• sacctmgr list assoc user=userbob account=prof7

• Filter by user and account

• sacctmgr create user userbob Accounts=prof2 DefaultAccount=prof2 Fairshare=1024

User Limits

4-8 August 2014 25

• Limits of CPUs, memory, nodes, timelimits, allocated cpus*time, etc. can be set on an association

• sacctmgr modify user userbob set GrpCPUs=1024• sacctmgr modify account prof7 set GrpCPUs=2048

• Account Grp* limits are a limit for the entire account (sum of children)

• Max* limits are usually per user or per job

• Set a limit to -1 to remove it

User Limits: GrpCPURunMins

4-8 August 2014 26

• GrpCPURunMins is a limit on the sum of an association's jobs' (allocated CPUs * time_remaining)

• Similar to MAXPS in Moab/Maui

• Staggers the start time of jobs• Allows more jobs to start as other jobs near completion• Simulator available:

• https://marylou.byu.edu/simulation/grpcpurunmins.php

• Download for your own site (LGPL v3):• https://github.com/BYUHPC/GrpCPURunMins-Visualizer

• More info about why we use this: http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html

GrpCPURunMins: 1 core job, 7 days, limit=172800GrpCPURunMins: 1 core job, 3 days, limit=172800

4-8 August 2014 27

Account Coordinator

4-8 August 2014 28

• An account coordinator can do the following for users and subaccounts under the account:

• Set limits (CPUs, nodes, walltime, etc.)• Modify fairshare “Shares” to favor/penalize certain users• Grant/revoke access to a QOS*

• Hold and cancel jobs

• We set faculty to be account coordinators for their accounts• End-user documentation:

https://marylou.byu.edu/documentation/slurm/account-coord

*Any QOS: http://bugs.schedmd.com/show_bug.cgi?id=824

Allocation Management

4-8 August 2014 29

• BYU does not use, therefore I don't know much about it• GrpCPUMins (different than GrpCPURunMins)• GrpCPUMins - The total number of CPU minutes that can

possibly be used by past, present and future jobs running from this association and its children.

• Can be reset manually or periodically. See PriorityUsageResetPeriod

• A QOS can have a UsageFactor that makes it so you get billed more or less depending on QOS: 5.0 for “immediate”, 1.0 for “normal”, 0.1 for “standby”

Job Priorities

4-8 August 2014 30

• priority/multifactor plugin uses weights * values• priority = sum(configured_weight_int * actual_value_float)• Weights are integers and the values themselves are floats

(0.0-1.0)• Available components

• Age (queue wait time)• Fairshare• JobSize• Partition• QOS

Job Priorities: Example

4-8 August 2014 31

• Let's say the weights are:• PriorityWeightAge=0• PriorityWeightFairshare=10000 (ten thousand)• PriorityWeightJobSize=0• PriorityWeightPartition=0• PriorityWeightQOS=10000 (ten thousand)

• QOS Priorities are:• high=5• normal=2• low=0

• userbob (fairshare=0.23) submits a job in qos “normal” (qos_priority=2):• priority = (PriorityWeightFairshare * .23) + (PriorityWeightQOS * 2 / MAX(qos_priority)))• priority = (10000 * .23) + (10000 * (2/5)) = 6300

Backfill

4-8 August 2014 32

• Can be tuned with SchedulerParameters in slurm.conf– Example:

SchedulerParameters=bf_max_job_user=20,bf_interval=60,default_queue_depth=15,max_job_bf=8000,bf_window=14400,bf_continue,max_sched_time=6,bf_resolution=1800,defer

• Goal: Only backfill a job if it will not delay the start time of any higher priority job

• So many nice tuning parameters pop up all the time that I can't keep up. See the slurm.conf manpage for SchedulerParameters options

Fairshare Algorithms

4-8 August 2014 33

• Warning: Sites have widely varying use cases so I don't necessarily understand the reason for some of the algorithms

• priority/multifactor plugin can use different fair share algorithms• Default (no algorithm override specified with PriorityFlags)

• Fairshare factor affected by you vs your siblings, your parent versus its siblings, your grandparent versus its siblings, etc. FSFactor=2**(-Usage/Shares). Seems to be the most common but it doesn't work for us

• PriorityFlags=DEPTH_OBLIVIOUS• Improves handling of deep and/or unbalanced trees

• PriorityFlags=TICKET_BASED• We used it for a while and it mostly worked but the algorithm itself is flawed• LEVEL_BASED recommended as a replacement

• PriorityFlags=LEVEL_BASED• Users in an under-served account will always have a higher fair share factor than users in an over-served

account. E.g. Account “hogs” has higher usage than account “idle”. All users in idle will have a higher FS factor than all users in hogs

• Available in 14.03 through github.com/BYUHPC/slurm. Used in production at BYU• Available through upstream in 14.11 (as of 14.11.0pre3)

Job Submit Plugin

4-8 August 2014 34

• Slurm can run a job submit plugin written in Lua• Lua looks like pseudo-code and doesn't take long to learn

• The plugin can modify a job's submission based on whatever business logic you want

• Example uses:• Allow access to a partition based on the requested CPU count being a

multiple of 3• Change the QOS to something different based on different factors• Output a custom error message, such as “Error! You requested x, y, and z

but ...”

• True business logic is possible with this script. It is worth your time to look

Other Stuff

4-8 August 2014 35

• Check out SPANK plugins: runs on a node and can do lots of stuff for job start/end events

• Prolog and epilog are available in lots of different ways (job, step, task)

User Education

4-8 August 2014 36

• Slurm (mostly) speaks #PBS and has many wrapper scripts• Maybe this is sufficient?• BYU switched from Moab/Torque to Slurm before notifying users of the change

● (Yes, we are that crazy. Yes, it worked great for >95% of use cases, which was our target. The other options/commands were esoteric and silently ignored by Moab/Torque anyway)

• Slurm/PBS Script Generator available: github.com/BYUHPC• LGPL v3, demo linked to from github

• http://slurm.schedmd.com/tutorials.html• “Introduction to Slurm Tools” video is linked from there

• https://marylou.byu.edu/documentation/slurm/commands• https://marylou.byu.edu/documentation/slurm/account-coord

http://slurm.schedmd.com/tutorials.html

https://marylou.byu.edu/documentation/slurm/commands

Diagnostics

4-8 August 2014 37

• Backtraces from core dumps are typically best for crashes• Be sure you don't have any ulimit-type restrictions on them• For slurmctld: gdb `which slurmctld` /var/log/slurm/core

● thread apply all bt

• SchedMD is usually able to diagnose problems from backtraces and maybe a few extra print statements they'll ask for

• Each component has its own logging level you can specify in .conf• There are extra flags for slurmctld:

• scontrol setdebugflags +backfill #and others like Priority

User Education

4-8 August 2014 38

• Slurm (mostly) speaks #PBS and has many wrapper scripts• Maybe this is sufficient?• BYU switched from Moab/Torque to Slurm before notifying users of the change

● (Yes, we are that crazy. Yes, it worked great for >95% of use cases, which was our target. The other options/commands were esoteric and silently ignored by Moab/Torque anyway)

• Slurm/PBS Script Generator available: github.com/BYUHPC• LGPL v3, demo linked to from github

• http://slurm.schedmd.com/tutorials.html• “Introduction to Slurm Tools” video is linked from there

• https://marylou.byu.edu/documentation/slurm/commands• https://marylou.byu.edu/documentation/slurm/account-coord

http://slurm.schedmd.com/tutorials.html

https://marylou.byu.edu/documentation/slurm/commands

Support

4-8 August 2014 39

• SchedMD• Excellent support from the original developers• Bugfixes typically committed to github within a day

• Other support vendors listed on Slurm's Wikipedia page• Usually tied to a specific hardware vendor or as part of a larger

software installation

• slurm-dev mailing list (http://slurm.schedmd.com/mail.html)• You should subscribe• “Hand holding” is extremely rare• Don't expect to use slurm-dev for support

Recommendations

4-8 August 2014 40

• Requirements documents• Don't have your primary scheduler admin write it• …unless the admin can step back and write what you actually need rather than

“must have features A, B, and C exactly [even though Slurm may have a better way of accomplishing the same thing]”

• Think: I want Prof Bob and his designated favorite students to have access to his privately owned hardware but also want preemptable jobs to run on there when they aren't. He shouldn't get charged cputime for using his own resources. How should I do that in Slurm?

• Set AllowQOS=profbob,standby on his partition in slurm.conf• sacctmgr create qos profbob UsageFactor=0• Then add each user who should have access to the QOS: `sacctmgr modify user

$user set qos+=profbob`

Questions?

4-8 August 2014 41

slurm - linux clusters institute · • active development occurs at github.com; releases are...

Documents