using supercomputers part 1 - support
TRANSCRIPT
Using Supercomputers – Part 1
Pawsey Webinar Series
20 July 2020
What you can do after the full course
• List and describe the major parts of a supercomputer
• Log into a supercomputer
• Explain how a supercomputer is shared among researchers
• Submit a basic job to the supercomputing queue
• Understand and use key supercomputer systems as: schedulers, partitions, nodes, data movers, etc.
• Define and submit job scripts according to your needs
• Find and use available software on a supercomputer
✓ Prerequisite knowledge: Linux
Supercomputing Overview• Supercomputing Overview
• Logging In
• Sharing Supercomputers
• Submitting Jobs
• Using High Performance Storage
• Job Scripts For Different Applications
• Deciding How Big to Scale
• Getting Help
Supercomputer examples
#1 Supercomputer Fugaku (RIKEN centre, Japan): 415.5 PFlop/s
(Ranking from https://www.top500.org/lists/top500/2020/06/)
Picture from: https://www.riken.jp/en/news_pubs
#2 Summit (Oak Ridge Nat. Lab., USA) : 148.6 PFlop/s
(Ranking from https://www.top500.org/lists/top500/2020/06/)
Picture from https://en.wikipedia.org/wiki/Summit_(supercomputer)
Supercomputer examples
#24 Gadi (NCI, Australia) : 9.2 PFlop/s
(Ranking from https://www.top500.org/lists/top500/2020/06/)
Picture from https://nci.org.au
Supercomputer examples
Magnus (Pawsey Supercomputing Centre, Australia) : 1 PFlop/sVisit:https://pawsey.org.au/https://tour.pawsey.org.au/
Supercomputer examples
Building blocks
Major Parts of a Supercomputer
High Performance Compute
• Individual compute nodes are "similar" to a high-end workstation
• Compute performance comes from using many processing resources together with fast communication
• There are fast communication channels among components within the node (memory, cpus, gpus)
• Among compute nodes, there is a fast network to transfer data during calculations
High Performance Compute
• Individual compute nodes are similar to a high-end workstation
• Compute performance comes from using many processing resources together with fast communication
• There are fast communication channels among components within the node (memory, cpus, gpus)
• Among compute nodes, there is a fast network to transfer data during calculations
"Dragonfly" interconnect for Cray-XC40,Image from: https://pawsey.org.au/systems/magnus/
High Performance Compute
Compute nodes work together in parallel:
• To perform large calculations,
• Or to obtain a faster execution of your code
• Or to perform many different calculations at the same time
High Performance Storage
• Fast storage “inside” the supercomputer
• Analyse very large data sets, high throughput data processing
• Temporary working area
• Usually have global storage
✓ All nodes can access global storage
✓ Can hold very large data sets
• Might have local node storage
Logging In
• Supercomputing Overview
• Logging In
• Sharing Supercomputers
• Submitting Jobs
• Using High Performance Storage
• Job Scripts For Different Applications
• Deciding How Big to Scale
• Getting Help
Remote Access is via Login Nodes
• Remote access to the supercomputer for administrative work:
• Submit jobs
• Manage workflows
• Check results
• Install software
• Many people (~100) share a login node
• Do not run programs on the login nodes!
Remote Access in Practice
• Have terminal program to execute the connection command:
o For Windows, use MobaXterm (download)
o For Linux, use xterm (preinstalled)
o For OS X, use Terminal (preinstalled) or xterm (download)
• Within a terminal window use ssh command: ssh username@hostname
• For example:
Exercise: Log in
• Log in to Zeus via ssh:
myLaptop> ssh [email protected]
(You will be asked for your password. Type it carefully. It will not be displayed)
- What are the "message of the day" announcements?
- What is the default directory where you log in?
- Do you count already with some files?
• Change directory to the correct file system to perform your work
zeus-1> cd $MYSCRATCH
zeus-1> pwd
zeus-1> ls
zeus-1> echo $MYSCRATCH
Exercise: Log in
• Use git to download the exercise material:
zeus-1> git clone http://github.com/PawseySC/Introductory-Supercomputing
• Change directory to the downloaded directory:
zeus-1> cd Introductory-Supercomputing
• Explore the content:
zeus-1> ls
- What directories and files are in the directory tree?
Common Log in Problems
• Forgot password
Self service reset https://support.pawsey.org.au/password-reset/
• Scheduled maintenance
Check your email or https://support.pawsey.org.au/documentation/display/US/Maintenance+and+Incidents
• Blacklisted due to too many failed login attempts
This is a security precaution.
Contact the helpdesk with your username and the machine you are attempting to log in to
Remote Access while being SecureSSH key authentication can increase the security of your account, compared to using a password https://support.pawsey.org.au/documentation/display/US/Logging+in+with+SSH+keys
Use a passphrase with SSH keys, so that if your local computer is compromised then your remote accounts are not compromised
Do not share your account! Account sharing violates the Conditions of Use. The project leader can invite others to the project
Do not provide your password to anyone! Not even in help desk tickets!
Remote Access with a Graphical Interface
• Web-based remote graphical interface (for visualization applications)
https://remotevis.pawsey.org.au/
• For simple GUI applications, can add -X flag to ssh:
ssh -X [email protected]
Sharing Supercomputers
• Supercomputing Overview
• Logging In
• Sharing Supercomputers
• Submitting Jobs
• Using High Performance Storage
• Job Scripts For Different Applications
• Deciding How Big to Scale
• Getting Help
Need for a scheduler• Supercomputers are expensive and need to be fully utilised to
get best value for money.
• Supercomputers get replaced every 3-5 years and consume electricity whether used or not
• Thus we want them running at maximum capacity including night and weekends.
• Supercomputers are shared among many users
• Different job sizes, usage patterns
• Users must wait their turn
Scheduler, queues and partitions
• A scheduler (SLURM, PBS, etc) is a program that manages jobs.
• Users submit jobs with information for the scheduler, and it feeds them into the compute nodes.
• It has several queues of jobs and constantly optimizes computer usage and job completion.
Scheduler, queues and partitions
• A scheduler (SLURM, PBS, etc) is a program that manages jobs.
• Users submit jobs with information for the scheduler, and it feeds them into the compute nodes.
• It has several queues of jobs and constantly optimizes computer usage and job completion.
• A partition is a group of nodes with a specific purpose (coloured in the figure)
• Queues are associated to specific partitions
• As a user you interact with the queues
The scheduler allows almost full utilisation
A 19200-cpu supercomputer over a day
Different colours represent different jobs
Note:
• Jobs of different sizes and lengths
• Minimal wastage of resource
0
4800
9600
14400
19200
00:00 08:00 16:00 24:00
Lesson for users: Adapt your workflow!
• Do not wait on your desk for a job to start. Minimise interactivity and automate where possible.
• Always have jobs queued to maximise your own utilisation.
Some SLURM terminology
• At Pawsey Supercomputing Centre we use the scheduler: SLURM.
https://slurm.schedmd.com/
https://support.pawsey.org.au/documentation/display/US/Submitting+and+Monitoring+Jobs
• A SLURM partition is a queue.
• A SLURM cluster is all the partitions that are managed by a single SLURM daemon.
• In the Pawsey Centre there are multiple SLURM clusters, each with multiple partitions.
o The clusters approximately map to systems (e.g. magnus, galaxy, zeus).
o You can submit a job to a partition in one cluster from another cluster.
(This is useful for pre-processing, post-processing or staging data.)
Querying partitions and their status• To list the partitions in the current machine, use the SLURM command:
sinfo
• To list the partitions of a remote clusters:
sinfo -M remoteClusterToCheck
• To list all partitions in all local clusters:
sinfo -M all
Querying partitions and their status• To list the partitions in the current machine, use the SLURM command:
sinfo
• To list the partitions of a remote clusters:
sinfo -M remoteClusterToCheck
• To list all partitions in all local clusters:
sinfo -M all
For example:
username@zeus-1:~> sinfo –M magnusCLUSTER: magnus
PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST
workq* up 1-1366 1-00:00:00 24 2:12:1 2 idle* nid00[543,840]
workq* up 1-1366 1-00:00:00 24 2:12:1 1 down* nid00694
workq* up 1-1366 1-00:00:00 24 2:12:1 12 reserved nid000[16-27]
workq* up 1-1366 1-00:00:00 24 2:12:1 1457 allocated nid0[0028-0063,
workq* up 1-1366 1-00:00:00 24 2:12:1 8 idle nid0[0193, …
debugq up 1-6 1:00:00 24 2:12:1 4 allocated nid000[08-11]
debugq up 1-6 1:00:00 24 2:12:1 4 idle nid000[12-15]
Partitions of Pawsey Supercomputers
It is important to use the correct system and partition for each part of a workflow:
System Partition Purpose
Magnus workq Large distributed memory jobs
Magnus debugq Debugging and compiling on Magnus
Zeus workq Smaller jobs
Zeus longq For long runtime jobs
Zeus highmemq Jobs with large memory requirements
Zeus debugq Debugging and development jobs
Zeus copyq Data transfer jobs, deleting large amount of files
Topaz gpuq, gpuq-dev GPU-accelerated jobs
Querying job queues and their status
• The SLURM command squeue displays the status of jobs in different queues
squeuesqueue –u usernamesqueue –p queueToQuery
Querying job queues and their status
• The SLURM command squeue displays the status of jobs in different queues
squeuesqueue –u usernamesqueue –p queueToQuery
charris@zeus-1:~> squeue
JOBID USER ACCOUNT PARTITION NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY
2358518 jzhao pawsey0149 longq SNP_call_zytho z119 R None Ystday 11:56 Thu 11:56 3-01:37:07 1 1016
2358785 askapops askap copyq tar-5182 hpc-data3 R None 09:20:35 Wed 09:20 1-23:01:09 1 3332
2358782 askapops askap copyq tar-5181 hpc-data2 R None 09:05:13 Wed 09:05 1-22:45:47 1 3343
2355496 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Priority Tomorr 01:53 Wed 01:53 1-00:00:00 2 1349
2355495 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Resources Tomorr 01:52 Wed 01:52 1-00:00:00 4 1356
2358214 yyuan pawsey0149 workq runGet_FQ n/a PD Priority 20:19:00 Tomorr 20:19 1-00:00:00 1 1125
2358033 yyuan pawsey0149 gpuq 4B_2 n/a PD AssocMaxJo N/A N/A 1-00:00:00 1 1140
2358709 pbranson pawsey0106 workq backup_RUN19_P n/a PD Dependency N/A N/A 1-00:00:00 1 1005
Understanding squeue Output
JOBID -> unique jobID. Very important for identifying your job.
NAME -> job name. Set this if you have lots of jobs.
ST -> job state. R=running. PD=pending.
REASON -> the reason the job is not running
• Dependency – job must wait for another to complete before it
• Priority – a higher priority job exists
• Resources – the job is waiting for sufficient resources
• AssocMaxJobs – user has reached their maximum number of jobs that can be ran in that queue
charris@zeus-1:~> squeue
JOBID USER ACCOUNT PARTITION NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY
2358518 jzhao pawsey0149 longq SNP_call_zytho z119 R None Ystday 11:56 Thu 11:56 3-01:37:07 1 1016
2358785 askapops askap copyq tar-5182 hpc-data3 R None 09:20:35 Wed 09:20 1-23:01:09 1 3332
2358782 askapops askap copyq tar-5181 hpc-data2 R None 09:05:13 Wed 09:05 1-22:45:47 1 3343
2355496 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Priority Tomorr 01:53 Wed 01:53 1-00:00:00 2 1349
2355495 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Resources Tomorr 01:52 Wed 01:52 1-00:00:00 4 1356
2358214 yyuan pawsey0149 workq runGet_FQ n/a PD Priority 20:19:00 Tomorr 20:19 1-00:00:00 1 1125
2358033 yyuan pawsey0149 gpuq 4B_2 n/a PD AssocMaxJo N/A N/A 1-00:00:00 1 1140
2358709 pbranson pawsey0106 workq backup_RUN19_P n/a PD Dependency N/A N/A 1-00:00:00 1 1005
Exercise: Query the queues
• Execute squeue and explore the output
Zeus-1> squeue
Zeus-1> squeue –u $USER
Zeus-1> squeue –p debugq
-What is the largest job running? In which queue?
-What is the largest job running in the debugq?
-Do you have any job running?
-Query for the jobs of another user?
Sharing the Supercomputer with Project Allocations
• All projects get a fraction of the supercomputer, an allocation. Units are “core hours” or “service units”, not a fraction of the supercomputer.
• Allocations are typically for 12 months
• At Pawsey, allocations are divided evenly between the four quarters of the year, to avoid end-of-year congestion. Priorities reset at the start of the quarter for all allocations
• The job priority in the queue is affected by the following
• usage relative to allocation (priority decreases as the allocation is used up)
• length of time in queue (priority increases with time)
• size of request (priority increases with size)
Monitoring your Project Allocation UsageAllocation usage can be checked using the pawseyAccountBalance tool:
module load pawseytoolspawseyAccountBalance -p projectname -u
charris@magnus-2:~> pawseyAccountBalance -p pawsey0001 -u
Compute Information
-------------------
Project ID Allocation Usage % used
---------- ---------- ----- ------
pawsey0001 250000 124170 49.7
--mcheeseman 119573 47.8
--mshaikh 2385 1.0
--maali 1109 0.4
--bskjerven 552 0.2
--ddeeptimahanti 292 0.1
Submitting Jobs
• Supercomputing Overview
• Logging In
• Sharing Supercomputers
• Submitting Jobs
• Using High Performance Storage
• Job Scripts For Different Applications
• Deciding How Big to Scale
• Getting Help
Scheduling and managing your jobs
All Pawsey supercomputers use SLURM to schedule jobs and manage queues
The three essential SLURM commands are:
sbatch jobScriptFileName
squeue
scancel jobID
Every successful submission gets a unique identifier (jobID)
username@zeus-1:~> sbatch jobscript.slurm
Submitted batch job 2315399
• sbatch is a SLURM command that interprets directives in the jobScript
• A jobScript is a bash or csh script.
• It contains important information for the scheduler:
o what resources the job needs
o how long for
o what to do with the resources
• And, of course, it also contains the series of commands you want to execute
• Overestimating the time required means it will take longer to find available resources. Underestimating the time required means the job will get killed before completion.
Scheduling and managing your jobs
• Information for the scheduler is given by Directive lines starting with #SBATCH.
• (Although this information can also be given to sbatch as command-line arguments.)
• Directives are usually more convenient and reproducible than command-line arguments. Put your resource request within the jobScript!
Scheduling and managing your jobs
Common resource request directives
#SBATCH --job-name=myjob -> makes it easier to find in squeue
#SBATCH --account=pawsey0001 -> project for accounting
#SBATCH --nodes=2 -> number of nodes
#SBATCH --tasks-per-node=4 -> processes (or tasks) per node
#SBATCH --cpus-per-task=6 -> cores per process (or task)
#SBATCH --time=00:05:00 -> walltime requested
#SBATCH –-partition=debugq -> queue (or partition) for the job
• (From the linux perspective, #SBATCH directives are understood as comments
in the script, so only subsequent commands are executed.)
First example jobScript#!/bin/bash -l
#SBATCH --job-name=hostname
#SBATCH --reservation=courseq
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --export=NONE
# print compute node host name
for i in $(seq 1 20); do
date
echo "The hostname is:"
hostname
sleep 15s
done
Note on reservations• A SLURM reservation dedicates nodes for a particular purpose
within a specific timeframe, with constraints that may be different
from the standard partitions.
• To use a reservation:
sbatch --reservation=reservation-name myscript
• Or in your jobscript:
#SBATCH --reservation=reservation-name
• To check the reservation:
sinfo –T
• Nodes can only be reserved for a certain time by the
system administrators (like for this training).
• Only ask for a reservation if you cannot work via the standard
queues (as for a once-off urgent deadline).
Output given by the scheduler
• Standard output and standard error messages
from your jobScript are collected by SLURM,
• and written to a file in the directory you submitted
the job
• By default, the output file is named:slurm-jobID.out
Exercise: Hostname (1)
• Submit the job with sbatch:
zeus-1> cd hostname
zeus-1> sbatch --reservation=courseq hostname.slurm
• Use squeue to see if it is in the queue:
zeus–1> squeue -u userName
What is the status of the job?
• Examine the slurm-jobID.out file:
zeus-1> cat slurm-jobID.out
Which node did the job run on?
Is there any error message in the output file?
More information about jobs with scontrol• The scontrol SLURM command provides high-level information on
the jobs that are being executed:
scontrol show job jobID
• The scontrol SLURM command provides high-level information on the jobs that are being executed:
scontrol show job jobID
• For example:charris@magnus-1:~> scontrol show job 2474075JobId=2474075 JobName=m2BDF2
UserId=tnguyen(24642) GroupId=tnguyen(24642) MCS_label=N/APriority=7016 Nice=0 Account=pawsey0199 QOS=normalJobState=RUNNING Reason=None Dependency=(null)Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0RunTime=03:13:09 TimeLimit=1-00:00:00 TimeMin=N/ASubmitTime=12 Dec 2017 EligibleTime=12 Dec 2017StartTime=10:41:04 EndTime=Tomorr 10:41 Deadline=N/APreemptTime=None SuspendTime=None SecsPreSuspend=0Partition=workq AllocNode:Sid=magnus-2:53310ReqNodeList=(null) ExcNodeList=(null)NodeList=nid0[0041-0047,0080-0082,0132-0133,0208-0219,0224-0226,0251-0253,0278-0279,0284-
0289,0310-0312,0319,0324-0332,0344,0349-0350,0377-0379,0385-0387,0484-0503,0517-0520,0525-0526,0554-0573,0620-0628,0673-0686,0689-0693,0732,0894-0895,0900-0907,1036-1037,1048-1051,1134-1138,1202-1203,1295-1296,1379-1380,1443-1446,1530-1534]
BatchHost=mom1NumNodes=171 NumCPUs=4104 NumTasks=171 CPUs/Task=1 ReqB:S:C:T=0:0:*:*TRES=cpu=4104,mem=5601960M,node=171Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*MinCPUsNode=1 MinMemoryCPU=1365M MinTmpDiskNode=0Features=(null) Gres=(null) Reservation=(null)OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)Command=/scratch/pawsey0199/tnguyen/run_test_periodicwave/stiff_problem/forMagnus/4thOrder/a
ccuracy_check/eta_1/PeriodicBCs/BDF2/m2/gpc.sh
More information about jobs with scontrol
Exercise: Hostname (2)
• Check information given by scontrol
zeus–1> squeue -u userName
zeus–1> scontrol show job jobID
Is the information about the execution node there too?
• Cancel the job if it is still running:
zeus–1> squeue -u userName
zeus–1> scancel jobID
• Check again the output file and see if there is any new message due to the cancelation
zeus-1> cat slurm-jobID.out
Common error messages in slurm-JobID.out
• REMEMBER: the SLURM output file is the first best place to check if you feel that something went wrong with your job. Always check your output file!
• Segmentation fault error are related to insufficient memory:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
slurmstepd: error: *** STEP 4677820.0 ON nid00846 CANCELLED AT 2020-02-20T11:34:34 ***
• Requested time was not enough:
slurmstepd: error: *** JOB 4677822 ON nid00849 CANCELLED AT 2020-02-20T11:34:34 DUE TO TIME LIMIT***
• Exceeded memory limit:
slurmstepd: error: Job 4645817 exceeded memory limit (69083856 > 65011712), being killed
slurmstepd: error: *** JOB 4645817 ON nid00488 CANCELLED AT 2018-07-30T17:04:02 ***
• In the "knowledge base" of our documentation there is a section of troubleshooting articles that explain how to solve these and other problems:
https://support.pawsey.org.au/documentation/display/US/Knowledge+Basehttps://support.pawsey.org.au/documentation/display/US/Troubleshooting+articles
Exercise: Hostname (3)
• Now submit the job to the debugq partition
zeus–1> sbatch –-partition=debugq hostname.slurm
Did the scheduler allowed you to submit your job?
What is the problem?
Exercise: Hostname (3)
• Now submit the job to the debugq partition
zeus–1> sbatch –-partition=debugq hostname.slurm (failed)
Did the scheduler allowed you to submit your job?
What is the problem?
• The problem is that the --reservation=courseq is in the workqpartition. And the script has the –reservation directive within.
zeus–1> cat hostname.slurm
Exercise: Hostname (3)
• Now submit the job to the debugq partition
zeus–1> sbatch –-partition=debugq hostname.slurm (failed)
Did the scheduler allowed you to submit your job?
What is the problem?
• The problem is that the --reservation=courseq is in the workqpartition. And the script has the –reservation directive within.
zeus–1> cat hostname.slurm
• Remove the resevation from the script, or define a "null" reservation from the command line:
zeus–1> sbatch –-partition=debugq --reservation="" hostname.slurm
zeus–1> squeue –u $USER
More information about jobs with sacct
• The sacct SLURM command provides high-level information on the jobs that had been executed:
sacct
More information about jobs with sacct
• The sacct SLURM command provides high-level information on the jobs that had been executed:
sacct
• There are many arguments, for example you can query the execution node with:
sacct --format=JobID,Nodelist
More information about jobs with sacct
• The sacct SLURM command provides high-level information on the jobs that had been executed:
sacct
• There are many arguments, for example you can query the execution node with:
sacct -X --format=JobID,Nodelist
• Other commonly used options are:-j jobID displays information about the specified jobIDs-u userName displays jobgs for this user-A projectname displays jobs from this project account-S yyyy-mm-ddThh:mm:ss display jobs after this start time-E yyyy-mm-ddThh:mm:ss display jobs before this end time-X only show statitstics for the whole job and not substeps
More information about jobs with sacct
• Other commonly used options are:-j jobID displays information about the specified jobIDs-u userName displays jobgs for this user-A projectname displays jobs from this project account-S yyyy-mm-ddThh:mm:ss display jobs after this start time-E yyyy-mm-ddThh:mm:ss display jobs before this end time-X only show statitstics for the whole job and not substeps
• For example:charris@magnus-1:~> sacct -a -A pawsey0001 -S 2017-12-01 -E 2017-12-02 –X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2461157 bash debugq pawsey0001 24 COMPLETED 0:0
2461543 bubble512 debugq pawsey0001 24 FAILED 1:0
2461932 bash workq pawsey0001 24 FAILED 2:0
2462029 bash workq pawsey0001 24 FAILED 127:0
2462472 bash debugq pawsey0001 24 COMPLETED 0:0
2462527 jobscript+ workq pawsey0001 960 COMPLETED 0:0
Further sources of information
• Pawsey Supercomputing documentation:
https://support.pawsey.org.au/documentation/display/US/Supercomputing+Documentation
https://support.pawsey.org.au/documentation/display/US/Using+Supercomputers
https://support.pawsey.org.au/documentation/display/US/Knowledge+Base
• SchedMD (SLURM) documentation:
https://slurm.schedmd.com/