using supercomputers part 1 - support

Using Supercomputers – Part 1

Pawsey Webinar Series

20 July 2020

What you can do after the full course

• List and describe the major parts of a supercomputer

• Log into a supercomputer

• Explain how a supercomputer is shared among researchers

• Submit a basic job to the supercomputing queue

• Understand and use key supercomputer systems as: schedulers, partitions, nodes, data movers, etc.

• Define and submit job scripts according to your needs

• Find and use available software on a supercomputer

✓ Prerequisite knowledge: Linux

Supercomputing Overview• Supercomputing Overview

• Logging In

• Sharing Supercomputers

• Submitting Jobs

• Using High Performance Storage

• Job Scripts For Different Applications

• Deciding How Big to Scale

• Getting Help

Supercomputer examples

#1 Supercomputer Fugaku (RIKEN centre, Japan): 415.5 PFlop/s

(Ranking from https://www.top500.org/lists/top500/2020/06/)

Picture from: https://www.riken.jp/en/news_pubs

https://www.top500.org/lists/top500/2020/06/

#2 Summit (Oak Ridge Nat. Lab., USA) : 148.6 PFlop/s


Picture from https://en.wikipedia.org/wiki/Summit_(supercomputer)



#24 Gadi (NCI, Australia) : 9.2 PFlop/s


Picture from https://nci.org.au



Magnus (Pawsey Supercomputing Centre, Australia) : 1 PFlop/sVisit:https://pawsey.org.au/https://tour.pawsey.org.au/


https://pawsey.org.au/

https://tour.pawsey.org.au/

Building blocks

Major Parts of a Supercomputer

High Performance Compute

• Individual compute nodes are "similar" to a high-end workstation

• Compute performance comes from using many processing resources together with fast communication

• There are fast communication channels among components within the node (memory, cpus, gpus)

• Among compute nodes, there is a fast network to transfer data during calculations


• Individual compute nodes are similar to a high-end workstation

• Compute performance comes from using many processing resources together with fast communication

• There are fast communication channels among components within the node (memory, cpus, gpus)

• Among compute nodes, there is a fast network to transfer data during calculations

"Dragonfly" interconnect for Cray-XC40,Image from: https://pawsey.org.au/systems/magnus/


Compute nodes work together in parallel:

• To perform large calculations,

• Or to obtain a faster execution of your code

• Or to perform many different calculations at the same time

High Performance Storage

• Fast storage “inside” the supercomputer

• Analyse very large data sets, high throughput data processing

• Temporary working area

• Usually have global storage

✓ All nodes can access global storage

✓ Can hold very large data sets

• Might have local node storage

Logging In

• Supercomputing Overview

• Logging In


• Submitting Jobs




• Getting Help

Remote Access is via Login Nodes

• Remote access to the supercomputer for administrative work:

• Submit jobs

• Manage workflows

• Check results

• Install software

• Many people (~100) share a login node

• Do not run programs on the login nodes!

Remote Access in Practice

• Have terminal program to execute the connection command:

o For Windows, use MobaXterm (download)

o For Linux, use xterm (preinstalled)

o For OS X, use Terminal (preinstalled) or xterm (download)

• Within a terminal window use ssh command: ssh username@hostname

• For example:

Exercise: Log in

• Log in to Zeus via ssh:

myLaptop> ssh [email protected]

(You will be asked for your password. Type it carefully. It will not be displayed)

- What are the "message of the day" announcements?

- What is the default directory where you log in?

- Do you count already with some files?

• Change directory to the correct file system to perform your work

zeus-1> cd $MYSCRATCH

zeus-1> pwd

zeus-1> ls

zeus-1> echo $MYSCRATCH

Exercise: Log in

• Use git to download the exercise material:

zeus-1> git clone http://github.com/PawseySC/Introductory-Supercomputing

• Change directory to the downloaded directory:

zeus-1> cd Introductory-Supercomputing

• Explore the content:

zeus-1> ls

- What directories and files are in the directory tree?

Common Log in Problems

• Forgot password

Self service reset https://support.pawsey.org.au/password-reset/

• Scheduled maintenance

Check your email or https://support.pawsey.org.au/documentation/display/US/Maintenance+and+Incidents

• Blacklisted due to too many failed login attempts

This is a security precaution.

Contact the helpdesk with your username and the machine you are attempting to log in to

https://support.pawsey.org.au/password-reset/

https://support.pawsey.org.au/documentation/display/US/Maintenance+and+Incidents

Remote Access while being SecureSSH key authentication can increase the security of your account, compared to using a password https://support.pawsey.org.au/documentation/display/US/Logging+in+with+SSH+keys

Use a passphrase with SSH keys, so that if your local computer is compromised then your remote accounts are not compromised

Do not share your account! Account sharing violates the Conditions of Use. The project leader can invite others to the project

Do not provide your password to anyone! Not even in help desk tickets!

https://support.pawsey.org.au/documentation/display/US/Logging+in+with+SSH+keys

https://support.pawsey.org.au/documentation/display/US/Conditions+of+Use

Remote Access with a Graphical Interface

• Web-based remote graphical interface (for visualization applications)

https://remotevis.pawsey.org.au/

• For simple GUI applications, can add -X flag to ssh:

ssh -X [email protected]

https://remotevis.pawsey.org.au/

Sharing Supercomputers


• Logging In


• Submitting Jobs




• Getting Help

Need for a scheduler• Supercomputers are expensive and need to be fully utilised to

get best value for money.

• Supercomputers get replaced every 3-5 years and consume electricity whether used or not

• Thus we want them running at maximum capacity including night and weekends.

• Supercomputers are shared among many users

• Different job sizes, usage patterns

• Users must wait their turn

Scheduler, queues and partitions

• A scheduler (SLURM, PBS, etc) is a program that manages jobs.

• Users submit jobs with information for the scheduler, and it feeds them into the compute nodes.

• It has several queues of jobs and constantly optimizes computer usage and job completion.

Scheduler, queues and partitions

• A scheduler (SLURM, PBS, etc) is a program that manages jobs.

• Users submit jobs with information for the scheduler, and it feeds them into the compute nodes.

• It has several queues of jobs and constantly optimizes computer usage and job completion.

• A partition is a group of nodes with a specific purpose (coloured in the figure)

• Queues are associated to specific partitions

• As a user you interact with the queues

The scheduler allows almost full utilisation

A 19200-cpu supercomputer over a day

Different colours represent different jobs

Note:

• Jobs of different sizes and lengths

• Minimal wastage of resource

0

4800

9600

14400

19200

00:00 08:00 16:00 24:00

Lesson for users: Adapt your workflow!

• Do not wait on your desk for a job to start. Minimise interactivity and automate where possible.

• Always have jobs queued to maximise your own utilisation.

Some SLURM terminology

• At Pawsey Supercomputing Centre we use the scheduler: SLURM.

https://slurm.schedmd.com/

https://support.pawsey.org.au/documentation/display/US/Submitting+and+Monitoring+Jobs

• A SLURM partition is a queue.

• A SLURM cluster is all the partitions that are managed by a single SLURM daemon.

• In the Pawsey Centre there are multiple SLURM clusters, each with multiple partitions.

o The clusters approximately map to systems (e.g. magnus, galaxy, zeus).

o You can submit a job to a partition in one cluster from another cluster.

(This is useful for pre-processing, post-processing or staging data.)


https://support.pawsey.org.au/documentation/display/US/Submitting+and+Monitoring+Jobs

Querying partitions and their status• To list the partitions in the current machine, use the SLURM command:

sinfo

• To list the partitions of a remote clusters:

sinfo -M remoteClusterToCheck

• To list all partitions in all local clusters:

sinfo -M all

Querying partitions and their status• To list the partitions in the current machine, use the SLURM command:

sinfo

• To list the partitions of a remote clusters:

sinfo -M remoteClusterToCheck

• To list all partitions in all local clusters:

sinfo -M all

For example:

username@zeus-1:~> sinfo –M magnusCLUSTER: magnus

PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST

workq* up 1-1366 1-00:00:00 24 2:12:1 2 idle* nid00[543,840]

workq* up 1-1366 1-00:00:00 24 2:12:1 1 down* nid00694

workq* up 1-1366 1-00:00:00 24 2:12:1 12 reserved nid000[16-27]

workq* up 1-1366 1-00:00:00 24 2:12:1 1457 allocated nid0[0028-0063,

workq* up 1-1366 1-00:00:00 24 2:12:1 8 idle nid0[0193, …

debugq up 1-6 1:00:00 24 2:12:1 4 allocated nid000[08-11]

debugq up 1-6 1:00:00 24 2:12:1 4 idle nid000[12-15]

Partitions of Pawsey Supercomputers

It is important to use the correct system and partition for each part of a workflow:

System Partition Purpose

Magnus workq Large distributed memory jobs

Magnus debugq Debugging and compiling on Magnus

Zeus workq Smaller jobs

Zeus longq For long runtime jobs

Zeus highmemq Jobs with large memory requirements

Zeus debugq Debugging and development jobs

Zeus copyq Data transfer jobs, deleting large amount of files

Topaz gpuq, gpuq-dev GPU-accelerated jobs

Querying job queues and their status

• The SLURM command squeue displays the status of jobs in different queues

squeuesqueue –u usernamesqueue –p queueToQuery

Querying job queues and their status

• The SLURM command squeue displays the status of jobs in different queues

squeuesqueue –u usernamesqueue –p queueToQuery

charris@zeus-1:~> squeue

JOBID USER ACCOUNT PARTITION NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY

2358518 jzhao pawsey0149 longq SNP_call_zytho z119 R None Ystday 11:56 Thu 11:56 3-01:37:07 1 1016

2358785 askapops askap copyq tar-5182 hpc-data3 R None 09:20:35 Wed 09:20 1-23:01:09 1 3332


2355496 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Priority Tomorr 01:53 Wed 01:53 1-00:00:00 2 1349

2355495 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Resources Tomorr 01:52 Wed 01:52 1-00:00:00 4 1356

2358214 yyuan pawsey0149 workq runGet_FQ n/a PD Priority 20:19:00 Tomorr 20:19 1-00:00:00 1 1125

2358033 yyuan pawsey0149 gpuq 4B_2 n/a PD AssocMaxJo N/A N/A 1-00:00:00 1 1140

2358709 pbranson pawsey0106 workq backup_RUN19_P n/a PD Dependency N/A N/A 1-00:00:00 1 1005

Understanding squeue Output

JOBID -> unique jobID. Very important for identifying your job.

NAME -> job name. Set this if you have lots of jobs.

ST -> job state. R=running. PD=pending.

REASON -> the reason the job is not running

• Dependency – job must wait for another to complete before it

• Priority – a higher priority job exists

• Resources – the job is waiting for sufficient resources

• AssocMaxJobs – user has reached their maximum number of jobs that can be ran in that queue

charris@zeus-1:~> squeue

JOBID USER ACCOUNT PARTITION NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY

2358518 jzhao pawsey0149 longq SNP_call_zytho z119 R None Ystday 11:56 Thu 11:56 3-01:37:07 1 1016



2355496 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Priority Tomorr 01:53 Wed 01:53 1-00:00:00 2 1349

2355495 pbranson pawsey0106 gpuq piv_RUN19_PROD n/a PD Resources Tomorr 01:52 Wed 01:52 1-00:00:00 4 1356

2358214 yyuan pawsey0149 workq runGet_FQ n/a PD Priority 20:19:00 Tomorr 20:19 1-00:00:00 1 1125

2358033 yyuan pawsey0149 gpuq 4B_2 n/a PD AssocMaxJo N/A N/A 1-00:00:00 1 1140

2358709 pbranson pawsey0106 workq backup_RUN19_P n/a PD Dependency N/A N/A 1-00:00:00 1 1005

Exercise: Query the queues

• Execute squeue and explore the output

Zeus-1> squeue

Zeus-1> squeue –u $USER

Zeus-1> squeue –p debugq

-What is the largest job running? In which queue?

-What is the largest job running in the debugq?

-Do you have any job running?

-Query for the jobs of another user?

Sharing the Supercomputer with Project Allocations

• All projects get a fraction of the supercomputer, an allocation. Units are “core hours” or “service units”, not a fraction of the supercomputer.

• Allocations are typically for 12 months

• At Pawsey, allocations are divided evenly between the four quarters of the year, to avoid end-of-year congestion. Priorities reset at the start of the quarter for all allocations

• The job priority in the queue is affected by the following

• usage relative to allocation (priority decreases as the allocation is used up)

• length of time in queue (priority increases with time)

• size of request (priority increases with size)

Monitoring your Project Allocation UsageAllocation usage can be checked using the pawseyAccountBalance tool:

module load pawseytoolspawseyAccountBalance -p projectname -u

charris@magnus-2:~> pawseyAccountBalance -p pawsey0001 -u

Compute Information

-------------------

Project ID Allocation Usage % used

---------- ---------- ----- ------

pawsey0001 250000 124170 49.7

--mcheeseman 119573 47.8

--mshaikh 2385 1.0

--maali 1109 0.4

--bskjerven 552 0.2

--ddeeptimahanti 292 0.1

Submitting Jobs


• Logging In


• Submitting Jobs




• Getting Help

Scheduling and managing your jobs

All Pawsey supercomputers use SLURM to schedule jobs and manage queues

The three essential SLURM commands are:

sbatch jobScriptFileName

squeue

scancel jobID

Every successful submission gets a unique identifier (jobID)

username@zeus-1:~> sbatch jobscript.slurm

Submitted batch job 2315399

• sbatch is a SLURM command that interprets directives in the jobScript

• A jobScript is a bash or csh script.

• It contains important information for the scheduler:

o what resources the job needs

o how long for

o what to do with the resources

• And, of course, it also contains the series of commands you want to execute

• Overestimating the time required means it will take longer to find available resources. Underestimating the time required means the job will get killed before completion.


• Information for the scheduler is given by Directive lines starting with #SBATCH.

• (Although this information can also be given to sbatch as command-line arguments.)

• Directives are usually more convenient and reproducible than command-line arguments. Put your resource request within the jobScript!


Common resource request directives

#SBATCH --job-name=myjob -> makes it easier to find in squeue

#SBATCH --account=pawsey0001 -> project for accounting

#SBATCH --nodes=2 -> number of nodes

#SBATCH --tasks-per-node=4 -> processes (or tasks) per node

#SBATCH --cpus-per-task=6 -> cores per process (or task)

#SBATCH --time=00:05:00 -> walltime requested

#SBATCH –-partition=debugq -> queue (or partition) for the job

• (From the linux perspective, #SBATCH directives are understood as comments

in the script, so only subsequent commands are executed.)

First example jobScript#!/bin/bash -l

#SBATCH --job-name=hostname

#SBATCH --reservation=courseq

#SBATCH --nodes=1

#SBATCH --tasks-per-node=1

#SBATCH --cpus-per-task=1

#SBATCH --time=00:05:00

#SBATCH --export=NONE

# print compute node host name

for i in $(seq 1 20); do

date

echo "The hostname is:"

hostname

sleep 15s

done

Note on reservations• A SLURM reservation dedicates nodes for a particular purpose

within a specific timeframe, with constraints that may be different

from the standard partitions.

• To use a reservation:

sbatch --reservation=reservation-name myscript

• Or in your jobscript:

#SBATCH --reservation=reservation-name

• To check the reservation:

sinfo –T

• Nodes can only be reserved for a certain time by the

system administrators (like for this training).

• Only ask for a reservation if you cannot work via the standard

queues (as for a once-off urgent deadline).

Output given by the scheduler

• Standard output and standard error messages

from your jobScript are collected by SLURM,

• and written to a file in the directory you submitted

the job

• By default, the output file is named:slurm-jobID.out

Exercise: Hostname (1)

• Submit the job with sbatch:

zeus-1> cd hostname

zeus-1> sbatch --reservation=courseq hostname.slurm

• Use squeue to see if it is in the queue:

zeus–1> squeue -u userName

What is the status of the job?

• Examine the slurm-jobID.out file:

zeus-1> cat slurm-jobID.out

Which node did the job run on?

Is there any error message in the output file?

More information about jobs with scontrol• The scontrol SLURM command provides high-level information on

the jobs that are being executed:

scontrol show job jobID

• The scontrol SLURM command provides high-level information on the jobs that are being executed:

scontrol show job jobID

• For example:charris@magnus-1:~> scontrol show job 2474075JobId=2474075 JobName=m2BDF2

UserId=tnguyen(24642) GroupId=tnguyen(24642) MCS_label=N/APriority=7016 Nice=0 Account=pawsey0199 QOS=normalJobState=RUNNING Reason=None Dependency=(null)Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0RunTime=03:13:09 TimeLimit=1-00:00:00 TimeMin=N/ASubmitTime=12 Dec 2017 EligibleTime=12 Dec 2017StartTime=10:41:04 EndTime=Tomorr 10:41 Deadline=N/APreemptTime=None SuspendTime=None SecsPreSuspend=0Partition=workq AllocNode:Sid=magnus-2:53310ReqNodeList=(null) ExcNodeList=(null)NodeList=nid0[0041-0047,0080-0082,0132-0133,0208-0219,0224-0226,0251-0253,0278-0279,0284-

0289,0310-0312,0319,0324-0332,0344,0349-0350,0377-0379,0385-0387,0484-0503,0517-0520,0525-0526,0554-0573,0620-0628,0673-0686,0689-0693,0732,0894-0895,0900-0907,1036-1037,1048-1051,1134-1138,1202-1203,1295-1296,1379-1380,1443-1446,1530-1534]

BatchHost=mom1NumNodes=171 NumCPUs=4104 NumTasks=171 CPUs/Task=1 ReqB:S:C:T=0:0:*:*TRES=cpu=4104,mem=5601960M,node=171Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*MinCPUsNode=1 MinMemoryCPU=1365M MinTmpDiskNode=0Features=(null) Gres=(null) Reservation=(null)OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)Command=/scratch/pawsey0199/tnguyen/run_test_periodicwave/stiff_problem/forMagnus/4thOrder/a

ccuracy_check/eta_1/PeriodicBCs/BDF2/m2/gpc.sh

More information about jobs with scontrol


• Check information given by scontrol


zeus–1> scontrol show job jobID

Is the information about the execution node there too?

• Cancel the job if it is still running:


zeus–1> scancel jobID

• Check again the output file and see if there is any new message due to the cancelation

zeus-1> cat slurm-jobID.out

Common error messages in slurm-JobID.out

• REMEMBER: the SLURM output file is the first best place to check if you feel that something went wrong with your job. Always check your output file!

• Segmentation fault error are related to insufficient memory:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

slurmstepd: error: *** STEP 4677820.0 ON nid00846 CANCELLED AT 2020-02-20T11:34:34 ***

• Requested time was not enough:

slurmstepd: error: *** JOB 4677822 ON nid00849 CANCELLED AT 2020-02-20T11:34:34 DUE TO TIME LIMIT***

• Exceeded memory limit:

slurmstepd: error: Job 4645817 exceeded memory limit (69083856 > 65011712), being killed

slurmstepd: error: *** JOB 4645817 ON nid00488 CANCELLED AT 2018-07-30T17:04:02 ***

• In the "knowledge base" of our documentation there is a section of troubleshooting articles that explain how to solve these and other problems:

https://support.pawsey.org.au/documentation/display/US/Knowledge+Basehttps://support.pawsey.org.au/documentation/display/US/Troubleshooting+articles

https://support.pawsey.org.au/documentation/display/US/Knowledge+Base

https://support.pawsey.org.au/documentation/display/US/Troubleshooting+articles


• Now submit the job to the debugq partition

zeus–1> sbatch –-partition=debugq hostname.slurm

Did the scheduler allowed you to submit your job?

What is the problem?



zeus–1> sbatch –-partition=debugq hostname.slurm (failed)



• The problem is that the --reservation=courseq is in the workqpartition. And the script has the –reservation directive within.

zeus–1> cat hostname.slurm



zeus–1> sbatch –-partition=debugq hostname.slurm (failed)



• The problem is that the --reservation=courseq is in the workqpartition. And the script has the –reservation directive within.

zeus–1> cat hostname.slurm

• Remove the resevation from the script, or define a "null" reservation from the command line:

zeus–1> sbatch –-partition=debugq --reservation="" hostname.slurm

zeus–1> squeue –u $USER

More information about jobs with sacct

• The sacct SLURM command provides high-level information on the jobs that had been executed:

sacct



sacct

• There are many arguments, for example you can query the execution node with:

sacct --format=JobID,Nodelist



sacct

• There are many arguments, for example you can query the execution node with:

sacct -X --format=JobID,Nodelist

• Other commonly used options are:-j jobID displays information about the specified jobIDs-u userName displays jobgs for this user-A projectname displays jobs from this project account-S yyyy-mm-ddThh:mm:ss display jobs after this start time-E yyyy-mm-ddThh:mm:ss display jobs before this end time-X only show statitstics for the whole job and not substeps


• Other commonly used options are:-j jobID displays information about the specified jobIDs-u userName displays jobgs for this user-A projectname displays jobs from this project account-S yyyy-mm-ddThh:mm:ss display jobs after this start time-E yyyy-mm-ddThh:mm:ss display jobs before this end time-X only show statitstics for the whole job and not substeps

• For example:charris@magnus-1:~> sacct -a -A pawsey0001 -S 2017-12-01 -E 2017-12-02 –X

JobID JobName Partition Account AllocCPUS State ExitCode

------------ ---------- ---------- ---------- ---------- ---------- --------

2461157 bash debugq pawsey0001 24 COMPLETED 0:0

2461543 bubble512 debugq pawsey0001 24 FAILED 1:0

2461932 bash workq pawsey0001 24 FAILED 2:0

2462029 bash workq pawsey0001 24 FAILED 127:0

2462472 bash debugq pawsey0001 24 COMPLETED 0:0

2462527 jobscript+ workq pawsey0001 960 COMPLETED 0:0

Further sources of information

• Pawsey Supercomputing documentation:

https://support.pawsey.org.au/documentation/display/US/Supercomputing+Documentation

https://support.pawsey.org.au/documentation/display/US/Using+Supercomputers


• SchedMD (SLURM) documentation:


https://support.pawsey.org.au/documentation/display/US/Supercomputing+Documentation

https://support.pawsey.org.au/documentation/display/US/Using+Supercomputers



using supercomputers part 1 - support

Documents