using lisa - uva · 2012-06-18 · login nodes and batch nodes lisa consists of - 2 login nodes...
TRANSCRIPT
Using Lisa
Q: What is Lisa
A: Lisa is a compute cluster, 512 nodes, 4480 cores
Q: Operating system?
A: Linux
Q: Software?
A: standard Debian (= Ubuntu), plus a few hundred packages
Q: Can I use Lisa?
A: Affiliates from of the UvA use Lisa for free
The Lisa cluster
Q: What is Lisa
A: Lisa is a compute cluster, 512 nodes, 4480 cores
Q: Operating system?
A: Linux
Q: Software?
A: standard Debian (= Ubuntu), plus a few hundred packages
Q: Can I use Lisa?
A: Affiliates from of the UvA use Lisa for free
The Lisa cluster
Login nodes and batch nodes
Lisa consists of
- 2 login nodes (lisa.sara.nl)- 500+ batch nodes for running jobs- gpu-equipped cluster: 8 nodes, 16 GPU's
All nodes share the same home file system and system file systems (SLOW)All nodes are equipped with a scratch disk (FAST)
Contents of this tutorial
What is a job
Modules
Software available
Job scripts: create, submit
Status of jobs
Efficient usage of the system
Log in to the system:
lisa.sara.nl
Login: sdemo001 .. sdemo0050Password: see printout
Change password: passwd
Logging in to the lisa system
Log in to the system:
For this course: gpu.sara.nl
Login: sdemo001 .. sdemo0050Password: see printout
Change password: passwd
Logging in to the lisa system
What is a job
A job consists of two parts:
- 1. the part describing what kind of job this is:
- amount of wallclock time needed - number of nodes needed - some extra's
- 2. the part describing what this job should do:
- a shell script#PBS -lnodes=1:mem24gb#PBS -lwalltime=200
datecd $HOME/workdirecho "3 + 4" | bcecho "end of job"
How to submit a job
#PBS -lnodes=1#PBS -lwalltime=200
dateecho "3 + 4" | bcecho "end of job"
File called 'job1':
qsub job1
Type to submit this job:
Create this file 'job1'
Number of nodes
Job can take 200 seconds walltime
File systems on Lisa
- Home file system: NFS 200 Gb/user accessible from all nodes SLOW
- /scratch file system ($TMPDIR) local disk 70-240 Gb accessible on the node itself cleaned after job
- /archive file system for storing large amounts of seldom used data accessible from login nodes slow
Archive file system
Location: (user elvis) /archive/elvis
Consists of disk and tapes.
For storage of seldom used data. Do not use for storing manysmall files, but tar first. Example, data is directory:
tar zcvf /archive/data.tgz data
See https://www.sara.nl/systems/lisa/filesystems
Software available
Complete unix suite:
grep awk perl python gcc gfortran java ….....
If you are missing something: let us know: [email protected]
Many extra packages, see http://sara.nl/systems/lisa/software
These extra packages are made available by the modules mechanism
If you need something extra, let us know: [email protected]
Modules what and why?
What:
- define environment for using or developing software
Why:
- when installing in a standard place (/usr/local/bin), root permission is required. Disaster when using flaky installation scripts
- what to do when more than one version is required?
- different packages with the same name
- and more ...
Modules
Purpose: make software available by defining the environment (PATH, ...)
Example: type
plinkmodule listmodule load plinkmodule listplinkmodule unload plinkmodule listplink
Module commands
module load name # activate module namemodule unload name # deactivate module name
module list # which modules are active
module avail ['xyz*'] # which modules are available
module display name # show contents of module
name examples:
plink # the default, in general # the newest version plink/1.02 # request specific version
openmpi # openmpi for Intel compilers openmpi/gnu # GNU compilers openmpi/intel/1.3 # specific version for # intel compilers
Most used module commands
Developing code, use Intel compilers:
module load c fortran
and the libraries for usage with Intel compilers:
module load fftw3 mkl
In general: to run a program, compiled with Intel compilers,the corresponding module must also be loaded.
module load fortran./intelfortranprog
Compiler wrappers
Need only to specify -l flag. -L and -I flags areautomatically generated:
module load fortran fftw3ifort myprog.f90 -lfftw3
Compilerwrappers implemented as a module.If undesired:
module unload compilerwrappers
On Lisa, when you call for example gcc, a wrapper iscalled instead, which takes care of the following:
Create job scripts
A job script is a simple text file.Normally: created using an editor
When many jobs are needed:
- create a script (written in bash, R, matlab, python, …) that creates and submits lots of jobs - use disparm - use array jobs
Typical job script
#PBS -lnodes=1:mem24gb:cores8#PBS -lwalltime=5:00:00
module load fortrancd $HOME/workdirecho "start of this job"some-command some parametersecho "end of this job"
-lnodes=n : request n nodes:mem24gb : nodes must have 24gb memory:cores8 : nodes must contain 8 cores:cores12 : nodes must contain 12 cores-lwalltime=5:00:00 : job cannot take more than 5 hours walltime
What happens after qsub
- 1. the #PBS lines are evaluated - 2. a copy of the job script is made - 3. the job is put in the job queue - 4. when the requested nodes are available: - a. nodes are allocated exclusively to the job - b. the copied job script is started on the first node as if a login is done - c. the job ends after the job script is ended, or when the wallclock limit is exceeded - d. the stdout and stderr are sent to the directory the job was submitted
Following your jobs
showq [ -u loginname]qstat [ -u loginname]
What happens on the nodes
pbs_jobmonitor 6651183
job number from showq/qstat
pbs_jobmonitor 6651183
pbs_joblogin 6651183
logs you in on the node
When will my job start
Maui scheduler
- first in – first out with backfill - priority adapted according to fair-share: group and user - measures to prevent monopolization of the system - see https://www.sara.nl/systems/shared/usage/maui-explained
How to favor my job
Specify sane walltime. Example: programwill run for one hour. Specify -lwalltime=1:30:00
How to get my work done in minimum of time
- Use all cores in your node - See above - Use efficient program
Deleting a job
Find the job numer using showq or qstat.
qdel jobnumber
Efficient jobs
Efficiency: finish your computations in the minimum amount of wall clock time
- use all computing power in a node (8 or 12 cores)- try to arrange that your jobs are scheduled early
Monitor your jobs:
pbs_jobmonitor jobnumberpbs_joblogin jobnumber
Tasks: 201 total, 3 running, 198 sleeping, 0 stopped, 0 zombieCpu(s): 56.8%us, 9.2%sy, 6.2%ni, 27.6%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%stMem: 16473116k total, 3117048k used, 13356068k free, 32k buffersSwap: 3999736k total, 43420k used, 3956316k free, 1954080k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19689 wva 20 0 560m 472m 107m R 102 2.9 1:18.99 Alpino.bin 19671 wva 20 0 565m 476m 107m R 100 3.0 1:28.93 Alpino.bin 625 wva 20 0 13380 1484 1244 S 0 0.0 0:00.00 bash 794 wva 20 0 13392 1120 860 S 0 0.0 0:00.00 bash 816 wva 20 0 34764 4832 2136 S 0 0.0 0:00.26 ssh 818 wva 20 0 47036 16m 3988 S 0 0.1 0:00.99 python2.7 819 wva 20 0 46424 15m 3988 S 0 0.1 0:01.01 python2.7 820 wva 20 0 46632 16m 3988 S 0 0.1 0:01.06 python2.7 821 wva 20 0 47220 16m 3988 S 0 0.1 0:01.34 python2.7 822 wva 20 0 52352 21m 3988 S 0 0.1 0:01.58 python2.7 823 wva 20 0 49604 19m 3988 S 0 0.1 0:01.56 python2.7 824 wva 20 0 48432 17m 3988 S 0 0.1 0:01.27 python2.7 825 wva 20 0 48004 17m 3988 S 0 0.1 0:01.33 python2.7 826 wva 20 0 36588 7856 3144 S 0 0.0 0:02.66 python2.7 19670 wva 20 0 11104 1420 1192 S 0 0.0 0:00.00 Alpino 19688 wva 20 0 11104 1420 1192 S 0 0.0 0:00.00 Alpino
Inefficient job: 2 active processes
Tasks: 239 total, 2 running, 237 sleeping, 0 stopped, 0 zombieCpu(s): 44.1%us, 1.7%sy, 0.0%ni, 53.8%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%stMem: 24733148k total, 13307084k used, 11426064k free, 40k buffersSwap: 3999736k total, 24212k used, 3975524k free, 502784k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20459 qyin 20 0 12.0g 11g 864 R 101 50.7 5353:21 Yin 20307 qyin 20 0 13380 1484 1244 S 0 0.0 0:00.00 bash 20458 qyin 20 0 13384 896 644 S 0 0.0 0:00.00 bash
Efficient job?
Tasks: 195 total, 9 running, 186 sleeping, 0 stopped, 0 zombieCpu(s): 61.6%us, 1.4%sy, 0.0%ni, 36.7%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%stMem: 24735448k total, 1188560k used, 23546888k free, 44k buffersSwap: 3999736k total, 21420k used, 3978316k free, 889296k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11767 sdebeer 20 0 179m 17m 5720 R 102 0.1 134:45.59 md_mpi 11761 sdebeer 20 0 179m 19m 7088 R 100 0.1 134:50.98 md_mpi 11762 sdebeer 20 0 179m 18m 7168 R 100 0.1 134:49.04 md_mpi 11763 sdebeer 20 0 179m 19m 7284 R 100 0.1 134:51.89 md_mpi 11765 sdebeer 20 0 179m 18m 6736 R 100 0.1 134:49.32 md_mpi 11768 sdebeer 20 0 179m 17m 5940 R 100 0.1 134:39.08 md_mpi 11764 sdebeer 20 0 179m 18m 6576 R 98 0.1 134:45.93 md_mpi 11766 sdebeer 20 0 179m 17m 5692 R 98 0.1 134:49.17 md_mpi 11598 sdebeer 20 0 13380 1476 1236 S 0 0.0 0:00.00 bash 11749 sdebeer 20 0 13404 960 688 S 0 0.0 0:00.00 bash 11755 sdebeer 20 0 52772 2296 1624 S 0 0.0 0:00.59 mpiexec
Efficient MPI job
Tasks: 244 total, 2 running, 242 sleeping, 0 stopped, 0 zombieCpu(s): 42.3%us, 4.2%sy, 0.0%ni, 53.2%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%stMem: 24733148k total, 1244256k used, 23488892k free, 32k buffersSwap: 3999736k total, 21808k used, 3977928k free, 868152k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9097 kazaryan 20 0 430m 116m 3428 R 1194 0.5 366:23.28 dscf_smp 9058 kazaryan 20 0 9152 1388 1112 S 0 0.0 0:00.00 dscf 19258 kazaryan 20 0 13448 1556 1248 S 0 0.0 0:00.02 bash 19669 kazaryan 20 0 13456 1076 752 S 0 0.0 0:00.00 bash 27769 kazaryan 20 0 9580 1840 1136 S 0 0.0 0:00.48 jobex
Efficient?
Tasks: 204 total, 9 running, 195 sleeping, 0 stopped, 0 zombieCpu(s): 60.8%us, 14.6%sy, 9.8%ni, 14.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%stMem: 24735448k total, 856716k used, 23878732k free, 32k buffersSwap: 3999736k total, 21516k used, 3978220k free, 520852k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11599 msnoek 20 0 33216 6008 1716 R 102 0.0 1254:59 FKMC3d_smart_en 11600 msnoek 20 0 33216 6004 1716 R 100 0.0 1254:44 FKMC3d_smart_en 11602 msnoek 20 0 33216 6008 1716 R 100 0.0 1254:38 FKMC3d_smart_en 11603 msnoek 20 0 33216 6004 1712 R 100 0.0 1254:38 FKMC3d_smart_en 11604 msnoek 20 0 33216 6004 1712 R 100 0.0 1254:23 FKMC3d_smart_en 11605 msnoek 20 0 33216 6000 1712 R 100 0.0 1254:56 FKMC3d_smart_en 11601 msnoek 20 0 33216 6008 1716 R 98 0.0 1254:20 FKMC3d_smart_en 11606 msnoek 20 0 33216 6004 1712 R 98 0.0 1254:37 FKMC3d_smart_en 11412 msnoek 20 0 13396 1500 1240 S 0 0.0 0:00.00 bash 11598 msnoek 20 0 13404 924 648 S 0 0.0 0:00.00 bash
Efficient: a number of processes running in parallel
How to create such a nice job?
principle of multi-process job
#PBS -lnodes=1:cores8 -lwalltime=1:00:00
cd $HOME/workdirsome_program 1 >out1 2>err1 &some_program 2 >out2 2>err2 &some_program 3 >out3 2>err3 &some_program 4 >out4 2>err4 &some_program 5 >out5 2>err5 &some_program 6 >out6 2>err6 &some_program 7 >out7 2>err7 &some_program 8 >out8 2>err8 &wait
8 processes in background
wait until back ground processes are ended
Multi-line commands in background
cd $TMPDIR( cp $HOME/input1 . my_program input1 > output1 cp output1 $HOME) &( cp $HOME/input2 . my_program input2 > output2 cp output2 $HOME) &….wait
Hmmm.. lots of lines to write
Automatic generation of jobs
You can use any language: bash, C, Python, Perl, …to generate job scripts and submit them.
#!/usr/bin/pythonimport osfor i in range(100): f = open("tmpjob","w") print >>f,"#PBS -lnodes=1 -lwalltime=1:00:00" print >>f,"#PBS -Jjob"+str(i) print >>f,"cd $HOME/workdir" for j in range(8): k = 8*i+j; sk=str(k) print >>f,"./myprog parm"+sk+" >out."+\ sk+" 2>err."+sk+" &" print >>f,"wait"; f.close() #os.system("qsub tmpjob") os.system("cat tmpjob")
Array jobs
Submit the same job many times:
qsub -t 4-23 job
The same job will be submitted 20 times,jobnumbers will be like 6788900-4 .. 6788900-23
In the job, the environment variable PBS_ARRAYIDis available, here ranging fro 4 to 23
Deleting all jobs:
qdel 6788900
Using the scratch disk
Home file system is terribly slow
If a job accesses the home file system frequently,everybody on the system suffers!
Remedy:
cp infile $TMPDIR copy input files to scratch
cd $TMPDIRmyprog infile outfile Let your program read from scratch
cp outfile $HOME/datadir copy output file to home
Note: scratch disk is cleaned after job
DISPARMWhat? string server
Why? to facilitate managing large number of jobs
How? https://www.sara.nl/systems/lisa/software/disparm
- create file with strings to be used as parameters (my_parmfile) - module load disparm - disparm -i my_parmfile -p my_pool
Now the file my_pool is filled with the lines from my_parmfile.
disparm -ngets one line from the pool, and marks this line.
disparm -rmarks the previously extracted line as ready
Disparm example
Create a parameter file. For example:
seq 1000 > myparms
Create a pool file:
module load disparmdisparm -c -i myparms -p mypool
Try it:
disparm -n -p mypoolecho $DISPARM_VALUEdisparm -r -p mypooldisparm -n -p mypoolecho $DISPARM_VALUE
Disparm example 2
disparm -s -p mypooldisparm -n -p mypooldisparm -r -p mypooldisparm -s -p mypool
Disparm takes care, that every line is produced once,and that a line marked as 'ready' will not be producedin future.
Advanced: using the -m flag, you can specify thatdisparm can produce the same line more than once.Useful in case a process runs in a time limit orother mishap.
Disparm example 3A job using disparm:
#PBS -lnodes=1 -lwalltime=1:00:00module load disparmncores=`sara-get-num-cores`for ((i=1; i<=ncores; i++)) ; do ( for ((j=1; j<=10; j++)) ; do disparm -n -p mypool if [ "$DISPARM_RC" != "OK" ]; then break fi myprogram $DISPARM_VALUE disparm -r -p mypool done ) &donewait
Use all cores available
get new line
check if it looks ok
call my program
mark line as ready
Run myprogram 10 timesafter each other in eachthread
https://www.sara.nl/systems/lisa/software/disparm
Submit 20 jobs:qsub -t 1-20 disparmjob
Disparm summary
- use all cores in all kind of nodes- only one, relatively simple, job script required- no job generating script necessary- automatic load balancing possible
Summary
- The Lisa system - jobs: what - software - module environment - efficient jobs - use as much cores as possible - create many efficient jobs - disparm