high performance computing with the cluster „elwetritsch - ii · high performance computing with...
TRANSCRIPT
High Performance Computing with the Cluster „Elwetritsch“ - II
Course instructor : Dr. Josef Schüle, RHRK
RHRK-Seminar
Overview
Course I
Login to cluster SSH
RDP / NX
Desktop Environments GNOME (default)
other desktops
Linux Basics Terminal / Shell
file systems
Tipps & Tricks
Course II
Cluster Basics Hardware
Software
Brainware
Batch system submitting jobs
Monitoring / Accounting
AHRP Projects
Working on MOGON /Mainz
Overview
Course III
Batchsystem SLURM Array-Jobs
Multi-core Jobs
SMP
MPI
Hybrid
job resources accellerators
file systems
Further questions, ideas for the course?
Send an E-Mail to [email protected]
Cluster Basics
Hardware
• Nodes
• Network
• File systems
Software
• System Software
• Module
• Batch system
Brainware
Hardware
nodes login nodes (headx) compute nodes
interactive available not interactive available
special different GPU versions/number XEON Phi P5110P 1 TB RAM
Infiniband network QDR
Ethernet network 1/10GB
file systems /scratch Fraunhofer BeeGFS (322 TB) /home NetApp-NFS for HOME (10 TB) /work small files on request /tmp short-time temporary storage
https://elwe.rhrk.uni-kl.de/elwetritsch/ressourcen.shtml
System software Scientific Linux 7.x periodical updates
Module software installed from source ->/software available in several versions -> software modules command: module Is something missing?: E-Mail to [email protected]
kl.de
Batch system SLURM jSLURM – graphical user interface
Software (1)
Software (2)
Compiler & Tools Intel Cluster Suite
Portland PGI
GNU Compiler Collection
CUDA Toolkit & GPU Programming SDK
Totalview Parallel Debugger
Various developper environments
Libraries Intel Math Kernel Library
(MKL)
AMD Core Math Library (ACML)
MPI Intel MPI
Open MPI
IBM Platform MPI
Competence in using the cluster at it‘s best is important for HPC
Don‘t hesitate to contact us and ask questions (Hotline)
Direct personal question may be posed to Dr. Josef Schüle Dr. Markus Hillenbrand Sven Daxinger
.. if you need help installating programs select best programs compile your programs using the batch system (select parameters and resources) developing your own software (OpenMP, MPI, CUDA) writing scripts to support your work
Brainware
Batch System
Job Management
• submission
• query
• ending
Monitoring
Accounting
jSLURM
Cluster: Number of servers to be seen as unity Bundeling compute power resources and load became balanced commands: sinfo, sacct, sprio, sshare
Job: work unit executed by the batch system typical a command or command sequence with
options and arguments batch system schedules for execution, controls and
protocols commands: squeue, sbatch
Wording (1)
Jobslot: Smallest execution unit. on Elwetritsch a jobslot corresponds to a CPU core batch system organizes usage of servers commands: sinfo, squeues, rhrk-showqos
States for jobs: PEND, RUN, DONE, EXIT, WAIT PSUSP, USUSP, SSUSP, POST_DONE, POST_ERR command: sacct
Queue: sorting jobs according to resource requirements after submission, jobs are sorted into waiting queues remain waiting until all resource requirements are available commands: squeue, sbatch
Scheduling: algorithm to select and start next jobs Fairshare Scheduling each user has a fair amount - according to project priority - of
cluster resources
Wording (2)
Commands Purpose
salloc request interactive job allocations. Better is rhrk-launch
sbatch submit a batch script
scancel cancel a job or job step
scontrol manage jobs and query information
sinfo retrieve information about partitions, reservations and nodes
sprio job priorities
srun initiate job steps from within a job
sshare displays fair share information for each user
sstat status information about a running job
sacct accounting information. Better is rhrk-accounting
sacctmgr query accounting information
SLURM - overview
SLURM course
Further information:
Monitoring
cockpit ELP
jSLURM - GUI for SLURM
Developed at RHRK
report errors
further improvements?
more applications?
AHRP
General
• history
• aims
• tasks
Projects
Interaction with MOGON
founded April 2010 by TU Kaiserslautern Johannes Gutenberg Universität Mainz
aims coordinating HPC activities provisioning of state-of-the-art HPC ressources
for research in RLP tasks
install, maintain and purchase ressources in common
coordinated teaching and support for HPC provisioning of 15% of the ressources for
research to all in RLP, load balancing among the sites
application based request for ressources
Overview
Projects
three steps fill-in form wait for review follow the sent instruction
project sizes XS: < 5 NE / month S: < 30 NE / month M: < 100 NE / month L: < 500 NE / month XL: > 500 NE / month
NE = CPU cores * h / 1000
Link: http://www.ahrp.info
Mogon I in Mainz consists of AMD systems with 64 cores more cores in total, but each less powerful source code has to be recompiled
interaction will again be realized with common batch system (previous LSF)
Usage Premise is an AHRP project E-Mail to [email protected] with a short
reasoning
interaction with Mogon
25
Vielen Dank Thank You
High Performance Computing on Elwetritsch
Part II