Guillimin HPC Users Meeting - May 2016
Guillimin HPC Users MeetingMay 12, 2016
McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada
Guillimin HPC Users Meeting - May 2016
• Compute Canada News• System Status• Software Updates• Training News• Special Topic
• How to Build Your Own Modules
Outline
2
Guillimin HPC Users Meeting - May 2016
• HPCS 2016 - Edmonton - June 19th to 22nd– Registration open: http://canheit-hpcs.ualberta.ca/
Compute Canada News
3
Guillimin HPC Users Meeting - May 2016
• Completed: Scheduled Downtime for Maintenance– From: Friday April 22 at 8:00 AM– Target for return to service: evening Saturday April 23– Maintenance:
• On ETS campus electrical network• On Guillimin network (network switch firmware updates)
– System Access Restoration• Login Nodes: April 24• Storage: April 24 for /sb and /lb, but not /gs (*)• Batch System: April 25 with temporary scratch• Switch firmware updates have improved stability of
network access to various services within our Virtual Machine (VM) hosting environment
System Status
4
Guillimin HPC Users Meeting - May 2016
• Summary: /gs file system issues (April 14 to May 6)– The 3 PB /gs file system is built from 5 individual
storage building blocks– Storage building blocks provide redundancy at the
hardware and software layers so as to handle failures - but not possible to protect against 100% of failures
– Starting April 14:• a set of hardware issues triggered an i/o storm that
exposed several bugs within the software layer of GPFS• Resulted in potential metadata corruption and i/o errors
when attempting access to many files on /gs• The extent of the damage to metadata and the ability to
repair the corruption was initially unknown.• Although metadata is replicated, there was risk of
possible data loss.
Storage Status
5
Guillimin HPC Users Meeting - May 2016
• Timeline of Events (April 14 to May 6)– April 14: storm of i/o errors within one building block– April 17: hardware i/o errors caused corruption of a portion
of /gs metadata; parts of files are unavailable for read and/or write access
– April 22: Site Maintenance Period begins– April 24: /gs kept offline pending root cause analysis and
availability of software patches (efixes) to correctly handle the i/o errors and would attempt to fix metadata
– May 2: “surgery” to manually repair metadata and apply the efixes - successful!
– May 3 - 6: stress testing reveals problematic disk drawers and module that are subsequently replaced
– May 6: /gs file system back online
Storage Status
6
Guillimin HPC Users Meeting - May 2016
• Current status of /gs (as of May 12)– Since May 6 all system elements behaving well and as
expected.– Metadata successfully restored and with no data loss.– Monday May 9 @ 10:00 to 10:06 am
• New storm of i/o errors observed in 1 of 5 building blocks• GPFS hardware and updated firmware and software
cleanly handled such errors• Root cause analysis traces the problem source to a faulty
server HCA (host channel adapter) card• HCA will be replaced in next days with no interruption of
access to /gs
Storage Status
7
Guillimin HPC Users Meeting - May 2016
• Reminder of Storage Policies and Practices– All storage is operated with redundancy at the
hardware and software layers - but not 100% guaranteed!
– Guillimin Storage File Systems• Home: small files, codes, backed-up nightly• Scratch: temporary i/o files, no backup• Project: any project files either temporary or longer term,
no backup by default without RAC award
– To mitigate against loss of important data we continue to recommend that, whenever possible, copies of important data are kept off-site
Storage Status
8
Guillimin HPC Users Meeting - May 2016
• Reminder about temporary scratch space on /sb:– The cleanup of /sb/scratch spaces will start on May 18– Make sure to move all necessary data to $SCRATCH
(/gs/scratch/$USER) or to a project space– Also, make sure your job scripts are using $SCRATCH
instead of “/sb/scratch/$USER”• Partition /gs is almost full:
– 96% used– 129TB left (as of May 11)– We are actually moving some cold data to tapes
• Metadata remains on disks• Users can still access their files, but with a significant
latency
Storage Status
9
Guillimin HPC Users Meeting - May 2016
• About the Lmod/EasyBuild based module structure:– Default since March 22, 2016– New LUA syntax (presented in special topic)
• New Installations– Total Academic Headcount (TAH) Matlab license file
(for McGill users only) : R201{2a, 4a, 6a}– MATLAB Distributed Computing Server (MDCS) license
file (for all users, 64 licenses): R201{2, 3, 4, 5}{a, b}– GraphicsMagick/1.3.21 (intel/2015b, iomkl/2015b)– argtable/2.13 (foss/2015b, iomkl/2015b)– VTK/6.3.0-Python-2.7.10 (iomkl/2015b)
Software Update
10
Guillimin HPC Users Meeting - May 2016
• See “Training and Outreach” at www.hpc.mcgill.ca for our calendar of training and workshops for 2016 and for links to registration pages
• Upcoming events: calculquebec.eventbrite.ca• May 17 - Introduction au nuage de Calcul Canada (U. de
Sherbrooke, online training)• May 24 - Assemblée générale de Calcul Québec• May 26 - Data Intensive Computing• June - Suggestions for training? Please let us know!
• All materials from previous workshops are available online: https://wiki.calculquebec.ca/w/Formations/en
• Recently completed:• April 28 - Advanced OpenMP• April 21 - Introduction to OpenMP
Training News
11
Guillimin HPC Users Meeting - May 2016
• Questions? Comments?• We value your feedback. Contact us at:
• Guillimin Operational News for Users– Status Pages
• http://www.hpc.mcgill.ca/index.php/guillimin-status• http://serveurscq.computecanada.ca (all CQ systems)
– Follow us on Twitter• http://twitter.com/McGillHPC
User Feedback and Discussion
12
Guillimin HPC Users Meeting - May 2016
McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada
Guillimin HPC Users MeetingMay 12, 2016
How to Build Your Own Modules
Guillimin HPC Users Meeting - May 2016
How Do You Configure Your Environment?
• If you want to run a specific version of a software:export PATH=$HOME/soft-1.2.3/bin:$PATH
which soft
• Most applications have library dependencies:export LD_LIBRARY_PATH=$HOME/soft-1.2.3/lib:$LD_LIBRARY_PATH
ldd $(which soft)
• Some applications have manual pages:export MANPATH=$HOME/soft-1.2.3/share/man:$MANPATH
man soft
• For your own code, you have to set:• For interpreters: PYTHONPATH, PERL5LIB• For C/C++, Fortran: CPATH, FPATH, LIBRARY_PATH
14
Guillimin HPC Users Meeting - May 2016
How do you set these environment variables?• From ~/.bashrc?
• This file is a script executed at the beginning of all Bash processes, so it is quite convenient
• When using a different version of a software, there is a risk of conflicts (ex.: a binary with wrong libraries).
• From a source file?• Very flexible, it can run any low level Bash command. A
source file can load another source file. The name of a source file can describe the environment
• How to maintain multiple versions? How to load only the minimum needed? How to implement requirements? How to remove values in variables and eliminate conflicts?
How Do You Configure Your Environment?
15
Guillimin HPC Users Meeting - May 2016
Modules modify environment variables. They:• Set or prepend values to environment variables
• When unloading a module, the corresponding values are automatically removed from the variable
• Load required modules (or dependencies)• Example: a specific OpenMPI module can load the
corresponding compiler module• Prevent conflicts between similar modules
• Conflicts between themselves and another module, or between two dependencies
• Load new sets of modules• Provide a description of themselves and a “help” text
What Lmod Modules Can Do
16
Guillimin HPC Users Meeting - May 2016
Default Sets of Modules
17
Guillimin HPC Users Meeting - May 2016
Where Modules are Coming From?
18
Guillimin HPC Users Meeting - May 2016
Where Modules are Coming From?
19
Guillimin HPC Users Meeting - May 2016
The Content and Syntax of a LUA File
20
help([[Toolkit ... Intel C/C++ and Fortran compilers, Intel MKL & OpenMPI. - ...]])
whatis([[Name: iomkl]])
whatis([[Version: 2015b]])
whatis([[Description: ... ]])
add_property("type_","recommended")
conflict("iomkl")
load("icc/2015.3.187-GNU-4.9.3-2.25")
load("ifort/2015.3.187-GNU-4.9.3-2.25")
load("OpenMPI/1.8.8")
load("iompi/2015b")
load("imkl/11.2.3.187")
Guillimin HPC Users Meeting - May 2016
The Content and Syntax of a LUA File
21
local root = "/software/CentOS-6/eb/software/Core/iomkl/2015b"
prepend_path("MODULEPATH", "/software/CentOS-6/eb/modules/all/Toolchain/iomkl/2015b")
setenv("EBROOTIOMKL", root)
setenv("EBVERSIONIOMKL", "2015b")
setenv("EBDEVELIOMKL", pathJoin(root, "easybuild/Core-iomkl-2015b-easybuild-devel"))
• This could be saved in:• $HOME/modulefiles/name/1.0.1.lua• /gs/project/abc-123-aa/software/modulefiles/name/1.0.1.lua
Guillimin HPC Users Meeting - May 2016
The Content and Syntax of older TCL Files
22
https://wiki.calculquebec.ca/w/Cr%C3%A9er_un_module/en#%Module1.0
###############################################################
## OPENMPI MPI lib
proc ModulesHelp { } {
puts stderr "\tAdds the OpenMPI library. "
}
module-whatis "(Category_______) mpi"
module-whatis "(Name___________) OpenMPI"
module-whatis "(Version________) 1.6.3"
conflict mpi
prereq compilers/intel/2013
set root /software/MPI/openmpi/1.6.3_intel
prepend-path PATH $root/bin
prepend-path LD_LIBRARY_PATH $root/lib
setenv OMPI_MCA_plm_rsh_num_concurrent 960
Guillimin HPC Users Meeting - May 2016
• If you create or have access to multiple module repositories:• module use /gs/project/abc-123-aa/software/modulefiles
• module load ...
• Then, to use another set of modules:• module unload ... # Or: module purge
• module unuse /gs/project/abc-123-aa/software/modulefiles
• module use /gs/project/def-456-aa/software/modulefiles
For Multiple Project spaces
23
Guillimin HPC Users Meeting - May 2016
• A module system needs to present modules (with avail or spider commands) that are spread across multiple directories and subdirectories
• Listing all these modules is an expensive task doing lots of small I/O operations
• Typical file systems for clusters are not optimized for this kind of access
• Solution:• Caching all the information in a single database file,
and use the cache for avail or spider commands• Since each user could have access to different sets of
modules, each user will have a private cache file
About the Lmod Cache
24
Guillimin HPC Users Meeting - May 2016
• Problems of a cached module structure:• It has to be renewed periodically. A central cache is
updated every time we add a new module on the main tree. Then, users’ cache is updated on the next Bash session
• Updating a cache takes time, but it must be worth it:• Below $LMOD_SHORT_TIME seconds, cache is not
needed. By default, it is 10 seconds or less• If you write your own modulefiles watch out for
caching:• Force new cache via rm -rf ~/.lmod.d/.cache• Disable cache via export LMOD_SHORT_TIME=86400
About the Lmod Cache
25
Guillimin HPC Users Meeting - May 2016
• Documentation:• https://www.tacc.utexas.edu/research-
development/tacc-projects/lmod• /system-administrators-guide/tcl-vs-modulefiles
• http://lmod.readthedocs.io/en/latest/• http://www.hpc.mcgill.ca/index.php/starthere/81-doc-
pages/88-guillimin-modules
• For any other question:[email protected]
Conclusion
26