incois hpc training - indian institute of tropical...
TRANSCRIPT
© 2013 IBM Corporation
INCOIS – HPC Training
© 2013 IBM Corporation
Agenda
Technical
Infrastructure
–Cluster layout
–Compute • Sandy bridge
–Management • xCAT
(Provisioning tool)
– Interconnect • FDR Mellanox
–Storage • GPFS
–Software stack
Intel Cluster Studio
–Compiler
• Optimization
methodology
–MPI/OpenMP • Features and
optimizations
–Math Library (MKL)
–Debugging • Parallel application
debugging
–Profiling/Tracing • VTUNE trace
analyzer
Job Scheduler / Cluster
Manager
–LSF • Basic Architecture
• Current configuration
• Scheduling policies
• Troubleshooting
• Profiling
• Queues and Priorities
• Fault tolerance
• Submission and
Management
–Hands On
© 2013 IBM Corporation
Technical Infrastructure
© 2013 IBM Corporation
Tools Compilers
Applications
Operating System
Scientific Libraries Message Passing Interface
Parallel File System
Job Scheduler Cluster Administration
Hardware
High performance computing stack
© 2013 IBM Corporation
Cluster Overview
800 TeraFlops High Performance Computing System IBM iDataPlex cluster, which
features 38,144 Intel Sandy Bridge processors and 149 TB of memory
The login and compute nodes are populated with two Intel Sandy Bridge 8-core
processors.
FDR 10 Infiniband interconnect in a Fat Tree configuration as its high-speed
network for MPI messages and IO traffic
For High performance parallel file system we used GPFS, a most stable and higly
reliable for HPC clusters
compute node has two 8-core processors (16 cores) with its own Red Hat
Enterprise Linux OS, sharing 64 GBytes of memory
The cluster is intended to be used as a batch-scheduled jobs
All executions that require large amounts of system resources must be sent to the
compute nodes by batch job submission through job scheduler
© 2013 IBM Corporation
© 2013 IBM Corporation
© 2013 IBM Corporation
© 2013 IBM Corporation
IBM System x iDataPlex Compute Building Block
72 x IBM System x iDataPlex dx360 M4 server
– 2x E5-2670 SandyBridge-EP 2.6GHz/1600 cache 20MB 8-core
– 8 x 8G DDR3-1600 DIMMs (4GB/core) Total: 64GB/node
– Dual-port Infiniband FDR14 Mezzazine Card
4 X Mellanox 36-port Managed FDR14 IB Switch
– 4 Leaf IB Switch
– 18 x compute nodes connected to each leaf switches.
– 18 Uplinks from every leaf switch connects at IB Main Switches
Management Network
– 2 x BNT RackSwitch G8052F
– 4 x 1 Gb Connections from each switch acts as uplink for a
flawless flow of management traffic
IBM System x iDataPlex Rack with with RDHX (water cooling)
Performance
– –2.60 GHz x 8 Flops/cycle (AVX) = 20.8 GFlops/core
– –16 core x 20.8 GFlops/core = 332.8 GFlops/node
– –72 nodes x 332.8 GFlops/node = 23.96 TFlops/rack
© 2013 IBM Corporation
Compute
© 2013 IBM Corporation
IBM System x iDataPlex dx360 M4 Compute Node
iDataPlex Rack server
1U Node Density 84 Nodes / 84U Rack
Support SSI Planars (EP & EN)
Shared Power –Common Form Factor (CFF)
Shared Cooling –80mm Fans
HPC Nodes incl 2x 1GbE down and 10GbE or
40G/QSFP IB Mezz card option
© 2013 IBM Corporation
Intel SandyBridge microprocessor
New architecture (Tock cycle) bring new features:
– Up to 8 cores per socket
– AVX vector units (double peak FP performance)
– Larger and faster caches
– Improved TLB ( Turbo lookaside buffer)
– Higher memory bandwidth per core
– Enhanced Turbo Mode
– Enhanced Hyper Threading mode
– …
SandyBridge-EP
model
© 2013 IBM Corporation
SandyBridge-EP microprocessor
In addition, Sandy Bridge also introduces support for AVX (Advanced vector) extensions within an updated execution stack, enabling 256-bit floating point (FP) operations to be decoded and executed as a single micro-operation (uOp).
The effect of this is a doubling in peak FP capability, sustaining 8 double precision FLOPs/cycle.
© 2013 IBM Corporation
SandyBridge-EP microprocessor
Sandy Bridge processor integrates a high
performance, bidirectional ring
architecture interconnecting
– CPU cores, Last Level Cache (LLC, or
L3), PCIe, QPI, memory controller
– Able to return 32 Bytes of data on each
cycle
each physical LLC segment is loosely
associated with a corresponding core
– But cache is also shared among all
cores as a logical unit
The ring and LLC are clocked with the CPU
core, so cache and memory
– latencies have dropped as compared
to the previous generation architecture
– bandwidths are significantly
improved.
© 2013 IBM Corporation
Turbo Boost
Turbo Boost – Allows dynamically increasing
CPU clock-speed on demand • « Dynamic over clocking »
– Frequency will increase in increments of 100 MHz
• When the processor has not reached its thermal and electrical limits
• When the user's workload demands additional performance
• Until a thermal or power limit is reached or Until the maximum speed for the number of active cores is reached
Important note: – On 4 sockets systems (like x3750m4), the 2.4GHz CPU will only achieve 2.8GHz
Turbo upside on a 4S-EP (this is intentionally limited by Intel) – It is lower than the turbo upside for an equivalent 2-socket EP processor (which
would achieve 3.1GHz).
© 2013 IBM Corporation
Storage
© 2013 IBM Corporation 17
GPFS Storage server
IBM System x GPFS Storage Server: Bringing HPC
Technology to the Mainstream
•Better, Sustained Performance
- Industry-leading throughput using efficient De-Clustered
RAID Techniques
•Better Value
–Leverages System x servers and Commercial JBODS
•Better Data Security
–From the disk platter to the client.
–Enhanced RAID Protection Technology
•Affordably Scalable
–Start Small and Affordably
–Scale via incremental additions
–Add capacity AND bandwidth
© 2013 IBM Corporation
•3 Year Warranty
–Manage and budget costs
•IT-Facility Friendly
–Industry-standard 42u 19 inch rack
mounts
–No special height requirements
–Client Racks are OK!
•And all the Data Management/Life
Cycle Capabilities of GPFS – Built in!
© 2013 IBM Corporation
General Parallel file system
© 2013 IBM Corporation 20
Parallel Filesystem
GPFS: A file system for high performance computing. as a shared disk, parallel file system for AIX, Linux clusters
Software features: snapshots, replication and multi-site connectivity are included in the GPFS
license. There are no license keys besides client and server to add-on, you
get all of the features up front.
Number of files:
• 2 Billion per file system
• 256 file systems
• Max File System Size: 2^99 bytes
• Max File Size = File system size
Disk IO:
•AIX 134 GB/sec
•Linux 66 GB/sec
Number of nodes:
• 1 to 8192
© 2013 IBM Corporation
• GPFS 2.3, or later, architectural file system size limit
– 2^99 bytes
– Current tested limit ~2 PB
• Total number of files per file system
– 4,000,000,000 (four billion - GPFS 3.4 created file system, two billion on 3.2 or earlier
GPFS versions)
• Total number of nodes: 8,192
– A node is in a cluster if:
• The node shows in mmlscluster (shows up in mmlscluster) or
• The node is in a remote cluster and is mounting a file system in the local cluster
• Maximum number of mounted file systems
– 256
– Before GPFS 3.2, 64 file systems
• Maximum disk size
– Limited by disk device driver and O/S
Architecture Stat
© 2013 IBM Corporation
GPFS provides a highly scalable file management infrastructure
Optimizes storage utilization by centralizing management
Provides a flexible scalable alternative to a growing number of
NAS appliances
Highly available grid computing infrastructure
Scalable information lifecycle tools to manage growing data volumes
What GPFS provides
© 2013 IBM Corporation
LAN
SAN
NSD Clients
SAN
GPFS
Seamless capacity
and performance
scaling
Centrally deployed,
managed, backed up
and grown
NSD Servers
Massive namespace
support
Architecture : Diagram
© 2013 IBM Corporation
Internal design
© 2013 IBM Corporation
• The GPFS kernel extension provides:
– Interfaces to the operating system vnode and VFS.
• Flow:
– Application makes file system calls to the O/S.
– O/S presents calls to the GPFS kernel extension.
• GPFS appears to the application as just another file system.
– GPFS kernel extension will either satisfy requests using information already available
or send a message to the GPFS daemon to complete the request.
– The GPFS daemon
• It performs all I/O and buffer management, including read ahead for sequential reads
and write behind operations.
• All I/O is protected by token management to ensure file system consistency.
• Multi-threaded with some threads dedicated to specific functions.
– Examples include space allocation, directory management (insert and removal), and
quotas.
• Disk I/O is initiated on threads of the daemon.
Kernel Extension
© 2013 IBM Corporation
Manager nodes
• Global lock manager
• File system configuration: recovery, adding disks, …
• Disk space allocation manager
• Quota manager
• File metadata manager - maintains file metadata integrity
File system nodes
• Run user programs, read/write data to/from storage nodes
• Implement virtual file system interface
• Cooperate with manager nodes to perform metadata operations
Storage nodes
• Implement block I/O interface
• Shared access to file system and manager nodes
• Interact with manager nodes for recovery
Node Roles
© 2013 IBM Corporation
Use mmdelnode to remove a node from a cluster: mmdelnode { -a | -N Node[,Node…] |
NodeFile|NodeClass]
–Cannot be primary or secondary GPFS cluster configuration
node (unless removing entire cluster)
–Cannot be an NSD server (unless removing entire cluster)
–Can be run from any node remaining in the GPFS cluster
–GFPS daemon must be stopped on node being deleted
Deleting some nodes:
–Avoid unexpected consequences due to quorum loss
Deleting a cluster using the mmdelnode command: mmdelnode -a
Adminitration : Node Deletion
© 2013 IBM Corporation
Disks are added to a file system using the mmadddisk command:
mmadddisk Device {"DiskDesc[;DiskDesc...]“ | -F DescFile}
[-a] [-r]
[-v{yes|no}] [-N {Node[,Node...] | NodeFile |
NodeClass}]
Optionally, rebalance the data ( -r) (recommended but can cause performance
impact while rebalancing).
The file system can be mounted or unmounted.
The NSD must be created before it can be added using mmadddisk.
– Create new disk (mmcrnsd) – Reuse available disk (mmlsnsd –F)
# mmlsnsd -F
File system Disk name Primary node
Backup node
-----------------------------------------------------
(free disk) gpfs3nsd (directly attached)
Adding disks
© 2013 IBM Corporation
Managing disks within a file system
– Disk errors
– Performance evaluation
– Planning for migration
Modify disk state using the mmchdisk command # mmchdisk
Usage:
mmchdisk Device {resume | start} -a
[-N {Node[,Node...] | NodeFile |
NodeClass}]
or
mmchdisk Device {suspend | resume | stop | start |
change}
{-d "DiskDesc[;DiskDesc...]" | -F DescFile}
[-N {Node[,Node...] | NodeFile |
NodeClass}]
Example
– Restart disk after fixing storage failure
Changing disk attributes
© 2013 IBM Corporation
A disk can be replaced by a new disk.
– Need a free NSD as large or larger than original
– Cannot replace stopped disk
– Cannot replace disk if only disk in file system
– Do not need to unmount file system
– No need to re-stripe
– File system can be mounted or unmounted
It is replaced using the mmrpldisk command.
Usage: mmrpldisk Device DiskName {DiskDesc | -F DescFile}
[-v {yes | no}]
[-N {Node[,Node...] | NodeFile
| NodeClass}]
Replacing Disks
© 2013 IBM Corporation
Disks are removed from a file system using the mmldedisk command. – Migrates data to remaining disks in file system – Removes disk from file system descriptor – Can be run from any node in cluster
The mmdeldisk command: – Usage:
mmdeldisk Device {"DiskName[;DiskName...]" | -F DiskFile} [-a] [-c]
[-r] [-N {Node[,Node...] | NodeFile |
NodeClass}]
Usage scenarios: – If disk is not failing and still readable by GPFS:
• Suspend the disk (mmchdisk disk_name suspend). • Re-stripe to rebalance all data onto other disks (mmrestripefs –b). • Delete the disk (mmdeldisk).
If disk is permanently damaged and file system is replicated: – Suspend and stop disk (mmchdisk disk_name suspend; mmchdisk disk_name stop)
– Re-stripe and restore replication for the file system, if possible (mmrestripefs –r)
– Delete the disk from the file system (mmdeldisk)
Deleting a Disk
© 2013 IBM Corporation
mmchfs command
– Usage: mmchfs Device [-A {yes | no | automount}] [-D {posix |
nfs4}] [-E {yes | no}]
[-F
MaxNumInodes[:NumInodesToPreallocate]]
[-k {posix | nfs4 | all}] [-K {no |
whenpossible | always}]
[-m DefaultMetadataReplicas] [-o
MountOptions]
[-Q {yes | no}] [-r
DefaultDataReplicas] [-S {yes | no}]
[-T Mountpoint] [-t DriveLetter] [-V
{full | compat}] [-z {yes | no}]
or mmchfs Device -W NewDeviceName
Cannot modify
– Blocksize
– Logfile (-L LogFileSize in mmcrfs)
– MaxDataReplicas and MaxMetadataReplicas
– numnodes
File system
© 2013 IBM Corporation
Quotas are set using the mmedquota command.
Issue mmedquota to explicitly set quotas for a user, groups, or filesets. mmedquota {-u [-p ProtoUser] User... |
-g [-p ProtoGroup] Group... |
-j [-p ProtoFileset] Fileset... |
-d {-u User... | -g Group... | -j Fileset}
|
-t {-u | -g | -j}}
– Confirm using mmrepquota command.
Example: Edit quota for user user1 # mmedquota –u user1
*** Edit quota limits for USR tests
NOTE: block limits will be rounded up to the next
multiple of the block size.
block units may be: K, M, or G.
fs1: blocks in use: 0K, limits (soft = 0K, hard = 0K)
inodes in use: 0, limits (soft = 0, hard = 0)
Setting up user quota
© 2013 IBM Corporation
Cluster management
© 2013 IBM Corporation
What is xCAT?
Extreme Cluster(Cloud) Administration Toolkit
– Open Source Linux/AIX/Windows Scale-out Cluster
Management Solution
Design Principles
– Build upon the work of others • Leverage best practices
– Scripts only (no compiled code) • Portable
• Source
– Vox Populi -- Voice of the People • Community requirements driven
• Do not assume anything
© 2013 IBM Corporation
What does xCAT do?
Remote Hardware Control – Power, Reset, Vitals, Inventory, Event Logs, SNMP alert processing – xCAT can even tell you which light path LEDs are lit up remotely
Remote Console Management – Serial Console, SOL, Logging / Video Console (no logging)
Remote Destiny Control – Local/SAN Boot, Network Boot, iSCSI Boot
Remote Automated Unattended Network Installation – Auto-Discovery
• MAC Address Collection • Service Processor Programming • Remote Flashing
– Kickstart, Autoyast, Imaging, Stateless/Diskless, iSCSI
Scales! Think 100,000 nodes.
xCAT will make you lazy. No need to walk to datacenter again.
© 2013 IBM Corporation
Functionality
Remote Hardware Control
– Power, reset, vitals, inventory, event logs, SNMP alert processing
Remote Console Management
– Serial console, SOL, logging
Remote Destiny Control
– Local boot, network boot, iSCSI boot
Parallel Cluster control
– parallel shell, parallel rsync, parallel secure copy, parallel ping
Remote Automated Unattended Network Installation
– Auto-discovery
• MAC address collection
• Service processor programming
– Remote flashing
– Kickstart, Autoyast, imaging, stateless/diskless
Easy to Use and it Scales! Think 100000 nodes
– xCAT will make you lazy - no need to walk to datacenter again
© 2013 IBM Corporation
Architecture
A single xCAT Management Node (MN) for N number of nodes.
– A single node DHCP/TFTP/HTTP/NFS server.
– Scales to ~128 nodes.
• If staggered boot is used, this can scale to 1024 nodes (tested)
© 2013 IBM Corporation
Scale Infrastructure
A single xCAT management node with multiple service nodes providing boot services to increasing scaling.
Can scale to 1000s and 10000s of nodes.
xCAT already provides this support for large diskfull clusters and it can by applied to stateless as well.
The number of nodes and network infrastructure will determine the number of DHCP/TFTP/HTTP servers required for a parallel reboot with no DHCP/TFTP/HTTP timeouts.
The number of DHCP servers does not need to equal the number of TFTP or HTTP servers. TFTP servers NFS mount read-only the /tftpboot and image directories from the management node to provide a consistent set of kernel, initrd, and file system images.
node001 node002 ... nodennn
DHCP TFTP HTTP NFS(hybrid)
DHCP TFTP HTTP NFS(hybrid)
nodennn + 1 nodennn + 2 ... nodennn + m
DHCP TFTP HTTP NFS(hybrid)
...
IMNmgmt node
service node01 service node02 service nodeNN
IMN...
© 2013 IBM Corporation
Tables and Database
xCAT stores all information about the nodes and subsystems it manages in a
database.
– XCAT default database is located in /etc/xcat in sqlite tables. XCAT can be
instructed to store the tables in MySQL, PostgreSQL or DB2 as well.
For most installations you won't need to even fill up half of the tables!
– And for the tables that you do need, in most cases you'll only need to put one
line in the table!
There are lot of tables but only some tables are for common to Linux and AIX, some
are for only AIX, some just for monitoring, some for advanced functions (virtual
machines, iSCSI settings), …
xCAT comes with a rich set of functions for manipulating tables.
© 2013 IBM Corporation
© 2013 IBM Corporation
Provisioning methods
HD Memory
Node
xCAT
Stateful – Diskful
Local - HD - Flash
Stateful – Disk-Elsewhere
San - iSCSi
Stateless – Disk Optional
Memory RAM - CRAM - NFS
xCAT xCAT
OS
In
stal
ler
HD Memory HD Memory
SAN/iSCSI/NAS
OS
In
stal
ler
Imag
e
Pu
sh
Node Node
OS
OS
• HD
• Flash
• RAM
• CRAM
OS
Statelite
© 2013 IBM Corporation
Management &Monitoring
© 2013 IBM Corporation
Job Scheduler Intel Cluster suite