evolution of the maintainability of hpc facilities at
TRANSCRIPT
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Evolution of the maintainability of HPC
facilities at CIEMAT headquarters
Antonio Juan Rubio Montero [on belhaf of the ICT Division]
[Centro de Investigaciones Energéticas Medioambientales y Tecnológicas (CIEMAT)
Madrid, Spain]
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
2
Unfortunately, Grace Hopper didn’t work on our UNIVAC SOLID STATE, but we had one!!!
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
3
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
• 1985 IBM 3090/150
PDC is built
4
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
• 1985 IBM 3090/150
CRAY
• 1991 CRAY XMS • 1991 YMP-EL • 1995 J90
PDC is built
5
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
PDC is built
• 1995 CRAY T3E
6
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
PDC is built
• 1995 CRAY T3E
• 1985 IBM 3090/150
(2000) STK9310Library [1,500 cartridges]
7
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700
SGI
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
PDC is built
8
STK9310
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP NUMA Cluster Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
SGI
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
Beowulf
• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700
• 2005 Lince (x86-32)
PDC is built
STK9310
9
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP NUMA Cluster Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
SGI
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
Beowulf
• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700
• 2005 Lince (x86-32)
• 2008 Euler (23TFlops) • 2010 Dirac (1.27TFlops) • 2015 ACME(40.6+18.8TFlops)
In 2019 first ¼ of Euler-2
(125.9TFlops) PDC is built
Future
Current
STK9310
10
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new
batteries and diesel engine 1,000KVA • Efficient cooling , fire protection
11
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new
batteries and diesel engine 1,000KVA • Efficient cooling , fire protection
• (2008) Euler (23TFlops) • (2010) Dirac (1.27TFlops) - 251 nodes, 2052 Xeon cores - 2 PBS/Torque - Infiniband - Unchanged base software
• 350 users in 30 research groups, 100 external. • Whole monitoring through Nagios: temp., humidity,
power, batteries, hardware and services
12
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new
batteries and diesel engine 1,000KVA • Efficient cooling , fire protection
• (2008) Euler (23TFlops) • (2010) Dirac (1.27TFlops) - 251 nodes, 2052 Xeon cores - 2 PBS/Torque - Infiniband - Unchanged base software • (2015) ACME
- 24 nodes - 720 Xeon cores (40.6 Tflops) - 2 Tesla P100 GPU (18.8TFlops) - Slurm, Infiniband - Continously updated
• 16 RAID NAS servers (NFS) - 1 intelligent device (NetApp) - 13 generic SAN Ethernet - 1 RDMA Infiniband (ACME) - > 1,5 PB total
• 350 users in 30 research groups, 100 external. • Whole monitoring through Nagios: temp., humidity,
power, batteries, hardware and services)
13
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters
IBM TS3584 Tape Library (18 drives, 1,581 cartridges,
4,42 PB) Daily incremental, 3 months
Secondary storage servers daily make differential rsync copies 2 months
X 15 Ethernet 1-10Gbps
Ethernet 1Gbps
Euler
14
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters
IBM TS3584 Tape Library (18 drives, 1,581 cartridges,
4,42 PB) Daily incremental, 3 months
Secondary storage servers daily make differential rsync copies 2 months
X 15 Ethernet 1-10Gbps
NetApp FAS2554 Hourly snapshots 3 weeks
Ether. 4x1Gbps
Ethernet 1Gbps
Euler
15
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Future acquisitions (2019)
Euler replacement. Practices: - constellation design - Slurm based:
- checkpointing - predefined containers
- yearly update cycle: - software - 25% of hardware
- Daily snapshots MD34xx - NDMP backup 10Gbps 2019 first ¼ of Euler-2 - 41 nodes - 1640 Xeon 6148 cores - Rpeak > 125.9TFlops - 600 TB based on Lustre
16
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
antonio.rubio <at> ciemat.es CIEMAT
Avda. Complutense, 40 – 28040 Madrid http://www.ciemat.es
http://rdgroups.ciemat.es/web/sci-track/
17
THANK YOU!!!