![Page 1: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/1.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Evolution of the maintainability of HPC
facilities at CIEMAT headquarters
Antonio Juan Rubio Montero [on belhaf of the ICT Division]
[Centro de Investigaciones Energéticas Medioambientales y Tecnológicas (CIEMAT)
Madrid, Spain]
![Page 2: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/2.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
2
Unfortunately, Grace Hopper didn’t work on our UNIVAC SOLID STATE, but we had one!!!
![Page 3: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/3.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
3
![Page 4: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/4.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
• 1985 IBM 3090/150
PDC is built
4
![Page 5: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/5.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
• 1985 IBM 3090/150
CRAY
• 1991 CRAY XMS • 1991 YMP-EL • 1995 J90
PDC is built
5
![Page 6: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/6.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
PDC is built
• 1995 CRAY T3E
6
![Page 7: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/7.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
PDC is built
• 1995 CRAY T3E
• 1985 IBM 3090/150
(2000) STK9310Library [1,500 cartridges]
7
![Page 8: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/8.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700
SGI
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
PDC is built
8
STK9310
![Page 9: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/9.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP NUMA Cluster Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
SGI
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
Beowulf
• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700
• 2005 Lince (x86-32)
PDC is built
STK9310
9
![Page 10: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/10.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
MPP NUMA Cluster Vectorial
Mainframe Punched Cards
History of HPC facilities at CIEMAT
50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s
UNIVAC
• 1959 UNIVAC SS • 1971 UNIVAC 1106
• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81
IBM
CRAY
• 1985 IBM 3090/150
SGI
• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90
Beowulf
• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700
• 2005 Lince (x86-32)
• 2008 Euler (23TFlops) • 2010 Dirac (1.27TFlops) • 2015 ACME(40.6+18.8TFlops)
In 2019 first ¼ of Euler-2
(125.9TFlops) PDC is built
Future
Current
STK9310
10
![Page 11: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/11.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new
batteries and diesel engine 1,000KVA • Efficient cooling , fire protection
11
![Page 12: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/12.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new
batteries and diesel engine 1,000KVA • Efficient cooling , fire protection
• (2008) Euler (23TFlops) • (2010) Dirac (1.27TFlops) - 251 nodes, 2052 Xeon cores - 2 PBS/Torque - Infiniband - Unchanged base software
• 350 users in 30 research groups, 100 external. • Whole monitoring through Nagios: temp., humidity,
power, batteries, hardware and services
12
![Page 13: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/13.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new
batteries and diesel engine 1,000KVA • Efficient cooling , fire protection
• (2008) Euler (23TFlops) • (2010) Dirac (1.27TFlops) - 251 nodes, 2052 Xeon cores - 2 PBS/Torque - Infiniband - Unchanged base software • (2015) ACME
- 24 nodes - 720 Xeon cores (40.6 Tflops) - 2 Tesla P100 GPU (18.8TFlops) - Slurm, Infiniband - Continously updated
• 16 RAID NAS servers (NFS) - 1 intelligent device (NetApp) - 13 generic SAN Ethernet - 1 RDMA Infiniband (ACME) - > 1,5 PB total
• 350 users in 30 research groups, 100 external. • Whole monitoring through Nagios: temp., humidity,
power, batteries, hardware and services)
13
![Page 14: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/14.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters
IBM TS3584 Tape Library (18 drives, 1,581 cartridges,
4,42 PB) Daily incremental, 3 months
Secondary storage servers daily make differential rsync copies 2 months
X 15 Ethernet 1-10Gbps
Ethernet 1Gbps
Euler
14
![Page 15: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/15.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Current HPC infrastructure at CIEMAT headquarters
IBM TS3584 Tape Library (18 drives, 1,581 cartridges,
4,42 PB) Daily incremental, 3 months
Secondary storage servers daily make differential rsync copies 2 months
X 15 Ethernet 1-10Gbps
NetApp FAS2554 Hourly snapshots 3 weeks
Ether. 4x1Gbps
Ethernet 1Gbps
Euler
15
![Page 16: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/16.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
Future acquisitions (2019)
Euler replacement. Practices: - constellation design - Slurm based:
- checkpointing - predefined containers
- yearly update cycle: - software - 25% of hardware
- Daily snapshots MD34xx - NDMP backup 10Gbps 2019 first ¼ of Euler-2 - 41 nodes - 1640 Xeon 6148 cores - Rpeak > 125.9TFlops - 600 TB based on Lustre
16
![Page 17: Evolution of the maintainability of HPC facilities at](https://reader035.vdocument.in/reader035/viewer/2022070408/62c0876737e1b06e8460f6bb/html5/thumbnails/17.jpg)
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero
antonio.rubio <at> ciemat.es CIEMAT
Avda. Complutense, 40 – 28040 Madrid http://www.ciemat.es
http://rdgroups.ciemat.es/web/sci-track/
17
THANK YOU!!!