high performance computing - atix ag · gaming and low-power chips → building blocks for modern...

Dr. Sebastian OhlmannMax Planck Computing and Data Facility

Linux Stammtisch München, 29.01.2019

High Performance Computing

Sebastian Ohlmann – High Performance Computing

About me● PhD in Physics from U. Heidelberg

– Hydrodynamic simulations on supercomputers– Stellar interactions and explosions

● I work at MPCDF (Max Planck Computing and Data Facility, formerly RZG)

● HPC application development in collaboration with researchers from Max Planck institutes

● Contact: [email protected]


Outline● What is high performance computing?● Development of HPC● How to build/work on/manage/program a

supercomputer● Application example


What is High Performance Computing?● High Perfomance Computing = Supercomputing

● "A supercomputer is a machine built by Seymour Cray” (common sense until the 1990's)

● “A machine that is (was?) listed in the Top500 list of supercomputers”

● … a moving target: – today's supercomputers are tomorrow's mobile devices

● multicore CPUs ~2005 → smartphone and PC processors ~2013– and vice versa

● gaming and low-power chips → building blocks for modern supercomputers


Historical perspective

Lynch & Lynch, Weather 63, 324-326, 2008. http://mathsci.ucd.ie/~plynch/eniac/phoniac.html

forecast time: 24 hourscomputation: 24 hourspower usage: 115 kW

2008reconstruction on mobile-phone (JAVA-application):

forecast time: 24 hours computation: < 1 second (!!!) power usage: 1.5 W

1950 (Charney, Fjørtoft, von Neumann):first numerical weather forecast on ENIAC (Electronic Numerical Integrator and Computer)

Source: Wikipedia

http://mathsci.ucd.ie/~plynch/eniac/phoniac.html


● List of “fastest supercomputers” ● Published twice a year

(www.top500.org) since June 1993● Based on High Performance Linpack

(HPL) → dense linear algebra (BLAS3)● HPL traces evolution of peak floating

point performance (technology and economics), rather than “sustained” application performance (which really matters)

● Flop/s = Floating operation per second

The Top 500 list

1024 131.0

www.top500.org


● List of “fastest supercomputers” ● Published twice a year

(www.top500.org) since June 1993● Based on High Performance Linpack

(HPL) → dense linear algebra (BLAS3)● HPL traces evolution of peak floating

point performance (technology and economics), rather than “sustained” application performance (which really matters)

● Flop/s = Floating operation per second

2,397,824 200,794.9

The Top 500 list

10 MW!

www.top500.org


Top500 evolution● Computing power

doubles every 18 months (Moore's law)

from: Highlights of the 44th TOP500 List (E. Strohmaier, SC'14, New Orleans)

smartphone

laptop/PC

6-8 years

Logarithmic scale!


MPCDF system in Top500

2018: first phase of the new HPC System10 Pflops Peak


Old supercomputers at MPCDF


Why parallel?

● High-performance computing is power (density) limited

● From 60 devices (1965) to billions of devices (today) per chip

● Background:– Miniaturization at constant power density, currently 14nm– P~f3 → CPU clock frequency levelled off at ~ 3 GHz

(2005) → dawn of multicore era (ca. 2004)→ dawn of manycore/GPU era (ca. 2010)

"The number of transistors and resistors on a chip doubles every 18 months." (Intel co-founder Gordon Moore, 1965) add: “… at (roughly) constant manufacturing costs”

http://cpudb.stanford.edu


Outlook: future architectures● Goal: “exascale” → 1 EFlop/s● Large HPC systems with current CPU technology:

– 1 PFlop/s ~ 0.3-0.5 MW → ~ 0.5 M€ per year– 1 EFlop/s → 0.5 GW ? → 500 M€ per year?

● Accelerators needed!● 11/2018: > 20% of Top500 systems with GPUs● Top1: Summit → 143 Pflop/s @ 9.8 MW

(2 Power9 + 6 NVIDIA Volta per node)


Components of a supercomputer● Login nodes● Compute nodes● Fast network● Large storage –

usually a parallel file system

Node

MemorySocket

Socket

Node

MemorySocket

Socket

Node

MemorySocket

Socket

Login Node

MemorySocket

Socket

...

Storage

Network switch


Compute nodes● Several Multi-core CPU sockets (~12-30)● Large main memory (~2-4GB per core)● Optionally accelerators (e.g. GPU)


Network● Simulation codes:

– Tightly connected parallelization– Many messages passed

● Requirements:– Low latency (~µs)– High bandwidth (→ I/O)

● Different topologies possible● Examples: Infiniband, Omnipath, Aries, ...


File system● Must support high bandwidth from multiple

nodes (~ TB from thousands of nodes in parallel)● Usually parallel file system

– Files are stored on multiple servers (striping)– Metadata stored separately

● Examples: GPFS, Lustre, BeeGFS, ...


At MPCDF● Current supercomputer: cobra

– ~3100 compute nodes with 2 20-core Skylake sockets → ~130000 cores

– 64 nodes with 2 NVIDIA Volta GPUs each

– Peak performance 10 PFlop/s– Storage: GPFS with 5 PB– OmniPath interconnect (6 islands with

fat tree)● Archive: ~100 PB HPSS installation● ~30 smaller clusters


How to work on a supercomputer● All (Top500) supercomputers use Linux● Use ssh to get into login nodes● Access to special software: module system● Compile application● Use batch managing software (Slurm, SGE, Torque…) to

submit jobs → execution on compute nodes● Write output to parallel file system● Analyze output from login nodes


How to manage a supercomputer● Manage thousands of nodes:

– Install/update OS– Monitor health status

● Supply special software stack → compiled to support hardware features (network, vectorization, …)

● Supply archiving system● Contact vendor for replacing defect hardware● Infrastructure (power & cooling)


At MPCDF● OS: SLES● Software stack:

– Use SUSE Open Build Service (OBS)– Manage complexity of custom installations (all

combinations of compiler + MPI library)– Write spec files for building RPMs


How to program a supercomputer● Efficent usage of one node

– Memory hierachy (caches)– Vector instructions (SSE, AVX)– Shared-memory parallelism (threads, OpenMP)– Accelerators, GPUs (CUDA, OpenACC, OpenMP)

● Distribute workload on many nodes: parallelism– Communication necessary– Distributed-memory parallelism (MPI)

● Input/Output: the least possible, few large files


Why always larger computations?● Simulation of physical processes: partial

differential equations● Examples: mechanics, fluid dynamics,

electromagnetism, quantum mechanics, … ● Discretization on grid → certain error ● Error decreases with higher resolution● Larger simulation areas possible with larger grid


Computational Effort● Error scales e.g. with N-2

● 3D CFD simulation: – number of grid points ~N3

– timestep also ~N● Going from 43 to 83:

– Error – factor ¼– Computational effort -

factor 16


Example: Supernova simulations● Simulate core-collapse supernovae with the VERTEX

code (developed at MPI for Astrophysics, Garching)● Computing-time requirements for a single model run:

– 2D: Nr = 600 radial zones, Nθ = 180 angular zones (1° resolution)

● 3 · 1018 Flops, 106 core-hours or 104 core-days● 4 - 8 weeks on 180 cores

– 3D: Nr = 600 radial zones, Nθ · Nφ = 43200 angular zones (1.5° resolution)

● 3 · 1020 Flops, 108 core-hours or 106 core-days● 4 - 8 weeks on 64000 cores (→ from weak scaling)

● Scientific goal: numerical experiments– Many model runs: parameters studies

→ strong scaling required


Conclusions● HPC: incredible increase of computing capability● Enables research from astrophysics to weather predictions● How to build/work on/manage/program a supercomputer● Outlook:

– HPC becomes more and more important in research– Main problem: power consumption → GPUs/accelerators

● Contact: [email protected]


Acknowledgements● Thanks to my colleagues Markus Rampp and

Klaus Reuter for sharing slides● Thanks to ATIX for organizing the Stammtisch

high performance computing - atix ag · gaming and low-power chips → building blocks for modern...

Documents