programming the new knl cluster at lrz introduction · wednesday, january 24, 2018, kursraum 2,...

Programming the new KNL Cluster at LRZIntroduction

Dr. Volker Weinberg (LRZ) January, 24-25, 2018, LRZ

LRZ in the HPC Environment

Programming the new KNL Cluster at LRZ

HLRS@Stuttgart JSC@Jülich LRZ@Garching

PRACE has 25 members, representing European Union Member States and Associated Countries.

Bavarian Contribution to National Infrastructure

German Contribution to European Infrastructure

24.-25.1.2018

PRACE Training

Advanced Training Centre (PATC) Courses

LRZ is part of the Gauss Centre for Supercomputing (GCS), which is one of the six PRACE Advanced Training Centres (PATCs) that started in 2012:

− Barcelona Supercomputing Center (Spain), CINECA− Consorzio Interuniversitario (Italy)− CSC – IT Center for Science Ltd (Finland)− EPCC at the University of Edinburgh (UK)− Gauss Centre for Supercomputing (Germany)− Maison de la Simulation (France)

Mission: Serve as European hubs and key drivers of advanced high-quality training for researchers working in the computational sciences.

http://www.training.prace-ri.eu/

Programming the new KNL Cluster at LRZ24.-25.1.2018

Tentative Agenda: Wednesday

24.-25.1.2018 Programming the new KNL Cluster at LRZ

● Wednesday, January 24, 2018, Kursraum 2, H.U.010 (course room)●● 09:00-10:00 Welcome & Introduction (Weinberg)● 10:00-10:30 Overview of the Intel MIC architecture (Allalen)● 10:30-11:00 Coffee break● 11:00-11:30 Overview of Intel MIC programming & Using the LRZ Linux-Cluster (Allalen)● 11:30-12:00 Hands-on

● 12:00-13:00 Lunch break

● 13:00-14:00 Guided CoolMUC3/SuperMUC Tour (Weinberg/Allalen)● 14:00-15:00 Vectorisation and basic Intel Xeon Phi performance optimisation (Allalen)● 15:00-15:30 Coffee break● 15:30-16:00 Hands-on

Tentative Agenda: Thursday


● Thursday, January 25, 2018, Kursraum 2, H.U.010 (course room)

● 09:00-10:30 Code optimization process for KNL (Iapichino)● 10:30-11:00 Coffee Break● 11:00-12:00 Intel profiling tools and roofline model (Iapichino)

● 12:00-13:00 Lunch Break● 13:00-14:30 KNL Memory Modes and Cluster Modes, MCDRAM (Weinberg)● 14:30-15:00 Coffee Break● 15:00-16:00 Hands-on● 16:00-16:30 Wrap-Up

Information

● Lecturers:− Dr. Momme Allalen (LRZ)− Dr. Luigi Iapichino (LRZ)− Dr. Volker Weinberg (LRZ)

● Complete lecture slides & exercise sheets:− https://www.lrz.de/services/compute/courses/x_lecturenotes/knl_c

ourse_2018/− https://tinyurl.com/y9hmln56

● Examples under:− /lrz/sys/courses/KNL


Intel Xeon Phi @ LRZ and EU


Evaluating Accelerators at LRZ

Research at LRZ within PRACE & KONWIHR:● CELL programming

− 2008-2009 Evaluation of CELL programming.− IBM announced to discontinue CELL in Nov. 2009.

● GPGPU programming− Regular GPGPU computing courses at LRZ since 2009.− Evaluation of GPGPU programming languages:

− CAPS HMPP− PGI accelerator compiler− CUDA, cuBLAS, cuFFT− PyCUDA/R

− Regular Deep Learning Workshops since 2018● Intel Xeon Phi programming

− Larrabee (2009) → Knights Ferry (2010) → Knights Corner → Intel Xeon Phi (2012) → KNL (2016)


} → OpenACC, OpenMP 4.x

IPCC (Intel Parallel Computing Centre)

● New Intel Parallel Computing Centre (IPCC) since July 2014: Extreme Scaling on MIC/x86

● Chair of Scientific Computing at the Department of Informatics in the Technische Universität München (TUM) & LRZ

● https://software.intel.com/de-de/ipcc#centers● https://software.intel.com/de-de/articles/intel-parallel-computing-center-at-

leibniz-supercomputing-centre-and-technische-universit-t● Codes:

− Simulation of Dynamic Ruptures and Seismic Motion in Complex Domains: SeisSol

− Numerical Simulation of Cosmological Structure Formation: GADGET− Molecular Dynamics Simulation for Chemical Engineering: ls1 mardyn− Data Mining in High Dimensional Domains Using Sparse Grids: SG++



● Numerical Simulation of Cosmological Structure Formation: GADGET

● Code optimization through:− Better threading parallelism (dynamic scheduling,

lockless version),− Data optimization (AoS → SoA),− Promoting more efficient vectorization.

● Baruffa, Iapichino, Hammer & Karakasis: Performance optimisation of SmoothedParticle Hydrodynamics algorithms for multi/many-core architectures, 2017, proceedings of HPCS 2017, Also available as: https://arxiv.org/abs/1612.06090Awarded as Outstanding Paper.



− Numerical Simulation of Cosmological Structure Formation: GADGET− Improved Speed-Up of the Isolated Kernel on KNL wrt

original version: 5.7x− parallel efficiency: 73%


Intel Xeon Phi and GPU Training @ LRZ


28.-30.4.2014 @ LRZ (PATC): KNC+GPU27.-29.4.2015 @ LRZ (PATC): KNC+GPU3.-4.2.2016 @ IT4Innovations: KNC27.-29.6.2016 @ LRZ (PATC): KNC+KNL28.9.2016 @ PRACE Seasonal School,

Hagenberg: KNC7.-8.2.2017 @ IT4Innovations (PATC): KNC26.-28.6.2017 @ LRZ (PATC): KNLJune 2018 @ LRZ (PATC tbc.): KNL

http://inside.hlrs.de/inSiDE, Vol. 12, No. 2, p. 102, 2014inSiDE, Vol. 13, No. 2, p. 79, 2015inSiDE, Vol. 14, No. 1, p. 76f, 2016inSiDE, Vol. 14, No. 2, p. 25ff, 2016inSiDE, Vol. 15, No. 1, p. 48ff, 2017inSiDE, Vol. 15, No. 2, p. 55ff, 2017

24.-25.1.2018

● Czech-Bavarian Competence Team for Supercomputing Applications (CzeBaCCA)

● BMBF funded project (Jan. 2016 – Dec 2017) to:

− Foster Czech-German Collaboration in Simulation Supercomputing− series of workshops will initiate and deepen collaboration between Czech

and German computational scientists − Establish Well-Trained Supercomputing Communities

− joint training program will extend and improve trainings on both sides− Improve Simulation Software

− establish and disseminate role models and best practices of simulation software in supercomputing


CzeBaCCA Project

24.-25.1.2018

CzeBaCCA Trainings and Workshops

● Intel MIC Programming Workshop, 3 – 4 February 2016, Ostrava, Czech Republic

● Scientific Workshop: SeisMIC - Seismic Simulation on Current and Future Supercomputers, 5 February 2016, Ostrava, Czech Republic

● PRACE PATC Course: Intel MIC Programming Workshop, 27 - 29 June 2016, Garching, Germany

● Scientific Workshop: High Performance Computing for Water Related Hazards, 29 June - 1 July 2016, Garching, Germany

● PRACE PATC Course: Intel MIC Programming Workshop, 7 – 8 February 2017, Ostrava, Czech Republic

● Scientific Workshop: High performance computing in atmosphere modelling and air related environmental hazards, 9 February 2017, Ostrava, Czech Republic

● PRACE PATC Course: Intel MIC Programming Workshop, 26 – 28 June 2017, Garching, Germany

● Scientific Workshop: HPC for natural hazard assessment and disaster migration, 28 - 30 June 2017, Garching, Germany




1st workshop series: February 2016 @ IT4I

https://www.lrz.de/forschung/projekte/forschung-hpc/CzeBaCCA/http://inside.hlrs.de/ inSiDE, Vol. 14, No. 1, p. 76f, 2016http://www.gate-germany.de/fileadmin/dokumente/Laenderprofile/Laenderprofil_Tschechien.pdf, p.27

24.-25.1.2018



2nd workshop series: June 2016 @ LRZ

https://www.lrz.de/forschung/projekte/forschung-hpc/CzeBaCCA/http://inside.hlrs.de/ inSiDE, Vol. 14, No. 2, p. 25ff, 2016http://www.gate-germany.de/fileadmin/dokumente/Laenderprofile/Laenderprofil_Tschechien.pdf, p.27

24.-25.1.2018



3rd workshop series: February 2017 @ IT4I


24.-25.1.2018



4th workshop series: June 2017 @ LRZ


24.-25.1.2018

Intel Xeon Phi @ Top500 November 2017

● https://www.top500.org/list/2017/11/● #2: Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH

Express-2, Intel Xeon Phi 31S1P, National Super Computer Center in Guangzhou, China● #7: Trinity - Cray XC40, Intel Xeon Phi 7250 68C 1.4GHz, Aries interconnect , Cray

Inc.DOE/NNSA/LANL/SNL, United States ● #8: Cori - Cray XC40, Intel Xeon Phi 7250 68C 1.4GHz, Aries interconnect , Cray Inc.,

DOE/SC/LBNL/NERSC, United States ● #9: Oakforest-PACS - PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel

Omni-Path , Fujitsu, Joint Center for Advanced High Performance Computing, Japan ● #12:Stampede2 - PowerEdge C6320P, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path ,

Dell, Texas Advanced Computing Center/Univ. of Texas, United States ● #14: Marconi, Intel Xeon Phi - CINECA Cluster, Intel Xeon Phi 7250 68C 1.4GHz, Intel

Omni-Path , Lenovo, CINECA, Italy ● #23: Tera-1000-2-Part 1 - Bull Sequana X1000, Intel Xeon Phi 7250 68C 1.4GHz, Bull BXI

1.2, Bull, Atos Group, Commissariat a l'Energie Atomique (CEA)●● … several non European systems …● #87: Salomon - SGI ICE X, Xeon E5-2680v3 12C 2.5GHz, Infiniband FDR, Intel Xeon Phi

7120P, HPE, IT4Innovations National Supercomputing Center, VSB-Technical University of Ostrava, Czech Republic


PRACE: Best Practice Guides

● http://www.prace-ri.eu/best-practice-guides/


Best Practice Guides - Overview

● The following 4 Best Practice Guides (BPGs) have been written within PRACE-4IP by 13 authors from 8 institutions and have been published in pdf and html format in January 2017 on the PRACE website:

− Intel® Xeon Phi™ BPG Update of the PRACE-3IP BPG− Haswell/Broadwell BPG Written from scratch− Knights Landing BPG Written from scratch− GPGPU BPG Update of the PRACE-2IP mini-guide

● Online under: http://www.prace-ri.eu/best-practice-guides/


Intel MIC within PRACE: Intel Xeon Phi (KNC) Best Practice Guide

− Created within PRACE-3IP+4IP.− Written in Docbook XML.− 122 pages, 13 authors− Now including information about existing

Xeon Phi based systems in Europe: Avitohol@ BAS (NCSA), MareNostrum @ BSC, Salomon @ IT4Innovations,SuperMIC @ LRZ

− http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-january-2017/

− http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Intel-Xeon-Phi-1.pdf


Intel MIC within PRACE: Knights Landing Best Practice Guide

− Created within PRACE-4IP.− Written in Docbook XML.− 85 pages, 3 authors− General information about the KNL

architecture and programming environment− Benchmark & Application Performance

results− http://www.prace-ri.eu/IMG/best-practice-

guide-knights-landing-january-2017/− http://www.prace-ri.eu/IMG/pdf/Best-

Practice-Guide-Knights-Landing.pdf


Best Practice Guides - Dissemination


DEEP/ER Project: Towards Exascale

● Design of an architecture leading to Exascale.● Development of hardware: Implementation of a Booster based on

MIC processors and EXTOLL interconnect.

7.-8.2.2017 Intel MIC Programming Workshop @ IT4I

SuperMIC ∈ SuperMUC @ LRZ


SuperMUC System Overview


SuperMUC Phase 2: Moving to Haswell


6 Haswell islands512 nodes per islandwarm water cooling

LRZ infrastructure(NAS, Archive, Visualization)

Internet / Grid Services

Mellanox FDR14Island switch

Haswell-EP24 cores/node2.67 GB/core

non blocking

Spine infinibandswitches

pruned tree

I/Oservers

GPFS for$WORK$SCRATCH

I/O Servers (weak coupling of phases 1+2)

Mellanox FDR10 Island switch

non blocking

pruned tree

Thin + Fat islandsof SuperMC

SuperMIC: Intel Xeon Phi Cluster


SuperMIC ∈ SuperMUC @ LRZ

● 32 compute nodes (diskless)− SLES11 SP3− 2 Ivy-Bridge host processors [email protected] GHz with 16 cores− 2 Intel Xeon Phi 5110P coprocessors per node with 60 cores − 64 GB (Host) + 2 * 8 GB (Xeon Phi) memory− 2 MLNX CX3 FDR PCIe cards attached to each CPU socket

● Interconnect− Mellanox Infiniband FDR14 − Through Bridge Interface all nodes and MICs are directly accessible

● 1 Login- and 1 Management-Server (Batch-System, xCAT, …)● Air-cooled● Supports both native and offload mode● Batch-system: LoadLeveler


SuperMIC Network Access


SuperMIC Access

● Description of SuperMIC:● https://www.lrz.de/services/compute/supermuc/supermic/


CoolMUC3 @ Linux-Cluster @ LRZ


First self-assembled Linux cluster (1999-2002)


Cluster components (2012)


SGI UltraViolet with air guidesin front to improve cooling efficiency (2012)


CooLMUC2 (2015): The six racks to the left


LRZ Linux Cluster Overview

● The LRZ Linux Cluster consists of several segments with different types of interconnect and different sizes of shared memory. All systems have a (virtual) 64 bit address space:− CooLMUC2 Cluster with 28-way Haswell-based nodes and FDR14 Infiniband

interconnect, used for both serial and parallel processing− Intel Broadwell based 6 TByte shared memory server HP DL580− CooLMUC3 Cluster with 64-way KNL 7210-F many-core processors and Intel

Omnipath OPA1 interconnect, for parallel/vector processing● Based on the various node types the LRZ Linux cluster offers a wide span of

capabilities:− mixed shared and distributed memory− large software portfolio− flexible usage due to various available memory sizes− parallelization by message passing (MPI)− shared memory parallelization with OpenMP or pthreads− mixed (hybrid) programming with MPI and OpenMP


Linux Cluster System Overview

Architecture Total Numbers Max Job Limits How to get access

System Name CPU Cores

per Node

RAMper Node[GB]

Nodes Cores Nodes Cores WallTime

Max. aggreg.RAM

Queue

Login Node(job submission)

Linux-ClusterCoolMUC2

Intel XeonE5-2697 v3 ("Haswell")

28 64 384 10752 60 1680 48h 3.8 TB mpp2lxlogin5.lrz.de, lxlogin6.lrz.de,lxlogin7.lrz.de

Linux-ClusterCoolMUC2

Intel XeonE5-2697 v3 ("Haswell")

28 64 1 28 1 28 96h 64 GB seriallxlogin5.lrz.de, lxlogin6.lrz.de,lxlogin7.lrz.de

Linux-ClusterHugemem

Intel XeonE5-2660 v2 ("Sandy Bridge")

20 240 7 220 1 20 168h 240 GB hugememlxlogin5.lrz.de,lxlogin6.lrz.de,lxlogin7.lrz.de

Linux-ClusterTeramem

Intel Xeon E7-8890 v4 96 6144 1 96 1 96 48h 6.1 TB inter

lxlogin5.lrz.de,lxlogin6.lrz.de,lxlogin7.lrz.de

Linux ClusterMany Core

Intel Xeon Phi (Knights Landing) 64 RAM:96

HBM:16 148 64 32 64 24h

RAM:2880GBHBM:512GB

mpp3 lxlogin8.lrz.de


CoolMUC3 System Procurement Vision

● To supply a system to its users that is suited for processing highly vectorizable and thread-parallel applications,

● provides good scalability across node boundaries for strong scaling, and

● deploys state-of-the art high-temperature water-cooling technologies for a system operation that avoids heat transfer to the ambient air of the computer room.


CoolMUC3 System Overview

● Consists of 148 computational many-core Intel “Knight’s Landing (KNL)” nodes (Xeon Phi 7210-F hosts).

● Connected to each other via an Intel Omnipath high performance network in fat tree topology.

● Theoretical peak performance is 400 TFlop/s and the LINPACK performance of the complete system is 255 TFlop/s.

● Standard Intel® Xeon® login node for development work and job submission.



● CoolMUC-3 comprises of three warm-water cooled racks, using an inlet temperature of at least 40 C.

● Deployment of liquid-cooled power supplies and Omni-Path switches.

● Thermal isolation of the racks to suppress radiative losses. ● Liquid cooled racks operate entirely without fans. ● A complementary rack for aircooled components (e.g.

management servers) uses less than 3% of the systems total power budget.

● With 4.96 GFlops/Watt (according to the strict Green500 level-3 measurement methodology) CooLMUC-3 is one of the most efficient x86 systems worldwide.

● Result of an established partnership between LRZ and Megware.


CoolMUC3 (2017): Really Cool



HardwareNumber of nodes 148Cores per node 64Hyperthreads per core 4Core nominal frequency 1.3 GHzMemory (DDR4) per node 96 GB (Bandwidth 80.8 GB/s)High Bandwidth Memory per node 16 GB (Bandwidth 460 GB/s)Bandwidth to interconnect per node 25 GB/s (2 Links)Number of Omnipath switches (100SWE48) 10 + 4 (each 48 Ports)Bisection bandwidth of interconnect 1.6 TB/sLatency of interconnect 2.3 µsPeak performance of system 394 TFlop/s

InfrastructureElectric power of fully loaded system 62 kVAPercentage of waste heat to warm water 97%Inlet temperature range for water cooling 30 … 50 °CTemperature difference between outlet and inlet 4 … 6 °C

Software (OS and development environment)Operating system SLES12 SP2 LinuxMPI Intel MPI 2017, alternativ OpenMPICompilers Intel icc, icpc, ifort 2017 Performance libraries MKL, TBB, IPP, DAALTools for performance and correctness analysis Intel Cluster Tools


Linux Cluster System Access


ssh -Y lxlogin5.lrz.de -l xxyyyzz Haswell (CooLMUC-2) login nodessh -Y lxlogin6.lrz.de -l xxyyyzz Haswell (CooLMUC-2) login nodessh -Y lxlogin7.lrz.de -l xxyyyzz Haswell (CooLMUC-2) login nodegsissh -Y lxgt2.lrz.de login node for Gsi-SSH

ssh -Y lxlogin8.lrz.de -l xxyyyzz KNL Cluster (CooLMUC-3 login node)

Linux Cluster System Documentation


● Linux-Cluster:− https://www.lrz.de/services/compute/linux-cluster/

● CoolMUC-3:− https://www.lrz.de/services/compute/linux-

cluster/coolmuc3/initial_operation/− https://www.lrz.de/services/compute/linux-

cluster/coolmuc3/overview/

CoolMUC3 SLURM

● Submit a job: sbatch --reservation=KNL_Course job.sh

● List own jobs: squeue –u a2c06aa

● Cancel jobs: scancel jobid

● Interactive Access:● salloc --nodes=1 --time=02:00:00

--constraint=flat,quad--reservation=KNL_Course

● srun –-pty bash --reservation=KNL_Course


CoolMUC3 SLURM Simplest Batch File

-> /lrz/sys/courses/KNL/batch-flat-quad.sh

#!/bin/bash

#SBATCH -o /home/hpc/a2c06/a2c06aa/test.%j.%N.out

#SBATCH -D /home/hpc/a2c06/a2c06aa/

#SBATCH -J jobname

#SBATCH --clusters=mpp3

#SBATCH --get-user-env

#SBATCH --time=02:00:00

#SBATCH --constraint=flat,quad

commands


KNL Testsystem


First login to Linux-Cluster (directlyreachable from the course PCs, use onlyaccount a2c06aa!):

ssh lxlogin8.lrz.de –l a2c06aa

Then:

ssh mcct03.cos.lrz.de orssh mcct04.cos.lrz.de

Processor: Intel(R) Xeon Phi(TM) CPU 7210. 64 cores, 4 threads per core.Frequency: 1 - 1.5 GHz

KNL: 64 cores x 1.3 GHz x 8 (SIMD) x 2 x 2 (FMA) = 2662.4 GFLOP/sCompare with:KNC: 60 cores x 1 GHz x 8 (SIMD) x 2 (FMA) = 960 GFLOP/sSandy-Bridge: 2 sockets x 8 cores x 2.7 GHz x 4 (SIMD) x 2 (ALUs) = 345.6 GFLOP/s

Xeon Phi References

● Books:− James Reinders, James Jeffers, Intel Xeon Phi Coprocessor High

Performance Programming, Morgan Kaufman Publ. Inc., 2013 http://lotsofcores.com ; new KNL edition in July 2016

− Rezaur Rahman: Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers, Apress 2013 .

− Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, Colfax 2013 http://www.colfaxintl.com/nd/xeonphi/book.aspx

● Training material by CAPS, TACC, EPCC● Intel Training Material and Webinars● V. Weinberg (Editor) et al., Best Practice Guide - Intel Xeon Phi v2,

http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-january-2017/ and references therein

● Ole Widar Saastad (Editor) et al., Best Practice Guide – Knights Landing, http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017/


Acknowledgements

● IT4Innovation, Ostrava.

● Partnership for Advanced Computing in Europe (PRACE)

● Intel

● BMBF (Federal Ministry of Education and Research)

● Dr. Karl Fürlinger (LMU)

● J. Cazes, R. Evans, K. Milfeld, C. Proctor (TACC)

● Adrian Jackson (EPCC)

Programming the new KNL Cluster at LRZ24.-25.1.2018

And now …

Enjoy the course!


programming the new knl cluster at lrz introduction · wednesday, january 24, 2018, kursraum 2,...

Documents