programming the new knl cluster at lrz: hardware overview

Programming the new KNL Cluster at LRZ: Hardware Overview

1

Dr. Momme Allalen Januar, 24-25, 2018 @ LRZ

Programming the new KNL Cluster at LRZ

● Intro@Multicore systems & accelerators on HPC ● Architecture overview of the Intel Xeon Phi Products (MIC) ● KNL vs KNC ● KNC and KNL Programming models ● Hands on

Agenda

January, 24-25, 2018 [email protected]

mailto:[email protected]?subject=

Multicore Processors

How many cores is too many ? Intel/AMD 2-4-8-16-40 cores IBM Cell 8-16 SPU, Power8 12 cores with 8 threads per core NVIDIA Pascal/Volta GPUs 3584/5120 cores

Multicore is Hardware and Software together (challenge and inspire each other)

More transistors, worse reliability Error / fault ( detection / correction / recovery)

Memory Memory wall due to bandwidth (scalability?) Memory wall due to power (interconnect needs power) Memory size grows but data always grows more and more

Programming the new KNL Cluster at LRZ [email protected], 24-25, 2018

more cores your CPU

contained, the faster it would perform

overall !


• In the past, computers got faster by increasing the clock frequency of the core, but this has now reached its limit mainly due to power requirements (Voltage was decreased).

• Today, processor cores are not getting any faster, but instead the number of cores per chip increases and registers are getting wider

• On HPC, we need a chip that can provide: • Higher computing performance • @high power efficiency: keep the power/core as low as possible • Developing a new chips is incredibly expensive • Must make maximum use of existing technology

Why do we need “accelerators” on HPC?

[email protected] the new KNL Cluster at LRZ January, 24-25, 2018


• One solution is a heterogeneous system containing both CPUs and “accelerators”, plus other forms of parallelism such as vector instruction support

• Two types of hardware options, Intel Xeon Phi (KNC) and Nvidia GPU

• Can perform many parallel operations every clock cycle and keep power/core as low as possible.

Why do we need “accelerators” on HPC?



Nvidia GPU

Intel Xeon Phi

[email protected], 24-25, 2018


Nvidia GPU



● Intel Xeon processors are for general purpose

● The current architecture Haswell and Broadwell Skylake is upcoming

Intel Multi-Core Architecture



Architectures Comparison (CPU vs GPU)

• Large cache and sophisticated flow control minimise latency for arbitrary memory access.

• Simple flow control • More transistors for

computing in parallel (up to 21 billion on Nvidia Volta GPU)

• (SIMD)

CPU

ControlALU

Cache

ALU

ALU ALU

DRAM

GPU

DRAM

Intel Xeon CPU E5-2697v4 “Brodwell”

Nvidia GPU P100 Nvidia GPU V100

Cores @ Clock 2 x 18 cores @ ≥ 2.3 GHz 56SMs @ 1.4 GHz [email protected] GHzSP Perf./core ≥ 73.6 GFlop/s up to 166 GFlop/s

SP peak ≥ 2.6 TFlop/s up to 10.6 TFlop/s up to 15 TFlop/s

Transistors/TDP 2x7 Billion /2x145W 14 Billion/300W 21 Billion/300WBW 2 x 62.5 GB/s 510 GB/s up to 900 GB/s



Intel Xeon Phi Products Intel Many Integrated Core (MIC) Architecture

• Xeon Phi Coprocessor: first product was released in 2012 named Knights Corner (KNC) which is the first architecture supporting 512 bit vectors

• Xeon Phi Processor: 2nd generation announced at ISC16 in June named Knights Landing (KNL) also support 512bit vectors with a new instruction set called Intel Advanced Vector Instructions 512 (Intel AVX-512)

Specialised Platform for high demanding computing application



Intel MIC Programming Workshop @ LRZ

Xeon Phi Performance in Practice: e.g WRF (Weather Research and Forecasting) - Code Performance Timeline (on single node)

01-11 01-12 01-13 01-14 01-15 01-16

GF/

s (S

ingl

e pr

ecis

ion)

0

20

40

60

80

100

120

140

pre KNL 89GF/s

KNL(68 cores) 125 GF/s OpenMP 2 threads per core

Xeon

Xeon Phi

Broadwell (36 cores) 73 GF/s Haswell

63/GF/s KNC (61 cores)

42 GF/s Optimised

Sandy Bridge (16 cores) 31 GF/s initial Sandy Bridge (16 cores)

41 GF/s Optimised

KNC (60 cores) 16 GF/s Initial http://www2.mmm.ucar.edu/wrf/WG2/benchv3/



In common with Intel many-core Xeon CPUs: • X86 architecture • C, C++ and Fortran • Standard parallelization libraries • Similar optimization methods

• PCIe bus connection • IP-addressable

• Own Linux OS

• 6, 8 and 16 GB cached • 57, up to 61 x86 64 bit cores

• 1.1, 1.053 and 1.238 GHz • 1 to 2 TFlop/s DP performance

• 512 bit SIMD vector registers

• 4 way hyper-threading • SSE & AVX set for SIMD are not

supported

• Intel Initial Many Core Instructions (IMCI).

Intel Xeon Phi KNC Architecture



-Each core has a private L2 cache connected. -Bidirectional ring interconnect which connects all the cores, L2 caches, PCIe client logic, GDDR5 memory controllers …etc.

Intel Xeon Phi KNC Architecture



Network Access on KNC

• Network access possible using TCP/IP tools like ssh. • NFS mounts on Xeon Phi supported. • Proxy Console /File I/O.

First experiences with the Intel MIC architecture at LRZ, Volker Weinberg and Momme Allalen, inSIDE Vol. 11 No.2 Autumn2013



GPUMICCPU

Architectures Comparison

General-purpose architecture

Control

Cache

Massively data parallel

ALU

DRAM

ALU

ALU ALU

DRAMDRAM

Power-efficient Multiprocessor X86 design architecture



Intel Xeon Phi Knights Landing (KNL)

• 2nd generation Xeon Phi • Successor to the Knights Corner 1st

generation

• New cores • New processor architecture • New operation modes • New memory systems



Intel Xeon Phi Knights Landing (KNL) Architecture

• Bootable CPU • Up to 72 cores based on the Intel Atom

cores (Silvermont microarchitecture) • 4HT running @ 1.3 to 1.5 GHz • 3+ TFlop/s in DP (FMA) • 6+ TFlop/s in SP (FMA) • ~ 384 GB DDR4 (> 90 GB/s) • 16 GB HBM (MCDRAM)

> 450 GB/s STREAM Performance • Binary-compatible with Xeon • Common operating system

(SUSE,WINDOWS,RHEL…)



KNC vs KNL

● Co-processor ● Binary incompatible

with other architectures

● 61 In-order cores ● 1.1 GHz processor ● up to 16 GB RAM ● 22 nm process ● One 512-bit VPU ● No support for branch prediction and

fast unaligned memory access

● No PCIe - Self hosted ● Binary compatible

with prior Xeon architectures (no phi)

● up to 72 Out-of-order cores ● 1.4 GHz processor ● Up to 400 GB RAM (with MCDRAM) ● 14 nm process ● Tow 512-bit VPUs ● Support for branch prediction and

fast unaligned memory access

The Improvement on the KNL Hardware still not good for non optimised code



KNL core

• 8-way 32 KB instruction cache • 8-way 32 KB data cache • 2VPUs

• Only one can run legacy ops • Compile with -xMIC-AVX512 to use both



● KNC required at least 2 threads per core for sensible compute performance

● KNL does not ● Can run up to 4 threads per core efficiently

● Several applications don’t need any hyper threads

Hyperthreading on Xeon Phi

Programming the new KNL Cluster at LRZ [email protected], 24-25, 2018


Invocation of the Intel MPI compiler

Language MPI Compiler CompilerC mpiicc iccC++ mpiicpc icpcFortran mpiifort ifort



Intel Fabric integrated on KNL processor

● Intel released KNL-F on Nov. 2016 ; KNL with a Fabric ✓ Intel Omni-Path Architecture ✓ High Bandwidth and low

latency ✓ The Omni-Path technology will

allow to build Clusters like: LRZ-KNL Cluster

• Other Xeon Phi based system: Server , Workstation ..

dap.xeonphi.com



KNL: Cores and threads

• Up to 36 tiles connected by 2 D Mesh interconnect each with 2 physical cores (up 72 cores with out of order instruction execution )

• Distributed L2 cache across a mesh interconnect • Sensitive to the data locality



KNL Tile (Cores and threads)

• 2cores, each with 2VPU

• Up to 72 cores with 4 way hyper threading up to 288 logical processors

• 2VPU: 2x AVX512

Core Corecache

Decode

Vector ALU

Vector ALU

32KB L1 D-cache

cache

Decode

Vector ALU

Vector ALU

32KB L1 D-cache

1 MB L2

D-cache 16-way

• Up to 36 MB L2 per KNL



Parallelism on Xeon Phi

Corelogical core

Vector Unit

logical core

Shared Memory

Corelogical core

Vector Unit

logical core

Corelogical core

Vector Unit

logical core

Corelogical core

Vector Unit

logical core

Victorisation

Shared memory parallelism: OpenMP

C/C++/Fortran, Python/Java …. Porting is easy Two parallelisation modes are required: Shared memory and vectorisation Run multiple threads/processes and each thread issues vector instructions (SIMD)

[email protected]

logical core

logical core

logical core

logical core

logical core

logical core

logical core

logical core

Programming the new KNL Cluster at LRZ January, 24-25, 2018


KNL and Vector Instruction Sets

● Binary runs without recompilation ● KNC binary requires recompilation

x87/MMX x87/MMX

SSE*

AVX

SSE*

AVX

AVX2

x87/MMX

SSE*

AVX

AVX2

AVX-512F

AVX-512CD

AVX-512PF

AVX-512ER

SNB E5-2600

HSW E5-2600 KNL

● are going to be used in future Xeon architecture like SKX

-Conflict Detection: Improves Vectorisation -Prefetch: Gather and Scatter Prefetch -Exponential and Reciprocal Instructions



Memory on KNL

● Two levels of memory on KNL : 1.Main memory ● KNL has direct access to all of main memory ● Similar latency and bandwidth as a standard Xeon

processors ● 6 DDR channels

2.Multi-Channels DRAM or MCDRAM ● HBM on chip: 16GB ● Slightly higher latency than main memory ● 8 MCDRAM controllers , 16 channels



Using MCDRAM on KNL

● At boot time you have to choose one memory mode operation

Flat Mode -MCDRAM treated as a NUMA node -as separately addressable

memory

-Users control what goes to

MCDRAM

Cache Mode -MCDRAM treated as a transparent Last Level Cache

(LLC)

-MCDRAM is used automatically

Hybrid Mode -Combination of Flat and Cache -Ratio can be chosen in the

BIOS



Using MCDRAM on KNL

● Flat mode offers the best performance for applications, but require changes to the code (memory allocation) or the execution environment (NUMA node)

numactl —membind 0 ./exec # DDR4

numactl —membind 1 ./exec # MCDRAM



• Native Mode • Programs started on Xeon Phi KNC • Cross-compilation using –mmic • User access to Xeon Phi is necessary

• Offload to MIC (KNC) • Offload using OpenMP extensions

• Automatically offload some routines using MKL

• MKL Compiled assisted offload (CAO)

• MKL automatic Offload (AO)

• MPI tasks on Host and MIC • Treat the coprocessor like another host

• MPI only and MPI + X (X may be OpenMP, TBB, Cilk, OpenCL…etc.)

Programming Models on KNC

myFunction();main() {

#pragma offload target(mic)

}

Host MIC

main() {

#pragma offload target(mic)

}

Host MIC



Native Mode on KNC

• First ensure that the application is suitable for native execution. • The application runs entirely on the MIC coprocessor without offload

from a host system. • Compile the application for native execution using the flag: -mmic • Build also the required libraries for native execution. • Copy the executable and any dependencies, such as run time libraries

to the coprocessor. • Mount file shares to the coprocessor for accessing input data sets and

saving output data sets. • Login to the coprocessor via console, setup the environment and run

the executable. • You can debug the native application via a debug server running on the

coprocessor.



Programming on KNL

• C/C++/Fortran/Python/Java … • Feel like a standard node • But:

• Manycore approach • Cores relative slow • Intra-node parallelisation required for good performance

• Binary compatible with previous Xeon but not vice verca



[email protected]

Lab 1: KNL Cluster @ LRZ



KNL Cluster @ LRZ - CoolMUC3

34

CoolMUC3 Documentation

Documentation: https://www.lrz.de/services/compute/linux-cluster/coolmuc3/

[email protected]

Setup: - The front end node lxlogin8.lrz.de must be used to do development work for the CooLMUC-3 system. user@host~$ ssh lxlogin8.lrz.de -l a2c06aa Enter your password Login node is Broadwell Take a look at different queue with: Only : $sinfo will print lotof details Or $sinfo -o “%2OP %5a %.10l %16F” Will print the queues and permitted job sizes and runtimes

- Programming the new KNL Cluster at LRZ January, 24-25, 2018



[email protected]

-We will use the reservation: KNL_Course -Copy the folder : cp -r /lrz/sys/courses/KNL . -We can compile for KNL architecture on Broadwell and Haswell architecture


user@mpp3-login8~$ salloc —ntasks=1 - -reservation=KNL_Course -t 02:00:00 user@mpp3-login8~$ squeue -u $USER



There is a hello world program (hello.c) compile and run:

$icc hello.c -o hello $ ./hello

Try to use the “-xmic-avx512” flag $icc -xmic-avx512 hello.c -o hello $./hello

What happens ?




Try now to do it with :

$icc -xcore-avx2 -axmic-avx512 hello.c -o hello $ ./hello Why does this run ?

Now execute :


user@mpp3-login8~$ salloc —ntasks=1 - -reservation=KNL_Course -t 02:00:00 user@mpp3-login8~$ squeue -u $USER user@mpp3-login8~$ srun --pty bash --reservation=KNL_Course user@mpp3-login8~$ssh mpp3r03c05s02



Execute


$ numactl -H

From mpp3-login8 execute: $srun ./hello

Now login to mcct03.cos and mcct04.cos And execute “numactl -H” compare ?


Submitting Jobs on CoolMUC3

40

I_MPI_FABRICS

● The following network fabrics are available for the Intel Xeon Phi processor and coprocessor:

● This variable tells MPI which fabric to use during the runtime

Fabric Network Hardware and Software usedshm Shared-memorytcp TCP/IP-capable network fabrics, such as Ethernet and

InfiniBand (through IPoIB)ofa OFA-capable network fabrics including InfiniBand

(through OFED verbs)dapl DAPL–capable network fabrics, such as InfiniBand, iWarp,

Dolphin, and XPMEM (through DAPL)ofi or tmi OFA-capable network fabrics such as Intel True fabric, Intel

Omni-Path, InfiniBand and Eternet



I_MPI_FABRICS

● The default can be changed by setting the I_MPI_FABRICS environment variable to I_MPI_FABRICS=<fabric> or I_MPI_FABRICS= <intra-node fabric>:<inter-nodes fabric>

● Intel® OPA MPI parameters: (TMI fabric) export I_MPI_FABRICS=shm:tmi

● Intranode: Shared Memory, Internode: DAPL(Default on SuperMIC/MUC) − export I_MPI_FABRICS=shm:dapl

● Intranode: Shared Memory, Internode: TCP(Can be used in case of Infiniband problems) − export I_MPI_FABRICS=shm:tcp



programming the new knl cluster at lrz: hardware overview

Documents