application performance on multi-core processors€¦ · 384 x fujitsu cx250 ep-nodes each with 2...

127
Scaling, Throughput and an Historical perspective Application Performance on Multi- core processors M.F. Guest , C.A. Kitchen , M. Foster and D. Cho § Cardiff University, Atos, § Mellanox Technologies

Upload: others

Post on 11-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Scaling, Throughput and an Historical

perspective

Application Performance on Multi-

core processors

M.F. Guest≠, C.A. Kitchen≠, M. Foster† and D. Cho§

≠ Cardiff University, †Atos, § Mellanox Technologies

Page 2: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

2Application Performance on Multi-core Processors

Outline

I. Performance Benchmarks and Cluster Systems

a. Synthetic Code Performance: STREAM and IMB

b. Application Code Performance: DLPOLY, GROMACS,

AMBER,GAMESS_UK, VASP and Quantum Espresso

c. Interconnect Performance: Intel MPI and Mellanox’s HPCX

d. Processor Family and Interconnect – “core to core” and “node

to node” benchmarks

II. Impact of Environmental Issues in Cluster Acceptance

tests

a. Security patches, turbo mode and Throughput testing

III. Performance profile of DL_POLY and GAMESS-UK over

the past two decades

IV. Acknowledgements and Summary

12 December 2018

Page 3: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

3Application Performance on Multi-core Processors

Contents

I. Review of parallel application performance featuring synthetics and end-

user applications across a variety of clusters

¤ End-user Codes – DL_POLY, GROMACS, AMBER, NAMD, LAMMPS,

GAMESS-UK, Quantum Espresso, VASP, CP2K, ONETEP & OpenFOAM

• Ongoing Focus on Intel’s Xeon Scalable processors (“Skylake”), AMD’s

Naples EPYC processor plus nVIDIA GPUs, including

¤ Clusters with dual-socket nodes - Intel Xeon Gold 6148 Processor (20c, 27.5M

Cache, 2.40 GHz) & Xeon Gold 6138 Processor (20c, 27.5M Cache, 2.00

GHz) + AMD Naples EPYC 7551 (2.00 GHz) & EPYC 7601 (2.20 GHz) CPUs.

¤ Updated review of Intel MPI and Mellanox HPCX performance analysis .

II. How these benchmarks have been deployed in the framework of

procurement and acceptance testing, dealing with a variety of issues

e.g. (a) security patches, turbo mode etc. & (ii) Throughput testing.

III. An historical perspective of two of these codes – DL_POLY and

GAMESS-UK – and briefly overview the development and performance

profile of both over the past two decades.

12 December 2018

Page 4: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

The Xeon Skylake Architecture

4Application Performance on Multi-core Processors

• The architecture of Skylake is

very different from that of the prior

“Haswell” and “Broadwell” Xeon

chips

• Three basic variants that now

cover what was formerly the Xeon

E5 and Xeon E7 product lines, with

Intel converging the Xeon E5 and

E7 chips into a single socket.

• Product segmentation – Platinum, Gold, Silver, & Bronze – with 51

variants of the SP chip

• Also custom versions requested by hyperscale and OEM customers.

• All of these chips differ from each other in a number of ways, including

number of cores, clock speed, L3 cache capacity, number and speed of

UltraPath links between sockets, number of sockets supported, main

memory capacity, width of the AVX vector units etc.

12 December 2018

Page 5: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Intel Xeon : Westmere - Skylake

Xeon 5600

(Westmere-EP)

Xeon E5-2600

(Sandy Bridge-EP)

Xeon E5-2600 v4

“Broadwell-EP”

Intel Xeon Scalable

Processor

“Skylake”

Cores / ThreadsUp to 6 cores / 12

threads

Up to 8 cores / 16

threads

Up to 22 Cores / 44

threads

Up to 28 Cores / 56

threads

Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-

inclusive)

Max memory

channels, speed

/ socket

3xDDR3 channels,

1333

4xDDR3 channels,

1600

4 channels of up to 3

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2400 MHz

6 channels of up to 2

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2666 MHz

New

instructionsAES-NI

AVX 1.0

8 DP Flops/Clock

AVX 2.0

16 DP Flops/Clock

AVX 512

32 DP Flops/Clock

QPI / UPI Speed

(GT/s)

1 QPI channels @

6.4 GT/s

2 QPI channels @ 8.0

GT/s

2 x QPI channels @

9.6 GT/s

Up to 3 x UPI @ 10.4

GT/s

PCIe Lanes /

Controllers /

Speed (GT/s)

36 lanes PCIe 2.0 on

chipset

40 Lanes / Socket

Integrated PCIe 3.0

40 / 10 / PCIe* 3.0

(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0

(2.5, 5, 8 GT/s)

Server /

Workstation

TDP

Server /

Workstation: 130W

Up to 130W Server;

150W Workstation 55 - 145W 70 – 205W

5Application Performance on Multi-core Processors 12 December 2018

Page 6: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

06

SKU 7601 7551 7501 7451 7401 7351 7301

Freq (base) 2.2 2.0 2.0 2.3 2.0 2.4 2.2

Turboboost

All cores

active

2.7 2.6 2.6 2.9 2.8 2.9 2.7

Turboboost

On core

active

3.2 3.0 3.0 3.2 3.0 2.9 2.7

Cores/socket 32 32 32 24 24 16 16

L3 cache size 64 MB

Memory

Channel8

Memory Freq 2667 MT/s

TDP (W) 180 180 155/170 180 155/170 155/170 155/170

AMD® Epyc™ 7000 Series - SKU Map and FLOP/cycle

Architecture Sandy Bridge Haswell Skylake EPYC

ISA* AVX AVX2 AVX-512 AVX2

op/cycle2

(1 ADD, 1 MUL)

4

(2 FMA)

4

(2 FMA)

4

(2 ADD, 2 MUL)

Vector size

(DP = 64-bits)4 4 8 2

FLOP/cycle 8 16 32 8

* Instruction Set Architecture

12 December 2018 6Application Performance on Multi-core Processors

The AMD EPYC

only supports 2

× 128-bit AVX

natively, so

there’s a large

gap with Intel

SKL and their 2

× 512-bit FMAs.

Thus the FP

peak on AMD is

4 × lower than

on Intel SKL.

Page 7: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

• Zen cores

¤ Private L1/L2 cache

• CCX

¤ 4 ZEN cores (or less)

¤ 8MB L3 shared cache

• Zeppelin

¤ 2 CCX (or less)

¤ 2 DDR4 channels

¤ 2 PCIe 16x

• Naples

¤ 4 Zeppelin SoC dies fully

connected by Infinity

Fabric.

¤ 4 Numa Nodes !

07

EPYC Architecture - Naples, Zeppelin & CCX2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x D

DR

4 C

ha

nn

els

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

2x D

DR

4 C

han

nels

Coherent Links

Co

here

nt L

inks

8M

L3

Zen L2

Zen L2

Zen L2

Zen L2

8M

L3

L2

L2

Zen

Zen

L2

L2

Zen

Zen

2x16 PCie

• Delivers 32 cores / 64 threads, 16MB L2 cache and 64MB L3 cache per socket.

• Design also means that there are four NUMA nodes per socket or eight NUMA nodes in

a dual socket system i.e. different memory latencies depending on which die needs data

from memory that can be attached to that die or another die on the fabric.

• The key difference with Intel’s Skylake SP architecture is that AMD needs to go off die within

the same socket where Intel stays on a single piece of silicon.

12 December 2018 7Application Performance on Multi-core Processors

Page 8: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Intel Skylake and AMD EPYC Cluster Systems

Cluster / Configuration

“Hawk” – Supercomputing Wales cluster at Cardiff comprising 201 nodes, totalling 8,040

cores, 46.080 TB total memory

• CPU: 2 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each; RAM: 192 GB,

384GB on high memory and GPU nodes; GPU: 26 x nVidia P100 GPUs with 16GB of

RAM on 13 nodes.

“Helios” – 32 node HPC Advisory Council cluster running SLURM: Mellanox ConnectX-5

Supermicro SYS-6029U-TR4 / Foxconn Groot 1A42USF00-600-G 32-node cluster; Dual

Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

Mellanox ConnectX-5 EDR 100Gb/s InfiniBand/VPI adapters with Socket Direct, Mellanox

Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches

Memory: 192GB DDR4 2677MHz RDIMMs per node

20 node Bull|ATOS AMD EPYC cluster running SLURM;

AMD EPYC 7551; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3.2 GHz

Base Clock: 2.2 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s

32 node Dell|EMC PowerEdge R7425 AMD EPYC cluster running SLURM;

AMD EPYC 7601; # of CPU Cores: 32; # of Threads: 64; Max Boost Clock: 3GHz Base

Clock: 2.0 GHz Default TDP / TDP: 180W; Mellanox EDR 100Gb/s

8Application Performance on Multi-core Processors 12 December 2018

Page 9: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Baseline Cluster Systems

9Application Performance on Multi-core Processors

Cluster Configuration

Intel Sandy Bridge Clusters

“Raven”128 x Bull|ATOS b510 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Supercomputing

Wales

384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670

(2.6 GHz), with Mellanox QDR infiniband.

Intel Broadwell Clusters

Dell PE R730/R630,

Broadwell EP-2697A v4

2.6 GHz 16C

HPC Advisory Council, “Thor” cluster, Dell PowerEdge R730/R630 36-

node cluster: 2 x Xeon E5-2697A v4 @ 2.6GHz, 16 Core, 145W TDP,

40MB Cache,256GB DDR4 2400MHz, Interconnect: ConnectX-4 EDR

ATOS Broadwell EP-

2680 v4 2.4 GHz 16C

32 node cluster, Node config: 2 x Xeon E5-2680 v4 @ 2.4GHz, 16 Core,

145W TDP, 40MB Cache,128GB DDR4 2400MHz, Interconnect: Mellanox

ConnectX-4 EDR; and Intel OPA

IBM Power 8 S822LC

IBM Power 8 S822LC

with Mellanox EDR

20 cores, 3.49 GHz with performance CPU governor; 256 GB memory ;

1 – IB (EDR) port ; 2 × NVIDIA K80 GPU;

IBM PE (Parallel Environment) Operating System: RHEL 7.2 LE;

Compilers: xlC 13.1.3, xlf 15.1.3, gcc 4.8.5 (Red Hat), gcc 5.2.1 (from IBM

Advance Toolchain 9.0)

12 December 2018

Page 10: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

The Performance Benchmarks

• The Test suite comprises both synthetics & end-user applications.

Synthetics include HPCC (http://icl.cs.utk.edu/hpcc/) & IMB benchmarks

(http://software.intel.com/en-us/articles/intel-mpi-benchmarks), IOR and

STREAM

• Variety of “open source” & commercial end-user application codes:

• These stress various aspects of the architectures under consideration

and should provide a level of insight into why particular levels of

performance are observed e.g., memory bandwidth and latency, node

floating point performance and interconnect performance (both latency

and B/W) and sustained I/O performance.

GROMACS, LAMMPS, AMBER, NAMD, DL_POLY classic & DL_POLY-4 (molecular dynamics)

Quantum Espresso, Siesta, CP2K, ONETEP, CASTEP and VASP

(ab initio Materials properties)

NWChem, GAMESS-US and GAMESS-UK

(molecular electronic structure)

10Application Performance on Multi-core Processors 12 December 2018

Page 11: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

EPYC - Compiler and Run-time Options

Compilation:

INTEL COMPILERS 2018, IntelMPI 2017

Update 3, FFTW-3.3.5

INTEL SKL: -O3 –xCORE-AVX512

AMD EPYC: –O3 -xAVX2

AMD EPYC: -axCORE-AVX-I

#

# Preload the amd-cputype library to navigate

# the Intel Genuine cpu test

module use /opt/amd/modulefiles

module load AMD/amd-cputype/1.0

export LD_PRELOAD=$AMD_CPUTYPE_LIBexport OMP_PROC_BIND=true

# export KMP_AFFINITY=granularity=fine

export I_MPI_DEBUG=5

export MKL_DEBUG_CPU_TYPE=5

Application Performance on Multi-core Processors 1112 December 2018

STREAM (Atos Clusters):module load AMD/amd-cputype/1.0

icc -o stream.x stream.c -DSTATIC -

Ofast -xCORE-AVX2 -qopenmp -

DSTREAM_ARRAY_SIZE=800000000 \

-mcmodel=large -shared-intel

export OMP_NUM_THREADS=16

export OMP_PROC_BIND=true

export OMP_PLACES="{0:4:1}:16:4” #1

thread per CCX

export OMP_DISPLAY_ENV=true

STREAM (Dell|EMC EPYC):export OMP_NUM_THREADS=32

export OMP_PROC_BIND=true

export OMP_DISPLAY_ENV=true

export

OMP_PLACES="{0},{16},{8},{24},{2},{1

8},{10},{26},{4},{20},{12},{28},{6},

{22},{14},{30},{1},{17},{9},{25},{3}

,{19},{11},{27},{5},{21},{13},{29},{

7},{23},{15},{31}"

Page 12: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

74,309

93,486

118,605114,367

132,035 128,083

169,830

185,863

196,721

184,087

303,797

279,640

0

50,000

100,000

150,000

200,000

250,000

300,000

Bull b510"Raven"SNB e5-

2670/2.6GHz

ClusterVision IVBe5-2650v2

2.6GHz

Dell R730 HSWe5-2697v32.6GHz (T)

Dell HSW e5-2660v3 2.6GHz

(T)

Thor BDW e5-2697A v4 2.6GHz

(T)

ATOS BDW e5-2680v4 2.4GHz

(T)

Mellanox SKLGold 61382.0GHz (T)

Dell SKL Gold6142 2.6GHz (T)

"Hawk" Atos SKLGold 6148

2.4GHz

IBM Power8S822LC 2.92GHz

AMD Epyc 75512.0 GHz

AMD Epyc 76012.2 GHz

Memory B/W –STREAM performance

TRIAD [Rate (MB/s) ]

Ivy Bridge & Haswell

E5-26xx v2,v3

OMP_NUM_THREADS (KMP_AFFINITY=physical

Broadwell

E5-26xx v4

Skylake Gold

6138, 6142, 6148

Application Performance on Multi-core Processors 1212 December 2018

Page 13: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

4,644

5,843

4,236

5,718

4,126

4,5744,246

5,808

4,918

9,204

4,747

4,369

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Bull b510"Raven"SNB e5-

2670/2.6GHz

ClusterVisionIVB e5-2650v2

2.6GHz

Dell R730 HSWe5-2697v32.6GHz (T)

Dell HSW e5-2660v3 2.6GHz

(T)

Thor BDW e5-2697A v42.6GHz (T)

ATOS BDW e5-2680v4 2.4GHz

(T)

Mellanox SKLGold 61382.0GHz (T)

Dell SKL Gold6142 2.6GHz (T)

"Hawk" AtosSKL Gold 6148

2.4GHz

IBM Power8S822LC 2.92GHz

AMD Epyc 75512.0 GHz

AMD Epyc 76012.2 GHz

Memory B/W – STREAM / core performance

TRIAD [Rate (MB/s) ]

OMP_NUM_THREADS (KMP_AFFINITY=physical

Ivy Bridge & Haswell

E5-26xx v2,v3

Broadwell

E5-26xx v4

Skylake Gold

6138, 6142, 6148

Application Performance on Multi-core Processors 1312 December 2018

Page 14: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

3.8

11,466

5,957

1.7

3,694

1,729

1

10

100

1,000

10,000

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Intel SKL Gold 6148 2.4GHz (T) OPA

Dell Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC 2.92GHz IB/EDR

Thor BDW e5-2697A v4 2.6GHz (T) EDR

Intel BDW e5-2690v4 2.6GHz (T) OPA

Dell OPA32 e5-2660v3 2.6GHz (T) OPA

Bull HSW E5-2680v3 2.5 GHz (T) Connect-IB

Dell R720 e5-2680v2 2.8 GHz (T) connect-IB

Azure A9 WE (e5-2670 2.6 GHz) IB RDMA

Merlin Xeon E5472 3.0 GHz QC + IB (mvapich2 1.4)

MPI Performance – PingPong

IMB Benchmark (Intel)

1 PE / node

Latency

Message Length (Bytes)

Mb

yte

s/s

ec

14Application Performance on Multi-core Processors

BE

TT

ER

12 December 2018

export I_MPI_DAPL_TRANSLATION_CACHE=1

Memory resident cache feature in DAPL

Page 15: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Fujitsu CX250 SNB e5-2670/2.6GHz IB-QDR

ATOS BDW e5-2680v4 2.4GHz (T) OPA

Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR

Dell PE R730 BDW e5-2697Av4 2.6GHz (T) OPA

Dell|EMC SKL Gold 6130 2.1GHz (T) OPA

"Helios" Mellanox SKL Gold 6138 2.0GHz (T)

Intel SKL Gold 6148 2.4GHz (T) OPA

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR

Dell|EMC SKL Gold 6142 2.6GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR

MPI Collectives – Alltoallv (128 PEs)

IMB Benchmark (Intel)

128 PEs

Latency

BE

TT

ER

Message Length (Bytes)

Measured Time (usec)

15Application Performance on Multi-core Processors

EPYC performance

with Intel MPI ~ 4-6 ×

worse than that with

SKL processors!

12 December 2018

Time-consuming messages

called by Alltoall & Alltoallv (IPM)

Page 16: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Application Performance on Multi-core

Processors

I.1 THE CODES: DLPOLY, GROMACS, NAMD, LAMMPS,

GAMESS, NWChem, GAMESS-UK, ONETEP, VASP,

SIESTA, CASTEP, Quantum Espresso, CP2K – on a

variety of HPC systems.

Page 17: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Allinea (ARM) Performance Reports

Allinea Performance Reports provides a

mechanism to characterize and understand the

performance of HPC application runs through a

single-page HTML report.

17Application Performance on Multi-core Processors

• Based on Allinea MAP's adaptive sampling technology that keeps data

volumes collected and application overhead low.

• Modest application slowdown (ca. 5%) even with 1000’s of MPI

processes.

• Runs on existing codes: a single command added to execution scripts.

• If submitted through a batch queuing system, then the submission script

is modified to load the Allinea module and add the 'perf-report' command

in front of the required mpiexec command.

• perf-report mpiexec -n 4 $code

• A Report Summary: This characterizes how the application's wallclock

time was spent, broken down into CPU, MPI and I/O

• All examples updated on Broadwell Mellanox Cluster (E5-2697A v4)

12 December 2018

Page 18: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

DL_POLY Developed as CCP5 parallel MD code by W. Smith,

T.R. Forester and I. Todorov

UK CCP5 + International user community

DLPOLY_classic (replicated data) and DLPOLY_3 &

_4 (distributed data – domain decomposition)

Areas of application:

liquids, solutions, spectroscopy, ionic solids,

molecular crystals, polymers, glasses, membranes,

proteins, metals, solid and liquid interfaces,

catalysis, clathrates, liquid crystals, biopolymers,

polymer electrolytes.

Molecular Dynamics Codes: AMBER, DL_POLY,

CHARMM, NAMD, LAMMPS, GROMACS etc

Molecular Simulation I. DL_POLY

18Application Performance on Multi-core Processors 12 December 2018

Page 19: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

3.5

6.8

11.1

15.7

4.4

8.3

13.7

16.8

4.8

9.3

15.4

19.2

4.2

7.7

12.4

16.7

4.1

6.0

9.8

13.6

4.5

7.1

11.6

15.9

0.0

4.0

8.0

12.0

16.0

20.0

32 64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Intel Broadwell2 e5-2690v4 2.6GHz (T) OPA

"Helios" Skylake Gold 6138 2.0GHz (T) EDR

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR

Dell Skylake Gold 6150 2.7GHz (T) EDR

ATOS AMD EPYC 7551 2.0 GHz (T) EDR

Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR

DL_POLY Classic – NaCl Simulation

Number of Processing Elements

Performance

Performance Data (32-256 PEs)

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (16 PEs)

BE

TT

ER

NaCl 27,000 atoms; 500 time steps

Application Performance on Multi-core Processors 1912 December 2018

Page 20: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

A B

C D

• Distribute atoms, forces across the nodes

¤ More memory efficient, can address much larger

cases (105-107)

• Shake and short-ranges forces require only

neighbour communication

¤ communications scale linearly with number of

nodes

• Coulombic energy remains global

¤ Adopt Smooth Particle Mesh Ewald scheme

• includes Fourier transform smoothed charge

density (reciprocal space grid typically

64x64x64 - 128x128x128)

http://www.scd.stfc.ac.uk//research/app/ccg/software/DL_POLY/44516.aspx

W. Smith and I. Todorov

Domain Decomposition - Distributed data:

DL_POLY 4 – Distributed data

20Application Performance on Multi-core Processors

Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å

2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps

12 December 2018

Page 21: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.8

3.1

4.9

2.9

5.1

8.5

2.8

5.0

7.9

2.4

4.5

7.4

2.7

5.0

8.4

3.2

5.7

3.0

5.4

9.0

3.2

9.5

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256

Re

lati

ve

P

erf

orm

an

ce

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI

"Helios" Skylake Gold 6138 2.0GHz (T) EDR

Intel Skylake Platinum 8170 2.1GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

DL_POLY 4 – Gramicidin Simulation

Number of Processing Elements

Performance

BE

TT

ER

Gramicidin 792,960 atoms; 50 time steps

Performance Data (64-256 PEs)

21Application Performance on Multi-core Processors

Relative to the Fujitsu CX250 e5-2670 2.6 GHz 8-C (32 PEs)

12 December 2018

SKL 6142 2.6 GHz ~

1.06 X e5-2697v4 2.6

GHz

Page 22: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.7

3.0

4.6

2.4

4.3

7.2

2.6

4.6

7.5

3.1

5.1

2.8

5.0

8.4

2.5

3.4

4.6

2.3

3.1 3.2

0.0

2.0

4.0

6.0

8.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI

"Helios" Skylake Gold 6138 2.0GHz (T) EDR

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

ATOS AMD EPYC 7551 2.0 GHz (T) EDR

Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR

DL_POLY 4 – Gramicidin Simulation – EPYC

Number of Processing Elements

Performance Relative to the Fujitsu CX250 e5-2670/ 2.6 GHz 8-C (32 PEs)

BE

TT

ER

Gramicidin 792,960 atoms; 50 time steps

Application Performance on Multi-core Processors 2212 December 2018

Page 23: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

DLPOLY4 – Gramicidin Simulation Performance Report

Smooth Particle Mesh Ewald Scheme

23Application Performance on Multi-core Processors

CPU Time Breakdown

Total Wallclock Time

Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

12 December 2018

“DL_POLY_4 and Xeon Phi: Lessons Learnt”,

Alin Marin Elena , Christian Lalanne, Victor

Gamayunov , Gilles Civario, Michael Lysaght,

and IlianTodorov

Page 24: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Molecular Simulation - II GROMACS

GROMACS (GROningen MAchine for Chemical Simulations) is a

molecular dynamics package designed for simulations of proteins, lipids

and nucleic acids [University of Groningen] .

• Single and Double Precision

• Efficient GPU Implementations

Versions under Test:

Version 4.6.1 – 5 March 2013

Version 5.0.7 – 14 October 2015

Version 2016.3 – 14 March 2017

Version 2018.2 – 14 June 2018 (optimised for “Hawk” by Ade Fewings)

Berk Hess et al. "GROMACS 4: Algorithms for Highly Efficient, Load-

Balanced, and Scalable Molecular Simulation". Journal of Chemical Theory

and Computation 4 (3): 435–447.

24Application Performance on Multi-core Processors 12 December 2018

http://manual.gromacs.org/documentation/

Page 25: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

GROMACS Benchmark Cases

25Application Performance on Multi-core Processors

Ion channel system

• The 142k particle ion channel system is the

membrane protein GluCl - a pentameric chloride

channel embedded in a DOPC membrane and

solvated in TIP3P water, using the Amber ff99SB-

ILDN force field. This system is a challenging

parallelization case due to the small size, but is one

of the most wanted target sizes for biomolecular

simulations.

Lignocellulose

• Gromacs Test Case B from the UEA Benchmark

Suite. A model of cellulose and lignocellulosic

biomass in an aqueous solution. This system of

3.3M atoms is inhomogeneous, and uses reaction-

field electrostatics instead of PME and therefore

should scale well.

12 December 2018

Page 26: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

Performance Data (32-256 PEs)

GROMACS – Ion-channel Performance Report

26Application Performance on Multi-core Processors

CPU Time Breakdown

Total Wallclock Time

Breakdown

12 December 2018

Page 27: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

45.1

79.8

132.4

48.5

90.3

167.2

60.1

114.0

191.7

0.0

50.0

100.0

150.0

200.0

64 128 256

Gromacs 4.6.1

Gromacs 5.0

Gomacs 2016.3-single-AVX

Gromacs 2018.2

27

GROMACS – Ion Channel Simulation

Number of Processing Elements

Performance (ns /day)

Performance Data (64-256 PEs)

BE

TT

ER

142k particle ion channel system

Application Performance on Multi-core Processors 12 December 2018

Single Precision

"Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T) Nodes

with EDR Interconnect + dual P100 GPU nodes

Page 28: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

20.6

37.9

68.8

32.5

60.3

100.4

36.4

68.5

123.6

54.0

97.4

151.5

48.5

90.3

149.0

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S}

Intel Skylake Gold 6148 2.4GHz (T) OPA {S}

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S}

28

Ion Channel Simulation – Impact of Single Precision

Number of Processing Elements

Performance (ns /day)

Performance Data (64-256 PEs)

BE

TT

ER

142k particle ion channel system

Application Performance on Multi-core Processors 12 December 2018

GROMACS 5.0.7

Page 29: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.0

1.2

1.6

1.9

2.2

3.2 3.2

3.5

1.2

1.8

2.7

3.2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

64 80 96 128 160 240 256 320 N=1,2×GPU

N=2,4×GPU

N=4,8×GPU

N=6,12×GPU

29

GROMACS – GPU Performance: Ion Channel Simulation

Number of Processing Elements

Relative Performance

BE

TT

ER

142k particle ion channel system

Application Performance on Multi-core Processors 12 December 2018

"Hawk" Atos Cluster -

SKL Gold 6148 2.4GHz (T)

Nodes with EDR

Interconnect + dual P100

GPU nodes

GROMACS 2018.2

Page 30: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

2.4

4.8

8.7

2.4

4.7

8.6

3.7

7.3

13.5

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

64 128 256

Gromacs 4.6.1

Gromacs 5.0

Gromacs 2016.3-single-AVX

Gromacs 2018.2

30

GROMACS – Lignocellulose Simulation

Number of Processing Elements

Performance (ns /day)

Performance Data (64-256 PEs)

BE

TT

ER

Application Performance on Multi-core Processors 12 December 2018

3,316,463 atom system using

reaction-field electrostatics instead

of PME

"Hawk" Atos Cluster - SKL Gold 6148 2.4GHz (T)

Nodes with EDR Interconnect + dual P100 GPU nodes

Single Precision

Page 31: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.9

1.7

3.3

1.3

2.6

5.0

1.4

2.8

5.2

1.6

3.1

6.1

2.9

5.5

10.1

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Thor Dell|EMC PE R730 Broadwell e5-2697Av4 2.6GHz (T) EDR HPCX

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

"Helios" Mellanox Skylake Gold 6138 2.0GHz (T) EDR {S}

Intel Skylake Gold 6148 2.4GHz (T) OPA {S}

"Hawk" Atos Skylake Gold 6148 2.4GHz (T) EDR {S}

31

Lignocellulose Simulation – Impact of Single Precision

Number of Processing Elements

Performance (ns /day)

Performance Data (64-256 PEs)

BE

TT

ER

3,316,463 atom system using reaction-

field electrostatics instead of PME

Application Performance on Multi-core Processors 12 December 2018

GROMACS 5.0.7

Page 32: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.01.2

1.5

1.9

2.3

3.43.6

4.4

1.6

2.9

5.3

7.1

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

64 80 96 128 160 240 256 320 N=1,2×GPU

N=2,4×GPU

N=4,8×GPU

N=6,12×GPU

32

GROMACS – GPU Performance: Lignocellulose Simulation

Number of Processing Elements

BE

TT

ER

Application Performance on Multi-core Processors 12 December 2018

"Hawk" Atos Cluster -

SKL Gold 6148 2.4GHz (T)

Nodes with EDR

Interconnect + dual P100

GPU nodes

3,316,463 atom system using

reaction-field electrostatics instead

of PME

GROMACS 2018.2

Relative Performance

Page 33: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Molecular Simulation - III The AMBER Benchmarks

• AMBER 16/1 is used, specifically

PMEMD & GPU accelerated PMEMD.

• M01 Benchmark

¤ Major Urinary Protein (MUP) + IBM ligand (21,736 atoms)

• M06 Benchmark

¤ Cluster of six MUPs (134,013 atoms)

• M27 Benchmark

¤ Cluster of 27 MUPs (657,585 atoms)

• M45 Benchmark

¤ Cluster of 45 MUPs (932,751 atoms)

All test cases run 30,000 steps * 2fs = 60ps simulation time. Periodic boundary

conditions, constant pressure, T=300K. Position data written every 500 steps.

R. Salomon-Ferrer, D.A. Case, R.C. Walker. An overview of the Amber biomolecular simulation package. WIREs Comput. Mol. Sci. 3, 198-210 (2013).

D.A. Case, T.E. Cheatham, III, T. Darden, H. Gohlke, R. Luo, K.M. Merz, Jr., A. Onufriev, C. Simmerling, B. Wang and R. Woods. The Amber biomolecular simulation programs. J. Computat. Chem. 26, 1668-1688 (2005).

33Application Performance on Multi-core Processors 12 December 2018

Page 34: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.31

1.241.27 1.27

1.13

1.34

1.45 1.44

1.34

1.27

1.36

1.57

1.48

1.56

1.65 1.65

1.36

1.23

1.47

1.39

1.551.58

1.70

1.64

1.00

1.20

1.40

1.60

1.80

64 80 96 128 160 240 256 320

M06 M27 M45

AMBER – SKL vs. SNB: M06, M27 and M45

Number of Processing Elements

Relative Performance

BE

TT

ER

Performance Data (64-320 PEs)

SKL 6148 2.4 GHz // EDR vs SNB e5-2670 2.6 GHz // QDR

34Application Performance on Multi-core Processors 12 December 2018

Page 35: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.361.48

1.88

2.292.41

3.12

3.403.54

2.73

4.21

0.0

1.0

2.0

3.0

4.0

64 80 96 128 160 240 256 320 N1 (ppn1GPU×2)

N1 (ppn2GPU×2)

35

AMBER – GPU Performance: M45 Simulation

Number of Processing Elements

Relative Performance (64 SNB cores)

BE

TT

ER

Application Performance on Multi-core Processors 12 December 2018

"Hawk" Atos Cluster - SKL Gold 6148

2.4GHz (T) with EDR Interconnect + dual

P100 GPU nodes vs. “Raven” 64 SNB e5-

2670 PEs

Cluster of 45 Major Urinary Protein

(MUP) + IBM ligand (932,751 atoms)

Page 36: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

GAMESS-UK - Moving to Distributed Data.

The MPI/ScaLAPACK Implementation

of the GAMESS-UK SCF/DFT module

• Pragmatic approach to the replicated data constraints:

¤ MPI-based tools (such as ScaLAPACK) used in place of Global Arrays

¤ All data structures except those required for the Fock matrix build are fully

distributed (F, P)

• Partially distributed model chosen because, in the absence of efficient

one-sided communications it is difficult to efficiently load balance a

distributed Fock matrix build.

• Obvious drawback - some large replicated data structures are required.

¤ These are kept to a minimum. For a closed shell HF or DFT calculation only

2 replicated matrices are required, 1 × Fock and 1 × Density (doubled for

UHF).

36Application Performance on Multi-core Processors

“The GAMESS-UK electronic structure package: algorithms, developments and

applications'' M.F. Guest, I. J. Bush, H.J.J. van Dam, P. Sherwood, J.M.H. Thomas, J.H.

van Lenthe, R.W.A Havenith, J. Kendrick, Mol. Phys. 103, No. 6-8, 2005, 719-747.

12 December 2018

Page 37: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.2

2.1

1.2

2.1

1.5

2.7

1.8

3.2

1.8

3.0

1.6

2.8

1.9

3.3

1.8

3.3

1.9

3.5

1.1

2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull Haswell e5-2695v3 2.3GHz Connect-IB

Huawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) EDR

Thor Broadwell e5-2697A v4 2.6GHz (T) EDR

Mellanox SKL Gold 6138 2.0GHz (T) EDR

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

IBM Power8 S822LC 2.92GHz IB/EDR

Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK Performance - Zeolite Y cluster

Performance

Number of Processing Elements

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)

BE

TT

ER

37Application Performance on Multi-core Processors

SKL 6142 2.6 GHz

~ 1.05 X e5-2697v4 2.6 GHz

12 December 2018

Page 38: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.2

2.1

1.2

2.1

1.5

2.7

1.8

3.2

1.8

3.0

1.6

2.8

1.7

3.1

1.8

3.3

1.9

3.5

1.1

2.0

1.4

2.5

1.5

2.8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDRBull Haswell e5-2695v3 2.3GHz Connect-IBHuawei Fusion CH140 e5-2683 v4 2.1GHz (T) EDRThor Broadwell e5-2697A v4 2.6GHz (T) EDRMellanox SKL Gold 6138 2.0GHz (T) EDRDell|EMC Skylake Gold 6130 2.1GHz (T) OPAIntel Skylake Gold 6148 2.4GHz (T) OPA"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDRDell|EMC Skylake Gold 6142 2.6GHz (T) EDRDell|EMC Skylake Gold 6150 2.7GHz (T) EDRIBM Power8 S822LC 2.92GHz IB/EDRATOS AMD EPYC 7551 2.0 GHz (T) EDRDell|EMC AMD EPYC 7601 2.2 GHz (T) EDR

Zeolite Y cluster SioSi7 DZVP (Si,O), DZVP2 (H) B3LYP(3975 GTOs)

GAMESS-UK MPI/ScaLAPACK code – EPYC Performance

Performance

Number of Processing Elements

Relative to the Fujitsu HTC X5650 2.67 GHz 6-C (128 PEs)

BE

TT

ER

Application Performance on Multi-core Processors 3812 December 2018

Page 39: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

Performance Data (32-256 PEs)

GAMESS-UK.MPI DFT – DFT Performance Report

39Application Performance on Multi-core Processors

Cyclosporin 6-31G** basis (1855

GTOs); DFT B3LYP

CPU Time Breakdown

Total Wallclock Time

Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

12 December 2018

Page 40: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

• VASP – performs ab-initio QM molecular dynamics (MD) simulations using

pseudopotentials or the projector-augmented wave method and a plane

wave basis set.

• Quantum Espresso – an integrated suite of Open-Source computer codes

for electronic-structure calculations and materials modelling at the

nanoscale. It is based on density-functional theory (DFT), plane waves,

and pseudopotentials

• SIESTA - an O(N) DFT code for electronic structure calculations and ab

initio molecular dynamics simulations for molecules and solids. It uses

norm-conserving pseudopotentials and linear combination of numerical

atomic orbitals (LCAO) basis set.

• CP2K is a program to perform atomistic and molecular simulations of solid

state, liquid, molecular, and biological systems. It provides a framework for

different methods such as e.g., DFT using a mixed Gaussian & plane waves

approach (GPW) and classical pair and many-body potentials. • ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling

code for quantum-mechanical calculations based on DFT.

Computational Materials

Advanced Materials Software

40Application Performance on Multi-core Processors 12 December 2018

Page 41: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Quantum Espresso is an

integrated suite of Open-

Source computer codes

for electronic-structure

calculations and

materials modelling at the

nanoscale. It is based on

density-functional theory,

plane waves, and

pseudopotentials.

Transition from v5.2 to

v6.1

Ground-state calculations.

Structural Optimization.

Transition states & minimum energy paths.

Ab-initio molecular dynamics.

Response properties (DFPT).

Spectroscopic properties.

Quantum Transport.

Benchmark Details

DEISA AU112

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT dimensions:

(180, 90, 288)

PRACE

GRIR443

Carbon-Iridium complex (C200Ir243),

2,233,063 G-vectors, 8 k-points, FFT

dimensions: (180, 180, 192)

Quantum Espresso

41Application Performance on Multi-core Processors 12 December 2018

Page 42: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.01.3

2.0 2.0

2.52.9

3.3 3.4

3.3

5.6

5.0

5.7

7.67.8

1.5

2.2

3.2

5.1

4.3

4.9

5.9

1.9

2.7

4.0

6.7

5.7

6.4

8.3

8.8

0.0

2.0

4.0

6.0

8.0

10.0

0 64 128 192 256 320

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL

Thor Dell|EMC e5-2697A v4 2.6GHz (T) OPA

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Intel Skylake Gold 6148 2.4GHz (T) OPA

Number of Processing Elements

Perf

orm

an

ce

Performance Data (32 - 320 PEs)

BE

TT

ER

Quantum Espresso – Au112

42Application Performance on Multi-core Processors

Relative to the Fujitsu e5-2670

2.6 GHz 8-C (32 PEs)

12 December 2018

Version 5.2

Page 43: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

Quantum Espresso – Au112 Performance Report

Au complex (Au112), 2,158,381 G-

vectors, 2 k-points, FFT

dimensions: (180, 90, 288)

43Application Performance on Multi-core Processors

CPU Time Breakdown

Total Wallclock Time

Breakdown

12 December 2018

Page 44: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Parallelism in Quantum Espresso

• Quantum ESPRESSO implements several MPI parallelization levels,

with Processors organized in a hierarchy of groups identified by different

MPI communicator levels. Group hierarchy:

• images: Processors divided into different "images", corresponding to a

different SCF or linear-response calculation, loosely coupled to others.

• Pools and bands: each image can be sub-partitioned into "pools", each

taking care of a group of k-points. Each pool is sub-partitioned into

"band groups", each taking care of a group of Kohn-Sham orbitals.

• PW Parallelisation: orbitals in the PW basis set, as well as charges

and density in either reciprocal or real space, distributed across

processors. All linear-algebra operations on array of PW / real-space

grids are automatically and effectively parallelized.

• tasks: Allows for good parallelization of the 3D FFT when no. of CPUs

exceeds the no. of FFT planes, FFTs on Kohn-Sham states are

redistributed to "task".

4412 December 2018Application Performance on Multi-core Processors

Page 45: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Parallelism in Quantum Espresso

• linear-algebra group: A further level, independent on PW or k-point

parallelization, is the parallelization of subspace diagonalization /

iterative orthonormalization.

• About communications Images and pools are loosely coupled and

CPUs communicate between different images and pools only once in a

while, whereas CPUs within each pool are tightly coupled and

communications are significant.

• Choosing parameters : To control the no. CPUs in each group,

command line switches: -nimage, -npools, -nband, -ntg, -ndiag or –

northo. Thus for Au112, use is of the following command line:

mpirun $code -inp ausurf.in -npool $NPOOL -ntg $NT -ndiag $ND

• This executes an energy calculation on $NP processors, with k-points

distributed across $NPOOL pools of $NP/$NPOOL processors each,

3D FFT is performed using $NT task groups, with the diagonalization of

the subspace Hamiltonian distributed to a square grid of $ND

processors.

4512 December 2018Application Performance on Multi-core Processors

Page 46: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Number of Processing Elements

Relative Performance

Performance Data (64 - 320 PEs)

BE

TT

ER

Impact of npool – Au112

46

2.42.7

3.2

4.4

4.8

5.4

6.7

7.7

6.5

4.7 4.8 4.9

4.0

4.44.0

1.0

1.4 1.5

2.0 2.02.3

2.62.4 2.5

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

32 64 96 128 160 192 224 256 288 320

Hawk (NPOOL=2, ND=nP)

Hawk (NPOOL=1)

Raven (NPOOL=2, ND=nP)

Raven (NPOOL=1)

Application Performance on Multi-core Processors 12 December 2018

Version 5.2

Page 47: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.92.2

2.9

3.9 4.0

4.7

5.96.1

6.8

4.24.4

5.3

3.73.9

4.5

1.01.1

1.4

1.8

2.2

2.8

3.2 3.3

3.9

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

32 64 96 128 160 192 224 256 288 320

Hawk (NPOOL=2, ND=nP)

Hawk (NPOOL=1)

Raven (NPOOL=2, ND=nP)

Raven (NPOOL=1)

Number of Processing Elements

Relative Performance

Performance Data (64 - 320 PEs)

BE

TT

ER

Impact of npool – GRIR443

47Application Performance on Multi-core Processors 12 December 2018

Page 48: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.3

1.9

2.42.3

3.2

3.8

2.8

4.0

5.2

2.8

4.0

4.9

1.4

1.82.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

96 128 160

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697A v4 2.6GHz (T) EDR IMPI DAPL

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Dell|EMC Skylake Gold 6142 2.6GHz (T) EDR

Dell|EMC Skylake Gold 6150 2.7GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

Number of Processing Elements

Pe

rfo

rma

nc

e

BE

TT

ER

Performance Data (96-160 PEs)Quantum Espresso – GRIR443[R

ela

tive

to

th

e F

ujit

su

e5

-

26

70

2.6

GH

z 8

-C (

96

PE

s)]

Application Performance on Multi-core Processors 4812 December 2018

Page 49: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Zeolite Benchmark

• Zeolite with the MFI structure unit cell running

a single point calculation and a planewave cut

off of 400eV using the PBE functional

• 2 k-points; maximum number of plane-

waves: 96,834

• FFT grid; NGX=65, NGY=65, NGZ=43,

giving a total of 181,675 points

Pd-O Benchmark

• Pd-O complex – Pd75O12, 5X4 3-layer

supercell running a single point calculation

and a planewave cut off of 400eV. Uses the

RMM-DIIS algorithm for the SCF and

is calculated in real space.

• 10 k-points; maximum number of plane-

waves: 34,470

• FFT grid; NGX=31, NGY=49, NGZ=45,

giving a total of 68,355 points

VASP – Vienna Ab-initio Simulation Package

Benchmark Details

MFI Zeolite

Zeolite (Si96O192), 2 k-

points, FFT grid: (65,

65, 43); 181,675 points

Pd-O

complex

Palladium-Oxygen

complex (Pd75O12), 10

k-points, FFT grid: (31,

49, 45), 68,355 points

VASP (5.4.4) performs ab-

initio QM molecular

dynamics (MD) simulations

using pseudopotentials or

the projector-augmented

wave method and a plane

wave basis set.

Application Performance on Multi-core Processors 4912 December 2018

Page 50: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.7

2.5

2.1

2.8

4.6

6.5

3.3

5.2

5.9

2.8

4.5

5.9

3.8

5.95.7

3.7

5.6

6.8

0.0

2.0

4.0

6.0

8.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX

Dell|EMC SKL Gold 6130 2.1GHz (T) OPA

"Helios" Mellanox SKL 6138 2.0GHz (T)

"Helios" Mellanox SKL 6138 2.0GHz (T) HPCX 2.3.0

Intel SKL Gold 6148 2.4GHz (T) OPA

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR

Dell|EMC SKL Gold 6142 2.6GHz (T) EDR

Dell|EMC SKL Gold 6150 2.7GHz (T) EDR

Number of Processing Elements

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

VASP 5.4.4 – Pd-O Benchmark

50Application Performance on Multi-core Processors 12 December 2018

Page 51: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.7

2.52.1

2.8

4.6

6.5

3.7

5.2

5.9

3.7

6.4

8.5

3.8

7.5

9.1

3.7

5.6

6.8

2.2

3.9

5.3

0.0

2.0

4.0

6.0

8.0

10.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Bull|ATOS BDW e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX

"Helios" Mellanox SKL 6138 2.0GHz (T)

"Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2]

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2]

Dell|EMC SKL Gold 6142 2.6GHz (T) EDR

Dell|EMC SKL Gold 6150 2.7GHz (T) EDR

Dell|EMC AMD EPYC 7601 2.2 GHz (T) EDR [KPAR=2]

Number of Processing Elements

Performance Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (32 PEs)

BE

TT

ER

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

VASP 5.4.4 – Pd-O Benchmark - Parallelisation on k-points

51Application Performance on Multi-core Processors 12 December 2018

NPEs KPAR NPAR

64 2 2

128 2 4

256 2 8

Page 52: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

Performance Data (32-256 PEs)

VASP – Pd-O Benchmark Performance Report

52Application Performance on Multi-core Processors

Palladium-Oxygen complex (Pd75O12), 8 k-

points, FFT grid: (31, 49, 45), 68,355 points

CPU Time Breakdown

Total Wallclock Time

Breakdown

12 December 2018

Page 53: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.0

1.51.7

1.4

2.9

4.7

1.6

3.2

4.3

1.5

2.7

3.9

1.8

3.4

4.0

1.7

3.0

4.2

1.7

3.2

4.5

0.0

1.0

2.0

3.0

4.0

5.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX

Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) OPA

"Helios" Mellanox SKL 6138 2.0GHz (T)

Dell|EMC SKL Gold 6130 2.1GHz (T) OPA

Intel SKL Gold 6148 2.4GHz (T) OPA

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR

Dell|EMC SKL Gold 6142 2.6GHz (T) EDR

Dell|EMC SKL Gold 6150 2.7GHz (T) EDR

Number of Processing Elements

Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single point

calculation and a 400eV planewave cut off of using the PBE

functional. maximum number of plane-waves: 96,834, 2 k-points,

FFT grid: (65, 65, 43); 181,675 points

VASP 5.4.4 – Zeolite Benchmark

53Application Performance on Multi-core Processors

Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

12 December 2018

Page 54: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.0

1.51.7

1.4

2.9

4.7

1.6

4.3

4.7

1.5

2.7

3.9

1.8

3.4

4.0

1.7

3.0

4.2

1.7

3.2

4.5

0.0

1.0

2.0

3.0

4.0

5.0

64 128 256

Fujitsu CX250 Sandy Bridge e5-2670/2.6GHz IB-QDR

Thor Dell|EMC PE R730 BDW e5-2697Av4 2.6GHz (T) EDR HPCX

"Helios" Mellanox SKL 6138 2.0GHz (T)

"Hellios" Mellanox SKL 6138 2.0GHz (T) [KPAR=2]

Dell|EMC SKL Gold 6130 2.1GHz (T) OPA

Intel SKL Gold 6148 2.4GHz (T) OPA

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR

"Hawk" Atos SKL Gold 6148 2.4GHz (T) EDR [KPAR=2]

Dell|EMC SKL Gold 6142 2.6GHz (T) EDR

Dell|EMC SKL Gold 6150 2.7GHz (T) EDR

Number of Processing Elements

Performance

BE

TT

ER

Zeolite (Si96O192) with MFI structure unit cell running a single point

calculation and a 400eV planewave cut off of using the PBE functional.

maximum number of plane-waves: 96,834, 2 k-points, FFT grid: (65, 65,

43); 181,675 points

VASP 5.4.4 – Zeolite Benchmark - Parallelisation on k-points

54Application Performance on Multi-core Processors

Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

12 December 2018

Page 55: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.4

2.0

2.6

1.5

2.1

2.7

0.9

1.4

1.6

1.4

1.8

2.1

0.0

0.5

1.0

1.5

2.0

2.5

3.0

64 96 128

Dell|EMC Skylake Gold 6130 2.1GHz (T) OPA

Intel Skylake Gold 6148 2.4GHz (T) OPA

ATOS AMD EPYC 7601 2.2 GHz (T) EDR

ATOS AMD EPYC 7601 2.2 GHz (T) EDR (16c/socket)

Number of Processing Elements

Perf

orm

an

ce

Relative to the Fujitsu CX250 Sandy Bridge e5-2670 2.6 GHz (64 PEs)

BE

TT

ER

VASP 5.4.1 – Zeolite Benchmark

Zeolite (Si96O192) with MFI structure unit cell running a single

point calculation and a 400eV planewave cut off of using the

PBE functional. maximum number of plane-waves: 96,834, 2 k-

points, FFT grid: (65, 65, 43); 181,675 points

Application Performance on Multi-core Processors 5512 December 2018

Page 56: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Application Performance on Multi-core

Processors:

I.2. Selecting Fabrics and Optimising

Performance:

Intel MPI and Mellanox HPCX

Page 57: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

• Intel MPI Library - can select a communication fabric at runtime without having

to recompile the application. By default, it automatically selects the most

appropriate fabric based on both S/W and H/W configuration i.e. in most cases

you do not have to manually select a fabric.

• Specifying a particular fabric can boost performance. Can specify fabrics for both

intra-node and inter-node communications. Following fabrics available:

• For inter-node communication, it uses the first available fabric from the default

fabric list. List is defined automatically for each H/W and S/W configuration (see

I_MPI_FABRICS_LIST).

• For most configurations, this list is as follows: dapl, ofa, tcp, tmi, ofi

Selecting Fabrics – MPI Optimisation

57Application Performance on Multi-core Processors

Fabric Network hardware and software used

shm Shared memory (for intra node communication only).

dapl Direct Access Programming Library (DAPL) fabrics, such as InfiniBand (IB)

and iWarp (through DAPL).

ofa OpenFabrics Alliance (OFA) fabrics e.g. InfiniBand (through OFED verbs).

tcp TCP/IP network fabrics, such as Ethernet and InfiniBand (through IPoIB).

tmi Tag Matching Interface (TMI) fabrics, such as Intel True Scale Fabric, Intel

Omni Path Architecture and Myrinet (through TMI).

ofi OpenFabrics Interfaces* (OFI) - capable fabrics, such as Intel True Scale

Fabric, Intel Omni Path Architecture, IB and Ethernet (through OFI API).

12 December 2018

Page 58: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Mellanox HPC-X Toolkit

The Mellanox HPC-X Toolkit provides a MPI, SHMEM and UPC

software suite for HPC environments. Delivers “enhancements to

significantly increase the scalability & performance of message

communications in the network”. Includes:

¤ Complete MPI, SHMEM, UPC package, including Mellanox MXM and

FCA acceleration engines

¤ Offload collectives communication from MPI process onto Mellanox

interconnect hardware

¤ Maximize application performance with underlying hardware

architecture. Optimized for Mellanox InfiniBand and VPI interconnects

¤ Increase application scalability and resource efficiency

¤ Multiple transport support including RC, DC and UD

¤ Intra-node shared memory communication

• Performance comparison conducted on the Mellanox SKL 6138 / 2.00

GHz EDR based “Helios” cluster

58Application Performance on Multi-core Processors 12 December 2018

Page 59: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Application Performance & MPI Libraries

Performance comparison exercise undertaken to capture the impact

of the latest release of Intel MPI and Mellanox’s HPCX.

¤ In 2017, on the Mellanox HP Proliant- E5-2697A v4 EDR based

Thor cluster, comparison of Intel MPI and Mellanox HPCX for the

following applications (and associated data sets).

– DLPOLY4 (NaCl and Gramicidin) & GROMACS (Ion Channel and

lignocellulose)

– VASP (PdO Complex & Zeolite System)

– Quantum ESPRESSO (Au112 and GRIR443)

– OpenFOAM (Cavity 3D-3M)

¤ Simply compared the time to solution for each application i.e.

T HPCX / T Intel-MPI

across multiple core counts

59Application Performance on Multi-core Processors 12 December 2018

Page 60: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Application Performance & MPI Libraries

• Optimum performance found to be a function of both

application and core count.

¤ With the materials-based codes & OpenFOAM, and at high

core count (> 512 cores), HPCX exhibited a clear

performance advantage over Intel MPI.

¤ This was not the case for the classical MD codes where Intel

MPI showed a distinct advantage at all but the highest core

counts.

• Repeated the exercise on the Helios partition of the Skylake

cluster using latest releases of HPCX v2.2.0 and 2.3.0-pre

60Application Performance on Multi-core Processors

http://www.mellanox.com/related-docs/prod_acceleration_software/PB_HPC-X.pdf

12 December 2018

Page 61: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

DL_POLY 4 – Intel MPI vs. HPCX – December 2017

61Application Performance on Multi-core Processors

% Intel MPI Performance vs. HPCX

Processor Core Count

85%

90%

95%

100%

105%

110%

115%

120%

0 128 256 384 512 640 768 896 1024

DL_POLY4 - NaCl

DL_POLY4 - Gramicidin

12 December 2018

Intel MPI is seen to outperform HPC-X for the DLPOLY 4 NaCl

test case at all core counts, and at lower core counts for

Gramicidin

Page 62: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

85%

90%

95%

100%

105%

110%

115%

120%

0 128 256 384 512 640 768 896 1024

DL_POLY4 - NaCl

DL_POLY4 - Gramicidin

DL_POLY 4 – Intel MPI vs. HPCX – December 2018

62Application Performance on Multi-core Processors

% Intel MPI Performance vs. HPCX

Processor Core Count

Advantage of Intel MPI now reduced at

most core counts for both NaCl and

Gramicidin

12 December 2018

Page 63: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

95%

100%

105%

110%

115%

120%

125%

0 128 256 384 512 640 768 896 1024

GROMACS - ion channel

GROMACS - lignocellulose

GROMACS – Intel MPI vs. HPCX – December 2017

63Application Performance on Multi-core Processors

% Intel MPI

Performance vs. HPCX

Processor Core Count

At no point does the HPC-X implementation of

Gromacs outperform that using Intel MPI

12 December 2018

Page 64: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

95%

100%

105%

110%

115%

120%

125%

0 128 256 384 512 640 768 896 1024

GROMACS - ion channel

GROMACS - lignocellulose

GROMACS – Intel MPI vs. HPCX – December 2018

64Application Performance on Multi-core Processors

% Intel MPI

Performance vs. HPCX

Processor Count

12 December 2018

Similar findings to DL_POLY, with the advantage of Intel

MPI over the HPC-X implementation of Gromacs

significantly reduced compared to the 2017 findings.

Page 65: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

60%

70%

80%

90%

100%

110%

120%

0 128 256 384 512

VASP - Palladium Complex

VASP - Zeolite Cluster

VASP 5.4.1 – Intel MPI vs. HPCX – December 2017

65Application Performance on Multi-core Processors

% Intel MPI Performance vs. HPCX

Processor Count

Significantly different to the classical MD codes – now

HPCX is seen to outperform Intel MPI for the Zeolite

cluster at all core counts, and at larger core counts for

the Palladium complex

12 December 2018

Page 66: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

60%

70%

80%

90%

100%

110%

120%

0 128 256 384 512

VASP - Palladium Complex

VASP - Zeolite Cluster

VASP 5.4.4 – Intel MPI vs. HPCX – December 2018

66Application Performance on Multi-core Processors

% Intel MPI Performance vs. HPCX

Processor Count

Significantly different to the 2017 findings – little

difference between Intel MPI and HPCX at larger core

counts, with Intel MPI superior at lower core counts.

12 December 2018

Page 67: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

65%

75%

85%

95%

105%

115%

125%

0 128 256 384 512 640 768

Quantum Espresso - GRIR443

Quantum Espresso - Au112

Quantum Espresso v5.2 – Intel MPI vs. HPCX – Dec. 2017

67Application Performance on Multi-core Processors

% Intel MPI Performance vs. HPCX

Processor Count

Significantly different to the classical MD codes – as

with VASP, HPCX is seen to outperform Intel MPI for the

larger core counts

12 December 2018

Page 68: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

65%

75%

85%

95%

105%

115%

125%

0 128 256 384 512

Quantum Espresso - GRIR443

Quantum Espresso - Au112

68Application Performance on Multi-core Processors

% Intel MPI Performance vs. HPCX

Processor Count

Significantly different to the 2017 findings – Intel MPI

superior at lower core counts, with HPCX somewhat

more effective at higher core counts.

12 December 2018

Quantum Espresso v6.1 – Intel MPI vs. HPCX – Dec. 2018

Page 69: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

I.3 Relative Performance as a Function

of Processor Family and Interconnect –

SKL and SNB Clusters.

Application Performance on Multi-

core Processors

Page 70: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.00

0.20

0.40

0.60

0.80

1.00

DLPOLY-4Gramicidin

DLPOLY-4 NaCl

GROMACS ion-channel

GROMACSlignocellulose

OpenFoam -3d3M

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

BSMBenchBalance

Bull b510 "Raven"Sandy Bridge e5-2670/2.6 GHz IB-QDR

Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDR

ATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDRHPCX

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA

Dell Skylake Gold 61422.6GHz (T) EDR

Dell Skylake Gold 61502.7GHz (T) EDR

Target Codes and Data Sets – 128 PEs

70Application Performance on Multi-core Processors

128 PE Performance [Applications]

12 December 2018

Page 71: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.11

1.29

1.33

1.36

1.37

1.38

1.40

1.41

1.42

1.43

1.45

1.49

1.53

1.53

1.54

1.58

1.58

1.59

1.65

1.67

1.71

1.95

0.9 1.1 1.3 1.5 1.7 1.9 2.1

OpenFOAM - Cavity3d-3M

WRF - 4dbasic

Gromacs 2016-3 - ion channel

Gromacs 2016-3 -lignocellulose

Gromacs 5.0 - lignocellulose

Gromacs 4.6.1 - lignocellulose

Gromacs 4.6.1 - ion channel

CP2K - H2O-512

QE 5.2 - AU112

CP2K - H2O-256

Gromacs 5.0 - ion channel

WRF - conus 2.5km

VASP 5.4.4 - Zeolite

DLPOLY Classic Bench7

GAMESS-UK - SiOSi7

GAMESS-UK - DFT.cyclo.6-31G-dp

DLPOLY Classic - Bench5

DLPOLY Classic - Bench4

DL_POLY 4.08 - NaCl

DL_POLY 4.08 - Gramicidin

QE 5.2 - GRIR443

VASP 5.4.4 - PdO Complex

Improved Performance of

Hawk - Dell |EMC Skylake

Gold 6148 2.4GHz (T) EDR

vs.

Raven - ATOS b510 Sandy

Bridge e5-2670/2.6 GHz

IB-QDR

71Application Performance on Multi-core Processors

Average Factor = 1.49

SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

12 December 2018

NPEs = 80

Page 72: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.23

1.32

1.33

1.36

1.36

1.36

1.39

1.45

1.46

1.48

1.49

1.49

1.49

1.50

1.53

1.56

1.56

1.59

1.71

1.76

2.02

2.23

0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3

OpenFOAM - Cavity3d-3M

Gromacs 2016-3 - ion channel

QE 5.2 - AU112

Gromacs 2016-3 - lignocellulose

WRF - 4dbasic

Gromacs 5.0 - lignocellulose

Gromacs 4.6.1 - lignocellulose

DLPOLY Classic - Bench5

Gromacs 5.0 - ion channel

Gromacs 4.6.1 - ion channel

WRF - conus 2.5km

CP2K - H2O-512

QE 5.2 - GRIR443

DLPOLY Classic - Bench4

CP2K - H2O-256

DLPOLY Classic Bench7

GAMESS-UK - SiOSi7

GAMESS-UK - DFT.cyclo.6-31G-dp

DL_POLY 4.08 - NaCl

DL_POLY 4.08 - Gramicidin

VASP 5.4.4 - Zeolite

VASP 5.4.4 - PdO Complex

Improved Performance of

Hawk - Dell |EMC Skylake

Gold 6148 2.4GHz (T) EDR

vs.

Raven - ATOS b510 Sandy

Bridge e5-2670/2.6 GHz IB-

QDR

72Application Performance on Multi-core Processors

Average Factor = 1.53

SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

12 December 2018

NPEs = 160

Page 73: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.34

1.34

1.37

1.38

1.39

1.39

1.40

1.40

1.41

1.41

1.44

1.45

1.53

1.56

1.58

1.60

1.74

1.80

1.88

1.97

2.16

2.71

0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7

QE 5.2 - GRIR443

Gromacs 2016-3 - ion channel

DLPOLY Classic - Bench5

Gromacs 2016-3 -…

Gromacs 5.0 - lignocellulose

WRF - 4dbasic

CP2K - H2O-512

Gromacs 5.0 - ion channel

Gromacs 4.6.1 - lignocellulose

DLPOLY Classic - Bench4

OpenFOAM - Cavity3d-3M

WRF - conus 2.5km

Gromacs 4.6.1 - ion channel

CP2K - H2O-256

GAMESS-UK - DFT.cyclo.6-…

GAMESS-UK - SiOSi7

DL_POLY 4.08 - Gramicidin

DL_POLY 4.08 - NaCl

VASP 5.4.4 - Zeolite

QE 5.2 - AU112

DLPOLY Classic Bench7

VASP 5.4.4 - PdO Complex

Improved Performance of

Hawk - Dell |EMC Skylake

Gold 6148 2.4GHz (T) EDR

vs.

Raven - ATOS b510 Sandy

Bridge e5-2670/2.6 GHz IB-

QDR

73Application Performance on Multi-core Processors

Average Factor = 1.60

SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

12 December 2018

NPEs = 320

Page 74: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

74Application Performance on Multi-core Processors

Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous charts based on Core to core comparison i.e.

performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a 4 and 6 “node to node” comparison of the following:

¤ Benchmarks based on set of 10 applications & 19 data sets.

1

Raven - Bull b510 Sandy

Bridge e5-2670/2.6 GHz IB-

QDR [64 cores]

Hawk - Dell |EMC Skylake

Gold 6148 2.4GHz (T) EDR

[160 cores]

2

Raven - Bull b510 Sandy

Bridge e5-2670/2.6 GHz IB-

QDR [96 cores]

Hawk - Dell |EMC Skylake

Gold 6148 2.4GHz (T) EDR

[240 cores]

12 December 2018

Page 75: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

2.50

2.62

2.70

2.81

2.94

2.95

2.96

2.98

3.09

3.11

3.26

3.28

3.31

3.40

3.46

3.50

1.0 1.5 2.0 2.5 3.0 3.5

CP2K - H2O-256

QE 5.2 - Au112

CP2K - H2O-512

DLPOLYclassic Bench4

GAMESS-UK (DFT.cyclo.6-31G-dp)

VASP 5.4.4 Pd-O complex

VASP 5.4.4 Zeolite complex

WRF 3.4 - 4dbasic

DLPOLY-4 NaCl

GROMACS 2016.3 - ion-channel

QE 5.2 - GRIR443

GROMACS 2016.3 - lignocellulose

DLPOLY-4 Gramicidin

GAMESS-UK (DFT.siosi7.3975)

WRF 3.4 - conus 2.5km

OpenFOAM - Cavity3d-3M

Improved Performance of

Dell |EMC Skylake Gold 6148

2.4GHz (T) EDR [160 cores]

vs.

Bull b510 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [64 cores]

75Application Performance on Multi-core Processors

Average Factor = 3.05

SKL “Gold” 6148 2.4 GHz EDR vs. SB e5-2670 2.6 GHz QDR

4 Node Comparison

12 December 2018

Page 76: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

2.59

2.63

2.64

2.67

2.78

2.78

2.79

2.96

2.96

3.01

3.14

3.18

3.19

3.19

3.27

3.88

1.0 1.5 2.0 2.5 3.0 3.5 4.0

VASP 5.4.4 Zeolite complex

CP2K - H2O-256

GROMACS 2016.3 - ion-channel

VASP 5.4.4 Pd-O complex

WRF 3.4 - 4dbasic

GAMESS-UK (DFT.cyclo.6-31G-dp)

CP2K - H2O-512

DLPOLY-4 Gramicidin

DLPOLYclassic Bench4

DLPOLY-4 NaCl

QE 5.2 - GRIR443

WRF 3.4 - conus 2.5km

GAMESS-UK (DFT.siosi7.3975)

QE 5.2 - Au112

GROMACS 2016.3 - lignocellulose

OpenFOAM - Cavity3d-3M

Improved Performance of Hawk

Dell |EMC Skylake Gold 6148

2.4GHz (T) EDR [240 cores]

vs.

Bull b510 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [96 cores]

76Application Performance on Multi-core Processors

Average Factor =

2.98

SKL “Gold” 6148 2.4 GHz EDR vs. SNB e5-2670 2.6 GHz QDR

6 Node Comparison

12 December 2018

Page 77: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

DLPOLYclassic Bench4

DLPOLY-4Gramicidin

DLPOLY-4NaCl

GROMACSion-channel

GROMACSlignocellulose

GAMESS-UK(cyc-sporin)

GAMESS-UK(Siosi7)

QE Au112

QE GRIR443

VASP Pd-Ocomplex

VASP Zeolitecomplex

Fujitsu CX250 SandyBridge e5-2670/2.6 GHzIB-QDRATOS Broadwell e5-2680v4 2.4GHz (T) OPA

Thor Dell|EMC e5-2697Av4 2.6GHz (T) EDR IMPI

Dell Skylake Gold 61302.1GHz (T) OPA

Intel Skylake Gold 61482.4GHz (T) OPA

Dell Skylake Gold 61422.6GHz (T) EDR

Dell Skylake Gold 61502.7GHz (T) EDR

Bull|ATOS Skylake Gold6150 2.7GHz (T) EDR

Dell|EMC AMD EPYC 76012.2 GHz (T) EDR

EPYC - Target Codes and Data Sets – 128 PEs

77Application Performance on Multi-core Processors

128 PE Performance [Applications]

12 December 2018

Page 78: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

78Application Performance on Multi-core Processors

Performance Benchmarks – Node to Node

• Analysis of performance Metrics across a variety of data sets

¤ “Core to core” and “node to node” workload comparisons

• Previous EPYC charts based on Core to core comparison

i.e. performance for jobs with a fixed number of cores

• Node to Node comparison typical of the performance when

running a workload (real life production). Expected to reveal

the major benefits of increasing core count per socket

¤ Focus on a “node to node” comparison of the following:

¤ Benchmarks based on set of 6 applications & 15 data sets.

1Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [64 cores]

Dell |EMC AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

2Dell |EMC Skylake Gold 6130 2.1GHz

(T) OPA [128 cores]

Dell |EMC AMD EPYC 7601 2.2 GHz

(T) EDR [256 cores]

12 December 2018

Page 79: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

1.55

2.09

2.13

2.30

2.69

2.88

2.90

3.24

3.33

3.62

4.15

4.19

1.0 1.5 2.0 2.5 3.0 3.5 4.0

DLPOLYclassic Bench7

VASP Pd-O complex

DLPOLYclassic Bench5

DLPOLY-4 NaCl

DLPOLY-4 Gramicidin

VASP Zeolite complex

DLPOLYclassic Bench4

GROMACS ion-channel

QE Au112

GAMESS-UK (cyc-sporin)

GROMACS lignocellulose

GAMESS-UK (valino.A2)

Relative Performance of

Dell | EMC AMD EPYC 7601 2.2

GHz (T) EDR [256 cores]

vs.

Fujitsu CX250 Sandy Bridge e5-

2670/2.6 GHz IB-QDR [64 cores]

79Application Performance on Multi-core Processors

Average Factor = 2.92

Dell|EMC EPYC 7601 2.2 GHz (T) EDR vs. SB e5-2670 2.6 GHz QDR

12 December 2018

4 Node Comparison

Page 80: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.74

0.80

0.84

0.94

1.00

1.07

1.13

1.21

1.44

1.51

1.51

1.64

1.78

1.78

1.83

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

VASP Pd-O complex

QE Au112

QE GRIR443

DLPOLY-4 NaCl

DLPOLYclassic Bench7

DLPOLY-4 Gramicidin

VASP Zeolite complex

DLPOLYclassic Bench5

GROMACS ion-channel

DLPOLYclassic Bench4

GAMESS-UK (cyc-sporin)

GAMESS-UK (valino.A2)

GAMESS-UK (Siosi7)

GROMACS lignocellulose

GAMESS-UK (hf12z)

Relative Performance of

Dell |EMC AMD EPYC 7601 2.2

GHz (T) EDR [256 cores]

vs.

Dell |EMC Skylake Gold 6130

2.1GHz (T) OPA [128 cores]

80Application Performance on Multi-core Processors

Average Factor = 1.28

SKL “Gold” 6130 2.1 GHz OPA vs. AMD EPYC 7601 2.2 GHz (T) EDR

12 December 2018

4 Node Comparison

Page 81: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Summary

• Ongoing Focus on performance benchmarks and clusters featuring

Intel’s SKL processors, with the addition of the “Gold” 6138, 2.0

GHz [20c] and 6148, 2.4 GHz [20c] alongside the 6142, 2.6 GHz

[16c] ; and 6150, 2.7 GHz [18c]).

• Performance comparison with current SNB systems and those

based on dual Intel BDW processor EP nodes (16-core, 14-core)

with Mellanox EDR and Intel’s Omnipath OPA interconnects.

• Measurements of parallel application performance based on

synthetic and end user applications – DLPOLY, Gromacs, Amber,

GAMESS-UK, Quantum ESPRESSO and VASP.

¤ Use of Allinea Performance reports to guide analysis, and

updated comparison of Mellanox’s HPC-X and Intel MPI on

EDR-based systems

• Results augmented through consideration of two AMD Naples

EPYC clusters, featuring the 7601 (2.20 GHz) and 7551 (2.00 GHz)

processors.

81Application Performance on Multi-core Processors 12 December 2018

Page 82: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Summary II

• Relative Code Performance: Processor Family and Interconnect – “core

to core” and “node to node” benchmarks.

• A Core-to-Core comparison focusing on the Skylake “Gold” 6148

cluster (EDR) across 19 data sets (7 applications) suggests average

speedups between 1.49 (80 cores) through 1.60 (320 cores) when

comparing the to the Sandy Bridge-based “Raven” e5-2670 2.6GHz

cluster with QDR environment.

¤ Some applications however show much higher factors e.g.

GROMACS and VASP depending on the level of optimisation

undertaken on Hawk.

• A Node-to-Node comparison typical of the performance when running

a workload shows increased factors.

¤ A 4-node benchmark (160 cores) based on examples from 9

applications and 16 data sets show average improvement factors of

3.05 compared to the corresponding 4 node runs (64 cores) on the

Raven cluster.

¤ This factor is reduced somewhat, to 2.98, when using 6 node

benchmarks, comparing 240 SKL cores to 96 SNB cores.

82Application Performance on Multi-core Processors 12 December 2018

Page 83: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Summary III

• An updated comparison of Intel MPI and Mellanox’s HPCX

conducted on the “Helios” cluster suggests that the clear

delineation between MD (DLPOLY, GROMACS) and Materials-

based codes (VASP, Quantum Espresso) is no longer evident.

• Ongoing studies on the EPYC 2701 shows a complex

performance dependency on EPYC architecture.

¤ Codes with high usage of vector instructions (Gromacs, VASP

and Quantum Espresso) perform at best in somewhat modest

fashion.

¤ The AMD EPYC only supports 2 × 128-bit AVX natively, so

there’s a large gap with Intel and their 2 × 512-bit FMAs.

¤ The floating point peak on AMD is 4 × lower than Intel and

given that e.g., GROMACS has a native AVX-512 kernel for

Skylake, performance inevitably suffers.

83Application Performance on Multi-core Processors 12 December 2018

Page 84: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

II. Acceptance Test Challenges and the

Impact of Environment.

Application Performance on Multi-

core Processors

Page 85: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Background - Supercomputing Wales, New HPC Systems

• Multi-million £ procurement exercise for new hubs agreed by all

partners

• Tender issued in May 2017 following 6-9 month review of research

community requirements and development of technical reference

design

• Budgetary challenges due to currency devaluation and increase in

component costs since budgets agreed in 2016

• Contracts awarded to Atos, March 2018. Hubs now installed and

operational, based on Intel Skylake Gold 6148, supported by Nvidia

GPU accelerators:

Lot 1 – “Hawk” system - Cardiff hub. 7,000 HPC + 1,040 HTC cores

Lot 2 – “Sunbird” system - Swansea hub. 5,000 HPC cores

Lot 3 – “Sparrow” – Cardiff High Performance Data Analytics

development system

Suppliers to provide development opportunities and other activities

through a programme of Community Benefits

12 December 2018Application Performance on Multi-core Processors 85

Page 86: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Performance Acceptances Tests

1. Consideration of the Performance Acceptance tests undertaken as

part of the Supercomputing Wales procurement. Carried out by Atos

on the “Hawk” HPC Skylake 6148 Cluster at Cardiff University.

2. Performance targets built on benchmarks specified in the ITT – but

developments impacted on the subsequent testing e.g., SPECTRE /

Meltdown.

3. Assess Performance through analyses of results generated through

three distinct run time environment variables, characterised by :

¤ Turbo Mode – ON or OFF. Impact considerably more complicated with

Skylake compared to previous Intel processor families.

¤ Security patches – DISABLED or ENABLED on the Skylake 6148 compute

nodes

¤ Distribution of processing cores – PACKED or UNPACKED on each node

e.g. 256 cores on either 7 or 8 × 40-core nodes.

4. Total of 8 Combinations – Impact on Performance ?

¤ ITT defined that all – “Application benchmarks should be in

“PACKED” mode; HPCC in non-turbo mode”12 December 2018Application Performance on Multi-core Processors 86

Page 87: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Process Adopted

1. Performance benchmark results generated by Atos (Martyn Foster)

on the Hawk HPC Skylake 6148 Cluster at Cardiff University

2. MF adopted a systematic approach to assessing performance

through the analyses of results generated across four distinct

environments (a subset of the 8 possible environments)

¤ “base (switch contained)” – Turbo mode off, security patches disabled on

the Skylake 6148 compute nodes

¤ “turbo + packed” - Turbo mode activated, with packed nodes – Slurm

default, with 40 cores per Skylake 6148 node

¤ “turbo + spread” - Turbo mode activated, de-populated nodes (32 cores /

node)

¤ “base + spectre” – base configuration above with security patches enabled

3. Identify those applications where the committed performance from the

SCW ITT submission (“Target”) is not achieved.10% shortfall allowed.

12 December 2018Application Performance on Multi-core Processors 87

Page 88: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

GLOBAL_SETTINGS

export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/disablekpti"

export SPEC="disable"

##### OR #####

export SPECTRE="clush -b -w $SLURM_NODELIST sudo /apps/slurm/enablekpti"

export SPEC="enable"

export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_on" ;

export TSTR=TURBO

##### OR #####

export TURBO="clush -b -w $SLURM_NODELIST sudo /apps/slurm/turbo_off" ;

export TSTR=OFF

export SRUN_PACKING="-m Pack" ; export PSTR=Packed

##### OR #####

export SRUN_PACKING="-m NoPack"; export PSTR=Spread

export LAUNCHER="srun ${SRUN_PACKING} --cpu_bind=verbose,cores --export

LD_LIBRARY_PATH".

12 December 2018Application Performance on Multi-core Processors 88

Page 89: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

SCW Application Performance Benchmarks

• The Benchmark suite comprises both synthetics & end-user

applications. Synthetics include HPCC (http://icl.cs.utk.edu/hpcc) &

IMB benchmarks (http://software.intel.com/en-us/articles/intel-mpi-

benchmarks), IOR and STREAM

• Variety of “open source” & commercial end-user application codes:

• These stress various aspects of the architectures under consideration

and should provide a level of insight into why particular levels of

performance are observed.

GROMACS and DL_POLY-4 (molecular dynamics)

Quantum Espresso and VASP (ab initio Materials properties)

BSMBench (particle physics – Lattice Gauge Theory Benchmarks)

OpenFOAM (computational engineering)

12 December 2018Application Performance on Multi-core Processors 89

Page 90: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

“Sunbird” Acceptance Tests – User Applications

90

106%

108%

105%

113%

111%

105%

95%

91%

97%

89%

98%

107%106%

104%

110%

95%

87%

96%

105%

95%

78%

104%

94%

103%

98%

102%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%

DL

PO

LY

-Gra

mic

idin

(64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-io

nc

han

ne

l (64

)

GR

OM

AC

S-io

nc

han

ne

l (12

8)

GR

OM

AC

S-lig

no

cellu

los

e (1

28

)

GR

OM

AC

S-lig

no

cellu

los

e (2

56

)

VA

SP

-Pd

O (6

4)

VA

SP

-Pd

O (1

28)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (6

4)

QE

-Au

11

2 (1

28)

QE

-GR

IR4

43

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(12

8)

Op

en

FO

AM

(25

6)

BS

MB

en

ch

-Co

mm

s (2

56

)

BS

MB

en

ch

-Co

mm

s (5

12

)

BS

MB

en

ch

-Co

mm

s (1

02

4)

BS

MB

en

ch

-Ba

lan

ce

(25

6)

BS

MB

en

ch

-Ba

lan

ce

(51

2)

BS

MB

en

ch

-Ba

lan

ce

(10

24

)

BS

MB

en

ch

-Co

mp

ute

(25

6)

BS

MB

en

ch

-Co

mp

ute

(51

2)

BS

MB

en

ch

-Co

mp

ute

(10

24

)

Basket of Synthetic (HPCC, IOR, STREAM, IMB) and end-user application codes

– DL_POLY, GROMACS, VASP, ESPRESSO, OpenFOAM & BSMBENCH)

12 December 2018Application Performance on Multi-core Processors

Page 91: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

85.0%

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Turbo Mode on Performance (Security Patches Enabled)

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)

12 December 2018Application Performance on Multi-core Processors 91

Normalised to corresponding

performance with Turbo OFF

Security patches Enabled

T Turbo-OFF / T Turbo-ON

Page 92: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

85.0%

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Turbo Mode on Performance (Security Patches Disabled)

Computational setup

BE

TT

ER

12 December 2018Application Performance on Multi-core Processors 92

Normalised to corresponding

performance with Turbo OFF

Security patches Disabled

Re

lati

ve

Pe

rfo

rma

nc

e (

%)

T Turbo-OFF / T Turbo-ON

Page 93: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

90.0%

95.0%

100.0%

105.0%

110.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR4

43

(25

6)

QE

-GR

IR4

43

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Security Patches on Performance (Turbo Mode OFF)

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)

12 December 2018Application Performance on Multi-core Processors 93

Normalised to corresponding

performance with patches

disabled on the compute nodes

Turbo OFF

T DISABLED / T ENABLED

Page 94: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

90.0%

95.0%

100.0%

105.0%

110.0%D

LP

OL

Y-G

ram

icid

in (

64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Impact of Security Patches on Performance (Turbo Mode ON)

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)

12 December 2018Application Performance on Multi-core Processors 94

Normalised to corresponding

performance with patches

disabled on the compute nodes

Turbo ON

T DISABLED / T ENABLED

Page 95: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

90.0%

95.0%

100.0%

105.0%

110.0%

115.0%

DL

PO

LY

-Gra

mic

idin

(64

)

DL

PO

LY

-Gra

mic

idin

(12

8)

DL

PO

LY

-Gra

mic

idin

(25

6)

GR

OM

AC

S-i

on

ch

an

ne

l (6

4)

GR

OM

AC

S-i

on

ch

an

ne

l (1

28

)

GR

OM

AC

S-l

ign

o. (1

28

)

GR

OM

AC

S-l

ign

o. (2

56

)

VA

SP

-Pd

O (

64

)

VA

SP

-Pd

O (

12

8)

VA

SP

-Ze

olite

(12

8)

VA

SP

-Ze

olite

(25

6)

QE

-Au

11

2 (

64

)

QE

-Au

11

2 (

12

8)

QE

-GR

IR443

(25

6)

QE

-GR

IR443

(51

2)

Op

en

FO

AM

(1

28)

Op

en

FO

AM

(2

56)

BS

M-C

om

ms

(2

56

)

BS

M-C

om

ms

(5

12

)

BS

M-C

om

ms

(1

02

4)

BS

M-B

ala

nc

e (

25

6)

BS

M-B

ala

nc

e (

51

2)

BS

M-B

ala

nc

e (

10

24

)

BS

M-C

om

pu

te (

25

6)

BS

M-C

om

pu

te (

51

2)

BS

M-C

om

pu

te (

10

24)

Overall Impact of Environment on Performance

Computational setup

BE

TT

ER

Re

lati

ve

Pe

rfo

rma

nc

e (

%)

12 December 2018Application Performance on Multi-core Processors 95

Normalised with respect to the most

constrained environment - Turbo OFF,

security patches enabled, “packed” nodes

T CONSTRAIN / T MIN

Page 96: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Workload validation and Throughput tests

• Aim: Throughput designed to illustrate the Stability of the system

over an observed period of a week, while hardening the system

• Benchmarks based on multiple, concurrent instantiations of a

number of data sets associated with five of the end user application

codes and two of the synthetic benchmarks.

• Each data set is run a number of times on a variety of processor

(core) counts - typically 40, 80, 160, 320, 640 and 1024. This

combination of jobs has been designed to run for approximately 6

hours (elapsed time) on a 2720-core, 68 node cluster partition.

• Note that the metrics for success of these tests are twofold:

1. All jobs comprising a given run complete successfully and

2. There is a consistency of run time across each of the tests. The

measured time is simply the time at which the first of the jobs is

launched through the time that the last jobs finishes.

12 December 2018Application Performance on Multi-core Processors 96

Page 97: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Workload validation and Throughput tests

• Based around multiple instantiations

of a number of data sets associated

with the five codes, DLPOLY4,

Gromacs (v5.2), Quantum Espresso,

OpenFOAM and VASP, and the two

synthetic benchmarks, IMB and IOR.

• DLPOLY4 - NaCl & Gramicidin

• Gromacs - ion_channel &

lignocellulose

• QE 6.1 - AUSURF112 & GRIR443

• OpenFOAM - cavity3d-3M

• VASP 5.4.4 – PdO complex and

Zeolite

12 December 2018Application Performance on Multi-core Processors 97

SLURM Scripts

DLPOLY4.test2+test8.SCW.40.q

DLPOLY4.test2+test8.SCW.80.q

DLPOLY4.test2+test8.SCW.160.q

DLPOLY4.test2+test8.SCW.320.q

DLPOLY4.test2+test8.SCW.640.q

GROMACS.All.SCW.80.q

GROMACS.All.SCW.160.q

GROMACS.All.SCW.320.q

GROMACS.All.SCW.640.q

GROMACS.All.SCW.1024.q

IMB3.SCW.160.q

IMB3.SCW.320.q

IOR.SCW.4.q

IOR.SCW.8.q

OpenFOAM_cavity3d-3M.SCW.80.q

OpenFOAM_cavity3d-3M.SCW.160.q

OpenFOAM_cavity3d-3M.SCW.320.q

OpenFOAM_cavity3d-3M.SCW.640.q

QE.AUSURF112.SCW.160.q

QE.AUSURF112.SCW.320.q

QE.GRIR443.SCW.320.q

QE.GRIR443.SCW.640.q

VASP.example3.SCW.80.q

VASP.example3.SCW.160.q

VASP.example3.SCW.320.q

VASP.example4.SCW.160.q

VASP.example4.SCW.320.q

Page 98: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Throughput Tests – Hawk System – Two partition Approach

The throughput tests were undertaken on two separate partitions of the

Hawk cluster – compute64 and compute64b – to enable other testing and

early pilot user service. Each partition comprised 68 nodes.

Partition 1 – Compute 64 (68 Nodes)

• The first set of trial runs was executed between 12-14 May. A number of the

runs failed to complete, subsequently attributed to an apparent VASP related

error peculiar to the lustre file system:

forrtl: severe (121): Cannot access current working directory for unit 18, file "Unknown"

Image PC Routine Line Source

vasp_std 00000000014F3E09 Unknown Unknown Unknown

vasp_std 000000000150E10F Unknown Unknown Unknown

vasp_std 000000000134C950 Unknown Unknown Unknown

vasp_std 000000000040AF5E Unknown Unknown Unknown

libc-2.17.so 00002B450F32EC05 __libc_start_main Unknown Unknown

vasp_std 000000000040AE69 Unknown Unknown Unknown

forrtl: error (76): Abort trap signal

• This transient error affected perhaps one in twenty identical jobs, and although

reported into the appropriate Level 3 service regimes, has still not been formally

addressed. A workaround module was developed by Cardiff’s Tom Green when

it became clear that the formal channels were struggling.

module load lustre_getcwd_fix

12 December 2018Application Performance on Multi-core Processors 98

Page 99: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Throughput Tests – Hawk System II

Partition 1:

• A second set of trial runs were carried out over the bank holiday

weekend and successfully passed the associated tests over the

period 30 May – 3 June.

• Partition 2:

• Runs 11 -22: Initial runs using compute64b conducted between 7 –

10 June revealed a number of issues pointing to readiness of the

nodes. Timings from the first completed run suggested some

variability in run times for a given application/core count, with the total

run time significantly longer than those on compute64.

Run # Start Time Finish TimeTotal Elapsed Time

(hours:Mins)

6 30May 21-21 31May 03-25 6:02

7 31May 23-33 01Jun 05-38 6:05

8 02Jun 00-04 02Jun 06-06 6:02

9 02Jun 13:24 02Jun 19:27 6:04

10 02Jun 22-57 03Jun 05-00 6:03

11 03Jun 05-45 03Jun 11-47 6:02

12 03Jun 15-57 03Jun 22-12 6:15

12 December 2018Application Performance on Multi-core Processors 99

Page 100: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Throughput Tests – Hawk System III

• Following a lustre upgrade, a further set of runs were undertaken

between 21 June and 25 June. Runs 8 - 12 actually ran OK, so

formally compute 64b, along with compute64, can be judged to have

passed the Acceptance Test throughput requirement of five

consecutive error-free runs, although the variations in the individual

run times are perhaps larger than hoped.

• Testing on Hawk commenced on 12 May 2018 and was finally

completed on the 25 June 2018.

Run # Start Time Finish TimeTotal Elapsed Time

(hours:Mins)

8 23Jun 15-44 23Jun 20-57 5:13

9 23Jun 21-35 24Jun 02-55 5:20

10 24Jun 11-34 24Jun 17-06 5:32

11 24Jun 18-56 25Jun 00-16 5:20

12 25Jun 00-31 25Jun 06-03 5:32

12 December 2018Application Performance on Multi-core Processors 100

Page 101: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Throughput Tests – Sunbird System – Two partition approach

Partition 1: Runs 1 – 4:

¤ Run 3 did not complete with JOBID #11050 hanging, while JOBID #11372 of

Run 4 suffered the same fate. Both jobs failed with the all too familiar

VASP/lustre error diagnostics. The scripts used were identical to those

used on Hawk in June, and did not include the workaround introduced at the

time.

• Runs 5 -10: Completed successfully, with two of the VASP/Lustre partitions

trapped though the added module

module load lustre_getcwd_fix

Partition 2: Runs 11 - 22: Three jobs in one of the runs hung when hitting problems

on scs0105. That node had been taken out for when setting up the user-facing file

systems and needed the playbooks running. Several of the runs showed the impact

of the lustre issue with VASP.

• However, there were significant variations in the overall run times.

¤ At least three of the nodes appeared to be either defective or possess some

different bios settings (scs0064,scs0092 and scs0096). These were

subsequently removed from service.

¤ Turbo in inconsistent state across the compute nodes. Usually reset by

the Slurm prologue scripts, but they appear to have been commented out.

12 December 2018 101Application Performance on Multi-core Processors

Page 102: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Throughput Tests – Sunbird System

• Runs 23 – 30: Runs certainly acceptable from the metric of job

completion, for all completed successfully. Note there was no

reoccurrence of the lustre-related issue during this set of runs.

• Testing on Sunbird commenced on 10 August 2018, and was finally

completed on the evening of 19 August 2018

Run # Start Time Finish TimeTotal Elapsed Time

(hours:Mins)

23 17Aug 17:52 17Aug 23:07 5:15

24 17Aug 23:54 18Aug 05:11 5:17

25 18Aug 05:19 18Aug 10:36 5:17

26 18Aug 14:00 18Aug 19:16 5:16

27 18Aug 19:51 19Aug 01:08 5:17

28 19Aug 02:45 19Aug 07-57 5:12

29 19Aug 13:21 19Aug 18:32 5:11

30 19Aug 19:06 20Aug 00:265:20 (SLURM CG

Issue

12 December 2018Application Performance on Multi-core Processors 102

Page 103: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Throughput Tests – Nottingham OCF Cluster

• Tests modified to run on two partition of the OCF cluster at

Nottingham, “martyn" and "colin", each comprising 50 nodes with

EDR interconnect. All component nodes comprised dual Gold 6138

2.0GHz 20c SKL processors

• Initial runs of the workload failed to complete successfully, with each

of the 8 x 320-core IMB jobs hanging, consuming all of their

allocated time. Traced to an issue with the gatherv collective that

failed to complete across all specified msglens.

• Navigated around the issue by removing those environment variables

deemed likely to trigger the problem, specifically:

¤ export I_MPI_JOB_FAST_STARTUP=enable

¤ export I_MPI_SCALABLE_OPTIMIZATION=enable

¤ export I_MPI_DAPL_UD=enable

¤ export I_MPI_TIMER_KIND=rdtsc

• With these removed, runs proceeded to complete successfully.

• One of the allocated nodes (compute099) rendered unusable as a

result of tests - removed from service. Thus the subsequent runs

used 49 nodes, rather that the intended 50.

12 December 2018Application Performance on Multi-core Processors 103

Page 104: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Throughput Tests – Acceptance Achieved (OCF System)

Run # Start Time Finish TimeTotal Elapsed Time

(hours:Mins)

2 31Jul 18-04 01Aug 00-49 6:45

3 01Aug 01-22 01Aug 08-07 6:45

4 01Aug 08-32 01Aug 15-18 6:46

5 01Aug 17-08 01Aug 23-52 6:44

6 02Aug 03-20 02Aug 10-06 6:46

12 December 2018Application Performance in Materials Science 104

Table. Overall run times for the throughput runs on the “martyn” partition.

Run # Start Time Finish TimeTotal Elapsed Time

(hours:Mins)

1 02Aug 12-24 02Aug 19-05 6:41

2 03Aug 02-41 03Aug 09-18 6:37

3 03Aug 10-34 03Aug 17-18 6:44

4 03Aug 17-35 04Aug 00-26 6:51

5 04Aug 01-00 04Aug 07-45 6:45

Table. Overall run times for the throughput runs on the “colin” partition.

Results of "throughput benchmarks" carried out on the new OCF Skylake

cluster at Nottingham University between 31 July and 4 August 2018.

Page 105: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

III. The Performance Evolution of two

Community Codes, DL_POLY and

GAMESS-UK

.

Application Performance on Multi-

core Processors

Page 106: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Outline and Contents

1. Introduction – DL_POLY and GAMESS-UK

¤ Background and Flagship community codes for the UK’s

CCP5 & CCP1 – Collaboration!

2. HPC Technology – Impact of Processor & Interconnect

developments

¤ The last 10 years of Intel dominance – Nehalem to Skylake

3. DL_POLY and GAMESS-UK Performance

¤ Benchmarks & Test Cases

¤ Overview of two decades of Code Performance: From the Cray

T3E/900 to Intel Skylake clusters

12 December 2018Application Performance on Multi-core Processors 106

“DL_POLY - A Performance Overview. Analysing, Understanding and Exploiting

available HPC Technology”, Martyn F Guest, Alin M Elena and Aidan B G Chalk,

Molecular Simulation, Accepted for publication (2019).

Page 107: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

The Story of Two Community Codes

DL_POLY and GAMESS-UK - A Performance

Overview

HPC Technology –

Processor and

Networks

Page 108: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Computer Systems

• Benchmark timings - a wide variety of systems, starting with the Cray

T3E/1200 in 1999. Access initially undertaken as part of Daresbury’s

Distributed Computing support programme (DiSCO), with the

benchmarks presented at the annual Machine Evaluation Workshops

(1989-2014) and STFC’s successor Computing Insight (CIUK)

conferences (2015 onwards).

¤ Access typically short-lived as systems provided by suppliers to

enhance their profile at the MEW Workshops - limited opportunity for in

depth benchmarking.

• Systems include a wide range of CPU offerings. Representatives from

over a dozen generations of Intel processors, from the early days of

single processor nodes housing Pentium 3 and Pentium 4 CPUs,

through dual processor nodes featuring dual-core Woodcrest, quad-core

Clovertown & Harpertown processors, along with the Itanium and

Itanium2 CPUs, through to the extensive range of multi-core offerings

Westmere - Skylake.

12 December 2018Application Performance on Multi-core Processors 108

Page 109: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Computer Systems

• A variety of processors from AMD (Athlon, Opteron, MagnyCours,

Interlagos etc.) along with the “power” processors from the IBM

pSeries have also featured (typically dual processor configurations).

• In the same way a wide variety of processors feature, so too is the

appearance of a range of network interconnects. Fast Ethernet and

GBit Ethernet were rapidly superseded by the increasing capabilities of

the family of Infiniband interconnects from Voltaire and Mellanox (SDR,

DDR, QDR, FDR, EDR and soon HDR), along with the now defunct

offerings from Myrinet, Quadrics and QLogic. The Truescale

interconnect from Intel, along with its successor, Omnipath, also feature.

• Dating from the appearance of Intel’s SNB processors, many of the

timings generated with the Turbo mode feature enabled by the system

administrators. Such systems are tagged with “(T)” notation.

• As for software, most of the commodity clusters featuring Intel CPUs

used successive generation of Intel compilers along with Intel MPI,

although a range of MPI libraries have been used – OpenMPI, MPICH,

MVAPICH and MVAPICH2. Proprietary systems (Cray and IBM) used

system specific compilers and associated MPI libraries.

12 December 2018Application Performance on Multi-core Processors 109

Page 110: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Intel Xeon : Westmere - Skylake

Xeon 5600

(Westmere-EP)

Xeon E5-2600

(Sandy Bridge-EP)

Xeon E5-2600 v4

“Broadwell-EP”

Intel Xeon Scalable

Processor

“Skylake”

Cores / ThreadsUp to 6 cores / 12

threads

Up to 8 cores / 16

threads

Up to 22 Cores / 44

threads

Up to 28 Cores / 56

threads

Last-level cache 12 MB Up to 20 MB Up to 55 MBUp to 38.5 MB (non-

inclusive)

Max memory

channels, speed

/ socket

3xDDR3 channels,

1333

4xDDR3 channels,

1600

4 channels of up to 3

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2400 MHz

6 channels of up to 2

RDIMMs, LRDIMMs

or 3DS LRDIMMs,

2666 MHz

New

instructionsAES-NI

AVX 1.0

8 DP Flops/Clock

AVX 2.0

16 DP Flops/Clock

AVX 512

32 DP Flops/Clock

QPI / UPI Speed

(GT/s)

1 QPI channels @

6.4 GT/s

2 QPI channels @ 8.0

GT/s

2 x QPI channels @

9.6 GT/s

Up to 3 x UPI @ 10.4

GT/s

PCIe Lanes /

Controllers /

Speed (GT/s)

36 lanes PCIe 2.0 on

chipset

40 Lanes / Socket

Integrated PCIe 3.0

40 / 10 / PCIe* 3.0

(2.5, 5, 8 GT/s)48 / 12 / PCIe* 3.0

(2.5, 5, 8 GT/s)

Server /

Workstation

TDP

Server /

Workstation: 130W

Up to 130W Server;

150W Workstation 55 - 145W 70 – 205W

12 December 2018Application Performance on Multi-core Processors 110

Page 111: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

The Story of Two Community Codes

DL_POLY and GAMESS-UK - A Performance

Overview

Overview of two

decades of

DL_POLY

Performance

Page 112: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

A B

C D

• Distribute atoms, forces across the nodes

¤ More memory efficient, can address much larger

cases (105-107)

• Shake and short-ranges forces require only

neighbour communication

¤ communications scale linearly with number of

nodes

• Coulombic energy remains global

¤ Adopt Smooth Particle Mesh Ewald scheme

• includes Fourier transform smoothed charge

density (reciprocal space grid typically

64x64x64 - 128x128x128)

https://www.scd.stfc.ac.uk/Pages/DL_POLY.aspx

W. Smith and I. Todorov

Domain Decomposition - Distributed data:

DL_POLY 3/4 – Distributed data

Benchmarks1. NaCl Simulation; 216,000 ions, 200 time steps, Cutoff=12Å

2. Gramicidin in water; rigid bonds + SHAKE: 792,960 ions, 50 time steps

12 December 2018Application Performance on Multi-core Processors 112

Page 113: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

DL_POLY 4

• Test2 Benchmark

¤ NaCl Simulation;

216,000 ions, 200 time

steps, Cutoff=12Å

• Test8 Benchmark

¤ Gramicidin in water;

rigid bonds + SHAKE:

792,960 ions, 50 time

steps

The DLPOLY Benchmarks

DL_POLY Classic

• Bench4

¤ NaCl Melt Simulation with Ewald

sum electrostatics & a MTS

algorithm. 27,000 atoms; 500 time

steps.

• Bench5

¤ Potassium disilicate glass (with 3-

body forces). 8,640 atoms: 3,000

time steps

• Bench7

¤ Simulation of gramicidin A molecule

in 4012 water molecules using

neutral group electrostatics. 12,390

atoms: 5,000 time steps

12 December 2018Application Performance on Multi-core Processors 113

Page 114: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

112

0

20

40

60

80

100

Cra

y T

3E

/1200 E

V56

600 M

Hz

IBM

SP

/Win

terh

aw

k2-3

75 M

Hz

SG

I O

rig

in 3

80

0/R

14k

-500

CS

4 A

MD

1.2

GH

z/F

E

CS

6 P

III/800 +

FE

/LA

M

IBM

Reg

att

a-H

CS

9 P

4/2

00

0 +

Myri

ne

t

IBM

p6

90

CS

10 P

4/2

666 +

Myri

ne

t

CS

16 Ita

niu

m2/1

300 +

My

rin

et

CS

11 P

4/2

400 +

Gb

itE

IBM

p6

90+

// H

PS

CS

19 O

pte

ron

246/2

.0 +

SC

I

CS

20 O

pte

ron

248/2

.2 +

M2

k

CS

22 P

4 E

M64

T/3

200 +

M2

k

Cra

y X

D1 O

pte

ron

250/2

.4 +

Ra

pid

Arr

ay

HP

Su

pe

rdo

me/Ita

niu

m2

1600

CS

26 O

pte

ron

875/2

.2 D

C +

M2k

CS

29 O

pte

ron

280/2

.4 D

C +

IB

CS

30 X

eo

n 5

160 3

.0G

Hz D

C +

IB

CS

32 O

pte

ron

2218-F

2.6

GH

z D

C +

IP

HT

X

CS

33 X

eo

n 5

160 3

.0G

Hz D

C +

IP

HT

X

CS

35 C

UB

RIC

Op

tero

n275

/2.2

DC

+ G

Bit

E

CS

42 O

pte

ron

2218-F

2.6

GH

z D

C +

Mella

no

x IB

CS

45 H

P B

L460

c X

eo

n 5

160/3

.0G

Hz D

C +

Me

llan

ox

IB

CS

50 S

GI Ic

e X

5365 C

lovert

ow

n 3

.0G

Hz Q

C +

CS

47 In

tel E

54

72 H

arp

ert

ow

n 3

.0G

Hz Q

C 1

600 F

SB

CS

51 B

ull X

eo

n E

5472 3

.0G

Hz Q

C 1

600 F

SB

+…

CS

54 S

GI Ic

e X

eo

n E

5440 2

.83G

Hz Q

C +

Vo

ltair

e IB

IBM

pS

eri

es 5

75 4

.7 G

Hz D

C +

IB

CS

57 In

tel X

55

60 N

eh

ale

m 2

.8G

Hz Q

C +

IB

QD

R

CS

60 V

igle

n E

55

20 N

H 2

.27G

Hz Q

C +

IB

DD

R…

CS

61 B

ullx X

555

0 N

H 2

.67G

Hz Q

C +

Co

nn

ectX

CS

63 In

tel X

55

70 N

H 2

.93G

Hz Q

C +

Co

nn

ectX

QD

R…

CS

66 In

tel L

7555 N

eh

ale

mE

X 1

.87G

Hz +

IB

QD

R

Fu

jits

u "

HT

C"

BX

92

2 X

5650 2

.66 G

Hz +

IB

CS

73 Q

Lo

gic

ND

C X

5670 2

.93

GH

z 6

-C +

QD

R…

Fu

jits

u B

X922 W

SM

X

5650 2

.67

GH

z IB

-QD

R

Fu

jits

u R

X300 S

NB

E5-2

680 8

-C +

IB

QD

R

Fu

jits

u C

X250 S

NB

e5-2

690/2

.9G

Hz IB

-QD

R

Fu

jits

u C

X250 S

NB

e5-2

670/2

.6G

Hz IB

-QD

R

Inte

l IV

B e

5-2

697v2 2

.7G

Hz T

rue

Scale

PS

M

Bu

ll B

710 IV

B e

5-2

697v2 2

.7G

Hz M

ellan

ox F

DR

Cra

y X

C30

e5-2

697v2 2

.7G

Hz A

RIE

S [

Arc

he

r]

Inte

l H

SW

e5-2

697v3

2.6

GH

z (

T)

Tru

escale

QD

R

Bu

ll H

SW

e5-2

690v3 2

.6G

Hz C

on

ne

ct-

IB

Bu

ll H

SW

e5-2

680v3 2

.5G

Hz (

T)

Co

nn

ect-

IB

Bo

sto

n B

DW

e5-2

650

v4 2

.2G

Hz (

T)

FD

R

Th

or

BD

W e

5-2

697A

v4 2

.6G

Hz (

T)

ED

R

IBM

Po

wer8

S822L

C 2

.92G

Hz IB

/ED

R

Hu

aw

ei F

usio

n C

H140 e

5-2

690 v

4 2

.6G

Hz (

T)

ED

R

Inte

l S

KL

Pla

tin

um

8170 2

.1G

Hz (

T)

OP

A[2

6c]

Dell S

KL

Go

ld 6

142 2

.6G

Hz (

T)

ED

R [

16c]

Bu

ll|A

TO

S S

KL

Go

ld 6

150 2

.7G

Hz (

T)

OF

A

Pe

rfo

rma

nc

e r

ela

tive

to

th

e C

ray T

3E

/12

00

E DLPOLY 2 - Bench 4 (32 PEs)

DL_POLY Classic: Bench 4

Performance Relative to the Cray T3E/1200 (32 CPUs)

12 December 2018Application Performance on Multi-core Processors 114

Page 115: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

47

0.0

10.0

20.0

30.0

40.0

50.0C

ray T

3E

/1200E

EV

56 6

00 M

Hz

IBM

SP

/Win

terh

aw

k2-3

75 M

Hz

SG

I O

rig

in 3

80

0/R

14k

-500

CS

4 A

MD

1.2

GH

z/F

E

CS

6 P

III/800 +

FE

/LA

M

IBM

Reg

att

a-H

CS

9 P

4/2

00

0 +

Myri

ne

t

IBM

p6

90

CS

10 P

4/2

666 +

Myri

ne

t

CS

16 Ita

niu

m2/1

300 +

My

rin

et

CS

11 P

4/2

400 +

Gb

itE

IBM

p6

90+

// H

PS

CS

19 O

pte

ron

246/2

.0 +

SC

I

CS

20 O

pte

ron

248/2

.2 +

M2

k

CS

22 P

4 E

M64

T/3

200 +

M2

k

Cra

y X

D1 O

pte

ron

250/2

.4 +

Ra

pid

Arr

ay

HP

Su

pe

rdo

me/Ita

niu

m2

1600

CS

26 O

pte

ron

875/2

.2 D

C +

IB

CS

29 O

pte

ron

280/2

.4 D

C +

IB

CS

30 X

eo

n 5

160 3

.0G

Hz D

C +

IB

CS

32 O

pte

ron

2218-F

2.6

GH

z D

C +

IP

HT

X

CS

33 X

eo

n 5

160 3

.0G

Hz D

C +

IP

HT

X

CS

35 C

UB

RIC

Op

tero

n275

/2.2

DC

+ G

Bit

E

CS

42 O

pte

ron

2218-F

2.6

GH

z D

C +

IB

CS

45 H

P B

L460

c X

eo

n 5

160/3

.0G

Hz D

C +

IB

CS

50 S

GI Ic

e X

5365 C

lovert

ow

n 3

.0G

Hz Q

C…

CS

47 In

tel E

54

72 H

arp

ert

ow

n 3

.0G

Hz Q

C…

CS

51 B

ull X

eo

n E

5472 3

.0G

Hz Q

C 1

600 F

SB

CS

54 S

GI Ic

e X

eo

n E

5440 2

.83G

Hz Q

C +

IBM

pS

eri

es 5

75 4

.7 G

Hz D

C +

IB

CS

57 In

tel X

55

60 N

EH

2.8

GH

z Q

C +

IB

QD

R

CS

60 V

igle

n E

55

20 N

EH

2.2

7G

Hz Q

C +

IB

DD

R

CS

61 B

ullx X

555

0 N

EH

2.6

7G

Hz Q

C +

IB

CS

63 In

tel X

55

70 N

EH

2.9

3G

Hz Q

C +

C-X

CS

66 In

tel L

7555 N

EH

-EX

1.8

7G

Hz +

IB

QD

R

Fu

jits

u "

HT

C"

BX

92

2 X

5650 2

.66 G

Hz +

IB

CS

73 Q

Lo

gic

ND

C X

5670 2

.93

GH

z 6

-C +

QD

R…

Fu

jits

u B

X922 W

SM

X

5650 2

.67

GH

z IB

-QD

R

Fu

jits

u R

X300 S

NB

E5-2

680 8

-C +

IB

QD

R

Fu

jits

u C

X250 S

NB

e5-2

690/2

.9G

Hz IB

-QD

R

Fu

jits

u C

X250 S

NB

e5-2

670/2

.6G

Hz IB

-QD

R

Inte

l IV

B e

5-2

697v2 2

.7G

Hz T

rue

Scale

PS

M

Bu

ll B

710 IV

B e

5-2

697v2 2

.7G

Hz IB

FD

R

Cra

y X

C30

e5-2

697v2 2

.7G

Hz A

RIE

S [

Arc

he

r]

Inte

l H

SW

e5-2

697v3

2.6

GH

z (

T)

Tru

escale

QD

R

Bu

ll H

SW

e5-2

690v3 2

.6G

Hz C

on

ne

ct-

IB

Bu

ll H

SW

e5-2

680v3 2

.5G

Hz (

T)

Co

nn

ect-

IB

Bo

sto

n B

DW

e5-2

650

v4 2

.2G

Hz (

T)

FD

R

Th

or

BD

W e

5-2

697A

v4 2

.6G

Hz (

T)

ED

R

IBM

Po

wer8

S822L

C 2

.92G

Hz IB

/ED

R

Hu

aw

ei F

usio

n C

H140 e

5-2

690 v

4 2

.6G

Hz (

T)…

Inte

l S

KL

Pla

tin

um

8170 2

.1G

Hz (

T)

OP

A[2

6c]

Dell S

KL

Go

ld 6

142 2

.6G

Hz (

T)

ED

R [

16c]

Bu

ll|A

TO

S S

KL

Go

ld 6

150 2

.7G

Hz (

T)

OF

A

Pe

rfo

rma

nc

e r

ela

tive

to

th

e C

ray T

3E

/12

00

E

DLPOLY 2 - Bench 7 (32 PEs)

DL_POLY V2: Bench 7

Performance Relative to the Cray T3E/1200 (32 CPUs)

12 December 2018Application Performance on Multi-core Processors 115

Page 116: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

61.5

0.0

10.0

20.0

30.0

40.0

50.0

60.0IB

M e

32

6 O

pte

ron

28

0 2

.4G

Hz

// G

bit

E (

Cis

co

) P

GI

HP

DL

14

5 G

2 O

pte

ron

28

0 2

.4G

Hz D

C // IB

HP

DL

14

5 G

2 O

pte

ron

28

0 2

.4G

Hz D

C // IB

Su

n X

410

0D

C O

pte

ron

28

0 2

.4G

Hz D

C /

/ IB

(P

SC

)

Qu

ad

rix X

eo

n 5

16

0 W

oo

dc

res

t 3

.0G

Hz D

C /

/ E

lan

4

HP

DL

14

0 G

3 X

eo

n 5

16

0 3

.0G

Hz D

C //

Vo

ltair

e IB

/DD

R

Bu

ll R

440

Xeo

n 5

160

3.0

GH

z D

C /

/ V

olt

air

e IB

/SD

R

Str

eam

lin

e X

eo

n 5

16

0 3

.0G

Hz D

C //

GB

itE

(S

Co

re)

Su

n X

220

0 M

2 O

pte

ron

22

18 2

.6G

Hz D

C //

IP (

PS

C)

IBM

x34

55 O

pte

ron

22

18

-F/2

.6G

Hz D

C //

IP

SG

I Ic

e X

53

65

C

lov

ert

ow

n 3

.0G

Hz Q

C // IB

IBM

pS

eri

es 5

75

po

we

r5 4

.7G

Hz D

C /

/ IB

PO

D E

543

0 H

arp

ert

ow

n 2

.66

GH

z Q

C //

IB/S

DR

(O

pe

nM

PI)

SG

I Ic

e E

54

62

2.8

3G

Hz Q

C //

IB (

mva

pic

h2

)

Str

eam

lin

e E

547

2 3

.0G

Hz Q

C //

IB

In

tel E

547

2 H

arp

ert

ow

n 3

.0G

Hz Q

C /

/ IB

Inte

l E

5482

2.8

0G

Hz Q

C //

IB/D

DR

(In

telM

PI)

Cra

y X

T4 O

pte

ron

2.3

GH

z Q

C //

XT

4 In

tern

al

inte

rco

nn

ect

Inte

l L

75

55

NH

EX

[8c]

1.8

7G

Hz /

/ IB

/QD

R (

mva

pic

h-1

.2)

Bu

ll X

555

0 N

H 2

.67G

Hz Q

C //

IB (

imp

i 3.2

.2)

Inte

l X

5560

NH

2.8

GH

z Q

C //

IB/Q

DR

Inte

l X

5570

NH

2.9

3G

Hz Q

C // IB

/QD

R (

imp

i-3

.2.2

)

De

ll P

E C

614

5 I

nte

rla

go

s O

pte

ron

627

6 [

16

c]

2.3

GH

z

Inte

l X

5670

WS

M [

6c]

2.9

3G

Hz //

IB/Q

DR

(m

vap

ich

2)

QL

og

ic N

DC

X56

75 [

6c]

3.0

7G

Hz //

IB/Q

DR

(Q

log

ic M

PI)

Inte

l S

NB

E5

-267

0 [

8c

] 2

.6G

Hz

// IB

/QD

R(i

mp

i)

Fu

jits

u R

X3

00

SN

B E

5-2

68

0 [

8c

] 2

.7 G

Hz /

/ IB

/QD

R

PO

D W

SM

X5

67

5 3

.07

GH

z [

6c]

// T

rue

sc

ale

/QD

R

PO

D S

NB

e5

-267

0 2

.6G

Hz [

8c

] //

Tru

es

ca

le Q

DR

Fu

jits

u C

X2

50

SN

B e

5-2

690

[8

c]

2.9

GH

z // IB

/QD

R

Clu

se

rVis

ion

IV

B e

5-2

650v

2 [

8c

] 2.6

GH

z /

/ T

rue

sc

ale

/QD

R

Inte

l IV

B e

5-2

697

v2

[1

2c

] 2

.7G

Hz /

/ T

rue S

cale

/QD

R

Bu

ll b

71

0 I

VB

e5-2

69

7v

2 [

12

c]

2.7

GH

z // IB

/FD

R

Inte

l IV

B e

5-2

690

v2

[1

0c

] 3

.0G

Hz (

T)

// T

rue S

cale

/QD

R

Bu

ll H

SW

e5

-26

80

v3

[1

2c

] 2

.5G

Hz (

T)

// I

B

Inte

l H

SW

e5-2

697

v3

[1

4c

] 2

.6G

Hz (

T)

// T

rue

sc

ale

/QD

R

Bu

ll H

SW

e5

-26

90

v3

[1

2c

] 2

.6G

Hz /

/ IB

De

ll H

SW

e5-2

660

v3

[1

0c

] 2

.6G

Hz (

T)

// O

PA

Bo

sto

n B

DW

e5

-26

50

v4

[1

2c

] 2

.2G

Hz (

T)

//

IB/F

DR

Ato

s B

DW

e5-2

680

v4 [

14c]

2.4

GH

z (

T)

// I

B/E

DR

Inte

l B

DW

e5

-26

90

v4

[1

4c

] 2

.6G

Hz (

T)

// O

PA

Th

or

BD

W e

5-2

697A

v4

[1

6c

] 2

.6G

Hz (T

) // I

B/E

DR

IBM

Po

we

r8 S

82

2L

C [

10c

] 2.9

2G

Hz //

IB/E

DR

Inte

l S

KL

Pla

tin

um

817

0 [

26

c]

2.1

GH

z (

T)

// O

PA

De

ll S

KL

Go

ld 6

14

2 [

16

c]

2.6

GH

z (

T)

// IB

/ED

R

De

ll S

KL

Go

ld 6

15

0 [

18

c]

2.7

GH

z (

T)

// IB

/ED

R

DLPOLY 3/4 - Gramicidin (128 cores)

DL_POLY 3/4 – Gramicidin (128 Cores)

Performance Relative to the IBM e326

Opteron280/2.4GHz + GbitE

Perf

orm

an

ce

DL_POLY 3

DL_POLY 4

12 December 2018Application Performance on Multi-core Processors 116

Page 117: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

61.5

0.0

10.0

20.0

30.0

40.0

50.0

60.0F

ujits

u B

X922

WS

M X

565

0 [

6c

] 2

.67

GH

z //

PO

D W

SM

X56

75 3

.07

GH

z [

6c]

// T

rues

cale

/QD

R

Azu

re A

9 W

E (

e5

-267

0 2

.6 G

Hz)

[8c

] // I

B R

DM

A

PO

D S

NB

e5

-267

0 2

.6G

Hz [

8c

] //

Tru

es

ca

le Q

DR

Fu

jits

u C

X250

SN

B e

5-2

67

0 [

8c]

2.6

GH

z //…

Fu

jits

u C

X250

SN

B e

5-2

69

0 [

8c]

2.9

GH

z //…

Fu

jits

u C

X250

SN

B e

5-2

69

0 [

8c]

2.9

GH

z (T

) //…

Clu

serV

isio

n IV

B e

5-2

65

0v

2 [

8c

] 2

.6G

Hz /

/…

Inte

l IV

B e

5-2

697

v2

[1

2c]

2.7

GH

z /

/ IB

/FD

R

Inte

l IV

B e

5-2

697

v2

[1

2c]

2.7

GH

z /

/ T

rue…

Cra

y X

C3

0 e

5-2

69

7v

2 [

12

c]

2.7

GH

z //

AR

IES

Bu

ll b

71

0 I

VB

e5

-26

97v

2 [

12

c]

2.7

GH

z //

IB/F

DR

De

ll R

72

0 IV

B e

5-2

68

0v2

[1

0c

] 2.8

GH

z (

T)

// I

B

Inte

l IV

B e

5-2

690

v2

[1

0c]

3.0

GH

z (

T)

// T

rue

Bu

ll H

SW

e5

-26

95

v3

[14

c]

2.3

GH

z /

/ IB

Bu

ll H

SW

e5

-26

80

v3

[12

c]

2.5

GH

z (

T)

// I

B

Bu

ll H

SW

e5

-26

80

v3

[12

c]

2.5

GH

z (

T)

// I

B/E

DR

Inte

l H

SW

e5-2

697

v3 [

14

c]

2.6

GH

z (

T)

//…

Inte

l H

SW

e5-2

697

v3 [

14

c]

2.6

GH

z (

T)

//…

Bu

ll H

SW

e5

-26

90

v3

[12

c]

2.6

GH

z /

/ IB

SG

I IC

E-X

HS

W e

5-2

690

v3 [

12

c]

2.6

GH

z (

T)

//…

De

ll H

SW

e5

-26

60v

3

[10

c]

2.6

GH

z (

T)

// O

PA

Hu

aw

ei C

H14

0 e

5-2

683

v4 [

16

c]

2.1

GH

z (

T)

//…

Bo

sto

n B

DW

e5

-26

50v

4 [

12

c]

2.2

GH

z (

T)

//…

Bo

sto

n B

DW

e5

-26

80v

4 [

14

c]

2.4

GH

z (

T)

// O

PA

Ato

s B

DW

e5

-268

0v4

[1

4c

] 2

.4G

Hz (

T)

// I

B/E

DR

Ato

s B

DW

e5

-268

0v4

[1

4c

] 2

.4G

Hz (

T)

// O

PA

Inte

l B

DW

e5-2

690

v4

[1

4c

] 2

.6G

Hz (

T)

// O

PA

Inte

l B

DW

e5-2

690

v4

[1

4c

] 2

.6G

Hz (

T)

// IB

/ED

R

Th

or

BD

W e

5-2

69

7A

v4 [

16c

] 2

.6G

Hz

(T)

//…

Inte

l D

iam

on

d B

DW

e5

-269

7A

v4 [

16

c]

2.6

GH

z…

IBM

Po

wer8

S8

22

LC

[1

0c

] 2

.92

GH

z //

IB/E

DR

Dell

SK

L G

old

613

0 [

16

c]

2.1

GH

z (

T)

// O

PA

Inte

l S

KL

Pla

tin

um

81

70 [

26c

] 2

.1G

Hz (

T)

// O

PA

Inte

l S

KL

Go

ld 6

14

8 [

20

c]

2.4

GH

z (

T)

// O

PA

De

ll S

KL

Go

ld 6

14

2 [

16

c]

2.6

GH

z (

T)

// IB

/ED

R

Ato

s S

KL

Go

ld 6

15

0 [

18

c]

2.7

GH

z (

T)

// O

FA

De

ll S

KL

Go

ld 6

15

0 [

18

c]

2.7

GH

z (

T)

// IB

/ED

R

Ato

s A

MD

EP

YC

76

01

[3

2c

] 2

.2G

Hz (

T)

//…

DLPOLY 4 - Gramicidin (128 cores)

E5-26xxE5-26xx v2

E5-26xx v3

E5-26xx v4

Intel SKL

DL_POLY 4 – Gramicidin (128 cores)

Performance Relative to the

IBM e326 Opteron280/2.4GHz

/ GbitE

Perf

orm

an

ce

12 December 2018Application Performance on Multi-core Processors 117

Page 118: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

Performance Data (32-256 PEs)

DL_POLY4 – Gramicidin Perf Report

Smooth Particle Mesh Ewald Scheme

CPU Time Breakdown

Total Wallclock Time Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

12 December 2018Application Performance on Multi-core Processors 118

Page 119: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

The Story of Two Community Codes

DL_POLY and GAMESS-UK - A Performance

Overview

Overview of two

decades of

GAMESS-UK

Performance

Page 120: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Large-Scale Parallel Ab-Initio Calculations

• GAMESS-UK now has two parallelisation schemes:

¤ The traditional version based on the Global Array tools

• retains a lot of replicated data

• limited to about 4000 atomic basis functions

¤ Subsequent developments by Ian Bush (High Performance

Applications Group, Daresbury, now at Oxford University via NAG

Ltd.) have extended the system sizes available for treatment by

both GAMESS-UK (molecular systems) and CRYSTAL (periodic

systems)

• Partial introduction of “Distributed Data” architecture…

• MPI/ScaLAPACK based

12 December 2018Application Performance on Multi-core Processors 120

Page 121: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

The GAMESS-UK Benchmarks

Five representative examples of increasing

complexity.

• Cyclosporin 6-31g basis (1000 GTOs) DFT B3LYP (direct

SCF)

• Cyclosporin 6-31g-dp basis (1855 GTOs) DFT B3LYP

(direct SCF)

• Valinomycin (dodecadepsipeptide) in water; DZVP2 DFT

basis, HCTH functional (1620 GTOs) (direct SCF)

• Mn(CO)5H TZVP/DZP MP2 - geometry optimization

• ((C6H4(CF3))2 6-31g basis DFT B3LYP opt geom + analytic

2nd Derivatives

12 December 2018Application Performance on Multi-core Processors 121

Page 122: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0C

ray

T3E/

120

0 E

V5

6 60

0 M

Hz

IBM

pSe

ries

690

po

wer

4 1

.3 G

Hz

SGI O

rigi

n 3

80

0/R

14

k 50

0M

Hz

// N

UM

Alin

k 3

IBM

pSe

ries

690

+ p

ow

er4

1.7

GH

z //

HP

S (S

P7

)

IBM

pSe

ries

690

po

wer

4 1

.3 G

Hz

[J-F

it]

IBM

pSe

ries

690

po

wer

4 1

.3 G

Hz

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

Lon

esta

r D

ell

PE

18

55 P

en

tiu

m-4

3.2

GH

z //

IB

SGI A

ltix

37

00/I

tan

ium

2 1

.3G

Hz

SGI O

rigi

n 3

80

0/R

14

k 50

0M

Hz

// N

UM

Alin

k 3

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

HP

Su

pe

rdo

me

Itan

ium

2 1

.6G

Hz

CS2

2 P

en

tiu

m-4

EM

64T

3.2

GH

z //

M2

k

CS4

8 B

ull

R42

2 H

arp

erto

wn

E5

472

3.0

GH

z Q

C /

/ IB

CS4

2 IB

M x

34

55 O

pte

ron

22

18-F

2.6

GH

z D

C /

/ IB

IBM

pSe

ries

575

po

wer

6 4

.7G

Hz

DC

//

IB

CS5

4 S

GI A

ltix

Ice

82

00

Xeo

n E

544

0 2

.83

GH

z Q

C /

/ IB

CS6

6 In

tel L

75

55 N

H-E

X [

8c]

1.8

7G

Hz

// IB

QD

R [

pp

n=1

6]

CS6

0 V

igle

n E

552

0 N

H 2

.27G

Hz

QC

//

IB D

DR

(m

vap

ich

)

Fuji

tsu

"H

TC"

BX

922

X5

650

[6

c] 2

.66G

Hz

// IB

/QD

R

De

ll M

51

0 X

565

0 [

6c]

2.6

7 G

Hz

// IB

/QD

R (

imp

i)

CS6

4 In

tel X

56

70 W

SM [

6c]

2.9

3GH

z //

IB/Q

DR

(m

vap

ich

2)

Fuji

tsu

CX

25

0 SN

B e

5-26

70

[8c]

2.6

GH

z //

IB/Q

DR

Inte

l SN

B e

5-2

670

[8

c] 2

.6G

Hz

// IB

/QD

R (

pp

n=8

)

Bu

ll b

510

SN

B E

5-2

680

[8

c] 2

.7 G

Hz

(T)

// IB

/QD

R

Fuji

tsu

CX

25

0 SN

B E

5-26

90

[8c]

2.9

GH

z //

IB/Q

DR

Inte

l IV

B e

5-2

697

v2 [

12

c] 2

.7G

Hz

// T

rue

Scal

e/Q

DR

Bu

ll b

710

IVB

e5

-269

7v2

2.7

GH

z //

IB/F

DR

Bu

ll H

SW e

5-2

690

v3 [

12c]

2.6

GH

z //

IB

De

ll R

730

HSW

e5

-269

7v3

2.6

GH

z (T

) //

IB

SGI I

CE-

X H

SW e

5-2

680

v3 [

12

c] 2

.6G

Hz

(T)

// IB

/FD

R

Ato

s B

DW

e5

-26

80v

4 [

14c

]2.4

GH

z (T

) //

IB/E

DR

Ato

s SK

L G

old

614

8 2

.4G

Hz

(T)

// IB

/ED

R (

pp

n=1

6)

28.5

65.1Valinomycin DFT - DZVP2 1620 GTOs

GAMESS-UK. DFT B3LYP Performance

Performance Relative to the Cray T3E/1200 (32CPUs)

Basis: DZVP2_A2 (Dgauss)

Valinomycin, 1620 GTOs

Atos Skylake Gold 6148

2.4GHz (T) // IB/EDR

CS48 Bull R422

Harpertown E5472

3.0GHz QC // IB

12 December 2018Application Performance on Multi-core Processors 122

Page 123: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

0.0

10.0

20.0

30.0

40.0

50.0

60.0C

ray

T3E/

900

EV

56

45

0 M

Hz

IBM

SP

/P2S

C

Cra

y T3

E/12

00

EV

56

60

0 M

Hz

CS1

Pe

nti

um

-3 4

50

MH

z //

FE/

LAM

IBM

SP

/Win

terh

awk-

2 p

ow

er3

37

5M

Hz

CS2

Qu

adri

x U

P2

000

/EV

67 6

67

MH

z //

QSN

et

Co

mp

aq A

lph

aSer

ver

ES4

0 66

7 M

hz

SGI O

rigi

n 3

800

/R1

2k

40

0MH

z

SGI O

rigi

n 3

800

/R1

4k

50

0MH

z //

NU

MA

link

3

CS6

Pe

nti

um

-3 8

00

MH

z //

FE/

LAM

CS7

AM

D A

thlo

n K

7 1

.0G

Hz

MP

//

SCI

CS9

Pe

nti

um

-4 2

.0G

Hz

// M

yrin

et

2k

IBM

pSe

rie

s 69

0 p

ow

er4

1.3

GH

z

SGI A

ltix

370

0/I

tan

ium

2 1

.3G

Hz

IBM

pSe

rie

s 69

0+

po

wer

4 1

.7 G

Hz

// H

PS

CS1

8 P

en

tiu

m-4

2.8

GH

z //

M2

k

CS1

8 P

en

tiu

m-4

2.8

GH

z //

GB

itE

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

CS2

1 P

en

tiu

m-4

EM

64T

3.4

GH

z //

IB

SGI O

rigi

n 3

800

/R1

4k

50

0MH

z //

NU

MA

link

3

IBM

pSe

rie

s 69

0+

po

wer

4 1

.7 G

Hz

// H

PS

(SP

9)

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

CS1

8 P

en

tiu

m-4

2.8

GH

z //

Gb

itE

CS1

8 P

en

tiu

m-4

2.8

GH

z //

M2

k

CS2

0 O

pte

ron

24

8 2

.2G

Hz

// M

2k

SGI O

rigi

n 3

800

/R1

4k

50

0MH

z //

NU

MA

link

3

IBM

pSe

rie

s 69

0 p

ow

er4

1.3

GH

z

Co

mp

aq A

lph

aSer

ver

SC E

S45

1.0

GH

z

SGI A

ltix

370

0/I

tan

ium

2 1

.3G

Hz

SGI A

ltix

370

0/I

tan

ium

2 1

.3G

Hz

SGI A

ltix

370

0/I

tan

ium

2 1

.5G

Hz

IBM

pSe

rie

s 57

5 p

ow

er5

1.5

GH

z D

C /

/ H

PS

HP

Su

per

do

me

Itan

ium

2 1

.6G

Hz

CS4

8 B

ull

R4

22

Har

per

tow

n E

54

72

3.0

GH

z Q

C /

/ IB

CS4

2 IB

M x

345

5 O

pte

ron

221

8-F

2.6

GH

z D

C /

/ IB

CS5

5 SG

I Ice

Xe

on

E5

46

2 2

.83

GH

z Q

C /

/ IB

(m

pav

ich

2)

IBM

pSe

rie

s 57

5 p

ow

er6

4.7

GH

z D

C /

/ IB

CS5

4 SG

I Alt

ix Ic

e 8

20

0 X

eon

E5

44

0 2

.83

GH

z Q

C /

/ IB

CS6

6 In

tel L

75

55

NH

-EX

[8

c] 1

.87

GH

z //

IB Q

DR

[p

pn

=8]

CS6

6 In

tel L

75

55

NH

-EX

[8

c] 1

.87

GH

z //

IB Q

DR

[p

pn

=16

]

CS6

6 In

tel L

75

55

NH

-EX

[8

c] 1

.87

GH

z //

IB Q

DR

CS6

0 V

igle

n E

55

20

NH

2.2

7G

Hz

QC

//

IB D

DR

(m

vap

ich

)

Alic

e X

55

50

NH

2.6

7G

Hz

QC

//

IB Q

DR

CS5

7 In

tel X

556

0 N

H 2

.8G

Hz

QC

//

IB Q

DR

De

ll M

510

X5

650

[6

c]2

.67

GH

z //

IB/Q

DR

(im

pi)

Fuji

tsu

"H

TC"

BX

92

2 X

565

0 [

6c]

2.6

6G

Hz

// IB

/QD

R

CS6

4 In

tel X

567

0 W

SM [

6c]

2.9

3G

Hz

// IB

/QD

R…

Inte

l SN

B e

5-2

67

0 [

8c]

2.6

GH

z //

IB/Q

DR

(p

pn

=8)

Rav

en

B5

10 S

NB

e5-

267

0 [

8c]

2.6

GH

z //

IB/Q

DR

Fuji

tsu

CX

25

0 S

NB

e5

-26

70

[8

c] 2

.6G

Hz

// IB

/QD

R (

201

7)

Bu

ll b

510

SN

B E

5-2

680

[8

c] 2

.7 G

Hz

// IB

/QD

R

Bu

ll b

510

SN

B E

5-2

680

[8

c] 2

.7 G

Hz

(T)

// IB

/QD

R

Fuji

tsu

RX

300

SN

B E

5-2

68

0 [

8c]

2.7

GH

z //

IB/Q

DR

Ato

s B

DW

e5-

268

0v4

[1

4c]

2.4

GH

z (T

) //

IB/E

DR

Ato

s Sk

ylak

e G

old

61

48

2.4

GH

z (T

) //

IB/E

DR

(p

pn

=16)

45.8

55.3

MP2 Mn(CO)5H

Performance of MP2 Gradient Module

Performance Relative to the Cray T3E/900 (32 CPUs)

Mn(CO)5H - MP2 geometry optimisation

BASIS: TZVP + f (217 GTOs)

CS48 Bull Xeon E5472

3.0GHz QC + DDR

Intel SNB e5-2670 [8c]

2.6GHz // IB/QDR

(ppn=8)

12 December 2018Application Performance on Multi-core Processors 123

Page 124: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Performance Data (32-256 PEs)

GAMESS-UK – DFT Performance Report

Cyclosporin 6-31G** basis (1855

GTOs); DFT B3LYP

CPU Time Breakdown

Total Wallclock Time

Breakdown

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU (%)

MPI (%)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

32 PEs

64 PEs

128 PEs

256 PEs

CPU Scalar numeric ops (%)

CPU Vector numeric ops (%)

CPU Memory accesses (%)

12 December 2018Application Performance on Multi-core Processors 124

Page 125: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Summary

1. Introduction – DL_POLY and GAMESS-UK

¤ Background and Flagship codes for the UK’s CCP5 & CCP1

¤ Critical role of collaborative developments

2. HPC Technology - Processor & Interconnect Technologies

¤ The last 10 years of Intel dominance – Nehalem to Skylake

3. DL_POLY and GAMESS-UK Performance

¤ Benchmarks & Test Cases

¤ Overview of two decades of Code Performance: From

T3E/1200E to Intel Skylake clusters

4. Understanding Performance – Useful Tools

5. Acknowledgements and Summary

12 December 2018Application Performance on Multi-core Processors 125

Page 126: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

Acknowledgements

• Ludovic Sauge, Enguerrand Petit, Martyn Foster and

Nick Allsopp and John Humphries (Bull/ATOS) for

informative discussions and access to the Skylake & EPYC

clusters at the Bull HPC Competency Centre.

• David Cho, Gilad Shainer, Colin Bridger & Steve Davey

for access to and considerable assistance with the “Helios”

cluster at the HPC Advisory Council.

• Joshua Weage, Martin Hilgeman, Dave Coughlin, Gilles

Civario and Christopher Huggins for access to, and

assistance with, the variety of Skylake and EPYC SKUs at

the Dell Benchmarking Centre.

• Alin Marin Elena and Ilian Todorov (STFC) for discussions

around the DL_POLY software

• The DisCO programme at Daresbury Laboratory.

Application Performance on Multi-core Processors 12612 December 2018

Page 127: Application Performance on multi-core Processors€¦ · 384 x Fujitsu CX250 EP-nodes each with 2 Intel Sandy Bridge E5-2670 (2.6 GHz), with Mellanox QDR infiniband. Intel Broadwell

127Application Performance on Multi-core Processors

Final Thoughts & Summary

I. Performance Benchmarks and Cluster Systems

a. Synthetic Code Performance: STREAM and IMB

b. Application Code Performance: DLPOLY, GROMACS,

AMBER,GAMESS_UK, VASP and Quantum Espresso

c. Interconnect Performance: Intel MPI and Mellanox’s HPCX

d. Processor Family and Interconnect – “core to core” and “node

to node” benchmarks

II. Impact of Environmental Issues in Cluster acceptance

tests.

a. Security patches, turbo mode and Throughput testing

III. Performance profile of DL_POLY and GAMESS-UK over

the past two decades

12 December 2018