maximizing application performance in a multi-core, numa

28
Maximizing Application Performance in a Multi-Core, NUMA-Aware Compute Cluster by Multi-Level Tuning Gilad Shainer 1 , Pak Lui 1 , Scot Schultz 1 Martin Hilgeman 2 , Jeffrey Layton 2 , Cydney Stevens 2 , Walker Stemple 2 Guy Ludden 3 , Joshua Mora 3 Georg Kresse 4 1 Mellanox Technologies, 2 Dell Inc, 3 Advanced Micro Devices, 4 University of Vienna, Vienna, Austria

Upload: others

Post on 16-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Maximizing Application Performance in a Multi-Core, NUMA-Aware Compute Cluster by Multi-Level Tuning

Gilad Shainer1, Pak Lui1, Scot Schultz1

Martin Hilgeman2, Jeffrey Layton2, Cydney Stevens2, Walker Stemple2

Guy Ludden3, Joshua Mora3

Georg Kresse4

1Mellanox Technologies, 2Dell Inc, 3Advanced Micro Devices, 4University of Vienna, Vienna, Austria

2

The HPC Advisory Council

• World-wide HPC organization (360+ members)

• Bridges the gap between HPC usage and its full potential

• Provides best practices and a support/development center

• Explores future technologies and future developments

• Leading edge solutions and technology demonstrations

3

HPC Advisory Council Members

4

HPC Advisory Council HPC Center

InfiniBand-based Storage (Lustre) Juniper Dodecas

Plutus Janus Athena

Vesta Mercury Mala

Lustre FS 512 cores

456 cores 192 cores

704 cores 384 cores

256 cores

16 GPUs

64 cores

5

Special Interest Subgroups Missions

• HPC|Scale

– To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary based

supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing services

• HPC|Cloud

– To explore usage of HPC components as part of the creation of cloud computing environments

• HPC|Works

– To provide best practices for building balanced and scalable HPC systems, performance tuning and application guidelines

• HPC|Storage

– To demonstrate how to build high-performance storage solutions and their affect on application performance and

productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions, and to

expose more users to the potential of Lustre over high-speed networks.

• HPC|GPU

– To explore usage models of GPU components as part of next generation compute environments and potential

optimizations for GPU based computing

• HPC|Music

– To enable HPC in music production and to develop HPC solutions that further enable the future of music production

6

HPC Advisory Council

• HPC Advisory Council (HPCAC)

– 360+ members, http://www.hpcadvisorycouncil.com/

– Application best practices, case studies (Over 150)

– Benchmarking center with remote access for users

– World-wide workshops

– Value add for your customers to stay up to date and in tune to HPC market

• 2013 Workshops

– USA (Stanford University) – January 2013

– Switzerland – March 2013

– Germany (ISC’13) – June 2013

– Spain – Sep 2013

– China (HPC China) – Oct 2013

• For more information

– www.hpcadvisorycouncil.com

[email protected]

7

ISC’13 – Student Cluster Challenge

• University-based teams to compete and demonstrate the incredible capabilities of

state-of- the-art HPC systems and applications on the ISC’13 show-floor

• The Student Cluster Challenge is designed to introduce the next generation of

students to the high performance computing world and community

8

ISC’13 – Student Cluster Challenge

9

Note

• The following research was performed under the HPC Advisory

Council activities

– Participating vendors: AMD, Dell, Mellanox

– Compute resource –

• HPC Advisory Council Cluster Center

• For more info please refer to

– http://www.amd.com

– http://www.dell.com/hpc

– http://www.mellanox.com

– http://www.vasp.at

10

VASP

• VASP – Stands for “Vienna Ab-initio Simulation Package”

– Performs ab-initio quantum-mechanical molecular dynamics (MD)

• using pseudopotentials and a plane wave basis set

– The code is written in FORTRAN 90 with MPI support

– Access to the code may be given by a request via the VASP website

• The approach that used in VASP is based on the techniques: – A finite-temperature local-density approximation, and

– An exact evaluation of the instantaneous electronic ground state at each MD-step using efficient matrix diagonalization schemes and an efficient Pulay mixing

• These techniques avoid problems in original Car-Parrinello method – which is based on the simultaneous integration of electronic and ionic equations of

motion

11

Objectives

• The following was done to provide best practices – VASP performance benchmarking

– Understanding VASP communication patterns

– Ways to increase VASP productivity

– Compilers and network interconnects comparisons

• The presented results will demonstrate – The scalability of the compute environment

– The capability of VASP to achieve scalable productivity

– Considerations for performance optimizations

12

Test Cluster Configuration

• Dell™ PowerEdge™ R815 11-node (704-core) cluster

• AMD™ Opteron™ 6276 (code name “Interlagos”) 16-core @ 2.3 GHz CPUs

• 4 CPU sockets per server node

• Mellanox ConnectX®-3 FDR InfiniBand Adapters

• Mellanox SwitchX™ 6036 36-Port InfiniBand switch

• Memory: 128GB memory per node DDR3 1333MHz

• OS: RHEL 6.2, SLES 11.2 with MLNX-OFED 1.5.3 InfiniBand SW stack

• MPI: Intel MPI 4 Update 3, MVAPICH 1.8.1, Open MPI 1.6.3 (w/ dell_affinity 0.85)

• Math Libraries: ACML 5.2.0, Intel MKL 11.0, SCALAPACK 2.0.2

• Compilers: Intel Compilers 13.0, Open64 4.5.2

• Application: VASP 5.2.7

• Benchmark workload:

– Pure Hydrogen (MD simulation, 10 iconic steps, 60 electronic steps, 264 bands, IALGO=48)

13

VASP Profiling – MPI/User Time Ratio

• Time needed for computation reduces as more nodes take part

– MPI Communication time increase gradually as the compute time reduces

QDR InfiniBand 32 Cores/Node

14

VASP Profiling – Number of MPI Calls

• The most used MPI functions are for MPI collective operations – MPI_Bcast(34%), MPI_Allreduce(16%), MPI_Recv(11%), MPI_Alltoall(8%) at 8 nodes – Collective operations cause communication time to grow at larger node counts

QDR InfiniBand 32 Cores/Node

15

VASP Profiling – Time Spent of MPI Calls

• The time in communications is taken place in the following MPI functions: – MPI_Alltoallv(40%) MPI_Alltoall(17%), MPI_Bcast (15%) at 8 nodes

QDR InfiniBand 32 Cores/Node

16

VASP Profiling – MPI Message Sizes

• Uneven communication seen for MPI messages

– More MPI collective communication takes place on certain MPI ranks

1 Nodes – 32 Processes 4 Nodes – 128 Processes

QDR InfiniBand

17

VASP Profiling – MPI Message Sizes

• Communication pattern changes as more processes involved

– 4 nodes: Majority of messages are concentrated at 64KB

– 8 nodes: MPI_Alltoallv is the largest MPI time consumer, is largely concentrated at 64KB

– 8 nodes: MPI_Bcast is the most frequently called MPI API, is largely concentrated at 4B

4 Nodes – 128 Processes 8 Nodes – 256 Processes

QDR InfiniBand 32 Cores/Node

18

VASP Profiling – Point To Point Data Flow

• The point to point data flow shows the communication pattern of VASP

– VASP mainly communicates on lower ranks

– The pattern stays the same as the cluster scales

1 Nodes – 32 Processes 4 Nodes – 128 Processes

QDR InfiniBand 32 Cores/Node

19

VASP Performance – Interconnect

32 Cores/Node Higher is better

• QDR InfiniBand delivers the best performance for VASP – Up to 186% better performance than 40GbE on 8 nodes – Over 5 times better performance than 10GbE on 8 nodes – Over 8 times better performance than 1GbE on 8 nodes

• Scalability limitation seen with Ethernet networks – 1GbE, 10GbE and 40GbE performance starts to decline after 2 nodes

NPAR=8

893%

520% 186%

20

VASP Performance – Open MPI SRQ

32 Cores/Node Higher is better

• Using SRQ enables better performance for VASP at high core counts

– 24% higher performance than Open MPI at 8 nodes

• Flags used for enabling SRQ in Open MPI:

– -mca btl_openib_receive_queues S,9216,256,128,32:S,65536,256,128,32

– Processor binding is enabled for both cases

24%

NPAR=8

21

VASP Performance – Open MPI w/ MXM

32 Cores/Node Higher is better

• Enabling MXM enables better performance for VASP at high core counts

– 47% higher job productivity than the untuned Open MPI run at 8 nodes

– 19% higher job productivity than the SRQ-enabled Open MPI at 8 nodes

• Flags used for enabling MXM in Open MPI:

– -mca mtl mxm -mca btl_openib_free_list_num 8192 -mca btl_openib_free_list_inc

1024 -mca mpi_preconnect_mpi 1 -mca btl_openib_flags 9

– Processor binding using dell_affinity.exe for all 3 cases

47%

NPAR=8

19%

22

Analyzing AMD “Bulldozer” CPU Architecture

• Maximize resources by analyzing AMD Interlagos CPU architecture

– Each AMD Opteron 6200 series “Interlagos” processor is composed of 2 dies

– Each die is composed of 8 “Bulldozer” modules

– Two of the neighboring cores share a 256-bit Floating Point Unit (FPU) within a module

• AMD Turbo Core increases clock speed when only 1 active core in “core-pair”

– Enables higher clock speed and dedicated FPU for HPC applications

AMD Opteron™

6200Series Processor

“Interlagos”

23

VASP Performance – Processes Per Node

• Running 1 active core in core pairs yield higher system utilization

– 116% gain in performance with 64 PPN versus 32 PPN (with 1 active core) for 8 nodes

– 1 floating point unit (FPU) is shared between 2 CPU cores in a package

• Using 4P servers deliver higher performance than 2P servers

– 30% gain with 4P server (32 PPN with 1 active core/package) than 32PPN in a 2P

Higher is better

116% 30%

NPAR=8

24

Effects of Processor Placement

• Naïve process placement can result bottlenecks in MPI communication

– MPI communications can occur between internode bus over HyperTransport bus

– Eventually results in degradation in runtime performance

• Optimal process placement can enhance memory bandwidth and MPI messaging

– Placing processes in proximity by ordering adjacent processes closely on NUMA node

– Reducing time to communicate between MPI processes and resulting higher performance

– Supports major OS Linux distributions, major MPI libraries, and hybrid MPI/OpenMP runs

Naïve (default) placement Optimal Placement

25

VASP Performance – Processor Binding

32 Cores/Node Higher is better

– Processor binding is crucial for achieving he best performance on AMD Interlagos • Allocating MPI processes on the most optimal cores allows Open MPI to perform

– Options that are used between the 2 cases:

• bind-to-core: OMPI param: --bind-to-core (OMPI compiled with hwloc support)

• dell_affinity.exe: dell_affinity.exe -v -n 32 -t 1

– dell_affinity is described at the HPC Advisory Council Spain Conference 2012:

• http://www.hpcadvisorycouncil.com/events/2012/Spain-Workshop/pres/7_Dell.pdf

• Works with all open source and commercial MPI libraries on Dell platforms

NPAR=8

26%

26

VASP Performance – Software Stack

32 Cores/Node Higher is better

• Free software stack performs comparably to Intel software stack – Free software stack: MVAPICH2, Open64 compilers, ACML and ScaLAPACK

• Specifications regarding the runs: – Intel Compiler/MKL: Default optimization in Makefile with minor modification to loc

– MVAPICH2 runs with processor affinity “dell_affinity.exe -v -n32 -t 1” – Open64: Compiler flags: “-march=bdver1 -mavx –mfma” – Changes to interface blocks of certain modules to support Open64 can be accessed and acquired

from the University of Vienna

27

Summary

• Low latency in network communication is required to make VASP scalable – QDR InfiniBand delivers good scalability and provides lowest latency among the tested:

• 186% versus 40GbE, over 5 times better than 10GbE and over 8 times than 1GbE on 8 nodes

– Ethernet would not scale and become inefficient to run beyond 2 nodes

– Mellanox messaging accelerations (SRQ & MXM) provide benefit for VASP to run at scale

– Heavy MPI collective communication occurred in VASP

• CPU: – Running single core in core pairs performs 116% faster than running with both cores

– “dell_affinity.exe” ensures proper process allocation support in Open MPI and MVAPICH2

• Software stack: – Free stack (Open64/MVAPICH2/ACML/ScaLAPACK) performs comparably to Intel stack

– Additional performance is expected with source code optimizations and tuning using the latest development tools (such as Open64, ACML) that support AMD “Interlagos” architecture

28 28

Thank You HPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein