performance benefits on hpcx from power5 chips and smt hpcx user group meeting 28 june 2006 alan...

Performance Benefits on

HPCx from Power5 chips

and SMT

HPCx User Group Meeting28 June 2006

Alan GrayEPCC, University of Edinburgh

28/06/06

Contents

• Introduction and System Overview

• Benchmark Results

• Synthetic

• Applications

• Simultaneous Multithreading

• Conclusions

http://www.epcc.ed.ac.uk/

28/06/06

Introduction and System Overview


28/06/06

Introduction

• HPCx underwent upgrade in November 2005 from Power4 to Power5 technology.

• New system features Simultaneous Multithreading (SMT)

• We will compare the new and old systems via benchmark results, both synthetic and involving real applications representing typical use of the system.

• the use of SMT is also investigated

• Also included for comparison are results from EPCC’s Blue Gene/L system.


28/06/06 5

• Previous HPCx (Phase 2): 50 IBM e-Server p690+ nodes– SMP cluster, 32 Power4 1.7 GHz processors per node– 32 GB of RAM per node– Federation interconnect– 6.2 TFLOP/s Linpack

• HPCx (Phase 2a): 96 IBM e-Server p575 nodes– SMP cluster, 16 Power5 1.5GHz processors per node

– Power5 have improved memory architecture over Power4– 32 GB of RAM per node (twice as much per processor than Phase 2)– Federation interconnect (same as Phase 2)– 7.4 TFLOP/s Linpack, No 46 on top500

• BlueSky: Single e-Server Blue Gene frame– 1024 dual core chips, 2048 PowerPC440 processors, 700 MHz– 512 MB of RAM per chip (distributed memory system), shared

between the two cores– 4.7 TFLOP/s Linpack, joint No 73 on top500

Systems for comparison


28/06/06

Benchmark Results: Synthetic


28/06/06

Synthetic benchmarks: Intel MPI suite

• Switch communication: Insignificant difference (not surprising – same switch)

• Intra node communication: Phase2a has better asymptotic bandwidth but slightly higher latency

• Ping Pong Benchmark – 2 processes communicate, either over the switch or within a node via shared memory


28/06/06

• Multi Ping Pong: All available processors utilised

• Modified to ensure that all comms utilise the switch

• No difference between Phase 2 and 2a (not surprising – same switch)

Synthetic benchmarks: Intel MPI suite


28/06/06 9

Streams performance (scale)

• Streams benchmark gives measure of memory bandwidth

• Hardware limit is 2 load+store per cycle

• Can clearly see caches

• Phase2a significantly better than Phase 2 for all memory levels


28/06/06

Benchmark Results: Applications


28/06/06 11

CASTEP: AL2O3

• Density functional theory application, Payne et al., 2002, Segall et al., 2002

• Widely used in the UK (largest user of HPCx)

• Benchmark: Al2O3: 270 atom slab sampled with 2 k-points

• Phase 2a around 1.3 times faster than Phase 2– Even although clocks are

slower– Code is taking advantage

of improved memory bandwidth


28/06/06 12

H2MOL

• Solves time dependent Schrödinger equation for laser driven dissociation of H2-molecules

• Refines grid when increasing processor count, hence constant work/proc

• Phase2a almost a factor of 2 faster than Phase 2

• Writing of intermediate results shows up poor IO on Blue Gene


28/06/06 13

PCHAN

• Finite difference code for Turbulent Flow: shock/boundary layer interaction (SBLI)

• Communications: halo-exchanges between adjacent computational sub-domains

• Phase 2a around 2 times faster than Phase 2

• Very good scaling all systems – HPCx superscales


28/06/06

AIMPRO

• Ab Initio Modelling PROgram

• Determines structure of atoms using Born and Oppenheimer approximation

• Benchmark: DS-4 - 433 atoms, 12124 basis functions and 4 k-points

• Phase 2a outperforming Phase 2 by around a factor of 1.2


28/06/06 15

MDCASK

• MDCASK: classical molecular dynamics code to study radiation damage in metals

• Benchmark used: 1372000 atoms in Ti lattice

• Performance is worse on Phase 2a than on Phase 2.

– by factor larger than clock frequency ratio

– Scaling also worse on Phase 2a

• Classical Molecular dynamics codes are characterised by many strided memory accesses.

– degradation could be due to sensitivity to increased latency in some part of memory subsystem


28/06/06

LAMMPS

• Classical Molecular Dynamics - can simulate wide range of materials

• Rhodopsin Benchmark: 2048000 atoms

• Performance degradation again on new system.

• Factor is clock ratio at low processor numbers, but scaling worse on Phase 2a.


28/06/06 17

NAMD 2.6b1: APO-A1 92224 atoms

• NAMD: classical molecular dynamics code designed for

high-performance simulation of large biomolecular systems

• ApoA1 benchmark 92224 atoms

• Similarly to other classical molecular dynamics, performs worse on Phase 2a.


28/06/06 18

DL_POLY3: Gramicidin 792960 atoms

• DL_POLY is a general purpose molecular dynamics package

• DL_POLY3 uses a distributed domain decomposition model

• The benchmark: a system of eight Gramicidin-A species (792,960 atoms)

• performs slightly better on Phase2a, but not as well as some of the other codes


28/06/06

Simultaneous Multithreading


28/06/06

Simultaneous Multithreading (SMT)

• Theoretical peak floating point performance of microprocessors has steadily risen in recent years

• Actual performance of apps, relative to theoretical peak, has dropped substantially– i.e. Number of cycles for which floating point units are idle is rising

• Due to latencies involved with processor operations.

• Compiler attempts to schedule instructions to minimise waste cycles

– but effectiveness is limited by lack of independent instructions

• SMT: multiple threads can issue instructions to the functional units in each cycle.– no. independent instructions increases, no. idle cycles decreases


28/06/06

Simultaneous Multithreading (SMT)

• Power 5 processors on HPCx have 2 floating point units, and support SMT with 2 threads.

• Hence have 2 virtual processes (MPI tasks or OpenMP threads) running per physical processor.

No SMT: #@tasks_per_node = 16

With SMT: #@ tasks_per_node = 32 #@ requirements = (Feature == “SMT”)

• Disadvantages:– More communication– Memory limit per task is halved


28/06/06

SMT: Streams

• Compare open squares (No SMT) with open circles (SMT)

• With SMT, there are twice as many tasks per node. For direct comparison, SMT results have been multiplied by a factor of 2

• No difference observed in memory bandwidth with SMT.

• Of course, caches are effectively halved in size

• Therefore, any improvements in apps must be due to reduced memory latency (as expected).


28/06/06

SMT: Classical Molecular Dynamics

• Reminder: Classical Molecular Dynamics codes did worse than expected on Phase 2a, likely due to sensitivity to increased memory latency.

• Such codes benefit from SMT: seems that latencies are successfully hidden.

• Benefit limited to lower processor counts. At high counts large amount of communication takes over.

• For NAMD, up to factor of 1.4 improvement and crossover point is around 512 processors.


28/06/06





• For MDcask, up to factor of 1.4 improvement and crossover point is around 256 processors.


28/06/06





• For DL_POLY, up to factor of 1.2 improvement and crossover point is around 64 processors.


28/06/06

SMT: Castep and H2MOL

• Reminder: Performance of Castep and H2MOL codes improved on new system.

– No performance benefit seen with SMT.– SMT degrades performance in certain situations


28/06/06

Conclusions

• HPCx upgraded from Power4 to Power5 technology recently.

• Although new chips have slightly lower clock frequency, significant improvements observed in majority of applications – due to better memory bandwidth

• Some types of application, in particular Classical Molecular Dynamics, have not performed as well as expected on new system. – These apps characterised by many strided memory accesses– Sensitivity to an increased latency could be to blame.

• Performance benefits with the use of SMT have been observed in certain situations– In particular for those codes which didn't do as well as expected– Users should benchmark their own codes


performance benefits on hpcx from power5 chips and smt hpcx user group meeting 28 june 2006 alan...

Documents

outperforming phase

previous hpcx phase

tflops linpackhpcx phase

edinburghbenchmark results

new system

memory levelsnovel computing

chip distributed memory

blue genenovel computing