applications on the a64fx
TRANSCRIPT
![Page 1: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/1.jpg)
Adrian Jackson
Senior Research Fellow
EPCC, The University of Edinburgh
@adrianjhpc
Investigating
Applications on the
A64FXAdrian Jackson
Michèle Weiland
Nick Brown
Andrew Turner
Mark Parsons
EPCC, The University of Edinburgh
![Page 2: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/2.jpg)
Arm-based processors
• Arm (Nvidia?) designed processors strong presence in low-
power computing
– Arm-licenced processor designs in a wide range of mobile devices,
and even in Intel products
• Recent work has seen a number of Arm-design processors
being developed for server class applications
– Cavium’s ThunderX2
– Amazon Graviton
– Ampere eMAG
– Huawei Kunpeng 920
– Fujitsu A64FX
![Page 3: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/3.jpg)
A64FX processor
• Fujitsu A64FX
– 48 cores
– Maximum of 2.2GHz
– 12 cores per quadrant
– Separate assistant cores for the O/S
– 64KB L1 per core, 32MB L2 (8MB per quadrant) per chip
– 512-bit wide SVE vectors
– Compared to Skylake’s 512-bit, Broadwell’s 256-bit, IvyBridge and
TX2’s 128-bit vectors
– 4 memory controllers (one per quadrant) with 8 GB HBM2 each
– ~1TB/s memory bandwidth across the chip
– Compared to 6 channels on latest Intel
– AMD’s EPYC also has 8 channels
![Page 4: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/4.jpg)
Previous work
• Existing results on other Arm processors (TX2)
Evaluating the Arm Ecosystem for High Performance
Computing. A. Jackson, A. Turner, M. Weiland, N. Johnson, O. Perks, M.
Parsons, PASC19’, June 2019.
0
0.5
1
1.5
2
2.5
HPE Apollo 70 SGI ICE XA Cray XC30 Dell EMC
Tim
e t
o s
olu
tion r
ela
tive t
o S
GI IC
E X
A(low
er
is q
uic
ker)
System
COSA OpenSBLI
GROMACS nektar++
![Page 5: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/5.jpg)
Multi-node performance
• Investigating A64FX performance at scale
• Large scale application runs
• Networking evaluation
• Evaluating porting effort and software eco-system maturity
– Including relevant programming languages
– Fortran, C, and C++
![Page 6: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/6.jpg)
System comparison
![Page 7: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/7.jpg)
Application benchmarking
![Page 8: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/8.jpg)
HPCG
• High performance conjugate gradient kernel benchmark
aiming to exercise:
– Floating point performance
– Memory bandwidth
– Network bandwidth and latency
– Implemented with C++, MPI, and OpenMP
![Page 9: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/9.jpg)
HPCG multi-node
![Page 10: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/10.jpg)
minikab
• Mini Krylov ASiMoV Benchmark (minikab)
• Parallel CG solver
– Fortran 2008 MPI and OpenMP
• Can configure
– The type of decomposition;
– The solver algorithm;
– The communication approach;
– The serial sparse-matrix routine in plain Fortran or implemented
– via a numerical library (such as MKL).
• Sparse matrix benchmark
– 9,573,984 degrees of freedom and 696,096,138 non-zero elements
![Page 11: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/11.jpg)
minikab performance
![Page 12: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/12.jpg)
nekbone
• Nekbone mini-app benchmark captures the basic structure
the Nek5000 application
– a high order, incompressible NS solver based on the spectral element
method, implemented in Fortran
– Dominated by matrix-vector multiplication operation in an element-by-
element fashion.
– Nearest-neighbour communication, and MPI Allreduce operations.
![Page 13: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/13.jpg)
nekbone
![Page 14: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/14.jpg)
COSA
• Fluid dynamics code
– Harmonic balance (frequency domain approach)
– Unsteady navier-stokes solver
– Optimise performance of turbo-machinery like problems
– Multi-grid, multi-level, multi-block code
– Implemented in Fortran (with Cray pointers )
– Parallelised with MPI
![Page 15: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/15.jpg)
COSA
![Page 16: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/16.jpg)
COSA
![Page 17: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/17.jpg)
CASTEP
• CASTEP DFT code for calculating the properties of materials
from first principles
– can simulate a wide range of materials proprieties including
energetics, structure at the atomic level, vibrational properties,
electronic response properties etc.
– Fortran code with MPI and OpenMP parallelisations
– Uses FFT libraries heavily
![Page 18: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/18.jpg)
CASTEP
![Page 19: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/19.jpg)
OpenSBLI
• Programming framework to generate finite difference
approximations
• Implemented in Python which generates C using the
OPS library, with MPI and OpenMP parallel functionality
as well as GPU parallelisations.
![Page 20: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/20.jpg)
OpenSBLI
![Page 21: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/21.jpg)
OpenSBLI
![Page 22: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/22.jpg)
Summary
• Porting not optimisation investigation
– General porting was very straight forward
– GNU compilers and Fujitsu maths libraries help significantly
– Fujitsu compilers can bring performance benefits
• Performance is generally extremely good
– Not all codes necessarily better than top end Intel
• Memory bandwidth dominated codes benefit significantly
• Small memory limit on nodes has challenges for some
applications
– Over decomposition necessary to fit simulations into memory
• Scope for targeted optimisation to improve performance
![Page 23: Applications on the A64FX](https://reader031.vdocument.in/reader031/viewer/2022020620/61e540c2fb9d2435fa41f36f/html5/thumbnails/23.jpg)
Acknowledgements
• Access to the A64FX was provided through the Fujitsu early
access programme
• The Fulhame HPE Apollo 70 system is supplied to EPCC as part
of the Catalyst UK programme, a collaboration with Hewlett
Packard Enterprise, Arm and SUSE to accelerate the adoption of
Arm based supercomputer applications in the UK.
• This work used the Cirrus UK National Tier-2 HPC Service at
EPCC (http://www.cirrus.ac.uk) funded by the University of
Edinburgh and EPSRC (EP/P020267/1).
• This work used the ARCHER UK National Supercomputing
Service (http://www.archer.ac.uk).
• The EPCC NGIO system was funded by the European Union's
Horizon 2020 Research and Innovation programme under Grant
Agreement no. 671951