performance evaluation of sve enabled arm processor a64fx … · 2019-10-02 · title: a memory...
TRANSCRIPT
Copyright 2019 FUJITSU LIMITED
Performance Evaluation of SVE Enabled Arm Processor A64FX using Variable Vector Length
Shinji SumimotoFujitsu Limited
0
Arm Research Summit 2019
Overview
Background
A64FX Overview
Application Characteristics: Compute Intensive vs. Memory Intensive
Preliminary Performance Evaluation
1 Copyright 2019 FUJITSU LIMITED
Background
A64FX is the First SVE enabled Arm Processor in the world. SVE realizes single binary for multiple vector length environment in order to
support application binary portability.
A64FX supports 512, 384, 256, 128 bit vector length.
HPC applications have several characteristics such as compute intensive and/or memory intensive. SVE enabled processor can control execution vector length at runtime.
A64FX has a memory bandwidth controlling feature at runtime.
No re-compilation is needed for the above executions.
Therefore, we have evaluated several application benchmarks in order to clarify application characteristics
2 Copyright 2019 FUJITSU LIMITED
Inheriting Fujitsu HPC CPU technologies with commodity standard ISA
A64FX: High Performance Arm CPU
Copyright 2019 FUJITSU LIMITED3
High Performance Arm CPU “A64FX”
Architecture featuresISA Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension)
SIMD width 512-bit
Precision FP64/32/16, INT64/32/16/8
# of cores 48 computing cores + 4 assistant cores (4 CMGs)
Memory HBM2: Peak B/W 1024 GB/s
Interconnect TofuD: 28 Gbps x 2 lanes x 10 ports
Copyright 2019 FUJITSU LIMITED
4
HBM2
HBM2
HBM2
HBM2
TofuDController
PCIeController
Net
wor
k on
Ch
ip
CMG(Core Memory Group)specification
13 coresL2 Cache 8 MiB
Mem 8 GiB, 256 GB/s
TofuD28 Gbps x 2 lanes x 10 ports
I/OPCIe Gen3 16 lanes
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core
Core Core Core Core
Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core
L2Cache
L2Cache
L2Cache
L2Cache
HB
M2
Interfa
ceH
BM
2 Interfa
ce
HB
M2
inte
rfa
ceH
BM
2 In
terf
ace
PCIe InterfaceTofuD Interface
RIN
G-B
us
HBM2 8GiBHBM2 8GiBHBM2 8GiB
Extremely high bandwidth• Asynchronous Processing in cores, caches and memory controllers
• Maximizing the capability of each layer’s bandwidth
Performance
>2.7TFLOPS
A64FX Memory System
Copyright 2019 FUJITSU LIMITED
CMG
L1 Cache
>11.0TB/s (BF= 4)
L2 Cache
>3.6TB/s (BF = 1.3)
L1D 64KiB, 4way
512-bit wide SIMD 2x FMAs
Core Core CoreCore
>230GB/s
>115GB/s
12x Computing Cores + 1x Assistant Core
Memory
1024GB/s (BF =~0.37)
>115GB/s
>57GB/s
HBM2 8GiB
L2 Cache 8MiB, 16way
256GB/s
5
A64FX CPU Performance Evaluation
Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx
Copyright 2019 FUJITSU LIMITED6
AHUG@ISC19 Workshop
A64FX Performance Comparison(1/2)
Himeno Benchmark (Fortran90)
7 Copyright 2019 FUJITSU LIMITED
† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”,SC18, https://dl.acm.org/citation.cfm?id=3291728
AHUG@ISC19 Workshop
A64FX Performance Comparison(2/2)
Copyright 2019 FUJITSU LIMITED
WRF: Weather Research and Forecasting model Vectorizing loops including IF-constructs is key optimization
Source code tuning using directives promotes compiler optimizations
xx
8
AHUG@ISC19 Workshop
Application Characteristics: Compute Intensive vs. Memory Intensive
Copyright 2019 FUJITSU LIMITED9
SVE: Scalable Vector Length
Copyright 2016 FUJITSU LIMITED
ISA does not fix Vector Length
SVE supports VL from 128 to 2048 bit with multiples of 128 bit
VL is set by processor before executing a binary dynamically
Single execution binary can be executed on processors with multiple VLs
Vector-Length Agnostic(VLA) programing enables ABI Compatibility
SVE SVE
512bit SIMD 256bit SIMD
Execution Binary does not depend on processor’s VL
Execution Binary Portability
Execution Binary/a.out
10
Reducing dynamic instruction steps to half
Increasing dynamic instruction steps to double
Application Characteristics: How to investigate?
Application Characteristics
Computing Intensive
Memory Bandwidth Intensive
How to investigate?
Using performance profiling tools: Arm Allinea Studio
Re-compiling with different compiler option and running
A64FX can help to evaluate the characteristics easily
Compute Intensive Analysis: Changing Vector Length
Memory Intensive Analysis: Changing Memory Access Gap
11 Copyright 2019 FUJITSU LIMITED
Preliminary Performance Evaluation
Copyright 2019 FUJITSU LIMITED12
Benchmark Applications, Evaluation Environment and Evaluation Pameters Benchmark Applications STREAM(TRIAD)
DGEMM
Himeno Benchmarks
NAS Parallel Benchmarks OMP Class C (EP, CG, LU, FE, IS, MG, BT, SP)
Evaluation Environment Hardware: Fugaku-Prototype System with A64FX (single node)
Compiler: Fujitsu Compiler (development version)
Evaluation Parameters SVE Vector Length: 128 - 512
•Using a command with prctl(PR_SVE *)
Memory Bandwidth: 100%-20%
•Using an option of submission job
13 Copyright 2019 FUJITSU LIMITED
Application VL 128 VL256 VL512
HBM100%
HBM 80%
HBM 60%
HBM 40%
HBM 20%
* https://lwn.net/Articles/717804/
STREAM TRIAD: 1 Thread vs. 12 Threads(1CMG)
14 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
STREAMS Triadd 1 thread
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
STREAMS Triadd 12 threads
512 256 128Memory
BandwidthMemory
Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
0
20
40
60
80
100
120
140
160
180
200
0% 20% 40% 60% 80% 100% 120%
STREAMS Triad 1 thread
512 256 128
0
20
40
60
80
100
120
140
160
180
200
0% 20% 40% 60% 80% 100% 120%
STREAMS Triad 12 threads
512 256 128
STREAM TRIAD: 1 Thread vs. 12 Threads-SIMD Effects
15 Copyright 2019 FUJITSU LIMITED
Relative Performance VL128-HBM 100% = 100
Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
DGEMM 12 threads
512 256 128
DGEMM: 12 Threads(1CMG)
16 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Vector Length
Himeno Benchmark: 1 CMG vs. 4 CMG
17 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 12 thread
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 48 thread
512 256 128Memory
BandwidthMemory
Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
Himeno Benchmark: 1 CMG vs. 4 CMG-SIMD Effects
18 Copyright 2019 FUJITSU LIMITED
Relative Performance VL128-HBM100% = 100
0
50
100
150
200
250
300
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 12 thread
512 256 128
0
50
100
150
200
250
300
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 48 thread
512 256 128
NPB: with No SMID Effect and Compute Intensive
19 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
EP
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
IS
512 256 128Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
NPB: with SMID Effect and Compute Intensive
Relative Performance VL512-HBM100% = 100
20 Copyright 2019 FUJITSU LIMITED
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
CG
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
BT
512 256 128Memory
BandwidthMemory
Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
SP
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
MG
512 256 128
NPB: with SMID Effect and Compute Intensiveand Memory Intensive(1) Relative Performance VL512-HBM100% = 100
21 Copyright 2019 FUJITSU LIMITED
Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
FT
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
LU
512 256 128
NPB: with SMID Effect and Compute Intensiveand Memory Intensive(2)
22 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
Summary
Arm SVE provides Application Binary Portability Variable Vector Length is defined at runtime
SVE feature can be used application performance analysis Reduction of vector length shows application is compute intensive or not
Benchmark Evaluation on A64FX Enables: Compute Intensive Analysis: Changing Vector Length
Memory Intensive Analysis: Changing Memory Access Gap
23 Copyright 2019 FUJITSU LIMITED
24 Copyright 2019 FUJITSU LIMITED