performance evaluation of sve enabled arm processor a64fx … · 2019-10-02 · title: a memory...

25
Copyright 2019 FUJITSU LIMITED Performance Evaluation of SVE Enabled Arm Processor A64FX using Variable Vector Length Shinji Sumimoto Fujitsu Limited 0 Arm Research Summit 2019

Upload: others

Post on 10-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Copyright 2019 FUJITSU LIMITED

Performance Evaluation of SVE Enabled Arm Processor A64FX using Variable Vector Length

Shinji SumimotoFujitsu Limited

0

Arm Research Summit 2019

Page 2: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Overview

Background

A64FX Overview

Application Characteristics: Compute Intensive vs. Memory Intensive

Preliminary Performance Evaluation

1 Copyright 2019 FUJITSU LIMITED

Page 3: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Background

A64FX is the First SVE enabled Arm Processor in the world. SVE realizes single binary for multiple vector length environment in order to

support application binary portability.

A64FX supports 512, 384, 256, 128 bit vector length.

HPC applications have several characteristics such as compute intensive and/or memory intensive. SVE enabled processor can control execution vector length at runtime.

A64FX has a memory bandwidth controlling feature at runtime.

No re-compilation is needed for the above executions.

Therefore, we have evaluated several application benchmarks in order to clarify application characteristics

2 Copyright 2019 FUJITSU LIMITED

Page 4: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Inheriting Fujitsu HPC CPU technologies with commodity standard ISA

A64FX: High Performance Arm CPU

Copyright 2019 FUJITSU LIMITED3

Page 5: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

High Performance Arm CPU “A64FX”

Architecture featuresISA Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension)

SIMD width 512-bit

Precision FP64/32/16, INT64/32/16/8

# of cores 48 computing cores + 4 assistant cores (4 CMGs)

Memory HBM2: Peak B/W 1024 GB/s

Interconnect TofuD: 28 Gbps x 2 lanes x 10 ports

Copyright 2019 FUJITSU LIMITED

4

HBM2

HBM2

HBM2

HBM2

TofuDController

PCIeController

Net

wor

k on

Ch

ip

CMG(Core Memory Group)specification

13 coresL2 Cache 8 MiB

Mem 8 GiB, 256 GB/s

TofuD28 Gbps x 2 lanes x 10 ports

I/OPCIe Gen3 16 lanes

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core

Core Core Core Core

Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core

L2Cache

L2Cache

L2Cache

L2Cache

HB

M2

Interfa

ceH

BM

2 Interfa

ce

HB

M2

inte

rfa

ceH

BM

2 In

terf

ace

PCIe InterfaceTofuD Interface

RIN

G-B

us

Page 6: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

HBM2 8GiBHBM2 8GiBHBM2 8GiB

Extremely high bandwidth• Asynchronous Processing in cores, caches and memory controllers

• Maximizing the capability of each layer’s bandwidth

Performance

>2.7TFLOPS

A64FX Memory System

Copyright 2019 FUJITSU LIMITED

CMG

L1 Cache

>11.0TB/s (BF= 4)

L2 Cache

>3.6TB/s (BF = 1.3)

L1D 64KiB, 4way

512-bit wide SIMD 2x FMAs

Core Core CoreCore

>230GB/s

>115GB/s

12x Computing Cores + 1x Assistant Core

Memory

1024GB/s (BF =~0.37)

>115GB/s

>57GB/s

HBM2 8GiB

L2 Cache 8MiB, 16way

256GB/s

5

Page 7: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

A64FX CPU Performance Evaluation

Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx

Copyright 2019 FUJITSU LIMITED6

AHUG@ISC19 Workshop

Page 8: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

A64FX Performance Comparison(1/2)

Himeno Benchmark (Fortran90)

7 Copyright 2019 FUJITSU LIMITED

† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”,SC18, https://dl.acm.org/citation.cfm?id=3291728

AHUG@ISC19 Workshop

Page 9: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

A64FX Performance Comparison(2/2)

Copyright 2019 FUJITSU LIMITED

WRF: Weather Research and Forecasting model Vectorizing loops including IF-constructs is key optimization

Source code tuning using directives promotes compiler optimizations

xx

8

AHUG@ISC19 Workshop

Page 10: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Application Characteristics: Compute Intensive vs. Memory Intensive

Copyright 2019 FUJITSU LIMITED9

Page 11: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

SVE: Scalable Vector Length

Copyright 2016 FUJITSU LIMITED

ISA does not fix Vector Length

SVE supports VL from 128 to 2048 bit with multiples of 128 bit

VL is set by processor before executing a binary dynamically

Single execution binary can be executed on processors with multiple VLs

Vector-Length Agnostic(VLA) programing enables ABI Compatibility

SVE SVE

512bit SIMD 256bit SIMD

Execution Binary does not depend on processor’s VL

Execution Binary Portability

Execution Binary/a.out

10

Reducing dynamic instruction steps to half

Increasing dynamic instruction steps to double

Page 12: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Application Characteristics: How to investigate?

Application Characteristics

Computing Intensive

Memory Bandwidth Intensive

How to investigate?

Using performance profiling tools: Arm Allinea Studio

Re-compiling with different compiler option and running

A64FX can help to evaluate the characteristics easily

Compute Intensive Analysis: Changing Vector Length

Memory Intensive Analysis: Changing Memory Access Gap

11 Copyright 2019 FUJITSU LIMITED

Page 13: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Preliminary Performance Evaluation

Copyright 2019 FUJITSU LIMITED12

Page 14: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Benchmark Applications, Evaluation Environment and Evaluation Pameters Benchmark Applications STREAM(TRIAD)

DGEMM

Himeno Benchmarks

NAS Parallel Benchmarks OMP Class C (EP, CG, LU, FE, IS, MG, BT, SP)

Evaluation Environment Hardware: Fugaku-Prototype System with A64FX (single node)

Compiler: Fujitsu Compiler (development version)

Evaluation Parameters SVE Vector Length: 128 - 512

•Using a command with prctl(PR_SVE *)

Memory Bandwidth: 100%-20%

•Using an option of submission job

13 Copyright 2019 FUJITSU LIMITED

Application VL 128 VL256 VL512

HBM100%

HBM 80%

HBM 60%

HBM 40%

HBM 20%

* https://lwn.net/Articles/717804/

Page 15: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

STREAM TRIAD: 1 Thread vs. 12 Threads(1CMG)

14 Copyright 2019 FUJITSU LIMITED

Relative Performance VL512-HBM100% = 100

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

STREAMS Triadd 1 thread

512 256 128

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

STREAMS Triadd 12 threads

512 256 128Memory

BandwidthMemory

Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Rel

ativ

e Pe

rfor

man

ce

Vector Length Vector Length

Page 16: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

0

20

40

60

80

100

120

140

160

180

200

0% 20% 40% 60% 80% 100% 120%

STREAMS Triad 1 thread

512 256 128

0

20

40

60

80

100

120

140

160

180

200

0% 20% 40% 60% 80% 100% 120%

STREAMS Triad 12 threads

512 256 128

STREAM TRIAD: 1 Thread vs. 12 Threads-SIMD Effects

15 Copyright 2019 FUJITSU LIMITED

Relative Performance VL128-HBM 100% = 100

Memory Bandwidth

Memory Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Rel

ativ

e Pe

rfor

man

ce

Vector Length Vector Length

Page 17: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

DGEMM 12 threads

512 256 128

DGEMM: 12 Threads(1CMG)

16 Copyright 2019 FUJITSU LIMITED

Relative Performance VL512-HBM100% = 100

Memory Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Vector Length

Page 18: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Himeno Benchmark: 1 CMG vs. 4 CMG

17 Copyright 2019 FUJITSU LIMITED

Relative Performance VL512-HBM100% = 100

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

Himeno OpenMP 12 thread

512 256 128

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

Himeno OpenMP 48 thread

512 256 128Memory

BandwidthMemory

Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Rel

ativ

e Pe

rfor

man

ce

Vector Length Vector Length

Page 19: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Himeno Benchmark: 1 CMG vs. 4 CMG-SIMD Effects

18 Copyright 2019 FUJITSU LIMITED

Relative Performance VL128-HBM100% = 100

0

50

100

150

200

250

300

0% 20% 40% 60% 80% 100% 120%

Himeno OpenMP 12 thread

512 256 128

0

50

100

150

200

250

300

0% 20% 40% 60% 80% 100% 120%

Himeno OpenMP 48 thread

512 256 128

Page 20: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

NPB: with No SMID Effect and Compute Intensive

19 Copyright 2019 FUJITSU LIMITED

Relative Performance VL512-HBM100% = 100

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

EP

512 256 128

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

IS

512 256 128Memory Bandwidth

Memory Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Rel

ativ

e Pe

rfor

man

ce

Vector Length Vector Length

Page 21: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

NPB: with SMID Effect and Compute Intensive

Relative Performance VL512-HBM100% = 100

20 Copyright 2019 FUJITSU LIMITED

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

CG

512 256 128

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

BT

512 256 128Memory

BandwidthMemory

Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Rel

ativ

e Pe

rfor

man

ce

Vector Length Vector Length

Page 22: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

SP

512 256 128

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

MG

512 256 128

NPB: with SMID Effect and Compute Intensiveand Memory Intensive(1) Relative Performance VL512-HBM100% = 100

21 Copyright 2019 FUJITSU LIMITED

Memory Bandwidth

Memory Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Rel

ativ

e Pe

rfor

man

ce

Vector Length Vector Length

Page 23: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

FT

512 256 128

0

20

40

60

80

100

120

0% 20% 40% 60% 80% 100% 120%

LU

512 256 128

NPB: with SMID Effect and Compute Intensiveand Memory Intensive(2)

22 Copyright 2019 FUJITSU LIMITED

Relative Performance VL512-HBM100% = 100

Memory Bandwidth

Memory Bandwidth

Rel

ativ

e Pe

rfor

man

ce

Rel

ativ

e Pe

rfor

man

ce

Vector Length Vector Length

Page 24: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

Summary

Arm SVE provides Application Binary Portability Variable Vector Length is defined at runtime

SVE feature can be used application performance analysis Reduction of vector length shows application is compute intensive or not

Benchmark Evaluation on A64FX Enables: Compute Intensive Analysis: Changing Vector Length

Memory Intensive Analysis: Changing Memory Access Gap

23 Copyright 2019 FUJITSU LIMITED

Page 25: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto

24 Copyright 2019 FUJITSU LIMITED