bemfmm: an fmm-accelerated boundary element method …...bemfmm: an fmm-accelerated boundary element...

32
BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed Al Farhan, Noha Al-Harthi, Rui Chen, Rio Yokota, Hakan Bagci, and David Keyes April 24, 2018 King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia Tokyo Institute of Technology, Tokyo, Japan

Upload: others

Post on 01-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

BEMFMM: An FMM-Accelerated Boundary Element

Method-Based Solver for the 3D Helmholtz Equation

Mustafa Abduljabbar, Mohammed Al Farhan, Noha Al-Harthi, Rui Chen, Rio Yokota, Hakan

Bagci, and David Keyes

April 24, 2018

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

Tokyo Institute of Technology, Tokyo, Japan

Page 2: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Highlights

- FMM-based solver for BEM discretizations of oscillatory operators.

• Demonstrated for acoustics in Helmholtz formulation.

• Scattering from a sphere (exact solution available).

- Three levels of parallelism: MPI + X + Y.

- Three contemporary Intel architectures: Haswell, Skylake, KNL.

- Up to 2 billion Degrees-of-Freedom and up to 0.2 million cores of Shaheen XC40.

1

Page 3: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Outline

Introduction and Background

Extreme Scale Implementation and Optimizations

Shared-memory Optimizations

Distributed-memory Optimizations

Evaluation and Discussion

Shared-memory Optimizations

Distributed-memory Optimizations

Concluding Remarks and Future Work

2

Page 4: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Motivation

- Validating FMM-based horizontal and vertical parallelism techniques on modern

many/multi-core architectures and 196,608 hardware cores of Shaheen XC40

supercomputer.

- Propose a performance model to estimate a near optimal granularity of recursive task

creation during tree traversal.

- Describing the tradeoffs of using various modes of singularity treatment in the BEM.

3

Page 5: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Problem Statement and Formulation [1/3]

• The time-dependent form of the wave equation is governed by the Helmholtz

equation.

∇2U(r , t)− 1

c2

∂t2U(r , t) = 0 (1)

• The time-harmonic form as a result of plugging U(r , t) = Re[U0(r)e−jwt ] in Eq. 1

is:

∇2U0(r) + k2U0(r) = 0 (2)

4

Page 6: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Problem Statement and Formulation [2/3]

- The surface integral solution as a result of plugging the second form of Green’s

theorem looks like

pinc(r) +

∫S

[∂G (r , r ′)

∂n′p(r ′)− G (r , r ′)q(r ′)]dS ′ =

1

2p(r), r ∈ S (3)

G (r , r ′) =e jkR

4πR(4)

- if we set p = 0, we obtain ∫SG (r , r ′)q(r ′) = pinc(r) (5)

5

Page 7: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Problem Statement and Formulation [3/3]

• Wave scattering and applications.

High-intensity focused ultrasound12

1T. Betcke, E. van ’t Wout and P. Glat. Computationally efficient boundary element methods for high-frequency Helmholtz problems in unbounded

domains, in: Modern Solvers for Helmholtz Problems, Springer (2017).

2E. van ’t Wout, P. Glat, T. Betcke and S. Arridge. A fast boundary element method for the scattering analysis of high-intensity focused

ultrasound, Journal of the Acoustical Society of America 138(5) (2015) pp2726-2737.

6

Page 8: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

System Overview

Normalization

BEMFMM PETSc GMRES FMM Wrapper ExaFMMu,wGmsh α β ym

iteration

Figure 1: Dataflow across libraries (color-coded)

7

Page 9: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Discretization of the Scatterer’s Surface

- Divide the surface into curvilinear

triangular patches.

- Each curvilinear patch has Ni inter-

polation points.

- Using Nystrom method, The unknown

velocity is expanded as an interpolation

given by Equation:

X

Y

Z

q(r ′) =

Np∑n=1

Ni∑i=1

ϑ−1(r ′)L(i ,n)(ζ, η){I}(i ,n) (6)

ZI = V inc (7)

[Z ](j ,m)(i ,n) =

∫∆n

G (r(j ,m), r′)ϑ−1(r ′)L(i ,n)(ζ, η)dr ′ (8)

8

Page 10: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

FMM in Numerical Science

- FMM looks for a solution that can be written as a BIE u(x) =∫

Γ Φ(x , y)q(y)dS(y)

(to some extent)

- Examples of FMM use cases

Application Kernel Single/Double

Gravitational, Potential Laplace Single

Electrostatic Field Laplace Double

Acoustics Scattering Field (Low Frequency) Helmholtz Single

Electromagnetic Scattering Field Helmholtz Double

9

Page 11: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

FMM as an Accelerator for the 3D Helmholtz Solution

- FMM works as a matrix-free accelerator for the mat-vec multiplication (or IFMM as a

pre-conditioner [Takahashi et al., 2017]).

- The resulting BIE results in a structured dense matrix.

- Example: impulse response due to a monopole source.

p(r) =Ns∑j

G (r , r ′)q(r ′) (9)

- Eq. 9 is expanded into a series of spherical harmonics.

p(r) = ik∞∑n=0

n∑m=−n

S−mn (rj)Rmn (r), r ≤ rq (10)

p(r) = ik∞∑n=0

n∑m=−n

Cmn Rm

n (r) (11) Cmn =

∑rq<Rmax

QqS−mn (rq) (12)

10

Page 12: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

An Overview of FMM

Quad-tree partitioning on 4 processes

P2M

P2P

L2P

M2L

M2M

L2L

FMM Hierarchy (Upward, Horizontal, and Downward Sweeps)

11

Page 13: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Traversal Stage

Traditional FMM: Iterate over

Hilbert/Morton orders and probe each cell

for its neighbor [MS Warren et al., 1995].

0000 0001

0010001100 001101

001110 001111

110000

01

10

110001

110010 1100111101

1110 1111

Warren and Salmon original FMM

Dual Tree Traversal (DTT)3:

Simultaneously parse source and targets in

a pre-order depth-first manner.

Dual-tree traversal

3Mustafa Abduljabbar, Mohammed Al Farhan, Rio Yokota and David Keyes. Performance Evaluation of Computation and Communication Kernels

of the Fast Multipole Method on Intel Manycore Architecture, Proceedings of the European Conference on Parallel Processing (2017) 553–564.

12

Page 14: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

P2P/M2L: Data-level Parallelism

Auto-vectorization with #pragma simd

may almost always win in embarrassing-

ly parallel kernels with structured me-

mory access; however,

• Check icc compiler’s output with

-qopt-report. The default was the

inner-most loop (suboptimal here).

• Double-check vector loads for

array-of-structs.

• Rewrite divisions and complex

numbers to their polar form

(experimental 20% reduction in

vector moves).

1 f o r ( ; i<n i ; ++i ) {2 v i r = r e a l ( Bi . SRC) ; v i i = imag ( Bi . SRC) ;

3 f o r ( j =0; j<n j ; ++j ) {4 dX = x i−x j ; R2 = norm (dX) ;

5 \\ r e l a y s e l f−s i g u l a r i t y to PETSc c a l l b a c k

6 i f ( Bi .PATCH!= Bj .PATCH && R2!=0) {7 r e a l t R=s q r t ( R2 ) ;

8 i f (R<=n e a r p a t c h d i s t a n c e ) {9 f o r ( k=0; k<g a u s s q u a d p o i n t s ; ++k ) {

10 \\ n e a r patch s i n g u l a r i t y t r e a t m e n t

11 }12 } e l s e {13 v j r = r e a l ( Bj . SRC) ; v j i = imag ( Bj . SRC) ;

14 s r c 2 r = v i r ∗ v j r−v i i ∗ v j i ;

15 s r c 2 i = v i r ∗ v j i + v i i ∗ v j r ;

16 invR = 1 . 0 / s q r t (R) ;

17 e i k r = 1 . 0 / exp ( w a v e i ∗R) ;

18 e i k r ∗= invR ;

19 e i k r r = cos ( wave r∗R)∗ e i k r ;

20 e i k r i = s i n ( wave r∗R)∗ e i k r ;

21 p o t r += s r c 2 r ∗ e i k r r−s r c 2 i ∗ e i k r i ;

22 p o t i += s r c 2 r ∗ e i k r i +s r c 2 i ∗ e i k r r ;

23 }24 }25 }26 Bi .TRG += complex ( p o t r , p o t i ) ;

27 }

Helmholtz S2T Kernel

13

Page 15: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Traversal: Thread-level Parallelism

- Control cell size and task granularity by minimizing the difference between LLC and

the interacting cell sizes (minimize cache-miss rate).

mins,c

f (s, c) = (M2Lsize + P2Psize)− L2/L3 Cache

= 2× c × task size × nthreads/core

× [(csize × logs

c) + bsize]− L2/L3 Cache

(13)

- Multiplier “2” is inclusive of source and

target.- csize/bsize is the cell/body struct size.- s is the task spawning parameter.- c is the number of bodies per leaf cell.- log s

c is the depth of recursive branch.- L2/L3 Last Level Cache (LLC) size.

14

Page 16: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Large Mesh Partitioning

- Pre-partitioning:

• Create intermediate format with

minimal indexed binary data

(double-precision coordinates).

• Map each rank to its region in the

file.

- Partitioning:

• Separate Global/Local trees using

modified ORB [Abduljabbar et al.,

2017].

• Graft tree in one step when doing

global traversal.

rank 0 rank 2my rank

localtree

processtree

local essential treelocal root nodes

LET Grafting

15

Page 17: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Load-balancing (Repartitioning)

- Load-imbalance harms GMRES performance due to long wait on reduction of each

step.

- Optimal partitioning for random distributions is NP-hard.

- Consider previous work-load and communication.

wi = li + α ∗ ri (14)

- The variables li and ri , and the total runtime are already measured in the present

code, so the information is available with negligible cost.

- Control granularity of partitioning based on load-imbalance.

Coarse Fine

Deferpartitioningfor a few steps

WeightedMorton/Hilbertpartitioningevery time step

Migrationwithin a time step(Node failure)

16

Page 18: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

A Neighborhood-based Communication Protocol (HSDX ) 4

4Mustafa Abduljabbar, George S. Markomanolis, Huda Ibeid, Rio Yokota and David Keyes. Communication Reducing Algorithms for Distributed

Hierarchical N-Body Problems with Boundary Distributions, Proceedings of the International Supercomputing Conference (2017), 79–96. 17

Page 19: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Hardware Specifications

KNL Haswell Skylake

Family x200 E5V3 Scalable

Model 7290 2670 8176

Socket(s) 1 2 2

Cores 72 32 56

GHz 1.50 2.60 2.10

Watts/socket 245 120 165

DDR4 (GB) 192 128 264

Frequency Driver acpi-cpufreq acpi-cpufreq acpi-cpufreq

Max GHz 1.50 2.60 2.10

Governor conservative performance ondemand

Turbo Boost X X X

18

Page 20: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Data-level Parallelism

0 0.5 1 1.5 2 2.5N 106

0

1000

2000

3000

4000

5000

6000

7000

8000

GF

LO

P/s

Skylake - 112 Threads

No-vecIntrinAuto-vecTheoratical Peak

0 0.5 1 1.5 2 2.5N 106

0

1000

2000

3000

4000

5000

6000

7000

8000

GF

LO

P/s

KNL [quadrant/flat mode] - 288 Threads

No-vecIntrinAuto-vecTheoratical Peak

Single Precision Floating Point Performance

- Lower latency and higher throughput for AVX512 in Skylake.

- Increased cache miss penalty in KNL.

19

Page 21: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Thread-level Parallelism [1/2]

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize

Skylake - 112 Threads

5.3

5.26

5.33

5.62

5.96

7.4

4.71

4.24

4.25

4.56

4.86

5.77

4.76

3.8

3.72

4.04

4.24

4.24

6.6

4.74

4.74

4.83

4.9

4.88

11.66

8.98

9.27

9.15

8.97

8.96

14.89

15.97

15.95

16.94

17.15

17.254

6

8

10

12

14

16

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize

Skylake - 112 Threads

0.62

0.66

0.71

0.75

0.79

0.83

0.33

0.41

0.5

0.58

0.66

0.75

0.18

0.01

0.16

0.33

0.5

0.5

1.02

0.68

0.34

0.01

0.01

0.01

2.36

1.69

1.02

1.02

1.02

1.02

4.38

3.03

3.03

3.03

3.03

3.030.5

1

1.5

2

2.5

3

3.5

4

Experimental results vs. performance model on Skylake

20

Page 22: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Thread-level Parallelism [2/2]

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize

KNL [quadrant/flat mode] - 288 Threads

14.76

13.82

13.96

13.93

14.48

17.58

11.57

10.94

10.8

11.15

11.63

13.49

9.08

8.35

8.24

8.71

9.03

9.04

11.07

10.09

9.7

10.07

10.1

10.1

19.81

18.57

19.5

18.81

18.49

18.46

32.51

35.35

37.05

39.93

41.17

41.1910

15

20

25

30

35

40

16 32 64 128 256 512Task Size

2048

1024

512

256

128

64

Gra

in S

ize

KNL [quadrant/flat mode] - 288 Threads

0.59

0.64

0.68

0.73

0.77

0.82

0.27

0.36

0.45

0.54

0.64

0.73

0.27

0.09

0.09

0.27

0.45

0.45

1.18

0.82

0.46

0.09

0.09

0.09

2.64

1.91

1.18

1.18

1.18

1.18

4.83

3.37

3.37

3.37

3.37

3.37 0.5

1

1.5

2

2.5

3

3.5

4

4.5

Experimental results vs. performance model on KNL

21

Page 23: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Cray XC40 Characteristics

- Distributed experiments are run on Shaheen XC40, which hosts 196,608 Haswell

cores, and has a linpack performance of 5.5 PFlop/s.

- Strong scaling is challenging in traditional FMM codes.

XC40 Network Hierarchy

Level Hardware/Network Unit Nodes Cores Overhead Hops

1 Socket 1 16 32 N/A

2 NUMA Node 1 32 64 N/A

3 Blade 4 128 256 1

4 Chassis 64 2,048 4,096 1

5 Cabinet 192 6,144 8,192 1

6 Local alltoall G 384 12,288 16,384 1

7 Global alltoall G1 2,304 74,728 131,072 2

8 Global alltoall G2 6,174 197,568 N/A 3 22

Page 24: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Weak Scalability

• A single scattering sphere, to

validate the concept and the

convergence.

• We fix the problem size while

keeping 10 points per wavelength

(1.0e−4 accuracy vs. analytical).32 128 512 2048 8192 32768 131072

Number of Cores

0

5

10

15

20

25

30

Mea

n T

ime

[s]

561,5402,246,160

8,984,640

35,938,560143,754,240

575,016,960

2,300,067,840

1 0.95

1.06 1 0.96

1.03

1.05

Weak ScalabilityIdeal Scaling

Cores Mem. [GB] Freq. [KHz] N (DoF) T [S] Iters

512 2,048 24 8,984,640 3.1e3 185

2,048 8,192 96 35,938,560 3.3e3 190

8,192 32,768 384 143,754,240 3.6e3 200

32,768 131,072 1,536 575,016,960 4.6e3 223

131,072 524,288 6,144 2,300,067,840 5.7e3 25623

Page 25: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Strong Scalability [1/2]

2,300,067,840 DoF

1

0.97

0.85

0.840.63

16384 32768 65536 131072 196608Number of Cores

10 0

10 1

Spee

dup

Strong ScalabilityIdeal Scaling

575,016,960 DoF

1

1.06

0.88

0.87

0.54 0.37

1966088192 16384 32768 65536 131072Number of Cores

10 0

10 1

Spee

dup

Strong ScalabilityIdeal Scaling

143,754,240 DoF

1

0.88

0.91

0.79

0.85

0.65

0.51

0.37

1024 2048 4096 8192 16384 32768 65536 131072Number of Cores

10 0

10 1

10 2

Spee

dup

Strong ScalabilityIdeal Scaling

24

Page 26: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Strong Scalability [2/2]

35,938,560 DoF

1

1.02

0.94

0.79

0.75

0.73

0.86 0.44

0.32 0.17

256 512 1024 2048 4096 8192 16384 32768 65536 131072Number of Cores

10 0

10 1

10 2Sp

eedu

p

Strong ScalabilityIdeal Scaling

8,984,640 DoF

1

0.88

0.91

0.86

0.83

0.76

0.77

0.66

0.55

64 128 256 512 1024 2048 4096 8192 16384Number of Cores

10 0

10 1

10 2

Spee

dup

Strong ScalabilityIdeal Scaling

2,246,160 DoF

1

0.95

0.89

0.82

0.81

0.86

0.78

0.61

0.51

32 64 128 256 512 1024 2048 4096 8192Number of Cores

10 0

10 1

10 2

Spee

dup

Strong ScalabilityIdeal Scaling

561,540 DoF

1

0.94

1.13

1.06

0.79

0.67

0.610.37

32 64 128 256 512 1024 2048 4096Number of Cores

10 0

10 1

10 2

Spee

dup

Strong ScalabilityIdeal Scaling

25

Page 27: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Communication Load Balancing

100 200 300 400 500 600 700 800 900 1000MPI Rank

10-2

10-1

100

Mea

n T

ime

[s]

Comm LET bodiesComm LET cells

100 200 300 400 500 600 700 800 900 1000MPI Rank

10-2

10-1

100

Mea

n T

ime

[s]

Comm LET bodiesComm LET cells

1 2 3Time step

0

0.5

1

1.5

2

2.5

3

3.5

Mea

n T

ime

[s]

Balancing Step

26

Page 28: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Data Scalability

10 5 10 6 10 7

N

10 -1

10 0

10 1

10 2

10 3

10 4

10 5

Mea

n T

ime

[s]

FMM TimeO(N)O(NlogN)

O(N 2 )

64 compute nodes of Shaheen

10 8 10 9

N

10 -1

10 0

10 1

10 2

10 3

10 4

Mea

n T

ime

[s]

FMM TimeO(N)O(NlogN)

O(N 2 )

1,024 compute nodes of Shaheen

27

Page 29: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Convergence Aspects and Numerical Error

200 400 600 800 1000 1200 1400Number of Iterations

10 -4

10 -3

10 -2

10 -1

10 0

Rel

ativ

e R

esid

ual N

orm

Self and Near SingularitySelf SingularityNear SingularityIgnore Self Singularity

Convergence behavior of 1m DoF using different

singularity correction modes

28

Page 30: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Conclusion

• Auto-vectorization of FMM kernels by the Intel C/C++ Compiler 2017 can

achieve significant performance boost with given precautions.

• Heavy tuning parameters for recursive task parallelism of FMM’s tree traversal

can be avoided using our performance model with a small prediction penalty.

• Communication balancing results in 6 times faster global communication time for

100m DoF.

• Solution of a 2 billion DoF systems of high-order curvilinear triangular patches of

a spherical mesh in about 60 minutes time-to-solution and an accuracy of 1.0e−4

compared to the analytical solution.

• Near-optimal parallel efficiency on Shaheen for both weak and strong scalability

studies.

29

Page 31: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Future Direction

• Explore and address performance challenges of more heterogeneous HPCarchitectures.

• GPU-based supercomputers.

• Study different network topologies of various supercomputer architectures, andmake HSDX aware of the underlying network units.

• Intel Omni-Path network architecture.

• Further extend our solver implementation to include different highly nonuniformdomains and more complex geometries.

• Wing-fuselage configuration emulated by two intersecting ellipsoids.

30

Page 32: BEMFMM: An FMM-Accelerated Boundary Element Method …...BEMFMM: An FMM-Accelerated Boundary Element Method-Based Solver for the 3D Helmholtz Equation Mustafa Abduljabbar, Mohammed

Thank you