the exa-dune project

53
The EXA-DUNE project Dune vs. 1 000 000 000 000 000 000 Christian Engwer joint work with P. Bastian, O. Ippisch, J. Fahlke, D. G¨ oddeke, S. M¨ uthing, D. Ribbrock, S. Turek 2.04.2016, OPM Meeting wissen leben WWU M¨ unster Westf¨ alische Wilhelms-Universit¨ at unster

Upload: others

Post on 11-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The EXA-DUNE project

The EXA-DUNE projectDune vs. 1 000 000 000 000 000 000

Christian Engwerjoint work with P. Bastian, O. Ippisch, J. Fahlke,D. Goddeke, S. Muthing, D. Ribbrock, S. Turek

2.04.2016,OPM Meeting

wissen lebenWWU Munster

WestfalischeWilhelms-UniversitatMunster

Page 2: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 1 /22

2006 Discussion in the German Research Foundation (DFG) on the necessity of a fundinginitiative for HPC software.

2010 Discussion with DFG’s Executive Committee. Suggestion of a flexible, strategicallyinitiated SPP.Initiative out of Germany HPC community, referring to increasing activities on HPCsoftware elsewhere (USA: NSF, DOE; Japan; China; G8)

2011 Submission of the proposal, international reviewing, and formal acceptance.

2012 13 full proposals accepted for funding SPPEXA 1.Review of project sketches and full proposals.

2013 Official start of SPPEXA 1.

2014 Call for proposals SPPEXA 2, incl. international partners (France, Japan)

2015 16 full proposals accepted for funding SPPEXA 2 (7 collaborations with Japan / 3collaborations with France).

2016 Official start of SPPEXA 2.

2020 Expected arrival of EXA-scale machines ;-)

6 years HPC related project funding

,,

WWU Munster Christian Engwer 02.06.2016

Page 3: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 1 /22

2006 Discussion in the German Research Foundation (DFG) on the necessity of a fundinginitiative for HPC software.

2010 Discussion with DFG’s Executive Committee. Suggestion of a flexible, strategicallyinitiated SPP.Initiative out of Germany HPC community, referring to increasing activities on HPCsoftware elsewhere (USA: NSF, DOE; Japan; China; G8)

2011 Submission of the proposal, international reviewing, and formal acceptance.

2012 13 full proposals accepted for funding SPPEXA 1.Review of project sketches and full proposals.

2013 Official start of SPPEXA 1.

2014 Call for proposals SPPEXA 2, incl. international partners (France, Japan)

2015 16 full proposals accepted for funding SPPEXA 2 (7 collaborations with Japan / 3collaborations with France).

2016 Official start of SPPEXA 2.

2020 Expected arrival of EXA-scale machines ;-)

6 years HPC related project funding,,

WWU Munster Christian Engwer 02.06.2016

Page 4: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 2 /22

EXA-DUNE

DUNE’s Framework approach to software developmentI Integrated toolbox of simulation componentsI Existing body of complex applicationsI Good performance + scalability for traditional MPI model

,,

WWU Munster Christian Engwer 02.06.2016

Page 5: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 2 /22

EXA-DUNE

Challenges:I Standard low order algorithms do not scale any moreI Incorporate new algorithms, hardware paradigmsI Integrate changes across simulation stages (Ahmdahl’s Law)I Provide “reasonable” upgrade path for existing applications

,,

WWU Munster Christian Engwer 02.06.2016

Page 6: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 2 /22

EXA-DUNE

DUNE + FEAST = Flexibility + PerformanceI General Software Frameworks→ co-designed to specific hardware platforms is not sufficient

I Hardware-Oriented Numerics→ design/choose algorithms with hardware in mind

,,

WWU Munster Christian Engwer 02.06.2016

Page 7: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 3 /22

Hardware Challenges

I Multiple Levels of concurrencyI MPI-parallel, Multi-core, SIMDI Memory-wall

Example Intel Xeon E5-2698v3 (Haswell)I Advertised peak performance: 486.4 GFlop/sI 16 cores→ single Core: 30.4 GFlop/sI AVX2+FMA→ without FMA: 15.2 GFlop/sI 4× SIMD→ without AVX: 3.8 GFlop/s

→ classic, non-parallel code bound by 3.8 GFlop/s

→ you loose 99% Performance

,,

WWU Munster Christian Engwer 02.06.2016

Page 8: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 3 /22

Hardware Challenges

I Multiple Levels of concurrencyI MPI-parallel, Multi-core, SIMDI Memory-wall

Example Intel Xeon E5-2698v3 (Haswell)I Advertised peak performance: 486.4 GFlop/s

I 16 cores→ single Core: 30.4 GFlop/sI AVX2+FMA→ without FMA: 15.2 GFlop/sI 4× SIMD→ without AVX: 3.8 GFlop/s→ classic, non-parallel code bound by 3.8 GFlop/s→ you loose 99% Performance

,,

WWU Munster Christian Engwer 02.06.2016

Page 9: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 3 /22

Hardware Challenges

I Multiple Levels of concurrencyI MPI-parallel, Multi-core, SIMDI Memory-wall

Example Intel Xeon E5-2698v3 (Haswell)I Advertised peak performance: 486.4 GFlop/sI 16 cores→ single Core: 30.4 GFlop/s

I AVX2+FMA→ without FMA: 15.2 GFlop/sI 4× SIMD→ without AVX: 3.8 GFlop/s→ classic, non-parallel code bound by 3.8 GFlop/s→ you loose 99% Performance

,,

WWU Munster Christian Engwer 02.06.2016

Page 10: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 3 /22

Hardware Challenges

I Multiple Levels of concurrencyI MPI-parallel, Multi-core, SIMDI Memory-wall

Example Intel Xeon E5-2698v3 (Haswell)I Advertised peak performance: 486.4 GFlop/sI 16 cores→ single Core: 30.4 GFlop/sI AVX2+FMA→ without FMA: 15.2 GFlop/s

I 4× SIMD→ without AVX: 3.8 GFlop/s→ classic, non-parallel code bound by 3.8 GFlop/s→ you loose 99% Performance

,,

WWU Munster Christian Engwer 02.06.2016

Page 11: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 3 /22

Hardware Challenges

I Multiple Levels of concurrencyI MPI-parallel, Multi-core, SIMDI Memory-wall

Example Intel Xeon E5-2698v3 (Haswell)I Advertised peak performance: 486.4 GFlop/sI 16 cores→ single Core: 30.4 GFlop/sI AVX2+FMA→ without FMA: 15.2 GFlop/sI 4× SIMD→ without AVX: 3.8 GFlop/s→ classic, non-parallel code bound by 3.8 GFlop/s→ you loose 99% Performance

,,

WWU Munster Christian Engwer 02.06.2016

Page 12: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 4 /22

Outline

1 Introduction

2 EXA-DUNE Overview

3 EXA-DUNE Selected FeaturesLinear AlgebraAssemblyHigh-level SIMD techniquesMultiple RHS

4 Outlook

,,

WWU Munster Christian Engwer 02.06.2016

Page 13: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 5 /22

Hybrid Parallelism

M CPU

M

CPU

M

M

MPI

UMACPU Accl

P...P

L1

P

L2

...

L1

L2

L1

L2

SIMD

M

IC

...

M

IC

M

IC...

M

IC

I Coarse-grained: MPI between heterogeneous nodesI Medium-grained: multicore-CPUs, GPUs, MICs, APUs, . . .I Fine-grained: vectorization, GPU ‘threads’, . . .

,,

WWU Munster Christian Engwer 02.06.2016

Page 14: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 5 /22

Hybrid Parallelism

M CPU

M

CPU

M

M

MPI

UMACPU Accl

P...P

L1

P

L2

...

L1

L2

L1

L2

SIMD

M

IC

...

M

IC

M

IC...

M

IC

I Coarse-grained: MPI between heterogeneous nodesI Medium-grained: multicore-CPUs, GPUs, MICs, APUs, . . . TBBI Fine-grained: vectorization, GPU ‘threads’, . . . ???

,,

WWU Munster Christian Engwer 02.06.2016

Page 15: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 6 /22

Algorithm DesignHardware-Oriented Numerics

Challenge: Combine flexibility and hardware efficiencyI Existing codes no longer run faster automaticallyI Standard low order algorithms do not scale any moreI Much more than a pure implementational issue

Solution: Locally structured, globally unstructured data→ increased algorithmic intensity and vectorizationConcepts:I higher order:

Discontinuous Galerkin, ReducedBasis

I virtual local refinement:global unstructured mesh, localtensor-prduct mesh

I multiple right-hand-sides:horizontal SIMD-vectorization

,,

WWU Munster Christian Engwer 02.06.2016

Page 16: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 6 /22

Algorithm DesignHardware-Oriented Numerics

Challenge: Combine flexibility and hardware efficiency

Solution: Locally structured, globally unstructured data→ increased algorithmic intensity and vectorizationConcepts:I higher order:

Discontinuous Galerkin, ReducedBasis

I virtual local refinement:global unstructured mesh, localtensor-prduct mesh

I multiple right-hand-sides:horizontal SIMD-vectorization

,,

WWU Munster Christian Engwer 02.06.2016

Page 17: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 6 /22

Algorithm DesignHardware-Oriented Numerics

Challenge: Combine flexibility and hardware efficiency

Solution: Locally structured, globally unstructured data→ increased algorithmic intensity and vectorizationConcepts:I higher order:

Discontinuous Galerkin, ReducedBasis

I virtual local refinement:global unstructured mesh, localtensor-prduct mesh

I multiple right-hand-sides:horizontal SIMD-vectorization

,,

WWU Munster Christian Engwer 02.06.2016

Page 18: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 6 /22

Algorithm DesignHardware-Oriented Numerics

Challenge: Combine flexibility and hardware efficiency

Solution: Locally structured, globally unstructured data→ increased algorithmic intensity and vectorizationConcepts:I higher order:

Discontinuous Galerkin, ReducedBasis

I virtual local refinement:global unstructured mesh, localtensor-prduct mesh

I multiple right-hand-sides:horizontal SIMD-vectorization

x0,0 x1,0 . . . xN ,0x0,1 x1,1 . . . xN ,1x0,2 x1,2 . . . xN ,2

... ...x0,M x1,M . . . xN ,M

=

b0,0 b1,0 . . . bN ,0b0,1 b1,1 . . . bN ,1b0,2 b1,2 . . . bN ,2

... ...b0,M b1,M . . . bN ,M

,,

WWU Munster Christian Engwer 02.06.2016

Page 19: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 7 /22

Features in EXA-DUNE

dune-commonI Multi-Threading support in DUNE (via TBB)dune-gridI Hybrid parallelization of DUNE grids (on-top of the grid interface)I Virtual-refinement of unstructured DUNE gridsdune-istlI Hybrid parallelizationI Performance portable Matrix formatI (vertically) vectorized preconditioners (incl. CUDA versions)I (horizontally) vectorized solvers (multiple RHS)dune-pdelabI Sum-Factorization DGI Special-purpose high-performance assemblers

,,

WWU Munster Christian Engwer 02.06.2016

Page 20: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 7 /22

Features in EXA-DUNE

dune-commonI Multi-Threading support in DUNE (via TBB)dune-gridI Hybrid parallelization of DUNE grids (on-top of the grid interface)I Virtual-refinement of unstructured DUNE gridsdune-istlI Hybrid parallelizationI Performance portable Matrix formatI (vertically) vectorized preconditioners (incl. CUDA versions)I (horizontally) vectorized solvers (multiple RHS)dune-pdelabI Sum-Factorization DGI Special-purpose high-performance assemblers

,,

WWU Munster Christian Engwer 02.06.2016

Page 21: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 8 /22

SELL-C-σ as Cross-Platform Matrix Format

Kreutzer, Hager, Wellein, Fehske, Bishop ’13

Block-sorted and chunked ELL (SELL-C -σ) with suitable tuning offerscompetitive performance across modern architectures (CPU, GPGPU,Xeon Phi)

I Adopt SELL-C -σ as unified matrix format in DUNE(including assembly phase→ dune-pdelab)

I Differentiate by memory domain (host, Xeon Phi, GPU)I Memory domain and C via extended C++ allocator interface

,,

WWU Munster Christian Engwer 02.06.2016

Page 22: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 9 /22

Blocked SELL Format for DG discretizations

Choi, Singh, Vuduc ’10

Interleaved storage of blocks from C block rows

I Move SIMD from scalarlevel to block level

I Vectorize algorithms byoperating on multipleindependent blockssimultaneously

I Tradeoff betweenvectorization and cacheusage at larger block sizes

com

pres

sed

data

mat

rix

com

pres

sed

bloc

kco

lum

nin

dex

mat

rix

· · ·

memory layout of data and column index arrays

[Muething et.al. 2013],,

WWU Munster Christian Engwer 02.06.2016

Page 23: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 10 /22

Hybrid-parallel DUNE GridStrategies for multi-threaded assembly of matrices and vectorsI Thread parallelization on-top of the DUNE grid interface

I Evaluation of partitioning strategies:

strided ranged sliced tensor

I . . . and data access strategies:I batched: Batched writeback with global lock.I elock: One lock per mesh entityI coloring: partitions of the same color do not “touch”.

other strategies not considered here:I global locking→ as bad as expected.I race-free schemes→ in general not possible not possible. ,

,

WWU Munster Christian Engwer 02.06.2016

Page 24: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 11 /22

Evaluation

Evaluation of partitioning strategiesI Best partitioning strategy:

ranges of cells→ memory efficient and fast

. . . and data access strategiesI Best strategy: entity-wise locking

(no benefit from coloring)

k ECPU

10 ECPU

20 EPHI

60 EPHI

120 EPHI

240

0 62% 42% 75% 43% 21%1 62% 42% 84% 57% 30%2 72% 46% 90% 69% 38%3 79% 50% 92% 72% 41%4 87% 49%5 88% 51%

Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking

Evaluation: 60-core Xeon PHIs and 10-core CPU’sI Full benefit of Xeon PHI will require vectorization of user code→ vectorization: non-trivial, work in progress

,,

WWU Munster Christian Engwer 02.06.2016

Page 25: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 11 /22

Evaluation

Evaluation of partitioning strategiesI Best partitioning strategy:

ranges of cells→ memory efficient and fast

. . . and data access strategiesI Best strategy: entity-wise locking

(no benefit from coloring)

k ECPU

10 ECPU

20 EPHI

60 EPHI

120 EPHI

240

0 62% 42% 75% 43% 21%1 62% 42% 84% 57% 30%2 72% 46% 90% 69% 38%3 79% 50% 92% 72% 41%4 87% 49%5 88% 51%

Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking

Evaluation: 60-core Xeon PHIs and 10-core CPU’sI Full benefit of Xeon PHI will require vectorization of user code→ vectorization: non-trivial, work in progress

,,

WWU Munster Christian Engwer 02.06.2016

Page 26: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 12 /22

Vectorizing over Elements

User provides local Operator to compute local Stiffness matrix

I Patch GridI reduce costs of unstructured

meshesI extract subset of meshI store as flat gridI store in consecutive arrays,

without pointers

I add structured refinementI exploit local structureI improve data locality

,,

WWU Munster Christian Engwer 02.06.2016

Page 27: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 12 /22

Vectorizing over Elements

User provides local Operator to compute local Stiffness matrix

I Patch GridI reduce costs of unstructured

meshesI extract subset of meshI store as flat gridI store in consecutive arrays,

without pointers

I add structured refinementI exploit local structureI improve data locality

(0)

local coefficients vectorv0 v1 v2 v3

patch coefficients vector0,0 0,1 0,2 0,3 0,41,0 1,1 1,2 1,3 1,42,0 2,1 2,2 2,3 2,43,0 3,1 3,2 3,3 3,4

,,

WWU Munster Christian Engwer 02.06.2016

Page 28: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 12 /22

Vectorizing over Elements

User provides local Operator to compute local Stiffness matrix

I Patch GridI reduce costs of unstructured

meshesI extract subset of meshI store as flat gridI store in consecutive arrays,

without pointers

I add structured refinementI exploit local structureI improve data locality

(1)

local coefficients vectorv0 v1 v2 v3

patch coefficients vector0,0 0,1 0,2 0,3 0,41,0 1,1 1,2 1,3 1,42,0 2,1 2,2 2,3 2,43,0 3,1 3,2 3,3 3,4

,,

WWU Munster Christian Engwer 02.06.2016

Page 29: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 12 /22

Vectorizing over Elements

User provides local Operator to compute local Stiffness matrix

I Patch GridI reduce costs of unstructured

meshesI extract subset of meshI store as flat gridI store in consecutive arrays,

without pointers

I add structured refinementI exploit local structureI improve data locality

(2)

local coefficients vectorv0 v1 v2 v3

patch coefficients vector0,0 0,1 0,2 0,3 0,41,0 1,1 1,2 1,3 1,42,0 2,1 2,2 2,3 2,43,0 3,1 3,2 3,3 3,4

,,

WWU Munster Christian Engwer 02.06.2016

Page 30: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 12 /22

Vectorizing over Elements

User provides local Operator to compute local Stiffness matrix

I Patch GridI reduce costs of unstructured

meshesI extract subset of meshI store as flat gridI store in consecutive arrays,

without pointers

I add structured refinementI exploit local structureI improve data locality

(3)

local coefficients vectorv0 v1 v2 v3

patch coefficients vector0,0 0,1 0,2 0,3 0,41,0 1,1 1,2 1,3 1,42,0 2,1 2,2 2,3 2,43,0 3,1 3,2 3,3 3,4

,,

WWU Munster Christian Engwer 02.06.2016

Page 31: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 12 /22

Vectorizing over Elements

User provides local Operator to compute local Stiffness matrix

I Patch GridI reduce costs of unstructured

meshesI extract subset of meshI store as flat gridI store in consecutive arrays,

without pointers

I add structured refinementI exploit local structureI improve data locality

(4)

local coefficients vectorv0 v1 v2 v3

patch coefficients vector0,0 0,1 0,2 0,3 0,41,0 1,1 1,2 1,3 1,42,0 2,1 2,2 2,3 2,43,0 3,1 3,2 3,3 3,4

,,

WWU Munster Christian Engwer 02.06.2016

Page 32: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 12 /22

Vectorizing over Elements

User provides local Operator to compute local Stiffness matrix

I Patch GridI reduce costs of unstructured

meshesI extract subset of meshI store as flat gridI store in consecutive arrays,

without pointers

I add structured refinementI exploit local structureI improve data locality

(5)

local coefficients vectorv0 v1 v2 v3

patch coefficients vector0,0 0,1 0,2 0,3 0,41,0 1,1 1,2 1,3 1,42,0 2,1 2,2 2,3 2,43,0 3,1 3,2 3,3 3,4

,,

WWU Munster Christian Engwer 02.06.2016

Page 33: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 13 /22

Vectorizing over Elements (2)

−∆u = 0 on Ω + BC

I Conforming FEM, Lagrange, Q1, three level virtual refinementI 16-Core Intel Haswell E5-2698v3

I Advertised 486.4 GFlop/s,I Peak w.o. FMA: 243.2 GFlop/s (→%avail)I Bandwidth: STREAM 2×40 GByte/s

SIMD lanes threads GByte/s %effBW GFlop/s %avail- 1 16 26.73 33.4 9.758 4.0- 1 32 34.32 42.9 12.53 5.2

AVX 4 16 62.89 78.8 22.96 9.4AVX 4 32 73.01 91.3 26.65 11.0

,,

WWU Munster Christian Engwer 02.06.2016

Page 34: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 14 /22

Interface ChallengesVectorization of local operator (user code)

pdelabapplication

istlgrid geometry localfunctions

time integrator

Assembler

grid functionspace

grid operator

local functionspace

core modules

local operatoruser code

common

I Intrinsics (non-portable)I Special language (needs special compiler)I Autovectorizer (difficult to drive portably)

→ Vectorization library (e.g. Vc, Boost.SIMD, NGSolve)

,,

WWU Munster Christian Engwer 02.06.2016

Page 35: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 14 /22

Interface ChallengesVectorization of local operator (user code)

pdelabapplication

istlgrid geometry localfunctions

time integrator

Assembler

grid functionspace

grid operator

local functionspace

core modules

local operatoruser code

common

I Intrinsics (non-portable)I Special language (needs special compiler)I Autovectorizer (difficult to drive portably)

→ Vectorization library (e.g. Vc, Boost.SIMD, NGSolve)

,,

WWU Munster Christian Engwer 02.06.2016

Page 36: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 14 /22

Interface ChallengesVectorization of local operator (user code)

pdelabapplication

istlgrid geometry localfunctions

time integrator

Assembler

grid functionspace

grid operator

local functionspace

core modules

local operatoruser code

common

I Intrinsics (non-portable)I Special language (needs special compiler)I Autovectorizer (difficult to drive portably)

→ Vectorization library (e.g. Vc, Boost.SIMD, NGSolve),,

WWU Munster Christian Engwer 02.06.2016

Page 37: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 15 /22

Programming Approaches

I Intrinsics(non-portable)

I Special language(needs special compiler)

I Autovectorize(difficult to drive portably)

I vectorization libraryI Hide instrinsics beneath a portable

interfaceI Implementations: e.g. Vc, Boost.SIMD,

NGSolve

,,

WWU Munster Christian Engwer 02.06.2016

Page 38: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 15 /22

Programming Approaches

I Intrinsics(non-portable)

I Special language(needs special compiler)

I Autovectorize(difficult to drive portably)

I vectorization libraryI Hide instrinsics beneath a portable

interfaceI Implementations: e.g. Vc, Boost.SIMD,

NGSolve

,,

WWU Munster Christian Engwer 02.06.2016

Page 39: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 15 /22

Programming Approaches

I Intrinsics(non-portable)

I Special language(needs special compiler)

I Autovectorize(difficult to drive portably)

I vectorization libraryI Hide instrinsics beneath a portable

interfaceI Implementations: e.g. Vc, Boost.SIMD,

NGSolve

,,

WWU Munster Christian Engwer 02.06.2016

Page 40: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 16 /22

Features of VcStorage: Vc::SimdArray<T,N>Abstracts a vector of type T with size N

SIMD Operator overloads: +,−, · , /, etc.

template <typename T>

SimdArray <T,N> operator *(SimdArray <T,N> a, SimdArray <T,N> b)

for (std:: size_t i=0; i<N; i++) a[i] *= b[i];

return a;

Limitations:Not all operations are supported on SIMD data typesIn particularI No implicit cast from float→ double

I No branching, e.g. if(c) x=a else x=b

Alternative: x = c?a:b;→ x = cond(c,a,b);

,,

WWU Munster Christian Engwer 02.06.2016

Page 41: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 16 /22

Features of VcStorage: Vc::SimdArray<T,N>Abstracts a vector of type T with size N

SIMD Operator overloads: +,−, · , /, etc.

template <typename T>

SimdArray <T,N> operator *(SimdArray <T,N> a, SimdArray <T,N> b)

for (std:: size_t i=0; i<N; i++) a[i] *= b[i];

return a;

Limitations:Not all operations are supported on SIMD data typesIn particularI No implicit cast from float→ double

I No branching, e.g. if(c) x=a else x=b

Alternative: x = c?a:b;→ x = cond(c,a,b);

,,

WWU Munster Christian Engwer 02.06.2016

Page 42: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 17 /22

Horizontal Vectorization for multiple RHS

I Many applications require solves for many right-hand-sidesI Multi-Scale FEMI Inverse problemsI . . .

I This corresponds to

foreach i ∈ [0,N] : solve Axi = bi → solve AX = B

with X = (x0, ..xN), B = (b0, ..bN)

I Templated FEM-code allows to implement vectorized assembly andsolvers, e.g.template<typename T> Dune::BlockVector andtemplate<class X> class BiCGSTABSolver : public InverseOperator<X,X>

template<class M, class X, class Y> class AssembledLinearOperator

,,

WWU Munster Christian Engwer 02.06.2016

Page 43: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 17 /22

Horizontal Vectorization for multiple RHS

I Many applications require solves for many right-hand-sidesI Multi-Scale FEMI Inverse problemsI . . .

I This corresponds to

foreach i ∈ [0,N] : solve Axi = bi → solve AX = B

with X = (x0, ..xN), B = (b0, ..bN)

I Templated FEM-code allows to implement vectorized assembly andsolvers, e.g.template<typename T> Dune::BlockVector andtemplate<class X> class BiCGSTABSolver : public InverseOperator<X,X>

template<class M, class X, class Y> class AssembledLinearOperator

,,

WWU Munster Christian Engwer 02.06.2016

Page 44: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 17 /22

Horizontal Vectorization for multiple RHS

I Many applications require solves for many right-hand-sidesI Multi-Scale FEMI Inverse problemsI . . .

I This corresponds to

foreach i ∈ [0,N] : solve Axi = bi → solve AX = B

with X = (x0, ..xN), B = (b0, ..bN)

I Templated FEM-code allows to implement vectorized assembly andsolvers, e.g.template<typename T> Dune::BlockVector andtemplate<class X> class BiCGSTABSolver : public InverseOperator<X,X>

template<class M, class X, class Y> class AssembledLinearOperator

,,

WWU Munster Christian Engwer 02.06.2016

Page 45: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 18 /22

Modifications to the linear Algebra code

Dune::BlockVector<double>

→ Dune::BlockVector<Vc::SimdArray<double,N>>

X =

x0,0 x1,0 x2,0 . . . xN,0

x0,1 x1,1 x2,1 . . . xN,1

x0,2 x1,2 x2,2 . . . xN,2

......

x0,M x1,M x2,M . . . xN,M

Memory layout: Corresponds to M × N dense-matrix with row-majorstorage.

And smaller SIMD-aware modifications to the Solvers

,,

WWU Munster Christian Engwer 02.06.2016

Page 46: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 18 /22

Modifications to the linear Algebra code

Dune::BlockVector<double>

→ Dune::BlockVector<Vc::SimdArray<double,N>>

X =

x0,0 x1,0 x2,0 . . . xN,0

x0,1 x1,1 x2,1 . . . xN,1

x0,2 x1,2 x2,2 . . . xN,2

......

x0,M x1,M x2,M . . . xN,M

Memory layout: Corresponds to M × N dense-matrix with row-majorstorage.And smaller SIMD-aware modifications to the Solvers

,,

WWU Munster Christian Engwer 02.06.2016

Page 47: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 19 /22

Application: EEG Source ReconstructionCooperation UKM (Munster), BESA GmbH (Munchen)

Source: Wikipedia

I EEG measurementsI > 200 electrodesI measure surface potentialI governing equation

−∇ ·K∇u = f on Ω

∇u · n = 0 on ∂Ω

Goal: reconstruct brain activityI Inverse modelling approachI L2 regulerizationI Fidelity term on potential at electrodes

,,

WWU Munster Christian Engwer 02.06.2016

Page 48: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 19 /22

Application: EEG Source ReconstructionCooperation UKM (Munster), BESA GmbH (Munchen)

Source: Wikipedia

I EEG measurementsI > 200 electrodesI measure surface potentialI governing equation

−∇ ·K∇u = 0 on Ω

∇u · n = j on ∂Ω

Goal: reconstruct brain activityI Inverse modelling approachI L2 regulerizationI Fidelity term on potential at electrodes

,,

WWU Munster Christian Engwer 02.06.2016

Page 49: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 20 /22

Performance Measurements

I SIMD-vectorized AMG-CG solverI Solve for 256 RHS, 300K Cells, 60K VerticesI Timing for Haswell-EP (E5-2698v3, 16 cores, AVX2, 4 lanes)

I 1-16 coresI no SIMD, AVX (4 lanes), AVX (4× 4 lanes)I speedup 50 (max theoretical speedup 64)

16 8 4 2 10

20

40

60

80

100

120

4x44

1

Wall time for 256 electrodes

cores

sec

lanes

16 8 4 2 10

10

20

30

40

50

60

14

4x4

Speeup

cores

spee

dup

lanes ,

,

WWU Munster Christian Engwer 02.06.2016

Page 50: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 21 /22

OutlookTransition from EXA-DUNE to mainline?!

dune-commonI Multi-Threading support in DUNE (via TBB)dune-gridI Hybrid parallelization of DUNE grids (on-top of the grid interface)I Virtual-refinement of unstructured DUNE gridsdune-istlI Hybrid parallelizationI Performance portable Matrix formatI (vertically) vectorized preconditioners (incl. CUDA versions)I (horizontally) vectorized solvers (multiple RHS)dune-pdelabI Sum-Factorization DGI Special-purpose high-performance assemblers

,,

WWU Munster Christian Engwer 02.06.2016

Page 51: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 21 /22

OutlookTransition from EXA-DUNE to mainline?!

dune-commonX Multi-Threading support in DUNE (via TBB)dune-gridX Hybrid parallelization of DUNE grids (on-top of the grid interface)

˜ Virtual-refinement of unstructured DUNE gridsdune-istlX Hybrid parallelization? Performance portable Matrix format? (vertically) vectorized preconditioners (incl. CUDA versions)+ (horizontally) vectorized solvers (multiple RHS)dune-pdelab+ Sum-Factorization DG

˜ Special-purpose high-performance assemblers

,,

WWU Munster Christian Engwer 02.06.2016

Page 52: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 22 /22

Outlook IIsecond phase

I Push features upstream

I Task-parallel scheduling

I Asynchronicity→ latency hiding, communication hiding, . . .

I Resilience

I . . .

Questions?!.

,,

WWU Munster Christian Engwer 02.06.2016

Page 53: The EXA-DUNE project

wis

sen

lebe

nW

WU

Mun

ster

WestfalischeWilhelms-UniversitatMunster The EXA-DUNE project 22 /22

Outlook IIsecond phase

I Push features upstream

I Task-parallel scheduling

I Asynchronicity→ latency hiding, communication hiding, . . .

I Resilience

I . . .

Questions?!.

,,

WWU Munster Christian Engwer 02.06.2016