will middleware and dsl change the programming … de verdiere.pdf · performances of the generated...
TRANSCRIPT
WILL MIDDLEWARE AND
DSL CHANGE THE
PROGRAMMING HABITS
WHEN COPING WITH
EXASCALE MACHINES ?
APRIL 23 2014
Salishan 2014 | Guillaume Colin de Verdière
GCdV | PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France |
04 23 2014
Motivations: an example
GCdV
| PAGE 2
CEA, DAM, DIF, F-91297 Arpajon, France | 04
23 2014
Manual optimization is becoming too difficult
GCdV | PAGE 3 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
Credit : Jason Sewall, Intel
HydroC miniapp : https://github.com/HydroBench/Hydro/tree/master/HydroC
HydroC optimizations done
• Tiling
• Another name for blocking
• Requires intra halo regions
• Rolling buffers strategy to improve data locality
• Vectorization
• Align data
• Use intrinsics instead of auto vectorization
• Arithmetic
• Optimize equations to reduce the number of / and sqrt()
GCdV | PAGE 4 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
Lessons learned here
• New architectures force code change on you
• Or you’ll reach only a tiny fraction of % of the peak
• Data layout is important to exhibit more parallelism
• Reduce the Amdahl’s impact
• Efficient Vectorization is difficult for a large code
• Questions:
• What if I have an old code (legacy code)?
• What can be done to simplify development of future codes?
GCdV | PAGE 5 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
Legacy codes
GCdV
| PAGE 6
CEA, DAM, DIF, F-91297 Arpajon, France | 04
23 2014
The heritage
• Legacy codes survive for decades Some are still Fortran based (F90 usually)
• Most are MPI only MPI + OpenMP is rare
Suboptimal behavior on a CPU with a large number of cores
• Consequences CS issues have been mostly overlooked
• NUMA effects
• Thread safety
• Reentrance
• Cache usage
• False sharing
• Memory usage
• …
GCdV | PAGE 7 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
A solution = circumvent some issues with a runtime
• A clever runtime will Provide a seamless view of different models
• Homogeneous thread based MPI + OpenMP
Hide some of the machine challenges
Allow for future extensions and experimentations
Provides a smooth transition method to exascale programming models
• The MPC runtime is such a solution Developped at CEA and UVSQ
• Active collaboration with TACC also
Web site maintained at University of Oregon
• Coupled with TAU
Is used in some major CEA codes
• Some positive feedbacks from DOE labs
GCdV | PAGE 8 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
MPC in a nutshell
• Multi-Processor Computing (MPC) framework Freely available at http://mpc.sourceforge.net (version 2.5.0)
• Credit/ Contact: [email protected], [email protected] or [email protected]
• Summary
Unified parallel runtime for clusters of NUMA machines
• Unification of several parallel programming models
MPI 1.3 compliant, thread based MPI, MPI_THREAD_MULTIPLE
POSIX Thread,
OpenMP 2.5 compliant
• Integration with other HPC components Parallel memory allocator for NUMA optimization
patched GCC
patched GDB, supported in Alinea DDT
HWLOC, …
• Extended TLS
• Hierarchical storage
GCdV | PAGE 9 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
What MPC can do for a legacy code
• NUMA allocator
• MPI_THREAD_MULTIPLE + memory optimization
• TLS extensions Automatic privatization
• Hierarchical storage One instance of physical databases on a
node
GCdV | PAGE 10 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
• NUMA effect
• MPI + OpenMP
• Global variable • F90 sources
• Memory footprint
A good runtime is a key piece of the future software stack
Preparing the future
GCdV
| PAGE 11
CEA, DAM, DIF, F-91297 Arpajon, France | 04
23 2014
How to avoid exposing the developers to gory details ?
• Two tracks are investigated at CEA Frameworks such as Arcane In production and evolving
DSLs such as NABLA An R&D effort
• DSL ∇: Numerical Analysis-Based Language
• Goal 1: how to make codes independent of machine generations
• Goal 2: how to simplify developments
• Separates the coding of the algorithm from the machine
representation
• Uses the same source on different parallel architectures
• Credit: [email protected], presented at PP14
GCdV | PAGE 12 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014 | PAGE 13
The ∇ toolchain
9 AVRIL 2014 GCdV
DDFV.
n
Hydro.
n
CoM
D.n
Lulesh.
n
∇ was able to express the following:
CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
Verification of the expressiveness of ∇
| PAGE 14
Numerical schemes Mini-
Application LOC ∇
Explicit Lagrangian +
indirections
LULESHLLNL 1030
GLACECEA 266
MicroHydroCE
A 217
Explicit Cartesian Hydro
(Sethi)CEA
757
Implicit
SchrödingerCEA 346
Diffusion
DDFVCEA
264
Diffusion
DIFFCEA
76
Monte-Carlo MCTBCEA 828
MC1DCEA 276
Molecular Dynamics CoMDLANL 293
MiniMDSANDIA 474 GCdV | PAGE 14
Performances of the generated (C++, Cuda) ≅ «natives» optimized versions
Strong-scaling of Lulesh 32768 cells [email protected] (in s)
Lulesh CUDA@Tesla = 9s vs ∇-Lulesh = 11s
CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014 | PAGE 15
Performances validation of the ∇ concept: LULESH
9 AVRIL 2014
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
1 2 4 8 16 32
Natif optimisé
Arcane +pthreadArcane + MPI
Arcane + MPC
∇ allows for an incremental approach of performance enhancement
32768
cells
Optimized ∇ Arcane
+ pthread
∇
Arcane
+ MPI
∇
Arcane
+ MPC
1 66 83 90 83
2 47 50 47
4 30 24 32
8 21 15 20
16 9 14
32 6 12
GCdV
Going a step further
Mixing a DSL and a runtime
capabilities
What can be done if the
DSL produces automatically arch
independent kernels ?
GCdV
| PAGE 16
CEA, DAM, DIF, F-91297 Arpajon, France | 04
23 2014
R&D: a Heterogeneous Scheduler within MPC
Goal: Harness at the same time CPUs and accelerators in
the context of irregular numerical computations
• Balance workload between each architecture by introducing a two-level work
stealing mechanism:
• Improve locality with a software cache strongly coupled to the scheduler
• Designed to reduce memory transfers by retaining data in off-chip memory
• Scheduler guided by cache affinity to avoid unnecessary transfers
GCdV 17 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014 | PAGE
Hierarchical scheduling scheme
Heterogeneous Scheduler
GCdV 18 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014 | PAGE
0
100
200
300
400
500
600
0 5000 10000 15000 20000 25000 30000
GF
LO
P/s
Blocks
CPUs GPU cachewith ordereddeque
CPUs GPU cache
CPUs GPU
2x AMD 6164HE (24 cores @ 1.7 GHz)
1x Nvidia Geforce GTX 470 (448 cores @ 1.215GHz)
LU Decomposition with Dense & Sparse Blocks,
cumulated perf. of step 3 (SGEMM –> MKL & CUBLAS) vs Matrix size
> Paper presented at MULTIPROG (January 2012)
Jean-Yves Vet, Patrick Carribault, Albert Cohen, Multigrain Affinity for Heterogeneous Work
Stealing, MULTIPROG ‘12
> Could be used to exploit several types of many-core processors (Nvidia GPUs,
AMD GPUs/APUs, Intel MIC, …)
Conclusions
• Getting at the exascale level will require major code evolutions Is it (it is) time to consider other tracks ? (!)
• Proper runtimes will help for the legacy codes MPC is a good starting point
• DSLs are a really promising way Our ∇ experiment confirms other results (Liszt)
Not ready for massive production yet
• In the meantime frameworks are the safer option In conjunction with their tools for increased software productivity
• Next step = a DSL working in collaboration with the runtime
GCdV | PAGE 19 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
Physics Modules
Plateform
External
libraries
NumericalServices
… Linear Systems
Solvers MPI I/O
Hydro Thermic Neutronic Materials … Nuclear
Reaction
Transport
Scheme
Finite Element
Diffusion Scheme
Monte
Carlo …
The future= a multi disciplinary team + a framework and tools
for increased productivity
Code
steering
High level
modelling
High Level
Abstract Syntax
Tree
GCdV CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014 | PAGE 20
Code
testing
Bibliography
• M. Pérache, P. Carribault, H. Jourdren, MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption (EuroPVM/MPI'09)
• P. Carribault, M. Pérache, H. Jourdren, Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC (IWOMP'10)
• P. Carribault, M. Pérache, H. Jourdren, Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP Applications (IWOMP’11)
• M. Tchiboukdjian, P. Carribault, M. Pérache, Hierarchical Local Storage: Exploiting Flexible User-Data Sharing Between MPI Tasks, (IPDPS’12)
• J.-Y. Vet, P. Carribault, A. Cohen, Multigrain Affinityfor Heterogeneous Work Stealing, (MULTIPROG’12)
• Jean-Sylvain Camier, 2014, Nabla : A Numerical Analysis Specific Language for Exascale Scientific Applications, presentation at the SIAM Conference on Parallel Processing for Scientific Computing, Portland, OR, USA.
• Contemporary High Performance Computing, from Petascale toward Exascale, chap 4 Tera 100, Ed. Jeffrey S. Vetter
• Arcane : Gilles Grospellier and Benoit Lelandais. 2009. The Arcane development framework. In Proceedings of the 8th workshop on Parallel/High-Performance Object-Oriented Scientific Computing (POOSC '09). ACM, New York, NY, USA, , Article 4 , 11 pages
GCdV | PAGE 21 CEA, DAM, DIF, F-91297 Arpajon, France | 04 23 2014
Commissariat à l’énergie atomique et aux énergies alternatives
Centre DAM Ile-de-France | 91297 Arpajon Cedex, France
T. +33 (0)1 69 26 40 00
Etablissement public à caractère industriel et commercial | RCS Paris B 775 685 019 GCdV
| PAGE 22
CEA, DAM, DIF, F-91297 Arpajon, France | 04
23 2014