an optimized solver for unsteady transonic aerodynamics and … · 2016. 4. 21. · 1 an optimized...
Post on 07-Feb-2021
4 Views
Preview:
TRANSCRIPT
-
1
An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles
Jean-Marie Le Gouez, Onera CFD Department
Jean-Matthieu Etancelin, ROMEO HPC Center, Université de Reims Champagne-Ardennes
Thanks to Nikolay Markovskiy dev-tech at NVIDIA research Center, GB,and to Carlos Carrascal, research master intern
GTC 2016, April 7th, San José California
-
2
Unsteady CFD for aerodynamics profiles
•Context
•State of the art for unsteady Fluid Dynamics simulations of aerodynamics profiles
•Prototypes for new generation flow solvers
•NextFlow GPU prototype : Development stages, data models, programming languages, co-processing tools
•Capacity of TESLA networks for LES simulations
•Performance measurements, tracks for further optimizations
•Outlook
GTC 2016, April 7th, San José California
-
3
The expectations of the external users:•Extended simulation domains: � effects of wake on downstream components, blade-vor tex interaction on helicopters, thermal loadings by the reactor jets on composite structure s,
•Model of full systems and not only the individual c omponents : multi-stage turbomachinery internal flo ws, couplings between the combustion chamber and the turbine aerodynamics, …
•More multi-scale effects : representation of techno logical effects to improve the overall flow system efficiency : grooves in the walls, local injectors for flow / acoustics control,
•Advanced usage of CFD : adjoint modes for automatic shape optimization and grid refinement, uncertaint y management, input parameters defined as pdf,
Système Cassiopee for application productivity, modularity and coupling,associated to the elsA solver, partly OpenSource
General context : CFD at Onera
GTC 2016, April 7th, San José California
-
4
Expectations from the internal users :
- to develop and validate state of the art physical models : transition to turbulence, wall models, sub-grid closure models, flame stability,
- to propose novel designs in rupture for aeronautics in terms of aerodynamics, propulsion integration, noise mitigation, …
- to tackle the CFD grand challenges
� New classes of numerical methods, less dependent on the grids, more robustand versatile,
� Computational efficiency near the hardware design performance, high parallelscalability,
Decision to launch research projects :
� On deployment of the DG method for complex cases : AGHORA code� On modular multi-solver architecture within the Cassiopee set of tools
Onera
CFD at Onera
elsA
GTC 2016, April 7th, San José California
-
5
Improvement of predictive capabilities in the last 5 year RANS / zonal LES of the flow around a High-Lift win g
2D steady RANS and 3D LES 7,5 Mpts
2009
Mach 0.18Rey 1 400 000/corde
LEISA project Onera FUNK software
2014
GTC 2016, April 7th, San José California
•Optimized on a CPU architecture MPI / OpenMP / vecto rization
•CPU ressources for 70ms of simulation : JADE compute r (CINES) Cpu time alloted by Genci•Nxyz~ 2 600 Mpts 4096 cores / 10688 domains T CPU~ 6 200 000 h Residence time : 63 days
-
NextFlow : Spatially High-Order Finite Volume me thod for RANS / LESDemonstration of the feasability of porting these a lgorithms on heterogeneous architectures
,
,
GTC 2016, April 7th, San José California
-
NextFlow : Spatially High-Order Finite Volume me thod for RANS / LESDemonstration of the feasability of porting these a lgorithms on heterogeneous architectures
,
,
GTC 2016, April 7th, San José California
-
8
Multi-GPU implementation of a High Order Finite Vol ume solver
Main Choices : CUDA, Thrust, mvapich, Reasons : resource-aware programming, productivity librairies
Hierarchy of memories correspond to the algorithm phases1/ main memory for field and metrics variables : 40 million cells on a K40 (12Gb),
and for the communication buffers (halo of cells for other partitions)2/ shared memory at the stream multi-processor level for stencil operations3/ careful use of registers for node, cell, face algorithms
Stages of the project
•Initial porting with the same data model organization than on the CPU
•Generic refinement of coarse triangular elements with curved faces : hierarchy of grids
•Multi-GPU implementation of a highly space-parallel model : extruded in the span direction and periodic
•On-going work on a 3D generalization of the preceding phases : embedded grids inside a regular distribution (Octree-type)
GTC 2016, April 7th, San José California
-
9
1st Approach: Block Structuration of a Regular Line ar Grid
Partition the mesh into small blocks
SM SMSM SM
Block Block Block Block
SM: Stream Multiprocessor
Map the GPU scalable structure
GTC 2016, April 7th, San José California
-
10
Relative advantage of the small block partition
●Bigger blocks provide
• Better occupancy
• Less latency due to kernel launch
• Less transfers between blocks
●Smaller blocks provide
• Much more data caching
0
20
40
60
80
100
L1 hit rate0
0,2
0,4
0,6
0,8
1
Fluxes time (normalized)0
0,2
0,4
0,6
0,8
1
1,2
Overall time(normalized)
2561024409624097
● Final speedup wrt. to 2 hyperthreaded Westmere CPU: ~2GTC 2016, April 7th, San José California
-
11
Unique grid connectivity for the inner algorithmOptimal to organize data for coalescent memory accessduring the algorithm and communication phasesEach coarse element in a block is allocated to an inner thread(threadId.x)
2nd approach : Embedded grids, hierachical data model NXO-GPU
Hierachical model for the grid : high order (quarti c polynomial) triangles generated by gmsh refined on the GPU
the whole fine grid as such could remain unknown to the host CPU
GTC 2016, April 7th, San José California
Imposing a sub-structuration to the grid and data model (inspired by the ‘tesselation’ mechanism in surface rendering)
-
12
Code structure
Preprocessing
Postprocessing
Mesh generation and block and generic refinement generation
Visualization and data analysis
Solver
Allocation and initialization of data structure from the modified mesh file
Computational routine
Time stepping
Data fetching binder
Computational binders
GPU allocation and initialization binders
CUDA kernels
Fortran
Fortran
C
C
C
CUDA
GTC 2016, April 7th, San José California
-
13
Version 2 : Measured efficiency on Tesla K20C (with respect to 2 Cpu Xeon 5650, OMP loop-based)
Initial results on a K20C : Max. Acceleration = 38 wrt to 2 Westmere sockets
Improvement of the Westmere CPU efficiency : OpenMP t ask-based rather than inner-loop
Same block data model on the CPU also, then the K20 C GPU / CPU acceleration drops to 13 ( 1 K20c = 150 Westmere cores)
�
In fact this method is memory bounded, and GPU band width is critical.
More CPU optimisation needed (cache blocking, vector isation ?)
Flop count : around 80 Gflops DP /K20C
These are valuable flop, not Ax=b, but highly non l inear Riemann solver flop with high order (4th, 5th ) extrapolated
values, characteristic splitting to avoid interfere nce between waves, … :
it requires a very high memory traffic to permit th eses flops : wide stencils method
Thanks to the NVIDIA GB dev-tech group for their su pport, “my flop is rich”
GTC 2016, April 7th, San José California
-
14
Version 3 : 2.5D periodic spanwise (circular shift vectors), MULTI-GPU / MPI
Objective : one Billion cells on a cluster
with only 64 TESLA K20 or 16 K80
(40 000 cells * 512 spanwise stations per partition :
20 million cells addressed to each TESLA K20)
The CPU (MPI / Fortran, OpenMP inner loop-based)
and GPU ( GPUDirect / C/ Cuda) versions are in the
same executable, for efficiency and accuracy
comparisons
High CPU vectorisation (all variables are vectors of length 256
to 512) in the 3rd homogeneous direction
Full data parallel Cuda kernels with coalesced memory access
GTC 2016, April 7th, San José California
•Coarse partitionning : number of partitions equal to the number of sockets / accelerators
-
15
Version 3 : 2.5D periodic spanwise (cshift vectors) , MULTI-GPU / MPIInitial performance measurements
GTC 2016, April 7th, San José California
-
16
Initial Kernel Optimization and analysis performed b y NVIDIA DevTech
After this first optimization : ratio of 14 in performance K40 / 8-core Ivy-Bridge socket
Strategy for further optimization of performances:
Increase occupancy, reduce registers’ use, reduce amount of operations with global memory and texture cache for wide arrays in read-only in a kernel
Put stencil coefficients in shared memory, Use constant memory, __launch_bounds__(128, 4)
GTC 2016, April 7th, San José California
-
17
Next stage of optimizations
Work done by Jean-Matthieu
• - Use thread collaboration to transfer stencil data
• from main memory to shared memory
• - Refactor the kernel where face-stencil operations are done :
• split in two phases to reduce stress on registers
• - Use the thrust library to class the face and cell indices into lists to template the kernels according to the list number
and avoid internal conditional switches
Enable an overlapping :
- computations in the center of the partition,
- transfer of the halo cells at the periphery, use of mvapich2
by using multiple streams and further classification of the cell and face indices : center� periphery (thrust)
GTC 2016, April 7th, San José California
-
18
Version 3 : Kernel granularity revised to optimize r egister use, Overlapping of communications with computations at the centers of the partitions, local memory usage and inner-block thread collabora tion
GTC 2016, April 7th, San José California
TAYLOR-GREEN Vortex
Scalability analysis with up to one billion cells and 4th degree
polynomial reconstruction (5 dof per cell, stencil size 68 cells)
with 1 to 128 Gpu (K20Xm)
High performance : 12 ns to compute one set of 5 fluxes on an
interface from a wide stencil : 180 GBytes/s, 170 Gflops DP
Scalability drops only for extreme degraded usage : small grid 1283
cells on more than 32 GPUs, over 30% of cells to exchange
Strong scalability
Weak scalability
-
High Order CFD Workshop Case 3.5 Taylor-Green Vortex
,
,
-
20
GPU implementation of the NextFlow solver
Performance on each K20Xm GPU :
in k3 1,8e-8 s per RHS, 0.36s for 20 000 000 cells
in k4 2,5e-8 s per RHS, 0.50s for 20 000 000 cells
� Taylor Green vortex 256**3 - wall-clock = 12 hours on 16 IVY-Bridge processors (total 128 cores) : 1600 hours CPU Intel core
25 minutes on 16 Tesla K20M GPU
By comparison, at the 1st HO CFD workshop , this case requested between 1100 and 33000 Intel core Cpu hours, depending on the numerical method
1. Taylor Green vortex 512**3 - wall-clock : 4 hours on 16 Tesla K20M GPUs
Taylor-Green Vortex Rey = 1600Computations on wedges
GTC 2016, April 7th, San José California
-
�Grids (structured ?) Octree � Tet-tree
All tets are identical, only oriented differently in space
From a grid of very coarse « structured tets » (bottom right): perform a refinement based on a simple
criterion (distance to an object) : 8, 82 , 83 …coarse tets in each (figure on top right)
� Tet-tree ‘Coarse’ grid , managed, partitioned on the cluster by the CPU thread 0 of each node
Each coarse tet of any size is filled dynamically with small tets : finite volumes for the solver
The size of the inner grid is adapted dynamically to the solution by refinement fronts crossing the coarse edges
The coarse tets are clustered by identical refinement level : these sets are alloted to the multiprocessors of theaccelarators available on nodes
On-going work Hierarchical grids based on the generic refinement of a coarse grid of type Octree/Tet-tree
GTC 2016, April 7th, San José California
-
On-going work Hierarchical grids based on the generic refinement of a coarse grid of type Octree/Tet-tree
�A generic set offilling grids are generated on this type of simple gmsh models (the «tet-farm ») : inner connectivity list,coefficients of the spatial scheme, halos of ghost-cell andtheir correspondence with the inner numbering of neighbors, HO projectioncoefficients when the filling grid density varies in time
�This commoninner data model is only stored on the GPUs and accessed in a coalesced way by the threads
�Wall boundary conditions are Immersed Boundary conditionsor CAD–Cut cells with curved geometry
GTC 2016, April 7th, San José California
-
23
Conclusion
A number of preparatory projects enabled to acquire a good expertise on the porting of CFD solvers, their compute
intensive kernels and intefaces, and the best organization of the data models for MULTI-GPU performance.
A high Compute intensity was reached by approching the peak main memory bandwidth and almost fully overlapping
computations and communications for big models : up to 80 million cells on a K80.
The initial choice of cuda, thrust, mvapich revealed correct : good stability of the language, SDK and associated
programming productivity tools
A project of full software deployment of a variety of CFD options for complex 3D geometries and adaptive grid
refinement, without the need for a preliminary meshing tool, was started
GTC 2016, April 7th, San José California
top related