hpc best practices for fea
TRANSCRIPT
-
8/20/2019 Hpc Best Practices for Fea
1/98
© 2012 ANSYS, Inc. May 18, 20121
HPC Best Practicesfor FEA
John Higgins, PESenior Application Engineer
-
8/20/2019 Hpc Best Practices for Fea
2/98
© 2012 ANSYS, Inc. May 18, 20122
Agenda
• Overview• Parallel Processing Methods
• Solver Types
• Performance Review• Memory Settings
• GPU Technology
• Software Considerations
• Appendix
-
8/20/2019 Hpc Best Practices for Fea
3/98
© 2012 ANSYS, Inc. May 18, 20123
Basic information Output data
A modelA machine
Elapsed Time
Overview
Need for speed :Implicit structural FEA codes
Mesh fidelity continues to increase
More complex physics being analyzedLots of computations !!
-
8/20/2019 Hpc Best Practices for Fea
4/98
© 2012 ANSYS, Inc. May 18, 20124
Basic information Solver Configuration Output data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Elapsed Time
Overview
Analysing the model prior to launch therun may help to choose the more suitablesolver configuration at the first attempt
-
8/20/2019 Hpc Best Practices for Fea
5/98
© 2012 ANSYS, Inc. May 18, 20125
Basic information Solver Configuration Output data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Elapsed Time
Overview
-
8/20/2019 Hpc Best Practices for Fea
6/98
© 2012 ANSYS, Inc. May 18, 20126
Basic information Solver ConfigurationInformation during
the solve Output data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
Elapsed Time
Overview
-
8/20/2019 Hpc Best Practices for Fea
7/98© 2012 ANSYS, Inc. May 18, 20127
Basic information Solver ConfigurationInformation during
the solve Output data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
Overview
-
8/20/2019 Hpc Best Practices for Fea
8/98© 2012 ANSYS, Inc. May 18, 20128
Basic information Solver ConfigurationInformation during
the solve Output data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
Overview
-
8/20/2019 Hpc Best Practices for Fea
9/98© 2012 ANSYS, Inc. May 18, 20129
Basic information Solver ConfigurationInformation during
the solve Output data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
Parallel Processing
-
8/20/2019 Hpc Best Practices for Fea
10/98© 2012 ANSYS, Inc. May 18, 201210
Workstation/Server:
• Shared memory (SMP) single box,or
• Distributed memory (DMP) single box,
Parallel Processing – Hardware
Workstation
-
8/20/2019 Hpc Best Practices for Fea
11/98© 2012 ANSYS, Inc. May 18, 201211
Cluster (Workstation Cluster, Node Cluster):• Distributed memory (DMP) multiple boxes, cluster
Parallel Processing – Hardware
Cluster
-
8/20/2019 Hpc Best Practices for Fea
12/98
© 2012 ANSYS, Inc. May 18, 201212
Parallel Processing – Hardware + Software
Laptop/Desktopor
Workstation/Server
Cluster
ANSYS YES SMP (per node)Distributed ANSYS YES YES
-
8/20/2019 Hpc Best Practices for Fea
13/98
© 2012 ANSYS, Inc. May 18, 201213
No limitation in simulation capability
Reproducible and consistent results
Support all major platforms
Distributed ANSYS Design Requirements
-
8/20/2019 Hpc Best Practices for Fea
14/98
© 2012 ANSYS, Inc. May 18, 201214
Domain decomposition approach
• Break problem into N pieces (domains)• “Solve” the global problem independently within
each domain
• Communicate information across the boundariesas necessary
Distributed ANSYS Architecture
Processor 1
Processor 4
Processor 3
Processor 2
-
8/20/2019 Hpc Best Practices for Fea
15/98
© 2012 ANSYS, Inc. May 18, 201215
Distributed ANSYS Architecture
domain 0
interprocesscommunication
process 1process 0 (host)
process n-1
domaindecomposition
…
elem
assemble
solve
domain 1 domain n-1
elem
assemble
solve
elem
assemble
solve
elem output elem output elem output
…
combining results
-
8/20/2019 Hpc Best Practices for Fea
16/98
© 2012 ANSYS, Inc. May 18, 201216
Distributed sparse (default)• Supports all analyses supported with DANSYS ( Linear,
Non Linear, Static , Transient )
Distributed PCG• For static and full transient analyses
Distributed LANPCG (eigensolver)• For modal analyses
Distributed ANSYS Solvers
-
8/20/2019 Hpc Best Practices for Fea
17/98
-
8/20/2019 Hpc Best Practices for Fea
18/98
© 2012 ANSYS, Inc. May 18, 201218
Basic information Solver Configuration
Information during
the solve Output data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
Solver Types
-
8/20/2019 Hpc Best Practices for Fea
19/98
© 2012 ANSYS, Inc. May 18, 201219
Solution Overview
Solver Types
Equation solver dominates solution CPU time! Need to pay attention to equation solver
Equation solver also consumes the most system resources (memory and I/O)
“Solver” Solve [K]{x} = {b}
Solution Procedures
Element Formation
Prep Data
Element Stress Recovery
Global Assembly
10% CPU time
5% CPU time
10% CPU time
70% CPU time
5% CPU time
-
8/20/2019 Hpc Best Practices for Fea
20/98
© 2012 ANSYS, Inc. May 18, 201220
Solution Overview
Solver Types
Solve [K]{x} = {b}
Solution Procedures
Element Formation
Prep Data
Element Stress Recovery
Global Assembly
-
8/20/2019 Hpc Best Practices for Fea
21/98
© 2012 ANSYS, Inc. May 18, 201221
Solver Architecture
System Resolution
emat
full
ElementFormation
Output
SymbolicAssembly
PCG Solver d a t a i n - c
o r e
o b j e c
t s
d a t a
b a s e
ElementOutput
Sparse Solver
esav
rst,rth
-
8/20/2019 Hpc Best Practices for Fea
22/98
© 2012 ANSYS, Inc. May 18, 201222
SPARSE (Direct)
Filing …
LN09
*.BCS: Stats from Sparse Solver
*.full: Assembled Stiffness Matrix
Solver Types: SPARSE (Direct)
-
8/20/2019 Hpc Best Practices for Fea
23/98
© 2012 ANSYS, Inc. May 18, 201223
SPARSE (Direct)
PROS
- More robust with poorly conditioned problems (Shell-Beams)
- Solution always guaranteed- Fast for 2 nd Solve or Higher (Multiple Load cases)
CONS
- Factoring matrix & Solving are resource intensive
- Large memory requirements
Solver Types: SPARSE (Direct)
-
8/20/2019 Hpc Best Practices for Fea
24/98
© 2012 ANSYS, Inc. May 18, 201224
PCG (Iterative)
- Minimization of residuals/potential energy (Standard ConjugateGradient Method) ( {r} = {f} – [K].{u} )
- Iterative process requiring a convergence test (PCGTOL).
- Preconjugate CG used instead to reduce the number of iterations( Preconditioner [Q] ̴ [K-1] - [Q] cheaper than [K -1] )
- Number of iterations
Solver Types: PCG (Iterative)
-
8/20/2019 Hpc Best Practices for Fea
25/98
© 2012 ANSYS, Inc. May 18, 201225
PCG (Iterative)
PCGTOL need to be used ( ill conditionned model ) with lower value 1e-9 or 1e-10to let ANSYS follow the same path ( equilibrium iterations ) than the directsolver
Solver Types: PCG (Iterative)
PCGTOL
-
8/20/2019 Hpc Best Practices for Fea
26/98
© 2012 ANSYS, Inc. May 18, 201226
PCG (Iterative)
Filing…
*.PC*
*.PCS: Iterative solver stats
Solver Types: PCG (Iterative)
-
8/20/2019 Hpc Best Practices for Fea
27/98
© 2012 ANSYS, Inc. May 18, 201227
PCG (Iterative)
PROS
- Less memory requirements
- Better suited for well conditioned bigger problemCONS
- Not useful with near or rigid body behavior
- Less robust with ill-conditioned models (Shells & Beams, inadequate
boundary conditions (Rigid Body Motions), elements considerablyelongated, nearly singular matrices…) – more difficult to approximate[K-1] with [Q]
Solver Types: PCG (Iterative)
-
8/20/2019 Hpc Best Practices for Fea
28/98
© 2012 ANSYS, Inc. May 18, 201228
Level Of Difficulty
Solver Types: PCG (Iterative)
.. but can also be seen along withnumber of PCG iteration requiredto reach a converged solutionwithin jobname.PCS file .
LOD number is available in the solver output (solve.out)…
-
8/20/2019 Hpc Best Practices for Fea
29/98
© 2012 ANSYS, Inc. May 18, 201229
Other ways to evaluate ill-conditioning
Solver Types: PCG (Iterative)
Error message is also an indication.
Although we propose to change some MULT coefficient, model should becarefully reviewed first and SPARSE solver considered for resolution instead.
-
8/20/2019 Hpc Best Practices for Fea
30/98
© 2012 ANSYS, Inc. May 18, 201230
Comparative
Solver Types
-
8/20/2019 Hpc Best Practices for Fea
31/98
© 2012 ANSYS, Inc. May 18, 201231
Basic information Solver ConfigurationInformation during
the solveOutput data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
Performance Review
-
8/20/2019 Hpc Best Practices for Fea
32/98
© 2012 ANSYS, Inc. May 18, 201232
Process Resource Monitoring (only available on Windows7)
Performance Review
Windows Resource Monitor is a powerful tool for understanding how yoursystem resources are used by processes and services in real time.
-
8/20/2019 Hpc Best Practices for Fea
33/98
© 2012 ANSYS, Inc. May 18, 201233
How to access to the Process Resource Monitoring ? :
- from OS Task Manager (Ctrl + Shift + Esc) :
Performance Review
- Click Start , click in the Start Search box, type resmon.exe , and then press ENTER.
-
8/20/2019 Hpc Best Practices for Fea
34/98
© 2012 ANSYS, Inc. May 18, 201234
Process Resource Monitoring - CPU
Performance Review
Shared Memory (SMP) Distributed Memory (DMP)
-
8/20/2019 Hpc Best Practices for Fea
35/98
© 2012 ANSYS, Inc. May 18, 201235
Process Resource Monitoring - Memory
Performance Review
Before the solve :
During the solve :
Information from the solve.out :
-
8/20/2019 Hpc Best Practices for Fea
36/98
© 2012 ANSYS, Inc. May 18, 201236
Basic information Solver ConfigurationInformation during
the solveOutput data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
Overview
-
8/20/2019 Hpc Best Practices for Fea
37/98
© 2012 ANSYS, Inc. May 18, 201237
ANSYS End Statistics
Performance Review
Basic information about AnalysisSolving directly available at the end ofSolver Output file (*.out) in SolutionInformation
Total Elapsed Time
-
8/20/2019 Hpc Best Practices for Fea
38/98
© 2012 ANSYS, Inc. May 18, 201238
Performance Review
Output Data Description
Elapsed Time (sec) Total time of the simulation
Solver rate (Mflops) Speed of the solver
Bandwidth (Gbytes/s) I/O rate
Memory Used (Mbytes) Memory required
Number of iterations (PCG) Available for PCG only
Other main output data to check :
-
8/20/2019 Hpc Best Practices for Fea
39/98
© 2012 ANSYS, Inc. May 18, 201239
Performance Review
Elapsed Time Solver rate Bandwidth Memory Used Number of iterations
PCG (*.PCS file) SPARSE (*.BCS file)
-
8/20/2019 Hpc Best Practices for Fea
40/98
© 2012 ANSYS, Inc. May 18, 201240
Basic information Solver ConfigurationInformation during
the solveOutput data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
Memory Settings
-
8/20/2019 Hpc Best Practices for Fea
41/98
© 2012 ANSYS, Inc. May 18, 201241
SPARSE: Solver Output Statistics >> Memory Checkup
Memory Settings
-
8/20/2019 Hpc Best Practices for Fea
42/98
© 2012 ANSYS, Inc. May 18, 201242
Memory Settings – Test Case 1Test Case 1 : “Small model” (need 4 Gb Scratch Memory < RAM)
Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE
Machine reference : 6Gb RAM , enough memory but …
Elapsed Time = 77 secElapsed Time = 146 sec
-
8/20/2019 Hpc Best Practices for Fea
43/98
© 2012 ANSYS, Inc. May 18, 201243
Memory Settings – Test Case 1Test Case 1 : “Small model” (need ..Gb Scratch Memory < RAM)
Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE
Machine reference : 6Gb RAM
-
8/20/2019 Hpc Best Practices for Fea
44/98
© 2012 ANSYS, Inc. May 18, 201244
Memory Settings – Test Case 2
Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE
Elapsed Time = 4767 secElapsed Time = 1249 sec
Test Case 2 : “Large model” (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0Do not set always incore when memory available is not enough !!
-
8/20/2019 Hpc Best Practices for Fea
45/98
© 2012 ANSYS, Inc. May 18, 201245
Memory Settings – Test Case 2
Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE
Test Case 2 : “Large model” (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0
-
8/20/2019 Hpc Best Practices for Fea
46/98
© 2012 ANSYS, Inc. May 18, 201246
SPARSE: Solver Output Statistics >> Memory Checkup
Memory Settings
206MB available for Sparse solver at time of factorization.
This is sufficient to run in Optimal out-of-core mode (which requires 126MB)and obtain good performance.
If more than 1547 MB is available, it’ll run fully in -core – best performance
Avoid using Minimum out-of-core -- memory less than 126 MB
1.5 GB > Optimal > 126 MB Out-of-Core 1547 MB
-
8/20/2019 Hpc Best Practices for Fea
47/98
© 2012 ANSYS, Inc. May 18, 201247
SPARSE: 3 Memory Modes can be Observed
Memory Settings
In-core mode (optional)• Requires the most amount of memory• Performs no I/O
Optimal out-of-core mode (default)• Balances memory usage and I/O
Minimum core mode (not recommended)• Requires the least amount of memory• Performs most amount I/O
Performance
Best
Worst
-
8/20/2019 Hpc Best Practices for Fea
48/98
© 2012 ANSYS, Inc. May 18, 201248
Memory Settings – Test Case 3
Elapsed Time = 1 998 sec Elapsed Time = 3 185 sec
Test Case 3 : trap to avoid : launch a run on a network ( or a slow drive )
Local solve on a local disk (left ) vs a slow disk ( networked or USB (right )
-
8/20/2019 Hpc Best Practices for Fea
49/98
© 2012 ANSYS, Inc. May 18, 201249
PCG: Solver Statistics - *.PCS File>> Number of Iterations
Performance Review
# of cores used (SMP,DMP)
From PCGOPT, Lev_Diff
Important statistic!
-
8/20/2019 Hpc Best Practices for Fea
50/98
© 2012 ANSYS, Inc. May 18, 201250
PCG: Solver Statistics - *.PCS File>> Number of Iterations
Performance Review
Check total number of PCG iterations
• Less than 1000 iterations: good performance• Greater than 1000 iterations: performance is deteriorated. Try increasing Lev_Diff on
PCGOPT)
• Greater than 3000 iterations: assuming you have tried increasing Lev_Diff , eitherabandon PCG and use Sparse solver or improve element aspect ratios, boundaryconditions, and/or contact conditions
>3000 Iterations
-
8/20/2019 Hpc Best Practices for Fea
51/98
© 2012 ANSYS, Inc. May 18, 201251
PCG: Solver Statistics - *.PCS File>> Number of Iterations
Performance Review
If too much iteration :
*Use parallel processing• Use PCGOPT,lev
• Refine your mesh
• Check for too high stiffness
-
8/20/2019 Hpc Best Practices for Fea
52/98
-
8/20/2019 Hpc Best Practices for Fea
53/98
© 2012 ANSYS, Inc. May 18, 201253
Basic information Solver ConfigurationInformation during
the solveOutput data
A model :-Size / number ofDOF-Analysis type
A machine :-Number of cores-RAM-GPU
Parallel ProcessingMethod :-Shared Memory(SMP)
-Distributed Memory(DMP)
Solver type :-Direct (Sparse)-Iterative (PCG)
Memory Settings
Resource Monitor :-CPU-Memory-Disk-Network
-Elapsed Time
-Equation solvercomputational rate
-Equation solvereffective I/O rate(Bandwidth)
-Total memory used(incore/out-of-core?)
GPU Technology
-
8/20/2019 Hpc Best Practices for Fea
54/98
-
8/20/2019 Hpc Best Practices for Fea
55/98
© 2012 ANSYS, Inc. May 18, 201255
CPUs and GPUs used in a collaborative fashion
GPU Technology – Introduction
Multi-core processors• Typically 4-12 cores• Powerful, general purpose
Many-core processors• Typically hundreds of cores• Great for highly parallel code
CPU GPU
PCI Expresschannel
-
8/20/2019 Hpc Best Practices for Fea
56/98
© 2012 ANSYS, Inc. May 18, 201256
Motivation• Equation solver dominates solution time
– Logical place to add GPU acceleration
GPU Accelerator capability
“solver”
Equation Solver (e.g., [A]{x} = {b})
Solution Procedures
Element Formation
Element Stress Recovery
Global Assembly
5%-30% time
1%-10% time
5%-10% time
60%-90% time
5%-10% time
-
8/20/2019 Hpc Best Practices for Fea
57/98
© 2012 ANSYS, Inc. May 18, 201257
“Accelerate ” sparse direct solver (Boeing/DSP) • GPU is only used to factor a dense frontal matrix• Decision is made based on frontal matrix size on when
to send data to GPU or not:
– Too small, too much overhead, stays on CPU – Too large, exceeds GPU memory, stays on CPU
GPU Accelerator capability
-
8/20/2019 Hpc Best Practices for Fea
58/98
© 2012 ANSYS, Inc. May 18, 201258
Supported hardware• Currently recommending NVIDIA Tesla 20-series cards• Recently added support for Quadro 6000• Requires the following items
– Larger power supply (1 card needs about 225W) – Open 2x form factor PCIe x16 Gen2 slot
• Supported on Windows/Linux 64-bit
GPU Accelerator capability
NVIDIA TeslaC2050
NVIDIA TeslaC2070
NVIDIA Quadro6000
Power 225 Watts 225 Watts 225 Watts
CUDA cores 448 448 448
Memory 3 GB 6 GB 6 GB
Memory Bandwidth 144 GB/s 144 GB/s 144 GB/s
Peak Speed (SP/DP) 1030/515 Gflops 1030/515 Gflops 1030/515 Gflops
-
8/20/2019 Hpc Best Practices for Fea
59/98
-
8/20/2019 Hpc Best Practices for Fea
60/98
© 2012 ANSYS, Inc. May 18, 201260
Distributed ANSYS – GPU Speedup @ 14.0
Cores GPU Speedup2 no 2.25
4 no 4.292 yes 11.364 yes 11.51
Vibroacoustic harmonic analysis of an audio speaker• Direct sparse solver• Quarter-symmetry model with 700K DOF:
– 657424 nodes – 465798 elements – higher-order acoustic fluid elements (FLUID220/221)
Distributed ANSYS Results (baseline is 1 core):• With GPU, ~11x speedup on 2 cores!• 15-25% faster than SMP with same number of cores
Windows workstation: Two Intel Xeon 5530processors (2.4 GHz, 8 cores total), 48 GB RAM,NVIDIA Quadro 6000
Speedup
SMPDANSYS
SMP+GPUDANSYS+GPU
0.00
2.00
4.00
6.00
8.00
10.00
12.00
2
4
-
8/20/2019 Hpc Best Practices for Fea
61/98
© 2012 ANSYS, Inc. May 18, 201261
1848
1192
846
564 516399444
342 314 273 270
0
1000
2000
3000
Xeon 5670 2.93 GHz Westmere (Dual Socket)
Xeon 5670 2.93 GHz Westmere + Tesla C2075
A N S Y S M e c
h a n
i c a
l T i m e s
i n S e c o n
d s
4.2x
2.7x
3.5x
2.1x 1.9x
Add a Tesla C2075 touse with 6 cores:
now 46% faster than12, with 6 available
for other tasks
1 Core 2 Core 4 Core 6 Core 12 Core
1 Socket 2 Socket
8 Core
Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
V13sp-5 Model
Turbinegeometry2,100 K DOFSOLID187 FEsStatic, nonlinear
One iterationDirect sparse
LowerisBetter
ANSYS Mechanical 14.0 Performance for Tesla C2075
-
8/20/2019 Hpc Best Practices for Fea
62/98
© 2012 ANSYS, Inc. May 18, 201262
V13sp-5 benchmark (turbine model)
GPU Accelerator capability
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
F a c t o r i z a t i o n s p e e
d ( M f l o p s )
Front size (MB)
-
8/20/2019 Hpc Best Practices for Fea
63/98
© 2012 ANSYS, Inc. May 18, 201263
1.9x
3.2x
1.7x
3.4x
4.4x
0.0
1.0
2.0
3.0
4.0
5.0
16 cores 32 cores 64 cores
T o t a
l S p e e
d u p
R14 Distributed ANSYS w/wo GPU
Without GPU
With GPU
ANSYS Mechanical – Multi-Node GPU
Mold
PCB
Solderballs
Results Courtesy of MicroConsult Engineering, GmbH
Solder Joint Benchmark (4 MDOF, Creep Strain Analysis)
Linux cluster : Each nodecontains 12 Intel Xeon5600-series cores, 96 GB
RAM, NVIDIA Tesla M2070,InfiniBand
-
8/20/2019 Hpc Best Practices for Fea
64/98
© 2012 ANSYS, Inc. May 18, 201264
Comparative Trends
Trends in Performance by Solver Type
3 areas can be defined:
I. SPARSE is more efficientII. Either SPARSE or PCG can be used
III. PCG solver works faster since it needs less I/O exchanges with HD
With multiplecores & GPUs all
trends canchange due to
speedupdifference
Need to evaluate Sparse & PCG behavior & speedup on your own model!
PCG1
sparsesparse+gpu
PCG2
Number of DOF
ElapsedTime I II III
-
8/20/2019 Hpc Best Practices for Fea
65/98
© 2012 ANSYS, Inc. May 18, 201265
Tips and Tricks on performance gains• Some considerations on scalability of DANSYS• Working with solution differences• Working with a case that does not (or hardly) scale• Working with programmable features for parallel runs
Other Software Considerations
l b l d
-
8/20/2019 Hpc Best Practices for Fea
66/98
© 2012 ANSYS, Inc. May 18, 201266
Scalability Considerations
Load balanceImprovements on domain decomposition
Amdahl’s Law
• Algorithmic enhancements: every part of the code is to run in parallelUser controllable items:• Contact pair definitions: big contact pairs hurt load balance (one contact pair is put
into one domain in our code )
• CE definition: many CE terms hurt load balance and Amdahl’s law ( CE needscommunications among domains that the CE’s are defined )
• Use best and most suitable hardware possible (speedup of the CPU, memory, I/O andinterconnects)
l b l d
-
8/20/2019 Hpc Best Practices for Fea
67/98
© 2012 ANSYS, Inc. May 18, 201267
• Avoid defining whole exterior surface as one piece target• Break pairs into smaller pieces if possible• Remember: one whole contact pair is processed on one processor (contact
work cannot be spread out)
Define half circle astarget, don’t definefull circle
Avoid overlappingcontact surface ifpossible
Define potentialcontact surfaceinto smaller pieces
Scalability Considerations: Contact
S l bili C id i C
-
8/20/2019 Hpc Best Practices for Fea
68/98
© 2012 ANSYS, Inc. May 18, 201268
Scalability Considerations: Contact
• Avoid defining “un -used” surfaces as contact or target: i.e. reducepotential contact definition to minimum:
• In rev. 12.0: Use new control “ CNCheCK,TRIM”
• In rev. 11.0: Turn NLGEOM,OFF when define contact pairs in WB. WBauto turns on facility like “CNCheCK, TRIM” internally.
Trim
S l bili C id i R L d/Di
-
8/20/2019 Hpc Best Practices for Fea
69/98
© 2012 ANSYS, Inc. May 18, 201269
Point load distribution (remote load)
Point moment and it isdistributed to internalsurface of the hole Deformed shape
All nodes connected to one RBE3 node have to begrouped into the same domain. This hurts loadbalance! Try to reduce # of RBE3 nodes.
Scalability Considerations: Remote Load/Disp
-
8/20/2019 Hpc Best Practices for Fea
70/98
© 2012 ANSYS, Inc. May 18, 201270
14 bonded
contact pairs
Torque
Internal CEgenerated
by bondedcontact
Example of Bonded Contact and Remote Loads: Universal Joint Model
Torque defined by RBE3 on endsurface only – good practice
This model has small pieces of contactsand RBE3, it scales well in DANSYS
W ki Wi h S l i Diff i P ll l R
-
8/20/2019 Hpc Best Practices for Fea
71/98
© 2012 ANSYS, Inc. May 18, 201271
Working With Solution Differences in Parallel Runs
Most of solution differences come from contact applications when NP =1, versusNP = 2, 3, 4, 5, 6, 7 , …… • Check on contact pairs to make sure we don’t have a case of bifurcation and also plot
deformations to see the case.
• Tighten CNVTOL convergence tolerance to see solution accuracy. If solution is less than,say, 1 % in difference, then parallel computing can make some difference in convergence,all solutions are acceptable.
• If solution is well-defined and all input settings are correct, report this case to ANSYS Inc.for investigations
W ki With C f P S l bilit
-
8/20/2019 Hpc Best Practices for Fea
72/98
© 2012 ANSYS, Inc. May 18, 201272
Working With a Case of Poor Scalability
No scalability (speedup) at all (or even slower than NP = 1)• Is this problem too small (normally DOFs should be greater 50K)?• Do I have a slow disk, problem is so big that I/O size exceeds the memory I/O buffer?• Is every NODE of my machines connected to public network?• Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not
scalable)• Resume data at /solu level and don’t read in input files every time of the run • etc
W ki With C f P S l bilit
-
8/20/2019 Hpc Best Practices for Fea
73/98
© 2012 ANSYS, Inc. May 18, 201273
Working With a Case of Poor Scalability
Yes, I have scalability but poor (say, speedup < 2X)• Is this GigE or other slow interconnect?• Are all processors sharing one disk (SF mount)?• Do other people run the job on the same machine the same time?• Do I have many big pairs of contacts or do I have remote load or displacement that tie to
the major portions of the model?• Am I using a generation of dual/quad cores where the memory bandwidth is totally
shared within a core?
• Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly notscalable)
• Resume data at /solu level and don’t read in input files every time of the run • etc
-
8/20/2019 Hpc Best Practices for Fea
74/98
© 2012 ANSYS, Inc. May 18, 201274
APPENDIX
Pl tf MPI I t ll ti f ANSYS 14
-
8/20/2019 Hpc Best Practices for Fea
75/98
© 2012 ANSYS, Inc. May 18, 201275
Platform MPI Installation for ANSYS 14
- Do not uninstall HP-MPI, this is required for compatibility purposes with R13.
- Verify that HP-MPI is installed in its default location :“C:\Program Files (x86)\Hewlett-Packard\HP- MPI”, this is required for ANSYSMechanical R13 to execute properly.
Note for ANSYS Mechanical R13 users
Platform MPI Installation for ANSYS 14
-
8/20/2019 Hpc Best Practices for Fea
76/98
© 2012 ANSYS, Inc. May 18, 201276
Platform MPI Installation for ANSYS 14
- Run “setup.exe” of AnsysR14 Installation as Administrator :
- Install Platform MPI :
- Follow the Platform MPI Installation Instructions
-
8/20/2019 Hpc Best Practices for Fea
77/98
Platform MPI Installation for ANSYS 14
-
8/20/2019 Hpc Best Practices for Fea
78/98
© 2012 ANSYS, Inc. May 18, 201278
Platform MPI Installation for ANSYS 14
To finish the installation :
- Go to %AWP_ROOT140%\commonfiles\MPI\Platform\8.1\Windows\setpcmpipassword.bat
(by default : “C: \Program Files\ANSYS Inc\v140\commonfiles\MPI\Platform\8.1.2\Windows\ setpcmpipassword.bat”)
- Run "sethpmpipassword.bat", tape your Windows User Password and press Enter :
Test MPI Installation for ANSYS 14
-
8/20/2019 Hpc Best Practices for Fea
79/98
© 2012 ANSYS, Inc. May 18, 201279
Test MPI Installation for ANSYS 14
The installation is now finished. How to verify the proper functioning ?
- Edit the file "test_mpi14.bat" attached in the .zip
- Change the Ansys path and the number of processors if necessary (-np x)
- Save and run the file "test_mpi14.bat"
- The expected result is shown below :
"c:\program files\ansys inc\v140\ansys\bin\winx64\ansys140" -mpitest -mpi pcmpi -np 2
Test Case Batch launch (Solver Sparse)
-
8/20/2019 Hpc Best Practices for Fea
80/98
© 2012 ANSYS, Inc. May 18, 201280
Test Case – Batch launch (Solver Sparse)
- The file "cube_sparse_hpc.txt" is an input file for a simple analysis
(pressure on a cube).- Edit the file "job_sparse.bat" and change the Ansys path and/or the
number of processors is necessary.
- Possibility to change the number of mesh division of the cube to try outthe performance of your machine. (-ndiv xx)
- Save and run the file "job_sparse.bat".
Informations about the file "job_sparse.bat"
-b : batch -np : number of processors
-j : jobname -ndiv : number of division (for this exemple only)
-i : input file -acc nvidia : use GPU acceleration
-o : output file -mpi pcmpi : plateform MPI
Test Case Batch launch (Solver Sparse)
-
8/20/2019 Hpc Best Practices for Fea
81/98
© 2012 ANSYS, Inc. May 18, 201281
Test Case – Batch launch (Solver Sparse)
- Possibility to check your processors running with the Windows Task
Manager. (Ctrl+Shift+Esc)Exemple with 6 processus requested :
Advice : do not request all the processors available if you want to dosomething else during the running.
Test Case Batch launch (Solver Sparse)
-
8/20/2019 Hpc Best Practices for Fea
82/98
© 2012 ANSYS, Inc. May 18, 201282
Test Case – Batch launch (Solver Sparse)
Once the running is finished :
- Read the file .out to collect all the informations about the solver output.
- The main informations are :
- Elapsed Time (sec)
- Latency Time from Master to core
- Communication Speed from Master to core
- Equation solver computational rate
- Equation solver effective I/O rate
Test Case Workbench launch
-
8/20/2019 Hpc Best Practices for Fea
83/98
© 2012 ANSYS, Inc. May 18, 201283
Test Case – Workbench launch
- Open a Workbench Project with AnsysR14
- Open Mechanical
- Go to : Tools - > Solver Process Setting… -> Advanced…
- Check "Distributed Solution", specify the number of processors used andwrite the Additionnal Command (-mpi pcmpi) as shown below :
2
1 3
Possibility touse GPU
Test Case Workbench launch
-
8/20/2019 Hpc Best Practices for Fea
84/98
© 2012 ANSYS, Inc. May 18, 201284
Test Case – Workbench launch
In the Analysis settings :
- Possibility to choose the Solver Type (Direct = Sparse, Iterative = PCG)
- Solve your model
- Read the Solver Output from the Solution Information
Appendix
-
8/20/2019 Hpc Best Practices for Fea
85/98
© 2012 ANSYS, Inc. May 18, 201285
Automated run for a model
Compare customer results with ANSYS reference
First step for an HPC test on customer machine
Appendix
General view
-
8/20/2019 Hpc Best Practices for Fea
86/98
© 2012 ANSYS, Inc. May 18, 201286
General view
INPUT DATA OUTPUT DATA
The goal of this Excel file is twofold :•On the one hand, it enables to write the batch launch commands of multiple analysisin a file (job.bat)•On the other hand, it enables to extract informations from the different solve.out filesand write them in Excel.
INPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
87/98
© 2012 ANSYS, Inc. May 18, 201287
INPUT DATA
1
2
3
-
8/20/2019 Hpc Best Practices for Fea
88/98
INPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
89/98
© 2012 ANSYS, Inc. May 18, 201289
INPUT DATA
Description Choice
Machine Number of machines used 1,2 or 3
Solver Type of solver used sparse or pcgDivision Division of the edge for meshing Any integer
Release Select Ansys Release 140 or 145
GPU Use GPU acceleration yes or no
np total Total number of cores No choice (value calculated)
np / machine Number of cores by machines Any integer
PCG level Only available for PCG solver 1,2,3 or 4
Simulationmethod
Shared Memory or DistributedMemory
SMP or DMP
2
INPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
90/98
© 2012 ANSYS, Inc. May 18, 201290
INPUT DATA
3
Create a job.bat file with all the input data given in the Excel
OUTPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
91/98
© 2012 ANSYS, Inc. May 18, 201291
OUTPUT DATA
1
2
3
OUTPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
92/98
© 2012 ANSYS, Inc. May 18, 201292
OUTPUT DATA
1
Read the informations from all the *.out files.Nb : All the files must be in the same directory.
If a *.out file is not found, a pop-up will appear :
Continue : over pass this file and go to nextSTOP : stop reading all the next *.out files
OUTPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
93/98
© 2012 ANSYS, Inc. May 18, 201293
OUTPUT DATA
2
Output Data Description
Elapsed Time (sec) Total time of the simulation
Solver rate (Mflops) Speed of the solver
Bandwidth (Gbytes/s) I/O rate
Memory Used (Mbytes) Memory required
Number of iterations (PCG) Available for PCG only
OUTPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
94/98
© 2012 ANSYS, Inc. May 18, 201294
OUTPUT DATA
All this informations are extracted from the *.out files :
Elapsed Time Solver rate Bandwidth Memory Used Number of iterations
PCG SPARSE
OUTPUT DATA
-
8/20/2019 Hpc Best Practices for Fea
95/98
© 2012 ANSYS, Inc. May 18, 201295
OUTPUT DATA
3
Hyperlinks are automatically created to open the different *.outfiles directly from Excel.
Nb : if an error occurred during the solve (*** ERROR ***), it
will be automatically highlighted in the Excel file.
-
8/20/2019 Hpc Best Practices for Fea
96/98
© 2012 ANSYS, Inc. May 18, 201296
And now :waiting your feedback ,
from your results
-
8/20/2019 Hpc Best Practices for Fea
97/98
© 2012 ANSYS, Inc. May 18, 201297
Any suggestion/question for Excel toolimprovement :
-
8/20/2019 Hpc Best Practices for Fea
98/98
THANK YOU