hpc best practices for fea

Upload: german-gomez

Post on 07-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/20/2019 Hpc Best Practices for Fea

    1/98

    © 2012 ANSYS, Inc. May 18, 20121

    HPC Best Practicesfor FEA

    John Higgins, PESenior Application Engineer

  • 8/20/2019 Hpc Best Practices for Fea

    2/98

    © 2012 ANSYS, Inc. May 18, 20122

    Agenda

    • Overview• Parallel Processing Methods

    • Solver Types

    • Performance Review• Memory Settings

    • GPU Technology

    • Software Considerations

    • Appendix

  • 8/20/2019 Hpc Best Practices for Fea

    3/98

    © 2012 ANSYS, Inc. May 18, 20123

    Basic information Output data

    A modelA machine

    Elapsed Time

    Overview

    Need for speed :Implicit structural FEA codes

    Mesh fidelity continues to increase

    More complex physics being analyzedLots of computations !!

  • 8/20/2019 Hpc Best Practices for Fea

    4/98

    © 2012 ANSYS, Inc. May 18, 20124

    Basic information Solver Configuration Output data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Elapsed Time

    Overview

    Analysing the model prior to launch therun may help to choose the more suitablesolver configuration at the first attempt

  • 8/20/2019 Hpc Best Practices for Fea

    5/98

    © 2012 ANSYS, Inc. May 18, 20125

    Basic information Solver Configuration Output data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Elapsed Time

    Overview

  • 8/20/2019 Hpc Best Practices for Fea

    6/98

    © 2012 ANSYS, Inc. May 18, 20126

    Basic information Solver ConfigurationInformation during

    the solve Output data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    Elapsed Time

    Overview

  • 8/20/2019 Hpc Best Practices for Fea

    7/98© 2012 ANSYS, Inc. May 18, 20127

    Basic information Solver ConfigurationInformation during

    the solve Output data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    Overview

  • 8/20/2019 Hpc Best Practices for Fea

    8/98© 2012 ANSYS, Inc. May 18, 20128

    Basic information Solver ConfigurationInformation during

    the solve Output data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    Overview

  • 8/20/2019 Hpc Best Practices for Fea

    9/98© 2012 ANSYS, Inc. May 18, 20129

    Basic information Solver ConfigurationInformation during

    the solve Output data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    Parallel Processing

  • 8/20/2019 Hpc Best Practices for Fea

    10/98© 2012 ANSYS, Inc. May 18, 201210

    Workstation/Server:

    • Shared memory (SMP) single box,or

    • Distributed memory (DMP) single box,

    Parallel Processing – Hardware

    Workstation

  • 8/20/2019 Hpc Best Practices for Fea

    11/98© 2012 ANSYS, Inc. May 18, 201211

    Cluster (Workstation Cluster, Node Cluster):• Distributed memory (DMP) multiple boxes, cluster

    Parallel Processing – Hardware

    Cluster

  • 8/20/2019 Hpc Best Practices for Fea

    12/98

    © 2012 ANSYS, Inc. May 18, 201212

    Parallel Processing – Hardware + Software

    Laptop/Desktopor

    Workstation/Server

    Cluster

    ANSYS YES SMP (per node)Distributed ANSYS YES YES

  • 8/20/2019 Hpc Best Practices for Fea

    13/98

    © 2012 ANSYS, Inc. May 18, 201213

    No limitation in simulation capability

    Reproducible and consistent results

    Support all major platforms

    Distributed ANSYS Design Requirements

  • 8/20/2019 Hpc Best Practices for Fea

    14/98

    © 2012 ANSYS, Inc. May 18, 201214

    Domain decomposition approach

    • Break problem into N pieces (domains)• “Solve” the global problem independently within

    each domain

    • Communicate information across the boundariesas necessary

    Distributed ANSYS Architecture

    Processor 1

    Processor 4

    Processor 3

    Processor 2

  • 8/20/2019 Hpc Best Practices for Fea

    15/98

    © 2012 ANSYS, Inc. May 18, 201215

    Distributed ANSYS Architecture

    domain 0

    interprocesscommunication

    process 1process 0 (host)

    process n-1

    domaindecomposition

    elem

    assemble

    solve

    domain 1 domain n-1

    elem

    assemble

    solve

    elem

    assemble

    solve

    elem output elem output elem output

    combining results

  • 8/20/2019 Hpc Best Practices for Fea

    16/98

    © 2012 ANSYS, Inc. May 18, 201216

    Distributed sparse (default)• Supports all analyses supported with DANSYS ( Linear,

    Non Linear, Static , Transient )

    Distributed PCG• For static and full transient analyses

    Distributed LANPCG (eigensolver)• For modal analyses

    Distributed ANSYS Solvers

  • 8/20/2019 Hpc Best Practices for Fea

    17/98

  • 8/20/2019 Hpc Best Practices for Fea

    18/98

    © 2012 ANSYS, Inc. May 18, 201218

    Basic information Solver Configuration

    Information during

    the solve Output data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    Solver Types

  • 8/20/2019 Hpc Best Practices for Fea

    19/98

    © 2012 ANSYS, Inc. May 18, 201219

    Solution Overview

    Solver Types

    Equation solver dominates solution CPU time! Need to pay attention to equation solver

    Equation solver also consumes the most system resources (memory and I/O)

    “Solver” Solve [K]{x} = {b}

    Solution Procedures

    Element Formation

    Prep Data

    Element Stress Recovery

    Global Assembly

    10% CPU time

    5% CPU time

    10% CPU time

    70% CPU time

    5% CPU time

  • 8/20/2019 Hpc Best Practices for Fea

    20/98

    © 2012 ANSYS, Inc. May 18, 201220

    Solution Overview

    Solver Types

    Solve [K]{x} = {b}

    Solution Procedures

    Element Formation

    Prep Data

    Element Stress Recovery

    Global Assembly

  • 8/20/2019 Hpc Best Practices for Fea

    21/98

    © 2012 ANSYS, Inc. May 18, 201221

    Solver Architecture

    System Resolution

    emat

    full

    ElementFormation

    Output

    SymbolicAssembly

    PCG Solver d a t a i n - c

    o r e

    o b j e c

    t s

    d a t a

    b a s e

    ElementOutput

    Sparse Solver

    esav

    rst,rth

  • 8/20/2019 Hpc Best Practices for Fea

    22/98

    © 2012 ANSYS, Inc. May 18, 201222

    SPARSE (Direct)

    Filing …

    LN09

    *.BCS: Stats from Sparse Solver

    *.full: Assembled Stiffness Matrix

    Solver Types: SPARSE (Direct)

  • 8/20/2019 Hpc Best Practices for Fea

    23/98

    © 2012 ANSYS, Inc. May 18, 201223

    SPARSE (Direct)

    PROS

    - More robust with poorly conditioned problems (Shell-Beams)

    - Solution always guaranteed- Fast for 2 nd Solve or Higher (Multiple Load cases)

    CONS

    - Factoring matrix & Solving are resource intensive

    - Large memory requirements

    Solver Types: SPARSE (Direct)

  • 8/20/2019 Hpc Best Practices for Fea

    24/98

    © 2012 ANSYS, Inc. May 18, 201224

    PCG (Iterative)

    - Minimization of residuals/potential energy (Standard ConjugateGradient Method) ( {r} = {f} – [K].{u} )

    - Iterative process requiring a convergence test (PCGTOL).

    - Preconjugate CG used instead to reduce the number of iterations( Preconditioner [Q]  ̴ [K-1] - [Q] cheaper than [K -1] )

    - Number of iterations

    Solver Types: PCG (Iterative)

  • 8/20/2019 Hpc Best Practices for Fea

    25/98

    © 2012 ANSYS, Inc. May 18, 201225

    PCG (Iterative)

    PCGTOL need to be used ( ill conditionned model ) with lower value 1e-9 or 1e-10to let ANSYS follow the same path ( equilibrium iterations ) than the directsolver

    Solver Types: PCG (Iterative)

    PCGTOL

  • 8/20/2019 Hpc Best Practices for Fea

    26/98

    © 2012 ANSYS, Inc. May 18, 201226

    PCG (Iterative)

    Filing…

    *.PC*

    *.PCS: Iterative solver stats

    Solver Types: PCG (Iterative)

  • 8/20/2019 Hpc Best Practices for Fea

    27/98

    © 2012 ANSYS, Inc. May 18, 201227

    PCG (Iterative)

    PROS

    - Less memory requirements

    - Better suited for well conditioned bigger problemCONS

    - Not useful with near or rigid body behavior

    - Less robust with ill-conditioned models (Shells & Beams, inadequate

    boundary conditions (Rigid Body Motions), elements considerablyelongated, nearly singular matrices…) – more difficult to approximate[K-1] with [Q]

    Solver Types: PCG (Iterative)

  • 8/20/2019 Hpc Best Practices for Fea

    28/98

    © 2012 ANSYS, Inc. May 18, 201228

    Level Of Difficulty

    Solver Types: PCG (Iterative)

    .. but can also be seen along withnumber of PCG iteration requiredto reach a converged solutionwithin jobname.PCS file .

    LOD number is available in the solver output (solve.out)…

  • 8/20/2019 Hpc Best Practices for Fea

    29/98

    © 2012 ANSYS, Inc. May 18, 201229

    Other ways to evaluate ill-conditioning

    Solver Types: PCG (Iterative)

    Error message is also an indication.

    Although we propose to change some MULT coefficient, model should becarefully reviewed first and SPARSE solver considered for resolution instead.

  • 8/20/2019 Hpc Best Practices for Fea

    30/98

    © 2012 ANSYS, Inc. May 18, 201230

    Comparative

    Solver Types

  • 8/20/2019 Hpc Best Practices for Fea

    31/98

    © 2012 ANSYS, Inc. May 18, 201231

    Basic information Solver ConfigurationInformation during

    the solveOutput data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    Performance Review

  • 8/20/2019 Hpc Best Practices for Fea

    32/98

    © 2012 ANSYS, Inc. May 18, 201232

    Process Resource Monitoring (only available on Windows7)

    Performance Review

    Windows Resource Monitor is a powerful tool for understanding how yoursystem resources are used by processes and services in real time.

  • 8/20/2019 Hpc Best Practices for Fea

    33/98

    © 2012 ANSYS, Inc. May 18, 201233

    How to access to the Process Resource Monitoring ? :

    - from OS Task Manager (Ctrl + Shift + Esc) :

    Performance Review

    - Click Start , click in the Start Search box, type resmon.exe , and then press ENTER.

  • 8/20/2019 Hpc Best Practices for Fea

    34/98

    © 2012 ANSYS, Inc. May 18, 201234

    Process Resource Monitoring - CPU

    Performance Review

    Shared Memory (SMP) Distributed Memory (DMP)

  • 8/20/2019 Hpc Best Practices for Fea

    35/98

    © 2012 ANSYS, Inc. May 18, 201235

    Process Resource Monitoring - Memory

    Performance Review

    Before the solve :

    During the solve :

    Information from the solve.out :

  • 8/20/2019 Hpc Best Practices for Fea

    36/98

    © 2012 ANSYS, Inc. May 18, 201236

    Basic information Solver ConfigurationInformation during

    the solveOutput data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    Overview

  • 8/20/2019 Hpc Best Practices for Fea

    37/98

    © 2012 ANSYS, Inc. May 18, 201237

    ANSYS End Statistics

    Performance Review

    Basic information about AnalysisSolving directly available at the end ofSolver Output file (*.out) in SolutionInformation

    Total Elapsed Time

  • 8/20/2019 Hpc Best Practices for Fea

    38/98

    © 2012 ANSYS, Inc. May 18, 201238

    Performance Review

    Output Data Description

    Elapsed Time (sec) Total time of the simulation

    Solver rate (Mflops) Speed of the solver

    Bandwidth (Gbytes/s) I/O rate

    Memory Used (Mbytes) Memory required

    Number of iterations (PCG) Available for PCG only

    Other main output data to check :

  • 8/20/2019 Hpc Best Practices for Fea

    39/98

    © 2012 ANSYS, Inc. May 18, 201239

    Performance Review

    Elapsed Time Solver rate Bandwidth Memory Used Number of iterations

    PCG (*.PCS file) SPARSE (*.BCS file)

  • 8/20/2019 Hpc Best Practices for Fea

    40/98

    © 2012 ANSYS, Inc. May 18, 201240

    Basic information Solver ConfigurationInformation during

    the solveOutput data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    Memory Settings

  • 8/20/2019 Hpc Best Practices for Fea

    41/98

    © 2012 ANSYS, Inc. May 18, 201241

    SPARSE: Solver Output Statistics >> Memory Checkup

    Memory Settings

  • 8/20/2019 Hpc Best Practices for Fea

    42/98

    © 2012 ANSYS, Inc. May 18, 201242

    Memory Settings – Test Case 1Test Case 1 : “Small model” (need 4 Gb Scratch Memory < RAM)

    Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

    Machine reference : 6Gb RAM , enough memory but …

    Elapsed Time = 77 secElapsed Time = 146 sec

  • 8/20/2019 Hpc Best Practices for Fea

    43/98

    © 2012 ANSYS, Inc. May 18, 201243

    Memory Settings – Test Case 1Test Case 1 : “Small model” (need ..Gb Scratch Memory < RAM)

    Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

    Machine reference : 6Gb RAM

  • 8/20/2019 Hpc Best Practices for Fea

    44/98

    © 2012 ANSYS, Inc. May 18, 201244

    Memory Settings – Test Case 2

    Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

    Elapsed Time = 4767 secElapsed Time = 1249 sec

    Test Case 2 : “Large model” (need 21.1Gb Scratch Memory > RAM)

    Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0Do not set always incore when memory available is not enough !!

  • 8/20/2019 Hpc Best Practices for Fea

    45/98

    © 2012 ANSYS, Inc. May 18, 201245

    Memory Settings – Test Case 2

    Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

    Test Case 2 : “Large model” (need 21.1Gb Scratch Memory > RAM)

    Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0

  • 8/20/2019 Hpc Best Practices for Fea

    46/98

    © 2012 ANSYS, Inc. May 18, 201246

    SPARSE: Solver Output Statistics >> Memory Checkup

    Memory Settings

    206MB available for Sparse solver at time of factorization.

    This is sufficient to run in Optimal out-of-core mode (which requires 126MB)and obtain good performance.

    If more than 1547 MB is available, it’ll run fully in -core – best performance

    Avoid using Minimum out-of-core -- memory less than 126 MB

    1.5 GB > Optimal > 126 MB Out-of-Core 1547 MB

  • 8/20/2019 Hpc Best Practices for Fea

    47/98

    © 2012 ANSYS, Inc. May 18, 201247

    SPARSE: 3 Memory Modes can be Observed

    Memory Settings

    In-core mode (optional)• Requires the most amount of memory• Performs no I/O

    Optimal out-of-core mode (default)• Balances memory usage and I/O

    Minimum core mode (not recommended)• Requires the least amount of memory• Performs most amount I/O

    Performance

    Best

    Worst

  • 8/20/2019 Hpc Best Practices for Fea

    48/98

    © 2012 ANSYS, Inc. May 18, 201248

    Memory Settings – Test Case 3

    Elapsed Time = 1 998 sec Elapsed Time = 3 185 sec

    Test Case 3 : trap to avoid : launch a run on a network ( or a slow drive )

    Local solve on a local disk (left ) vs a slow disk ( networked or USB (right )

  • 8/20/2019 Hpc Best Practices for Fea

    49/98

    © 2012 ANSYS, Inc. May 18, 201249

    PCG: Solver Statistics - *.PCS File>> Number of Iterations

    Performance Review

    # of cores used (SMP,DMP)

    From PCGOPT, Lev_Diff

    Important statistic!

  • 8/20/2019 Hpc Best Practices for Fea

    50/98

    © 2012 ANSYS, Inc. May 18, 201250

    PCG: Solver Statistics - *.PCS File>> Number of Iterations

    Performance Review

    Check total number of PCG iterations

    • Less than 1000 iterations: good performance• Greater than 1000 iterations: performance is deteriorated. Try increasing Lev_Diff on

    PCGOPT)

    • Greater than 3000 iterations: assuming you have tried increasing Lev_Diff , eitherabandon PCG and use Sparse solver or improve element aspect ratios, boundaryconditions, and/or contact conditions

    >3000 Iterations

  • 8/20/2019 Hpc Best Practices for Fea

    51/98

    © 2012 ANSYS, Inc. May 18, 201251

    PCG: Solver Statistics - *.PCS File>> Number of Iterations

    Performance Review

    If too much iteration :

    *Use parallel processing• Use PCGOPT,lev

    • Refine your mesh

    • Check for too high stiffness

  • 8/20/2019 Hpc Best Practices for Fea

    52/98

  • 8/20/2019 Hpc Best Practices for Fea

    53/98

    © 2012 ANSYS, Inc. May 18, 201253

    Basic information Solver ConfigurationInformation during

    the solveOutput data

    A model :-Size / number ofDOF-Analysis type

    A machine :-Number of cores-RAM-GPU

    Parallel ProcessingMethod :-Shared Memory(SMP)

    -Distributed Memory(DMP)

    Solver type :-Direct (Sparse)-Iterative (PCG)

    Memory Settings

    Resource Monitor :-CPU-Memory-Disk-Network

    -Elapsed Time

    -Equation solvercomputational rate

    -Equation solvereffective I/O rate(Bandwidth)

    -Total memory used(incore/out-of-core?)

    GPU Technology

  • 8/20/2019 Hpc Best Practices for Fea

    54/98

  • 8/20/2019 Hpc Best Practices for Fea

    55/98

    © 2012 ANSYS, Inc. May 18, 201255

    CPUs and GPUs used in a collaborative fashion

    GPU Technology – Introduction

    Multi-core processors• Typically 4-12 cores• Powerful, general purpose

    Many-core processors• Typically hundreds of cores• Great for highly parallel code

    CPU GPU

    PCI Expresschannel

  • 8/20/2019 Hpc Best Practices for Fea

    56/98

    © 2012 ANSYS, Inc. May 18, 201256

    Motivation• Equation solver dominates solution time

    – Logical place to add GPU acceleration

    GPU Accelerator capability

    “solver”

    Equation Solver (e.g., [A]{x} = {b})

    Solution Procedures

    Element Formation

    Element Stress Recovery

    Global Assembly

    5%-30% time

    1%-10% time

    5%-10% time

    60%-90% time

    5%-10% time

  • 8/20/2019 Hpc Best Practices for Fea

    57/98

    © 2012 ANSYS, Inc. May 18, 201257

    “Accelerate ” sparse direct solver (Boeing/DSP) • GPU is only used to factor a dense frontal matrix• Decision is made based on frontal matrix size on when

    to send data to GPU or not:

    – Too small, too much overhead, stays on CPU – Too large, exceeds GPU memory, stays on CPU

    GPU Accelerator capability

  • 8/20/2019 Hpc Best Practices for Fea

    58/98

    © 2012 ANSYS, Inc. May 18, 201258

    Supported hardware• Currently recommending NVIDIA Tesla 20-series cards• Recently added support for Quadro 6000• Requires the following items

    – Larger power supply (1 card needs about 225W) – Open 2x form factor PCIe x16 Gen2 slot

    • Supported on Windows/Linux 64-bit

    GPU Accelerator capability

    NVIDIA TeslaC2050

    NVIDIA TeslaC2070

    NVIDIA Quadro6000

    Power 225 Watts 225 Watts 225 Watts

    CUDA cores 448 448 448

    Memory 3 GB 6 GB 6 GB

    Memory Bandwidth 144 GB/s 144 GB/s 144 GB/s

    Peak Speed (SP/DP) 1030/515 Gflops 1030/515 Gflops 1030/515 Gflops

  • 8/20/2019 Hpc Best Practices for Fea

    59/98

  • 8/20/2019 Hpc Best Practices for Fea

    60/98

    © 2012 ANSYS, Inc. May 18, 201260

    Distributed ANSYS – GPU Speedup @ 14.0

    Cores GPU Speedup2 no 2.25

    4 no 4.292 yes 11.364 yes 11.51

    Vibroacoustic harmonic analysis of an audio speaker• Direct sparse solver• Quarter-symmetry model with 700K DOF:

    – 657424 nodes – 465798 elements – higher-order acoustic fluid elements (FLUID220/221)

    Distributed ANSYS Results (baseline is 1 core):• With GPU, ~11x speedup on 2 cores!• 15-25% faster than SMP with same number of cores

    Windows workstation: Two Intel Xeon 5530processors (2.4 GHz, 8 cores total), 48 GB RAM,NVIDIA Quadro 6000

    Speedup

    SMPDANSYS

    SMP+GPUDANSYS+GPU

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    2

    4

  • 8/20/2019 Hpc Best Practices for Fea

    61/98

    © 2012 ANSYS, Inc. May 18, 201261

    1848

    1192

    846

    564 516399444

    342 314 273 270

    0

    1000

    2000

    3000

    Xeon 5670 2.93 GHz Westmere (Dual Socket)

    Xeon 5670 2.93 GHz Westmere + Tesla C2075

    A N S Y S M e c

    h a n

    i c a

    l T i m e s

    i n S e c o n

    d s

    4.2x

    2.7x

    3.5x

    2.1x 1.9x

    Add a Tesla C2075 touse with 6 cores:

    now 46% faster than12, with 6 available

    for other tasks

    1 Core 2 Core 4 Core 6 Core 12 Core

    1 Socket 2 Socket

    8 Core

    Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17

    V13sp-5 Model

    Turbinegeometry2,100 K DOFSOLID187 FEsStatic, nonlinear

    One iterationDirect sparse

    LowerisBetter

    ANSYS Mechanical 14.0 Performance for Tesla C2075

  • 8/20/2019 Hpc Best Practices for Fea

    62/98

    © 2012 ANSYS, Inc. May 18, 201262

    V13sp-5 benchmark (turbine model)

    GPU Accelerator capability

    0

    20000

    40000

    60000

    80000

    100000

    120000

    140000

    160000

    180000

    200000

    0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300

    F a c t o r i z a t i o n s p e e

    d ( M f l o p s )

    Front size (MB)

  • 8/20/2019 Hpc Best Practices for Fea

    63/98

    © 2012 ANSYS, Inc. May 18, 201263

    1.9x

    3.2x

    1.7x

    3.4x

    4.4x

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    16 cores 32 cores 64 cores

    T o t a

    l S p e e

    d u p

    R14 Distributed ANSYS w/wo GPU

    Without GPU

    With GPU

    ANSYS Mechanical – Multi-Node GPU

    Mold

    PCB

    Solderballs

    Results Courtesy of MicroConsult Engineering, GmbH

    Solder Joint Benchmark (4 MDOF, Creep Strain Analysis)

    Linux cluster : Each nodecontains 12 Intel Xeon5600-series cores, 96 GB

    RAM, NVIDIA Tesla M2070,InfiniBand

  • 8/20/2019 Hpc Best Practices for Fea

    64/98

    © 2012 ANSYS, Inc. May 18, 201264

    Comparative Trends

    Trends in Performance by Solver Type

    3 areas can be defined:

    I. SPARSE is more efficientII. Either SPARSE or PCG can be used

    III. PCG solver works faster since it needs less I/O exchanges with HD

    With multiplecores & GPUs all

    trends canchange due to

    speedupdifference

    Need to evaluate Sparse & PCG behavior & speedup on your own model!

    PCG1

    sparsesparse+gpu

    PCG2

    Number of DOF

    ElapsedTime I II III

  • 8/20/2019 Hpc Best Practices for Fea

    65/98

    © 2012 ANSYS, Inc. May 18, 201265

    Tips and Tricks on performance gains• Some considerations on scalability of DANSYS• Working with solution differences• Working with a case that does not (or hardly) scale• Working with programmable features for parallel runs

    Other Software Considerations

    l b l d

  • 8/20/2019 Hpc Best Practices for Fea

    66/98

    © 2012 ANSYS, Inc. May 18, 201266

    Scalability Considerations

    Load balanceImprovements on domain decomposition

    Amdahl’s Law

    • Algorithmic enhancements: every part of the code is to run in parallelUser controllable items:• Contact pair definitions: big contact pairs hurt load balance (one contact pair is put

    into one domain in our code )

    • CE definition: many CE terms hurt load balance and Amdahl’s law ( CE needscommunications among domains that the CE’s are defined )

    • Use best and most suitable hardware possible (speedup of the CPU, memory, I/O andinterconnects)

    l b l d

  • 8/20/2019 Hpc Best Practices for Fea

    67/98

    © 2012 ANSYS, Inc. May 18, 201267

    • Avoid defining whole exterior surface as one piece target• Break pairs into smaller pieces if possible• Remember: one whole contact pair is processed on one processor (contact

    work cannot be spread out)

    Define half circle astarget, don’t definefull circle

    Avoid overlappingcontact surface ifpossible

    Define potentialcontact surfaceinto smaller pieces

    Scalability Considerations: Contact

    S l bili C id i C

  • 8/20/2019 Hpc Best Practices for Fea

    68/98

    © 2012 ANSYS, Inc. May 18, 201268

    Scalability Considerations: Contact

    • Avoid defining “un -used” surfaces as contact or target: i.e. reducepotential contact definition to minimum:

    • In rev. 12.0: Use new control “ CNCheCK,TRIM”

    • In rev. 11.0: Turn NLGEOM,OFF when define contact pairs in WB. WBauto turns on facility like “CNCheCK, TRIM” internally.

    Trim

    S l bili C id i R L d/Di

  • 8/20/2019 Hpc Best Practices for Fea

    69/98

    © 2012 ANSYS, Inc. May 18, 201269

    Point load distribution (remote load)

    Point moment and it isdistributed to internalsurface of the hole Deformed shape

    All nodes connected to one RBE3 node have to begrouped into the same domain. This hurts loadbalance! Try to reduce # of RBE3 nodes.

    Scalability Considerations: Remote Load/Disp

  • 8/20/2019 Hpc Best Practices for Fea

    70/98

    © 2012 ANSYS, Inc. May 18, 201270

    14 bonded

    contact pairs

    Torque

    Internal CEgenerated

    by bondedcontact

    Example of Bonded Contact and Remote Loads: Universal Joint Model

    Torque defined by RBE3 on endsurface only – good practice

    This model has small pieces of contactsand RBE3, it scales well in DANSYS

    W ki Wi h S l i Diff i P ll l R

  • 8/20/2019 Hpc Best Practices for Fea

    71/98

    © 2012 ANSYS, Inc. May 18, 201271

    Working With Solution Differences in Parallel Runs

    Most of solution differences come from contact applications when NP =1, versusNP = 2, 3, 4, 5, 6, 7 , …… • Check on contact pairs to make sure we don’t have a case of bifurcation and also plot

    deformations to see the case.

    • Tighten CNVTOL convergence tolerance to see solution accuracy. If solution is less than,say, 1 % in difference, then parallel computing can make some difference in convergence,all solutions are acceptable.

    • If solution is well-defined and all input settings are correct, report this case to ANSYS Inc.for investigations

    W ki With C f P S l bilit

  • 8/20/2019 Hpc Best Practices for Fea

    72/98

    © 2012 ANSYS, Inc. May 18, 201272

    Working With a Case of Poor Scalability

    No scalability (speedup) at all (or even slower than NP = 1)• Is this problem too small (normally DOFs should be greater 50K)?• Do I have a slow disk, problem is so big that I/O size exceeds the memory I/O buffer?• Is every NODE of my machines connected to public network?• Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not

    scalable)• Resume data at /solu level and don’t read in input files every time of the run • etc

    W ki With C f P S l bilit

  • 8/20/2019 Hpc Best Practices for Fea

    73/98

    © 2012 ANSYS, Inc. May 18, 201273

    Working With a Case of Poor Scalability

    Yes, I have scalability but poor (say, speedup < 2X)• Is this GigE or other slow interconnect?• Are all processors sharing one disk (SF mount)?• Do other people run the job on the same machine the same time?• Do I have many big pairs of contacts or do I have remote load or displacement that tie to

    the major portions of the model?• Am I using a generation of dual/quad cores where the memory bandwidth is totally

    shared within a core?

    • Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly notscalable)

    • Resume data at /solu level and don’t read in input files every time of the run • etc

  • 8/20/2019 Hpc Best Practices for Fea

    74/98

    © 2012 ANSYS, Inc. May 18, 201274

    APPENDIX

    Pl tf MPI I t ll ti f ANSYS 14

  • 8/20/2019 Hpc Best Practices for Fea

    75/98

    © 2012 ANSYS, Inc. May 18, 201275

    Platform MPI Installation for ANSYS 14

    - Do not uninstall HP-MPI, this is required for compatibility purposes with R13.

    - Verify that HP-MPI is installed in its default location :“C:\Program Files (x86)\Hewlett-Packard\HP- MPI”, this is required for ANSYSMechanical R13 to execute properly.

    Note for ANSYS Mechanical R13 users

    Platform MPI Installation for ANSYS 14

  • 8/20/2019 Hpc Best Practices for Fea

    76/98

    © 2012 ANSYS, Inc. May 18, 201276

    Platform MPI Installation for ANSYS 14

    - Run “setup.exe” of AnsysR14 Installation as Administrator :

    - Install Platform MPI :

    - Follow the Platform MPI Installation Instructions

  • 8/20/2019 Hpc Best Practices for Fea

    77/98

    Platform MPI Installation for ANSYS 14

  • 8/20/2019 Hpc Best Practices for Fea

    78/98

    © 2012 ANSYS, Inc. May 18, 201278

    Platform MPI Installation for ANSYS 14

    To finish the installation :

    - Go to %AWP_ROOT140%\commonfiles\MPI\Platform\8.1\Windows\setpcmpipassword.bat

    (by default : “C: \Program Files\ANSYS Inc\v140\commonfiles\MPI\Platform\8.1.2\Windows\ setpcmpipassword.bat”)

    - Run "sethpmpipassword.bat", tape your Windows User Password and press Enter :

    Test MPI Installation for ANSYS 14

  • 8/20/2019 Hpc Best Practices for Fea

    79/98

    © 2012 ANSYS, Inc. May 18, 201279

    Test MPI Installation for ANSYS 14

    The installation is now finished. How to verify the proper functioning ?

    - Edit the file "test_mpi14.bat" attached in the .zip

    - Change the Ansys path and the number of processors if necessary (-np x)

    - Save and run the file "test_mpi14.bat"

    - The expected result is shown below :

    "c:\program files\ansys inc\v140\ansys\bin\winx64\ansys140" -mpitest -mpi pcmpi -np 2

    Test Case Batch launch (Solver Sparse)

  • 8/20/2019 Hpc Best Practices for Fea

    80/98

    © 2012 ANSYS, Inc. May 18, 201280

    Test Case – Batch launch (Solver Sparse)

    - The file "cube_sparse_hpc.txt" is an input file for a simple analysis

    (pressure on a cube).- Edit the file "job_sparse.bat" and change the Ansys path and/or the

    number of processors is necessary.

    - Possibility to change the number of mesh division of the cube to try outthe performance of your machine. (-ndiv xx)

    - Save and run the file "job_sparse.bat".

    Informations about the file "job_sparse.bat"

    -b : batch -np : number of processors

    -j : jobname -ndiv : number of division (for this exemple only)

    -i : input file -acc nvidia : use GPU acceleration

    -o : output file -mpi pcmpi : plateform MPI

    Test Case Batch launch (Solver Sparse)

  • 8/20/2019 Hpc Best Practices for Fea

    81/98

    © 2012 ANSYS, Inc. May 18, 201281

    Test Case – Batch launch (Solver Sparse)

    - Possibility to check your processors running with the Windows Task

    Manager. (Ctrl+Shift+Esc)Exemple with 6 processus requested :

    Advice : do not request all the processors available if you want to dosomething else during the running.

    Test Case Batch launch (Solver Sparse)

  • 8/20/2019 Hpc Best Practices for Fea

    82/98

    © 2012 ANSYS, Inc. May 18, 201282

    Test Case – Batch launch (Solver Sparse)

    Once the running is finished :

    - Read the file .out to collect all the informations about the solver output.

    - The main informations are :

    - Elapsed Time (sec)

    - Latency Time from Master to core

    - Communication Speed from Master to core

    - Equation solver computational rate

    - Equation solver effective I/O rate

    Test Case Workbench launch

  • 8/20/2019 Hpc Best Practices for Fea

    83/98

    © 2012 ANSYS, Inc. May 18, 201283

    Test Case – Workbench launch

    - Open a Workbench Project with AnsysR14

    - Open Mechanical

    - Go to : Tools - > Solver Process Setting… -> Advanced…

    - Check "Distributed Solution", specify the number of processors used andwrite the Additionnal Command (-mpi pcmpi) as shown below :

    2

    1 3

    Possibility touse GPU

    Test Case Workbench launch

  • 8/20/2019 Hpc Best Practices for Fea

    84/98

    © 2012 ANSYS, Inc. May 18, 201284

    Test Case – Workbench launch

    In the Analysis settings :

    - Possibility to choose the Solver Type (Direct = Sparse, Iterative = PCG)

    - Solve your model

    - Read the Solver Output from the Solution Information

    Appendix

  • 8/20/2019 Hpc Best Practices for Fea

    85/98

    © 2012 ANSYS, Inc. May 18, 201285

    Automated run for a model

    Compare customer results with ANSYS reference

    First step for an HPC test on customer machine

    Appendix

    General view

  • 8/20/2019 Hpc Best Practices for Fea

    86/98

    © 2012 ANSYS, Inc. May 18, 201286

    General view

    INPUT DATA OUTPUT DATA

    The goal of this Excel file is twofold :•On the one hand, it enables to write the batch launch commands of multiple analysisin a file (job.bat)•On the other hand, it enables to extract informations from the different solve.out filesand write them in Excel.

    INPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    87/98

    © 2012 ANSYS, Inc. May 18, 201287

    INPUT DATA

    1

    2

    3

  • 8/20/2019 Hpc Best Practices for Fea

    88/98

    INPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    89/98

    © 2012 ANSYS, Inc. May 18, 201289

    INPUT DATA

    Description Choice

    Machine Number of machines used 1,2 or 3

    Solver Type of solver used sparse or pcgDivision Division of the edge for meshing Any integer

    Release Select Ansys Release 140 or 145

    GPU Use GPU acceleration yes or no

    np total Total number of cores No choice (value calculated)

    np / machine Number of cores by machines Any integer

    PCG level Only available for PCG solver 1,2,3 or 4

    Simulationmethod

    Shared Memory or DistributedMemory

    SMP or DMP

    2

    INPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    90/98

    © 2012 ANSYS, Inc. May 18, 201290

    INPUT DATA

    3

    Create a job.bat file with all the input data given in the Excel

    OUTPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    91/98

    © 2012 ANSYS, Inc. May 18, 201291

    OUTPUT DATA

    1

    2

    3

    OUTPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    92/98

    © 2012 ANSYS, Inc. May 18, 201292

    OUTPUT DATA

    1

    Read the informations from all the *.out files.Nb : All the files must be in the same directory.

    If a *.out file is not found, a pop-up will appear :

    Continue : over pass this file and go to nextSTOP : stop reading all the next *.out files

    OUTPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    93/98

    © 2012 ANSYS, Inc. May 18, 201293

    OUTPUT DATA

    2

    Output Data Description

    Elapsed Time (sec) Total time of the simulation

    Solver rate (Mflops) Speed of the solver

    Bandwidth (Gbytes/s) I/O rate

    Memory Used (Mbytes) Memory required

    Number of iterations (PCG) Available for PCG only

    OUTPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    94/98

    © 2012 ANSYS, Inc. May 18, 201294

    OUTPUT DATA

    All this informations are extracted from the *.out files :

    Elapsed Time Solver rate Bandwidth Memory Used Number of iterations

    PCG SPARSE

    OUTPUT DATA

  • 8/20/2019 Hpc Best Practices for Fea

    95/98

    © 2012 ANSYS, Inc. May 18, 201295

    OUTPUT DATA

    3

    Hyperlinks are automatically created to open the different *.outfiles directly from Excel.

    Nb : if an error occurred during the solve (*** ERROR ***), it

    will be automatically highlighted in the Excel file.

  • 8/20/2019 Hpc Best Practices for Fea

    96/98

    © 2012 ANSYS, Inc. May 18, 201296

    And now :waiting your feedback ,

    from your results

  • 8/20/2019 Hpc Best Practices for Fea

    97/98

    © 2012 ANSYS, Inc. May 18, 201297

    Any suggestion/question for Excel toolimprovement :

    [email protected]

  • 8/20/2019 Hpc Best Practices for Fea

    98/98

    THANK YOU