abagpucom2012.pdf

GPU Computing with SIMULIA Abaqus 6.12

GPUs Accelerate Computing Abaqus Uses Computing Power of GPUs for Faster Simulation

GPU CPU

= Speed Up

Increasing Performance & Memory Bandwidth

0

50

100

150

200

250

300

2007 2008 2009 2010 2011 2012

GBytes/s

Peak Memory Bandwidth

M1060

Nehalem 3 GHz

Westmere 3 GHz

8-core Sandy Bridge

3 GHz

Fermi M2070

Fermi+ M2090

0

200

400

600

800

1000

1200

1400

2007 2008 2009 2010 2011 2012

Gflops/s

Peak Double Precision FP

Nehalem 3 GHz

Westmere 3 GHz

Fermi M2070

Fermi+ M2090

M1060

8-core Sandy Bridge

3 GHz

NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU

Kepler Kepler

Abaqus/Standard GPU Computing

SIMULIA announced support for CUDA and GPUs in Abaqus 6.11

NVIDIA GPUs include Tesla 20-series and Quadro 6000

The direct equation solver is accelerated on the GPU

More of Abaqus will be moved to GPUs in progressive stages

Abaqus 6.12 supports multi-GPU, multi-node DMP clusters

Linux and Windows OS

Flexibility to run jobs on specific GPUs

Basic system recommendations

Large system memory (48GB) to avoid scratch I/O of system matrix

Ratio of 1 GPU attached to 1 CPU socket (4-8 cores)

“Accelerating Commercial Linear Dynamic and Nonlinear Implicit FEA Software Through High Performance Computing” From

NAFEMS World Congress May 2011 Boston, MA, USA - by Vladimir Belsky, Director of Solver Development, SIMULIA

Abaqus 6.11 Single GPU Results

Model Model DOF FLOPs

Solver speed-

up over 4 core

runs Overall

speed-up Contact

1: s4a 0.72M 4.34E+11 1.83 1.14 yes

2 1.07M 9.90E+11 1.83 1.12

3 1.47M 2.18E+12 2.15 1.52 yes

4 1.09M 4.37E+12 2.2 1.49 yes

5 1.47M 5.76E+12 2.75 2.19

6: s4b 3.17M 1.03E+13 2.9 2.01 yes

7 2.61M 1.68E+13 3.38 2 yes

8 1.46M 1.70E+13 3.69 2.29

9 4.48M 2.63E+13 3.36 2.66 yes

10 3.37M 1.08E+14 1.96 1.79

BAKER HUGHES Production Use of GPUs for Abaqus 6.11

“We are in full production with NVIDIA GPU’s for our modeling and simulation activities. We

are very happy that SIMULIA started support of GPU computing in Abaqus since May 2011

and we are fully exploiting it with our production models”

Dr. Shyu, Engineering Manager, Baker Hughes R&T

Oilfield Service Company

Plastic deformation, Large Strain

Material & geometric nonlinear

Solid FEs, 3.58M DOF

Plastic deformation, Large Strain & Contact

Material, geometric & boundary nonlinear

Solid FEs, 1.9M DOF

0

40000

80000

120000

160000

Shaft1 Shaft3 LockCone SideMandrel

O&G Models 4-core of Westmere CPU (3.33GHz) + Tesla C2070

With GPU No GPU

SCANIA Production Use of GPUs for Abaqus 6.11

Abaqus Model of Axle Gear

- Axle Gear

- 2,269 K DOF

- Solid FEs

- Static solution

- 37 GB memory

- Intel 5570 CPU

Abaqus Model of Differential

- Differential

- 1,489 K DOF

- Solid FEs

- Static solution

- 17 GB memory

- Intel 5570 CPU

~5x

~3x

~2x

~5x

~3x

~2x

Global Manufacturer of

Heavy Vehicles & Engines

Abaqus/Standard 6.11

SolidWorks 2012

GRAPHICS COMPUTE

NVIDIA MAXIMUS

SIMULIA Validated NVIDIA Maximus workstation A unified and optimized platform for GPU computing & visualization with NVIDIA Quadro & Tesla

GPUs to dramatically accelerate Abaqus workflows

• CAD & CAE: CATIA or SolidWorks & Abaqus

• CAE Workflow: Abaqus/CAE -> Abaqus/Standard ->

Abaqus/Viewer

Motorcycle Model - Spin, Rotate, Background Image

Change, Section Cuts with SolidWorks 2012

Abaqus s4b Engine model on 4 Westmere core + Tesla

C2075 (with and w/o SolidWorks on Quadro 6000):

924 versus 921 secs

Abaqus 6.12 Multi-GPU, DMP clusters

SIMULIA Community News, June 2012 issue

http://www.3ds.com/company/3ds-magazines/simulia-scn/ (pg. 15)

http://www.3ds.com/company/3ds-magazines/simulia-scn/






• 4 core + 2 gpu

• Higher throughput • 0.5x time of single 8c job

• Job completed in 8 to 10 hours • No need to lock the system for 24 hrs;

• Start the job at 6pm; ready for post-

processing next morning at 8am

Multi-GPU Maximus workstation w/ Abaqus 6.12 (USA customer model)

Hardware: 2x Xeon X5670, 2.93 GHz CPUs, ~80GB of available memory, 2x Tesla C2070, Linux RHEL5.4

Lower is

Better

SIMULIA Abaqus Model: • 7.33M DOF (equations); ~52 TFLOPs • Solid FEs • Nonlinear Static (3 Steps) • Direct Sparse solver • ~76GB memory to minimize IO

Today Path Forward

(High Perf) Path Forward

(Low Cost)

8 core 8 core + 2 GPU 4 core + 2GPU

Perf 1x 2.52x 1.9x

$ 1x 1.15x 0.97x

Perf/$ 1x 2.2x 1.96x

$ = workstation + GPUs + Abaqus tokens

0

5

10

15

20

8c Baseline 4c2g 8c1g 8c2g 12c 12c1g 12c2g

Abaqus 6.12-PR3 Elapsed time in Hours

0

10000

20000

30000

40000

6c 12c 6c+2g 24c 24c+4g 36c 36c+6g 48c 48c+8g

Ela

pse

d T

ime

(se

cs

)

Multi-GPU DMP Cluster Scaling w/ Abaqus 6.12 (Rolls-Royce plc)

Hardware: Each node with (2x Xeon X5670, 2.93 GHz CPUs, 48-96GB memory, 2x Tesla M2090), Linux RHEL6.2, QDR IB

Lower is

Better

Abaqus Model: • 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) • Direct Sparse solver • ~100GB memory to minimize IO (1 node)

Single node 2 nodes 3 nodes

1X

1.7X

3.1X 3.2X

7.6X

4.6X

9.4X

6X

11.2X

4 nodes

12 tokens

19 tokens

14 tokens

1

1.5

2

2.5

3

3.5

0

5000

10000

15000

20000

8c 8c + 1g 8c + 2g 16c 16c + 2g

Elapsed Time in seconds

Speed up relative to 8 core

0

10000

20000

30000

40000

Westmere + Tesla M2090 Sandy Bridge + Tesla K20X

Elap

sed

Tim

e in

se

con

ds

6c 6c + 2g

3.09X

3.77X

Abaqus 6.12-2 Scaling in a Node (Rolls-Royce plc)

Westmere node with (2x Xeon X5670, 2.93 GHz CPUs, 96GB memory), 2x Tesla M2090, Linux RHEL6.2 Sandy Bridge node with (2x E5-2670, 2.6GHz CPUs, 128GB memory), 2x Tesla K20X, Linux RHEL 6.2

• 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) • Direct Sparse solver • ~100GB memory to minimize IO (1 node)

Sandy Bridge + Tesla K20X Gen N-1 versus Gen N

Speed u

p r

ela

tive t

o 8

core

(1x)

1.87X

2.42X

2.11X

Abaqus 6.12-2 Scaling across multiple nodes (Rolls-Royce plc)

Each node with (2x Sandy Bridge E5-2667 (6-core), 2.9GHz CPUs,64GB memory), 2x Tesla K20, Linux RHEL 6.2

Sandy Bridge + Tesla K20

2.04X

1.8X

2 nodes 3 nodes

0

3000

6000

9000

24c 24c+4g 36c 36c+6g 48c 48c8g

Elap

sed

Tim

e in

sec

on

ds

2.17X

1.92X

1.81X

4 nodes

• 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) • Direct Sparse solver • ~100GB memory to minimize IO (1 node)

Abaqus 6.12-2 Scaling in a node (Americas Small Engines manufacturer)

Sandy Bridge CPU + Tesla K20X GPU

Large Model: 52 TFLOPs • 7.33M DoF (equations) • Nonlinear Static (3 Steps) • Direct Sparse solver • ~76GB of memory to reduce IO (1 node)

1

1.5

2

2.5

3

0

10000

20000

30000

40000

50000

8c 8c+1g 8c+2g 16c 16c+1g 16c+2g


Speedup relative to 8 cores (1X)

Elap

sed

Tim

e in

sec

on

ds

Speed u

p r

ela

tive t

o 8

core

s (1

x)

1

2

3

4

0

25000

50000

75000

100000

6c 6c+1g 6c+2g 12c 12c+1g 12c+2g


Speedup relative to 8 cores (1X)6

Speed u

p r

ela

tive t

o 6

core

s (1

x)

Westmere CPU + Tesla M2090 GPU

Westmere node with (2x Xeon X5670 (6-core), 2.93 GHz CPUs, 96GB memory), 2x Tesla M2090, Linux RHEL6.2 Sandy Bridge node with (2x E5-2670 (8-core), 2.6 GHz CPUs, 128GB memory), 2x Tesla K20X, Linux RHEL 6.2

Elap

sed

Tim

e in

sec

on

ds

Note the 2:1 reduction in Elapsed Time (Y-axis) from Gen N-1 to Gen N

Recommendations for GPU acceleration

Key factors for model selection for GPU acceleration:

Enough work

FLOPs, Solid models

In-core solution

Sufficient system (host) memory

Large fronts fit in the GPU memory

Super-nodes fit in device memory (6 GB)

Size limits will be largely eliminated in Abaqus 6.13

Uses direct sparse solver

Unsymmetric solve in-development for Abaqus 6.13

Recommended Configurations

Workstation user

Westmere or Sandy Bridge CPU

48-64 GB RAM (or more)*

Maximus Q6000/4000 GPU & Tesla C2075

Tesla K20c GPU

* Memory requirements dictated by problem size to minimize disk I/O.

Cluster user

Each node of IB cluster:

2x Sandy Bridge CPU

96-128 GB RAM

2x Tesla K20/K20X GPU

(or) 2x Tesla M2090 GPU

Unlocks a single CPU core

(5 Tokens)

How GPU Licensing Works

Abaqus Portfolio License

1 additional token:

Unlocks 1 additional CPU core –or-

Unlocks 1 entire GPU (same token scheme for CPU core & GPU)

Token Scheme

Contact SIMULIA to enable GPUs Abaqus 6.11 & 6.12

Licensees

SIMULIA V6R2013 – ExSight

ExSight relies on

Abaqus for solver

engine & exploits

GPU Computing in

Abaqus

Tools & Resources

GPU Genius

GPU Computing with Abaqus 6.12

GPU Computing with Abaqus 6.11

More info at:

http://www.nvidia.com/abaqus

• Multi-GPU Support in Abaqus 6.12 – SIMULIA Community News, June 2012.

• Acceleration of SIMULIA’s Abaqus Solver on NVIDIA GPUs – Acceleware Inc.

http://www.brainshark.com/nvidia/vu?pi=107928904&sid=230327724&sky=b73b49d20a1644b6a69ede28441faf4a&uid=1286900

http://www.brainshark.com/nvidia/vu?pi=107928904&sid=230327724&sky=b73b49d20a1644b6a69ede28441faf4a&uid=1286900

http://www.brainshark.com/nvidia/accelerate-abaqus-with-gpus?n=0

http://www.nvidia.com/abaqus

The 2012 HPCwire Readers’ Choice Awards

http://www.hpcwire.com/hpcwire/2012-11-

12/hpcwire_announces_2012_readers_choice_awards_winners.html

Best use of HPC application in manufacturing

Readers’ Choice: NVIDIA Quadro and Tesla GPUs with Dassault SIMULIA Abaqus Finite

Element Analysis

Editor’s Choice: Airbus using HPC-as-a-service provided by Hewlett Packard

Best use of HPC in automotive

Readers’ Choice: NVIDIA Tesla GPUs and Dassault SIMULIA Abaqus Finite Element Analysis

solution

Editor’s Choice: Caterham F1 Racing using Dell systems for Formula 1 racing design

http://www.hpcwire.com/hpcwire/2012-11-12/hpcwire_announces_2012_readers_choice_awards_winners.html






Thank You Contact: Srinivas Kodiyalam, [email protected]

abagpucom2012.pdf

Documents

2x xeon x5670

2x tesla k20x

2x tesla m2090

seconds speedup relative

3x 2x

2x e5

tesla gpus

elapsed time