abagpucom2012.pdf
TRANSCRIPT
GPU Computing with SIMULIA Abaqus 6.12
GPUs Accelerate Computing Abaqus Uses Computing Power of GPUs for Faster Simulation
GPU CPU
= Speed Up
Increasing Performance & Memory Bandwidth
0
50
100
150
200
250
300
2007 2008 2009 2010 2011 2012
GBytes/s
Peak Memory Bandwidth
M1060
Nehalem 3 GHz
Westmere 3 GHz
8-core Sandy Bridge
3 GHz
Fermi M2070
Fermi+ M2090
0
200
400
600
800
1000
1200
1400
2007 2008 2009 2010 2011 2012
Gflops/s
Peak Double Precision FP
Nehalem 3 GHz
Westmere 3 GHz
Fermi M2070
Fermi+ M2090
M1060
8-core Sandy Bridge
3 GHz
NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU
Kepler Kepler
Abaqus/Standard GPU Computing
SIMULIA announced support for CUDA and GPUs in Abaqus 6.11
NVIDIA GPUs include Tesla 20-series and Quadro 6000
The direct equation solver is accelerated on the GPU
More of Abaqus will be moved to GPUs in progressive stages
Abaqus 6.12 supports multi-GPU, multi-node DMP clusters
Linux and Windows OS
Flexibility to run jobs on specific GPUs
Basic system recommendations
Large system memory (48GB) to avoid scratch I/O of system matrix
Ratio of 1 GPU attached to 1 CPU socket (4-8 cores)
“Accelerating Commercial Linear Dynamic and Nonlinear Implicit FEA Software Through High Performance Computing” From
NAFEMS World Congress May 2011 Boston, MA, USA - by Vladimir Belsky, Director of Solver Development, SIMULIA
Abaqus 6.11 Single GPU Results
Model Model DOF FLOPs
Solver speed-
up over 4 core
runs Overall
speed-up Contact
1: s4a 0.72M 4.34E+11 1.83 1.14 yes
2 1.07M 9.90E+11 1.83 1.12
3 1.47M 2.18E+12 2.15 1.52 yes
4 1.09M 4.37E+12 2.2 1.49 yes
5 1.47M 5.76E+12 2.75 2.19
6: s4b 3.17M 1.03E+13 2.9 2.01 yes
7 2.61M 1.68E+13 3.38 2 yes
8 1.46M 1.70E+13 3.69 2.29
9 4.48M 2.63E+13 3.36 2.66 yes
10 3.37M 1.08E+14 1.96 1.79
BAKER HUGHES Production Use of GPUs for Abaqus 6.11
“We are in full production with NVIDIA GPU’s for our modeling and simulation activities. We
are very happy that SIMULIA started support of GPU computing in Abaqus since May 2011
and we are fully exploiting it with our production models”
Dr. Shyu, Engineering Manager, Baker Hughes R&T
Oilfield Service Company
Plastic deformation, Large Strain
Material & geometric nonlinear
Solid FEs, 3.58M DOF
Plastic deformation, Large Strain & Contact
Material, geometric & boundary nonlinear
Solid FEs, 1.9M DOF
0
40000
80000
120000
160000
Shaft1 Shaft3 LockCone SideMandrel
O&G Models 4-core of Westmere CPU (3.33GHz) + Tesla C2070
With GPU No GPU
SCANIA Production Use of GPUs for Abaqus 6.11
Abaqus Model of Axle Gear
- Axle Gear
- 2,269 K DOF
- Solid FEs
- Static solution
- 37 GB memory
- Intel 5570 CPU
Abaqus Model of Differential
- Differential
- 1,489 K DOF
- Solid FEs
- Static solution
- 17 GB memory
- Intel 5570 CPU
~5x
~3x
~2x
~5x
~3x
~2x
Global Manufacturer of
Heavy Vehicles & Engines
Abaqus/Standard 6.11
SolidWorks 2012
GRAPHICS COMPUTE
NVIDIA MAXIMUS
SIMULIA Validated NVIDIA Maximus workstation A unified and optimized platform for GPU computing & visualization with NVIDIA Quadro & Tesla
GPUs to dramatically accelerate Abaqus workflows
• CAD & CAE: CATIA or SolidWorks & Abaqus
• CAE Workflow: Abaqus/CAE -> Abaqus/Standard ->
Abaqus/Viewer
Motorcycle Model - Spin, Rotate, Background Image
Change, Section Cuts with SolidWorks 2012
Abaqus s4b Engine model on 4 Westmere core + Tesla
C2075 (with and w/o SolidWorks on Quadro 6000):
924 versus 921 secs
Abaqus 6.12 Multi-GPU, DMP clusters
SIMULIA Community News, June 2012 issue
http://www.3ds.com/company/3ds-magazines/simulia-scn/ (pg. 15)
• 4 core + 2 gpu
• Higher throughput • 0.5x time of single 8c job
• Job completed in 8 to 10 hours • No need to lock the system for 24 hrs;
• Start the job at 6pm; ready for post-
processing next morning at 8am
Multi-GPU Maximus workstation w/ Abaqus 6.12 (USA customer model)
Hardware: 2x Xeon X5670, 2.93 GHz CPUs, ~80GB of available memory, 2x Tesla C2070, Linux RHEL5.4
Lower is
Better
SIMULIA Abaqus Model: • 7.33M DOF (equations); ~52 TFLOPs • Solid FEs • Nonlinear Static (3 Steps) • Direct Sparse solver • ~76GB memory to minimize IO
Today Path Forward
(High Perf) Path Forward
(Low Cost)
8 core 8 core + 2 GPU 4 core + 2GPU
Perf 1x 2.52x 1.9x
$ 1x 1.15x 0.97x
Perf/$ 1x 2.2x 1.96x
$ = workstation + GPUs + Abaqus tokens
0
5
10
15
20
8c Baseline 4c2g 8c1g 8c2g 12c 12c1g 12c2g
Abaqus 6.12-PR3 Elapsed time in Hours
0
10000
20000
30000
40000
6c 12c 6c+2g 24c 24c+4g 36c 36c+6g 48c 48c+8g
Ela
pse
d T
ime
(se
cs
)
Multi-GPU DMP Cluster Scaling w/ Abaqus 6.12 (Rolls-Royce plc)
Hardware: Each node with (2x Xeon X5670, 2.93 GHz CPUs, 48-96GB memory, 2x Tesla M2090), Linux RHEL6.2, QDR IB
Lower is
Better
Abaqus Model: • 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) • Direct Sparse solver • ~100GB memory to minimize IO (1 node)
Single node 2 nodes 3 nodes
1X
1.7X
3.1X 3.2X
7.6X
4.6X
9.4X
6X
11.2X
4 nodes
12 tokens
19 tokens
14 tokens
1
1.5
2
2.5
3
3.5
0
5000
10000
15000
20000
8c 8c + 1g 8c + 2g 16c 16c + 2g
Elapsed Time in seconds
Speed up relative to 8 core
0
10000
20000
30000
40000
Westmere + Tesla M2090 Sandy Bridge + Tesla K20X
Elap
sed
Tim
e in
se
con
ds
6c 6c + 2g
3.09X
3.77X
Abaqus 6.12-2 Scaling in a Node (Rolls-Royce plc)
Westmere node with (2x Xeon X5670, 2.93 GHz CPUs, 96GB memory), 2x Tesla M2090, Linux RHEL6.2 Sandy Bridge node with (2x E5-2670, 2.6GHz CPUs, 128GB memory), 2x Tesla K20X, Linux RHEL 6.2
• 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) • Direct Sparse solver • ~100GB memory to minimize IO (1 node)
Sandy Bridge + Tesla K20X Gen N-1 versus Gen N
Speed u
p r
ela
tive t
o 8
core
(1x)
1.87X
2.42X
2.11X
Abaqus 6.12-2 Scaling across multiple nodes (Rolls-Royce plc)
Each node with (2x Sandy Bridge E5-2667 (6-core), 2.9GHz CPUs,64GB memory), 2x Tesla K20, Linux RHEL 6.2
Sandy Bridge + Tesla K20
2.04X
1.8X
2 nodes 3 nodes
0
3000
6000
9000
24c 24c+4g 36c 36c+6g 48c 48c8g
Elap
sed
Tim
e in
sec
on
ds
2.17X
1.92X
1.81X
4 nodes
• 4.71M DOF (equations); ~77 TFLOPs • Nonlinear Static (6 Steps) • Direct Sparse solver • ~100GB memory to minimize IO (1 node)
Abaqus 6.12-2 Scaling in a node (Americas Small Engines manufacturer)
Sandy Bridge CPU + Tesla K20X GPU
Large Model: 52 TFLOPs • 7.33M DoF (equations) • Nonlinear Static (3 Steps) • Direct Sparse solver • ~76GB of memory to reduce IO (1 node)
1
1.5
2
2.5
3
0
10000
20000
30000
40000
50000
8c 8c+1g 8c+2g 16c 16c+1g 16c+2g
Elapsed Time in seconds
Speedup relative to 8 cores (1X)
Elap
sed
Tim
e in
sec
on
ds
Speed u
p r
ela
tive t
o 8
core
s (1
x)
1
2
3
4
0
25000
50000
75000
100000
6c 6c+1g 6c+2g 12c 12c+1g 12c+2g
Elapsed Time in seconds
Speedup relative to 8 cores (1X)6
Speed u
p r
ela
tive t
o 6
core
s (1
x)
Westmere CPU + Tesla M2090 GPU
Westmere node with (2x Xeon X5670 (6-core), 2.93 GHz CPUs, 96GB memory), 2x Tesla M2090, Linux RHEL6.2 Sandy Bridge node with (2x E5-2670 (8-core), 2.6 GHz CPUs, 128GB memory), 2x Tesla K20X, Linux RHEL 6.2
Elap
sed
Tim
e in
sec
on
ds
Note the 2:1 reduction in Elapsed Time (Y-axis) from Gen N-1 to Gen N
Recommendations for GPU acceleration
Key factors for model selection for GPU acceleration:
Enough work
FLOPs, Solid models
In-core solution
Sufficient system (host) memory
Large fronts fit in the GPU memory
Super-nodes fit in device memory (6 GB)
Size limits will be largely eliminated in Abaqus 6.13
Uses direct sparse solver
Unsymmetric solve in-development for Abaqus 6.13
Recommended Configurations
Workstation user
Westmere or Sandy Bridge CPU
48-64 GB RAM (or more)*
Maximus Q6000/4000 GPU & Tesla C2075
Tesla K20c GPU
* Memory requirements dictated by problem size to minimize disk I/O.
Cluster user
Each node of IB cluster:
2x Sandy Bridge CPU
96-128 GB RAM
2x Tesla K20/K20X GPU
(or) 2x Tesla M2090 GPU
Unlocks a single CPU core
(5 Tokens)
How GPU Licensing Works
Abaqus Portfolio License
1 additional token:
Unlocks 1 additional CPU core –or-
Unlocks 1 entire GPU (same token scheme for CPU core & GPU)
Token Scheme
Contact SIMULIA to enable GPUs Abaqus 6.11 & 6.12
Licensees
SIMULIA V6R2013 – ExSight
ExSight relies on
Abaqus for solver
engine & exploits
GPU Computing in
Abaqus
Tools & Resources
GPU Genius
GPU Computing with Abaqus 6.12
GPU Computing with Abaqus 6.11
More info at:
http://www.nvidia.com/abaqus
• Multi-GPU Support in Abaqus 6.12 – SIMULIA Community News, June 2012.
• Acceleration of SIMULIA’s Abaqus Solver on NVIDIA GPUs – Acceleware Inc.
The 2012 HPCwire Readers’ Choice Awards
http://www.hpcwire.com/hpcwire/2012-11-
12/hpcwire_announces_2012_readers_choice_awards_winners.html
Best use of HPC application in manufacturing
Readers’ Choice: NVIDIA Quadro and Tesla GPUs with Dassault SIMULIA Abaqus Finite
Element Analysis
Editor’s Choice: Airbus using HPC-as-a-service provided by Hewlett Packard
Best use of HPC in automotive
Readers’ Choice: NVIDIA Tesla GPUs and Dassault SIMULIA Abaqus Finite Element Analysis
solution
Editor’s Choice: Caterham F1 Racing using Dell systems for Formula 1 racing design
Thank You Contact: Srinivas Kodiyalam, [email protected]