© 2011 ANSYS, Inc. March 19, 2015 1
GPU Acceleration of HFSS Transient
Hsueh-Yung (Robert) Chao, Stylianos Dosopoulos, and Rickard Petersson
ANSYS, Inc.
© 2011 ANSYS, Inc. March 19, 2015 2
• ANSYS overview
• HFSS transient solvers
• Why use graphic processing units (GPUs) for acceleration?
• Optimization of CUDA programs for HFSS Transient
• Distributed solve on multiple GPUs
• Conclusions
Outline
© 2011 ANSYS, Inc. March 19, 2015 3
Our Vision
© 2011 ANSYS, Inc. March 19, 2015 4
Our Strategy
© 2011 ANSYS, Inc. March 19, 2015 5
HFSS Transient Solvers R13 (2011)
• General-purpose hybrid (implicit-explicit) hp-adaptive finite-element solver for transient electromagnetics
• Superior to finite-difference (FDTD, FIT) for solving multiscale problems • Explicit part - discontinuous Galerkin time domain (DGTD) method with
local time stepping • Implicit part – dual-field Crank-Nicolson • Locally implicit to alleviate the restriction of small time steps of explicit DG • OpenMP multithreading
121,764 tets, Intel X5675 8 CPU cores, 893 MB DRAM, 3 hrs 14 mins
© 2011 ANSYS, Inc. March 19, 2015 6
HFSS Transient Solvers R15 (2013)
• Target applications with electrically large structures and high-order meshes
• GPU-accelerated DGTD with local time stepping • One process on one GPU • Multiple processes on multiple GPUs for parametric
sweeps and network analysis with multiple excitations
© 2011 ANSYS, Inc. March 19, 2015 7
HFSS Transient Solvers R16 (2015)
• Target applications with electrically small structures and low-frequency signals
• Immune to bad meshes • Single-field fully implicit with Newark-beta
(β=1/4) • Similar memory scalability to frequency-
domain HFSS • OpenMP multithreading
*Touch screen and helicopter simulations courtesy of Jack Wu and Arien Sligar
© 2011 ANSYS, Inc. March 19, 2015 8
Nodal bases
Thread Block
Why Use GPUs for Acceleration?
thread
GPU/Device TB1 TB2
FEM Mesh
Parallelism level 1:
Parallelism level 2:
Parallelism level 3:
CUDA MODEL
• The large number of ALUs in GPUs is especially favorable for massively parallel processing.
• DGTD is inherently highly parallel. Its locality property makes DGTD map efficiently to the GPU architecture models.
• A GPU provides much higher memory BW and FLOPS than a CPU.
© 2011 ANSYS, Inc. March 19, 2015 9
Why Use GPUs for Acceleration? (Cont.)
L0
Load for DGTD field update at each time step for all LTS levels. Field updates across LTS level are interdependent & cannot be parallelized.
L1
L2
L3
L4
L0
L1
L2
L3
L4
Load not evenly distributed on each CPU core if the mesh is divided based on equal number of elements. CPU cores are frequently idle at some LTS levels. Scalability can be poor.
CPU0 CPU1 CPU2 CPU3
L0
Load is evenly distributed to GPU threads blocks for each LTS level. Parallel efficiency is limited by the lowest LTS level with few elements.
L1
L2
L3
L4
Li = level i of local time-stepping
Time step at Li is 2iΔt 2Δt
Δt
4Δt
© 2011 ANSYS, Inc. March 19, 2015 10
GPU Code Optimization
Key feature 1: CUDA is SIMT. Peak performance requires that all threads in a warp execute the same instruction, i.e. no divergent execution paths.
if (id < 4) {} else {}
running Divergent path
NOTE: Avoiding divergent paths may not always be possible
time
warp warp warp
Ins W1 Ins W1 Latency
Ins W2 Ins W2 Latency
Ins W3 Ins W3 Latency
Warps(warp=32 threads)
time
Warp schedulers
Clock cycles- Hardware
Warps(threads)-Software
serialization
Thread block
© 2011 ANSYS, Inc. March 19, 2015 11
Key feature 2: In CUDA instructions are executed per warp (32 threads). Therefore, we should utilize as much of the 32 threads (good granularity).
E1
Optimized Granularity
E1
Nodal bases
warp
E2
More shared memory usage.
GPU Code Optimization (Cont.)
registers, shared, constant, global
Key feature 3: Carefully choose various memory spaces with different bandwidths and latencies.
slower/larger
• Re-used variables should be stored in registers. • If not enough register space use shared memory
for reused variables. • Global memory coalesced access patterns.
Memory Variables
registers Local
shared Flux Gather, local
constant Reference elements
global Time-Stepping Vectors, Element Matrices, Maps
In our implementation
© 2011 ANSYS, Inc. March 19, 2015 12
Key feature 4: Avoid GPU hardware “limiting factors” with regard to registers, shared, warps and thread blocks.
Compute Capability: 3.5
Threads per Warp 32
Warps per Multiprocessor 64
Threads per Multiprocessor 2048
Thread Blocks per Multiprocessor 16
Total # of 32-bit registers per Multiprocessor 65536
Register allocation unit size 256
Register allocation granularity warp
Registers per Thread 255
Shared Memory per Multiprocessor (bytes) 49152
Shared Memory Allocation unit size 256
Warp allocation granularity 4
Maximum Thread Block Size 1024
Physical Limits for GPU
• 32 threads per block with 255 registers per thread.
• Blocks = 65536 /(32*255) = 8(max=16).
Example 1. Limited by registers :
• 500 DP values per thread block.
• Blocks = 49152/(500*8) = 12.288(max=16).
Example 2. Limited by shared mem:
warp
E1 E2
GPU Code Optimization (Cont.)
© 2011 ANSYS, Inc. March 19, 2015 13
• Use NVIDIA-Nsight tool as a guideline in applying all previously mentioned optimizations.
• Through Nsight analysis our CUDA code was optimized with 2-3x improvement over the non-optimized case.
Nsight Performance Analysis Tools
© 2011 ANSYS, Inc. March 19, 2015 14
GPU Speedup: One Tesla K20c vs. Xeon X5675 8 Cores
Note: HFSS Transient detects cases not suitable for GPU acceleration and falls back to CPUs.
Dipole_v16, 21K
PecMine, 32K
BenchMark2
(500ps)0-10GHz,
36K
FlipChip(30ps),
39K
Cauer(1.75ns),
40K
Dipole_v16(30ns)
, 59K
Boradwith
Traces,92K
500ns
DiffVia(30 ps),115K
BenchMark2(200
ps)0-12GHz,108K
F35Ant(36ns)600MHz,125K
F35Ant(36.5ns)800MHz,
330K
GSMAntenna
, 133K(p=2)
GSMAntenna
Mixedorder(p=1,2), 75K
ApacheScatering300MHz, 387K
Substation, 443K
500ns
Series1 1.83 2.22 0.95 0.275 2.64 2.62 0.65 4.3 2.22 2.36 3.38 5.21 2.96 3.52 0.65
0
1
2
3
4
5
6
GP
U S
pee
d U
p
© 2011 ANSYS, Inc. March 19, 2015 15
UHF Blade Antenna on F-35
262,970 tets, fmax = 800 MHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 2.3 GB GPU RAM, GPU Speedup 3.2x
© 2011 ANSYS, Inc. March 19, 2015 16
Transient Analysis of a Smart Phone
1,093,376 tets, fmax = 5 GHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 6.0 GB GPU RAM, GPU Speedup 4.8x
Transient field analysis on CPU, memory, GPS, USB, and Bluetooth ports due to power surge during battery charging
GPS antenna
multiband antenna
Bluetooth port
CPU USB
Touch screen panel
SIM card
*Phone model courtesy of Sara Louie
© 2011 ANSYS, Inc. March 19, 2015 17
Mutual Coupling between Patch Antennas
833,218 tets, fmax = 1.2 GHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 6.7 GB GPU RAM, GPU Speedup 6.8x
S11 S12 ,S13
*Helicopter model courtesy of Matt Commens
© 2013 ANSYS, Inc. March 19, 2015 18
Cosite Interference of LTE Monopoles on A320
2,120,263 tets, fmax = 1.7 GHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 10.9 GB GPU RAM, GPU Speedup 6.3x
*A320 model courtesy of Matt Commens
© 2011 ANSYS, Inc. March 19, 2015 19
Acceleration on Multiple GPUs • Automatic job assignment for parametric sweeps and network analysis with
multiple excitations
• Speedup scales linearly with respect to the number of GPUs
• Auto detection of GPUs attached to displays and exclude them from GPU acceleration
GPU monitoring by nvidia-smi
CPU monitoring by Windows Task Manager
© 2011 ANSYS, Inc. March 19, 2015 20
Performance Gains on Multiple GPUs
• Transient Network analysis with 64 excitations, speedup of GPU is 2.0x for 1 GPU vs. 8 CPU cores, each workstation can host up to 4 GPUs and 16 CPU cores, a simulation for one excitation using 8 CPU cores takes 1 hr.
1 HPC pack = 1 GPU + 8 CPU cores 2 HPC packs = 4 GPUs + 32 CPU cores 3 HPC packs = 16 GPUs + 128 CPU cores 4 HPC packs = 64 GPUs + 512 CPU cores
# of Workstations
# of HPC Licenses
# of CPU Cores # of GPUs Simulation Time (Hours)
Speedup*
1 1 16 0 32 2
1 1 16 1 32 2
1 2 16 4 8 8
4 3 64 16 2 32
16 4 256 64 0.5 128
*Actual speedup may vary depending on system configurations.
© 2013 ANSYS, Inc. March 19, 2015 21
Conclusions
• DGTD is a good candidate for GPU acceleration due to its inherent parallelism.
• The desired goal of 2x over 8 CPU cores is successfully achieved.
• Explicit GPU vs. Hybrid CPU
• Line-by-line optimization is necessary in order to achieve the performance goals.
• Memory access patterns are critical for GPU acceleration.
• Rethinking algorithms to expose more parallelism can significantly improve performance.