Download - GPU Acceleration of HFSS Transient - NVIDIA...© 2011 ANSYS, Inc.1 March 19, 2015 GPU Acceleration of HFSS Transient Hsueh-Yung (Robert) Chao, Stylianos Dosopoulos, and Rickard Petersson

© 2011 ANSYS, Inc. March 19, 2015 1

GPU Acceleration of HFSS Transient

Hsueh-Yung (Robert) Chao, Stylianos Dosopoulos, and Rickard Petersson

ANSYS, Inc.

© 2011 ANSYS, Inc. March 19, 2015 2

• ANSYS overview

• HFSS transient solvers

• Why use graphic processing units (GPUs) for acceleration?

• Optimization of CUDA programs for HFSS Transient

• Distributed solve on multiple GPUs

• Conclusions

Outline

© 2011 ANSYS, Inc. March 19, 2015 3

Our Vision

© 2011 ANSYS, Inc. March 19, 2015 4

Our Strategy

© 2011 ANSYS, Inc. March 19, 2015 5

HFSS Transient Solvers R13 (2011)

• General-purpose hybrid (implicit-explicit) hp-adaptive finite-element solver for transient electromagnetics

• Superior to finite-difference (FDTD, FIT) for solving multiscale problems • Explicit part - discontinuous Galerkin time domain (DGTD) method with

local time stepping • Implicit part – dual-field Crank-Nicolson • Locally implicit to alleviate the restriction of small time steps of explicit DG • OpenMP multithreading

121,764 tets, Intel X5675 8 CPU cores, 893 MB DRAM, 3 hrs 14 mins

© 2011 ANSYS, Inc. March 19, 2015 6


• Target applications with electrically large structures and high-order meshes

• GPU-accelerated DGTD with local time stepping • One process on one GPU • Multiple processes on multiple GPUs for parametric

sweeps and network analysis with multiple excitations

© 2011 ANSYS, Inc. March 19, 2015 7


• Target applications with electrically small structures and low-frequency signals

• Immune to bad meshes • Single-field fully implicit with Newark-beta

(β=1/4) • Similar memory scalability to frequency-

domain HFSS • OpenMP multithreading

*Touch screen and helicopter simulations courtesy of Jack Wu and Arien Sligar

© 2011 ANSYS, Inc. March 19, 2015 8

Nodal bases

Thread Block

Why Use GPUs for Acceleration?

thread

GPU/Device TB1 TB2

FEM Mesh

Parallelism level 1:



CUDA MODEL

• The large number of ALUs in GPUs is especially favorable for massively parallel processing.

• DGTD is inherently highly parallel. Its locality property makes DGTD map efficiently to the GPU architecture models.

• A GPU provides much higher memory BW and FLOPS than a CPU.

© 2011 ANSYS, Inc. March 19, 2015 9

Why Use GPUs for Acceleration? (Cont.)

L0

Load for DGTD field update at each time step for all LTS levels. Field updates across LTS level are interdependent & cannot be parallelized.

L1

L2

L3

L4

L0

L1

L2

L3

L4

Load not evenly distributed on each CPU core if the mesh is divided based on equal number of elements. CPU cores are frequently idle at some LTS levels. Scalability can be poor.

CPU0 CPU1 CPU2 CPU3

L0

Load is evenly distributed to GPU threads blocks for each LTS level. Parallel efficiency is limited by the lowest LTS level with few elements.

L1

L2

L3

L4

Li = level i of local time-stepping

Time step at Li is 2iΔt 2Δt

Δt

4Δt

© 2011 ANSYS, Inc. March 19, 2015 10

GPU Code Optimization

Key feature 1: CUDA is SIMT. Peak performance requires that all threads in a warp execute the same instruction, i.e. no divergent execution paths.

if (id < 4) {} else {}

running Divergent path

NOTE: Avoiding divergent paths may not always be possible

time

warp warp warp

Ins W1 Ins W1 Latency



Warps(warp=32 threads)

time

Warp schedulers

Clock cycles- Hardware

Warps(threads)-Software

serialization

Thread block

© 2011 ANSYS, Inc. March 19, 2015 11

Key feature 2: In CUDA instructions are executed per warp (32 threads). Therefore, we should utilize as much of the 32 threads (good granularity).

E1

Optimized Granularity

E1

Nodal bases

warp

E2

More shared memory usage.

GPU Code Optimization (Cont.)

registers, shared, constant, global

Key feature 3: Carefully choose various memory spaces with different bandwidths and latencies.

slower/larger

• Re-used variables should be stored in registers. • If not enough register space use shared memory

for reused variables. • Global memory coalesced access patterns.

Memory Variables

registers Local

shared Flux Gather, local

constant Reference elements

global Time-Stepping Vectors, Element Matrices, Maps

In our implementation

© 2011 ANSYS, Inc. March 19, 2015 12

Key feature 4: Avoid GPU hardware “limiting factors” with regard to registers, shared, warps and thread blocks.

Compute Capability: 3.5

Threads per Warp 32

Warps per Multiprocessor 64

Threads per Multiprocessor 2048

Thread Blocks per Multiprocessor 16

Total # of 32-bit registers per Multiprocessor 65536

Register allocation unit size 256

Register allocation granularity warp

Registers per Thread 255

Shared Memory per Multiprocessor (bytes) 49152

Shared Memory Allocation unit size 256

Warp allocation granularity 4

Maximum Thread Block Size 1024

Physical Limits for GPU

• 32 threads per block with 255 registers per thread.

• Blocks = 65536 /(32*255) = 8(max=16).

Example 1. Limited by registers :

• 500 DP values per thread block.

• Blocks = 49152/(500*8) = 12.288(max=16).

Example 2. Limited by shared mem:

warp

E1 E2

GPU Code Optimization (Cont.)

© 2011 ANSYS, Inc. March 19, 2015 13

• Use NVIDIA-Nsight tool as a guideline in applying all previously mentioned optimizations.

• Through Nsight analysis our CUDA code was optimized with 2-3x improvement over the non-optimized case.

Nsight Performance Analysis Tools

© 2011 ANSYS, Inc. March 19, 2015 14

GPU Speedup: One Tesla K20c vs. Xeon X5675 8 Cores

Note: HFSS Transient detects cases not suitable for GPU acceleration and falls back to CPUs.

Dipole_v16, 21K

PecMine, 32K

BenchMark2

(500ps)0-10GHz,

36K

FlipChip(30ps),

39K

Cauer(1.75ns),

40K

Dipole_v16(30ns)

, 59K

Boradwith

Traces,92K

500ns

DiffVia(30 ps),115K

BenchMark2(200

ps)0-12GHz,108K

F35Ant(36ns)600MHz,125K

F35Ant(36.5ns)800MHz,

330K

GSMAntenna

, 133K(p=2)

GSMAntenna

Mixedorder(p=1,2), 75K

ApacheScatering300MHz, 387K

Substation, 443K

500ns

Series1 1.83 2.22 0.95 0.275 2.64 2.62 0.65 4.3 2.22 2.36 3.38 5.21 2.96 3.52 0.65

0

1

2

3

4

5

6

GP

U S

pee

d U

p

© 2011 ANSYS, Inc. March 19, 2015 15

UHF Blade Antenna on F-35

262,970 tets, fmax = 800 MHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 2.3 GB GPU RAM, GPU Speedup 3.2x

© 2011 ANSYS, Inc. March 19, 2015 16

Transient Analysis of a Smart Phone

1,093,376 tets, fmax = 5 GHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 6.0 GB GPU RAM, GPU Speedup 4.8x

Transient field analysis on CPU, memory, GPS, USB, and Bluetooth ports due to power surge during battery charging

GPS antenna

multiband antenna

Bluetooth port

CPU USB

Touch screen panel

SIM card

*Phone model courtesy of Sara Louie

© 2011 ANSYS, Inc. March 19, 2015 17

Mutual Coupling between Patch Antennas

833,218 tets, fmax = 1.2 GHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 6.7 GB GPU RAM, GPU Speedup 6.8x

S11 S12 ,S13

*Helicopter model courtesy of Matt Commens

© 2013 ANSYS, Inc. March 19, 2015 18

Cosite Interference of LTE Monopoles on A320

2,120,263 tets, fmax = 1.7 GHz Tesla K40c vs. Xeon E5-2687W 8 CPU cores, 10.9 GB GPU RAM, GPU Speedup 6.3x

*A320 model courtesy of Matt Commens

© 2011 ANSYS, Inc. March 19, 2015 19

Acceleration on Multiple GPUs • Automatic job assignment for parametric sweeps and network analysis with

multiple excitations

• Speedup scales linearly with respect to the number of GPUs

• Auto detection of GPUs attached to displays and exclude them from GPU acceleration

GPU monitoring by nvidia-smi

CPU monitoring by Windows Task Manager

© 2011 ANSYS, Inc. March 19, 2015 20

Performance Gains on Multiple GPUs

• Transient Network analysis with 64 excitations, speedup of GPU is 2.0x for 1 GPU vs. 8 CPU cores, each workstation can host up to 4 GPUs and 16 CPU cores, a simulation for one excitation using 8 CPU cores takes 1 hr.

1 HPC pack = 1 GPU + 8 CPU cores 2 HPC packs = 4 GPUs + 32 CPU cores 3 HPC packs = 16 GPUs + 128 CPU cores 4 HPC packs = 64 GPUs + 512 CPU cores

# of Workstations

# of HPC Licenses

# of CPU Cores # of GPUs Simulation Time (Hours)

Speedup*

1 1 16 0 32 2

1 1 16 1 32 2

1 2 16 4 8 8

4 3 64 16 2 32

16 4 256 64 0.5 128

*Actual speedup may vary depending on system configurations.

© 2013 ANSYS, Inc. March 19, 2015 21

Conclusions

• DGTD is a good candidate for GPU acceleration due to its inherent parallelism.

• The desired goal of 2x over 8 CPU cores is successfully achieved.

• Explicit GPU vs. Hybrid CPU

• Line-by-line optimization is necessary in order to achieve the performance goals.

• Memory access patterns are critical for GPU acceleration.

• Rethinking algorithms to expose more parallelism can significantly improve performance.

Download - GPU Acceleration of HFSS Transient - NVIDIA...© 2011 ANSYS, Inc.1 March 19, 2015 GPU Acceleration of HFSS Transient Hsueh-Yung (Robert) Chao, Stylianos Dosopoulos, and Rickard Petersson

Top Related