Download - Acceleware Performance Guide
-
8/8/2019 Acceleware Performance Guide
1/14
ACCELEWARE FDTD PERFORMANCE GUIDE
Nine easy ways to speed up your simulation
- February 2010
Logan Maxwell, Mike Weldon
-
8/8/2019 Acceleware Performance Guide
2/14
Page 2 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Copyright Notice
All material herein is Acceleware copyright and shall not be reproduced, copied, forwarded,
published or shared in any matter without prior written authorization from Acceleware.
All rights reserved. Acceleware, the Acceleware logo and wordmark are registered trademarks and
/or trademarks of Acceleware Corp. in the United States, Canada and other countries. All other
trademarks are the property of their respective owners.
-
8/8/2019 Acceleware Performance Guide
3/14
Page 3 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Table of Contents
Copyright Notice ____________________________________________________________ 2
Table of Contents ____________________________________________________________ 3
Overview ___________________________________________________________________ 4
The Fundamentals of FDTD Performance on GPUs ________________________________ 5
1)Perfectly Matched Layers (PML) _____________________________________________ 6
2)Reads and Read Regions (Observations, DFT, Convergence, etc.) _________________ 7
3)Screen Savers ____________________________________________________________ 8
4)Simulation Orientation _____________________________________________________ 9
5)Number of Materials ______________________________________________________ 10
6)Dispersive Materials ______________________________________________________ 11
7)Mesh Density ____________________________________________________________ 12
8)Windows Remote Log In ___________________________________________________ 13
9)Multi-GPU Systems _______________________________________________________ 14
-
8/8/2019 Acceleware Performance Guide
4/14
Page 4 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Overview
Introduction
Fastest-possible FDTD simulation performance on GPU and multi-core hardware is a key objective
for partners and end users alike. The hardware and Acceleware software libraries that make it run
are obviously key determinants of the ultimate performance, but partners and end users can still
have a large impact on the performance in ways that are not always obvious. This document
outlines several key simulation parameters that impact simulation performance. Each case includes
a brief description of the parameter, a plot illustrating the performance impact and tips on how to
minimize any speed reduction. Improper use of these parameters whether intended or not, can
reduce simulation speed by 50% or more. Understanding the suggestions in this document will help
you avoid unnecessary reductions in performance and get the most out of all your simulations.
Intended Audience
Acceleware partners, end users and all those that are interested in running FDTD simulations
optimally on GPUs and multi-core hardware. This document should be considered essential reading
for engineers and scientists running FDTD simulations and will help make sure that they are always
getting the most out of their simulation tools.
-
8/8/2019 Acceleware Performance Guide
5/14
Page 5 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
The Fundamentals of FDTD Performance on GPUs
The above chart shows single-GPU performance only. Faster throughput can be achieved bysharing the computations across multiple GPUs; typical scaling when doing so is 80-90 percent.
Ramp Up: In this range the GPU is not using all of its compute resources andmemory bandwidth efficiently. Secondly, PML takes up a largeportion of the total simulation size and acts to slow the totalsimulation throughput.
Knee: The knee is the point at which the performance levels off and theGPU is running optimally.
Optimal Range: This is the optimal range because the GPU has found a good balancebetween computation and communication. The goal of any GPUFDTD code is to maximize this regions breadth and magnitude.
GPU Memory Limit: This is the point at which the GPU runs out of memory and CPUbegins to solve the remaining calculations.
Soft Memory: In this area the CPU is solving the remaining calculations that theGPU does not have memory for. As simulation size goes further intosoft memory, the performance will approach that of the CPU.
How to calculate throughput performance:
Note: Simulation Size is not including PML cells
0
200
400
600
800
0 25 50 75 100 125 150
Pom
Mc/s
Simulation Size (Mcell)
GPU (10 Series)
CPU (Nehalem)
CPU (Non-Nehalem)
Optimal Range Soft MemoryRamp Up
Acceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: Cubic, 16 Lossy Dielectrics,
4 Layer PML
Knee
Memory Limit
-
8/8/2019 Acceleware Performance Guide
6/14
Page 6 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Perfectly Matched Layers (PML)
Overview:
Adding PML (absorbing) boundary layers can reduce simulation performance by as much as
50% which would double run the time. The maximum simulation size the GPU is capable of running
will also be partially reduced. PML cells require more memory than non-PML cells. That reduces
simulation size. They also are more expensive to compute, which reduces performance. More
significantly, we don't include PML cells when calculating capacity or speed. Small simulations
are impacted more than larger simulations because PML cells represent the majority of the
computational load.
Tips:
- Minimize the number of layers of PML.
- Understand how much PML your simulation requires and use no more than that.
- Use maximum PML layers only when absolutely necessary.
0
200
400
600
800
0 25 50 75 100 125 150
Pom
Mcss
Simulation Size (Mcells)
PML Performance
0
10
20
PML Layers
Acceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: Cubic, 16 Lossy Dielectrics
-
8/8/2019 Acceleware Performance Guide
7/14
Page 7 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Reads and Read Regions
(Observations, DFT, Convergence, etc.)
Overview:
Reading field data during a simulation can dramatically impact performance. Field data is
read when observing simulation output, convergence, for DFTs etc. How much of the volume is read
and how frequently the volume is read both impact simulation performance. The chart below shows
performance for different volumes of reads based on a percentage of the total volume. All six fields
are read for each cell. We are sweeping the number of time steps between each read.
Tips:
-Keep the read volume to a minimum; only observe the region (volume) that is of direct
interest.
- Read only as frequently as is necessary to achieve accurate power, DFT, SAR, optical
generation, etc. results.
- For optical generation, far field etc. only start to read after a simulation has converged.
0
100
200
300
400
500
0 20 40 60 80 100
Pom
MCss
All Fields Read Every X Time Steps
0%
25%
50%
75%
100%
% of Volume Read
Read PerformanceAcceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Ex, Ey, Ez, Hx, Hy, Hz
Simulation: 30 Mcells ,Cubic, 16 Lossy
Dielectrics, 4 Layer PML
-
8/8/2019 Acceleware Performance Guide
8/14
Page 8 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Screen Savers
Overview:
Screen savers, especially graphics intensive 3D types can decrease the performance of the
simulation. Performance difference between no screen saver and basic screen saver is negligible.
Smaller simulations experienced a greater percent decrease in performance. Occasionally,
significantly worse performance is observed, and is abnormal.
Tips:
- Use low detail screen savers, blank, or no screen saver.
- Use the management settings to turn off the monitor instead of using a screen saver.
- If you must use a screen saver, confirm your performance is not degraded by more than 10-
20%. If it is worse, please contact Acceleware
0
200
400
600
800
0 25 50 75 100 125 150
Pom
Mcss
Simulation Size (Mcells)
Screen Saver Performance
None
Blank
3D Pipes
Screen Saver
Acceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: Cubic, 16 Lossy Dielectrics,
4 Layer PML
-
8/8/2019 Acceleware Performance Guide
9/14
Page 9 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Simulation Orientation
Overview:
Single-GPU simulations where Z is the smallest dimension by a significant margin will
experience a decrease in simulation performance and maximum simulation size. This is due to the
way in which memory is allocated, this problem is not unique to GPU FDTD solutions - it is also
present in CPU-only FDTD solvers. The example below shows an extreme case of smallest
dimension, for less extreme cases the decrease in performance and max simulation size is smaller.
Partitioning across multiple GPUs will change the effective simulation dimensions on each GPU, and
hence the performance. Smallest dimension in the graph is 10% of the other dimensions.
Tips:
- Rotate the simulation so that the Z is not the smallest dimension
- Avoid extreme differences in dimension, cubic shows the best performance
0
200
400
600
800
0 25 50 75 100 125 150
Pom
Mcss
Simulation Size (Mcells)
Smallest Performance
Cubic
X Smallest
Y Smallest
Z Smallest
Orientation:
Acceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: 16 Lossy Dielectrics,
4 Layer PML
x y z
x smallest (a, b, b)
y smallest (b, a, b)
z smallest (b, b, a)a
b
b
-
8/8/2019 Acceleware Performance Guide
10/14
Page 10 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Number of Materials
Overview:
The number of materials can have a large impact on performance, up to a 20% decrease.
The type of material can also have an effect on performance. For simulations with a variety of bothE
and H materials, the performance drop is more severe.
Tips:
-If possible keep number of materials below 1024.
- Make sure that all the materials are necessary; some applications add arbitrary complexity
by continually varying the number of materials.
0
100
200
300
400
500
600
1 32 1024 32768
Pom
Mcss
Unique Materials (#)
Number of Materials Performance
E Materials
H Materials
E and H Materials
Acceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: 30 Mcells. Cubic,
4 Layer PML
-
8/8/2019 Acceleware Performance Guide
11/14
Page 11 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Dispersive Materials
Overview:
Dispersive materials have a large impact on simulation performance and maximum
simulation size. Both the order (number of poles) of the dispersive materials and the total number of
materials present will decrease performance. Higher order dispersive materials show worse
performance, and higher numbers of dispersive materials will decrease performance. This applies to
all dispersive materials types, Drude, Debye, Lorentz, Drude-Lorentz, etc. Dispersive materials also
run slower on the CPU, the 'speed up factor' when using GPUs is roughly the same as for non-
dispersive simulations.
Case 1 1600 non-dispersive materials distributed evenly thought the entire simulation space
Case 2 1 single-pole dispersive material occupies 40% of the total volume contiguously.
Case 3 1 single-pole dispersive distributed evenly throughout the entire volume, 40% of the total volume is made up of dispersive
materials.
Case 4
1600 Multi-pole dispersive materials distributed contiguously throughout 40% of the total volume.
Case 5 1600 Multi-pole dispersive materials distributed evenly throughout the entire volume, 40% of the total volume is made up
of dispersive materials.
Tips:
- Restrict the total volume of dispersive materials in any simulation.
- Use the minimum number of dispersive materials and volume to achieve desired result.
0
100
200
300
400
500
600
700
0 50 100
Pom
Mcss
Simulation Size (Mcells)
Dispersive Performance
Case 1
Case 2
Case 3
Case 4
Case 5
Acceleware SDK: Sanda (9.3.1.11545)GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: Cubic, , 4 Layer PML,
-
8/8/2019 Acceleware Performance Guide
12/14
Page 12 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Mesh Density
Overview:
Increasing the mesh density does not always yield more accurate results, however
increasing the mesh density will always increase run time. This is for two reasons, one because
there are more cells to compute, and two, because t in the simulation must also decrease to
maintain simulation stability which increases the number of time steps required for a given number of
periods. The chart below demonstrates the naive linear and actual effect of increasing mesh density
on run time with a 10 Mcell simulation.
Tips:
-Only increase mesh density if your simulation accuracy requires it.
0:00:00
0:04:00
0:08:00
0:12:00
0:16:00
0:20:00
0 10 20 30 40 50 60 70 80 90 100
SmuaoTmehmms
Simulation Size (Mcells)
Time to complete 6 periods
Actual
Nave Linear
Acceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: Cubic, 16 Lossy Dielectrics,
4 Layer PML
-
8/8/2019 Acceleware Performance Guide
13/14
Page 13 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Acceleware 2010
Windows Remote Log In
Overview:
Remote desktop software can have a large impact on simulation performance; it can exceed
a 50% reduction in speed, give an error or not run at all. This happens because the desktop is
virtualized and in some cases access to the GPU is limited or nonexistent. The desktop uses GPU
resources which are needed for computation.
Tips:
- Use a KVM as they have no impact on performance.
- Do not use remote desktop tools in general
-If absolutely necessary use Ultra VNC, which still has some performance decrease as shown
above.
0
200
400
600
800
0 25 50 75 100 125 150
Pom
Mcss
Simulation Size (Mcells/s)
Ultra VNC Performance
VNC OFF
VNC ON
Acceleware SDK: Sanda (9.3.1.11545)
GPU: NVIDIA Tesla C1060
(Driver 6.14.11.9038)
Observations: Off
Simulation: Cubic, 16 Lossy Dielectrics,
4 Layer PML
-
8/8/2019 Acceleware Performance Guide
14/14
Page 14 of 14ACCELEWARE FDTD PERFORMANCE GUIDE
Multi-GPU Systems
Overview:
Using multiple GPUs can have a dramatic effect on performance and total allowable
simulation size. The addition of GPUs to any configuration will add 80-90 percent performance and
add approximately 95 percent to the total allowable simulation size. For example a simulation
running on a single C1060 may get a throughput performance of 650 Mcells/s, if that same
simulation were run on a 4xC1060 (S1070) configuration the performance would be
650+650*0.85*3 = ~2,300 Mcells/s.
Tips:
- Small simulations in the ramp up range will experience a smaller scaling factor than
simulations in the optimal range.
- Multi-GPU systems are able to run multiple simultaneous simulations using GPU targeting.
- Z smallest performance for multi-GPU systems will not be degraded to the same extent as
for single-GPU systems.
0
1000
2000
3000
4000
5000
0 250 500 750 1000
om
Mcss
Simulation Size (Mcells)
Multi-GPU PerformanceDual NVIDIA
Tesla S1070
NVIDIA Tesla
S1070NVIDIA Quadro
Plex 2200 D2NVIDIA Tesla
C1060
Acceleware SDK: Sanda (9.3.1.11545)
Driver: 6.14.11.9038
Observations: Off
Simulation: Cubic, 4 Layer PML,
16 Lossy Dielectrics
0
1000
2000
3000
0 25 50 75 100
om
Mcss
Simulation Size (Mcells)
Multi-GPU Performance (Zoomed)Dual NVIDIA Tesla
S1070NVIDIA Tesla
S1070NVIDIA Quadro
Plex 2200 D2NVIDIA Tesla
C1060