ANSYS FLUENT Performance Benchmark and Profiling
May 2009
2
Note
• The following research was performed under the HPC Advisory Council activities
– Participating vendors: AMD, ANSYS, Dell, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• The participating members would like to thank ANSYS for their support and guidelines
• For more info please refer to
– www.mellanox.com, www.dell.com/hpc, www.amd.com,
www.ansys.com
3
ANSYS FLUENT
• Computational Fluid Dynamics (CFD) is a computational technology
– Enables the study of the dynamics of things that flow
• By generating numerical solutions to a system of partial differential equations
which describe fluid flow
– Enable better understanding of qualitative and quantitative physical
phenomena in the flow which is used to improve engineering design
• CFD brings together a number of different disciplines
– Fluid dynamics, mathematical theory of partial differential systems,
computational geometry, numerical analysis, Computer science
• ANSYS FLUENT is a leading CFD application from ANSYS
– Widely used in almost every industry sector and manufactured product
4
Objectives
• The presented research was done to provide best practices
– ANSYS FLUENT performance benchmarking
– Interconnect performance comparisons
– Performance enhancement of the latest FLUENT release
– Ways to increase FLUENT productivity
– Understanding FLUENT communication patterns
5
Test Cluster Configuration
• Dell™ PowerEdge™ SC 1435 24-node cluster
• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs
• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs
• Mellanox® InfiniBand DDR Switch
• Memory: 16GB memory, DDR2 800MHz per node
• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack
• MPI: HP-MPI 2.3
• Application: FLUENT 6.3.37, FLUENT 12.0
• Benchmark Workload
– New FLUENT Benchmark Suite
6
Mellanox InfiniBand Solutions
• Industry Standard– Hardware, software, cabling, management
– Design for clustering and storage interconnect
• Performance– 40Gb/s node-to-node
– 120Gb/s switch-to-switch
– 1us application latency
– Most aggressive roadmap in the industry
• Reliable with congestion management• Efficient
– RDMA and Transport Offload
– Kernel bypass
– CPU focuses on application processing
• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage
InfiniBand Delivers the Lowest Latency
The InfiniBand Performance Gap is Increasing
Fibre Channel
Ethernet
60Gb/s
20Gb/s
120Gb/s
40Gb/s
240Gb/s (12X)
80Gb/s (4X)
7
• Performance– Quad-Core
• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache
– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor
– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core
– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz
• Scalability– 48-bit Physical Addressing
• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor
7 November5, 2007
PCI-E® Bridge
PCI-E® Bridge
I/O HubI/O Hub
USBUSB
PCIPCI
PCI-E® Bridge
PCI-E® Bridge
8 GB/S
8 GB/S
Dual ChannelReg DDR2
8 GB/S
8 GB/S
8 GB/S
Quad-Core AMD Opteron™ Processor
8
Dell PowerEdge Servers helping Simplify IT
• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers
– Servers optimized for High Performance Computing environments
– Building Block Foundations for best price/performance and performance/watt
• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity
– Dell's comprehensive HPC services help manage the lifecycle requirements.
– Integrated, Tested and Validated Architectures
• Workload Modeling– Optimized System Size, Configuration and Workloads
– Test-bed Benchmarks
– ISV Applications Characterization
– Best Practices & Usage Analysis
9
FLUENT Benchmark Results
• Input Dataset– EDDY_417K
• Reacting Flow with Eddy Dissipation Model
• FLUENT 12 provides better performance and scalability– Utilizing InfiniBand DDR to delivers highest performance and scalability
InfiniBand DDRHigher is better
FLUENT Benchmark Result (Eddy_417K)
0
1000
2000
3000
4000
5000
1 2 4 8 12 16 20 24Number of Nodes
Ratin
g
FLUENT 6.3.37 FLUENT 12
10
FLUENT Benchmark Result(Aircraft_2M)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 12 16 20 24Number of Nodes
Rat
ing
FLUENT 6.3.37 FLUENT 12
FLUENT Benchmark Results
• Input Dataset– Aircraft_2M
• External Flow Over an Aircraft Wing
• FLUENT 12 provides performance and scalability increase– Up to 107% higher performance versus previous 6.3.37 version
Higher is better
107%
InfiniBand DDR
11
FLUENT Benchmark Result (Truck_14M)
0
200
400
600
800
1000
1 2 4 8 12 16 20 24Number of Nodes
Rat
ing
FLUENT 6.3.37 FLUENT 12
FLUENT Benchmark Results
• Input Dataset– Truck_14M
• External Flow Over a Truck Body
• FLUENT 12 delivers higher performance and scalability– For any cluster size– Up to 80% higher performance versus previous 6.3.37 version
Higher is better
80%
InfiniBand DDR
12
FLUENT Benchmark Result(Truck_poly_14M)
0100200300400500600700800900
1 2 4 8 12 16 20 24Number of Nodes
Rat
ing
FLUENT 6.3.37 FLUENT 12
FLUENT Benchmark Results
• Input Dataset– Truck_Poly_14M
• External Flow Over a Truck Body with a Polyhedral Mesh
• FLUENT 12 delivers higher performance and scalability– For any cluster size– Up to 67% higher performance versus previous 6.3.37 version
Higher is better
67%
InfiniBand DDR
13
FLUENT 12 Benchmark Results - Interconnect
• Input Dataset– EDDY_417K (417 thousand elements)
• Reacting Flow with Eddy Dissipation Model
• InfiniBand DDR delivers higher performance and scalability– For any cluster size– Up to 192% higher performance versus Ethernet (GigE)
Higher is better
FLUENT 12.0 Benchmark Result(Eddy_417K)
0
1000
2000
3000
4000
5000
1 2 4 8 12 16 20 24Number of Nodes
Rat
ing
InfiniBand DDR Ethernet
192%
14
FLUENT 12.0 Benchmark Result(Aircraft_2M)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 12 16 20 24Number of Nodes
Rat
ing
InfiniBand DDR Ethernet
FLUENT 12 Benchmark Results - Interconnect
• Input Dataset– Aircraft_2M (2 million elements)
• External Flow Over an Aircraft Wing
• InfiniBand DDR delivers higher performance and scalability– For any cluster size– Up to 99% higher performance versus Ethernet (GigE)
Higher is better
99%
15
FLUENT 12.0 Benchmark Result(Truck_14M)
0
200
400
600
800
1000
1 2 4 8 12 16 20 24Number of Nodes
Rat
ing
InfiniBand DDR Ethernet
FLUENT 12 Benchmark Results - Interconnect
Higher is better
36%
• Input Dataset– Truck_14M (14 millions elements)
• External Flow Over a Truck Body
• InfiniBand DDR delivers higher performance and scalability– Up to 36% higher performance versus Ethernet (GigE)
• For bigger cases (# of elements) CPU is the bottleneck for larger node count configuration– More server nodes (or cores) are required for increased paternalism interconnect dependency
16
Enhancing FLUENT Productivity• Test cases
– Single job over the entire systems– 2 jobs, each runs on four cores per server
• Running multiple jobs simultaneously improves FLUENT productivity– Up to 90% more jobs per day for Eddy_417K– Up to 30% more jobs per day for Aircraft_2M– Up to 3% more jobs per day for Truck_14M
• As bigger the # of elements, higher node count is required for increased productivity– The CPU is the bottleneck for larger number of servers
Higher is better InfiniBand DDR
FLUENT 12.0 Productivity Result(2 jobs in parallel vs 1 job)
0%
20%
40%
60%
80%
100%
4 8 12 16 20 24Number of Nodes
Prod
uctiv
ity G
ain
Eddy_417K Aircraft_2M Truck_14M
17
Power Cost Savings with Different Interconnect
• InfiniBand saves up to $8000 power to finish the same number of FLUENT jobs compared to GigE– Yearly based for 24-node cluster
• As cluster size increases, more power can be saved
$/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
Power Cost Savings(InfiniBand vs GigE)
0
2000
4000
6000
8000
10000
Eddy_417K Aircraft_2M Truck_14M
Pow
er S
avin
gs/y
ear (
$)
18
Power Cost Savings with FLUENT upgrade
Power Cost Savings(FLUENT 12 vs FLUENT 6.3.37)
0
2000
4000
6000
8000
Eddy_417K Aircraft_2M Truck_14M
Pow
er S
avin
gs/y
ear (
$)
• FLUENT 12 saves up to ~$6000 power to finish the same number of FLUENT jobs compared to FLUENT 6.3.37– Yearly based for 24-node cluster
• As cluster size increases, more power can be saved
$/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
19
FLUENT Productivity Results Summary
• FLUENT 12 has tremendous performance improvement over version 6.3.37
– Optimizations made for higher performance and scalability
– Optimizations included for AMD Opteron™ processor technology
• AMD contributed to the development and the QA stages of Fluent 12
• Performance results on AMD technology established baseline performance
• InfiniBand enables higher performance and scalability than Ethernet
– Performance advantage extends as cluster size increases
• Efficient job placement can increase FLUENT productivity significantly
• Interconnect comparison shows
– InfiniBand delivers superior performance in every cluster size
– Low latency InfiniBand enables unparalleled scalability
• InfiniBand enables up to $8000/year power savings compared to GigE
• FLUENT 12 reduces yearly power consumption by up to $6000 compared to FLUENT 6.3.37
20
FLUENT 6.3.37 Profiling – MPI Functions
• Mostly used MPI functions– MPI_Send, MPI_Recv, MPI_Reduce, and MPI_Bcast
FLUENT 6.3.37 Benchmark Profiling Result(Truck_poly_14M)
05000
1000015000200002500030000
MPI_Wait
all
MPI_Sen
d
MPI_Red
uce
MPI_Rec
v
MPI_Bca
st
MPI_Barr
ier
MPI_Ise
nd
MPI_Irec
v
MPI Function
Num
ber o
f Cal
ls
(Tho
usan
ds)
4 Nodes 8 Nodes 16 Nodes
21
FLUENT 6.3.37 Profiling – Timing
• MPI_Recv shows the highest communication overhead
FLUENT 6.3.37 Benchmark Profiling Result(Truck_poly_14M)
040000
80000120000
160000200000
MPI_Wait
all
MPI_Sen
d
MPI_Red
uce
MPI_Rec
v
MPI_Bca
st
MPI_Barr
ier
MPI_Ise
nd
MPI_Irec
v
MPI Function
Tota
l Ove
rhea
d (s
)
4 Nodes 8 Nodes 16 Nodes
22
FLUENT 6.3.37 Profiling – Message Transferred
• Most data related MPI messages are within 256B-1KB in size• Typical MPI synchronization messages are lower than 64B in size• Number of messages increases with cluster size
FLUENT 6.3.37 Benchmark Profiling Result(Truck_poly_14M)
08000
1600024000320004000048000
[0..64
B][64
..256
B][25
6B..1
KB][1.
.4KB]
[4..16
KB][16
..64K
B][64
..256
KB][25
6KB..1
M][1.
.4M]
[4M..2
GB]
Message Size
Num
ber o
f Mes
sage
s(T
hous
ands
)
4 Nodes 8 Nodes 16 Nodes
23
FLUENT 12 Profiling – MPI Functions
• Mostly used MPI functions – MPI_Iprobe, MPI_Allreduce, MPI_Isend, and MPI_Irecv
FLUENT 12 Benchmark Profiling Result(Truck_poly_14M)
0
4000
8000
12000
16000
20000
MPI_Wait
allMPI_S
end
MPI_Red
uceMPI_R
ecv
MPI_Bca
stMPI_B
arrier
MPI_Ipro
beMPI_A
llreduce
MPI_Ise
ndMPI_I
recv
MPI Function
Num
ber o
f Cal
ls(T
hous
ands
)
4 Nodes 8 Nodes 16 Nodes
24
FLUENT 12 Profiling – Timing
• MPI_Recv and MPI_Allreduce show highest communication overhead
FLUENT 12 Benchmark Profiling Result(Truck_poly_14M)
040008000
120001600020000
MPI_Wait
allMPI_S
end
MPI_Red
uceMPI_R
ecv
MPI_Bca
stMPI_B
arrier
MPI_Ipr
obe
MPI_Allre
duce
MPI_Ise
ndMPI_I
recv
MPI Function
Tota
l Ove
rhea
d (s
)
4 Nodes 8 Nodes 16 Nodes
25
FLUENT 12 Profiling – Message Transferred
• Most data related MPI messages are within 256B-1KB in size• Typical MPI synchronization messages are lower than 64B in size• Number of messages increases with cluster size
FLUENT 12 Benchmark Profiling Result(Truck_poly_14M)
040008000
1200016000200002400028000
[0..64
B][64
..256
B][25
6B..1
KB][1.
.4KB]
[4..16
KB][16
..64K
B][64
..256
KB][25
6KB..1
M][1.
.4M]
[4M..2
GB]
Message Size
Num
ber o
f Mes
sage
s(T
hous
ands
)
4 Nodes 8 Nodes 16 Nodes
26
FLUENT Profiling - Message Transferred
• FLUENT 12 reduces total number of messages• Further optimization can be made to take bigger advantage of high-
speed and low latency interconnects
FLUENT Benchmark Profiling Result(Truck_poly_14M)
08000
1600024000
3200040000
48000
[0..64B
][64..
256B
][256
B..1KB]
[1..4KB]
[4..16K
B][16..
64KB]
[64..25
6KB]
[256KB..1
M][1..4
M]
Message Size
Num
ber o
f Mes
sage
s(T
hous
ands
)
FLUENT 6.3.37 FLUENT 12
27
FLUENT Profiling Summary
• FLUENT 12 and FLUENT 6.3.37 were profiled to identify their communication patterns
• Frequent used message sizes
– 256-1KB messages for data related communications
– <64B for synchronizations
– Number of messages increases with cluster size
• MPI Functions
– FLUENT 12 introduced MPI collective functions
• MPI_Allreduce help improves the communication efficiency
• Interconnects effect to FLUENT performance
– Both interconnect latency (MPI_Allreduce) and throughput (MPI_Recv) highly influence
FLUENT performance
– Further optimization can be made to take bigger advantage of high-speed networks
2828
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein