Download - ANSYS FLUENT Benchmark and Profiling - HPC Advisory …hpcadvisorycouncil.com/pdf/Fluent_Analysis.pdf · 3 ANSYS FLUENT • Computational Fluid Dynamics (CFD) is a computational technology

ANSYS FLUENT Performance Benchmark and Profiling

May 2009

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: AMD, ANSYS, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The participating members would like to thank ANSYS for their support and guidelines

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com,

www.ansys.com

http://www.mellanox.com/

http://www.dell.com/hpc

http://www.amd.com/

http://www.ansys.com/

3

ANSYS FLUENT

• Computational Fluid Dynamics (CFD) is a computational technology

– Enables the study of the dynamics of things that flow

• By generating numerical solutions to a system of partial differential equations

which describe fluid flow

– Enable better understanding of qualitative and quantitative physical

phenomena in the flow which is used to improve engineering design

• CFD brings together a number of different disciplines

– Fluid dynamics, mathematical theory of partial differential systems,

computational geometry, numerical analysis, Computer science

• ANSYS FLUENT is a leading CFD application from ANSYS

– Widely used in almost every industry sector and manufactured product

4

Objectives

• The presented research was done to provide best practices

– ANSYS FLUENT performance benchmarking

– Interconnect performance comparisons

– Performance enhancement of the latest FLUENT release

– Ways to increase FLUENT productivity

– Understanding FLUENT communication patterns

5

Test Cluster Configuration

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 800MHz per node

• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack

• MPI: HP-MPI 2.3

• Application: FLUENT 6.3.37, FLUENT 12.0

• Benchmark Workload

– New FLUENT Benchmark Suite

6

Mellanox InfiniBand Solutions

• Industry Standard– Hardware, software, cabling, management

– Design for clustering and storage interconnect

• Performance– 40Gb/s node-to-node

– 120Gb/s switch-to-switch

– 1us application latency

– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload

– Kernel bypass

– CPU focuses on application processing

• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage

InfiniBand Delivers the Lowest Latency

The InfiniBand Performance Gap is Increasing

Fibre Channel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s (12X)

80Gb/s (4X)

7

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache

– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

7 November5, 2007

PCI-E® Bridge

PCI-E® Bridge

I/O HubI/O Hub

USBUSB

PCIPCI

PCI-E® Bridge

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

8

Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis

9

FLUENT Benchmark Results

• Input Dataset– EDDY_417K

• Reacting Flow with Eddy Dissipation Model

• FLUENT 12 provides better performance and scalability– Utilizing InfiniBand DDR to delivers highest performance and scalability

InfiniBand DDRHigher is better

FLUENT Benchmark Result (Eddy_417K)

0

1000

2000

3000

4000

5000

1 2 4 8 12 16 20 24Number of Nodes

Ratin

g

FLUENT 6.3.37 FLUENT 12

10

FLUENT Benchmark Result(Aircraft_2M)

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 12 16 20 24Number of Nodes

Rat

ing



• Input Dataset– Aircraft_2M

• External Flow Over an Aircraft Wing

• FLUENT 12 provides performance and scalability increase– Up to 107% higher performance versus previous 6.3.37 version

Higher is better

107%

InfiniBand DDR

11

FLUENT Benchmark Result (Truck_14M)

0

200

400

600

800

1000

1 2 4 8 12 16 20 24Number of Nodes

Rat

ing



• Input Dataset– Truck_14M

• External Flow Over a Truck Body

• FLUENT 12 delivers higher performance and scalability– For any cluster size– Up to 80% higher performance versus previous 6.3.37 version

Higher is better

80%

InfiniBand DDR

12

FLUENT Benchmark Result(Truck_poly_14M)

0100200300400500600700800900

1 2 4 8 12 16 20 24Number of Nodes

Rat

ing



• Input Dataset– Truck_Poly_14M

• External Flow Over a Truck Body with a Polyhedral Mesh

• FLUENT 12 delivers higher performance and scalability– For any cluster size– Up to 67% higher performance versus previous 6.3.37 version

Higher is better

67%

InfiniBand DDR

13

FLUENT 12 Benchmark Results - Interconnect

• Input Dataset– EDDY_417K (417 thousand elements)

• Reacting Flow with Eddy Dissipation Model

• InfiniBand DDR delivers higher performance and scalability– For any cluster size– Up to 192% higher performance versus Ethernet (GigE)

Higher is better

FLUENT 12.0 Benchmark Result(Eddy_417K)

0

1000

2000

3000

4000

5000

1 2 4 8 12 16 20 24Number of Nodes

Rat

ing

InfiniBand DDR Ethernet

192%

14

FLUENT 12.0 Benchmark Result(Aircraft_2M)

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 12 16 20 24Number of Nodes

Rat

ing



• Input Dataset– Aircraft_2M (2 million elements)

• External Flow Over an Aircraft Wing

• InfiniBand DDR delivers higher performance and scalability– For any cluster size– Up to 99% higher performance versus Ethernet (GigE)

Higher is better

99%

15

FLUENT 12.0 Benchmark Result(Truck_14M)

0

200

400

600

800

1000

1 2 4 8 12 16 20 24Number of Nodes

Rat

ing



Higher is better

36%

• Input Dataset– Truck_14M (14 millions elements)

• External Flow Over a Truck Body

• InfiniBand DDR delivers higher performance and scalability– Up to 36% higher performance versus Ethernet (GigE)

• For bigger cases (# of elements) CPU is the bottleneck for larger node count configuration– More server nodes (or cores) are required for increased paternalism interconnect dependency

16

Enhancing FLUENT Productivity• Test cases

– Single job over the entire systems– 2 jobs, each runs on four cores per server

• Running multiple jobs simultaneously improves FLUENT productivity– Up to 90% more jobs per day for Eddy_417K– Up to 30% more jobs per day for Aircraft_2M– Up to 3% more jobs per day for Truck_14M

• As bigger the # of elements, higher node count is required for increased productivity– The CPU is the bottleneck for larger number of servers

Higher is better InfiniBand DDR

FLUENT 12.0 Productivity Result(2 jobs in parallel vs 1 job)

0%

20%

40%

60%

80%

100%

4 8 12 16 20 24Number of Nodes

Prod

uctiv

ity G

ain

Eddy_417K Aircraft_2M Truck_14M

17

Power Cost Savings with Different Interconnect

• InfiniBand saves up to $8000 power to finish the same number of FLUENT jobs compared to GigE– Yearly based for 24-node cluster

• As cluster size increases, more power can be saved

$/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

Power Cost Savings(InfiniBand vs GigE)

0

2000

4000

6000

8000

10000


Pow

er S

avin

gs/y

ear (

$)

18

Power Cost Savings with FLUENT upgrade

Power Cost Savings(FLUENT 12 vs FLUENT 6.3.37)

0

2000

4000

6000

8000


Pow

er S

avin

gs/y

ear (

$)

• FLUENT 12 saves up to ~$6000 power to finish the same number of FLUENT jobs compared to FLUENT 6.3.37– Yearly based for 24-node cluster

• As cluster size increases, more power can be saved

$/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

19

FLUENT Productivity Results Summary

• FLUENT 12 has tremendous performance improvement over version 6.3.37

– Optimizations made for higher performance and scalability

– Optimizations included for AMD Opteron™ processor technology

• AMD contributed to the development and the QA stages of Fluent 12

• Performance results on AMD technology established baseline performance

• InfiniBand enables higher performance and scalability than Ethernet

– Performance advantage extends as cluster size increases

• Efficient job placement can increase FLUENT productivity significantly

• Interconnect comparison shows

– InfiniBand delivers superior performance in every cluster size

– Low latency InfiniBand enables unparalleled scalability

• InfiniBand enables up to $8000/year power savings compared to GigE

• FLUENT 12 reduces yearly power consumption by up to $6000 compared to FLUENT 6.3.37

20

FLUENT 6.3.37 Profiling – MPI Functions

• Mostly used MPI functions– MPI_Send, MPI_Recv, MPI_Reduce, and MPI_Bcast

FLUENT 6.3.37 Benchmark Profiling Result(Truck_poly_14M)

05000

1000015000200002500030000

MPI_Wait

all

MPI_Sen

d

MPI_Red

uce

MPI_Rec

v

MPI_Bca

st

MPI_Barr

ier

MPI_Ise

nd

MPI_Irec

v

MPI Function

Num

ber o

f Cal

ls

(Tho

usan

ds)

4 Nodes 8 Nodes 16 Nodes

21

FLUENT 6.3.37 Profiling – Timing

• MPI_Recv shows the highest communication overhead


040000

80000120000

160000200000

MPI_Wait

all

MPI_Sen

d

MPI_Red

uce

MPI_Rec

v

MPI_Bca

st

MPI_Barr

ier

MPI_Ise

nd

MPI_Irec

v

MPI Function

Tota

l Ove

rhea

d (s

)


22

FLUENT 6.3.37 Profiling – Message Transferred

• Most data related MPI messages are within 256B-1KB in size• Typical MPI synchronization messages are lower than 64B in size• Number of messages increases with cluster size


08000

1600024000320004000048000

[0..64

B][64

..256

B][25

6B..1

KB][1.

.4KB]

[4..16

KB][16

..64K

B][64

..256

KB][25

6KB..1

M][1.

.4M]

[4M..2

GB]

Message Size

Num

ber o

f Mes

sage

s(T

hous

ands

)


23

FLUENT 12 Profiling – MPI Functions

• Mostly used MPI functions – MPI_Iprobe, MPI_Allreduce, MPI_Isend, and MPI_Irecv

FLUENT 12 Benchmark Profiling Result(Truck_poly_14M)

0

4000

8000

12000

16000

20000

MPI_Wait

allMPI_S

end

MPI_Red

uceMPI_R

ecv

MPI_Bca

stMPI_B

arrier

MPI_Ipro

beMPI_A

llreduce

MPI_Ise

ndMPI_I

recv

MPI Function

Num

ber o

f Cal

ls(T

hous

ands

)


24

FLUENT 12 Profiling – Timing

• MPI_Recv and MPI_Allreduce show highest communication overhead


040008000

120001600020000

MPI_Wait

allMPI_S

end

MPI_Red

uceMPI_R

ecv

MPI_Bca

stMPI_B

arrier

MPI_Ipr

obe

MPI_Allre

duce

MPI_Ise

ndMPI_I

recv

MPI Function

Tota

l Ove

rhea

d (s

)


25

FLUENT 12 Profiling – Message Transferred

• Most data related MPI messages are within 256B-1KB in size• Typical MPI synchronization messages are lower than 64B in size• Number of messages increases with cluster size


040008000

1200016000200002400028000

[0..64

B][64

..256

B][25

6B..1

KB][1.

.4KB]

[4..16

KB][16

..64K

B][64

..256

KB][25

6KB..1

M][1.

.4M]

[4M..2

GB]

Message Size

Num

ber o

f Mes

sage

s(T

hous

ands

)


26

FLUENT Profiling - Message Transferred

• FLUENT 12 reduces total number of messages• Further optimization can be made to take bigger advantage of high-

speed and low latency interconnects

FLUENT Benchmark Profiling Result(Truck_poly_14M)

08000

1600024000

3200040000

48000

[0..64B

][64..

256B

][256

B..1KB]

[1..4KB]

[4..16K

B][16..

64KB]

[64..25

6KB]

[256KB..1

M][1..4

M]

Message Size

Num

ber o

f Mes

sage

s(T

hous

ands

)


27

FLUENT Profiling Summary

• FLUENT 12 and FLUENT 6.3.37 were profiled to identify their communication patterns

• Frequent used message sizes

– 256-1KB messages for data related communications

– <64B for synchronizations

– Number of messages increases with cluster size

• MPI Functions

– FLUENT 12 introduced MPI collective functions

• MPI_Allreduce help improves the communication efficiency

• Interconnects effect to FLUENT performance

– Both interconnect latency (MPI_Allreduce) and throughput (MPI_Recv) highly influence

FLUENT performance

– Further optimization can be made to take bigger advantage of high-speed networks

2828

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

Download - ANSYS FLUENT Benchmark and Profiling - HPC Advisory …hpcadvisorycouncil.com/pdf/Fluent_Analysis.pdf · 3 ANSYS FLUENT • Computational Fluid Dynamics (CFD) is a computational technology

Top Related