performance measurement with zeromq and fairmq 20/02/15cwg13 meeting mohammad al-turany

CWG13 Meeting

Performance measurement with ZeroMQ and FairMQ

20/02/15

Mohammad Al-Turany

CWG13 Meeting

Zero MQ performance tests suite

• Zero MQ deliver some tools to measure bandwidth and latency of the network, following executables are build by default and located in the perf subdirectory– local_lat– Remote_lat– local_thr– remote_thr

20/02/15

CWG13 Meeting

ØMQ performance tests suite• Latency Test– consists of local_lat and remote_lat. These are to

be placed on two boxes that you wish to measure latency between.

20/02/15

$ remote_lat tcp://192.168.0.111:5555 1 100000

$ local_lat tcp://eth0:5555 1 100000

message size: 1 [B]roundtrip count: 100000average latency: 30.915 [us]

latency reported is the one-way latency

We did not perform

this test up to know!!

CWG13 Meeting

ØMQ performance tests suite• Throughput Test– consists of local_thr and remote_thr. These are to

be placed on two boxes that you wish to measure latency between.

20/02/15

$remote_thr tcp://192.168.0.111:5555 1 100000

$local_thr tcp://eth0:5555 1 100000

message size: 1 [B]message count: 1000000mean throughput: 5554568 [msg/s]mean throughput: 44.437 [Mb/s]

CWG13 Meeting

Running the Zero MQ performance test on the DAQ test cluster

0 10 20 30 40 50 60 70 80 90 1000

5000

10000

15000

20000

25000

30000

35000

40000

aidrefma02 aidrefma01

1-Processes2-Processes3-Processes4-processes

Msg size in kByte

Thro

ughp

ut M

bit/

s

20/02/15

CWG13 Meeting


0 10 20 30 40 50 600

10

20

30

40



Msg size in MByte

Thro

ughp

ut G

bit/

s

20/02/15

CWG13 Meeting


0 10 20 30 40 50 601.00

2.00

3.00

4.00

5.00



Msg size in MByte

Thro

ughp

ut G

byte

/s

20/02/15

CWG13 Meeting

Performance test with FairMQFLP 2 EPN

20/02/15

FLP

aidrefma01

EPN

aidrefma02

Push-Pull patternMessage size= 10 MbyteThroughput = 2,6 Gbyte/s

CWG13 Meeting


20/02/15

FLP

aidrefma01

EPN

aidrefma02


CWG13 Meeting


20/02/15

FLP

aidrefma01

EPN

aidrefma03


CWG13 Meeting

A node that use 3(4) cores to receive data via Ethernet or IPoverIB at a

rate of more than 4 GByte/s, ist still usable for reconstruction?

20/02/15

CWG13 Meeting

STREAM: Sustainable Memory Bandwidth in High Performance Computers

• A simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

• Specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications.

20/02/15

http://www.cs.virginia.edu/stream/

CWG13 Meeting

Stream SettingsThis system uses 8 bytes per array element.-------------------------------------------------------------Array size = 200000000 (elements), Offset = 0 (elements)Memory per array = 1525.9 MiB (= 1.5 GiB).Total memory required = 4577.6 MiB (= 4.5 GiB).Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth.-------------------------------------------------------------Number of Threads requested = 12Number of Threads counted = 12-------------------------------------------------------------

20/02/15

CWG13 Meeting

• STREAM is intended to measure the bandwidth from main memory

20/02/15

CWG13 Meeting

Performance and bandwidth test with FairMQ FLP 2 EPN

20/02/15

EPN

aidrefma02CERN: DAQ Lab system:40 G EthernetDual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM

Function Best Rate MB/s Avg time Min time Max timeCopy: 15258.3 0.017153 0.010486 0.025462Scale: 15019.2 0.017180 0.010653 0.025397Add: 16883.6 0.021488 0.014215 0.036001Triad: 16831.6 0.021190 0.014259 0.035066

-------------------------------------------------------------- name kernel bytes/iter FLOPS/iter -------------------------------------------------------------- COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 --------------------------------------------------------------

CWG13 Meeting


20/02/15

FLP

aidrefma02

EPN

aidrefma01

8 MB Masseges 4.7 Gbyte/s

CERN: DAQ Lab system:40 G EthernetDual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM

Function Best Rate MB/sCopy: 12782.6Scale: 12319.0Add: 14210.4Triad: 14317.3

-16 %-18 %-16 %-15 %

CWG13 Meeting


20/02/15

FLP

aidrefma02

EPN

aidrefma01 CPU Time in seconds needed to simulate 1000 events, 10 proton in FairRoot example 3

GeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeant

Without MQ

With 4 MB Messages

With 8 MB Messages

54 61 6858 64 6154 66 6258 56 5757 56 5755 63 6458 63 5758 64 6760 65 5758 65 5761 66 6257 56 65

57,3 62,1 61,25% 4%

4 MB Masseges 4.5 Gbyte/s8 MB Masseges 4.7 Gbyte/s


Run 12 processes

CWG13 Meeting


20/02/15

FLP

aidrefma02

EPN

aidrefma01

CPU Time in seconds needed to simulate 1000 events, 100 proton in FairRoot example 3

GeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeant

Without MQ

With 8 MB Messages

565 605573 615570 598573 603565 602563 601570 619570 598576 616574 606567 609577 595

570.2 605.66%

8 MB Masseges 4.7 Gbyte/s2.8 TByte total data transfer


Run 12 processes

CWG13 Meeting

Backup and Discussion

20/02/15

CWG13 Meeting

Run on STREAM version $Revision: 5.10 $-------------------------------------------------------------This system uses 8 bytes per array element.-------------------------------------------------------------Array size = 10000000 (elements), Offset = 0 (elements)Memory per array = 76.3 MiB (= 0.1 GiB).Total memory required = 228.9 MiB (= 0.2 GiB).Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth.-------------------------------------------------------------Your clock granularity/precision appears to be 1 microseconds.Each test below will take on the order of 21173 microseconds. (= 21173 clock ticks)Increase the size of the arrays if this shows thatyou are not getting at least 20 clock ticks per test.-------------------------------------------------------------WARNING -- The above is only a rough guideline.For best results, please be sure you know theprecision of your system timer.-------------------------------------------------------------Function Best Rate MB/s Avg time Min time Max timeCopy: 15258.3 0.017153 0.010486 0.025462Scale: 15019.2 0.017180 0.010653 0.025397Add: 16883.6 0.021488 0.014215 0.036001Triad: 16831.6 0.021190 0.014259 0.035066-------------------------------------------------------------20/02/15

performance measurement with zeromq and fairmq 20/02/15cwg13 meeting mohammad al-turany

Documents