performance measurement with zeromq and fairmq 20/02/15cwg13 meeting mohammad al-turany
TRANSCRIPT
CWG13 Meeting
Performance measurement with ZeroMQ and FairMQ
20/02/15
Mohammad Al-Turany
CWG13 Meeting
Zero MQ performance tests suite
• Zero MQ deliver some tools to measure bandwidth and latency of the network, following executables are build by default and located in the perf subdirectory– local_lat– Remote_lat– local_thr– remote_thr
20/02/15
CWG13 Meeting
ØMQ performance tests suite• Latency Test– consists of local_lat and remote_lat. These are to
be placed on two boxes that you wish to measure latency between.
20/02/15
$ remote_lat tcp://192.168.0.111:5555 1 100000
$ local_lat tcp://eth0:5555 1 100000
message size: 1 [B]roundtrip count: 100000average latency: 30.915 [us]
latency reported is the one-way latency
We did not perform
this test up to know!!
CWG13 Meeting
ØMQ performance tests suite• Throughput Test– consists of local_thr and remote_thr. These are to
be placed on two boxes that you wish to measure latency between.
20/02/15
$remote_thr tcp://192.168.0.111:5555 1 100000
$local_thr tcp://eth0:5555 1 100000
message size: 1 [B]message count: 1000000mean throughput: 5554568 [msg/s]mean throughput: 44.437 [Mb/s]
CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster
0 10 20 30 40 50 60 70 80 90 1000
5000
10000
15000
20000
25000
30000
35000
40000
aidrefma02 aidrefma01
1-Processes2-Processes3-Processes4-processes
Msg size in kByte
Thro
ughp
ut M
bit/
s
20/02/15
CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster
0 10 20 30 40 50 60 70 80 90 1000
5000
10000
15000
20000
25000
30000
35000
40000
aidrefma02 aidrefma01
1-Processes2-Processes3-Processes4-processes
Msg size in kByte
Thro
ughp
ut M
bit/
s
20/02/15
CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster
0 10 20 30 40 50 600
10
20
30
40
aidrefma02 aidrefma01
1-Processes2-Processes3-Processes4-processes
Msg size in MByte
Thro
ughp
ut G
bit/
s
20/02/15
CWG13 Meeting
Running the Zero MQ performance test on the DAQ test cluster
0 10 20 30 40 50 601.00
2.00
3.00
4.00
5.00
aidrefma02 aidrefma01
1-Processes2-Processes3-Processes4-processes
Msg size in MByte
Thro
ughp
ut G
byte
/s
20/02/15
CWG13 Meeting
Performance test with FairMQFLP 2 EPN
20/02/15
FLP
aidrefma01
EPN
aidrefma02
Push-Pull patternMessage size= 10 MbyteThroughput = 2,6 Gbyte/s
CWG13 Meeting
Performance test with FairMQFLP 2 EPN
20/02/15
FLP
aidrefma01
EPN
aidrefma02
Push-Pull patternMessage size= 10 MbyteThroughput = 3,7 Gbyte/s
CWG13 Meeting
Performance test with FairMQFLP 2 EPN
20/02/15
FLP
aidrefma01
EPN
aidrefma03
Push-Pull patternMessage size= 10 MbyteThroughput = 4,8 Gbyte/s
CWG13 Meeting
A node that use 3(4) cores to receive data via Ethernet or IPoverIB at a
rate of more than 4 GByte/s, ist still usable for reconstruction?
20/02/15
CWG13 Meeting
STREAM: Sustainable Memory Bandwidth in High Performance Computers
• A simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
• Specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications.
20/02/15
http://www.cs.virginia.edu/stream/
CWG13 Meeting
Stream SettingsThis system uses 8 bytes per array element.-------------------------------------------------------------Array size = 200000000 (elements), Offset = 0 (elements)Memory per array = 1525.9 MiB (= 1.5 GiB).Total memory required = 4577.6 MiB (= 4.5 GiB).Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth.-------------------------------------------------------------Number of Threads requested = 12Number of Threads counted = 12-------------------------------------------------------------
20/02/15
CWG13 Meeting
• STREAM is intended to measure the bandwidth from main memory
20/02/15
CWG13 Meeting
Performance and bandwidth test with FairMQ FLP 2 EPN
20/02/15
EPN
aidrefma02CERN: DAQ Lab system:40 G EthernetDual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM
Function Best Rate MB/s Avg time Min time Max timeCopy: 15258.3 0.017153 0.010486 0.025462Scale: 15019.2 0.017180 0.010653 0.025397Add: 16883.6 0.021488 0.014215 0.036001Triad: 16831.6 0.021190 0.014259 0.035066
-------------------------------------------------------------- name kernel bytes/iter FLOPS/iter -------------------------------------------------------------- COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 --------------------------------------------------------------
CWG13 Meeting
Performance and bandwidth test with FairMQ FLP 2 EPN
20/02/15
FLP
aidrefma02
EPN
aidrefma01
8 MB Masseges 4.7 Gbyte/s
CERN: DAQ Lab system:40 G EthernetDual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM
Function Best Rate MB/sCopy: 12782.6Scale: 12319.0Add: 14210.4Triad: 14317.3
-16 %-18 %-16 %-15 %
CWG13 Meeting
Performance and bandwidth test with FairMQ FLP 2 EPN
20/02/15
FLP
aidrefma02
EPN
aidrefma01 CPU Time in seconds needed to simulate 1000 events, 10 proton in FairRoot example 3
GeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeant
Without MQ
With 4 MB Messages
With 8 MB Messages
54 61 6858 64 6154 66 6258 56 5757 56 5755 63 6458 63 5758 64 6760 65 5758 65 5761 66 6257 56 65
57,3 62,1 61,25% 4%
4 MB Masseges 4.5 Gbyte/s8 MB Masseges 4.7 Gbyte/s
CERN: DAQ Lab system:40 G EthernetDual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM
Run 12 processes
CWG13 Meeting
Performance and bandwidth test with FairMQ FLP 2 EPN
20/02/15
FLP
aidrefma02
EPN
aidrefma01
CPU Time in seconds needed to simulate 1000 events, 100 proton in FairRoot example 3
GeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeantGeant
Without MQ
With 8 MB Messages
565 605573 615570 598573 603565 602563 601570 619570 598576 616574 606567 609577 595
570.2 605.66%
8 MB Masseges 4.7 Gbyte/s2.8 TByte total data transfer
CERN: DAQ Lab system:40 G EthernetDual socket Intel Sandy Bridge-EP, dual E5-2690 @ 2.90GHz, 2x8 hw cores - 32 threads, 64GB RAM
Run 12 processes
CWG13 Meeting
Backup and Discussion
20/02/15
CWG13 Meeting
Run on STREAM version $Revision: 5.10 $-------------------------------------------------------------This system uses 8 bytes per array element.-------------------------------------------------------------Array size = 10000000 (elements), Offset = 0 (elements)Memory per array = 76.3 MiB (= 0.1 GiB).Total memory required = 228.9 MiB (= 0.2 GiB).Each kernel will be executed 10 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth.-------------------------------------------------------------Your clock granularity/precision appears to be 1 microseconds.Each test below will take on the order of 21173 microseconds. (= 21173 clock ticks)Increase the size of the arrays if this shows thatyou are not getting at least 20 clock ticks per test.-------------------------------------------------------------WARNING -- The above is only a rough guideline.For best results, please be sure you know theprecision of your system timer.-------------------------------------------------------------Function Best Rate MB/s Avg time Min time Max timeCopy: 15258.3 0.017153 0.010486 0.025462Scale: 15019.2 0.017180 0.010653 0.025397Add: 16883.6 0.021488 0.014215 0.036001Triad: 16831.6 0.021190 0.014259 0.035066-------------------------------------------------------------20/02/15