the performance of configurable collective communication ... · the set of collective...

12
The Performance of Configurable Collective Communication for LAM-MPI in Clusters and Multi-Clusters John Markus Bjørndalen , Otto J. Anshus Brian Vinter , Tore Larsen Department of Computer Science, University of Tromsø Department of Mathematics and Computer Science, University of Southern Denmark Abstract Using a cluster of eight four-way com- puters, PastSet, an experimental tuple space based shared memory system, has been measured to be 1.83 times faster on global reduction than using the Allreduce operation of LAM-MPI. Our hypothesis is that this is due to PastSet using a better mapping of processes to computers result- ing in less messages and more use of lo- cal processor cycles to compute partial sums. PastSet achieved this by utilizing the PATHS system for configuring multi- cluster applications. This paper reports on an experiment to verify the hypothesis by using PATHS on LAM-MPI to see if we can get better performance, and to identify the contributing factors. By adding configurability to Allreduce by using PATHS, we achieved a perfor- mance gain of 1.52, 1.79, and 1.98 on re- spectively two, four and eight-way clus- ters. We conclude that the LAM-MPI algo- rithm for mapping processes to computers when using Allreduce was the reason for its poor performance relative to the imple- mentation using PastSet and PATHS. We then did a set of experiments to examine whether we could improve the performance of the Allreduce operation when using two clusters interconnected by a WAN link with 30-50ms roundtrip la- tency. The experiments showed that even a bad mapping of Allreduce which resulted in multiple messages being sent across the WAN did not add significant perfor- mance penalty to the Allreduce operation for packet sizes up to 4KB. We believe this is due to multiple small messages concur- rently in transit on the WAN. 1 Introduction For efficient support of synchroniza- tion and communication in parallel sys- tems, these systems require fast collective communication support from the underly- ing communication subsystem as, for ex- ample, is defined by the Message Pass- ing Interface (MPI) Standard [1]. Among the set of collective communication opera- tions broadcast is fundamental and is used in several other operations such as barrier synchronization and reduction [12]. Thus, it is advantageous to reduce the latency of broadcast operations on these systems. In our work with the PATHS[5] config- uration and orchestration system, we have experimented with micro-benchmarks and applications to study the effects of config- urable communication. In one of the experiments[18], we used

Upload: others

Post on 27-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

The Performance of Configurable CollectiveCommunication for LAM-MPI in Clusters and

Multi-Clusters

John Markus Bjørndalen�

, Otto J. Anshus�

Brian Vinter�

, Tore Larsen�

���Department of Computer Science, University of Tromsø

���Department of Mathematics and Computer Science, University of Southern Denmark

Abstract

Using a cluster of eight four-way com-puters, PastSet, an experimental tuplespace based shared memory system, hasbeen measured to be 1.83 times faster onglobal reduction than using the Allreduceoperation of LAM-MPI. Our hypothesis isthat this is due to PastSet using a bettermapping of processes to computers result-ing in less messages and more use of lo-cal processor cycles to compute partialsums. PastSet achieved this by utilizingthe PATHS system for configuring multi-cluster applications. This paper reportson an experiment to verify the hypothesisby using PATHS on LAM-MPI to see if wecan get better performance, and to identifythe contributing factors.

By adding configurability to Allreduceby using PATHS, we achieved a perfor-mance gain of 1.52, 1.79, and 1.98 on re-spectively two, four and eight-way clus-ters. We conclude that the LAM-MPI algo-rithm for mapping processes to computerswhen using Allreduce was the reason forits poor performance relative to the imple-mentation using PastSet and PATHS.

We then did a set of experiments toexamine whether we could improve theperformance of the Allreduce operationwhen using two clusters interconnected by

a WAN link with 30-50ms roundtrip la-tency. The experiments showed that even abad mapping of Allreduce which resultedin multiple messages being sent acrossthe WAN did not add significant perfor-mance penalty to the Allreduce operationfor packet sizes up to 4KB. We believe thisis due to multiple small messages concur-rently in transit on the WAN.

1 Introduction

For efficient support of synchroniza-tion and communication in parallel sys-tems, these systems require fast collectivecommunication support from the underly-ing communication subsystem as, for ex-ample, is defined by the Message Pass-ing Interface (MPI) Standard [1]. Amongthe set of collective communication opera-tions broadcast is fundamental and is usedin several other operations such as barriersynchronization and reduction [12]. Thus,it is advantageous to reduce the latency ofbroadcast operations on these systems.

In our work with the PATHS[5] config-uration and orchestration system, we haveexperimented with micro-benchmarks andapplications to study the effects of config-urable communication.

In one of the experiments[18], we used

Page 2: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

the configuration mechanisms to reducethe execution times of collective commu-nication operations in PastSet. To get abaseline, we compared our reduction oper-ation with the equivalent operation in MPI(Allreduce).

By trying a few configurations, wefound that we could improve our TupleSpace system to be 1.83 times faster thanLAM-MPI (Local Area for MulticomputerMPI) [11][14]. Our hypothesis was thatthis advantage came from a better usage ofresources in the cluster rather than a moreefficient implementation.

If anything, LAM-MPI should be fasterthan PastSet since PastSet stores the re-sults of each global sum computation ina tuple space inducing more overheadthan simply computing and distributingthe sum.

This paper reports on an experimentwhere we have added configurable com-munication to the Broadcast and Reduceoperations in LAM-MPI (both of whichare used by Allreduce) to validate or fal-sify our hypothesis. In [4] we reporton complimentary experiments where wealso varied the number of computers perexperiment.

The paper is organized as follows: Sec-tion 2 summarizes the main features ofthe PastSet and PATHS system. Section3 describes the Allreduce, Reduce andBroadcast operations in LAM-MPI. Sec-tion 4 describes the configuration mech-anism that was added to LAM-MPI forthe experiments reported on in this paper.Section 5 describes the experiments andresults, section 7 presents related work,section 6 presents our multicluster exper-iments, and section 8 concludes the paper.

2 PATHS: Configurable Or-chestration and Mapping

Our research platform is PastSet[2][17],a structured distributed shared memorysystem in the tradition of Linda[6]. Past-

Set is a set of Elements, where each El-ement is an ordered collection of tuples.All tuples in an Element follow the sametemplate.

The PATHS[5] system is an extension ofPastSet that allows for mappings of pro-cesses to hosts at load time, selection ofphysical communication paths to each ele-ment, and distribution of communicationsalong the path. PATHS also implementsthe X-functions[18], which are PastSet op-eration modifiers.

A path specification for a single threadneeding access to a given element is rep-resented by a list of stages. Each stage isimplemented using a wrapper object hid-ing the rest of the path after that stage.The stage specification includes parame-ters used for initialisation of the wrapper.

Thread Thread Thread Thread

Global Sum

Remote access

Global Sum

Remote access

Global Sum

Element

Node 1 Node 2

Server

Figure 1. Four threads on two hostsaccessing shared element on a sep-arate server. Each host computesa partial sum that is forwarded tothe global-sum wrapper on the server.The final result is stored in the ele-ment.

Paths can be shared whenever path de-scriptions match and point to the same el-ement (see figure 1). This can be usedto implement functionality such as, for in-stance, caches, reductions and broadcasts.

The collection of all paths in a systempointing to a given element forms a tree.

Page 3: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

The leaf nodes in the tree are the applica-tion threads, while the root is the element.

Figure 1 shows a global reduction tree.By modifying the tree and the parametersto the wrappers in the tree, we can spec-ify and experiment directly with factorssuch as which processes participate in agiven partial sum, how many partial sumwrappers to use, where each sum wrapperis located, protocols and service require-ments for remote operations and where theroot element is located. Thus, we cancontrol and experiment with tradeoffs be-tween placement of computation, commu-nication, and data location.

Applications tend to use multiple trees,either because the application uses multi-ple elements, or because each thread mightuse multiple paths to the same element.

To get access to an element, the applica-tion programmer can either choose to uselower-level functions to specify paths be-fore handing it over to a path builder, oruse a higher level function which retrievesa path specification to a named elementand then builds the specified path. Theapplication programmer then gets a refer-ence to the topmost wrapper in the path.

The path specification can either be re-trieved from a combined path specificationand name server, or be created with a high-level language library loaded at applica-tion load-time1

Since the application program invokesall operations through its reference tothe topmost wrapper, the application canbe mapped to different cluster topologiessimply by doing one of the following:

� Updating a map description used bythe high-level library.

� Specifying a different high-levellibrary that generates path-specifications. This library maybe written by the user.

1Currently, a Python module is loaded for thispurpose.

� Update the path mappings in thename server.

Profiling is provided via trace wrap-pers that log the start and completion timeof operations that are invoked through it.Any number of trace wrappers can in-serted anywhere in the path.

Specialized tools to examine the per-formance aspects of the application canlater read trace data stored with the pathspecifications from a system. We are cur-rently experimenting with different visu-alizations and analyses of this data to sup-port optimization of a given application.

The combination of trace data, a specifi-cation of communication paths, and com-putations along the path has been usefulin understanding performance aspects andtuning benchmarks and applications thatwe have run in cluster and multi-clusterenvironments.

3 LAM-MPI implementation ofAllreduce

LAM-MPI is an open source implemen-tation of MPI available from [11]. It waschosen over MPICH [7] for our work in[18] since it had lower latency with lessvariance than MPICH for the benchmarkswe used in our clusters.

The MPI Allreduce operation combinesvalues from all processes and distributethe result back to all processes. LAM-MPIimplements Allreduce by first calling Re-duce, collecting the result in the root pro-cess, then calling Broadcast, distributingthe result from the root process. For all ourexperiments, the root process is the pro-cess with rank 0 (hereafter called process0).

The Reduce and Broadcast algorithmsuse a linear scheme (every process com-municates directly with process 0) up toand including 4 processes. From there onthey use a scheme that organizes the pro-cesses into a logarithmic spanning tree.

Page 4: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

The shape of this tree is fixed, anddoesn’t change to reflect the topology ofthe computing system or cluster. Figure 2shows the reduction trees used in LAM-MPI for 32 processes in a cluster. Weobserve that broadcast and reduction treesare different.

By default, LAM-MPI evenly dis-tributes processes onto nodes. When wecombine this mapping for 32 processeswith the reduction tree, we can see in Fig-ure 3 that a lot of messages are sent acrossnodes in the system. The broadcast oper-ation has a better mapping for this clusterthough.

4 Adding configuration toLAM-MPI

To minimize the necessary changes toLAM-MPI for this experiment, we didn’tadd a full PATHS system at this point.Instead, a mechanism was added that al-lowed for scripting the way LAM-MPIcommunicates during the broadcast andreduce operations.

There were two main reasons for this.Firstly, our hypothesis was that PastSetwith PATHS allowed us to map the com-munication and computation better to theresources and cluster topology. For globalreduction and broadcast, LAM-MPI al-ready computes partial sums at internalnodes in the trees. This means that ex-perimenting with different reduction andbroadcast trees should give us much ofthe effect that we observed with PATHS in[18] and [5].

Secondly, we wanted to limit the influ-ence that our system would have on theperformance aspects of LAM-MPI suchthat any observable changes in perfor-mance would come from modifying the re-duce and broadcast trees.

Apart from this, the amount of codechanged and added was minimal, whichreduced the chances of introducing errorsinto the experiments.

When reading the LAM-MPI sourcecode, we noticed that the reduce opera-tion was, for any process in the reductiontree, essentially a sequence of

�receives

from the�

children directly below it inthe tree, and one send to the process aboveit. For broadcast, the reverse was true; onereceive followed by

�sends.

Using a different reduction or broad-cast tree would then simply be a matter ofexamining, for each process, which pro-cesses are directly above and below it inthe tree and construct a new sequence ofsend and receive commands.

To implement this, we added new re-duce and broadcast functions which usedthe rank of the process and size of the sys-tem to look up the sequence of sends andreceives to be executed (including whichprocesses to send and receive from). Thisis implemented by retrieving and execut-ing a script with send and receive com-mands.

As an example, when using a scriptedreduce operation with a mapping identicalto the original LAM-MPI reduction tree,the process with rank 12 (see figure 2)would look up and execute a script withthe following commands:

� Receive (and combine result) fromrank 13

� Receive (and combine result) fromrank 14

� Send result to 8

The new scripted functions are used in-stead of the original logarithmic Reduceand Broadcast operations in LAM-MPI.No change was necessary to the Allreducefunction since it is implemented using theReduce and Broadcast operations.

The changes to the LAM-MPI code wasthus limited to 3 code lines, replacingthe calls to the logarithmic reduction andbroadcast functions as well as adding anda call in MPI_Init to load the scripts.

Page 5: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

Remapping the application to anothercluster configuration, or simply trying newmappings for optimization purposes, nowconsists of specifying new communicationtrees and generating the scripts. A Pythonprogram generates these scripts as Lispsymbolic expressions.

5 Experiments

Figure 4 shows the code run in the ex-periments. The code measures the averageexecution time of 1000 Allreduce opera-tions. The average of 5 runs is then plot-ted. To make sure that the correct sum iscomputed, the code also checks the resulton each iteration.

For each experiment, the number of pro-cesses was varied from 1 to 32. LAM-MPIused the default placement of processes onnodes, which evenly spread the processesover the nodes.

The hardware platforms consists ofthree clusters, each with 32 processors:

� 2W: 16*2-Way Pentium III 450MHz, 256MB RAM

� 4W: 8*4-Way Pentium Pro 166 MHz,128MB RAM

� 8W: 4*8-Way Pentium Pro 200 MHz,2GB RAM

All clusters used 100MBit Ethernet forintra-cluster communication.

5.1 4-way cluster

Figure 5 shows experiments run on the4-way cluster. The following experimentswere run:

Original LAM-MPI The experimentwas run using LAM-MPI 6.5.6.

Modified LAM-MPI, original schemeThe experiment was run using amodified LAM-MPI, but using thesame broadcast and reduce trees

0

500

1000

1500

2000

2500

3000

0 5 10 15 20 25 30 35

Ope

rati

onla

tenc

yin

us,a

vgof

5x10

00op

s

Number of processes in system

Configurable schemes vs. pristine LAM-MPI

LAM-MPI original code

��

��

��

��

��

��

�� ��

�� ����

��

��

!

"# $%

&'

()

*+,-

./

01 2345

6789 :;

<=

>?

@A

LAM-MPI + path, original scheme

+

+

+

+

+

+

+

+

+

+ +

++ +

+

+

+

+ +

+

++

+

+

+ + + ++ +

+

+

+

LAM-MPI + path, linear scheme

BC

DE

FG

HI JK

LM

NO

PQ

RS

TU

VW

XY

Z[

\]

^_

`a

bc

de

LAM-MPI + path, exp1 scheme

+

+

+

+ +

+

+

+

+

+

++

+ + ++

+

+ + + + ++ +

+

++ + +

++ +

+

Figure 5. Allreduce, 4-way cluster

for communication as the originalversion used.

This graph completely overlaps thegraph from the original LAM-MPIcode, showing us that the scriptingfunctionality is able to replicate theperformance aspects of the originalLAM-MPI for this experiment.

Modified LAM-MPI, linear schemeThis experiment was run using alinear scheme where all processesreport directly to the process withrank 0, and rank 0 sends the resultsdirectly back to each client.

This experiment crashed at 18 pro-cesses due to TCP connection failuresin LAM-MPI. We haven’t examinedwhat caused this, but it is likely to bea limitation in the implementation.

Modified LAM-MPI, hierarchal scheme 1Rank 0 is always placed on node0, so all other processes on node 0contribute directly to rank 0. On theother nodes, a local reduction is donewithin each node before the nodespartial sum is sent to process 0.

The modified LAM-MPI with the hier-archal scheme is 1.79 times faster than theoriginal LAM-MPI. This is close to thefactor of 1.83 reported in [18], which wasbased on measurements on the same clus-ter. It is possible that further experiments

Page 6: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

0

500

1000

1500

2000

2500

3000

0 5 10 15 20 25 30 35

Ope

rati

onla

tenc

yin

us,a

vgof

5x10

00op

s

Number of processes in system

Configurable LAM-MPI vs. pristine LAM-MPI

LAM-MPI original code

��

��

��

��

��

��

��

�� ��

��

����

��

��

!

"# $%

&' () *+ ,-

./

01 23 45

6789 :;

<=

>?

@A

LAM-MPI + path

+

+

+

+ +

+ + + +

+ ++ +

++ +

+

+

+

+ + +

+

+ ++ + +

+

++

+

+

Figure 6. Allreduce, 8-way cluster

with configurations would have brought uscloser to the original speed improvement.

5.2 8-way cluster

For the 8-way cluster, we expected thebest configuration to involve using an in-ternal root process on each node andlimit the inter-node communications toone send (of the local sum) and one receive(of the global result) resembling the bestconfiguration for the 4-way cluster. Thisturned out not to be true.

Due to the 8 processes on each node, thelocal root process on each node would get7 children in the sum tree (with the excep-tion of the root processes in the root node).The extra latency added per child of the lo-cal root processes was enough to increasethe total latency for 32 processes to 1444microseconds.

Splitting the processes to allow furthersubgrouping within each node introducedanother layer in the sum tree. This extralayer brought the total latency for 32 pro-cesses to 1534 microseconds.

The fastest solution found so far in-volves partitioning the processes on eachnode into two groups of 4 processes. Eachgroup computes a partial sum for thatgroup, and forwards it to directly to theroot process. This doubled the communi-cation over the network, but the total ex-ecution time was shorter (1322 microsec-onds).

0

500

1000

1500

2000

2500

3000

0 5 10 15 20 25 30 35

Ope

rati

onla

tenc

yin

us,a

vgof

5x10

00op

s

Number of processes in system

Configurable LAM-MPI vs. pristine LAM-MPI

LAM-MPI original code

BC

DE

FG

HI

JK

LM

NO

PQRS

TU

VW

XY

Z[

\]^_

`a

bc

de fg

hi jk

lm

no

pq

rs

tu

vwxy z{ |}

~�

��

��

LAM-MPI + PATHS

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+ ++

+ +

++

+ ++

+

Figure 7. Allreduce, 2-way cluster.

The fastest configuration is 1.98 timesfaster than the original LAM-MPI Allre-duce operation. In figure 6, the fastestconfiguration and the original LAM-MPIAllreduce are plotted for 1-32 processes.

5.3 2-way cluster

On the 2-way cluster, after trying a fewconfigurations, we ended up with a config-uration that was 1.52 times faster than theoriginal LAM-MPI code. This was not asgood as for the other clusters. We expectedthis since a reduction in this cluster wouldneed more external messages.

Based on the experiments, we observedthree main factors that influenced the scal-ing of the application:

� The number of children for a pro-cess in the operation tree. More chil-dren adds communication load for thenode hosting this process.

� The depth of a branch in the operationtree. That is, the number of nodes avalue has to be added through beforereaching the root node.

� Whether a message was sent over thenetwork or internally on a node.

The depth factor seemed to introducea higher latency than adding a few morechildren to a process. On the other hand,adding another layer in the tree would alsoallow us to use partial sums. We have

Page 7: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

not arrived at any model which allows uspredict which tree organization would inpractice lead to the shortest execution timeof the Allreduce operation.

One difference between the Past-Set/PATHS implementation and LAM-MPI is that PastSet/PATHS processincoming partial sums by order of ar-rival, while LAM-MPI process them inthe order they appear in the sequencespecified either by the original LAM-MPIcode, or the script. This might influencethe overhead when receiving messagesfrom a larger number of children. We areinvestigating this further.

6 Multi-cluster experiments

In [5] we also ran multi-cluster globalreductions using PastSet and PATHS, and[10] shows an approach where they reducethe number of messages over WAN con-nections to reduce the latency of collectiveoperations over wide-area links.

To study the effect of choosing variousconfigurations on multi-clusters, we addedan experiment where we installed an IPtunnel between the 4W cluster in Tromsøand the 2W cluster in Denmark and ran theAllreduce benchmark using those clustersas one larger cluster.

Unfortunately, during the experimentstwo things happened: two of the nodes inDenmark went down, and the 2W clus-ter was allocated to other purposes afterwe finished the measurements with theunmodified LAM-MPI Allreduce, so wewere unable to run the benchmark us-ing configurable Allreduce on the multi-cluster system.

However, the performance measure-ments using the unmodified Allreduce didproduce interesting results. An experi-ment with 32 processes both in Denmarkand Tromsø documented that the reducephase of the Allreduce operation had avery good mapping for minimizing mes-sages over the WAN link: only one mes-

sage was sent from the Tromsø cluster tothe root process in the Denmark cluster.However, in the broadcast phase each pro-cess in Tromsø was sent a message from aprocess in Denmark.

We expected this configuration of Allre-duce to perform gradually worse whenwe scaled up the number of processes inTromsø from 1 to 32. This did not turn outto be the case. With 32 processes in Den-mark, the difference between the Allre-duce latency of using 1 and 32 processesin Tromsø was small enough to be maskedby the noise in latency.

We suspected that the reason for thiswas that the numerous routers and linksdown to Denmark allowed multiple smallmessages to propagate towards Tromsøconcurrently.

This suggests that for the cluster sizesthat we have available, a configurablemechanism that reduces the number ofsmall messages on a long WAN link doesnot significantly reduce the latency. How-ever, for larger packet sizes, the bandwidthof the link is important, and a reductionin the number of messages should mattermore.

To study this, the benchmark was ex-tended to include not only scaling from 1to 64 processes, but also to calling Allre-duce with value vectors from 4 byte (1 in-teger) to 4MB.

Since two computers in Denmark hadgone down, we ran the benchmark with28 processors (14 nodes) in Denmark, and32 processors in Tromsø reducing the to-tal cluster size to 60 processors. Thisproduced a similar mapping to the orig-inal: when running with 60 processes,all Tromsø nodes would receive a mes-sage directly from Denmark during broad-cast, and only two messages would be sentdown to Denmark on reduce.

We also ran the benchmark with 61 pro-cesses. This resulted in a configurationwhere the process with rank 60 (the lastprocess added when scaling the system to

Page 8: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

0

10000

20000

30000

40000

50000

60000

70000

80000

0 10 20 30 40 50 60 70

Mic

rose

cond

s

Number of processes

Average Allreduce execution time

4 bytes

��

��

��

�� �

� � ��

���� �� ��

���� �� ��

!"# $% &' () *+ ,- ./

01 2345 67

89

:;

<=

>?

@ABC

DE FG HI JK

LM

NO

PQ RS TU VW XY

Z[

\] ^_`a bc de fg hi

jklm

no pq rs tu vw

xy

z{

128 bytes

+

+

+

+ +

+ + +

+

+ + +

+

+ + +

++ + + +

+ + +

+ +

+ +

+

+

+

+

+

++ +

+ +

+

+

+ + + + +

+

+ +

+ ++ + +

+

++ + + + +

+

+

256 bytes

|}

~�

��

�� ��

�� ����

���� �� ��

�� �� �� ��

����  ¡ ¢£ ¤¥

¦§ ¨© ª«¬­ ®¯

°± ²³

´µ

¶·

¸¹

º»

¼½¾¿

ÀÁÂÃ ÄÅ ÆÇ ÈÉ

ÊË

ÌÍÎÏ ÐÑ

ÒÓ ÔÕ

Ö×

ØÙ ÚÛÜÝ Þß

àá âã äå æç èé êë ìí îï ðñ òó

ôõ

ö÷

512 bytes+

+

+

+ ++ +

+

+

+ + + + + + ++

+ + + ++ + +

+ +

+ +

+

+

+

+

+

+

+

++

++

+

++ + + +

+

++

+ ++ + + +

++ + + + +

+

+

2048 bytesøù

úû

üý

þÿ �� �� �� �� � � � ���� �� �� �� �� �� �� �� ! "# $% &' () *+ ,- ./

01 2345 6789 :; <= >? @A

BC DEFG HI JK LM NO PQ RS TU VW XY Z[ \] ^_ `a bc de fg hi jk lm no

pq

rs

Figure 8. Allreduce performance us-ing two clusters.

1

10

100

1000

10000

100000

tvuxw

tvuxy

0 10 20 30 40 50 60 70

Mic

rose

cond

s

Number of processes

Average Allreduce execution time

4 bytes

z{

|} ~��� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��  ¡ ¢£ ¤¥ ¦§ ¨© ª« ¬­ ®¯ °±

²³ ´µ ¶· ¸¹ º» ¼½ ¾¿ ÀÁ Âà ÄÅ ÆÇ ÈÉ ÊË ÌÍ ÎÏ ÐÑ ÒÓ ÔÕ Ö× ØÙ ÚÛ ÜÝ Þß àá âã äå æç èé êë ìí îï ðñ òó

ôõ

128 bytes

+

+

+

++

+ ++

+

+ + +

++ + +

++ + + + + + +

+ ++ +

++ + + + + + + + +

+

+ + + + + + + + + + + + + + + + + + + + +

+

+

256 bytes

ö÷

øù úûüý þÿ �

� �� ���� � � � �� �� �� �� �� �� �� �� �� ! "# $% &' () *+ ,-

./ 01 23 45 67 89 :; <= >? @A BC DE FG HI JK LM NO PQ RS TU VW XY Z[ \] ^_ `a bc de fg hi jk lm

no

pq

512 bytes

+

+

+

++

+ ++

+

+ + + + + + ++

+ + + ++ + +

+ ++ +

+ + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + +

+

+

2048 bytes

rs

tu

vw

xy z{|} ~� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��  ¡ ¢£ ¤¥ ¦§ ¨©

ª« ¬­ ®¯ °± ²³ ´µ ¶· ¸¹ º» ¼½ ¾¿ ÀÁ Âà ÄÅ ÆÇ ÈÉ ÊË ÌÍ ÎÏ ÐÑ ÒÓ ÔÕ Ö× ØÙ ÚÛ ÜÝ Þß àá âã äå æç èé

êë

ìí

4096 bytes

îï

ðñ

òó

ôõ ö÷ øùúû

üý þÿ �� �� ���� � �

� �� �� �� �� �� �� �� ��

�� ! "# $%

&' () *+,- ./01 23 45 67 89 :;

<= >? @A BC DEFG HI JK LM NO PQ RS TU VW XY Z[ \] ^_ `a bc de

fg

hi

32384 bytesjk

lm

no

pq rstuvw

xyz{|} ~� �� �� �� ��

���� �� �� �� �� �� �� ��

�� ����  ¡

¢£ ¤¥ ¦§¨©ª« ¬­®¯ °±²³

´µ

¶·

¸¹ º» ¼½ ¾¿ ÀÁÂà ÄÅ ÆÇ ÈÉ ÊË ÌÍ ÎÏ ÐÑ

ÒÓ ÔÕ Ö×ØÙ ÚÛ ÜÝ Þß

àá

âã

äå

262144 bytes

æ

ç

è

éê ëì

í

îïñð ò

óñô õö÷ ø ùñú ûñü ý

þÿ�� ���

� ��

�� � ��

� � ���� � � �

� �� � �

���� �

! "#$

%

4194304 bytes

&

'

(

)*

+ ,-

. /�01 2�3 4�5

6 7 8�9 :�; < =>�? @�A

B

C DE

F G HIJK

L

M NOP Q R

S T U V

WX Y Z�[ \

] ^_ `a b

c

Figure 9. Allreduce performance us-ing two clusters. Note logarithmicscale of y-axis.

61 processes) is located in Denmark butonly communicates directly with nodes inTromsø for Reduce and Broadcast. Thisforces the nodes in Tromsø to wait for pro-cess 60 before one of the two messagescan be sent down to Denmark again, andthe corresponding broadcast message for60 has to propagate through Tromsø be-fore reaching it. Figure 8 shows that thisconfiguration nearly doubles the latency ofAllreduce compared to running Allreducewith 60 processes. Clearly, configurationscan be encountered where it pays off tocontrol who communicates with who evenfor small messages.

Figure 8 also shows us that sendingmany messages (there is one messageper process in the figure) for vectors upto 2KB does not add much performance

100

1000

10000

100000

t uxw

t u y

1 10 100 1000 10000 100000tvu w t uxy

Rou

ndtr

ipla

tenc

yin

mic

rose

cond

s

Packet size in bytes

1 threads

de

fg

hi

jk

lm

no

pq

rs

tu

vw

2 threads

+

+

+

+

+

+

+

+

+

+

3 threads

xy

z{

|}

~�

��

��

��

��

����

4 threads

+

+

+

+

+

+

+

+

+

+

5 threads

��

��

��

��

��

��

��

��

��

��

6 threads

 ¡

¢£

¤¥

¦§

¨©

ª«

¬­

®¯

°±

²³

7 threads

´µ

¶·

¸¹

º»

¼½

¾¿

ÀÁ

ÂÃ

ÄÅ

ÆÇ

8 threads

È

É

Ê

Ë

Ì

Í

Î

Ï

Ð

Ñ

9 threads

Ò

Ó

Ô

Õ

Ö

×

Ø

Ù

Ú

Û

10 threads

Ü

Ý

Þ

ß

à

á

â

ã

ä

å

Figure 10. Roundtrip TCP/IP latencyfor one of N threads. 4W node to 8Wnode (LAN)

penalty to Allreduce compared to sendingonly a few messages. The figure shows aspike around 38 processes, which comesfrom a temporary network routing prob-lem somewhere between Denmark andTromsø when the experiment was per-formed.

Figure 9 shows that the performanceof Allreduce rapidly deteriorates when thesize and number of packets increase. Thegraph uses a logarithmic scale for the y-axis.

To examine this effect further, we de-vised another experiment. We had onethread measuring the latency of roundtripmessages over TCP/IP. To simulate thebad mapping of the Broadcast operation,a number of threads were added sendingroundtrip messages concurrently with thefirst thread.

The hypotheses were that adding the ex-tra threads would impact the first threadthe least using the WAN to Denmark, andthat the background threads would influ-ence the first thread more for small mes-sages when using the LAN.

Figure 10 shows that communicatingbetween two cluster nodes, using onlya single router between them, the back-ground communication would influencethe roundtrip latency of the benchmarkthread even at the smallest packet sizes.As we add background communication,

Page 9: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

10000

100000

tvuxw

tvu y

tvu �

1 10 100 1000 10000 100000t uxw tvuxy

Rou

ndtr

ipla

tenc

yin

mic

rose

cond

s

Packet size in bytes

1 threads

�� �� �� �� �� �

��

��

��

2 threads

+ + + +

+ +

+

+

+

+

3 threads

�� �� �� �� �� � !"

#$

%&

'(

4 threads

+ + + +

+ +

+

+

+

+

5 threads

)* +, -. /0 12 34

56

78

9:

;<

6 threads

=> ?@ AB CD EF GH

IJ

KL

MNOP

7 threads

QR ST UV WX YZ [\

]^

_`

ab

cd

8 threads

e fhgjilknm

o

p

q

r

9 threads

s thujvlwnx

y

z

{

|

10 threads

} ~h�j�l�n�

Figure 11. Roundtrip TCP/IP latencyfor one of N threads. Gateway nodein Denmark to 4W node (WAN)

the latency increases for all packet sizes.The cluster in Denmark was busy at the

time of the experiment, so we had to usethe clusters gateway node when runningthe WAN experiment. The gateway waswhere we originally installed the IP tunnelfor the LAM-MPI Allreduce benchmark,so we still used the tunnel for the experi-ment.

Figure 11 shows that for messages upto 4KB, we can hardly see any differ-ence in the performance of Allreduce aswe add background communication. Onlywith larger packet sizes, where we assumethat the bandwidth plays a larger role, isthe performance influenced by the back-ground communication.

7 Related work

Jacunski et al.[9] shows that selection ofthe best performing algorithm for all-to-all broadcast on clusters of workstationsbased on commodity switch-based net-works is a function of both network char-acteristics as well as the message lengthand number of participating nodes in theall-to-all broadcast operation. In contrastour work used clusters with multiproces-sors, we did performance measurementson the reduce operation as well as on thebroadcast, and we documented the effectof the actual system at run time includ-

ing the workload, communication perfor-mance of the hosts.

Bernashci et al.[3] study the perfor-mance of the MPI broadcast operationon large shared memory systems using a-trees.

Kielmann et al.[10] show how the per-formance for collective operations likebroadcast depend upon cluster topology,latency, and bandwidth. They develop alibrary of collective communication op-erations optimized for wide area systemswhere a majority of the speedup comesfrom limiting the number messages pass-ing over wide-area links.

Husbands et al.[8] optimizes the perfor-mance of MPI_Bcast in clusters of SMPnodes by using a two-phase scheme; firstmessages are sent to each SMP node, thenmessages are distributed within the SMPnodes. Sistare et al.[13] uses a simi-lar scheme, but focus on improving theperformance of collective operations onlarge-scale SMP’s.

Tang et al.[15] also use a two-phasescheme for a multithreaded implementa-tion for MPI. They separate the imple-mentation of the point-to-point and collec-tive operations to allow for optimizationsof the collective operations, which wouldotherwise be difficult.

In contrast to the above works, this pa-per is not focused on a particular optimiza-tion of the spanning trees. Instead, wefocus on making the shape of the span-ning trees configurable to allow easy ex-perimentation on various cluster topolo-gies and applications.

Vadhiyar et al.[16] shows an automaticapproach to selecting buffer sizes and al-gorithms to optimize the collective opera-tion trees by conducting a series of exper-iments.

8 Conclusions

We have observed that the broadcastand reduction trees in LAM-MPI are dif-

Page 10: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

ferent, and do not necessarily take into ac-count the actual topology of the cluster.

By introducing configurable broadcastand reduction trees, we have shown a sim-ple way of mapping the reduction andbroadcast trees to the actual clusters in use.This gave us a performance improvementup to a factor of 1.98.

For the cluster where we observed theperformance difference of a factor 1.83 be-tween PastSet/PATHS and LAM-MPI, wearrived at a reduction and broadcast treethat gave us an improvement of 1.79 forAllreduce over the original LAM-MPI im-plementation. This supports our hypothe-sis that the majority of the performance-difference between LAM-MPI and Past-Set/PATHS was a better mapping to the re-sources and topology in the cluster.

We have also observed that the assump-tion that doing a reduction internally ineach node before sending a message onthe network did not lead to the best per-formance on our cluster of 8-way nodes.Instead, increasing the number of mes-sages on the network to reduce the depthof the reduction and broadcast as well asthe number of direct children for each in-ternal node proved to be a better strategy.

The reason for this may be found bystudying three different factors that add tothe cost of the global reduction and broad-cast trees:

� The number of children directly be-low an internal node in the spanningtree.

� The depth of the spanning tree.

� Whether an arc between a child and aparent in the spanning tree is a mes-sage on the network or internally inthe node.

We suspect that these factors are not in-creasing linearly, and that the cost of, forinstance, adding another child to an inter-nal node in the spanning tree depends on

factors such as the order of the sequenceof receive commands as well as contentionon the network layer in the host computer.

For the multi-cluster experiments, weobserved that configurations which sentmore messages than necessary over theWAN link did not perform as bad as wehad expected. For message sizes up to4KB, the extra messages did not add a no-ticeable operation time to the Allreduceoperation. We believe that multiple mes-sages concurrently in transit through therouters along the WAN link masks someof the overhead of the extra messages sendby LAM-MPI.

For larger messages, we see that thebandwidth of the WAN link is starting topenalize the bad configurations.

9 Acknowledgements

Thanks to Ole Martin Bjørndalen forreading the paper and suggesting improve-ments.

References

[1] Mpi: A message-passing interfacestandard. Message Passing InterfaceForum (Mar 1994).

[2] ANSHUS, O. J., AND LARSEN,T. Macroscope: The abstractionsof a distributed operating system.Norsk Informatikk Konferanse (Oc-tober 1992).

[3] BERNASCHI, M., AND RICHELLI,G. Mpi collective communicationoperations on large shared mem-ory systems. Proceedings of theNinth Euromicro Workshop on Paral-lel and Distributed Processing (EU-ROPDP.01) (2001).

[4] BJØRNDALEN, J. M., ANSHUS, O.,VINTER, B., AND LARSEN, T. Con-figurable collective communication

Page 11: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

in lam-mpi. Proceedings of Commu-nicating Process Architectures 2002,Reading, UK (September 2002).

[5] BJØRNDALEN, J. M., ANSHUS,O., LARSEN, T., AND VINTER,B. Paths - integrating the prin-ciples of method-combination andremote procedure calls for run-time configuration and tuning ofhigh-performance distributed appli-cation. Norsk Informatikk Konfer-anse (November 2001), 164–175.

[6] CARRIERO, N., AND GELERNTER,D. Linda in context. Commun. ACM32, 4 (April 1989), 444–458.

[7] GROPP, W., LUSK, E., DOSS, N.,AND SKJELLUM, A. A high-performance, portable implementa-tion of the mpi message passing in-terface standard. Parallel Comput-ing, Volume 22, Issue 6 (September1996).

[8] HUSBANDS, P., AND HOE, J. C.Mpi-start: delivering network per-formance to numerical applications.Proceedings of the 1998 ACM/IEEEconference on Supercomputing(1998). San Jose, CA.

[9] JACUNSKI, M., SADAYAPPAN, P.,AND PANDA, D. All-to-all broadcaston switch-based clusters of work-stations. 13th International Paral-lel Processing Symposium and 10thSymposium on Parallel and Dis-tributed Processing (12 - 16 April1999). San Juan, Puerto Rico.

[10] KIELMANN, T., HOFMAN, R. F. H.,BAL, H. E., PLAAT, A., AND

BHOEDJANG, R. A. F. Magpie:Mpi’s collective communication op-erations for clustered wide area sys-tems. Proceedings of the seventhACM SIGPLAN symposium on Prin-ciples and practice of parallel pro-

gramming (1999). Atlanta, Georgia,United States.

[11] http://www.lam-mpi.org/.

[12] PANDA, D. Issues in design-ing efficient and practical algo-rithms for collective communicationin wormhole-routed systems. Proc.ICPP Workshop Challenges for Par-allel processing (1995), 8–15.

[13] SISTARE, S., VANDEVAART, R.,AND LOH, E. Optimization of mpicollectives on clusters of large-scalesmp’s. Proceedings of the 1999 con-ference on Supercomputing (1999).Portland, Oregon, United States.

[14] SQUYRES, J. M., LUMSDAINE,A., GEORGE, W. L., HAGE-DORN, J. G., AND DEVANEY,J. E. The interoperable messagepassing interface (IMPI) extensionsto LAM/MPI. In Proceedings,MPIDC’2000 (March 2000).

[15] TANG, H., AND YANG, T. Optimiz-ing threaded mpi execution on smpclusters. Proceedings of the 15th in-ternational conference on Supercom-puting (2001). Sorrento, Italy.

[16] VADHIYAR, S. S., FAGG, G. E.,AND DONGARRA, J. Automati-cally tuned collective communica-tions. Proceedings of the 2000 con-ference on Supercomputing (2000).Dallas, Texas, United States.

[17] VINTER, B. PastSet a Struc-tured Distributed Shared MemorySystem. PhD thesis, Tromsø Univer-sity, 1999.

[18] VINTER, B., ANSHUS, O. J.,LARSEN, T., AND BJØRNDALEN,J. M. Extending the applicability ofsoftware dsm by adding user rede-finable memory semantics. ParallelComputing (ParCo) 2001, Naples,Italy (September 2001).

Page 12: The Performance of Configurable Collective Communication ... · the set of collective communication opera- ... port optimization of a given application. The combinationof trace data,

V-0

V-1

S

V-2

S

V-4

S

V-8

S

V-16

S

V-3

S

V-5

S

V-6

S

V-7

S

V-9

S

V-10

S

V-12

S

V-11

S

V-13

S

V-14

S

V-15

S

V-17

S

V-18

S

V-20

S

V-24

S

V-19

S

V-21

S

V-22

S

V-23

S

V-25

S

V-26

S

V-28

S

V-27

S

V-29

S

V-30

S

V-31

S

Figure 2. Log-reduce tree for 32 processes. The arcs represent communication be-tween two nodes. Partial sums are computed at a node in the tree before passing theresult further up in the tree.

ps0

ps1 ps2

ps3

ps4

ps5 ps6

ps7

V-0

V-8V-16

V-1 V-2V-4 V-9 V-10V-12

V-24

V-17 V-18V-20 V-25 V-26V-28

V-3V-11V-19V-27V-5 V-6V-13 V-14V-21 V-22V-29 V-30

V-7 V-15V-23 V-31

Figure 3. Log-reduce tree for 32 processes mapped onto 8 nodes.

t1 = get_usecs();MPI_Allreduce(&hit, &ghit, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);t1 = get_usecs();for (i = 0; i < ITERS; i++) {

MPI_Allreduce(&i, &ghit, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);if (ghit != (i * size))

printf("oops at %d. %d != %d\n", i, ghit, (i * size));}t2 = get_usecs();MPI_Allreduce(&hit, &ghit, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);

Figure 4. Global reduction benchmark