hierarchical optimization of mpi reduce algorithms · abstract. optimization of mpi collective...

14
Hierarchical Optimization of MPI Reduce Algorithms Khalid Hasanov (B ) and Alexey Lastovetsky University College Dublin, Belfield, Dublin 4, Ireland [email protected], [email protected] Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in 1990s. Many general and architecture-specific collective algorithms have been proposed and implemented in the state-of-the-art MPI implementations. Hierarchical topology-oblivious transformation of existing communica- tion algorithms has been recently proposed as a new promising approach to optimization of MPI collective communication algorithms and MPI- based applications. This approach has been successfully applied to the most popular parallel matrix multiplication algorithm, SUMMA, and the state-of-the-art MPI broadcast algorithms, demonstrating significant multi-fold performance gains, especially for large-scale HPC systems. In this paper, we apply this approach to optimization of the MPI reduce operation. Theoretical analysis and experimental results on a cluster of Grid’5000 platform are presented. Keywords: MPI · Reduce · Grid’5000 · Communication · Hierarchy 1 Introduction Reduce is important and commonly used collective operation in the Message Passing Interface (MPI) [1]. A five-year profiling study [2] demonstrates that MPI reduction operations are the most used collective operations. In the reduce operation each node i owns a vector x i of n elements. After completion of the operation all the vectors are reduced element-wise to a single n-element vector which is owned by a specified root process. Optimization of MPI collective operations has been an active research topic since the advent of MPI in 1990s. Many general and architecture-specific col- lective algorithms have been proposed and implemented in the state-of-the-art MPI implementations. Hierarchical topology-oblivious transformation of exist- ing communication algorithms has been recently proposed as a new promising approach to optimization of MPI collective communication algorithms and MPI- based applications [3, 5]. This approach has been successfully applied to the most popular parallel matrix multiplication algorithm, SUMMA [4], and the state-of- the-art MPI broadcast algorithms, demonstrating significant multi-fold perfor- mance gains, especially on large-scale HPC systems. In this paper, we apply this approach to optimization of the MPI reduce operation. c Springer International Publishing Switzerland 2015 V. Malyshkin (Ed.): PaCT 2015, LNCS 9251, pp. 21–34, 2015. DOI: 10.1007/978-3-319-21909-7 3

Upload: others

Post on 27-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

Hierarchical Optimization of MPI ReduceAlgorithms

Khalid Hasanov(B) and Alexey Lastovetsky

University College Dublin, Belfield, Dublin 4, [email protected], [email protected]

Abstract. Optimization of MPI collective communication operationshas been an active research topic since the advent of MPI in 1990s.Many general and architecture-specific collective algorithms have beenproposed and implemented in the state-of-the-art MPI implementations.Hierarchical topology-oblivious transformation of existing communica-tion algorithms has been recently proposed as a new promising approachto optimization of MPI collective communication algorithms and MPI-based applications. This approach has been successfully applied to themost popular parallel matrix multiplication algorithm, SUMMA, andthe state-of-the-art MPI broadcast algorithms, demonstrating significantmulti-fold performance gains, especially for large-scale HPC systems. Inthis paper, we apply this approach to optimization of the MPI reduceoperation. Theoretical analysis and experimental results on a cluster ofGrid’5000 platform are presented.

Keywords: MPI · Reduce · Grid’5000 · Communication · Hierarchy

1 Introduction

Reduce is important and commonly used collective operation in the MessagePassing Interface (MPI) [1]. A five-year profiling study [2] demonstrates thatMPI reduction operations are the most used collective operations. In the reduceoperation each node i owns a vector xi of n elements. After completion of theoperation all the vectors are reduced element-wise to a single n-element vectorwhich is owned by a specified root process.

Optimization of MPI collective operations has been an active research topicsince the advent of MPI in 1990s. Many general and architecture-specific col-lective algorithms have been proposed and implemented in the state-of-the-artMPI implementations. Hierarchical topology-oblivious transformation of exist-ing communication algorithms has been recently proposed as a new promisingapproach to optimization of MPI collective communication algorithms and MPI-based applications [3,5]. This approach has been successfully applied to the mostpopular parallel matrix multiplication algorithm, SUMMA [4], and the state-of-the-art MPI broadcast algorithms, demonstrating significant multi-fold perfor-mance gains, especially on large-scale HPC systems. In this paper, we apply thisapproach to optimization of the MPI reduce operation.c© Springer International Publishing Switzerland 2015V. Malyshkin (Ed.): PaCT 2015, LNCS 9251, pp. 21–34, 2015.DOI: 10.1007/978-3-319-21909-7 3

Page 2: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

22 K. Hasanov and A. Lastovetsky

1.1 Contributions

We propose a hierarchical optimization of legacy MPI reduce algorithms withoutredesigning them. The approach is simple and general, allowing for applicationof the proposed optimization to any existing reduce algorithm. As by design theoriginal algorithm is a particular case of its hierarchically transformed counter-part, the performance of the algorithm will either improve or stay the same inthe worst case scenario. Theoretical study of the hierarchical transformation ofsix reduce algorithms, which are implemented in Open MPI [7], is presented.The theoretical results have been experimentally validated on a widely usedGrid’5000 [8] infrastructure.

1.2 Outline

The rest of the paper is structured as follows. Section 2 discusses related work.The hierarchical optimization of MPI reduce algorithms is introduced in Sect. 3.The experimental results are presented in Sect. 4. Finally, Sect. 5 concludes thepresented work and discusses future directions.

2 Related Work

In the early 1990s, the CCL library [9] implemented collective reduce opera-tion as an inverse broadcast operation. Later collective algorithms for wide-areaclusters were proposed [10], and automatic tuning for a given system by con-ducting a series of experiments on the system was discussed [11]. Design andhigh-performance implementation of collective communication operations andcommonly used algorithms, such as minimum-spanning tree reduce algorithm,are discussed in [12]. Five reduction algorithms optimized for different messagesizes and number of processes are proposed in [13]. Implementations of MPIcollectives, including reduce, in MPICH [15] are discussed in [16]. Algorithmsfor MPI broadcast, reduce and scatter, where the communication happens con-currently over two binary trees, are presented in [14]. Cheetah framework [17]implements MPI reduction operations in a hierarchical way on multicore sys-tems, which supports multiple communication mechanisms. Unlike that work,our optimization is topology-oblivious, and MPI reduce optimizations in thiswork do not design new algorithms from scratch, employing the existing reducealgorithms underneath. Therefore, our hierarchical design can be built on top ofthe algorithms from the Cheetah framework as well. This work focuses on reducealgorithms implemented in Open MPI such as flat, linear/chain, pipeline, binary,binomial and in-order binary tree algorithms.

We extend our previous studies on parallel matrix multiplication [3] andtopology-oblivious optimization of MPI broadcast algorithms on large-scale dis-tributed memory platforms [5,6] to MPI reduce algorithms.

Page 3: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

Hierarchical Optimization of MPI Reduce Algorithms 23

2.1 MPI Reduce Algorithms

We assume that the time to send a message of size m between any two MPIprocesses is modeled with Hockney model [18] as α+m×β, where α is the latencyper message and β is the reciprocal bandwidth per byte. It is also assumedthat the computation cost per byte in the reduction operation is γ on any MPIprocess. Unless otherwise noted, in the rest of the paper we will call MPI processjust process.

– Flat tree reduce algorithm.In this algorithm, the root process sequentially receives and reduces a messageof size m from all the processes participating in the reduce operation in p − 1steps:

(p − 1) × (α + m×β + m×γ) . (1)

In a segmented flat tree algorithm, a message of size m is split into X segments,in which case the number of steps is X×(p − 1). Thus, the total executiontime will be as follows:

X× (p − 1) ×(α +

m

X×β +

m

X×γ

). (2)

– Linear tree reduce algorithm.Unlike the flat tree, here each process receives or sends at most one message.Theoretically, its cost is the same as the flat tree algorithm:

(p − 1) × (α + m×β + m×γ) . (3)

– Pipeline reduce algorithm.It is assumed that a message of size m is split into X segments and in onestep of the algorithm a segment of size m

X is reduced between p processes.If we assume a logically reverse ordered linear array, in the first step of thealgorithm the first segment of the message is sent to the next process in thearray. Next, while the second process sends the first segment to the thirdprocess, the first process sends the second segment to the second process, andthe algorithm continues in this way. The first segment takes p − 1 and theremaining segments take X − 1 steps to reach the end of the array. If we alsoconsider the computation cost in each step, then overall execution time of thealgorithm will be as follows:

(p + X − 2) ×(α +

m

X×β +

m

X×γ

). (4)

– Binary tree reduce algorithms.If we take a full and complete binary tree of height h, its number of nodeswill be 2h+1 − 1. In the reduce operation, a node at the hight h will receivetwo messages from its children at the height h + 1. In addition, if we segmenta message of size m into X segments, the overall run time will be as follows:

2 (log2 (p + 1) + X − 2) ×(α +

m

X×β +

m

X×γ

). (5)

Page 4: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

24 K. Hasanov and A. Lastovetsky

Open MPI uses the in-order binary tree algorithm for non-commutative oper-ations. It works similarly to the binary tree algorithm but enforces order inthe operations.

– Binomial tree reduce algorithm.The binomial tree algorithm takes log2(p) steps and the message communi-cated at each step is m. If the message is divided into X segments, thenthe number of steps and the message communicated at each step will beX× log2(p) and m

X respectively. Therefore, the overall run time will be asfollows:

log2 (p) × (α + m×β + m×γ) . (6)

– Rabenseifner’s reduce algorithm.The Rabenseifner’s algorithm [13] is designed for large message sizes. The algo-rithm consists of reduce-scatter and gather phases. It has been implementedin MPICH [16] and used for message sizes greater than 2 KB. The reduce-scatter phase is implemented with recursive-halving, and the gather phase isimplemented with binomial tree. Therefore, the run time of the algorithm isthe sum of these two phases:

2 log2 (p) ×α + 2p − 1

p×m×β +

p − 1p

×m×γ. (7)

The algorithm can be further optimized by recursive vector halving, recursivedistance doubling, recursive distance halving, binary blocks, and ring algorithmsfor non-power-of-two number of processes. An interested reader can consult [13]for more detailed discussion of those algorithms.

3 Hierarchical Optimization of MPI Reduce Algorithms

This section introduces a topology-oblivious optimization of MPI reduce algo-rithms. The idea is inspired by our previous study on the optimization of thecommunication cost of parallel matrix multiplication [3] and MPI broadcast [5]on large-scale distributed memory platforms.

The proposed optimization technique is based on the arrangement of thep processes participating in the reduce into logical groups. For simplicity, itis assumed that the number of groups divides the number of MPI processesand can change between one and p. Let G be the number of groups. Thenthere will be p

G MPI processes per group. Figure 1 shows an arrangement of 8processes in the original MPI reduce operation, and Fig. 2 shows the arrangementin a hierarchical reduce operation with 2 groups of 4 processes. The hierarchicaloptimization has two phases: in the first phase, a group leader is selected foreach group and the leaders start reduce operation inside their own group inparallel (in this example between 4 processes). In the next phase, the reduceis performed between the group leaders (in this example between 2 processes).The grouping can be done by taking the topology into account as well. However,in this work the grouping is topology-oblivious and the first process in each

Page 5: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

Hierarchical Optimization of MPI Reduce Algorithms 25

group is selected as the group leader. In general, different algorithms can beused for reduce operations between group leaders and within each group. Thiswork focuses on the case where the same algorithm is employed at both levels ofhierarchy. Algorithm 1 shows the pseudocode of the hierarchically transformedMPI reduce operation. Line 4 calculates the root for the reduce between thegroups. Then line 5 creates a sub-communicator of G processes between thegroups, and line 6 creates a sub-communicator of p

G processes inside the groups.Our implementation uses the MPI Comm split MPI routine to create new sub-communicators.

P0 P1 P2 P3 P4 P5 P6 P7

MPI Op

P0

Fig. 1. Logical arrangement of processes in MPI reduce.

P0 P1 P2 P3

MPI Op

P0

P4 P5 P6 P7

MPI Op

P4

MPI Op

P0

Fig. 2. Logical arrangement of processes in hierarchical MPI reduce.

3.1 Hierarchical Transformation of Flat Tree Reduce Algorithm

After the hierarchical transformation, there will be two steps of the reduce opera-tion: inside the groups and between the groups. The reduce operations inside thegroups happen between p

G processes in parallel. Then, the operation continuesbetween G groups. The cost of the reduce operations inside groups and betweengroups will be (G − 1)×(α + m×β + m×γ) and ( p

G − 1)×(α + m×β + m×γ)respectively. Thus, the overall run time can be seen as a function of G:

F (G) =(G +

p

G− 2

)× (α + m×β + m×γ) (8)

The derivative of the function is (1 − pG2 )×(α + m×β + m×γ), it can be shown

that p =√

G is the minimum point of the function in the interval (1, p). Thenthe optimal value of the function will be as follows:

F (√

p) = (2√

p − 2) × (α + m×β + m×γ) (9)

Page 6: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

26 K. Hasanov and A. Lastovetsky

Algorithm 1. Hierarchical optimization of MPI reduce operation.Data: p - Number of processesData: G - Number of groupsData: sendbuf - Send bufferData: recvbuf - Receive bufferData: count - Number of entries in send buffer (integer)Data: datatype - Data type of elements in send bufferData: op - MPI reduce operation handleData: root - Rank of reduce rootData: comm - MPI communicator handleResult: The root process has the reduced messagebegin

1 MPI Comm comm outer /* communicator between the groups */

2 MPI Comm comm inner /* communicator inside the groups */

3 int root outer /* root of reduce between the groups */

4 root outer = Calculate Root Outer(G, p, root, comm)

5 comm outer = Create Comm Between Groups(G, p, root outer, comm)

6 comm inner = Create Comm Inside Groups(G, p, root, comm)

7 MPI Reduce(sendbuf, recvbuf, count, datatype, op, root, comm inner)8 MPI Reduce(sendbuf, recvbuf, count, datatype, op, root outer, comm outer)

3.2 Hierarchical Transformation of Pipeline Reduce Algorithm

If we sum the costs of reduce inside and between groups with pipeline algorithm,the overall run time will be as follows:

F (G) =(2X + G +

p

G− 4

(α +

m

X×β +

m

X×γ

)(10)

In the same way, it can be easily shown that the optimal value of the cost functionis as follows:

F (√

p) = (2X + 2√

p − 4) ×(α +

m

X×β +

m

X×γ

)(11)

3.3 Hierarchical Transformation of Binary Reduce Algorithm

For simplicity, we will take p + 1≈p in the formula 5. Then the cost of thereduce operations between the groups and inside the groups will be as followsrespectively: 2 log2(G)×(α+m×β +mγ) and 2 log2(

pG )×(α+m×β +mγ). If we

add these two terms, the overall cost of the hierarchical transformation of thebinary tree algorithm will be equal to the cost of the original algorithm.

3.4 Hierarchical Transformation of Binomial Reduce Algorithm

Similarly to the binary reduce algorithm, the cost function of the binomial treewill not change after hierarchical transformation.

Page 7: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

Hierarchical Optimization of MPI Reduce Algorithms 27

3.5 Hierarchical Transformation of Rabenseifner’s ReduceAlgorithm

By applying the formula 7 between the groups with G processes and inside thegroups with p

G processes, we can find the run time of hierarchical transformationof Rabenseifner’s algorithm. Unlike the previous algorithms, now the theoreticalcost increases in comparison to the original Rabenseifner’s algorithm. Therefore,theoretically the hierarchical reduce implementation should use the number ofgroups equals to one, in which case the hierarchical algorithm retreats to theoriginal algorithm.

2 log2(p)×α + 2m×β×(

2 − G

p− 1

G

)+ mγ

(2 − G

p− 1

G

)(12)

3.6 Possible Overheads in the Hierarchical Design

Our implementation of the hierarchical reduce operation uses MPI Comm splitoperation to create groups of processes. The obvious questions would be to whichextent the split operation can affect the scalability of the hierarchical algorithms.Recent research works show different approaches to improve the scalability ofMPI communicator creation operations in terms of run time and memory foot-print. The research in [20] introduces a new MPI Comm split algorithm, whichscales well to millions of cores. The memory usage of the algorithm is O(pg ) andthe time is O(g log2(p) + log22(p) + p

g log2(g)), where p is the number of MPIprocesses, g is the number of processes in the group that perform sorting. Morerecent research work in [21] improves the previous algorithm with two variants.The first one, which uses a bitonic sort, needs O(log2(p)) memory and O(log22(p))time. The second one is a hash-based algorithm and requires O(1) memory andO(log2(p)) time. Having these algorithms, we can utilize MPI Comm split oper-ation in our hierarchical design with negligible overhead of creating MPI sub-communicators. There will not be any overhead at all for large messages as thesplit operation does not depend on the message size.

4 Experiments

The experiments were carried out on the Grid’5000 infrastructure in France.The platform consists of 24 clusters distributed over 9 sites in France and onein Luxembourg which includes 1006 nodes, 8014 cores. Almost all the sites areinterconnected by 10 Gb/s high-speed network. We used the Graphene clusterfrom Nancy site of the infrastructure as our main testbed. The cluster is equippedwith 144 nodes and each node has a disk of 320 GB storage, 16 GB of memoryand 4-cores of CPU Intel Xeon X3440. The nodes in the Graphene cluster inter-connected via 20 Gb/s Infiniband and Gigabyte Ethernet. More comprehensiveinformation about the platform can be found on the Grid’5000 web site (http://www.grid5000.fr).

Page 8: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

28 K. Hasanov and A. Lastovetsky

The experiments have been done with Open MPI 1.4.5, which provides a fewreduce implementations. Among those implementations there are several reducealgorithms such as linear, chain, pipeline, binary, binomial, and in-order binaryalgorithms and platform/architecture specific algorithms, some of which arereduce algorithms for Infiniband networks, and the Cheetah framework for mul-ticore architectures. In this work, we do not consider the platform specific reduceimplementations. We used the same approach as described in MPIBlib [19] tobenchmark our experiments. During the experiments, the mentioned reduce algo-rithms were selected by using Open MPI MCA (Modular Component Architec-ture) coll tuned use dynamic rules and coll tuned reduce algorithm parameters.MPI MAX operation has been used in the experiments. We have used Graphenecluster with two experimental settings, one process per core and one processper node with the Infiniband-20G network. A power-of-two number of processeshave been used in the experiments.

4.1 Experiments: One Process per Core

The nodes in the Graphene cluster are organized into four groups and connectedto four switches. The switches in turn are connected to the main Nancy router.We have used 10 patterns of process to core mappings, but we will show exper-imental results only with one such mappings where the processes are groupedby their rank in increasing order. The measurements with different groupingsshowed similar performance.

The theoretical and experimental results showed that the hierarchical app-roach mainly improves the algorithms which assume flat arrangements of theprocesses, such as linear, chain and pipeline. On the other hand the native OpenMPI reduce operation selects different algorithms depending on the message size,the count and the number of processes sent to the MPI Reduce function. Thismeans the hierarchical transformation can improve the native reduce operationas well. The algorithms used in the Open MPI decision function are linear, chain,binomial, binary/in-order binary and pipeline reduce algorithms which can beused with different sizes of segmented messages.

Figure 3 shows experiments with default Open MPI reduce operation with amessage of size 16 KB where the best performance is achieved when the groupsize is 1 or p, in which case the hierarchical reduce obviously turns into the origi-nal non-hierarchical reduce. Here for different numbers of groups the Open MPIdecision function selected different reduce algorithms. Namely, if the number ofgroups is 8 or 64 then Open MPI selects the binary tree reduce algorithm betweenthe groups and inside the groups respectively. In all other cases the binomial treereduce algorithm is used. Figure 4 shows similar measurements with a message ofsize 16 MB where one can see a huge performance improvement up to 30 times.This improvement does not come solely from the hierarchical optimization itself,but also because of the number of groups in the hierarchical reduce resulted inOpen MPI decision function to select the pipeline reduce algorithm with differ-ent segment sizes for each groups. The selection of the algorithms for differentnumber of groups is described in Table 1.

Page 9: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

Hierarchical Optimization of MPI Reduce Algorithms 29

20 22 24 26 28 2100

1

2

3

·10−3

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 3. Hierarchical nativereduce. m =16 KB and p = 512.

20 22 24 26 28 2100

2

4

6

8

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 4. Hierarchical nativereduce. m =16 MB and p = 512.

Table 1. Open MPI algorithm selection in HReduce. m =16 MB, p = 512.

Groups Inside groups Between groups

1 - Pipeline 32 KB

2 Pipeline 32 KB Pipeline 64 KB

4 Pipeline 32 KB Pipeline 64 KB

8 Pipeline 32 KB Pipeline 64 KB

16 Pipeline 64 KB Pipeline 64 KB

32 Pipeline 64 KB Pipeline 64 KB

64 Pipeline 64 KB Pipeline 32 KB

128 Pipeline 64 KB Pipeline 32 KB

256 Pipeline 64 KB Pipeline 32 KB

As mentioned in Sect. 3.6, it is expected that the overhead from theMPI Comm split operation should affect only reduce operations with smallermessage sizes. Figure 5 validates this with experimental results. The hierarchicalreduce operation of 1 KB message with the underlying native reduce achievedits best performance when the number of groups was one as the overhead fromthe split operation itself was higher than the reduce.

It is interesting to study the pipeline algorithm with different seg-ment sizes as it is used for large message sizes in Open MPI. Figure 6presents experiments with the hierarchical pipeline reduce with a messagesize of 16 KB with 1 KB segmentation. We selected the segment sizes usingcoll tuned reduce algorithm segmentsize parameter provided by MCA. Figures 7and 8 shows the performance of the pipeline algorithm with segment sizes of32 KB and 64 KB respectively. In the first case, we see a 26.5 times improve-ment, while with the 64 KB the improvement is 18.5 times.

Figure 9 demonstrates speedup of the hierarchical transformation of nativeOpen MPI reduce operation, linear, chain, pipeline, binary, binomial, andin-order binary reduce algorithms with message sizes starting from 16 KB up to

Page 10: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

30 K. Hasanov and A. Lastovetsky

16 MB. Except binary, binomial and in-order binary reduce algorithms, there isa significant performance improvement. In the figure, NT is native Open MPIreduce operation, LN is linear, CH is chain, PL is pipeline with 32 KB seg-mentation, BR is binary, BL is binomial, and IBR denotes in-order binary treereduce algorithm. We would like to highlight one important point that Fig. 9does not compare the performance of different Open MPI reduce algorithms, itrather shows the speedup of their hierarchical transformations. Each of thesealgorithms can be better than the others in some specific settings dependingon the message size, number of processes, underlying network and so on. Atthe same time, the hierarchical transformation of these algorithms will eitherimprove their performance or be equally fast.

20 22 24 26 280

1

2

3

4

5·10−4

Number of groups

Tim

e(Se

c)

MPI Comm split HReduceReduce

Fig. 5. Time spent onMPI Comm split and hierar-chical native reduce. m = 1 KB,p = 512.

20 22 24 26 280

5 · 10−2

0.1

0.15

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 6. Hierarchical pipelinereduce. m = 16 KB, segment1 KB and p = 512.

20 22 24 26 280

2

4

6

8

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 7. Hierarchical pipelinereduce. m = 16 MB, segment32 KB and p = 512.

20 22 24 26 280

1

2

3

4

5

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 8. Hierarchical pipelinereduce. m = 16 MB, segment64 KB and p = 512.

Page 11: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

Hierarchical Optimization of MPI Reduce Algorithms 31

NT LI CH PL BR BL IBR

2425

2627

2829210

211212

213

13579

111315171921232527

Reduce algorithms and their hierarchical modifications

Message size(KB)

Spee

dup

NT LI CH PL BR BL IBR

2425

2627

2829210

211212

213

13579

1113151719212325272931

Reduce algorithms and their hierarchical modifications

Message size(KB)

Spee

dup

Fig. 9. Speedup on 256 (left) and 512 (right) cores, one process per core.

4.2 Experiments: One Process per Node

The experiments with one process per node showed a similar trend to that ofwith the one process per core setting. The performance of linear, chain, pipelineand native Open MPI reduce operations can be improved by the hierarchicalapproach. Figures 10 and 11 show experiments on 128 nodes with message sizesof 16 KB and 16 MB accordingly. In the first setting, the Open MPI decision func-tion uses the binary tree algorithm when the number of processes is 8 betweenor inside groups, in all other cases the binomial tree is used.

20 21 22 23 24 25 26 270

0.5

1

1.5

2·10−3

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 10. Hierarchical nativereduce. m = 16 KB and p = 128.

20 21 22 23 24 25 26 270

1

2

3

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 11. Hierarchical nativereduce. m = 16 MB and p = 128.

The pipeline algorithm has similar performance improvement to that of with512 processes, Fig. 12 shows experiments with a message of size 16 MB segmentedby 32 KB and 64 KB sizes. The labels on the x axis has the same meaning as inthe previous section.

Figure 13 presents speedup of the hierarchical transformations of all thereduce algorihms from Open MPI “TUNED” component with message sizes

Page 12: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

32 K. Hasanov and A. Lastovetsky

20 21 22 23 24 25 26 270

0.5

1

1.5

2

2.5

Number of groups

Tim

e(Se

c)

HReduce Reduce

20 21 22 23 24 25 26 270

0.5

1

1.5

Number of groups

Tim

e(Se

c)

HReduce Reduce

Fig. 12. Hierarchical pipeline reduce. m = 16 MB, segment 32 KB (left) and 64 KB(right). p = 128.

from 16 KB up to 16 MB on 64 (left) and 128 (right) nodes. Again, the reducealgorithms wich has “flat” design and Open MPI default reduce operation havemulti-fold performance improvement.

NT LI CH PL BR BL IBR

2425

2627

2829210

211212

213

1

3

5

7

9

11

13

15

Reduce algorithms and their hierarchical modifications

Message size(KB)

Spee

dup

NT LI CH PL BR BL IBR

2425

2627

2829210

211212

213

1

3

5

7

9

11

13

15

17

19

Reduce algorithms and their hierarchical modifications

Message size(KB)

Spee

dup

Fig. 13. Speedup on 64(left) and 128(right) cores. 1 process per node.

5 Conclusion

Despite there has been a lot of research in MPI collective communications, thiswork shows that their performance is far from optimal and there is some roomfor improvement. Indeed, our simple hierarchical optimization, which trans-forms existing MPI reduce algorithms into two-level hierarchy, gives significantimprovement on small and medium scale platforms. We believe that the idea canbe incorporated into Open MPI decision function to improve the performanceof reduce algorithms even further. It can also be used as a standalone softwareon top of MPI based applications.

Page 13: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

Hierarchical Optimization of MPI Reduce Algorithms 33

The key feature of the optimization is that it can never be worse than anyother optimized reduce operation. In the worst case, the algorithm can use onegroup and fall back to the native reduce operation.

As the future work, we plan to investigate if using different reduce algorithmsin each phase and different number of processes per group can improve theperformance. We would also like to generalize our optimization to other MPIcollective operations.

Acknowledgments. This work has emanated from research conducted with the finan-cial support of IRCSET (Irish Research Council for Science, Engineering and Technol-ogy) and IBM, grant number EPSPG/2011/188, and Science Foundation Ireland, grantnumber 08/IN.1/I2054.

The experiments presented in this publication were carried out using the Grid’5000experimental testbed, being developed under the INRIA ALADDIN developmentaction with support from CNRS, RENATER and several Universities as well as otherfunding bodies (see https://www.grid5000.fr).

References

1. Message passing interface forum. http://www.mpi-forum.org/2. Rabenseifner, R.: Automatic MPI counter proling of all users: first results on a

CRAY T3E 900–512. Proceedings of the Message Passing Interface Developersand Users Conference 1999(MPIDC99), 77–85 (1999)

3. Hasanov, K., Quintin, J.N., Lastovetsky, A.: Hierarchical approach to optimiza-tion of parallel matrix multiplication on large-scale platforms. J. Supercomput-ing., 24p. March 2014 (Springer). doi:10.1007/s11227-014-1133-x

4. van de Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplicationalgorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)

5. Hasanov, K., Quintin, J.-N., Lastovetsky, A.: High-level topology-oblivious opti-mization of mpi broadcast algorithms on extreme-scale platforms. In: Lopes, L.,Zilinskas, J., Costan, A., Cascella, R.G., Kecskemeti, G., Jeannot, E., Cannataro,M., Ricci, L., Benkner, S., Petit, S., Scarano, V., Gracia, J., Hunold, S., Scott,S.L., Lankes, S., Lengauer, C., Carretero, J., Breitbart, J., Alexander, M. (eds.)Euro-Par 2014, Part II. LNCS, vol. 8806, pp. 412–424. Springer, Heidelberg (2014)

6. Hasanov, K., Quintin, J.N., Lastovetsky, A.: Topology-oblivious optimizationof MPI broadcast algorithms on extreme-scale platforms. Simulation ModellingPractice and Theory. 10p. April 2015. doi:10.1016/j.simpat.2015.03.005

7. Gabriel, E., Fagg, G., Bosilca, G., Angskun, T., Dongarra, J., et al.: Open MPI:goals, concept, and design of a next generation MPI implementation. In: Proceed-ings of the 11th European PVM/MPI Users Group Meeting (2004)

8. Grid’5000. http://www.grid5000.fr9. Bala, V., Bruck, J., Cypher, R., Elustondo, P., Ho, C.-T., Ho, C.-T., Kipnis,

S., Snir, M.: CCL: a portable and tunable collective communication library forscalable parallel computers. IEEE TPDS 6(2), 154–164 (1995)

10. Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIeMPIs collective communication operations for clustered wide area systems. In:Proceedings of PPoPP99, 34(8): 131–140 (1999)

Page 14: Hierarchical Optimization of MPI Reduce Algorithms · Abstract. Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in

34 K. Hasanov and A. Lastovetsky

11. Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective commu-nications. In: Proceedings of ACM/IEEE Conference on Supercomputing (2000)

12. Chan, E.W., Heimlich, M.F., Purkayastha, A., Van de Geijn, R.A.: On optimizingcollective communication. In: Proceedings of IEEE International Conference onCluster Computing (2004)

13. Rabenseifner, R.: Optimization of collective reduction operations. In: Proceddingsof International Conference on Computational Science, June 2004

14. Sanders, P., Speck, J., Traff, J.L.: Two-tree algorithms for full bandwidth broad-cast. Reduct. Scan. Parallel Comput. 35(12), 581–594 (2009)

15. MPICH-A Portable Implementation of MPI. http://www.mpich.org/16. Thakur, R., Gropp, W.D.: Improving the performance of collective operations in

MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003.LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003)

17. Venkata, M.G., Shamis, P., Sampath, R., Graham, R.L.l, Ladd, J.S.: Optimizingblocking and nonblocking reduction operations for multicore systems: hierarchicaldesign and implementation. In: Proceedings of IEEE Cluster, pp. 1–8 (2013)

18. Hockney, R.W.: The communication challenge for MPP: intel paragon and MeikoCS-2. Parallel Comput. 20(3), 389–398 (1994)

19. Lastovetsky, A., Rychkov, V., O’Flynn, M.: MPIBlib: benchmarking MPI com-munications for parallel computing on homogeneous and heterogeneous clusters.In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) EuroPVM/MPI 2008. LNCS,vol. 5205, pp. 227–238. Springer, Heidelberg (2008)

20. Sack, P., Gropp, W.: A scalable MPI comm split algorithm for exascale comput-ing. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010.LNCS, vol. 6305, pp. 1–10. Springer, Heidelberg (2010)

21. Moody, A., Ahn, D.H., de Supinski, B.R.: Exascale algorithms for generalizedMPI comm split. In: Proceedings of the 18th European MPI Users’ Group con-ference on Recent advances in the message passing interface (EuroMPI 2011)(2011)