april 15, 2005 chikayama-taura laboratory 46411 hideo saito

Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees

April 15, 2005Chikayama-Taura Laboratory46411 Hideo Saito

Background

Opportunities to perform message passing in WANs are increasing

QuickTime˛ Ç∆TIFFÅià≥èkÇ»ÇµÅj êLí£ÉvÉçÉOÉâÉÄ

Ç™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅBQuickTime˛ Ç∆

TIFFÅià≥èkÇ»ÇµÅj êLí£ÉvÉçÉOÉâÉÄÇ™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆TIFFÅià≥èkÇ»ÇµÅj êLí£ÉvÉçÉOÉâÉÄ

Ç™Ç±ÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB WAN

Message Passing in WANs

WAN → more resourcesHowever, systems designed for LANs do not perform well in WANs

Collective operations (broadcast, reduction) designed for LANs perform horribly in WANs

Collective Operations

Operations in which all processors participate (cf. send/receive)Ex. broadcast, reduction

∑

root

3 5 2 2

0 2 1 1

1 2 1 1

2 1 0 0reduction

Collective Operations in WANs

Topology must be considered for high performance

LAN

LAN

LAN

LAN

Manual configuration is undesirable

Processors should be able to join/leave

Objective

To design and implement collective operations w/ high performance in WANs w/o manual configuration w/ support for joining/leaving

processors

1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work

Collective Operations of MPICH

MPICH (Thakur et al. 2003) Latency-aware tree for short messages Bandwidth-aware tree for long messages

Binomial Tree

root


root

Scatter Ring All-gather


MPICH assumes that latency and bandwidth are uniformBut, latency and bandwidth are orders of magnitude different within local-area and wide-area linksCollective operations for LANs do not perform well in WANs

High-Performance Collective Operations for WANs

Manual configuration necessaryProcessors cannot join/leave

LAN

LAN

LAN

MagPIe (Kielmann et al. 1999)Bandwidth-Efficient Collective Operations (Kielmann et al. 2000)MPICH-G2 (Karonis et al. 2003)

Overview of Our Proposal

Dynamically create/maintain 2 spanning trees (latency-aware and bandwidth-aware) for each processorPerform collective operations along those treesProvide a mechanism to support joining/leaving processorsImplement as an extension to the Phoenix Message Passing Library

Phoenix (Taura et al. 2003)

Message passing library for GridsNot an impl. of MPI, but has its own APIMessages are sent to virtual nodes, not processors

ph_send(3)

{0, 1, 2} {3, 4}

Processor A Processor B

{0, 1, 2} {4}

{3}

Processor A Processor B

Processor Cph_send(3)

Latency-AwareSpanning Tree Algorithm

Each processor looks for a suitable parent for each spanning tree using RTT measured at runtime

LAN

LAN

Not too deep,not too biga fan-out

Minimal numberof wide-arearelationships

rootLAN

Parent SelectionChange parents if bothRTTn,c < RTTn,p AND

RTTc,r < RTTn,rRTTn,r

RTTc,r

RTTn,c

RTTn,p

p

n

c

r

Wide-area relation-ships are quicklyreplaced by local-arearelationships

LAN

LAN

Tree Creation within a LAN

Force thatmakes treedeeper

Force that makes tree shallower

Tree that isnot too deep,not too shallowWill nodes that

are placed too deep move up?

Will nodes thatare placed too shallow move down?

LAN

LAN

LAN

Bandwidth-AwareSpanning Tree Algorithm

Each processor looks for a suitable parent using bandwidth measured at runtime

Place processors asfar away as possiblefrom the root withoutsacrificing bandwidth

Fan-outtoo large!

Become a child of a sibling if it does not sacrifice bandwidth, to create long pipes

Parent Selection

Find a parent with high bandwidth to the rootEstimate BWn-c-r as min(BWn-c, BWc-r)

BWn-p-r BWn-c-r

p

n

c

r

Broadcast

{1} {2}

{3} {4} {5}

Stable Topology

{0}{2, 5}{1, 3, 4}

Changing Topology

{0}

{1} {2}

{3,5} {4}

{2, 5}

point-to-pointmessage

to virtual node 5

{5}{3, 4} {4}

Reduction

{1} {2}

{3} {4} {5}

Stable Topology

{0}timeout

Changing Topology

{0}

{2}

{3,5} {4}

{1}

waiting forvirtual node 5…

Preliminary Experiments

1. Latency-aware Spanning Tree Creation (Java Applet)

2. Stable-state short-message broadcast3. Stable-state short-message reduction4. Transient-state short-message

broadcast

Broadcast (Stable-State)1B broadcast over 201 processors in 3 clusters

0

5

10

15

20

25

30

35

0 40 80 120 160 200

Processor Number

Completion Time (msecs)

0

5

10

15

20

25

30

35

0 40 80 120 160 200

Processor Number


0

5

10

15

20

25

30

35

0 40 80 120 160 200

Processor Number


topology-unawareimplementation

topology-awareimplementation

ourimplementation

Reduction (Stable-State)


ourimplementation


Reduction using 128 processors in 3 clusters

1.E+04

1.E+05

1.E+06

1.E+07

1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Integers Summed

Comletion Time (microsecs)

Transient-State Behavior

201 processors in 3 clusters (1 virtual node per processor)Repeatedly perform broadcasts100 processors leave after 60 secs virtual nodes are remapped to remaining processor

s

100 processors re-join after 30 secs virtual nodes are given back to original processors

Transient-State Behavior

0

5

10

15

20

25

30

0 30 60 90 120

Elapsed Time (seconds)

Total Waiting Time (seconds)

Leave

Join

Conclusion

Designed and implemented latency-aware broadcast and reduction for wide-area networksShowed that they perform reasonably well in stable topologiesShowed that they support joining/leaving processorsFuture Work Implement bandwidth-aware spanning tree

Publications

1. 斎藤秀雄，田浦健次朗，近山隆．動的スパニングツリーを用いた広域メッセージパッシング用の集合通信． In Symposium on Advanced Computing System and Infrastructures ． May 2005 ( ポスター論文， To Appear ）．

2. Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Expedite: An Operating System Extension to Support Low-Latency Communication in Non-Dedicated Clusters. In IPSJ Transactions on Advanced Computing Systems. October 2004.

3. Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Collective Operations for the Phoenix Programming Model. In Summer United Workshops on Parallel, Distributed, and Cooperative Processing. July 2004.

Broadcast (Stable-State)

0

20

40

60

80

100

120

140

160

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Message Length (B)

Bandwidth (MB/sec)


ourimplementation


Broadcast over 251 processors in 3 clusters

april 15, 2005 chikayama-taura laboratory 46411 hideo saito

Documents

reductioncollective

latencyaware tree

high bandwidth

tree shallowertree

spanning trees latencyaware

c rttn

suitable parent

stable topology