april 15, 2005 chikayama-taura laboratory 46411 hideo saito
DESCRIPTION
Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees. April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito. Background. Opportunities to perform message passing in WANs are increasing. WAN. Message Passing in WANs. WAN → more resources - PowerPoint PPT PresentationTRANSCRIPT
Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees
April 15, 2005Chikayama-Taura Laboratory46411 Hideo Saito
Background
Opportunities to perform message passing in WANs are increasing
QuickTime˛ Ç∆TIFFÅià≥èkǻǵÅj êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅBQuickTime˛ Ç∆
TIFFÅià≥èkǻǵÅj êLí£ÉvÉçÉOÉâÉÄǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB
QuickTime˛ Ç∆TIFFÅià≥èkǻǵÅj êLí£ÉvÉçÉOÉâÉÄ
ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB WAN
Message Passing in WANs
WAN → more resourcesHowever, systems designed for LANs do not perform well in WANs
Collective operations (broadcast, reduction) designed for LANs perform horribly in WANs
Collective Operations
Operations in which all processors participate (cf. send/receive)Ex. broadcast, reduction
∑
root
3 5 2 2
0 2 1 1
1 2 1 1
2 1 0 0reduction
Collective Operations in WANs
Topology must be considered for high performance
LAN
LAN
LAN
LAN
Manual configuration is undesirable
Processors should be able to join/leave
Objective
To design and implement collective operations w/ high performance in WANs w/o manual configuration w/ support for joining/leaving
processors
1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work
Collective Operations of MPICH
MPICH (Thakur et al. 2003) Latency-aware tree for short messages Bandwidth-aware tree for long messages
Binomial Tree
root
Collective Operations of MPICH
root
Scatter Ring All-gather
Collective Operations of MPICH
MPICH assumes that latency and bandwidth are uniformBut, latency and bandwidth are orders of magnitude different within local-area and wide-area linksCollective operations for LANs do not perform well in WANs
High-Performance Collective Operations for WANs
Manual configuration necessaryProcessors cannot join/leave
LAN
LAN
LAN
MagPIe (Kielmann et al. 1999)Bandwidth-Efficient Collective Operations (Kielmann et al. 2000)MPICH-G2 (Karonis et al. 2003)
1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work
Overview of Our Proposal
Dynamically create/maintain 2 spanning trees (latency-aware and bandwidth-aware) for each processorPerform collective operations along those treesProvide a mechanism to support joining/leaving processorsImplement as an extension to the Phoenix Message Passing Library
Phoenix (Taura et al. 2003)
Message passing library for GridsNot an impl. of MPI, but has its own APIMessages are sent to virtual nodes, not processors
ph_send(3)
{0, 1, 2} {3, 4}
Processor A Processor B
{0, 1, 2} {4}
{3}
Processor A Processor B
Processor Cph_send(3)
Latency-AwareSpanning Tree Algorithm
Each processor looks for a suitable parent for each spanning tree using RTT measured at runtime
LAN
LAN
Not too deep,not too biga fan-out
Minimal numberof wide-arearelationships
rootLAN
Parent SelectionChange parents if bothRTTn,c < RTTn,p AND
RTTc,r < RTTn,rRTTn,r
RTTc,r
RTTn,c
RTTn,p
p
n
c
r
Wide-area relation-ships are quicklyreplaced by local-arearelationships
LAN
LAN
Tree Creation within a LAN
Force thatmakes treedeeper
Force that makes tree shallower
Tree that isnot too deep,not too shallowWill nodes that
are placed too deep move up?
Will nodes thatare placed too shallow move down?
LAN
LAN
LAN
Bandwidth-AwareSpanning Tree Algorithm
Each processor looks for a suitable parent using bandwidth measured at runtime
Place processors asfar away as possiblefrom the root withoutsacrificing bandwidth
Fan-outtoo large!
Become a child of a sibling if it does not sacrifice bandwidth, to create long pipes
Parent Selection
Find a parent with high bandwidth to the rootEstimate BWn-c-r as min(BWn-c, BWc-r)
BWn-p-r BWn-c-r
p
n
c
r
Broadcast
{1} {2}
{3} {4} {5}
Stable Topology
{0}{2, 5}{1, 3, 4}
Changing Topology
{0}
{1} {2}
{3,5} {4}
{2, 5}
point-to-pointmessage
to virtual node 5
{5}{3, 4} {4}
Reduction
{1} {2}
{3} {4} {5}
Stable Topology
{0}timeout
Changing Topology
{0}
{2}
{3,5} {4}
{1}
waiting forvirtual node 5…
1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work
Preliminary Experiments
1. Latency-aware Spanning Tree Creation (Java Applet)
2. Stable-state short-message broadcast3. Stable-state short-message reduction4. Transient-state short-message
broadcast
Broadcast (Stable-State)1B broadcast over 201 processors in 3 clusters
0
5
10
15
20
25
30
35
0 40 80 120 160 200
Processor Number
Completion Time (msecs)
0
5
10
15
20
25
30
35
0 40 80 120 160 200
Processor Number
Completion Time (msecs)
0
5
10
15
20
25
30
35
0 40 80 120 160 200
Processor Number
Completion Time (msecs)
topology-unawareimplementation
topology-awareimplementation
ourimplementation
Reduction (Stable-State)
topology-awareimplementation
ourimplementation
topology-unawareimplementation
Reduction using 128 processors in 3 clusters
1.E+04
1.E+05
1.E+06
1.E+07
1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
Integers Summed
Comletion Time (microsecs)
Transient-State Behavior
201 processors in 3 clusters (1 virtual node per processor)Repeatedly perform broadcasts100 processors leave after 60 secs virtual nodes are remapped to remaining processor
s
100 processors re-join after 30 secs virtual nodes are given back to original processors
Transient-State Behavior
0
5
10
15
20
25
30
0 30 60 90 120
Elapsed Time (seconds)
Total Waiting Time (seconds)
Leave
Join
1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work
Conclusion
Designed and implemented latency-aware broadcast and reduction for wide-area networksShowed that they perform reasonably well in stable topologiesShowed that they support joining/leaving processorsFuture Work Implement bandwidth-aware spanning tree
Publications
1. 斎藤秀雄,田浦健次朗,近山隆.動的スパニングツリーを用いた広域メッセージパッシング用の集合通信. In Symposium on Advanced Computing System and Infrastructures . May 2005 ( ポスター論文, To Appear ).
2. Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Expedite: An Operating System Extension to Support Low-Latency Communication in Non-Dedicated Clusters. In IPSJ Transactions on Advanced Computing Systems. October 2004.
3. Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Collective Operations for the Phoenix Programming Model. In Summer United Workshops on Parallel, Distributed, and Cooperative Processing. July 2004.
Broadcast (Stable-State)
0
20
40
60
80
100
120
140
160
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Message Length (B)
Bandwidth (MB/sec)
topology-awareimplementation
ourimplementation
topology-unawareimplementation
Broadcast over 251 processors in 3 clusters