april 15, 2005 chikayama-taura laboratory 46411 hideo saito

31
Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Upload: benjiro-fujii

Post on 01-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees. April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito. Background. Opportunities to perform message passing in WANs are increasing. WAN. Message Passing in WANs. WAN → more resources - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees

April 15, 2005Chikayama-Taura Laboratory46411 Hideo Saito

Page 2: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Background

Opportunities to perform message passing in WANs are increasing

QuickTime˛ Ç∆TIFFÅià≥èkǻǵÅj êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅBQuickTime˛ Ç∆

TIFFÅià≥èkǻǵÅj êLí£ÉvÉçÉOÉâÉÄǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆TIFFÅià≥èkǻǵÅj êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB WAN

Page 3: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Message Passing in WANs

WAN → more resourcesHowever, systems designed for LANs do not perform well in WANs

Collective operations (broadcast, reduction) designed for LANs perform horribly in WANs

Page 4: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations

Operations in which all processors participate (cf. send/receive)Ex. broadcast, reduction

root

3 5 2 2

0 2 1 1

1 2 1 1

2 1 0 0reduction

Page 5: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations in WANs

Topology must be considered for high performance

LAN

LAN

LAN

LAN

Manual configuration is undesirable

Processors should be able to join/leave

Page 6: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Objective

To design and implement collective operations w/ high performance in WANs w/o manual configuration w/ support for joining/leaving

processors

Page 7: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work

Page 8: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations of MPICH

MPICH (Thakur et al. 2003) Latency-aware tree for short messages Bandwidth-aware tree for long messages

Binomial Tree

root

Page 9: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations of MPICH

root

Scatter Ring All-gather

Page 10: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations of MPICH

MPICH assumes that latency and bandwidth are uniformBut, latency and bandwidth are orders of magnitude different within local-area and wide-area linksCollective operations for LANs do not perform well in WANs

Page 11: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

High-Performance Collective Operations for WANs

Manual configuration necessaryProcessors cannot join/leave

LAN

LAN

LAN

MagPIe (Kielmann et al. 1999)Bandwidth-Efficient Collective Operations (Kielmann et al. 2000)MPICH-G2 (Karonis et al. 2003)

Page 12: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work

Page 13: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Overview of Our Proposal

Dynamically create/maintain 2 spanning trees (latency-aware and bandwidth-aware) for each processorPerform collective operations along those treesProvide a mechanism to support joining/leaving processorsImplement as an extension to the Phoenix Message Passing Library

Page 14: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Phoenix (Taura et al. 2003)

Message passing library for GridsNot an impl. of MPI, but has its own APIMessages are sent to virtual nodes, not processors

ph_send(3)

{0, 1, 2} {3, 4}

Processor A Processor B

{0, 1, 2} {4}

{3}

Processor A Processor B

Processor Cph_send(3)

Page 15: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Latency-AwareSpanning Tree Algorithm

Each processor looks for a suitable parent for each spanning tree using RTT measured at runtime

LAN

LAN

Not too deep,not too biga fan-out

Minimal numberof wide-arearelationships

rootLAN

Page 16: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Parent SelectionChange parents if bothRTTn,c < RTTn,p AND

RTTc,r < RTTn,rRTTn,r

RTTc,r

RTTn,c

RTTn,p

p

n

c

r

Wide-area relation-ships are quicklyreplaced by local-arearelationships

LAN

LAN

Page 17: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Tree Creation within a LAN

Force thatmakes treedeeper

Force that makes tree shallower

Tree that isnot too deep,not too shallowWill nodes that

are placed too deep move up?

Will nodes thatare placed too shallow move down?

Page 18: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

LAN

LAN

LAN

Bandwidth-AwareSpanning Tree Algorithm

Each processor looks for a suitable parent using bandwidth measured at runtime

Place processors asfar away as possiblefrom the root withoutsacrificing bandwidth

Page 19: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Fan-outtoo large!

Become a child of a sibling if it does not sacrifice bandwidth, to create long pipes

Parent Selection

Find a parent with high bandwidth to the rootEstimate BWn-c-r as min(BWn-c, BWc-r)

BWn-p-r BWn-c-r

p

n

c

r

Page 20: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Broadcast

{1} {2}

{3} {4} {5}

Stable Topology

{0}{2, 5}{1, 3, 4}

Changing Topology

{0}

{1} {2}

{3,5} {4}

{2, 5}

point-to-pointmessage

to virtual node 5

{5}{3, 4} {4}

Page 21: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Reduction

{1} {2}

{3} {4} {5}

Stable Topology

{0}timeout

Changing Topology

{0}

{2}

{3,5} {4}

{1}

waiting forvirtual node 5…

Page 22: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work

Page 23: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Preliminary Experiments

1. Latency-aware Spanning Tree Creation (Java Applet)

2. Stable-state short-message broadcast3. Stable-state short-message reduction4. Transient-state short-message

broadcast

Page 24: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Broadcast (Stable-State)1B broadcast over 201 processors in 3 clusters

0

5

10

15

20

25

30

35

0 40 80 120 160 200

Processor Number

Completion Time (msecs)

0

5

10

15

20

25

30

35

0 40 80 120 160 200

Processor Number

Completion Time (msecs)

0

5

10

15

20

25

30

35

0 40 80 120 160 200

Processor Number

Completion Time (msecs)

topology-unawareimplementation

topology-awareimplementation

ourimplementation

Page 25: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Reduction (Stable-State)

topology-awareimplementation

ourimplementation

topology-unawareimplementation

Reduction using 128 processors in 3 clusters

1.E+04

1.E+05

1.E+06

1.E+07

1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Integers Summed

Comletion Time (microsecs)

Page 26: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Transient-State Behavior

201 processors in 3 clusters (1 virtual node per processor)Repeatedly perform broadcasts100 processors leave after 60 secs virtual nodes are remapped to remaining processor

s

100 processors re-join after 30 secs virtual nodes are given back to original processors

Page 27: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Transient-State Behavior

0

5

10

15

20

25

30

0 30 60 90 120

Elapsed Time (seconds)

Total Waiting Time (seconds)

Leave

Join

Page 28: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

1. Introduction2. Related Work3. Our Proposal4. Preliminary Experiments5. Conclusion and Future Work

Page 29: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Conclusion

Designed and implemented latency-aware broadcast and reduction for wide-area networksShowed that they perform reasonably well in stable topologiesShowed that they support joining/leaving processorsFuture Work Implement bandwidth-aware spanning tree

Page 30: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Publications

1. 斎藤秀雄,田浦健次朗,近山隆.動的スパニングツリーを用いた広域メッセージパッシング用の集合通信. In Symposium on Advanced Computing System and Infrastructures . May 2005 ( ポスター論文, To Appear ).

2. Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Expedite: An Operating System Extension to Support Low-Latency Communication in Non-Dedicated Clusters. In IPSJ Transactions on Advanced Computing Systems. October 2004.

3. Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Collective Operations for the Phoenix Programming Model. In Summer United Workshops on Parallel, Distributed, and Cooperative Processing. July 2004.

Page 31: April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Broadcast (Stable-State)

0

20

40

60

80

100

120

140

160

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

Message Length (B)

Bandwidth (MB/sec)

topology-awareimplementation

ourimplementation

topology-unawareimplementation

Broadcast over 251 processors in 3 clusters