an adaptive collective communication suppressing contention taura lab. m2 shota yoshitomi

An Adaptive Collective Communication An Adaptive Collective Communication Suppressing Contention Suppressing Contention

Taura Lab. M2

Shota Yoshitomi

Outline

• Introduction– Problems in collective communication– Contribution

• Problem settings

• Our approach

• Conclusion

Background

• Grid computing has become widely used.– More opportunities to perform parallel computation

in grid environments.• InTrigger (Japan), Grid5000 (France)

• Large scale parallel computation

• Data intensive applications

• Message passing (e.g. MPI)– Point to point communication– Collective communication

WAN

Problems in collective communication

• Heterogeneous Network– LAN / WAN– Difference in latency and

bandwidth

• Contention

• Network congestion • Connectivity

– Scalability– Nat, Firewall

LANLANWANWAN

NG OK

ConnectionConnection

NodeNode

SW SW SW SW

ContentionContention

• Designing an efficient collective operation algorithm.– Suppresing network contention– Adaptive and scalable in large network

• Focusing on Many-to-one and Many-to-Many operations.– Many-to-One (Gather)– Many-to-Many (All-to-all)

• Implementation and evaluation – Our algorithm achieved better performance than existing

MPI libraries.

Contribution

Many-to-One Many-to-Many

Outline

• Introduction

• Problem settings– Effect of network contention – Related work

• Our approach

• Conclusion

Gather operation behavior

• Gather operation– Root node receives

different data from the other nodes.

SWSW

N0 N1 N2 N3

D0 D1 D2 D3

N0 N1 N2 N3

D0 D1 D2 D3

D1

D2

D3

BeforeBefore

AfterAfter

• Contention– Messages from N1 ~ Nk flow into

N0’s link at the same time.– N0 can only receive part of them

• Reach N0’s receive capacity limits.

N0 N1 NkN2 N3

Effect of network contention

0

500

1000

1500

2000

1 10 100 1000 10000Message size (Kbyte)

Com

plet

ion

time

(mse

c)TheoreticalConcurrent

200msec

• The Completion time of “Concurrent” is up to 400 times as much as “Theoretical” value.

• A leap in the completion time of “Concurrent” around 3KB.

SWSW

14nodes

LAN

Experimental Settings

SW: PowerConnect5324Network: 500Mbps 3KB

Findings

• Caused by some TCP features– Packet loss at a switch– Receiver waits for retransmission of the packet

• RTO : retransmission timeout • 200msec ~ (Linux kernel 2.6.18)

– Sender retransmits the packet

• Requirements– Prevent packet losses at any switches

– Control the number of nodes which communicate with a common destination at a time at all switches.

Related work (MPI Implementation)

• OpenMPI– Flat tree

• MPICH – Binomial tree

• MagPIe [Kielmann et al. 1999]

– Binomial tree (LAN)– Flat tree (WAN)

Flat treeFlat tree

Binomial treeBinomial treeNetwork contention may degrade the performance of gather operation in these MPI implementation.

Contention!

Outline

• Introduction

• Problem settings

• Our approach– Base idea

• Pipeline transfer• Synchronized transfer

– Evaluation

• Conclusion

Necessary conditions

• Prerequisite

Messages do NOT flow into a link– at the same time– from two or more different sources

SWSW

OKSWSW

NGSWSW

OK

• Assumption – All node can send and receive defferent messages concurrntly– No nodes communicate to other nodes in more than two gathers at

the same time.

Basic idea

• Immediate Goal– Suppressing contention at any switches and

routers.

• Our algorithm consists of two approaches.– Sequential send with synchronization– Pipeline transfer

• Communication graph configuration– Combine pipeline transfer and synchronized

transfer to improve the performance of gather operation

Sequential send with synchronization

1. N1 send its message to N0

2. N0 send a packet (1 byte) to N2

3. When N2 has gotten the message from N0, then N2 starts to send its message to N0.

N0

N1

N2SW

1 byte message1 byte message

The weekness of the “sequential send with synchronization”

• The method of “Sequential send” does not always achieve the most efficient communication.

rootroot

1000

1

Sync.

Sync. Sync.

Sync. Sync. Sync. Sync.

Sync.

Sync. Sync.Sync. …

NOT scalableNOT scalable

High costHigh cost

Cost 7000Cost 7000rootroot

Pipeline transfer

• N1 send its message to N0

• N2 send its message to N1

• When N1 completely finished to receive the message from N2, N1 transfers it immediately to N0 .

N0

N1

N2SW

The feature of “Pipeline transfer”

• No synchronization

• A low-bandwidth network in the middle of the pipeline often get into a bottleneck.

1000

1

Bottleneck

Cost 1003Cost 1003

Graph configuration

• First, configure a “pipeline” with layer 2 network topology.

• Meet the conditions of avoiding contention– Messages do not flow into

any links in network• In the same direction• From More than once sources

SWSW

SWSW

SWSW

SWSW

SWSW

SWSW

SWSW

Pipelined transferPipelined transfer

root

※ Getting network information

- Topology inference [Shirai et. al 2006]

- Bandwidth estimation [Naganuma et. al 2008]

root

Improving the performance

• Reconfigure the communication graph

SWSW

SWSW

SWSW

SWSW

SWSW

SWSW

SWSW

Sync.

root

Sync.

Sync.

Sync.

can send its message to

( Pipeline transfer ) or

( synchronized transfer ) or

…

( synchronized transfer ).

E.g.

1. Calculate the arrival time that ‘s

message reaches to the root node.

If send to , then it takes X seconds.

If send to , then it takes Y seconds.

…2. Select the route where ‘s message arrives at the root node as soon as possible.

Experimentation• OpenMPI (Concurrent)

– Flat tree– All nodes send their messages co

ncurrently.

• MPICH / MagPIe – Binomial tree– Flat tree (MagPIe over WAN)

• Sequential– Only using “Synchronized transfe

r” in our algorithm.

• OURS– Pipeline transfer and Synchronize

d transfer.

Flat treeFlat tree

Binomial treeBinomial tree

Experiment results (1)

• 55 nodes send a 10KB ~ 1MB message to the root node.

• The performance of our algorithm is better than the other algorithms in almost all case.

SWSW

47nodes

LAN

SW: FastIron GSNetwork: 1Gbps

Experimental Settings

SWSW

9 nodes

root

0

2

4

6

8

10

10 100 1000Send message size (KB)

Com

plet

ion

time

(nor

mar

ized

) OpenMPIMPICHSequentialOURS

Experiment results (2)• Settings

– Each node sends a 20KB message to the root node. – 1 cluster, 50 nodes → 9 cluster 190 nodes

• The result shows – Our algorithm can avoid contention and prevent the

performance of communication from being degraded.

0

100

200

300

400

50 100 150 200Number of nodes

Com

plet

ion

time

(mse

c)

ConcurrentSequentialMagPieOURS

200ms

ContentionContention

Conclusion

• Conclusion– Proposed an algorithm that achieves avoiding

contention in large network.– The algorithm achieved better performance than

existing MPI libraries.

• Future work– Designing more adaptive and precise

communication graph configuration algorithm– Considering wide area bandwidth – Designing a contention free algorithm for Many-

to-Many operation.

Publications

•吉富翔太 , 弘中健 , 田浦健次朗 . メッセージ衝突を防止する適応的な収集操作アルゴリズム . 先進的計算基盤システムシンポジウム (SACSIS2009).May 2009 (発表予定 )

•吉富翔太 , 斎藤秀雄 , 田浦健次朗 , 近山隆 .自動取得したネットワーク構成情報に基づくMPI集合通信アルゴリズム並列・分散・協調処理に関するサマーワークショップ (SWoPP2008).

Aug 2008.

•吉富翔太 , 斎藤秀雄 , 田浦健次朗 , 近山隆 .自動取得したネットワーク構成情報に基づくMPI集合通信アルゴリズムの改良情報処理学会全国大会 2008.

Mar 2008.

an adaptive collective communication suppressing contention taura lab. m2 shota yoshitomi

Documents

sequential send

number of nodes

better performance

existing mpi libraries

operationroot node

large networkfocusing

retransmission timeout

different data