an adaptive collective communication suppressing contention taura lab. m2 shota yoshitomi
TRANSCRIPT
An Adaptive Collective Communication An Adaptive Collective Communication Suppressing Contention Suppressing Contention
Taura Lab. M2
Shota Yoshitomi
Outline
• Introduction– Problems in collective communication– Contribution
• Problem settings
• Our approach
• Conclusion
Background
• Grid computing has become widely used.– More opportunities to perform parallel computation
in grid environments.• InTrigger (Japan), Grid5000 (France)
• Large scale parallel computation
• Data intensive applications
• Message passing (e.g. MPI)– Point to point communication– Collective communication
WAN
Problems in collective communication
• Heterogeneous Network– LAN / WAN– Difference in latency and
bandwidth
• Contention
• Network congestion • Connectivity
– Scalability– Nat, Firewall
LANLANWANWAN
NG OK
ConnectionConnection
NodeNode
SW SW SW SW
ContentionContention
• Designing an efficient collective operation algorithm.– Suppresing network contention– Adaptive and scalable in large network
• Focusing on Many-to-one and Many-to-Many operations.– Many-to-One (Gather)– Many-to-Many (All-to-all)
• Implementation and evaluation – Our algorithm achieved better performance than existing
MPI libraries.
Contribution
Many-to-One Many-to-Many
Outline
• Introduction
• Problem settings– Effect of network contention – Related work
• Our approach
• Conclusion
Gather operation behavior
• Gather operation– Root node receives
different data from the other nodes.
SWSW
N0 N1 N2 N3
D0 D1 D2 D3
N0 N1 N2 N3
D0 D1 D2 D3
D1
D2
D3
BeforeBefore
AfterAfter
• Contention– Messages from N1 ~ Nk flow into
N0’s link at the same time.– N0 can only receive part of them
• Reach N0’s receive capacity limits.
N0 N1 NkN2 N3
Effect of network contention
0
500
1000
1500
2000
1 10 100 1000 10000Message size (Kbyte)
Com
plet
ion
time
(mse
c)TheoreticalConcurrent
200msec
• The Completion time of “Concurrent” is up to 400 times as much as “Theoretical” value.
• A leap in the completion time of “Concurrent” around 3KB.
SWSW
14nodes
LAN
Experimental Settings
SW: PowerConnect5324Network: 500Mbps 3KB
Findings
• Caused by some TCP features– Packet loss at a switch– Receiver waits for retransmission of the packet
• RTO : retransmission timeout • 200msec ~ (Linux kernel 2.6.18)
– Sender retransmits the packet
• Requirements– Prevent packet losses at any switches
– Control the number of nodes which communicate with a common destination at a time at all switches.
Related work (MPI Implementation)
• OpenMPI– Flat tree
• MPICH – Binomial tree
• MagPIe [Kielmann et al. 1999]
– Binomial tree (LAN)– Flat tree (WAN)
Flat treeFlat tree
Binomial treeBinomial treeNetwork contention may degrade the performance of gather operation in these MPI implementation.
Contention!
Outline
• Introduction
• Problem settings
• Our approach– Base idea
• Pipeline transfer• Synchronized transfer
– Evaluation
• Conclusion
Necessary conditions
• Prerequisite
Messages do NOT flow into a link– at the same time– from two or more different sources
SWSW
OKSWSW
NGSWSW
OK
• Assumption – All node can send and receive defferent messages concurrntly– No nodes communicate to other nodes in more than two gathers at
the same time.
Basic idea
• Immediate Goal– Suppressing contention at any switches and
routers.
• Our algorithm consists of two approaches.– Sequential send with synchronization– Pipeline transfer
• Communication graph configuration– Combine pipeline transfer and synchronized
transfer to improve the performance of gather operation
Sequential send with synchronization
1. N1 send its message to N0
2. N0 send a packet (1 byte) to N2
3. When N2 has gotten the message from N0, then N2 starts to send its message to N0.
N0
N1
N2SW
1 byte message1 byte message
The weekness of the “sequential send with synchronization”
• The method of “Sequential send” does not always achieve the most efficient communication.
rootroot
1000
1
Sync.
Sync. Sync.
Sync. Sync. Sync. Sync.
Sync.
Sync. Sync.Sync. …
NOT scalableNOT scalable
High costHigh cost
Cost 7000Cost 7000rootroot
Pipeline transfer
• N1 send its message to N0
• N2 send its message to N1
• When N1 completely finished to receive the message from N2, N1 transfers it immediately to N0 .
N0
N1
N2SW
The feature of “Pipeline transfer”
• No synchronization
• A low-bandwidth network in the middle of the pipeline often get into a bottleneck.
1000
1
Bottleneck
Cost 1003Cost 1003
Graph configuration
• First, configure a “pipeline” with layer 2 network topology.
• Meet the conditions of avoiding contention– Messages do not flow into
any links in network• In the same direction• From More than once sources
SWSW
SWSW
SWSW
SWSW
SWSW
SWSW
SWSW
Pipelined transferPipelined transfer
root
※ Getting network information
- Topology inference [Shirai et. al 2006]
- Bandwidth estimation [Naganuma et. al 2008]
root
Improving the performance
• Reconfigure the communication graph
SWSW
SWSW
SWSW
SWSW
SWSW
SWSW
SWSW
Sync.
root
Sync.
Sync.
Sync.
can send its message to
( Pipeline transfer ) or
( synchronized transfer ) or
…
( synchronized transfer ).
E.g.
1. Calculate the arrival time that ‘s
message reaches to the root node.
If send to , then it takes X seconds.
If send to , then it takes Y seconds.
…2. Select the route where ‘s message arrives at the root node as soon as possible.
Experimentation• OpenMPI (Concurrent)
– Flat tree– All nodes send their messages co
ncurrently.
• MPICH / MagPIe – Binomial tree– Flat tree (MagPIe over WAN)
• Sequential– Only using “Synchronized transfe
r” in our algorithm.
• OURS– Pipeline transfer and Synchronize
d transfer.
Flat treeFlat tree
Binomial treeBinomial tree
Experiment results (1)
• 55 nodes send a 10KB ~ 1MB message to the root node.
• The performance of our algorithm is better than the other algorithms in almost all case.
SWSW
47nodes
LAN
SW: FastIron GSNetwork: 1Gbps
Experimental Settings
SWSW
9 nodes
root
0
2
4
6
8
10
10 100 1000Send message size (KB)
Com
plet
ion
time
(nor
mar
ized
) OpenMPIMPICHSequentialOURS
Experiment results (2)• Settings
– Each node sends a 20KB message to the root node. – 1 cluster, 50 nodes → 9 cluster 190 nodes
• The result shows – Our algorithm can avoid contention and prevent the
performance of communication from being degraded.
0
100
200
300
400
50 100 150 200Number of nodes
Com
plet
ion
time
(mse
c)
ConcurrentSequentialMagPieOURS
200ms
ContentionContention
Conclusion
• Conclusion– Proposed an algorithm that achieves avoiding
contention in large network.– The algorithm achieved better performance than
existing MPI libraries.
• Future work– Designing more adaptive and precise
communication graph configuration algorithm– Considering wide area bandwidth – Designing a contention free algorithm for Many-
to-Many operation.
Publications
•吉富翔太 , 弘中健 , 田浦健次朗 . メッセージ衝突を防止する適応的な収集操作アルゴリズム . 先進的計算基盤システムシンポジウム (SACSIS2009).May 2009 (発表予定 )
•吉富翔太 , 斎藤秀雄 , 田浦健次朗 , 近山隆 .自動取得したネットワーク構成情報に基づくMPI集合通信アルゴリズム並列・分散・協調処理に関するサマーワークショップ (SWoPP2008).
Aug 2008.
•吉富翔太 , 斎藤秀雄 , 田浦健次朗 , 近山隆 .自動取得したネットワーク構成情報に基づくMPI集合通信アルゴリズムの改良情報処理学会全国大会 2008.
Mar 2008.