http:// ishikawa, the university of tokyo1 gridmpi : grid enabled mpi yutaka ishikawa university...
TRANSCRIPT
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 1
GridMPI :Grid Enabled MPI
Yutaka IshikawaUniversity of Tokyo
andAIST
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 2
Motivation• MPI has been widely used to program parallel applications
• Users want to run such applications over the Grid environment without any modifications of the program
• However, the performance of existing MPI implementations is not scaled up on the Grid environment
Wide-areaNetwork
Single (monolithic) MPI applicationover the Grid environment
computing resourcesite A
computing resourcesite A
computing resourcesite B
computing resourcesite B
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 3
Motivation• Focus on metropolitan-area, high-bandwidth environment: 10Gp
bs, 500miles (smaller than 10ms one-way latency)– Internet Bandwidth in Grid Interconnect Bandwidth in Cluster
• 10 Gbps vs. 1 Gbps• 100 Gbps vs. 10 Gbps
Wide-areaNetwork
Single (monolithic) MPI applicationover the Grid environment
computing resourcesite A
computing resourcesite A
computing resourcesite B
computing resourcesite B
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 4
Motivation• Focus on metropolitan-area, high-bandwidth environment:
10Gpbs, 500miles (smaller than 10ms one-way latency)– We have already demonstrated that the performance of the NAS parallel
benchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment
Wide-areaNetwork
Single (monolithic) MPI applicationover the Grid environment
computing resourcesite A
computing resourcesite A
computing resourcesite B
computing resourcesite B
Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003
Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 5
Issues• High Performance Communication
Facilities for MPI on Long and Fat Networks
– TCP vs. MPI communication patterns
– Network Topology
• Latency and Bandwidth
• Interoperability
– Most MPI library implementations use their own network protocol.
• Fault Tolerance and Migration
– To survive a site failure
• Security
TCP MPI
Designed for streams.
Burst traffic.
Repeat the computation and communication phases.
Change traffic by communication patterns.
0
25
50
75
100
125
0 100 200 300 400 500Time (ms)
Ban
dwid
th (M
B/s
)
Repeating 10MB data transfer with two second intervals
Observed during one 10MB data transfer
• The slow-start phase•window size is set to 1
• The slow-start phase•window size is set to 1
• These silences results from burst traffic • These silences results from burst traffic
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 6
Issues• High Performance Communication
Facilities for MPI on Long and Fat Networks
– TCP vs. MPI communication patterns
– Network Topology
• Latency and Bandwidth
• Interoperability
– Most MPI library implementations use their own network protocol.
• Fault Tolerance and Migration
– To survive a site failure
• Security
TCP MPI
Designed for streams.
Burst traffic.
Repeat the computation and communication phases.
Change traffic by communication patterns.
Start one-to-one communication at time 0 after all-to-all
0
25
50
75
100
125
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5Time (sec)
Ban
dwid
th (M
B/s
)
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 7
Issues• High Performance Communication
Facilities for MPI on Long and Fat Networks
– TCP vs. MPI communication patterns
– Network Topology
• Latency and Bandwidth
• Interoperability
– Most MPI library implementations use their own network protocol.
• Fault Tolerance and Migration
– To survive a site failure
• Security
TCP MPI
Designed for streams.
Burst traffic.
Repeat the computation and communication phases.
Change traffic by communication patterns.
InternetInternet
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 8
Internet
Internet
Issues• High Performance Communication
Facilities for MPI on Long and Fat Networks
– TCP vs. MPI communication patterns
– Network Topology
• Latency and Bandwidth
• Interoperability
– Many MPI library implementations. Most implementations use their own network protocol
• Fault Tolerance and Migration
– To survive a site failure
• Security
TCP MPI
Designed for streams.
Burst traffic.
Repeat the computation and communication phases.
Change traffic by communication patterns.
Using Vendor C’s MPI library
Using Vendor A’s MPI library
Using Vendor B’s MPI library
Using Vendor D’s MPI library
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 9
GridMPI FeaturesMPI API
TC
P/IP
PM
v2
MX
O2G
Vendor M
PI
P2P Interface
Request LayerRequest Interface
IMP
I
LAC Layer(Collectives)
IMP
I
ssh
rsh
SC
ore
Globus
Vendor M
PI
RPIM Interface
• MPI-2 implementation• YAMPII, developed at the Unive
rsity of Tokyo, is used as the core implementation
• Intra communication by YAMPII( TCP/IP 、 SCore )
• Inter communication by IMPI( Interoperable MPI), protocol and extension to Grid– MPI-2– New Collective protocols
• Integration of Vendor MPI– IBM Regatta MPI, MPICH2,
Solaris MPI, Fujitsu MPI, (NEC SX MPI)
• Incremental checkpoint• High Performance TCP/IP imple
mentationInterne
t
Internet
IPMI/TCP
Vendor’s MPI YAMPII
LAC: Latency Aware Collectives• bcast/allreduce algorithms have been developed
(to appear at the cluster06 conference)
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 10
High-performance Communication Mechanisms in the Long and Fat Network
• Modifications of TCP Behavior – M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa,
“TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005.
• Precise Software Pacing– R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Is
hikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”,
PFLDnet2005, 2005.• Collective communication algorithms with respect to network latenc
y and bandwidth.– M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y. Ishikawa, “Ef
ficient MPI Collective Operations for Clusters in Long-and-Fast Networks”,
to appear at IEEE Cluster 2006.
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 11
Evaluation
• It is almost impossible to reproduce the execution behavior of communication performance in the wide area network
• A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc.
GtrcNET-1
GtrcNET-1 is developed at AIST.• injection of delay, jitter, error, …• traffic monitor, frame capture
•Four 1000Base-SX ports•One USB port for Host PC•FPGA (XC2V6000)http://www.gtrc.aist.go.jp/gnet/
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 12
Experimental Environment8 PCs
CPU: Pentium4/2.4GHz, Memory: DDR400 512MBNIC: Intel PRO/1000 (82547EI)OS: Linux-2.6.9-1.6 (Fedora Core 2)Socket Buffer Size: 20MB
WAN Emulator
GtrcNET-1
8 PCs
Node7Node7
Host 0Host 0Host 0Host 0Host 0Host 0Node0Node0 Catalyst 3750
Catalyst 3750
Node15Node15
Host 0Host 0Host 0Host 0Host 0Host 0Node8Node8Catalyst 3750
Catalyst 3750
……
… ……
…
•Bandwidth:1Gbps•Delay: 0ms -- 10ms
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 13
GridMPI vs. MPICH-G2 (1/4)FT (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
FT(GridMPI)
FT(MPICH-G2)
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 14
GridMPI vs. MPICH-G2 (2/4)IS (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
IS(GridMPI)
IS(MPICH-G2)
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 15
GridMPI vs. MPICH-G2 (3/4)LU (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
LU(GridMPI)LU(MPICH-G2)
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 16
GridMPI vs. MPICH-G2 (4/4)NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
SP(GridMPI)BT (GridMPI)MG(GridMPI)CG(GridMPI)SP(MPICH-G2)BT(MPICH-G2)MG(MPICH-G2)CG(MPICH-G2)
No parameters tuned in GridMPI
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 17
GridMPI on Actual Network • NAS Parallel Benchmarks run using 8
node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara
– 16 nodes
• Comparing the performance with– result using 16 node (2.4 GHz)
– result using 16 node (2.8 GHz)
JGN2 Network 10Gbps Bandwidt
h 1.5 msec RTT
JGN2 Network 10Gbps Bandwidt
h 1.5 msec RTT
Pentium-4 2.4GHz x 8connected by 1G Ethernet@ Tsukuba
Pentium-4 2.8 GHz x 8Connected by 1G Ethernet@ Akihabara
60 Km (40mi.)
00.20.40.60.8
11.2
BT CG EP FT IS LU MG SP
2.4 GHz2.8 GHz
Benchmarks
Rel
ativ
e pe
rfor
man
ce
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 18
GridMPI Now and Future• GridMPI version 1.0 has been released
– Conformance Tests
• MPICH Test Suite: 0/142 (Fails/Tests)
• Intel Test Suite: 0/493 (Fails/Tests)
– GridMPI is integrated into the NaReGI package
• Extension of IMPI Specification
– Refine the current extensions
– Collective communication and check point algorithms could not be fixed. The current idea is specifying
• The mechanism of
– dynamic algorithm selection
– dynamic algorithm shipment and load
» virtual machine to implement algorithms
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 19
InternetInternet
Dynamic Algorithm Shipment• A collective communication algorithm implemented in the virtual
machine
• The code is shipped to all MPI processes
• The MPI runtime library interprets the algorithm to perform the collective communication for inter-clusters
http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 20
Concluding Remarks• Our Main Concern is the metropolitan area network
– high-bandwidth environment: 10Gpbs, 500miles (smaller than 10ms one-way latency)
• Overseas ( 100 milliseconds)– Applications must be aware of the communication latency– data movement using MPI-IO ?
• Collaborations – We would like to ask people, who are interested in this work,
for collaborations