http:// ishikawa, the university of tokyo1 gridmpi ： grid enabled mpi yutaka ishikawa university...

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 1

GridMPI ：Grid Enabled MPI

Yutaka IshikawaUniversity of Tokyo

andAIST


Motivation• MPI has been widely used to program parallel applications

• Users want to run such applications over the Grid environment without any modifications of the program

• However, the performance of existing MPI implementations is not scaled up on the Grid environment

Wide-areaNetwork

Single (monolithic) MPI applicationover the Grid environment

computing resourcesite A


computing resourcesite B



Motivation• Focus on metropolitan-area, high-bandwidth environment: 10Gp

bs, 500miles (smaller than 10ms one-way latency)– Internet Bandwidth in Grid Interconnect Bandwidth in Cluster

• 10 Gbps vs. 1 Gbps• 100 Gbps vs. 10 Gbps

Wide-areaNetwork







Motivation• Focus on metropolitan-area, high-bandwidth environment:

10Gpbs, 500miles (smaller than 10ms one-way latency)– We have already demonstrated that the performance of the NAS parallel

benchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment

Wide-areaNetwork






Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003

Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003


Issues• High Performance Communication

Facilities for MPI on Long and Fat Networks

– TCP vs. MPI communication patterns

– Network Topology

• Latency and Bandwidth

• Interoperability

– Most MPI library implementations use their own network protocol.

• Fault Tolerance and Migration

– To survive a site failure

• Security

TCP MPI

Designed for streams.

Burst traffic.

Repeat the computation and communication phases.

Change traffic by communication patterns.

0

25

50

75

100

125

0 100 200 300 400 500Time (ms)

Ban

dwid

th (M

B/s

)

Repeating 10MB data transfer with two second intervals

Observed during one 10MB data transfer

• The slow-start phase•window size is set to 1

• The slow-start phase•window size is set to 1

• These silences results from burst traffic • These silences results from burst traffic











• Security

TCP MPI


Burst traffic.



Start one-to-one communication at time 0 after all-to-all

0

25

50

75

100

125

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5Time (sec)

Ban

dwid

th (M

B/s

)











• Security

TCP MPI


Burst traffic.



InternetInternet


Internet

Internet







– Many MPI library implementations. Most implementations use their own network protocol



• Security

TCP MPI


Burst traffic.



Using Vendor C’s MPI library

Using Vendor A’s MPI library

Using Vendor B’s MPI library

Using Vendor D’s MPI library


GridMPI FeaturesMPI API

TC

P/IP

PM

v2

MX

O2G

Vendor M

PI

P2P Interface

Request LayerRequest Interface

IMP

I

LAC Layer(Collectives)

IMP

I

ssh

rsh

SC

ore

Globus

Vendor M

PI

RPIM Interface

• MPI-2 implementation• YAMPII, developed at the Unive

rsity of Tokyo, is used as the core implementation

• Intra communication by YAMPII（ TCP/IP 、 SCore ）

• Inter communication by IMPI（ Interoperable MPI), protocol and extension to Grid– MPI-2– New Collective protocols

• Integration of Vendor MPI– IBM Regatta MPI, MPICH2,

Solaris MPI, Fujitsu MPI, (NEC SX MPI)

• Incremental checkpoint• High Performance TCP/IP imple

mentationInterne

t

Internet

IPMI/TCP

Vendor’s MPI YAMPII

LAC: Latency Aware Collectives• bcast/allreduce algorithms have been developed

(to appear at the cluster06 conference)


High-performance Communication Mechanisms in the Long and Fat Network

• Modifications of TCP Behavior – M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa,

“TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005.

• Precise Software Pacing– R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Is

hikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”,

PFLDnet2005, 2005.• Collective communication algorithms with respect to network latenc

y and bandwidth.– M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y. Ishikawa, “Ef

ficient MPI Collective Operations for Clusters in Long-and-Fast Networks”,

to appear at IEEE Cluster 2006.


Evaluation

• It is almost impossible to reproduce the execution behavior of communication performance in the wide area network

• A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc.

GtrcNET-1

GtrcNET-1 is developed at AIST.• injection of delay, jitter, error, …• traffic monitor, frame capture

•Four 1000Base-SX ports•One USB port for Host PC•FPGA (XC2V6000)http://www.gtrc.aist.go.jp/gnet/


Experimental Environment8 PCs

CPU: Pentium4/2.4GHz, Memory: DDR400 512MBNIC: Intel PRO/1000 (82547EI)OS: Linux-2.6.9-1.6 (Fedora Core 2)Socket Buffer Size: 20MB

WAN Emulator

GtrcNET-1

8 PCs

Node7Node7

Host 0Host 0Host 0Host 0Host 0Host 0Node0Node0 Catalyst 3750

Catalyst 3750

Node15Node15

Host 0Host 0Host 0Host 0Host 0Host 0Node8Node8Catalyst 3750

Catalyst 3750

……

… ……

…

•Bandwidth:1Gbps•Delay: 0ms -- 10ms


GridMPI vs. MPICH-G2 (1/4)FT (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes

One way delay (msec)

Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

FT(GridMPI)

FT(MPICH-G2)


GridMPI vs. MPICH-G2 (2/4)IS (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes


Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

IS(GridMPI)

IS(MPICH-G2)


GridMPI vs. MPICH-G2 (3/4)LU (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes


Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

LU(GridMPI)LU(MPICH-G2)


GridMPI vs. MPICH-G2 (4/4)NAS Parallel Benchmarks 3.2 Class B

on 8 x 8 processes


Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

SP(GridMPI)BT (GridMPI)MG(GridMPI)CG(GridMPI)SP(MPICH-G2)BT(MPICH-G2)MG(MPICH-G2)CG(MPICH-G2)

No parameters tuned in GridMPI


GridMPI on Actual Network • NAS Parallel Benchmarks run using 8

node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara

– 16 nodes

• Comparing the performance with– result using 16 node (2.4 GHz)

– result using 16 node (2.8 GHz)

JGN2 Network 10Gbps Bandwidt

h 1.5 msec RTT

JGN2 Network 10Gbps Bandwidt

h 1.5 msec RTT

Pentium-4 2.4GHz x 8connected by 1G Ethernet@ Tsukuba

Pentium-4 2.8 GHz x 8Connected by 1G Ethernet@ Akihabara

60 Km (40mi.)

00.20.40.60.8

11.2

BT CG EP FT IS LU MG SP

2.4 GHz2.8 GHz

Benchmarks

Rel

ativ

e pe

rfor

man

ce


GridMPI Now and Future• GridMPI version 1.0 has been released

– Conformance Tests

• MPICH Test Suite: 0/142 (Fails/Tests)

• Intel Test Suite: 0/493 (Fails/Tests)

– GridMPI is integrated into the NaReGI package

• Extension of IMPI Specification

– Refine the current extensions

– Collective communication and check point algorithms could not be fixed. The current idea is specifying

• The mechanism of

– dynamic algorithm selection

– dynamic algorithm shipment and load

» virtual machine to implement algorithms


InternetInternet

Dynamic Algorithm Shipment• A collective communication algorithm implemented in the virtual

machine

• The code is shipped to all MPI processes

• The MPI runtime library interprets the algorithm to perform the collective communication for inter-clusters


Concluding Remarks• Our Main Concern is the metropolitan area network

– high-bandwidth environment: 10Gpbs, 500miles (smaller than 10ms one-way latency)

• Overseas ( 100 milliseconds)– Applications must be aware of the communication latency– data movement using MPI-IO ?

• Collaborations – We would like to ask people, who are interested in this work,

for collaborations

http:// ishikawa, the university of tokyo1 gridmpi ： grid enabled mpi yutaka ishikawa university...

Documents

communication phases

change traffic

burst traffic

grid interconnect bandwidth

fat networkstcp

highbandwidth environment

gridconnected clusters

miles smaller