http:// ishikawa, the university of tokyo1 gridmpi : grid enabled mpi yutaka ishikawa university...

20
http://www.gridmp i.org Yutaka Ishikawa, The University of Tokyo 1 GridMPI Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

Upload: randall-arthur-burke

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 1

GridMPI :Grid Enabled MPI

Yutaka IshikawaUniversity of Tokyo

andAIST

Page 2: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 2

Motivation• MPI has been widely used to program parallel applications

• Users want to run such applications over the Grid environment without any modifications of the program

• However, the performance of existing MPI implementations is not scaled up on the Grid environment

Wide-areaNetwork

Single (monolithic) MPI applicationover the Grid environment

computing resourcesite A

computing resourcesite A

computing resourcesite B

computing resourcesite B

Page 3: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 3

Motivation• Focus on metropolitan-area, high-bandwidth environment: 10Gp

bs, 500miles (smaller than 10ms one-way latency)– Internet Bandwidth in Grid Interconnect Bandwidth in Cluster

• 10 Gbps vs. 1 Gbps• 100 Gbps vs. 10 Gbps

Wide-areaNetwork

Single (monolithic) MPI applicationover the Grid environment

computing resourcesite A

computing resourcesite A

computing resourcesite B

computing resourcesite B

Page 4: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 4

Motivation• Focus on metropolitan-area, high-bandwidth environment:

10Gpbs, 500miles (smaller than 10ms one-way latency)– We have already demonstrated that the performance of the NAS parallel

benchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment

Wide-areaNetwork

Single (monolithic) MPI applicationover the Grid environment

computing resourcesite A

computing resourcesite A

computing resourcesite B

computing resourcesite B

Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003

Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003

Page 5: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 5

Issues• High Performance Communication

Facilities for MPI on Long and Fat Networks

– TCP vs. MPI communication patterns

– Network Topology

• Latency and Bandwidth

• Interoperability

– Most MPI library implementations use their own network protocol.

• Fault Tolerance and Migration

– To survive a site failure

• Security

TCP MPI

Designed for streams.

Burst traffic.

Repeat the computation and communication phases.

Change traffic by communication patterns.

0

25

50

75

100

125

0 100 200 300 400 500Time (ms)

Ban

dwid

th (M

B/s

)

Repeating 10MB data transfer with two second intervals

Observed during one 10MB data transfer

• The slow-start phase•window size is set to 1

• The slow-start phase•window size is set to 1

• These silences results from burst traffic • These silences results from burst traffic

Page 6: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 6

Issues• High Performance Communication

Facilities for MPI on Long and Fat Networks

– TCP vs. MPI communication patterns

– Network Topology

• Latency and Bandwidth

• Interoperability

– Most MPI library implementations use their own network protocol.

• Fault Tolerance and Migration

– To survive a site failure

• Security

TCP MPI

Designed for streams.

Burst traffic.

Repeat the computation and communication phases.

Change traffic by communication patterns.

Start one-to-one communication at time 0 after all-to-all

0

25

50

75

100

125

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5Time (sec)

Ban

dwid

th (M

B/s

)

Page 7: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 7

Issues• High Performance Communication

Facilities for MPI on Long and Fat Networks

– TCP vs. MPI communication patterns

– Network Topology

• Latency and Bandwidth

• Interoperability

– Most MPI library implementations use their own network protocol.

• Fault Tolerance and Migration

– To survive a site failure

• Security

TCP MPI

Designed for streams.

Burst traffic.

Repeat the computation and communication phases.

Change traffic by communication patterns.

InternetInternet

Page 8: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 8

Internet

Internet

Issues• High Performance Communication

Facilities for MPI on Long and Fat Networks

– TCP vs. MPI communication patterns

– Network Topology

• Latency and Bandwidth

• Interoperability

– Many MPI library implementations. Most implementations use their own network protocol

• Fault Tolerance and Migration

– To survive a site failure

• Security

TCP MPI

Designed for streams.

Burst traffic.

Repeat the computation and communication phases.

Change traffic by communication patterns.

Using Vendor C’s MPI library

Using Vendor A’s MPI library

Using Vendor B’s MPI library

Using Vendor D’s MPI library

Page 9: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 9

GridMPI FeaturesMPI API

TC

P/IP

PM

v2

MX

O2G

Vendor M

PI

P2P Interface

Request LayerRequest Interface

IMP

I

LAC Layer(Collectives)

IMP

I

ssh

rsh

SC

ore

Globus

Vendor M

PI

RPIM Interface

• MPI-2 implementation• YAMPII, developed at the Unive

rsity of Tokyo, is used as the core implementation

• Intra communication by YAMPII( TCP/IP 、 SCore )

• Inter communication by IMPI( Interoperable MPI), protocol and extension to Grid– MPI-2– New Collective protocols

• Integration of Vendor MPI– IBM Regatta MPI, MPICH2,

Solaris MPI, Fujitsu MPI, (NEC SX MPI)

• Incremental checkpoint• High Performance TCP/IP imple

mentationInterne

t

Internet

IPMI/TCP

Vendor’s MPI YAMPII

LAC: Latency Aware Collectives• bcast/allreduce algorithms have been developed

(to appear at the cluster06 conference)

Page 10: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 10

High-performance Communication Mechanisms in the Long and Fat Network

• Modifications of TCP Behavior – M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa,

“TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005.

• Precise Software Pacing– R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Is

hikawa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”,

PFLDnet2005, 2005.• Collective communication algorithms with respect to network latenc

y and bandwidth.– M. Matsuda, T. Kudoh, Y. Kodama, R. Takano, Y. Ishikawa, “Ef

ficient MPI Collective Operations for Clusters in Long-and-Fast Networks”,

to appear at IEEE Cluster 2006.

Page 11: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 11

Evaluation

• It is almost impossible to reproduce the execution behavior of communication performance in the wide area network

• A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc.

GtrcNET-1

GtrcNET-1 is developed at AIST.• injection of delay, jitter, error, …• traffic monitor, frame capture

•Four 1000Base-SX ports•One USB port for Host PC•FPGA (XC2V6000)http://www.gtrc.aist.go.jp/gnet/

Page 12: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 12

Experimental Environment8 PCs

CPU: Pentium4/2.4GHz, Memory: DDR400 512MBNIC: Intel PRO/1000 (82547EI)OS: Linux-2.6.9-1.6 (Fedora Core 2)Socket Buffer Size: 20MB

WAN Emulator

GtrcNET-1

8 PCs

Node7Node7

Host 0Host 0Host 0Host 0Host 0Host 0Node0Node0 Catalyst 3750

Catalyst 3750

Node15Node15

Host 0Host 0Host 0Host 0Host 0Host 0Node8Node8Catalyst 3750

Catalyst 3750

……

… ……

•Bandwidth:1Gbps•Delay: 0ms -- 10ms

Page 13: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 13

GridMPI vs. MPICH-G2 (1/4)FT (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes

One way delay (msec)

Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

FT(GridMPI)

FT(MPICH-G2)

Page 14: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 14

GridMPI vs. MPICH-G2 (2/4)IS (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes

One way delay (msec)

Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

IS(GridMPI)

IS(MPICH-G2)

Page 15: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 15

GridMPI vs. MPICH-G2 (3/4)LU (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes

One way delay (msec)

Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

LU(GridMPI)LU(MPICH-G2)

Page 16: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 16

GridMPI vs. MPICH-G2 (4/4)NAS Parallel Benchmarks 3.2 Class B

on 8 x 8 processes

One way delay (msec)

Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

SP(GridMPI)BT (GridMPI)MG(GridMPI)CG(GridMPI)SP(MPICH-G2)BT(MPICH-G2)MG(MPICH-G2)CG(MPICH-G2)

No parameters tuned in GridMPI

Page 17: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 17

GridMPI on Actual Network • NAS Parallel Benchmarks run using 8

node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara

– 16 nodes

• Comparing the performance with– result using 16 node (2.4 GHz)

– result using 16 node (2.8 GHz)

JGN2 Network 10Gbps Bandwidt

h 1.5 msec RTT

JGN2 Network 10Gbps Bandwidt

h 1.5 msec RTT

Pentium-4 2.4GHz x 8connected by 1G Ethernet@ Tsukuba

Pentium-4 2.8 GHz x 8Connected by 1G Ethernet@ Akihabara

60 Km (40mi.)

00.20.40.60.8

11.2

BT CG EP FT IS LU MG SP

2.4 GHz2.8 GHz

Benchmarks

Rel

ativ

e pe

rfor

man

ce

Page 18: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 18

GridMPI Now and Future• GridMPI version 1.0 has been released

– Conformance Tests

• MPICH Test Suite: 0/142 (Fails/Tests)

• Intel Test Suite: 0/493 (Fails/Tests)

– GridMPI is integrated into the NaReGI package

• Extension of IMPI Specification

– Refine the current extensions

– Collective communication and check point algorithms could not be fixed. The current idea is specifying

• The mechanism of

– dynamic algorithm selection

– dynamic algorithm shipment and load

» virtual machine to implement algorithms

Page 19: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 19

InternetInternet

Dynamic Algorithm Shipment• A collective communication algorithm implemented in the virtual

machine

• The code is shipped to all MPI processes

• The MPI runtime library interprets the algorithm to perform the collective communication for inter-clusters

Page 20: Http:// Ishikawa, The University of Tokyo1 GridMPI : Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST

http://www.gridmpi.org Yutaka Ishikawa, The University of Tokyo 20

Concluding Remarks• Our Main Concern is the metropolitan area network

– high-bandwidth environment: 10Gpbs, 500miles (smaller than 10ms one-way latency)

• Overseas ( 100 milliseconds)– Applications must be aware of the communication latency– data movement using MPI-IO ?

• Collaborations – We would like to ask people, who are interested in this work,

for collaborations