towards mpi progression layer elimination with tcp and sctp brad penoff and alan wagner department...

56
Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada HIPS 2006 April 25 Distributed Systems Group

Upload: godwin-french

Post on 13-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Towards MPI progression layer elimination with TCP and SCTP

Brad Penoff and Alan Wagner Department of Computer Science

University of British ColumbiaVancouver, Canada

HIPS 2006 April 25

Distributed SystemsGroup

Page 2: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Portability Aspect of parallel processing integration

MPI API provides interface for portable parallel applications, independent of MPI implementation

Will my application run?

Page 3: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI API

User Code

any MPI Implementation

Resources

Page 4: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Portability

Aspect of parallel processing integration

MPI API provides interface for portable parallel applications, independent of MPI implementation

MPI Middleware provides glue for a variety of underlying components required for a complex parallel runtime environment, independent of component implementation

Will my application perform well?

Page 5: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI Middleware

User Code

any MPI Implementation

Resources

Page 6: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI Middleware

User Code

Job SchedulerComponent

Process Manager Component

Transport

Network

MPI Middleware

Message Progression

Communication Component

Operating System

Glues together components

Job SchedulerComponent

Process Manager Component

Page 7: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Message Progression Communication Component

Maintains necessary state between MPI calls Calls not a simple library

function

Manages underlying communication through the OS (e.g. TCP) direct low-level interaction

(e.g. Infiniband)

User Code

Transport

Network

MPI Middleware

Message Progression

Communication Component

OS

Page 8: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Communication Requirements

Common: Portability by having support

for all potential interconnects

In this work: Portability by eliminating this

component by assuming IP! Push MPI functionality down

onto IP-based transports Learn about necessary MPI

implementation design changes

Application

Middleware

transport Infiniband Myrinet. . .

Ethernet

IP

Application

Transport

IP

Ethernet Infiniband Myrinet. . .

Page 9: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Component Elimination

User Code

Job SchedulerComponent

Process Manager Component

Network

MPI Middleware/Library

Message Progression

Communication Component

Operating System Transport

Page 10: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Elimination Motivation

Common approach Exploit specific features for all potential interconnects

Middleware does transport-layer “things”

Sequencing & flow control complicates the middleware

Implemented differently, MPI implementations incompatible

Our approach here Assume IP

Leverage mainstream commodity networking advances

Simplify middleware

Increase MPI implementation interoperability (perhaps?)

Page 11: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Elimination Approach View MPI as a protocol, from a

networking point-of-view

Design MPI with elimination as a goal

MPI Message matching Expected / unexpected queues Short / long protocol

Networking Demultiplexing Storage/buffering Flow control

Page 12: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI Implementation Designs

TCP

SCTP

Page 13: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

TCP Socket Per TRC

General scheme Socket per MPI message stream (tag-

rank-context (TRC)) Control port

MPI_Send calls connect (MPI_Recv could wildcard)

Resulting socket stored in table attached to communicator object

Page 14: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

TCP-MPI as a Protocol Matching

select() fd sets for wildcards Queues

Unexpected = socket buffer w/ flow control Expected = more local, attached to

handles Short/long

No distinction, rely on TCP flow control

Page 15: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

TCP per TRC critique

Design achieves elimination, but… # sockets – OS user limits Expense of sys calls (context switch,

copying) select() – doesn’t scale Flow control

Mismatch : transport/OS = event driven vs. MPI application = control-driven

Page 16: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

SCTP-based design

Page 17: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

What is SCTP?

Stream Control Transmission Protocol General purpose unicast transport

protocol for IP network data communications

Recently standardized by IETF Can be used anywhere TCP is used

Page 18: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Available SCTP stacks

BSD / Mac OS X LKSCTP – Linux Kernel 2.4.23 and later Solaris 10 HP OpenCall SS7 OpenSS7 Other implementations listed on

sctp.org for Windows, AIX, VxWorks, etc.

Page 19: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Relevant SCTP features

Multistreaming

One-to-many socket style

Multihoming

Message-based

Page 20: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Logical View of Multiple Streams in an Association

Flow control per association (not stream)

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

Stream 1

Stream 2

Stream 3

SEND

SEND

RECEIVE

RECEIVE

Inbound Streams

Outbound Streams

Page 21: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Using SCTP for MPI TRC-to-stream map matches MPI semantics

MPISCTP

Context

Rank

Tag

One-to-Many Socket

Association

Streams

Page 22: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

SCTP-MPI as a protocol Matching – required since cannot receive

from a particular stream sctp_recvmsg() = ANY_RANK + ANY_TAG Avoids select() through one-to-many socket

Queues – globally required for matching Short/Long – required; flow control not

per stream

Page 23: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

SCTP and elimination

SCTP thins the middleware but the component cannot be eliminated Need flow control per stream Need ability to receive from stream Need ability to query which streams

have data ready

Page 24: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Conclusions TCP design eliminates but doesn’t scale

SCTP scales but only thins component

SCTP one-to-many socket style requires additional features for elimination Flow control per stream Ability to receive from stream Ability to query which streams have data

ready

Page 25: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

More information about our work is at:http://www.cs.ubc.ca/labs/dsg/mpi-sctp/

Thank you!

Or Google “sctp mpi”

Page 26: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Upcoming annual SCTP Interop

July 30 – Aug 4, 2006 to be held at UBC

Vendors and implementers test their stacks Performance Interoperability

Page 27: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Extra slides

Page 28: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI Point-to-Point

Message matching is done based on Tag, Rank and Context (TRC).

Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered.

Use of wildcards for receive

MPI_Send(msg,cnt,type,dst-rank,tag,context)

MPI_Recv(msg,cnt,type,src-rank,tag,context)

Payload

Format of MPI Message

Context Rank Tag

Envelope

Page 29: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI Messages Using Same Context, Two Processes

Process X Process Y

Msg_1MPI_Send(Msg_1,Tag_A)

MPI_Irecv(..ANY_TAG..)

MPI_Send(Msg_2,Tag_B)

MPI_Send(Msg_3,Tag_A) Msg_3

Msg_2

Process X Process Y

Msg_1MPI_Send(Msg_1,Tag_A)

MPI_Send(Msg_2,Tag_B)

MPI_Send(Msg_3,Tag_A)Msg_3

Msg_2

MPI_Irecv(..ANY_TAG..)

Page 30: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI Messages Using Same Context, Two Processes

Process X Process Y

Msg_1

MPI_Send(Msg_1,Tag_A)

MPI_Send(Msg_2,Tag_B)

MPI_Send(Msg_3,Tag_A)Msg_3

Msg_2

MPI_Irecv(..ANY_TAG..)

Out of order messages withsame tagsviolate MPI semantics

Page 31: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Associations and Multihoming

Endpoint X

NIC1 NIC2

Endpoint Y

NIC3 NIC4

Network207.10.x.x

Network168.1.x.x

IP=207 .10.40.1

IP=168.1.140.10IP=168.1.10.30

IP=207.10.3.20

Association

Page 32: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

SCTP Key Similarities

Reliable in-order delivery, flow control, full duplex transfer.

TCP-like congestion control

Selective ACK is built-in the protocol

Page 33: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

SCTP Key Differences

Message oriented

Added security

Multihoming, use of associations

Multiple streams within an association

Page 34: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI over SCTP

LAM and MPICH2 are two popular open source implementations of the MPI library.

We redesigned LAM to use SCTP and take advantage of its additional features.

Future plans include SCTP support within MPICH2.

Page 35: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

How can SCTP help MPI? A redesign for SCTP thins the MPI

middleware’s communication component. Use of one-to-many socket-style scales well.

SCTP adds resilience to MPI programs. Avoids unnecessary head-of-line blocking with

streams Increased fault tolerance in presence of

multihomed hosts Built-in security features Improved congestion control

Full Results Presented @

Page 36: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D Msg EMsg B Msg C

Send order

Page 37: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D Msg EMsg B Msg C

Send order

Page 38: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D Msg E

Msg B

Msg C

Send order

Page 39: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D Msg E

Msg B

Msg C

Send order

Page 40: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Send order

Page 41: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Send order

Page 42: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D

Msg E

Msg B

Msg C

Receive order

Page 43: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D

Msg E

Msg B

Msg C

Receive order

Page 44: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D

Msg E

Msg B Msg C

Receive order

Page 45: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg E

Receive order

Msg A Msg DMsg B Msg C

Page 46: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D Msg EMsg B Msg C

Receive order

Page 47: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D Msg EMsg B Msg C

Can be received in the same order as it was sent (required in TCP).

Page 48: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Alternative receiveorder

Page 49: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg B

Msg C

Alternative receiveorder

Msg E

Page 50: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg D

Msg B

Msg C

Msg AMsg E

Alternative receiveorder

Page 51: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg D

Msg B

Msg C

Msg AMsg E

Alternative receiveorder

Page 52: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg D

Msg B Msg CMsg AMsg E

Alternative receiveorder

Page 53: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg DMsg B Msg CMsg AMsg E

Alternative receiveorder

Page 54: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Partially Ordered User Messages Sent

on Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg DMsg B Msg CMsg AMsg E

Delivery constraints: A must be before C and C must be before D

Page 55: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

MPI Middleware

JobScheduler

ProcessManager

Message Progression

Communication Component

MPI Middleware

Parallel Application

Parallel Application

Resource Resource Resource Resource Resource

Parallel Application

Parallel Application

MPI Parallel Library API

Components{ }←

Page 56: Towards MPI progression layer elimination with TCP and SCTP Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver,

Elimination Motivation

• Common approach : Exploit specific features for all potential interconnects– Middleware does transport-layer “things”

• Sequencing & flow control complicates the middleware

– Implemented differently, MPI implementations incompatible

• Our approach here : Assume IP – Leverage mainstream commodity networking advances

• Simplify middleware

– Increase MPI implementation interoperability