towards mpi progression layer elimination with tcp and sctp brad penoff and alan wagner department...
TRANSCRIPT
Towards MPI progression layer elimination with TCP and SCTP
Brad Penoff and Alan Wagner Department of Computer Science
University of British ColumbiaVancouver, Canada
HIPS 2006 April 25
Distributed SystemsGroup
Portability Aspect of parallel processing integration
MPI API provides interface for portable parallel applications, independent of MPI implementation
Will my application run?
MPI API
User Code
any MPI Implementation
Resources
Portability
Aspect of parallel processing integration
MPI API provides interface for portable parallel applications, independent of MPI implementation
MPI Middleware provides glue for a variety of underlying components required for a complex parallel runtime environment, independent of component implementation
Will my application perform well?
MPI Middleware
User Code
any MPI Implementation
Resources
MPI Middleware
User Code
Job SchedulerComponent
Process Manager Component
Transport
Network
MPI Middleware
Message Progression
Communication Component
Operating System
Glues together components
Job SchedulerComponent
Process Manager Component
Message Progression Communication Component
Maintains necessary state between MPI calls Calls not a simple library
function
Manages underlying communication through the OS (e.g. TCP) direct low-level interaction
(e.g. Infiniband)
User Code
Transport
Network
MPI Middleware
Message Progression
Communication Component
OS
Communication Requirements
Common: Portability by having support
for all potential interconnects
In this work: Portability by eliminating this
component by assuming IP! Push MPI functionality down
onto IP-based transports Learn about necessary MPI
implementation design changes
Application
Middleware
transport Infiniband Myrinet. . .
Ethernet
IP
Application
Transport
IP
Ethernet Infiniband Myrinet. . .
Component Elimination
User Code
Job SchedulerComponent
Process Manager Component
Network
MPI Middleware/Library
Message Progression
Communication Component
Operating System Transport
Elimination Motivation
Common approach Exploit specific features for all potential interconnects
Middleware does transport-layer “things”
Sequencing & flow control complicates the middleware
Implemented differently, MPI implementations incompatible
Our approach here Assume IP
Leverage mainstream commodity networking advances
Simplify middleware
Increase MPI implementation interoperability (perhaps?)
Elimination Approach View MPI as a protocol, from a
networking point-of-view
Design MPI with elimination as a goal
MPI Message matching Expected / unexpected queues Short / long protocol
Networking Demultiplexing Storage/buffering Flow control
MPI Implementation Designs
TCP
SCTP
TCP Socket Per TRC
General scheme Socket per MPI message stream (tag-
rank-context (TRC)) Control port
MPI_Send calls connect (MPI_Recv could wildcard)
Resulting socket stored in table attached to communicator object
TCP-MPI as a Protocol Matching
select() fd sets for wildcards Queues
Unexpected = socket buffer w/ flow control Expected = more local, attached to
handles Short/long
No distinction, rely on TCP flow control
TCP per TRC critique
Design achieves elimination, but… # sockets – OS user limits Expense of sys calls (context switch,
copying) select() – doesn’t scale Flow control
Mismatch : transport/OS = event driven vs. MPI application = control-driven
SCTP-based design
What is SCTP?
Stream Control Transmission Protocol General purpose unicast transport
protocol for IP network data communications
Recently standardized by IETF Can be used anywhere TCP is used
Available SCTP stacks
BSD / Mac OS X LKSCTP – Linux Kernel 2.4.23 and later Solaris 10 HP OpenCall SS7 OpenSS7 Other implementations listed on
sctp.org for Windows, AIX, VxWorks, etc.
Relevant SCTP features
Multistreaming
One-to-many socket style
Multihoming
Message-based
Logical View of Multiple Streams in an Association
Flow control per association (not stream)
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
Stream 1
Stream 2
Stream 3
SEND
SEND
RECEIVE
RECEIVE
Inbound Streams
Outbound Streams
Using SCTP for MPI TRC-to-stream map matches MPI semantics
MPISCTP
Context
Rank
Tag
One-to-Many Socket
Association
Streams
SCTP-MPI as a protocol Matching – required since cannot receive
from a particular stream sctp_recvmsg() = ANY_RANK + ANY_TAG Avoids select() through one-to-many socket
Queues – globally required for matching Short/Long – required; flow control not
per stream
SCTP and elimination
SCTP thins the middleware but the component cannot be eliminated Need flow control per stream Need ability to receive from stream Need ability to query which streams
have data ready
Conclusions TCP design eliminates but doesn’t scale
SCTP scales but only thins component
SCTP one-to-many socket style requires additional features for elimination Flow control per stream Ability to receive from stream Ability to query which streams have data
ready
More information about our work is at:http://www.cs.ubc.ca/labs/dsg/mpi-sctp/
Thank you!
Or Google “sctp mpi”
Upcoming annual SCTP Interop
July 30 – Aug 4, 2006 to be held at UBC
Vendors and implementers test their stacks Performance Interoperability
Extra slides
MPI Point-to-Point
Message matching is done based on Tag, Rank and Context (TRC).
Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered.
Use of wildcards for receive
MPI_Send(msg,cnt,type,dst-rank,tag,context)
MPI_Recv(msg,cnt,type,src-rank,tag,context)
Payload
Format of MPI Message
Context Rank Tag
Envelope
MPI Messages Using Same Context, Two Processes
Process X Process Y
Msg_1MPI_Send(Msg_1,Tag_A)
MPI_Irecv(..ANY_TAG..)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A) Msg_3
Msg_2
Process X Process Y
Msg_1MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)Msg_3
Msg_2
MPI_Irecv(..ANY_TAG..)
MPI Messages Using Same Context, Two Processes
Process X Process Y
Msg_1
MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)Msg_3
Msg_2
MPI_Irecv(..ANY_TAG..)
Out of order messages withsame tagsviolate MPI semantics
Associations and Multihoming
Endpoint X
NIC1 NIC2
Endpoint Y
NIC3 NIC4
Network207.10.x.x
Network168.1.x.x
IP=207 .10.40.1
IP=168.1.140.10IP=168.1.10.30
IP=207.10.3.20
Association
SCTP Key Similarities
Reliable in-order delivery, flow control, full duplex transfer.
TCP-like congestion control
Selective ACK is built-in the protocol
SCTP Key Differences
Message oriented
Added security
Multihoming, use of associations
Multiple streams within an association
MPI over SCTP
LAM and MPICH2 are two popular open source implementations of the MPI library.
We redesigned LAM to use SCTP and take advantage of its additional features.
Future plans include SCTP support within MPICH2.
How can SCTP help MPI? A redesign for SCTP thins the MPI
middleware’s communication component. Use of one-to-many socket-style scales well.
SCTP adds resilience to MPI programs. Avoids unnecessary head-of-line blocking with
streams Increased fault tolerance in presence of
multihomed hosts Built-in security features Improved congestion control
Full Results Presented @
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Send order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg EMsg B Msg C
Send order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B
Msg C
Receive order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B
Msg C
Receive order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B Msg C
Receive order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg E
Receive order
Msg A Msg DMsg B Msg C
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Receive order
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Can be received in the same order as it was sent (required in TCP).
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receiveorder
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg B
Msg C
Alternative receiveorder
Msg E
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg D
Msg B
Msg C
Msg AMsg E
Alternative receiveorder
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg D
Msg B
Msg C
Msg AMsg E
Alternative receiveorder
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg D
Msg B Msg CMsg AMsg E
Alternative receiveorder
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg DMsg B Msg CMsg AMsg E
Alternative receiveorder
Partially Ordered User Messages Sent
on Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg DMsg B Msg CMsg AMsg E
Delivery constraints: A must be before C and C must be before D
MPI Middleware
JobScheduler
ProcessManager
Message Progression
Communication Component
MPI Middleware
Parallel Application
Parallel Application
Resource Resource Resource Resource Resource
Parallel Application
Parallel Application
MPI Parallel Library API
Components{ }←
Elimination Motivation
• Common approach : Exploit specific features for all potential interconnects– Middleware does transport-layer “things”
• Sequencing & flow control complicates the middleware
– Implemented differently, MPI implementations incompatible
• Our approach here : Assume IP – Leverage mainstream commodity networking advances
• Simplify middleware
– Increase MPI implementation interoperability