a hybrid mpi design using sctp and iwarp distributed systems group mike tsai, brad penoff, and alan...

Post on 14-Dec-2015

220 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Hybrid MPI Design using SCTP and iWARP

Distributed Systems

Group

Mike Tsai, Brad Penoff, and Alan WagnerDepartment of Computer Science

University of British Columbia

Vancouver, Canada

April 14, 2008

A Hybrid Message Passing Interface Design using the

Stream Control Transmission Protocol and the Internet Wide Area

Remote Direct Memory Access Protocol

Distributed Systems

Group

Mike Tsai, Brad Penoff, and Alan WagnerDepartment of Computer Science

University of British Columbia

Vancouver, Canada

April 14, 2008

Research Background

• SCTP – Stream Control Transmission Protocol

– IETF standardized transport protocol for IP– Can be used anywhere TCP or UDP are used– Additional features

• SCTP and MPI middleware– LAM (unreleased)– MPICH2 (1.0.5 and on) ch3:sctp– Open MPI SCTP BTL (in v1.3 trunk)

• Hardware acceleration techniques for IP– Protocol offload– OS bypass– Zero copy– RDMA– 10 GigE

How would these look for SCTP?

Are there benefits here for using SCTP?

State-of-the-Art Networking

• iWARP - Internet Wide Area RDMA protocol– IETF standard for RDMA over IP

• Use RDMA, point-to-point, or a mix?

• “Why Compromise?” (G. Shainer @ HPCWire.com)

– Depending on the application, use whichever is best.• For MPI middleware, who decides what’s best?

Story/motivation

The programmer!

Contribution

• Hybrid MPI with functional decomposition lets the programmer decide:– Let RMA use RDMA– Let other communications use point-to-point

• Explore SCTP’s use within iWARP– Extended OSC userspace software iWARP,

making many internal OSC changes

iWARP : DDP & LLP

RDMAP

IP

DDP

Verbs or API

Lower Layer Protocol (LLP)

Direct Data Placement

• Fragments messages• Reassembles segments• Segments self-contained

• Data delivery and placement separation

• Out-of-order delivery

Requires LLP to:• Keep segment boundaries• Be reliable• Take a strong checksum

iWARP : LLP = MPA over TCP

RDMAP

IP

TCP

MPA

DDP

Verbs or API

Message PDU Aligned

• Message framing• DDP segment vs. TCP stream

• Markers for out-of-order• For middlebox fragmentation

• Stronger checksum

… is a complex layer (majority of OSC code)!

… can lead to non-compliant TCP stacks.

LLP

SCTP is a better LLP

LLP’s needs built-in to SCTP:• Reliable, message-based• CRC32c checksum• Out-of-order support:

• MSG_UNORDERED• Multistreaming• Multihoming

Unmodified stack supports:• Path failover• Multirail data striping

RDMAP

IP

TCP

MPA

DDP

SCTP

Verbs or API

LLP

In the beginning, there was ch3:sctp

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

CH3:SCTP

Socket

MPICH2

OSC iWARP was modified and incorporated in as a thread….

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

iWARP

CH3:SCTP

Socket

MPICH2

RMA done by modified OSC iWARP

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

iWARP

CH3:HYBRID

CH3:SCTP

Socket

MPICH2

Shared Data Structure

OSC iWARP changes to support MPI

• Running in a thread

• Use SCTP

• Making all OSC ops non-blocking

• Locks around shared data

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

iWARP

CH3:HYBRID

CH3:SCTP

Socket

MPICH2

Shared Data Structure

Connection Management Design

Connection establishment:

• Separate one-to-many socket for new QPs– SCTP “peeloff” feature

• New QP sends request from one-to-many socket• Request/ACK received, then QP socket peeled-off• For conflicts, MPI rank resolves who sends ACK

Progress Engine

Loop or Break out

early

Dequeue Event

Handle Event

Read Logic

Write Logic

No Event

Valid event

W WRWR

Event Queue

Enqueue

Read Event

Enqueue Write Event

head

Start

Dequeue head Event

End

iWARP poll

Enqueue iWARP E

vent

Application Level Events

Performance What we tested…

– Compared our new ch3:hybrid to the original ch3:sctp

– Two 3.2 GHz Intel boxes (GigE + switch)• OSU latency tests (MPI_Put & MPI_Get)• Homemade synthetic benchmark

– Combination of RMA and MPI-1 calls

OSU One-sided Latency Tests• ch3:hybrid adds 2-8% overhead

Synthetic Application

• ch3:hybrid was faster than ch3:sctp – 3.8 seconds vs. 4.5 seconds

• Extra thread helps in some cases

Conclusions

• RDMA versus point-to-point for MPI– Why choose?

• Functional decomposition lets programmer decide

• SCTP is a good match for iWARP– Implementation of iWARP using SCTP shown.– SCTP has its place in the state-of-the-art.– It’d be more exciting to have SCTP-based

devices…

Google “sctp mpi” for more information about our work

Thank you!

Rank 0 Rank 1

Connect (send connection packet)

Connect (send connection packet)

Connect Request Discarded

(Target rank > 0)

Connect Request Accepted (Target rank > local rank)

Peeloff, register with iWARP

Connection ACK

Peeloff, register with iWARP

App. Level Connection formed

Time t

Connection Management Design

top related