a hybrid mpi design using sctp and iwarp distributed systems group mike tsai, brad penoff, and alan...

21
A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada April 14, 2008

Upload: melvin-furnish

Post on 14-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

A Hybrid MPI Design using SCTP and iWARP

Distributed Systems

Group

Mike Tsai, Brad Penoff, and Alan WagnerDepartment of Computer Science

University of British Columbia

Vancouver, Canada

April 14, 2008

Page 2: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

A Hybrid Message Passing Interface Design using the

Stream Control Transmission Protocol and the Internet Wide Area

Remote Direct Memory Access Protocol

Distributed Systems

Group

Mike Tsai, Brad Penoff, and Alan WagnerDepartment of Computer Science

University of British Columbia

Vancouver, Canada

April 14, 2008

Page 3: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Research Background

• SCTP – Stream Control Transmission Protocol

– IETF standardized transport protocol for IP– Can be used anywhere TCP or UDP are used– Additional features

• SCTP and MPI middleware– LAM (unreleased)– MPICH2 (1.0.5 and on) ch3:sctp– Open MPI SCTP BTL (in v1.3 trunk)

Page 4: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

• Hardware acceleration techniques for IP– Protocol offload– OS bypass– Zero copy– RDMA– 10 GigE

How would these look for SCTP?

Are there benefits here for using SCTP?

State-of-the-Art Networking

Page 5: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

• iWARP - Internet Wide Area RDMA protocol– IETF standard for RDMA over IP

• Use RDMA, point-to-point, or a mix?

• “Why Compromise?” (G. Shainer @ HPCWire.com)

– Depending on the application, use whichever is best.• For MPI middleware, who decides what’s best?

Story/motivation

The programmer!

Page 6: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Contribution

• Hybrid MPI with functional decomposition lets the programmer decide:– Let RMA use RDMA– Let other communications use point-to-point

• Explore SCTP’s use within iWARP– Extended OSC userspace software iWARP,

making many internal OSC changes

Page 7: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

iWARP : DDP & LLP

RDMAP

IP

DDP

Verbs or API

Lower Layer Protocol (LLP)

Direct Data Placement

• Fragments messages• Reassembles segments• Segments self-contained

• Data delivery and placement separation

• Out-of-order delivery

Requires LLP to:• Keep segment boundaries• Be reliable• Take a strong checksum

Page 8: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

iWARP : LLP = MPA over TCP

RDMAP

IP

TCP

MPA

DDP

Verbs or API

Message PDU Aligned

• Message framing• DDP segment vs. TCP stream

• Markers for out-of-order• For middlebox fragmentation

• Stronger checksum

… is a complex layer (majority of OSC code)!

… can lead to non-compliant TCP stacks.

LLP

Page 9: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

SCTP is a better LLP

LLP’s needs built-in to SCTP:• Reliable, message-based• CRC32c checksum• Out-of-order support:

• MSG_UNORDERED• Multistreaming• Multihoming

Unmodified stack supports:• Path failover• Multirail data striping

RDMAP

IP

TCP

MPA

DDP

SCTP

Verbs or API

LLP

Page 10: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

In the beginning, there was ch3:sctp

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

CH3:SCTP

Socket

MPICH2

Page 11: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

OSC iWARP was modified and incorporated in as a thread….

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

iWARP

CH3:SCTP

Socket

MPICH2

Page 12: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

RMA done by modified OSC iWARP

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

iWARP

CH3:HYBRID

CH3:SCTP

Socket

MPICH2

Shared Data Structure

Page 13: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

OSC iWARP changes to support MPI

• Running in a thread

• Use SCTP

• Making all OSC ops non-blocking

• Locks around shared data

MPI-1 APIMPI-2 one-sided

RMA API

SCTP

iWARP

CH3:HYBRID

CH3:SCTP

Socket

MPICH2

Shared Data Structure

Page 14: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Connection Management Design

Connection establishment:

• Separate one-to-many socket for new QPs– SCTP “peeloff” feature

• New QP sends request from one-to-many socket• Request/ACK received, then QP socket peeled-off• For conflicts, MPI rank resolves who sends ACK

Page 15: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Progress Engine

Loop or Break out

early

Dequeue Event

Handle Event

Read Logic

Write Logic

No Event

Valid event

W WRWR

Event Queue

Enqueue

Read Event

Enqueue Write Event

head

Start

Dequeue head Event

End

iWARP poll

Enqueue iWARP E

vent

Application Level Events

Page 16: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Performance What we tested…

– Compared our new ch3:hybrid to the original ch3:sctp

– Two 3.2 GHz Intel boxes (GigE + switch)• OSU latency tests (MPI_Put & MPI_Get)• Homemade synthetic benchmark

– Combination of RMA and MPI-1 calls

Page 17: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

OSU One-sided Latency Tests• ch3:hybrid adds 2-8% overhead

Page 18: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Synthetic Application

• ch3:hybrid was faster than ch3:sctp – 3.8 seconds vs. 4.5 seconds

• Extra thread helps in some cases

Page 19: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Conclusions

• RDMA versus point-to-point for MPI– Why choose?

• Functional decomposition lets programmer decide

• SCTP is a good match for iWARP– Implementation of iWARP using SCTP shown.– SCTP has its place in the state-of-the-art.– It’d be more exciting to have SCTP-based

devices…

Page 20: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Google “sctp mpi” for more information about our work

Thank you!

Page 21: A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of

Rank 0 Rank 1

Connect (send connection packet)

Connect (send connection packet)

Connect Request Discarded

(Target rank > 0)

Connect Request Accepted (Target rank > local rank)

Peeloff, register with iWARP

Connection ACK

Peeloff, register with iWARP

App. Level Connection formed

Time t

Connection Management Design