infiniband meets openshmem · lom adapter card 3.0 from data center to campus and metro...

Post on 21-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

March 2014

InfiniBand Meets OpenSHMEM

© 2014 Mellanox Technologies 2

Leading Supplier of End-to-End Interconnect Solutions

Virtual Protocol Interconnect

Storage Front / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Virtual Protocol Interconnect

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN

© 2014 Mellanox Technologies 3

Virtual Protocol Interconnect (VPI) Technology

64 ports 10GbE

36 ports 40/56GbE

48 10GbE + 12 40/56GbE

36 ports IB up to 56Gb/s

8 VPI subnets

Switch OS Layer

Mezzanine Card

VPI Adapter VPI Switch

Ethernet: 10/40/56 Gb/s

InfiniBand:10/20/40/56 Gb/s

Unified Fabric Manager

Networking Storage Clustering Management

Applications

Acceleration Engines

LOM Adapter Card

3.0

From data center to

campus and metro

connectivity

Standard Protocols of InfiniBand and Ethernet on the Same Wire!

© 2014 Mellanox Technologies 4

OpenSHMEM / PGAS

Mellanox ScalableHPC Communication Library to Accelerate Applications

MXM • Reliable Messaging

• Hybrid Transport Mechanism

• Efficient Memory Registration

• Receive Side Tag Matching

FCA • Topology Aware Collective Optimization

• Hardware Multicast

• Separate Virtual Fabric for Collectives

• CORE-Direct Hardware Offload

MPI Berkeley UPC

0.0

20.0

40.0

60.0

80.0

100.0

0 500 1000 1500 2000 2500

Late

ncy (

us

)

Processes (PPN=8)

Barrier Collective Latency

Without FCA With FCA

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500

Late

ncy (

us

)

Processes (PPN=8)

Reduce Collective Latency

Without FCA With FCA

© 2014 Mellanox Technologies 5

Overview

Extreme scale programming-model challenges

Challenges for scaling OpenSHMEM to extreme scale

InfiniBand enhancements

• Dynamically Connected Transport (DC)

• Cross-Channel synchronization

• Non-contiguous data transfer

• On Demand Paging

Mellanox ScalableSHMEM

© 2014 Mellanox Technologies 6

Exascale-Class Computer Platforms – Communication Challenges

Very large functional unit count ~10,000,000

• Implication to communication stack: Need scalable communication capabilities

- Point-to-point

- Collective

Large on-”node” functional unit count ~500

• Implication to communication stack: Scalable HCA architecture

Deeper memory hierarchies

• Implication to communication stack: Cache aware network access

Smaller amounts of memory per functional unit

• Implication to communication stack: Low latency, high b/w capabilities

May have functional unit heterogeneity

• Implication to communication stack: Support for data heterogeneity

Component failures part of “normal” operation

• Implication to communication stack: Resilient and redundant stack

Data movement is expensive

• Implication to communication stack: Optimize data movement

© 2014 Mellanox Technologies 7

Challenges of a Scalable OpenSHMEM

Scalable Communication interface

• Interface objects scale in less than system dimension

• Not a current issue, but keep in mind moving forward

Support for asynchronous communication

• Non-blocking communication interfaces

• Sparse and topology aware interfaces ?

System noise issues

• Interfaces that support communication delegation (e.g., offloading)

Minimize data motion

• Semantics that support data “aggregation”

© 2014 Mellanox Technologies 8

Scalable Performance Enhancements

© 2014 Mellanox Technologies 9

Dynamically Connected Transport

A Scalable Transport

© 2014 Mellanox Technologies 10

New Transport

Challenges being addressed:

• Scalable communication protocol

• High-performance communication

• Asynchronous communication

Current status: Transports in widest use

• RC

- High Performance: Supports RDMA and Atomic Operations

- Scalability limitations: One connection per destination

• UD

- Scalable: One QP services multiple destinations

- Limited communication support: No support for RDMA and Atomic Operations, unreliable

Need scalable transport that also supports RDMA and Atomic operations DC – The best of both

worlds

• High Performance: Supports RDMA and Atomic Operations, Reliable

• Scalable: One QP services multiple destinations

© 2014 Mellanox Technologies 11

IB Reliable Transports Model

QoS/Multipathing: 2 to 8 times the above

Resource sharing (XRC/RD) causes processes to impact each-other

Example n=4K p=16 Pro

ce

ss

p-1

n*p

Pro

ce

ss

1

n*p

Pro

ce

ss

0

n*p

n*p*p

(~1M)

RC XRC RD(+XRC)

Pro

ce

ss

p-1

n

Pro

ce

ss

1

n

Pro

ce

ss

0

n

~n*p

(~64K)

Pro

ce

ss

p-1

n

Pro

ce

ss

1

n

Pro

ce

ss

0

~n

(~4K)

Sh

are

d

Co

nte

xts

© 2014 Mellanox Technologies 12

12

The DC Model

Dynamic Connectivity

Each DC Initiator can be used to reach any remote DC Target

No resources’ sharing between processes

• process controls how many (and can adapt to load)

• process controls usage model (e.g. SQ allocation policy)

• no inter-process dependencies

Resource footprint

• Function of HCA capability

• Independent of system size

Fast Communication Setup Time

cs – concurrency of the sender

cr=concurrency of the responder

Pro

ce

ss

0

cs

p*(

cs+

cr)

/2

cr

Pro

ce

ss

1

cs

cr

Pro

ce

ss

p

-1

cs

cr

© 2014 Mellanox Technologies 13

Connect-IB – Exascale Scalability

1

1,000

1,000,000

1,000,000,000

InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012

8 nodes

2K nodes

10K nodes

100K nodes

Ho

st

Me

mo

ry C

on

su

mp

tio

n (

MB

)

© 2014 Mellanox Technologies 14

Dynamically Connected Transport

Key objects

• DC Initiator: Initiates data transfer

• DC Target: Handles incoming data

© 2014 Mellanox Technologies 15

Reliable Connection Transport Mode

© 2014 Mellanox Technologies 16

Dynamically Connected Transport Mode

© 2014 Mellanox Technologies 17

Cross-Channel Synchronization

© 2014 Mellanox Technologies 18

Challenges Being Addressed

Scalable Collective communication

Asynchronous communication

Communication resources (not computational resources) are used to manage communication

Avoids some of the effects of system noise

© 2014 Mellanox Technologies 19

High Level Objectives

Provide synchronization mechanism between QP’s

Provide mechanisms for managing communication dependencies to be managed without additional

host intervention

Support asynchronous progress of multi-staged communication protocols

© 2014 Mellanox Technologies 20

Motivating Example - HPC

Collective communications optimization

Communication pattern involving multiple processes

Optimized collectives involve a communicator-wide data-dependent communication pattern, e.g.,

communication initiation is dependent on prior completion of other communication operations

Data needs to be manipulated at intermediate stages of a collective operation (reduction

operations)

Collective operations limit application scalability

• Performance, scalability, and system noise

© 2014 Mellanox Technologies 21

Scalability of Collective Operations

Ideal Algorithm Impact of System Noise

3

1

2

4

© 2014 Mellanox Technologies 22

Scalability of Collective Operations - II

Offloaded Algorithm Nonblocking Algorithm

- Communication processing

© 2014 Mellanox Technologies 23

Network Managed Multi-stage Communication

Key Ideas

• Create a local description of the communication pattern

• Pass the description to the communication subsystem

• Manage the communication operations on the network, freeing the CPU to do meaningful computation

• Check for full-operation completion

Current Assumptions

• Data delivery is detected by new Completion Queue Events

• Use Completion Queue to identify the data source

• Completion order is used to associate data with a specific operation

• Use RDMA with the immediate to generate Completion Queue events

© 2014 Mellanox Technologies 24

Key New Features

New QP trait - Managed QP: WQE on such a QP must be enabled by WQEs from other QP’s

Synchronization primitives:

• Wait work queue entry: waits until specified completion queue (QP) reaches specified producer index value

• Enable tasks: WQE on one QP can “enable” a WQE on a second QP

Submit lists of task to multiple QP’s in single post - sufficient to describe chained operations (such as

collective communication)

Can setup a special completion queue to monitor list completion (request CQE from the relevant task)

© 2014 Mellanox Technologies 25

Setting up CORE-Direct QP’s

Create QP with the ability to use the CORE-Direct primitives

Decide if managed QP’s will be used, if so, need to create QP that will take the enable tasks. Most

likely, this is a centralized resource handling both the enable and the wait tasks

Decide on a Completion Queue strategy

Setup all needed QP’s

© 2014 Mellanox Technologies 26

Initiating CORE-Direct Communication

Task list is created

Target QP for task

Operation send/wait/enable

For wait, the number of completions to wait for

• Number of completions is specified relative to the beginning of the task list

• Number of completions can be positive, zero, or negative (wait on previously posted tasks)

For enable, the number of send tasks on the target QP is specified

• Number of tasks to enable

© 2014 Mellanox Technologies 27

Posting of Tasks

© 2014 Mellanox Technologies 28

CORE-Direct Task-List Completion

Can specify which task will generate completion, in “Collective” completion queue

Single CQ signals full list (collective) completion

CPU is not needed for progress

© 2014 Mellanox Technologies 29

Example – Four Process Recursive Doubling

1 2 3 4

1 2 3 4

1 2 3 4

Step 1

Step 2

© 2014 Mellanox Technologies 30

Four Process Barrier Example – Using Managed Queues – Rank 0

© 2014 Mellanox Technologies 31

Four Process Barrier Example – No Managed Queues – Rank 0

© 2014 Mellanox Technologies 32

User-Mode Memory Registration

© 2014 Mellanox Technologies 33

Key Features

Support combining contiguous registered memory regions into a single memory region. H/W treats

them as a single contiguous region (and handles the non-contiguous regions)

For a given memory region, supports non-contiguous access to memory, using a regular structure

representation – base pointer, element length, stride, repeat count.

• Can combine these from multiple different memory keys

Memory descriptors are created by posting WQE’s to fill in the memory key

Supports local and remote non-contiguous memory access

• Eliminates the need for some memory copies

© 2014 Mellanox Technologies 34

Combining Contiguous Memory Regions

© 2014 Mellanox Technologies 35

Non-Contiguous Memory Access – Regular Access

© 2014 Mellanox Technologies 36

On-Demand Paging

© 2014 Mellanox Technologies 37

Memory Registration

Apps register Memory Regions (MRs) for IO

• Referenced memory must be part of process address

space at registration time

• Memory key returned to identify the MR

Registration operation

• Pins down the MR

• Hands off the virtual to physical mapping to HW

Ap

plic

atio

n

User-

sp

ace

driv

er

sta

ck

HW

MR

SQ

CQ

Key

RQ

Kern

el

driv

er

sta

ck

© 2014 Mellanox Technologies 38

Memory Registration – continued

Fast path

• Applications post IO operations directly to HCA

• HCA accesses memory using the translations

referenced by the memory key

Ap

plic

atio

n

User-

sp

ace

driv

er

sta

ck

HW

MR

SQ

CQ

RQ

Kern

el

driv

er

sta

ck

Key

Wow !!! But…

© 2014 Mellanox Technologies 39

Challenges

Size of registered memory must fit physical memory

Applications must have memory locking privileges

Continuously synchronizing the translation tables between the address space and the HCA is hard

• Address space changes (malloc, mmap, stack)

• NUMA migration

• fork()

Registration is a costly operation

• No locality of reference

© 2014 Mellanox Technologies 40

Achieving High Performance

Requires careful design

Dynamic registration

• Naïve approach induces significant overheads

• Pin-down cache logic is complex and not complete

Pinned bounce buffers

• Application level memory management

• Copying adds overhead

© 2014 Mellanox Technologies 41

On Demand Paging

MR pages are never pinned by the OS

• Paged in when HCA needs them

• Paged out when reclaimed by the OS

HCA translation tables may contain non-present pages

• Initially, a new MR is created with non-present pages

• Virtual memory mappings don’t necessarily exist

© 2014 Mellanox Technologies 42

Semantics

ODP memory registration

• Specify IBV_ACCESS_ON_DEMAND access flag

Work request processing

• WQEs in HW ownership must reference mapped memory

- From Post_Send()/Recv() until PollCQ()

• RDMA operations must target mapped memory

• Access attempts to unmapped memory trigger an error

Transport

• RC semantics unchanged

• UD responder drops packets while page faults are resolved

- Standard semantics cannot be achieved unless wire is back-pressured

© 2014 Mellanox Technologies 43

Execution Time Breakdown (Send Requestor)

2%

36%

4%

5% 16%

37%

0%

Schedule in

WQE read

Get User Pages

PTE update

TLB flush

QP resume

Misc

0%

7%

83%

1% 2%

7%

0%

Schedule in

WQE read

Get User Pages

PTE update

TBL flush

QP resume

Misc

4K Page fault (135us total time) 4M Page fault (1ms total time)

© 2014 Mellanox Technologies 44

Mellanox ScalableSHMEM

© 2014 Mellanox Technologies 45

Mellanox’s OpenSHMEM Implementation

Implemented within the context of the Open MPI project

Exploit’s Open MPI’s component architecture, re-using components used for the MPI

implementation

Adds OpenSHMEM specific components

Uses InifiniBand optimized point-to-point and Collective communication modules

• Hardware Multicast

• CORE-Direct

• Dynamically Connected Transport

• Enhanced Hardware scatter/gather (UMR) – coming soon

• On Demand Paging – coming soon

© 2014 Mellanox Technologies 46

Leveraging the work already done for MPI implementation

Similarities

• Atomics, collectives operations

• one-sided operations (put/get)

• Job start and runtime support (mapping/binding/…)

Differences

• SPMD

• No communicators (yet)

• No user-defined data types

• Limited set of collectives, and slightly different semantics (Barrier)

• Application can put/get data from pre-allocated heap or static variables

• No file I/O support

© 2014 Mellanox Technologies 47

OMPI + OSHMEM

Many OMPI frameworks reused (runtime, platform support, job start,

BTL, BML, MTL, profiling, autotools)

OSHMEM specific frameworks added, keeping MCA plugin

architecture (SCOLL, SPML, atomics, synchronization and ordering

enforcement)

OSHMEM supports Mellanox p2p and collectives accelerators (MXM,

FCA) as long as OMPI provided transports (TCP, OpenIB, Portals, …)

© 2014 Mellanox Technologies 48

SHMEM Implementation Architecture

© 2014 Mellanox Technologies 49

SHMEM Put Latency

1.4

1.5

1.6

1.7

1.8

1.9

2

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy (

usec)

Data Size (bytes)

ConnectX-3 RC

Connect-IB RC

ConnectX-3 UD

Connect-IB UD

© 2014 Mellanox Technologies 50

SHMEM Put Latency

0

20

40

60

80

100

120

140

160

180

200

8192 16384 32768 65536 131072 262144 524288 1048576

Late

ncy (

usec)

Data Size (bytes)

ConnectX-3 RC

Connect-IB RC

ConnectX-3 UD

Connect-IB UD

© 2014 Mellanox Technologies 51

‘SHMEM 8 Byte Message Rate

0

10

20

30

40

50

60

70

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

Messag

e R

ate

(M

MP

S)

Data Size (bytes)

ConnectX-3 RC

Connect-IB RC

ConnectX-3 UD

Connect-IB UD

© 2014 Mellanox Technologies 52

‘SHMEM Barrier - Latency

0

2

4

6

8

10

12

14

16

18

20

2 4 8 16 32 48 63

Late

ncy (

usec)

Number of Hosts (16 Processes per Node)

ConnectX-3

Connect-IB

© 2014 Mellanox Technologies 53

SHMEM All-to-All Benchmark (Message Size – 4096 Bytes)

0

50

100

150

200

250

300

350

400

8 16 32 48 63

Ban

dw

idth

- M

Bp

s p

er

PE

Number of Hosts (16 PE's per Node)

CionnectX-3

Connect-IB

© 2014 Mellanox Technologies 54

SHMEM Atomic Updates

0

1

2

3

4

5

6

Millio

n O

ps p

er

seco

nd

Connect-IB UD

ConnectX-3 UD

For more information: HPC@mellanox.com

top related