enabling mpi interoperability through flexible communication endpoints

Enabling MPI Interoperability Through Flexible Communication Endpoints

James Dinan, Pavan Balaji, David Goodell,Douglas Miller, Marc Snir, and Rajeev Thakur

2

Mapping of Ranks to Processes in MPI

MPI provides a 1-to-1 mapping of ranks to processes This was good in the past, but usage models have evolved

– Programmers use many-to-one mapping of threads to processes• E.g. Hybrid parallel programming with OpenMP/threads

– Other programming models also use many-to-one mapping• Interoperability is a key objective, e.g. with Charm++, etc…

Rank

T T T

Conventional Communicator

Process

Rank

T T

Process

…

3

Current Approaches to Hybrid MPI+Threads MPI message matching space: <communicator, sender, tag> Two approaches to using THREAD_MULTIPLE

1. Match specific thread using the tag:– Partition the tag space to address individual threads– Limitations:

• Collectives – Multiple threads at a process can’t participate concurrently• Wildcards – Multiple threads concurrently requires care

2. Match specific thread using the communicator:– Split threads across different communicators (e.g. Dup and assign)– Can use wildcards and collectives– However, limits connectivity of threads with each other

4

Impact of Light Cores and Threads on Message Rate

Shamelessly stolen from Brian Barrett, et al. [EuroMPI ‘13] Threads sharing a rank increase posted receive queue depth (x-axis) Solution: More ranks!

– Adding more MPI processes fragments the node– Can’t do shared memory programming across the whole node

5

Endpoints: Flexible Mapping of Ranks to Processes

Provide a many-to-one mapping of ranks to processes– Allows threads to act as first-class participants in MPI operations– Improve programmability of MPI + node-level and MPI + system-level models– Potential for improving performance of hybrid MPI + X

A rank represents a communication “endpoint”– Set of resources that supports the independent execution of MPI communications

Note: Figure demonstrates many usages, some may impact performance

Rank

T T T

Endpoints Communicator

Process

Rank

T T

Process

…RankRank Rank Rank

T T

Process

6

Impact on MPI Implementations

Two implementation strategies1. Each rank is a distinct network endpoint2. Ranks are multiplexed on endpoints

• Effectively adds destination rank tothe matching criteria

• Currently rank is not included,because there is one per process

3. Combination of the above

Potential to reduce threading overheads– Separate resources per thread

• Rank can represent distinct network resources• Increase HFI/NIC concurrency

– Separate software state per thread• Per-endpoint message queues/matching

– Split up progress across threads, increase progress engine concurrency• Enable per-communicator threading levels

– COMM_WORLD = THREAD_MULTIPLE, my_comm = THREAD_FUNNELED

Rank

T T T

Process

RankRank

7

The Endpoints Programming Interface

Interface choices impact performance and usability Key parameter, creation of Endpoints:

1. Static interface• Endpoints fixed for entire execution• Pro: Allows simpler implementation• Con: Interface is restrictive, not usable with libraries• Proposed for, but not included in MPI 3.0

2. Dynamic interface• Additional endpoints can be added dynamically• Pro: More expressive interface• Con: Implementation is not as simple• Proposed for MPI <next>

Association of endpoints with threads– Explicit attach/detach or implicit– Goal: Avoid dependence on particular threading packages

8

Static Endpoint Creation

MPI_COMM_ENDPOINTS defined statically– New MPI_INIT_ENDPOINTS function– “mpiexec --num_ep XX”, requires calling Init for each EP, OOB num_ep

• E.g. for (ep = 0; ep < my_num_ep) MPI_Init(); Allows simple resource management

– Creation/freeing/mapping of network endpoints at startup/exit Interface is inflexible

– Not easy for libraries and apps to both use static endpoints

0 1 2 3 4

0 1

MPI_COMM_ENDPOINTS

MPI_COMM_WORLD

Rank

T

Process

RankRankRank

T

Process

Rank

T

9

Dynamic Endpoint Creation

Endpoints communicator is created dynamically– Through new MPI_COMM_CREATE_ENDPOINTS operation

More expressive interface– Allows libraries and apps equal access to endpoints

Dynamic resource management– Endpoints are added/removed dynamically– More sophisticated implementation required (Option #2 or #3)

0 1 2 3 4

0 1

my_ep_comm

MPI_COMM_WORLD

Rank

T

Process

RankRankRank

T

Process

Rank

T

10

Representation of Endpoints (Static/Dynamic)1. One handle: MPI_COMM_EP / my_ep_comm

– Single communicator handle given to parent process– How to identify desired endpoints in MPI calls?

• Threads/processes must attach/detach prior to making an MPI call• Endpoint I am using is cached in per-thread state

– Requires MPI to use thread-local storage (TLS)– Adds a TLS lookup on the critical path for every operation

2. N handles: MPI_COMM_EP[MY_EP] / my_ep_comm[MY_EP]– Multiple communicator handles, one per endpoint– Attach/detach is not needed (but could be helpful)– MPI does not need to use TLS– Improves interoperability with threading packages

11

Putting It All Together: Proposed Interface

int MPI_Comm_create_endpoints(MPI_Comm parent_comm,int my_num_ep,MPI_Info info,MPI_Comm *out_comm_hdls[])

Each rank in parent_comm gets my_num_ep ranks in out_comm– My_num_ep can be different at each process– Rank order: process 0’s ranks, process 1’s ranks, etc.

Output is an array of communicator handles– ith handle corresponds to ith endpoint create by parent process– To use that endpoint, use the corresponding handle

0 2 3 4

1 20

1

12

Collectives and Endpoints

Endpoints have exactly thesame semantics as MPI processes

Collective routines must be calledby all ranks in the communicator concurrently– MPI_THREAD_MULTIPLE required for collectives to be used with endpoints

Exception: Freeing the communicator– Want to avoid requiring MPI_THREAD_MULTIPLE– Allow usages where endpoints are used with MPI_THREAD_FUNNELED– The implementation must allow a single thread to free the communicator

by calling MPI_COMM_FREE once per endpoint

0 2 3 4

1 20

1

13

Usage Models are Many…

Intranode parallel programming with MPI– Spawn endpoints off MPI_COMM_SELF

Allow true thread multiple, with each thread addressable– Spawn endpoints off MPI_COMM_WORLD

Obtain better performance– Partition threads into groups and assign a rank to each group– Performance benefits without partitioning shared memory programming

model

Interoperability– Examples: OpenMP and UPC

14

Enabling OpenMP Threads in MPI Collectives Hybrid

MPI+OpenMP code

Endpoints are used to enable OpenMP threads to fully utilize MPI

15

Enabling UPC+MPI Interoperability: User Code

UPC runtime may be using threads within the node UPC compiler substitutes its own world communicator for

MPI_COMM_WORLD– Can use the PMPI interface, if needed

Compiler generates MPI calls needed to give a rank to each UPC thread

16

Enabling UPC+MPI Interoperability: Generated Code

17

Flexible Computation Mapping

Ranks correspond to work units, e.g., mesh tiles Data exchange between work units maps to communication

between ranks Periodic load balancing redistributes work (i.e. ranks)

– Communication is preserved, because it follows the ranks

0 2 3 4

1 20

1 5 6

46 5 63 5 5

COMM_WORLD

work_comm

balanced_comm

MPI Process MPI ProcessMPI Process

18

Thank you and Acknowledgements

We thank the many members of the MPI community and MPI forum who contributed to this work!

Review the formal proposal:– https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/ 380

Send comments to MPI Forum’s hybrid working group or [email protected]

Disclaimer: This presentation represents the views of the authors, and does not necessarily represent the views of Intel.

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/380

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/380

19

Endpoints Proposal, Prototype

int MPI_Comm_create_endpoints(MPI_Comm parent_comm,int my_num_ep, MPI_Info info, MPI_Comm

*out_comm_hdls[])

MPI_Comm_create_endpoints(parent_comm, my_num_ep, info,out_comm_hdls) BIND(C)

Type(MPI_Comm), INTENT(IN) :: parent_comm INTEGER, INTENT(IN) :: my_num_ep TYPE(MPI_Info), INTENT(IN) :: info Type(MPI_Comm), INTENT(OUT) :: out_comm_hdls(my_num_ep) INTEGER, OPTIONAL, INTENT(OUT) :: ierror

MPI_COMM_CREATE_ENDPOINTS(PARENT_COMM, MY_NUM_EP, INFO,OUT_COMM_HDLS, IERROR)

INTEGER PARENT_COMM, MY_NUM_EP, INFO, OUT_COMM_HDLS(*),

IERROR

20

Endpoints Proposal, Text Part 1This function creates a new communicator from an existing communicator, parent_comm, where my_num_ep ranks in the output communicator are associated with a single calling rank in parent_comm. This function is collective on parent_comm. Distinct handles for each associated rank in the output communicator are returned in the new_comm_hdls array at the corresponding rank in parent_comm. Ranks associated with a process in parent_comm are numbered contiguously in the output communicator, and the starting rank is defined by the order of the associated rank in the parent communicator.

If parent_comm is an intracommunicator, this function returns a new intracommunicator new_comm with a communication group of size equal to the sum of the values of my_num_ep on all calling processes. No cached information propagates from parent_comm to new_comm. Each process in parent_comm must call MPI_COMM_CREATE_ENDPOINTS with a my_num_ep argument that ranges from 0 to the value of the MPI_COMM_MAX_ENDPOINTS attribute on parent_comm. Each process may specify a different value for the my_num_ep argument. When my_num_ep is 0, no output communicator is returned.

If parent_comm is an intercommunicator, then the output communicator is also an intercommunicator where the local group consists of endpoint ranks associated with ranks in the local group of parent_comm and the remote group consists of endpoint ranks associated with ranks in the remote group of parent_comm. If either the local or remote group is empty, MPI_COMM_NULL is returned in all entries of new_comm_hdls.

21

Endpoints Proposal, Text Part 2Ranks in new_comm behave as MPI processes. For example, a collective function on new_comm must be called concurrently on every rank in this communicator. An exception to this rule is made for MPI_COMM_FREE, which must be called for every rank in new_comm, but must permit a single thread to perform these calls serially.

Rationale: The concurrency exception for MPI_COMM_FREE is made to enable MPI_COMM_CREATE_ENDPOINTS to be used when the MPI library has not been initialized with MPI_THREAD_MULTIPLE, or when the threading package cannot satisfy the concurrency requirement for collective operations.

Advice to Users: Although threads can acquire individual ranks through the MPI_COMM_CREATE_ENDPOINTS function, they still share an instance of the MPI library. Users must ensure that the threading level with which MPI was initialized is maintained. Some operations, such as collective operations, cannot be used by multiple threads sharing an instance of the MPI library, when MPI was initialized with MPI_THREAD_MULTIPLE.

Proposed New Error Classes MPI_ERR_ENDPOINTS -- The requested number of endpoints could not be provided.Proposed New Info Keys same_num_ep -- All processes will provide the same my_num_ep argument to MPI_COMM_CREATE_ENDPOINTS.

enabling mpi interoperability through flexible communication endpoints

Documents

mapping of threads

mpi processes

mpi interoperability

split threads

processesallows threads

collectives multiple

flexible mapping of

limits connectivity