performance oriented mpi
DESCRIPTION
Performance Oriented MPI. Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame. Overview. Overview and History of MPI Performance Oriented Point to Point Collectives, Data Types Diagnostics and Tuning Rules of Thumb and Gotchas. Scope of This Talk. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/1.jpg)
Performance Oriented MPIPerformance Oriented MPI
Jeffrey M. SquyresJeffrey M. Squyres
Andrew LumsdaineAndrew Lumsdaine
NERSC/LBNL and U. Notre DameNERSC/LBNL and U. Notre Dame
QuickTime™ and aGIF decompressor
are needed to see this picture.
![Page 2: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/2.jpg)
OverviewOverview
Overview and History of MPIOverview and History of MPI Performance Oriented Point to PointPerformance Oriented Point to Point Collectives, Data TypesCollectives, Data Types Diagnostics and TuningDiagnostics and Tuning Rules of Thumb and GotchasRules of Thumb and Gotchas
![Page 3: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/3.jpg)
Scope of This TalkScope of This Talk
Beginning to intermediate userBeginning to intermediate user General principles and rules of thumbGeneral principles and rules of thumb When and where performance might be When and where performance might be
availableavailable Omit (advanced) low-level issuesOmit (advanced) low-level issues
![Page 4: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/4.jpg)
Overview and History of MPIOverview and History of MPI
Library (not language) specificationLibrary (not language) specification GoalsGoals
– PortabilityPortability– EfficiencyEfficiency– Functionality (small and large)Functionality (small and large)
Safety (communicators)Safety (communicators) Conservative (current best practices)Conservative (current best practices)
![Page 5: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/5.jpg)
Performance in MPIPerformance in MPI
MPI includes many performance-MPI includes many performance-oriented featuresoriented features
These features are only These features are only potentiallypotentially high- high-performanceperformance
The standard seeks not to preclude The standard seeks not to preclude performance, it does not mandate itperformance, it does not mandate it
Progress might only be made during MPI Progress might only be made during MPI function callsfunction calls
![Page 6: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/6.jpg)
(Potential) Performance (Potential) Performance FeaturesFeatures
Non-blocking operationsNon-blocking operations Persistent operationsPersistent operations Collective operationsCollective operations MPI DatatypesMPI Datatypes
![Page 7: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/7.jpg)
Basic Point to PointBasic Point to Point
““Six function MPI” includesSix function MPI” includes MPI_Send()MPI_Send() MPI_Recv()MPI_Recv() These are useful, but there is moreThese are useful, but there is more
![Page 8: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/8.jpg)
Basic Point to PointBasic Point to Point
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);
} else {
MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);
} else {
MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);
}
![Page 9: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/9.jpg)
Non-Blocking OperationsNon-Blocking Operations
MPI_Isend()MPI_Isend() MPI_Irecv()MPI_Irecv() ““I” is for immediateI” is for immediate Paired with MPI_Test()/MPI_Wait()Paired with MPI_Test()/MPI_Wait()
![Page 10: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/10.jpg)
Non-Blocking OperationsNon-Blocking Operations
MPI_Comm_rank(comm,&rank);
if (rank == 0) {
MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
} else {
MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
}
MPI_Comm_rank(comm,&rank);
if (rank == 0) {
MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
} else {
MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);
/* Do some computation */
MPI_Wait(&request,&status);
}
![Page 11: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/11.jpg)
Persistent OperationsPersistent Operations
MPI_Send_Init() MPI_Send_Init() MPI_Recv_init()MPI_Recv_init() Creates a request but does not start itCreates a request but does not start it MPI_Start() begins the communicationMPI_Start() begins the communication A single request can be re-used with A single request can be re-used with
multiple calls to MPI_Start()multiple calls to MPI_Start()
![Page 12: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/12.jpg)
Persistent OperationsPersistent Operations
MPI_Comm_rank(comm, &rank);
if (rank == 0)
MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);
else
MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);
/* … */
for (i = 0; i < n; i++) {
MPI_Start(&request);
/* Do some work */
MPI_Wait(&request, &status);
}
MPI_Comm_rank(comm, &rank);
if (rank == 0)
MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);
else
MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);
/* … */
for (i = 0; i < n; i++) {
MPI_Start(&request);
/* Do some work */
MPI_Wait(&request, &status);
}
![Page 13: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/13.jpg)
Collective OperationsCollective Operations
May be layered on point to pointMay be layered on point to point May use tree communication patterns May use tree communication patterns
for efficiencyfor efficiency Synchronization! (No non-blocking Synchronization! (No non-blocking
collectives)collectives)
![Page 14: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/14.jpg)
Collective OperationsCollective Operations
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm);
O(P) O(log P)
![Page 15: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/15.jpg)
MPI DatatypesMPI Datatypes May allow MPI to send a message May allow MPI to send a message
directly from memory directly from memory May avoid copying/packingMay avoid copying/packing (General) high performance (General) high performance
implementations not widely availableimplementations not widely available
network
copy
![Page 16: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/16.jpg)
Quiz: MPI_Send()Quiz: MPI_Send()
After I call MPI_Send()After I call MPI_Send()– The recipient has received the messageThe recipient has received the message– I have sent the messageI have sent the message– I can write to the message buffer without I can write to the message buffer without
corrupting the messagecorrupting the message I can write to the message bufferI can write to the message buffer
![Page 17: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/17.jpg)
Sidenote: MPI_Ssend()Sidenote: MPI_Ssend()
MPI_Ssend() has the (perhaps) MPI_Ssend() has the (perhaps) expected semanticsexpected semantics
When MPI_Ssend() returns, the When MPI_Ssend() returns, the recipient has received the messagerecipient has received the message
Useful for debugging (replace Useful for debugging (replace MPI_Send() with MPI_Ssend())MPI_Send() with MPI_Ssend())
![Page 18: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/18.jpg)
Quiz: MPI_Isend()Quiz: MPI_Isend()
After I call MPI_Isend()After I call MPI_Isend()– The recipient has started to receive the The recipient has started to receive the
messagemessage– I have started to send the messageI have started to send the message– I can write to the message buffer without I can write to the message buffer without
corrupting the messagecorrupting the message None of the above (I must call None of the above (I must call
MPI_Test() or MPI_Wait())MPI_Test() or MPI_Wait())
![Page 19: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/19.jpg)
Quiz: MPI_Isend()Quiz: MPI_Isend()
True or FalseTrue or False– I can overlap communication and I can overlap communication and
computation by putting some computation computation by putting some computation between MPI_Isend() and between MPI_Isend() and MPI_Test()/MPI_Wait()MPI_Test()/MPI_Wait()
False (in many/most cases)False (in many/most cases)
![Page 20: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/20.jpg)
Communication is Still Communication is Still ComputationComputation
A CPU, usually the main one, must do A CPU, usually the main one, must do the communication workthe communication work– Part of your process (inside MPI calls)Part of your process (inside MPI calls)– Another process on main CPUAnother process on main CPU– Another thread on main CPUAnother thread on main CPU– Another processorAnother processor
![Page 21: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/21.jpg)
No Free LunchNo Free Lunch Part of your process (most common)Part of your process (most common)
– Fast but no overlapFast but no overlap Another process (daemons)Another process (daemons)
– Overlap, but slow (extra copies)Overlap, but slow (extra copies) Another thread (rare)Another thread (rare)
– Overlap and fast, but difficultOverlap and fast, but difficult Another processor (emerging)Another processor (emerging)
– Overlap and fast, but more hardwareOverlap and fast, but more hardware– E.g., Myri/gm, VIAE.g., Myri/gm, VIA
![Page 22: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/22.jpg)
How Do I Get Performance?How Do I Get Performance?
Minimize time spent communicatingMinimize time spent communicating– Minimize data copiesMinimize data copies
Minimize synchronizationMinimize synchronization– I.e., time waiting for communicationI.e., time waiting for communication
![Page 23: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/23.jpg)
Minimizing Communication Minimizing Communication TimeTime
BandwidthBandwidth LatencyLatency
![Page 24: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/24.jpg)
Minimizing LatencyMinimizing Latency
Collect small messages together (if you Collect small messages together (if you can)can)– One 1024-byte message instead of 1024 One 1024-byte message instead of 1024
one-byte messagesone-byte messages Minimize other overhead (e.g., copying)Minimize other overhead (e.g., copying) Overlap with computation (if you can)Overlap with computation (if you can)
![Page 25: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/25.jpg)
Example: Domain Example: Domain DecompositionDecomposition
![Page 26: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/26.jpg)
Naïve ApproachNaïve Approach
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_send(…);
for (i = 0; i < 4; i++)
MPI_recv(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_send(…);
for (i = 0; i < 4; i++)
MPI_recv(…);
}
![Page 27: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/27.jpg)
Naïve ApproachNaïve Approach
Deadlock! (Deadlock! (MaybeMaybe)) Can fix with careful coordination of Can fix with careful coordination of
receiving versus sending on alternate receiving versus sending on alternate processesprocesses
But this can still serializeBut this can still serialize
![Page 28: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/28.jpg)
MPI_Sendrecv()MPI_Sendrecv()
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Sendrecv(…);
}
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Sendrecv(…);
}
}
![Page 29: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/29.jpg)
Immediate OperationsImmediate Operations
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Isend(…);
MPI_Irecv(…);
}
MPI_Waitall(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++) {
MPI_Isend(…);
MPI_Irecv(…);
}
MPI_Waitall(…);
}
![Page 30: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/30.jpg)
Receive Before SendingReceive Before Sending
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_Irecv(…);
for (i = 0; i < 4; i++)
MPI_Isend(…);
MPI_Waitall(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
for (i = 0; i < 4; i++)
MPI_Irecv(…);
for (i = 0; i < 4; i++)
MPI_Isend(…);
MPI_Waitall(…);
}
![Page 31: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/31.jpg)
Persistent OperationsPersistent Operations
for (i = 0; i < 4; i++) {
MPI_Recv_init(…);
MPI_Send_init(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
MPI_Startall(…)
MPI_Waitall(…);
}
for (i = 0; i < 4; i++) {
MPI_Recv_init(…);
MPI_Send_init(…);
}
while (!done) {
exchange(D, neighbors, myrank);
dored(D);
exchange(D, neighbors, myrank);
doblack(D);
}
void exchange(Array D, int *neighbors, int myrank) {
MPI_Startall(…)
MPI_Waitall(…);
}
![Page 32: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/32.jpg)
OverlappingOverlappingwhile (!done) {
MPI_Startall(…); /* Start exchanges */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* As information arrives */
do_received_red(D); /* Process */
}
MPI_Startall(…);
do_inner_black(D);
for (i = 0; i < 4; i++) {
MPI_Wait_any(…);
do_received_black(D);
}
}
while (!done) {
MPI_Startall(…); /* Start exchanges */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* As information arrives */
do_received_red(D); /* Process */
}
MPI_Startall(…);
do_inner_black(D);
for (i = 0; i < 4; i++) {
MPI_Wait_any(…);
do_received_black(D);
}
}
![Page 33: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/33.jpg)
Advanced OverlapAdvanced Overlap
MPI_Startall(…); /* Start all receives */
/* … */
while (!done) {
MPI_Startall(…); /* Start sends */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* Wait on receives */
if (received) {
do_received_red(D); /* Process */
MPI_Start(…); /* Restart receive */
}
}
/* Repeat for black */
}
MPI_Startall(…); /* Start all receives */
/* … */
while (!done) {
MPI_Startall(…); /* Start sends */
do_inner_red(D); /* Internal computation */
for (i = 0; i < 4; i++) {
MPI_Wait_any(…); /* Wait on receives */
if (received) {
do_received_red(D); /* Process */
MPI_Start(…); /* Restart receive */
}
}
/* Repeat for black */
}
![Page 34: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/34.jpg)
MPI Data TypesMPI Data Types
MPI_Type_vectorMPI_Type_vector MPI_Type_structMPI_Type_struct Etc.Etc. MPI_Pack might be betterMPI_Pack might be better
network
copy
![Page 35: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/35.jpg)
Minimizing SynchronizationMinimizing Synchronization
At synchronization point (e.g., with At synchronization point (e.g., with collective communication) all processes collective communication) all processes must arrive at collective callmust arrive at collective call
Can spend lots of time waitingCan spend lots of time waiting This is often an algorithmic issueThis is often an algorithmic issue
– E.g., check for convergence every 5 E.g., check for convergence every 5 iterations instead of every iterationiterations instead of every iteration
![Page 36: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/36.jpg)
GotchasGotchas
MPI_ProbeMPI_Probe– Guarantees extra memory copyGuarantees extra memory copy
MPI_Any_sourceMPI_Any_source– Can cause additional (internal) loopingCan cause additional (internal) looping
MPI_All_to_allMPI_All_to_all– All pairs must communicateAll pairs must communicate– Synchronization (avoid in general)Synchronization (avoid in general)
![Page 37: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/37.jpg)
Diagnostic ToolsDiagnostic Tools
TotalviewTotalview PrismPrism UpshotUpshot XMPIXMPI
![Page 38: Performance Oriented MPI](https://reader036.vdocument.in/reader036/viewer/2022062409/5681493b550346895db68353/html5/thumbnails/38.jpg)
SummarySummary
Receive before sendingReceive before sending Collect small messages togetherCollect small messages together Overlap (if possible)Overlap (if possible) Use immediate operationsUse immediate operations Use persistent operationsUse persistent operations Use diagnostic toolsUse diagnostic tools