lecture 9: mpi continued - cornell universitybindel/class/cs5220-f11/slides/lec09.pdflecture 9: mpi...
TRANSCRIPT
![Page 1: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/1.jpg)
Lecture 9: MPI continued
David Bindel
27 Sep 2011
![Page 2: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/2.jpg)
Logistics
I Matrix multiply is done! Still have to run.I Small HW 2 will be up before lecture on Thursday, due next
Tuesday.I Project 2 will be posted next Tuesday.I Email me if interested in Sandia recruitingI Also email me if interested in MEng projects.
![Page 3: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/3.jpg)
Previously on Parallel Programming
Can write a lot of MPI code with 6 operations we’ve seen:I MPI_Init
I MPI_Finalize
I MPI_Comm_size
I MPI_Comm_rank
I MPI_Send
I MPI_Recv
... but there are sometimes better ways. Decide oncommunication style using simple performance models.
![Page 4: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/4.jpg)
Communication performance
I Basic info: latency and bandwidthI Simplest model: tcomm = α+ βMI More realistic: distinguish CPU overhead from “gap”
(∼ inverse bw)I Different networks have different parametersI Can tell a lot via a simple ping-pong experiment
![Page 5: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/5.jpg)
OpenMPI on crocus
I Two quad-core chips per nodes, five nodesI Heterogeneous network:
I Crossbar switch between cores (?)I Bus between chipsI Gigabit ehternet between nodes
I Default process layout (16 process example)I Processes 0-3 on first chip, first nodeI Processes 4-7 on second chip, first nodeI Processes 8-11 on first chip, second nodeI Processes 12-15 on second chip, second node
I Test ping-pong from 0 to 1, 7, and 8.
![Page 6: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/6.jpg)
Approximate α-β parameters (on node)
0
2
4
6
8
10
12
0 2 4 6 8 10 12 14 16 18
Tim
e/m
sg (m
icro
sec)
Message size (kilobytes)
Measured (1)Model (1)
Measured (7)Model (7)
α1 ≈ 1.0× 10−6, β1 ≈ 5.7× 10−10
α2 ≈ 8.4× 10−7, β2 ≈ 6.8× 10−10
![Page 7: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/7.jpg)
Approximate α-β parameters (cross-node)
0
50
100
150
200
250
0 2 4 6 8 10 12 14 16 18
Tim
e/m
sg (m
icro
sec)
Message size (kilobytes)
Measured (1)Model (1)
Measured (7)Model (7)
Measured (8)Model (8)
α3 ≈ 7.1× 10−5, β3 ≈ 9.7× 10−9
![Page 8: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/8.jpg)
Moral
Not all links are created equal!I Might handle with mixed paradigm
I OpenMP on node, MPI acrossI Have to worry about thread-safety of MPI calls
I Can handle purely within MPII Can ignore the issue completely?
For today, we’ll take the last approach.
![Page 9: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/9.jpg)
Reminder: basic send and recv
MPI_Send(buf, count, datatype,dest, tag, comm);
MPI_Recv(buf, count, datatype,source, tag, comm, status);
MPI_Send and MPI_Recv are blockingI Send does not return until data is in systemI Recv does not return until data is ready
![Page 10: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/10.jpg)
Blocking and buffering
data
buffer
buffer
P0 OS Net OS P1
data
Block until data “in system” — maybe in a buffer?
![Page 11: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/11.jpg)
Blocking and buffering
data
P0 OS Net OS P1
data
Alternative: don’t copy, block until done.
![Page 12: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/12.jpg)
Problem 1: Potential deadlock
... blocked ...
Send Send
Both processors wait to finish send before they can receive!May not happen if lots of buffering on both sides.
![Page 13: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/13.jpg)
Solution 1: Alternating order
Recv
SendRecv
Send
Could alternate who sends and who receives.
![Page 14: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/14.jpg)
Solution 2: Combined send/recv
SendrecvSendrecv
Common operations deserve explicit support!
![Page 15: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/15.jpg)
Combined sendrecv
MPI_Sendrecv(sendbuf, sendcount, sendtype,dest, sendtag,recvbuf, recvcount, recvtype,source, recvtag,comm, status);
Blocking operation, combines send and recv to avoid deadlock.
![Page 16: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/16.jpg)
Problem 2: Communication overhead
..waiting...
Sendrecv Sendrecv
Sendrecv Sendrecv
..waiting...
Partial solution: nonblocking communication
![Page 17: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/17.jpg)
Blocking vs non-blocking communication
I MPI_Send and MPI_Recv are blockingI Send does not return until data is in systemI Recv does not return until data is readyI Cons: possible deadlock, time wasted waiting
I Why blocking?I Overwrite buffer during send =⇒ evil!I Read buffer before data ready =⇒ evil!
I Alternative: nonblocking communicationI Split into distinct initiation/completion phasesI Initiate send/recv and promise not to touch bufferI Check later for operation completion
![Page 18: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/18.jpg)
Overlap communication and computation
}
Start recv
Start send
Start recv
End send
End recv
End send
End recv
Compute, but don’t touch buffers!
Start send
![Page 19: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/19.jpg)
Nonblocking operations
Initiate message:
MPI_Isend(start, count, datatype, desttag, comm, request);
MPI_Irecv(start, count, datatype, desttag, comm, request);
Wait for message completion:
MPI_Wait(request, status);
Test for message completion:
MPI_Wait(request, status);
![Page 20: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/20.jpg)
Multiple outstanding requests
Sometimes useful to have multiple outstanding messages:
MPI_Waitall(count, requests, statuses);MPI_Waitany(count, requests, index, status);MPI_Waitsome(count, requests, indices, statuses);
Multiple versions of test as well.
![Page 21: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/21.jpg)
Other send/recv variants
Other variants of MPI_SendI MPI_Ssend (synchronous) – do not complete until receive
has begunI MPI_Bsend (buffered) – user provides buffer (viaMPI_Buffer_attach)
I MPI_Rsend (ready) – user guarantees receive has alreadybeen posted
I Can combine modes (e.g. MPI_Issend)MPI_Recv receives anything.
![Page 22: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/22.jpg)
Another approach
I Send/recv is one-to-one communicationI An alternative is one-to-many (and vice-versa):
I Broadcast to distribute data from one processI Reduce to combine data from all processorsI Operations are called by all processes in communicator
![Page 23: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/23.jpg)
Broadcast and reduce
MPI_Bcast(buffer, count, datatype,root, comm);
MPI_Reduce(sendbuf, recvbuf, count, datatype,op, root, comm);
I buffer is copied from root to othersI recvbuf receives result only at rootI op ∈ { MPI_MAX, MPI_SUM, . . . }
![Page 24: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/24.jpg)
Example: basic Monte Carlo#include <stdio.h>#include <mpi.h>int main(int argc, char** argv) {
int nproc, myid, ntrials;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nproc);MPI_Comm_rank(MPI_COMM_WORLD, &my_id);if (myid == 0) {
printf("Trials per CPU:\n");scanf("%d", &ntrials);
}MPI_Bcast(&ntrials, 1, MPI_INT,
0, MPI_COMM_WORLD);run_trials(myid, nproc, ntrials);MPI_Finalize();return 0;
}
![Page 25: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/25.jpg)
Example: basic Monte Carlo
Let sum[0] =∑
i Xi and sum[1] =∑
i X 2i .
void run_mc(int myid, int nproc, int ntrials) {double sums[2] = {0,0};double my_sums[2] = {0,0};/* ... run ntrials local experiments ... */MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);if (myid == 0) {
int N = nproc*ntrials;double EX = sums[0]/N;double EX2 = sums[1]/N;printf("Mean: %g; err: %g\n",
EX, sqrt((EX*EX-EX2)/N));}
}
![Page 26: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/26.jpg)
Collective operations
I Involve all processes in communicatorI Basic classes:
I Synchronization (e.g. barrier)I Data movement (e.g. broadcast)I Computation (e.g. reduce)
![Page 27: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/27.jpg)
Barrier
MPI_Barrier(comm);
Not much more to say. Not needed that often.
![Page 28: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/28.jpg)
Broadcast
Broadcast
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
A
![Page 29: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/29.jpg)
Scatter/gather
Gather
P0
P1
P2
P3
P0
P1
P2
P3
A
B
C
D
AScatter
B C D
![Page 30: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/30.jpg)
Allgather
Allgather
P0
P1
P2
P3
P0
P1
P2
P3
A
A
A
A
B C D
B
B
B
C
C
C
D
D
D
A
B
C
D
![Page 31: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/31.jpg)
Alltoall
Alltoall
P0
P1
P2
P3
P0
P1
P2
P3
A0
A1
A2
A3
B0
B1
B2
B3
C0
C1
C2
C3
D0
D1
D2
D3
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
D0 D1 D2 D3
![Page 32: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/32.jpg)
Reduce
ABCDP0
P1
P2
P3
P0
P1
P2
P3
Reduce
A
B
C
D
![Page 33: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/33.jpg)
Scan
Scan
P0
P1
P2
P3
P0
P1
P2
P3
A
B
C
D ABCD
A
AB
ABC
![Page 34: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have](https://reader030.vdocument.in/reader030/viewer/2022041100/5ed83b470fa3e705ec0e150b/html5/thumbnails/34.jpg)
The kitchen sink
I In addition to above, have vector variants (v suffix), moreAll variants (Allreduce), Reduce_scatter, ...
I MPI3 adds one-sided communication (put/get)I MPI is not a small library!I But a small number of calls goes a long way
I Init/FinalizeI Get_comm_rank, Get_comm_sizeI Send/Recv variants and WaitI Allreduce, Allgather, Bcast