ernie chan
DESCRIPTION
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links. Ernie Chan. Ernie Chan Robert van de Geijn Department of Computer Sciences The University of Texas at Austin. William Gropp Rajeev Thakur Mathematics and Computer Science Division - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/1.jpg)
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links
Ernie Chan
![Page 2: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/2.jpg)
Authors
Ernie Chan Robert van de Geijn
Department of Computer Sciences
The University of Texas at Austin
William Gropp Rajeev Thakur
Mathematics and Computer Science Division
Argonne National Laboratory
![Page 3: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/3.jpg)
Testbed Architecture
IBM Blue Gene/L3D torus point-to-point interconnect networkOne rack
1024 dual-processor nodes Two 8 x 8 x 8 midplanes
Special feature to send simultaneously Use multiple calls to MPI_Isend
![Page 4: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/4.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 5: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/5.jpg)
Model of Parallel Computation
Target Architectures Distributed-memory parallel architectures
Indexingp computational nodes Indexed 0 … p - 1
Logically Fully ConnectedA node can send directly to any other node
![Page 6: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/6.jpg)
Model of Parallel Computation
TopologyN-dimensional torus
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 7: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/7.jpg)
Model of Parallel Computation
Old Model of Communicating Between NodesUnidirectional sending or receiving
![Page 8: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/8.jpg)
Model of Parallel Computation
Old Model of Communicating Between NodesSimultaneous sending and receiving
![Page 9: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/9.jpg)
Model of Parallel Computation
Old Model of Communicating Between NodesBidirectional exchange
![Page 10: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/10.jpg)
Model of Parallel Computation
Communicating Between NodesA node can send or receive with 2N other
nodes simultaneously along its 2N different links
![Page 11: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/11.jpg)
Model of Parallel Computation
Communicating Between NodesCannot perform bidirectional exchange on any
link while sending or receiving simultaneously with multiple nodes
![Page 12: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/12.jpg)
Model of Parallel Computation
Cost of Communication
α + nβ
α: startup time, latencyn: number of bytes to communicateβ: per data transmission time, bandwidth
![Page 13: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/13.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 14: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/14.jpg)
Sending Simultaneously
Old Cost of Communication with Sends to Multiple NodesCost to send to m separate nodes
(α + nβ) m
![Page 15: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/15.jpg)
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1)
![Page 16: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/16.jpg)
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1) τ
Cost of one send Cost of extra sends
![Page 17: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/17.jpg)
Sending Simultaneously
New Cost of Communication with Simultaneous Sends
(α + nβ) m
can be replaced with
(α + nβ) + (α + nβ) (m - 1) τ
Cost of one send Cost of extra sends
0 ≤ τ ≤ 1
![Page 18: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/18.jpg)
Sending Simultaneously
Benchmarking Sending SimultaneouslyLogarithmic-Logarithmic timing graphsMidplane – 512 nodesSending simultaneously with 1 – 6 neighbors8 bytes – 4 MB
![Page 19: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/19.jpg)
Sending Simultaneously
![Page 20: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/20.jpg)
Sending Simultaneously
Cost of Communication with Simultaneous Sends
(α + nβ) (1 + (m - 1) τ)
![Page 21: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/21.jpg)
Sending Simultaneously
![Page 22: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/22.jpg)
Sending Simultaneously
![Page 23: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/23.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 24: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/24.jpg)
Collective Communication
Broadcast (Bcast)Motivating example
Before After
![Page 25: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/25.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 26: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/26.jpg)
Generalized Algorithms
Short-Vector AlgorithmsMinimum-Spanning Tree
Long-Vector AlgorithmsBucket Algorithm
![Page 27: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/27.jpg)
Generalized Algorithms
Minimum-Spanning Tree
![Page 28: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/28.jpg)
Generalized Algorithms
Minimum-Spanning TreeDivide p nodes into N+1 partitions
![Page 29: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/29.jpg)
Generalized Algorithms
Minimum-Spanning TreeDisjointed partitions on N-dimensional mesh
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 30: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/30.jpg)
Generalized Algorithms
Minimum-Spanning TreeDivide dimensions by a decrementing counter
from N+1
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 31: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/31.jpg)
Generalized Algorithms
Minimum-Spanning TreeNow divide into 2N+1 partitions
5
9 11
3
7
8
0
10
12 13 15
1
4
14
6
2
![Page 32: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/32.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 33: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/33.jpg)
Performance Results
Single point-to-pointcommunication
![Page 34: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/34.jpg)
Performance Results
my-bcast-MST
![Page 35: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/35.jpg)
Outline
Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion
![Page 36: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/36.jpg)
Conclusion
IBM Blue Gene/L supports functionality of sending simultaneouslyBenchmarking along with model checking
verifies this claim New generalized algorithms show clear
performance gains
![Page 37: Ernie Chan](https://reader036.vdocument.in/reader036/viewer/2022062309/56813537550346895d9c9d7b/html5/thumbnails/37.jpg)
Conclusion
Future DirectionsRoom for optimization to reduce
implementation overheadWhat if not using MPI_COMM_WORLD?Possible new algorithm for Bucket Algorithm
Questions? [email protected]