workshop on hpc in india programming models, languages, and compilation for accelerator-based...
TRANSCRIPT
![Page 1: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/1.jpg)
Workshop on HPC in India
Programming Models, Languages, and Compilation for
Accelerator-Based ArchitecturesR. Govindarajan
SERC, [email protected]
ATIP 1st Workshop on HPC in India @ SC-09
![Page 2: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/2.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 2
Current Trend in HPC Systems Top500 systems have hundreds of
thousand (100,000s) cores Large HPCs. Performance scaling major challenge
No. of cores in a processor/node is increasing!
4 – 6 cores per processor, 16-24 cores/node! Parallelism even at the node level
Top systems use accelerators GPUs and CellBEs 1000s of cores/proc. Elements in a single GPU!
![Page 3: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/3.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 3
HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware
accelerators GPUs : nVidia, ATI, Accelerators: Clearspeed, Cell BE, … Plethora of Instruction Sets even for SIMD
Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators
Exploit instruction-level parallelism Exploit data-level parallelism on SIMD units Exploit thread-level parallelism on multiple units/multi-cores
Challenges Portability across different generation and platforms Ability to exploit different types of parallelism
![Page 4: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/4.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 4
Accelerators – Cell BE
![Page 5: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/5.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 5
Accelerators - 8800 GPU
![Page 6: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/6.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 6
The Challenge
SSE
CUDA
OpenCL
ArmNeon
AltiVec
AMD CAL
![Page 7: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/7.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 7
Programming in Accelerator-Based Architectures Develop a framework
Programmed in a higher-level language, and is efficient
Can exploit different types of parallelism on different hardware
Parallelism across heterogeneous functional units
Be portable across platforms – not device specific!
![Page 8: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/8.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 8
C/C++
CPU
Autovectorizer
SSE/ Altivec
CUDA/OpenCL
CompilernvCC/JIT
CPU
GPUs
PTX/ATI CAL IL
Brook
BrookCompiler
CPU
GPUs
ATI CAL IL
Existing Approaches
![Page 9: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/9.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 9
StreaMIT
CellBE RAW
StreamITCompiler
Accelerator
CPU
GPUs
DirectX
Runtime
Std. Compiler
OpenMP
Std. Compiler
CPU
GPUs
Existing Approaches (contd.)
![Page 10: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/10.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 10
Synergistic Execution on Multiple Hetergeneous Cores
What is needed?
Compiler/Runtime System
CellBE
OtherAceel.
Multicores
GPUsSSE
StreamingLang.
MPIOpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
![Page 11: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/11.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 11
What is needed?
StreamingLang.
MPIOpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
CellBE
OtherAceel.
Multicores
GPUsSSE
Synergistic Execution on Multiple Hetergeneous Cores
PLASMA: High-Level IR
Compiler
Runtime System
![Page 12: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/12.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 12
Stream Programming Model Higher level programming model where nodes
represent computation and channels communication (producer/consumer relation) between them.
Exposes Pipelined parallelism and Task-level parallelism
Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow
Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-
optimal, buffer-optimal, software-pipelined schedules
Mapping applications to Accelerators such as GPUs and Cell BE.
![Page 13: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/13.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 13
Streamit programs are a hierarchical composition of three basic constructs:
Pipeline SplitJoin
• Round-robin or duplicate splitter
Feedback Loop Stateful filters Peek values
...Filter Filter Filter
Splitter
Stream
Stream
Joiner
Joiner Body Splitter
Loop
The StreamIt Language
![Page 14: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/14.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 14
More ”natural” than frameworks like CUDA or CTM
Easier learning curve than CUDA No need to think of ”threads” or blocks, StreamIt programs are easier to verify, Schedule can be determined statically.
Why StreamIt on GPUs
![Page 15: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/15.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 15
Work distribution across multiprocessors GPUs have hundreds of processing pipes! Exploit task-level and data-level parallelism Schedule across the multiprocessors Multiple concurrent threads in SM to exploit DLP
Execution configuration: task granularity and concurrency
Lack of synchronization between the processors of the GPU.
Managing CPU-GPU memory bandwidth
Issues on Mapping StreamIt for GPUs
![Page 16: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/16.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 16
Stream Graph Execution
Stream Graph Software Pipelined Execution
A
C
D
B
SM1 SM2 SM3 SM4
A1 A2
A3 A4
B1 B2
B3 B4 D1
C1
D2
C2
D3
C3
D4
C4
0123
4567
Pipeline Parallelism
Task Parallelism
Data Parallelism
![Page 17: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/17.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 17
Our Approach
Our Approach for GPUs Code for SAXPY float->float filter saxpy
{
float a = 2.5f;
work pop 2 push 1 {
float x = pop();
float y = pop();
float s = a * x + y;
push(s);
}
}
![Page 18: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/18.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 18
Multithreading Identify good execution configuration to exploit the right
amount of data parallelism Memory
Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.
Task Partition between GPU and CPU cores Work scheduling and processor (SM)
assignment problem. Takes into account communication bandwidth restrictions
Our Approach (contd.)
![Page 19: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/19.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 19
Execution Configuration
Exec. Time of Macro Node = 32
Exec. Time of Macro Node = 16
A0 A1 A127
B0 B1 B127 B0 B1 B127
Total Exec. Time on 2 SMs = MII = 64/2 = 32
More threads for exploiting data-level parallelism
![Page 20: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/20.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 20
GPUs have a banked memory architecture with a very wide memory channel
Accesses by threads in an SM have to be coalesced
d0 d1 d2 d3 d4 d5 d6 d7
B0 B1 B2 B3 B0 B1 B2 B3
thread0 thread2thread1 thread3
d0 d2 d4 d6 d1 d3 d5 d7
B0 B1 B2 B3 B0 B1 B2 B3
thread0 thread2thread1 thread3
Coalesced Memory Accessing
![Page 21: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/21.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 21
Execution on CPU and GPU Problem: Partition work across CPU
and GPU Data transfer between GPU and Host memory
required based on the partition! Coalesced access is efficient for GPU, but harmful
for CPU! Transform data before move from/to GPU memory
Reduce the overall execution time, taking into account memory transfer and transform delays!
![Page 22: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/22.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 22
Scheduling and Mapping
CPU Load:45GPU Load:40DMA Load:40 MII:45
B
A
C
D
E
GPU:20
CPU:20
GPU:20
CPU:15
CPU:10
20
10
10
B
A
C
D
E
CPU:10GPU:20
CPU:20
CPU:80GPU:20
CPU:15GPU:10
CPU:10GPU:25
20
10
10
60
Initial StreamIt Graph Partitioned Graph
![Page 23: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/23.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 23
Bn-2
Dn-6
En-7
Bn-1 An-1
Bn-3 Cn-3
Dn-5 Cn-5
An
Cn-4
CPU DMA Channel GPU
B
A
C
D
E
GPU:20
CPU:20
GPU:20
CPU:15
CPU:10
20
10
10
Scheduling and Mapping
![Page 24: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/24.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 24
Compiler Framework
Execute ProfileRuns
Generate Code for Profiling
ConfigurationSelection
StreamItProgram
TaskPartitioning
TaskPartitioning
ILP Partitioner
Heuristic Partitioner
InstancePartitioning
InstancePartitioning
ModuloScheduling
CodeGeneration
CUDACode
+C Code
![Page 25: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/25.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 25
Significant speedup for synergistic execution
Experimental Results on Tesla
> 5
2x
> 3
2x
> 6
5x
![Page 26: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/26.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 26
What is needed?
StreamingLang.
MPIOpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
CellBE
OtherAceel.
Multicores
GPUsSSE
Synergistic Execution on Multiple Hetergeneous Cores
PLASMA: High-Level IR
Compiler
Runtime System
![Page 27: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/27.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 27
Rich abstractions for Functionality Independence from any single architecture Portability without compromises on
efficiency Scale-up and scale down
Single core embedded processor to multi-core workstation
Take advantage of Accelerators (GPU, Cell, …)
Transparent Distributed Memory
PLASMA: Portable Programming for PLASTIC SIMD Accelerators
IR: What should a solution provide?
![Page 28: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/28.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 28
PLASMA IR
Reduce Add
Par Mul
Slice V
M
Matrix-Vector Multiply
par mul, temp, A[i *n : i *n+n :
1], X
reduce add, Y[I : i+1 : 1], temp
![Page 29: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/29.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 29
“CPLASM”, a prototype high-level assembly language
Prototype PLASMA IR Compiler
Currently Supported Targets:C (Scalar), SSE3, CUDA (NVIDIA
GPUs) Future Targets:
Cell, ATI, ARM Neon, ... Compiler Optimizations for this
“Vector” IR
Our Framework
![Page 30: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/30.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 30
Our Framework (contd.)
Plenty of optimization opportunities!
![Page 31: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/31.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 31
PLASMA IR Performance
Normalized exec. Time comparable to that of hand-tuned library!
![Page 32: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/32.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 32
Ongoing Work
StreamingLang.
MPI OpenMP
CUDA/OpenCL
ArrayLang. (Matlab)
Parallel Lang.
CellBE
OtherAceel.
Multicores
GPUsSSE
Synergistic Execution on Multiple Hetergeneous Cores
PLASMA: High-Level IR
Compiler
Runtime System
Look at other high level languages !
Target other accelerators
![Page 33: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/33.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 33
Compiling OpenMP/MPI / X10 Mapping the semantics Exploiting data parallelism and
task parallelism Communication and
synchronization across CPU/GPU/Multiple Nodes
Accelerator-specific optimization Memory layout, memory transfer, …
Performance and Scaling
![Page 34: Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc govind@serc.iisc.ernet.in](https://reader035.vdocument.in/reader035/viewer/2022070307/551ab239550346e0158b6374/html5/thumbnails/34.jpg)
R. Govindarajan ATIP 1st Workshop on HPC in India @ SC-09 34
Thank You !!
My students! IISc and SERC Microsoft and Nvidia ATIP, NSF, all Sponsors ONR
Acknowledgements