Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution February 2012 Sriram Krishnamoorthy Wenjing Ma, Oreste Villa, Karol Kowalski, Gagan Agrawal
Deep Dive into GPU Programming
Experience with Optimizing Tensor Contractions How to think about programming GPUs
CUDA, OpenCL, … Lessons Learned Computer scientist perspective
2
NWChem-TCE
3
Phase Complexity Hartree–Fock N^4 4-index transformation N^5 CCSD - Iterative no^2 * nu^4 CCSD(T) – Not iterative no^3 * nu^4
N=no+nu
Com
puta
tiona
l Com
plex
ity
Tensor Contraction Expressions – CCSD(T)
4
Matrix Multiplication Formulation
5
Direct GPU Implementation (Tesla T10)
6
Steps
Manage GPU memory Transfer data between CPU and GPU Map work to thread blocks and threads Decode mapping in each thread Execute concurrently … Party!
7
Host Code Aspects
8
Avoid frequent allocation/deallocation (pool allocator)
CPU does work common to all threads (dimension offset for C arrays)
Parallelization Across Thread Blocks
9
Tile one dimension of each tensor
16x16 work for each thread block
Decode work to be done on each thread
Baseline Host Implementation
10
Copy inputs, invoke kernel, copy output
Kernel Code (I)
11
Set things up for each thread
Kernel Code (II)
12
Execute the loop
Index Combining
13
Index decoding is expensive Duplicated across threads Division and modulo arithmetic
Dimension Flattening
14
Maximize full thread blocks
Dimension Flattening
15
Baseline approach works well only when index dimensions are multiple of the thread block configuration. We flatten the loops and we recreate them of the “right” size, with some index operations increase of index operations but better utilization
Pipelining
16
Overlap data transfer and computation
Pipelining
We can avoid the O(N6) copy IN and just have the copy OUT and then accumulate on the host We can create “streams” of kernels using one loop dimension and asynchronously copy out partial data
14 ms
3.5 3.5 3.5 3.5
18 ms
4.5 4.5 4.5 4.5
16 ms
Copy Data Out
Execute
Host Accumulates
Copy Data Out
Execute
Host Accumulates 4 4 4 4
16 ms 18 ms Copy Data In
Execute + acc Copy Data Out 18 ms
“Default”
“Host Support”
“Host Support and streams”
Pipeline of GPU / PCI express/ Host
17
CPU-GPU Hybrid Execution
18
Host has more many cores – use them
What did we do?
Optimized memory allocation
Offload redundant calculations to CPUs
Index combining to reduce overhead
Flatten to maximize full thread blocks
Pipeline data transfer with computation
19
Single Tile Performance
20
Tim
e (m
s)
Optimizations help Greater improvements for irregular sizes
CCSD(T) - Results for 60 nodes GPU and CPU – Tesla C1060
21
0
20
40
60
80
100
120 113
5
28
56
17 13 Min
utes
to C
ompl
etio
n
Double precision calculations for the chromophore of green fluorescent protein (GFP) with 284 and 476 basis functions. In all calculations core electrons were not correlated.
1 core 2 cores 4 cores (1 socket)
8 cores (2 sockets)
1 GPU 2 GPUs + 6 cores
Two level Hybrid execution GPUs share the PCI bus
One level Hybrid execution GPU performs contraction and CPU accumulates
Fermi GPU
22
Fermi GPU Features
More FLOPS for the same PCIe bandwidth
Larger register file size
Larger shared memory
Cache
Bi-direction PCIe transfer
23
Performance Estimate
24
Minimize PCIe Data Transfer
25
Keep intermediates in memory
Register Tiling
Just painful, but mechanical (see reference) Improves data reuse
26
Index Calculation to Favor Output
27
Access to input arrays optimized by cache Global memory coalescing for outputs
Reversed Evaluation of Conditionals
28
Minimize count of executed conditionals
What did we do?
Register tiling Reduce overall work
Index calculation order Favor memory access coalescing for output
Reverse execution of conditionals Minimizing thread divergence
29
Evaluation
30
Baseline: single core Not comparing with CPU, just GPU versions Tesla codes run faster on Fermi Fermi specific optimizations further improve performance More results in references
Different versions result in different performance
Complicate compiler models
Iterative Coupled Cluster
Just copy-paste? Not really
Cannot eliminate data movement as in CCSD(T)
Varied tensor contractions Often limited FLOP count
…
Back at the drawing board
31
Observations
GPU programming is hard Bad news: vectorization is hard
Even for compilers (PLDI’04 – memory copy; PACT’06 – matrix transpose)
Non-obvious performance implications Let the compiler do the most for you
They are getting better Knowing how to vectorize the code goes a long way
Get the best out of a compiler Isolate performance critical code
You will rewrite it many times Keeps relationship good between you and computer scientists
32
References
W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution”. Cluster Computing Special Issue, 2011 W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “GPU-based implementations of the non- iterative regularized-CCSD(T) corrections: applications to strongly correlated systems”. Journal of Chemical Theory and Computation vol:7(5) pp:1316-1327, 2011 W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “Acceleration of Streamed Tensor Contraction Expressions on GPGPU-based Clusters”. IEEE International Conference on Cluster Computing, September 2010
33