optimizing tensor contraction expressions for hybrid cpu

Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution February 2012 Sriram Krishnamoorthy Wenjing Ma, Oreste Villa, Karol Kowalski, Gagan Agrawal

Deep Dive into GPU Programming

Experience with Optimizing Tensor Contractions How to think about programming GPUs

CUDA, OpenCL, … Lessons Learned Computer scientist perspective

2

NWChem-TCE

3

Phase Complexity Hartree–Fock N^4 4-index transformation N^5 CCSD - Iterative no^2 * nu^4 CCSD(T) – Not iterative no^3 * nu^4

N=no+nu

Com

puta

tiona

l Com

plex

ity

Tensor Contraction Expressions – CCSD(T)

4

Matrix Multiplication Formulation

5

Direct GPU Implementation (Tesla T10)

6

Steps

Manage GPU memory Transfer data between CPU and GPU Map work to thread blocks and threads Decode mapping in each thread Execute concurrently … Party!

7

Host Code Aspects

8

Avoid frequent allocation/deallocation (pool allocator)

CPU does work common to all threads (dimension offset for C arrays)

Parallelization Across Thread Blocks

9

Tile one dimension of each tensor

16x16 work for each thread block

Decode work to be done on each thread

Baseline Host Implementation

10

Copy inputs, invoke kernel, copy output

Kernel Code (I)

11

Set things up for each thread

Kernel Code (II)

12

Execute the loop

Index Combining

13

Index decoding is expensive Duplicated across threads Division and modulo arithmetic

Dimension Flattening

14

Maximize full thread blocks

Dimension Flattening

15

Baseline approach works well only when index dimensions are multiple of the thread block configuration. We flatten the loops and we recreate them of the “right” size, with some index operations increase of index operations but better utilization

Pipelining

16

Overlap data transfer and computation

Pipelining

We can avoid the O(N6) copy IN and just have the copy OUT and then accumulate on the host We can create “streams” of kernels using one loop dimension and asynchronously copy out partial data

14 ms

3.5 3.5 3.5 3.5

18 ms

4.5 4.5 4.5 4.5

16 ms

Copy Data Out

Execute

Host Accumulates

Copy Data Out

Execute

Host Accumulates 4 4 4 4

16 ms 18 ms Copy Data In

Execute + acc Copy Data Out 18 ms

“Default”

“Host Support”

“Host Support and streams”

Pipeline of GPU / PCI express/ Host

17

CPU-GPU Hybrid Execution

18

Host has more many cores – use them

What did we do?

Optimized memory allocation

Offload redundant calculations to CPUs

Index combining to reduce overhead

Flatten to maximize full thread blocks

Pipeline data transfer with computation

19

Single Tile Performance

20

Tim

e (m

s)

Optimizations help Greater improvements for irregular sizes

CCSD(T) - Results for 60 nodes GPU and CPU – Tesla C1060

21

0

20

40

60

80

100

120 113

5

28

56

17 13 Min

utes

to C

ompl

etio

n

Double precision calculations for the chromophore of green fluorescent protein (GFP) with 284 and 476 basis functions. In all calculations core electrons were not correlated.

1 core 2 cores 4 cores (1 socket)

8 cores (2 sockets)

1 GPU 2 GPUs + 6 cores

Two level Hybrid execution GPUs share the PCI bus

One level Hybrid execution GPU performs contraction and CPU accumulates

Fermi GPU

22

Fermi GPU Features

More FLOPS for the same PCIe bandwidth

Larger register file size

Larger shared memory

Cache

Bi-direction PCIe transfer

23

Performance Estimate

24

Minimize PCIe Data Transfer

25

Keep intermediates in memory

Register Tiling

Just painful, but mechanical (see reference) Improves data reuse

26

Index Calculation to Favor Output

27

Access to input arrays optimized by cache Global memory coalescing for outputs

Reversed Evaluation of Conditionals

28

Minimize count of executed conditionals

What did we do?

Register tiling Reduce overall work

Index calculation order Favor memory access coalescing for output

Reverse execution of conditionals Minimizing thread divergence

29

Evaluation

30

Baseline: single core Not comparing with CPU, just GPU versions Tesla codes run faster on Fermi Fermi specific optimizations further improve performance More results in references

Different versions result in different performance

Complicate compiler models

Iterative Coupled Cluster

Just copy-paste? Not really

Cannot eliminate data movement as in CCSD(T)

Varied tensor contractions Often limited FLOP count

…

Back at the drawing board

31

Observations

GPU programming is hard Bad news: vectorization is hard

Even for compilers (PLDI’04 – memory copy; PACT’06 – matrix transpose)

Non-obvious performance implications Let the compiler do the most for you

They are getting better Knowing how to vectorize the code goes a long way

Get the best out of a compiler Isolate performance critical code

You will rewrite it many times Keeps relationship good between you and computer scientists

32

References

W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution”. Cluster Computing Special Issue, 2011 W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “GPU-based implementations of the non- iterative regularized-CCSD(T) corrections: applications to strongly correlated systems”. Journal of Chemical Theory and Computation vol:7(5) pp:1316-1327, 2011 W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “Acceleration of Streamed Tensor Contraction Expressions on GPGPU-based Clusters”. IEEE International Conference on Cluster Computing, September 2010

33

optimizing tensor contraction expressions for hybrid cpu

Documents