optimizing tensor contraction expressions for hybrid cpu

33
Optimizing Tensor Contraction Expressions for Hybrid CPU- GPU Execution February 2012 Sriram Krishnamoorthy Wenjing Ma, Oreste Villa, Karol Kowalski, Gagan Agrawal

Upload: others

Post on 18-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing Tensor Contraction Expressions for Hybrid CPU

Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution February 2012 Sriram Krishnamoorthy Wenjing Ma, Oreste Villa, Karol Kowalski, Gagan Agrawal

Page 2: Optimizing Tensor Contraction Expressions for Hybrid CPU

Deep Dive into GPU Programming

Experience with Optimizing Tensor Contractions How to think about programming GPUs

CUDA, OpenCL, … Lessons Learned Computer scientist perspective

2

Page 3: Optimizing Tensor Contraction Expressions for Hybrid CPU

NWChem-TCE

3

Phase Complexity Hartree–Fock N^4 4-index transformation N^5 CCSD - Iterative no^2 * nu^4 CCSD(T) – Not iterative no^3 * nu^4

N=no+nu

Com

puta

tiona

l Com

plex

ity

Page 4: Optimizing Tensor Contraction Expressions for Hybrid CPU

Tensor Contraction Expressions – CCSD(T)

4

Page 5: Optimizing Tensor Contraction Expressions for Hybrid CPU

Matrix Multiplication Formulation

5

Page 6: Optimizing Tensor Contraction Expressions for Hybrid CPU

Direct GPU Implementation (Tesla T10)

6

Page 7: Optimizing Tensor Contraction Expressions for Hybrid CPU

Steps

Manage GPU memory Transfer data between CPU and GPU Map work to thread blocks and threads Decode mapping in each thread Execute concurrently … Party!

7

Page 8: Optimizing Tensor Contraction Expressions for Hybrid CPU

Host Code Aspects

8

Avoid frequent allocation/deallocation (pool allocator)

CPU does work common to all threads (dimension offset for C arrays)

Page 9: Optimizing Tensor Contraction Expressions for Hybrid CPU

Parallelization Across Thread Blocks

9

Tile one dimension of each tensor

16x16 work for each thread block

Decode work to be done on each thread

Page 10: Optimizing Tensor Contraction Expressions for Hybrid CPU

Baseline Host Implementation

10

Copy inputs, invoke kernel, copy output

Page 11: Optimizing Tensor Contraction Expressions for Hybrid CPU

Kernel Code (I)

11

Set things up for each thread

Page 12: Optimizing Tensor Contraction Expressions for Hybrid CPU

Kernel Code (II)

12

Execute the loop

Page 13: Optimizing Tensor Contraction Expressions for Hybrid CPU

Index Combining

13

Index decoding is expensive Duplicated across threads Division and modulo arithmetic

Page 14: Optimizing Tensor Contraction Expressions for Hybrid CPU

Dimension Flattening

14

Maximize full thread blocks

Page 15: Optimizing Tensor Contraction Expressions for Hybrid CPU

Dimension Flattening

15

Baseline approach works well only when index dimensions are multiple of the thread block configuration. We flatten the loops and we recreate them of the “right” size, with some index operations increase of index operations but better utilization

Page 16: Optimizing Tensor Contraction Expressions for Hybrid CPU

Pipelining

16

Overlap data transfer and computation

Page 17: Optimizing Tensor Contraction Expressions for Hybrid CPU

Pipelining

We can avoid the O(N6) copy IN and just have the copy OUT and then accumulate on the host We can create “streams” of kernels using one loop dimension and asynchronously copy out partial data

14 ms

3.5 3.5 3.5 3.5

18 ms

4.5 4.5 4.5 4.5

16 ms

Copy Data Out

Execute

Host Accumulates

Copy Data Out

Execute

Host Accumulates 4 4 4 4

16 ms 18 ms Copy Data In

Execute + acc Copy Data Out 18 ms

“Default”

“Host Support”

“Host Support and streams”

Pipeline of GPU / PCI express/ Host

17

Page 18: Optimizing Tensor Contraction Expressions for Hybrid CPU

CPU-GPU Hybrid Execution

18

Host has more many cores – use them

Page 19: Optimizing Tensor Contraction Expressions for Hybrid CPU

What did we do?

Optimized memory allocation

Offload redundant calculations to CPUs

Index combining to reduce overhead

Flatten to maximize full thread blocks

Pipeline data transfer with computation

19

Page 20: Optimizing Tensor Contraction Expressions for Hybrid CPU

Single Tile Performance

20

Tim

e (m

s)

Optimizations help Greater improvements for irregular sizes

Page 21: Optimizing Tensor Contraction Expressions for Hybrid CPU

CCSD(T) - Results for 60 nodes GPU and CPU – Tesla C1060

21

0

20

40

60

80

100

120 113

5

28

56

17 13 Min

utes

to C

ompl

etio

n

Double precision calculations for the chromophore of green fluorescent protein (GFP) with 284 and 476 basis functions. In all calculations core electrons were not correlated.

1 core 2 cores 4 cores (1 socket)

8 cores (2 sockets)

1 GPU 2 GPUs + 6 cores

Two level Hybrid execution GPUs share the PCI bus

One level Hybrid execution GPU performs contraction and CPU accumulates

Page 22: Optimizing Tensor Contraction Expressions for Hybrid CPU

Fermi GPU

22

Page 23: Optimizing Tensor Contraction Expressions for Hybrid CPU

Fermi GPU Features

More FLOPS for the same PCIe bandwidth

Larger register file size

Larger shared memory

Cache

Bi-direction PCIe transfer

23

Page 24: Optimizing Tensor Contraction Expressions for Hybrid CPU

Performance Estimate

24

Page 25: Optimizing Tensor Contraction Expressions for Hybrid CPU

Minimize PCIe Data Transfer

25

Keep intermediates in memory

Page 26: Optimizing Tensor Contraction Expressions for Hybrid CPU

Register Tiling

Just painful, but mechanical (see reference) Improves data reuse

26

Page 27: Optimizing Tensor Contraction Expressions for Hybrid CPU

Index Calculation to Favor Output

27

Access to input arrays optimized by cache Global memory coalescing for outputs

Page 28: Optimizing Tensor Contraction Expressions for Hybrid CPU

Reversed Evaluation of Conditionals

28

Minimize count of executed conditionals

Page 29: Optimizing Tensor Contraction Expressions for Hybrid CPU

What did we do?

Register tiling Reduce overall work

Index calculation order Favor memory access coalescing for output

Reverse execution of conditionals Minimizing thread divergence

29

Page 30: Optimizing Tensor Contraction Expressions for Hybrid CPU

Evaluation

30

Baseline: single core Not comparing with CPU, just GPU versions Tesla codes run faster on Fermi Fermi specific optimizations further improve performance More results in references

Different versions result in different performance

Complicate compiler models

Page 31: Optimizing Tensor Contraction Expressions for Hybrid CPU

Iterative Coupled Cluster

Just copy-paste? Not really

Cannot eliminate data movement as in CCSD(T)

Varied tensor contractions Often limited FLOP count

Back at the drawing board

31

Page 32: Optimizing Tensor Contraction Expressions for Hybrid CPU

Observations

GPU programming is hard Bad news: vectorization is hard

Even for compilers (PLDI’04 – memory copy; PACT’06 – matrix transpose)

Non-obvious performance implications Let the compiler do the most for you

They are getting better Knowing how to vectorize the code goes a long way

Get the best out of a compiler Isolate performance critical code

You will rewrite it many times Keeps relationship good between you and computer scientists

32

Page 33: Optimizing Tensor Contraction Expressions for Hybrid CPU

References

W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution”. Cluster Computing Special Issue, 2011 W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “GPU-based implementations of the non- iterative regularized-CCSD(T) corrections: applications to strongly correlated systems”. Journal of Chemical Theory and Computation vol:7(5) pp:1316-1327, 2011 W. Ma, S. Krishnamoorthy, O. Villa, and K. Kowalski. “Acceleration of Streamed Tensor Contraction Expressions on GPGPU-based Clusters”. IEEE International Conference on Cluster Computing, September 2010

33