lecture 5: convolutionmqhuang/courses/4643/f2017/... · convolution computation • each output...

31
1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Upload: others

Post on 09-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

1

GPU Programming

Lecture 5: Convolution

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Page 2: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Objective

• To learn convolution, an important parallel computation pattern – Widely used in signal, image and video processing– Foundational to stencil computation used in many

science and engineering

• Important techniques– Taking advantage of shared memories

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

2

Page 3: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Convolution Computation

• Each output element is a weighted sum of neighboring input elements

• The weights are defined as the convolution kernel– Convolution kernel is also called convolution mask– The same convolution mask is typically used for all

elements of the array.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

3

Page 4: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Gaussian Blur

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

4

Simple Integer Gaussian Kernel

Page 5: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

1D Convolution Example

• Commonly used for audio processing– Mask size is usually an odd number for symmetry (5 in

this example)• Calculation of P[2]

3 4 5 4 3 3 8 15 16 15

N[0] P3 8 57 16 151 2 3 4 5 6 7 3 3

N[3]N[1] N[2] N[5]N[4] N[6]

M[0] M[3]M[1] M[2] M[4]

P[0] P[3]P[1] P[2] P[5]P[4] P[6]N

M

Page 6: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

3 4 5 4 3 0 4 10 12 12

M

N P

3 38 57 16 151 2 3 4 5 6 7 3 30

N[0] N[3]N[1] N[2] N[5]N[4] N[6]

Filled in

M[0] M[3]M[1] M[2] M[4]

P[0] P[3]P[1] P[2] P[5]P[4] P[6]

1D Convolution Boundary Condition

• Calculation of output elements near the boundaries (beginning and end) of the input array need to deal with “ghost” elements– Different policies (0, replicates of boundary values, etc.)

Page 7: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

__global__ void basic_1D_conv(float *N, float *M, float *P, int Mask_Width, int Width) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

float Pvalue = 0;int N_start_point = i - (Mask_Width/2);for (int j = 0; j < Mask_Width; j++) {

if (N_start_point + j >= 0 && N_start_point + j < Width) {Pvalue += N[N_start_point + j]*M[j];

}}P[i] = Pvalue;

}

A 1D Convolution Kernel with Boundary Condition Handling

• All elements outside the input vector to set to 0

• Are there problems with this kernel function?

Page 8: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2 1

1 4 9 8 5

4 9 16 15 12

9 16 25 24 21

8 15 24 21 16

5 12 21 16 5

M

N P

1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

4 5 6 7 8 5 6

5 6 7 8 5 6 7

6 7 8 9 0 1 2

7 8 9 0 1 2 3

1 2 3 4 5

2 3 4 5 6

3 4 321 6 7

4 5 6 7 8

5 6 7 8 5

2D Convolution

Page 9: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2 1

0 0 0 0 0

0 0 4 6 6

0 0 10 12 12

0 0 12 12 10

0 0 12 10 6

M

N P1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

5 6 7 8 5 6

5 6 7 8 5 6 7

6 7 8 9 0 1 2

7 8 9 0 1 2 3

1 2 3 4 5

112 3 4 5 6

3 4 5 6 7

4 5 6 7 8

5 6 7 8 5

4

2D Convolution Boundary Condition

Page 10: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Two optimizations

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

10

• Use Constant Memory for the convolution mask

• Use tiling technique

Page 11: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Access Pattern for M

• M is referred to as mask (a.k.a. kernel, filter, etc.)• Calculation of all output elements in P needs M• Total of O(P*M) reads of M• M is not changed during kernel• Bonus - M elements are accessed in the same

order when calculating all P elements• M is a good candidate for Constant Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

11

Page 12: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

12

Programmer View of CUDA Memories(Review)

• Each thread can:– Read/write per-thread

registers (~1 cycle)– Read/write per-block

shared memory (~5 cycles)

– Read/write per-grid global memory (~500 cycles)

– Read/only per-gridconstant memory (~5 cycles with caching)

Grid

Global Memory

Block (0, 0)

Shared Memory/L1 cache

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory/L1 cache

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Page 13: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

How to Use Constant Memory

• Host code allocates, initializes variables the same way as any other variables that need to be copied to the device

• Use cudaMemcpyToSymbol(dest, src, size) to copy the variable into the device memory

• This copy function tells the device that the variable will not be modified by the kernel and can be safely cached

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

13

Page 14: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

14

More on Constant Caching

• Each SM has its own L1 cache– Low latency, high bandwidth

access by all threads

• However, there is no way for threads in one SM to update the L1 cache in other SMs– No L1 cache coherence

Grid

Global Memory

Block (0, 0)

Shared Memory/L1 cache

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory/L1 cache

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

This is not a problem if a variable is NOT modified by a kernel.

Page 15: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Some Header File Stuff for M#define MASK_WIDTH 5

// Mask Structure declaration

typedef struct {

unsigned int width;

unsigned int height;

float* elements;

} Mask;

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

15

Page 16: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

AllocateMask// Allocate a mask of dimensions height*width

// If init == 0, initialize to all zeroes.

// If init == 1, perform random initialization.

// If init == 2, initialize mask parameters,

// but do not allocate memory

Mask AllocateMask(int height, int width, int init)

{

Mask M;

M.width = width;

M.height = height;

int size = M.width * M.height;

M.elements = NULL;

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

16

Page 17: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

AllocateMask() (Cont.)

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

17

// don't allocate memory on option 2

if(init == 2) return M;

M.elements = (float*) malloc(size*sizeof(float));

for(unsigned int i = 0; i < M.height * M.width; i++) {

M.elements[i] = (init == 0) ? (0.0f) :

(rand() / (float)RAND_MAX);

}

return M;

}

Page 18: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Host Code// global variable, outside any function

__constant__ float Mc[MASK_WIDTH][MASK_WIDTH];

// allocate N, P, initialize N elements, copy N to Nd

Mask M;

M = AllocateMask(MASK_WIDTH, MASK_WIDTH, 1);

// initialize M elements

….

cudaMemcpyToSymbol(Mc, M.elements,

MASK_WIDTH * MASK_WIDTH *sizeof(float));

ConvolutionKernel<<<dimGrid, dimBlock>>>(Nd, Pd, width, height);

18

Page 19: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Observation on Convolution

• Each element in N is used in calculating up to MASK_WIDTH * MASK_WIDTH elements in P

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

19

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 7

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

0 1 1 3 1

Page 20: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Observation on Convolution

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

20

Output Tile

Input Tile

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2 1

1

• Input tiles need to be larger than output tiles

Page 21: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Dealing with MismatchInput tile > Output tile

• Use a thread block that matches the input tile– Why?

• Each thread loads one element of the input tile– Special case: halo cells

• Not all threads participate in calculating output– There will be if statements and control divergence

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

21

Page 22: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Shifting from output coordinates to input coordinates

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

22

TILE_SIZE

BLOCK_SIZE

txty

Output Tile

Page 23: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Shifting from output coordinates to input coordinate

int tx = threadIdx.x;int ty = threadIdx.y;int row_o = blockIdx.y * TILE_SIZE + ty;int col_o = blockIdx.x * TILE_SIZE + tx;

int row_i = row_o - 2; // Assumes kernel size is 5int col_i = col_o - 2; // Assumes kernel size is 5

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

23

Page 24: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Threads that loads halos outside N should return 0.0

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

24

Page 25: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Taking Care of Boundaries

float output = 0.0f;__shared__ float Ns[TILE_SIZE+KERNEL_SIZE-

1][TILE_SIZE+KERNEL_SIZE-1];

if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width) )

Ns[ty][tx] = N.elements[row_i*N.width + col_i];else

Ns[ty][tx] = 0.0f;

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

25

Page 26: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Some threads do not participate in calculating output.

if(ty < TILE_SIZE && tx < TILE_SIZE){for(i = 0; i < 5; i++)

for(j = 0; j < 5; j++) output += Mc[i][j] * Ns[i+ty][j+tx];

}

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

26

Page 27: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Some threads do not write output

if(row_o < P.height && col_o < P.width)P.elements[row_o * P.width + col_o] = output;

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

27

Page 28: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Setting Block Size

#define BLOCK_SIZE (TILE_SIZE + 4)

dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);

In general, block size should be tile size + (kernel size -1)

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

28

Page 29: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

A Simple Analysis for a small 8X8 output tile example, 5x5 mask

• 12X12=144 N elements need to be loaded into shared memory

• The calculation of each P element needs to access 25 N elements

• 8X8X25 = 1600 global memory accesses are converted into shared memory accesses

• A reduction of 1600/144 = 11X

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

29

Page 30: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

In General

• (Tile_Width+Mask_Width-1) 2 elements of N need to be loaded into shared memory

• The calculation of each element of P needs to access Mask_Width 2 elements of N

• Tile_Width 2 * Mask_Width 2 global memory accesses are converted into shared memory accesses

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

30

Page 31: Lecture 5: Convolutionmqhuang/courses/4643/f2017/... · Convolution Computation • Each output element is a weighted sum of neighboring input elements • The weights are defined

Bandwidth Reduction for 2D

• The reduction isMask_Width 2 * (Tile_Width) 2 /(Tile_Width+Mask_Width-1) 2

Tile_Width 8 16 32 64ReductionMask_Width = 5

11.1 16 19.7 22.1

ReductionMask_Width = 9

20.3 36 51.8 64

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

31