lecture 5: convolutionmqhuang/courses/4643/f2017/... · convolution computation • each output...

1

GPU Programming

Lecture 5: Convolution

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois, 2007-2012

Objective

• To learn convolution, an important parallel computation pattern – Widely used in signal, image and video processing– Foundational to stencil computation used in many

science and engineering

• Important techniques– Taking advantage of shared memories


2

Convolution Computation

• Each output element is a weighted sum of neighboring input elements

• The weights are defined as the convolution kernel– Convolution kernel is also called convolution mask– The same convolution mask is typically used for all

elements of the array.


3

Gaussian Blur


4

Simple Integer Gaussian Kernel

1D Convolution Example

• Commonly used for audio processing– Mask size is usually an odd number for symmetry (5 in

this example)• Calculation of P[2]

3 4 5 4 3 3 8 15 16 15

N[0] P3 8 57 16 151 2 3 4 5 6 7 3 3

N[3]N[1] N[2] N[5]N[4] N[6]

M[0] M[3]M[1] M[2] M[4]

P[0] P[3]P[1] P[2] P[5]P[4] P[6]N

M

3 4 5 4 3 0 4 10 12 12

M

N P

3 38 57 16 151 2 3 4 5 6 7 3 30

N[0] N[3]N[1] N[2] N[5]N[4] N[6]

Filled in

M[0] M[3]M[1] M[2] M[4]

P[0] P[3]P[1] P[2] P[5]P[4] P[6]

1D Convolution Boundary Condition

• Calculation of output elements near the boundaries (beginning and end) of the input array need to deal with “ghost” elements– Different policies (0, replicates of boundary values, etc.)

__global__ void basic_1D_conv(float *N, float *M, float *P, int Mask_Width, int Width) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

float Pvalue = 0;int N_start_point = i - (Mask_Width/2);for (int j = 0; j < Mask_Width; j++) {

if (N_start_point + j >= 0 && N_start_point + j < Width) {Pvalue += N[N_start_point + j]*M[j];

}}P[i] = Pvalue;

}

A 1D Convolution Kernel with Boundary Condition Handling

• All elements outside the input vector to set to 0

• Are there problems with this kernel function?

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2 1

1 4 9 8 5

4 9 16 15 12

9 16 25 24 21

8 15 24 21 16

5 12 21 16 5

M

N P

1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

4 5 6 7 8 5 6

5 6 7 8 5 6 7

6 7 8 9 0 1 2

7 8 9 0 1 2 3

1 2 3 4 5

2 3 4 5 6

3 4 321 6 7

4 5 6 7 8

5 6 7 8 5

2D Convolution

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2 1

0 0 0 0 0

0 0 4 6 6

0 0 10 12 12

0 0 12 12 10

0 0 12 10 6

M

N P1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

5 6 7 8 5 6

5 6 7 8 5 6 7

6 7 8 9 0 1 2

7 8 9 0 1 2 3

1 2 3 4 5

112 3 4 5 6

3 4 5 6 7

4 5 6 7 8

5 6 7 8 5

4

2D Convolution Boundary Condition

Two optimizations


10

• Use Constant Memory for the convolution mask

• Use tiling technique

Access Pattern for M

• M is referred to as mask (a.k.a. kernel, filter, etc.)• Calculation of all output elements in P needs M• Total of O(P*M) reads of M• M is not changed during kernel• Bonus - M elements are accessed in the same

order when calculating all P elements• M is a good candidate for Constant Memory


11

12

Programmer View of CUDA Memories(Review)

• Each thread can:– Read/write per-thread

registers (~1 cycle)– Read/write per-block

shared memory (~5 cycles)

– Read/write per-grid global memory (~500 cycles)

– Read/only per-gridconstant memory (~5 cycles with caching)

Grid

Global Memory

Block (0, 0)

Shared Memory/L1 cache

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)


Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory


How to Use Constant Memory

• Host code allocates, initializes variables the same way as any other variables that need to be copied to the device

• Use cudaMemcpyToSymbol(dest, src, size) to copy the variable into the device memory

• This copy function tells the device that the variable will not be modified by the kernel and can be safely cached


13

14

More on Constant Caching

• Each SM has its own L1 cache– Low latency, high bandwidth

access by all threads

• However, there is no way for threads in one SM to update the L1 cache in other SMs– No L1 cache coherence

Grid

Global Memory

Block (0, 0)


Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)


Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory


This is not a problem if a variable is NOT modified by a kernel.

Some Header File Stuff for M#define MASK_WIDTH 5

// Mask Structure declaration

typedef struct {

unsigned int width;

unsigned int height;

float* elements;

} Mask;


15

AllocateMask// Allocate a mask of dimensions height*width

// If init == 0, initialize to all zeroes.

// If init == 1, perform random initialization.

// If init == 2, initialize mask parameters,

// but do not allocate memory

Mask AllocateMask(int height, int width, int init)

{

Mask M;

M.width = width;

M.height = height;

int size = M.width * M.height;

M.elements = NULL;


16

AllocateMask() (Cont.)


17

// don't allocate memory on option 2

if(init == 2) return M;

M.elements = (float*) malloc(size*sizeof(float));

for(unsigned int i = 0; i < M.height * M.width; i++) {

M.elements[i] = (init == 0) ? (0.0f) :

(rand() / (float)RAND_MAX);

}

return M;

}


Host Code// global variable, outside any function

__constant__ float Mc[MASK_WIDTH][MASK_WIDTH];

…

// allocate N, P, initialize N elements, copy N to Nd

Mask M;

M = AllocateMask(MASK_WIDTH, MASK_WIDTH, 1);

// initialize M elements

….

cudaMemcpyToSymbol(Mc, M.elements,

MASK_WIDTH * MASK_WIDTH *sizeof(float));

ConvolutionKernel<<<dimGrid, dimBlock>>>(Nd, Pd, width, height);

18

Observation on Convolution

• Each element in N is used in calculating up to MASK_WIDTH * MASK_WIDTH elements in P


19

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 7

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

3 4 5 6 72 3 4 5 61 2 3 4 52 3 5 6 70 1 1 3 1

0 1 1 3 1

Observation on Convolution


20

Output Tile

Input Tile

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2

1 2 3 2 12 3 4 3 23 4 5 4 32 3 4 3 21 2 3 2 1

1

• Input tiles need to be larger than output tiles

Dealing with MismatchInput tile > Output tile

• Use a thread block that matches the input tile– Why?

• Each thread loads one element of the input tile– Special case: halo cells

• Not all threads participate in calculating output– There will be if statements and control divergence


21

Shifting from output coordinates to input coordinates


22

TILE_SIZE

BLOCK_SIZE

txty

Output Tile

Shifting from output coordinates to input coordinate

int tx = threadIdx.x;int ty = threadIdx.y;int row_o = blockIdx.y * TILE_SIZE + ty;int col_o = blockIdx.x * TILE_SIZE + tx;

int row_i = row_o - 2; // Assumes kernel size is 5int col_i = col_o - 2; // Assumes kernel size is 5


23

Threads that loads halos outside N should return 0.0


24

Taking Care of Boundaries

float output = 0.0f;__shared__ float Ns[TILE_SIZE+KERNEL_SIZE-

1][TILE_SIZE+KERNEL_SIZE-1];

if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width) )

Ns[ty][tx] = N.elements[row_i*N.width + col_i];else

Ns[ty][tx] = 0.0f;


25

Some threads do not participate in calculating output.

if(ty < TILE_SIZE && tx < TILE_SIZE){for(i = 0; i < 5; i++)

for(j = 0; j < 5; j++) output += Mc[i][j] * Ns[i+ty][j+tx];

}


26

Some threads do not write output

if(row_o < P.height && col_o < P.width)P.elements[row_o * P.width + col_o] = output;


27

Setting Block Size

#define BLOCK_SIZE (TILE_SIZE + 4)

dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);

In general, block size should be tile size + (kernel size -1)


28

A Simple Analysis for a small 8X8 output tile example, 5x5 mask

• 12X12=144 N elements need to be loaded into shared memory

• The calculation of each P element needs to access 25 N elements

• 8X8X25 = 1600 global memory accesses are converted into shared memory accesses

• A reduction of 1600/144 = 11X


29

In General

• (Tile_Width+Mask_Width-1) 2 elements of N need to be loaded into shared memory

• The calculation of each element of P needs to access Mask_Width 2 elements of N

• Tile_Width 2 * Mask_Width 2 global memory accesses are converted into shared memory accesses


30

Bandwidth Reduction for 2D

• The reduction isMask_Width 2 * (Tile_Width) 2 /(Tile_Width+Mask_Width-1) 2

Tile_Width 8 16 32 64ReductionMask_Width = 5

11.1 16 19.7 22.1

ReductionMask_Width = 9

20.3 36 51.8 64


31

lecture 5: convolutionmqhuang/courses/4643/f2017/... · convolution computation • each output...

Documents