lecture 3: blocking/tiling for locality · shared memory blocking basic idea ©wen-mei w. hwu and...

26
©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Berkeley Winter School Advanced Algorithmic Techniques for GPUs Lecture 3: Blocking/Tiling for Locality 1

Upload: others

Post on 10-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

Berkeley Winter School

Advanced Algorithmic Techniques for GPUs

Lecture 3: Blocking/Tiling for Locality

1

Page 2: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Objective

• Reuse each data accessed from the global memory multiple times– Across threads – shared memory blocking– Within a thread - register tiling

• Register tiling is also often used to re-use computation results for increased efficiency.

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

2

Page 3: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Shared Memory Blocking Basic Idea

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

Thread 1 Thread 2 …

in

Global Memory

Thread 1 Thread 2 …

Global Memory

in

On-chip Memory

3

Page 4: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Basic Concept of Blocking/Tiling

• In a congested traffic system, significant reduction of vehicles can greatly improve the delay seen by all vehicles– Carpooling for commuters– Blocking/Tiling for global

memory accesses• drivers = threads,• cars = data

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

4

Page 5: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Some computations are more challenging to block/tile than others.• Some carpools may

be easier than others– More efficient if

neighbors are also classmates or co-workers

– Some vehicles may be more suitable for carpooling

• Similar variations exist in blocking/tiling

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

5

Page 6: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Carpools need synchronization.

• Good – when people have similar schedule

• Bad – when people have very different schedule

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

Worker A

Worker BTime

sleep

sleep work

work

dinner

dinner

Worker A

Worker Btime

sleep

sleep work

work

dinner

party

6

Page 7: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Same with Blocking/Tiling

• Good – when threads have similar access timing

• Bad – when threads have very different timing©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

Thread 1

Thread 2Time

Thread 1

Thread 2time

7

Page 8: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Outline of Technique

• Identify a block/tile of global memory content that are accessed by multiple threads

• Load the block/tile from global memory into on-chip memory

• Have the multiple threads to access their data from the on-chip memory

• Move on to the next block/tile

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

8

Page 9: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx01 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TIL

E_W

IDT

HT

ILE

_WID

TH

TIL

E_W

IDT

HE

WID

TH

WID

TH

Tiled Matrix Multiply• Each row of Md is accessed by

multiple threads• Problem: some threads can be

much further along than others– An entire row may need to be in

on-chip memory– Not enough on-chip memory for

large input matrices

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

Thread 1 Thread 2

9

Page 10: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

A Small Example

• Can we use two on-chip memory locations to reduce the number of M accesses by the two threads?– Not if the two threads

can have very different timing!

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

,

,

,0

,

,

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3 10

Page 11: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Every M and N Element is used exactly twice in generating a 2X2 tile of P

P0,0

thread0,0

P1,0

thread1,0

P0,1

thread0,1

P1,1

thread1,1

M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

Accessorder

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

11

Page 12: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0Pd3,0

Nd0,3Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3

Breaking Md and Nd into Tiles

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

Phase 1

12

Page 13: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0Pd3,0

Nd0,3Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3

Breaking Md and Nd into Tiles(cont.)

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

Phase 2

13

Page 14: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Phase 1

Each phase uses one tile from Md and one from Nd

Step 4 Step 5 Step 6

T0,0 Md0,0

↓ Mds0,0

Nd0,0

↓ Nds0,0

PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1

Md2,0

↓ Mds0,0

Nd0,2

↓ Nds0,0

PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1

T1,0 Md1,0

↓ Mds1,0

Nd1,0

↓ Nds1,0

PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1

Md3,0

↓ Mds1,0

Nd1,2

↓ Nds1,0

PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1

T0,1 Md0,1

↓ Mds0,1

Nd0,1

↓ Nds0,1

PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1

Md2,1

↓ Mds0,1

Nd0,3

↓ Nds0,1

PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1

T1,1 Md1,1

↓ Mds1,1

Nd1,1

↓ Nds1,1

PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1

Md3,1

↓ Mds1,1

Nd1,3

↓ Nds1,1

PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1

Phase 2

time©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

time 14

Page 15: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Summary of Small Example

• Each tile contains four elements• Two tiles loaded in each phase

– 8 memory loads• Each phase performs 16 operations

– 8 mul, 8 add• Each element is used twice• Two floating point operations per load

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

15

Page 16: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

16

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx01 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TIL

E_W

IDT

HT

ILE

_WID

TH

TIL

E_W

IDT

HE

WID

TH

WID

TH

Tiled Multiply – Large Matrices• Make sure that tiles are all

loaded in vertical patters from the global memory

• Md data can then be accessed from shared memory in horizontal direction

Page 17: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

First-order Size Considerations

• Assume– TILE_WIDTH of 16 gives 16*16 = 256 threads– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. – Memory bandwidth no longer a limiting factor– Could use thread coarsening to further reduce traffic

• Each thread block can have up to 1024 threads– Can use 32*32 tiles to further reduce traffic

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

17

Page 18: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

18

Memory Access Pattern(Corner Turning)

Md Nd

WID

TH

WIDTH

Md Nd

Original AccessPattern

Tiled AccessPattern

Copy into scratchpad

memory

Perform multiplication

with scratchpad values

Page 19: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

19

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

Memory Layout of a Matrix in C

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

MT1 T2 T3 T4

Time Period 1

T1 T2 T3 T4

Time Period 2

Access direction in Kernel code

Page 20: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

20

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

Memory Layout of a Matrix in C

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

MT1 T2 T3 T4

Time Period 1

T1 T2 T3 T4

Time Period 2

Access direction in Kernel code

Page 21: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Loading a Tile

• All threads in a block participate– Each thread loads one Md element and one Nd

element in based tiled code

• Assign the loaded element to each thread such that the accesses within each warp is coalesced

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

21

Page 22: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

CUDA Code – Kernel Execution Configuration

// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);

dim3 dimGrid(Width / TILE_WIDTH,

Width / TILE_WIDTH);

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

22

Page 23: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx01 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TIL

E_W

IDT

HT

ILE

_WID

TH

TIL

E_W

IDT

HE

WID

TH

WID

TH

Tiled Multiply• Each block computes one

square sub-matrix Pdsub of size TILE_WIDTH

• Each thread computes one element of Pdsub

m

kbx

by

k

m

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

23

Page 24: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Tiled Matrix Multiplication Kernel

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;7. float Pvalue = 0;// Loop over the Md and Nd tiles required to compute the Pd element

8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory9. Mds[tx][ty] = Md[Row*Width + m*TILE_WIDTH + tx];10. Nds[tx][ty] = Nd[(m*TILE_WIDTH + ty) * Width + Col)];

11. __syncthreads();12. for (int k = 0; k < TILE_WIDTH; ++k)13. Pvalue += Mds[tx][k] * Nds[k][ty];14. __syncthreads();

15.}16. Pd[Row*Width+Col] = Pvalue;

}

24

Page 25: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

Shared Memory and Threading

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

• Each SM in Fermi has 64KB on-chip SRAM, partitioned into 48KB L1 cache and 16KB shared memory, or vice versa– SM shared memory size is implementation dependent!– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of

shared memory. – Can potentially have up to 8 Thread Blocks actively executing

• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)

– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time (Problem with earlier GPUs!)

• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16– A 150GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!

25

Page 26: Lecture 3: Blocking/Tiling for Locality · Shared Memory Blocking Basic Idea ©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011 Thread 1. Thread 2 … in. Global

ANY MORE QUESTIONS?

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2011

26