gpu algorithms iii/iv computing with cuda · gpu algorithms iii/iv computing with cuda madalgo...

183
GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian University of Utah

Upload: others

Post on 31-Jul-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU Algorithms III/IVComputing with CUDA

MADALGO Summer School on Algorithms for Modern Parallel

and Distributed Models

Suresh VenkatasubramanianUniversity of Utah

Page 2: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Previously...

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

Page 3: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Outline

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

sortingmatrices

graphs

Page 4: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA: Compute Unified Device Architecture

Lightweight threads that run SIMD (SIMT) in “blocks”Blocks run in “SPMD” mode (single program, multiple data)Memory at multiple levels (thread, blocks, global)Threads are very lightweight, and there are many of them.Two views: programmer-centric and hardware-centric

Page 5: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Model: Blocks

Block

• A block is a collection of threads

• A block can have different ”shapes”

• All threads run the same instructions and cansynchronize

• Theads have local memory (and so do blocks)

• Block memory is low-latency and shared among threads

Page 6: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Model: Grids

Grid

• A grid is a collection of blocks

• A grid can have different shapes

• A grid of blocks is initiated by a request from the host

kernel<< 4, 3 >>

• A grid has shared memory

• Blocks cannot coordinate with each other and are runindependently

Page 7: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Model: Overview

Host Device

GridProgram

RUN

Page 8: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Execution Model

Nickolls, Buck, Garland, Skadron, ACM Queue, Mar 2008[NBGS08]

Page 9: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Execution Model

CUDA grids

• Each block is assigned to a single SP

• Grid is a software construct

• Block memory managed by SM

Page 10: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Execution Model

SP

Block

Thread

Warps

• Each block is divided into groups of 32 threads called ”warps”

• Warp threads are scheduled SIMD on the processor

• Warps are scheduled concurrently

Page 11: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 12: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 13: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 14: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

3

3

4

5

6

7

8

Page 15: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

3

3

4

5

6

7

8

Page 16: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 17: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 18: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 19: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 20: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Branch Divergence

1

2

33

4

5

6

7

8

Page 21: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 22: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 23: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 24: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 25: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 26: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 27: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 28: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 29: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Memory Bank Conflicts

Warp

Sharedmemory

Warp

Sharedmemory

Banks

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

All memory accessesprocessed in parallel

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Conflict!

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Warp

Sharedmemory

Banks

halfwarp

Memory bank conflicts can result in serialized access

Page 30: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Global Memory Coalescing

Warp

WarpWarpWarp

Both cache blocks are returned

Warp

Both cache blocks are returned

• Threads should ”coalesce” access to single memory line

• This is akin to block transfer in external memory

Page 31: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Global Memory Coalescing

Warp

Warp

WarpWarp

Both cache blocks are returned

Warp

Both cache blocks are returned

• Threads should ”coalesce” access to single memory line

• This is akin to block transfer in external memory

Page 32: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Global Memory Coalescing

WarpWarp

Warp

Warp

Both cache blocks are returned

Warp

Both cache blocks are returned

• Threads should ”coalesce” access to single memory line

• This is akin to block transfer in external memory

Page 33: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Global Memory Coalescing

WarpWarpWarp

Warp

Both cache blocks are returned

Warp

Both cache blocks are returned

• Threads should ”coalesce” access to single memory line

• This is akin to block transfer in external memory

Page 34: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

CUDA Design Pitfalls: Global Memory Coalescing

WarpWarpWarpWarp

Both cache blocks are returned

Warp

Both cache blocks are returned

• Threads should ”coalesce” access to single memory line

• This is akin to block transfer in external memory

Page 35: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

5

Solution: Kernel DecompositionSolution: Kernel Decomposition

Avoid global sync by decomposing computation into multiple kernel invocations

In the case of reductions, code for all levels is the same

Recursive kernel invocation

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

Level 0:

8 blocks

Level 1:

1 block

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 36: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

7

Reduction #1: Interleaved AddressingReduction #1: Interleaved Addressing

__global__ void reduce0(int *g_idata, int *g_odata) {

extern __shared__ int sdata[];

// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];__syncthreads();

// do reduction in shared mem

for(unsigned int s=1; s < blockDim.x; s *= 2) {if (tid % (2*s) == 0) {

sdata[tid] += sdata[tid + s];}__syncthreads();

}

// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 37: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

8

Parallel Reduction: Interleaved AddressingParallel Reduction: Interleaved Addressing

2011072-3-253-20-18110Values (shared memory)

0 2 4 6 8 10 12 14

22111179-3-558-2-2-17111Values

0 4 8 12

22111379-3458-26-17118Values

0 8

22111379-31758-26-17124Values

0

22111379-31758-26-17141Values

Thread

IDs

Step 1

Stride 1

Step 2 Stride 2

Step 3

Stride 4

Step 4 Stride 8

Thread IDs

Thread

IDs

Thread IDs

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 38: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

9

Reduction #1: Interleaved AddressingReduction #1: Interleaved Addressing

__global__ void reduce1(int *g_idata, int *g_odata) {

extern __shared__ int sdata[];

// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];__syncthreads();

// do reduction in shared mem

for (unsigned int s=1; s < blockDim.x; s *= 2) {if (tid % (2*s) == 0) {

sdata[tid] += sdata[tid + s];}__syncthreads();

}

// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Problem: highly divergent warps are very inefficient, and

% operator is very slow

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 39: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

10

Performance for 4M element reductionPerformance for 4M element reduction

2.083 GB/s8.054 msKernel 1: interleaved addressingwith divergent branching

Note: Block Size = 128 threads for all tests

BandwidthTime (222 ints)

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 40: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

11

for (unsigned int s=1; s < blockDim.x; s *= 2) {if (tid % (2*s) == 0) {

sdata[tid] += sdata[tid + s];

}__syncthreads();

}

for (unsigned int s=1; s < blockDim.x; s *= 2) {int index = 2 * s * tid;

if (index < blockDim.x) {sdata[index] += sdata[index + s];

}__syncthreads();

}

Reduction #2: Interleaved AddressingReduction #2: Interleaved Addressing

Just replace divergent branch in inner loop:

With strided index and non-divergent branch:

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 41: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

12

Parallel Reduction: Interleaved AddressingParallel Reduction: Interleaved Addressing

2011072-3-253-20-18110Values (shared memory)

0 1 2 3 4 5 6 7

22111179-3-558-2-2-17111Values

0 1 2 3

22111379-3458-26-17118Values

0 1

22111379-31758-26-17124Values

0

22111379-31758-26-17141Values

Thread

IDs

Step 1

Stride 1

Step 2 Stride 2

Step 3

Stride 4

Step 4 Stride 8

Thread IDs

Thread

IDs

Thread IDs

New Problem: Shared Memory Bank Conflicts

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 42: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

13

Performance for 4M element reductionPerformance for 4M element reduction

2.33x4.854 GB/s

2.083 GB/s

2.33x3.456 msKernel 2:interleaved addressingwith bank conflicts

8.054 msKernel 1: interleaved addressingwith divergent branching

Step

SpeedupBandwidthTime (222 ints)Cumulative

Speedup

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 43: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

14

Parallel Reduction: Sequential AddressingParallel Reduction: Sequential Addressing

2011072-3-253-20-18110Values (shared memory)

0 1 2 3 4 5 6 7

2011072-3-27390610-28Values

0 1 2 3

2011072-3-27390131378Values

0 1

2011072-3-2739013132021Values

0

2011072-3-2739013132041Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3

Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict free

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 44: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

15

for (unsigned int s=1; s < blockDim.x; s *= 2) {int index = 2 * s * tid;

if (index < blockDim.x) {sdata[index] += sdata[index + s];

}__syncthreads();

}

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {if (tid < s) {

sdata[tid] += sdata[tid + s];}__syncthreads();

}

Reduction #3: Sequential AddressingReduction #3: Sequential Addressing

Just replace strided indexing in inner loop:

With reversed loop and threadID-based indexing:

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 45: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

16

Performance for 4M element reductionPerformance for 4M element reduction

2.01x

2.33x

9.741 GB/s

4.854 GB/s

2.083 GB/s

4.68x

2.33x

1.722 msKernel 3:sequential addressing

3.456 msKernel 2:interleaved addressingwith bank conflicts

8.054 msKernel 1: interleaved addressingwith divergent branching

Step

SpeedupBandwidthTime (222 ints)Cumulative

Speedup

Mark Harris. Optimizing Parallel Reduction in CUDA.[HBM+07, SHZO07]

Page 46: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Applications

Page 47: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Sample Sort[LOS10]

1 23 7 19 25 42 4

1 23 7 19 25 42 41 23 7 19 25 42 4

2319

1 23 7 19 25 42 4

23191 7

425 42

1 23 7 19 25 42 4

23191 7

425 42

1 4 7 19 23 25 42

Page 48: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Sample Sort[LOS10]

1 23 7 19 25 42 4

1 23 7 19 25 42 4

1 23 7 19 25 42 4

2319

1 23 7 19 25 42 4

23191 7

425 42

1 23 7 19 25 42 4

23191 7

425 42

1 4 7 19 23 25 42

Page 49: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Sample Sort[LOS10]

1 23 7 19 25 42 41 23 7 19 25 42 4

1 23 7 19 25 42 4

2319

1 23 7 19 25 42 4

23191 7

425 42

1 23 7 19 25 42 4

23191 7

425 42

1 4 7 19 23 25 42

Page 50: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Sample Sort[LOS10]

1 23 7 19 25 42 41 23 7 19 25 42 41 23 7 19 25 42 4

2319

1 23 7 19 25 42 4

23191 7

425 42

1 23 7 19 25 42 4

23191 7

425 42

1 4 7 19 23 25 42

Page 51: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Sample Sort[LOS10]

1 23 7 19 25 42 41 23 7 19 25 42 41 23 7 19 25 42 4

2319

1 23 7 19 25 42 4

23191 7

425 42

1 23 7 19 25 42 4

23191 7

425 42

1 4 7 19 23 25 42

Page 52: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Sample Sort

Given X = {x1, x2, . . . xn}, kif n ≤ M thenSimpleSort(X)

end ifPick random sample R = xi1 , xi2 , . . . xikSort(R) = {r0 = −∞, r1, r2, . . . , rk,+∞ = rk+1}Place xi in bucket bj if rj ≤ xi < rj+1Concatenate SampleSort(b0, k), SampleSort(b1, k), . . .

Parallelize the individual sortsUse parallel reductions to partition elements

Page 53: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU Algorithm: A single phase

Phase 1 Compute the sample and sort itPhase 2 Within each block, figure out the bucket indices for

each element. Construct k-element histogram. Copy toglobal memory

Phase 3 Do prefix sum (parallel reduction!) to find global offsetsPhase 4 Distribute items using global offset

Page 54: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 55: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 56: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 57: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 58: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 59: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 60: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 61: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 62: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7

a b

min(a, b) max(a, b)

a > 0

−3 9 7

a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 63: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Data-parallel Binary Search

a b

min(a, b) max(a, b)

a > 0a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

5

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 5 6

a b

min(a, b) max(a, b)

a > 0

−3 9 7a b

min(a, b) max(a, b)

a > 0

−3 9 7

a b

min(a, b) max(a, b)

a > 0

−3 9 7

• In first comparator, a path is taken or not.

• In second, both paths are used, but by different data

Page 64: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 65: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 66: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 67: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 68: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 69: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 70: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 71: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 72: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Predicated Instructions

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

r1

r2 r3

r4 r5 r6 r7

r1 r2 r3 r4 r5 r6 r7

j = 1for i = 1 to log 4 do

j← 2j + (q > ri)end forj← j− 4 + 1

Conditionals are replaced by predicated statements

Page 73: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Prefix Sum for Offsets

GlobalmemoryGlobal

memory

Number ofelements in eachbucket

Globalmemory

Number ofelements in eachbucket

Prefix Sum

Page 74: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Prefix Sum for Offsets

GlobalmemoryGlobal

memory

Number ofelements in eachbucket

Globalmemory

Number ofelements in eachbucket

Prefix Sum

Page 75: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Prefix Sum for Offsets

GlobalmemoryGlobal

memory

Number ofelements in eachbucket

Globalmemory

Number ofelements in eachbucket

Prefix Sum

Page 76: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Prefix Sum for Offsets

GlobalmemoryGlobal

memory

Number ofelements in eachbucket

Globalmemory

Number ofelements in eachbucket

Prefix Sum

Page 77: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Prefix Sum for Offsets

Globalmemory

Globalmemory

Number ofelements in eachbucket

Globalmemory

Number ofelements in eachbucket

Prefix Sum

Page 78: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Prefix Sum for Offsets

Globalmemory

Globalmemory

Number ofelements in eachbucket

Globalmemory

Number ofelements in eachbucket

Prefix Sum

Page 79: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Prefix Sum for Offsets

GlobalmemoryGlobal

memory

Number ofelements in eachbucket

Globalmemory

Number ofelements in eachbucket

Prefix Sum

Page 80: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 81: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 82: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 83: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 84: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 85: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 86: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 87: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Distribution

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 1

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 8

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

(3, 7)

1, 4, 6, 8 2, 0, 9, 5

1, 2, 1 2, 1, 10, 0, 01, 2, 13, 3, 2

1 4 6 82 0 5 9

Repeat in each block

Page 88: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 89: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 90: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 91: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 92: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 93: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 94: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 95: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 96: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 97: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 98: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Convex Hulls

Page 99: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU QuickHull[SKGN09]

Phase 1 Compute the sample and sort itPhase 2 Within each block, figure out the bucket indices for

each element. Construct k-element histogram. Copy toglobal memory

Phase 3 Do prefix sum (parallel reduction!) to find global offsetsPhase 4 Distribute items using global offset

Page 100: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU QuickHull[SKGN09]

Phase 1 Compute the pivot pointPhase 2 Within each block, eliminate elements above the pivot

segment. Store the restPhase 3 Do prefix sum (parallel reduction!) to find global offsetsPhase 4 Distribute items using global offset

Page 101: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 102: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 103: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 104: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 105: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 106: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 107: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 108: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 109: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

3D Quick Hull Algorithm

Instead of line segments, the separating objects are planes.As before, points are distributed to the planes that they are“outside” of.The operation of extending the hull can create “concave” edgesthat need to be repaired.This is a hybrid CPU-GPU algorithm: distribution happens onthe GPU, and the rest happens on the CPU.

Page 110: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 111: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 112: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 113: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 114: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 115: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 116: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 117: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 118: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 119: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Lifting map

3D convex hull⇔ 2D Delaunay⇔ 2D Voronoi

Page 120: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Jump Flooding algorithm[RT06]

Page 121: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Jump Flooding algorithm[RT06]

Page 122: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Jump Flooding algorithm[RT06]

Page 123: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Jump Flooding algorithm[RT06]

Page 124: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Challenge

ChallengeCan we get better algorithms for computing 2D Voronoi diagramsand lower envelopes ?

Page 125: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

k-means clustering[ZG09]

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Page 126: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

k-means clustering[ZG09]

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Page 127: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

k-means clustering[ZG09]

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Page 128: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

k-means clustering[ZG09]

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Page 129: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

k-means clustering[ZG09]

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Find k centers C = c1, . . . ck such that

∑p∈P

minc∈C‖p− c‖2

is minimized

Page 130: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Where are the expensive computations ?

Each point finds its nearest neighbor (O(nk) or O(n log k) ifclever)We compute the centroids of points in clustersPoints are fixed in all iterations, but centroids change.

Page 131: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Standard Implementation: CPU-GPU hybrid

For each point, need to compute minC ‖p− c‖This is a trivial parallelization

One thread for each point (or block appropriately)In each iteration, compute distances from a single center.k� n, so this is not expensive

For each labelling, find new center.Could do this entirely in GPU: centers computed via reduceoperation.Most algorithms don’t do it: copy to CPU.

Page 132: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Why irregular access is a problem

GPUs are designed for high arithmetic intensity and SIMTbehaviorIrregular data locality and access (such as with graphs) reducesthe benefit of these methodsTo handle sparse data, you need to store the data compactly, andprocess it efficiently based on the format.

Page 133: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Sparse Matrix-Vector multiplication

y = Ax

Easy if A is dense using kernel at each vector: O(n2)

If number of nonzeros in A is small, would prefer O(nnz(A)) (orlinear in input)

Page 134: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Representations

A =

1 7 0 00 2 8 05 0 3 90 6 0 4

Diagonal form

data =

∗ 1 7∗ 2 85 3 96 4 ∗

offsets =[−2 0 1

]

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

CSR representation

Page 135: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Representations

A =

1 7 0 00 2 8 05 0 3 90 6 0 4

Diagonal form

data =

∗ 1 7∗ 2 85 3 96 4 ∗

offsets =[−2 0 1

]

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

CSR representation

Page 136: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

CSR representation

start← ptr[ID]end← ptr[ID + 1]for i = start to end dod = d + data[i] + x[indices[start]]

end for

Page 137: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I: Access Patterns

Threads in this kernel access elements haphazardly:ptr =

[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

Page 138: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I: Access Patterns

Threads in this kernel access elements haphazardly:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

Page 139: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I: Access Patterns

Threads in this kernel access elements haphazardly:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

Page 140: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I: Access Patterns

Threads in this kernel access elements haphazardly:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

Page 141: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I: Access Patterns

Threads in this kernel access elements haphazardly:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

Page 142: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I: Access Patterns

Threads in this kernel access elements haphazardly:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

Page 143: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV I: Access Patterns

Threads in this kernel access elements haphazardly:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

ptr =[0 2 4 7 9

]

indices =[0 1 1 2 0 2 3 1 3

]

data =[1 7 2 8 5 3 9 6 4

]

Round 1:Round 2:Round 3:

Page 144: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV II

New idea: instead of assigning one thread per row, assign one warpper row.

Each thread in a warp sums up a piece of the dot productAt end, a parallel reduction combines the pieces of the sumUnrolling helps speed the reduction.

Row

Row

Thread 0

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Parallel Reduce

Page 145: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV II

New idea: instead of assigning one thread per row, assign one warpper row.

Each thread in a warp sums up a piece of the dot productAt end, a parallel reduction combines the pieces of the sumUnrolling helps speed the reduction.

Row

Row

Thread 0

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Parallel Reduce

Page 146: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV II

New idea: instead of assigning one thread per row, assign one warpper row.

Each thread in a warp sums up a piece of the dot productAt end, a parallel reduction combines the pieces of the sumUnrolling helps speed the reduction.

RowRow

Thread 0

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Parallel Reduce

Page 147: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV II

New idea: instead of assigning one thread per row, assign one warpper row.

Each thread in a warp sums up a piece of the dot productAt end, a parallel reduction combines the pieces of the sumUnrolling helps speed the reduction.

RowRow

Thread 0

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Parallel Reduce

Page 148: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU SpMV II

New idea: instead of assigning one thread per row, assign one warpper row.

Each thread in a warp sums up a piece of the dot productAt end, a parallel reduction combines the pieces of the sumUnrolling helps speed the reduction.

RowRow

Thread 0

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Row

Thread 0

Thread 1

Parallel Reduce

Page 149: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Compute a BFS ordering of a graph[MGG12]

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

Page 150: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Compute a BFS ordering of a graph[MGG12]

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

Page 151: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Compute a BFS ordering of a graph[MGG12]

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

Page 152: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Compute a BFS ordering of a graph[MGG12]

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

1

2

3

4

5

6

7

• BFS is notoriously hard to parallelize well

• Main challenge is managing the nonuniform vertex and edgefrontier

Page 153: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Matrix View

If A is the adjacency matrix of a graph, and x is a vector representingthe current vertex frontier, then

y = x>A

is the new frontier.

This is under the (min,+) algebra, rather than (+,×) for regularmatrix operationsMethods from sparse matrix multiplication can be used here.But this is very expensive.

Plan:Replace matrix view by a “parallel” frontier expansionCarefully manage duplicate neighbors

Page 154: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Better Implementation I

How to construct a new frontier from current frontier ?

Every thread manages one node, and counts its neighbors.Once we have all neighbors, we invoke a prefix sum.Now threads can write new frontier into shared memory.

Page 155: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Better Implementation IINew frontier can have many duplicates in it, if individual threadsshare neighbors.

Idea:When a thread writes a neighbor, it first hashes and checks for acollision.If collision, then neighbor has already been written.Called “warp-culling”.

Using combination of these and other ideas yields fast BFS withlinear work.

Page 156: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Better Implementation IINew frontier can have many duplicates in it, if individual threadsshare neighbors.

Idea:When a thread writes a neighbor, it first hashes and checks for acollision.If collision, then neighbor has already been written.Called “warp-culling”.

Using combination of these and other ideas yields fast BFS withlinear work.

Page 157: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Better Implementation IINew frontier can have many duplicates in it, if individual threadsshare neighbors.

Idea:When a thread writes a neighbor, it first hashes and checks for acollision.If collision, then neighbor has already been written.Called “warp-culling”.

Using combination of these and other ideas yields fast BFS withlinear work.

Page 158: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Better Implementation IINew frontier can have many duplicates in it, if individual threadsshare neighbors.

Idea:When a thread writes a neighbor, it first hashes and checks for acollision.If collision, then neighbor has already been written.Called “warp-culling”.

Using combination of these and other ideas yields fast BFS withlinear work.

Page 159: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Better Implementation IINew frontier can have many duplicates in it, if individual threadsshare neighbors.

Idea:When a thread writes a neighbor, it first hashes and checks for acollision.If collision, then neighbor has already been written.Called “warp-culling”.

Using combination of these and other ideas yields fast BFS withlinear work.

Page 160: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Graph Coloring

Core problem in graph optimizationRegister allocation, spectrum assignment, scheduling, ....

Page 161: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Graph Coloring

Core problem in graph optimizationRegister allocation, spectrum assignment, scheduling, ....

Page 162: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Graph Coloring

Core problem in graph optimizationRegister allocation, spectrum assignment, scheduling, ....

Page 163: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Graph Coloring

Core problem in graph optimizationRegister allocation, spectrum assignment, scheduling, ....

Page 164: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 165: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 166: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 167: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 168: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 169: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 170: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 171: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

What do we know about it

NP-hard, and n1−ε-hard to approximateMany heuristics (based on greedy ordering)

Fix an arbitrary ordering of the verticesColor a vertex with the smallest feasible color.

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

1 2 3

4

5

6

7

Page 172: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Ordering Heuristics

First Fit Choose any orderingSDO/LDO Color is allocated to vertex with highest “saturation”

(number of distinct neighboring colors) and thenhighest degree.

MAX OUT Choose vertex that has the maximum number of edgesgoing out of the subgraph

MIN OUT Choose vertex that has fewest number of edges out ofthe subgraph.

Page 173: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Parallel Coloring

1 Partition the graph into roughly equal-sized pieces that havevery few connections between them

2 Color each piece in parallel3 Fix conflicts at boundaries of pieces

GPU has fine-grained parallelism, faster SIMD processors. Can we do better ?

Page 174: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

GPU Coloring[GZL+11]

1 Arbitrarily partition vertices into pieces2 Color each piece in a thread block, but use common color pool3 Suppose conflicts occur (at boundary)

Erase colors of conflicted nodesTry to color them againRepeat until number of conflicts is smallShift to CPU.

Page 175: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Summary

GPU SIMD makes conflict checking very easyDoing careful partitioning (METIS) doesn’t really help (GPU ismore tolerant to “bad” partitioning)CPU is very slow to resolve conflicts sequentially: best to use itwhen number of conflicts is smallGPU heuristics give good quality colorings (not sure why!)

Page 176: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Software Tools

CUDPP (http://code.google.com/p/cudpp (basic data-paralleltools)CUBLAS (http://developer.nvidia.com/cuda/cublas)Thrust (https://code.google.com/p/thrust/) (templatelibrary)Cusp (http://code.google.com/p/cusp-library/) (sparselinear algebra)

Page 177: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Research Tools

Page 178: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Overview of lectures

GPU in the BC era: vertex and fragment shaders. Can doVoronoi diagrams !SIMD view key to designing and exploiting behavior of card.CUDA provides general purpose SIMD frameworkLow-level SIMD violations can lose many factors in performanceParallel reduction and prefix sum is an important primitive.Many applications: dense systems, sparse systems, geometry, . . .For efficient code, reduce to known primitives likereduction/prefix sm

Page 179: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Debate

The GPU represents the realistic future of high intensity parallelcomputing. SIMD is the only way to get the throughput neededfor many problems, and once memory buses become faster, GPUswill become the primary model.

Versus

While the GPU can demonstrate great performance, the hoops youhave to jump through to get this performance are so constrainingand so artificial that GPUs will never be more than a boutiqueprocessor that is great for games.

Page 180: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

Questions?

Page 181: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

References I

A.V.P. Grosset, P. Zhu, S. Liu, S. Venkatasubramanian, and M. Hall.Evaluating graph coloring on gpus.In Proceedings of the 16th ACM symposium on Principles and practice ofparallel programming, pages 297–298. ACM, 2011.

M. Harris, G.E. Blelloch, B.M. Maggs, N.K. Govindaraju, B. Lloyd,W. Wang, M. Lin, D. Manocha, P.K. Smolarkiewicz, L.G. Margolin, et al.Optimizing parallel reduction in cuda.Proc. of ACM SIGMOD, 21, 13:104–110, 2007.

N. Leischner, V. Osipov, and P. Sanders.Gpu sample sort.In Parallel & Distributed Processing (IPDPS), 2010 IEEE InternationalSymposium on, pages 1–10. Ieee, 2010.

Duane Merrill, Michael Garland, and Andrew Grimshaw.Scalable gpu graph traversal.In Proceedings of the 17th ACM SIGPLAN symposium on Principles andPractice of Parallel Programming, PPoPP ’12, pages 117–128, New York,NY, USA, 2012. ACM.

Page 182: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

References II

J. Nickolls, I. Buck, M. Garland, and K. Skadron.Scalable parallel programming with cuda.Queue, 6(2):40–53, 2008.

G. Rong and T.S. Tan.Jump flooding in gpu with applications to voronoi diagram and distancetransform.In Proceedings of the 2006 symposium on Interactive 3D graphics and games,pages 109–116. ACM, 2006.

S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens.Scan primitives for gpu computing.In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposiumon Graphics hardware, pages 97–106. Eurographics Association, 2007.

D.P.R. Srikanth, K. Kothapalli, R. Govindarajulu, and PJ Narayanan.Parallelizing two dimensional convex hull on nvidia gpu and cell be.In International Conference on High Performance Computing (HiPC), pages1–5, 2009.

Page 183: GPU Algorithms III/IV Computing with CUDA · GPU Algorithms III/IV Computing with CUDA MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian

References III

M. Zechner and M. Granitzer.Accelerating k-means on the graphics processor via cuda.In Intensive Applications and Services, 2009. INTENSIVE’09. FirstInternational Conference on, pages 7–15. IEEE, 2009.