~rupesh - indian institute of technology madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf ·...

62
GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017

Upload: others

Post on 12-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

GPU Programming

Rupesh Nasre.http://www.cse.iitm.ac.in/~rupesh

IIT MadrasJuly 2017

Page 2: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

Let's Sync-up!

● Print numbers 1 to 10 in sequence.● Launch kernel with 3 threads.● Each kernel prints threadIdx.x modulo 3.

__global__ void onetoten() { for (int ii = 0; ii < 10; ++ii) if (ii % 3 == threadIdx.x) printf("%d: %d\n", threadIdx.x, ii);}

__global__ void onetoten() { for (int ii = 0; ii < 10; ++ii) if (ii % 3 == threadIdx.x) printf("%d: %d\n", threadIdx.x, ii);}

__global__ void onetoten() { __shared__ int n; n = 0; __syncthreads(); while (n < 10) { if (n % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, n); ++n; } __syncthreads(); }}

__global__ void onetoten() { __shared__ int n; n = 0; __syncthreads(); while (n < 10) { if (n % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, n); ++n; } __syncthreads(); }}

Page 3: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

Let's Sync-up!

● Print numbers 1 to 10 in sequence.● Launch kernel with 3 threads.● Each kernel prints threadIdx.x modulo 3.

__device__ volatile int n;

__global__ void onetoten() { n = 0; __syncthreads(); while (n < 10) { if (n % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, n); ++n; } }}

__device__ volatile int n;

__global__ void onetoten() { n = 0; __syncthreads(); while (n < 10) { if (n % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, n); ++n; } }}

__global__ void onetoten() { volatile __shared__ int n; n = 0; __syncthreads(); while (n < 10) { if (n % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, n); ++n; } }}

__global__ void onetoten() { volatile __shared__ int n; n = 0; __syncthreads(); while (n < 10) { if (n % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, n); ++n; } }}

Page 4: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

Let's Sync-up!

● Print numbers 1 to 10 in sequence.● Launch kernel with 3 threads.● Each kernel prints threadIdx.x modulo 3.

__global__ void onetoten() { __shared__ unsigned int n; n = 0; __syncthreads();

while (n < 10) { int oldn = atomicInc(&n, 100); if (oldn % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, oldn); } }}

__global__ void onetoten() { __shared__ unsigned int n; n = 0; __syncthreads();

while (n < 10) { int oldn = atomicInc(&n, 100); if (oldn % 3 == threadIdx.x) { printf("%d: %d\n", threadIdx.x, oldn); } }}

Note that some of these codes are faulty.

Page 5: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

Let's Compute the Shortest Paths● You are given an input graph of

India, and you want to compute the shortest path from Nagpur to every other city.

● Assume that you are given a GPU graph library and the associated routines.

● Each thread operates on a node and settles distances of the neighbors (Bellman-Ford style).

aa

cc

bb dd

73

4

gg

ff

ee

__global__ void dsssp(Graph g, unsigned *dist) {unsigned id = …for each n in g.allneighbors(id) { // pseudo-code.

unsigned altdist = dist[id] + weight(id, n);if (altdist < dist[n]) {

dist[n] = altdist;} } }

__global__ void dsssp(Graph g, unsigned *dist) {unsigned id = …for each n in g.allneighbors(id) { // pseudo-code.

unsigned altdist = dist[id] + weight(id, n);if (altdist < dist[n]) {

dist[n] = altdist;} } }

What is the error in this code?

Page 6: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

Data Race

● A datarace occurs if all of the following hold:

1. Multiple threads

2. Common memory location

3. At least one write

4. Concurrent execution

● Ways to remove datarace:

1. Execute sequentially

2. Privatization / Data replication

3. Separating reads and writes by a barrier

4. Mutual exclusion

Page 7: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

Classwork

flag = 1;while (flag)

;S1;

flag = 1;while (flag)

;S1;

while (!flag);

S2;flag = 0;

while (!flag);

S2;flag = 0;

T1 T2

● Is there a datarace in this code?● What does the code ensure?● Can you ensure the same with barriers? ● Can you ensure the same with atomics?● Generalize it for N threads.

Page 8: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

8

Synchronization

● Atomics● Barriers● Control + data flow● ...

Page 9: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

9

atomics

● Atomics are primitive operations whose effects are visible either none or fully (never partially).

● Need hardware support.

● Several variants: atomicCAS, atomicMin, atomicAdd, ...

● Work with both global and shared memory.

Page 10: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

10

atomics

__global__ void dkernel(int *x) {++x[0];

}…dkernel<<<1, 2>>>(x);

__global__ void dkernel(int *x) {++x[0];

}…dkernel<<<1, 2>>>(x);

After dkernel completes,what is the value of x[0]?

++x[0] is equivalent to:

Load x[0], R1Increment R1Store R1, x[0]

++x[0] is equivalent to:

Load x[0], R1Increment R1Store R1, x[0]

Load x[0], R1 Load x[0], R2

Increment R1 Increment R2

Store R2, x[0]

Store R1, x[0]

Load x[0], R1 Load x[0], R2

Increment R1 Increment R2

Store R2, x[0]

Store R1, x[0]Tim

e

Final value stored in x[0] could be 1 (rather than 2).What if x[0] is split into multiple instructions? What if there are more threads?

Page 11: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

11

atomics

● Ensure all-or-none behavior.

– e.g., atomicInc(&x[0], ...);

● dkernel<<<K1, K2>>> would ensure x[0] to be incremented by exactly K1*K2 – irrespective of the thread execution order.

– When would this effect be visible?

__global__ void dkernel(int *x) {++x[0];

}…dkernel<<<1, 2>>>(x);

__global__ void dkernel(int *x) {++x[0];

}…dkernel<<<1, 2>>>(x);

Page 12: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

12

Let's Compute the Shortest Paths● You are given an input graph of

India, and you want to compute the shortest path from Nagpur to every other city.

● Assume that you are given a GPU graph library and the associated routines.

● Each thread operates on a node and settles distances of the neighbors (Bellman-Ford style).

aa

cc

bb dd

73

4

gg

ff

ee

__global__ void dsssp(Graph g, unsigned *dist) {unsigned id = …for each n in g.allneighbors(id) { // pseudo-code.

unsigned altdist = dist[id] + weight(id, n);if (altdist < dist[n]) {

dist[n] = altdist; atomicMin(&dist[n], altdist);} } }

__global__ void dsssp(Graph g, unsigned *dist) {unsigned id = …for each n in g.allneighbors(id) { // pseudo-code.

unsigned altdist = dist[id] + weight(id, n);if (altdist < dist[n]) {

dist[n] = altdist; atomicMin(&dist[n], altdist);} } }

Page 13: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

13

Classwork

1. Compute sum of all elements of an array.

2. Find the maximum element in an array.

3. Each thread adds elements to a worklist.● e.g., next set of nodes to be processed in SSSP.

Page 14: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

14

AtomicCAS

● Syntax: oldval = atomicCAS(&var, x, y);

● Typical usecases:● Locks (critical section processing)● Single● Other atomic variants

Classwork: Implement single with atomicCAS.

Page 15: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

15

Barriers

● A barrier is a program point where all threads need to reach before any thread can proceed.

● End of kernel is an implicit barrier for all GPU threads (global barrier).

● There is no explicit global barrier supported in CUDA.

● Threads in a thread-block can synchronize using __syncthreads().

● How about barrier within warp-threads?

Page 16: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

16

Barriers

__global__ void dkernel(unsigned *vector, unsigned vectorsize) { unsigned id = blockIdx.x * blockDim.x + threadIdx.x; vector[id] = id; __syncthreads(); if (id < vectorsize - 1 && vector[id + 1] != id + 1) printf("syncthreads does not work.\n");}

S1

S2

S1 S1 S1 S1

S2 S2 S2 S2 S1 S1 S1 S1

S2 S2 S2 S2

Tim

e

Thread block

Thread block

Page 17: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

17

Barriers● __syncthreads() is not only about control synchronization, it

also has data synchronization mechanism.● It performs a memory fence operation.

– A memory fence ensures that the writes from a thread are made visible to other threads.

– __syncthreads() executes a fence for all the block-threads.● There is a separate __threadfence_block() instruction also.

Then, there is __threadfence().● [In general] A fence does not ensure that other thread will read

the updated value.

– This can happen due to caching.

– The other thread needs to use volatile data.● [In CUDA] a fence applies to both read and write.

Page 18: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

18

Classwork

● Write a CUDA kernel to find maximum over a set of elements, and then let thread 0 print the value in the same kernel.

● Each thread is given work[id] amount of work. Find average work per thread and if a thread's work is above average + K, push extra work to a worklist.– This is useful for load-balancing.

– Also called work-donation.

Page 19: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

19

Synchronization

● Atomics● Barriers● Control + data flow● ...

while (!flag) ;S1;

while (!flag) ;S1;

S2;flag = true;

S2;flag = true;

Initially, flag == false.

Page 20: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

20

Reductions

● What are reductions?● Computation properties required.● Complexity measures

Input: 4 3 9 3 5 7 3 2

7 12 12 5

19 17

36

barrierlog(n) steps

n numbers

Output:

Classwork: Write the reduction code.

Page 21: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

21

Reductions

Input: 4 3 9 3 5 7 3 2

7 12 12 5

19 17

36

barrierlog(n) steps

n numbers

Output:

for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

Page 22: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

22

Reductions for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

● Write the reduction such that thread i sums a[i] and a[i + n/2].● Assuming each a[i] is a character, find a concatenated string using reduction.● String concatenation cannot be done using a[i] and a[i + n/2],

but computing sum was possible; why?● What other operations can be cast as reductions?

Page 23: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

23

Prefix Sum

Input: 4 3 9 3 5 7 3 2Output: 4 7 16 19 24 31 34 36OROutput: 0 4 7 16 19 24 31 34

● Imagine threads wanting to push work-items to a central worklist.

● Each thread pushes different number of work-items.

● This can be computed using atomics or prefix sum (also called as scan).

Classwork: Write the prefix-sum code.

Page 24: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

24

Prefix Sum

Input: 4 3 9 3 5 7 3 2Output: 4 7 16 19 24 31 33 35OROutput: 0 4 7 16 19 24 31 33

for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

for (int off = n; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

for (int off = n; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + off]; } __syncthreads(); }

Page 25: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

25

Prefix Sum

Input: 4 3 9 3 5 7 3 2Output: 4 7 16 19 24 31 33 35OROutput: 0 4 7 16 19 24 31 33

for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + (n - off)]; } __syncthreads(); }

for (int off = n/2; off; off /= 2) { if (threadIdx.x < off) { a[threadIdx.x] += a[threadIdx.x + (n - off)]; } __syncthreads(); }

for (int off = 0; off < n; off *= 2) { if (threadIdx.x > off) { a[threadIdx.x] += a[threadIdx.x - off]; } __syncthreads(); }

for (int off = 0; off < n; off *= 2) { if (threadIdx.x > off) { a[threadIdx.x] += a[threadIdx.x - off]; } __syncthreads(); }

Page 26: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

26

Prefix Sum

x0

x1

x2

x3

x4

x5

x6

x7

(x0..x

0) (x

0..x

1) (x

0..x

2) (x

0..x

3) (x

0..x

4) (x

0..x

5) (x

0..x

6) (x

0..x

7)

Input: 4 3 9 3 5 7 3 2Output: 4 7 16 19 24 31 33 35OROutput: 0 4 7 16 19 24 31 33

Page 27: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

27

Prefix Sum

x0

x1

x2

(x0..x

0) (x

0..x

1) (x

1..x

2)

(x0..x

0) (x

0..x

1) (x

0..x

2)

Input: 4 3 9 3 5 7 3 2Output: 4 7 16 19 24 31 33 35OROutput: 0 4 7 16 19 24 31 33

Page 28: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

28

Prefix Sum

x0

x1

x2

x3

x4

x5

x6

x7

(x0..x

0) (x

0..x

1) (x

0..x

2) (x

0..x

3) (x

0..x

4) (x

0..x

5) (x

0..x

6) (x

0..x

7)

(x0..x

0) (x

0..x

1) (x

1..x

2) (x

2..x

3) (x

3..x

4) (x

4..x

5) (x

5..x

6) (x

6..x

7)

Iter

ations

12

3

(x0..x

0) (x

0..x

1) (x

0..x

2) (x

0..x

3) (x

1..x

4) (x

2..x

5) (x

3..x

6) (x

4..x

7)

Input: 4 3 9 3 5 7 3 2Output: 4 7 16 19 24 31 33 35OROutput: 0 4 7 16 19 24 31 33

Page 29: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

29

Prefix Sum for (int off = 1; off < n; off *= 2) { if (threadIdx.x > off) { a[threadIdx.x] += a[threadIdx.x - off]; } __syncthreads(); }

for (int off = 1; off < n; off *= 2) { if (threadIdx.x > off) { a[threadIdx.x] += a[threadIdx.x - off]; } __syncthreads(); }

for (int off = 0; off < n; off *= 2) { if (threadIdx.x > off) { tmp = a[threadIdx.x – off];

__syncthreads(); a[threadIdx.x] += tmp;

} __syncthreads(); }

for (int off = 0; off < n; off *= 2) { if (threadIdx.x > off) { tmp = a[threadIdx.x – off];

__syncthreads(); a[threadIdx.x] += tmp;

} __syncthreads(); }

Datarace

SeparatingR and Win time

Page 30: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

30

Prefix Sum for (int off = 1; off < n; off *= 2) { if (threadIdx.x >= off) { tmp = a[threadIdx.x - off]; } __syncthreads();

if (threadIdx.x >= off) { a[threadIdx.x] += tmp; } __syncthreads(); }

for (int off = 1; off < n; off *= 2) { if (threadIdx.x >= off) { tmp = a[threadIdx.x - off]; } __syncthreads();

if (threadIdx.x >= off) { a[threadIdx.x] += tmp; } __syncthreads(); }

Homework: Can one of the barriers be avoided?

Page 31: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

31

Application of Prefix Sum

● Assuming that you have the prefix sum kernel, insert elements into the worklist.– Each thread inserts nelem[tid] many elements.

– The order of elements is not important.

– You are forbidden to use atomics.

● Computing cumulative sum– Histogramming

– Area under the curve

Page 32: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

32

Concurrent Data Structures

● Array– atomics for index update

– prefix sum for coarse insertion

● Singly linked list– insertion

– deletion [marking, actual removal]

GG PP PP UU

GG

Page 33: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

33

Concurrent Data Structures

GG PP PP UU

GG

struct node {char item;struct node *next;

};

struct node {char item;struct node *next;

};

G->next = P2;P1->next = G;

G->next = P2;P1->next = G;

Type

def

init

ion

Sequ

enti

al in

sert

How to execute thetwo instructions

atomically?

How to execute thetwo instructions

atomically?

Page 34: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

34

Concurrent Linked List

GG PP PP UU

GG

Solution 1: Keep a lock with the list.– Coarse-grained synchronization

– Low concurrency / sequential access

– Easy to implement

– Easy to argue about correctness

Classwork: Implementlock() and unlock().

Classwork: Implementlock() and unlock().

Page 35: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

35

lock() and unlock()

void lock(List &list) {while (list.sema == 1)

;list.sema = 1;

}void unlock(List &list) {

list.sema = 0;}

void lock(List &list) {while (list.sema == 1)

;list.sema = 1;

}void unlock(List &list) {

list.sema = 0;}

time

T1 T2 T3

sema = 1

-- CS --

sema = 0

sema == 1 sema == 1

sema = 1 sema = 1

-- CS -- -- CS --

What is the problem here? How to fix it?

Page 36: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

36

lock() and unlock()

void lock(List &list) {atomicCAS(&list.sema, 0, 1);

}void unlock(List &list) {

atomicCAS(&list.sema, 1, 0);}

void lock(List &list) {atomicCAS(&list.sema, 0, 1);

}void unlock(List &list) {

atomicCAS(&list.sema, 1, 0);}

void lock(List &list) {do {

old = atomicCAS(&list.sema, 0, 1);} while (old == 1);

}void unlock(List &list) {

list.sema = 0;}

void lock(List &list) {do {

old = atomicCAS(&list.sema, 0, 1);} while (old == 1);

}void unlock(List &list) {

list.sema = 0;}

Page 37: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

37

Concurrent Linked List

GG PP PP UU

GG

Solution 2: Keep a lock with each node.– Fine-grained synchronization

– Better concurrency

– Moderately difficult to implement, need to finalize the supported operations

– Difficult to argue about correctness when multiple nodes are involved

Classwork: Implementinsert().

Classwork: Implementinsert().

Classwork: Check iftwo concurrentinserts work.

Classwork: Check iftwo concurrentinserts work.

Page 38: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

38

Concurrent Linked List

GG PP PP UU

GG

Classwork: Implementinsert().

Classwork: Implementinsert().

Classwork: Check iftwo concurrentinserts work.

Classwork: Check iftwo concurrentinserts work.

void insert(Node &prev, Node &naya) {naya.next = prev.next;prev.lock();prev.next = naya;prev.unlock();

}

void insert(Node &prev, Node &naya) {naya.next = prev.next;prev.lock();prev.next = naya;prev.unlock();

}

HH

T1T2

T1

Page 39: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

39

Concurrent Linked List

GG PP PP UU

GG

Classwork: Implementinsert().

Classwork: Implementinsert().

Classwork: Check iftwo concurrentinserts work.

Classwork: Check iftwo concurrentinserts work.

void insert(Node &prev, Node &naya) {naya.next = prev.next;prev.lock();prev.next = naya;prev.unlock();

}

void insert(Node &prev, Node &naya) {naya.next = prev.next;prev.lock();prev.next = naya;prev.unlock();

}

HH

T1T2

T1

T2

Page 40: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

40

Concurrent Linked List

GG PP PP UU

GG

Classwork: Implementinsert().

Classwork: Implementinsert().

Classwork: Check iftwo concurrentinserts work.

Classwork: Check iftwo concurrentinserts work.

Classwork: Now allowremove().

Classwork: Now allowremove().

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}

Page 41: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

41

GG PP PP UU

GG

Classwork: Implementinsert().

Classwork: Implementinsert().

Classwork: Check iftwo concurrentinserts work.

Classwork: Check iftwo concurrentinserts work.

Classwork: Now allowremove().

Classwork: Now allowremove().

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}void remove(Node &prev, Node &tbr) {

prev.lock();tbr.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}void remove(Node &prev, Node &tbr) {

prev.lock();tbr.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

If the order of unlocks is reversed?

If the order of locks is reversed?

Page 42: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

42

GG PP PP UUGG

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}void remove(Node &prev, Node &tbr) {

tbr.lock();prev.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}void remove(Node &prev, Node &tbr) {

tbr.lock();prev.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

If the order of locks is reversed?

T1T2

T1T2T2 T1

Page 43: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

43

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}void remove(Node &prev, Node &tbr) {

prev.lock();tbr.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

void insert(Node &prev, Node &naya) {prev.lock();naya.next = prev.next;prev.next = naya;prev.unlock();

}void remove(Node &prev, Node &tbr) {

prev.lock();tbr.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

Isn't a similar issuepossible with our current

implementation?

GG PP PP UUGG

T2T1

GG PP PP UUGG

Expected GPU,received GGPU.

Where is the problem?Where is the problem?

Page 44: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

44

void remove(Node &prev) {prev.lock();tbr = prev.next;tbr.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

void remove(Node &prev) {prev.lock();tbr = prev.next;tbr.lock();prev.next = tbr.next;// process tbr.tbr.unlock();prev.unlock();

}

GG PP PP UUGG

T2T1

GG PP PP UUGG

This would solve the problemof tbr, but what about

remove(P) and remove(P)executing concurrently?

This requires us to checkfor a node's validity.

This requires us to checkfor a node's validity.

Page 45: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

45

int remove(Node &prev) { if (prev.valid == 0) return -1;

prev.lock();tbr = prev.next;tbr.lock();prev.next = tbr.next;// process tbr.tbr.valid = 0;tbr.unlock();prev.unlock();

}

int remove(Node &prev) { if (prev.valid == 0) return -1;

prev.lock();tbr = prev.next;tbr.lock();prev.next = tbr.next;// process tbr.tbr.valid = 0;tbr.unlock();prev.unlock();

}

GG PP PP UUGG

T2T1

GG PP PP UUGG

Checking for validity and

locking needs to be atomic!

● Memory is not reclaimed!● Only insert and remove!● No traversal yet!● Direct pointer is given!● Still so many complications!

● Memory is not reclaimed!● Only insert and remove!● No traversal yet!● Direct pointer is given!● Still so many complications!

Page 46: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

46

int remove(Node &prev) {prev.lock();

if (prev.valid == 0) return -1;tbr = prev.next;tbr.lock();if (tbr.valid == 0) return -2;prev.next = tbr.next;// process tbr.tbr.valid = 0;tbr.unlock();prev.unlock();

}

int remove(Node &prev) {prev.lock();

if (prev.valid == 0) return -1;tbr = prev.next;tbr.lock();if (tbr.valid == 0) return -2;prev.next = tbr.next;// process tbr.tbr.valid = 0;tbr.unlock();prev.unlock();

}

GG PP PP UUGG

T2T1

GG PP PP UUGG

Does not unlock on error.

Page 47: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

47

int remove(Node &prev) {prev.lock();

if (prev.valid) {tbr = prev.next;tbr.lock();if (tbr.valid) {

prev.next = tbr.next;// process tbr.tbr.valid = 0;

}tbr.unlock();

}prev.unlock();

}

int remove(Node &prev) {prev.lock();

if (prev.valid) {tbr = prev.next;tbr.lock();if (tbr.valid) {

prev.next = tbr.next;// process tbr.tbr.valid = 0;

}tbr.unlock();

}prev.unlock();

}

GG PP PP UUGG

T2T1

GG PP PP UUGG

Homework: Find outissues with this code.Homework: Find out

issues with this code.

Page 48: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

48

Concurrent Linked List

GG PP PP UU

GG

Solution 3: Use atomics to insert.– Possible only in a few cases

– Difficult to implement with multiple inserts

– Difficult to prove correctness

Classwork: Implementinsert().

Classwork: Implementinsert().

Page 49: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

49

void remove(Node &prev) {do {

tbr = prev.next;old = atomicCAS(&prev.next, tbr, tbr.next);

} while (old != tbr);}

void remove(Node &prev) {do {

tbr = prev.next;old = atomicCAS(&prev.next, tbr, tbr.next);

} while (old != tbr);}

void insert(Node &prev, Node &naya) {do {

naya.next = prev.next;old = atomicCAS(&prev.next, naya.next, &naya);

} while (old != naya.next);}

void insert(Node &prev, Node &naya) {do {

naya.next = prev.next;old = atomicCAS(&prev.next, naya.next, &naya);

} while (old != naya.next);}

GG PP PP UU

GG

Dare to support remove?

Mostly works.Problem with multipleconcurrent remove(P).

Mostly works.Problem with multipleconcurrent remove(P).

Page 50: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

50

CPU-GPU Synchronization

● While GPU is busy doing work, CPU may perform useful work.

● If CPU-GPU collaborate, they require synchronization.

Classwork: Implementa functionality to print sequence 0..10.

CPU prints even numbers,GPU prints odd.

Classwork: Implementa functionality to print sequence 0..10.

CPU prints even numbers,GPU prints odd.

Page 51: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

51

CPU-GPU Synchronization#include <cuda.h>#include <stdio.h>

__global__ void printk(int *counter) { ++*counter; printf("\t%d\n", *counter);}int main() { int hcounter = 0, *counter;

cudaMalloc(&counter, sizeof(int));

do { printf("%d\n", hcounter); cudaMemcpy(counter, &hcounter, sizeof(int), cudaMemcpyHostToDevice); printk <<<1, 1>>>(counter); cudaMemcpy(&hcounter, counter, sizeof(int), cudaMemcpyDeviceToHost); } while (++hcounter < 10);

return 0;}

#include <cuda.h>#include <stdio.h>

__global__ void printk(int *counter) { ++*counter; printf("\t%d\n", *counter);}int main() { int hcounter = 0, *counter;

cudaMalloc(&counter, sizeof(int));

do { printf("%d\n", hcounter); cudaMemcpy(counter, &hcounter, sizeof(int), cudaMemcpyHostToDevice); printk <<<1, 1>>>(counter); cudaMemcpy(&hcounter, counter, sizeof(int), cudaMemcpyDeviceToHost); } while (++hcounter < 10);

return 0;}

Page 52: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

52

Pinned Memory

● Typically, memories are pageable (swappable).● CUDA allows to make host memory pinned.● CUDA allows direct access to pinned host

memory from device.

● cudaHostAlloc(&pointer, size, 0);

Classwork: Implementthe same functionality to print sequence 0..10.

CPU prints even numbers,GPU prints odd.

Classwork: Implementthe same functionality to print sequence 0..10.

CPU prints even numbers,GPU prints odd.

Page 53: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

53

Pinned Memory#include <cuda.h>#include <stdio.h>

__global__ void printk(int *counter) { ++*counter; printf("\t%d\n", *counter);}int main() { int *counter;

cudaHostAlloc(&counter, sizeof(int), 0);

do { printf("%d\n", *counter); printk <<<1, 1>>>(counter); cudaDeviceSynchronize(); ++*counter; } while (*counter < 10);

cudaFreeHost(counter); return 0;}

#include <cuda.h>#include <stdio.h>

__global__ void printk(int *counter) { ++*counter; printf("\t%d\n", *counter);}int main() { int *counter;

cudaHostAlloc(&counter, sizeof(int), 0);

do { printf("%d\n", *counter); printk <<<1, 1>>>(counter); cudaDeviceSynchronize(); ++*counter; } while (*counter < 10);

cudaFreeHost(counter); return 0;}

No cudaMempcy!

Classwork: Can we avoid repeated kernel calls?

Classwork: Can we avoid repeated kernel calls?

Page 54: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

54

Persistent Kernels__global__ void printk(int *counter) { do { while (*counter % 2) ; ++*counter; printf("\t%d\n", *counter); } while (*counter < 10);}int main() { int *counter;

cudaHostAlloc(&counter, sizeof(int), 0); printk <<<1, 1>>>(counter);

do { printf("%d\n", *counter); while (*counter % 2 == 0) ; ++*counter; } while (*counter < 10);

cudaFreeHost(counter); return 0;}

__global__ void printk(int *counter) { do { while (*counter % 2) ; ++*counter; printf("\t%d\n", *counter); } while (*counter < 10);}int main() { int *counter;

cudaHostAlloc(&counter, sizeof(int), 0); printk <<<1, 1>>>(counter);

do { printf("%d\n", *counter); while (*counter % 2 == 0) ; ++*counter; } while (*counter < 10);

cudaFreeHost(counter); return 0;}

Page 55: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

55

Extra

Page 56: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

56

Barrier-based Synchronization

Disjoint accesses

Overlapping accesses

Benign overlaps

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

...

Consider threads pushing elements into a worklist

Page 57: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

57

Barrier-based Synchronization

Disjoint accesses

Overlapping accesses

Benign overlaps

...

atomic per element

non-atomic mark

prioritized mark

check

Race and resolve

Race and resolve

AND

ORnon-atomic mark

check

e.g., for owning cavities in Delaunay mesh refinement

e.g., for inserting unique elements into a worklist

Consider threads trying to own a set of elements

Page 58: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

58

Barrier-based Synchronization

Disjoint accesses

Overlapping accesses

Benign overlaps

...

with atomics

without atomics

e.g., level-by-level breadth-first search

Consider threads updating shared variables to the same value

Page 59: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

59

Exploiting Algebraic Properties

Monotonicity Idempotency Associativity

1010

33 44

2 3

tfive tseven

77

33 44

2 3

tfive tseven

55

33 44

2 3

tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity

Consider threads updating distances in shortest paths computation

Page 60: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

60

Exploiting Algebraic Properties

Monotonicity Idempotency Associativity

zz

bb cc

t2 t3

aa dd

t1 t4 zz zz zz zz

worklist zz

pp

qqrr

t5, t6, t7,t8

Consider threads updating distances in shortest paths computation

Update by multiple threads Multiple instances of a node in the worklist

Same node processed by multiple threads

Page 61: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

61

Exploiting Algebraic Properties

Monotonicity Idempotency Associativity

zz

bb cc

t2 t3

aa dd

t1 t4x

y z,v

m,n

x,y,z,v,m,n

Consider threads pushing information to a node

Associativity helps push information using prefix-sum

Page 62: ~rupesh - Indian Institute of Technology Madrasrupesh/teaching/gpu/aug17/4-synchronization.pdf · 12 Let's Compute the Shortest Paths You are given an input graph of India, and you

62

Scatter-Gather

O(e) atomics

O(t) atomics

O(log t) barriers

atomic per element

atomic per thread

prefix-sum

...

scatter

gather

Consider threads pushing elements into a worklist