gpu programming on cpu - using c++amp

GPU Programming

on CPUs

Using C++AMP

Miller Lee

Outline

1. Introduction to C++AMP2. Introduction to Tiling3. tile_static4. barrier.wait and solutions

a. C++11 threadb. setjmp/longjmpc. ucontext

(Homogeneous coordinates)

(0, 0) (0, 1) (0, 2) (0, 3)

(1, 0) (1, 1) (1, 2) (1, 3)

(2, 0) (2, 1) (2, 2) (2, 3)

(3, 0) (3, 1) (3, 2) (3, 3)

Matrix A b

result

Computing example

● Simple matrix multiplication

C++ Version

1. int A[4][4];2. int b[4];3. int result[4];4. for (int i = 0; i < 4; i++) {5. result[i] = 0;6. for (int j = 0; j < 4; j++)7. result[i] += A[i][j] * b[j];8. }

C++AMP Version1. array_view<float, 2> A(4, 4);2. array_view<float, 1> b(4);3. array_view<float, 1> result(4);4. extent<1> ext(4);5. parallel_for_each(ext, [&](index<1> idx) restrict(amp)6. {7. result[idx[0]] = 0;8. for (int i = 0; i < 4; i++)9. result[idx[0]] += A(idx[0], i) * b(i);

10. });

memory access

0 1 2 3

P0 P1 P2 P3

global memory

Total access time = 400t 6

shared memory

0 1 2 3

shared memory

Total access time = 130t

1. array_view<float, 2> A(4, 4);2. array_view<float, 1> b(4);3. array_view<float, 1> result(4);4. extent<1> ext(4);5. parallel_for_each(ext.tile<4>(), [&](tiled_index<4> tidx)

restrict(amp)6. {7. int local = tidx.local[0];8. int global = tidx.global[0];9. tile_statc int buf[4];

10. buf[local] = b[global];11. tidx.barrier.wait();12. result[idx[0]] = 0;13. for (int i = 0; i < 4; i++)14. result[idx[0]] += A[idx[0]][i] * buf[i];15. }); 8

barrier

Architecture

source: NVIDIA TESLA:AUNIFIED GRAPHICS AND COMPUTING ARCHITECTURE

shared memoryaccessible to all SPs

● Implement all the C++AMP function on CPU instead of GPU without any compiler modification.

tiled_static

● The limitation of C++ syntax leads to the following choices○ const, volatile○ __attribute__(...)○ static

● Choose static○ static memory can be shared among all the threads○ side effect: At most one thread group can be

executed at the same time.

#define tile_static static

Barrier.wait

● Threads in the same thread group will be waited at the point where “wait” is called.

● Program cana. perform real barrier actionb. jump out of current execution context

● True threading○ C++11 thread

● Fake threading(Coroutines)○ setjmp/longjmp○ makecontext/getcontext/swapcontext/setcontext

Approaches

C++11 thread

● launch hundreds of threads at a time.● implemente my own barrier by using C++11

mutex library.→ extremely slow.→ The data on static memory will be corrupted

setjmp/longjmp

● int setjmp(jmp_buf env)○ setjmp() saves the stack context/environment in env

for later use by longjmp.○ The stack context will be invalidated if the function

which called setjmp() returns.● void longjmp(jmp_buf env, int val);

○ longjmp() restores the environment saved by the last call of setjmp.

1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf;4. void wait(void) {5. printf("wait\n"); // prints6. longjmp(buf,1); 7. }8. void first(void) {9. wait();

10. printf("first\n"); // does not print11. }12. int main() { 13. if (!setjmp(buf))14. first(); // when executed, setjmp returns 015. else // when longjmp jumps back, setjmp returns 116. printf("main\n"); // prints17. return 0;18. }

Pseudo code (1)void entry(){while(!finish) for(t : tasks) run(t)}

void fun(){… wait();...}

void entry(){while(!finish) for(t : tasks) run(t)}

Pseudo code (2)void entry(){while(!finish) for(t : tasks) run(t)}

void entry(){while(!finish) for(t : tasks) run(t)}

1. #include <stdio.h>2. #include <setjmp.h>3. jmp_buf buf, b;4. void wait(void) {5. printf("wait\n");6. if (setjmp(b) == 0)7. longjmp(buf,1);8. }9. void first(void) {

10. wait();11. }12. int main() { 13. if (!setjmp(buf) )14. first();15. else {16. printf("main\n");17. longjmp(b, 10);18. }19. return 0;20. } 20

10. wait();11. }12. int main() { 13. if (!setjmp(buf) )14. first();15. else {16. printf("main\n");17. longjmp(b, 10);18. }19. return 0;20. }

ret address

Cannot return？？？

？？？

？？？ buf

Problems

● Cannot return○ return address in the stack is destroyed

● Cannot use too many static variables○ will lost spilled registers

→ can be solved by using “alloca”http://www.codemud.net/~thinker/GinGin_CGI.py/show_id_doc/489

ucontext.h

● ucontext_t● getcontext● makecontest● swapcontext● setcontext

ucontext_ttypedef struct ucontext { struct ucontext *uc_link; sigset_t uc_sigmask; stack_t uc_stack; mcontext_t uc_mcontext; ...} ucontext_t;

● uc_link○ points to the context that will be resumed when the current context

terminates● uc_stack

○ the stack used by this context ● uc_mcontext

○ machine-specific representation of the saved context, that includes the calling thread's machine registers

Functions

● int getcontext(ucontext_t *ucp);○ initializes the structure pointed at by ucp.

● int setcontext(const ucontext_t *ucp);○ restores the user context pointed at by ucp

● int swapcontext(ucontext_t *oucp, const ucontext_t *ucp);○ saves the current context in the structure pointed to

by oucp, and then activates the context pointed to by ucp.

makecontext

● void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);○ glibc(x86_64) saves the arguments to registers

instead of pushing them on stack as AMD64 ABI said

○ The size of the arguments that passed to makecontext should be no less than sizeof(register)

1. #include <stdio.h>

2. #include <ucontext.h>

3. static ucontext_t ctx[2];

4. static void f1 (void) {

5. puts("start f1");

6. swapcontext(&ctx[1], &ctx[0]);

7. puts("finish f1");

9. int main (void)

11. char st1[8192];

12. getcontext(&ctx[1]);

13. ctx[1].uc_stack.ss_sp = st1;

14. ctx[1].uc_stack.ss_size = sizeof st1;

15. ctx[1].uc_link = &ctx[0];

16. makecontext(&ctx[1], f1, 0);

19. return 0;

20. } 30

1. #include <stdio.h>

2. #include <ucontext.h>

3. static ucontext_t ctx[3];

4. static void f1 (void) {

6. swapcontext(&ctx[1], &ctx

9. static void f2 (void)

12. swapcontext(&ctx[2], &ctx

1. int main (void)

3. char st1[8192], st2[8192];

6. ctx[1].uc_stack.ss_size = sizeof

12. ctx[2].uc_stack.ss_size = sizeof

17. return 0;

Fake threading (yield)void entry(){ setup(fun, 2);while(!finish) switch_to();}

void entry(){ setup(fun, 2);while(!finish) switch_to();}

Problems

1. How to pass a lambda?○ makecontext(&ctx,

(void (*)(void))&Kernel::operator(), …);2. How to pass non-int arguments?

○ What if sizeof(Type) > sizeof(int)○ How about complex structure and class

Pass lambda

1. Use a wrapper function!!template <typename Ker, typename Arg>

void fun(Ker k, Arg arg)

k(arg);

template <typename Ker, typename Arg>

void makectx(Ker k, Arg arg)

makecontext(&ctx, (void (*)(void))fun<ker, Arg>, 2, k, arg);

Pass non-int arguments

2. Pass pointer instead!!template <typename Ker, typename Arg>

void fun(Ker *k, Arg *arg)

(*k)(*arg);

template <typename Ker, typename Arg>

void makectx(Ker k, Arg arg)

makecontext(&ctx, (void (*)(void))fun<ker, Arg>, 2, &k, &arg);

Additional

● Use a counter so that we can spawn coroutines dynamically

● Can it be multithreaded? Yes

true threading

barrier

There are 12 threads in one thread group

one thread

barrier

multithreading

barrier

Hardware Core = 4

barrierstruct bar_t { unsigned const count; std::atomic<unsigned> spaces; std::atomic<unsigned> generation; bar_t(unsigned count_) : count(count_), spaces(count_), generation(0) {} void wait() noexcept { unsigned const my_generation = generation; if (!--spaces) { spaces = count; ++generation; } else { while(generation == my_generation); } }}; source: C++ Concurrency in Action: Practical Multithreading 40

Summary

● It works fine on AMP right now● The importance of low level knowledge

gpu programming on cpu - using c++amp

buf buf

void fun

void waitvoid

void firstvoid

void entry

void longjmpjmp

int main

int b4

Software

on cpu and gpu clusters - centralesupelec

a comparative evaluation of the gpu vs he cpu for...

cpu vs. gpu presentation

unicorn: a bulk synchronous programming model,...

gpu programming

multi gpu programming with mpi...multi gpu programming with...

redefining the role of the cpu in era of cpu...

c language extensions for hybrid cpu/gpu programming with

gpu analysis and optimisation -...

central processing unit/graphics processing unit (cpu/gpu...

selective gpu caches to eliminate cpu–gpu hw cache...

gpgpu programming on example of cuda - panoramix -...

cuda programming. floating point operations for the cpu and...

integer programming based heterogeneous cpu-gpu cluster...

build gpu cluster hardware for efficiently accelerating...

multi-gpu programming - gpu technology conference

agenda cpu threads flip queue cpu queues gpu hardware queue

groute: an asynchronous multi-gpu programming model for...

gpu architecture and programming. gpu vs cpu

gpu computing april 2009. gpu outpacing cpu in raw...