pt-4057, automated cuda-to-opencl™ translation with cu2cl: what's next?, by wu feng and mark...

85
synergy.cs.vt.edu Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next? Wu Feng and Mark Gardner Virginia Tech 2013-11-12

Upload: amd-developer-central

Post on 13-Jan-2015

2.780 views

Category:

Technology


1 download

DESCRIPTION

Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.

TRANSCRIPT

Page 1: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

synergy.cs.vt.edu

Automated CUDA-to-OpenCL Translation with CU2CL:What’s Next?

Wu Feng and Mark Gardner

Virginia Tech

2013-11-12

Page 2: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

Page 3: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

Page 4: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

Page 5: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

Source code lasts longer than platforms

Page 6: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Goal

http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg

To take advantage of OpenCL's portability...

Without sacrificing man-years of existing code

Page 7: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL APIs

CUDA Module OpenCL Module

Thread Contexts &Command Queues

Device Platforms & Devices

Stream Command Queues

Event Events

Memory Memory Objects

Page 8: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL APIs

CUDA Module OpenCL Module

Thread Contexts &Command Queues

Device Platforms & Devices

Stream Command Queues

Event Events

Memory Memory Objects

Page 9: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL APIs

CUDA Module OpenCL Module

Thread Contexts &Command Queues

Device Platforms & Devices

Stream Command Queues

Event Events

Memory Memory Objects

Page 10: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

Page 11: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

Page 12: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

Page 13: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

Page 14: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

synergy.cs.vt.edu

CUDA and OpenCLExecution and Memory Models

Page 15: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Problem

Page 16: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Problem

Manual Translation(weeks, months)

CUDASourceCode

OpenCLSourceCode

xkcd.com

Page 17: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

The Problem

Manual Translation(weeks, months)

Automatic Translation(seconds)

CUDASourceCode

OpenCLSourceCode

CU2CL

xkcd.com

Page 18: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

synergy.cs.vt.edu

Forecast

• Observations about Translating

• Examples: CUDA and OpenCL constructs

• CU2CL Architecture

• Current State of CU2CL: Robustness and Performance

• Future Directions

http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US

Page 19: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...

Page 20: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)

Page 21: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)

• High-level language → low-level representation, e.g., C → LLVM

x * y + z →

%tmp = mul i32 %x, %y

%tmp2 = add i32 %tmp, %z

Page 22: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)

• High-level language → low-level representation, e.g., C → LLVM

x * y + z →

%tmp = mul i32 %x, %y

%tmp2 = add i32 %tmp, %z

• Between languages, e.g., CUDA → OpenCL

__powf(x[threadIdx.x], y[threadIdx.y]) →

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Page 23: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

Page 24: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

…when there IS ambiguity (or lack of a direct mapping) in the translation between languages

Page 25: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

…when there IS ambiguity (or lack of a direct mapping) in the translation between languages

• Idiomatic Expressions

– “Putting all your eggs in one basket” → ?? in Spanish

– CUDA threadfence() → OpenCL ??

Page 26: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation is more difficult

…when there IS ambiguity (or lack of a direct mapping) in the translation between languages

• Idiomatic Expressions

– “Putting all your eggs in one basket” → ?? in Spanish

– CUDA threadfence() → OpenCL ??

• Dialects

– Latin American Spanish vs. Castilian Spanish → English

– CUDA Runtime API vs. CUDA Driver API → OpenCL

Page 27: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA and OpenCL

http://www.dragon1.com/images/examples.jpg

Page 28: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Initialization Code

None(Implicit)

Dialect: CUDA runtime API

Page 29: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Initialization Code

//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);

//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);

//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

Explicit

Page 30: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Initialization Code

//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);

//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);

//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

Explicit

Page 31: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Initialization Code

//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);

//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);

//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

Explicit

Page 32: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Kernel Invocation

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }

Page 33: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Kernel Invocation

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }

Page 34: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CUDA Kernel Invocation

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }

Page 35: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Kernel Invocation

// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }

Page 36: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Kernel Invocation

// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }

Page 37: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

OpenCL Kernel Invocation

// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }

Page 38: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

Page 39: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

Page 40: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

Page 41: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA

Page 42: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Architecture

http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg

Page 43: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Compilation Process

Page 44: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Compilation Process

SourceCode

PreprocessedCode

TokenizedCode

ParseTree

IntermediateRepresentation

Binary

Preprocessor LexerSemanticAnalyzer

ParserCode

Generator

Clang LLVM

Page 45: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Compilation Process

Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011

SourceCode

PreprocessedCode

TokenizedCode

ParseTree

IntermediateRepresentation

Binary

Preprocessor LexerSemanticAnalyzer

ParserCode

Generator

Clang LLVM

Page 46: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y]) CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

Page 47: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

Page 48: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

Arg

Arg

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

Page 49: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

Arg

Arg

Struct

Struct

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

Page 50: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL

Page 51: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

Page 52: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

Page 53: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

Page 54: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]

Page 55: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]native_pow

Page 56: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

Write Out

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]native_pow

Page 57: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AST-driven, String-based Rewriting

Write Out

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )

x[ ] y[ ]native_pow

Advantage: formatting remains intact → maintainable

Page 58: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions

Page 59: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

Page 60: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

CUDA Kernel Launch

Page 61: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

CUDA Kernel Launch

Naive OpenCL Translation

Page 62: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);

float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f;

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float),

&__cu2cl_Kernel_kernel_arg_1);

int __cu2cl_Kernel_kernel_arg_2 = 256;

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int),

&__cu2cl_Kernel_kernel_arg_2);

CUDA Kernel Launch

Correct OpenCL Translation

Naive OpenCL Translation

Page 63: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions

Page 64: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions2. Device Identification

– CUDA uses int, OpenCL uses opaque cl_device

– To change devices in CUDA, use cudaSetDevice(int id)

– To change devices in OpenCL, use...

Page 65: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions2. Device Identification

– CUDA uses int, OpenCL uses opaque cl_device

– To change devices in CUDA, use cudaSetDevice(int id)

– To change devices in OpenCL, use...

//scan all devices

//save old platform, device, context, queue, program, & kernels

myDevice = allDevices[id]

ClGetDeviceInfo(...); //get new device's platform

myContext = clCreateContext(...);

myQueue = clCreateCommandQueue(...);

//load program source

clBuildProgram(...);

myKernel = clCreateKernel(...);

Page 66: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Complex Semantic Conversions2. Device Identification

– CUDA uses int, OpenCL uses opaque cl_device

– To change devices in CUDA, use cudaSetDevice(int id)

– To change devices in OpenCL, use...

– Implement our own handler to emulate and encapsulate

//scan all devices

//save old platform, device, context, queue, program, & kernels

myDevice = allDevices[id]

ClGetDeviceInfo(...); //get new device's platform

myContext = clCreateContext(...);

myQueue = clCreateCommandQueue(...);

//load program source

clBuildProgram(...);

myKernel = clCreateKernel(...);

Page 67: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Evaluation

Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg

Page 68: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

Page 69: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

• 79 CUDA SDK Samples

• 17 Rodinia Samples

• Applications– GEM – Molecular Modeling

– IZ PS – Neural Network

– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

Page 70: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

• 79 CUDA SDK Samples

• 17 Rodinia Samples

• Applications– GEM – Molecular Modeling

– IZ PS – Neural Network

– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

Page 71: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Test Code

• 79 CUDA SDK Samples

• 17 Rodinia Samples

• Applications– GEM – Molecular Modeling

– IZ PS – Neural Network

– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

Page 72: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translator CoverageApplication CUDA Lines OpenCL Lines

ChangedPercentAutomatically Translated

asyncAPI 135 5 96.3

bandwidthTest 891 5 98.9

BlackScholes 347 14 96.0

FastWalshTransform 327 30 90.8

matrixMul 351 9 97.4

scalarProd 251 18 92.8

vectorAdd 147 0 100

Back Propagation 313 24 92.3

Breadth-First Search 306 35 88.6

Gaussian 390 26 93.3

Hotspot 328 2 99.4

Needleman-Wunsch 430 3 99.3

Fen Zi 17768 1786 89.9

GEM 524 15 97.1

IZ PS 8402 166 98.0

SDK Samples

Rodinia

Page 73: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translator CoverageApplication CUDA Lines OpenCL Lines

ChangedPercentAutomatically Translated

asyncAPI 135 5 96.3

bandwidthTest 891 5 98.9

BlackScholes 347 14 96.0

FastWalshTransform 327 30 90.8

matrixMul 351 9 97.4

scalarProd 251 18 92.8

vectorAdd 147 0 100

Back Propagation 313 24 92.3

Breadth-First Search 306 35 88.6

Gaussian 390 26 93.3

Hotspot 328 2 99.4

Needleman-Wunsch 430 3 99.3

Fen Zi 17768 1786 89.9

GEM 524 15 97.1

IZ PS 8402 166 98.0

SDK Samples

Rodinia

Page 74: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translation Challenges

Challenge CUDA SDKFrequency ()%

RodiniaFrequency ()%

Device Identifiers 54.4 29.4

Literal Parameters 19.0 23.5

Separate Compilation 54.4 29.4

CUDA Libraries 10.1 0

Kernel Templates 21.5 0

Texture Memory 27.8 23.5

Graphics Interoperability 24.1 0

Constant Memory 17.7 29.4

Shared Memory 46.8 70.6

Profiled Identified

Kernel Function Pointer InvocationsPreprocessor EffectsWarp-level SynchronizationDevice Intrinsic FunctionsDevice Buffer cl_mem Type Propagation#defined Function DefinitionsDevice Buffers as Struct MembersArrays of Device BuffersImplicitly-Defined Kernel FunctionsDevice-side Classes, Constructors, & DestructorsStruct Alignment Attributes__threadfence()

Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation”. ICPP Workshops 2012: 89-96Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear

Page 75: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translator Performance

100 1000 10000 1000000.01

0.1

1

10

SDK Samples Rodinia Samples Large Applications

100 1000 10000 1000001

10

100

1000

10000R+ = 0.61R+ = 0.95

SDK Samples Linear (SDK Samples)Rodinia Samples Linear (Rodinia Samples)Large Applications

Source Lines Source Lines

Tota

l Tra

nsl

ati

on T

ime

(s)

CU

2C

L T

rans l

ati

on T

ime

(mic

rose

conds)

Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04

Page 76: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Translated Application PerformanceasyncAPI

bandwidthTest

BlackScholes

FastWalshTransform

matrixMul

scalarProd

vectorAdd

backprop BFS

Gaussian

Hotspot

Needleman-Wunsch

GEM

0.5

1

1.5

2

2.5

CU

DA

OpenC

L

Tim

e (

s)

SDK Samples Rodinia Samples

Lower is Better

Note: all runs on same Nvidia GPU for fair comparison purposes

Page 77: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Reliability0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

20.3%

52.9%

20.3%

52.9%

12.7%

5.9%

21.5%

23.5%

15.2%

5.9%

24.1%

2.5%

5.9%

1.3%

11.4%

11.8%

68.3%

35.3%

5.9%

FailedPartialComplete

Clang 3.2main() method handlingTemplate handling

OpenGL #defined function handlingSeparately declared and defined function handlingKernel pointer invocation handling

CUDA SDK Samples

CUDA SDK Samples

Rodinia Samples

Rodinia Samples

Before

Upgrades

After

Upgrades

Increase reliability in translating samples after latest round of improvements

Page 78: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Roadmap & Future Work

CU2CLAlpha (2011)

CU2CLBeta

(2013)

CU2CL w/FunctionalPortability

CU2CL w/Performance

Portability

Well-designed scaffold

Improved Robustness, CUDA Coverage, and Reliability

Analysis and profiling of difficult-to-translate CUDA structures

Expand CUDA coverage• Shared, const,

texture memory• Driver API• OpenGL

Handling unmapped CUDA structs / behaviors• Warp sync

Automatic de-optimization

Device-agnostic optimization

Device-specific optimization

Page 79: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Roadmap & Future Work

CU2CLAlpha (2011)

CU2CLBeta

(2013)

CU2CL w/FunctionalPortability

CU2CL w/Performance

Portability

Well-designed scaffold

Improved Robustness, CUDA Coverage, and Reliability

Analysis and profiling of difficult-to-translate CUDA structures

Expand CUDA coverage• Shared, const,

texture memory• Driver API• OpenGL

Handling unmapped CUDA structs / behaviors• Warp sync

Automatic de-optimization

Device-agnostic optimization

Device-specific optimization

What about CUDA to HSA?

Page 80: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Related Work

Swan– High-level abstraction API, links to either OpenCL or CUDA

implementation

Ocelot & Caracal– Translate NVIDIA PTX IR to other device IRs

CUDAtoOpenCL– Source to source translator, based on Cetus

Page 81: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

Page 82: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

• Status– What used to take months by hand takes seconds

• 90+ successful translation

• Negligible difference in performance

Page 83: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

• Status– What used to take months by hand takes seconds

• 90+ successful translation

• Negligible difference in performance

• Challenges– CUDA functionality missing in OpenCL

• __threadfence()

– Equivalent libraries needed in OpenCL

• cuFFT, MAGMA, cuBLAS

– Implicit semantics

• Implicit synchronization across warps

Page 84: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

CU2CL Conclusions

• Status– What used to take months by hand takes seconds

• 90+ successful translation

• Negligible difference in performance

• Challenges– CUDA functionality missing in OpenCL

• __threadfence()

– Equivalent libraries needed in OpenCL

• cuFFT, MAGMA, cuBLAS

– Implicit semantics

• Implicit synchronization across warps

• What's Next?– Improved functional portability

– Support for performance portability

Page 85: PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

AcknowledgementsStudents: Gabriel Martinez, Paul Sathre

This work was supported in part by NSF I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing (CHREC).