pt-4057, automated cuda-to-opencl™ translation with cu2cl: what's next?, by wu feng and mark...
DESCRIPTION
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.TRANSCRIPT
synergy.cs.vt.edu
Automated CUDA-to-OpenCL Translation with CU2CL:What’s Next?
Wu Feng and Mark Gardner
Virginia Tech
2013-11-12
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Why OpenCL?
http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y
http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png
http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg
http://www.thinkcomputers.org/articles/ces11_amd/main.jpg
http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Why OpenCL?
http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y
http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png
http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg
http://www.thinkcomputers.org/articles/ces11_amd/main.jpg
http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Why OpenCL?
http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y
http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png
http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg
http://www.thinkcomputers.org/articles/ces11_amd/main.jpg
http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Why OpenCL?
http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y
http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png
http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg
http://www.thinkcomputers.org/articles/ces11_amd/main.jpg
http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg
Source code lasts longer than platforms
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
The Goal
http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg
To take advantage of OpenCL's portability...
Without sacrificing man-years of existing code
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL APIs
CUDA Module OpenCL Module
Thread Contexts &Command Queues
Device Platforms & Devices
Stream Command Queues
Event Events
Memory Memory Objects
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL APIs
CUDA Module OpenCL Module
Thread Contexts &Command Queues
Device Platforms & Devices
Stream Command Queues
Event Events
Memory Memory Objects
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL APIs
CUDA Module OpenCL Module
Thread Contexts &Command Queues
Device Platforms & Devices
Stream Command Queues
Event Events
Memory Memory Objects
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA OpenCL
Vector types (e.g. float4) Host: cl_float4Kernel: float4
dim3 size_t[3]
cudaStream_t cl_command_queue
cudaEvent_t cl_event
Device pointers (e.g. float* created through cudaMalloc)
cl_mem created through clCreateBuffer
cudaChannelFormat cl_image_format
textureReference cl_mem created through clCreateImage
cudaDeviceProp No direct equivalent
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA OpenCL
Vector types (e.g. float4) Host: cl_float4Kernel: float4
dim3 size_t[3]
cudaStream_t cl_command_queue
cudaEvent_t cl_event
Device pointers (e.g. float* created through cudaMalloc)
cl_mem created through clCreateBuffer
cudaChannelFormat cl_image_format
textureReference cl_mem created through clCreateImage
cudaDeviceProp No direct equivalent
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA OpenCL
Vector types (e.g. float4) Host: cl_float4Kernel: float4
dim3 size_t[3]
cudaStream_t cl_command_queue
cudaEvent_t cl_event
Device pointers (e.g. float* created through cudaMalloc)
cl_mem created through clCreateBuffer
cudaChannelFormat cl_image_format
textureReference cl_mem created through clCreateImage
cudaDeviceProp No direct equivalent
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA OpenCL
Vector types (e.g. float4) Host: cl_float4Kernel: float4
dim3 size_t[3]
cudaStream_t cl_command_queue
cudaEvent_t cl_event
Device pointers (e.g. float* created through cudaMalloc)
cl_mem created through clCreateBuffer
cudaChannelFormat cl_image_format
textureReference cl_mem created through clCreateImage
cudaDeviceProp No direct equivalent
synergy.cs.vt.edu
CUDA and OpenCLExecution and Memory Models
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
The Problem
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
The Problem
Manual Translation(weeks, months)
CUDASourceCode
OpenCLSourceCode
xkcd.com
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
The Problem
Manual Translation(weeks, months)
Automatic Translation(seconds)
CUDASourceCode
OpenCLSourceCode
CU2CL
xkcd.com
synergy.cs.vt.edu
Forecast
• Observations about Translating
• Examples: CUDA and OpenCL constructs
• CU2CL Architecture
• Current State of CU2CL: Robustness and Performance
• Future Directions
http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation Is Easy ...
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)
• High-level language → low-level representation, e.g., C → LLVM
x * y + z →
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)
• High-level language → low-level representation, e.g., C → LLVM
x * y + z →
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
• Between languages, e.g., CUDA → OpenCL
__powf(x[threadIdx.x], y[threadIdx.y]) →
native_pow(x[get_local_id(0)], y[get_local_id(1)])
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation is more difficult
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation is more difficult
…when there IS ambiguity (or lack of a direct mapping) in the translation between languages
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation is more difficult
…when there IS ambiguity (or lack of a direct mapping) in the translation between languages
• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation is more difficult
…when there IS ambiguity (or lack of a direct mapping) in the translation between languages
• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??
• Dialects
– Latin American Spanish vs. Castilian Spanish → English
– CUDA Runtime API vs. CUDA Driver API → OpenCL
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA and OpenCL
http://www.dragon1.com/images/examples.jpg
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA Initialization Code
None(Implicit)
Dialect: CUDA runtime API
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
OpenCL Initialization Code
//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);
//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);
//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);
Explicit
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
OpenCL Initialization Code
//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);
//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);
//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);
Explicit
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
OpenCL Initialization Code
//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);
//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);
//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);
Explicit
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA Kernel Invocation
// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);
// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA Kernel Invocation
// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);
// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CUDA Kernel Invocation
// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);
// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
OpenCL Kernel Invocation
// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};
// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
OpenCL Kernel Invocation
// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};
// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
OpenCL Kernel Invocation
// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};
// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Kernel Code for Vector Add
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
CUDA
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Kernel Code for Vector Add
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
CUDA
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Kernel Code for Vector Add
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
CUDA
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Kernel Code for Vector Add
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
CUDA
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Architecture
http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Compilation Process
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Compilation Process
SourceCode
PreprocessedCode
TokenizedCode
ParseTree
IntermediateRepresentation
Binary
Preprocessor LexerSemanticAnalyzer
ParserCode
Generator
Clang LLVM
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Compilation Process
Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011
SourceCode
PreprocessedCode
TokenizedCode
ParseTree
IntermediateRepresentation
Binary
Preprocessor LexerSemanticAnalyzer
ParserCode
Generator
Clang LLVM
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y]) CUDA
native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
Func
CUDA
native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
Func
Arg
Arg
CUDA
native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
Func
Arg
Arg
Struct
Struct
CUDA
native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
OpenCL
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
OpenCL
1
0
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
OpenCL
1
0
get_local_id( )get_local_id( )
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
OpenCL
1
0
get_local_id( )get_local_id( )
x[ ] y[ ]
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
__powf(x[threadIdx.x], y[threadIdx.y])
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
OpenCL
1
0
get_local_id( )get_local_id( )
x[ ] y[ ]native_pow
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
Write Out
__powf(x[threadIdx.x], y[threadIdx.y])
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
OpenCL
1
0
get_local_id( )get_local_id( )
x[ ] y[ ]native_pow
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AST-driven, String-based Rewriting
Write Out
__powf(x[threadIdx.x], y[threadIdx.y])
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Func
Arg
Arg
Struct
Struct
Field
Field
CUDA
OpenCL
1
0
get_local_id( )get_local_id( )
x[ ] y[ ]native_pow
Advantage: formatting remains intact → maintainable
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);
CUDA Kernel Launch
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);
CUDA Kernel Launch
Naive OpenCL Translation
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f;
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float),
&__cu2cl_Kernel_kernel_arg_1);
int __cu2cl_Kernel_kernel_arg_2 = 256;
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int),
&__cu2cl_Kernel_kernel_arg_2);
CUDA Kernel Launch
Correct OpenCL Translation
Naive OpenCL Translation
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...); //get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Complex Semantic Conversions2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
– Implement our own handler to emulate and encapsulate
//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...); //get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Evaluation
Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Test Code
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics
• 100k+ SLOC in total
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics
• 100k+ SLOC in total
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics
• 100k+ SLOC in total
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translator CoverageApplication CUDA Lines OpenCL Lines
ChangedPercentAutomatically Translated
asyncAPI 135 5 96.3
bandwidthTest 891 5 98.9
BlackScholes 347 14 96.0
FastWalshTransform 327 30 90.8
matrixMul 351 9 97.4
scalarProd 251 18 92.8
vectorAdd 147 0 100
Back Propagation 313 24 92.3
Breadth-First Search 306 35 88.6
Gaussian 390 26 93.3
Hotspot 328 2 99.4
Needleman-Wunsch 430 3 99.3
Fen Zi 17768 1786 89.9
GEM 524 15 97.1
IZ PS 8402 166 98.0
SDK Samples
Rodinia
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translator CoverageApplication CUDA Lines OpenCL Lines
ChangedPercentAutomatically Translated
asyncAPI 135 5 96.3
bandwidthTest 891 5 98.9
BlackScholes 347 14 96.0
FastWalshTransform 327 30 90.8
matrixMul 351 9 97.4
scalarProd 251 18 92.8
vectorAdd 147 0 100
Back Propagation 313 24 92.3
Breadth-First Search 306 35 88.6
Gaussian 390 26 93.3
Hotspot 328 2 99.4
Needleman-Wunsch 430 3 99.3
Fen Zi 17768 1786 89.9
GEM 524 15 97.1
IZ PS 8402 166 98.0
SDK Samples
Rodinia
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translation Challenges
Challenge CUDA SDKFrequency ()%
RodiniaFrequency ()%
Device Identifiers 54.4 29.4
Literal Parameters 19.0 23.5
Separate Compilation 54.4 29.4
CUDA Libraries 10.1 0
Kernel Templates 21.5 0
Texture Memory 27.8 23.5
Graphics Interoperability 24.1 0
Constant Memory 17.7 29.4
Shared Memory 46.8 70.6
Profiled Identified
Kernel Function Pointer InvocationsPreprocessor EffectsWarp-level SynchronizationDevice Intrinsic FunctionsDevice Buffer cl_mem Type Propagation#defined Function DefinitionsDevice Buffers as Struct MembersArrays of Device BuffersImplicitly-Defined Kernel FunctionsDevice-side Classes, Constructors, & DestructorsStruct Alignment Attributes__threadfence()
Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation”. ICPP Workshops 2012: 89-96Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translator Performance
100 1000 10000 1000000.01
0.1
1
10
SDK Samples Rodinia Samples Large Applications
100 1000 10000 1000001
10
100
1000
10000R+ = 0.61R+ = 0.95
SDK Samples Linear (SDK Samples)Rodinia Samples Linear (Rodinia Samples)Large Applications
Source Lines Source Lines
Tota
l Tra
nsl
ati
on T
ime
(s)
CU
2C
L T
rans l
ati
on T
ime
(mic
rose
conds)
Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Translated Application PerformanceasyncAPI
bandwidthTest
BlackScholes
FastWalshTransform
matrixMul
scalarProd
vectorAdd
backprop BFS
Gaussian
Hotspot
Needleman-Wunsch
GEM
0.5
1
1.5
2
2.5
CU
DA
OpenC
L
Tim
e (
s)
SDK Samples Rodinia Samples
Lower is Better
Note: all runs on same Nvidia GPU for fair comparison purposes
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Reliability0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
20.3%
52.9%
20.3%
52.9%
12.7%
5.9%
21.5%
23.5%
15.2%
5.9%
24.1%
2.5%
5.9%
1.3%
11.4%
11.8%
68.3%
35.3%
5.9%
FailedPartialComplete
Clang 3.2main() method handlingTemplate handling
OpenGL #defined function handlingSeparately declared and defined function handlingKernel pointer invocation handling
CUDA SDK Samples
CUDA SDK Samples
Rodinia Samples
Rodinia Samples
Before
Upgrades
After
Upgrades
Increase reliability in translating samples after latest round of improvements
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Roadmap & Future Work
CU2CLAlpha (2011)
CU2CLBeta
(2013)
CU2CL w/FunctionalPortability
CU2CL w/Performance
Portability
Well-designed scaffold
Improved Robustness, CUDA Coverage, and Reliability
Analysis and profiling of difficult-to-translate CUDA structures
Expand CUDA coverage• Shared, const,
texture memory• Driver API• OpenGL
Handling unmapped CUDA structs / behaviors• Warp sync
Automatic de-optimization
Device-agnostic optimization
Device-specific optimization
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Roadmap & Future Work
CU2CLAlpha (2011)
CU2CLBeta
(2013)
CU2CL w/FunctionalPortability
CU2CL w/Performance
Portability
Well-designed scaffold
Improved Robustness, CUDA Coverage, and Reliability
Analysis and profiling of difficult-to-translate CUDA structures
Expand CUDA coverage• Shared, const,
texture memory• Driver API• OpenGL
Handling unmapped CUDA structs / behaviors• Warp sync
Automatic de-optimization
Device-agnostic optimization
Device-specific optimization
What about CUDA to HSA?
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
Related Work
Swan– High-level abstraction API, links to either OpenCL or CUDA
implementation
Ocelot & Caracal– Translate NVIDIA PTX IR to other device IRs
CUDAtoOpenCL– Source to source translator, based on Cetus
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Conclusions
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Conclusions
• Status– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Conclusions
• Status– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance
• Challenges– CUDA functionality missing in OpenCL
• __threadfence()
– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS
– Implicit semantics
• Implicit synchronization across warps
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
CU2CL Conclusions
• Status– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance
• Challenges– CUDA functionality missing in OpenCL
• __threadfence()
– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS
– Implicit semantics
• Implicit synchronization across warps
• What's Next?– Improved functional portability
– Support for performance portability
2013/11/12AMD Developer Summit
synergy.cs.vt.edu
AcknowledgementsStudents: Gabriel Martinez, Paul Sathre
This work was supported in part by NSF I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing (CHREC).