pt-4057, automated cuda-to-opencl™ translation with cu2cl: what's next?, by wu feng and mark...

synergy.cs.vt.edu

Automated CUDA-to-OpenCL Translation with CU2CL:What’s Next?

Wu Feng and Mark Gardner

Virginia Tech

2013-11-12

2013/11/12AMD Developer Summit

synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg


synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/images/375584-nvidia-geforce-gtx-titan.jpg?thumb=y

http://www.amd.com/PublishingImages/Public/Photograph_ProductShots/375WPNG/61979.png

http://www.hardwarezone.com.sg/files/img/2012/06/Xeon_Phi_PCIe_Card_M.jpg

http://www.thinkcomputers.org/articles/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revimages/cpu/Core_I7_965/New_Core_I7.jpg

Source code lasts longer than platforms


synergy.cs.vt.edu

The Goal

http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg

To take advantage of OpenCL's portability...

Without sacrificing man-years of existing code


synergy.cs.vt.edu

CUDA and OpenCL APIs

CUDA Module OpenCL Module

Thread Contexts &Command Queues

Device Platforms & Devices

Stream Command Queues

Event Events

Memory Memory Objects


synergy.cs.vt.edu

CUDA and OpenCL Data

CUDA OpenCL

Vector types (e.g. float4) Host: cl_float4Kernel: float4

dim3 size_t[3]

cudaStream_t cl_command_queue

cudaEvent_t cl_event

Device pointers (e.g. float* created through cudaMalloc)

cl_mem created through clCreateBuffer

cudaChannelFormat cl_image_format

textureReference cl_mem created through clCreateImage

cudaDeviceProp No direct equivalent

synergy.cs.vt.edu

CUDA and OpenCLExecution and Memory Models


synergy.cs.vt.edu

The Problem


synergy.cs.vt.edu

The Problem

Manual Translation(weeks, months)

CUDASourceCode

OpenCLSourceCode

xkcd.com


synergy.cs.vt.edu

The Problem

Manual Translation(weeks, months)

Automatic Translation(seconds)

CUDASourceCode

OpenCLSourceCode

CU2CL

xkcd.com

synergy.cs.vt.edu

Forecast

• Observations about Translating

• Examples: CUDA and OpenCL constructs

• CU2CL Architecture

• Current State of CU2CL: Robustness and Performance

• Future Directions

http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US


synergy.cs.vt.edu

Translation Is Easy ...


synergy.cs.vt.edu

Translation Is Easy ...…when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping)


synergy.cs.vt.edu


• High-level language → low-level representation, e.g., C → LLVM

x * y + z →

%tmp = mul i32 %x, %y

%tmp2 = add i32 %tmp, %z


synergy.cs.vt.edu


• High-level language → low-level representation, e.g., C → LLVM

x * y + z →

%tmp = mul i32 %x, %y

%tmp2 = add i32 %tmp, %z

• Between languages, e.g., CUDA → OpenCL

__powf(x[threadIdx.x], y[threadIdx.y]) →

native_pow(x[get_local_id(0)], y[get_local_id(1)])


synergy.cs.vt.edu

Translation is more difficult


synergy.cs.vt.edu


…when there IS ambiguity (or lack of a direct mapping) in the translation between languages


synergy.cs.vt.edu



• Idiomatic Expressions

– “Putting all your eggs in one basket” → ?? in Spanish

– CUDA threadfence() → OpenCL ??


synergy.cs.vt.edu



• Idiomatic Expressions

– “Putting all your eggs in one basket” → ?? in Spanish

– CUDA threadfence() → OpenCL ??

• Dialects

– Latin American Spanish vs. Castilian Spanish → English

– CUDA Runtime API vs. CUDA Driver API → OpenCL


synergy.cs.vt.edu

CUDA and OpenCL

http://www.dragon1.com/images/examples.jpg


synergy.cs.vt.edu

CUDA Initialization Code

None(Implicit)

Dialect: CUDA runtime API


synergy.cs.vt.edu

OpenCL Initialization Code

//get a platform and device, set up a context and command queueclGetPlatformIDs(1, &__cu2cl_Platform, NULL);clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL);

//read kernel source from diskFILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f);

//build device program and kernel__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL);free((void *) progSrc);clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

Explicit


synergy.cs.vt.edu

CUDA Kernel Invocation

// setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y);

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); }


synergy.cs.vt.edu

OpenCL Kernel Invocation

// setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};

// execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); }


synergy.cs.vt.edu

Kernel Code for Vector Add

// Device code

__kernel void VecAdd(const __global float* A, const __global float* B,

__global float* C, int N) {

int i = get_local_size(0) * get_group_id(0) + get_local_id(0);

if (i < N)

C[i] = A[i] + B[i];

}

OpenCL

// Device code

__global__ void VecAdd(const float* A, const float* B, float*

C, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

CUDA


synergy.cs.vt.edu

CU2CL Architecture

http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg


synergy.cs.vt.edu

Compilation Process


synergy.cs.vt.edu

Compilation Process

SourceCode

PreprocessedCode

TokenizedCode

ParseTree

IntermediateRepresentation

Binary

Preprocessor LexerSemanticAnalyzer

ParserCode

Generator

Clang LLVM


synergy.cs.vt.edu

Compilation Process

Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011

SourceCode

PreprocessedCode

TokenizedCode

ParseTree

IntermediateRepresentation

Binary

Preprocessor LexerSemanticAnalyzer

ParserCode

Generator

Clang LLVM


synergy.cs.vt.edu

AST-driven, String-based Rewriting

__powf(x[threadIdx.x], y[threadIdx.y]) CUDA

native_pow(x[get_local_id(0)], y[get_local_id(1)])OpenCL


synergy.cs.vt.edu


__powf(x[threadIdx.x], y[threadIdx.y])

Func

CUDA



synergy.cs.vt.edu



Func

Arg

Arg

CUDA



synergy.cs.vt.edu



Func

Arg

Arg

Struct

Struct

CUDA



synergy.cs.vt.edu



Func

Arg

Arg

Struct

Struct

Field

Field

CUDA



synergy.cs.vt.edu




Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL


synergy.cs.vt.edu




Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0


synergy.cs.vt.edu




Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0

get_local_id( )get_local_id( )


synergy.cs.vt.edu




Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0


x[ ] y[ ]


synergy.cs.vt.edu




Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0


x[ ] y[ ]native_pow


synergy.cs.vt.edu


Write Out



Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0


x[ ] y[ ]native_pow


synergy.cs.vt.edu


Write Out



Func

Arg

Arg

Struct

Struct

Field

Field

CUDA

OpenCL

1

0


x[ ] y[ ]native_pow

Advantage: formatting remains intact → maintainable


synergy.cs.vt.edu

Complex Semantic Conversions


synergy.cs.vt.edu

Complex Semantic Conversions1. Literal Parameters to Kernels

– CUDA pass-by-value invocations vs. OpenCL pass-by-reference


synergy.cs.vt.edu



kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

CUDA Kernel Launch


synergy.cs.vt.edu




clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

CUDA Kernel Launch

Naive OpenCL Translation


synergy.cs.vt.edu





clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);


float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f;

clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float),

&__cu2cl_Kernel_kernel_arg_1);

int __cu2cl_Kernel_kernel_arg_2 = 256;

clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int),

&__cu2cl_Kernel_kernel_arg_2);

CUDA Kernel Launch

Correct OpenCL Translation

Naive OpenCL Translation


synergy.cs.vt.edu

Complex Semantic Conversions


synergy.cs.vt.edu

Complex Semantic Conversions2. Device Identification

– CUDA uses int, OpenCL uses opaque cl_device

– To change devices in CUDA, use cudaSetDevice(int id)

– To change devices in OpenCL, use...


synergy.cs.vt.edu





//scan all devices

//save old platform, device, context, queue, program, & kernels

myDevice = allDevices[id]

ClGetDeviceInfo(...); //get new device's platform

myContext = clCreateContext(...);

myQueue = clCreateCommandQueue(...);

//load program source

clBuildProgram(...);

myKernel = clCreateKernel(...);


synergy.cs.vt.edu





– Implement our own handler to emulate and encapsulate

//scan all devices

//save old platform, device, context, queue, program, & kernels

myDevice = allDevices[id]

ClGetDeviceInfo(...); //get new device's platform

myContext = clCreateContext(...);

myQueue = clCreateCommandQueue(...);

//load program source

clBuildProgram(...);

myKernel = clCreateKernel(...);


synergy.cs.vt.edu

CU2CL Evaluation

Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg


synergy.cs.vt.edu

Test Code


synergy.cs.vt.edu

Test Code

• 79 CUDA SDK Samples

• 17 Rodinia Samples

• Applications– GEM – Molecular Modeling

– IZ PS – Neural Network

– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total


synergy.cs.vt.edu

Translator CoverageApplication CUDA Lines OpenCL Lines

ChangedPercentAutomatically Translated

asyncAPI 135 5 96.3

bandwidthTest 891 5 98.9

BlackScholes 347 14 96.0

FastWalshTransform 327 30 90.8

matrixMul 351 9 97.4

scalarProd 251 18 92.8

vectorAdd 147 0 100

Back Propagation 313 24 92.3

Breadth-First Search 306 35 88.6

Gaussian 390 26 93.3

Hotspot 328 2 99.4

Needleman-Wunsch 430 3 99.3

Fen Zi 17768 1786 89.9

GEM 524 15 97.1

IZ PS 8402 166 98.0

SDK Samples

Rodinia


synergy.cs.vt.edu

Translation Challenges

Challenge CUDA SDKFrequency ()%

RodiniaFrequency ()%

Device Identifiers 54.4 29.4

Literal Parameters 19.0 23.5

Separate Compilation 54.4 29.4

CUDA Libraries 10.1 0

Kernel Templates 21.5 0

Texture Memory 27.8 23.5

Graphics Interoperability 24.1 0

Constant Memory 17.7 29.4

Shared Memory 46.8 70.6

Profiled Identified

Kernel Function Pointer InvocationsPreprocessor EffectsWarp-level SynchronizationDevice Intrinsic FunctionsDevice Buffer cl_mem Type Propagation#defined Function DefinitionsDevice Buffers as Struct MembersArrays of Device BuffersImplicitly-Defined Kernel FunctionsDevice-side Classes, Constructors, & DestructorsStruct Alignment Attributes__threadfence()

Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation”. ICPP Workshops 2012: 89-96Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear


synergy.cs.vt.edu

Translator Performance

100 1000 10000 1000000.01

0.1

1

10

SDK Samples Rodinia Samples Large Applications

100 1000 10000 1000001

10

100

1000

10000R+ = 0.61R+ = 0.95

SDK Samples Linear (SDK Samples)Rodinia Samples Linear (Rodinia Samples)Large Applications

Source Lines Source Lines

Tota

l Tra

nsl

ati

on T

ime

(s)

CU

2C

L T

rans l

ati

on T

ime

(mic

rose

conds)

Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04


synergy.cs.vt.edu

Translated Application PerformanceasyncAPI

bandwidthTest

BlackScholes

FastWalshTransform

matrixMul

scalarProd

vectorAdd

backprop BFS

Gaussian

Hotspot

Needleman-Wunsch

GEM

0.5

1

1.5

2

2.5

CU

DA

OpenC

L

Tim

e (

s)

SDK Samples Rodinia Samples

Lower is Better

Note: all runs on same Nvidia GPU for fair comparison purposes


synergy.cs.vt.edu

CU2CL Reliability0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

20.3%

52.9%

20.3%

52.9%

12.7%

5.9%

21.5%

23.5%

15.2%

5.9%

24.1%

2.5%

5.9%

1.3%

11.4%

11.8%

68.3%

35.3%

5.9%

FailedPartialComplete

Clang 3.2main() method handlingTemplate handling

OpenGL #defined function handlingSeparately declared and defined function handlingKernel pointer invocation handling

CUDA SDK Samples

CUDA SDK Samples

Rodinia Samples

Rodinia Samples

Before

Upgrades

After

Upgrades

Increase reliability in translating samples after latest round of improvements


synergy.cs.vt.edu

CU2CL Roadmap & Future Work

CU2CLAlpha (2011)

CU2CLBeta

(2013)

CU2CL w/FunctionalPortability

CU2CL w/Performance

Portability

Well-designed scaffold

Improved Robustness, CUDA Coverage, and Reliability

Analysis and profiling of difficult-to-translate CUDA structures

Expand CUDA coverage• Shared, const,

texture memory• Driver API• OpenGL

Handling unmapped CUDA structs / behaviors• Warp sync

Automatic de-optimization

Device-agnostic optimization

Device-specific optimization


synergy.cs.vt.edu

CU2CL Roadmap & Future Work

CU2CLAlpha (2011)

CU2CLBeta

(2013)

CU2CL w/FunctionalPortability

CU2CL w/Performance

Portability

Well-designed scaffold

Improved Robustness, CUDA Coverage, and Reliability

Analysis and profiling of difficult-to-translate CUDA structures

Expand CUDA coverage• Shared, const,

texture memory• Driver API• OpenGL

Handling unmapped CUDA structs / behaviors• Warp sync

Automatic de-optimization

Device-agnostic optimization

Device-specific optimization

What about CUDA to HSA?


synergy.cs.vt.edu

Related Work

Swan– High-level abstraction API, links to either OpenCL or CUDA

implementation

Ocelot & Caracal– Translate NVIDIA PTX IR to other device IRs

CUDAtoOpenCL– Source to source translator, based on Cetus


synergy.cs.vt.edu

CU2CL Conclusions


synergy.cs.vt.edu

CU2CL Conclusions

• Status– What used to take months by hand takes seconds

• 90+ successful translation

• Negligible difference in performance


synergy.cs.vt.edu

CU2CL Conclusions




• Challenges– CUDA functionality missing in OpenCL

• __threadfence()

– Equivalent libraries needed in OpenCL

• cuFFT, MAGMA, cuBLAS

– Implicit semantics

• Implicit synchronization across warps


synergy.cs.vt.edu

CU2CL Conclusions




• Challenges– CUDA functionality missing in OpenCL

• __threadfence()

– Equivalent libraries needed in OpenCL

• cuFFT, MAGMA, cuBLAS

– Implicit semantics

• Implicit synchronization across warps

• What's Next?– Improved functional portability

– Support for performance portability


synergy.cs.vt.edu

AcknowledgementsStudents: Gabriel Martinez, Paul Sathre

This work was supported in part by NSF I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing (CHREC).

pt-4057, automated cuda-to-opencl™ translation with cu2cl: what's next?, by wu feng and mark...

Technology

ee pru mt o

jpga ddvl esm

coma ddvl esm

uca ddvl esm

problema ddvl esm

opencl translation

opencl execution

automated cuda