a composable workflow for productive fpga computing via ... · a composable workflow for productive...

16
A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts) Paul Sathre Dept. of CS Virginia Tech [email protected] Ahmed Helal Dept. of ECE Virginia Tech [email protected] Wu Feng Depts. of CS and ECE Virginia Tech [email protected] Abstract We present a composable workflow to enable highly-productive heterogeneous computing on FPGAs. The workflow consists of a trio of static analysis and transformation tools: (1) a whole-program, source-to-source translator to transform existing parallel code to OpenCL, (2) a set of OpenCL kernel linters, which target FPGAs to detect possible semantic errors and performance traps, and (3) a whole-program OpenCL linter to validate the host-to-device inter- face of OpenCL programs. The workflow promotes rapid realization of heterogeneous parallel code across a multitude of heterogeneous computing environments, particularly FPGAs, by providing com- plementary tools for automatic CUDA-to-OpenCL translation and compile-time OpenCL validation in advance of very expensive com- pilation, placement, and routing on FPGAs. The proposed tools perform whole-program analysis and transformation to tackle real- world, large-scale parallel applications. The efficacy of the workflow tools is demonstrated via a representative translation and analysis of a sizable CUDA finite automata processing engine as well as the analysis and validation of an additional 96 OpenCL benchmarks. 1 Introduction Heterogeneous computing has become a frontrunner in the com- putational race for performance and power efficiency. FPGAs have gained increasing attention in the general-purpose computing com- munity due to the emergence of high-level synthesis (HLS) tools based on OpenCL [17, 18]. Ideally, heterogeneous code should be "write-once, run-anywhere" regardless of the target parallel acceler- ator. In practice, however, productivity is limited by the community fragmentation between programming languages and the sunk cost in the existing heterogeneous code implemented in vendor-specific languages such as CUDA [20]. Thus, there is a need for automated tools to assist in the transition to the heterogeneous platforms ac- cessible via the vendor-agnostic OpenCL language, which would allow devices to be procured on the basis of performance, power ef- ficiency, and cost, rather than language compatibility only. Further, the FPGA is a new platform for many software developers, and such productivity tools can reduce the learning curve by providing automated translation to OpenCL and advisories about potential semantic and performance pitfalls. The translation between programming languages should be done statically at the source-code level to support software maintainabil- ity, i.e., code modifications, adaptations, and extensions on the new target platforms. To automate such a daunting task, researchers have created several source-to-source translators with a specific focus on CUDA-to-OpenCL translation [9, 12, 15, 21], due to the Document Revision 1.1 (June 2018). This work was supported in part by NSF I/UCRC IIP-1266245 via NSF CHREC. wealth of existing CUDA code that may benefit from the accessi- bility to heterogeneous platforms such as FPGAs. However, these tools work on a single translation unit (TU) at a time, which makes it impossible to propagate code transformations across TUs and limits their ability to holistically translate large-scale parallel codes. Real-world applications follow a modular design methodology that separates the application’s functionality into multiple software components (modules) to improve software maintainability and to promote code reuse. Thus, there is a compelling need for tools that examine applications holistically, spanning TUs to observe possible inconsistencies and effect wide-reaching global transformations. As such, this paper presents a workflow of automated tools that supports whole-program analysis and transformation both within and across TUs for productive OpenCL programming on hetero- geneous parallel systems with FPGA, GPU, and/or CPU. Whole- program analysis and transformation (e.g., refactoring, translation, etc.) is difficult in the context of separately compiled and linked lan- guages (C/C++) and derivatives (CUDA and OpenCL) because of the traditionally hard delineation between TUs. However, by reducing this general cross-TU problem to the bounded instances of source- to-source translation between semantically similar languages and language- and platform-specific linter analyses, robust cross-TU tools can be realized using a unified framework that focuses on the interface between TUs. In particular, we make the following research contributions: A workflow of automated, composable tools to accelerate the devel- opment process of heterogeneous parallel code. Specifically, CU2CL- MAST is a whole-program source-to-source translator that holis- tically transforms existing heterogeneous code from the vendor- specific CUDA to the vendor-agnostic OpenCL. FLOCL and FLOCL- MAST are intra- and inter-TU static code analyzers (i.e., linters), respectively, that detect possible semantic errors, runtime fail- ures, and performance traps before investing time in expensive compilation and debugging/profiling. (§2 and §3). A case study of transforming a finite automata code (iNFAnt) from CUDA to OpenCL via CU2CL-MAST and validation of the resulting code with the FLOCL and FLOCL-MAST linters, thus enabling iNFAnt to run on other accelerators (e.g., FPGAs and AMD GPUs) in addition to NVIDIA GPUs (§4). A suite of linter tools to identify potential performance issues and semantic faults in OpenCL codes. Specifically, our proposed intra- TU tools identify over 153 potential kernel performance and semantic faults in five common OpenCL benchmark suites. In addition, our inter-TU linter detects inconsistent kernel calls between statically-compiled host code and JIT-compiled device code, which can result in undefined run-time behavior (§4).

Upload: others

Post on 22-Sep-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

A Composable Workflow for Productive FPGA Computing viaWhole-Program Analysis and Transformation (with Code Excerpts)

Paul SathreDept. of CSVirginia Tech

[email protected]

Ahmed HelalDept. of ECEVirginia Tech

[email protected]

Wu FengDepts. of CS and ECE

Virginia [email protected]

AbstractWe present a composable workflow to enable highly-productiveheterogeneous computing on FPGAs. The workflow consists of atrio of static analysis and transformation tools: (1) a whole-program,source-to-source translator to transform existing parallel code toOpenCL, (2) a set of OpenCL kernel linters, which target FPGAs todetect possible semantic errors and performance traps, and (3) awhole-program OpenCL linter to validate the host-to-device inter-face of OpenCL programs. The workflow promotes rapid realizationof heterogeneous parallel code across a multitude of heterogeneouscomputing environments, particularly FPGAs, by providing com-plementary tools for automatic CUDA-to-OpenCL translation andcompile-time OpenCL validation in advance of very expensive com-pilation, placement, and routing on FPGAs. The proposed toolsperform whole-program analysis and transformation to tackle real-world, large-scale parallel applications. The efficacy of the workflowtools is demonstrated via a representative translation and analysisof a sizable CUDA finite automata processing engine as well as theanalysis and validation of an additional 96 OpenCL benchmarks.

1 IntroductionHeterogeneous computing has become a frontrunner in the com-putational race for performance and power efficiency. FPGAs havegained increasing attention in the general-purpose computing com-munity due to the emergence of high-level synthesis (HLS) toolsbased on OpenCL [17, 18]. Ideally, heterogeneous code should be"write-once, run-anywhere" regardless of the target parallel acceler-ator. In practice, however, productivity is limited by the communityfragmentation between programming languages and the sunk costin the existing heterogeneous code implemented in vendor-specificlanguages such as CUDA [20]. Thus, there is a need for automatedtools to assist in the transition to the heterogeneous platforms ac-cessible via the vendor-agnostic OpenCL language, which wouldallow devices to be procured on the basis of performance, power ef-ficiency, and cost, rather than language compatibility only. Further,the FPGA is a new platform for many software developers, andsuch productivity tools can reduce the learning curve by providingautomated translation to OpenCL and advisories about potentialsemantic and performance pitfalls.

The translation between programming languages should be donestatically at the source-code level to support software maintainabil-ity, i.e., code modifications, adaptations, and extensions on the newtarget platforms. To automate such a daunting task, researchershave created several source-to-source translators with a specificfocus on CUDA-to-OpenCL translation [9, 12, 15, 21], due to the

Document Revision 1.1 (June 2018).This work was supported in part by NSF I/UCRC IIP-1266245 via NSF CHREC.

wealth of existing CUDA code that may benefit from the accessi-bility to heterogeneous platforms such as FPGAs. However, thesetools work on a single translation unit (TU) at a time, which makesit impossible to propagate code transformations across TUs andlimits their ability to holistically translate large-scale parallel codes.

Real-world applications follow a modular design methodologythat separates the application’s functionality into multiple softwarecomponents (modules) to improve software maintainability and topromote code reuse. Thus, there is a compelling need for tools thatexamine applications holistically, spanning TUs to observe possibleinconsistencies and effect wide-reaching global transformations.

As such, this paper presents a workflow of automated tools thatsupports whole-program analysis and transformation both withinand across TUs for productive OpenCL programming on hetero-geneous parallel systems with FPGA, GPU, and/or CPU. Whole-program analysis and transformation (e.g., refactoring, translation,etc.) is difficult in the context of separately compiled and linked lan-guages (C/C++) and derivatives (CUDA and OpenCL) because of thetraditionally hard delineation between TUs. However, by reducingthis general cross-TU problem to the bounded instances of source-to-source translation between semantically similar languages andlanguage- and platform-specific linter analyses, robust cross-TUtools can be realized using a unified framework that focuses on theinterface between TUs.

In particular, we make the following research contributions:• A workflow of automated, composable tools to accelerate the devel-opment process of heterogeneous parallel code. Specifically, CU2CL-MAST is a whole-program source-to-source translator that holis-tically transforms existing heterogeneous code from the vendor-specific CUDA to the vendor-agnostic OpenCL. FLOCL and FLOCL-MAST are intra- and inter-TU static code analyzers (i.e., linters),respectively, that detect possible semantic errors, runtime fail-ures, and performance traps before investing time in expensivecompilation and debugging/profiling. (§2 and §3).

• A case study of transforming a finite automata code (iNFAnt) fromCUDA to OpenCL via CU2CL-MAST and validation of the resultingcode with the FLOCL and FLOCL-MAST linters, thus enablingiNFAnt to run on other accelerators (e.g., FPGAs and AMDGPUs)in addition to NVIDIA GPUs (§4).

• A suite of linter tools to identify potential performance issues andsemantic faults in OpenCL codes. Specifically, our proposed intra-TU tools identify over 153 potential kernel performance andsemantic faults in five common OpenCL benchmark suites. Inaddition, our inter-TU linter detects inconsistent kernel callsbetween statically-compiled host code and JIT-compiled devicecode, which can result in undefined run-time behavior (§4).

Page 2: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

OpenCLCompiler

FLOCLFLOCL-MAST

CUDASource

KernelFiles

Kernel ANDHost Files

OpenCLSource

Manual Translation(days to weeks)

RapidRefinement(minutes)

Slow refinement(hours to days)

CU2CL-MAST

AutomaticTranslation(minutes)

HeterogeneousApplication

Workflow Tools

Compile, Place, Route(hours to days)

Compiler Errors

RuntimeErrors

Figure 1. The proposed whole-program transformation and analysis workflow which accepts OpenCL or CUDA code (thanks to automated translation) and assists in rapid iterativerefinement of device kernels and host runtime via linter advisories before ever compiling and running the application on an OpenCL platform.

2 Whole-ProgramWorkflowFigure 1 provides a sketch of the proposed workflow and explainshow the tools interact. As a preliminary step for users with CUDAcode, CU2CL-MAST (§ 3.4) expands an existing CUDA-to-OpenCLtranslator to the whole-program scope to tackle large-scale appli-cations and enables these users to leverage FPGA architectures.These translated OpenCL kernel files (or preexisiting ones) are in-dividually analyzed with FLOCL (§ 3.3.1) to detect OpenCL andFPGA-specific semantic and performance faults. Next, the entireOpenCL application (i.e., all host and device files) is further in-spected with FLOCL-MAST (§ 3.3.2) to ensure the consistency ofthe separately-compiled host-to-device interface. Finally, after theserapid refinement stages at the source-code level, the long kernelcompilation process is performed to execute the OpenCL applica-tion on the FPGA device.

3 Workflow ToolsHere we present our suite of workflow tools, including our MAST(multi-abstract syntax tree) framework, libTooling management,FLOCL (FPGA linters for OpenCL), FLOCL-MAST, and CU2CL-MAST.

3.1 Multi-AST FrameworkFigure 2 shows the typical uses of C, C++, and other parallel C-likelanguages (e.g., CUDA and OpenCL), where the source files (i.e.,translation units or TUs) are separately compiled into individualobject files, and then linked into a single binary/executable file.These TUs are only aware of each other based on explicit interfaces(i.e., declarations) provided by the programmer via shared headerfiles. While the division of a software program into multiple TUsimproves software maintainability, it complicates the design ofany tool that operates on an entire application. Thus, we create aframework for transformation and analysis across multiple TUsthat traverses the abstract syntax tree (ASTs) of each source file.

C.c

B.c

A.c A.o

B.o

C.o

COMPILER

LINKER

BinaryH.h

Figure 2. Typical separate compilation of object files before linking together.

Figure 3 shows the proposed framework that enables the cre-ation of source-level inter-TU tools by providing several criticalcomponents. First, it leverages the Clang [6] compiler frontendto parse individual source files and translate them into navigable

ASTs for use in later stages. Second, the multi-AST framework usesthe Clang libTooling library to control and modify the behavior ofthe compiler frontend, providing fine-grained control over the pro-cessing of individual TUs. Third, the framework relies on customintra-TU logic to perform local analysis and transformation and torecord interesting source features for later inter-TU stages. Finally,the inter-TU stage relies on the minimal shared interface betweenlink units to connect across ASTs. The following subsections detailthe different stages and compiler components that produce themulti-AST transformation and analysis tools.

B.c

A.c

C.c

ASTLexParse Sema

ASTLexParse Sema

ASTLexParse SemaPer-ASTActions

Multi-ASTTool

DeferredActions

Multi-ASTTranslation(CU2CL)

Multi-ASTAnalysis

(FLOCL-MAST)

Per-ASTActions

Per-ASTActions

CLANG Frontend Our Tools

Figure 3. Multi-AST tools coordinate across invocations of the Clang Frontend toprovide whole-program translation or analysis.

3.2 Clang libTooling ManagementThe Clang libTooling library provides the driver that translatessource files into ASTs along with a convenient machinery to man-age compilation tools via the invocation and processing of Front-endActions on source code. Algorithm 1 provides a skeleton ofthe execution phases of our multi-AST tools. The cross-TU toolscustomize the frontend behavior on a per-TU basis (i.e., to providedifferent compilation options for OpenCL device and host files)via pre-frontend callbacks. Custom FrontendActions implementthe tool-specific intra-TU processing and are detailed further insubsequent sections. By having a tool scope above individual invo-cations of the frontend, the tool itself can maintain persistent datastructures in memory across frontend invocations, which is criticalto inter-TU analysis.

3.3 FLOCL and FLOCL-MASTFLOCL provides OpenCL- and FPGA-specific linter passes to im-prove programmer productivity in developing OpenCL applicationsfor FPGAs. The intra-TU portion is implemented as a series of mod-ules for the clang-tidy tool. However, many of the complexitiesof OpenCL development arise because the OpenCL specificationnecessitates separate compilation of the host and device codesto support portability across OpenCL platforms. Therefore, theinter-TU FLOCL-MAST is devised as a companion tool to detect

2

Page 3: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 s t r u c t m i s a l i gn ed {2 char a ;3 doub le b ;4 char c ;5 } ;67 s t r u c t packed_and_a l i gned {8 char a ;9 doub le b ;10 char c ;11 } _ _ a t t r i b u t e ( ( packed ) )12 _ _ a t t r i b u t e ( ( a l i g n e d ( 1 6 ) ) ) ;

(a) Example of a misaligned and unpackedstruct that would use 24 bytes (alignedto 8 bytes) but only needs 10 as well asa packed and aligned struct that uses 16bytes.

1 F inder −>addMatcher (2 r e c o r dDec l (3 i s S t r u c t ( )4 ) . b ind ( " s t r u c t " ) , t h i s ) ;

(b) The ASTMatcher used tofind inefficient structs thatmay be misaligned or poorlypacked.

1 S t r u c t = R e s u l t . Nodes . getNodeAs <RecordDecl > ( " s t r u c t " ) ;2 / / Sum the b i t width o f a l l f i e l d s in the r e co r d3 f o r each f i e l d in the r e co r d { minWidth += f i e l dW id t h }4 / / Compute the minimum e f f i c i e n t member a l i gnment ( l a r g e s t f i e l d member )5 / / Compute the minimum a l l owab l e width ( even mu l t i p l e o f minAlignment <= sumWidth )6 i f ( currWidth > minWidth ) {7 Emit ' ' poo r l y packed ' ' d i a g n o s t i c8 }9 / / Compute the minimum e f f i c i e n t s t r u c t a l i gnment10 / / ( sm a l l e s t power o f two up to the dev i ce ' s l o ad width >= minWidth )11 i f ( minAlign != cu r rA l i gn ) {12 Emit ' ' poo r l y a l i gned ' ' d i a g n o s t i c13 }

(c) Skeleton of the Match Callback used to filter out only problematic structs.Figure 4. Example code for the inefficient struct alignment and packing check.

Algorithm 1 Skeleton of multi-AST Tool execution.STARTOption processingPre-tool actionsfor each source file from command line do

Invoke Clang FrontendPre-Frontend Callback ActionsFrontend Parsing, Lexing, Semantic AnalysisPost-Frontend, Intra-TU actions, Record deferred work

end forMulti-AST Actions (consume deferred work)END

inconsistencies in the separately-compiled interfaces between theOpenCL host and device codes.

3.3.1 FLOCLThe clang-tidy tool provides a robust framework for building lintersthat target C-like languages, and already supports a number ofcommon linting passes (such as refactoring deprecated API usages,identifying dead code, etc.). However, general OpenCL kernels aswell as FPGA-specific kernels require unique linting passes, due tothe special C99 kernel dialect and platform restrictions, respectively.Thus, new linter modules for OpenCL and FPGA (i.e., “FLOCL”) aredevised.

Modules to clang-tidy are written using the Clang ASTMatchersinterface, which provides a way to target specific abstract syntaxsub-trees (Sub-ASTs) that are defined in terms of node types andtheir relationships, and allow easy capture of snipets of structuredcode. Once matches are found, a user-defined callback is triggeredto analyze each match. Hence, each FLOCL check is implementedas a matcher for a specific OpenCL code construct and a callback todiagnose it. These checks typically implement guidelines or restric-tions from the OpenCL standard, OpenCL SDK documentations,and/or literature. A synthetic code example is constructed to exhibitthe behavior, and then the AST representation of the code is man-ually examined for a matchable Sub-AST. A matcher to catch thesymptomatic Sub-ASTs is designed, and then a callback providesfurther refinement and diagnostics.

We currently provide device checks for 1) inefficiently aligned/-packed structs, 2) barriers in single-threaded kernels, 3) possibly-unreachable barriers inside conditionals, and 4) ID-dependent back-ward branching. These checks are derived from restrictions in theOpenCL specification and the Intel/Altera FPGA best practices. Ad-ditionally, a technique was developed to track variable assignmentsto ensure that a given variable never held a thread-dependent value,which is necessary to significantly reduce the false positive rate ofchecks (3) and (4).

1) Inefficient struct alignment and packing Ensuring efficientdevice memory access can improve the performance of an OpenCLkernel. When targeting Intel FPGAs, a minimum memory align-ment of 4 bytes is desired and an optimal 128-bytes alignment issuggested [11]. Users who are unfamiliar with OpenCL may not beaware of the importance of memory alignment and packing, or theexposed attributes used to override the compiler defaults. Therefore,a linter is implemented to identify structs that do not use optimalpacking and alignment, and suggest appropriate attributes for thedeveloper to add. Figure 4 provides an example of a struct thatcould benefit from both explicit packing and alignment attributes,and shows a skeleton of the linter used to identify it.

The linter’s ASTMatcher detects every struct identified in thedevice code. Next, the match callback examines the default pack-ing and alignment inferred by Clang, and computes the minimumefficient size and alignment based on the size of individual structelements. Specifically, it computes the minimum size of the structas if the element are stored contiguously, then computes the min-imum alignment sufficient to fit the contiguous elements. If theClang-generated size is larger than this computed minimum size, adiagnostic is emitted advising the use of a packed attribute. If theminimum computed alignment differs from the Clang-generatedalignment, a similar diagnostic is emitted suggesting the properalignment attribute to use. (Since we are targeting Intel FPGA de-vices, this value is bounded by a maximum 128-byte alignment.)

2) Single-work item barriers The Intel FPGA OpenCL platformdetects if an OpenCL kernel is suited for execution across multiplework items (NDRange), or a single work item (SWI ). This detec-tion relies on calls to runtime APIs and/or compiler attributes, andalso varies with the version of the device compiler. Older versionsrequired calls to either get_global_id or get_local_id to be in-ferred as NDRange, whereas new versions trigger on additionalfunctions [1, 11]. (Fig. 5a provides a trivial example of a SWI kerneland an NDRange kernel according to the original compiler.) If theuser intends the kernel for SWI execution, any barrier synchro-nization can instead be relaxed to a mem_fence. Consequently, alinter was developed to address the presence of barriers in potentialSWI kernels. The matcher shown in Fig. 5b simply detects anyOpenCL kernel function with barriers, but no calls to a thread IDfunction. (Specifically the matcher binds on any function that 1)has the OpenCL kernel attribute, 2) calls either the OpenCL 1.Xor 2.X work-group barrier function, 3) unless — logical NOT — itcalls one of the ID functions. It also binds on the barrier call itself.A separate matcher will be generated for the modern compiler tosupport the additional ID functions.) The callback then provides acustom message, based on the target compiler version, to indicate

3

Page 4: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 vo id __ke rne l s ing l eWI ( _ _ g l o b a l i n t ∗ foo , i n t s i z e ) {2 f o r ( i n t j = 0 ; j < 2 5 6 ; j ++) {3 f o r ( i n t i = 2 5 6 ; i < s i z e ; i += 2 5 6 ) {4 foo [ j ] += foo [ j + i ] ;5 }6 }7 work_g roup_ba r r i e r (CLK_GLOBAL_MEM_FENCE ) ;8 f o r ( i n t i = 1 ; i < 2 5 6 ; i ++) {9 foo [ 0 ] += foo [ i ] ;10 }11 }1213 vo id __ke rne l mult iWI ( _ _ g l o b a l i n t ∗ foo , i n t s i z e ) {14 i n t t i d = g e t _ g l o b a l _ i d ( 0 ) ;15 f o r ( i n t i = g e t _ g l o b a l _ s i z e ( 0 ) ; i < s i z e ; i += g e t _ g l o b a l _ s i z e ( 0 ) ) {16 foo [ t i d ] += foo [ t i d + i ] ;17 }18 work_g roup_ba r r i e r (CLK_GLOBAL_MEM_FENCE ) ;19 i f ( t i d = 0 ) {20 f o r ( i n t i = 1 ; i < g e t _ g l o b a l _ s i z e ( 0 ) ; i ++) {21 foo [ 0 ] += foo [ i ] ;22 }23 }24 }

(a) Example of a single work item kernel with a barrier (invalid) and a multiplework item kernel (that calls an ID function) with a barrier.

1 F inder −>addMatcher (2 / / F ind f u n c t i o n d e c l a r a t i o n s3 f un c t i o nDe c l ( a l l O f (4 / / That a r e OpenCL k e r n e l s5 ha sA t t r ( a t t r : : Kind : : OpenCLKernel ) ,6 / / And c a l l a b a r r i e r f u n c t i o n ( e i t h e r 1 . x or 2 . x v e r s i o n )7 fo rEachDescendan t ( c a l l E x p r ( c a l l e e (8 f un c t i o nDe c l ( anyOf (9 hasName ( " b a r r i e r " ) ,10 hasName ( " work_g roup_ba r r i e r " )11 ) )12 ) ) . b ind ( " b a r r i e r " ) ) ,13 / / But do not c a l l an ID f un c t i o n14 un l e s s ( hasDescendant ( c a l l E x p r ( c a l l e e (15 f un c t i o nDe c l ( anyOf (16 hasName ( " g e t _ g l o b a l _ i d " ) ,17 hasName ( " g e t _ l o c a l _ i d " )18 ) )19 ) ) ) )20 ) ) . b ind ( " f u n c t i o n " ) , t h i s ) ;

(b) The matcher used to find all OpenCL kernel functions that call a barrier, butdo not call an ID function

1 MatchedDecl = R e s u l t . Nodes . getNodeAs <Func t ionDec l > ( " f u n c t i o n " ) ;2 Ma t chedBa r r i e r = R e s u l t . Nodes . getNodeAs <Ca l lExpr > ( " b a r r i e r " ) ;3 Emit ' ' b a r r i e r in s i n g l e work i tem kerne l ' ' d i a g n o s t i c(c) The simple Match Callback doesn’t have to do any filtering, just emit a diag-nostic since only invalid kernels are matched.

Figure 5. Example detection of a single-work-item kernel with a barrier.

that either the barrier will cause a compilation error or a mem_fencemay be more appropriate for performance.

Thread-dependency tracking Several performance and seman-tic issues result from divergent threads within the OpenCL kernel.When threads take separate paths, performance suffers due to serial-ization and/or generation of less efficient FPGA hardware. Further,such divergence can result in deadlocks when barrier semantics arerequired. Therefore, a helper linter is developed to track variableswhich may contain thread-dependent values such that conditionalsreferencing them are flagged.

The helper linter detects all variable declarations within a kernel.When a variable is assigned the return value from a thread IDfunction, it is marked as thread-dependent by the match callback.Further, when a variable is assigned based on the value of someother variable(s), all members of the right hand side are checkedfor thread-dependency, and if such a dependency is found, theassignee is marked. Appendix Figure 10a shows the three matchersnecessary to catch 1) declarations or assignments that directly callan ID function, 2) variable declarations that initialize to the valueof some other variable or struct member, and 3) all assignmentsthat reference a variable or struct member on the right hand side.Appendix Figure 10b demonstrates the callback portion that records

direct assignments (1), and propagates assignments from other ID-dependent references (2) and (3). Since Matches are performedconsistent with a preorder traversal of the AST (and thus callbackstriggered in the same order), in almost all cases the dependencyof a variable will be resolvable by the time it is matched — sinceuse must occur after definition. 1 This particular tracking design isgreedy and once a variable is flagged as ID-dependent, it cannot beunflagged; as a static analysis the approach is limited to what valuesthe variable may contain, but without runtime information wouldhave limited ability to definitively ascertain that such dependencywas overwritten by all threads. This helper is used by the next twolinters to determine if conditional expressions have a risk of beingthread-dependent.

3) Possibly-unreachable barriers A barrier is a common andoften necessary synchronization mechanism; however, it is also apotential source of deadlock errors that can be hard to debug at run-time. (On FPGA platforms with long kernel hardware generationtime, the cost to resolve such a deadlock is further exacerbated.)A source of barrier deadlocks is the unintentional violation of theOpenCL specification by placing a barrier within a conditional re-gion that is not reliably executed by all threads in a work-group. Forexample, Appendix Figure 11a shows two trivial for loops contain-ing barriers that are executed an ID-dependent number of times,and will thus deadlock.

The helper linter for thread-dependency tracking, discussedabove, is used to flag the “risky” conditional expressions — ones thateither call a barrier directly or reference a variable or struct member.The rest of the matcher is otherwise fairly simple; it looks for any ofthe five primitive branch types (if/else, switch/case, do/while,while and for) that both contain a risky conditional expressionand a call to a barrier function. Appendix Figure 11b provides thecode for the branch matcher. The match callback (App. Fig. 11c)identifies what type of branch is present (for emitting a precise diag-nostic), whether the conditional is in fact thread-dependent, and ifso, whether it is due to a thread ID function call or an ID-dependentvariable. However, barriers can still be used safely (if inefficiently)in a thread-dependent manner, if all paths reliably encounter thebarrier. ID-dependency tracking provides false-postive reductionbut alone does not address such valid in-branch barriers.

To address this case, the linter includes a refinement step toreduce false-positives even further in if/else and switch state-ments, where all possible paths execute the same number of barriers.Appendix Figure 12 provides the code that handles this refinement,which is structured similarly for both types of branches but ac-counts for the difference in semantics. (For example switch casesare terminated by breaks, but may run into one another without abreak, and all possible entry points must be evaluated. If/elseshave no terminator statement, but have well-defined “then” state-ments — usually compound.) Effectively both versions assemble alist of references to every possible entry point to their respectivebranches, and ensure an else or default case is present otherwise itcannot statically guarantee that some threads don’t skip the branchentirely. Once this list is assembled, a simple recursive preorder tra-versal flattens the entire AST sub-tree for the branching structure,

1The only corner case identified thus far is if a backward branch such as a loop or gotoassigns a thread-dependent value to a previously-independent variable after a loopconditional that references the variable, but before the jump back to the start of theloop, i.e. in the loop body itself, but this is seldom performed in practice.

4

Page 5: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

so that it may easily be traversed. Then this flattened tree is iteratedfrom top to bottom to count barriers in each branch, skipping overmost “uninteresting” AST nodes. When the beginning of a branchis encountered a barrier counter is initialized to zero. When theend of a branch is reached (either by encountering a break or byencountering the start of the next branch), the total number of bar-riers is checked. The refinement will abort to diagnostics if 1) thebranch has zero barriers, since the matcher will only have matchedif at least one branch has a barrier, 2) the branch has a differentnumber of barriers than a previously-counted branch, or 3) a switchcase has a non-zero number of barriers and proceeds directly intoanother case without breaking, since it is guaranteed to encountermore barriers than the subsequent case. Each time a function call isencountered, it is counted if and only if it is a barrier, otherwise,the refinement aborts since it cannot easily traverse the CFG anddetermine whether the called function contains a barrier. Finally,any time a jump or nested conditional is found, the refinementaborts.

4) ID-dependent backward branches In the context of FPGAs,OpenCL kernel instructions are generated as a hardware pipeline.For efficiency reasons, the Intel compiler attempts to collapse branchstatements into a single bit indicating if a functional unit is ac-tive [11]. This is relatively easy for “forward” branches (withoutloops), resulting in flat control structures. However, looped (i.e.“backward”) branches are more difficult to implement which can sig-nificantly reduce performance, and thus should be avoided. Whilesome algorithms may require backward branches, many can beremoved with algorithmic refactoring. Therefore, a linter has beendesigned to recognize potential ID-dependent backward branchingto advise users to consider refactoring their kernels.

ID-dependent backward branches, by definition, cannot be re-solved and optimized at compile time, and thus result in a morecomplex hardware logic. Therefore, the linter needs to make useof the thread-dependency tracker we have previously detailed. Asbefore, the linter matches on any loop with a risky conditionalexpression, and then the callback determines if the conditional maybe thread-dependent and emits a diagnostic. Figure 6 outlines theotherwise simple remainder of the matcher and callback.

3.3.2 FLOCL-MASTFLOCL-MAST expands on FLOCL by providing a standalone multi-AST framework for developing new linters that require simultane-ous access to multiple TUs. An important example is linters thatwork across the host and device TUs to ensure semantic consis-tency. In large-scale applications, it can be difficult to ensure thatmodifications to the host-device interface are applied consistentlyin the whole program, which can lead to burdensome runtime bugs.For example, if the type, number, or position of kernel parameterschanged, the application will crash and OpenCL’s runtime can re-port an error (granted that proper error checking is implemented inthe code), wasting valuable compilation and execution time. Thus,the first Multi-AST linter for OpenCL implements a check to ensurethe semantic consistency of each of the arguments provided to adevice kernel invocation.

FLOCL-MAST operates in two distinct phases, an intra-TU phasethat records information from each device or host AST, and a inter-TU phase that validates this information and reports potential prob-lems. Figure 7 provides a conceptual overview of the flow from

1 / / Omitted ID−dependency matchers23 / / Matcher f o r d e t e c t i n g branch s t a t emen t s i n s i d e o f l oop s4 / / which only b ind s i f they c ond i t i o n e xp r e s s i o n e i t h e r c a l l s an ID5 / / f u n c t i o n or r e f e r e n c e s a v a r i a b l e6 / / Sub−matcher to check i f an e xp r e s s i o n c on t a i n s an ID c a l l or7 / / r e f e r e n c e s a v a r i a b l e t h a t may be ID−dependent8 con s t au to COND_EXPR =9 expr ( anyOf (10 hasDescendant ( c a l l E x p r (11 c a l l e e ( f u n c t i o nDe c l ( anyOf (12 hasName ( " g e t _ g l o b a l _ i d " ) ,13 hasName ( " g e t _ l o c a l _ i d " )14 ) ) )15 ) . b ind ( " i d _ c a l l " ) ) ,16 hasDescendant ( s tmt ( anyOf (17 d e c lR e fExp r ( t o ( va rDec l ( ) ) ) ,18 memberExpr ( member ( f i e l d D e c l ( ) ) )19 ) ) )20 ) ) . b ind ( " cond_expr " ) ;21 / / Matcher t h a t g r ab s any branch s t a t emen t s t h a t have a p o t e n t i a l22 / / ID−dependency in the c o n d i t i o n a l e x p r e s s i o n and a re i n s i d e l oop s23 F inder −>addMatcher (24 s tmt ( anyOf (25 f o r S tm t (26 ha sCond i t i on (COND_EXPR ) ) ,27 doStmt (28 ha sCond i t i on (COND_EXPR ) ) ,29 whi l eS tmt (30 ha sCond i t i on (COND_EXPR ) )31 ) ) . b ind ( " backward_branch " ) ,32 t h i s ) ;

(a) The ASTMatcher used to find all loop conditionals that need to be checked forID-dependency.

1 / / Omitted thread−dependency t r a c k i n g code23 / / The second p a r t o f the c a l l b a c k j u s t checks matched branches f o r4 / / an ID−dependent c o n d i t i o n a l e x p r e s s i o n5 con s t au to ∗ CondExpr = R e s u l t . Nodes . getNodeAs <Expr > ( " cond_expr " ) ;6 c on s t au to ∗ IDCa l l = R e s u l t . Nodes . getNodeAs <Ca l lExpr > ( " i d _ c a l l " ) ;7 c on s t au to ∗ Loop = R e s u l t . Nodes . getNodeAs <Stmt > ( " backward_branch " ) ;8 i n t l oop_ type = −1;9 i f ( Loop ) {10 i f ( i s a <DoStmt >( Loop ) ) { l oop_ type = 0 ; }11 e l s e i f ( i s a <WhileStmt >( Loop ) ) { l oop_ type = 1 ; }12 e l s e i f ( i s a <ForStmt >( Loop ) ) { l oop_ type = 2 ; }13 }14 i f ( CondExpr ) {15 i f ( IDCa l l ) {16 / / I t c a l l s one o f the ID f u n c t i o n s d i r e c t l y17 Emit a d i a g n o s t i c t h a t the c o n d i t i o n a l c a l l s an ID f un c t i o n18 } e l s e {19 / / I t has some Dec lRe fExpr ( s ) , check f o r ID−dependency20 con s t auto ∗ r e t Exp r = has IDDepDeclRef ( CondExpr ) ;21 con s t auto ∗ retMemberExpr = hasIDDepMember ( CondExpr ) ;22 i f ( r e tDe c l Exp r | | retMemberExpr ) {23 / / Warn t h a t an ID−dependent v a r i a b l e i s r e f e r e n c e d24 }25 }26 }

(b) The latter portion of the backward branch callback uses the ID-dependencyinformation computed earlier to check the conditional expressions from branchesthat are inside loops.

Figure 6. Example code for checking whether backward branches (loops) may haveinefficient ID-dependent conditional expressions.

Algorithm 2 Skeleton of FLOCL-MAST execution stages.STARTfor each source file from command line do

if Kernel File thenRecord Kernel Prototypes (Figs. 13e & 13f)

elseRecord cl_kernel Decls (Figs. 13b & 13c)Record clCreateKernel Calls (Figs. 13b & 13c)Record clSetKernelArg Calls (Figs. 13b & 13c)Record Kernel Launches (Figs. 13b & 13c)

end ifRetain AST, SourceManager, etc.

end forBind cl_kernels to Device Prototypes (Fig. 14a)Cluster clSetKernelArg calls to Launches (Fig. 14b)Bind Launch Clusters to Prototypes (Fig. 14c)Check Each Argument Against Prototype (Figs. 15a & 15b)END

source code to diagnostics. Algorithm 2 shows the sequence of5

Page 6: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

A.cl

A.c

B.c

HostAST

DeviceAST

HostAST

A.h

CLANG Frontend

CLANG Frontend

CLANG Frontend

HostOptions

HostOptions

DeviceOptions

Match Kernel CreationMatch Argument SettingMatch Kernel Launches

Match Kernel Prototypes

Match Kernel CreationMatch Argument SettingMatch Kernel Launches

RecordedCreationMatches

RecordedArgumentMatches

RecordedLaunchMatches

RecordedPrototypeMatches

Bind created hostcl_kernels by nameto device prototypes

Bind argument assignmentsto the next matching

cl_kernel launch

Bind created cl_kerneland prototype to launch(es)and argument assignments

Map of all kernelprototypes to theirlaunches and arguments

For ( each argument :in each launch :of each kernel )Check against prototype

Per-AST Actions Cross-AST Actions

1 2 3

4

5

6

Figure 7. Conceptual overview of FLOCL-MAST execution. Host and device files are recognized by file type (1) and each go through a separate instance of the Clang Frontend to beprocessed to ASTs. After the ASTs are generated, they each use AST Matchers and callbacks to record relevant AST nodes for the inter-TU stages (2). After all per-AST actions arecompleted the cross-AST portion of the tool iterates over all the recorded information to pair kernel function prototypes to their host-side representation (3) and uses (4) & (5), theninspects them all for type consistency between arguments and parameters (6).

linter steps, and provides pointers to the Appendix source codelistings for each processing step. Similar to FLOCL, the intra-TUrecording phase is implemented as a set of ASTMatchers and call-backs which identify the code components of each stage of theOpenCL kernel invocation. It records __kernel function declara-tions from device ASTs, and clCreateKernel, clSetKernelArg,and clEnqueueNDRangeKernel calls from host ASTs. Further, eachtime a cl_kernel is declared, the storage space is checked; if it isglobal (and possibly extern) it is stored in a special inter-TU datastructure that maps global externs to their real definition (i.e. theobject file that will actually store the variable).

After the intra-TU analysis is completed on all device and hostASTs, the multi-AST portion of the tool combines information fromeach of the kernel invocation stages. First, each clCreateKernelcall is checked for an exact match of its kernel name argument toa prototype from a device AST. When such a match is found, thedevice declaration is bound to the assigned cl_kernel. Next, allclSetKernelArg calls are “clustered” to the lexically-next kernellaunchwith amatching cl_kernel. Lexically-next refers to the nextlaunch encountered when scanning the source code from beginningto end, since kernel invocations are usually preceded closely bythe argument specification. Following argument clustering, thenext stage iterates over every kernel launch and attaches them to aprototype with a matching cl_kernel object, if one exists.

Finally, each kernel launch is compared to the device prototypeto ensure that 1) all arguments are provided and 2) each providedargument has a compatible type to that of the declared parameter inthe same position. This comparison must account for the differencebetween the pass-by-reference semantics of clSetKernelArg andthe pass-by-value semantics of the kernel. Hence, the void pointer-to-scalar variables in the host code must be resolved to a non-voidqualified type, if possible. (Typically the address of variables is takenand cast to void within the call, which is easy to resolve.) Pointerson the device must be assigned as either cl_mems (for global andconstantmemory) or a NULL pointer with a custom size argumentfor local memory. The former currently uses a wildcard matchof any approved pointer to a cl_mem, and the latter is currentlyunsupported; both will be refined as future work.

This inter-TU linter is useful as a demonstration of the complex-ity of multi-AST analysis in comparison to the intra-AST analysisused by basic FLOCL. While both FLOCL and FLOCL-MAT use thesame underlying ASTMatcher and callback technology, the com-plexity of the analysis dramatically increases once the tool worksacross the ASTs of different translation units.

3.4 CU2CL-MASTCU2CL [9] is an automated CUDA-to-OpenCL translator that per-forms AST-driven, string-based translation, i.e., CU2CL walks theAST generated by Clang to identify CUDA expressions and thentranslates the code by replacing the relevant text directly. This hasthe benefit of preserving preprocessor directives, comments, andformat in the generated OpenCL source files. However, CU2CLis restricted to a single translation unit — it cannot deal with theextra complexity of a multi-file application without tedious manualpre- and post-processing (for example, to merge generated OpenCLboilerplate). Therefore, we provide a whole-program multi-ASTexpansion of CU2CL, dubbed CU2CL-MAST (Multi-AST).

The implementation of CU2CL-MAST consisted of four phases.First, a libTooling interface was built to bootstrap and contain mul-tiple instances of the existing intra-TU translation functionality,as detailed in § 3.2. Second, to ensure completeness of translation,an implicit ban on processing non-.cuh header files was removed.Third, intra-TU boilerplate generation was modified such that com-mon elements are produced by the inter-TU layer and generatedin a shared set of cu2cl_util.c/h/cl utility files. Furthermore,within each TU the inter-TU layer generates extern declarationsfor cl_programs and cl_kernels originating in other TUs. Thefourth and most novel stage tackled the longstanding incongruitybetween OpenCL and CUDA representations of pointers to device-side buffers: OpenCL stores the host-side pointer to a device bufferin a cl_mem object, whereas CUDA allows storage of such pointersin any arbitrary pointer type.

The original intra-TU CU2CL performs immediate depth-firsttranslation of the device buffer used in a cudaMalloc expressionby simply navigating to the buffer’s declaration and changing thetype [9]. However, as each AST is only walked once, when thisvariable is aliased — either by pointer assignment or through afunction call — propagation to all possible aliases cannot be guar-anteed, causing a type inconsistency in other code regions that arenot aware of the translation. (Figure 8a demonstrates a simple caseof aliasing which confounds the original translator.) Thus, eitherthe variable is not translated and instead explicitly casted to a stan-dard pointer variable (as in [12]), or the translation must be fullypropagated to all affected regions. The latter is substantially harderproblem as potentially extends across scopes and TUs. While thecasting solution is expedient, it creates a software maintainabilityissue in a broader codebase as the true type of the casted cl_memvariables is only apparent when used in OpenCL calls. Therefore,to ensure propagation of translated device pointers all translation

6

Page 7: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 ======Conten t s o f a . h======2 vo id c r e a t e ( f l o a t ∗ ∗ , s i z e _ t ) ;3 vo id launch ( f l o a t ∗ , f l o a t ∗ ) ;4 ======Conten t s o f a . cu======5 # i n c l u d e " a . h "6 _ _g l o b a l _ _ vo id fooKern ( f l o a t ∗ inBuf , f l o a t ∗ outBuf ) {7 outBuf [ t h r e a d I d x . x ] = inBu f [ t h r e a d I d x . x ] ;8 }9 vo id c r e a t e ( f l o a t ∗ ∗ bu f f e r , s i z e _ t s i z e ) {10 cudaMal loc ( bu f f e r , s i z e ) ;11 }12 vo id launch ( f l o a t ∗ inBuf , f l o a t ∗ outBuf ) {13 fooKern <<<1 ,256 > >>( inBuf , outBuf ) ;14 }15 ======Conten t s o f b . c======16 # i n c l u d e " a . h "17 f l o a t ∗A, ∗B ;18 i n t main ( i n t argc , c on s t char ∗ argv [ ] ) {19 i n t c o n d i t i o n = a t o i ( argv [ 1 ] ) ;20 c r e a t e (&A , s i z e o f ( f l o a t ) ∗ 2 5 6 ) ; c r e a t e (&B , s i z e o f ( f l o a t ) ∗ 2 5 6 ) ;21 f l o a t ∗ inBuf , ∗ outBuf ;22 i f ( c o n d i t i o n ) { inBu f = A , outBuf = B ; }23 e l s e { inBu f = B , outBuf = A ; }24 launch ( inBuf , outBuf ) ;25 }

(a) CUDA device pointers used as arguments and aliased. The initial translation istriggered by the cudaMalloc call on line 10.

1 ======Conten t s o f a . h−c l . h======2 vo id c r e a t e ( cl_mem ∗ , s i z e _ t ) ;3 vo id launch ( cl_mem , cl_mem ) ;4 ======Conten t s o f a . cu−c l . cpp======5 . . . ( om i t t ed b o i l e r p l a t e )6 # i n c l u d e " a . h−c l . h "7 vo id c r e a t e ( cl_mem ∗ bu f f e r , s i z e _ t s i z e ) {8 ∗ b u f f e r = c l C r e a t e B u f f e r ( . . . , s i z e , . . . ) ;9 }10 vo id launch ( cl_mem inBuf , cl_mem outBuf ) {11 c l S e tK e r n e lA r g ( __cu2c l _Ke rne l _ fooKern , 0 , s i z e o f ( cl_mem ) , &inBu f ) ;12 c l S e tK e r n e lA r g ( __cu2c l _Ke rne l _ fooKern , 1 , s i z e o f ( cl_mem ) , &outBuf ) ;13 l o c a lWorkS i z e [ 0 ] = 2 5 6 ;14 g loba lWorkS i z e [ 0 ] = ( 1 ) ∗ l o c a lWorkS i z e [ 0 ] ;15 clEnqueueNDRangeKernel ( . . . , _ _ cu2c l _Ke rne l _ fooKern , . . . ) ;16 }17 ======Conten t s o f b . c−c l . cpp======18 . . . ( om i t t ed b o i l e r p l a t e )19 # i n c l u d e " a . h−c l . h "20 cl_mem A, B ;21 i n t main ( i n t argc , c on s t char ∗ argv [ ] ) {22 _ _ c u 2 c l _ I n i t ( ) ;23 i n t c o n d i t i o n = a t o i ( argv [ 1 ] ) ;24 c r e a t e (&A , s i z e o f ( f l o a t ) ∗ 2 5 6 ) ; c r e a t e (&B , s i z e o f ( f l o a t ) ∗ 2 5 6 ) ;25 cl_mem inBuf , outBuf ;26 i f ( c o n d i t i o n ) { inBu f = A , outBuf = B ; }27 e l s e { inBu f = B , outBuf = A ; }28 launch ( inBuf , outBuf ) ;29 __cu2c l _C leanup ( ) ;30 }

(b) Example of Multi-AST CU2CL’s fully-propagated cl_mem translationFigure 8. Propagation of the cl_mem type by CU2CL-MAST. Without the Multi-ASTexpansion only the declaration of “buffer” on line 9 in Fig. 8a is translated. With Multi-AST propagation, line 9 propagates horizontally to the forward declaration on line 2.By translating a header shared with b.c’s AST, the calls to create on line 20 propagateupward to the declarations of “A” and “B” on line 18. These propagate horizontallythrough the alias assignments in the if/else region on lines 22 and 23 to the declarationsof “inBuf” and “outBuf” on line 21. Finally, translating inBuf and outBuf propagatedownward through the call to “launch” on line 24 to launch’s forward declaration online 3 and then horizontally to the definition on line 12.

was hosited to the inter-TU layer which has access to the entiretyof the codebase and can track aliasing use/def relationships.

Primarily, inter-TU translation relies on forward declarationsfrom a shared header that resides on all ASTs and thus form acommon anchor between them. Thus, we modified the intra-TUportion of CU2CL to defer all cl_mem translations, by storing themin an inter-TU list, and to generate a map from all declared vari-ables and functions to their references such that the inter-TU layercan track their uses. Actual translation is then implemented as aconsumer of the deferred translation list. While each translationis performed, the reference map is checked for necessary cl_memtranslation propagation.

In particular, we address three cases in which a cl_mem typetranslation necessitates propagation to some other device pointer:

• Downward propagation: when a translated variable is a functionargument, the type change must be extended down the controlflow graph (CFG) to the callee’s parameters.

• Upward propagation: when a function parameter is translated,the type change must be propagated up the CFG to any variablessupplied as arguments to that parameter.

• Horizontal propagation: when a function parameter is translated,it must be applied to any forward declaration in a shared headerto be visible to other ASTs. Pointers aliased through assignmentstatements must ensure that both the left- and right-hand sideshave the cl_mem type.

When a necessary propagation is found, the newly-affected variableis added to the translation list, if it is not already on the list. Oncethe list is consumed, all cl_mem translations are fully propagated.(Effectively, this performs a breadth-first propagation of translationsup and down the control flow graph and across ASTs, starting at theinitial translation site.) Figure 8 walks through a typical propagationcascade from a single initial translation site.

4 Case StudiesWe demonstrate the efficacy of the proposed workflow using com-mon OpenCL mini-apps and benchmarks as well as a CUDA finiteautomata processing engine. The following subsections detail theapplications used to test each tool, and provide a discussion onthe quality of their respective outputs — translated OpenCL forCU2CL-MAST, and linter warnings for FLOCL and FLOCL-MAST.

4.1 Target Workloads4.1.1 iNFAntiNFAnt is a CUDA implementation of a non-deterministic finiteautomata (NFA) regex matching engine [3]. We used an optimizedversion of the algorithm, detailed in [23], as a case study for aCUDA-to-FPGA translation pipeline using both CU2CL-MAST andFLOCL/FLOCL-MAST. An NFA regex matching application wasselected as it represents a common workload for FPGAs. In latersections, we discuss the improvement of CUDA-to-OpenCL trans-lation using CU2CL-MAST, and analyze the feedback produced byFLOCL and FLOCL-MAST on the translated code.

4.1.2 BenchmarksFive OpenCL benchmark suites were selected to cover a wide rangeof use cases in terms of both application domain and OpenCLdevelopment style. OpenDwarfs [8, 13], PolyBench-ACC [10], Ro-dinia [4, 5], and SHOC [7] come from a more traditional HPCand general-purpose GPU background. CHO [19] targets OpenCL-on-FPGA development and captures current limitations of offline-compiled FPGA development. Across the five suites, there are atotal of 96 separate OpenCL applications.

4.2 CU2CL-MAST4.2.1 Translation AnalysisTo show howCU2CL-MAST simplified the translation of large-scalecode with multiple TUs, such as optimized iNFAnt, we compareCU2CL-MAST’s translation to the original plugin-based, single-AST CU2CL (version 0.6.2b). (The original CU2CL had to be modi-fied to ensure output files were generated despite possible errors.)CU2CL-MAST’s translation improvement was evidenced in threecategories:

7

Page 8: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

Improved header files processing As a prerequisite to multi-AST translations, CU2CL-MAST ensures that translations that oc-cur within any file are faithfully reproduced as output. The originalCU2CL translator took an overly conservative approach and limitedtranslation of header files to non-system header files with eithera .cu or .cuh extension, although many developers declare wrap-pers to CUDA kernel and utility functions in .h, .hpp, or otherfile types and may pass device buffers through those functions. Byexpanding translation to all non-system files regardless of file type,CU2CL-MAST performs translation in 14 additional header filesof Optimized iNFAnt, which include portions of the interface thatmust have cl_mems propagated through them.

Multi-AST boilerplate generation CU2CL generates custom boil-erplate for each AST, including many redundant features suchas OpenCL emulation of CUDA functions like cudaMemset. Con-versely, CU2CL-MAST defers the boilerplate generation to the inter-TU portion of the tool. As such, it can intelligently generate sharedutility files that concisely provide all shared functionality. Fur-thermore, CU2CL-MAST coordinates all explicit initialization andfinalization functionality from each TU, eliminating the need forpost-translation manual merging. While Optimized iNFAnt uses asingle kernel file, CU2CL-MAST generates the necessary externdeclarations to make it visible program wide, which is necessarysince creation and invocation happen in separate source files. Inprograms with more kernels from different source files, the value ofthis automatic propagation across source files is further increased.

cl_mem propagation Section 3.4 has already detailed the signifi-cance of ensuring propagation of type translations. The OptimizediNFAnt code uses wrapper functions around all CUDA allocationand deallocation functions. This necessitates cl_mem propagation,which CU2CL-MAST successfully provides. 2

4.2.2 Translation ValidationThe translated OpenCL application was validated by comparing itsoutput against the the original CUDA application, when compiledwith the built-in debugging and validation routines. The built-intest packet generator was used as synthetic network traffic, and thepackaged “big-boy” transition graph was used to specify the finiteautomata. The OpenCL results were consistent with the originalCUDA application; in particular, we successfully reproduced theoriginal results using the translated OpenCL code on all testedOpenCL platforms (Nvidia K20Xm GPU, AMD S9150 GPU, IntelArria10 FPGA).

4.3 FLOCL and FLOCL-MAST4.3.1 FLOCL ResultsThe FLOCL linters produce warnings about possible semantic orperformance faults when examining the 138 kernel files in thebenchmarks. Each may have important performance issues or run-time problems that will only become apparent at runtime after

2A slight incongruence in the CU2CL and CU2CL-MAST implementation of cud-aMallocHost was discovered. Rather than returning just the host buffer’s pointer,and internally managing the cl_mem as required to directly emulate the function,__cu2cl_MallocHost returns both the host and device pointers. A simple map (key:host pointer, value: device cl_mem) implementation is manually utilized to emulatethe desired semantics. Automatic management of such host-allocated-and-mappedbuffers remains for future work.

the developer has invested significant time in compilation, place-ment, and routing. The following is the detailed description of thewarnings found:• Possibly unreachable barrier: four kernel files in the benchmarksuite have barriers which are thread-dependent and may result indeadlocks at runtime. Fig. 9 shows one of the hits in SHOC/SPMV.

• ID-dependent backwards branches: 90 instances of backwardbranches are observed in the benchmark suites, and another fouroccur in the optimized iNFAnt application.

• Inefficient struct alignment: in the benchmark suites, FLOCL iden-tifies 59 opportunities to augment a struct with extra attributesfor improved memory utilization. Specifically, 41 cases could ben-efit from an aligned attribute, and 18 from a packed attribute.An additional 2 structs from the Optimized iNFAnt applicationcould benefit from both explicit packing and alignment.

• SWI-barrier: no instances of a barrier in a single-threaded kernelwere found. (The benchmarks largely consist of applicationsthat are explicit designed for multi-threaded NDRange execution,rather than single-thread execution.)

1 __ke rne l vo id2 spmv_c s r _v e c t o r _k e rn e l ( _ _ g l o b a l c on s t FPTYPE ∗ r e s t r i c t va l ,3 _ _ g l o b a l c on s t FPTYPE ∗ r e s t r i c t vec ,4 _ _ g l o b a l c on s t i n t ∗ r e s t r i c t c o l s ,5 _ _ g l o b a l c on s t i n t ∗ r e s t r i c t rowDe l im i t e r s ,6 c on s t i n t dim , con s t i n t vecWidth ,7 _ _ g l o b a l FPTYPE ∗ r e s t r i c t out )8 {9 / / Thread ID in b l o ck10 i n t t = g e t _ l o c a l _ i d ( 0 ) ;11 / / Thread ID wi th in warp12 i n t i d = t & ( vecWidth − 1 ) ;13 / / One row per warp14 i n t v e c sPe rB l o ck = g e t _ l o c a l _ s i z e ( 0 ) / vecWidth ;15 i n t myRow = ( ge t _g roup_ i d ( 0 ) ∗ ve c sPe rB l o ck ) + ( t / vecWidth ) ;16 _ _ l o c a l v o l a t i l e FPTYPE pa r t i a l S ums [ 1 2 8 ] ;17 p a r t i a l S ums [ t ] = 0 ;18 i f (myRow < dim )19 {20 / / . . . om i t t ed p a r t i a l sum loop21 p a r t i a l S ums [ t ] = mySum ;22 b a r r i e r (CLK_LOCAL_MEM_FENCE ) ;23 / / . . . om i t t ed r e du c t i o n and wr i t e24 }25 }

Figure 9. Example of a possibly-unreachable barrier found in SHOC/SPMV. If a work-group is configured such that some (but not all) work items’ myRow exceeds the dimparameter, the barrier after partial summation (line 22) will not be encountered bythe entire workgroup. myRow is marked as thread-dependent due to the ID calls onlines 10 and 15.

4.3.2 FLOCL-MAST ResultsWe apply FLOCL-MAST to each OpenCL application, across thefive benchmark suites, to detect any semantic inconsistency be-tween the host and device code. As the benchmarks are largelyproduction-grade and well-developed, 50 of the 97 applicationsfound no inconsistencies between device and host argument types.The host-device validation linter is designed for in-developmentapplications, where the interface may be in flux and more likely toexhibit inconsistencies. However, despite the overall quality of thebenchmarks, FLOCL-MAST did raise warnings while validating theremaining 47 applications.

True Positives In a typical CPU-only compilation, the compileremits warnings about mismatched types or missing argumentswhen calling user-defined functions it is aware of. Since OpenCLhost and device codes are separately compiled using different com-piler tool chains, neither tool chain is aware of the interface theother is using for invoking user-defined kernel functions. Therefore,

8

Page 9: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

Table 1. Application behavior when linted with FLOCL-MASTBenchmark Report True Report False OtherSuite Positives Positives Issues

CHO 0 2 0OpenDwarfs 3 3 2PolyBench-ACC 1 8 1Rodinia 1 1 10SHOC 5 7 7

FLOCL-MAST provides this same functionality across the host-to-device interface, and several examples are found in the benchmarksuites.• Possible loss of precision: nine applications change the size and/orsignedness of an integer between host and device.

• PolyBench/Jacobi-2d-Imper has an unused wrapper function thatcalls two kernels without arguments. It should either be removedor documented as requiring arguments persisting from the other(correct) launches.

False Positives Generally, a linter should err on the side of cau-tion when producing warnings. Currently FLOCL-MAST producessome false positivewarnings aboutmissing ormis-typed arguments,many of which can be resolved by respecting the equivalence oftypedefs that refer to the same primitive type.• Unrespected equivalent types: FLOCL-MAST currently uses thedeclared type to match arguments to parameters, but shouldbe adapted to use the primitive type referred to by a possibletypedef’d or dependent type. This causes the majority of falsemismatches (17 out of 21 applications with false positives). Specif-ically, SHOC/BFS, Rodinia/Leukocyte, and CHO/{AES_fpga, AES}all mismatch to the cl_int OpenCL host runtime type; OpenD-warfs/TDMmismatches a typedef’d ubyte to an unsigned charon the device; OpenDwarfs/{TDM, CRC} mismatch a host un-signed int and a device-side uint; PolyBench/{Correlation,SYRK, 2mm, Gemver, gesummv, GEMM, SYR2K, Covariance} usea typedef inside a conditional compilation region for DATA_TYPE;and SHOC/{GEMM, MD and, S3d} all use a template-dependenttype on the host which FLOCL-MAST should resolve (as it has allthe compile-time template specialization information necessaryto do so).

• Variable argument position: SHOC/BFS andOpenDwarfs/Kmeans,use an incremented integer to specify the argument position,rather than an integer literal. If the variable value cannot be stati-cally deduced at compile time, the position cannot be determinedfor validation against the device prototype.

• SHOC/MD calls an iterative kernel in which the arguments areonly set on the first launch, outside a loop, and then multiplesubsequent calls invoke the kernel without reassigning the samearguments. In this case, both launches occur within the samefunction and FLOCL-MAST should be expanded to understandOpenCL’s retained arguments; however, if the launches wereto occur in separate functions, a static analysis tool would havelimited ability to deduce what order they would be invoked atruntime, and thus whether any arguments were retained.

• SHOC/{Reduction, Scan, and Sort} use dynamic local memoryallocation, which is reported as a type mismatch between a NULLpointer on the host, and the true local address-space pointertype on the device. FLOCL-MAST should be expanded to matchsuch local memory regions to a host NULL pointer, but further

confirmation of the host-intended type would be limited to whatcan be heuristically deduced from the size argument to clSetK-ernelArg (such as through inspection of a possible sizeof op-erator).

Other Issues The remaining issues represent limitations of thestatic analysis approach when applied to a JIT-compiled (partiallydynamically-compiled) language, or corner cases to improve in thetool’s implementation.

• Dynamic kernel names and program strings: in all but the sim-plest runtime-invariant assignments, using dynamic string fordevice code or kernel name will be outside the scope of a staticanalysis tool or offline FPGA compliation. Eleven applications usesome form of dynamic string value.

• Indirect cl_kernel references: eight applications alias their cl_-kernel objects by passing them through functions or storingthem within other data types. Some degenerate cases may besolvable with propagation techniques a la CU2CL-MAST, butmost are outside the scope of a static analysis tool, as following(possibly multiple) runtime-dependent cl_kernels through aCFG is more difficult than a static type change.

5 Related WorkTraditional whole-program analysis and transformation tools sufferfrom prohibitive run-time and memory cost due to the quadraticcomplexity of the required program representations such as Pro-gram Dependence Graph (PDG) [2, 16]. One approach for designingsuch tools is to construct the expensive program representations ondemand using program slicing [2]. However, this demand-drivenapproach is limited to software debugging and comprehension andis not suitable for performing global source-code transformation ordetecting inter-TU inconsistencies. Another approach for whole-program tools summarizes the critical program representations toreduce the super-linear run-time and memory complexity at thecost of less analysis precision, which hinders the ability of thesetools to realize source-code translation and bug detection [22].

Therefore, this work introduces an open-source framework forwhole-program analysis and transformation that utilizes the ab-stract syntax tree (AST) representation of software programs andleverages the Clang and LLVM infrastructure [6, 14]. Unlike otherprogram representations, the size of the AST is linear in the sizeof the software program [2, 16], which makes it a practical rep-resentation for tackling several important problems. Specifically,we demonstrate the application of a novel multi-abstract syntaxtree (multi-AST) approach to create tools that operate on entireprograms composed of multiple separately compiled and linkedC/C++-like source files.

6 ConclusionIn this work we have demonstrated a workflow for productiveOpenCL programming and whole-program analysis on FPGA. Theworkflow is composed of three tools: a reconstructed Multi-ASTvariation of an existing CUDA-to-OpenCL translator, intra-TU lin-ters for OpenCL and FPGA, and an inter-TU linter for validating theOpenCL host-to-device kernel launch interface. The entire work-flow is demonstrated by translation and analysis of a research finiteautomata code, originally written in CUDA. Further analysis of the

9

Page 10: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

linters is conducted on a suite of 96 OpenCL benchmarks, success-fully identifying over 150 possible errors or performance faults, aswell as opportunities to further refine the utilized heuristics.

References[1] Altera. 2015. Altera FPGA SDK for OpenCL, Best practices guide. Online. (05.04

2015). https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/opencl-sdk/aocl_optimization_guide.pdf

[2] Darren C. Atkinson and William G. Griswold. 1996. The Design of Whole-program Analysis Tools. In Proceedings of the 18th International Conference onSoftware Engineering (ICSE ’96). IEEE Computer Society, Washington, DC, USA,16–27. http://dl.acm.org/citation.cfm?id=227726.227732

[3] Niccolo’ Cascarano, Pierluigi Rolando, Fulvio Risso, and Riccardo Sisto. 2010.iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Comput. Commun.Rev. 40, 5 (Oct. 2010), 20–26. https://doi.org/10.1145/1880153.1880157

[4] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron.2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEEInternational Symposium on Workload Characterization (IISWC). 44–54. https://doi.org/10.1109/IISWC.2009.5306797

[5] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Wang, and K. Skadron.2010. A characterization of the Rodinia benchmark suite with comparison tocontemporary CMP workloads. In Workload Characterization (IISWC), 2010 IEEEInternational Symposium on. 1–11. https://doi.org/10.1109/IISWC.2010.5650274

[6] Clang Clang. 2013. A C language family frontend for LLVM. (2013).[7] Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C.

Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The ScalableHeterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rdWorkshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). ACM, New York, NY, USA, 63–74. https://doi.org/10.1145/1735688.1735702

[8] Wu-chun Feng, Heshan Lin, Thomas Scogland, and Jing Zhang. 2012. OpenCLand the 13 Dwarfs: A Work in Progress. In Proceedings of the 3rd ACM/SPECInternational Conference on Performance Engineering (ICPE ’12). ACM, New York,NY, USA, 291–294. https://doi.org/10.1145/2188286.2188341

[9] Mark Gardner, Paul Sathre, Wu chun Feng, and Gabriel Martinez. 2013. Character-izing the challenges and evaluating the efficacy of a CUDA-to-OpenCL translator.Parallel Comput. 39, 12 (2013), 769 – 786. https://doi.org/10.1016/j.parco.2013.09.003 Programming models, systems software and tools for High-End Computing.

[10] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative ParallelComputing (InPar). 1–10. https://doi.org/10.1109/InPar.2012.6339595

[11] Intel. 2017. Intel FPGA SDK for OpenCL, Best practices guide. Online. (12.082017). https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf

[12] Junghyun Kim, Thanh Tuan Dao, Jaehoon Jung, Jinyoung Joo, and Jaejin Lee.2015. Bridging OpenCL and CUDA: A Comparative Analysis and Translation.In Proceedings of the International Conference for High Performance Computing,Networking, Storage and Analysis (SC ’15). ACM, New York, NY, USA, Article 82,12 pages. https://doi.org/10.1145/2807591.2807621

[13] K. Krommydas, Wu-Chun Feng, M. Owaida, C. D. Antonopoulos, and N. Bellas.2014. On the characterization of OpenCL dwarfs on fixed and reconfigurableplatforms. In Application-specific Systems, Architectures and Processors (ASAP),2014 IEEE 25th International Conference on. 153–160. https://doi.org/10.1109/ASAP.2014.6868650

[14] Chris Lattner. 2008. LLVM and Clang: Next generation compiler technology. InThe BSD Conference. 1–2.

[15] Gabriel Martinez, Mark Gardner, and Wu-chun Feng. 2011. CU2CL: A CUDA-to-OpenCL translator for multi-and many-core architectures. In Parallel andDistributed Systems (ICPADS), 2011 IEEE 17th International Conference on. IEEE,300–307.

[16] S.S. Muchnick. 1997. Advanced Compiler Design Implementation. Morgan Kauf-mann Publishers.

[17] AaftabMunshi. [n. d.]. The OpenCL Specification, 2008. Chronos OpenCLWorkingGroup ([n. d.]).

[18] Aaftab Munshi. 2009. The opencl specification. In Hot Chips 21 Symposium (HCS),2009 IEEE. IEEE, 1–314.

[19] Geoffrey Ndu, Javier Navaridas, and Mikel Luján. 2015. CHO: Towards a Bench-mark Suite for OpenCL FPGA Accelerators. In Proceedings of the 3rd InternationalWorkshop on OpenCL (IWOCL ’15). ACM, New York, NY, USA, Article 10, 10 pages.https://doi.org/10.1145/2791321.2791331

[20] CUDA Nvidia. 2007. Compute unified device architecture programming guide.(2007).

[21] Paul Sathre, Mark Gardner, and Wu-chun Feng. 2012. Lost in translation: Chal-lenges in automating cuda-to-opencl translation. In Parallel Processing Workshops(ICPPW), 2012 41st International Conference on. IEEE, 89–96.

[22] M Sharir and A Pnueli. 1978. Two approaches to interprocedural data flow analysis.Technical Report. New York Univ. Comput. Sci. Dept.

[23] Xiaodong Yu and Michela Becchi. 2013. GPU Acceleration of Regular ExpressionMatching for Large Datasets: Exploring the Implementation Space. In Proceedingsof the ACM International Conference on Computing Frontiers (CF ’13). ACM, New

York, NY, USA, Article 18, 10 pages. https://doi.org/10.1145/2482767.2482791

10

Page 11: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

A Additional Code Listings1 / / Sub−matcher to check i f an e xp r e s s i o n c a l l s an ID f un c t i o n2 con s t auto TID_RHS =3 expr (4 hasDescendant ( c a l l E x p r (5 c a l l e e ( f u n c t i o nDe c l ( anyOf (6 hasName ( " g e t _ g l o b a l _ i d " ) ,7 hasName ( " g e t _ l o c a l _ i d " )8 ) ) )9 ) )10 ) ;11 / / Sub−matcher f o r any o f the b ina ry as s ignment o p e r a t o r s12 con s t au to ANY_ASSIGN =13 anyOf (14 hasOperatorName ( " = " ) , hasOperatorName ( " ∗ = " ) ,15 hasOperatorName ( " / = " ) , hasOperatorName ( " % = " ) ,16 hasOperatorName ( " + = " ) , hasOperatorName ( " −= " ) ,17 hasOperatorName ( " < <= " ) , hasOperatorName ( " > >= " ) ,18 hasOperatorName ( "&= " ) , hasOperatorName ( " ^ = " ) ,19 hasOperatorName ( " | = " )20 ) ;21 / / Matcher f o r any s t a t emen t t h a t e i t h e r i n i t i a l i z e s a v a r i a b l e with22 / / an ID c a l l , or a s s i g n s the r e s u l t s o f an ID c a l l t o a v a r i a b l e23 F inder −>addMatcher (24 compoundStmt (25 / / Bind on a c t u a l g e t _ l o c a l / g l o b a l _ i d c a l l s26 fo rEachDescendan t (27 s tmt ( anyOf (28 / / D e c l a r a t i o n s which i n i t i a l i z e with an ID c a l l29 d e c l S tm t ( hasDescendant (30 va rDec l ( h a s I n i t i a l i z e r ( TID_RHS ) ) . b ind ( " t i d _ d ep_v a r " )31 ) ) ,32 / / Ass ignments t h a t s t o r e an ID va lue33 b i na ryOpe r a t o r ( a l l O f (34 ANY_ASSIGN ,35 hasRHS ( TID_RHS ) ,36 hasLHS ( anyOf (37 d e c lR e fExp r ( t o (38 va rDec l ( ) . b ind ( " t i d _ d ep_v a r " )39 ) ) ,40 memberExpr ( member (41 f i e l d D e c l ( ) . b ind ( " t i d _ d e p _ f i e l d " )42 ) )43 ) )44 ) )45 ) ) . b ind ( " s t r a i g h t _ a s s i g nmen t " )46 )47 ) , t h i s ) ;48 / / Matcher f o r a l l v a r i a b l e d e c l a r a t i o n s t h a t r e f e r e n c e ano the r49 / / v a r i a b l e in t h e i r i n i t i a l i z e r ( i n c a s e i t i s ID−dependent )50 F inder −>addMatcher (51 s tmt ( fo rEachDescendan t (52 va rDec l ( h a s I n i t i a l i z e r (53 fo rEachDescendan t ( s tmt ( anyOf (54 d e c lR e fExp r (55 to ( va rDec l ( ) )56 ) . b ind ( " a s s i g n _ r e f _ v a r " ) ,57 memberExpr (58 member ( f i e l d D e c l ( ) )59 ) . b ind ( " a s s i g n _ r e f _ f i e l d " )60 ) ) )61 ) ) . b ind ( " p o t _ t i d " )62 ) ) , t h i s ) ;63 / / Matcher f o r a l l v a r i a b l e s t h a t a t some po i n t a r e a s s i gn ed a64 / / v a l u e from ano the r v a r i a b l e ( i n c a s e i t i s ID−dependent )65 F inder −>addMatcher (66 s tmt ( fo rEachDescendan t (67 b i n a ryOpe r a t o r ( a l l O f (68 ANY_ASSIGN ,69 hasRHS (70 fo rEachDescendan t ( s tmt ( anyOf (71 d e c lR e fExp r (72 to ( va rDec l ( ) )73 ) . b ind ( " a s s i g n _ r e f _ v a r " ) ,74 memberExpr (75 member ( f i e l d D e c l ( ) )76 ) . b ind ( " a s s i g n _ r e f _ f i e l d " )77 ) ) )78 ) ,79 hasLHS ( anyOf (80 d e c lR e fExp r ( t o (81 va rDec l ( ) . b ind ( " p o t _ t i d _ v a r " )82 ) ) ,83 memberExpr ( member (84 f i e l d D e c l ( ) . b ind ( " p o t _ t i d _ f i e l d " )85 ) )86 ) )87 ) )88 ) ) , t h i s ) ;

(a) The complex set of matchers required to track thread-dependent values

1 / /A r e c u r s i v e p r eo r d e r t r a v e r s a l o f an e xp r e s s i o n to check i f any o f the2 / / v a r i a b l e s t h a t might be r e f e r e n c e d in i t a r e marked as ID−dependent3 con s t Dec lRe fExpr ∗ MyCheck : : has IDDepDeclRef ( c on s t Expr ∗ e ) {4 / / I f t h i s e x p r e s s i o n i s a v a r i a b l e r e f e r e n c e5 i f ( c on s t Dec lRe fExpr ∗ expr = dyn_cas t <Dec lRefExpr >( e ) ) {6 / / check i f i t ' s on the ID−dependent v a r i a b l e l i s t7 i f ( s t d : : f i n d ( IDDepVars . beg in ( ) , IDDepVars . end ( ) ,8 dyn_cas t <VarDecl > ( expr−>ge tDec l ( ) ) ) != IDDepVars . end ( ) ) {9 r e t u r n expr ;10 } e l s e {11 r e t u r n NULL ;12 }13 / / E l s e i t i s e i t h e r a non− r e f e r e n c e l e a f node ,14 / / or i t i s an an c e s t o r t h a t needs to be r e c u r s e d i n t o15 } e l s e {16 / / I t e r a t e over i t s c h i l d r e n and see i f any o f them are ID−dependent17 f o r ( au to i = e−>ch i l d _ b e g i n ( ) ; i != e−>ch i l d _ end ( ) ; ++ i ) {18 i f ( au to ∗ newExpr = dyn_cas t <Expr > ( ∗ i ) ) {19 auto ∗ r e t Exp r = has IDDepDeclRef ( newExpr ) ;20 i f ( r e t Exp r ) r e t u r n r e t Exp r ;21 }22 }23 / / I f none o f the c h i l d r e n f o r c e an e a r l y r e t u r n with a match , r e t u r n NULL24 r e t u r n NULL ;25 }26 r e t u r n NULL ; / / Should be un r ea chab l e27 }28 / / Omit ted hasIDDepMember f u n c t i o n i somorph i c to the above , but f o r29 / / r e f e r e n c e s to f i e l d members3031 / / The beg inn ing po r t i o n o f any Match Ca l l b a c k t h a t has to d e a l with th r e ad32 / / dependency must s t a r t by popu l a t i n g the ID dependency l i s t33 / / The f i r s t h a l f o f the c a l l b a c k only d e a l s with i d e n t i f y i n g and p ropaga t i ng34 / / ID−dependency i n f o rma t i on i n t o the IDDepVars v e c t o r35 V a r i a b l e = R e s u l t . Nodes . getNodeAs <VarDecl > ( " t i d _ d ep_v a r " ) ;36 F i e l d = R e s u l t . Nodes . getNodeAs < F i e l dDe c l > ( " t i d _ d e p _ f i e l d " ) ;37 S t a t emen t = R e s u l t . Nodes . getNodeAs <Stmt > ( " s t r a i g h t _ a s s i g nmen t " ) ;38 RefExpr = R e s u l t . Nodes . getNodeAs <Dec lRefExpr > ( " a s s i g n _ r e f _ v a r " ) ;39 MemExpr = R e s u l t . Nodes . getNodeAs <MemberExpr > ( " a s s i g n _ r e f _ f i e l d " ) ;40 P o t e n t i a l V a r = R e s u l t . Nodes . getNodeAs <VarDecl > ( " p o t _ t i d _ v a r " ) ;41 P o t e n t i a l F i e l d = R e s u l t . Nodes . getNodeAs < F i e l dDe c l > ( " p o t _ t i d _ f i e l d " ) ;42 / / I f the re ' s a s t a t emen t we know i s ID−dependent and has a v a r i a b l e on the LHS43 i f ( S t a t emen t && ( V a r i a b l e | | F i e l d ) ) {44 / / Record t h a t t h i s v a r i a b l e i s thread−dependent on ly i f we haven ' t a l r e a dy45 i f ( V a r i a b l e ) {46 i f ( s t d : : f i n d ( IDDepVars . beg in ( ) , IDDepVars . end ( ) ,47 V a r i a b l e ) == IDDepVars . end ( ) ) {48 IDDepVars . push_back ( V a r i a b l e ) ;49 } e l s e i f ( F i e l d ) {50 i f ( s t d : : f i n d ( IDDepF i e l d s . beg in ( ) , IDDepF i e l d s . end ( ) ,51 F i e l d ) == IDDepF i e l d s . end ( ) ) {52 IDDepF i e l d s . push_back ( F i e l d ) ;53 }54 }55 } e l s e i f ( S t a t emen t ) {56 / / Warn t h a t we found an ID−dependent ass ignment ,57 / / but couldn ' t deduce the a s s i g n e e v a r i a b l e58 } e l s e i f ( V a r i a b l e ) {59 / / Warn t h a t an ID−dependent v a r i a b l e was found ,60 / / but we can ' t make sense o f the s t a t emen t i t ' s in61 } e l s e i f ( F i e l d ) {62 / / Warn t h a t an ID−dependent f i e l d was found ,63 / / but we can ' t make sense o f the s t a t emen t i t ' s in64 }65 i f ( Re fExpr && Po t e n t i a l V a r ) {66 / / i f we got a h i t because o f a v a r i a b l e in the RHS67 con s t au to RefVar = dyn_cas t <VarDecl > ( RefExpr−>ge tDec l ( ) ) ;68 / / I f the LHS v a r i a b l e i sn ' t r e co rded as ID−dependent69 / / but the r e f e r e n c e d v a r i a b l e i s , the a s s i g n e e shou ld be r e co rded70 i f ( s t d : : f i n d ( IDDepVars . beg in ( ) , IDDepVars . end ( ) ,71 P o t e n t i a l V a r ) == IDDepVars . end ( ) &&72 s t d : : f i n d ( IDDepVars . beg in ( ) , IDDepVars . end ( ) ,73 RefVar ) != IDDepVars . end ( ) ) {74 / / Record i t75 IDDepVars . push_back ( P o t e n t i a l V a r ) ;76 }77 }78 / / Omit ted t r a c k i n g f o r MemberExprs in LHS /RHS or both , as i t79 / / i s i somorph i c to the above

(b) Code used in callbacks to finish propagating thread-dependency information.Portions of the code related to handling struct members have been omitted for brevitysince they are structurally isomorphic to their variable counterparts, but storeddifferently on the Clang AST.

Figure 10. Tracking thread dependency relies on a complex combination of Matcher and Callback functionality, that exploits the preorder AST traversal property of the ASTMatchers interface.

11

Page 12: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 __ke rne l vo id dependent ( ) {2 i n t t i d = g e t _ l o c a l _ i d ( 0 ) ; i n t g i d = g e t _ g l o b a l _ i d ( 0 ) ;3 i n t t x = t i d + g e t _ l o c a l _ s i z e ( 0 ) ∗ g id ;4 f o r ( i n t i = t x ; i < 256 ; i ++) {5 b a r r i e r (CLK_LOCAL_MEM_FENCE ) ;6 }7 f o r ( i n t i = 0 ; i < g e t _ l o c a l _ i d ( 0 ) ; i ++) {8 b a r r i e r (CLK_LOCAL_MEM_FENCE ) ;9 }10 }

1112 __ke rne l vo id independen t ( ) {13 i n t t i d = g e t _ l o c a l _ i d ( 0 ) ;14 i n t g i d = g e t _ g l o b a l _ i d ( 0 ) ;15 i n t t x = t i d + g e t _ l o c a l _ s i z e ( 0 ) ∗ g id ;16 f o r ( i n t i = 0 ; i < 256 ; i ++) {17 b a r r i e r (CLK_LOCAL_MEM_FENCE ) ;18 }19 }

(a) Example of two for loops that are ID-dependent, one from referencing an ID-dependent variable (“i” via “tx” via “tid” and “gid”) and another due to calling an ID function, as wellas a for loop that is not ID-dependent.

1 / / Omitted ID−dependency matchers23 / / Sub−matcher to check i f a s t a tmen t c on t a i n s4 / / a de s cenden t t h a t i s a b a r r i e r c a l l5 c on s t auto HAS_BAR_DESC =6 hasDescendant (7 c a l l E x p r ( c a l l e e (8 f un c t i o nDe c l ( anyOf (9 hasName ( " b a r r i e r " ) , hasName ( " work_g roup_ba r r i e r " )10 ) )11 ) ) . b ind ( " b a r r i e r " )12 ) ;13 / / Sub−matcher to check i f an e xp r e s s i o n e i t h e r14 / / c a l l s an ID f un c t i o n or r e f e r e n c e s a v a r i a b l e15 con s t au to COND_EXPR =16 expr ( anyOf (17 hasDescendant ( c a l l E x p r (18 c a l l e e ( f u n c t i o nDe c l ( anyOf (19 hasName ( " g e t _ g l o b a l _ i d " ) , hasName ( " g e t _ l o c a l _ i d " )20 ) ) )21 ) . b ind ( " i d _ c a l l " ) ) ,22 hasDescendant ( s tmt ( anyOf (23 d e c lR e fExp r (24 to ( va rDec l ( ) )25 ) ,26 memberExpr (27 member ( f i e l d D e c l ( ) )28 )29 ) ) )30 ) ) . b ind ( " cond_expr " ) ;31 / / Matcher t h a t f i n d s any branch s t a t emen t s t h a t32 / / have a p o t e n t i a l ID−dependent c o n d i t i o n a l33 / / e i t h e r from a c a l l or v a r i a b l e r e f e r e n c e34 F inder −>addMatcher (35 s tmt ( anyOf (36 f o r S tm t ( a l l O f (37 HAS_BAR_DESC , ha sCond i t i on (COND_EXPR ) ) ) . b ind ( " f o r " ) ,38 i f S tm t ( a l l O f (39 HAS_BAR_DESC , ha sCond i t i on (COND_EXPR ) ) ) . b ind ( " i f " ) ,40 doStmt ( a l l O f (41 HAS_BAR_DESC , ha sCond i t i on (COND_EXPR ) ) ) . b ind ( " do " ) ,42 whi l eS tmt ( a l l O f (43 HAS_BAR_DESC , ha sCond i t i on (COND_EXPR ) ) ) . b ind ( " whi l e " ) ,44 sw i t chS tmt ( a l l O f (45 HAS_BAR_DESC , ha sCond i t i on (COND_EXPR ) ) ) . b ind ( " sw i t ch " )46 ) ) , t h i s ) ;

(b) The ASTMatcher used to find all conditionals that contain a barrier and needto be checked for ID-dependency that may make the barrier unreachable by somethreads.

1 / / Omitted thread−dependency t r a c k i n g code23 / /We want the d i a g n o s t i c to emi t s l i g h t l y d i f f e r e n t4 / / t e x t so i t b i nd s on the f i v e p o t e n t i a l c o n d i t i o n a l5 / / t ype s ( o f which only one w i l l be t r u e ) a b a r r i e r6 / / and c o n d i t i o n a l e xp r e s s i on , and p o s s i b l y an i d c a l l7 B a r r i e r = R e s u l t . Nodes . getNodeAs <Ca l lExpr > ( " b a r r i e r " ) ;8 ForAn = Re s u l t . Nodes . getNodeAs <ForStmt > ( " f o r " ) ;9 I fAn = R e s u l t . Nodes . getNodeAs < I f S tm t > ( " i f " ) ;10 DoAn = Re s u l t . Nodes . getNodeAs <DoStmt > ( " do " ) ;11 WhAn = Re s u l t . Nodes . getNodeAs <WhileStmt > ( " whi l e " ) ;12 SwAn = Re s u l t . Nodes . getNodeAs <SwitchStmt > ( " sw i t ch " ) ;13 Cond = R e s u l t . Nodes . getNodeAs <Expr > ( " cond_expr " ) ;14 IDCa l l = R e s u l t . Nodes . getNodeAs <Ca l lExpr > ( " i d _ c a l l " ) ;15 / / F i gu r e out which type o f c o n d i t i o n a l16 i n t type = −1;17 i f ( ForAn ) { type = 0 ; }18 e l s e i f ( I fAn ) { type = 1 ; }19 e l s e i f (DoAn ) { type = 2 ; }20 e l s e i f (WhAn) { type = 3 ; }21 e l s e i f ( SwAn ) { type = 4 ; }22 / / I f t h e r e i s a c o n d i t i o n a l e xp r e s s i on , s t a r t the23 / / d i a g n o s t i c s ( t h i s code g e t s h i t by a l l matches t h a t24 / / on ly t r i g g e r ID−dependency t r a ck i ng , so we may25 / / not need to run t h i s code each t ime )26 i f ( Cond ) {27 / / Omit ted Va l id−Ba r r i e r −Ref inement Code ( F i g . 1 1 )28 / / The c o n d i t i o n a l e x p r e s s i o n might c a l l an ID f un c t i o n29 / / d i r e c t l y , hand le t h a t c a s e30 i f ( IDCa l l ) {31 / / Emit a d i a g n o s t i c warning t h a t a l l t h r e a d s may not r each32 / / the b a r r i e r due to the ID c a l l i n the branch c ond i t i o n33 } e l s e {34 / / The c ond i t i o n does not c a l l an ID func t i on , but may35 / / r e f e r e n c e a v a r i a b l e t h a t needs to be checked f o r an ID36 con s t auto ∗ r e t Exp r = has IDDepDeclRef ( Cond ) ;37 i f ( r e t Exp r ) { / / I t has an ID−dependent r e f e r e n c e38 / / Emit a d i a g n o s t i c warning o f which v a r i a b l e in39 / / the c o n d i t i o n a l e x p r e s s i o n may be ID−dependent40 / / and r i s k making the b a r r i e r un r e a chab l e41 }42 }43 }

(c) The latter portion of the unreachable barrier callback uses the ID-dependencyinformation computed earlier to check the conditional expression inside every branchthat contains a call to a barrier function.

Figure 11. Example code for checking whether a barrier is potentially not reachable by all threads due to being inside a thread-dependent branch.

12

Page 13: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 sw i t ch ( type ) {2 c a s e 1 : { / / B a r r i e r i n s i d e i f / e l s e − i f / e l s e3 / / I f t h i s i s an e l s e − i f , we have a l r e a d y checked i t4 / / based on the i n i t i a l I f S tm t5 / / Get the pa r en t to s ee i f we ' r e in an e l s e branch6 con s t auto p a r en t s = R e s u l t . Context−>g e t P a r e n t s ( ∗ I fAns c ) ;7 c on s t Stmt ∗ pa r en t = NULL ;8 i f ( ! p a r en t s . empty ( ) ) { p a r en t = p a r en t s [ 0 ] . get <Stmt > ( ) ; }9 i f ( p a r en t ) {10 con s t I f S tm t ∗ i f P a r e n t = dyn_cas t < I f S tm t >( pa r en t ) ;11 i f ( i f P a r e n t != NULL && i f P a r e n t −>g e t E l s e ( ) == I fAns c ) {12 / / I f we a r e in an e l s e branch , r e t u r n13 r e t u r n ;14 } }15 / / F i r s t c o l l e c t a l l e l s e / i f b ranches to i t e r a t e over16 s t d : : l i s t < con s t Stmt ∗> checks ;17 auto c u r r I f = I fAns c ;18 checks . push_back ( c u r r I f −>getThen ( ) ) ; / / Record i n i t i a l Then19 con s t Stmt ∗ n e x t I f = I fAnsc −>g e t E l s e ( ) ;20 / / I t e r a t e through e l s e − i f Thens21 f o r ( ; n e x t I f != NULL && dyn_cas t < I f S tm t >( n e x t I f ) ;22 n e x t I f = c u r r I f −>g e t E l s e ( ) ) {23 c u r r I f = dyn_cas t < I f S tm t >( n e x t I f ) ;24 checks . push_back ( c u r r I f −>getThen ( ) ) ;25 }26 i f ( n e x t I f == NULL ) { / / There i s no f i n a l e l s e27 / / then we cannot prove a l l t h r e a d s ex e cu t e the b a r r i e r28 break ;29 } e l s e {30 / / n e x t I f ho l d s the a c t u a l f i n a l e l s e s t a t emen t31 checks . push_back ( n e x t I f ) ;32 }33 / / The l i s t o f b ranches to check i s comple te34 / / make su r e they a l l c a l l a b a r r i e r the same number o f t imes35 / / S e r i a l i z e the i f / e l s e − i f / e l s e t r e e to make t r a v e r s a l easy36 s t d : : l i s t < con s t Stmt ∗> f l a t I f E l s e ;37 p r e o r d e r F l a t t e n S tm t ( I fAnsc , & f l a t I f E l s e ) ;38 s t d : : l i s t < con s t Stmt ∗ > : : i t e r a t o r cu r rCase = checks . beg in ( ) ;39 i n t maxBa r r i e r s = −1; i n t c u r r B a r r i e r s = 0 ;40 / / I t e r a t e over every AST node in the i f / e l s e − i f / e l s e t r e e41 f o r ( au to i t r = f l a t I f E l s e . beg in ( ) ; i t r != f l a t I f E l s e . end ( ) &&42 cur rCase != checks . end ( ) ; i t r ++) {43 / / I f s t a r t i n g the nex t case , or a t the l a s t e l ement44 i f ( ∗ i t r == ∗ ( s t d : : nex t ( currCase , 1 ) ) | |45 s t d : : nex t ( i t r , 1 ) == f l a t I f E l s e . end ( ) ) {46 / / No b a r r i e r s , a b o r t t o d i a g n o s t i c s47 i f ( c u r r B a r r i e r s == 0 ) { b reak ; }48 i f ( maxBa r r i e r s != −1 && c u r r B a r r i e r s != maxBa r r i e r s ) {49 / / Th i s branch has a d i f f e r e n t number o f b a r r i e r s than50 / / a p r e v i o u s l y p r o c e s s e d branch , a bo r t to d i a g n o s t i c s51 break ;52 }53 / / F i r s t c a s e eva l ua t ed , s e t the expec t ed b a r r i e r count54 i f ( maxBa r r i e r s == −1) maxBa r r i e r s = c u r r B a r r i e r s ;55 / / Advance to the nex t branch56 cur rCase ++ ;57 / / J u s t p r o c e s s e d the l a s t branch and haven ' t f a i l e d yet ,58 / / then assume the b a r r i e r i s s a f e and r e t u r n !59 i f ( cu r rCase == checks . end ( ) ) { r e t u r n ; }60 }61 / / At the beg inn ing o f each branch , r e s e t the coun t e r62 i f ( ∗ i t r == ∗ cu r rCase ) { c u r r B a r r i e r s = 0 ; }63 / /When we encoun te r a f u n c t i o n c a l l , check i f i t ' s a b a r r i e r64 i f ( au to c a l l = dyn_cas t <Ca l lExpr > ( ∗ i t r ) ) {65 i f ( au to c a l l D e c l = c a l l −> g e t D i r e c t C a l l e e ( ) ) {66 s t d : : s t r i n g name = c a l l D e c l −>getNameAsStr ing ( ) ;67 i f ( name != " " && name != " b a r r i e r " &&68 name != " work_g roup_ba r r i e r " ) {69 / / i f i t ' s a c a l l t o non−b a r r i e r f unc t i on , we don ' t know70 / / whether i t c o n t a i n s a b a r r i e r , c o n s e r v a t i v e l y a bo r t71 break ;72 } e l s e {73 c u r r B a r r i e r s ++ ; / / count the b a r r i e r74 }75 } e l s e {76 / / Can ' t deduce what i s be ing c a l l e d77 break ; / / c o n s e r v a t i v e l y a bo r t78 } }79 / / I f we h i t any ne s t ed branches , b reak80 i f ( i s a < I f S tm t > ( ∗ i t r ) | | i s a <SwitchStmt > ( ∗ i t r ) | |81 i s a <ForStmt > ( ∗ i t r ) | | i s a <DoStmt > ( ∗ i t r ) | |82 i s a <WhileStmt > ( ∗ i t r ) | | i s a <ReturnStmt > ( ∗ i t r ) | |83 i s a <BreakStmt > ( ∗ i t r ) | | i s a <GotoStmt > ( ∗ i t r ) ) {84 / / I f p a r en t o f currCase , then i t ' s the i n i t i a l I f85 / / I f p a r en t o f cu r rCase +1 , then i t ' s an E l se− I f86 boo l i s O u t e r I f = f a l s e ;

87 i f ( c on s t I f S tm t ∗ par = dyn_cas t < I f S tm t > ( ∗ i t r ) ) {88 / / Make su r e i t ' s not a ne s t ed I f89 f o r ( au to c I t r = par−>ch i l d _ b e g i n ( ) ;90 c I t r != par−>ch i l d _ end ( ) ; c I t r ++) {91 i f ( ( c I t r != par−>ch i l d _ end ( ) ) && ( ∗ c I t r == ∗ cu r rCase | |92 ∗ c I t r == ∗ ( s t d : : nex t ( currCase , 1 ) ) ) ) {93 i s O u t e r I f = t r u e ;94 break ; / / Out o f the c h i l d check loop95 } } }96 / / I f i t i s n e i t h e r , then i t must be an i n t e r i o r branch , so a bo r t97 i f ( ! i s O u t e r I f ) { b reak ; }98 } } }99 break ; / / End o f i f / e l s e − i f / e l s e p r o c e s s i n g100 c a s e 4 : { / / B a r r i e r i n s i d e sw i t ch / c a s e101 boo l h a sDe f au l t = f a l s e ;102 con s t Swi tchCase ∗ c a s e s = SwitchAnsc−>g e t Sw i t c hCa s e L i s t ( ) ;103 s t d : : l i s t < con s t Stmt ∗> checks ;104 / / Th i s i t e r a t e s from the bottom of the c a s e l i s t , upwards105 f o r ( ; c a s e s != NULL ; c a s e s = case s −>ge tNex tSwi t chCase ( ) ) {106 i f ( dyn_cas t <De fau l t S tmt >( c a s e s ) ) {107 h a sDe f au l t = t r u e ;108 }109 / / Sk ip over c a s e s with no body ( immediate subsequen t c a s e )110 i f ( ! dyn_cas t <CaseStmt >( ca se s −>getSubStmt ( ) ) ) checks . push_back ( c a s e s ) ;111 }112 / / I f no d e f a u l t case , can ' t gua r an t e e a l l t h r e a d s exe cu t e b a r r i e r113 i f ( h a sDe f au l t == f a l s e ) break ;114 / / The l i s t o f c a s e s with bod i e s i s comple te115 / / make su r e they a l l c a l l a b a r r i e r the same number o f t imes116 / / S e r i a l i z e the sw i t ch / c a s e t r e e to make t r a v e r s a l easy117 s t d : : l i s t < con s t Stmt ∗> f l a t Sw i t c h ;118 p r e o r d e r F l a t t e n S tm t ( SwitchAnsc−>getBody ( ) , &f l a t Sw i t c h ) ;119 checks . r e v e r s e ( ) ; / / Check c a s e s from top to bottom120 s t d : : l i s t < con s t Stmt ∗ > : : i t e r a t o r cu r rCase = checks . beg in ( ) ;121 i n t maxBa r r i e r s = −1; i n t c u r r B a r r i e r s = 0 ;122 f o r ( au to i t r = f l a t Sw i t c h . beg in ( ) ; i t r != f l a t Sw i t c h . end ( ) &&123 cur rCase != checks . end ( ) ; i t r ++) {124 / / Re s e t c oun t e r s when a new ca s e i s r eached125 i f ( ∗ i t r == ∗ cu r rCase ) { c u r r B a r r i e r s = 0 ; }126 / / I f we con t i nue i n t o a subsequen t c a s e wi thout a break127 i f ( ∗ i t r == ∗ ( s t d : : nex t ( currCase , 1 ) ) ) {128 / / Abort i f we have any accumula ted b a r r i e r s129 / / s i n c e we a re gua ran t eed to have more than subsequen t c a s e130 i f ( c u r r B a r r i e r s > 0 ) { b reak ; }131 cur rCase ++ ; / / o t h e rw i s e j u s t advance to nex t c a s e132 }133 / / Cases a r e f o rma l l y ended by b r eak s134 i f ( dyn_cas t <BreakStmt > ( ∗ i t r ) ) {135 / / No b a r r i e r s , a b o r t t o d i a g n o s t i c s136 i f ( c u r r B a r r i e r s == 0 ) { b reak ; }137 i f ( maxBa r r i e r s != −1 && c u r r B a r r i e r s != maxBa r r i e r s ) {138 break ; / / Some o the r c a s e has a d i f f e r e n t number o f b a r r i e r s139 }140 / / Th i s i s the f i r s t c a s e eva l ua t ed , s e t the expec t ed count141 i f ( maxBa r r i e r s == −1) maxBa r r i e r s = c u r r B a r r i e r s ;142 cur rCase ++ ; / / Advance to the nex t c a s e143 / / I f we j u s t p r o c e s s e d our l a s t c a s e and haven ' t f a i l e d yet ,144 / / then we can assume the b a r r i e r i s s a f e and r e t u r n !145 i f ( cu r rCase == checks . end ( ) ) { r e t u r n ; }146 }147 / /When we encoun te r a f u n c t i o n c a l l , check i f i t ' s a b a r r i e r148 i f ( au to c a l l = dyn_cas t <Ca l lExpr > ( ∗ i t r ) ) {149 i f ( au to c a l l D e c l = c a l l −> g e t D i r e c t C a l l e e ( ) ) {150 s t d : : s t r i n g name = c a l l D e c l −>getNameAsStr ing ( ) ;151 i f ( name != " " && name != " b a r r i e r " &&152 name != " work_g roup_ba r r i e r " ) {153 / / i f i t ' s a c a l l t o non−b a r r i e r f unc t i on , we don ' t know154 / / whether i t c o n t a i n s a b a r r i e r , c o n s e r v a t i v e l y a bo r t155 break ;156 } e l s e {157 c u r r B a r r i e r s ++ ; / / count the b a r r i e r158 }159 } e l s e {160 / / Can ' t deduce what i s be ing c a l l e d161 break ; / / c o n s e r v a t i v e l y a bo r t162 } }163 / / I f we h i t any ne s t ed branches or jumps , b reak164 i f ( i s a < I f S tm t > ( ∗ i t r ) | | i s a <SwitchStmt > ( ∗ i t r ) | |165 i s a <ForStmt > ( ∗ i t r ) | | i s a <DoStmt > ( ∗ i t r ) | |166 i s a <WhileStmt > ( ∗ i t r ) | | i s a <ReturnStmt > ( ∗ i t r ) | |167 i s a <GotoStmt > ( ∗ i t r ) ) {168 break ;169 } } }170 break ; / / End o f sw i t ch / c a s e p r o c e s s i n g171 }

Figure 12. Refinement step to search through if/else and switch/case statements to attempt to prove all branches execute the same number of barriers.

13

Page 14: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 fooKern = c l C r e a t eK e r n e l ( . . . , " f oo " , . . . ) ;2 . . .3 c l S e tK e r n e lA r g ( fooKern , 0 , s i z e o f ( i n t ) , &myAInt ) ;4 c l S e tK e r n e lA r g ( fooKern , 1 , s i z e o f ( cl_mem ) , &myBclMem ) ;5 c l S e tK e r n e lA r g ( fooKern , 2 , s i z e o f ( f l o a t ) , &myCPtr ) ;6 clEnqueueNDRangeKernel ( . . . , fooKern , . . . ) ;7 . . .8 c l S e tK e r n e lA r g ( fooKern , 0 , s i z e o f ( uns igned i n t ) , &myAUInt ) ;9 c l S e tK e r n e lA r g ( fooKern , 1 , s i z e o f ( cl_mem ) , &myBclMem ) ;10 c l S e tK e r n e lA r g ( fooKern , 2 , s i z e o f ( f l o a t ) , &myCPtr ) ;11 clEnqueueNDRangeKernel ( . . . , fooKern , . . . ) ;

(a) Example of OpenCL host code to create and launch the kernel from Figure 13d.Note that the first launch sets arguments of the correct host types, and the secondmisuses an unsigned integer for the zeroth argument.

1 / / Matcher to c a t ch a l l d e c l a r a t i o n s o f c l _ k e r n e l s2 Matcher . addMatcher (3 va rDec l (4 hasType ( a s S t r i n g ( " c l _ k e r n e l " ) )5 ) . b ind ( " k e r n e l _ d e c l " ) , new Hos tKerne lDec lHand l e r ( ) ) ;6 / / Matcher to c a t ch a l l r e f e r e n c e s to c l _ k e r n e l s7 Matcher . addMatcher (8 d e c lR e fExp r ( t o ( va rDec l (9 hasType ( a s S t r i n g ( " c l _ k e r n e l " ) )10 ) ) ) . b ind ( " k e r n e l _ r e f " ) , new Hos tKe rne lDec lRe fHand l e r ( ) ) ;11 / / Sub−matcher to check whether a RHS c a l l s c l C r e a t eK e r n e l12 con s t au to c rea t eKerne lRHS =13 c a l l E x p r ( c a l l e e (14 f un c t i o nDe c l ( hasName ( " c l C r e a t eK e r n e l " ) )15 ) ) . b ind ( " c r e a t e " ) ;16 / / Matcher f o r any ass ignment o f the r e t u r n va lue o f c l C r e a t eK e r n e l17 Matcher . addMatcher (18 s tmt ( anyOf (19 / / c a t ch d e c l a r a t i o n s t h a t a r e a l s o i n i t i a l i z a t i o n s20 de c l S tm t ( hasDescendant (21 va rDec l ( h a s I n i t i a l i z e r ( c r ea t eKerne lRHS ) ) . b ind ( " a s s i gnee ' ' )22 ) ) ,23 / / c a t ch r e g u l a r v a r i a b l e a s s i gnmen t s24 b i na ryOpe r a t o r (25 hasOperatorName ( " = " ) ,26 hasRHS ( c r ea t eKerne lRHS ) ,27 hasLHS ( d e c lR e fExp r ( t o (28 va rDec l ( ) . b ind ( " a s s i g n e d _ k e r n e l " )29 ) ) )30 ) ,31 / / c a t ch a s s i gnmen t s to s t r u c t members32 b i na ryOpe r a t o r (33 hasOperatorName ( " = " ) ,34 hasRHS ( c r ea t eKerne lRHS ) ,35 hasLHS ( memberExpr ( member (36 va l u eDec l ( ) . b ind ( " a s s i g n e d _ k e r n e l " )37 ) ) )38 )39 ) ) . b ind ( " a s s i g n _ s tm t " ) , new Hos tKe rne lCrea t eHand l e r ( ) ) ;40 / / Matcher to c a t ch a l l argument a s s i gnmen t s41 Matcher . addMatcher (42 c a l l E x p r ( c a l l e e (43 f un c t i o nDe c l ( hasName ( " c l S e tK e r n e lA r g " ) )44 ) ) . b ind ( " s e t a r g " ) , new HostKerne lArgHand ler ( ) ) ;45 / / Matcher to c a t ch a l l k e r n e l l aunche s46 Matcher . addMatcher (47 c a l l E x p r ( c a l l e e (48 f un c t i o nDe c l ( anyOf (49 hasName ( " clEnqueueNDRangeKernel " ) ,50 hasName ( " c lEnqueueTask " )51 ) )52 ) ) . b ind ( " l aunch " ) , new Hos tKe rne lCa l lHand l e r ( ) ) ;

(b) Examples of matchers for the host AST(s)

1 / / S t o r e s d e c l a r a t i o n s o f c l _ k e r n e l s2 / / I f g l o b a l or e x t e r n g l o b a l , adds to c ro s s −AST3 / / e x t e r n management s t r u c t u r e4 vo id Hos tKerne lDec lHand l e r : : run ( con s t MatchResu l t &R e s u l t ) {5 c on s t au to ∗ Kerne lDec l =6 R e s u l t . Nodes . getNodeAs <VarDecl > ( " k e r n e l _ d e c l " ) ;7 i f ( Ke rne lDec l ) {8 i f ( Kerne lDec l −>getType ( ) . g e tA s S t r i n g ( ) == " c l _ k e r n e l " ) {9 / / S t o r e the d e c l a r a t i o n f o r d e f e r r e d s t a g e s10 i f ( Kerne lDec l −> i s G l o b a l ( ) && ! Kerne lDec l −> i s E x t e r n ( ) ) {11 / / S t o r e as c a n on i c a l g l o b a l c l _ k e r n e l d e c l by name12 } e l s e i f ( Kerne lDec l −> i s G l o b a l ( ) && Kerne lDec l −> i s E x t e r n ( ) ) {13 / / S t o r e as e x t e r n r e f to g l o b a l c l _ k e r n e l d e c l by name1415 } } } }16 / / A t t a che s every r e f e r e n c e o f a c l _ k e r n e l o b j e c t to the17 / / d e c l a r a t i o n i t i s r e f e r e n c i n g18 / / Unused f o r now but w i l l become pa r t o f a c l _ k e r n e l19 / / p ropaga t o r f o r CFG c a l l s and o th e r r e f e r e n c e s20 vo id Hos tKe rne lDec lRe fHand l e r : : run ( con s t MatchResu l t &R e s u l t ) {21 con s t auto ∗ Kerne lDec lRe f =22 R e s u l t . Nodes . getNodeAs <Dec lRefExpr > ( " k e r n e l _ r e f " ) ;23 i f ( Ke rne lDec lRe f ) {24 i f ( Ke rne lDec lRe f −>getType ( ) . g e tA s S t r i n g ( ) == " c l _ k e r n e l " ) {25 / / D i r e c t l y a t t a c h to a p r e v i ou s l y −d e c l a r e d and s t o r e d26 / / c l _ k e r n e l , s i n c e i t must be d e f i n e d b e f o r e r e f e r e n c e27 }28 }29 }30 / / Record i n f o rma t i on about c a l l s t o c l C r e a t eK e r n e l31 / / and the a s s i g n e e c l _ k e r n e l t h a t s t o r e s the hos t o b j e c t32 vo id Hos tKe rne lCrea t eHand l e r : : run ( con s t MatchResu l t &R e s u l t ) {33 con s t auto ∗ Crea t eKe rne l =34 R e s u l t . Nodes . getNodeAs <Ca l lExpr > ( " c r e a t e " ) ;35 con s t auto ∗ KernVar =36 R e s u l t . Nodes . getNodeAs <ValueDecl > ( " a s s i g n e e " ) ;37 / / I f the f u n c t i o n i s c a l l e d and i t has an a s s i g n e e38 i f ( C r e a t eKe rne l && KernVar ) {39 / / F i r s t g e t the name o f the k e r n e l i t i s c r e a t i n g40 con s t S t r i n g L i t e r a l ∗ k e rnS t r =41 dyn_cas t < S t r i n g L i t e r a l > (42 Crea t eKe rne l −>getArg (1)− > I g n o r e Im p l i c i t ( ) ) ;43 i f ( k e r n S t r ) { / / easy , i t ' s a s t r i n g l i t e r a l44 / / Fo r ce the SourceManager and Contex t to p e r s i s t45 / / C rea t e and popu l a t e a s t o r a g e v a r i a b l e46 / / Record the c r e a t e d k e r n e l i n a l i s t f o r l a t e r p r o c e s s i n g47 } e l s e {48 / /TODO: I f i t ' s a r e f e r e n c e to a s t r i n g we w i l l have to a t t empt49 / / t o p ropaga t e i t t o f i n d the S t r i n g L i t e r a l ( s ) i t ' s made o f50 }51 } e l s e {52 / / Omit ted code to emi t d i a g n o s t i c s f o r p a r t i a l matches53 }54 }55 / / Record i n f o rma t i on about c a l l s t o c l S e tK e r n e lA r g56 vo id HostKerne lArgHand ler : : run ( con s t MatchResu l t &R e s u l t ) {57 con s t auto ∗ Se tArgs = R e s u l t . Nodes . getNodeAs <Ca l lExpr > ( " s e t a r g " ) ;58 i f ( Se tArgs ) {59 / / C rea t e and popu l a t e a s t o r a g e v a r i a b l e60 / Record the a s s i gn ed argument in a l i s t f o r l a t e r p r o c e s s i n g61 S e tA r g sC a l l s . push_back ( da t a ) ;62 }63 }64 / / Record i n f o rma t i on about k e r n e l l aunche s65 vo id Hos tKe rne lCa l lHand l e r : : run ( con s t MatchResu l t &R e s u l t ) {66 con s t auto ∗ LaunchKerne l = R e s u l t . Nodes . getNodeAs <Ca l lExpr > ( " l aunch " ) ;67 i f ( LaunchKerne l ) {68 / / C rea t e and popu l a t e a s t o r a g e v a r i a b l e69 / / Record the launched k e r n e l i n a s e t s o r t e d by sou r c e70 / / manager and l o c a t i o n f o r l a t e r p r o c e s s i n g71 Kerne lLaunches . i n s e r t ( d a t a ) ;72 }73 }

(c) Examples host AST callbacks to record information for deferred processing

1 __ke rne l vo id foo (2 i n t a ,3 _ _ g l o b a l f l o a t ∗ b ,4 f l o a t c ) { . . . }

(d) Example of a kernel prototype to match.1 / / Matcher f o r any f un c t i o n with the OpenCL ke rn e l a t t r i b u t e2 Matcher . addMatcher (3 f un c t i o nDe c l (4 h a sA t t r ( a t t r : : Kind : : OpenCLKernel )5 ) . b ind ( " k e r n e lDe c l " ) , new DevKerne lHandler ( ) ) ;

(e) Kernel Matcher

1 / / Record i n f o rma t i on about k e r n e l p r o t o t yp e s2 vo id DevKerne lHandler : : run ( con s t MatchResu l t &R e s u l t ) {3 c on s t au to ∗ Kerne lDec l =4 R e s u l t . Nodes . getNodeAs <Func t ionDec l > ( " k e r n e lDe c l " ) ;5 i f ( Ke rne lDec l ) {6 / / Fo r ce the SourceManager and ASTcontext to p e r s i s t7 / / C rea t e and popu l a t e a s t o r a g e v a r i a b l e8 / / Record the k e r n e l p r o t o t ype by name in a map f o r l a t e r p r o c e s s i n g9 Ke rn e l P r o t o t yp e s [ Kerne lDec l −>getName ( ) ] = da t a ;10 }11 }

(f) Example kernel AST callback

Figure 13. The per-AST phase of FLOCL-MAST has separate matchers and callbacks for host and device code. Both record information that is later used by the cross-AST phasedetailed in Figures 14 and 15.

14

Page 15: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 / / The d e f e r r e d s t e p to a t t a c h c l _ k e r n e l s from a c l C r e a t eK e r n e l2 / / c a l l t o a k e r n e l p r o t o t ype from a dev i c e AST3 vo id b i n dKe rn e l P r o t o s ( ) {4 / / I t e r a t e over every c l C r e a t eK e r n e l c a l l from every hos t AST5 f o r ( c r e a t eDa t a ∗ c r e a t e : C r e a t eKe rn e l s ) {6 i f ( c r e a t e != NULL ) {7 / / Look up the p ro t o t ype to a s s o c i a t e to based on the name used8 / / i n the c l C r e a t eK e r n e l c a l l ( f o r now a S t r i n g L i t e r a l )9 k e rn e l P r o t oDa t a ∗ pro to = Ke rn e l P r o t o t yp e s [ c r e a t e −>protoName ] ;10 i f ( p ro to != NULL ) {11 / / I f a matching p ro t o t ype e x i s t s , p opu l a t e and r e co rd a12 / / s t o r a g e s t r u c t u r e on a map , u s ing the a s s i g n ed13 / / c l _ k e r n e l ' s d e c l a r a t i o n as the key14 i f ( ! a s s i gnee −> i s G l o b a l ( ) ) {15 BoundKernels [ c r e a t e −>a s s i g n e e ] = da t a ;16 } e l s e {17 / / l ook up c a non i c a l g l o b a l d e c l a r a t i o n18 BoundKernels [ c a non i c a lD e c l ] = da t a ;19 }20 } e l s e {21 / / Warn t h a t a matching name p ro t o t ype can ' t be found22 } } } }

(a) The cross-AST phase must bind cl_kernel objects from the host AST(s) to kernelprototype declarations from the device AST(s), using information from clCreateKernelcalls on the host AST(s).

1 / / The d e f e r r e d s t e p to a t t a c h each launch and i t s newly−a s s o c i a t e d2 / / arguments to a k e r n e l p ro to type , v i a the ones bound to c l _ k e r n e l s3 / /TODO: implement c ro s s − f u n c t i o n and c ro s s − a l i a s c l _ k e r n e l4 / / t r a c k i n g to p ropaga t e l aunche s c o r r e c t l y to c l _ k e r n e l s t h a t5 / / a r e not g l o b a l l y d e c l a r e d ( or in d e c l a r e d and c r e a t e d in6 / / the same scope as the launch )7 vo id propKerne l sToLaunches ( ) {8 / / I t e r a t e over every launched k e r n e l i n every AST9 f o r ( every launch : Kerne lLaunches ) {10 con s t Va lueDec l ∗ c a l l e dV a l u e = NULL ;11 con s t Expr ∗ argExpr ;12 / / Try to r e s o l v e the launch ' s c l _ k e r n e l argument13 / / t o a v a r i a b l e or member d e c l a r a t i o n14 / / I f g l o b a l , use the c a n on i c a l d e c l a r a t i o n not an ex t e r n15 i f ( c a l l e dV a l u e == NULL ) {16 / / I f the c l _ k e r n e l can ' t be r e s o l v ed , warn t h a t we can ' t17 / / deduce what c l _ k e r n e l i s be ing launched18 }19 / / Sea rch f o r a bound k e rn e l p r o t o t ype with a matching hos t c l _ k e r n e l20 auto boundKernel = BoundKernels . f i n d ( c a l l e dV a l u e ) ;21 i f ( boundKernel != BoundKernels . end ( ) ) {22 / / I f a p r o t o t ype i s found , r e c o r d t h i s l aunch and i t s arguments23 / / f o r l a t e r p r o c e s s i n g24 ( ∗ boundKernel ) . second−>launches −>push_back ( ∗ l aunche s ) ;25 } } }

(c) The cross-AST phase must attach launched kernels (with newly-attached ar-guments) to a prototype. This phase relies on the launched cl_kernel declarationmatching one successfully bound to a prototype by a clCreateKernel call.

1 / / The d e f e r r e d s t e p to a t t a c h c l S e tK e r n e lA r g s to the l e x i c a l l y −next2 / / k e r n e l l aunch they w i l l app ly to3 / / Warning : L e x i c a l l y −next i s a h e u r i s t i c t h a t w i l l not t ake i n t o4 / / account i n t e r v e n i n g a s s i gnmen t s t h a t happen due to a c a l l5 / / Warning : does not c u r r e n t l y c a r e about scope so an as s ignment6 / / i n one f u n c t i o n may app ly to a launch in ano the r f u n c t i o n i f7 / / t h e ry sha r e the same c l _ k e r n e l o b j e c t , r e g a r d l e s s o f the o rde r8 / / the f u n c t i o n s a r e invoked a t runt ime9 vo id c l u s t e rArg sToLaunche s ( ) {10 / / I t e r a t e over a l l c l S e tK e r n e lA r g s c a l l s from a l l ho s t ASTs11 f o r ( s e tArg sDa ta ∗ s e tA rg s : S e tA r g sC a l l s ) {12 / / Try to r e s o l v e the ass ignment ' s c l _ k e r n e l argument13 / / t o a v a r i a b l e or member d e c l a r a t i o n14 i f ( s e tVa l u e == NULL ) {15 / / I f the c l _ k e r n e l can ' t be r e s o l v ed , warn t h a t we can ' t16 / / deduce what c l _ k e r n e l the argument i s be ing a s s i gn ed to17 }18 / / I t e r a t e over a l l k e r n e l l aunche s t h i s a s s ignment may app ly to19 / / The l aunche s s e t i s s o r t e d by the AST f i r s t , and the l e x i c a l20 / / p o s i t i o n second and we assume the as s ignment w i l l not app ly21 / / t o a launch t h a t i s l e x i c a l l y −b e f o r e i t a s a s im p l i f y i n g22 / / h e u r i s t i c , so we s t a r t i t e r a t i n g a t the f i r s t l aunch a f t e r23 / / the ass ignment , and end i f we eve r l e a v e the AST or da t a24 / / s t r u c t u r e , or s t a r t l o ok i ng a t l aunche s t h a t a r e somehow25 / / b e f o r e the as s ignment .26 f o r ( a l l such l aunche s : Kerne lLaunches ) {27 con s t Va lueDec l ∗ c a l l e dV a l u e = NULL ;28 con s t Expr ∗ l a r gExp r ;29 / / Try to r e s o l v e the launch ' s c l _ k e r n e l argument30 / / t o a v a r i a b l e or member d e c l a r a t i o n31 i f ( c a l l e dV a l u e == NULL ) {32 / / I f the c l _ k e r n e l can ' t be r e s o l v ed , warn t h a t we can ' t33 / / deduce what c l _ k e r n e l i s be ing launched34 }35 / / I f the two c l _ k e r n e l s match ( a r e the same d e c l a r a t i o n )36 i f ( s e tVa l u e == c a l l e dV a l u e ) {37 / / F i gu r e out which k e r n e l argument p o s i t i o n i s a s s i g n ed38 / / f o r s i m p l i c i t y assume index i s I n t e g e r L i t e r a l ( f o r now )39 i f ( c on s t I n t e g e r L i t e r a l ∗ i n t V a l =40 dyn_cas t < I n t e g e r L i t e r a l > (41 se tArgs −> c a l l −>getArg (1)− > I g n o r e Im p l i c i t ( ) )42 ) {43 / / Bind the argument as s ignment c a l l t o the launch , u s ing44 / / the p o s i t i o n va lue as the map key45 ( LaunchesWithArgs [ ∗ l aunche s ] )46 [ i n tVa l −>ge tVa lue ( ) . g e t L im i t e dVa l u e ( ) ] = s e tArg s ;47 / / Warn i f we a r e o v e rw r i t i n g a p r e v i ou s l y −a s s i gn ed va lue48 } e l s e {49 / / Emit a d i a g n o s t i c i f the p o s i t i o n a l argument i s not50 / / an I n t e g e r L i t e r a l t h a t can be s t a t i c a l l y de te rmined51 }52 / /A matching k e r n e l has been found , don ' t a t t empt to app ly53 / / i t t o o th e r k e r n e l s . TODO: Re l ax to suppor t OpenCL ' s54 / / argument r e t e n t i o n i f they a r e not re−a s s i gn ed55 break ;56 }57 / /TODO: Ea r l y t e rm in a t e i f we have l e f t a s s ignment scope58 } } }

(b) The cross-AST phase must heuristically determine what clSetKernelArg assign-ments on the host AST(s) apply to which kernel launches. Currently, the argumentsmust be lexically-before the launch in the same AST, and only apply to the firstfollowing launch with the same cl_kernel.

Figure 14. The first stages of the deferred cross-AST check for type consistency between kernel and device marry information from separate ASTs.

15

Page 16: A Composable Workflow for Productive FPGA Computing via ... · A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)

1 / / Check whether a hos t and d ev i c e type a r e compa t i b l e2 / / Warning : For now dynamica l ly−a l l o c a t e d l o c a l memory3 / / r e g i on s w i l l emi t a type− i n c omp a t i b l i t y because they use4 / / a NULL po i n t e r f o r the hos t v a r i a b l e to copy5 / / Warning : cl_mem ' s a r e c u r r e n t l y t r e a t e d as a wi ld ca rd which6 / / matches with any _ _ g l o b a l or _ _ con s t an t p o i n t e r on the d ev i c e7 / /TODO: Attempt to r e s o l v e the r e a l type s t o r e d in a cl_mem8 boo l checkH2DTypeCompat ( QualType hostType , QualType devType ) {9 / / Wi ldcard match cl_mem type s on the hos t to any _ _ g l o b a l or10 / / _ _ con s t an t p o i n t e r on the d ev i c e11 i f ( hostType . g e tA s S t r i n g ( ) == " cl_mem ∗ " | |12 hostType . g e tA s S t r i n g ( ) == " con s t cl_mem ∗ " ) {13 i f ( c on s t Type ∗ type = devType . ge tTypeP t rOrNu l l ( ) ) {14 i f ( ! type−>i sAnyPo in te rType ( ) ) {15 / / I f the d ev i c e type i s not a po in t e r , warn o f type mismatch16 r e t u r n f a l s e ;17 } e l s e {18 / / I f the d ev i c e type i s a po in t e r , ensure i t p o i n t s19 / / t o an a p p r o p r i a t e a dd r e s s space20 uns igned addrSpace = type−>ge tPo in t e eType ( ) . g e tAddre s sSpace ( ) ;21 i f ( addrSpace != LangAS : : o p e n c l _ g l o b a l &&22 addrSpace != LangAS : : o p en c l _ c on s t a n t ) {23 / / Warn i f a dd r e s s space i s n e i t h e r _ _ g l o b a l nor __ con s t an t24 r e t u r n f a l s e ;25 } e l s e {26 / /TODO: improve t h i s w i l d c a r d match with any r e s o l v a b l e i n f o27 / / about the r e a l type s t o r e d in the cl_mem on the hos t28 r e t u r n t r u e ;29 } } } e l s e {30 / / E r r o r i f f o r some rea son the d ev i c e type cannot be r e s o l v e d31 r e t u r n f a l s e ;32 } } / / No need f o r e l s e , a l l cl_mem branches r e t u r n e a r l y33 / / Handle non−p o i n t e r t ype s34 i f ( c on s t Type ∗ hType = hostType . ge tTypeP t rOrNu l l ( ) ) {35 i f ( ! hType−>i sAnyPo in te rType ( ) ) {36 / / Warn i f ho s t type i s not a p o i n t e r to memory to be cop i ed37 / /TODO Handle NULL p o i n t e r s f o r dynamic _ _ l o c a l memory38 r e t u r n f a l s e ;39 } e l s e {40 / / D i s c a rd any e x t r a q u a l i f i e r s and check i f the t ype s match41 i f ( hType−>ge tPo in t e eType ( )42 . w i t h o u t L o c a l F a s t Q u a l i f i e r s ( ) . g e tA s S t r i n g ( )43 == devType . w i t h o u t L o c a l F a s t Q u a l i f i e r s ( ) . g e tA s S t r i n g ( ) ) {44 r e t u r n t r u e ;45 } e l s e {46 r e t u r n f a l s e ;47 } } } e l s e {48 / / Emit an e r r o r t h a t the hos t type cannot be deduced49 r e t u r n f a l s e ;50 }51 / / F a l l b a c k on r e t u r n i n g f a l s e52 r e t u r n f a l s e ;53 }

(a) Host and device ASTs may use different types and memory spaces to refer to thesemantically-same type of variable. A helper function is used to concisely check if ahost argument’s type is compatible with a device parameter’s.

1 / / The d e f e r r e d s t e p to check every launch f o r conformance2 / / t o the p ro t o t ype s p e c i f i e d in d ev i c e code . For now t h i s3 / / i s l i m i t e d to check ing the pa rame te r s s p e c i f i e d on the d ev i c e4 / / a g a i n s t the arguments p rov ided to them on the hos t .5 / /TODO: Check f o r e x c e s s arguments s p e c i f i e d on the hos t6 / /TODO: Check workgroup s i z e s a g a i n s t p o t e n t i a l k e r n e l a t t r i b u t e s7 vo id checkLaunchesToPro tos ( ) {8 / / I t e r a t e over a l l p r o t o t y p e s t h a t have been bound to c l _ k e r n e l s9 f o r ( au to boundPa i r : BoundKernels ) {10 BoundKernelData ∗ boundKernel = boundPa i r . second ;11 / / I t e r a t e over a l l the l aunche s o f t h i s p r o t o t ype12 f o r ( l aunchData ∗ l aunch : ∗ ( boundKernel−> l aunche s ) ) {13 / / F i r s t , ensure t h a t i f the p ro t o t ype has ze ro14 / / parameters , t h a t none have been s e t15 i f ( boundKernel−>protoData −>dec l −>getNumParams ( ) == 016 && LaunchesWithArgs . f i n d ( l aunch ) !=17 LaunchesWithArgs . end ( ) ) {18 / / Warn o f argument to a p a r ame t e r l e s s k e r n e l19 break ;20 }21 / / I t e r a t e over a l l pa r ame te r s in the p ro t o t ype22 f o r ( i n t i = 0 ;23 i < boundKernel−>protoData −>dec l −>getNumParams ( ) ;24 i ++) {25 / / Sea rch f o r an argument in t h i s p o s i t i o n f o r t h i s l aunch26 auto se tArg = LaunchesWithArgs [ l aunch ] . f i n d ( i ) ;27 i f ( s e tArg == LaunchesWithArgs [ l aunch ] . end ( ) ) {28 / / No argument s p e c i f i e d , warn o f a mi s s ing argument29 } e l s e {30 / / I f i t has an argument in t h i s p o s i t i o n ,31 / / g e t i t and check i t a g a i n s t the p ro t o t ype32 se tArgsDa ta ∗ da t a = ( ( ∗ s e tArg ) . second ) ;33 / / Get the type o f the argument from the hos t34 QualType argType =35 data −> c a l l −>getArg (3)− > I g n o r e Im p l i c i t ()−> getType ( ) ;36 / / Try to s t r i p away any e x p l i c t c a s t s to ( vo id ∗ )37 con s t Cas tExpr ∗ e x p l i c i t C a s t =38 dyn_cas t <CastExpr >( data −> c a l l −>39 getArg (3)− > I g n o r e Im p l i c i t ( ) ) ;40 i f ( e x p l i c i t C a s t && ( e x p l i c i t C a s t −>getSubExprAsWri t t en ( )41 −>getType ( ) . g e tA s S t r i n g ( ) != " vo id ∗ " ) ) {42 argType = e x p l i c i t C a s t43 ge tSubExprAsWri t t en ()−> getType ( ) ;44 }45 / / Get the d ev i c e paramete r type46 QualType paramType = boundKernel−>pro toDa ta47 −>dec l −>getParamDecl ( i )−>getType ( ) ;48 / / Check hos t and d ev i c e t ype s f o r c omp a t i b i l i t y49 i f ( checkH2DTypeCompat ( argType , paramType ) ) {50 / / The type s match , e i t h e r emi t a s u c c e s s or do noth ing51 } e l s e {52 / / Warn t h a t the d ev i c e and hos t t ype s a r e i n c ompa t i b l e53 }54 }55 }56 / /TODO check f o r arguments with out o f bounds pa rame te r s57 / /TODO hand le dynamic _ _ l o c a l memory arguments58 }59 }60 }

(b) The final cross-AST phase for checking host-side launches for existence andtype-consistency of argument/parameters must iterate over all arguments assignedto every launch on every host AST and make sure they match the kernel prototypethat is being invoked.

Figure 15. The final stages of the cross-AST type consistency check use the previously-married information to check every assigned launch parameter for consistency with theassociated device kernel’s prototype.

16