survey of c-based application mapping tools for...

Holland #215 MAPLD 2005

Survey ofSurvey of CC--based Application Mapping Toolsbased Application Mapping Tools for Reconfigurable Computingfor Reconfigurable Computing

Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George

High-performance Computing and Simulation (HCS) Research LabDepartment of Electrical and Computer Engineering

University of Florida

#215 MAPLD 2005Holland 2

OutlineIntroductionGeneral Survey

Ten C-based Application Mappers

Benchmarking & ResultsFinite-Impulse Response (FIR)N-QueensRadix Sort

Lessons LearnedConclusionsAcknowledgementsReferences


Motivation for Application Mappers

Motivation for Application MappersHDL programming has shortcomings

Limited applicability to application developersMore involved development process (vs. software)Requires training beyond application level

Instead, can we find and exploit an environment that allows a measure of hardware control along with increased productivity?

Can we bring RC performance benefits to application developers?Would this be practical/possible in traditional HDL?

HDL is well below the level of traditional application programmingConsequently, we need to move to a higher level of abstraction


Introduction

Selecting a Higher Level of AbstractionCAD tools: Visual appealing, but tedious for large projectsNew language: Optimal, but requires complete retrainingTraditional or Object-Oriented languages: Which? How?

Ideally, use pure ANSI-C, “The Universal Language”Requires no additional knowledge or special trainingPort existing C programs into hardware implementations (HDL)

Translation can be handled by a hardware compilerProgrammer concentrates on algorithmic functionality

Configuration File

Netlist

HDL

C Code

COMPILER


Commonalities

General characteristics of C-based application mappers:Companies create proprietary ANSI C-based languageLanguages do not have all ANSI C featuresExtra pragmas are included for corresponding compilersAdditional libraries of functions/macros for further extensionsMust adhere to specific programming “style” for maximum optimizationEmphasis on both hardware generation and I/O interfaces

void FIR(int INPUTA, int OUTPUTB){

/*user source*/

}

COMPILER

Entity FIR is Port( rst, clock: in std_logic; INPUTA_en: in std_logic; INPUTA_data: in std_logic_vector(31 downto 0); OUTPUTB_en: in std_logic; OUTPUTB_data: out std_logic_vector(31 downto 0));end;

/*user source in VHDL*/

ANSI-C VHDL


Spectrum of C-based Application MappersSURVEY PORTION

Open Standard

SystemC

Generic HDLMultiple Platforms

Impulse C

Catapult C

Mitrion C

Targets a SpecificPlatform/Configuration

Generic HDL (Optimize forManufacturer’s Hardware)

DIME-C

Handel C Streams C

SA-C

Napa C

RISC/FPGAHybrid Only

Carte

Handel CDIME-C

VHDL

Impulse C

ANSI-C

THE LAW OF CONSERVATION OF PAIN

Con

trol

Effort

Not

Cyc

leA

ccur

ate

Lim

ited

Pred

iciti

blity

Det

erm

inis

tic

Software Some HWPragmas

Many HWPragmas

ANSI-C

DIME-CImpulse C

Handel C

Cyc

le A

ccur

ate

HDL

VHDL


Carte SRC Computers [1]

C/Fortran FPGA environmentDirect mapping of C/Fortran code to configuration levelSoftware emulation and simulation of compiled code for debuggingCapable of multiprocessor and multi-FPGA computational definitionsAllows explicit data flow control within memory hierarchy

Targets SRC’s MAP processorProduces “Unified Executables” for HW or SW processor executionRuntime libraries handle required interfacing and management

Algorithmic synthesis tool for RTL generation

RTL from “pure” untimed C++No extensions, pragmas, etc.

Compiler uses “wrappers”around algorithmic code

External: manages I/O interfaceInternal: constrains synthesis to optimize for chosen interface

Explicit architectural constraints and optimizationOutput: RTL netlists in VHDL, Verilog, and SystemC

Catapult C Mentor Graphics [2-3]


DIME-C Nallatech

[4]

FPGA prototyping toolDesigns are not cycle-accurate

Allows application synthesis for a higher clock speed

Compilation/OptimizationPipeline/parallelize where possibleIncluded IEEE-754 FP coresDedicated (integer) multipliers

Currently in beta, expected release: 4Q05Output: synthesizable VHDL and DIMEtalk components

Environment for cycle-accurate application developmentAll operations occur in one deterministic clock cycle

Makes it cycle-accurate, but clock freq reduced to slowest operationDecisions/Loops are “penalty-free”but can significantly impact timing

Language has pragmas for explicitly defined parallelismCompiler can analyze, optimize, and rewrite codeOutput: VHDL/Verilog, SystemC, or targeted EDIFs

Handel C Celoxica

[5]


Impulse C Impulse Accelerated Technologies [6]

Language/compiler for modeling sequential apps.

Processes - independent, potentially concurrent, computing blocksStreams – communicate and synchronize processes

Uses Streams-C methodologyHowever, focuses on compatibility with C development environments

CompilationEach process implemented as separate state machine

Output: Generic or FPGA-specific VHDL

“Softcore” processor tactic“Processor” creates abstraction layer between C code and FPGA

CompilationC code is mapped to a generic “API” of possible functionsProcessor instantiated on FPGA, tailored to specific applicationCustom instruction bit-widths, specific cache and buffer sizes

Currently in beta, expected release: 4Q05 Output: a VHDL IP core for target architectures

Mitrion

C Mitrion

[7]


Napa C National Semiconductor [8]

Language/compiler for RISC/FPGA hybrid processor

Capitalize on single-cycle interconnect instead of I/O bus

Datapath Synthesis TechniqueHand-optimized pre-placed, pre-routed module generatorsCompiler generates hardware pipelines from C loops

Targets NS NAPA1000 hybrid processor

Fixed-Instruction Processor (FIP), Adaptive Logic Processor (ALP)ALP also compiles to RTL VHDL, structural VHDL, structural Verilog

High-level, expression-oriented, machine-independent, single-assignment language

Designed to implicitly express data-parallel operationsImage and signal processing

Compiler (UC-Irvine, UC-Riverside, Colorado State Univ.)

Loop optimizationsStructural transformsExecution block placement

Target PlatformsUC Irvine Morphosys; AnnapolisWildForce, StarFire, WildFire

SA-C Colorado State University [9-12]


Streams C Los Alamos National Laboratory [12-14]

Stream-oriented sequential process modeling

Essentially, data elements moving through discrete functional blocks

CompilerGenerates multi-threaded processor executables and multiple FPGA bitstreamsAllows parallel C program translation into a parallel arch.

Includes functional-level simulation environmentOutput: synthesizable RTL

Open-source extension of C++ for HW/SW modeling

Core language, modules & ports for defining structure, and interfaces & channels

Supports functional modelingHierarchical decomposition of a system into modulesStructural connectivity between modules using ports/exportsScheduling and synchronization of concurrent processes using events

Event-driven simulatorEvents are basic dynamic/static process synchronization objects

SystemC Open SystemC

Initiative (OSCI) [15-16]


About the BenchmarksThree classic algorithms used for benchmarking

Finite-Impulse Response (FIR)Simple 51-tap FIR filter for standard DSP applicationsCompare compiler solutions and analyze their usage metrics

N-QueensClassic embarrassingly parallel HPC backtracking search problemShowcases the potential of optimized implementations

Radix SortSorts using ‘binary bins’, minimizing resourcesIllustrates resource metrics in RAM-intensive applications

Implementation DetailsDIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing)Experiments performed on Nallatech BenNUEY-PCI card with VirtexII-6000 FPGAResource utilization based on post place-and-route dataRuntime represents communication time (setup and verification I/O is negated)Handel C and Impulse C require VHDL wrappers which can increase resource usage

-10

-8

-6

-4

-2

0

2

4

6

8

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27

0 110 100 10

1 111 101


Finite-Impulse Response

FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6])Various application-mapper languages do not have a consistent I/O interface

Could not create a consistent streaming channel with requisite blocking in every toolInstead, FIR algorithm operates on values stored in a block RAM

Obtains speedup through parallel multiplication, efficient memory accessesThe 51 coefficients and variables are stored in local variables

Additional performance boosts are possible in multi-channel DSP processing

FIR Resource Utilization Statistics

0

20

40

60

80

100

Slices Multipliers Block RAMs Clock Freq

% U

sage

DIME-C Handel C Impulse C VHDL

Speedup over 2.4GHz Xeon

0

1

2

3

4

DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0


N-Queens

Represents a purely computational algorithm; virtually no communication overheadAlgorithm contains several parallelizable code segments, exploitable for speedupImplementations are based upon same baseline C code

Every available technique and compiler optimization is employed to boost performance

Notes:Handel C N-Queens is a benchmark from our MAPLD’04 paper with additional refinementsVHDL N-Queens is culmination of a semester-long endeavor into algorithm’s parallelismDIME-C and Impulse C N-Queens are results of experimentation with beta compilers

N-Queens Resource Utilization Statistics

0

20

40

60

80

100

Slices Clock Freq

% U

sage



0

1

2

3

4

5

6

13 14 15 16 17 N



Radix Sort

Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time)Represents a “worst-case” legacy algorithm, containing no functional-level parallelism

Every element in every iteration depends on every previous element in every iterationIdeal for software processor with fast cache, challenging in FPGA hardware

Speedup comes through efficient RAM usage and compiler optimizations/pipeliningReduce quantity and addressing complexity of RAM accesses whenever possible

Metrics are based on sorting 600 32-bit integers contained within a block RAM

Radix Sort Resource Utilization Statistics

0

20

40

60

80

100

Slices Block RAMs Clock Freq

% U

sage



0.0

0.5

1.0



Some Optimization TechniquesKeep expensive computational operations to a minimum

Multiplication, division, modulo, greater/less than, and floating point are *slow*Minimize reliance on arrays

Watch for combinable statements

Exploit functional level parallelism

Reduce bit-widths to minimal size

for(i=0;i<2;i++){ for(j=0;j<20;j++){ a[i][j] = i+j; }}

for(j=0;j<20;j++){ a[j] = j;}

for(j=0;j<20;j++){ b[j] = 1+j;}

for(j=0;j<20;j++){ c[j] = 2+j;}


Case Study: Dot Product DIME-C

void Kernel(int a[50], int b[50], int answer){

int i, temp = 0;for(i=0;i<50;i++){

temp += a[i] * b[i];}answer = temp;

}

void dot_product(int a1[50], int b1[50],int a2[50], int b2[50], int answer)

{int answer1, answer2;

#pragma genusc instance Kernel1Kernel(a1,b1,answer1);

#pragma genusc instance Kernel2Kernel(a2,b2,answer2);

answer = answer1 + answer2}

IMPULSE Cvoid Kerne11(co_stream a1, co_stream b1, co_stream z1){

int a[50], b[50], answer=0;co_stream_open(a1,O_RDONLY,INT_TYPE(32)); /*etc*/for(i=0;i<50;i++){

co_stream_read(a1, &a[i], sizeof(int32));co_stream_read(b1, &b[i], sizeof(int32));

}for(i=0;i<50;i++){

#pragma CO UNROLLanswer += a[i] * b[i];

}co_stream_write(z1, &answer, sizeof(int32));co_stream_close(a1); /*etc*/

}

void Kernel2(co_stream a2, co_stream b2, co_stream z2){/* SAME AS IN Kernel1 */

}

void dot_product(co_stream z1, co_stream z2, co_stream ans){int i, answer1, answer2, answer;co_stream_open(z1,O_RDONLY,INT_TYPE(32)); /*etc*/co_stream_read(z1, &answer1, INT_TYPE(32));co_stream_read(z2, &answer2, INT_TYPE(32));answer = answer1 + answer2;co_stream_write(ans, &answer, INT_TYPE(32));co_stream_close(z1); /*etc*/

}

HANDEL Cint 32 Kernel1(int 32 a[50], int 32 b[50]){

static int 32 i, temp[i], answer;par(i=0;i<50;i++){

temp[i] = a[i] * b[i];}for(i=0;i<50;i++){

answer += temp[i];}return answer;

}

int 32 Kernel2(int 32 a[50], int 32 b[50])/* SAME AS IN Kernel1 */

}

void main() //dot_product{

int 32 a1[50]; int 32 b1[50];int 32 a2[50]; int 32 b2[50];int 32 temp1, temp2;int 32 answer;interface bus_out() OutputResult(answer);par{

ans1 = Kernel1(int 32 a1[50],int 32 b[50]);ans2 = Kernel2(int 32 a2[50],int 32 b[50]);

}answer = ans1 + ans2;

}

*Not all implementations are perfectly optimized. Your mileage will vary.*

Green – ComputationBlue – CommunicationOrange - Pragmas


Lessons Learned

Tools are not near point of automatic translationPrograms still require some tweaking for hardware compilation [17]

Optimized Software C ≠ Optimized Hardware C

However, generating VHDL is significantly easierLearning basics of a C-based mapper is straightforward

At least two major challenges remain:Input/output interfaces become a limiting factor

Moving generic VHDL to unsupported platforms requires VHDL knowledgeHowever, once a generic I/O wrapper is generated, it should be reusable

True hardware debugging remains a challengeAnother level of abstraction means another layer for mistranslationWith no knowledge of internal VHDL signals, tracing becomes difficult


ConclusionsAdvantages of C-based application mappers

Far broader audience of potential RC users with high-level languagesRequired HDL knowledge is significantly reduced or eliminatedTime to preliminary results is much less than manual HDLSoftware-to-hardware porting is considerably easierVisualization of C hardware is far easier for scientific community

DisadvantagesMapper instructions are many times more powerful than CPU instructions, but FPGA clocks are many times slowerMappers can parallelize and pipeline C code, however they generally cannot automatically instantiate multiple functional unitsOptimized C-mapper code is obtained through manual parallelization of existing code using techniques pertinent to algorithm’s structureReduced development time can come at cost of performance


Acknowledgements

We thank the following vendors for application mapping tools, information, and technical support:

Celoxica (Handel C)Impulse Accelerated Technologies (Impulse C)Nallatech (DIME-C)Mitrion (Mitrion C)

We thank the following vendors for providing tools and/or hardware that made this study possible:

Aldec (Active-HDL & Riviera EDA tools)Intel (Xeon servers)Nallatech (FUSE & DIMEtalk tools, RC boards) Xilinx (ISE, RC boards, FPGAs)


References[1] http://www.srccomp.com[2] http://www.mentor.com/products/c-based_design/catapult_c_synthesis/.[3] K. Morris, “Catapult C: Mentor Announces Architectural Synthesis,” fpgajournal.com, June 1, 2004.[4] Nallatech, Inc., “DIME-C User Guide,” Reference Manual, United Kingdom, 2005.[5] Celoxica, Ltd. “Using Handel-C with DK,” Training Manual, United Kingdom, 2005.[6] D. Pellerin and S. Thibault, “Practical FPGA Programming in C,” Pearson Education, Inc., Upper Saddle River, NJ, 2005.[7] Mitrionics AB, Inc, “The Mitrion Processor,” Product Overview, Sweden, 2005.[8] M. Gokhale, J. Stone and E. Gomersall, “Co-Synthesis to a Hybrid RISC/FPGA Architecture,” Journal of VLSI Signal Processing

Systems, 24, pp. 165-180, 2000.[9] J. Hammes and W. Böhm, “The SA-C Language,” Reference Manual, Colorado State University, 2001.[10] J. Hammes, M. Chawathe and W. Böhm, “The SA-C Compiler,” Reference Manual, Colorado State University, 2001.[11] Colorado State Univ. “Cameron Poster for ACS PI Meeting,” Arlington, VA, March 7, 2002.[12] I. Troxel, “CARMA: An Infrastructure for Reconfigurable High-Performance Computing,” Ph.D. Prospectus, University of Florida, pp.

30-32, 2005.[13] R. Goering, “Open-source C compiler targets FPGAs,” Embedded.com, October 18, 2002.[14] J. Frigo, M. Gokhale and D. Lavenier, “Evaluation of Streams-C C-to-FPGA Compiler: An Applications Perspective,” Proc.

ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, February 11-13, 2001.[15] http://www.systemc.org.[16] OSCI, “SystemC 2.0.1 Language Reference Manual,” Reference Manual, San Jose, CA, 2003.[17] D. A. Buell, S. Akella, J. P. Davis, G. Quan, and D. Caliga, "The DARPA boolean equation benchmark on a reconfigurable

computer," Proc. Military Applications of Programmable Logic Devices (MAPLD),Washington, DC, September 8-10, 2004.[18] V. Aggarwal, I. Troxel, and A George, “Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and

MPI” Proc. MAPLD, Washington, DC, September 8-10, 2004.[19] J. Jussel, “The future of programmable SoC design is C-based”, Proc. Engineering of Reconfigurable Systems and Algorithms

(ERSA), Las Vegas, NV, June 27-30, 2005.

survey of c-based application mapping tools for...

Documents