survey of c-based application mapping tools for...
TRANSCRIPT
Holland #215 MAPLD 2005
Survey ofSurvey of CC--based Application Mapping Toolsbased Application Mapping Tools for Reconfigurable Computingfor Reconfigurable Computing
Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George
High-performance Computing and Simulation (HCS) Research LabDepartment of Electrical and Computer Engineering
University of Florida
#215 MAPLD 2005Holland 2
OutlineIntroductionGeneral Survey
Ten C-based Application Mappers
Benchmarking & ResultsFinite-Impulse Response (FIR)N-QueensRadix Sort
Lessons LearnedConclusionsAcknowledgementsReferences
#215 MAPLD 2005Holland 3
Motivation for Application Mappers
Motivation for Application MappersHDL programming has shortcomings
Limited applicability to application developersMore involved development process (vs. software)Requires training beyond application level
Instead, can we find and exploit an environment that allows a measure of hardware control along with increased productivity?
Can we bring RC performance benefits to application developers?Would this be practical/possible in traditional HDL?
HDL is well below the level of traditional application programmingConsequently, we need to move to a higher level of abstraction
#215 MAPLD 2005Holland 4
Introduction
Selecting a Higher Level of AbstractionCAD tools: Visual appealing, but tedious for large projectsNew language: Optimal, but requires complete retrainingTraditional or Object-Oriented languages: Which? How?
Ideally, use pure ANSI-C, “The Universal Language”Requires no additional knowledge or special trainingPort existing C programs into hardware implementations (HDL)
Translation can be handled by a hardware compilerProgrammer concentrates on algorithmic functionality
Configuration File
Netlist
HDL
C Code
COMPILER
#215 MAPLD 2005Holland 5
Commonalities
General characteristics of C-based application mappers:Companies create proprietary ANSI C-based languageLanguages do not have all ANSI C featuresExtra pragmas are included for corresponding compilersAdditional libraries of functions/macros for further extensionsMust adhere to specific programming “style” for maximum optimizationEmphasis on both hardware generation and I/O interfaces
void FIR(int INPUTA, int OUTPUTB){
/*user source*/
}
COMPILER
Entity FIR is Port( rst, clock: in std_logic; INPUTA_en: in std_logic; INPUTA_data: in std_logic_vector(31 downto 0); OUTPUTB_en: in std_logic; OUTPUTB_data: out std_logic_vector(31 downto 0));end;
/*user source in VHDL*/
ANSI-C VHDL
#215 MAPLD 2005Holland 6
Spectrum of C-based Application MappersSURVEY PORTION
Open Standard
SystemC
Generic HDLMultiple Platforms
Impulse C
Catapult C
Mitrion C
Targets a SpecificPlatform/Configuration
Generic HDL (Optimize forManufacturer’s Hardware)
DIME-C
Handel C Streams C
SA-C
Napa C
RISC/FPGAHybrid Only
Carte
Handel CDIME-C
VHDL
Impulse C
ANSI-C
THE LAW OF CONSERVATION OF PAIN
Con
trol
Effort
Not
Cyc
leA
ccur
ate
Lim
ited
Pred
iciti
blity
Det
erm
inis
tic
Software Some HWPragmas
Many HWPragmas
ANSI-C
DIME-CImpulse C
Handel C
Cyc
le A
ccur
ate
HDL
VHDL
#215 MAPLD 2005Holland 7
Carte SRC Computers [1]
C/Fortran FPGA environmentDirect mapping of C/Fortran code to configuration levelSoftware emulation and simulation of compiled code for debuggingCapable of multiprocessor and multi-FPGA computational definitionsAllows explicit data flow control within memory hierarchy
Targets SRC’s MAP processorProduces “Unified Executables” for HW or SW processor executionRuntime libraries handle required interfacing and management
Algorithmic synthesis tool for RTL generation
RTL from “pure” untimed C++No extensions, pragmas, etc.
Compiler uses “wrappers”around algorithmic code
External: manages I/O interfaceInternal: constrains synthesis to optimize for chosen interface
Explicit architectural constraints and optimizationOutput: RTL netlists in VHDL, Verilog, and SystemC
Catapult C Mentor Graphics [2-3]
#215 MAPLD 2005Holland 8
DIME-C Nallatech
[4]
FPGA prototyping toolDesigns are not cycle-accurate
Allows application synthesis for a higher clock speed
Compilation/OptimizationPipeline/parallelize where possibleIncluded IEEE-754 FP coresDedicated (integer) multipliers
Currently in beta, expected release: 4Q05Output: synthesizable VHDL and DIMEtalk components
Environment for cycle-accurate application developmentAll operations occur in one deterministic clock cycle
Makes it cycle-accurate, but clock freq reduced to slowest operationDecisions/Loops are “penalty-free”but can significantly impact timing
Language has pragmas for explicitly defined parallelismCompiler can analyze, optimize, and rewrite codeOutput: VHDL/Verilog, SystemC, or targeted EDIFs
Handel C Celoxica
[5]
#215 MAPLD 2005Holland 9
Impulse C Impulse Accelerated Technologies [6]
Language/compiler for modeling sequential apps.
Processes - independent, potentially concurrent, computing blocksStreams – communicate and synchronize processes
Uses Streams-C methodologyHowever, focuses on compatibility with C development environments
CompilationEach process implemented as separate state machine
Output: Generic or FPGA-specific VHDL
“Softcore” processor tactic“Processor” creates abstraction layer between C code and FPGA
CompilationC code is mapped to a generic “API” of possible functionsProcessor instantiated on FPGA, tailored to specific applicationCustom instruction bit-widths, specific cache and buffer sizes
Currently in beta, expected release: 4Q05 Output: a VHDL IP core for target architectures
Mitrion
C Mitrion
[7]
#215 MAPLD 2005Holland 10
Napa C National Semiconductor [8]
Language/compiler for RISC/FPGA hybrid processor
Capitalize on single-cycle interconnect instead of I/O bus
Datapath Synthesis TechniqueHand-optimized pre-placed, pre-routed module generatorsCompiler generates hardware pipelines from C loops
Targets NS NAPA1000 hybrid processor
Fixed-Instruction Processor (FIP), Adaptive Logic Processor (ALP)ALP also compiles to RTL VHDL, structural VHDL, structural Verilog
High-level, expression-oriented, machine-independent, single-assignment language
Designed to implicitly express data-parallel operationsImage and signal processing
Compiler (UC-Irvine, UC-Riverside, Colorado State Univ.)
Loop optimizationsStructural transformsExecution block placement
Target PlatformsUC Irvine Morphosys; AnnapolisWildForce, StarFire, WildFire
SA-C Colorado State University [9-12]
#215 MAPLD 2005Holland 11
Streams C Los Alamos National Laboratory [12-14]
Stream-oriented sequential process modeling
Essentially, data elements moving through discrete functional blocks
CompilerGenerates multi-threaded processor executables and multiple FPGA bitstreamsAllows parallel C program translation into a parallel arch.
Includes functional-level simulation environmentOutput: synthesizable RTL
Open-source extension of C++ for HW/SW modeling
Core language, modules & ports for defining structure, and interfaces & channels
Supports functional modelingHierarchical decomposition of a system into modulesStructural connectivity between modules using ports/exportsScheduling and synchronization of concurrent processes using events
Event-driven simulatorEvents are basic dynamic/static process synchronization objects
SystemC Open SystemC
Initiative (OSCI) [15-16]
#215 MAPLD 2005Holland 12
About the BenchmarksThree classic algorithms used for benchmarking
Finite-Impulse Response (FIR)Simple 51-tap FIR filter for standard DSP applicationsCompare compiler solutions and analyze their usage metrics
N-QueensClassic embarrassingly parallel HPC backtracking search problemShowcases the potential of optimized implementations
Radix SortSorts using ‘binary bins’, minimizing resourcesIllustrates resource metrics in RAM-intensive applications
Implementation DetailsDIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing)Experiments performed on Nallatech BenNUEY-PCI card with VirtexII-6000 FPGAResource utilization based on post place-and-route dataRuntime represents communication time (setup and verification I/O is negated)Handel C and Impulse C require VHDL wrappers which can increase resource usage
-10
-8
-6
-4
-2
0
2
4
6
8
10
1 3 5 7 9 11 13 15 17 19 21 23 25 27
0 110 100 10
1 111 101
#215 MAPLD 2005Holland 13
Finite-Impulse Response
FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6])Various application-mapper languages do not have a consistent I/O interface
Could not create a consistent streaming channel with requisite blocking in every toolInstead, FIR algorithm operates on values stored in a block RAM
Obtains speedup through parallel multiplication, efficient memory accessesThe 51 coefficients and variables are stored in local variables
Additional performance boosts are possible in multi-channel DSP processing
FIR Resource Utilization Statistics
0
20
40
60
80
100
Slices Multipliers Block RAMs Clock Freq
% U
sage
DIME-C Handel C Impulse C VHDL
Speedup over 2.4GHz Xeon
0
1
2
3
4
DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0
#215 MAPLD 2005Holland 14
N-Queens
Represents a purely computational algorithm; virtually no communication overheadAlgorithm contains several parallelizable code segments, exploitable for speedupImplementations are based upon same baseline C code
Every available technique and compiler optimization is employed to boost performance
Notes:Handel C N-Queens is a benchmark from our MAPLD’04 paper with additional refinementsVHDL N-Queens is culmination of a semester-long endeavor into algorithm’s parallelismDIME-C and Impulse C N-Queens are results of experimentation with beta compilers
N-Queens Resource Utilization Statistics
0
20
40
60
80
100
Slices Clock Freq
% U
sage
DIME-C Handel C Impulse C VHDL
Speedup over 2.4GHz Xeon
0
1
2
3
4
5
6
13 14 15 16 17 N
DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0
#215 MAPLD 2005Holland 15
Radix Sort
Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time)Represents a “worst-case” legacy algorithm, containing no functional-level parallelism
Every element in every iteration depends on every previous element in every iterationIdeal for software processor with fast cache, challenging in FPGA hardware
Speedup comes through efficient RAM usage and compiler optimizations/pipeliningReduce quantity and addressing complexity of RAM accesses whenever possible
Metrics are based on sorting 600 32-bit integers contained within a block RAM
Radix Sort Resource Utilization Statistics
0
20
40
60
80
100
Slices Block RAMs Clock Freq
% U
sage
DIME-C Handel C Impulse C VHDL
Speedup over 2.4GHz Xeon
0.0
0.5
1.0
DIME-C Handel C Impulse C VHDL gcc -O3 gcc -O0
#215 MAPLD 2005Holland 16
Some Optimization TechniquesKeep expensive computational operations to a minimum
Multiplication, division, modulo, greater/less than, and floating point are *slow*Minimize reliance on arrays
Watch for combinable statements
Exploit functional level parallelism
Reduce bit-widths to minimal size
for(i=0;i<2;i++){ for(j=0;j<20;j++){ a[i][j] = i+j; }}
for(j=0;j<20;j++){ a[j] = j;}
for(j=0;j<20;j++){ b[j] = 1+j;}
for(j=0;j<20;j++){ c[j] = 2+j;}
#215 MAPLD 2005Holland 17
Case Study: Dot Product DIME-C
void Kernel(int a[50], int b[50], int answer){
int i, temp = 0;for(i=0;i<50;i++){
temp += a[i] * b[i];}answer = temp;
}
void dot_product(int a1[50], int b1[50],int a2[50], int b2[50], int answer)
{int answer1, answer2;
#pragma genusc instance Kernel1Kernel(a1,b1,answer1);
#pragma genusc instance Kernel2Kernel(a2,b2,answer2);
answer = answer1 + answer2}
IMPULSE Cvoid Kerne11(co_stream a1, co_stream b1, co_stream z1){
int a[50], b[50], answer=0;co_stream_open(a1,O_RDONLY,INT_TYPE(32)); /*etc*/for(i=0;i<50;i++){
co_stream_read(a1, &a[i], sizeof(int32));co_stream_read(b1, &b[i], sizeof(int32));
}for(i=0;i<50;i++){
#pragma CO UNROLLanswer += a[i] * b[i];
}co_stream_write(z1, &answer, sizeof(int32));co_stream_close(a1); /*etc*/
}
void Kernel2(co_stream a2, co_stream b2, co_stream z2){/* SAME AS IN Kernel1 */
}
void dot_product(co_stream z1, co_stream z2, co_stream ans){int i, answer1, answer2, answer;co_stream_open(z1,O_RDONLY,INT_TYPE(32)); /*etc*/co_stream_read(z1, &answer1, INT_TYPE(32));co_stream_read(z2, &answer2, INT_TYPE(32));answer = answer1 + answer2;co_stream_write(ans, &answer, INT_TYPE(32));co_stream_close(z1); /*etc*/
}
HANDEL Cint 32 Kernel1(int 32 a[50], int 32 b[50]){
static int 32 i, temp[i], answer;par(i=0;i<50;i++){
temp[i] = a[i] * b[i];}for(i=0;i<50;i++){
answer += temp[i];}return answer;
}
int 32 Kernel2(int 32 a[50], int 32 b[50])/* SAME AS IN Kernel1 */
}
void main() //dot_product{
int 32 a1[50]; int 32 b1[50];int 32 a2[50]; int 32 b2[50];int 32 temp1, temp2;int 32 answer;interface bus_out() OutputResult(answer);par{
ans1 = Kernel1(int 32 a1[50],int 32 b[50]);ans2 = Kernel2(int 32 a2[50],int 32 b[50]);
}answer = ans1 + ans2;
}
*Not all implementations are perfectly optimized. Your mileage will vary.*
Green – ComputationBlue – CommunicationOrange - Pragmas
#215 MAPLD 2005Holland 18
Lessons Learned
Tools are not near point of automatic translationPrograms still require some tweaking for hardware compilation [17]
Optimized Software C ≠ Optimized Hardware C
However, generating VHDL is significantly easierLearning basics of a C-based mapper is straightforward
At least two major challenges remain:Input/output interfaces become a limiting factor
Moving generic VHDL to unsupported platforms requires VHDL knowledgeHowever, once a generic I/O wrapper is generated, it should be reusable
True hardware debugging remains a challengeAnother level of abstraction means another layer for mistranslationWith no knowledge of internal VHDL signals, tracing becomes difficult
#215 MAPLD 2005Holland 19
ConclusionsAdvantages of C-based application mappers
Far broader audience of potential RC users with high-level languagesRequired HDL knowledge is significantly reduced or eliminatedTime to preliminary results is much less than manual HDLSoftware-to-hardware porting is considerably easierVisualization of C hardware is far easier for scientific community
DisadvantagesMapper instructions are many times more powerful than CPU instructions, but FPGA clocks are many times slowerMappers can parallelize and pipeline C code, however they generally cannot automatically instantiate multiple functional unitsOptimized C-mapper code is obtained through manual parallelization of existing code using techniques pertinent to algorithm’s structureReduced development time can come at cost of performance
#215 MAPLD 2005Holland 20
Acknowledgements
We thank the following vendors for application mapping tools, information, and technical support:
Celoxica (Handel C)Impulse Accelerated Technologies (Impulse C)Nallatech (DIME-C)Mitrion (Mitrion C)
We thank the following vendors for providing tools and/or hardware that made this study possible:
Aldec (Active-HDL & Riviera EDA tools)Intel (Xeon servers)Nallatech (FUSE & DIMEtalk tools, RC boards) Xilinx (ISE, RC boards, FPGAs)
#215 MAPLD 2005Holland 21
References[1] http://www.srccomp.com[2] http://www.mentor.com/products/c-based_design/catapult_c_synthesis/.[3] K. Morris, “Catapult C: Mentor Announces Architectural Synthesis,” fpgajournal.com, June 1, 2004.[4] Nallatech, Inc., “DIME-C User Guide,” Reference Manual, United Kingdom, 2005.[5] Celoxica, Ltd. “Using Handel-C with DK,” Training Manual, United Kingdom, 2005.[6] D. Pellerin and S. Thibault, “Practical FPGA Programming in C,” Pearson Education, Inc., Upper Saddle River, NJ, 2005.[7] Mitrionics AB, Inc, “The Mitrion Processor,” Product Overview, Sweden, 2005.[8] M. Gokhale, J. Stone and E. Gomersall, “Co-Synthesis to a Hybrid RISC/FPGA Architecture,” Journal of VLSI Signal Processing
Systems, 24, pp. 165-180, 2000.[9] J. Hammes and W. Böhm, “The SA-C Language,” Reference Manual, Colorado State University, 2001.[10] J. Hammes, M. Chawathe and W. Böhm, “The SA-C Compiler,” Reference Manual, Colorado State University, 2001.[11] Colorado State Univ. “Cameron Poster for ACS PI Meeting,” Arlington, VA, March 7, 2002.[12] I. Troxel, “CARMA: An Infrastructure for Reconfigurable High-Performance Computing,” Ph.D. Prospectus, University of Florida, pp.
30-32, 2005.[13] R. Goering, “Open-source C compiler targets FPGAs,” Embedded.com, October 18, 2002.[14] J. Frigo, M. Gokhale and D. Lavenier, “Evaluation of Streams-C C-to-FPGA Compiler: An Applications Perspective,” Proc.
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, February 11-13, 2001.[15] http://www.systemc.org.[16] OSCI, “SystemC 2.0.1 Language Reference Manual,” Reference Manual, San Jose, CA, 2003.[17] D. A. Buell, S. Akella, J. P. Davis, G. Quan, and D. Caliga, "The DARPA boolean equation benchmark on a reconfigurable
computer," Proc. Military Applications of Programmable Logic Devices (MAPLD),Washington, DC, September 8-10, 2004.[18] V. Aggarwal, I. Troxel, and A George, “Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and
MPI” Proc. MAPLD, Washington, DC, September 8-10, 2004.[19] J. Jussel, “The future of programmable SoC design is C-based”, Proc. Engineering of Reconfigurable Systems and Algorithms
(ERSA), Las Vegas, NV, June 27-30, 2005.