compiling high-level descriptions on a heterogeneous system josé gabriel de figueiredo coutinho...

Compiling High-Level Descriptions on a Heterogeneous System Jos Gabriel de Figueiredo Coutinho Department of Computing, Imperial College London The Programming Challenge of Heterogeneous Architectures Workshop University of Birmingham July 2-3, 2009 1

2 Overview 1. hArtes Project 2. Research a) Task Transformation b) Mapping Selection c) High-Level Synthesis 3. Harmonic toolchain 4. Challenges

3 Why Heterogeneous Systems? Because... orders-of-magnitude faster than conventional single-core processors target computation hungry applications: financial modeling pharmaceutical applications simulation of real-life complex systems strategy: mix conventional processors with specialised processors However... how to develop applications? portability... new system, new application? design exploration... how to decide the partitioning and mapping? optimisation... how to exploit specialised processors (FPGAs, DSPs)? control vs automation.. how developers interact with compilation process?

4 1. hArtes Project - Consortium Atmel Roma (Italy) Faital (Italy) Fraunhofer IGD (Germany) Imperial College (U.K.) INRIA (France) Leaff (Italy) Politecnico di Bari (Italy) Politecnico di Milano (Italy) Scaleo Chip (France) Thales Communications (France) Thomson(France) TU Delft (Netherlands) UP delle Marche (Italy) Universit di Ferrara (Italia) Universit d'Avignon (France) 15 partners in 5 countries

5 Scope Holistic Approach to Reconfigurable real Time Embedded Systems www.hartes.org hArtes Tool-Chain FPGA GPP DSP.c source code Algorithm Exploration Tools

6 Applications Enhanced In-Car audio and video: Multichannel audio system Automatic Echo Cancellation (AEC) Automatic Speech and Speaker Recognition (ASR) Adaptive filtering Video Transcoding Intra-cabin communication Hardware Platforms (multi-purpose hardware) Audio and Video Applications

7 Hardware Platforms Atmel Diopsis 940H Evaluation Board (ARM+DSP) hArtes Harware Platform (ARM+DSP+FPGA)

8 Toolchain The hArtes toolchain is composed by three toolboxes: 1) Algorithm Exploration Toolbox 2) Design Space Exploration Toolbox 3) System Synthesis Toolbox Mapping Selection

9 Algorithm Exploration Toolbox: SciLab SCILAB To SCILAB2C and Design Exploration Toolbox hArtes Physical Model Algorithm

10 Algorithm Exploration Toolbox: Nu-Tech Thanks to the plug-in architecture the developer can write his/her own NUTs (NU-Techs satellites) and immediately plug them into the graphical interface design environment. hArtes Design Exploration Toolbox The NU-Tech Graphical Exploration (GAE) is the hArtes platform to validate the complex algorithms.

11 Design Space Exploration Toolbox Task Partitioning Task Transformation Data Representation Optimisation Annotated C Politecnico di Milano (Italy) Imperial College (U.K.) TU Delft (Netherlands) Profiling Input Source

12 System Synthesis Toolbox Generic GPP (C+macros) GPP Molen code DSP C code FPGA Mapping Selection Code Generation Annotated C GPP compMolenDSP comp C2VHDL ELF obj Bitstream Linker Loader Executable code (ELF) Imperial College (U.K.) Atmel Roma (Italy) TU Delft (Netherlands)

13 Accelerating an application

14 T1_dsp1 T3_dsp3 T2_dsp2 T1_gpp 3 32 34 42 43 54 Design Exploration and Synthesis Partitioning Task Transformation Tasks T3_fpga 2 T1_gpp 1 T1_gpp 2 Mapping Selection T1 DSP2 Cost Estimation T3 GPP T4 DSP5 C Description Implementations System Description

15 2. a) Task transformation What are task transformations? Source-to-source transformations pattern matching on syntax or dataflow Why use them? Compilers cannot include all optimisations Use knowledge of domain or platform experts Use to influence task mapping How to use them? Write in C++ using ROSE framework: hard Write in our domain-specific language, CML: easier Who writes them? Domain or platform experts Developers needing design-space exploration

16 Basic CML: 3 parts to a transform Pattern: syntax to match, label elements Conditions based on dataflow Resulting pattern to substitute Proposed novel aspects of extended CML Systematic description of dataflow conditions Parameterised transforms Features for labelling subpatterns Probabilities for machine learning Extend: CML code matching DFGs s1->s2 matches true dependence arc from s1 to s2 s1 -/> s2 matches antidependence arc from s2 to s2 s1 -@-> s2 matches output dependence arc from s1 to s2 CML for task transformations

17 Requirements: CML language Aim: compact transformation description Describe transformations on Abstract Syntax Tree (AST) Data Flow Graph (DFG) Support transformations specific to Application domain: embedded media Target technology: CPU + DSP + FPGA Allow parameterisable transforms e.g. unrolling factor Interpretation Can change transform without recompilation Saves time, eases learning curve Can rapidly explore transform design space Customize existing transforms Facilitate cost estimate: e.g. number of registers

18 CML example: replace multiply-by-n with shift 18 Replacing multiplies by shift is usually an optimisation in hardware lower area, greater speed transform times2ToShift { pattern { expr(1) * n } conditions { n & (n-1) == 0 } result { expr(1)

19 Simple CML example Eliminate addition with zero Expr + 0 => 0 Not always applicable (Floating-point: NaN + 0 = NaN) transform addZero { pattern { expr(1) + 0 } result { expr(1) } C++: class AddZero : public Avisitor { Expr * result; public: void visit(Add * a) { // recurse to left-hand side a->getLhs()->accept(this); Expr * x = result; if (IntLiteral * il = dynamic_cast (a->getRhs())){ if (il->getValue() == 0){ // pattern matched result = x; } else { result = new Add(x, result); } } else { a->getRhs()->accept(this); result = new Add(x, result); } }; CML C++ / visitor pattern Match pattern in several stages If pattern matched, replace with expr(1)/x Match any addition to zero; label left-hand side as x

20 CML Interpreter CML: transform addZero { pattern { expr(a) + 0 } result { expr(a) } CML AST Add CMLExpr a IntLiteral 0 CML parser source AST SgAddO p SgIntVal 1 SgAddOp SgIntVal 2 SgIntVal 0 Interpreter Interpret: Depth-first visit of source AST At each node If node matches root of CML pattern Match pattern in depth-first, postorder Save labelled nodes (a in example) Exit at first mismatch If patterns match and conditions apply Visit result pattern to apply result

21 Ray tracing: Design Space Exploration Start 46.0 Simple parallel 23.3 Simple parallel 23.0 Loop interchange Loop coalesce Loop interchange Simple parallel 22.6 Simple parallel 22.2 Pixel-cyclic parallel 20.1 Key: Last transform Time (secs) Start: simple, sequential loop Add transforms to aid parallelisation Best result from pixel-cyclic parallel

22 Loop coalescing transform loopCoalesce { pattern { for(var(0)=0;var(0)

33 Haydn interpretation rules b ac * * 0 num_sol == 0 2 1 0 2 MUX executed at cycle 1 executed at cycle 2 c c c c true c * a b b * = = = == true false num_sol 0 0 0 2 1 Structural Interpretation (Handel-C) Behavioural Interpretation { delta = b*b - ((a*c) 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; }

34 Rapid development 34 { delta = b*b - ((a*c) 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } c c c c true c * a b b * = = = == true false num_sol 0 0 0 2 1 unscheduling (behavioural interpretation) scheduling constraints + par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } synthesis (structural interpretation)

35 Design exploration par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); }.... } par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } c c c c true c * a b b * = = = == true false num_sol 0 0 0 2 1 scheduling constraints + unscheduling (behavioural interpretation) bac pmult[0] > 0 num_so l == 2 1 0 tmp1 tmp0 - tmp2 pmult[0] cycle 2 cycle 1

36 Abstraction par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } { // ==================[stage 4] delay; tmp0 = pipe_mult[0].q; } { // ==================[stage 5] tmp1 = pipe_mult[0].q 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } c c c c true c * a b b * = = = == true false num_sol 0 0 0 2 1 unscheduling (behavioural interpretation) { delta = b*b - ((a*c) 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } abstraction

37 Design quality { delta = b*b - ((a*c) 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } { delta = b*b - ((a*c) 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } c c c c true c * a b b * - > = = = == true false num_sol 0 0 0 2 1 unscheduling scheduling constraints + par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } User Intervention Manual Scheduling

41 Design exploration: batch mode constraints 5 multiplications: 1 cycle per result => 5 multipliers 2 cycles per result => 3 multipliers 5 cycles per result => 1 multiplier

Evaluation: speed vs area

43 Initiation interval vs area

44 3. Harmonic Toolchain: Design Flow binaries bitstream Handel-C (cycle-accurate description) C code (specific to each PE) request new partition task B task A C source files, hardware description Task Partitioning task A1 (FPGA), task A2 (FPGA), task A3 (DSP) task B1 (GPP) task B2 (DSP)... Task Transformation Engine runtime support FPGA Synthesis GPP compiler Haydn (HLS) DSP compiler Mapping Selection application and domain specific transformations description CML description input task parameters CML transforms ROSE C++ transforms GPP transforms DSP transforms FPGA transforms Generic Transform Libraries Task Transformation Engine implementations pattern to match matching conditions result pattern

45 Tools and Annotations { #pragma haydn pipeline II(1) s = SQRT(a); y = (s + b) * (c + d); } { #pragma haydn pipeline II(1) s = SQRT(a); y = (s + b) * (c + d); } Haydn (High-Level Synthesis) par { sqrt_v4.in(a); adder_v4[0].in(sqrt_v4.res, b); adder_v4[1].in(c, d); mult_v4.in(adder_v4[0].res, adder_v4[1].res); y = mult_v4.res; } Task Partitioning #pragma map cluster void d0d2Sci2CMixRealChTmpd2(...) {... ssOpStarsa1(a,x,t1);... ssOpStarsa2(b,y,t2);... ssOpPlusaa1(t1,t2,z); } Source-Files void filter(...) { { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); } #pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } } } void filter(...) { { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); } #pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } } } Mapping Selection void foo(...) {... #pragma omp parallel sections num_threads(2) { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); } #pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } } } void foo(...) {... #pragma omp parallel sections num_threads(2) { #pragma omp section { #pragma map call_hw \ impl(MAGIC, 14) \ param(x,1000,r) \ param(h,100, rw) filter(x, h); } #pragma omp section { #pragma map call_hw \ impl(ARM, 15) \ param(y,2000,r) \ param(i,50, rw) filter2(y, i); } } } tasks/implementations

46 ROSE source infrastructure l Software analysis and optimization for scientific applications l Tool for building source-to-source translators l Support for C,C++, Fortran, binary l Loop optimizations l Lab and academic use l Software engineering l Performance analysis l Domain-specific analysis and optimizations l Development of new optimization approaches l http://rosecompiler.org

47 4. Challenges Theoretical define and meet global constraints (application/platform) correctness: verify transformation results effective combination of static and dynamic analysis Practical: reuse legacy code incremental approach for using toolchain create modular toolchain that can evolve with new applications and platforms

48 5. Summary 1.hArtes Project complete toolchain targeting heterogeneous systems 2. Research Task Transformations: CML language for describing transformations Mapping Selection: integrated approach with multiple neighbourhood functions High-Level Synthesis (Haydn): combined behavioural and structural approach 3. Harmonic toolchain modular: enable customisation and technology evolution

compiling high-level descriptions on a heterogeneous system josé gabriel de figueiredo coutinho...

Documents