Download - SPARK: A Parallelizing High-Level Synthesis Frameworkembedded.eecs.berkeley.edu/esd-seminar/fall03/... · Center for Embedded Computer Systems University of California, Irvine and

Center for Embedded Computer SystemsUniversity of California, Irvine and San Diego

http://www.cecs.uci.edu/~spark

SPARK: A Parallelizing High-Level Synthesis Framework

Supported by Semiconductor Research Corporation & Intel Inc

Sumit Gupta

Rajesh Gupta, Nikil Dutt, Alex Nicolau

2Copyright Sumit Gupta 2003

System Level SynthesisSystem Level Synthesis

System LevelModel

TaskAnalysis

HW/SWPartitioning

ASIC

ProcessorCore

Memory

FPGA

I/O

HardwareBehavioralDescription

SoftwareBehavioralDescription

SoftwareCompiler

HighLevel

Synthesis


High Level SynthesisHigh Level Synthesis

M e m o r y

ALUCon

trol

Data path

d = e - f g = h + i

If NodeT Fc

x = a + bc = a < b

j = d x gl = e + x

x = a + b;c = a < b;if (c) thend = e – f;

elseg = h + i;

j = d x g;l = e + x;

Transform behavioral descriptions to RTL/gate level

From C to CDFG to Architecture


High Level SynthesisHigh Level Synthesis

M e m o r y

ALUCon

trol

Data path

d = e - f g = h + i

If NodeT Fc

x = a + bc = a < b

j = d x gl = e + x

x = a + b;c = a < b;if (c) thend = e – f;

elseg = h + i;

j = d x g;l = e + x;

Transform behavioral descriptions to RTL/gate level

From C to CDFG to Architecture

Problem # 1Problem # 1 :: Poor quality of HLS results beyond Poor quality of HLS results beyond straightstraight--line behavioral descriptionsline behavioral descriptionsPoor/No controllability of the HLSPoor/No controllability of the HLSresultsresults

Problem # 2Problem # 2 ::


OutlineOutlinenn Motivation and Background Motivation and Background nn Our Approach to Our Approach to ParallelizingParallelizing HighHigh--Level SynthesisLevel Synthesisnn Code Transformations Techniques for PHLSCode Transformations Techniques for PHLS

nn Parallelizing Transformations Parallelizing Transformations nn Dynamic TransformationsDynamic Transformations

nn The PHLS Framework and Experimental ResultsThe PHLS Framework and Experimental Resultsnn Multimedia and Image Processing ApplicationsMultimedia and Image Processing Applicationsnn Case Study: Intel Instruction Length DecoderCase Study: Intel Instruction Length Decoder

nn Conclusions and Future WorkConclusions and Future Work


HighHigh--level Synthesislevel Synthesisnn WellWell--researched area: from early 1980’sresearched area: from early 1980’s

nn Renewed interest due to new system level design methodologies Renewed interest due to new system level design methodologies nn Large number of synthesis optimizations have been proposed Large number of synthesis optimizations have been proposed

nn Either Either operation leveloperation level: algebraic transformations on DSP codes: algebraic transformations on DSP codesnn or or logic levellogic level: Don’t Care based control optimizations: Don’t Care based control optimizationsnn In contrast, compiler transformations operate at both operation In contrast, compiler transformations operate at both operation level level

(fine(fine--grain) and source level (coarsegrain) and source level (coarse--grain) grain) nn Parallelizing Compiler TransformationsParallelizing Compiler Transformations

nn Different optimization objectives and cost models than HLSDifferent optimization objectives and cost models than HLSØØOur aimOur aim: Develop Synthesis and Parallelizing Compiler : Develop Synthesis and Parallelizing Compiler

Transformations that are “useful” for HLS Transformations that are “useful” for HLS nn Beyond scheduling results: in Beyond scheduling results: in Circuit Area and DelayCircuit Area and Delaynn For large designs with For large designs with complex control flowcomplex control flow (nested (nested

conditionals/loops)conditionals/loops)


Our Approach: Our Approach: ParallelizingParallelizing HLS (PHLS)HLS (PHLS)

C Input VHDLOutput

Original CDFG

Optimized CDFG

Scheduling& Binding

Source-Level CompilerTransformations

Scheduling Compiler & Dynamic Transformations

nn Optimizing Compiler and Parallelizing Compiler transformations Optimizing Compiler and Parallelizing Compiler transformations applied at applied at SourceSource--levellevel (Pre(Pre--synthesis) and during synthesis) and during SchedulingSchedulingnn SourceSource--level code refinement using level code refinement using PrePre--synthesissynthesis transformationstransformationsnn Code Restructuring by Code Restructuring by SpeculativeSpeculative Code MotionsCode Motionsnn Operation Operation replicationreplication to improve concurrencyto improve concurrencynn DynamicDynamic transformations: transformations: exploit new opportunities during exploit new opportunities during

schedulingscheduling


PHLS Transformations PHLS Transformations Organized into Four GroupsOrganized into Four Groups

1.1. PrePre--synthesissynthesis: Loop: Loop--invariant code motions, Loop invariant code motions, Loop unrolling, CSEunrolling, CSE

2.2. SchedulingScheduling: Speculative Code Motions, Multi: Speculative Code Motions, Multi--cycling, Operation Chaining, Loop Pipeliningcycling, Operation Chaining, Loop Pipelining

3.3. DynamicDynamic: Transformations applied dynamically : Transformations applied dynamically during scheduling: Dynamic CSE, Dynamic Copy during scheduling: Dynamic CSE, Dynamic Copy Propagation, Dynamic Branch BalancingPropagation, Dynamic Branch Balancing

4.4. Basic Compiler TransformationsBasic Compiler Transformations: Copy : Copy Propagation, Dead Code EliminationPropagation, Dead Code Elimination


Speculative Code MotionsSpeculative Code Motions

+

+If Node

T FReverse

Speculation

Conditional Speculation

Speculation

Across HierarchicalBlocks

_

a

b

c

Operation Movement to reduce impact of Programming Style on Quality of HLS Results

Early Condition Execution

Evaluates conditionsAs soon as possible


Dynamic TransformationsDynamic Transformationsnn Called Called “dynamic”“dynamic” since they are applied during since they are applied during

scheduling (versus a pass before/after scheduling)scheduling (versus a pass before/after scheduling)nn Dynamic Branch BalancingDynamic Branch Balancingnn Increase the scope of code motionsIncrease the scope of code motionsnn Reduce impact of programming style on HLS resultsReduce impact of programming style on HLS results

nn Dynamic CSE and Dynamic Copy PropagationDynamic CSE and Dynamic Copy Propagationnn Exploit the Operation movement and duplication due Exploit the Operation movement and duplication due

to speculative code motionsto speculative code motionsnn Create new opportunities to apply these transformations Create new opportunities to apply these transformations

nn Reduce the number of operations Reduce the number of operations


Dynamic Branch BalancingDynamic Branch Balancing

If NodeT F

_ e

BB 0

BB 2BB 1

BB 3

BB 4

+ a

+ b

_ c

_ dS0

S1

S2

S3

++Resource Allocation

Original Design

If NodeT F

_ e

BB 0

BB 2BB 1

BB 3

BB 4

+a

+b

_ c _ d

Scheduled Design

UnbalancedConditional

Longest PathLongest Path


Insert New Scheduling Step in Shorter BranchInsert New Scheduling Step in Shorter Branch

If NodeT F

_ e

BB 0

BB 2BB 1

BB 3

BB 4

+a

+b

_ c _ d

If NodeT F

_ e

BB 0

BB 2BB 1

BB 3

BB 4

+ a

+ b

_ c

_ dS0

S1

S2

S3


Original Design Scheduled Design


Insert New Scheduling Step in Shorter BranchInsert New Scheduling Step in Shorter Branch

If NodeT F

BB 0

BB 2BB 1

BB 3

BB 4

+a

+b

_ c _ d

If NodeT F

_ e

BB 0

BB 2BB 1

BB 3

BB 4

+ a

+ b

_ c

_ dS0

S1

S2

S3


e_ _e

Original Design Scheduled Design

Dynamic Branch Balancing inserts new scheduling stepsDynamic Branch Balancing inserts new scheduling stepsnn Enables Conditional SpeculationEnables Conditional Speculationnn Leads to further code compactionLeads to further code compaction


Dynamic CSEDynamic CSE: Going beyond Traditional CSE: Going beyond Traditional CSE

a = b + c;cd = b < c;if (cd)

d = b + c;else

e = g + h;

C Description

BB 2 BB 3

BB 1

d = b + c

BB 4

a = b + c

e = g + h

HTG Representation

If NodeT F

BB 0

BB 2 BB 3

BB 1

d = a

BB 4

a = b + c

e = g + h

After Traditional CSE

If NodeT F

BB 0


a = b + c;cd = b < c;if (cd)

d = b + c;else

e = g + h;

C Description

BB 2 BB 3

BB 1

d = b + c

BB 4

a = b + c

e = g + h

HTG Representation

If NodeT F

BB 0

BB 2 BB 3

BB 1

d = a

BB 4

a = b + c

e = g + h

After Traditional CSE

If NodeT F

BB 0

nn We use notion of We use notion of DominanceDominance of Basic Blocksof Basic Blocksnn Basic block Basic block BBiBBi dominates dominates BBjBBj if all control paths from if all control paths from

the the initialinitial basic block of the design graph leading to basic block of the design graph leading to BBjBBj goes through goes through BBiBBi

nn We can eliminate an operation We can eliminate an operation opjopj in in BBjBBj using common using common expression in expression in opiopi if if BBiBBi dominates dominates BBjBBj

Dynamic CSEDynamic CSE: Going beyond Traditional CSE: Going beyond Traditional CSE


New Opportunities for “New Opportunities for “DynamicDynamic” CSE” CSEDue to Code MotionsDue to Code Motions

BB 2 BB 3

BB 1

a = b + c

BB 6 BB 7

BB 5

d = b + c

BB 4

BB 8

Scheduler decides to Speculate

BB 2 BB 3

BB 1

a = dcse

BB 6 BB 7

BB 5

d = b + c

BB 4

BB 8

dcse = b + c BB 0BB 0

CSE CSE not possiblenot possible since BB2 since BB2 does not dominate BB6does not dominate BB6

CSE CSE possiblepossible now since now since BB0 does not dominate BB6BB0 does not dominate BB6


BB 2 BB 3

BB 1

a = b + c

BB 6 BB 7

BB 5

d = b + c

BB 4

BB 8

BB 2 BB 3

BB 1

a = dcse

BB 6 BB 7

BB 5

d = dcse

BB 4

BB 8

dcse = b + c BB 0BB 0Scheduler decides to Speculate

New Opportunities for “New Opportunities for “DynamicDynamic” CSE” CSEDue to Code MotionsDue to Code Motions

CSE CSE not possiblenot possible since BB2 since BB2 does not dominate BB6does not dominate BB6

CSE CSE possiblepossible now since now since BB0 does not dominate BB6BB0 does not dominate BB6

If scheduler moves or duplicates an operation op, apply CSE on remaining operations using op

If scheduler moves or duplicates an operation If scheduler moves or duplicates an operation opop, apply CSE on , apply CSE on remaining operations using remaining operations using opop


Condition Speculation & Dynamic CSECondition Speculation & Dynamic CSE

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

a = b + c

BB 3

BB 7

d = b + c

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

a = a'

BB 3

BB 7

a' = b + c a' = b + c

d = b + c BB 8BB 8

Scheduler decides to

ConditionallySpeculate



BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

a = b + c

BB 3

BB 7

d = b + c

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

a = a'

BB 3

BB 7

a' = b + c a' = b + c

d = a'BB 8 BB 8





BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

a = b + c

BB 3

BB 7

d = b + c

BB 1 BB 2

BB 0

BB 5 BB 6

BB 4

a = a'

BB 3

BB 7

a' = b + c a' = b + c

d = a'BB 8 BB 8



nn Use the notion of dominance Use the notion of dominance by groups of basic blocksby groups of basic blocksn All Control Paths leading up

to BB8 come from either BB1 or BB2: =>=> BB1 and BB1 and BB2 BB2 together dominatetogether dominate BB8BB8


Loop Shifting: An Incremental Loop Loop Shifting: An Incremental Loop Pipelining TechniquePipelining Technique

BB 0

b +_ d

LoopExit

Loop Node

BB 3

BB 2

BB 1

BB 4

BB 0

b +_ d

LoopExit

Loop Node

BB 3

BB 2

BB 1

BB 4

a + c_

a + c_

LoopLoopShiftingShifting

a + c_


Loop Shifting: An Incremental Loop Loop Shifting: An Incremental Loop Pipelining TechniquePipelining Technique

BB 0

a +

b

c_

+_ d

LoopExit

Loop Node

BB 3

BB 2

BB 1

BB 4

BB 0

b +_ d

LoopExit

Loop Node

BB 3

BB 2

BB 1

BB 4

a +

c_

a + c_

LoopLoopShiftingShifting

CompacCompac--tiontion


SPARKSPARKHigh Level High Level Synthesis Synthesis

FrameworkFramework


SPARK Parallelizing HLS FrameworkSPARK Parallelizing HLS Frameworknn C input and C input and SynthesizableSynthesizable RTL VHDL outputRTL VHDL outputnn ToolTool--box box of Transformations and Heuristicsof Transformations and Heuristics

nn Each of these can be developed independently of the otherEach of these can be developed independently of the othernn Script based Script based control over transformations & heuristicscontrol over transformations & heuristicsnn Hierarchical Intermediate Representation (Hierarchical Intermediate Representation (HTGsHTGs))

nn Retains structural information about design (conditional blocks,Retains structural information about design (conditional blocks,loops)loops)

nn Enables efficient and structured application of transformationsEnables efficient and structured application of transformationsnn Complete HLS tool:Complete HLS tool: Does Binding, Control Synthesis and Does Binding, Control Synthesis and

Backend VHDL generationBackend VHDL generationnn Interconnect Minimizing Resource BindingInterconnect Minimizing Resource Binding

nn Enables Enables Graphical VisualizationGraphical Visualization of Design description of Design description and intermediate resultsand intermediate results

nn 100,000+ lines of C++ code100,000+ lines of C++ code


Synthesizable CSynthesizable Cnn ANSIANSI--C front end from Edison Design Group (EDG)C front end from Edison Design Group (EDG)nn Features of C not supported for synthesisFeatures of C not supported for synthesis

nn PointersPointersnn However, Arrays and passing by reference However, Arrays and passing by reference areare supportedsupported

nn Recursive Function CallsRecursive Function Callsnn GotosGotos

nn Features for which support has not been implementedFeatures for which support has not been implementednn MultiMulti--dimensional arraysdimensional arraysnn StructsStructsnn Continue, BreaksContinue, Breaks

nn Hardware component generated for each function Hardware component generated for each function nn A called function is instantiated as a hardware component in A called function is instantiated as a hardware component in

calling functioncalling function


HTGHTG DFGDFGGraph VisualizationGraph Visualization


Resource Utilization GraphResource Utilization Graph

SchedulingScheduling


Example of Example of ComplexComplex HTGHTGnn Example of a real design: Example of a real design:

MPEGMPEG--1 pred2 function1 pred2 functionnn Just for demonstration; you are Just for demonstration; you are

not expected to read the textnot expected to read the text

nn Multiple nested loops and Multiple nested loops and conditionalsconditionals


ExperimentsExperimentsnn Results presented here forResults presented here for

nn PrePre--synthesis transformationssynthesis transformationsnn Speculative Code MotionsSpeculative Code Motionsnn Dynamic CSEDynamic CSE

nn We used We used SPARKSPARK to synthesize designs derived from to synthesize designs derived from several industrial designsseveral industrial designsnn MPEGMPEG--1, MPEG1, MPEG--2, GIMP Image Processing software2, GIMP Image Processing softwarenn Case StudyCase Study of Intel Instruction Length Decoderof Intel Instruction Length Decoder

nn Scheduling ResultsScheduling Resultsnn Number of States in FSMNumber of States in FSMnn Cycles on Longest Path Cycles on Longest Path

through Designthrough Design

nn VHDL: Logic Synthesis VHDL: Logic Synthesis nn Critical Path Length (ns)Critical Path Length (ns)nn Unit AreaUnit Area


Target ApplicationsTarget Applications

1501503535221111GIMP GIMP tilertiler

2602606161441818MPEGMPEG--2 2 dp_framedp_frame

2872874545661111MPEGMPEG--1 1 pred2pred2

12312317172244MPEGMPEG--1 1 pred1pred1

# of # of OperationsOperations

# Non# Non--Empty Empty Basic BlocksBasic Blocks

# of # of LoopsLoops

# of Ifs# of IfsDesignDesign


MPEG-1 Pred1 Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)

Total Delay (c*l) Unit Area

+ Speculative Code Motions

+ Pre-Synthesis Transforms

+ Dynamic CSE


0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)


Scheduling & Logic Synthesis ResultsScheduling & Logic Synthesis Results

Non-speculative CMs: Within BBs & Across Hier Blocks

42%

10%

36%

36%

8%

39%



0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)




+ Dynamic CSE


0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)


Scheduling & Logic Synthesis ResultsScheduling & Logic Synthesis Results


42%

10%

36%

36%

8%

39%

Overall: 63Overall: 63--66 % improvement in Delay66 % improvement in Delay

Almost constant Area Almost constant Area





+ Dynamic CSE

Scheduling & Logic Synthesis ResultsScheduling & Logic Synthesis ResultsMPEG-2 DpFrame Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)


GIMP Tiler Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)


14%

20%1%

33%

41%

52%





+ Dynamic CSE

Scheduling & Logic Synthesis ResultsScheduling & Logic Synthesis ResultsMPEG-2 DpFrame Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)


GIMP Tiler Function

0

0.2

0.4

0.6

0.8

1

1.2

Longest Path(lcyc)

Critical Path(cns)


14%

20%1%

33%

41%

52%

Overall: 48Overall: 48--76 % improvement in Delay76 % improvement in Delay

Almost constant Area Almost constant Area


Case Study: Case Study: IntelIntel Instruction Length DecoderInstruction Length Decoder

Stream ofInstructions

Instruction Length Decoder

FirstInsn

SecondInsn

ThirdInstruction

Instruction BufferInstruction Buffer


Example Design: ILD Block from IntelExample Design: ILD Block from Intel

nn Case Study: A design derived from the Case Study: A design derived from the Instruction Length Instruction Length DecoderDecoder of the Intel Pentiumof the Intel Pentium®® class of processorsclass of processorsnn Decodes length of instructions streaming from Decodes length of instructions streaming from

memorymemorynnHas to look at up to 4 bytes at a timeHas to look at up to 4 bytes at a time

nn Has to execute in Has to execute in one cycleone cycle and decode about 64 bytes and decode about 64 bytes of instructionsof instructions

ØØ Characteristics of Microprocessor functional blocksCharacteristics of Microprocessor functional blocksnn Low Latency: Single or Dual cycle implementationLow Latency: Single or Dual cycle implementationnn Consist of several small computationsConsist of several small computationsnn Intermix of control and data logicIntermix of control and data logic


Basic Instruction Length Decoder:Basic Instruction Length Decoder:Initial DescriptionInitial Description

Length Contribution 1

Need Byte 4 ?

Need Byte 2 ?

Need Byte 3 ?

Byt

e 1

Byt

e 2

Byt

e 3

Byt

e 4

=+

+

+

Total Length Of Instruction





Instruction Length Decoder: Instruction Length Decoder: Decoding 2Decoding 2ndnd InstructionInstruction


Need Byte 4 ?

Need Byte 2 ?

Need Byte 3 ?

Byt

e 3

Byt

e 4

=+

+

+

Total Length Of Insn



Length Contribution 4B

yte

5

Byt

e 6First

Insn

After decoding the length of an instruction

v Start looking from next bytev Again examine up to 4 bytes to determine length of next instruction


Instruction Length Decoder:Instruction Length Decoder:ParallelizedParallelized DescriptionDescription

Need Byte 4 ?

Need Byte 2 ?

Need Byte 3 ?

Byt

e 1

Byt

e 2

Byt

e 3

Byt

e 4





=+

+

+

Total Length Of Instruction

v Speculatively calculate the length contribution of all 4 bytes at a timev Determine actual total length of instruction based on this data


ILD: Extracting Further ParallelismILD: Extracting Further ParallelismB

yte

1

Byt

e 2

Byt

e 3

Byt

e 4

Byte 1Insn.Len Calc

Byte 3Insn.Len Calc

Byte 5Insn.Len Calc

Byte 2Insn.Len Calc

Byte 4Insn.Len Calc

Byt

e 5

v Speculativelycalculate length of instructions assuming a new instruction starts at each bytev Do this calculation for all bytes in parallelv Traverse from 1st

byte to last v Determine length of instructions starting from the 1st till the lastv Discard unused calculations


Initial:Initial: MultiMulti--Cycle Cycle SequentialSequential ArchitectureArchitecture


Need Byte 4 ?

Need Byte 3 ?

Byt

e 1

Byt

e 2

Byt

e 3

Byt

e 4




Need Byte 2 ?


ILD Synthesis: Resulting ArchitectureILD Synthesis: Resulting ArchitectureSpeculate Operations,

Fully Unroll Loop,Eliminate Loop Index

Variable


ILD Synthesis: Resulting ArchitectureILD Synthesis: Resulting ArchitectureSpeculate Operations,

Fully Unroll Loop,Eliminate Loop Index

Variable

Multi-cycle Sequential

Architecture

Multi-cycle Sequential

Architecture

Single cycle Parallel

Architecture

Single cycle Parallel

Architecture

nn Our toolbox approach enables us to develop a script to Our toolbox approach enables us to develop a script to synthesize applications from different domainssynthesize applications from different domains

nn Final design looks close to the actual implementation done Final design looks close to the actual implementation done by Intelby Intel


ConclusionsConclusionsnn Parallelizing code transformations enable a new range of Parallelizing code transformations enable a new range of

HLS transformationsHLS transformationsnn Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results

nn Possible to be competitive against manually designed circuits. Possible to be competitive against manually designed circuits. nn Can enable productivity improvements in microelectronic designCan enable productivity improvements in microelectronic design

nn Built a synthesis system with a range of code transformationsBuilt a synthesis system with a range of code transformationsnn Platform for applying Coarse and FinePlatform for applying Coarse and Fine--grain Optimizationsgrain Optimizationsnn ToolTool--box approach where transformations and heuristics can be box approach where transformations and heuristics can be

developeddevelopednn Enables the designer to find the right Enables the designer to find the right synthesis scriptsynthesis script for for

different application domainsdifferent application domainsnn Performance improvements of 60Performance improvements of 60--70 % across a number of designs70 % across a number of designsnn We have shown its effectiveness on an Intel designWe have shown its effectiveness on an Intel design


AcknowledgementsAcknowledgementsnn AdvisorsAdvisorsnn Professors Rajesh Gupta, Professors Rajesh Gupta, NikilNikil DuttDutt, Alex , Alex NicolauNicolau

nn Contributors to SPARK frameworkContributors to SPARK frameworknn Nick Nick SavoiuSavoiu, , MehrdadMehrdad ReshadiReshadi, , SunwooSunwoo KimKim

nn Intel Strategic CAD Labs (SCL)Intel Strategic CAD Labs (SCL)nn Timothy Timothy KamKam, Mike , Mike KishinevskyKishinevsky

nn Supported by Semiconductor Research Supported by Semiconductor Research Corporation and Intel SCLCorporation and Intel SCL

Thank YouThank You

Download - SPARK: A Parallelizing High-Level Synthesis Frameworkembedded.eecs.berkeley.edu/esd-seminar/fall03/... · Center for Embedded Computer Systems University of California, Irvine and

Top Related