modeling, synthesis, verification - cecs · 2009. 7. 8. · modeling, synthesis, verification...
TRANSCRIPT
Embedded System DesignEmbedded System Design Modeling, Synthesis, Verification
Daniel D. Gajski, Samar Abdi, Andreas Gerstlauer, Gunar Schirner
7/8/2009
Chapter 6: Hardware Synthesis
7/8/2009 2Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
• Design flow• RTL architecture• Input specification• Specification profiling• RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
• Chaining and multi-cycling• Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 3Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
HW Synthesis Design Flow
Tool Model
RTLComponent
Library
Specification
RTL Model
Model Generation
RTL Tools
Compilation
Estimation
HLS
Allocation Binding Scheduling
...
• Compilation
• Estimation
• HLS
• Model generation
• RTL synthesis
• Logic synthesis
• Layout
7/8/2009 4Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow• RTL architecture• Input specification• Specification profiling• RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
• Chaining and multi-cycling• Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 5Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
RTL Architecture
•Controller•FSM controller•Programmable controller
•Datapath components•Storage components•Functional units•Connection components
•Pipelining•Functional unit •Datapath•Control
•Structure•Chaining•Multicycling•Forwarding•Branch prediction•Caching
7/8/2009 6Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
RTL Architecture with FSM Controller
Output Logic
B1B2
ALU Memory
RF
MUL
B3FSM Controller
Input Logic
Datapath
ControlInputs
ControlOutputs
ControlSignals
StatusSignals
DataInputs
DataOutputs
•Simple architecture
•Small number of states
7/8/2009 7Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Status
Address
IR or C
WR
Cmemor
PMem
AG
SR
PC
B1B2
ALU Memory
RF
MUL
B3Programmable ControllerDatapath
Offset
ControlInputs
ControlOutputs
ControlSignals
DataInputs
DataInputs
RTL Architecture with Programmable Controller
•Complex architecture•Control and datapath pipelining•Advanced structural features
•Large number of states (CW or IS)
7/8/2009 8Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow RTL architecture• Input specification• Specification profiling• RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
• Chaining and multi-cycling• Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 9Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Input Specification
• Programming language (C/C++, …)• Programming semantics requires pre-synthesis optimization
• System description language (SystemC, …)• Simulation semantics requires pre-synthesis optimization
• Control/Data flow graph (CDFG)• CDFG generation requires dependence analysis
• Finite state machine with data (FSMD)• State interpretation requires some kind of scheduling
• RTL netlist• RTL design that requires only input and output logic synthesis
• Hardware description language (Verilog / VHDL)• HDL description requires RTL library and logic synthesis
7/8/2009 10Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
C Code for Ones Counter
Function-based C code RTL-based C code
01: int OnesCounter(int Data){02: int Ocount = 0;03: int Temp, Mask = 1;04: while (Data > 0) {05: Temp = Data & Mask;06 Ocount = Data + Temp;07: Data >>= 1;08: }09: return Ocount;10: }
01: while(1) {02: while (Start == 0);03: Done = 0;04: Data = Input;05: Ocount = 0;06: Mask = 1; 07: while (Data>0) {08: Temp = Data & Mask;09: Ocount = Ocount + Temp;10: Data >>= 1;11: }12: Output = Ocount;13: Done = 1;14: }
•Programming language semantics
• Sequential execution,
• Coding style to minimize coding
•HW design
• Parallel execution,
• Communication through signals
7/8/2009 11Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
CDFG for Ones Counter
0
1
>0
0
DoneOutput
Input
0
Done
0
Ocount
1
MaskData
&
>>1 +
DoneData
DoneOcountData
Start
Data
1
Mask Ocount
•Control/Data flow graph
•Resembles programming language
•Loops, ifs, basic blocks (BBs)
•Explicit dependencies
•Control dependences between BBs
•Data dependences inside BBs
•Missing dependencies between BBs
7/8/2009 12Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
FSMD for Ones Counter
•FSMD more detailed then CDFG
•States may represent clock cycles
•Conditionals and statements executed concurrently
• All statement in each state executed concurrently
•Control signal and variable assignments executed concurrently
•FSMD includes scheduling
•FSMD doesn't specify binding or connectivity
7/8/2009 13Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
CDFG and FSMD for Ones Counter
7/8/2009 14Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
RTL Specification for Ones Counter
S0
S7
S4
S6
S5
S4
S3
S2
S1
S0
StateNext
1XXS7
01XS6
X
X
X
X
X
X
1
0
StartInputs: Output:Present
0XS5
00S6
0XS4
0XS3
0XS2
0XS1
XXS0
XXS0
DoneData = 0State
RF[0] = Data, RF[1] = Mask, RF[2] = Ocount, RF[3] = Temp
RF[2]
RF[0]
RF[2]
RF[0]
RF[2]
RF[2]
X
X
RF Read Port A
X
pass
add
AND
increment
subtract
X
X
ALU
X
X
RF[3]
RF[1]
X
RF[2]
X
X
RF Read Port B
X
B3
B3
B3
B3
B3
Inport
X
RF selector
X
shift right
pass
pass
pass
pass
X
X
Shifter
disable
RF[0]
RF[2]
RF[3]
RF[1]
RF[2]
RF[0]
X
RF Write
enableS7
ZS6
ZS5
ZS4
ZS3
ZS2
ZS1
ZS0
OutportState
Output logic table
status
Output Logic
B1
ALU
Shifter
B3FSM Controller
Input Logic
Datapath
Done
Start
B2
Selector
Outport
ControlSignals
Inport
RF
Input logic table
•RTL Specification
•Controller and datapath netlist
•Input and output tables for logic synthesis
•RTL library needed for netlist
7/8/2009 15Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
HDL description of Ones Counter
01: // …02: always@(posedge clk) 03: begin : output_logic04: case (state)05: // …06: S4: begin 07: B1 = RF[0];08: B2 = RF[1]; 09: B3 = alu(B1, B2, l_and);10: RF[3] = B3;11: next_state = S5;12: end13: // …14: S7: begin 15: B1 = RF[2];16: Outport <= B1;17: done <= 1; 18: next_state = S0;19: end20: endcase21: end 22: endmodule
•HDL description
•Same as RTL description
•Several levels of abstraction
•Variable binding to storage
•Operation binding to FUs
•Transfer binding to connections
•Netlist must be synthesized
•Partial HLS may be needed
7/8/2009 16Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow RTL architecture Input specification• Specification profiling• RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
• Chaining and multi-cycling• Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 17Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Profiling and Estimation
• Pre-synthesis optimization
• Preliminary scheduling• Simple scheduling algorithm
• Profiling• Operation usage
• Variable life-times
• Connection usage
• Estimation• Performance
• Cost
• Power
7/8/2009 18Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Square-Root Algorithm (SRA)
S0
a = In1b = In2
0
1
Start
S1
S2
S3
S4
S5
S6
S7
t1 = |a|t2 = |b|
t5 = x – t3
x = max( t1 , t2 )y = min ( t1 , t2 )
t3 = x >> 3t4 = y >>1
t6 = t4 + t5
t7 = max ( t6 , x )
Done = 1Out = t7
• SQR = max ((0.875x + 0.5y), x)
• x = max (|a|, |b|)
• y= min (|a|, |b|)
7/8/2009 19Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Variable and Operation Usage
S0
a = In1b = In2
0
1
Start
S1
S2
S3
S4
S5
S6
S7
t1 = |a|t2 = |b|
t5 = x – t3
x = max( t1 , t2 )y = min ( t1 , t2 )
t3 = x >> 3t4 = y >>1
t6 = t4 + t5
t7 = max ( t6 , x )
Done = 1Out = t7
Variable usage
1233222No. of live variables
Xt7
Xt6
Xt5
X
X
S1
X
X
S3
XX
S2
X
X
S5
X
X
X
S4
X
S6
t4
t3
y
x
t2t1
b
a
S7
S7
111212No. of
operations
2
S1
2
S3
11
S2
1
S5
1
S4
1
S6
1+1-2>>1max1min2abs
Max. no. of units
Operation usage
7/8/2009 20Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Connectivity usage
I
O
t5
O
I
t6
O
t7
I
O
t3
I
O
t4
I
a
I
I
O
t1
I
b
I
I
I
O
x
I
I
O
t2
I
O
y
+
-
>>1
>>3
max
min
abs2
abs1
S7
111212No. of
operations
2
S1
2
S3
1
1
S2
1
S5
1
S4
1
S6
1+
1-
2>>
1max
1min
2abs
Max. no. of units
Operation usage
Connectivity usage
S0
a = In1b = In2
0
1
Start
S1
S2
S3
S4
S5
S6
S7
t1 = |a|t2 = |b|
t5 = x – t3
x = max( t1 , t2 )y = min ( t1 , t2 )
t3 = x >> 3t4 = y >>1
t6 = t4 + t5
t7 = max ( t6 , x )
Done = 1Out = t7
7/8/2009 21Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow RTL architecture Input specification Specification profiling• RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
• Chaining and multi-cycling• Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 22Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath Synthesis
• Variable Merging (Storage Sharing)
• Operation Merging (FU Sharing)
• Connection Merging (Bus Sharing)
• Register merging (RF sharing)
• Chaining and Multi-Cycling
• Data and Control Pipelining
7/8/2009 23Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Gain in register sharing
a
Selector Selector
Selector Selector
c b d
x y
+
Selector Selector
a , c b , d
x , y
+
Selector
Partial FSMD Datapath without register sharing Datapath with register sharing
•Register sharing
•Grouping variables with non-overlapping lifetimes
•Sharing reduces connectivity cost
7/8/2009 24Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
General partitioning algorithm
Stop
no yes
Create compatibility graph
Merge highest priority nodes
Upgrade compatibility graph
All nodes incompatible
Start•Compatibility graph
•Compatibility:
•Non-overlapping in time
•Not using the same resource
•Non-compatible:
•Overlapping in time
•Using the same resource
• Priority
•Critical path
•Same source, same destination
7/8/2009 25Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Variable Merging for SRA
1/0 0/1
1/0
0/1
R1 = [ a, t1, x, t7 ]R2 = [ b, t2, y, t3, t5, t6 ]R3 = [ t4 ]
(a) Initial compatibility graph (b) Compatibility graph after merging t3, t5, and t6
(c) Compatibility graph after merging t1, x, and t7 (d) Compatibility graph after merging t2 and y
(e) Final compatibility graph (f) Final register assignments
7/8/2009 26Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath with Shared Registers
Selector Selector
R1 R2 R3
| a | | b | min max + - >>1 >>3
•Variables combined into registers
•One functional unit for each operation
7/8/2009 27Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Gain in Functional Unit Sharing
x = a + b
y = c + d
Sj
Si
a
Selector Selector
c b d
x y
+/-
Partial FSMD Non-shared design Shared design
•Functional unit sharing
•Smaller number of FUs
•Larger connectivity cost
7/8/2009 28Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Operation Merging for SRA
1/0
|a| |b|
min max
+ -
1/0
1/01/0
1/1 1/1
0/0 1/1
1/1
1/1
1/1 2/0
2/1
|a| |b|
min max
+ -
1/0
1/1 1/1
1/1
2/1 2/0
1/1
1/0
Initial compatibility graph Compatibility graph after merging of + and -
Compatibility graph after merging of min, +, and - Final graph partitions
7/8/2009 29Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Selector Selector
R1 R2 R3
abs/max>>1Selector
abs/min/+/-
>>3
Datapath with Shared Registers and FUs
•Variables combined into registers
•Operations combined into functional units
7/8/2009 30Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Connection usage for SRA
XN
XM
XL
XK
XXXXJ
XXXI
X
X
S6
X
S7S0
X
X
X
X
S2
X
X
S1
X
X
S4
X
X
S3
X
X
S5
H
G
F
E
D
C
B
A
S0
a = In1b = In2
0
1
Start
S1
S2
S3
S4
S5
S6
S7
t1 = |a|t2 = |b|
t5 = x – t3
x = max( t1 , t2 )y = min ( t1 , t2 )
t3 = x >> 3t4 = y >>1
t6 = t4 + t5
t7 = max ( t6 , x )
Done = 1Out = t7
Connection usage table
•Find compatible connections for merging into buses
7/8/2009 31Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Connection Merging for SRA
I
J
K
L
M
N
Compatibility graph for input buses Compatibility graph for output buses
Bus assignmentXN
XM
XL
XK
XXXXJ
XXXI
X
X
S6
X
S7S0
X
X
X
X
S2
X
X
S1
X
X
S4
X
X
S3
X
X
S5
H
G
F
E
D
C
B
A
•Combine connection not used at the same time
•Priority to same source, same destination
•Priority to maximum groups
7/8/2009 32Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath with Shared Registers, FUs and Buses
•Minimal SRA architecture
•3 registers
•4 (2) functional units
• 4 buses
7/8/2009 33Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Register Merging into RFs
R1 R2
R3
S6 S7S0 S2S1 S4S3 S5
R3
R2
R1
Register assignment
Register access tableCompatibility graph
S0
a = In1b = In2
0
1
Start
S1
S2
S3
S4
S5
S6
S7
t1 = |a|t2 = |b|
t5 = x – t3
x = max( t1 , t2 )y = min ( t1 , t2 )
t3 = x >> 3t4 = y >>1
t6 = t4 + t5
t7 = max ( t6 , x )
Done = 1Out = t7
•Register merging: Port sharing
•Merge registers with non-overlapping access times
•No of ports is equal to simultaneous read/write accesses
7/8/2009 34Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath with Shared RF
H
Out
In1 In2
R1
R2
>>3 >>1
Bus1
abs/max abs/min/+/-
Bus2
Bus3
Bus4
R3
•RF minimize connectivity cost by sharing ports
7/8/2009 35Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow RTL architecture Input specification Specification profiling RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
• Chaining and multi-cycling• Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 36Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath with Chaining
In 2
S0
a = In1b = In2
0
1
Start = 1
S1
S2
S3
S4
S5
S6
t1 = |a|t2 = |b|
t5 = x – t3
x = max( t1 , t2 )t3 = max( t1 , t2 )>>3t4 = min ( t1 , t2 )>>1
t6 = t4 + t5
t7 = max ( t6 , x )
Done = 1Out = t7
•Chaining connects two or more FUs
•Allows execution of two or more operation in a single clock cycle
•Improves performance at no cost
7/8/2009 37Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
In 1
R1 R2 R3
Bus 1
abs/+/-
Bus 2
Bus 3
Bus 4
In 2
Out
abs/max min
>>3 >>1
In 2
S0
a = In1b = In2
0
1
Start
S1
S2
S3
S4
S5
S6
t1 = |a|t2 = |b|
t5 = x – t3t4 = [min ( t1, t2 ) >>1]
x = max( t1 , t2 )t3 = max( t1 , t2 )>>3[t4]= min ( t1 , t2 )>>1
t6 = t4 + t5
t7 = max ( t6 , x )
Done = 1Out = t7
Datapath with Chained and Multi-Cycled FUs
•Multi-cycling allows use of slower FUs
•Multi-cycling allows faster clock-cycle
7/8/2009 38Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow RTL architecture Input specification Specification profiling RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
Chaining and multi-cycling• Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 39Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Pipelining
• Functional Unit pipelining• Two or more operation executing at the same time
• Datapath pipelining • Two or more register transfers executing at the same time
• Control Pipelining• Two or more instructions generaqted at the same time
7/8/2009 40Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Functional Unit Pipelining (1)
In 2Done = 1Out = t7
t1 = |a|
x = max( t1 , t2 )t3 = max( t1 , t2 )>>3
t4 = min( t1 , t2 )>>1
t6 = t4 + t5
t7 = max ( t6 , x )
Start
a = In1b = In2
S0
0
1S1
S2
S4
S6
S7
S8
t5 = x – t3
S5
t2 = |b|
S3
•Operation delay cut in ”half”
•Shorter clock cycle
•Dependencies may delay some states
•Extra NO states reduce performance gain
7/8/2009 41Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Functional Unit Pipelining (2)
In 2Done = 1Out = t7
t1 = |a|
x = max( t1 , t2 )t3 = max( t1 , t2 )>>3
t4 = min( t1 , t2 )>>1
t6 = t4 + t5
t7 = max ( t6 , x )
Start
a = In1b = In2
S0
0
1S1
S2
S4
S6
S7
S8
t5 = x – t3
S5
t2 = |b|
S3
t7
max
NO
t7
t7
S8
t6
+
NO
max
t6
X
S7
t5
>>1
-
NO
+
t4
t5
S6
Write Out
t4Write R3
>>3
min
-
t3
X
S5
b
a
S0
t1
|a|
|b|
b
S2
|a|
a
S1
max
t2
t1
S3
t2
|b|
NO
t3
X
max
min
t2
t1
S4
Write R2
Write R1
Shifters
ALU stage 2
ALU stage 1
Read R3
Read R2
Read R1
Timing diagram with 4 additional NO states
7/8/2009 42Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath Pipelining (1)
In1
R1 R2 R3
>>1
Bus1
Bus2
Bus3
Bus4
>>3
In2
Out
ALU
•Register-to-register delay cut in “equal” parts•Much shorter clock cycle•Dependencies may delay some states•Extra NO states reduce performance gain
7/8/2009 43Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath pipelining (2)
In 2Done = 1Out = t7
t1 = |a|
x = max( t1 , t2 )t3 = max( t1 , t2 )>>3
t4 = min( t1 , t2 )>>1
t6 = t4 + t5
t7 = max ( t6 , x )
Start
a = In1b = In2
S0
0
1S1
S2
S4
S6
S7
S8
t5 = x – t3
S5
t2 = |b|
S3 Timing diagram with additional NO clock cycles
In1
R1 R2 R3
>>1
Bus1
Bus2
Bus3
Bus4
>>3
In2
Out
ALU
t6t5t3t2t2bALUIn(R)
t7
17
t7
t7
18
x
t6
x
15
max
16
t1
|b|
4
t2
5
t6
14
t4
t4
t5
12
+
13
-
10
t5
11
Write Out
t4Write R3
>>1
x
t3
x
9
b
a
1
|a|
b
3
a
a
2
max
t1
t2
t1
7
t1
t2
t1
6
t3
x
>>3
min
8
Write R2
Write R1
Shifters
ALUOut
ALUIn(L)
Read R3
Read R2
Read R1
Cycles
7/8/2009 44Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Datapath and Control Pipelining (1)
S1
1 0a>b
x = c + d
y = x - 1
S2
S3
ALU
Register
Selector
RF Mem
Bus2
Bus1
StatusSignals
Controlsignals
/
Register
DataInputs
Datapath
DataOutputs
CW
R
PC CMem
AG
SR
Offset
Controller
ControlInputs
ControlOutputs
Bus3
•Fetch delay cut into several parts•Shorter clock cycle•Conditionals may delay some states•Extra NO states reduce performance gain
7/8/2009 45Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Data and Control Pipelining (2)
S1
1 0a>b
x = c + d
y = x - 1
S2
S3
ALU
Register
Selector
RF Mem
Bus2
Bus1
StatusSignals
Controlsignals
/
Register
DataInputs
Datapath
DataOutputs
CW
R
PC CMem
AG
SR
Offset
Controller
ControlInputs
ControlOutputs
Bus3
14/17
NO
13
3
15
NO
14
4
20
x-1
19
9
y
20
10
19181716131211Write PC
a>bWrite SR
1
x
1
x
S3
18
8
10
0
NO
12
2
b
a
b
a
S1
11
1
c+d
NO
16
6
d
c
d
c
S2
15
5
x
NO
17
7
Write RF
Write ALUOut
Write ALUIn(R)
Write ALUIn(L)
Read RF(R)
Read RF(L)
Read CWR
Read PC
CycleOperation
Timing diagram with additional NO clock cycles
•3 NO cycles for the branch
•2 NO cycles for data dependence
7/8/2009 46Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow RTL architecture Input specification Specification profiling RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
Chaining and multi-cycling Data and control pipelining• Scheduling• Component interfacing• Conclusions
7/8/2009 47Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Scheduling
• Scheduling assigns clock cycles to register transfers
• Non-constrained scheduling• ASAP scheduling
• ALAP scheduling
• Constrained scheduling• Resource constrained (RC) scheduling
– Given resources , minimize metrics (time, power, …)
• Time constrained (TC) scheduling– Given time, minimize resources (FUs, storage, connections)
7/8/2009 48Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
C and CDFG for SRA Algorithm
C flowchart
0Start
t1=|a|t2=|b|
x=max(t1,t2)y=min(t1,t2)
t3=x>>3t4=y>>1t5=x-t3
t6=t4+t5t7=max(t6,x)
Done=1Out=t7
a=In1b=In2
1
0
1
Start
In1 In 2
a b
a b
min
|a| |b|
max
>>1 >>3
-
+
max
Out Done
1
CDFG
7/8/2009 49Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
RC Scheduling
ASAPschedule
Ready list with mobilities(ALAP – ASAP)
ALAPschedule
RC schedule(for single FUand 2 shifters)
7/8/2009 50Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Scheduling Algorithms
RC algorithm TC algorithm
7/8/2009 51Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
TC Scheduling
ASAP ALAP TC schedule
Out
min
|a| |b|
max
>>1 >>3
-
+
max
Out
min
|a| |b|
max
>>1
>>3
+
max
min
|a|
|b|
max
>>1
>>3
-
+
max
S5
S6
S7
S1
S3
S4
S8
S2
a b a b a b
Out
-
7/8/2009 52Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Initial probability distribution graph Graph after max, +, and – were scheduled
1.83
S5
S6
S7
S1
S2
S3
S4
|a|
|b|
max
min
-
+
max
>>
1
>>
3
1.0
.83
.83
1.0
1.0
.5
.83
.83
.33
AU units
Probability sum/state
Shift units
1.33
S5
S6
S7
S1
S2
S3
S4
|a| |b|
max
min
-
max
>>
1
>>3
1.0
1.33
.33
1.0
1.0
1.0
AU units
Probability sum/state
Shift units
+
Distribution Graphs for TC scheduling
7/8/2009 53Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
1.0
S5
S6
S7
S1
S2
S3
S4
|a| |b|
max
min
-
max
>>1
>>3
1.0
1.0
1.0
1.0
1.0
AU units
Probability sum/state
Shift units
+
1.0
1.0
1.0
1.0
S5
S6
S7
S1
S2
S3
S4
|a|
|b|
max
min
-
max
>>1
>>3
1.0
1.0
1.0
1.0
1.0
AU units
Probability sum/state
Shift units
+
1.0
1.0
1.0
Graph after max, +, -, min, >>3, and >>1were scheduled
Distribution graph for final schedule
Distribution Graphs for TC scheduling
7/8/2009 54Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Hardware Synthesis
Design flow RTL architecture Input specification Specification profiling RTL synthesis
• Variable merging (Storage sharing)• Operation Merging (FU sharing)• Connection Merging (Bus sharing)
Chaining and multi-cycling Data and control pipelining Scheduling• Component interfacing• Conclusions
7/8/2009 55Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Interface Synthesis
•Combine process and channel codes
•HW and protocol clock cycles may differ
•Insert a bus-interface component
•Communication in three parts:
•Freely schedulable code
•Scheduled with process code
•Schedule constrained code
•MAC driver from library for selected bus interface
•Bus interface
•Implemented by bus interface component from library
7/8/2009 56Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Bus Interface Controller (1)
ALU
Reg RF Mem
Bus 2
Bus 1
Statussignals
Controlsignals
/
CMem
AG
offset
Controller
Control
ack
Address
Write Queue
Read Queue
Datapath
Output logic
Input logic
Bus 3 Bus 4
Selector
Selector
DATA
ADDRESS
CONTROL
GRANT
REQUEST
Controlsignals
INC
ready OutCntrl OutAddr OutData InData
MAC driver
7/8/2009 57Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Bus Interface Controller (2)
S5
S4
S3
S1
S2
S6
OutAddr = BusAddr
OutData = BusData
OutCntrl = WRITE_WORDack = 1
ready = 1
ack = 0
ready = 0
ready = 1
ready = 0
ALU
Reg RF Mem
Bus 2
Bus 1
Statussignals
Controlsignals
/
CMem
AG
offset
Controller
Control
ack
Address
Write Queue
Read Queue
Datapath
Output logic
Input logic
Bus 3 Bus 4
Selector
Selector
DATA
ADDRESS
CONTROL
GRANT
REQUEST
Controlsignals
INC
ready OutCntrl OutAddr OutData InData
MAC driver
Bus protocol
7/8/2009 58Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Transducer/ Bridge
Processor1<clk1>
Processor2<clk2>
Controller1<clk1>
Controller2<clk2>
Transducer
Ready1Ack1
Ready2Ack2
Memory1 Memory2
PE1 PE2Bus 1 Bus 2
Interrupt2Interrupt1
Data1 Data2
Queue<clk3>
•Translates one protocol into another
•Controller1 receives data with protocol1 and writes into queue
•Controller2 reads from queue and sends data with protocol2
7/8/2009 59Embedded System Design © 2009: Gajski, Abdi, Gerstlauer, Schirner
Chapter 6: Hardware Synthesis
Conclusion
• Synthesis techniques• Variable Merging (Storage Sharing)
• Operation Merging (FU Sharing)
• Connection Merging (Bus Sharing)
• Architecture techniques• Chaining and Multi-Cycling
• Data and Control Pipelining
• Forwarding and Caching
• Scheduling• Metric constrained scheduling
• Interfacing• Part of HW component
• Bus interface unit
• If too complex, use partial order