slack analysis in the system design loop
DESCRIPTION
Slack Analysis in the System Design Loop. Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University. Typical System Design Flow. Spec. Scalability Issues: Simulation takes minutes to hours Synthesis takes many hours. - PowerPoint PPT PresentationTRANSCRIPT
Slack Analysis in the System Design Loop
Girish Venkataramani Carnegie Mellon University,The MathWorks
Seth C. Goldstein Carnegie Mellon University
2
Typical System Design Flow
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
SystemDesign Loop
Too slow!
Scalability Issues:Simulation takes minutes to hoursSynthesis takes many hours
IR
3
Proposed Design Flow
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IRTraditionalSystemDesign Loop
Timing Analysis
Optimize Design
Update timing in IR
New Optimization
Loopreplace with
4
Key Contributions
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IR
Timing Analysis
Optimize Design
Update timing in IR
1. Slack is a distributed representation of system timing2. Linear-time update algorithm based on slack3. > 100x reduction in total design time:
• Optimization Loop runs in seconds/minutes while • System Design Loop runs in hours/days
TraditionalSystemDesign Loop
replace withNew
OptimizationLoop
5
Outline
• Motivation
• Timing Metrics– Cycle time– Slack: An alternative view of cycle time
• Slack Update Algorithm• Experimental Evaluation• Conclusions
6
The Intermediate Representation (IR)
• Models a dynamic system
• Concurrent sub-systems– PE, FSM, S/W, Memory
• Communication between sub-systems based on pre-defined protocols– FIFOs, NoC, shared bus
Sub-System
Sub-System
Sub-System
Sub-System
Network
… …
… …
Transaction-Level Modeling (TLM), [Cai, ISSS 03]Adopted by System-C, Bluespec, Balsa, Tangram
7
Marked Graphs
• Model dynamic system interactions
• Events and transitions– Event: An edge acquires a
token– Transition: Node consumes
inputs and generates outputs
• Encode the communication protocols
S1 S2
S3
token
8
Timing Analysis of IR
• Time Separation between Event (TSEs)– TSE between consecutive firings of same event in steady state is
the mean cycle time
• Mean Cycle Time, CT
• Computing CT is about O(|E|3) complexity [Dasdan 04]
cycleon tokensofnumber is N
cycle,graph a oflatency isC :
)(
i
iwhere
CTi
i
i
NC
CMax
9
Slack as a Timing Metric
• Distributed representation of cycle time– Different type of TSE– Defined on each (input edge, node) pair– How early this input arrives
Slack(S1, S3) = 3Slack(S2, S3) = 0Slack(S3, S1) = 0Slack(S3, S2) = 0
• Zero-slack input is locally critical
• Longest chain of zero-slack events yields the critical cycle or the Global Critical Path (GCP)– Cycle time = Latency of the GCP– Given slack values, computing cycle time has linear complexity
*2 5
8
S1 S2
S3
locallycritical
10
Slack is an Annotation on the IR
Sub-System
Sub-System
Sub-System
Sub-System
Network
… …
… …
SlackSlack
Slack
Slack
Helps in discovering hotspots and applying optimizations
11
Outline
• Motivation
• Timing Metrics
• Slack Update Algorithm
• Experimental Evaluation
• Conclusions
12
Optimizations Change the IR
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IR+
slack
Optimize Design
Update timing in IR
New Optimization
Loop
Causes changes in component delays,in turn, changing in slack values
Need to update slack on each change
13
Problem Description
• Given a graph model and its current slack values, compute new values of slack when latency of a node changes by a given Δ
2 5
8
S1 S2
S3 6, Δ = -2
14
Insight behind Update Algorithm
• Slack is also latency difference of two branches of a re-convergent fork-join
• If delay of node, sc, increases by Δ– Update slack in surrounding
re-convergent fork-joins– Propagate change globally
sfork
sjoin
e1 e2
P1
P2
Assume δ(P1) > δ(P2), Di = δ(P1) - δ(P2)Slack(e1, sjoin) = 0Slack(e2, sjoin) = Di
Update (let Δ ≤ Di):Slack(e2, sjoin) = Di + Δ
+Δ sc
15
Insight behind Update Algorithm
1. If there is a path from change point to every input of sjoin, then no change in slack
2. If not, then slack changes occur
sjoin
e1 e2
+Δ sc
sfork+Δ
Easy for acyclic graphs, but what about scc graphs?
16
Insight for Cyclic Graphs
• Use token knowledge– Count tokens from change point to every input– If value is equal for all inputs, then no change in slack values
2 5
8
S1 S2
S3
2, Δ = -3
# toks = 1# toks = 2
Change in slack exists
17
Insight for Cyclic Graphs
• Use token knowledge– Count tokens from change point to every input– If value is equal for all inputs, then no change in slack values
2 5
8
S1 S2
S3 6, Δ = -2
# toks = 0 # toks = 0
No change in slack
18
Algorithm Summary
• Initially, find # tokens between every pair of nodes in the graph– Problem formulated as a flow lattice
– Invoked once, complexity is O(|M0| |V|)
• After inducing every change in graph– Compute slack change at each node– Propagate new change to neighboring outputs– Overall complexity is O(|V|)
19
Outline
• Motivation
• Timing Metrics
• Slack Update Algorithm
• Experimental Evaluation
• Conclusions
20
Experimental Setup
• Slack Update loop incorporated into CASH compiler [ASPLOS 04, DAC 07]
– Synthesizes asynchronous circuits from ANSI-C programs
• Applied three different optimizations– SM: Slack Matching [ICCAD 06]
– OC: Operation Chaining [ICCAD 07]– ASU: Heterogeneous Pipeline Synthesis [Async 08]
• Benchmarks: Fifteen frequently executed kernels from Mediabench suite, [Lee 97]
• All results are post-synthesis mapped to ST Micro 180nm library
21
Absolute Accuracy
• After SM, compare computed values of slack against actual values of slack for adpcm_d
0
10
20
30
40
50
60
70
80
90
100
-100 -50 0 50 100
% Slack Difference = [(Actual - Estimate)*100 / Actual]
% o
f T
ota
l Eve
nts
• Close to 100 changes applied (algo invoked for each change)
• 1.2x performance speedup
• Update inaccuracy due to unknown latency values during circuit transformation
22
Design Loop Experiments
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IRSystemDesign Loop
Timing Analysis
Optimize Design
Update timing in IR
OptimizationLoop
Run the same N optimizations in both loops and compare:1. Overall Performance change2. Overall design time
23
Design Loop Experiments
• Three optimization sequences1. SM-ASU: Slack matching followed by
Heterogenous latch insertion
2. ASU-SM
3. SM-ASU-OC
• Compare final circuit timing and loop traversal (design) times between the two methodologies
24
Design Loop Experiments
• About ~500 circuit changes, on average• Design Loop time: 0.5 – 4 hours• Optimization loop: 10 - 100 seconds
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
K1.ad
pcm
_d
K2.ad
pcm
_e
K3.g7
21_d
K4.g7
21_e
K5.gs
m_d
K6.gs
m_d
K7.gs
m_e
K8.gs
m_e
K9.jpe
g_d
K10.jp
eg_e
K11.m
peg2
_d
K12.m
peg2
_d
K13.m
peg2
_e
K14.p
gp_d
K15.p
gp_e
Sp
ee
d-u
p o
ve
r B
as
e
Design Loop
Optimization Loop
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
K1.ad
pcm
_d
K2.ad
pcm
_e
K3.g7
21_d
K4.g7
21_e
K5.gs
m_d
K6.gs
m_d
K7.gs
m_e
K8.gs
m_e
K9.jpe
g_d
K10.jp
eg_e
K11.m
peg2
_d
K12.m
peg2
_d
K13.m
peg2
_e
K14.p
gp_d
K15.p
gp_e
Sp
ee
du
p o
ve
r B
as
e
Design Loop
Optimization Loop
SM-ASU ASU-SM
25
Design Loop Experiments: SM-ASU-OC
• About ~1000-3000 circuit changes, on average• Design Loop time: 1 – 10 hours• Optimization loop: 20 - 200 seconds
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
K1.ad
pcm
_d
K2.ad
pcm
_e
K3.g7
21_d
K4.g7
21_e
K5.gs
m_d
K6.gs
m_d
K7.gs
m_e
K8.gs
m_e
K9.jpe
g_d
K10.jp
eg_e
K11.m
peg2
_d
K12.m
peg2
_d
K13.m
peg2
_e
K14.p
gp_d
K15.p
gp_e
Sp
ee
du
p o
ve
r B
as
e
Design Loop
Optimization Loop
d
300x Speedup
26
Outline
• Motivation
• Timing Metrics
• Slack Update Algorithm
• Experimental Evaluation
• Conclusions
27
Conclusions
• Slack update algorithm to speed up design loop in TLM-based workflows
• Use slack to track system-level timing changes
• Orders of magnitude reduction in design time at negligible loss in accuracy
• New optimizations loop enables scalability in iterative system design flows