elec692 vlsi signal processing architecture lecture 4 unfolding
TRANSCRIPT
ELEC692 VLSI Signal Processing Architecture
Lecture 4Unfolding
Unfolding Transformation• Create a new program describing more than one iteration
of the original program• Unfolding factor J: J consecutive iterations of the original
program• Also named as loop unrolling, used with software
pipelining in compiler design• Unfolding can reveal hidden concurrencies so that the
program can be scheduled to a smaller iteration period and thus increase throughput, e.g. with pipelining
1 2 3
looping
unfolding1 2 3 1 2 3 1 2 3 …..
pipelining
1 2 3
1 2 3
1 2 3
……
…… Increase throughput
Unfolding• Also unfolding can work with parallel processing to
increase throughput• E.g. 3-tap FIR filters
– Y(n) = ax(n-1)+bx(n-2)+cx(n-3)
• Unroll three times we get
)3(
)13(
)23(
)13(
)3(
)13(
)23(
)13(
)3(
)23(
)13(
)3(
kx
kx
kx
c
kx
kx
kx
b
kx
kx
kx
a
ky
ky
ky
A Parallel FIR System
AMclk TTT 2
)2(3
11AMclksampleiter TTT
LTT
sampleclk TT
a b c
Y(3k+2)
c a b
Y(3k+1)
b c a
Y(3k)
D
D
x(3k+2)x(3k+1) x(3k)
Parallel system
Pipelined system
sampleclk TT
Unfolding Example• Example
– Y(n)=ay(n-9)+x(n), after 2-unfolding we have– Y(2k)=ay(2k-9)+x(2k), Y(2k+1)=ay(2k-8)+x(2k+1),
• The above can be rewritten as– Y(2k)=ay(2(k-5)+1)+x(2k), Y(2k+1)=ay(2(k-4)+x(2k+1),
9DX(n)
Y(n) a X(2k)
X(2k+1)
y(2k)
y(2k+1)
5D
4D
a
a
Timing after unfoldingAfter J-unfolding, the clock period T = J * data
sampling period.Ex. Timing diagram of the previous example:
y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10)y(11)
Tclock=Tsampleunfolding
x(n)
y(0) y(2) y(4) y(6) y(8) y(10)
Tclock=2*Tsample
X(2k)
y(1) y(3) y(5) y(7) y(9) y(11)
Tclock=2*Tsample
x(2k+1)
Unfolding Algorithm• How to construct J-unfolded DFG
– The resulting DFG contains J times as many edges and edges as the original DFG.
• Algorithm– For each node U in the original DFG, draw the J
nodes Uo, U1,…, UJ-1.– For each edge U->V with w delays in the original
DFG, draw the J edges with delays for I = 0,1,…,J-1.
Jwii VU )%(
J
wi
*Here is the floor of x and a%b is the remainder after dividing a by b. x
Example
9D5D
4DC D
B
A
C0 D0
D1C1
A0
B0
B1
A1
More example
U V37D
4-unfolded
U1 V1
9D
U2V29D
U3V3
9D
U4V4
10D
Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1 delay in J-unfolded DFG when w<J
Unfolding preserves precedence constraints of a DSP program
Another Example
U V
T
D
6D5D
3-unfoldedU0 V0 T0
U1 V1 T1
U2 V2 T2
2D
2D
2D
2D
2D
D
D
Unfolding preserves precedence constraints
• The e edges in the original DFG explicitly show the precedence constraints for 1 iterations of the original program
• The J*e edges in the J-unfolded DFG explicitly show the precedence constraints for J iterations of the program
• The edge with delay in the unfolded DFG corresponds to the edge U->V with w delay in the original DFG.
Jwii VU )%(
J
wi
JwiJ
wi
iw VUVU )%(
Precedence preservation (I)
• For unfolding, the k-th iteration of the node U i in the J-unfolded DFG executes the (Jk+i)-th iteration of the node U in the original DFG.
• Due to the delays in the edge , the output of the k-th iteration of the node Ui is consumed by the (k+ )-th iteration of the node in the unfolded DFG.
• The k-th iteration of Ui corresponds to the (Jk+i)-th iteration of the node U, and the (k+ )-th iteration of V(i+w)%J corresponds to the ( )-th of the node V
Jwii VU )%(
J
wi
J
wi
JwiV )%(
J
wi
JwiJ
wikJ )%()(
Precedence preservation (II)
• Therefore in the original DFG, the output of the (Jk+i)-th iteration of the node U is consumed by the ( )-th of the node V. So the number of delays on the edge UV is
iJwiJ
wiJ
iJkJwiJ
wikJ
)%(
)()%()(
This term is simply i+wHence the number of delays on the edge UV is (i+w)-i=w
JwiJ
wikJ )%()(
Properties of Unfolding• Unfolding preserves the number of delays in a DFG
– Sum of the delays on the J unfolded edges UiV(i+w)%J, I = 0,1,…,J-1, is the same as the number of delays on the edge UV in the original DFG, i.e.
• When a loop is unfolded, let l be a loop with wl delays in the original DFG. Let A be a node in l. (we denote this as with wl delays). If the loop l is traversed p times (p>=1) this results in the path with p*wl delays.
• The correspodning unfolded path starting at the node Ai, 0<=i <= J-1, in the J-unfolded DFG is Ai A(i+pwl)%J, and this path forms a loop in the unfolded DFG if i = (i+pwl )%J.
• Find the minimum p such that Ai A(i+pwl)%J, is a loop in the unfolded DFG.
wJ
Jw
J
w
J
w
1
...1
AA l
AAA lll .....
Example
A
B
C
2D
3D
D 3-unfolded
4-unfolded
A0 B0 C0
A1 B1 C1
A2 B2 C2
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
D
DD
D
DD
D
D
D
D D
D
A single loop with wl=6
i=(i+pwl)%J=>i=(i+6p)%3=>p=1(for i=0,1,2 are true), so we have 3 loops in the unfolded DFG
i=(i+pwl)%J=>i=(i+6p)%4=> Hods for i=0,1,2,3 when p=2, so unfolding p=2 instances of the loop results in a loop in the 4-unfolded DFG and we have 2 loops in the unfolded DFG
Properties of Unfolding• When a loop is unfolded, the number of resulting unfolded loops
can be determined by finding the minimum p value such that the path AiA(i+pwl)%J forms a path,i.e. i=(i+pwl)%J.
• Two lemma– Lemma 1: =(i+pwl)%J pwl=qJ for an integer q.– Lemma 2: The smallest positive integer p that satisfies pwl=qJ is J/gcd(wl,J)
where gcd is the greatest common divisor
• Based on Lemma 1, AiA(i+pwl)%J is a loop iff pwl=qJ holds• Minimum p is equal to the number of times the loop l is traversed
to form an unfolded loop, and from Lemma 2 it is equal to J/gcd(wl,J)
• Thus an unfolded loop contains J/gcd(wl,J) copies oe each node in l and the unfolded DFG contains a total of J copies of each node, so there must be J/(J/gcd(wl,J))= gcd(wl,J) unfolded loops.
• The unfolded DFG contains wl,delays and each of the gcd(wl,J) unfolded loops contains wl,/gcd(wl,J) delays.
Properties of Unfolding
• Unfolding a DFG with iteration bound T• results in a J-unfolded DFG with iteration
bound JT.– Original iteration bound is given by T =maxl{tl/wl}
– From the previous property, we can show that the iteration bound of the unfolded DFG is
JTw
tJ
Jww
tJwJT
l
l
lll
ll
lmax
),gcd(/
)),gcd(/(max'
Example
9D5D
4DC D
B
A
C0 D0
D1C1
A0
B0
B1
A1
Iteration bound = 9/9 = 1
Iteration bound = 18/9 = 2
Properties regarding critical path, unfolding and retiming
• Consider a path with w delays in the original DFG. J-unfolding of this path leads to (J-w) paths with no delays and w paths with 1 delay each, when w<J.– Any path in the original DFG containing J or more delays leads
to J paths with 1 or more delays in each path. Therefore a path in the original DFG with J or more delays cannot create a critical path in the J-unfolded DFG.
– From these, we can retime the original DFG such that the J-unfolded version of the retimed DFG will meet a specified critical path delay c. This is true if there exists a path in the original DFG with computation time c and less than J delays.
– Assume that the critical path of the J-folded DFG is c, if D(UV) c, then wr(U,V)=W(U,V)+r(V)-r(U) J.
Properties regarding critical path, unfolding and retiming
• Any feasible clock cycle period that can be obtained by retiming he J-unfolded DFG, GJ, an be achieved by retiming the original DFG, G, directly and then unfolding it by unfolding factor J.– Let r’ be a legal retiming for GJ, which lead to critical path c. Let r
be a retiming for G defined as
1
0
)(')(J
iiUrUr
Consider an edge UV with w delays in G, since r’ is a legal retiming on the unfolded DFG, we have
J
wiVrUr Jwii )(')(' )%(
Summing these inequalities for i=0 to J-1, we have r(U-r(V) w which shows this is a legal retiming
For the J-unfolded DFG, since r’ satisfies the critical path (c) constraints, i.e. cVUDif
J
wiVrUr JwiiJwii
)(1)(')(' )%()%(
Summing these inequalities for i=0 to J-1, we have r(U-r(V) W(U,V)-J which is the desired inequality
Applications of Unfolding
• Sample Period Reduction
• Parallel Processing– Word-Level Parallel Processing– Bit-Level Parallel Processing
Sample Period Reduction
• How unfolding can allow DSP program to be implemented with an iteration period equal to the iteration bound T
• 2 cases– Case 1: A node in the DFG has computation
time greater than T.
– Case 2: Iteration bound T is not an integer
Case 1• Even retiming cannot be used to reduce
the computation time of the critical path of the DFG to T.
• Example
P R
Q T
S
U
2D
D
loop1loop2
Loop bound for loop 1 = (4+1+1)/2=3
Loop bound for loop 2 = (4+1+1)/3=2
Iteration bound = max(3,2)=3
Maximum computation delay time of a component = 4 > iteration bound
Case 1 Example• The previous DFG can
be unfolded with a factor 2 and from this the loop bound can be reduced.
3
4
P0 R0
Q0 T0
S0
U0
2D
P1 R1
Q1 T1
S1
U1
D
D
loop1
loop2
loop3
Loop bound for loop 1 = (4+1+1)/1=6
Loop bound for loop 2 =
Loop bound for loop 3 = 12/3 = 4
So for this unfolded DFG, T = 6However it performs 2 iterations of the original DSP in 6 unit time and hence the sampling period is 6/2 = 3 which is the iteration bound of the original DFG.
** In general, unfolding is used where tu is the max. computation time of a node.In this example, it is and so = 2.
T
tU
Case 2: T is not an integer
• Iteration period cannot achieve the iteration bound
4T
S T U VD D
D
Even retiming cannot not achieve a critical path of less than 2.
3-Unfolded
S0 T1 U1 V2
D
S1 T2 U3 V0D
S2 T0 U0 V1
D
3/4T
for each loop
# Of sampling = 3 and hence the minimum sampling period of the unfolded DFG is 4/3 =Tof the original DFG
** In general if a critical loop bound is tl/wl where tl and wlare mutually prime, then wl-
unfolding should be used.
Parallel Processing• Unfolding transformation is used to derive parallel
processing architectures from serial processing architectures.
• Word-level parallel processing– Processing multiple (J) samples at the same times by J-unfolding
x(n)
y(n)2D 4D
a b c
X
C B A
D E
2D4D
3-unfolded
X0
C0 B0 A0
D0 E02DX1
C1 B1 A1
D1 E1
D
X2
C2 B2 A2
D2 E2
D
D
D
x(n)
y(n)a b c
x(n)
y(n)
D
a b c
x(n)
y(n)a b c
2D
D
D
D
Bit-level Parallel Processing
• Let W be the wordlength of the data• Bit-serial processing – one bit processed per
clock cycle and a complete word is processed in W clock cycles
• Bit-parallel processing – one word of W bits is processed every clock cycle.
• Digital-serial processing: N bits are processed per clock cycle and a word is processed in W/N clock cycles. N is the digit size.
Bit-level Parallel Processing
Bit-parallel
an
a2
a1a0
…. bn
b2
b1b0
….
Bit-serialan a2a1 a0… bn b2b1 b0…
Digit-serial(digit size=2)
a2
a3 a1
a0…
…
b2
b3 b1
b0…
…
Bit serial addition
To obtain a digit-serial or bit-parallel adder architecture, the bit-serial adder must be unfolded. The issue is how to unfold the edge with a switch.
For J-unfolding of the switch, we assume
1. The wordlength W is a multiple of the unfolding factor J, i.e. W=W’J.
2. All edges into and out of the switch have no delays.
Unfolding of Bit serial addition
)%()'( JuJ
ulWJuWl
)'(
J
ulW
The edge is unfolded as follows:
1.Write the switching instance as
2.Draw an edge with no delays in the unfolded graph from the node Uu%J ot the node Vu
%J, which is switched at time instance ( ).
E.g. The above switch, Let J=3 assuming W=12 and u=7
We have W = W’J=>12=(4)(3) and the edge has no delays, and
1)24(3)3%7()3
74(3712
lll
In the unfolded DFG, the unfolded edge is from the node U1 to the node V1 and is switched at 4l+2
Example of unfolding of switch
U V
12l +1,7,9,11 3-unfolding
U0 V0
4l +3
U1 V1
4l +0,2
U2 V2
4l +3
To unfold the DFG by 3 (J=3), the switching instances are as follows:
2)34(31112
0)34(3912
1)24(3712
1)04(3112
ll
ll
ll
ll
Example of unfolding of switch• IF an edge contains
a switch and a positive number of delays, a dummy node can be used to reduce the problem to the case where the edge contains zero delays.
Example of the bit serial adder
Example of the bit serial adder
How about if wordlength W is not a multiple of the unfolding factor J?
• Determine L = LCM(W,J) and replacing the switching instance Wl+u with the L/W switching instances Ll+u+wW for w=0 to L/W -1.
• Switch periodicity changed from W to L and each switching instance has been expanded by a factor L/W.
• E.g. Let W = 4, J=3. The LCM(4,3)=12. Then 4l is equivalent to 12l, 12l+4, 12l+8; 4l+1 is equivalent to 12l+1, 12l+5, 12l+9; 4l+2 is equivalent to 12l+2, 12l+6, 12l+10; 4l+3 is equivalent to 12l+3, 12l+7, 12l+11. All new switching instances are now multiples of J=3 and can now be unfolded using the previous approach.
Example