elec692 vlsi signal processing architecture lecture 4 unfolding

ELEC692 VLSI Signal Processing Architecture

Lecture 4Unfolding

Unfolding Transformation• Create a new program describing more than one iteration

of the original program• Unfolding factor J: J consecutive iterations of the original

program• Also named as loop unrolling, used with software

pipelining in compiler design• Unfolding can reveal hidden concurrencies so that the

program can be scheduled to a smaller iteration period and thus increase throughput, e.g. with pipelining

1 2 3

looping

unfolding1 2 3 1 2 3 1 2 3 …..

pipelining

1 2 3

1 2 3

1 2 3

……

…… Increase throughput

Unfolding• Also unfolding can work with parallel processing to

increase throughput• E.g. 3-tap FIR filters

– Y(n) = ax(n-1)+bx(n-2)+cx(n-3)

• Unroll three times we get

)3(

)13(

)23(

)13(

)3(

)13(

)23(

)13(

)3(

)23(

)13(

)3(

kx

kx

kx

c

kx

kx

kx

b

kx

kx

kx

a

ky

ky

ky

A Parallel FIR System

AMclk TTT 2

)2(3

11AMclksampleiter TTT

LTT

sampleclk TT

a b c

Y(3k+2)

c a b

Y(3k+1)

b c a

Y(3k)

D

D

x(3k+2)x(3k+1) x(3k)

Parallel system

Pipelined system

sampleclk TT

Unfolding Example• Example

– Y(n)=ay(n-9)+x(n), after 2-unfolding we have– Y(2k)=ay(2k-9)+x(2k), Y(2k+1)=ay(2k-8)+x(2k+1),

• The above can be rewritten as– Y(2k)=ay(2(k-5)+1)+x(2k), Y(2k+1)=ay(2(k-4)+x(2k+1),

9DX(n)

Y(n) a X(2k)

X(2k+1)

y(2k)

y(2k+1)

5D

4D

a

a

Timing after unfoldingAfter J-unfolding, the clock period T = J * data

sampling period.Ex. Timing diagram of the previous example:

y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10)y(11)

Tclock=Tsampleunfolding

x(n)

y(0) y(2) y(4) y(6) y(8) y(10)

Tclock=2*Tsample

X(2k)

y(1) y(3) y(5) y(7) y(9) y(11)

Tclock=2*Tsample

x(2k+1)

Unfolding Algorithm• How to construct J-unfolded DFG

– The resulting DFG contains J times as many edges and edges as the original DFG.

• Algorithm– For each node U in the original DFG, draw the J

nodes Uo, U1,…, UJ-1.– For each edge U->V with w delays in the original

DFG, draw the J edges with delays for I = 0,1,…,J-1.

Jwii VU )%(

J

wi

*Here is the floor of x and a%b is the remainder after dividing a by b. x

Example

9D5D

4DC D

B

A

C0 D0

D1C1

A0

B0

B1

A1

More example

U V37D

4-unfolded

U1 V1

9D

U2V29D

U3V3

9D

U4V4

10D

Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1 delay in J-unfolded DFG when w<J

Unfolding preserves precedence constraints of a DSP program

Another Example

U V

T

D

6D5D

3-unfoldedU0 V0 T0

U1 V1 T1

U2 V2 T2

2D

2D

2D

2D

2D

D

D

Unfolding preserves precedence constraints

• The e edges in the original DFG explicitly show the precedence constraints for 1 iterations of the original program

• The J*e edges in the J-unfolded DFG explicitly show the precedence constraints for J iterations of the program

• The edge with delay in the unfolded DFG corresponds to the edge U->V with w delay in the original DFG.

Jwii VU )%(

J

wi

JwiJ

wi

iw VUVU )%(

Precedence preservation (I)

• For unfolding, the k-th iteration of the node U i in the J-unfolded DFG executes the (Jk+i)-th iteration of the node U in the original DFG.

• Due to the delays in the edge , the output of the k-th iteration of the node Ui is consumed by the (k+ )-th iteration of the node in the unfolded DFG.

• The k-th iteration of Ui corresponds to the (Jk+i)-th iteration of the node U, and the (k+ )-th iteration of V(i+w)%J corresponds to the ( )-th of the node V

Jwii VU )%(

J

wi

J

wi

JwiV )%(

J

wi

JwiJ

wikJ )%()(

Precedence preservation (II)

• Therefore in the original DFG, the output of the (Jk+i)-th iteration of the node U is consumed by the ( )-th of the node V. So the number of delays on the edge UV is

iJwiJ

wiJ

iJkJwiJ

wikJ

)%(

)()%()(

This term is simply i+wHence the number of delays on the edge UV is (i+w)-i=w

JwiJ

wikJ )%()(

Properties of Unfolding• Unfolding preserves the number of delays in a DFG

– Sum of the delays on the J unfolded edges UiV(i+w)%J, I = 0,1,…,J-1, is the same as the number of delays on the edge UV in the original DFG, i.e.

• When a loop is unfolded, let l be a loop with wl delays in the original DFG. Let A be a node in l. (we denote this as with wl delays). If the loop l is traversed p times (p>=1) this results in the path with p*wl delays.

• The correspodning unfolded path starting at the node Ai, 0<=i <= J-1, in the J-unfolded DFG is Ai A(i+pwl)%J, and this path forms a loop in the unfolded DFG if i = (i+pwl )%J.

• Find the minimum p such that Ai A(i+pwl)%J, is a loop in the unfolded DFG.

wJ

Jw

J

w

J

w

1

...1

AA l

AAA lll .....

Example

A

B

C

2D

3D

D 3-unfolded

4-unfolded

A0 B0 C0

A1 B1 C1

A2 B2 C2

A0 B0 C0

A1 B1 C1

A2 B2 C2

A3 B3 C3

D

DD

D

DD

D

D

D

D D

D

A single loop with wl=6

i=(i+pwl)%J=>i=(i+6p)%3=>p=1(for i=0,1,2 are true), so we have 3 loops in the unfolded DFG

i=(i+pwl)%J=>i=(i+6p)%4=> Hods for i=0,1,2,3 when p=2, so unfolding p=2 instances of the loop results in a loop in the 4-unfolded DFG and we have 2 loops in the unfolded DFG

Properties of Unfolding• When a loop is unfolded, the number of resulting unfolded loops

can be determined by finding the minimum p value such that the path AiA(i+pwl)%J forms a path,i.e. i=(i+pwl)%J.

• Two lemma– Lemma 1: =(i+pwl)%J pwl=qJ for an integer q.– Lemma 2: The smallest positive integer p that satisfies pwl=qJ is J/gcd(wl,J)

where gcd is the greatest common divisor

• Based on Lemma 1, AiA(i+pwl)%J is a loop iff pwl=qJ holds• Minimum p is equal to the number of times the loop l is traversed

to form an unfolded loop, and from Lemma 2 it is equal to J/gcd(wl,J)

• Thus an unfolded loop contains J/gcd(wl,J) copies oe each node in l and the unfolded DFG contains a total of J copies of each node, so there must be J/(J/gcd(wl,J))= gcd(wl,J) unfolded loops.

• The unfolded DFG contains wl,delays and each of the gcd(wl,J) unfolded loops contains wl,/gcd(wl,J) delays.

Properties of Unfolding

• Unfolding a DFG with iteration bound T• results in a J-unfolded DFG with iteration

bound JT.– Original iteration bound is given by T =maxl{tl/wl}

– From the previous property, we can show that the iteration bound of the unfolded DFG is

JTw

tJ

Jww

tJwJT

l

l

lll

ll

lmax

),gcd(/

)),gcd(/(max'

Example

9D5D

4DC D

B

A

C0 D0

D1C1

A0

B0

B1

A1

Iteration bound = 9/9 = 1

Iteration bound = 18/9 = 2

Properties regarding critical path, unfolding and retiming

• Consider a path with w delays in the original DFG. J-unfolding of this path leads to (J-w) paths with no delays and w paths with 1 delay each, when w<J.– Any path in the original DFG containing J or more delays leads

to J paths with 1 or more delays in each path. Therefore a path in the original DFG with J or more delays cannot create a critical path in the J-unfolded DFG.

– From these, we can retime the original DFG such that the J-unfolded version of the retimed DFG will meet a specified critical path delay c. This is true if there exists a path in the original DFG with computation time c and less than J delays.

– Assume that the critical path of the J-folded DFG is c, if D(UV) c, then wr(U,V)=W(U,V)+r(V)-r(U) J.

Properties regarding critical path, unfolding and retiming

• Any feasible clock cycle period that can be obtained by retiming he J-unfolded DFG, GJ, an be achieved by retiming the original DFG, G, directly and then unfolding it by unfolding factor J.– Let r’ be a legal retiming for GJ, which lead to critical path c. Let r

be a retiming for G defined as

1

0

)(')(J

iiUrUr

Consider an edge UV with w delays in G, since r’ is a legal retiming on the unfolded DFG, we have

J

wiVrUr Jwii )(')(' )%(

Summing these inequalities for i=0 to J-1, we have r(U-r(V) w which shows this is a legal retiming

For the J-unfolded DFG, since r’ satisfies the critical path (c) constraints, i.e. cVUDif

J

wiVrUr JwiiJwii

)(1)(')(' )%()%(

Summing these inequalities for i=0 to J-1, we have r(U-r(V) W(U,V)-J which is the desired inequality

Applications of Unfolding

• Sample Period Reduction

• Parallel Processing– Word-Level Parallel Processing– Bit-Level Parallel Processing

Sample Period Reduction

• How unfolding can allow DSP program to be implemented with an iteration period equal to the iteration bound T

• 2 cases– Case 1: A node in the DFG has computation

time greater than T.

– Case 2: Iteration bound T is not an integer

Case 1• Even retiming cannot be used to reduce

the computation time of the critical path of the DFG to T.

• Example

P R

Q T

S

U

2D

D

loop1loop2

Loop bound for loop 1 = (4+1+1)/2=3


Iteration bound = max(3,2)=3

Maximum computation delay time of a component = 4 > iteration bound

Case 1 Example• The previous DFG can

be unfolded with a factor 2 and from this the loop bound can be reduced.

3

4

P0 R0

Q0 T0

S0

U0

2D

P1 R1

Q1 T1

S1

U1

D

D

loop1

loop2

loop3


Loop bound for loop 2 =

Loop bound for loop 3 = 12/3 = 4

So for this unfolded DFG, T = 6However it performs 2 iterations of the original DSP in 6 unit time and hence the sampling period is 6/2 = 3 which is the iteration bound of the original DFG.

** In general, unfolding is used where tu is the max. computation time of a node.In this example, it is and so = 2.

T

tU

Case 2: T is not an integer

• Iteration period cannot achieve the iteration bound

4T

S T U VD D

D

Even retiming cannot not achieve a critical path of less than 2.

3-Unfolded

S0 T1 U1 V2

D

S1 T2 U3 V0D

S2 T0 U0 V1

D

3/4T

for each loop

# Of sampling = 3 and hence the minimum sampling period of the unfolded DFG is 4/3 =Tof the original DFG

** In general if a critical loop bound is tl/wl where tl and wlare mutually prime, then wl-

unfolding should be used.

Parallel Processing• Unfolding transformation is used to derive parallel

processing architectures from serial processing architectures.

• Word-level parallel processing– Processing multiple (J) samples at the same times by J-unfolding

x(n)

y(n)2D 4D

a b c

X

C B A

D E

2D4D

3-unfolded

X0

C0 B0 A0

D0 E02DX1

C1 B1 A1

D1 E1

D

X2

C2 B2 A2

D2 E2

D

D

D

x(n)

y(n)a b c

x(n)

y(n)

D

a b c

x(n)

y(n)a b c

2D

D

D

D

Bit-level Parallel Processing

• Let W be the wordlength of the data• Bit-serial processing – one bit processed per

clock cycle and a complete word is processed in W clock cycles

• Bit-parallel processing – one word of W bits is processed every clock cycle.

• Digital-serial processing: N bits are processed per clock cycle and a word is processed in W/N clock cycles. N is the digit size.

Bit-level Parallel Processing

Bit-parallel

an

a2

a1a0

…. bn

b2

b1b0

….

Bit-serialan a2a1 a0… bn b2b1 b0…

Digit-serial(digit size=2)

a2

a3 a1

a0…

…

b2

b3 b1

b0…

…

Bit serial addition

To obtain a digit-serial or bit-parallel adder architecture, the bit-serial adder must be unfolded. The issue is how to unfold the edge with a switch.

For J-unfolding of the switch, we assume

1. The wordlength W is a multiple of the unfolding factor J, i.e. W=W’J.

2. All edges into and out of the switch have no delays.

Unfolding of Bit serial addition

)%()'( JuJ

ulWJuWl

)'(

J

ulW

The edge is unfolded as follows:

1.Write the switching instance as

2.Draw an edge with no delays in the unfolded graph from the node Uu%J ot the node Vu

%J, which is switched at time instance ( ).

E.g. The above switch, Let J=3 assuming W=12 and u=7

We have W = W’J=>12=(4)(3) and the edge has no delays, and

1)24(3)3%7()3

74(3712

lll

In the unfolded DFG, the unfolded edge is from the node U1 to the node V1 and is switched at 4l+2

Example of unfolding of switch

U V

12l +1,7,9,11 3-unfolding

U0 V0

4l +3

U1 V1

4l +0,2

U2 V2

4l +3

To unfold the DFG by 3 (J=3), the switching instances are as follows:

2)34(31112

0)34(3912

1)24(3712

1)04(3112

ll

ll

ll

ll

Example of unfolding of switch• IF an edge contains

a switch and a positive number of delays, a dummy node can be used to reduce the problem to the case where the edge contains zero delays.

Example of the bit serial adder

How about if wordlength W is not a multiple of the unfolding factor J?

• Determine L = LCM(W,J) and replacing the switching instance Wl+u with the L/W switching instances Ll+u+wW for w=0 to L/W -1.

• Switch periodicity changed from W to L and each switching instance has been expanded by a factor L/W.

• E.g. Let W = 4, J=3. The LCM(4,3)=12. Then 4l is equivalent to 12l, 12l+4, 12l+8; 4l+1 is equivalent to 12l+1, 12l+5, 12l+9; 4l+2 is equivalent to 12l+2, 12l+6, 12l+10; 4l+3 is equivalent to 12l+3, 12l+7, 12l+11. All new switching instances are now multiples of J=3 and can now be unfolded using the previous approach.

Example

elec692 vlsi signal processing architecture lecture 4 unfolding

Documents

unfolding slide

d unfolding

original dfg

junfolded dfg

j edges

jw edges

w slide

unfolding algorithm