sequential design and pipelining - doc.ic.ac.ukwl/teachlocal/cuscomp/notes/cc07.pdf · •...
TRANSCRIPT
wl 2019 7.1
Sequential design and pipelining
• sequential design
– example: systolic array
– stream representation
– delays and anti-delays
• pipelining
– pros and cons
– graphical method
– Horner’s Rule
wl 2019 7.2
Systolic array: data-oriented parallelism
• introduced by Kung and Leiserson in 1978
• pump data through processors
– like blood pumped through the body
• efficient, scalable, suitable for regular control
– particularly suited for FPGA technology
• challenges
– describe them systematically
– verify that they work
M
P P P P
M: Memory
P : Pipelined
Processor
wl 2019 7.3
*
=
(Kung and Leiserson, 1978)
+C’=A * B
wl 2019 7.4
P33P32P31
P23P22P21
P11 P12 P13a13 a12 a11
a23 a22 a21
a33 a32 a31
b31
b21
b11
b32
b22
b12
b33
b23
b13
each processor Pij
at each time step:
- computes
ct+1 = at * bt + ct
- passes a rightwards
- passes b downwards
- ct remains stationary
Simple systolic matrix multiplier
(source: J Break)
wl 2019 7.5
3 4 2
2 5 3
3 2 5
X =3 4 2
2 5 3
3 2 5
23 36 28
25 39 34
28 32 37
2 4 3
3 5 2
3
2
3
5 2 3
5
3
2
2
5
4
P33P32P31
P23P22P21
P11 P12 P13
Systolic matrix multiplier: example
(source: J Break)
wl 2019 7.6
3*32 4
3 5 2
3
2
5 2 3
5
3
2
2
5
4
Time step: 1
9 0 0 0 0 0 0 0 0
P11 P12 P13 P21 P23P22 P31 P32 P33
(source: J Break)
wl 2019 7.7
3
2*3
4*2 3*42
3 5
5 2 3
5
3
2
2
5
Time step: 2
17 12 0 6 0 0 0 0 0
P11 P12 P13 P21 P23P22 P31 P32 P33
(source: J Break)
wl 2019 7.8
2
3*3
2*45*2
2*3 4*5 3*2
3
5 2
5
3
Time step: 3
23 32 6 16 8 0 9 0 0
P11 P12 P13 P21 P23P22 P31 P32 P33
(source: J Break)
wl 2019 7.9
5
3*42*2
2*25*53*3
2*2 4*3
5
Time step: 4
23 36 18 25 33 4 13 12 0
P11 P12 P13 P21 P23P22 P31 P32 P33
(source: J Break)
wl 2019 7.10
3*25*25*3
5*33*2
2*5
Time step: 5
23 36 28 25 39 19 28 22 6
P11 P12 P13 P21 P23P22 P31 P32 P33
(source: J Break)
wl 2019 7.11
2*35*2
3*5
Time step: 6
23 36 28 25 39 34 28 32 12
P11 P12 P13 P21 P23P22 P31 P32 P33
(source: J Break)
wl 2019 7.12
5*5
Time step: 7
23 36 28 25 39 34 28 32 37
P11 P12 P13 P21 P23P22 P31 P32 P33
(source: J Break)
Done – now look into each cell…
wl 2019 7.13
Parallel matrix multiplier
(Woods, McCanny and McWhirter, 2008)
Bit-level systolic correlator
y = a * x + c
(a is M-bit vector)
a
wl 2019 7.14
Sequential designs
• components relate streams, same laws if delay not involved
– <..., xt-1, xt, xt+1, ...> inc <... xt-1+1, xt+1, xt+1+1, ...>
or x inc y t . xt+1 = yt
– <..., <xt-1, yt-1>, <xt, yt>, ...> add <... xt-1+yt-1, xt+yt, ...>
• delay: provides state, range is one cycle behind domain– <..., xt-1, xt, xt+1, ...> D <... xt-2, xt-1, xt, ...> models a register
or x D y t . xt-1 = yt
• initialised delay DIc, at t = 0, yt = C (DI c in Rebecca)
• D-1 is anti-delay: if input in domain then predicts input,hence not implementable (can simulate AD in Rebecca) – can implement D (D in Rebecca) when input in domain
or D-1 (D^~1 in Rebecca) when input in range– cannot implement D (AD^~1) with input in range
or D-1 (AD) with input in domain
incxt yt
single time sequence of pairs
wl 2019 7.15
Pipelining
• insert latches between circuits to increase throughput
• also reduce power consumption, especially for FPGAs
• but may also increase– area
– clock power consumption
– latency
data result
clock
wl 2019 7.16
Graphical method
• idea: introduce anti-delay which cancels effect of a delay; OK to have non-implementables at inputs or outputs
• graphical contours linking introduction of delay/anti-delay: draw contours around blocks; when a contour cuts– a domain connection, put a D (or D-1)
– a range connection, put a D-1 (or D)
• make sure
– R is timeless: D;R = R;D or R=R\D (D is timeless but not stateless!)
– the internal Ds are implementable
R
D
D
D
D-1
D-1
D-1 D-1 D-1R
D
D
D-1
R DR R R =
design is combinational
wl 2019 7.17
Pipelining a chain
• timeless pre-condition:
• then
RD D-1R
given
R = R \ D-1
=
R D D-1 D-1 D-1R DR DR R R =
Horner’s Rn = (R ; D)n ; D-n
Rule
boundary condition
wl 2019 7.18
Pipelining a row
• timeless pre-condition:
• then
RD
D
D-1
D-1R
given
R = R \ D-1 (or R \ [D-1, D-1])
=
R
D
D
D
D-1
D-1
D-1 D-1 D-1R
D
D
D-1
R DR R R =
Horner’s Rn = (R ; D)n ; D-n
Rule rownR = snd nD ; rown (R ; snd D) ; [nD-1, D-n]
boundary conditions
wl 2019 7.19
Example: polynomial evaluation and optim.
• given
• then
x
+
a3
x
x
+
+
xx
x
a2
a1
a0
x
+
a3
x
x
+
+
a2
a1
a0
a0 + a1 x + a2 x2 + a3 x3 = a0 + x (a1 + x (a2 + a3x))
x
+a
b
x
a x + b x = (a + b) x
x
+
b
a
wl 2019 7.20
Horner’s Rule
• given
• then
Q
R P
[P, Q] ; R = R ; Q
Q
R
Q
R
Q
Q
R
R
PP
P
Q
R
Q
Q
R
R
[nP, Qn] ; rdrn R = rdrn (2Q ; R)
wl 2019 7.21
A grid
R R R R
R R R R
R R R R
gridm n R = ?
wl 2019 7.22
Pipelining a grid: (a) put D between columns
R D R D R D R D D-1 D-1 D-1 D-1
R D R D R D R D D-1 D-1 D-1 D-1
R D R D R D R D D-1 D-1 D-1 D-1
D-1 D-1
D-1
D-1
D-1
D-1
D
D
D
D
D
D
wl 2019 7.23
Pipelining a grid: (b) put D between rows
R D
D
R D
D
R D
D
R D
D
D-1 D-1 D-1 D-1
R D
D
R D
D
R D
D
R D
D
D-1 D-1 D-1 D-1 D-1
R D
D
R D R D R D D-1 D-1 D-1 D-1 D-1 D-1
D
DD
D-1
D-1
D-1
D
D-1
D-1
D-1
D-1
D
D-1
D-1
D-1
D-1
D-1
D
D-1
D-1
D-1
D-1
D-1
D-1
D
D
D
D
D
D
wl 2019 7.24
Pipelining a grid: (c) place D diagonally
R D
D
R D
D
R D
D
R D
D
D-1 D-1 D-1 D-1
R D
D
R D
D
R D
D
R D
D
D-1 D-1 D-1 D-1 D-1
R D
D
R D R D R D D-1 D-1 D-1 D-1 D-1 D-1
D
DD
D-1
D-1
D-1
D
D-1
D-1
D-1
D-1
D
D-1
D-1
D-1
D-1
D-1
D
D-1
D-1
D-1
D-1
D-1
D-1
D
D
D
D
D
D