sequential design and pipelining - doc.ic.ac.ukwl/teachlocal/cuscomp/notes/cc07.pdf · •...

wl 2019 7.1

Sequential design and pipelining

• sequential design

– example: systolic array

– stream representation

– delays and anti-delays

• pipelining

– pros and cons

– graphical method

– Horner’s Rule

wl 2019 7.2

Systolic array: data-oriented parallelism

• introduced by Kung and Leiserson in 1978

• pump data through processors

– like blood pumped through the body

• efficient, scalable, suitable for regular control

– particularly suited for FPGA technology

• challenges

– describe them systematically

– verify that they work

M

P P P P

M: Memory

P : Pipelined

Processor

wl 2019 7.3

*

=

(Kung and Leiserson, 1978)

+C’=A * B

wl 2019 7.4

P33P32P31

P23P22P21

P11 P12 P13a13 a12 a11

a23 a22 a21

a33 a32 a31

b31

b21

b11

b32

b22

b12

b33

b23

b13

each processor Pij

at each time step:

- computes

ct+1 = at * bt + ct

- passes a rightwards

- passes b downwards

- ct remains stationary

Simple systolic matrix multiplier

(source: J Break)

wl 2019 7.5

3 4 2

2 5 3

3 2 5

X =3 4 2

2 5 3

3 2 5

23 36 28

25 39 34

28 32 37

2 4 3

3 5 2

3

2

3

5 2 3

5

3

2

2

5

4

P33P32P31

P23P22P21

P11 P12 P13

Systolic matrix multiplier: example

(source: J Break)

wl 2019 7.6

3*32 4

3 5 2

3

2

5 2 3

5

3

2

2

5

4

Time step: 1

9 0 0 0 0 0 0 0 0

P11 P12 P13 P21 P23P22 P31 P32 P33

(source: J Break)

wl 2019 7.7

3

2*3

4*2 3*42

3 5

5 2 3

5

3

2

2

5

Time step: 2

17 12 0 6 0 0 0 0 0

P11 P12 P13 P21 P23P22 P31 P32 P33

(source: J Break)

wl 2019 7.8

2

3*3

2*45*2

2*3 4*5 3*2

3

5 2

5

3

Time step: 3

23 32 6 16 8 0 9 0 0

P11 P12 P13 P21 P23P22 P31 P32 P33

(source: J Break)

wl 2019 7.9

5

3*42*2

2*25*53*3

2*2 4*3

5

Time step: 4

23 36 18 25 33 4 13 12 0

P11 P12 P13 P21 P23P22 P31 P32 P33

(source: J Break)

wl 2019 7.10

3*25*25*3

5*33*2

2*5

Time step: 5

23 36 28 25 39 19 28 22 6

P11 P12 P13 P21 P23P22 P31 P32 P33

(source: J Break)

wl 2019 7.11

2*35*2

3*5

Time step: 6

23 36 28 25 39 34 28 32 12

P11 P12 P13 P21 P23P22 P31 P32 P33

(source: J Break)

wl 2019 7.12

5*5

Time step: 7

23 36 28 25 39 34 28 32 37

P11 P12 P13 P21 P23P22 P31 P32 P33

(source: J Break)

Done – now look into each cell…

wl 2019 7.13

Parallel matrix multiplier

(Woods, McCanny and McWhirter, 2008)

Bit-level systolic correlator

y = a * x + c

(a is M-bit vector)

a

wl 2019 7.14

Sequential designs

• components relate streams, same laws if delay not involved

– <..., xt-1, xt, xt+1, ...> inc <... xt-1+1, xt+1, xt+1+1, ...>

or x inc y t . xt+1 = yt

– <..., <xt-1, yt-1>, <xt, yt>, ...> add <... xt-1+yt-1, xt+yt, ...>

• delay: provides state, range is one cycle behind domain– <..., xt-1, xt, xt+1, ...> D <... xt-2, xt-1, xt, ...> models a register

or x D y t . xt-1 = yt

• initialised delay DIc, at t = 0, yt = C (DI c in Rebecca)

• D-1 is anti-delay: if input in domain then predicts input,hence not implementable (can simulate AD in Rebecca) – can implement D (D in Rebecca) when input in domain

or D-1 (D^~1 in Rebecca) when input in range– cannot implement D (AD^~1) with input in range

or D-1 (AD) with input in domain

incxt yt

single time sequence of pairs

wl 2019 7.15

Pipelining

• insert latches between circuits to increase throughput

• also reduce power consumption, especially for FPGAs

• but may also increase– area

– clock power consumption

– latency

data result

clock

wl 2019 7.16

Graphical method

• idea: introduce anti-delay which cancels effect of a delay; OK to have non-implementables at inputs or outputs

• graphical contours linking introduction of delay/anti-delay: draw contours around blocks; when a contour cuts– a domain connection, put a D (or D-1)

– a range connection, put a D-1 (or D)

• make sure

– R is timeless: D;R = R;D or R=R\D (D is timeless but not stateless!)

– the internal Ds are implementable

R

D

D

D

D-1

D-1

D-1 D-1 D-1R

D

D

D-1

R DR R R =

design is combinational

wl 2019 7.17

Pipelining a chain

• timeless pre-condition:

• then

RD D-1R

given

R = R \ D-1

=

R D D-1 D-1 D-1R DR DR R R =

Horner’s Rn = (R ; D)n ; D-n

Rule

boundary condition

wl 2019 7.18

Pipelining a row

• timeless pre-condition:

• then

RD

D

D-1

D-1R

given

R = R \ D-1 (or R \ [D-1, D-1])

=

R

D

D

D

D-1

D-1

D-1 D-1 D-1R

D

D

D-1

R DR R R =

Horner’s Rn = (R ; D)n ; D-n

Rule rownR = snd nD ; rown (R ; snd D) ; [nD-1, D-n]

boundary conditions

wl 2019 7.19

Example: polynomial evaluation and optim.

• given

• then

x

+

a3

x

x

+

+

xx

x

a2

a1

a0

x

+

a3

x

x

+

+

a2

a1

a0

a0 + a1 x + a2 x2 + a3 x3 = a0 + x (a1 + x (a2 + a3x))

x

+a

b

x

a x + b x = (a + b) x

x

+

b

a

wl 2019 7.20

Horner’s Rule

• given

• then

Q

R P

[P, Q] ; R = R ; Q

Q

R

Q

R

Q

Q

R

R

PP

P

Q

R

Q

Q

R

R

[nP, Qn] ; rdrn R = rdrn (2Q ; R)

wl 2019 7.21

A grid

R R R R

R R R R

R R R R

gridm n R = ?

wl 2019 7.22

Pipelining a grid: (a) put D between columns

R D R D R D R D D-1 D-1 D-1 D-1

R D R D R D R D D-1 D-1 D-1 D-1

R D R D R D R D D-1 D-1 D-1 D-1

D-1 D-1

D-1

D-1

D-1

D-1

D

D

D

D

D

D

wl 2019 7.23

Pipelining a grid: (b) put D between rows

R D

D

R D

D

R D

D

R D

D

D-1 D-1 D-1 D-1

R D

D

R D

D

R D

D

R D

D

D-1 D-1 D-1 D-1 D-1

R D

D

R D R D R D D-1 D-1 D-1 D-1 D-1 D-1

D

DD

D-1

D-1

D-1

D

D-1

D-1

D-1

D-1

D

D-1

D-1

D-1

D-1

D-1

D

D-1

D-1

D-1

D-1

D-1

D-1

D

D

D

D

D

D

wl 2019 7.24

Pipelining a grid: (c) place D diagonally

R D

D

R D

D

R D

D

R D

D

D-1 D-1 D-1 D-1

R D

D

R D

D

R D

D

R D

D

D-1 D-1 D-1 D-1 D-1

R D

D

R D R D R D D-1 D-1 D-1 D-1 D-1 D-1

D

DD

D-1

D-1

D-1

D

D-1

D-1

D-1

D-1

D

D-1

D-1

D-1

D-1

D-1

D

D-1

D-1

D-1

D-1

D-1

D-1

D

D

D

D

D

D

sequential design and pipelining - doc.ic.ac.ukwl/teachlocal/cuscomp/notes/cc07.pdf · •...

Documents