examples of one-dimensional systolic arrays
DESCRIPTION
Examples of One-Dimensional Systolic Arrays. Motivation & Introduction. We need a high-performance , special-purpose computer system to meet specific application. I/O and computation imbalance is a notable problem. The concept of Systolic architecture can map high-level - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/1.jpg)
Examples of Examples of One-One-
Dimensional Dimensional Systolic ArraysSystolic Arrays
![Page 2: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/2.jpg)
Motivation & IntroductionMotivation & Introduction• We need a high-performance , special-purpose computer system to meet specific application.
• I/O and computation imbalance is a notable problem.I/O and computation imbalance is a notable problem.
• The concept of Systolic architecture can map high-level computation into hardware structures.
• Systolic system works like an automobile assembly line.
• Systolic system is easy to implement because of its regularity and easy to reconfigure.
• Systolic architecture can result in cost-effective , high- performance special-purpose systems for a wide range of problems.
![Page 3: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/3.jpg)
Pipelined ComputationsPipelined Computations• Pipelined program divided into a series of tasks that
have to be completed one after the other.• Each task executed by a separate pipeline stage• Data streamed from stage to stage to form computation
P1 P2 P3 P4 P5f, e, d, c, b, a
![Page 4: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/4.jpg)
Pipelined ComputationsPipelined Computations• Computation consists of data streaming through pipeline stages• Execution Time = Time to fill pipeline (P-1)
+ Time to run in steady state (N-P+1) + Time to empty pipeline (P-1)
P1 P2 P3 P4 P5f, e, d, c, b, a
a b fedca b fedc
a b fedca b fedc
a b fedc
time
P5P4P3P2P1
P = # of processorsN = # of data items(assume P < N)
This slide must be explained in all detail.
It is very important
![Page 5: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/5.jpg)
Pipelined Example: Sieve of EratosthenesPipelined Example: Sieve of Eratosthenes• Goal is to take a list of integers greater than 1 and
produce a list of primes– E.g. For input 2 3 4 5 6 7 8 9 10, output is 2 3 5 7
• A pipelined approach:
– Processor P_i divides each input by the i-th prime
– If the input is divisible (and not equal to the divisor), it is marked (with a negative sign) and forwarded
– If the input is not divisible, it is forwarded
– Last processor only forwards unmarked (positive) data [primes]
![Page 6: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/6.jpg)
Sieve of Eratosthenes Pseudo-CodeSieve of Eratosthenes Pseudo-Code• Code for processor Pi (and prime p_i):
– x=recv(data,P_(i-1))– If (x>0) then
• If (p_i divides x and p_i = x ) then send(-x,P_(i+1)
• If (p_i does not divide x or p_i = x) then send(x, P_(i+1))
– Else • Send(x,P_(i+1))
• Code for last processor– x=recv(data,P_(i-
1))– If x>0 then
send(x,OUTPUT)
P2 P3 P5 P7 out
/
Processor P_i divides each input by the i-th prime
![Page 7: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/7.jpg)
Programming IssuesProgramming Issues• Algorithm will take N+P-1 to run where N is the number of data
items and P is the number of processors. – Can also consider just the odd bnys or do some initial part separately
• In given implementation, number of processors must store all primes which will appear in sequence– Not a scalable approach– Can fix this by having each processor do the job of multiple primes, i.e.
mapping logical “processors” in the pipeline to each physical processor– What is the impact of this on performance?
P2 P3 P5 P7 P11 P13 P17
processor does the job of three primes
![Page 8: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/8.jpg)
Processors for such operationProcessors for such operation• In pipelined algorithm, flow of data moves through processors in lockstep.
• The design attempts to balance the work so that there is no bottleneck at any processor
• In mid-80’s, processors were developed to support in hardware this kind of parallel pipelined computation
• Two commercial products from Intel: – Warp (1D array)– iWarp (components for 2D array)
• Warp and iWarp were meant to operate synchronously Wavefront Array Processor (S.Y. Kung) was meant to operate asynchronously, – i.e. arrival of data would signal that it was time to execute
![Page 9: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/9.jpg)
Systolic Arrays from IntelSystolic Arrays from Intel• Warp and iWarp were examples of systolic arrays
– Systolic means regular and rhythmic,– data was supposed to move through pipelined computational units in a
regular and rhythmic fashion
• Systolic arrays meant to be special-purpose processors or co-processors.
• They were very fine-grained
• Processors implement a limited and very simple computationvery simple computation, usually called cells
• Communication is very fast, granularity meant to be around one operation/communication!
![Page 10: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/10.jpg)
Systolic AlgorithmsSystolic Algorithms• Systolic arrays were built to support systolic algorithms,
a hot area of research in the early 80’s
• Systolic algorithms used pipelining through various kinds of arrays to accomplish computational goals:
– Some of the data streaming and applications were very creative and quite complex
– CMU a hotbed of systolic algorithm and array research (especially H.T. Kung and his group)
![Page 11: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/11.jpg)
Example 1:Example 1: “pipelined” “pipelined” polynomial evaluationpolynomial evaluation
• Polynomial Evaluation is done by using a Linear array with 2D.
• Expression: Y = ((((anx+an-1)*x+an-2)*x+an-3)*x……a1)*x + a0
• Function of PEs in pairs – 1. Multiply input by x – 2. Pass result to right. – 3. Add aj to result from left. – 4. Pass result to right.
![Page 12: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/12.jpg)
• Using systolic array for polynomial evaluation.
• This pipelined array can produce a polynomial on new X value on every cycle - after 2n stages.
• Another variant:Another variant: you can also calculate various polynomials on the same X.
• This is an example of a deeply pipelined computation- – The pipeline has 2n stages.
X ++ X + X X +
x an x an-1 an-2 a0xx
……….
Example 1: polynomial evaluationExample 1: polynomial evaluationY = ((((anx+an-1)*x+an-2)*x+an-3)*x……a1)*x + a0
Adding processor
Multiplying processor
X is broadcasted
![Page 13: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/13.jpg)
Example 2:Example 2:Matrix Vector MultiplicationMatrix Vector Multiplication
• There are many ways to solve a matrix problems using systolic arrays, some of the methods are:
– Triangular Array performing gaussian elimination with neighbor pivoting.
– Triangular Array performing orthogonal triangularization.
• Simple matrix multiplication methods are shown in next slides.
![Page 14: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/14.jpg)
• Matrix Vector Multiplication: • Each cell’s function is:
– 1. To multiply the top and bottom inputs. – 2. Add the left input to the product just obtained. – 3. Output the final result to the right.
• Each cell consists of an adder and a few registers.
• At time t0 the array receives 1, a, p, q, and r ( The other inputs are all zero).
• At time t1, the array receive m, d, b, p, q, and r ….e.t.c
• The results emerge after 5 steps.
Example 2:Example 2:Matrix Vector MultiplicationMatrix Vector Multiplication
![Page 15: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/15.jpg)
Matrix MultiplicationMatrix Multiplication
PE1 PE2 PE3n m l
a - - d b -
g e c - h f - - i
z y x
p q r
Example 2:Example 2:Matrix Vector MultiplicationMatrix Vector Multiplication
• At time t0 the array receives 1, a, p, q, and r ( The other inputs are all zero).
• At time t1, the array receive m, d, b, p, q, and r ….e.t.c
• The results emerge after 5 steps.
![Page 16: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/16.jpg)
• Each cell (P1, P2, P3) does just one instruction• Multiply the top and bottom inputs, add the left input to the product just obtained, output the final result to the right
• The cells are simple• Just an adder and a few registers
• The cleverness comes in the order in which you feed input into the systolic array• At time t0, the array receives l, a, p, q, and r
– (the other inputs are all zero)
• At time t1, the array receives m, d, b, p, q, and r
• And so on.
• Results emerge after 5 steps
PE1 PE2 PE3n m l
a - - d b -
g e c - h f - - i
z y x
p q r
To visualize how it works it is good to
do a snapshot animation
![Page 17: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/17.jpg)
Systolic Processors, versus Cellular Automata Systolic Processors, versus Cellular Automata versus Regular Networks of Automataversus Regular Networks of Automata
Data Path Block
Data Path Block
Data Path Block
Data Path Block
Systolic processor
Control Block
Control Block
Control Block
Control Block
Cellular AutomatonThese slides are for one-dimensional only
![Page 18: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/18.jpg)
Systolic Processors, versus Cellular Automata Systolic Processors, versus Cellular Automata versus Regular Networks of Automataversus Regular Networks of Automata
Control Block
Control Block
Control Block
Control Block
Control Block
Control Block
Control Block
Control Block
Cellular AutomatonGeneral and Soldiers,
Symmetric Function Evaluator
Data Path Block
Data Path Block
Data PathBlock
Data PathBlock
Regular Network of AutomataRegular Network of Automata
![Page 19: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/19.jpg)
Introduction to Convolution Introduction to Convolution circuits synthesiscircuits synthesis
Perkowski
![Page 20: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/20.jpg)
FIR-filter like structureFIR-filter like structure
b4 b3 b2 b1
++ +
a4 0 0 0
a4*b4
![Page 21: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/21.jpg)
b4 b3 b2 b1
++ +
a4 0 0
a4*b4
a3
a3*b4+a4b3
![Page 22: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/22.jpg)
b4 b3 b2 b1
++ +
a3 a4 0
a4*b4
a2
a3*b4+a4b3 a4*b2+a3*b3+a2*b4
![Page 23: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/23.jpg)
b4 b3 b2 b1
++ +
a2 a3 a4
a4*b4
a1
a3*b4+a4b3 a4*b2+a3*b3+a2*b4
a1*b4+a2*b3+a3*b2+a4*b1
![Page 24: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/24.jpg)
b4 b3 b2 b1
++ +
a1 a2 a3
a4*b4
0
a3*b4+a4b3 a4*b2+a3*b3+a2*b4
a1*b4+a2*b3+a3*b2+a4*b1 a1*b3+a2*b2+a3*b1
![Page 25: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/25.jpg)
We insert Dffs to avoid many levels of logicWe insert Dffs to avoid many levels of logic
b4 b3 b2 b1
++ +
a4a2 a3
a4*b4a4*b3 a4*b2 a4*b1
![Page 26: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/26.jpg)
b4 b3 b2 b1
++ +
a3a1 a2
a4*b4 a4*b3+a3b4 a4*b2+a3b3a4*b1+a3b2 a3b1
![Page 27: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/27.jpg)
b4 b3 b2 b1
++ +
a20 a1
a4*b4 a4*b3+a3b4 a4*b2+a3b3+a2b4 a4*b1+a3b2+a2b3
a3b1+a2b2 a2b1
The disadvantage of this circuit is broadcasting
![Page 28: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/28.jpg)
We insert more Dffs to avoid broadcastingWe insert more Dffs to avoid broadcasting
b4 b3 b2 b1
++ +
a4a2 a3
a4*b40 0 0
0 0 0
![Page 29: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/29.jpg)
b4 b3 b2 b1
++ +
a3a1 a2
a4*b4 a3b4 a4b30
a4 0 0
0
Does not work correctly like this, try something new….
![Page 30: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/30.jpg)
b4 b3 b2 b1
a3a1 a2
a4*b4
a3b4 a4b3
0
a4 0 0
0
a2b4
a1b4
a3b3
a2b3
a1b3
00
0
0
a4b2
a3b2
a2b2
a1b2
0
0
0
a4b1
a3b1
a2b1
First sum
Second sum
![Page 31: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/31.jpg)
FIR-filter like structure, FIR-filter like structure, assume two delaysassume two delays
b4 b3 b2 b1
++ +
![Page 32: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/32.jpg)
b4 b3 b2 b1
++ +
![Page 33: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/33.jpg)
b4 b3 b2 b1
++ +
![Page 34: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/34.jpg)
b4 b3 b2 b1
++ +
![Page 35: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/35.jpg)
b4 b3 b2 b1
++ +
![Page 36: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/36.jpg)
b4 b3 b2 b1
++ +
![Page 37: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/37.jpg)
b4 b3 b2 b1
++ +
![Page 38: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/38.jpg)
b4 b3 b2 b1
++ +
![Page 39: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/39.jpg)
b4 b3 b2 b1
++ +
![Page 40: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/40.jpg)
b4 b3 b2 b1
++ +
![Page 41: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/41.jpg)
b4 b3 b2 b1
++ +
![Page 42: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/42.jpg)
b4 b3 b2 b1
++ +
![Page 43: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/43.jpg)
b4 b3 b2 b1
++ +
![Page 44: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/44.jpg)
b4 b3 b2 b1
++ +
![Page 45: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/45.jpg)
Example 3:Example 3:FIR Filter or FIR Filter or ConvolutionConvolution
![Page 46: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/46.jpg)
Example 3: ConvolutionExample 3: Convolution• There are many ways to implement convolution using systolic arrays, one of them is
shown: – u(n) : The input of sequence from left. – w(n) : The weights preloaded in n PEs. – y(n) : The sequence from right (Initial value: 0) and having the same speed as u(n).
• In this operation each cell’s function is: – 1. Multiply the inputs coming from left with weights and output the input received to the
next cell. – 2. Add the final value to the inputs from right.
W0 W1 W2 W3
ui……u0
yi……y00
Wi
ain
bout
aout
bin
aout = ain
bout = bin + ain * wi
![Page 47: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/47.jpg)
• Each cell operation.
W0 W1 W2 W3
ui……u0
yi……y00
Wi
ain
bout
aout
bin
aout = ain
bout = bin + ain * wi
Convolution (cont)Convolution (cont)• Systolic array.
The input of sequence from left.
This is just one solution to this problem
![Page 48: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/48.jpg)
Various Possible Various Possible ImplementationsImplementations
Convolution is very important, we use it in several Convolution is very important, we use it in several applications. So let us think what are applications. So let us think what are all the possible ways to implement itto implement it
• Convolution Algorithm
Two loops
![Page 49: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/49.jpg)
Bag of Tricks that Bag of Tricks that can be usedcan be used
• Preload-repeated-value • Replace-feedback-with-register • Internalize-data-flow • Broadcast-common-input • Propagate-common-input • Retime-to-eliminate-broadcasting
![Page 50: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/50.jpg)
Bogus Attempt at Systolic FIRBogus Attempt at Systolic FIRfor i=1 to n in parallel for j=1 to k in place yi += wj * x i+j-1
feedback from sequential implementation
Replace with register
Inner loop realized in placeStage 1: directly from equation
Stage 2: feedback = yi = yi
Stage 3:
![Page 51: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/51.jpg)
Bogus Attempt continued: Bogus Attempt continued: Outer LoopOuter Loopfor i=1 to n in parallel for j=1 to k in place yi += wj * x i+j-1
Factorize wj
![Page 52: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/52.jpg)
Bogus Attempt continued: Outer Loop - 2Bogus Attempt continued: Outer Loop - 2for i=1 to n in parallel for j=1 to k in place yi += wj * x i+j-1
Because we do not want to have broadcast, we retime the signal w, this requires also retiming of X j
![Page 53: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/53.jpg)
• Another possibility of retiming
for i=1 to n in parallel for j=1 to k in place yi += wj * x i+j-1
Bogus Attempt continued: Outer Loop - 2aBogus Attempt continued: Outer Loop - 2a
![Page 54: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/54.jpg)
• Yet another approach is to broadcast common input x i-1
Bogus Attempt continued: Outer Loop - 3Bogus Attempt continued: Outer Loop - 3for i=1 to n in parallel for j=1 to k in place yi += wj * x i+j-1
![Page 55: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/55.jpg)
Attempt at Systolic FIR: now internal loop is Attempt at Systolic FIR: now internal loop is in parallelin parallel
1
23
![Page 56: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/56.jpg)
Outer Loop continuation for FIR filterOuter Loop continuation for FIR filter
![Page 57: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/57.jpg)
Continue: Optimize Outer LoopContinue: Optimize Outer LoopPreload-repeated ValuePreload-repeated Value
Based on previous slide we can
preload weights Wi
![Page 58: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/58.jpg)
Continue: Optimize Outer LoopContinue: Optimize Outer LoopBroadcast Common ValueBroadcast Common Value
This design has broadcast. Some purists tell this is not systolic as systolic should have all short wires.
![Page 59: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/59.jpg)
Continue: Optimize Outer LoopContinue: Optimize Outer LoopRetime to Eliminate BroadcastRetime to Eliminate Broadcast
We delay these signals yi
![Page 60: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/60.jpg)
The design becomes not intuitive. Therefore, we The design becomes not intuitive. Therefore, we have to explain in detail “How it works”have to explain in detail “How it works”
y1=x1w1
y1=x1w1
x1
x2
![Page 61: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/61.jpg)
Types of systolic structureTypes of systolic structure• Convolution problem
weight : {w1, w2, ..., wk}
inputs : {x1, x2, ..., xn}
results : {y1, y2, ..., yn+k-1}
yi = w1xi + w2xi+1 + ...... + wkxi+k-1
(combining two data streams)H. T. Kung’s grouping work
assume k = 3
Polynomial Multiplication Polynomial Multiplication of 1-D convolution problemof 1-D convolution problem
![Page 62: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/62.jpg)
A family of systolic designs forA family of systolic designs forconvolution computationconvolution computation
•Given the sequence of weights
{w1 , w2 , . . . , wk}•And the input sequence
{x1 , x2 , . . . , xk} ,•Compute the result sequence
{y1 , y2 , . . . , yn+1-k}
• Defined by
yi = w1 xi + w2 xi+1 + . . . + wk xi+k-1
![Page 63: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/63.jpg)
Design B1Design B1
- Broadcast input , - move results systolically, - weights stay- (Semi-systolic convolution arrays with global data communication
• Previously proposed for
circuits to implement a
pattern matching processor
and for circuit to implement
polynomial multiplication.-
![Page 64: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/64.jpg)
Types of systolic structure: Types of systolic structure: design design B1B1
• wider systolic path (partial result yi move)
x3 x2 x1
y3 y2 y1 W1 W2 W3
yin
xin
yout
yout = yin + Wxin
W
Please analyze this circuit drawing snapshots like in an animated movie of data in subsequent moments of time
broadcast
Discuss disadvantages of broadcast
Results move out
![Page 65: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/65.jpg)
Types of systolic structure: Types of systolic structure: Design B2Design B2Inputs broadcastWeights moveResults stay
• wi circulate• use multiplier-accumulator hardware• wi has a tag bit (signals accumulator to output results)• needs separate bus (or other global network for collecting
output)
Win
xin
Wout y = y + Winxin
Wout = Winy
x3 x2 x1
y1 y2 y3
W2W3W1
![Page 66: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/66.jpg)
Design B2
Broadcast input , move weights , results stay[(Semi-) systolic convolution arrays with
global data communication]
• The path for moving yi’s is wider then wi’s because of yi’s carry more bits then wi’s in numerical accuracy.
• The use of multiplier-accumulators may also help increase precision of the result , since extra bit can be kept in these accumulators with modest cost.
Semisystolic because of broadcast
![Page 67: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/67.jpg)
Types of systolic structure: Types of systolic structure: design Fdesign F
Input moveWeights stayPartial results fan-in
• needs adder• applications : signal processing, pattern matching
y1’sZout = Wxin
xout = xin
Zout
xoutxin W
x3 x2 x1W3 W2 W1
ADDER
![Page 68: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/68.jpg)
Design F
- Fan-in results, move inputs, weights stay- Semi-systolic convolution arrays with global data communication
• When number of cell is large , the adder can be implemented as a pipelined adder tree to avoid large delay.
• Design of this type using unbounded fan-in.
![Page 69: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/69.jpg)
Types of systolic structure:Types of systolic structure: Design R1 Design R1Inputs and weights move in the opposite directionsResults stay
• can use tag bit• no bus (systolic output path is sufficient)• one-half the cells are work at any time• applications : pattern matching
y = y + Winxin
xout = xin
Wout = Win
x1x3 x2
W1 W2
y3 y2 y1
Win
xin
Wout
yxout
![Page 70: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/70.jpg)
Design R1
- Results stay, inputs and weights move in opposite directions- Pure-systolic convolution arrays with global data communication
• Design R1 has the advan-tage that it dose not require a bus , or any other global net-work , for collecting output from cells.
• The basic ideal of this de-sign has been used to imple-ment a pattern matching chip.
![Page 71: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/71.jpg)
Types of systolic structure: Types of systolic structure: design R2design R2
Inputs and weights move in the same direction at different speedsResults stay
• xj’s move twice as fast as the wj’s• all cells work at any time• need additional registers (to hold w value)• applications : pipeline multiplier
W1
W2
W3
W4
W5
x3 x2 x1 y1 y2 y3
W W W
W
y
Win Wout
xin xout
y = y + Winxin
W = Win
Wout = W xout = xin
![Page 72: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/72.jpg)
Design R2
- Results stay , inputs and weights move in the same direction but at different speeds- Pure-systolic convolution arrays with global data communication
• Multiplier-accumulator can be used effectively and so can tag bit method to signal the output of each cell.
• Compared with R1 , all cells work all the time when additional register in each cell to hold a w value.
![Page 73: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/73.jpg)
Types of systolic structure: Types of systolic structure: design W1design W1Inputs and results move in the opposite direction
Weights stay• one-half the cells are work• constant response time• applications : polynomial division
yout = yin + Wxin
xout = xin
yin
xin
yout
Wxout
x1x3 x2 W1W2
yW3
![Page 74: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/74.jpg)
Design W1
-Weights stay, inputs and results move in opposite direction- Pure-systolic convolution arrays with global data communication
• This design is fundamental in the sense that it can be naturally extend to perform recursive filtering.
• This design suffers the same drawback as R1 , only appro-ximately 1/2 cells work at any given time unless two inde-pendent computation are in-terleaved in the same array.
![Page 75: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/75.jpg)
Overlapping the executions of multiply-and-add in design W1
![Page 76: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/76.jpg)
Types of systolic structure: Types of systolic structure: design W2design W2Inputs and results move in the same direction at different speeds Weights stay
• all cells work (high throughputshigh throughputs rather than fast response)
x
W
xin xout
yin yout
yout = yin + Winxin
x = xin
xout = x
W1W2
x5
W3
x7 x3 x2x1
y1y2y3
W W Wx4x6
![Page 77: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/77.jpg)
Design W2
-Weights stay, inputs and results move in thesame direction but at different speeds- Pure-systolic convolution arrays with global data communication
• This design lose one advan-tage of W1 , the constant response time.
• This design has been extended to implement 2-D 2-D convolution ,convolution , where high throughputs rather than fast response are of concern.
![Page 78: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/78.jpg)
Remarks on Linear Arrays• Above designs are all possible systolic designs for the convolution problem. (some are semi-)
• Using a systolic control path , weight can be selected on- the-fly to implement interpolation or adaptive filtering.
• We need to understand precisely the strengths and drawbacks of each design so that an appropriate design can be selected for a given environment.
• For improving throughput, it may be worthwhile to implement multiplier and adder separately to allow overlapping of their execution. (Such as next page show)
• When chip pin is considered:• pure-systolic requires four I/O ports; • semi-systolic requires three I/O ports.
![Page 79: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/79.jpg)
FIR circuit: initial designFIR circuit: initial design
delays
Pipelining of xi
![Page 80: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/80.jpg)
FIR circuit: registers added below FIR circuit: registers added below weight multipliersweight multipliers
Notice changed timing here
![Page 81: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/81.jpg)
FIR Summary: comparison of FIR Summary: comparison of sequential and systolicsequential and systolic
![Page 82: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/82.jpg)
Conclusions on 1D and 1.5D Systolic Arrays Conclusions on 1D and 1.5D Systolic Arrays
Systolic arrays are more than processor arrays which execute systolic algorithms.
– A systolic cell takes on one of the followingone of the following forms:
1. A special purpose cell with hardwired functions,
2. A vector-computer-like cell with instruction decoding and a processing element,
3. A systolic processor complete with a control unit and a processing unit.
Smarter processor for SAT, Petrick, etc.
![Page 83: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/83.jpg)
Large Large Systolic Arrays as general Systolic Arrays as general purpose computerspurpose computers
• Originally, systolic architectures were motivated for high performance special purpose computational systems that meet the constraints of VLSI,
• However, it is possible to design systolic systems which: – have high throughputs – yet are not constrained to a single VLSI chip.
![Page 84: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/84.jpg)
Problems with systolic array Problems with systolic array designdesign
1. Hard to design - hard to understandlow level realization may be hard to realize
2. Hard to explainremote from the algorithmfunction can’t readily be deduced from the
structure
3. Hard to verify
![Page 85: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/85.jpg)
Key Key architectural issuesarchitectural issues in designing in designing special-purpose systemsspecial-purpose systems
•Simple and regular design Simple, regular design yields cost-effective special systems.
•Concurrency and communication Design algorithm to support high concurrency and meantime to employ only simple blocks.
•Balancing computation with I/O A special-purpose system should be a match to a variety of I/O bandwidths.
![Page 86: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/86.jpg)
Two Dimensional Two Dimensional Systolic Systolic ArraysArrays
• In 1978, the first systolic arrays were introduced as a feasible design for special purpose devices which meet the VLSI constraints.
• These special purpose devices were able to perform four types of matrix operations at high processing speeds:
– matrix-vector multiplication,
– matrix-matrix multiplication,
– LU-decomposition of a matrix,
– Solution of triangular linear systems.
![Page 87: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/87.jpg)
General General Systolic OrganizationSystolic Organization
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
SystolicElement
![Page 88: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/88.jpg)
Example 2:Example 2: Matrix-Matrix Multiplication
All previously showntricks can be applied
![Page 89: Examples of One-Dimensional Systolic Arrays](https://reader034.vdocument.in/reader034/viewer/2022042505/56815e62550346895dcce1ab/html5/thumbnails/89.jpg)
• Seth Copen Goldstein, CMU Seth Copen Goldstein, CMU A.R. HursonA.R. Hurson2. David E. Culler, UC. Berkeley,2. David E. Culler, UC. Berkeley,3. 3. [email protected]. Syeda Mohsina Afroze4. Syeda Mohsina Afrozeand other students of Advanced Logic and other students of Advanced Logic Synthesis, ECE 572, 1999 and 2000.Synthesis, ECE 572, 1999 and 2000.
SourcesSources