distributed arithmetic: implementations and applications a tutorial

30
Distributed Arithmetic: Implementations and Applications A Tutorial

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Distributed Arithmetic: Implementations and Applications A Tutorial

Distributed Arithmetic: Implementations and Applications

A Tutorial

Page 2: Distributed Arithmetic: Implementations and Applications A Tutorial

Distributed Arithmetic (DA) [Peled and Liu,1974]

An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC)

MAC operation is very common in all Digital Signal Processing Algorithms

Page 3: Distributed Arithmetic: Implementations and Applications A Tutorial

So Why Use DA? The advantages of DA are best exploited in data-

path circuit designing Area savings from using DA can be up to 80% and

seldom less than 50% in digital signal processing hardware designs

An old technique that has been revived by the wide spread use of Field Programmable Gate Arrays (FPGAs) for Digital Signal Processing (DSP)

DA efficiently implements the MAC using basic building blocks (Look Up Tables) in FPGAs

Page 4: Distributed Arithmetic: Implementations and Applications A Tutorial

An Illustration of MAC Operation The following expression represents a multiply and

accumulate operation

A numerical example

KK xAxAxAy 2211

K

kkk xAyei

1

..

2069154117169001344

6723)22(7820454232

y

y

)4(67,22,20,4223,45,42,32 KxA

Page 5: Distributed Arithmetic: Implementations and Applications A Tutorial

A Few Points about the MAC Consider this

Note a few points A=[A1, A2,…, AK] is a matrix of “constant” values

x=[x1, x2,…, xK] is matrix of input “variables”

Each Ak is of M-bits

Each xk is of N-bits y should be able large enough to accommodate the

result

K

kkk xAy

1

Page 6: Distributed Arithmetic: Implementations and Applications A Tutorial

A Possible Hardware (NOT DA Yet!!!) Let, )4(,,,,,, 4321 KDCBAxCCCCA

Multi-bit AND gate

Registers to hold sum of partial products

Shift registersEach scaling accumulator calculates Ai X xi

Shift right

Adder/Subtractor

Page 7: Distributed Arithmetic: Implementations and Applications A Tutorial

How does DA work? The “basic” DA technique is bit-serial in nature DA is basically a bit-level rearrangement of the

multiply and accumulate operation DA hides the explicit multiplications by ROM look-

ups an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)

Page 8: Distributed Arithmetic: Implementations and Applications A Tutorial

Moving Closer to Distributed Arithmetic Consider once again

a. Let xk be a N-bits scaled two’s complement number i.e.

| xk | < 1

xk : {bk0, bk1, bk2……, bk(N-1) }

where bk0 is the sign bit b. We can express xk as

c. Substituting (2) in (1),

K

kkk xAy

1

1

10 2

N

n

nknkk bbx

…(1)

…(2)

K

k

N

n

nknkk bbAy

1

1

10 2

K

k

N

n

nknk

K

kkk bAAby

1

1

110 2 …(3)

Page 9: Distributed Arithmetic: Implementations and Applications A Tutorial

Moving More Closer to DA

11

22

11

1212

2222

1221

1111

2112

1111

0220110

222

222

222

NKNKKKKK

NN

NN

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

K

k

N

n

nkkn

K

kkk AbAby

1

1

110 2

K

k

NNkkkkkk

K

kkk bAbAbAAby

1

)1()1(

22

11

10 222

…(3)

Expanding this part

Page 10: Distributed Arithmetic: Implementations and Applications A Tutorial

Moving Still More Closer to DA

11

22

11

1212

2222

1221

1111

2112

1111

0220110

222

222

222

NKNKKKKK

NN

NN

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

11212111

22222112

11221111

0220110

2

2

2

NKNKNN

KK

KK

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

Page 11: Distributed Arithmetic: Implementations and Applications A Tutorial

Almost there!

11212111

22222112

11221111

0220110

2

2

2

NKNKNN

KK

KK

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

1

1221

10 2)(

N

n

nKKnnkn

K

kkk AbAbAbAby

1

1 110 2)(

N

n

nK

kknk

K

kkk bAbAy

The Final Reformulation

…(4)

Page 12: Distributed Arithmetic: Implementations and Applications A Tutorial

Lets See the change of hardware

1

1 110 2)(

N

n

nK

kknk

K

kkk bAbAy

K

k

N

n

nkkn

K

kkk AbAby

1

1

110 2

Our Original Equation

Bit Level Rearrangement

Page 13: Distributed Arithmetic: Implementations and Applications A Tutorial

So where does the ROM come in?

Note this portion. It’s can be

treated as function of serial

inputs bits of

{A, B, C,D}

Page 14: Distributed Arithmetic: Implementations and Applications A Tutorial

The ROM Construction

has only 2K possible values i.e.

(5) can be pre-calculated for all possible values of b1n b2n …bKn

We can store these in a look-up table of 2K words addressed by K-bits i.e. b1n b2n …bKn

1

1 110 2)(

N

n

nK

kknk

K

kkk bAbAy

K

kknkbA

1

)( 211

Knnnn

K

kknk bbbfbA

…(4)

…(5)

Page 15: Distributed Arithmetic: Implementations and Applications A Tutorial

Lets See An Example Let number of taps K=4 The fixed coefficients are A1 =0.72, A2= -0.3, A3 =

0.95, A4 = 0.11

We need 2K = 24 = 16-words ROM

1

1 10

1

)(2N

n

K

kkk

nK

kknk bAbAy …(4)

Page 16: Distributed Arithmetic: Implementations and Applications A Tutorial

ROM: Address and Contentsb1n b2n b3n b4n Contents0 0 0 0 0

0 0 0 1 A4=0.11

0 0 1 0 A3=0.95

0 0 1 1 A3+ A4=1.06

0 1 0 0 A2=-0.30

0 1 0 1 A2+ A4= -0.19

0 1 1 0 A2+ A3=0.65

0 1 1 1 A2+ A3 + A4=0.75

1 0 0 0 A1=0.72

1 0 0 1 A1+ A4=0.83

1 0 1 0 A1+ A3=1.67

1 0 1 1 A1+ A3 + A4=1.78

1 1 0 0 A1+ A2=0.42

1 1 0 1 A1+ A2 + A4=0.53

1 1 1 0 A1+ A2 + A3=1.37

1 1 1 1 A1+ A2 + A3 + A4=1.48

nnnnk

knk bAbAbAbAbA 44332211

4

1

Page 17: Distributed Arithmetic: Implementations and Applications A Tutorial

Key Issue: ROM Size The size of ROM is very important for high speed

implementation as well as area efficiency ROM size grows exponentially with each added

input address line The number of address lines are equal to the

number of elements in the vector i.e. K Elements up to 16 and more are common =>

216=64K of ROM!!! We have to reduce the size of ROM

Page 18: Distributed Arithmetic: Implementations and Applications A Tutorial

A Very Neat Trick:

1

1

)1(0 22

N

n

Nnknkk bbx

1

10 2

N

n

nknkk bbx

1

1

)1(00 22

2

1 N

n

Nnknknkkk bbbbx

)]([2

1kkk xxx

2‘s-complement

…(7)

…(6)

Page 19: Distributed Arithmetic: Implementations and Applications A Tutorial

Re-Writing xk in a Different Code

Define: Offset Code

Finally

1

1

)1(00 22

2

1 N

n

Nnknknkkk bbbbx

1

0

)1(222

1 N

n

Nnknk cx

}1,1{0,

0,

)(

knknkn

knknkn cwhere

n

n

bb

bbc

…(7)

…(8)

Page 20: Distributed Arithmetic: Implementations and Applications A Tutorial

Using the New xk

Substitute the new xk in here

K

kkk xAy

1

K

k

Nnkn

N

nk cAy

1

)1(1

0

222

1

1

0

)1(222

1 N

n

Nnknk cx

)1(

11

1

0

22

12

2

1

NK

kk

K

k

N

n

nknk AcAy

)1(

1

1

0 1

22

12

2

1

NK

kk

N

n

K

k

nknk AcAy …(9)

Page 21: Distributed Arithmetic: Implementations and Applications A Tutorial

The New Formulation in Offset Code

Let and

K

kknkKnnn cAcccQ

121 2

1

K

kkAQ

12

1)0(

Constant

1

0

)1(21 022

N

n

NnKnnn QcccQy

)1(

1

1

0 1

22

12

2

1

NK

kk

N

n

K

k

nknk AcAy

Page 22: Distributed Arithmetic: Implementations and Applications A Tutorial

The Benefit: Only Half Values to Storeb1n b2n b3n b4n c1n c2n c3n c4n Contents

0 0 0 0 -1 -1 -1 -1 -1/2 (A1+ A2 + A3 + A4) = -0.74

0 0 0 1 -1 -1 -1 1 -1/2 (A1+ A2 + A3 - A4) = - 0.63

0 0 1 0 -1 -1 1 -1 -1/2 (A1+ A2 - A3 + A4) = 0.21

0 0 1 1 -1 -1 1 1 -1/2 (A1+ A2 - A3 - A4) = 0.32

0 1 0 0 -1 1 -1 -1 -1/2 (A1 - A2 + A3 + A4) = -1.04

0 1 0 1 -1 1 -1 1 -1/2 (A1 - A2 + A3 - A4) = - 0.93

0 1 1 0 -1 1 1 -1 -1/2 (A1 - A2 - A3 + A4) = - 0.09

0 1 1 1 -1 1 1 1 -1/2 (A1 - A2 - A3 - A4) = 0.02

1 0 0 0 1 -1 -1 -1 -1/2 (-A1+ A2 + A3 + A4) = -0.02

1 0 0 1 1 -1 -1 1 -1/2 (-A1+ A2 + A3 - A4) = 0.09

1 0 1 0 1 -1 1 -1 -1/2 (-A1+ A2 - A3 + A4) = 0.93

1 0 1 1 1 -1 1 1 -1/2 (-A1+ A2 - A3 - A4) = 1.04

1 1 0 0 1 1 -1 -1 -1/2 (-A1 - A2 + A3 + A4) = - 0.32

1 1 0 1 1 1 -1 1 -1/2 (-A1 - A2 + A3 - A4) = - 0.21

1 1 1 0 1 1 1 -1 -1/2 (-A1 - A2 - A3 + A4) = 0.63

1 1 1 1 1 1 1 1 -1/2 (-A1 - A2 - A3 - A4) = 0.74

Inverse sym

metry

Page 23: Distributed Arithmetic: Implementations and Applications A Tutorial

Hardware Using Offset Coding

x1 selects between the two symmetric halves

Ts indicates when the sign bit arrives

Page 24: Distributed Arithmetic: Implementations and Applications A Tutorial

Alternate Technique: Decomposing the ROM

Requires additional adder to the sum the partial outputs

Page 25: Distributed Arithmetic: Implementations and Applications A Tutorial

Speed Concerns We considered One Bit At A Time (1 BAAT) No. of Clock Cycles Required = N If K=N, then essentially we are taking 1 cycle per dot

product Not bad! Opportunity for parallelism exists but at a cost of

more hardware We could have 2 BAAT or up to N BAAT in the

extreme case N BAAT One complete result/cycle

Page 26: Distributed Arithmetic: Implementations and Applications A Tutorial

Illustration of 2 BAAT

Page 27: Distributed Arithmetic: Implementations and Applications A Tutorial

Illustration of N BAAT

Page 28: Distributed Arithmetic: Implementations and Applications A Tutorial

The Speed Limit: Carry Propagation The speed in the critical path is limited by the width

of the carry propagation Speed can be improved upon by using techniques to

limit the carry propagation

Page 29: Distributed Arithmetic: Implementations and Applications A Tutorial

Speeding Up Further: Using RNS+DA By Using RNS, the computations can be broken

down into smaller elements which can be executed in parallel

Since we are operating on smaller arguments, the carry propagation is naturally limited

So by using RNS+DA, greater speed benefits can be attained, specially for higher precision calculations

Page 30: Distributed Arithmetic: Implementations and Applications A Tutorial

Conclusion Ref: Stanley A. White, “Applications of Distributed

Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, July, 1989

Ref: Xilinx App Note, ”The Role of Distributed Arithmetic In FPGA Based Signal Processing’