reducing hardware complexity of linear dsp systems by iteratively eliminating two-term common...
Post on 21-Dec-2015
214 Views
Preview:
TRANSCRIPT
Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions
IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005
Anup Hosangadi
Ryan Kastner
ECE Department, UCSB
Farzan Fallah
Advanced CAD Research
Fujitsu Labs of America
Outline
Introduction Related Work Polynomial transformation Common Subexpression
elimination Results Conclusions
Introduction
Multiplications by constants encountered in many application areas DSP transforms in Audio, Video, Image
processing (DFT, DCT, IDCT etc..) Filtering operations in Communication
(FIR, IIR filters) Multiple Input Multiple Output (MIMO)
systems Polynomials in Computer graphics
Introduction Multiplication is expensive in hardware Decompose constant multiplications into shifts and
additions 13*X = (1101)2*X = X + X<<2 + X<<3
Signed digits can reduce the number of additions/subtractions
Canonical Signed Digits (CSD) (Knuth’74) (57)10 = (0110111)2 = (100-1001)CSD
Further reduction possible by common subexpression elimination
Upto 50% reduction (R.Hartley TCS’96)
Introduction Common subexpressions = common digit patterns
F1 = 7*X = (0111)*X = X + X<<1 + X<<2 F2 = 13*X = (1101)*X = X + X<<2 + X<<3
D1 = X + X<<2 F1 = D1 + X<<1 F2 = D1 + X<<3
Good for single variable: FIR filters (transposed form)
Multiple variable? (DFT, DCT etc..??)
“0101”
=> X + X<<23+, 3<<
4+, 4<<
Related Work Simple Bipartite matching (Potkonjak et. al
TCAD’95) (10101) and (01101) => common pattern = “101” (10010) and (010010) => cannot detect pattern “1001”
Recursive Shift and Add (RESANDS) (H.Nguyen et. Al, TVLSI 2000)
(10010) and (010010) => common pattern “1001”
Exhaustive enumeration of all digit patterns (Pasko et. Al. TCAD’99)
(1011) => “0011”, “1001”, “1010”, “0101”, “1011”
Related Work Extending techniques for multiple variables
Y1 a11 a12 a13 X1
Y2 = a21 a22 a23 x X2
Y3 a31 a32 a33 X3
k
kikjj
iji DCXSY k
kikjj
iji DCXSY
1 0 1 1 0 0
0 1 1 1 0 1
1 0 0 1 0 1
All Distinct SAll Distinct SijijXXjj and C and CikikDDkk
Y1
Y2
Y3
Potkonjak et. al. TCAD’95
Related Work Multiple Variable Common
Subexpression elimination (A.Hosangadi et. al ASAP’04) Polynomial transformation of linear systems. Use rectangular covering methods
Cannot find subexpressions with reversed signs eg. (X1 – X2<<1) ≠ (X2<<1 – X1) Common occurrence when signed digits are
used Rectangle covering has exponential complexity
Method to overcome these limitations ?
Related Work Algebraic methods in
multi-level logic synthesis (MLLS)
Reducing literal count in a set of Boolean expressions
Factoring, decomposition: Established algebraic techniques
Typically used for thousands of variables and literals
Apply these methods to optimize linear systems?
D1 = X1+ X2<<2
Y1 = D1 + D1<<3 + X1<<3
Y2 = D1 + X2<<2
Linear systems and polynomial transformation
View linear systems as set of arithmetic expressions Expressions consisting of +,-,<< operators Develop methodology for extracting common
subexpressions
Polynomial formulationC × X = (±X×Li)(14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1
= (100-10)CSD × X = XL4 – XL1
Linear Systems and polynomial transformation
Y0 1 1 1 1 X0
Y1 = 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2
Y3 1 -2 2 -1 X3
Decomposing constant multiplications
Y0 = X0 + X1 + X2 + X3
Y1 = X0<<1 + X1 - X2 - X3<<1
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1<<1 + X2<<1 - X3
Y0 = X0 + X1 + X2 + X3
Y1 = X0<<1 + X1 - X2 - X3<<1
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1<<1 + X2<<1 - X3 12+, 4<<12+, 4<<
H.264 Integer Transform
Linear Systems and polynomial transformation
Y0 1 1 1 1 X0
Y1 = 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2
Y3 1 -2 2 -1 X3
Polynomial transformation
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3 12+, 4<<12+, 4<<
H.264 Integer Transform
Fx algorithm
Concurrent Decomposition and Factorization of Boolean Expressions (J.Rajski et. al TCAD’92) Popular as Fast-Extract (Fx) algorithm Expression f = gh + r
g = (ab + c) => Double cube divisor g = ab => Single cube divisor
Fx algorithm for Linear systems?
Two-term divisors Obtained from every pair of terms in each
expression Divide by the minimum exponent of L
eg. F = X1 + X2L + X3L3
{ +X2L, +X3L3}: Divide by L => (X2 + X3L2)
Divisors = (X1 + X2L), (X1 + X3L3), (X2 + X3L2)
Two divisors intersect if The terms involved are distinct (X1 – X2L) ∩ (X1 - X2L) = φ
(X1 – X2L) ∩ (-X1 + X2L) = φ (reversed signs allowed !!)
Two-term divisors Theorem: Multiple term common
subexpression in set of expression iff non-overlapping intersection among two-term divisors
Many divisors with intersections, which one to choose? Use greedy selection of divisor with most #
of intersections Selecting divisors changes expressions
Perform concurrent decomposition of expressions
Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi;
{Divisors} = {Divisors} ∩ {Dnew}; Update frequency statistics of
{Divisors} ; }
Algorithm (Step 2)Common Subexpression Elimination
{Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in
{T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {Dnew}; }
Algorithm complexity
MxM constant matrix; N digits of precision
Y0 1111 1111 1011 1001 Y0 = X0 + X0L + ... XM-1L3
+
XM-1
Y1
.. … … … …
.. YM-1 1111 1110 0011 1010
M
MN
O(MN) terms
=> O(M2N2) divisors
Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi; {Divisors} = {Divisors} ∩ {Dnew}; Update frequency statistics of
{Divisors} ; }
O(M2N2) distinct divisors
O(M2N2)
O(M3N2)
Algorithm (Step 2)Common Subexpression Elimination
{Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in
{T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {Dnew}; }
O(M2N2)
O(M2N2)
Algorithm
H.264 example
>> Select D0 = (X0 + X3)
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3
Algorithm
H.264 example
>> Select D1 = (X1 – X2)
Y0 = D0 + X1 + X2
Y1 = X0L + X1 - X2 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - X1L + X2L - X3
Y0 = D0 + X1 + X2
Y1 = X0L + X1 - X2 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - X1L + X2L - X3
Algorithm
H.264 example
>> Select D2 = (X1 + X2)
Y0 = D0 + X1 + X2
Y1 = X0L + D1 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - D1L - X3
Y0 = D0 + X1 + X2
Y1 = X0L + D1 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - D1L - X3
Algorithm
H.264 example
>> Select D3 = (X0 – X3)
Y0 = D0 + D2
Y1 = X0L + D1 - X3L
Y2 = D0 - D2
Y3 = X0 - D1L - X3
Y0 = D0 + D2
Y1 = X0L + D1 - X3L
Y2 = D0 - D2
Y3 = X0 - D1L - X3
Final Implementation
Extracting 4 divisors
D0 = X0 + X3 Y0 = D0 + D2
D1 = X1 – X2 Y1 = D1 + D3L
D2 = X1 + X2 Y2 = D0 - D2
D3 = X0 - X3 Y3 = D3 – D1L
D0 = X0 + X3 Y0 = D0 + D2
D1 = X1 – X2 Y1 = D1 + D3L
D2 = X1 + X2 Y2 = D0 - D2
D3 = X0 - X3 Y3 = D3 – D1L
8+, 2<<8+, 2<<
Original: 12+, 4<<
Rectangle Covering:
10+, 3<<
Experimental Setup Goal
Reduction in #additions/subtractions Effect on area/latency on synthesis Simulate designs to estimate power consumption
Transforms DCT, IDCT,DFT, DST, DHT. 8x8 constant matrices 16 digits precision (CSD representation) Compare with
Potkonjak (TCAD’95) RESANDS (Nguyen et. al TVLSI’2000) Rectangle Covering (A.Hosangadi et.al ASAP’04)
Experimental Results
Example
# of additions/subtractions
Original(I)
Potkonjak
(II)
RESANDS
(III)
Rectangle
Covering(IV)
Two-term CSE(V)
DCT 274 202 227 174 153
IDCT 242 183 222 162 143
RealDFT 253 193 208 165 144
ImagDFT 207 178 198 134 124
DST 320 238 252 200 187
DHT 284 209 211 175 158
Average 263.3 200.5 219.7 168.3 151.5Run Time 0.81s 0.08s
Experimental results Synthesis results (minimum latency
constraints)Exampl
e Area (Library
Units) Latency (Clock
cycles)
(III) (IV) (V) (III) (IV) (V)
DCT 90667 73311 66759 10 11 10
IDCT 81868 66864 62883 10 11 10
R-DFT 90496 69827 64026 10 11 10
I-DFT 75140 55940 54606 10 10 10
DST 108101
84715 81214 11 11 11
DHT 93939 71272 67775 11 11 10
Average 90110
70322
66211 10.3 10.8 10.2
(III) RESANDS
(IV) Rect. Covering
(V) 2-term CSE
Experimental results Power
consumptionExample
Power consumption (µWatts)
(III) (IV) (V)
DCT 729 504 531
IDCT 662 547 569
R-DFT 707 544 554
I-DFT 644 575 490
DST 607 718 595
DHT 598 545 527
Average
657.8 572.2 544.3
(III) RESANDS
(IV) Rect. Covering
(V) 2-term CSE
Conclusions
A new technique for eliminating common subexpressions in linear systems
Fewer operations than known methods
Much faster than rectangle covering Combine with scheduling on given
resources
Thank you Questions??
top related