the design of application specific integrated circuits with high level synthesis approaches

1NSYSU CSENSYSU CSE 2002/9

The Design of Application The Design of Application Specific Integrated Circuits with Specific Integrated Circuits with

High Level Synthesis Approaches High Level Synthesis Approaches

Shiann-Rong Kuang (鄺獻榮 )

Assistant ProfessorDept. of Computer Science and Engineering

National Sun Yat-Sen University


Outlines

Introduction

Novel High Level Synthesis Approaches

Integrated Data Path Synthesis Approach

Pipelined Control Path Synthesis Approach

Dynamic Pipelining Approach

ASICs design

Binary Arithmetic Coder

– Low-Error Fixed-Width Multipliers

Fuzzy Color Corrector

Future Work


Introduction

High level synthesis

Behavioral description register transfer level description

Data path synthesis and control path synthesis

t1=a-b;t2=c+t1;t3=e-f;x=d-t2;y=t1+t3; +

b d fet3

at1x

et2y

_

FSM



Data Path Synthesis

module selection, scheduling, and allocation: highly interdependent

separately solve them the best designs may not be explored

Proposed Data Path Synthesis Approach

combine module selection, scheduling, and allocation

general module selection model – module types with different attributes (delay, area, …)

a mixed-vertex compatibility graph model– solve it globally using partial clique partitioning


Clock cycle=100ns, Latency=5, and performance constraint=500ns

-1

-3+5

a bcd e f

t1

t2

x y

t3+2

-4m-type ADD_1 ADD_2 SUB_1 SUB_2 SUB_3 Register Mux_2 Mux_3 wire

index 1 2 3 4 5cost (area) 200 100 240 120 60 150 80 120 100delay (ns) 70 160 70 160 380 2/2 3 3 5

-4

c-step0

1

2

3

4

+2

+5-3

-1

+2+5

e df ct3

bt1x

at2y

ADD_1 SUB_2

-1-3-4

SUB_3

-1

-3+5

+2-4

c-step

0

1

2

3

4-1,-3

-4+2+5

b d fet3

at1x

et2y

ADD_2 SUB_1

circuit 1 2

module cost 340 380

MUXcost 200 80

wirecost 1200 1100

Register cost 900 900

Totalcost 2640 2460


Find all feasible Assignments

MCG transformations

Initial MCG

|V1|=30, |V2|=0

A130 A131 A132

A433

A334

A432

A333

A332

A431

A140

A441

A343A342

A440

A141

A442

A450A211 A212 A213

A511 A512 A513 A514

A221 A222

A521 A522 A523

A430

c-step

01234

-1

-3

-4+2 +5

01234

-1

-3-4+2+5

c-step01234

c-step

-1+2

-3+5

-4

m-type ADD_1 ADD_2 SUB_1 SUB_2 SUB_3 Register Mux_2 Mux_3 wire

index 1 2 3 4 5cost (area) 200 100 240 120 60 150 80 120 100delay (ns) 70 160 70 160 380 2/2 3 3 5


A433

A334

A432

A333

A431A140

A441

A343

A440A442

A450 A212 A213

A512 A513 A514

A222

A522 A523

instance 1

A430

:new instance (subtractor)best Decision: A140,

A430A433A432A431

A140 , A343

A450

A212

A513 A514

A440

A522 A523

instance 1

A441 A442

best Decision: A343,1

(using the old subtractor instance)

MCG after iteration 1

MCG after iteration 2


A140 , A343A450

instance 1 instance 2

A212 , A514

instance 3

|V1|=0,|V2|=3

A140 , A343 A450

A212 A514

instance 1 instance 2

MCG after iteration 3 Final MCG

-4

c-step0

1

2

3

4

+2

+5-3

-1

+2+5

e df ct3

bt1x

at2y

ADD_1 SUB_2

-1-3-4

SUB_3



Experiments and Results

Lib 1 Lib 2 Lib 3 Lib 4 m-type area area area delay (ns) m-type area delay (ns)

LT 1020 102 102 50 ADD_1 1500 55 ADD 1920 192 192 50 ADD_2 500 170 SUB 2240 224 224 50 ADD_3 300 220 MUL 80460 8046 4000 50 MUL_1 17000 80

Register 150 1500 150 2/2 MUL_2 8000 180 Mux_2 64 64 6400 2 MUL_3 3000 260 Mux_3 96 96 9600 2 Register 1200 2/2 Mux_4 128 128 12800 2 Mux_2 0 0 Mux_5 160 160 16000 2 Mux_3 0 0 Mux_6 192 192 19200 2 Mux_4 0 0

wire 100 100 1000 2 wire 0 0



Main Idea of Pipelining Control Path

i(t)

s(t)

SL SL

OL

SL’

SRs

i(t)

s(t)

)](),([ tstiSL

OL

SRs

)](),([ tsti

(a) (b)

i(t)

s(t)

SL SL

OL

SL’’

SRs

i(t+1)

(c)

i(t)

s(t)

OL

SRs

i(t+1) SRs’

(d)

i(t)

s(t)

OL

SRs

i(t+1)

'

(e)

i(t)

s(t)

SRs

i(t+1)CRsPRs

pipelined circuit 2

(f)



Proposed Control Path Synthesis Approach

A problem: may violate the control dependency

Modify the original BSTG by inserting no operation states

Theorem

A BSTG satisfies all control dependencies if the distance Dij of states in each produce-consume state pair <Si, Sj>c satisfies one of the following conditions:

Condition 1: if Sj is not a branch state, then Dij k.

Condition 2: if Sj is a branch state, then Dij 2k-1.

Nij : the minimal number of NOOPs needed to insert between <Si, Sj>c

Nij = 2k-Dij-1, if Sj is a branch state;

Nij = k-Dij, otherwise.

Minimize the number of NOOPs using ILP formulation


>1 +1

-2

+3

+7 -9+4

+5 -6 -8 +10

+11

>2

>3

SCDFG

[>1,+1]

[>2,-2]

[+4]

[+5] [-6] [c3:-8][c3:+10]

S1

S2

S3

S4

S5

S6

S7

S8

S9

c1

c2

c1

c2

[c3:+7][c3:-9]

[+11]

[>3,+3]

2

22

1

v1 v3

v2

2 2

2 1

v1 v3

v2 v4

2

2

2

1

2 21 ii

i1 i2

v4

(a)

(b)

[>1,+1]

[>2,-2]

[>3,+3]

[+4]

[+5] [-6]

[+11]

S1

S2

S3

S4

S5

S6

S7

S8

S9

c1

c2

c1

c2

[c3:+7][c3:-9]

c1 >1c2 >2c3 >3

[c3:-8][c3:+10]

BSTG


[>1,+1]

[>2,-2]

[+5] [-6][c3:-8]

[c3:+10]

S1

S2

S3

S4

S5 S6

S7

S8

S9

c1

c2

c1

c2 [c3:+7][c3:-9]

N2

N1

N3

N4

[+11]

[>3,+3]

[+4]

[>1,+1]

[>2,-2]

[+4]

[c3:-8][c3:+10]

S1

S2

S3

S4

S5 S6

S7

S8

c1c1

[c3:+7][c3:-9]

N2

N1

N3

N4

[+11]

[>3,+3]c2c1

c2c1 c1

S9

[-6][+5]

CL1

state registers

CL2

control registers

CLkCL3 PR1 PR2 PR3 PRk-1

pipelined circuit k


0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12

5_EWF

Cond2

k

k

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12

5_EWFCond2

PCPk

k

5_EWFCond2

100

200

300

400

500

600

0 2 4 6 8 10 12k

lits

0

100

200

300

400

500

0 2 4 6 8 10 12

5_EWFCond2

k

PRs



Pipelining

In most of existing pipelining techniques, latency is fixed or has some fixed values

In some loops of ASICs, variant loop execution length and time-relative data dependencies between the different iterations make them to be pipelined inefficiently or impossibly

Dynamic pipelining– A new loop scheduling approach to pipeline the loop using variant latencies

– Controller consists of two interactive finite state machines

while(c1) { while(c2) { } }


iteration

Phase 1 Phase 2 Phase 3

Latency=5 Latency=6 Latency=4

: the stages in which no operation is performedtime

i+4

i+3

i+2

i+1

i



An Example of Dynamic Pipelining

j=1;

while (N>j) { /* N is the number of data which needs to be sorted */

i=j-1;

temp=a[j];

while (temp<a[i] && i 0) {

a[i+1]=a[i];

i=i-1;

}

a[i+1]=temp;

j++;

}


S1: j=1; ………......................…...………………....….….... o1

S2: O_loop: if(N j) goto End_O; …......…...…….. o2

i=j-1; …….......................……………………….…........… o3

r_add=j; ………….….........…….………….… o4

S3: j++; ………….……..............…………………….........…. o5

temp=a[r_add];……..….……………...……………..…. o6

S4: I_loop: r_add=i; ……..…………….....……..…......... o7

S5: data=a[r_add]; …………………..….........…..……...... o8

S6: w_add=i+1; ………………………..…....…........… o9

if (!(temp<data && i 0)) goto End_I; …. o10

S7: a[w_add]=data; ………........…….………....... o11

i=i-1; gotoI_loop; …….......………………… o12

S8: End_I: a[w_add]=temp; gotoO_loop; ………........… o13

End_O:

S1 S2 S3 S4 S5 S6 S7 S8ci

cico

co

Init

[o1] [o2,o3][o4]

[o5,o6] [o7] [o8] [o13]

inner loop

[o9,o10] [o11,o12]


BSTGo

S1 S2 S3 NoopiS8

coco

Initdone

done

[start=1]

BSTGi

S4 S5 S6 S7Noopo

[done=ci]start

done

startdone

BSTG Partitioning

Inner Loop Pipelining

original PBSTGi

S4,i S5,i

S6,i Noopo

[done=ci]start done

startdone

S7,i

S4,i+1

S5,i+1

[done=1] start done

done

S7,i

PS1 PS2

S5,i+1

start

S6,i

[done=ci]

S4,i+1

Noopo

[done=1]

new PBSTGi


Outer Loop Pipelining

S2 S3 S8S4 S5 N1 N2 N3

iteration

i

i+1

i+2

1iS2 S3 S8S4 S5 N1 N2 N3

S2 S3 S8S4 S5 N1 N2 N3

S2 S3 S8S4 S5 N1 N2 N3

repeating pipeline body

unwind the loop body four times

new BSTGo :S1 S2 S3 Noopi S8

coco

Initdone

done

S4 S5

[start=1]

S2 S4 S5

S2 S3 S4 S5

S2 S3 S4 S5

S2 S3 S4 S5

S2 S3 S4 S5L=1 L=2

S3

L=3


final PBSTGo

N3,i

Noopx

S3,i+2

N2,i+1

[start=1]

done

doneco done

co done

PS1 PS2[start=1] PS3

S2,i+2

S5,i+1

S8,i

N1,i+1 S4,i+2

co done

co done

repeating pipeline body

Init S1

S5,i

S2,i+1

N1,i

S4,i+1

N2,i

Noopx

done

done[start=1] Noopi

done

S2 S3 S4 S3,i+1[start=1]

S8 S8

co done

co done

done

final PBSTGi

doneS7,i

PS2

S5,i+1

done start

S6,i

Noopo

[done=ci]start

start

PS1

done start S4,i+1

[done=1]


Controller Architecture

combinationallogic

state registers

Mux

combinationallogic

state registers

inner controller

outer controller

run

done

co

start

Control signals

from

dat

apat

h

to d

atap

ath

ciEq. (3.4)

Eq. (3.3)

Eq. (3.5)

Datapath Allocation


An execution example

latency=7 latency=3latency=5latency=3

PS3 PS1 PS2 PS3 PS1 PS2 PS3 PS1 PS2 PS3 PS1 PS2 PS3 PS1Nop Nop Nop Nop Nop Nop

PS2 PS1 PS1 PS2 PS1 PS1 PS2 PS1 PS1 Nop Nop PS1 PS2 PS1PS2 PS1 PS2 PS1 PS2 PS1

N2

S4

N3

S5

S2

S8

N1

S3

N2

S4

N3

S5

S2

S8

N1

S3

N2

S4

N3

S5

S2

S8

N1

S3

N2

S4

N3

S5

S2

S8

N1

S3

N2

S4

N3

S5

S2

S7S5

S6S4

S6S4

S7S5

S6S4

S7S5

S6S4

S6S4

S7S5

S6S4

S7S5

S6S4

S7S5

S6S4

S6S4

S6S4

S7S5

S6S4

done

start

run( )

1 0 0 0 0 10 0 0 0 0 0 0 1 1 1 1 0 0 1

1 0 0 1 1 10 0 0 1 1 1 1 1 0 0 1 0 0 1

1 1 0 1 1 10 1 0 1 1 1 1 1 1 0 1 1 0 1

inner

outer

state(i)

state(o)

PS1 PS2

Nop PS1

N3

S5

S2

S8

N1

S3

S6S4

1 0

1 0

1 1

PS2

PS1

S8

N1

S3

S6S4

0

0

1

S2 S3 S4 S5 S6 S7 S4 S5 S6 S8

S2 S3 S4 S5 S6 S7 S4 S5 S6 S7 S4 S5 S6 S8

S6S5S2 S3 S4 S5 S6 S7 S4 S5 S6 S7 S4 S5 S6 S7 S4 S8

S2 S3 S4 S5 S6 S8

iteration

i

i+1

i+2

1i


Experimental Results

Comparing results of insertion sorter

Other examples

example data size sequentialdynamic

pipeliningspeedup

Data1 10 56 31 1.81

Data2 10 236 121 1.95

Data3 10 108 57 1.89

Data4 10 112 59 1.90

Data5 100 596 301 1.98

Data6 100 20396 10201 2.00

Data7 100 9980 4993 2.00

Data8 100 10176 5091 2.00


Binary Arithmetic Coder

Adaptive Binary Arithmetic Coder

Q-coder: compress mainly bilevel image data

a compression chip universal enough quickly compress any type of data that could still achieve a good compression ratio

proposed modified hardwared algorithm– a new probability estimation modeler using a table-look-up app

roach

– a technique solves carry-over and source termination

– fixed-width parallel multiplier

VLSI chip


Encoding(){

C=0x00; A=0xff; R=0x0000; S=0000000000;for (each input binary symbol) {

phase1: Generate P('0'|S) by Eq. (4.5);phase2: AP=A* P('0'|S);

if (input symbol=='0') A=AP;else {

A=A-AP; C=C+AP;if (carry occurs) R++;

}Update the adaptive modeler by Eq. (4.6);Shift the input symbol into S;

phase3: while (MSB of A==0) normalization_of_encoding();}Encode LPS and then output 17 consecutive '1'’s;

}

Encoding Algorithm


System Architecture

Modeler

P(‘0’|S)Adaptive

S

NormalizationInput/Output

Path

Asynchronous

C

A

01

En_Input

De_Output

En/De

En_Input Input

Output

symbol

symbolEn_Output

De_Input

De_Output

01

A’

C’

En/De

Shift_In

Adaptive Coder

En_CL

De_CL

State registers

Control Path

Init

En/De

ArithmeticOperation Unit Unit

handshakingsignals

In_Data

Out_Data


characteristic proposed chip

technology TSMC 0.8m SPDMsupply voltage 5V

package 40 LD S/B

operation frequency 25 MHz

chip area 4.2*4.5 mm2

chip complexity 54k gates

scan-mode yes


file compression ratio of different coding methods

type name ST_32 MF1 MF2 MDF Huffman LZW

t07 62.60% 58.02% 61.27% 64.55% 47.13% 63.32%

t18 63.00% 58.05% 60.57% 63.83% 44.70% 63.59%

t19 65.31% 60.89% 63.86% 67.65% 48.39% 65.78%

t20 60.99% 57.32% 59.37% 62.70% 43.52% 61.04%

t21 66.97% 61.61% 63.46% 67.35% 46.57% 67.11%

Text

t22 57.69% 55.21% 58.15% 61.67% 45.25% 59.13%

average 63.57% 59.03% 61.51% 65.03% 45.91% 64.00%

t03 26.83% 18.59% 20.87% 21.42% 14.74% 14.41%

t09 40.90% 32.74% 35.50% 37.72% 27.43% 38.55%

t10 43.73% 29.68% 31.49% 32.38% 20.21% 42.06%

t13 49.87% 35.16% 37.34% 38.87% 22.82% 47.99%

t14 27.44% 20.89% 22.65% 23.06% 16.44% 18.62%

Binary

t23 46.57% 36.82% 40.03% 43.39% 29.61% 44.71%

average 44.04% 30.65% 32.65% 33.79% 21.04% 41.58%

t15 49.37% 45.01% 46.55% 48.12% 10.31% 38.82%

t05 19.69% 20.15% 21.26% 21.46% 4.22% 6.40%

t12 49.77% 47.14% 48.62% 49.82% 12.43% 39.84%

t24 9.41% 8.44% 9.15% 8.69% 10.56% 0.00%

t16 24.86% 22.66% 23.55% 23.46% 3.09% 10.23%

Image

t17 36.64% 35.58% 36.79% 37.48% 12.20% 26.18%

average 36.19% 34.12% 35.35% 36.04% 9.80% 25.22%

total average 42.32% 33.61% 35.36% 36.47% 18.39% 36.91%


Dynamic Pipelining Design

filesfile size(bytes)

C- rateC-speed (S)(Mbit/sec)

C-speed (P)(Mbit/sec)

speedup

Text1 73622 49.1% 2.99 6.10 2.04Text2 51740 52.8% 3.00 6.12 2.04Text3 108501 51.5% 3.00 6.12 2.04Text4 23680 49.6% 2.97 6.13 2.06

Binary1 163840 23.2% 2.86 6.01 2.10Binary2 147456 34.8% 2.91 6.03 2.07Binary3 1064960 36.3% 2.94 6.05 2.06Binary4 98304 40.9% 2.93 6.06 2.07Image1 345600 42.0% 2.99 6.01 2.01Image2 245760 13.9% 2.85 5.98 2.10Image3 921856 41.8% 2.94 6.01 2.04Image4 345600 28.1% 2.96 5.98 2.02


Low-Error Fixed-Width Multipliers

Fixed-Width Multiplier

multiplication operations used in many ASICs have the special fixed-width property

directly omit about half the adder cells of the conventional parallel multiplier

a significant error would be introduced in the product

Low-Error Fixed-Width Multiplier

low-error fixed-width sign-magnitude multipliers

low-error fixed-width two’s complement multipliers

reduced width multiplier (n < m < 2n)



Fixed-width sign-magnitude multipliers

20311302 4

1 ... nnnn yxyxyxyx

1001120032

1

8

1 ...... yxyxyxyxn-nn

102112011

2

1 ... nnnnn yxyxyxyx

102112011

...

nnnnnji

ji yxyxyxyxyx

2

1 1n

10101113022

1

2

1

4

1

4

1 ... yxyxyxyxnnnn where

Theorem: Given a , we have that

n

2

0.> if 1,

0;= if 0, 1

nand


x0 y5

P0

P1

P2

P3

P4

P5

x3 y1 x2 y1 x1 y1 x0 y1x4 y1

x5 y0 x4 y0 x3 y0 x2 y0 x1 y0 x0 y0

x2 y2 x1 y2 x0 y2x3 y2

x2 y3 x1 y3 x0 y3

x1 y4 x0 y4

Ha Ha Ha Ha Ha

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

P6P7P8P9P10P11

0

x5 y1

x4 y2x5 y2

x4 y3x5 y3

x4 y4x5 y4

x4 y5x5 y5

x3 y3

x2 y4x3 y4

x2 y5 x1 y5x3 y5

Fa Fa Fa Fa Fa

x5 y0

cell AO: AO1

AO3

AO4

x4 y1

AO2 x3 y2

x2 y3

x1 y4

x0 y5

Cg

Fa

Fa Fa

Fa Fa Fa

Fa Fa Fa Fa

P6P7P8P9P10P11

x5 y1

x4 y2x5 y2

x4 y3x5 y3

x4 y4x5 y4

x4 y5x5 y5

x3 y3

x2 y4x3 y4

x2 y5 x1 y5x3 y5

Fa Fa Fa Fa Ha

AG

C1

C2

C3

C4

C5

O1

O2

O3

O4

Sign-magnitude multiplierSign-magnitude multiplier

X = x5 x4 x3 x2 x1 x0 Y = y5 y4 y3 y2 y1 y0


x0 y5

P0

P1

P2

P3

P4

P5

x3 y1 x2 y1 x1 y1 x0 y1x4 y1

x5 y0 x4 y0 x3 y0 x2 y0 x1 y0 x0 y0

x2 y2 x1 y2 x0 y2x3 y2

x2 y3 x1 y3 x0 y3

x1 y4 x0 y4

Ha Ha Ha Ha Ha

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

P6P7P8P9P10P11

1

x5 y1

x4 y2x5 y2

x4 y3x5 y3

x4 y4x5 y4

x4 y5x5 y5

x3 y3

x2 y4x3 y4

x2 y5 x1 y5x3 y5

Fa Fa Fa Fa Fa

P’M

cell OR:OR

OR

OR

x4 y1

x5 y0

OR x3 y2

x2 y3

x1 y4

x0 y5

Cg

Fa

Fa Fa

Fa Fa Fa

Fa Fa Fa Fa

P6P7P8P9P10P11

1

x5 y1

x4 y2x5 y2

x4 y3x5 y3

x4 y4x5 y4

x4 y5x5 y5

x3 y3

x2 y4x3 y4

x2 y5 x1 y5x3 y5

Fa Fa Fa Fa FaTwo’s complementTwo’s complement multipliermultiplier


AO

AO

AO

Cgx0 y5

P5

x3 y1x4 y1

x5 y0 x4 y0

x2 y2

x3 y2

x2 y3

x1 y3

x1 y4

x0 y4

Ha

Fa Fa

Fa Fa Fa

Fa Fa Fa Fa

Fa Fa Fa Fa Fa

P6P7P8P9P10P11

1

x5 y1

x4 y2x5 y2

x4 y3x5 y3

x4 y4x5 y4

x4 y5x5 y5

x3 y3

x2 y4x3 y4

x2 y5 x1 y5x3 y5

Fa Fa Fa Fa Fa

x0 y5

P0

P1

P2

P3

P4

P5

x3 y1 x2 y1 x1 y1 x0 y1x4 y1

x5 y0 x4 y0 x3 y0 x2 y0 x1 y0 x0 y0

x2 y2 x1 y2 x0 y2x3 y2

x2 y3 x1 y3 x0 y3

x1 y4 x0 y4

Ha Ha Ha Ha Ha

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

Fa Fa Fa Fa Fa

P6P7P8P9P10P11

1

x5 y1

x4 y2x5 y2

x4 y3x5 y3

x4 y4x5 y4

x4 y5x5 y5

x3 y3

x2 y4x3 y4

x2 y5 x1 y5x3 y5

Fa Fa Fa Fa Fa

P’M

Reduced width multiplierReduced width multiplier



Error comparison

multipliers errors n=4 n=8 n=12 n=16

MP' 48 1792 45056 983040M1 32 1280 32768 720896M2 16 768 20480 458752MF 16 512 8192 196608MR

M

16 256 4096 131072

MP' 13.750 450.75 11267.75 245764.7M1 5.375 187.89 3927.92 74497.4M2 1.500 130.50 4099.50 98308.5MF 0.938 65.04 1570.32 30403.7MR

0.125 26.76 731.39 14629.6

MP' 12.9 3.7 5.1 7.1M1 5.6 2.5 2.3 2.8M2 1.6 1.9 2.5 3.2MF 1 1 1 1

two’scomplement

MR

(0.01)

0.1 0.4 0.5 0.5


Application

(a) original (b) M1 (c) MF

(d) MR1 (e) MR2 (f) MS


(b) M1(a) original


(c) MF (d) MR1


(e) MR2 (f) MS


Fuzzy Color Corrector

Fuzzy Color Correction

in previous literature, the color correction process was modeled as a three-level fuzzy tree inference process

– the algorithm in it is inefficient and its hardware implementation is then costly and slow

a new efficient fuzzy tree inference algorithm suitable for the center of gravity defuzzification method is proposed

FuzzyCorrection System

Printed Color Image

ColorScanner

Original Color Image

ColorPrinter

RGB CMY


modified fuzzy color correction algorithm

Init: L=1;S1: while (input pattern Xi NULL) {S1: Calculate the address of rule memory (ROM);S2, S3: s1=ROM[address++]; D=s1; S4: k=0; PathL=0; d=ROM[address];S5: while (k<8 && D>0) {S6: D=d; PathL=k; k++;

}S5: if (1 k 7 && |D| d/2) PathL=k; S7~S13: Calculate Xo using Eq. (6.6);S7: if (++L==4) L=1;

}


+ROM

s1, d L ROM2

ALU

k

Path1Path2

c6 c8

2’s

_

D

|D|

Xo

temp2temp1address

wk

Xi

144

128

<<4 <<1<<1

>>1E0E8L0

_

2.5. Fuzzy Color CorrectorProposed Sequential Architecture


Dynamic pipelined Design

picturesfile size(bytes) sequential

dynamicpipelining speedup

Pic1 148416 2704900 1370164 9.25 1.97

Pic2 230604 4345076 2193924 10.39 1.98

Pic3 974916 16898438 8139846 8.35 2.08

Pic4 1137198 20268056 10356836 9.11 1.96

L


Future Work

NNI: NoC Network Interface (ISO-OSI 7-Layer RM) other components

R-FPGA

VCI

NNI

NNI NNI NNINNI NNI

NNI NNINNI NNI NNI

Interconnection network External Networks

Video Camera

I

Display

O

VideoDec. IP

CPUCore

VideoEnc. IP

RateControl IP MEM

System-on-a-Chip System-on-a-Chip ((SoCSoC)) Platform Platform


References

[1] Jer-Min Jou, Shiann-Rong Kuang, Yeu-Horng Shiau, and Ren-Der Chen, “Design of A Dynamic Pipelined Architecture for Fuzzy Color Correction”, to be published in IEEE Transactions on VLSI Systems, 2002.

[2] Jer-Min Jou, Yeu-Horng Shiau, Pei-Yin Chen, and Shiann-Rong Kuang, “A Low Cost Gray Prediction Search Chip for Motion Estimation”, Vol. 49, No. 7, pp. 928-938, July 2002.

[3] Shiann-Rong Kuang, Jer-Min Jou, Ren-Der Chen, and Yeu-Horng Shiau, “Dynamic Pipeline Design of an Adaptive Binary Arithmetic Coder,” IEEE Transactions on Circuits & Systems Part II, Vol. 48, No. 9, pp. 813-825, September 2001.

[4] Jer Min Jou, Shiann Rong Kuang, and Ren-Der Chen, “Design of Low-Error Fixed-Width Multipliers for DSP Applications,” IEEE Transactions on Circuits & Systems Part II, Vol. 46, No. 6, pp. 836-842, June 1999.


References

[5] Jer-Min Jou, Shiann-Rong Kuang, and Ren-Der Chen, “A New Efficient Fuzzy Algorithm for Color Correction,” IEEE Transactions on Circuits & Systems Part I, Vol. 46, No. 6, pp. 773-775, June 1999.

[6] Shiann-Rong Kuang, Jer-Min Jou, and Yuh-Lin Chen, “The Design of an Adaptive On-Line Binary Arithmetic Coding Chip,” IEEE Transactions on Circuits & Systems Part I, Vol. 45, No. 7, pp. 693-706, July 1998.

[7] Jer-Min Jou and Shiann-Rong Kuang, “Design of a low-error fixed-width multiplier for DSP applications,” Electronics Letters, Vol. 33, No. 19, pp. 1597-1598, 1997.

[8] Jer-Min Jou and Shiann-Rong Kuang, “A Library-Adaptively Integrated High Level Synthesis System,” Proceedings of NSC – Part A: Physical Science and Engineering, Vol. 19, No. 3, pp. 220-234, May 1995.

the design of application specific integrated circuits with high level synthesis approaches

Documents