embedded computer architecture

Embedded Computer Architecture

ASIPApplication Specific Instruction-set

Processor

5KK73

Bart Mesman and Henk Corporaal

04/19/23 Embedded Computer Archtiecture H.Corporaal and B. Mesman

2

flexibility

efficiency

DSP

Programmable CPU

Programmable DSP

Application domain specific

Applicationspecific processor

Application domain specific processors (ADSP or ASIP)

04/19/23 Platform Design H.Corporaal and B. Mesman

3

Application domain specific processors (ADSP or ASIP)

takes a well defined application domain as a starting point• exploits characteristics of the domain (computation kernels)• still programmable within the domain

e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ...

performance: clock speed + ILP ILP,DLP, tuning to domain flexible dev. (new apps.) cost effective (high volume)

Appl. domain

implementation

ADSP

implementation

Appl. domain

GP

problems - specification manual design, - design time and effort large effort => synthesized cores


4

Part DescriptionClock(MHz)

Size(gates)

ROM(Kbyte)

RAM(Kbyte)

Speech Components

ADPCM Full duplex ITU-T G.726 compliant and 40 kbit/s speech-compression encoder/decoder. 4 5,100 1.3 0.128

ADPCM-16 Full duplex 16 Channel ITU-T G.726 compliant 16, 24, 32 and 40 kbit/s speech-compression encoder/decoder. 32 10,200 1.3 2.048

IW-ASRSpeechRecognition

Template-based speaker-dependent, isolated-word automatic speech recognition 1.3 9,000 6approx.1kbyte/word

G.723.1 Low bit-rate ITU-TG.723.1 compliant speech-compression at 6.3 kbit/s; can be combined with G.723.1A. 20 24,000 22 2.3

G.723.1AExtended version of G.723.1 to reduce bit rate by a silence compression scheme. Uses voice activity detection andcomfort-noise generation. Fully compliant with Annex A of speech-compression standard CODEC G.723.1.Yields no additional hardware cost.

20 24,000 22 2.3

SpeechSynthesis

Phrase-concatenated speech synthesisDepends on compressionrequirements

Telecommunications

EchoCancellation

High-performance Echo-cancellation and suppression processor. 4 6,000 2.80 0.15

DTMF Full-duplex DTMF transceiver. 2 4,000 1.00 0.15

Caller-ID On-hook and off-hook caller line identification. Includes DTMF and V.23. 3 6,000 2.10 0.15

Reed-Solomon Full-duplex Reed-Solomon codec 7,000 3.75 0.15

ViterbiDecoder

Configurable rate, code and constraint-length. (depending on throughput) Configurable traceback depth. Supportssoft & hard decision making. Supports code puncturing.

5,000

to9,000

--- ---

V.23 modem ITU-T V23 compliant 1200 baud FSK modem 6,000 0.80 0.15

Other

Pink NoiseGenerator

Low-ripple pink noise filter with filter characteristic of -3 ± 0.08 dB per octave over the bandwidth 20Hz to 20kHz 4,000 0.10 0.10

CCIR 656/601 Digital video converter : CCIR to raw-video data and vice versa. 1,500 none none

www.adelantetech.com


5

application(s)processor

-model

OK?

more appl.? yes

no

noyes

Estimationscycles/algoccupation

HWdesign

SW (code generation)

Estimationsnsec/cycle,

area, power/instr

go to phase 2

3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw

Fast, accurate and early feedback

Design process

parametersinstance

e.g. VLIW withshared RFs


6

*1

+2

*3

*4

*5

+6

+7

*8

*9

+10

IPB

OPB

ALU

MULT

IPB

OPB

+2*3

*1

*1

*3

+2

*1

*3

*4

*3

*4

*4

*3+6

*3

+6

+7*8

*5

*5

*8

*8

+7

*5

*9

*5

*9

*5

*9+10

*9

+10

CandidateLIST

Conflict & Priority Comp.

ScheduledOperation

0 0

1 1

2 2

3 3

4 4

5

ASIP/VLIW architectures: list scheduling


7

A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file.

A guarded register transfer pattern (GRTP) is a register transferpattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101GRTPs contain all inter-RT-conflict information.

Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor.

Problem statement


8

Algorithmspec

FE

CDFG

Code Generation

Machinecode

Processorspec (instance)

ISE

GRTP

Problem statement

in ch 4 this is

part of the code

generator


9

PC

IM

+1

I.(20:0)

RAM

I.(12:5)

I.(4)

Inp

I.(20:13)

I.(3:2)

I.(1:0)

REG

outp

Example: Simple processor [Leupers]


10

Instruction Instruction bits21111111111098765432109876543210

PC := PC + 1 xxxxxxxxxxxxxxxxxxxxxREG := Inp xxxxxxxxxxxxxxxxx011x

REG := IM PC .(20..13) xxxxxxxxxxxxxxxxx001x

REG := RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x1xREG := REG - Inp xxxxxxxxxxxxxxxxx0101

REG := REG - IM PC .(20..13) xxxxxxxxxxxxxxxxx0001

REG := REG - RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x01REG := REG + Inp xxxxxxxxxxxxxxxxx0100

REG := REG + IM PC .(20..13) xxxxxxxxxxxxxxxxx0000

REG := REG + RAM IM PC . (12..5 ) xxxxxxxxxxxxxxxxx1x00RAM IM PC . (12..5 ) := REG xxxxxxxxxxxxxxxx1xxxxoutp := REG xxxxxxxxxxxxxxxxxxxxxRAM_NOP xxxxxxxxxxxxxxxx0xxxx

Example: Simple processor [Leupers]


11

ASIP/VLIW architectures

A|RT designer template as an example (= set of rules, a model)

Differences with GP VLIW processors 1. // FUs

• ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc …

• larger grainsize, more heterogeneous, more pipelines2. Rfiles

• many Rfiles (>5 vs 1 or 2)• limited # ports (3 vs 15) • limited size (<16 vs. 128)

3. Issue slots• all in parallel vs. 5


12

RF1

FU1

RF2 RF3

FU2

RF4 RF5

FU3

RF6 RF7

FU4

RF8

IR1 IR2 IR3 IR4

Instruction memory Con-trol

flags


13

readaddress

RF 1

writeaddress

RF 1

readaddress

RF 2

writeaddress

RF 2mux 1 mux 2

controlFU

outputdrivers

Additional characteristics of the A|RT designer template• interconnect network: busses + input multiplexers

mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged

• memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output

• Each FU can generate one or more flags• instruction format (per issue slot)

ASIP/VLIW architectures


14

ALU MACbus1 bus2

RF1 RF2 RF3 RF4

mux 2

read RF1

write RF1

read RF2

write RF2

ALU instr.mux

3read RF4

write RF4

read RF3

write RF3

MAC instr.

091019

ASIP/VLIW architectures: example


15

GRTP Instruction bits1 1 1 1 1 1 1 1 1 19 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0

RF1 = ALU (RF1, RF2) x c c c c x x c c c x x x x x x x x x xRF2 = ALU (RF1, RF2) x c x c c c c c c c x x x x x x x x x xRF3 = ALU (RF1, RF2) x c x c c x x c c c c x x c c x x x x xRF3 = MAC (RF3, RF4) x x x x x x x x x x c c c c c c x c c cRF4 = MAC (RF3, RF4) x x x x x x x x x x x c c x x c c c c cRF2 = MAC (RF3, RF4) c x x x x c c x x x x c c x x c x c c c

ASIP/VLIW architectures : example


16

Datapath synthesis

Controller synthesis

OK?

Changepragmas

Algorithmspec

no

yes

RTs

Estimationsarea, power, timing

RF1 : x = RF2 : y, RF3 : z | ALU = ADDInmux = bus2

assign ( a+b, ALU, fu_alu1)assign ( a+_, ALU, fu_alu2)assign ( _+_, ALU, fu_alu3)

VLIW makes relatively simple code selection

possible

ASIP/VLIW architectures:design flow

17

#define NTAPS 4

int fir(int in)int i;static int state[NTAPS];static int coeff[NTAPS];int out[NTAPS];

state[NTAPS] = in;out[0] = state[0] * coeff[0];for ( i = 1; i < NTAPS+1; i++)

out[i] = out[i-1] + state[i] * coeff[i];state[i-1] = state[i];

return(out[NTAPS]);

*

Z-1

*

Z-1

*

Z-1

*

+

c3c4 c2 c1

x4 x3 x2 x1

y

Z-1

c0

x0

*

Application examples (1)

Processor Architectures and Program Mapping H. Corporaal, J. van

Meerbergen, and B. Mesman

18

.L1000006sll $3, $2, 2 R3=R2>>2 R3=i-1addu $14, $15, $3 R14=R15+R3lw $24, 0($14) R24=load(*R14) R24=coeff[i-1]addiu $12, $6, -4 R12=R6-4addu $11, $12, $3 R11=R12+R3lw $13, 0($11) R13=load(*R11) R13=state[i-1]nopmult $24, $13 R24=R24*R13addu $25, $sp, $3 R25=sp+R3lw $9, -4($25) R9=load(R25-4) R9=out[i-1]addiu $2, $2, 1 R2=R2+1 i=i+1mflo $13 R13=move from low mpy regaddu $10, $9, $13 R10=R9+R13 R10=out[i]sw $10, 0($25) mem(*R25)=R10addu $25, $7, $3 R25=R7+R3sw $24, 0($25) mem(*R25)=R24slti $24, $2, 10bne $24, $0, .L100006addiu $15, $7, -4


19 instructions per tap!!

19

temp1 = input << 1temp2 = if (bit(input,7) == 1

then 29 else 0

out = temp1 exor temp2

Bit level operations:finite field arithmetic

r1 = LB input Load byter2 = SLL r1 Shift left logicalr3 = ANDI r1, mask AND immediater4 = ADDI r3, -1 ADD immediateBNE ( r4 != r0) Branch on != to nonzeronopR5 = XORI(r1, 29) Exclusive or immediateJ common Jumpnop

nonzero r5 = XOR(r1,r0) Exclusive ORcommon …

in[0] in[1] in[2] in[3] in[4] in[5] in[6] in[7]

out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]

exor exor exor


10 instructions!!Very simple in hardware

Processor Architectures and Program Mapping H. Corporaal, J. van

Meerbergen, and B. Mesman

20

srl $13, $2, 20andi $25, $13, 1srl $14, $2, 21andi $24, $14, 6or $15, $25, $24srl $13, $2, 22andi $14, $13, 56or $25, $15, $14sll $24, $25, 2

202223252627

source register ($2)

destination register ($24)

2 3 4 5 6 7

Bit level operations : DES example


21

srl $24, $5, 18srl $25, $5, 17xor $8, $24, $25srl $9, $5, 16xor $10, $8, $9srl $11, $5, 13xor $12, $10, $11andi $13, $12, 1

181716 13

xor

$5

1$13

… 0 ...

Bit level operations : A5 example (GSM encryption)



22

architecture viewarchitecture view

life-time analysislife-time analysis

resource loadresource load

bus loadbus load

cycle-countcycle-count

ASIP/VLIW architectures: feedback


23

ImplementationIndependent

Design Database


Design Database

Low power aspects

• Estimation

EXU ACTIVITY AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944

area

speed

power

Estimation Database

+Architecture

Mistral2 Mistral2


24

GSM viterbi decoder : default solution

13750

EXU ACTIV AREA POWERalu_1 96% 3469 46196romctrl_1 48% 39 259acu_1 26% 327 1209ipb_1 5% 131 105opb_1 23% 1804 5801ctrl 9821 135035total 15591 188605


• controller responsible for 70% of power consumption

– maximum resource-sharing

– heavy decision-making : “main” loop with 16 metrics-computations per iteration

• EXU-numbers include Registers for local storage


25

GSM viterbi decoder : no loop-folding

• area down by 33%• power down by 35%• next step: reduce # of program-steps with

second ALU

14247




26

GSM viterbi decoder : 2 ALU’s

9739

EXU ACTIV AREA POWERalu_1 69% 1797 12248alu_2 65% 1393 8916romctrl_1 67% 39 255acu_1 37% 294 1087ipb_1 8% 149 119opb_1 33% 2136 6871ctrl 8957 87235total 14766 116731

EXU ACTIV AREA POWERalu_1 69% 1797 12248alu_2 65% 1393 8916romctrl_1 67% 39 255acu_1 37% 294 1087ipb_1 8% 149 119opb_1 33% 2136 6871ctrl 8957 87235total 14766 116731

cycle count down 30%

area up 42% power down by 5% next step: introduce

ASU to reduce ALU-load


27

GSM viterbi decoder : 1 x ACS-ASU

EXU ACTIV AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944

EXU ACTIV AREA POWERalu_1 20% 261 105acs_asu_1 83% 2382 3816or_asu_1 10% 611 122romctrl_1 16% 65 21acu_1 36% 294 205ipb_1 20% 107 43opb_1 11% 163 35ctrl 1864 3597total 5747 7944

func ACS ( M1, M2, d ) MS, MS8 =begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;end;

func ACS ( M1, M2, d ) MS, MS8 =begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;end;

=

1930

cycle count down 5X power down 20X !


28

GSM viterbi decoder : 4 x ACS-ASU

EXU ACTIV AREA POWERalu_1 94% 243 97acs_asu_1 95% 1041 420acs_asu_2 95% 1041 420acs_asu_3 95% 1041 420acs_asu_4 95% 1041 420split_asu_1 47% 90 18or_asu_1 47% 592 118romctrl_1 28% 48 6acu_1 98% 212 85ipb_1 23% 60 6opb_1 50% 369 80ctrl 1306 555total 7084 2645

EXU ACTIV AREA POWERalu_1 94% 243 97acs_asu_1 95% 1041 420acs_asu_2 95% 1041 420acs_asu_3 95% 1041 420acs_asu_4 95% 1041 420split_asu_1 47% 90 18or_asu_1 47% 592 118romctrl_1 28% 48 6acu_1 98% 212 85ipb_1 23% 60 6opb_1 50% 369 80ctrl 1306 555total 7084 2645

cycle count down another 5X

area up 23% power down another

3X !

425


29

GSM viterbi example : summary


Design Database


Design Database

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

default loop 2 ALU 1 ACS 4 ACS

power

areacycles

72x !72x !

Mistral2 Mistral2


30

Exploration phase

Application softwaredevelopment:

constraint driven compilation

application(s)processor

-model

OK?

more appl.? yes

no

noyes

HWdesign


application(s)

OK?no

yes


Freezeprocessor

model

no

Discussion: phase 3


31

RF1

FU1 FU2 FU3 FU4

IR1 IR2 IR3 IR4

Instruction memory Con-trol

flags

RF2 RF3 RF4


32

Discussion: problems with VLIWs

• code compaction = reduce code size after scheduling possible compaction ratio ?e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = - pi log2 pi = 0.47

maximum compression factor 2 • control parallelism during scheduling = switch between

different processor models (10% of code = 90% runtime) • architecture

reduce number of control bits for operand addressese.g. 128 reg (TM) -> 28 bits/issue slot for addresses only=> use stacks and fifos

code size and instruction bandwidth

04/19/23 33

n n A n n n n nn B n n n n n nn n n n n C n nn n n n n D n nn n n E n n n nF n n n n n n nn n n n n n G nn n n n n n n H

A B C D E F G H0 0 0 0 0 0 0 0

n B A n n C n nn n n E n D n nF n n n n n n nn n n n n n G H

A B C D E F G H1 1 0 1 0 0 1 0

A B C D E F G H1 1 1 1 1 1 1 0

A B C D E F G H

Fully serial

Mixed serial/parallel

Fully parallel

Velocity encoding

Classical encoding: fetching many nops


34

Conclusions

• ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency).

• The methodology is interesting for IP creation.

• The key problem is retargetable compilation.

• A (distributed) VLIW model is a good compromise between HW and SW.

• Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.

embedded computer architecture

Documents

compliant speechcompression

henk corporaal

processor5kk73bart mesman

design time

silence compression

domain flexible

modemitut v23 compliant

defined application