programmable processors for wireless base-stations

64
Programmable processors for wireless base-stations Sridhar Rajagopal ([email protected]) December 9, 2003

Upload: kesia

Post on 17-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Programmable processors for wireless base-stations. Sridhar Rajagopal ( [email protected] ) December 9, 2003. Fact#1: Wireless rates  clock rates. 4. 10. Clock frequency (MHz). 3. 10. 2. 10. W-LAN data rate (Mbps). 1. 10. 0. 10. -1. 10. Cellular data rate (Mbps). -2. 10. -3. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Programmable processors for wireless base-stations

Programmable processors for wireless base-stations

Sridhar Rajagopal([email protected])

December 9, 2003

Page 2: Programmable processors for wireless base-stations

Fact#1: Wireless rates clock rates

Need to process 100X more bits per clock cycle today than in 1996

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 200610

-3

10-2

10-1

100

101

102

103

104

Year

Clock frequency (MHz)

W-LAN data rate (Mbps)

Cellular data rate (Mbps)

200 MHz

1 Mbps

9.6 Kbps

4 GHz

54-100 Mbps

2-10 Mbps

Source: Intel, IEEE 802.11x, 3GPP

Page 3: Programmable processors for wireless base-stations

Fact#2: base-stations need horsepower

LNA

ADC

DDC

FrequencyOffset

Compensation

Channelestimation

Chip levelDemodulationDespreading

SymbolDetection

SymbolDecoding

Packet/Circuit Switch

Control

BSC/RNCInterface

Power Supplyand Control

Unit

Power Measurement and GainControl (AGC)

RF Baseband processing Network Interface

E1/T1or

PacketNetwork

RF RX

Sophisticated signal processing for multiple users

Need 100-1000s of arithmetic operations to process 1 bit

Source: Texas Instruments

Page 4: Programmable processors for wireless base-stations

Need 100 ALUs in base-stations

Example:1000 arithmetic operations/bit with 1 bit/10

cycles– 100 arithmetic operations/clock cycle

Base-stations need 100 ALUs– irrespective of the type of (clocked) architecture

Page 5: Programmable processors for wireless base-stations

Fact #3: Base-stations need power-efficiency*

Wireless systems getting denser – More base-stations per unit area– operational and maintenance costs

Architectures first tested on base-stations

*implies does not waste power – does not imply low power

Wireless gets blacked out too

Trying to use your cell phone during the blackout was nearly impossible. What went wrong?August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer

Page 6: Programmable processors for wireless base-stations

Fact #4: Base-stations need flexibility*

• Wireless systems are continuously evolving– New algorithms designed and evaluated – allow upgrading, co-existing, minimize design time,

reuse

• Flexibility needed for power-efficiency– Base-stations rarely operate at full capacity– Varying users, data rates, spreading, modulation,

coding– Adapt resources to needs

*how much flexibility? – as flexible as possible

Page 7: Programmable processors for wireless base-stations

Fact #5: Current base-stations not flexible / not power-efficient

‘Chip rate’processing

‘Symbol rate’processing

Decoding

Control andprotocol

RF(Analog)

ASIC(s)and/or

ASSP(s)and/or

FPGA(s)

DSP(s)

Co-processor(s)and/or

ASIC(s)

DSP orRISC

processor

Change implies re-partitioning algorithms, designing new hardwareDesign done for the worst case – no adaptation with workload

Source: [Baines2003]

Page 8: Programmable processors for wireless base-stations

Thesis addresses the following problem

design a base-station

(a)supports 100’s of ALUs(b)power-efficient (adapts resources to needs)(c) as flexible as possible

How many ALUs at what clock frequency?

HYPOTHESIS:Programmable* processors for wireless base-

stations*how much programmable? – as programmable as possible

Page 9: Programmable processors for wireless base-stations

Programmable processors

• No processor optimization for specific algorithm– As programmable as possible– Example: no instruction for Viterbi decoding– FPGAs, ASICs, ASIPs etc. not considered

• Use characteristics of wireless systems – precision, parallelism, operations,.. – MMX extensions for multimedia

Page 10: Programmable processors for wireless base-stations

Single processors won’t do

(1) Find ways for increasing clock frequency– C64x DSP: 600 – 720 – 1GHz – 100GHz?– Easiest solution but physical limits to scaling f– Not good for power, given cubic dependence with

f

(2) Increasing ALUs– Limited instruction level parallelism (ILP,MMX)– Register file area, ports explosion– Compiler issues in extracting more ILP

(3) Multiprocessors

Page 11: Programmable processors for wireless base-stations

Related work - Multiprocessors

Multi-chip:

TI TMS320C40 DSPSundance

Cm*

Clustered VLIW

:TI TMS320C6x DSP

Multiflow TRACEAlpha 21264

Multiprocessors

SIMD(Single Instruction

Multiple Data)

MIMD(Multiple Instructions

Multiple Data)

Single chip

Vector

:CODE

Vector IRAMCray 1

Array

:ClearSpeedTM

MasParIlliac-IV

BSP

Stream

:Imagine

Motorola RSVPTM

Multi-threading(MT)

:Sandbridge SandBlaster DSP

Cray MTASun MAJC

PowerPC RS64IVAlpha 21464

Reconfigurable*processors

:RAW

ChameleonpicoChip

Chipmultiprocessor

(CMP)

TI TMS320C8x DSP

HydraIBM Power4

Cannot scale to support 100’s of arithmetic units

Control

Data Parallel

*Reconfigurable processor uses reconfiguration for execution time benefits

Page 12: Programmable processors for wireless base-stations

Challenges in proving hypothesis

• Architecture choice for design exploration– SIMD generally more programmable* than

reconfigurable – Compiler, simulators, tools and support play a

major role

• Benchmark workloads need to be designed – Previously done as ASICs, so none available– Not easy – finite precision, algorithms changing

• Need detailed knowledge of wireless algorithms, architectures, mapping, compilers, design tools *Programmable here refers to ease of use and write code for

Page 13: Programmable processors for wireless base-stations

Architecture choice: Stream processors

• State-of-the-art programmable media processors– Can scale to 1000’s of arithmetic units [Khailany 2003]– Wireless algorithms have similar characteristics

• Cycle-accurate simulator with open-source code

• Parameters such as ALUs, register files can be varied

• Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …

• Almost anything can be changed, some changes easier than others!

Page 14: Programmable processors for wireless base-stations

Thesis contributions

• Mapping algorithms on stream processors – designing data-parallel algorithm versions– tradeoffs between packing, ALU utilization and

memory– reduced inter-cluster communication network

• Improve power efficiency in stream processors – adapting compute resources to workload variations – varying voltage and frequency to real-time

requirements

• Design exploration between #ALUs and clock frequency to minimize power consumption– fast real-time performance prediction

Page 15: Programmable processors for wireless base-stations

Outline

• Background– Wireless systems– Stream processors

• Contribution #1 : Mapping• Contribution #2 : Power-efficiency• Contribution #3 : Design exploration

• Broader impact and limitations

Page 16: Programmable processors for wireless base-stations

Wireless workloads : 2G (Basic)

Slidingcorrelator

CodeMatched

Filter

Viterbidecoder

MACand

Networklayers

ReceivedsignalafterDDC

2G physical layer signal processing

Slidingcorrelator

CodeMatched

Filter

Viterbidecoder

User 1

User K

User 1

User K

32 users

16 Kbps/user

Single-user algorithms(other users noise)

> 2 GOPs

Page 17: Programmable processors for wireless base-stations

3G Multiuser system

Multiuserchannel

estimation

CodeMatched

Filter

Viterbidecoder

MACand

Networklayers

ReceivedsignalafterDDC

3G physical layer signal processing

ParallelInterferenceCancellation

Stages

Multiuser detection

CodeMatched

Filter

Viterbidecoder

User 1

User K User K

User 1

32 users

128 Kbps/user

Multi-user algorithms(cancelsinterference)

> 20 GOPs

Page 18: Programmable processors for wireless base-stations

4G MIMO system

Receivedsignal

after DDC

4G physical layer signal processing

ChannelEstimation

LDPCdecoder

MACand

NetworklayersChannel

estimation

User 1, Antenna 1

User 1, Antenna T

M antennas

User 1Code

MatchedFilter

Chip levelEqualization

CodeMatched

Filter

Chip levelEqualization

ChannelEstimation

LDPCdecoder

Channelestimation

User K, Antenna 1

User K, Antenna T

User KCode

MatchedFilter

Chip levelEqualization

CodeMatched

Filter

Chip levelEqualization

32 users

1 Mbps/user

Multiple antennas(higher spectralefficiency, higher data rates)

> 200 GOPs

Page 19: Programmable processors for wireless base-stations

Programmable processors

int i,a[N],b[N],sum[N]; // 32 bits

short int c[N],d[N],diff[N]; // 16 bits packed

for (i = 0; i< 1024; ++i) {

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

Instruction Level Parallelism (ILP) - DSP

Subword Parallelism (MMX) - DSP

Data Parallelism (DP) – Vector Processor

DP can decrease by increasing ILP and MMX

– Example: loop unrolling

ILP

DP

MMX

Page 20: Programmable processors for wireless base-stations

Stream Processors : multi-cluster DSPs

+++***

InternalMemory

ILPMMX

Memory: Stream Register File (SRF)

VLIW DSP(1 cluster)

+++***

+++***

+++***

+++***

…ILPMMX

DP

adapt clusters to DPIdentical clusters, same operations.Power-down unused FUs, clusters

mic

ro

con

tro

ller

mic

ro

con

tro

ller

Page 21: Programmable processors for wireless base-stations

Outline

Contribution #1 – Mapping algorithms to stream processors (parallel,

fixed pt)– Tradeoffs between packing, ALU utilization and

memory– Reduced inter-cluster communication network

Page 22: Programmable processors for wireless base-stations

Packing

• Packing introduced around 1996 for exploiting subword parallelism– Intel MMX– Subword parallelism never looked back – Integrated into all current microprocessors and

DSPs

• SIMD + MMX : Stream processor/vector IRAM : 2000 + – relatively new concept

• Not necessarily useful in SIMD processors– May add to inter-cluster communication

Page 23: Programmable processors for wireless base-stations

Packing may not be useful

1 2 3 4 5 6 7 8a

Multiplication

1 3 5 7p

2 4 6 8q

1 2 3 4p

5 6 7 8q 7

Algorithm:short a;int y;

for(i= 1; i < 8 ; ++i)

{

y[i] = a[i]*a[i];

}

Re-ordering data

1 3 x xp

5 7 x xm

x x 2 4n

x x 6 8q

1 3 2 4p

5 7 6 8q

Add

Re-ordering data

Packing uses odd-even grouping

Page 24: Programmable processors for wireless base-stations

Data re-ordering in memory

• Matrix transpose– Common in wireless communication systems– Column access to data expensive

• Re-ordering data inside the ALUs– Faster– Lower power

Page 25: Programmable processors for wireless base-stations

Trade-offs during memory re-ordering

t1

t2

Transpose

tmem

ALUs Memory

t1

t2

Transpose

tmem

ALUs Memory

t3

t1

t2

ALUs

talu

t = t2 + tstalls0 < tstalls < tmem

(a)t = t2(b)

t = t2 + talu (c)

Page 26: Programmable processors for wireless base-stations

Transpose uses odd-even grouping

N

M

0

M/2

1 2 3 4

A B C D

IN

OUT

Repeat LOG(M ) times{IN = OUT;}

A B C D

1 2 3 4C 3 D 4

A 1 B 2

Page 27: Programmable processors for wireless base-stations

ALU Bandwidth > Memory Bandwidth

104

103

104

105

Matrix sizes (32x32, 64x64, 128x128)

Exe

cuti

on

tim

e (c

ycle

s)

Transpose in memory (tmem

): DRAM 8 cycles

Transpose in memory (tmem

): DRAM 3 cycles

Transpose in ALU (talu

)

Page 28: Programmable processors for wireless base-stations

Viterbi needs odd-even grouping

Exploiting Viterbi DP in SWAPs:Use Register exchange (RE) instead of regular traceback Re-order ACS, RE

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

DP

vector

Regular ACSACS in SWAPs

Page 29: Programmable processors for wireless base-stations

Performance of Viterbi decoding

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

1 10 1001

10

100

1000

Number of clusters

Fre

qu

en

cy n

eed

ed

to a

ttain

real-

tim

e (

in M

Hz)

K = 9K = 7 K = 5DSP

Max DP

Page 30: Programmable processors for wireless base-stations

Pattern in inter-cluster comm

• Broadcasting– Matrix-vector multiplication, matrix-matrix

multiplication, outer product updates

• Odd-even grouping– Transpose, Packing, Viterbi decoding

Page 31: Programmable processors for wireless base-stations

Odd-even grouping

Inter-cluster communication

O(C2) wires, O(C2) interconnections, 8 cycles

0/4 1/5 2/6 3/7

4 Clusters

Data

Entire chip lengthLimits clock frequencyLimits scaling

0 1 2 3 4 5 6 7 0 2 4 8 1 3 5 7

Page 32: Programmable processors for wireless base-stations

A reduced inter-cluster comm network

only nearest neighbor interconnections

O(Clog(C)) wires, O(C) interconnections, 8 cycles

0/4 1/5 2/6 3/7

Broadcasting

support

Odd-even

grouping

Registers

(pipelining)

Multiplexer

4 Clusters

Demultiplexer

Data

Page 33: Programmable processors for wireless base-stations

Outline

Contribution #2 : Power-efficiency

High performance is low power - Mark Horowitz

Page 34: Programmable processors for wireless base-stations

Flexibility needed in workloads

0

5

10

15

20

25O

per

atio

n c

ou

nt

(in

GO

Ps)

(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)

2G base-station (16 Kbps/user)3G base-station (128 Kbps/user)

(Users, Constraint lengths)

Billions of computations per second needed

Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi

to ~23 GOPs for 32 users, constraint 9 viterbi

Note:GOPs referonly to arithmeticcomputations

Page 35: Programmable processors for wireless base-stations

Flexibility affects Data Parallelism*

Workload Estimation Detection Decoding

(U,K) f(U,N) f(U,N) f(U,K,R)

(4,7) 32 4 16

(4,9) 32 4 64

(8,7) 32 8 16

(8,9) 32 8 64

(16,7) 32 16 16

(16,9) 32 16 64

(32,7) 32 32 16

(32,9) 32 32 64

U - Users, K - constraint length,

N - spreading gain, R - decoding rate

*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

Page 36: Programmable processors for wireless base-stations

Adapting #clusters to Data Parallelism

AdaptiveMultiplexer

Network

C C C C

C C C C C CC

No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off

Turned off using voltage gating toeliminate static anddynamic power dissipation

Page 37: Programmable processors for wireless base-stations

Cluster utilization variation

0 5 10 15 20 25 300

50

100(4,9)(4,7)

0 5 10 15 20 25 300

50

100(8,9)(8,7)

0 5 10 15 20 25 300

50

100

(16,9)(16,7)

0 5 10 15 20 25 300

50

100

(32,9)(32,7)

Cluster Index

Clu

ster

Uti

liza

tio

n

Cluster utilization variation on a 32-cluster processor

(32, 9) = 32 users, constraint length 9 Viterbi

Page 38: Programmable processors for wireless base-stations

Frequency variation

0

200

400

600

800

1000

1200

Rea

l-ti

me

Fre

qu

ency

(in

MH

z)

(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)

Mem StalluC Stall

Busy

Page 39: Programmable processors for wireless base-stations

Operation

• Dynamic Voltage-Frequency scaling when system changes significantly – Users, data rates …– Coarse time scale (every few seconds)

• Turn off clusters – when parallelism changes significantly– Memory operations– Exceed real-time requirements– Finer time scales (100’s of microseconds)

Page 40: Programmable processors for wireless base-stations

Power : Voltage Gating & Scaling

Workload Freq (MHz) Voltage Power Savings (W) Power (W) Savingsneeded used (V) clocking Memory Clusters New Base

(4,7) 345.09 433 0.875 0.325 1.05 0.366 0.3 2.05 85.14 %(4,9) 380.69 433 0.875 0.193 0.56 0.604 0.69 2.05 66.41 %(8,7) 408.89 433 0.875 0.089 0.54 0.649 0.77 2.05 62.44 %(8,9) 463.29 533 0.95 0.304 0.71 0.643 1.33 2.98 55.46 %(16,7) 528.41 533 0.95 0.02 0.44 0.808 1.71 2.98 42.54 %(16,9) 637.21 667 1.05 0.156 0.58 0.603 3.21 4.55 29.46 %(32,7) 902.89 1000 1.3 0.792 1.18 1.375 7.11 10.46 32.03 %(32,9) 1118.3 1200 1.4 0.774 1.41 0 12.38 14.56 14.98 %

Estimated Cluster Power Consumption 78 %Estimated SRF Power Consumption 11.5 %Estimated instruction decoderoder Power Consumption 10.5 %Estimated Chip Area (0.13 micron process) 45.7 mm2

Power can change from 12.38 W to 300 mW depending on workload changes

Page 41: Programmable processors for wireless base-stations

Outline

Contribution #3 : Design exploration– How many adders, multipliers, clusters, clock

frequency– Quickly predict real-time performance

Page 42: Programmable processors for wireless base-stations

Deciding ALUs vs. clock frequency

• No independent variables– Clusters, ALUs, frequency, voltage (c,a,m,f)– Trade-offs exist

• How to find the right combination for lowest power!

2P CV f V f 3P f

‘1’ cluster

100 GHz

(A)

+++***

‘a’+

‘m’*

+++***

‘a’+

‘m’*

+++***

‘a’+

‘m’*

‘c’ clusters

‘f’ MHz

+++***

‘1’+

‘1’*

+++***

‘10’+

‘10’*

+++***

‘10’+

‘10’*

+++***

‘10’+

‘10’*

‘100’ clusters

10 MHz

(B) (C)

Page 43: Programmable processors for wireless base-stations

Static design exploration

Static part(computations)

Dynamic part(Memory stalls

Microcontroller stalls)

Exe

cuti

on T

ime

also helps in quickly predicting real-time performance

Page 44: Programmable processors for wireless base-stations

Sensitivity analysis important

• We have a capacitance model [Khailany2003]

• All equations not exact– Need to see how variations affect solutions

(1 3)

* (0.01 1)

pP f p

adder power multiplier power

Page 45: Programmable processors for wireless base-stations

Design exploration methodology

• 3 types of parallelism: ILP, MMX, DP• For best performance (power)

– Maximize the use of all

• Maximize ILP and MMX at expense of DP– Loop unrolling, packing – Schedule on sufficient number of

adders/multipliers

• If DP remains, use clusters = DP– No other way to exploit that parallelism

Page 46: Programmable processors for wireless base-stations

Setting clusters, adders, multipliers

• If sufficient DP, linear decrease in frequency with clusters– Set clusters depending on DP and execution time

estimate

• To find adders and multipliers,– Let compiler schedule algorithm workloads across

different numbers of adders and multipliers and let it find execution time

• Put all numbers in power equation– Compare increase in capacitance due to added ALUs

and clusters with benefits in execution time

• Choose the solution that minimizes the power

Page 47: Programmable processors for wireless base-stations

Design exploration

For sufficiently large #adders, #multipliers per clusterExplore Algorithm 1 : 32 clusters

(t1)Explore Algorithm 2 : 64 clusters

(t2)Explore Algorithm 3 : 64 clusters

(t3)Explore Algorithm 4 : 16 clusters

(t4)

ILP

DP

1

( )L

ii

i

dpf c real time target t

c

Page 48: Programmable processors for wireless base-stations

Clusters: frequency and power

100

101

102

102

103

104

Clusters

Fre

qu

en

cy (

MH

z)

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clusters

No

rmal

ized

Po

wer

Power fPower f2

Power f3

32 clusters at frequency = 836.692 MHz (p = 1)

64 clusters at frequency = 543.444 MHz (p = 2)

64 clusters at frequency = 543.444 MHz (p = 3)

( ) min ( ) ( ) pP c C c f c

3G workload

Page 49: Programmable processors for wireless base-stations

ALU utilization with frequency

1

1.5

2

2.5

3

3.5

4

4.5

5 1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3500

600

700

800

900

1000

1100

(32,28)

(38,28)

#Multipliers

(33,34)

(50,31)

(42,37)

(64,31)

(36,53)

(51,42)

(78,18)

(43,56)

(65,46)

#Adders

(55,62)

(78,27)

(67,62)

(78,45)

Rea

l-T

ime

Fre

qu

ency

(in

MH

z) w

ith

FU

uti

liza

tio

n(+

,*)

3G workload

Page 50: Programmable processors for wireless base-stations

Power variations with f and

0

5

1

2

30

0.5

1

Adders

= 0.01, P f

Multipliers

0

5

1

2

30

0.5

1

Adders

= 0.1, P f

Multipliers

0

5

1

2

30

0.5

1

Adders

= 1, P f

Multipliers

0

5

12

30

0.5

1

Adders

= 0.01, P f2

MultipliersP

ow

er

0

5

12

30

0.5

1

Adders

= 0.1, P f2

Multipliers

Po

we

r

0

5

12

30.5

1

Adders

= 1, P f2

Multipliers

Po

we

r0

5

1

2

30

0.5

1

Adders

= 0.01, P f3

Multipliers

Po

we

r

0

5

1

2

30

0.5

1

Adders

= 0.1, P f3

Multipliers

Po

we

r

0

5

1

2

30

0.5

1

Adders

= 1, P f3

Multipliers

Po

we

r

Page 51: Programmable processors for wireless base-stations

Choice of adders and multipliers

(,fp) Optimal Optimal ALU/Cluster Cluster/Total

Adders Multipliers Power Power

(0.01,1) 2 1 30 61

(0.01,2) 2 1 30 61

(0.01,3) 3 1 25 58

(0.1,1) 2 1 52 69

(0.1,2) 2 1 52 69

(0.1,3) 3 1 51 68

(1,1) 1 1 86 89

(1,2) 2 2 84 87

(1,3) 2 2 84 87

Page 52: Programmable processors for wireless base-stations

Exploration results

************************* Final Design Conclusion *************************Clusters : 64Multipliers/cluster : 1 Utilization: 62%Adders/cluster : 3 Utilization: 55%Real-time frequency : 568.68 MHz*************************

Exploration done with plots generated in seconds….

Page 53: Programmable processors for wireless base-stations

Outline

Broader impact and limitations

Page 54: Programmable processors for wireless base-stations

Broader impact

• Results not specific to base-stations– High performance, low power system designs

• Concepts can be extended to handsets

• Mux network applicable to all SIMD processors – Power efficiency in scientific computing

• Results #2, #3 applicable to all stream applications– Design and power efficiency– Multimedia, MPEG, …

Page 55: Programmable processors for wireless base-stations

Limitations

Don’t believe the model is the reality (Proof is in the pudding)

• Fabrication needed to verify concepts– Cycle accurate simulator – Extrapolating models for power

• LDPC decoding (in progress)– Sparse matrix requires permutations over large

data– Indexed SRF may help

• 3G requires 1 GHz at 128 Kbps/user– 4G equalization at 1 Mbps breaks down (expected)

Page 56: Programmable processors for wireless base-stations

Conclusions

• Road ends - conventional architectures[Agarwal2000]

• Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable + – Difficult to compare and contrast– Need new definitions that allow comparisons

• Wireless workloads – SPECwireless standard needed

• utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures– my thesis lays the initial foundations

Page 57: Programmable processors for wireless base-stations

Time-scales [met schedules on time – year not mentioned in the

proposal]Work Time Frame Status

Fixed point estimationdetection in Imagine

August

Integration ofImagine code

August

C64x DSP comparisons AugustSingle cluster

implementationAugust

Comparisonpoints

VLSI comparisons AugustMemory stalls SeptemberInnovations

Viterbi FU OctoberScaling Scaling algorithm November

W-LAN base-station DecemberExtensionsHandset issues January

Page 58: Programmable processors for wireless base-stations

Alternate view of the CMP DSP

Streaming Memory system

L2 internalmemory

Bank C

Inter-clustercommunication

network

Bank 2

Bank 1

Prefetch Buffers

ClustersOf

C64x

clu

ste

r C

clu

ste

r 0

clu

ste

r 1

Inst

ruct

ion

d

eco

der

Page 59: Programmable processors for wireless base-stations

Adapting clusters using (1) memory transfers

SRF

Stream A

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

08

715

614

513

412

311

210

19

04812

15913

26

1014

47

1115

X XXX

Stream A' Step 1:Step 2:

Memory

Clusters

Page 60: Programmable processors for wireless base-stations

(2) Using conditional streams

Conditional Buffer

Condition Switch

Data received

0

A

1

A

B

1

B

C

0

-

D

0

-

1 2 3Cluster index 0

1

C

1

D

C

0

-

D

0

-

1 2 3

A B

Access 0 Access 1

4-clusters reconfiguring to 2

Page 61: Programmable processors for wireless base-stations

Arithmetic clusters in stream processors

Intercluster NetworkComm. Unit

Scratchpad (indexed accesses)

SRF

From/To SRF

Cross Point

Distributed Register Files(supports more ALUs)

+

+

+*

*/

+/

+

+

+*

*/

+

/

Page 62: Programmable processors for wireless base-stations

Programming model

stream<int> a(1024);stream<int> b(1024);stream<int> sum(1024);stream<half2> c(512);stream<half2> d(512);stream<half2> diff(512);

add(a,b,sum);sub(c,d,diff);

kernel add(istream<int> a, istream<int> b, ostream<int> sum){

int inputA, inputB, output;

loop_stream(a){

a >> inputA;b >> inputB;

output = a + b;sum << output;

}}

kernel sub(istream<half2> c, istream<half2> d, ostream<half2> diff){

int inputC, inputD, output;loop_stream(c){

c >> inputC;d >> inputD;

output = c - d;diff << output;

}

}

Your new hardware won’t run your old software – Balch’s law

Page 63: Programmable processors for wireless base-stations

Stream processor programming

Kernel

Viterbidecoding

StreamInput Data Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

• Kernels (computation) and streams (communication)

• Use local data in clusters providing GOPs support

• Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

Page 64: Programmable processors for wireless base-stations

Parallel Viterbi Decoding

• Add-Compare-Select (ACS) : trellis interconnect : computations– Parallelism depends on constraint length (#states)

• Traceback: searching– Conventional

• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture

– Use Register Exchange (RE) • parallel solution

ACS Unit

Traceback Unit

Detectedbits

Decodedbits