flexible wireless communication architectures

RICE UNIVERSITY

Flexible wireless communication architectures

Sridhar Rajagopal

Department of Electrical and Computer EngineeringRice University, Houston TX

Faculty Candidate Seminar – Southern Methodist UniversityApril 23, 2003

This work has been supported in part by NSF, Nokia and Texas Instruments

2RICE UNIVERSITY

Future wireless devices demand flexibility

Multiple algorithms and environments supported in same device

High data rate mobile devices with multimedia

Flexible algorithms: Multiple antennas, complex signal processing

Flexible architectures: High performance (Mbps), low power (mW)

Fast design with structured exploration

Bluetooth/Home Networks

Wireless Cellular

Wireless LAN

3RICE UNIVERSITY

Flexibility needed in different layers

Physical Layer

MAC Layer

Network Layer

Application Layer Puppeteer project at Ricehttp://www.cs.rice.edu/CS/Systems/Puppeteer/

Analog RF

Flexible Algorithms

Mapping

Flexible Architectures

4RICE UNIVERSITY

Research vision: Attain flexibility

Algorithms:Flexibility: support variety of sophisticated

algorithms

Architectures:Flexibility: adapts hardware to algorithms

Fast, structured design exploration

Design me

5RICE UNIVERSITY

Contributions: Algorithms

Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00] Matrix-inversions Numerical techniques

conjugate-gradient descent for complexity reduction

Multi-user detection: [ISCAS’01] Block-based computation to streaming computations

Pipelining, lower memory requirements

Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]

6RICE UNIVERSITY

Contributions: Architectures

Heterogeneous DSP-FPGA system designs: [ICSPAT’00]

Computer arithmetic:[Symp. On Comp. Arith’01]Dynamic truncation in ASICs using on-line arithmeticwith Most Significant Digit First computation

[Ph.D. Thesis]

Scalable Wireless Application-specific Processors (SWAPs)

Rapid, structured architectures with flexibility-performance tradeoffs

7RICE UNIVERSITY

Scalable Wireless Application-specific Processors

Family of flexible programmable processorsClusters of ALUsHigh performance by supporting 100’s of ALUsCan provide customization for various algorithmsAdapts (“swaps”) architecture dynamically for power

+

?

**

+

**

+

**

+

**

…? ? ?

Scale Clusters

ScaleALUs

8RICE UNIVERSITY

Rapid, structured design for SWAPs

Low “complexity”, parallel, fixed point

algorithms

Architecture Exploration ASIC

designapply

DSPdesign

apply

SWAPs+?**

+

**

+

**

+

**

…? ? ?

9RICE UNIVERSITY

Research vision summary

Provide a structured framework to rapidly explore:flexible, high performance, low power architectures

(SWAPs)

Efficient algorithm design for mapping to SWAPs

Understanding of algorithms, DSPs and ASICs used

Flexibility-performance trade-offs

Inter-disciplinary research:Wireless communications, VLSI Signal Processing, Computer

architecture, Computer arithmetic, Circuits, CAD, Compilers

10RICE UNIVERSITY

Talk Outline

Research vision

SWAPs - Background

Algorithm design for SWAPs

Architecture design for SWAPs

Current and Future Research Goals

11RICE UNIVERSITY

SWAPs borrow from DSPs

DSPs use : Instruction Level Parallelism (ILP) Subword Parallelism (MMX)

Not enough ALUs for GOPs of computation-- Need 100’s TI C6x has 8 ALUs

Why not more ALUs?Cannot support more registers (area,ports)Difficult to find ILP as ALUs increase

32

Register File

1 ALURF 4 16

12RICE UNIVERSITY

SWAPs borrow from ASICs

Exploit data parallelism (DP)Available in many wireless algorithmsThis is what ASICs do!

int i,a[N],b[N],sum[N]; // 32 bitsshort int c[N],d[N],diff[N]; // 16 bits packed

for (i = 0; i< 1024; ++i)

{

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

ILP

DP

Subword

13RICE UNIVERSITY

SWAPs borrow from stream processors

Kernel

Viterbidecoding

StreamInput Data Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

Kernels (computation) and streams (communication)

Use local data in clusters providing GOPs support

Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

14RICE UNIVERSITY

SWAPs are multi-cluster DSPs

+++***

InternalMemory

ILP

Memory: Stream Register File (SRF)

DSP(1 cluster)

+++***

+++***

+++***

+++***

…ILP

DP

SWAPsadapt clusters to DP

Identical clusters, same operations.Power-down unused FUs, clusters

15RICE UNIVERSITY

Arithmetic clusters in SWAPs

Intercluster NetworkComm. Unit

Scratchpad (indexed accesses)

SRF

From/To SRF

Cross Point

Distributed Register Files(supports more ALUs)

+

+

+*

*/

+/

+

+

+*

*/

+

/

16RICE UNIVERSITY

Talk Outline

Research vision

SWAPs Background




17RICE UNIVERSITY

SWAPs: Physical layer algorithms

Antenna

Channel estimation

Detection DecodingHigher(MAC/

Network/OS)Layers

RF Front-end

Baseband processing

Complex signal processing algorithms with GOPs of computation

18RICE UNIVERSITY

SWAP mapping example: Viterbi decoding

Multiple antenna systems (MIMO systems)Complexity exponential with transmit x receive antennas

Estimation: Linear MMSE, blind, conjugate gradient….

Detection: FFT, (blind) interference cancellation….

Decoding: Viterbi, Turbo, LDPC…. & joint schemes

SWAP flexibility lets you use the best algorithms for the situation

Example for concept demonstration: Viterbi decoding

19RICE UNIVERSITY

Parallel Viterbi Decoding for SWAPs

Add-Compare-Select (ACS) : trellis interconnect : computationsParallelism depends on constraint length (#states)

Traceback: searchingConventional

• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture

Use Register Exchange (RE) • parallel solution

ACS Unit

Traceback Unit

Detectedbits

Decodedbits

20RICE UNIVERSITY

Parallel Viterbi needs re-ordering for SWAPs

Exploiting Viterbi DP in SWAPs:Use RE instead of regular traceback Re-order ACS, RE

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

DP

vector

Regular ACSACS in SWAPs

21RICE UNIVERSITY

Talk Outline

Research vision

SWAP Background




22RICE UNIVERSITY

SWAP architecture design

More clusters better than more ALUs/per cluster (if #clusters > 2)

1. Decide how many clusters Exploit DP

2. Decide what to put within each cluster Maximize ILP with high functional unit efficiency Search design space with “explore” tool

Time-power-area characterization

+?**

+

**

+

**

+

**

…ILP

DP

? ? ?

23RICE UNIVERSITY

Design a SWAP cluster: “Explore”

Auto-exploration of adders and multipliers for “ACS"

1

2

3

4

5

1

2

3

4

5

40

60

80

100

120

140

160

(43,58)

(54,59)

(39,41)

(62,62)

(47,43)

#Multipliers

(40,32)

(70,59)

(65,45)

(49,33)

(39,27)

(80,34)

(73,41)

(61,33)

(48,26)

(39,22)

(50,22)

(85,24)

(76,33)

(60,26)

#Adders

(61,22)

(85,17)

(72,22)

(72,19)

(85,13)

(85,11)

Inst

ruct

ion c

ount

(Adder util%, Multiplier util%)

24RICE UNIVERSITY

“Explore” tool benefits

Instruction count vs. ALU efficiencyWhat goes inside each cluster

Design customized application-specific unitsBetter performance with increased ALU utilization

Explore multiple algorithms turn off functional units not in use for given kernelVdd-gating, clock gating techniques

25RICE UNIVERSITY

Example for SWAP architecture design

Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters

Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters



Chosen Architecture: 4 adders, 3 multipliers, 64 clusters

ILP

DP

26RICE UNIVERSITY

SWAP flexibility provides power savings

Multiple algorithmsDifferent ALU, cluster requirements

Turning off ALUs ( –add –mul compiler options)Use the right #ALUs from “explore” tool

Turning off clustersData across SRF of all clustersCluster only has access to its own SRFNext kernel may need data from SRF of other

clustersReconfiguration support needs to be provided

27RICE UNIVERSITY

SWAPs provide cluster reconfiguration

SRF

Clusters

Mux-DemuxNetwork

WithStreambuffers

M D X 2 M D X 2

M D X 1

LA T C H LA T C H LA T C H LA T C H

Additional latency (few cycles) due to microcontroller stalls

- Minimal loss in performance

28RICE UNIVERSITY

Cluster reconfiguration for Viterbi

Packet 1Constraint length 7

(16 clusters)


(64 clusters)


(4 clusters)

DP Can be turned OFF

29RICE UNIVERSITY

64-bit Rate ½

Packet 1K = 7

Packet 2K = 9

Packet 3K = 5

Kernels(Computation)

No Data Memoryaccesses

Execu

tion T

ime

(cycl

es)

Clusters Memory

SWAPs provide flexibility at negligible overhead

30RICE UNIVERSITY

SWAP exploration for Viterbi decoding

1 10 1001

10

100

1000

Number of clusters

Fre

qu

en

cy n

eed

ed

to a

ttain

real-

tim

e (

in M

Hz)

K = 9K = 7 K = 5Different SWAPs

(Without reconfiguration)Same SWAP

(With reconfiguration)

DSP

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

Max DP

31RICE UNIVERSITY

SWAPs : Salient features

1-2 orders of magnitude better than a DSP

Any constraint length 10 MHz at 128 Kbps

Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant

Power savings due to dynamic cluster scaling

32RICE UNIVERSITY

Expected SWAP power consumption

Power model based on [Khailany’03] 64 clusters and 1 multiplier per cluster:

0.13 micron, 1.2 V Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW) Area: ~53.7 mm2

10 MHz, 128 Kbps with reconfiguration

Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003

0 10 20 30 40 50 60 700102030405060708090

Active Clusters (max 64)P

ow

er (

in m

W)

Viterbi Clusters Used

Peak Power

K = 9 64 ~90 mW

K = 7 16 ~28.57 mWK = 5 4 ~13.8 mW

overhead 0 ~8.1 mW

DSP, K = 9 1 ~200 mW

33RICE UNIVERSITY

Multiuser Estimation-Detection+Decoding

Real-time target : 128 Kbps per user

1 10 10010

100

1000

10000

100000

Number of clusters

Fre

qu

en

cy

ne

ed

ed

to

att

ain

re

al-

tim

e (

in M

Hz)

FASTMEDIUMSLOW

32-user base-station

Mobile

DSP

Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

Fading scenarios

34RICE UNIVERSITY

Expected SWAP power : base-station

32 user base-station with 3 X’s per cluster and 64 clusters: 0.13 micron, 1.2 V Peak Active Power: ~18.19 mW for 1 MHz (increased

X) Area: ~93.4 mm2

Total Peak Base-station power consumption:~18.19 W at 1 GHz for 32 users at 128 Kbps/user

35RICE UNIVERSITY

Talk Outline

Research vision

SWAP Background




36RICE UNIVERSITY

Current research: Flexibility vs. performance

SWAPs: 128 Kbps at ~10-100 mW for ViterbiBorrow DP from ASICs!

suitable for base-stationsFlexibility more important than power

suitable for mobile devicesPower constraints tightercan be customized for further power savings

Handset SWAPs (H-SWAPs) Borrow Task pipelining from ASICs!Application-specific units and specialized comm.

network

37RICE UNIVERSITY

Handset SWAPs: H-SWAPs

Trade Data Parallelism for Task Pipelining

SRF

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

…

DP

SWAPs(max. clusters

and reconfigure)

+++*

+++*

+++*

+++*

LimitedDP

SWAPlet(limit

clusters)

+++*

+++*

+++*

+++*

LimitedDP

++*

++*

++*

++*

LimitedDP

++++

++++

LimitedDP

H-SWAPs(collection of customized

SWAPlets)

38RICE UNIVERSITY

Sample points in architecture exploration

DSPs(1 cluster)

ILPSubword

ILPSubword

DP

SWAPs(multiple)

H-SWAPs(optimized for handsets)

ILPSubword

DP Task PipeliningCustom ALUs

Programmable solutions with increased customization

Performance, Power benefits(with decreasing flexibility)

39RICE UNIVERSITY

Future: Efficient algorithms and mapping

MultipathC hannel

EqualizerMRC Decoder

DetectorDemodulator

Non-C oherent

STC

Beam-forming

C oherentSTC

C hannelEstimator

C hannel

Turbo Equalizer

Multiple antenna systems with 1-2 orders-of-magnitude higher complexity

40RICE UNIVERSITY

Future research: Architectures

Generalized and structured framework and tools Joint algorithm-architecture explorationArea-time-power-flexibility tradeoffs

Potential applications: embedded systems Image and Video processing:

Cameras : variety of compression algorithms

Biomedical applications: Hearing aids: DSP running on body heat*

Sensor networksCompression of data before transmission

*Quote: Gene Frantz, TI Fellow

41RICE UNIVERSITY

SWAPs: Flexibility, Performance, Power

Need flexibility in future wireless devicesAlgorithms and Architectures

Rapid Exploration for Scalable, Wireless Application-specific ProcessorsStructured approach with flexibility-performance trade-offs

SWAPs - flexibility, high performance and low powerExploit data parallelism like ASICs1-2 orders better performance than DSPsTurn off unused clusters and unused ALUs for low power

flexible wireless communication architectures

Documents