flexible wireless communication architectures
DESCRIPTION
Flexible wireless communication architectures. Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar – Southern Methodist University April 23, 2003. This work has been supported in part by NSF, Nokia and Texas Instruments. - PowerPoint PPT PresentationTRANSCRIPT
RICE UNIVERSITY
Flexible wireless communication architectures
Sridhar Rajagopal
Department of Electrical and Computer EngineeringRice University, Houston TX
Faculty Candidate Seminar – Southern Methodist UniversityApril 23, 2003
This work has been supported in part by NSF, Nokia and Texas Instruments
2RICE UNIVERSITY
Future wireless devices demand flexibility
Multiple algorithms and environments supported in same device
High data rate mobile devices with multimedia
Flexible algorithms: Multiple antennas, complex signal processing
Flexible architectures: High performance (Mbps), low power (mW)
Fast design with structured exploration
Bluetooth/Home Networks
Wireless Cellular
Wireless LAN
3RICE UNIVERSITY
Flexibility needed in different layers
Physical Layer
MAC Layer
Network Layer
Application Layer Puppeteer project at Ricehttp://www.cs.rice.edu/CS/Systems/Puppeteer/
Analog RF
Flexible Algorithms
Mapping
Flexible Architectures
4RICE UNIVERSITY
Research vision: Attain flexibility
Algorithms:Flexibility: support variety of sophisticated
algorithms
Architectures:Flexibility: adapts hardware to algorithms
Fast, structured design exploration
Design me
5RICE UNIVERSITY
Contributions: Algorithms
Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00] Matrix-inversions Numerical techniques
conjugate-gradient descent for complexity reduction
Multi-user detection: [ISCAS’01] Block-based computation to streaming computations
Pipelining, lower memory requirements
Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]
6RICE UNIVERSITY
Contributions: Architectures
Heterogeneous DSP-FPGA system designs: [ICSPAT’00]
Computer arithmetic:[Symp. On Comp. Arith’01]Dynamic truncation in ASICs using on-line arithmeticwith Most Significant Digit First computation
[Ph.D. Thesis]
Scalable Wireless Application-specific Processors (SWAPs)
Rapid, structured architectures with flexibility-performance tradeoffs
7RICE UNIVERSITY
Scalable Wireless Application-specific Processors
Family of flexible programmable processorsClusters of ALUsHigh performance by supporting 100’s of ALUsCan provide customization for various algorithmsAdapts (“swaps”) architecture dynamically for power
+
?
**
+
**
+
**
+
**
…? ? ?
Scale Clusters
ScaleALUs
8RICE UNIVERSITY
Rapid, structured design for SWAPs
Low “complexity”, parallel, fixed point
algorithms
Architecture Exploration ASIC
designapply
DSPdesign
apply
SWAPs+?**
+
**
+
**
+
**
…? ? ?
9RICE UNIVERSITY
Research vision summary
Provide a structured framework to rapidly explore:flexible, high performance, low power architectures
(SWAPs)
Efficient algorithm design for mapping to SWAPs
Understanding of algorithms, DSPs and ASICs used
Flexibility-performance trade-offs
Inter-disciplinary research:Wireless communications, VLSI Signal Processing, Computer
architecture, Computer arithmetic, Circuits, CAD, Compilers
10RICE UNIVERSITY
Talk Outline
Research vision
SWAPs - Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
11RICE UNIVERSITY
SWAPs borrow from DSPs
DSPs use : Instruction Level Parallelism (ILP) Subword Parallelism (MMX)
Not enough ALUs for GOPs of computation-- Need 100’s TI C6x has 8 ALUs
Why not more ALUs?Cannot support more registers (area,ports)Difficult to find ILP as ALUs increase
32
Register File
1 ALURF 4 16
12RICE UNIVERSITY
SWAPs borrow from ASICs
Exploit data parallelism (DP)Available in many wireless algorithmsThis is what ASICs do!
int i,a[N],b[N],sum[N]; // 32 bitsshort int c[N],d[N],diff[N]; // 16 bits packed
for (i = 0; i< 1024; ++i)
{
sum[i] = a[i] + b[i];
diff[i] = c[i] - d[i];
}
ILP
DP
Subword
13RICE UNIVERSITY
SWAPs borrow from stream processors
Kernel
Viterbidecoding
StreamInput Data Output Data
Correlator channelestimation
receivedsignal
Matchedfilter
InterferenceCancellation
Decoded bits
Kernels (computation) and streams (communication)
Use local data in clusters providing GOPs support
Imagine stream processor at Stanford [Rixner’01]
Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.
14RICE UNIVERSITY
SWAPs are multi-cluster DSPs
+++***
InternalMemory
ILP
Memory: Stream Register File (SRF)
DSP(1 cluster)
+++***
+++***
+++***
+++***
…ILP
DP
SWAPsadapt clusters to DP
Identical clusters, same operations.Power-down unused FUs, clusters
15RICE UNIVERSITY
Arithmetic clusters in SWAPs
Intercluster NetworkComm. Unit
Scratchpad (indexed accesses)
SRF
From/To SRF
Cross Point
Distributed Register Files(supports more ALUs)
+
+
+*
*/
+/
+
+
+*
*/
+
/
16RICE UNIVERSITY
Talk Outline
Research vision
SWAPs Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
17RICE UNIVERSITY
SWAPs: Physical layer algorithms
Antenna
Channel estimation
Detection DecodingHigher(MAC/
Network/OS)Layers
RF Front-end
Baseband processing
Complex signal processing algorithms with GOPs of computation
18RICE UNIVERSITY
SWAP mapping example: Viterbi decoding
Multiple antenna systems (MIMO systems)Complexity exponential with transmit x receive antennas
Estimation: Linear MMSE, blind, conjugate gradient….
Detection: FFT, (blind) interference cancellation….
Decoding: Viterbi, Turbo, LDPC…. & joint schemes
SWAP flexibility lets you use the best algorithms for the situation
Example for concept demonstration: Viterbi decoding
19RICE UNIVERSITY
Parallel Viterbi Decoding for SWAPs
Add-Compare-Select (ACS) : trellis interconnect : computationsParallelism depends on constraint length (#states)
Traceback: searchingConventional
• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture
Use Register Exchange (RE) • parallel solution
ACS Unit
Traceback Unit
Detectedbits
Decodedbits
20RICE UNIVERSITY
Parallel Viterbi needs re-ordering for SWAPs
Exploiting Viterbi DP in SWAPs:Use RE instead of regular traceback Re-order ACS, RE
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(2)
X(4)X(6)
X(8)
X(10)
X(12)X(14)
X(1)
X(3)
X(5) X(7)
X(9)
X(11)
X(13) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
DP
vector
Regular ACSACS in SWAPs
21RICE UNIVERSITY
Talk Outline
Research vision
SWAP Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
22RICE UNIVERSITY
SWAP architecture design
More clusters better than more ALUs/per cluster (if #clusters > 2)
1. Decide how many clusters Exploit DP
2. Decide what to put within each cluster Maximize ILP with high functional unit efficiency Search design space with “explore” tool
Time-power-area characterization
+?**
+
**
+
**
+
**
…ILP
DP
? ? ?
23RICE UNIVERSITY
Design a SWAP cluster: “Explore”
Auto-exploration of adders and multipliers for “ACS"
1
2
3
4
5
1
2
3
4
5
40
60
80
100
120
140
160
(43,58)
(54,59)
(39,41)
(62,62)
(47,43)
#Multipliers
(40,32)
(70,59)
(65,45)
(49,33)
(39,27)
(80,34)
(73,41)
(61,33)
(48,26)
(39,22)
(50,22)
(85,24)
(76,33)
(60,26)
#Adders
(61,22)
(85,17)
(72,22)
(72,19)
(85,13)
(85,11)
Inst
ruct
ion c
ount
(Adder util%, Multiplier util%)
24RICE UNIVERSITY
“Explore” tool benefits
Instruction count vs. ALU efficiencyWhat goes inside each cluster
Design customized application-specific unitsBetter performance with increased ALU utilization
Explore multiple algorithms turn off functional units not in use for given kernelVdd-gating, clock gating techniques
25RICE UNIVERSITY
Example for SWAP architecture design
Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters
Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters
Explore Algorithm 3 : 2 adders, 2 multipliers, 64 clusters
Explore Algorithm 4 : 2 adders, 2 multipliers, 16 clusters
Chosen Architecture: 4 adders, 3 multipliers, 64 clusters
ILP
DP
26RICE UNIVERSITY
SWAP flexibility provides power savings
Multiple algorithmsDifferent ALU, cluster requirements
Turning off ALUs ( –add –mul compiler options)Use the right #ALUs from “explore” tool
Turning off clustersData across SRF of all clustersCluster only has access to its own SRFNext kernel may need data from SRF of other
clustersReconfiguration support needs to be provided
27RICE UNIVERSITY
SWAPs provide cluster reconfiguration
SRF
Clusters
Mux-DemuxNetwork
WithStreambuffers
M D X 2 M D X 2
M D X 1
LA T C H LA T C H LA T C H LA T C H
Additional latency (few cycles) due to microcontroller stalls
- Minimal loss in performance
28RICE UNIVERSITY
Cluster reconfiguration for Viterbi
Packet 1Constraint length 7
(16 clusters)
Packet 2Constraint length 9
(64 clusters)
Packet 3Constraint length 5
(4 clusters)
DP Can be turned OFF
29RICE UNIVERSITY
64-bit Rate ½
Packet 1K = 7
Packet 2K = 9
Packet 3K = 5
Kernels(Computation)
No Data Memoryaccesses
Execu
tion T
ime
(cycl
es)
Clusters Memory
SWAPs provide flexibility at negligible overhead
30RICE UNIVERSITY
SWAP exploration for Viterbi decoding
1 10 1001
10
100
1000
Number of clusters
Fre
qu
en
cy n
eed
ed
to a
ttain
real-
tim
e (
in M
Hz)
K = 9K = 7 K = 5Different SWAPs
(Without reconfiguration)Same SWAP
(With reconfiguration)
DSP
Ideal C64x (w/o co-proc) needs ~200 MHz for real-time
Max DP
31RICE UNIVERSITY
SWAPs : Salient features
1-2 orders of magnitude better than a DSP
Any constraint length 10 MHz at 128 Kbps
Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant
Power savings due to dynamic cluster scaling
32RICE UNIVERSITY
Expected SWAP power consumption
Power model based on [Khailany’03] 64 clusters and 1 multiplier per cluster:
0.13 micron, 1.2 V Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW) Area: ~53.7 mm2
10 MHz, 128 Kbps with reconfiguration
Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003
0 10 20 30 40 50 60 700102030405060708090
Active Clusters (max 64)P
ow
er (
in m
W)
Viterbi Clusters Used
Peak Power
K = 9 64 ~90 mW
K = 7 16 ~28.57 mWK = 5 4 ~13.8 mW
overhead 0 ~8.1 mW
DSP, K = 9 1 ~200 mW
33RICE UNIVERSITY
Multiuser Estimation-Detection+Decoding
Real-time target : 128 Kbps per user
1 10 10010
100
1000
10000
100000
Number of clusters
Fre
qu
en
cy
ne
ed
ed
to
att
ain
re
al-
tim
e (
in M
Hz)
FASTMEDIUMSLOW
32-user base-station
Mobile
DSP
Ideal C64x (w/o co-proc) needs ~15 GHz for real-time
Fading scenarios
34RICE UNIVERSITY
Expected SWAP power : base-station
32 user base-station with 3 X’s per cluster and 64 clusters: 0.13 micron, 1.2 V Peak Active Power: ~18.19 mW for 1 MHz (increased
X) Area: ~93.4 mm2
Total Peak Base-station power consumption:~18.19 W at 1 GHz for 32 users at 128 Kbps/user
35RICE UNIVERSITY
Talk Outline
Research vision
SWAP Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
36RICE UNIVERSITY
Current research: Flexibility vs. performance
SWAPs: 128 Kbps at ~10-100 mW for ViterbiBorrow DP from ASICs!
suitable for base-stationsFlexibility more important than power
suitable for mobile devicesPower constraints tightercan be customized for further power savings
Handset SWAPs (H-SWAPs) Borrow Task pipelining from ASICs!Application-specific units and specialized comm.
network
37RICE UNIVERSITY
Handset SWAPs: H-SWAPs
Trade Data Parallelism for Task Pipelining
SRF
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
…
DP
SWAPs(max. clusters
and reconfigure)
+++*
+++*
+++*
+++*
LimitedDP
SWAPlet(limit
clusters)
+++*
+++*
+++*
+++*
LimitedDP
++*
++*
++*
++*
LimitedDP
++++
++++
LimitedDP
H-SWAPs(collection of customized
SWAPlets)
38RICE UNIVERSITY
Sample points in architecture exploration
DSPs(1 cluster)
ILPSubword
ILPSubword
DP
SWAPs(multiple)
H-SWAPs(optimized for handsets)
ILPSubword
DP Task PipeliningCustom ALUs
Programmable solutions with increased customization
Performance, Power benefits(with decreasing flexibility)
39RICE UNIVERSITY
Future: Efficient algorithms and mapping
MultipathC hannel
EqualizerMRC Decoder
DetectorDemodulator
Non-C oherent
STC
Beam-forming
C oherentSTC
C hannelEstimator
C hannel
Turbo Equalizer
Multiple antenna systems with 1-2 orders-of-magnitude higher complexity
40RICE UNIVERSITY
Future research: Architectures
Generalized and structured framework and tools Joint algorithm-architecture explorationArea-time-power-flexibility tradeoffs
Potential applications: embedded systems Image and Video processing:
Cameras : variety of compression algorithms
Biomedical applications: Hearing aids: DSP running on body heat*
Sensor networksCompression of data before transmission
*Quote: Gene Frantz, TI Fellow
41RICE UNIVERSITY
SWAPs: Flexibility, Performance, Power
Need flexibility in future wireless devicesAlgorithms and Architectures
Rapid Exploration for Scalable, Wireless Application-specific ProcessorsStructured approach with flexibility-performance trade-offs
SWAPs - flexibility, high performance and low powerExploit data parallelism like ASICs1-2 orders better performance than DSPsTurn off unused clusters and unused ALUs for low power