programmable processors for wireless base-stations
DESCRIPTION
Programmable processors for wireless base-stations. Sridhar Rajagopal ( [email protected] ) December 9, 2003. Fact#1: Wireless rates clock rates. 4. 10. Clock frequency (MHz). 3. 10. 2. 10. W-LAN data rate (Mbps). 1. 10. 0. 10. -1. 10. Cellular data rate (Mbps). -2. 10. -3. - PowerPoint PPT PresentationTRANSCRIPT
Programmable processors for wireless base-stations
Sridhar Rajagopal([email protected])
December 9, 2003
Fact#1: Wireless rates clock rates
Need to process 100X more bits per clock cycle today than in 1996
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 200610
-3
10-2
10-1
100
101
102
103
104
Year
Clock frequency (MHz)
W-LAN data rate (Mbps)
Cellular data rate (Mbps)
200 MHz
1 Mbps
9.6 Kbps
4 GHz
54-100 Mbps
2-10 Mbps
Source: Intel, IEEE 802.11x, 3GPP
Fact#2: base-stations need horsepower
LNA
ADC
DDC
FrequencyOffset
Compensation
Channelestimation
Chip levelDemodulationDespreading
SymbolDetection
SymbolDecoding
Packet/Circuit Switch
Control
BSC/RNCInterface
Power Supplyand Control
Unit
Power Measurement and GainControl (AGC)
RF Baseband processing Network Interface
E1/T1or
PacketNetwork
RF RX
Sophisticated signal processing for multiple users
Need 100-1000s of arithmetic operations to process 1 bit
Source: Texas Instruments
Need 100 ALUs in base-stations
Example:1000 arithmetic operations/bit with 1 bit/10
cycles– 100 arithmetic operations/clock cycle
Base-stations need 100 ALUs– irrespective of the type of (clocked) architecture
Fact #3: Base-stations need power-efficiency*
Wireless systems getting denser – More base-stations per unit area– operational and maintenance costs
Architectures first tested on base-stations
*implies does not waste power – does not imply low power
Wireless gets blacked out too
Trying to use your cell phone during the blackout was nearly impossible. What went wrong?August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer
Fact #4: Base-stations need flexibility*
• Wireless systems are continuously evolving– New algorithms designed and evaluated – allow upgrading, co-existing, minimize design time,
reuse
• Flexibility needed for power-efficiency– Base-stations rarely operate at full capacity– Varying users, data rates, spreading, modulation,
coding– Adapt resources to needs
*how much flexibility? – as flexible as possible
Fact #5: Current base-stations not flexible / not power-efficient
‘Chip rate’processing
‘Symbol rate’processing
Decoding
Control andprotocol
RF(Analog)
ASIC(s)and/or
ASSP(s)and/or
FPGA(s)
DSP(s)
Co-processor(s)and/or
ASIC(s)
DSP orRISC
processor
Change implies re-partitioning algorithms, designing new hardwareDesign done for the worst case – no adaptation with workload
Source: [Baines2003]
Thesis addresses the following problem
design a base-station
(a)supports 100’s of ALUs(b)power-efficient (adapts resources to needs)(c) as flexible as possible
How many ALUs at what clock frequency?
HYPOTHESIS:Programmable* processors for wireless base-
stations*how much programmable? – as programmable as possible
Programmable processors
• No processor optimization for specific algorithm– As programmable as possible– Example: no instruction for Viterbi decoding– FPGAs, ASICs, ASIPs etc. not considered
• Use characteristics of wireless systems – precision, parallelism, operations,.. – MMX extensions for multimedia
Single processors won’t do
(1) Find ways for increasing clock frequency– C64x DSP: 600 – 720 – 1GHz – 100GHz?– Easiest solution but physical limits to scaling f– Not good for power, given cubic dependence with
f
(2) Increasing ALUs– Limited instruction level parallelism (ILP,MMX)– Register file area, ports explosion– Compiler issues in extracting more ILP
(3) Multiprocessors
Related work - Multiprocessors
Multi-chip:
TI TMS320C40 DSPSundance
Cm*
Clustered VLIW
:TI TMS320C6x DSP
Multiflow TRACEAlpha 21264
Multiprocessors
SIMD(Single Instruction
Multiple Data)
MIMD(Multiple Instructions
Multiple Data)
Single chip
Vector
:CODE
Vector IRAMCray 1
Array
:ClearSpeedTM
MasParIlliac-IV
BSP
Stream
:Imagine
Motorola RSVPTM
Multi-threading(MT)
:Sandbridge SandBlaster DSP
Cray MTASun MAJC
PowerPC RS64IVAlpha 21464
Reconfigurable*processors
:RAW
ChameleonpicoChip
Chipmultiprocessor
(CMP)
TI TMS320C8x DSP
HydraIBM Power4
Cannot scale to support 100’s of arithmetic units
Control
Data Parallel
*Reconfigurable processor uses reconfiguration for execution time benefits
Challenges in proving hypothesis
• Architecture choice for design exploration– SIMD generally more programmable* than
reconfigurable – Compiler, simulators, tools and support play a
major role
• Benchmark workloads need to be designed – Previously done as ASICs, so none available– Not easy – finite precision, algorithms changing
• Need detailed knowledge of wireless algorithms, architectures, mapping, compilers, design tools *Programmable here refers to ease of use and write code for
Architecture choice: Stream processors
• State-of-the-art programmable media processors– Can scale to 1000’s of arithmetic units [Khailany 2003]– Wireless algorithms have similar characteristics
• Cycle-accurate simulator with open-source code
• Parameters such as ALUs, register files can be varied
• Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …
• Almost anything can be changed, some changes easier than others!
Thesis contributions
• Mapping algorithms on stream processors – designing data-parallel algorithm versions– tradeoffs between packing, ALU utilization and
memory– reduced inter-cluster communication network
• Improve power efficiency in stream processors – adapting compute resources to workload variations – varying voltage and frequency to real-time
requirements
• Design exploration between #ALUs and clock frequency to minimize power consumption– fast real-time performance prediction
Outline
• Background– Wireless systems– Stream processors
• Contribution #1 : Mapping• Contribution #2 : Power-efficiency• Contribution #3 : Design exploration
• Broader impact and limitations
Wireless workloads : 2G (Basic)
Slidingcorrelator
CodeMatched
Filter
Viterbidecoder
MACand
Networklayers
ReceivedsignalafterDDC
2G physical layer signal processing
Slidingcorrelator
CodeMatched
Filter
Viterbidecoder
User 1
User K
User 1
User K
32 users
16 Kbps/user
Single-user algorithms(other users noise)
> 2 GOPs
3G Multiuser system
Multiuserchannel
estimation
CodeMatched
Filter
Viterbidecoder
MACand
Networklayers
ReceivedsignalafterDDC
3G physical layer signal processing
ParallelInterferenceCancellation
Stages
Multiuser detection
CodeMatched
Filter
Viterbidecoder
User 1
User K User K
User 1
32 users
128 Kbps/user
Multi-user algorithms(cancelsinterference)
> 20 GOPs
4G MIMO system
Receivedsignal
after DDC
4G physical layer signal processing
ChannelEstimation
LDPCdecoder
MACand
NetworklayersChannel
estimation
User 1, Antenna 1
User 1, Antenna T
M antennas
User 1Code
MatchedFilter
Chip levelEqualization
CodeMatched
Filter
Chip levelEqualization
ChannelEstimation
LDPCdecoder
Channelestimation
User K, Antenna 1
User K, Antenna T
User KCode
MatchedFilter
Chip levelEqualization
CodeMatched
Filter
Chip levelEqualization
32 users
1 Mbps/user
Multiple antennas(higher spectralefficiency, higher data rates)
> 200 GOPs
Programmable processors
int i,a[N],b[N],sum[N]; // 32 bits
short int c[N],d[N],diff[N]; // 16 bits packed
for (i = 0; i< 1024; ++i) {
sum[i] = a[i] + b[i];
diff[i] = c[i] - d[i];
}
Instruction Level Parallelism (ILP) - DSP
Subword Parallelism (MMX) - DSP
Data Parallelism (DP) – Vector Processor
DP can decrease by increasing ILP and MMX
– Example: loop unrolling
ILP
DP
MMX
Stream Processors : multi-cluster DSPs
+++***
InternalMemory
ILPMMX
Memory: Stream Register File (SRF)
VLIW DSP(1 cluster)
+++***
+++***
+++***
+++***
…ILPMMX
DP
adapt clusters to DPIdentical clusters, same operations.Power-down unused FUs, clusters
mic
ro
con
tro
ller
mic
ro
con
tro
ller
Outline
Contribution #1 – Mapping algorithms to stream processors (parallel,
fixed pt)– Tradeoffs between packing, ALU utilization and
memory– Reduced inter-cluster communication network
Packing
• Packing introduced around 1996 for exploiting subword parallelism– Intel MMX– Subword parallelism never looked back – Integrated into all current microprocessors and
DSPs
• SIMD + MMX : Stream processor/vector IRAM : 2000 + – relatively new concept
• Not necessarily useful in SIMD processors– May add to inter-cluster communication
Packing may not be useful
1 2 3 4 5 6 7 8a
Multiplication
1 3 5 7p
2 4 6 8q
1 2 3 4p
5 6 7 8q 7
Algorithm:short a;int y;
for(i= 1; i < 8 ; ++i)
{
y[i] = a[i]*a[i];
}
Re-ordering data
1 3 x xp
5 7 x xm
x x 2 4n
x x 6 8q
1 3 2 4p
5 7 6 8q
Add
Re-ordering data
Packing uses odd-even grouping
Data re-ordering in memory
• Matrix transpose– Common in wireless communication systems– Column access to data expensive
• Re-ordering data inside the ALUs– Faster– Lower power
Trade-offs during memory re-ordering
t1
t2
Transpose
tmem
ALUs Memory
t1
t2
Transpose
tmem
ALUs Memory
t3
t1
t2
ALUs
talu
t = t2 + tstalls0 < tstalls < tmem
(a)t = t2(b)
t = t2 + talu (c)
Transpose uses odd-even grouping
N
M
0
M/2
1 2 3 4
A B C D
IN
OUT
Repeat LOG(M ) times{IN = OUT;}
A B C D
1 2 3 4C 3 D 4
A 1 B 2
ALU Bandwidth > Memory Bandwidth
104
103
104
105
Matrix sizes (32x32, 64x64, 128x128)
Exe
cuti
on
tim
e (c
ycle
s)
Transpose in memory (tmem
): DRAM 8 cycles
Transpose in memory (tmem
): DRAM 3 cycles
Transpose in ALU (talu
)
Viterbi needs odd-even grouping
Exploiting Viterbi DP in SWAPs:Use Register exchange (RE) instead of regular traceback Re-order ACS, RE
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(2)
X(4)X(6)
X(8)
X(10)
X(12)X(14)
X(1)
X(3)
X(5) X(7)
X(9)
X(11)
X(13) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
DP
vector
Regular ACSACS in SWAPs
Performance of Viterbi decoding
Ideal C64x (w/o co-proc) needs ~200 MHz for real-time
1 10 1001
10
100
1000
Number of clusters
Fre
qu
en
cy n
eed
ed
to a
ttain
real-
tim
e (
in M
Hz)
K = 9K = 7 K = 5DSP
Max DP
Pattern in inter-cluster comm
• Broadcasting– Matrix-vector multiplication, matrix-matrix
multiplication, outer product updates
• Odd-even grouping– Transpose, Packing, Viterbi decoding
Odd-even grouping
Inter-cluster communication
O(C2) wires, O(C2) interconnections, 8 cycles
0/4 1/5 2/6 3/7
4 Clusters
Data
Entire chip lengthLimits clock frequencyLimits scaling
0 1 2 3 4 5 6 7 0 2 4 8 1 3 5 7
A reduced inter-cluster comm network
only nearest neighbor interconnections
O(Clog(C)) wires, O(C) interconnections, 8 cycles
0/4 1/5 2/6 3/7
Broadcasting
support
Odd-even
grouping
Registers
(pipelining)
Multiplexer
4 Clusters
Demultiplexer
Data
Outline
Contribution #2 : Power-efficiency
High performance is low power - Mark Horowitz
Flexibility needed in workloads
0
5
10
15
20
25O
per
atio
n c
ou
nt
(in
GO
Ps)
(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)
2G base-station (16 Kbps/user)3G base-station (128 Kbps/user)
(Users, Constraint lengths)
Billions of computations per second needed
Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi
to ~23 GOPs for 32 users, constraint 9 viterbi
Note:GOPs referonly to arithmeticcomputations
Flexibility affects Data Parallelism*
Workload Estimation Detection Decoding
(U,K) f(U,N) f(U,N) f(U,K,R)
(4,7) 32 4 16
(4,9) 32 4 64
(8,7) 32 8 16
(8,9) 32 8 64
(16,7) 32 16 16
(16,9) 32 16 64
(32,7) 32 32 16
(32,9) 32 32 64
U - Users, K - constraint length,
N - spreading gain, R - decoding rate
*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling
Adapting #clusters to Data Parallelism
AdaptiveMultiplexer
Network
C C C C
C C C C C CC
No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off
Turned off using voltage gating toeliminate static anddynamic power dissipation
Cluster utilization variation
0 5 10 15 20 25 300
50
100(4,9)(4,7)
0 5 10 15 20 25 300
50
100(8,9)(8,7)
0 5 10 15 20 25 300
50
100
(16,9)(16,7)
0 5 10 15 20 25 300
50
100
(32,9)(32,7)
Cluster Index
Clu
ster
Uti
liza
tio
n
Cluster utilization variation on a 32-cluster processor
(32, 9) = 32 users, constraint length 9 Viterbi
Frequency variation
0
200
400
600
800
1000
1200
Rea
l-ti
me
Fre
qu
ency
(in
MH
z)
(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)
Mem StalluC Stall
Busy
Operation
• Dynamic Voltage-Frequency scaling when system changes significantly – Users, data rates …– Coarse time scale (every few seconds)
• Turn off clusters – when parallelism changes significantly– Memory operations– Exceed real-time requirements– Finer time scales (100’s of microseconds)
Power : Voltage Gating & Scaling
Workload Freq (MHz) Voltage Power Savings (W) Power (W) Savingsneeded used (V) clocking Memory Clusters New Base
(4,7) 345.09 433 0.875 0.325 1.05 0.366 0.3 2.05 85.14 %(4,9) 380.69 433 0.875 0.193 0.56 0.604 0.69 2.05 66.41 %(8,7) 408.89 433 0.875 0.089 0.54 0.649 0.77 2.05 62.44 %(8,9) 463.29 533 0.95 0.304 0.71 0.643 1.33 2.98 55.46 %(16,7) 528.41 533 0.95 0.02 0.44 0.808 1.71 2.98 42.54 %(16,9) 637.21 667 1.05 0.156 0.58 0.603 3.21 4.55 29.46 %(32,7) 902.89 1000 1.3 0.792 1.18 1.375 7.11 10.46 32.03 %(32,9) 1118.3 1200 1.4 0.774 1.41 0 12.38 14.56 14.98 %
Estimated Cluster Power Consumption 78 %Estimated SRF Power Consumption 11.5 %Estimated instruction decoderoder Power Consumption 10.5 %Estimated Chip Area (0.13 micron process) 45.7 mm2
Power can change from 12.38 W to 300 mW depending on workload changes
Outline
Contribution #3 : Design exploration– How many adders, multipliers, clusters, clock
frequency– Quickly predict real-time performance
Deciding ALUs vs. clock frequency
• No independent variables– Clusters, ALUs, frequency, voltage (c,a,m,f)– Trade-offs exist
• How to find the right combination for lowest power!
2P CV f V f 3P f
‘1’ cluster
100 GHz
(A)
+++***
‘a’+
‘m’*
+++***
‘a’+
‘m’*
+++***
‘a’+
‘m’*
‘c’ clusters
‘f’ MHz
+++***
‘1’+
‘1’*
+++***
‘10’+
‘10’*
+++***
‘10’+
‘10’*
+++***
‘10’+
‘10’*
‘100’ clusters
10 MHz
(B) (C)
Static design exploration
Static part(computations)
Dynamic part(Memory stalls
Microcontroller stalls)
Exe
cuti
on T
ime
also helps in quickly predicting real-time performance
Sensitivity analysis important
• We have a capacitance model [Khailany2003]
• All equations not exact– Need to see how variations affect solutions
(1 3)
* (0.01 1)
pP f p
adder power multiplier power
Design exploration methodology
• 3 types of parallelism: ILP, MMX, DP• For best performance (power)
– Maximize the use of all
• Maximize ILP and MMX at expense of DP– Loop unrolling, packing – Schedule on sufficient number of
adders/multipliers
• If DP remains, use clusters = DP– No other way to exploit that parallelism
Setting clusters, adders, multipliers
• If sufficient DP, linear decrease in frequency with clusters– Set clusters depending on DP and execution time
estimate
• To find adders and multipliers,– Let compiler schedule algorithm workloads across
different numbers of adders and multipliers and let it find execution time
• Put all numbers in power equation– Compare increase in capacitance due to added ALUs
and clusters with benefits in execution time
• Choose the solution that minimizes the power
Design exploration
For sufficiently large #adders, #multipliers per clusterExplore Algorithm 1 : 32 clusters
(t1)Explore Algorithm 2 : 64 clusters
(t2)Explore Algorithm 3 : 64 clusters
(t3)Explore Algorithm 4 : 16 clusters
(t4)
ILP
DP
1
( )L
ii
i
dpf c real time target t
c
Clusters: frequency and power
100
101
102
102
103
104
Clusters
Fre
qu
en
cy (
MH
z)
0 10 20 30 40 50 60 700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Clusters
No
rmal
ized
Po
wer
Power fPower f2
Power f3
32 clusters at frequency = 836.692 MHz (p = 1)
64 clusters at frequency = 543.444 MHz (p = 2)
64 clusters at frequency = 543.444 MHz (p = 3)
( ) min ( ) ( ) pP c C c f c
3G workload
ALU utilization with frequency
1
1.5
2
2.5
3
3.5
4
4.5
5 1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3500
600
700
800
900
1000
1100
(32,28)
(38,28)
#Multipliers
(33,34)
(50,31)
(42,37)
(64,31)
(36,53)
(51,42)
(78,18)
(43,56)
(65,46)
#Adders
(55,62)
(78,27)
(67,62)
(78,45)
Rea
l-T
ime
Fre
qu
ency
(in
MH
z) w
ith
FU
uti
liza
tio
n(+
,*)
3G workload
Power variations with f and
0
5
1
2
30
0.5
1
Adders
= 0.01, P f
Multipliers
0
5
1
2
30
0.5
1
Adders
= 0.1, P f
Multipliers
0
5
1
2
30
0.5
1
Adders
= 1, P f
Multipliers
0
5
12
30
0.5
1
Adders
= 0.01, P f2
MultipliersP
ow
er
0
5
12
30
0.5
1
Adders
= 0.1, P f2
Multipliers
Po
we
r
0
5
12
30.5
1
Adders
= 1, P f2
Multipliers
Po
we
r0
5
1
2
30
0.5
1
Adders
= 0.01, P f3
Multipliers
Po
we
r
0
5
1
2
30
0.5
1
Adders
= 0.1, P f3
Multipliers
Po
we
r
0
5
1
2
30
0.5
1
Adders
= 1, P f3
Multipliers
Po
we
r
Choice of adders and multipliers
(,fp) Optimal Optimal ALU/Cluster Cluster/Total
Adders Multipliers Power Power
(0.01,1) 2 1 30 61
(0.01,2) 2 1 30 61
(0.01,3) 3 1 25 58
(0.1,1) 2 1 52 69
(0.1,2) 2 1 52 69
(0.1,3) 3 1 51 68
(1,1) 1 1 86 89
(1,2) 2 2 84 87
(1,3) 2 2 84 87
Exploration results
************************* Final Design Conclusion *************************Clusters : 64Multipliers/cluster : 1 Utilization: 62%Adders/cluster : 3 Utilization: 55%Real-time frequency : 568.68 MHz*************************
Exploration done with plots generated in seconds….
Outline
Broader impact and limitations
Broader impact
• Results not specific to base-stations– High performance, low power system designs
• Concepts can be extended to handsets
• Mux network applicable to all SIMD processors – Power efficiency in scientific computing
• Results #2, #3 applicable to all stream applications– Design and power efficiency– Multimedia, MPEG, …
Limitations
Don’t believe the model is the reality (Proof is in the pudding)
• Fabrication needed to verify concepts– Cycle accurate simulator – Extrapolating models for power
• LDPC decoding (in progress)– Sparse matrix requires permutations over large
data– Indexed SRF may help
• 3G requires 1 GHz at 128 Kbps/user– 4G equalization at 1 Mbps breaks down (expected)
Conclusions
• Road ends - conventional architectures[Agarwal2000]
• Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable + – Difficult to compare and contrast– Need new definitions that allow comparisons
• Wireless workloads – SPECwireless standard needed
• utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures– my thesis lays the initial foundations
Time-scales [met schedules on time – year not mentioned in the
proposal]Work Time Frame Status
Fixed point estimationdetection in Imagine
August
Integration ofImagine code
August
C64x DSP comparisons AugustSingle cluster
implementationAugust
Comparisonpoints
VLSI comparisons AugustMemory stalls SeptemberInnovations
Viterbi FU OctoberScaling Scaling algorithm November
W-LAN base-station DecemberExtensionsHandset issues January
Alternate view of the CMP DSP
Streaming Memory system
L2 internalmemory
Bank C
Inter-clustercommunication
network
Bank 2
Bank 1
Prefetch Buffers
ClustersOf
C64x
clu
ste
r C
clu
ste
r 0
clu
ste
r 1
Inst
ruct
ion
d
eco
der
Adapting clusters using (1) memory transfers
SRF
Stream A
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
08
715
614
513
412
311
210
19
04812
15913
26
1014
47
1115
X XXX
Stream A' Step 1:Step 2:
Memory
Clusters
(2) Using conditional streams
Conditional Buffer
Condition Switch
Data received
0
A
1
A
B
1
B
C
0
-
D
0
-
1 2 3Cluster index 0
1
C
1
D
C
0
-
D
0
-
1 2 3
A B
Access 0 Access 1
4-clusters reconfiguring to 2
Arithmetic clusters in stream processors
Intercluster NetworkComm. Unit
Scratchpad (indexed accesses)
SRF
From/To SRF
Cross Point
Distributed Register Files(supports more ALUs)
+
+
+*
*/
+/
+
+
+*
*/
+
/
Programming model
stream<int> a(1024);stream<int> b(1024);stream<int> sum(1024);stream<half2> c(512);stream<half2> d(512);stream<half2> diff(512);
add(a,b,sum);sub(c,d,diff);
kernel add(istream<int> a, istream<int> b, ostream<int> sum){
int inputA, inputB, output;
loop_stream(a){
a >> inputA;b >> inputB;
output = a + b;sum << output;
}}
kernel sub(istream<half2> c, istream<half2> d, ostream<half2> diff){
int inputC, inputD, output;loop_stream(c){
c >> inputC;d >> inputD;
output = c - d;diff << output;
}
}
Your new hardware won’t run your old software – Balch’s law
Stream processor programming
Kernel
Viterbidecoding
StreamInput Data Output Data
Correlator channelestimation
receivedsignal
Matchedfilter
InterferenceCancellation
Decoded bits
• Kernels (computation) and streams (communication)
• Use local data in clusters providing GOPs support
• Imagine stream processor at Stanford [Rixner’01]
Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.
Parallel Viterbi Decoding
• Add-Compare-Select (ACS) : trellis interconnect : computations– Parallelism depends on constraint length (#states)
• Traceback: searching– Conventional
• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture
– Use Register Exchange (RE) • parallel solution
ACS Unit
Traceback Unit
Detectedbits
Decodedbits