software defined radio – a high performance embedded challenge hyunseok lee, yuan lin, yoav harel,...
Post on 19-Dec-2015
219 views
TRANSCRIPT
Software Defined Radio – A High Performance Embedded Challenge
Hyunseok Lee, Yuan Lin, Yoav Harel, Mark Woh, Scott Mahlke, Trevor Mudge, and 1Krisztian Flautner
University of Michigan1ARM Ltd
Advanced Computer Architecture LaboratoryUniversity of Michigan 2
Contents
Software defined radio Categories of wireless networks Core technologies for future networks Case study : W-CDMA Network
Major algorithmsWorkload characterizationArchitectural implications
Software Defined Radio
Advanced Computer Architecture LaboratoryUniversity of Michigan 4
Wireless Communication System
Upper Protocol Layers
Physical Layer (PHY)
Application bits
BasebandProcessing
AnalogFront-end
Packets “Air”
MAC
LINK
Network
Transport
PPP
IP
TCP/UDP
Advanced Computer Architecture LaboratoryUniversity of Michigan 5
Anatomy of Cellular Phone
Bluetooth
GPS
BasebandProcessor
AnalogFrontend
ApplicationProcessor
PowerManager
Camera
Keyboard
Display
Speaker
Advanced Computer Architecture LaboratoryUniversity of Michigan 6
AudioAMR/QCELP
PHY
MAC
Protocol on Wireless Platform
Upperlayers
Physicallayer
LINK
Network
Transport
ASIC(Hardware)
GPP(Software)
VideoMPEG
GPP(Software)
DSP/AcceleratorSource
coding
ApplicationProcessor
BasebandProcessor
Advanced Computer Architecture LaboratoryUniversity of Michigan 7
Software Defined Radio (SDR)
Use software routines instead of ASICs for the physical layer operations of wireless communication system
ASICs(PHY)ASICs(PHY)
ProgrammableHardware
ProgrammableHardware
SoftwareRoutinesSoftwareRoutines
Both Analog Frontend and Digital Baseband are the scope of SDR
Advanced Computer Architecture LaboratoryUniversity of Michigan 8
Levels of SDR
Tier Name Description
Tier 0 Hardware Radio (HR)Implemented using hardware components. Cannot be modified
Tier 1Software Controlled
Radio (SCR)Only control functions are implemented in software: inter-connects, power levels, etc.
Tier 2Software Defined
Radio (SDR)
Software control of a variety of modulation techniques, wide-band or narrow-band operation, security functions, etc.
Tier 3Ideal Software Radio
(ISR)Programmability extends to the entire system with analog conversion only at the antenna.
Tier 4Ultimate Software
Radio (USR)Defined for comparison purposes only
<source:http://www.sdrforum.org>
Advanced Computer Architecture LaboratoryUniversity of Michigan 9
Why we need SDR ? Seamless wireless connection – End User
Widely different wireless protocols TDMA : GSM, AMPS CDMA : IS-95, cdma2000, W-CDMA, IEEE 802.11b OFDM : IEEE 802.11a/g/n, WiMAX
Needs a terminal that can support multiple wireless protocols
Easy infrastructure upgrade – Service Provider Wireless protocols evolve continuously
Ex) W-CDMA W-CDMA + HSDPA
Time to market – Manufacturer Reduce hardware development time and cost
Advanced Computer Architecture LaboratoryUniversity of Michigan 10
Where can we use SDR ?
Basestations Weak constraints on power and area Support several hundred subscribers Will be commercialized first
Wireless terminals Tight constraints on power and area. Will be commercialized next
Advanced Computer Architecture LaboratoryUniversity of Michigan 11
Why SDR is challenging ?
Analog Frontend Must be tunable across a range of carrier frequencies and
bandwidths.
Digital Baseband Super computer level computation power.
> 50 Gops per subscriber Tight power budget.
200 ~ 300 mW (@terminal) High level of programmability.
Combination of heterogeneous signal processing algorithms.
Advanced Computer Architecture LaboratoryUniversity of Michigan 12
Our Strategy
Performance Exploit the parallelism in signal processing and forward error
correction (FEC) algorithms
Power Limit the programmability to minimize power consumption. Minimize both active and idle mode power consumption
There exists trade off between power efficiency and programmability
Categories of Wireless Networks
Advanced Computer Architecture LaboratoryUniversity of Michigan 14
Categories of Wireless Networks
<source : Wireless communication technology landscape, DELL >
WPAN :Personal Area Connectivity10 meters
WLAN :Local Area Connectivity100 meters
WMAN :Metro Area Connectivity(City or suburb)
WWAN :Wide Area Connectivity(Broad geographiccoverage)
Beyond 100 meters
Bluetooth, UWB WiFi, HiperLan WiMaxAMPS, GSM, IS-95cdma2000, W-CDMA
Advanced Computer Architecture LaboratoryUniversity of Michigan 15
WWAN (Wireless Wide Area Network)
AMPS
FDMA
IS-95
GSM
IS-136/PDC
IS-95B
CDMA CDMA
cdma2000
CDMA
GPRS
EDGE
W-CDMA
TDMA
CDMA
TDMA
W-CDMA/HSDPA
cdma2000EV,DO,DV
TDMA TDMA
?
CDMA
CDMA
OFDM
1G 2G 2.5G 3G 3.5G 4G
Analog Digital
FDMA
CDMA
TDMAOFDM
Voice 64~384K Packet ~2M Multimedia ~10M Multimedia ~100M Multimedia
Can be Implemented by Programmable DSP No fully programmable H/W solutions
NMTTACT
FDMA
Advanced Computer Architecture LaboratoryUniversity of Michigan 16
WLAN / WMAN
802.11b11Mbps
CDMA
802.11g54Mbps
OFDM
802.11a54Mbps
OFDM
802.11n100+Mbps
OFDM
WMAN : Wireless Metro Area Network For last mile problem 802.16d : Fixed WiMax 802.16e : Mobile WiMax
WLAN : Wireless Local Area Network High data rate Poor mobility support
WiMax802.16d
WiMax802.16e
OFDM OFDM
70Mbps 10Mbps
Advanced Computer Architecture LaboratoryUniversity of Michigan 17
WPAN (Wireless Personal Area Network)
Bluetooth1.1
1 Mbps
Bluetooth1.2
Bluetooth2.0
3 Mbps
802.15.3aUWB
100 ~ 480 Mbps
802.15.3aUWB-NG~ 1Gbps
Interconnecting personal devices
Core technologies of future networks
Advanced Computer Architecture LaboratoryUniversity of Michigan 19
OFDM (Orthogonal Frequency Division Multiplexing)
X(f) XIFFT(f)
IFFT
0 fsc Nfsc
….….
-Nfsc -fsc
modulation
cos(fct)
Xmod(f)
-fc -fc+Nfsc
….….
-fc+Nfsc fc fc+Nfsc
….….
fc-Nfsc
demodulation
cos(fct)
Xdemod(f)
0 fsc Nfsc
….….
-Nfsc -fsc
FFT
X(f)
Wireless Channel
Transmit signal over several sub-carriers. Frequency spectrum of sub-carriers are overlapped. (High spectral efficiency) Highly susceptible to frequency error in receiver.
0 fsc Nfsc
….….
-Nfsc -fsc
Advanced Computer Architecture LaboratoryUniversity of Michigan 20
Major Computation in OFDM system
FFT / IFFT
N = 64 : IEEE 802.11a N = 256~2048 : IEEE 802.16 WiMax Data precision : 12~16bits
Amount of computations for OFDM operation ~ 108 complex multiplications / sec
21
0
( ) [ ] , 0,.., 1N j kn
N
n
X k x n e k N
Advanced Computer Architecture LaboratoryUniversity of Michigan 21
MIMO (Multiple Input Multiple Output) Use multiple antennas for signal transmission and reception In ideal case, linearly increase channel capacity Can effectively compensate multipath fading effect Significantly increase receiver complexity
tx rx
<Single Input Single Output (SISO)>
Channel Capacity
C = W log2(1+SNR)<Multiple Input Multiple Output (MIMO)>
Channel Capacity
C = min(n, m) * W log2(1+SNR)
......
Tx,1
Tx,2
Tx,n
Rx,1
Rx,2
Rx,m
Advanced Computer Architecture LaboratoryUniversity of Michigan 22
Computation in MIMO receiver
Amount of computation in MIMO receiver
M : # of Tx/Rx antenna LT : Length of preamble
LP : Length of payload
4 Tx/Rx antenna, 100 Mbps, 64 QAM, ½ coding rate ~ 6 x 108 Computations / Sec
<source: B. Hassibi, An Efficient Square-Root Algorithm for BLAST>
2 32
292 (log )
3T pM L L M
Advanced Computer Architecture LaboratoryUniversity of Michigan 23
LDPC code
Low Density Parity Check (LDPC) code Turbo code like coding gain with lower implementation cost.
Encoding Matrix multiplication, c = xG G (Generator matrix) is large matrix. (e.g. 4K X 4K matrix)
Decoding Equivalent to find most probable vector x such that Hx mod 2 =
0. H (Parity check matrix) is large sparse matrix.
Implementation There exist trade-off between coding gain and implementation
complexity
Advanced Computer Architecture LaboratoryUniversity of Michigan 24
Hybrid ARQ Reuse error frames for the decoding of retransmitted frame Require huge buffer space
Link Layer
2error
2error
Store at hybridARQ buffer
3
- At this point, detects that frame #2 is missed- request the retransmission of frame #2 to sender
4
2error
2ret
2error
+
Combine error frame withretransmitted frame
5
time
PhysicalLayer
1
Case Study : W-CDMA system
Major Algorithms
Advanced Computer Architecture LaboratoryUniversity of Michigan 27
Physical layer of W-CDMA
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
modulator
demodulator
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
D/A
A/D
Frontend
Error Correction
Overcome severe error in short time interval
Assign signal waveform optimal for data transmission
Suppress the signal term in outside of stop band
Advanced Computer Architecture LaboratoryUniversity of Michigan 28
Channel Encoder/Decoder
Encoder Add systematic redundancy on source data
Decoder Fix errors on received data with the systematic redundancy
information generated by encoder
W-CDMA system uses Convolutional code (for short voice and control message) Turbo code (for video stream and high speed packet data)
Advanced Computer Architecture LaboratoryUniversity of Michigan 29
Channel Encoder
Consists of flip-flops and exclusive OR gates Has negligible impact on workload
Output 0
G 0 = 561 ( octal)
Input
D D D D D D D D
Output 1
G 1 = 753 ( octal)
<convolutional encoder of W-CDMA system>
Advanced Computer Architecture LaboratoryUniversity of Michigan 30
Channel Decoder
Determine maximally probable code sequence from the received sequence.
Select C having minimum distance with received sequence r
One of dominant workload
C1C2
CN
rd1 d2
dN
.
.
.
- {ci} : code set
- r : received signal
Advanced Computer Architecture LaboratoryUniversity of Michigan 31
Channel Decoder – Viterbi Algorithm Most popular decoding algorithm of convolutional code Consists of three steps:
Branch metric calculation (BMC) abs(a-b), Parallelizable
Add compare select (ACS) min(a+b, c+d), Parallelizable
Trace back (TB) Recursive pointer tracing, Sequential
Amount of operation in W-CDMA 16Kbps voice : ~2Gops
Advanced Computer Architecture LaboratoryUniversity of Michigan 32
Channel Decoder –Turbo decoder
Two algorithms are widely used SOVA (Soft Output Viterbi Algorithm)
Less computation intensive Lower error correction performance
Max-LogMap algorithm More computation required Higher error correction performance
Amount of operation in W-CDMA For 128 Kbps streaming data : ~18 Gops
Advanced Computer Architecture LaboratoryUniversity of Michigan 33
Turbo Decoder
<High level block diagram of turbo decoder>
SOVA/Max-LogMap
SOVA/Max-LogMap
Interleaver
deinterleaver
demux
Input
output
OneIteration
Based on the multiple iteration of SOVA / Max-LogMap blocks. More iterations show better performance.
Advanced Computer Architecture LaboratoryUniversity of Michigan 34
Block Interleaver/Deinterleaver Overcome severe signal
attenuation within short time interval which frequently appears at wireless channel.
Interleaver (@transmitter): Randomize the sequence of source
data. Deinterleaver (@receiver):
Recover original sequence by reordering.
Amount of operation : < 10 Mops<example of signal strength variation>
123456789Interleaving Deinterleaving
147258369 123456789 147258369
Advanced Computer Architecture LaboratoryUniversity of Michigan 35
Spreader/Despreader
Allow the transmission of several signals at the same time. (x[n] and y[n] in the below diagram)
It is based on the orthogonality between spreading codes
x[n]
y[n]
x[n]
y[n]
ci[n]
cj[n]
ci[n]
cj[n]
spreader despreader
11
0
1, if[ ] [ ]
0, otherwize
N
i jNn
i jc n c n
<orthogonality between codes>
Advanced Computer Architecture LaboratoryUniversity of Michigan 36
Spreader/Despreader
x[n]
Ci[n]
f
X(f)
f
Xsp(f)
N[n]
xsp[n] r[n]
f
r(f)
Ci[n]
rdesp[n]
f
rdesp(f)
y[n]
f
y(F)
Spreader DespreaderWireless Channel
Noise signal isspreaded
Spreader / Despreader also suppress noise
Amount of operation : ~4 Gops
Advanced Computer Architecture LaboratoryUniversity of Michigan 37
Scrambler/Descrambler Randomize the output signal by multiplying pseudo random sequence
so called scrambling code. Allow multiple terminals to communicate at the same time. Amount of operation : ~ 3 Gops
Terminal 1, with scrambling code n
Terminal 2, with scrambling code m
x[n]
y[n]
x[n]
y[n]
csc,i[n]
csc,j[n]
c*sc,i[n]
c*sc,j[n]
Scrambler Descrambler
Complexmultiplication
Complexmultiplication
Complexmultiplication
Complexmultiplication
Advanced Computer Architecture LaboratoryUniversity of Michigan 38
Low Pass Filter Suppress the signal terms at the outside of stop band
frequency.
<Input signal><Output signal>
Filtering
Time domain
Freq. domain
Impulse signal sinc function
Band limited signalBand unlimited signal
Advanced Computer Architecture LaboratoryUniversity of Michigan 39
Low Pass Filter
Use conventional FIR filter1
0
[ ] [ ]N
ii
y n h x n i
z- 1 z- 1
h0
x[n]
h1 hN- 1
z- 1
h2
y[n]
x[n- 1] x[n- 2] x[n- N+1]
Number of filter tap (N) = 32 ~ 64 Amount of operation : ~ 12 Gops
Advanced Computer Architecture LaboratoryUniversity of Michigan 40
Rake Receiver – Multipath fading
Rake receiver mitigates multipath fading effect Multipath fading is a major cause of unreliable wireless
channel characteristic
x(t)
y(t) = a0x(t)y(t) = a0x(t)+a1x(t-d1)y(t) = a0x(t)+a1x(t-d1)+a2x(t-d2)
Advanced Computer Architecture LaboratoryUniversity of Michigan 41
Rake Receiver - Functions
Ideally the function of rake receiver is to aggregate the signal terms with proper delay compensation
y(t) = a0x(t)+a1x(t-d1)+a2x(t-d2)
r(t) = a0x(t-tdealy)+a1x(t-d1-dest1)+a2x(t-d2-dest2)
= (a0+a1+a2) * x(t-tdelay)
Rake receiver
delaytdelayt
We need to know delay spread of received signal that randomly varies
Advanced Computer Architecture LaboratoryUniversity of Michigan 42
Rake Receiver – Detect Delay Spread
Scan the received signal in frame buffer while computing correlation with scrambling code sequence.
Received signalCorrelation
window
Correlation Result
a0
a1
a2
0 d1 d2
0 1 1 2 2[ ] [ ] [ ] [ ]y n a x n a x n d a x n d
Advanced Computer Architecture LaboratoryUniversity of Michigan 43
Computation of Rake Receiver
Correlation computation : LWLBF LW : Correlation window = 320 LB: Frame buffer size = 5120 F : Operation Frequency = 50 ~ 80 Mega Multiplications / sec Multiplications can be converted into subtraction
Amount of operation in W-CDMA : ~25 Gops Most dominant workload
Advanced Computer Architecture LaboratoryUniversity of Michigan 44
Rake Receiver – Overall Architecture
Searcher
Descrambler/Despreader
Descrambler/Despreader
Descrambler/Despreader
combiner
Delay
Delay
Delay
r(t)
d1, d2, d3 a1, a2, a3
Detects delay spread
Compensates propagation delay recombine signal terms without delay
Advanced Computer Architecture LaboratoryUniversity of Michigan 45
Power Control Receiver controls the transmission power of transmitter in order to minimize the
interference to other users. Required computation is negligible
Terminal Basestation
Refrence level
u d u u d d u
Strength of pilot signal is below the reference level
Terminal sends UP command
Strength of pilot signal is above the reference level
Terminal sends DOWN command
: Pilot Signal
u : Power Control Command
Advanced Computer Architecture LaboratoryUniversity of Michigan 46
H/W operation states
Radio resource control state defined in W-CDMA specification
operation states defined according to H/W activity
Idle
Control Hold
Active
• For long idle period between sessions• Periodic wake up for control message reception• Minimum workload but dominate terminal standby time
• For short idle period between packet burst• Hold narrow control channel for fast transition to Active • Intermediate workload
• For packet burst transmission period• Use high speed packet channels up to 2Mbps• Most heavily loaded state
Workload Characterization
Advanced Computer Architecture LaboratoryUniversity of Michigan 48
Workload Profile One operation is equivalent to one RISC instruction
0
5
10
15
20
25
30
Searc
her
Inte
rleav
er
Deinte
rleav
er
Viterb
i Enc
oder
Viterb
i Dec
oder
Turbo
Enc
oder
Turbo
Dec
oder
Scram
bler
Descr
amble
r
Scram
bling
-cod
e(Tx)
.
Scram
bling
-cod
e(Rx)
Sprea
der
Despr
eade
r
Combin
er
LPF(
Tx)
LPF
(RX)
Power
Cot
nrol
[GO
PS
]
Idle state
Control hold state
Active state
Searcher, Turbo decoder, and LPF are dominant workloads Workload profile varies according to operation state
Advanced Computer Architecture LaboratoryUniversity of Michigan 49
Processing Time Requirement
Mixture of algorithms with various processing time requirements Classified into two categories
Heavy workload with long processing time (turbo decoder, searcher) Light workload with short processing time (Scrambler, spreader, LPF,
Power control)
Advanced Computer Architecture LaboratoryUniversity of Michigan 50
Parallelism Most heavy workload algorithms have significant vector parallelism Data width of most operation is 8 bit
Advanced Computer Architecture LaboratoryUniversity of Michigan 51
Memory Access Pattern
Huge memory is not required Traffic between algorithm is not dominant Access rate of scratch pad memory is very high.
Advanced Computer Architecture LaboratoryUniversity of Michigan 52
Instruction Breakdown
0
0.2
0.4
0.6
0.8
1
1.2
Searc
her
Inter
leave
Deinte
rleav
er
Viterb
i Enc
oder
Viterb
i dec
oder
Turbo
enco
der
Turbo
deco
der
Scram
bler
Descr
amble
r
Scram
bling
code
(Tx)
Scram
bling
code
(Rx)
Sprea
der
Despr
eade
r
Combin
er
LPF (R
x)
LPF (T
x)
Power
cont
rol
Avera
ge
Ins
tru
cti
on
ty
pe
pe
rce
nta
ge Others
Branch (Others)
BRANCH (IF)
BRANCH (IB)
ST
LD
LOGIC
MUL/DIV
ADD/SUB
ADD/SUB are dominant instruction Multiplication is not dominant in heavy workloads
Advanced Computer Architecture LaboratoryUniversity of Michigan 53
Frequent Computations
Most multiplications are simplified into cheaper operations Multiplication in LPF-Rx can not be simplified because both
operands are 16bit integer number.
Architectural Implications
Advanced Computer Architecture LaboratoryUniversity of Michigan 55
Architectural Implications
SIMD because We can exploit vector
parallelism in W-CDMA algorithms
Highly power efficiency can be achieved by sharing control logic between datapath elements.
Chip multiprocessor because There exist substantial
algorithm level parallelism There exist many tiny
sequential algorithms Multiple SIMD + Scalar
SIMD SIMD SIMD….
Scalar
Interconnection Network
Advanced Computer Architecture LaboratoryUniversity of Michigan 56
Architectural Implications Memory structure
Cache free Memory access pattern exhibits very dense spatial locality.
Small data memory (<64K) Small instruction memory (<4K)
Simple interconnection network Low inter-processor communication is possible by
algorithm level task mapping on each PE.
Advanced Computer Architecture LaboratoryUniversity of Michigan 57
Architectural Implication
Power management Large workload variation according to operation state
and radio channel condition change. Various power management schemes can be applied
DVS, DFS, Clock gating. Idle mode power must be minimized because it
dominates terminal standby time.
Advanced Computer Architecture LaboratoryUniversity of Michigan 58
W-CDMA benchmark suite
C based implementation of W-CDMA physical layer operation.
Used for the workload characterization done in this paper.
Available at www.eecs.umich.edu/~sdrg
Advanced Computer Architecture LaboratoryUniversity of Michigan 59
Conclusion We discussed :
what is SDR and why it is challenging topic for embedded system.
the evolution history of wireless protocols and what are the core technologies of emerging protocols.
We analyzed : the workload characteristic of W-CDMA protocol and
its architectural implication.
Backup Slides
Advanced Computer Architecture LaboratoryUniversity of Michigan 61
Viterbi Algorithms –Trellis Diagram Viterbi algorithm is based on trellis diagram. Trellis diagram represents all possible state transition of encoder.
< Example of trellis diagram and corresponding convolutional encoder>
00
01
10
11
00
01
10
11
0
1
00
01
10
11
0
1
1
0
00
01
10
11
0
1
0
11
0
1
0
00
01
10
11
00
01
10
11
…x[n]
y[n]
01
0
11
0
1
0
0
1
0
11
0
1
0
1: State transition by input 0 and corresponding output
1: State transition by input 1 and corresponding output
Advanced Computer Architecture LaboratoryUniversity of Michigan 62
Viterbi Algorithm - BMC BMC (Branch metric calculation) operation is to compute difference
between the received sequence r and outputs of trellis diagram.
BMCi,j = distance(rij, oij)=abs(rij, oij)
oij : output of state transition form i to j
rij : corresponding received sequence
00
01
10
11
0, BMC=0
1, BMC=1
00
01
10
11
0,BMC=1
1,BMC=0
1,BMC=0
0, BMC=1
00
01
10
11
r = 0 1 …
...
All BMC operation in a trellis diagram can be done in parallel.
distance between r(01) and Cn(10) = 1 + 1 = 2
Cn
Advanced Computer Architecture LaboratoryUniversity of Michigan 63
Viterbi Algorithm - ACS ACS(Add Compare Select) operation is:
BMC i,k
BMCj,k
i
j
kACSi
ACSj
ACSk=min(ACSi+BMCi,k, ACSj+BMCj,k)
This procedure is equivalent to finding a local optimal code sequence. If C1 has smallest ACS value at node state i, then the ACS values of C2 and C3
are always greater than that of C1
AddCompare, Select
C1
C2
C3
i
Advanced Computer Architecture LaboratoryUniversity of Michigan 64
Viterbi Algorithm - TB Trace back a code sequence which is most close to the received sequence Sequential algorithm
00
01
10
11
00
01
10
11
0
1
00
01
10
11
1
0
1
2
00
01
10
11
2
10
1
00
01
10
11
00
01
10
11
0
12
1
r = 0 1 1 0 0
0
1
2
1
1) find a node has smallest ACS value (00 at this
example)
Decoded result = 0 1 0 0 0
2) Trace back from node 00
Advanced Computer Architecture LaboratoryUniversity of Michigan 65
Block Interleaver/Deinterleaver
b0 b1 b2 …. b(M-1)
bM b(M+1) b(M+2) …. b(2M-1)
.
.
.b((L/M-1)*M) b(1+(L/M-1)*M) b(2+(L/M-1)*M) …. b(L-1)
b0b1b2...b(L-1)
write
read
b0bM…b1b(M+1)…b(M-1)b(2M-1)...b(L-1)
Interleaver Write row by row sequentially read column by column according to the predefined permutation pattern
Deinterlever Write column by column according to the predefined permutation pattern read row by row sequentially
<interleaving procedure>