data-parallel digital signal processors: algorithm mapping, architecture scaling, and workload...
TRANSCRIPT
![Page 1: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/1.jpg)
Data-Parallel Digital Signal Processors:Algorithm mapping, Architecture scaling,
and Workload adaptation
Sridhar Rajagopal
![Page 2: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/2.jpg)
Digital Signal Processors (DSPs)
Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications
A 5 billion $ (and growing) market today
![Page 3: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/3.jpg)
We always want something faster!
New high performance applications drive need for faster DSPs
• Physical-layer signal processing in high speed wireless communications to support multimedia
• Application-layer signal processing for video and imaging
![Page 4: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/4.jpg)
Example : wireless systems
Data ratesAlgorithmsEstimationDetection
Decoding
Theoretical min ALUs @ 1
GHz
32-user system
1 Mbps/userMIMO
Chip equalizerMatched filter
LDPC
> 200
128 Kbps/userMulti-user
Max. likelihoodInterference cancellation
Viterbi
> 20
16 Kbps /user
Single-user Correlator
Matched filter
Viterbi
> 2
4G3G2G
Time1996 2003 ?
![Page 5: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/5.jpg)
Data-Parallel DSPs: state-of-the-art
Clusters of ALUs provide billions of computations per second
Exploit data parallelism in signal processing applications
Imagine stream processor – Stanford (1998 - 2004)
Internal memory
+++***
+++***
+++***
+++***
…
Clusterof ALUs
![Page 6: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/6.jpg)
Proposal:Research questions for DP-DSPs
• Will DP-DSPs work well for wireless systems?
• How do I design DP-DSPs to meet real-time at lowest power?
• Can I improve power efficiency further by adapting DSPs to the application?
![Page 7: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/7.jpg)
Contributions: Algorithm mapping
• Efficient mapping of (wireless) algorithms
– parallelization, structure, memory access patterns
– tradeoffs between ALU utilization, inter-cluster
communication, memory stalls, packing
• A reduced inter-cluster network proposed
– exploits inter-cluster communication patterns
– allows greater scalability of the architecture by reducing
wires
![Page 8: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/8.jpg)
Contributions: Architecture scaling
• Design methodology and tool to explore architectures for low power
• Provides candidate architectures for low power
• Provides insights into ALU utilization and performance
• Compile-time exploration is orders-of-magnitude faster than run-time exploration
![Page 9: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/9.jpg)
Contributions: Workload adaptation
• Adapt the number of clusters and ALUs to
changes in workload during run-time
• Multiplexer network designed
– adapts clusters to DP at run-time
– turns off unused clusters using power gating
• Significant power savings at run-time (up to 60%)
![Page 10: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/10.jpg)
Thesis contributions
Data-Parallel DSPs
+++***
+++***
+++***
Algorithmmapping:
Design of algorithms for
efficient mapping and performance
Architecturescaling:
Having designed the algorithms, find a low
power processor
Workloadadaptation:
Having designed the processor, improve power
at run-time
![Page 11: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/11.jpg)
Outline
• DP-DSPs : Parallelism and architecture
• Power-aware design exploration
• Power-aware resource utilization at run-time
• Conclusions
![Page 12: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/12.jpg)
Parallelism levels in DP-DSPs
Instruction Level Parallelism (ILP) - DSP
Subword Parallelism (SubP) - DSP
Data Parallelism (DP) – vector processor
Not independentDP can decrease by increasing ILP and SubP
– loop unrolling
![Page 13: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/13.jpg)
Code snippet for ILP, SubP, DP
int i,a[N],b[N],sum[N];
short int c[N],d[N],diff[N];
for (i = 0; i< 64; ++i)
{
sum[i] = a[i] + b[i];
diff[i] = c[i] - d[i];
}
ILP
DP
SubP
![Page 14: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/14.jpg)
Data-Parallel DSPs
• ILP, SubP within cluster, DP across clusters• Communication within clusters using inter-cluster comm.
network• Microcontroller issues same instruction to all clusters
Internal memory
+++***
+++***
+++***
+++***
…ILPSubP
DP
mic
roco
ntr
oll
er
![Page 15: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/15.jpg)
ILP is resource-bound
• ILP dependent on resources such as ALUs, read/write ports, inter-cluster communication, registers
• Any one resource bottleneck can affect ILP
Adders Multipliers Inter-cluster communication
Tim
e
Schedule for matrix-matrix multiplication as ALUs increase
![Page 16: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/16.jpg)
Signal processing algorithms have DP in plenty
Observations: 1. More DP available after exploiting ILP and SubP
to the point of diminishing returns
2. Used to set number of clusters
3. As clusters are added and exploit this ‘extra’ DP, ILP and SubP are not affected significantly
This ‘extra’ DP is defined as Cluster DP (CDP)
![Page 17: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/17.jpg)
Observing CDP in Viterbi decoding
1 10 1001
10
100
1000
Number of clustersFre
qu
en
cy n
eed
ed
to a
ttain
real-
tim
e (
in M
Hz)
K = 9K = 7 K = 5DSP
Max CDP
![Page 18: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/18.jpg)
Designing low power DP-DSPs
‘1’ cluster
100 GHz
+
++
*
*
*
‘a’
+
‘m’
*
+
++
*
*
*
‘a’
+
‘m’
*
+
++
*
*
*
‘a’
+
‘m’*
‘c’ clusters
‘f’ MHz
+
++
*
*
*
‘1’
+
‘1’
*
+
++
*
*
*
‘10’+
‘10’
*
+
++
*
*
*
‘10’
+
‘10’
*
+
++
*
*
*
‘10’
+
‘10’
*
‘100’ clusters
10 MHz
Find the right (a,m,c,f) to minimize power
a – #adders/cluster, m – #multipliers/cluster, c – #clusters
![Page 19: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/19.jpg)
Detailed simulation using the Imagine processor simulator
• Cycle accurate, parameterized simulator
– Insights into operations every cycle
• High-level C++-based programming
• GUI interface shows dependencies and schedule
• Power and VLSI scaling model available
• Open source allows modifications in architecture,
tools
![Page 20: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/20.jpg)
Need for design exploration tool
• Random choice may be way off
– 100x power variation possible
• Exhaustive simulation not possible
– large parameter space (hours for each simulation)
– DSP compilers need hand optimizations for performance
– evolving algorithms -- architecture exploration needed
![Page 21: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/21.jpg)
Design exploration framework
Base Data-ParallelDSP
Designworkload
(worst-case)
Applicationworkload
Explore (a,m,c,f)combination thatminimizes power
Dynamic adaptationto turn down (a,m,c,f)
to save power
Hardwareimplementation
+++***
+++***
+++***
Designphase
Utilizationphase
![Page 22: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/22.jpg)
DSPs are compute-bound with predictable performance
Computations
Hiddenmemory stalls
Exposedmemory stalls
Totalexecution
time(cycles)
Microcontrollerstalls
tcompute
tstall
![Page 23: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/23.jpg)
Minimization for power
C(a,m,c) – capacitance from simulator model f(a,m,c) – real-time clock frequency
– obtained by running application on (a,m,c) architecture
2
, , , , , ,
, , , , , ,
( , , )
3
( , , )
min min ( , , )
min min ( , , )
a m c f a m c f
a m c f a m c f
a m c
a m c
P C a m c V f
P C a m c f
V f
![Page 24: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/24.jpg)
Sensitivity to technology and modeling
• Sensitivity to technology ‘p’
• Sensitivity to adder-multiplier power ratio ‘’– 0.01 0.1 for 32-bit adders and 32x32
multipliers
• Sensitivity to memory stalls ‘’– difficult to predict at compile time (5-20 %)– assume q = 25% of execution time as worst case
– fstall = q* (1-) * fmin 0 1
, , , , , , ( , , )2 min min ( , , ) where p 3
a m c f a m c f
p
a m cP C a m c f
![Page 25: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/25.jpg)
Design exploration: big picture
1. (a,m,c) = (, , )
2. Find (a,m,c) where ILP, SubP, DP are fully exploited
3. Find c that minimizes P for (max(a), max(m))
4. Find (a,m) that minimizes P using c
5. Explore sensitivity to , , p
, , , , , , ( , , )min min ( , , )a m c f a m c f
p
a m cP C a m c f
![Page 26: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/26.jpg)
Running algorithms at (amax,mmax,
cCDP)
Algorithm Kernel CDP MHz
Estimation
Correlation 32 1
Matrix mul 32 43
Iteration 32 1
Transpose 512 < 1
Matrix mul L 32 22
Matrix mul C 32 22
Detection Matched filter 32 71
Interference cancellation 32 83
Decoding
Packing 256 <1
Re-packing 64 <1
Initialization 64 17
Add-Compare-Select (ACS)
64 254
Decoding output 64 23
Min. real-time frequency (a,m,c) =(5,3,512)
538 MHz
![Page 27: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/27.jpg)
Real-time frequency with clusters for (a,m) = (5,3)
100
101
102
10310
2
103
104
Clusters
Fre
qu
ency
(M
Hz)
= 0 = 0.5 = 1
538 MHz
541 MHz
( ) ( )c cdp
cdpf f
c
![Page 28: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/28.jpg)
Choosing clusters c = 64, 541 MHz
100
101
102
103
10-3
10-2
10-1
100
Clusters
Nor
mal
ized
Pow
er
Power f2
Power f2.5
Power f3
![Page 29: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/29.jpg)
ALU utilization (+,*)
1
3
5 1
3
400
800
1200
(51,42)
(55,62)
(65,46)
#Adders
(67,62)
(78,45)
Rea
l-T
ime
Fre
qu
ency
(in
MH
z)
Initial (5,3,64)(541 MHz)
Final (3,1,64)(567 MHz)
c = 64, = 0.01, = 1, p = 3
![Page 30: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/30.jpg)
Choosing ALUs (a,m) for c = 64
p = 2 p = 2.5
p = 3
= 0, = 0.01 (2,1,64)
(2,1,64)
(3,1,64)
= 0.5, = 0.01
(2,1,64)
(3,1,64)
(3,1,64)
= 1, = 0.01 (2,1,64)
(3,1,64)
(3,1,64)
= 1, = 0.1 (2,1,64)
(3,1,64)
(3,1,64)
![Page 31: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/31.jpg)
Insights from analysis
• Sensitivity importance: p, ,
• Design gives candidates for low power solutions Design I : (a,m,c): (, , ) (5,3,512) (5,3,64)
(2,1,64)Design II : (a,m,c): (, , ) (5,3,512) (5,3,64)
(3,1,64)
• Power minimization related to ALU efficiency– same as maximizing a scaled version of ALU utilization
![Page 32: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/32.jpg)
Advantages of design exploration tool
• Simulator (S)– cycle-accurate (execution time at run-time)– explore 100 machine configurations in 100 hours
(conservative)– modification of parameters and code for different runs
• Tool (T)– cycle-approximate (execution time at compile time)– explore millions of configurations in 100 hours– automated process all the way – generate plots for defense the day before
• Rapid evaluation of candidate algorithms for future systems
![Page 33: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/33.jpg)
Verification of design tool
Human (3,3,32) @ 1.2V, 0.13 , 1 GHz = 18.2 W
Exploration tool choice : (2,1,64) at 887 MHz
Estimated base power @ 1.2V, 0.13 = 13.2 W
200
400
600
800
1000
(Execu
tion
tim
e)
Real-
tim
e c
lock f
req
uen
cy (
MH
z)
ComputationsStalls
T S T S T S Design I Design II Human
T- ToolS - Simulator
![Page 34: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/34.jpg)
Cluster utilization
• 64 cluster inefficient in terms of cluster utilization (54% for 33:64)
• But, still lower power than 32 clusters due to the difference in f– can see difference reduces as p 2
20 40 60
20
40
60
80
100
Cluster index number
Clu
ste
r u
tiliza
tion
(%
)
32 clusters
64 clusters
![Page 35: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/35.jpg)
Improving power efficiency
• Clusters significant source of power consumption (50-75%)
• When CDP < c, unutilized clusters waste power
• Dynamically turn off clusters using power gating to improve power efficiency
![Page 36: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/36.jpg)
Data access difficult after adaptation
Clusters off – then how to get data from other banks?
4 2 clusters• Data not in the correct memory banks• Overhead in bringing data : external memory, inter-
cluster network
+++***
+++***
+++***
+++***
4 2 clusters
![Page 37: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/37.jpg)
Multiplexer network design
Multiplexernetwork adapts clusters to DP
No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off
Turned off using power gating to
eliminate static anddynamic power dissipation
![Page 38: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/38.jpg)
Run-time variations in workload
20 40 60
20
40
60
80
100
Cluster index number
Clu
ster
uti
lizat
ion
(%
)
K = 9
K = 7
K = 5
![Page 39: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/39.jpg)
Benefits of multiplexer network
Power efficiency at design time:
Human choice : (3,3,32) Base power @ 1.2V, 0.13 , 1 GHz = 18.2 W
Exploration tool choice : (2,1,64)Base power @ 1.2V, 0.13 , 887 MHz = 13.2 W
Power efficiency at run-time:With mux network ( K = 9) = 9.9 W
( K = 7) = 7.4 W (K = 5) = 6.8 W
![Page 40: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/40.jpg)
Design exploration for 2G-3G-4G systems
A “power”ful tool for algorithm-architecture exploration
101
102
103
101
102
103
104
105
Data ratesReal-
tim
e c
lock f
req
uen
cy (
MH
z)
4G*3G2G
(2,1,64) and (3,1,64)
(1,1,32) and (2,1,32)
![Page 41: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/41.jpg)
Broader impact
• Power-aware design exploration with improved run-time power efficiency
• Techniques can be applied to all high performance, power efficient DSP designs– Handsets, cameras, video
![Page 42: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/42.jpg)
Future extensions
• Fabrication needed to verify concepts
• Higher performance– Multi-threading (ILP, SubP, DP, MT)– Pipelining (ILP, SubP, DP, MT, PP)
• LDPC decoding– Sparse matrix requires permutations over large data– Indexed SRF in stream processors [Jayasena, HPCA
2004]
![Page 43: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/43.jpg)
Conclusions
• Providing high performance with 100-1000’s of ALUs and providing low power designs – a challenge for DSP designers
• Algorithm design for efficient mapping on DP-DSPs
• Design exploration tool for low power DP-DSPs – Provides candidate DSPs for low power – Allows algorithm-architecture evaluation for new systems
• Power efficiency provided during both design and use of DP-DSPs
![Page 44: Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1e5503460f949f232d/html5/thumbnails/44.jpg)
Acknowledgements
• Dr. Joseph R. Cavallaro, Dr. Scott Rixner
• Imagine stream processor group at Stanford– Abhishek, Ujval, Brucek, Dr. Dally
• Marjan, Predrag, Alex– 4G MIMO + LDPC
• Thesis committee
• Nokia, Texas Instruments, TATP, NSF