university of michigan eecs pepsc : a power efficient computer for scientific computing. ganesh...
TRANSCRIPT
![Page 1: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/1.jpg)
University of MichiganEECS
PEPSC : A Power Efficient Computer for Scientific Computing.
Ganesh Dasika1 Ankit Sethia2 Trevor Mudge2 Scott Mahlke2
1 – ARM R&D, Austin Tx 2 – ACAL, University of Michigan
![Page 2: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/2.jpg)
University of MichiganEECS
2
The Efficiency ofHigh-Performance Compute
1
10
100
1,000
10,000
1 10 100 1,000
Pe
rfo
rma
nc
e (
GF
LO
Ps
)
Power (Watts)Ultra-
PortablePortable with
frequent chargesWall Power
DedicatedPower Network
Pentium M
Core 2
CortexA8
Core i7
GTX 280
GTX 295S1070
IBM Cell
AMD 6850
Target EfficiencyTo reach 1 Petaflop 200 KiloWatts will be required.
![Page 3: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/3.jpg)
University of MichiganEECS
3
General-PurposeScientific Computing
• Currently, best performance is by GPGPUs– Generalized shader pipelines– Graphics 1st priority, generality 2nd
– Power inefficient graphics-specific hardware
• Can we improve efficiency by building a processor ground-up?
![Page 4: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/4.jpg)
University of MichiganEECS
4
Important Domains• We mean scientific computing to be:– Dense matrix.– Large datasets.– Floating point computation intensive.
• We specifically look at:– Communications, signal processing.– Mathematics.– Financial applications.
![Page 5: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/5.jpg)
University of MichiganEECS
5
binOpt black fft fwt lps lu mc nw sde srad0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Mem StallControl DivDatapath stallRunning
Inefficiency Sources
• No single common source, so no panacea.• GPUs unutilized even with thousands of threads.• Need a multi-faceted approach to mitigate inefficiency.
%GPU unutilized
37% 35% 55% 60% 54% 90% 42% 95% 43% 37%
GPU
Util
iztio
n
![Page 6: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/6.jpg)
University of MichiganEECS
6
• Prefetching rather than threading for energy efficiency.
PEPSC
• Wide SIMD architecture as a baseline.
• Novel datapath to eliminate datapath and divergence stalls efficiently.
![Page 7: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/7.jpg)
University of MichiganEECS
7
Outline
• Data-path stalls.
• Memory stalls.
• Control stalls.
• Experiments and Results.
• Conclusion.
![Page 8: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/8.jpg)
University of MichiganEECS
8
*-
-
+
LL
x
S
Traditional SIMD
C[i] = A[i]*B[i]-B[i]-A[i]+x;
Lane 0 Lane 1 Lane 2 Lane 3
** * * *
Register File
WB WB WB WB
-- - - -
-
- - - -
+
+ + + +
![Page 9: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/9.jpg)
University of MichiganEECS
9
2D SIMD using FPU Chaining
Subgraphdepth?
ALU0
Register File
ALU1
ALU2
ALU3
MUX
*-
-
+
LL
x
S
**-
--
-+
+
No intermediate values are written back to Register File.
ALU0
Register File
ALU1
ALU2
ALU3
MUX
ALU0
Register File
ALU1
ALU2
ALU3
MUX
ALU0
Register File
ALU1
ALU2
ALU3
MUX
![Page 10: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/10.jpg)
University of MichiganEECS
10
Preprocessor
Arithmetic
Postprocessor
Arithmetic
Postprocessor
Pre-Proc
Arithmetic
Postprocessor
Pre-Proc
Pre-Proc
Normalizer
FPU0
MUXSubgraph
depth?
Arithmetic
Postprocessor
FPU1
FPU2
FPU3
Register File
2D SIMD using FPU Chaining• Increased performance– Fewer cycles/operation– Fewer instructions
• Reduced power– Fewer intermediate values
=> Fewer RF R/W• Trade-offs:– Increased # RF ports– Increased area
Preprocessor
Arithmetic
Postprocessor
Normalizer
Register File
Use intermediate,non-standard values
Normalize at end
1-Deep FPU(3 cycle latency)
4-Deep FPU(9 cycle latency)
![Page 11: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/11.jpg)
University of MichiganEECS
11
SIMD Width Efficiency Trade-offs
• Efficiency improves as chain-length and SIMD width increases.• Control flow divergence limits this efficiency.
123451.01.52.02.53.03.5
8 16 32 64
Nor
mal
ized
Perf
/Pow
er E
ffici
ency
Chain LengthSIMD Width
![Page 12: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/12.jpg)
University of MichiganEECS
12
Dual FPU Chains• Targets datapath
stalls• Exploit ILP among
chains• Requires:– Extra RF-W port– Extra RF-R port– Extra normalizer
stage
Pre.
Arith.
Post.
Arith.
Post.
Pre.
Arith.
Post.
Pre.
Pre.
Normalizer
Arith.
Post.
Register File
Normalizer
Preprocessor
Arithmetic
Postprocessor
Arithmetic
Postprocessor
Pre-Proc
Arithmetic
Postprocessor
Pre-Proc
Pre-Proc
Normalizer
FPU0
MUXSubgraph
depth?
Arithmetic
Postprocessor
FPU1
FPU2
FPU3
Register File
![Page 13: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/13.jpg)
University of MichiganEECS
13
Dual-FPU Speedup
binOpt black fft fwt lps lu mc nw sde srad mean0%
5%
10%
15%
20%
25%
Dual Single
% S
peed
up
binopt black fft fwt lps lu mc nw sde srad mean
2-deep 4-deep 5-deep
By exploiting ILP among chains, on an average 18% speedup could be found.
FPU Chains FPU chain
![Page 14: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/14.jpg)
University of MichiganEECS
14
Memory Stalls
• GPUs use massive multithreading to hide memory stalls:– Inefficient in terms of static power.– Already several MBs of register file on GPUs
• Caches + Prefetching– Reduce # of thread contexts– Works across different memory latencies
![Page 15: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/15.jpg)
University of MichiganEECS
15
Weighted Average Degree
• Significant variation between benchmarks• Single-stride is insufficient• Significant variation in degree between different loops
binOpt black fft fwt lps lu mc nw sde srad0
1
2
3
4
5
6
7
Wei
ghte
d Av
erag
e D
egre
e.
![Page 16: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/16.jpg)
University of MichiganEECS
16
Dynamic Degree PrefetcherLoad PC Address Stride Confidence Degree0x253ad 0xfcface 8 2 2
- =
+
+>
Current PC
Miss address
Prefetch Queue
Prefetch Table
Prefetch Address
Prefetch Enable
Enab
le
1
1
+
1
On every subsequentmiss, re-check strideand improve confidence
Create entry in PF tablebased on last two requests
Increase degree if missin the cache is already queued in the PFQ
![Page 17: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/17.jpg)
University of MichiganEECS
17
DDP Results
• ~2.5X speedup from Degree-1• ~3.5X from DDP• 80% reduction in D-cache wait latency.
binOpt
black fft fwt lps lu mc nw sde srad mean1.0
2.0
3.0
4.0
5.0
6.0
7.0
Degree-1 DDP
Spee
dup
over
no-
pref
etch
![Page 18: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/18.jpg)
University of MichiganEECS
18
Pre.
Arith.
Post.
Arith.
Post.
Pre.
Arith.
Post.
Pre.
Pre.
Normalizer
Arith.
Post.
Register File
Normalizer
Divergence-Folding• Reduce control-flow overhead.• Use dual-FPU chaining to execute
opposite sides of branches concurrently.
• Support simultaneous dual path execution.
Register File
b + c
+ d
e + f
+ g
“then”
“else”
if (cond2) a = b + c + d;else a = e + f + g;
Vt1 = Vb + Vc [0xFF00]Va = Vt1 + Vd [0xFF00]Vt2 = Ve + Vf [0x00FF]Va = Vt2 + Vg [0x00FF]
![Page 19: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/19.jpg)
University of MichiganEECS
19
Other Serializing Constructs
• Several instances of inner loops aggregating single value
• FP adder tree could help mitigate this– log(SIMD width) depth– Minimal area overhead
with FPU chaining– Potential for 5-6X
speedup of aggregation
FPU0
FPU1
FPU2
Lane0 Lane1 Lane2 Lane3
![Page 20: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/20.jpg)
University of MichiganEECS
20
Evaluation Methodology
• Trimaran compiler used to find chains of dependent instructions.
• Major components synthesized using Design Compiler. Power measured using Prime Time.
• For memory system simulation, we used M5.– L1 : 512B per SIMD Lane, 1 cycle latency.– L2 : 4kB per SIMD Lane, 10 cycle latency.– Memory : 200 cycle latency.
![Page 21: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/21.jpg)
University of MichiganEECS
21
PEPSC Architecture
• 750MHz @ 45nm• 32-way SIMD• 120 GFLOPs• ~2.1 W/core• 1 TFLOP
@ 9 cores, 19 W
• Power-Efficient Processor for Scientific Computing
![Page 22: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/22.jpg)
University of MichiganEECS
22
Utilization Improvement on GPU
binOpt
black
fft fwt
lps lu mc
nw
sde
srad
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
Mem StallControl DivDatapath stallRunningG
PU U
tiliza
tion
%Improvement in utilization.
5% 21% 47% 85% 41% 370% 29% 1340% 33%67%
On an average 2x increase in utilization for GPUs with our techniques.
![Page 23: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/23.jpg)
University of MichiganEECS
23
Comparisons
1
10
100
1,000
10,000
1 10 100 1,000
Pe
rfo
rma
nc
e (
GF
LO
Ps
)
Power (Watts)Ultra-
PortablePortable with
frequent chargesWall Power
DedicatedPower Network
GTX 280 peak
GTX 280 realized(with enhancements)
GTX 280 realized(without enhancements)
PEPSC peak
PEPSC realized
![Page 24: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/24.jpg)
University of MichiganEECS
24
Conclusion
• GPUs generally energy-inefficient for scientific computing.
• PEPSC addresses various reasons of inefficiencies by GPUs.– Chained FPU datapath design.– Dynamic Prefetching reduces memory stalls.– Hardware for handling control divergence.
• PEPSC provides a 10x efficiency over modern GPUs.
![Page 25: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ec95503460f94bd6d36/html5/thumbnails/25.jpg)
University of MichiganEECS
PEPSC : A Power Efficient Computer for Scientific Computing.
Questions?