near-threshold computing: reclaiming moore’s la · university of michigan ena-hpc -- september 7,...
Post on 12-Nov-2020
1 Views
Preview:
TRANSCRIPT
1 1
1
1 University of Michigan EnA-HPC -- September 7, 2011
Near-Threshold Computing: Reclaiming Moore’s Law
Dr. Ronald G. Dreslinski Research Fellow
University of Michigan – Ann Arbor
2 University of Michigan EnA-HPC -- September 7, 2011
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
1985 1990 1995 2000 2005 2010 2015 2020
Transistors (100,000's)
Power (W)
Performance (GOPS)
Efficiency (GOPS/W)
2
Limits on heat extrac6on
Limits on energy-‐efficiency of opera6ons
Stagnates performance growth
Motivation
3 University of Michigan EnA-HPC -- September 7, 2011
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
1985 1990 1995 2000 2005 2010 2015 2020
Transistors (100,000's)
Power (W)
Performance (GOPS)
Efficiency (GOPS/W)
3
Era of High Performance Compu6ng Era of Energy-‐Efficient Compu6ng c. 2000
With the help of some beBer thermal management…
Goal: To increase energy-‐efficiency of operaGons
Result: Con6nue scaling trends that fueled the compu6ng
revolu6on
Motivation
4 4
4
4 University of Michigan EnA-HPC -- September 7, 2011
Outline Define a new region of operation, Near-Threshold
Computing
Explore new architectures enabled by key insights of computing in the NTC region
Present an initial design of a 3D stacked NTC system, Centip3De
5 5
5
5 University of Michigan EnA-HPC -- September 7, 2011
Dark Silicon—The emerging dilemma: More and more gates can fit on a die,
but not all can be turned on at the same time
Environmental Concerns
Form factor vs. Battery Life
Power Density Limitations Circuit supply
voltages are no longer scaling…
Power does not decrease at the same rate that transistor count
increases
A = gate area scaling 1/s2
C = capacitance scaling < 1/s Dynamic dominates
Stagnant
Shrinking
6 6
6
6 University of Michigan EnA-HPC -- September 7, 2011
Today: Super-Vth, High Performance, Power Constrained
Super-Vth
Ene
rgy
/ Ope
ratio
n Lo
g (D
elay
)
Supply Voltage 0 Vth Vnom
Core i7
3+ GHz 0.5 mW/MHz
Normalized Power, Energy, & Performance Energy per operation is the key metric for
efficiency. Goal: same performance, low energy per operation
7 7
7
7 University of Michigan EnA-HPC -- September 7, 2011
Subthreshold Design
Super-Vth Sub-Vth
Ene
rgy
/ Ope
ratio
n Lo
g (D
elay
)
Supply Voltage 0 Vth Vnom
500 – 1000X
12-16X
Operating in the sub-threshold gives us huge power gains at the expense of performance OK for sensors!
8 8
8
8 University of Michigan EnA-HPC -- September 7, 2011
Evolution of Subthreshold Designs
Phoenix 2 Design (2010) - 0.18 µm CMOS -Commercial ARM M3 Core -Used to investigate:
• Energy harvesting • Power management
-37.4 µW/MHz
Subliminal 2 Design (2007) -0.13 µm CMOS -Used to investigate process variation -3.5 µW/MHz
Subliminal 1 Design (2006) -0.13 µm CMOS -Used to investigate existence of Vmin -2.60 µW/MHz
Phoneix 1 Design (2008) - 0.18 µm CMOS -Used to investigate sleep current -2.8 µW/MHz / 30pW sleep power
9 9
9
9 University of Michigan EnA-HPC -- September 7, 2011
Near-Threshold Computing (NTC)
Super-Vth Sub-Vth
Ene
rgy
/ Ope
ratio
n Lo
g (D
elay
)
Supply Voltage 0 Vth Vnom
~10X ~50-100X
~2X
~6-8X
Near-Threshold Computing (NTC): • >60X power reduction • 6-8X energy reduction
• Invest portion of extra transistors from scaling to overcome barriers
10 10
10
10 University of Michigan EnA-HPC -- September 7, 2011
Silicon Verification of Trends
Phoenix 2 Design [Seok’11] 180nm Design
1.8V -> 700mV ~10x NTC Performance Loss ~7x NTC Energy Reduction
Seok ISSCC 2011
Phoenix 2 Processor
11 11
11
11 University of Michigan EnA-HPC -- September 7, 2011
NTC – Opportunities and Challenges
Challenges: Low Voltage Memory
New SRAM designs Robustness analysis at near-threshold
Variation Razor [Ernst’03] and other in-situ delay monitoring Adaptive body biasing
Performance Loss Many-core designs to improve parallelism Core boosting to improve single thread performance
Opportunities: New architectures Optimized Processes 3D Integration – less thermal restrictions
12 12
12
12 University of Michigan EnA-HPC -- September 7, 2011
Outline Define a new region of operation, Near-Threshold
Computing
Explore new architectures enabled by key insights of computing in the NTC region
Present an initial design of a 3D stacked NTC system, Centip3De
13 13
13
13 University of Michigan EnA-HPC -- September 7, 2011
Minimum Energy SRAM
SRAM has a lower activity rate than logic VDD for minimum energy operation (VMIN) is higher Running logic at VMIN for SRAM has a small energy penalty
with increased performance
Leakage
Dynamic
Total
—
14 14
14
14 University of Michigan EnA-HPC -- September 7, 2011
Cluster
L1
Key Insight: • SRAM is run at a higher VDD than cores with little energy
penalty, allowing caches to operate faster than the core
Cluster Cluster Cluster
Core Core Core Core
New NTC Architectures
L1
BUS / Switched Network
Next Level Memory
Core
L1
Core
L1
Core
L1
Core
L1
Core
BUS / Switched Network
Next Level Memory
L1 L1 L1 L1
Design Levers: • Operating Voltage • L1 Size • Number of Cores per Cluster • Number of Clusters
15 15
15
15 University of Michigan EnA-HPC -- September 7, 2011
Core
L1
L2
L1 Cache Size Tradeoff
Core
L1
L2
Decreased Miss Rate
Higher Energy/Access
16 16
16
16 University of Michigan EnA-HPC -- September 7, 2011
Results – Energy Optimal L1 Size (Single Core)
Energy dependency on L1 size Trade-off between L1 and L2 access
17 17
17
17 University of Michigan EnA-HPC -- September 7, 2011
Clustering Tradeoffs
CPU CPU CPU CPU
L1 L1 L1 L1
L2
CPU CPU CPU CPU
L1 L1
L2
O X X
Tradeoffs ----------------------- + Clustered Sharing - Cluster Conflict - New Bus - L1 Speed
18 18
18
18 University of Michigan EnA-HPC -- September 7, 2011
Energy Optimal Cluster-based CMP (Fixed Die Size)
19 19
19
19 University of Michigan EnA-HPC -- September 7, 2011
Full Space Analysis
20 20
20
20 University of Michigan EnA-HPC -- September 7, 2011
0
0.2
0.4
0.6
0.8
1
Uniprocessor CMP w/ DVFS
NTC
Nor
mal
ized
Ene
rgy/
Ope
ratio
n
L2 L1 Core
Various Scaling Methods
Baseline Single CPU @
233MHz
Simple CMP One core per L1 Vdd scaling
Proposed cluster-based CMP Multiple cores per L1 Vdd scaling
38%
53%
71% 4 Cores 4 L1’s
2 Cores/Cluster 3 Clusters
21 21
21
21 University of Michigan EnA-HPC -- September 7, 2011 -21-
Energy Optima for SPLASH2 Cluster based architecture with Vdd and Vth scaling
Optimal cluster size is 2 for most of the apps
Rad choose non-clustered CMP Average: 74% over baseline, 55% over simple CMP
nc k L1 size/kB energy savings over baseline
energy savings over simple CMP
Cho 3 2 64 70.8% 52.8%
Fft 2 2 32 72.6% 68.5%
fmm � 8� 2� 128� 79.7%� 41.6%
luc � 3� 2� 32� 77.8%� 64.4%
lun� 2� 2� 64� 69.2%� 58.0%
rad� 16� 1� 128� 84.2%� 35.1%
ray� 3� 2� 128� 65.1%� 54.9%
22 22
22
22 University of Michigan EnA-HPC -- September 7, 2011
Energy Optima w/ Performance Requirements
Cluster based approach provides best savings Traditional approach only saves energy at high end
53%
32%
20%
23 23
23
23 University of Michigan EnA-HPC -- September 7, 2011
Outline Define a new region of operation, Near-Threshold
Computing
Explore new architectures enabled by key insights of computing in the NTC region
Present an initial design of a 3D stacked NTC system, Centip3De
24 24
24
24 University of Michigan EnA-HPC -- September 7, 2011
A Closer Look at Wafer-Level Stacking
Dielectric(SiO2/SiN) Gate Poly STI (Shallow Trench Isolation)
Oxide
Silicon
W (Tungsten contact & via) Al (M1 – M5) Cu (M6, Top Metal)
“Super-Contact”
Illustration from Bob Patti, Tezzaron
25 25
25
25 University of Michigan EnA-HPC -- September 7, 2011
Next, Stack a Second Wafer & Thin:
26 26
26
26 University of Michigan EnA-HPC -- September 7, 2011
3rd wafer
2nd wafer
1st wafer: controller
Then, Stack a Third Wafer:
27 27
27
27 University of Michigan EnA-HPC -- September 7, 2011
Centip3De – 3D NTC Prototype
Logic - B Logic - B Logic - A
DRAM Sense/Logic – Bond Routing DRAM DRAM
F2F Bond
F2F Bond
Logic - A
Centip3De Design • 130nm, 7-Layer 3D-Stacked Chip • 128 - ARM M3 Cores • 150mm2
28 28
28
28 University of Michigan EnA-HPC -- September 7, 2011
1.9 GOPS (3.8 GOPS in Boost) Max 1 IPC per core 128 Cores 15 MHz
130 mW (691mW in Boost) 14.6 GOPS/W (5.5 in Boost)
Design Scaling and Power Breakdowns NTC Centip3De System
42
2.9 7.0
39
NTC Mode Power (mW)
Cores I-Caches D-Caches DRAM
336
28
67
45
Boosted Mode Power (mW)
Raytracing Benchmark
Naïve Scaling to 22nm yields ~200GOPS/W
29 29
29
29 University of Michigan EnA-HPC -- September 7, 2011
Observed Voltage Scaling and Thermal Limits reducing the gains of Moore’s Law
Defined a new computational operating region: Near Threshold Computing
Leveraged key insights of NTC for new clustered architectures
Initial ideas of a 3D integrated NTC system, Centip3De
Conclusions
30 30
30
30 University of Michigan EnA-HPC -- September 7, 2011
Related References • Ronald G. Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, Trevor Mudge, “Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Proceedings of the IEEE, Special Issue on Ultra-Low Power Circuit Technology, Vol. 98, No. 2, February 2010, pg. 253 – 266.
• Bo Zhai, Ronald G. Dreslinski, Trevor Mudge, David Blaauw, Dennis Sylvester, “Energy Efficent Near-threshold Chip Multi-processing,” ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), August 2007, Best Paper Nomination.
• Dan Ernst, Shidhartha Das, Seokwoo Lee, David Blaauw, Todd Austin, Trevor Mudge, Nam Sung Kim, Krisztian Flautner, “Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation”, IEEE, Vol. 24, No. 6, November-December 2004, pg. 10-20.
• Mingoo Seok, Dongsuk Jeon, Chaitali Chakrabarti, David Blaauw, Dennis Sylvester, “A 0.27V, 30MHz, 17.7nJ/transform 1024-pt complex FFT core with super-pipelining,” IEEE International Solid-State Circuits Conference (ISSCC), February 2011, to appear
31 31
31
31 University of Michigan EnA-HPC -- September 7, 2011
Backup
32 32
32
32 University of Michigan EnA-HPC -- September 7, 2011 -32-
Logic vs. Memory To maintain same robustness at low voltages SRAM cell sizes needs
to be increased to compensate effects of process variation Increased size leads to higher energy consumption, and longer
interconnects
33 33
33
33 University of Michigan EnA-HPC -- September 7, 2011 -33-
Proposed Parallel Architecture
34 34
34
34 University of Michigan EnA-HPC -- September 7, 2011 -34-
Energy Optimal Vth Selection
Vth is very high Energy optimal Vdd is
independent of Vth Free performance gain
without consuming more energy
As Vth reduces Circuit operates faster More leakage, more energy
consumption per switching
Choose Vth Body bias Dopant implant
top related