centip3de: a 64-core, 3d stacked, near-threshold...
TRANSCRIPT
1 1 1 University of Michigan 1 1
1
Centip3De: A 64-Core, 3D Stacked,
Near-Threshold System
Ronald G. Dreslinski
David Fick, Bharan Giridhar,
Gyouho Kim, Sangwon Seo, Matthew Fojtik,
Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim,
Nurrachman Liu, Michael Wieckowski, Gregory Chen,
Trevor Mudge, Dennis Sylvester, David Blaauw
University of Michigan
2 2 2 University of Michigan
U CVdd
2
AIleakVdd
Af
The emerging dilemma:
More and more gates can fit on a die,
but cooling constraints are restricting their use
The Problem of Power
Circuit supply voltages are no longer scaling…
Power does not decrease at the same
rate that transistor count increases,
resulting in increased energy density
A = gate area scaling 1/s2
C = capacitance scaling < 1/s
Dynamic
dominates
3 3 3 University of Michigan
Today: Super-Vth, High Performance, Power Constrained
Super-Vth
Energ
y /
Opera
tion
Log (
Dela
y)
Supply Voltage 0 Vth Vnom
Large gate overdrive favors
performance with
unsustainable power density
Must design within fixed TDP
Goal: maintain performance,
improved Energy/Operation
Normalized CPU Metrics
4 4 4 University of Michigan
Subthreshold Design
Super-Vth Sub-Vth
Energ
y /
Opera
tion
Log (
Dela
y)
Supply Voltage 0 Vth Vnom
500 – 1000X
12-16X
Operating in sub-threshold
yields large power gains at the
expense of performance.
Applications: sensors, medical
5 5 5 University of Michigan
Subthreshold Design
Super-Vth Sub-Vth
Energ
y /
Opera
tion
Log (
Dela
y)
Supply Voltage 0 Vth Vnom
500 – 1000X
12-16X
Operating in sub-threshold
yields large power gains at the
expense of performance.
Applications: sensors, medical
Phoenix 2 Processor, ISSCC’10
6 6 6 University of Michigan
Near-Threshold Computing (NTC)
NTC Super-Vth Sub-Vth
Energ
y /
Opera
tion
Log (
Dela
y)
Supply Voltage 0 Vth Vnom
~10X
~50-100X
~2X
~6-8X
Near-Threshold Computing (NTC): •>60X power reduction
•6-8X energy reduction
• Enables 3D integration
8 8 8 University of Michigan
Architectural Impact of NTC
Caches have higher Vopt and operating frequency
Smaller activity rate when compared to core logic
Leakage larger proportion of total power in caches
New Architectures Possible
Vt
Core
L1
L2
Core
L1
L2
9 9 9 University of Michigan
SRAM is run at a higher VDD
Caches operate faster than core
Can introduce clustered architecture
Multiple cores share L1
Cores see private L1
L1 still provides single-cycle latency
Advantages:
Less coherence/snoop traffic
Larger cache for processes that need it
Drawbacks:
Core conflicts evicting L1 data
Not dominant in simulation
Longer interconnect
3D addressable
Cluster Cluster Cluster
Proposed NTC Architecture
L1
BUS / Switched Network
Next Level Memory
Core
L1
Core
L1
Core
L1
Core
L1
Core
BUS / Switched Network
Next Level Memory
Cluster
L1
Core Core Core Core
L1 Shared Cache
10 10 10 University of Michigan
Proposed Boosting Approach
Measured results for 130nm LP design
10MHz becomes ~110MHz in 32nm simulation
140 FO4 delay core
Baseline
Cache runs 4x core frequency
Pipelined cache
Better Single Thread Performance
Turn some cores off, speed up the rest
Cache de-pipelined
Faster response time, same throughput
Core sees larger cache
Faster cores needs larger caches
Cluster
Core
L1
Core Core Core
4 Cores @ 10MHz (650mV)
Cache @ 40MHz (800mV)
Core Core Core
Cluster
Core
L1
1 Core @ 40MHz (850mV)
Cache @ 80MHz (1.15 V)
Core Core Core
Cluster
Core
L1
1 Core @ 80MHz (1.15V)
Cache @ 160MHz (1.65V)
4x
8x
11 11 11 University of Michigan
Cache Timing
Data Array Tag Array
= Data Array
Tag Array
=
NTC Mode (3/4 Cores)
Low power
Tag arrays read first
0-1 data arrays accessed
Boost Mode (1/2)
Low latency
Data and tags read in parallel
4 data arrays accessed
12 12 12 University of Michigan
Cache Timing
Data Array
Tag Array
=
NTC Mode (3/4 Cores)
Low power
Tag arrays read first
0-1 data arrays accessed
Other
AccessOther
AccessTag
Read
Tag
Comp
Data
ReadIdle
EX Stage Cache Access MEM StageIF/DE Stage
Edge
A
Edge
B
Edge
C
Edge
D
13 13 13 University of Michigan
Cache Timing
Data Array Tag Array
=
Boost Mode (1/2)
Low latency
Data and tags read in parallel
4 data arrays accessed
EX Stage Cache Access MEM StageIF/DE Stage
Other
AccessOther
AccessTag & Data
Read
Tag Compare
& Mem Access
Edge
A
Edge
B
14 14 14 University of Michigan
Centip3De System Overview
15 15 15 University of Michigan
Centip3De System Overview
Measured
7-Layer NTC
system
2-Layer system
completed
fabrication
with measured
results
Full 7-layer system
expected
End of 2012
16 16 16 University of Michigan
Centip3De System Overview
Cluster architecture
4 Cores/cluster
1kB I$, 8kB D$
Local clock controller
operates cores
90˚ Out-of-phase
1591 F2F connections
per cluster
Organized into layer
pairs (cachecore)
Minimizes routing
Up to two pairs
16 clusters per pair
Cores have only vertical interconnections
Cluster
x32
17 17 17 University of Michigan
Centip3De System Overview
Bus interconnect
architecture
Up to 500 MHz
9-11 cycle latency
1-3 core cycles
8 lanes, each 128b
One per DRAM
interface
Each cluster connects
to all eight
1024b total
Vertically connected
through all four layers
Flipping interface enables 128-core system
Bus
System
18 18 18 University of Michigan
Centip3De System Overview
3D-Stacked DRAM
Tezzaron Octopus
1 control layer
130nm CMOS
1 Gb bitcell layers
Up to two layers
DRAM process
8x 128b DDR2
interfaces
Operated at bus frequency (up to 500 MHz)
DRAM System
19 19 19 University of Michigan
Centip3De System Overview
28485 F2F
3024 B2B
28485 F2F
3624 B2B
20 20 20 University of Michigan
Centip3De System Overview
130nm process
12.66x5mm per layer
28.4M device core layer
18.0M device cache layer
22 22 22 University of Michigan
2-Layer Stacking Process Evaluated
Core Layer
Cache Layer
For the measured 2-layer system,
aluminum wirebond pads were used instead
Wirebonds
Aluminum wirebonding pads
connected to perimeter
TSVs like for 7-layer
N P N P
F2F
23 23 23 University of Michigan
Cache 3D Connections
SRAM SRAM
SR
AM
SR
AM
SR
AM
SR
AM
SRAMSRAM
SR
AM
SR
AM
SR
AM
SR
AM
SRAM
SRAM
SRAM
SRAM
Cache
Bus Interface
Sea of Gates
24 24 24 University of Michigan
Core 3D Connections
Core 0 Core 1
Core 2Core 3
Sea of Gates
25 25 25 University of Michigan
Cluster 3D Connections
1591 F2F Connections
Each saved ~600-1000um in routing
Prevented wiring congestion around SRAMS
26 26 26 University of Michigan
Silicon Results
DRAM Control Layer
I$/D$
[00]
I$/D$
[07]
Cache Bus Hub
8x8 Crossbar
DRAM Bus Hub
8x4 Crossbar
Flipping Interface
I$/D$
[16]
I$/D$
[23]Cache Bus Hub
8x8 Crossbar
DRAM Bus Hub
Flipping Interface
I$/D$
[15]
I$/D$
[08]
Cache Bus Hub
8x8 Crossbar
DRAM Bus Hub
8x4 Crossbar
Flipping Interface
I$/D$
[31]
I$/D$
[24]Cache Bus Hub
8x8 Crossbar
DRAM Bus Hub
Flipping Interface
BottomCore
Layer
BottomCacheLayer
TopCacheLayer
TopCore
Layer
DRAM Bitcell Layer
DRAM Bitcell Layer
TezzaronOctopus
DRAM
Co
rtex
M3
[06
2]
Co
rtex
M3
[06
0]
DR
AM
Inte
rfac
e
DR
AM
Inte
rfac
e
DR
AM
Inte
rfac
e
DR
AM
Inte
rfac
eDR
AM
Inte
rfa
ce
DR
AM
Inte
rfa
ce
DR
AM
Inte
rfa
ce
DR
AM
Inte
rfa
ce
DR
AM
Inte
rfac
e
DR
AM
Inte
rfac
e
DR
AM
Inte
rfac
e
DR
AM
Inte
rfac
eDR
AM
Inte
rfa
ce
DR
AM
Inte
rfa
ce
DR
AM
Inte
rfa
ce
DR
AM
Inte
rfa
ce
Co
rtex
M3
[06
1]
Co
rtex
M3
[06
3]
Co
rtex
M3
[03
4]
Co
rtex
M3
[03
2]
Co
rtex
M3
[03
3]
Co
rtex
M3
[03
5]C
ort
ex
M3
[0
01
]
Co
rte
x M
3 [
00
3]
Co
rte
x M
3 [
00
2]
Co
rte
x M
3 [
00
0]
Co
rte
x M
3 [
02
9]
Co
rte
x M
3 [
03
1]
Co
rte
x M
3 [
03
0]
Co
rte
x M
3 [
02
8]
Co
rtex
M3
[12
6]
Co
rtex
M3
[12
4]
Co
rtex
M3
[12
5]
Co
rtex
M3
[12
7]
Co
rtex
M3
[09
8]
Co
rtex
M3
[09
6]
Co
rtex
M3
[09
7]
Co
rtex
M3
[09
9]C
ort
ex
M3
[0
93
]
Co
rte
x M
3 [
09
5]
Co
rte
x M
3 [
09
4]
Co
rte
x M
3 [
09
2]
Co
rte
x M
3 [
06
5]
Co
rte
x M
3 [
06
7]
Co
rte
x M
3 [
06
6]
Co
rte
x M
3 [
06
4]
Disabled Due
To Redundancy
27 27 27 University of Michigan
Die Shot
Aluminum
wirebond
pads
DRAM
Interface/
Bus Hub
4-Core
Cluster
Looking through back of core-layer
130nm process
12.66x5mm per layer
28.4M device core layer
18.0M device cache layer
28 28 28 University of Michigan
System Configurations
Cache Bus Hub
160 MHz1.15 Volts
I$/D$
M3
M3
M3
M3
Div 4x40 MHz0.80 Volts
Div 4x10 MHz0.65 Volts
4 Core Mode
0 Core Boosted0 Cores Gated
M3
Cache Bus Hub
160 MHz1.15 Volts
I$/D$
M3
M3
M3
Div 2x80 MHz1.15 Volts
Div 4x20 MHz0.75 Volts
3 Core Mode
3 Cores Boosted1 Core Gated
Cache Bus Hub
160 MHz1.15 Volts
I$/D$
M3
M3
M3
M3
Div 2x80 MHz1.15 Volts
Div 2x40 MHz0.85 Volts
2 Core Mode
2 Core Boosted2 Cores Gated
Cache Bus Hub
320 MHz1.6 Volts
I$/D$
M3
M3
M3
M3
Div 2x160 MHz1.65 Volts
Div 2x80 MHz1.15 Volts
1 Core Mode
1 Core Boosted3 Cores Gated
29 29 29 University of Michigan
Measured Results
4-Core 3-Core 2-Core 1-Core
0
200
400
600
800
1000
1200
1400
1600
1800
2000
84.0
266
113 113
Po
wer
(mW
)
65.025.9
203
339
463
1851
175
113
155
1225
Core Power
Cache Power
Memory System Power
System Configuration
471
51.2
Boosting a single cluster
to 1-core mode requires
disabling, or down-boosting
other clusters
1-core cluster:
= 15x 4-core clusters
= 6x 3-core clusters
= 4.5x 2-core clusters
Baseline configuration
depends on TDP and
processing needs
30 30 30 University of Michigan
Measured Results
4-Core 3-Core 2-Core 1-Core0
20
40
60
80
100
25
12.5
50
100
Sin
gle
-Th
rea
de
d P
erf
orm
an
ce
(D
MIP
S)
System Configuration
4-Core 3-Core 2-Core 1-Core
0
200
400
600
800
1000
1200
1400
1600
1800
2000
84.0
266
113 113
Po
wer
(mW
)
65.025.9
203
339
463
1851
175
113
155
1225
Core Power
Cache Power
Memory System Power
System Configuration
471
51.2
31 31 31 University of Michigan
Measured Results
4-Core 3-Core 2-Core 1-Core0
500
1000
1500
2000
2500
3000
3500
4000
4500
3540
3930
3460
Eff
icie
ncy (
DM
IPS
/Watt
)
System Configuration
860
Measured Results:
Centip3De – 3,930 (130nm)
Industry Comparison:
ARM A9 – 8,000 (40nm) [1]
Estimated Results:
Centip3De – 18,500 (45nm)
[1] http://arm.com/products/processors/cortex-a/cortex-a9.php, ARM Ltd, 2011.
32 32 32 University of Michigan
Conclusion
Near threshold computing (NTC)
Need low power solutions to maintain TDP
Achieves 10x energy efficiency => 10x more computation to give TDP
Offers optimum balance between performance and energy
Allows boosting for single threaded performance (Amdahl's law)
Large scale 3D CMP demonstrated
64 cores currently
128 cores + DRAM in the future
3D design shown to be feasible
This work was funded and organized with the help of DARPA,
Tezzaron, ARM, and the National Science Foundation