energy efficient designs with wide dynamic range vivek de intel fellow director of circuit...
TRANSCRIPT
Energy Efficient Designs with Wide Dynamic Range
Vivek DeIntel Fellow
Director of Circuit Technology ResearchCircuits Research Lab
Intel Labs
2
2
Platform segment characteristics
Few designs must serve all segments
Serv
er
Deskto
pM
ob
ile
MID
Power Workload ThermalV-F Cores AmbientLoad
High
Med
Low
Verylow
ThroughputParallelHPC-RAS
BurstSerial-ParallelStreamGraphics
Med/HighWide
Med/HighWide
Med/HighWide
Low/MedWide
SOC
Active
Fan (-less)
Fan (-less)
PassiveFan-less
HighWide
Med-HighWide
Med-HighWide
Low-MedWide
ControlledCool
ControlledVarying
UncontrolledExtreme
UncontrolledExtreme
BurstSerial-ParallelStreamGraphics
BurstSerial-ParallelStreamGraphics
3
3
Dynamic platform control
Deliver best user experience under constraints
Scalable On-die I nterconnect FabricScalable On-die I nterconnect Fabric
Graphics
Video
SpecialPurposeEngines
IntegratedMemory
Controllers
Off Die interconnect
Cache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
Scalable On-die I nterconnect FabricScalable On-die I nterconnect Fabric
Graphics
Video
SpecialPurposeEngines
IntegratedMemory
Controllers
Off Die interconnect
Cache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
Scalable On-die I nterconnect FabricScalable On-die I nterconnect Fabric
Graphics
Video
SpecialPurposeEngines
Graphics
Video
SpecialPurposeEngines
IntegratedMemory
Controllers
Off Die interconnect
Cache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
Cache Cache CacheCache Cache Cache
Last LevelCache
Last LevelCache
Last LevelCache
DynamicV/F control
IndependentV/F control
regions
Workload-basedcore activation
& shutdown
Scenario-basedpower allocation
Maximizeperformance& efficiency
4
4
Dynamic adaptation & reconfiguration
Adapt & reconfigure for best power-performance
Dyn
am
ic C
on
trol U
nit
Processor
AgingSensor
ThermalSensor
VoltageSensor
CurrentSensor
Sensors
VoltageControl
FrequencyControl
ConfigurationControl
Change V
Change F
Reconfigure
5
5
Resilient platforms
Resiliency for performance, efficiency & reliability
Circuit & Design
Microarchitecture
Microcode
Firmware
VM
OS
Programming System
Applications
Low
er
err
or
rate
Less re
covery
overh
ead
Less
sili
con
overh
ead
Resiliency framework
Systemadaptation
Systemrecovery
Errorcorrection
Faultconfinement
Faultdiagnosis
Errordetection
Systemreconfiguration
Resilie
nt
pla
tform
featu
res
6
6
TFLOPS at 62 watts
Time (msec)
0.0 1.0 1.50.5
user-demanddelivered
Ene
rgy
Time (msec)
0.0 1.0 1.50.5
user-demanddelivered
Ene
rgyCoarse-grain
management
Slow voltage &frequency
change
Fine-grain management
FUTURE
Time (msec)
0.0 1.0 1.50.5
user-demanddelivered
Ene
rgy
Time (msec)
0.0 1.0 1.50.5
user-demanddelivered
Ene
rgy
Fine-grain power managementTemporal domain
Fast voltage & frequency change
TODAY
• V/F change latency• Active-sleep transition latency • Wake-up latency• Wake-up prediction• State transition energy• Supply noise
7
7
Fine-grain power managementSpatial domain
• Same voltage to all cores• Same frequency for all cores
Coarse-grain management
TODAY
• Each core/cluster at optimum voltage• Each core/cluster at optimum frequency
Fine-grain management
FUTURE
• V/F domain interfaces• Synchronization overhead• Clock generation/distribution• Power grid routing• Optimum V/F for non-cores• Sub-core clock/leakage gating
8
8
ηηPSPS
IILOADLOAD
Conventional Conventional Power SourcePower Source
Efficiency Efficiency
Load AdaptiveLoad AdaptivePower Source Power Source
ηηPSPS
IILOADLOAD
Conventional Conventional Power SourcePower Source
Efficiency Efficiency
Load AdaptiveLoad AdaptivePower Source Power Source
Advanced voltage regulators
• Area efficient• Scalable • Persistent rail
Conversion
Efficient
TOP
BOTTOM
Converter chip
Load chip
Inductors
InputCapacitors
OutputCapacitors
5mmRF
Launch
70
75
80
85
90
0 5 10 15 20
Load Current [A]
Eff
icie
ncy
[%
]
L = 1.9nH
L = 0.8nH2.4V to 1.5V
2.4V to 1.2V
60MHz
100MHz80MHz
• Lower loss• Higher fidelity • Simpler
Distribution
Low loss
• Fast & efficient• Load adaptive• Independent rails
Control
Fine-grain
VR innovations for fine-grain power management
9
9
2KB Data memory (DMEM)
3K
B In
st.
me
mo
ry (
IME
M) 6-read, 4-write 32 entry RF
32
64 64
32
64
RIB
96
9696
Mesochronous Interface
Processing Engine (PE)
Crossbar Router
MS
INT39
39
20 GB/s
FPMAC0
x
+
Normalize
32
32
FPMAC1
x
+32
32M
SIN
TM
SIN
T
MSINT
MSINT
Normalize
Tile
Research testchip
I/O Area
I/O Area
PLL
single tile
1.5mm
2.0mm
TAP
21
.72
mm
I/O Area
PLL TAP
12.64mm
10
10
Many-core power management
Dynamic sleep
STANDBY: • Memory retains data• 50% less power/tileFULL SLEEP: • Memories fully off• 80% less power/tile
21 sleep regions per tile (not all shown)
FP Engine 1
FP Engine 2
Router
Data Memory
Instruction Memory
FP Engine 1
Sleeping:90% less
power
FP Engine 2
Sleeping:90% less
power
RouterSleeping:
10% less power(stays on to pass traffic)
Data MemorySleeping:
57% less power
Instruction MemorySleeping:
56% less power
Energy efficiency of 19.4 Gflops/Watt
11
11
Memory integration
Polaris
DRAM
Package
denser than C4 pitch
C4 pitch
Processor with Cu bump
Package
Processor
Memory
256 KB SRAM per core4X C4 bump density8490 thru-silicon vias
Thru-SiliconVia
3D stacking with thru-silicon vias
12
12
Voltage-frequency range limiters
Reliability & functional failures limit range
Volt
ag
e
Frequency
Vmax
Vmin
Fmax
• Reliability• Thermals• Power delivery
Vmax/Fmax limiters
• Circuit functional failures• Soft errors• Steep frequency roll-off• Aging
Vmin limiters
13
13
Voltage-frequency marginsVolt
age
Frequency
Volt
age
Frequency
V m
arg
inIR drop
Inductivedroops
Load linevariations
V variation T variation
F margin
Nominal T
Worst T
V m
arg
in
F margin
Volt
age
Frequency
Volt
age
Frequency
Aging Path activity
BOL
EOL
V m
arg
in
F margin
Nominal
Worst
F margin
V m
arg
in
MISSignal
coupling
Criticalpath
14
14
Array voltage
Cum
ula
tive f
ail
rate
Nominal array VminWorst die Vmin
SER, erratic bits
Array Vmin
Density
Vm
in
8T+ cell
Multi-V6T SRAM
6T SRAM
LLC density
Multi-voltage cache
Embedded level Embedded level shifters for wordline & shifters for wordline & write drivers minimize write drivers minimize area & power overheadarea & power overhead
Push active Vmin limit to VmaxPush active Vmin limit to Vmax
15
15
Read WriteRetentio
n
NX WeakStron
g-
NPDStron
gWeak Balanced
PPUStron
gWeak Balanced
Dynamic multi-voltage cache6T SRAM cell
WL
BL
BL
#VCC
NX
NPD
PPU
VCC
R1
R2
VWL_
+
VOPAMP
0
TM
Magnitude
Duration
C
VCC
VSS
TD
WL
TD
M = VCC(1-e-T /RC)M
R
128K
b
min
-ce
lls
IO pad drivers
PB
IST
blo
ck
128K
b
min
-ce
lls
128K
b
min
-ce
lls
WL
UD
&
du
mm
y W
Ls
128K
b
p-c
ells
128K
b
p-c
ells
128K
b
p-c
ells
107K
b
p-c
ells
Transistor count 6.2M
Chip Area 0.91mm2
Vcc 1.1V-0.7V
Testing interfacemembrane probe card
Wordline underdrive for read Dynamic voltage collapse for write
45nm dynamic multi-V testchip
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1.E+10
0.7 0.8 0.9 1 1.1
VWL(V)
Re
lati
ve
Sin
gle
Bit
Fa
ils
Strongest DVC
Weakest DVC
DVC off
Nominal DVC
MIN cell
26X less fails
Array to WL differential supply noise tracking
Pulse width control
Source: Intel
16
16
Reduce cache size @ low V/F by eliminating failing words/bits
00101000
Word disable
Failing words
Bitmap of failing words
Bit fix
1.0E-11
1.0E-10
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
0.4 0.45 0.5 0.55 0.6 0.65 0.7Supply Voltage
Fa
ilu
re P
rob
.
1-bit ECC
Word disable
Bit fix10-bit ECC
Vmin(mV)
Density
Capacity
Latency
(cycles)
IPC EPI
Conventional 660 1* 1*(8-way)
L1: 3L2: 20
1* 1*
32KB L1Word disable
500 0.92 0.5(4-way)
4 0.95 0.5
2MB L2Bit fix
500 1 0.75(6-way)
23 0.95 0.5* Normalized reference value
Cache reconfiguration
Source: Intel
Source: Intel
17
17
Low-voltage logic design
4:1 Mux
“1”
“1”
“1”
“0”
“0”
“0”
“1”
“1”
“1”
“0”
“0”
“0”
Narrow muxes No stack height > 2
input
output
vcch vcch
vcch
vccl
vcch
input
output
vcch vcch
vcch
vccl
vcch
Robust level converters
Effi
cien
cy: M
IPS/W
att
Performance
Improve range
Impact max performance
Robust flip-flops
Effi
cien
cy: M
IPS/W
att
Performance
Improve range& efficiency
Design & technology optimizations to balancerange, performance & efficiency
18
18
Low-voltage motion estimation engine
I/O Control
Output FIFO Input FIFOClock Generator
I/O ControlMotion Estimation Engine
Clock Generator
I/O Control
Output FIFO Input FIFOClock Generator
I/O ControlMotion Estimation Engine
Clock Generator
65nm CMOS70K transistors
Die area ~1mm2
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
GO
PS
/Wat
t
P1264, 50°C
9.6X
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
GO
PS
/Wat
t
P1264, 50°C
9.6X
412 Gops/Watt
@ 320 mV!
1
10
100
1000
10000
0.2 0.4 0.6 0.8 1 1.2 1.4
0.01
0.1
1
10
100
Supply Voltage (V)
Fm
ax (
MH
z)
Wide V-F range
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
GO
PS
/Wat
t
P1264, 50°C
9.6X
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
GO
PS
/Wat
t
P1264, 50°C
9.6X
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
GO
PS
/Wat
t
P1264, 50°C
9.6X
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
GO
PS
/Wat
t
P1264, 50°C
9.6X
412 Gops/Watt
@ 320 mV!
1
10
100
1000
10000
0.2 0.4 0.6 0.8 1 1.2 1.4
0.01
0.1
1
10
100
Supply Voltage (V)
Fm
ax (
MH
z)
Wide V-F range
Source: Intel Source: Intel
19
19
Dynamic V & F adaptation
Clocking
Input Buffer
Sensors& Analog
DAB
TCP/IP Processor Core
JTAG
Noi
se g
en
Noi
se g
en
TCP/IP processor
PLL0
PLL1
DAB Control
Thermal sensor
Div
PMOS CBG
NMOS CBG
core clk
gate
Droop sensor
Time
Tem
p
Time
Vc
c
PLL2
NMOS body bias
PMOS body bias
I/O clk
Noise injector
CL
OC
KIN
GC
ON
TR
OL
F0
Inp
ut b
uffe
r
Ou
tpu
t po
rt
F1
F2
1st droop
2nd droop 3rd droop
ctrl
PL
L c
om
man
d
VR
M
• Adapt F/V to V/T change reduce V/T margin• Adapt F/V to aging reduce aging margin
Environment-aware dynamic adaptation
Prototype chip in
90nm
Source: Intel
Source: Intel
20
20
Resilient circuits
MSFF MSFF
clkclk
Voltage droop
Error!
Errordet. • Detect errors in critical path FFs
• Propagate error signals• Correct errors by re-execution• Feedback to adaptive V/F
Error Detection Sequential (EDS)
65nm resilient circuits testchip
21
21
Resiliency experimentsResponse to voltage droops
0
50
100
150
200
1.0 1.5 2.0 2.5 3.0
Throughput (BIPS)
Po
wer
(m
W)
Conventional Design
Resilient Design
21% Throughput GainOR
37% Power Reduction
0.0
1.0
2.0
3.0
4.0
5.0
2100 2400 2700 3000 3300 3600
Clock Frequency (MHz)
Th
rou
gh
pu
t (B
IPS
)
1.0E-13
1.0E-10
1.0E-07
1.0E-04
1.0E-01
1.0E+02
Erro
r Ra
te (%
)Conventional Design
Max TP
Resilient DesignMax TP
Nominal: VCC=1.2V & Temp=60CWorst-Case: 10% VCC Droop & Temp=110C
VCC & TemperatureFCLK Guardband
0.0
1.0
2.0
3.0
4.0
5.0
2100 2400 2700 3000 3300 3600
Clock Frequency (MHz)
Th
rou
gh
pu
t (B
IPS
)
1.0E-13
1.0E-10
1.0E-07
1.0E-04
1.0E-01
1.0E+02
Erro
r Ra
te (%
)Conventional Design
Max TP
Resilient DesignMax TP
Nominal: VCC=1.2V & Temp=60CWorst-Case: 10% VCC Droop & Temp=110C
VCC & TemperatureFCLK Guardband
Source: Intel
Source: Intel
22
22
Summary
• Energy efficiency and wide dynamic operating range are critical for all platforms
• Integration, fine-grain power management, advanced voltage regulators & 3D memory stacking are key for energy efficiency
• Reliability, functionality, margins & efficiency limit dynamic operating range
• Multi-voltage design, dynamic adaptation, reconfiguration & resiliency are key enablers
23
23
AcknowledgementCRL prototype Bangalore Design LTD Design Murli TirumalaTomm Aldridge Mark Anders Paolo Aseron Ravi MahajanJerry Bautista Nitin Borkar Shekhar Borkar Ravi PrasherKeith Bowman Saurabh Dighe Zeshan Chishti Venkat NatarajanLev Finkelstein Shih-Lien Lu Varghese George Anand DeshpandeMarci Glenn George Goodman Steve Gunther Ali FarhangMatthew Haycock Chris Wilkerson Ming Zhang Amit AgarwalAndrew Henroid Yatin Hoskote Jason Howard Robert ChauSteven Hsu Tanay Karnik Himanshu Kaul Rajesh KumarAlaa Alameldeen M. Khellah V. Erraguntla Ketan ParanjapeSean Koehl R. Krishnamurthy Partha Kundu Greg TaylorSanu Mathew Randy Mooney A. Raychowdhury P. VishakantaiahAlon Naveh Trang Nguyen Fabrice Paillet R. KuppuswamyClark Roberts Ronny Ronen Erfaim Rotem Rohit VidwansMark Rowland Greg Ruhl Gerhard Schrom Gautam DoshiJoe Schutz D. Somasekhar James Tschanz Sunit TyagiSriram Vangal Manny Vara Howard Wilson B. ChatterjeeBibiche Geuskens Steven Hsu Kevin Zhang Sati BanerjeeClair Webb Mark Bohr Gunjan Pandya
24
24
Q&A