exploiting dark silicon in server design
TRANSCRIPT
Exploiting Dark Silicon in Server Design
Nikos Hardavellas
Northwestern University, EECS
© Hardavellas 2
Moore’s Law Is Alive And Well
90nm
90nm transistor
(Intel, 2005)
Swine Flu A/H1N1
(CDC)
65nm
2007
45nm
2010
32nm
2013
22nm
2016
16nm
2019
Device scaling continues for at least another 10 years
© Hardavellas 3
Good Days Ended Nov. 2002
[Yelick09]
“New” Moore’s Law: 2x cores with every generation
Moore’s Law Is Alive And Well
Exponential Growth of Core Counts
© Hardavellas 4
So, are 1000-core chips a viable architecture?
1
2
4
8
16
32
64
128
256
512
1975 1980 1985 1990 1995 2000 2005 2010 2015
# C
ore
s
Year
8086 80286 80386 80486 PentiumPentium Pro
P-II P-IIIP-4
Itanium
Power4
MIT-RAW
UltraSPARC IVPentium-D
Xeon Brisbane
Niagara
Core 2 Quad
Turion X2Core 2 DuoDenmark
Teraflops
Larrabee
TILE64
Sun Rock
Cell
XeonAgena Barcelona
Toliman
Core i7
IstanbulBecktonMagny-Cours
Performance Expectations vs. Reality
© Hardavellas 5
0
5
10
15
20
25
2004 2007 2010 2013 2016 2019
Re
lati
ve P
erf
orm
an
ce
Year of Technology Introduction
Speedup under Moore's Law
Speedup under Physical Constraints
Physical constraints limit speedup
Area vs. Power Envelope
© Hardavellas 6
Good news: can fit 100’s cores. Bad news: cannot power them all
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128256512
Nu
mb
er
of
Co
res
Cache Size (MB)
Area (310mm)
Power (130W)
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128256512
Nu
mb
er
of
Co
res
Cache Size (MB)
Area (310mm)
Power (130W)
1 GHz, 0.27V
2.7 GHz, 0.36V
4.4 GHz, 0.45V
5.7 GHz, 0.54V
6.9 GHz, 0.63V
8 GHz, 0.72V
9 GHz, 0.81V
Pack More Slower Cores, Cheaper Cache
© Hardavellas 7
The reality of The Power Wall: a power-performance trade-off
VFS
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128256512
Nu
mb
er
of
Co
res
Cache Size (MB)
Area (310mm)
Power (130W)
1 GHz, 0.27V
2.7 GHz, 0.36V
4.4 GHz, 0.45V
5.7 GHz, 0.54V
6.9 GHz, 0.63V
8 GHz, 0.72V
9 GHz, 0.81V
Bandwidth (1 GHz)
Bandwidth (2.7GHz)
Pin Bandwidth Constraint
© Hardavellas 8
Bandwidth constraint favors fewer + slower cores, more cache
VFS
0
100
200
300
400
500
600
1 2 4 8 16 32 64 128 256 512
Perf
orm
an
ce (
GIP
S)
Cache Size (MB)
Area (max freq)
Power (max freq)
Bandwidth, VFS
Area+Power, VFS
Area+P+BW, VFS
Example of Optimization Results
© Hardavellas 9
BW:
~2x loss
Power + BW: ~5x loss
Jointly optimize parameters, subject to constraints, SW trends
Design is first bandwidth-constrained, then power-constrained
Core Counts for Peak-Performance Designs
© Hardavellas 10
1248
163264
128256512
1024
2004 2007 2010 2013 2016 2019
Nu
mb
er
of
Co
res
Year of Technology Introduction
Max EMB Cores
Embedded (EMB)
General-Purpose (GPP)
Designs > 120 cores impractical for general-purpose server apps
B/W and power envelopes + dataset scaling limit core counts
Physical characteristics
modeled after
• UltraSPARC T2 (GPP)
• ARM11 (EMB)
Application Dataset Scaling
© Hardavellas 11
Application datasets scale faster than Moore’s Law! Big Caches
0
2
4
6
8
10
12
14
16
18
20
2004 2007 2010 2013 2016 2019
Scali
ng
Fac
tor
Year of Technology Introduction
OS Dataset Scaling(Muhrvold's Law)
Transistor Scaling(Moore's Law)
TPC Dataset(Historic)
Pin Bandwidth Scaling
© Hardavellas 12
Off-chip bandwidth scales slowly (#pins, off-chip clock) Big Caches
1
2
4
8
16
2004 2007 2010 2013 2016 2019
Scali
ng
Fac
tor
Year of Technology Introduction
Transistor Scaling(Moore's Law)
Pin Bandwidth
Supply Voltage Scaling
© Hardavellas 13
Supply voltage scaling is SLOW! Dark Silicon
0.5
1
2
4
8
16
2004 2007 2010 2013 2016 2019
Scali
ng
Fac
tor
Year of Technology Introduction
Transistor Scaling(Moore's Law)
Supply Voltage(ITRS)
Chip Power Scaling
© Hardavellas 14
Chip power does not scale!
0
50
100
150
200
250
2004 2007 2010 2013 2016 2019
Watt
s / C
hip
Year of Technology Introduction
Max Power(air cooling +heatsink)
Chip Power(ITRS)
Range of Operational Voltage
© Hardavellas 15
[Watanabe et al., ISCA’10]
Shrinking range of operational voltage hampers voltage-freq. scaling
Mitigating Bandwidth Limitations: 3D-stacking
© Hardavellas 16
[Loh et al., ISCA’08]
Delivers TB/sec of bandwidth; use as large “in-package” cache
[Amcor Tech]
[Philips]
Performance Analysis of 3D-Stacked Multicores
© Hardavellas 17
0
100
200
300
400
500
600
700
800
1 2 4 8 16 32 64 128 256 512
Perf
orm
an
ce (
GIP
S)
Cache Size (MB)
Area (max freq)
Power (max freq)
Bandwidth, VFS
Area+Power, VFS
Area+P+BW, VFS
Chip becomes power-constrained
Exponentially Large Die Area Left Unutilized
© Hardavellas 18
64
128
256
512
2004 2007 2010 2013 2016 2019
Die
Siz
e (
mm
2)
Year of Technology Introduction
Max Die Size
DB2-TPCC
DB2-TPCH
Apache
Trendline (exp.)
Dark Silicon!!! Should we waste it?
Example of a Specialized Multicore Chip
© Hardavellas 19
ILP OoO
Core
SIMD
DSP
Many Threads
Reconfigurable
SIMD SIMD
SIMD
Many
Threads
ILP OoO
Core
Crypto
…
…
…
Many custom cores on chip; power only the most useful ones
TCP
Core Specialization • Existing general designs
OoO for ILP, in-order MT for memory-latency-bound, SIMD for data-parallel, systolic arrays
• Customizable cores
Tensilica Xtensa (custom ISA and datapath, operation fusion)
• Reconfigurable logic
• Generality of implemented operations
Target specific application
Common macro-operations
General ISA
• Trade-offs in performance, power, programmability, generality
© Hardavellas 20
Wide range of “heterogeneity” and “specialization” meanings
First-Order Core Specialization Model
• 720p HD H.264 encoder (high-definition video encoder)
• Several optimized implementations exist
Commercial ASICs, FPGAs, CMP software
• Wide range of computational motifs
© Hardavellas 21
Framesper sec
Energy per frame (mJ)
Performance gap with ASIC
Energy gap with ASIC
ASIC 30 4
CMP
IME 0.06 1179 525x 707x
FME 0.08 921 342x 468x
Intra 0.48 137 63x 157x
CABAC 1.82 39 17x 261x
[Hameed et al., ASPLOS’10]
Performance of Specialized Multicores
© Hardavellas 22
1
2
4
8
16
32
64
2004 2007 2010 2013 2016 2019
Sp
ee
du
p
Year of Technology Introduction
Ideal-P + 3D mem
Ideal-P
EMB + 3D mem
EMB
GPP + 3D mem
GPP
Specialized multicores deliver 2x-12x higher performance
Core Counts for Specialized Multicores
© Hardavellas 23
1
2
4
8
16
32
64
128
256
512
1024
2004 2007 2010 2013 2016 2019
Nu
mb
er
of
Co
res
Year of Technology Introduction
Max EMB Cores
EMB + 3D mem
EMB
GPP + 3D mem
Only few cores need to run at a time; large die area allow many cores
Power constraints? Yield?
Taming Power and Bandwidth : Nanophotonics
© Hardavellas 24
Optical
Interconnect
Split chip into chiplets, spread in space
Ease cooling and power delivery, high yield; photonics for bandwidth
Nanophotonic Components
© Hardavellas 25
Ge-doped
Rings selectively couple optical energy of a specific wavelength 110100101
110100101
64 wavelengths DWDM, 3 ~ 5μm waveguide pitch, 10Gbps per link
~100 Gbps/μm bandwidth density !!! [Batten et al., HOTI’08]
Technology: Off-chip Channel Material
© Hardavellas 26
Material Optical LossPropagation
Speed
Pitch
(density)
Silicon
Waveguide0.3 dB/cm 0.286c 20um
Optic Fiber 0.2 dB/km 0.676c 250um
• Optical fiber is low-loss, high speed
Enables further spreading out chiplets.
BW density was a challenge (fiber pitch size is large)
*
* J. Cardenas et al., Optics Express 2009
Fiber: low optical loss, high speed, flexibility eases assembly
Technology: Dense Off-Chip Coupling
© Hardavellas 27
• Dense optical fiber array. [Lee et al., OSA / OFC/NFOEC 2010]
• <1dB loss, 8 Tbps/mm demonstrated.
Tapered couplers solved bandwidth problem, demonstrated Tbps/mm
Galaxy Overall Architecture
© Hardavellas 28
Chiplet 1 Chiplet 0src
Chiplet 3
Chiplet 2
Chiplet 4
Cross-chiplet assemblies share an optical bus,
forming optical crossbars (FlexiShare)
Chiplet 0
Chiplet 3
Laser
Source
couplers
Optical fiber
Electrical
cluster
dst
Large-Scale Interconnects
© Hardavellas 29
200mm2 die, 64 routers per chiplet, 9 chiplets, 16cm fiber
Supports > 1K cores!
Conclusions
• Physical constraints and software pragmatics limit core counts
…and performance
• Emerging/exotic technologies may solve some problems
3D-memory for bandwidth
Nanophotonics for bandwidth, power, yield
• Need to reduce wasted energy per unit of work
Heterogeneity, only power the few cores needed
• Need to innovate across software/hardware stack
Programmability, tools are a great challenge
• Scaling forces caches to grow exponentially
Address data management both at cache and software
© Hardavellas 30
Thank You!
Acknowledgements:
Y. Pan, J. Kim, G. Memik, M. Ferdman, B. Falsafi
© Hardavellas 31