exploiting dark silicon in server design

Exploiting Dark Silicon in Server Design

Nikos Hardavellas

Northwestern University, EECS

© Hardavellas 2

Moore’s Law Is Alive And Well

90nm

90nm transistor

(Intel, 2005)

Swine Flu A/H1N1

(CDC)

65nm

2007

45nm

2010

32nm

2013

22nm

2016

16nm

2019

Device scaling continues for at least another 10 years

© Hardavellas 3

Good Days Ended Nov. 2002

[Yelick09]

“New” Moore’s Law: 2x cores with every generation

Moore’s Law Is Alive And Well

Exponential Growth of Core Counts

© Hardavellas 4

So, are 1000-core chips a viable architecture?

1

2

4

8

16

32

64

128

256

512

1975 1980 1985 1990 1995 2000 2005 2010 2015

# C

ore

s

Year

8086 80286 80386 80486 PentiumPentium Pro

P-II P-IIIP-4

Itanium

Power4

MIT-RAW

UltraSPARC IVPentium-D

Xeon Brisbane

Niagara

Core 2 Quad

Turion X2Core 2 DuoDenmark

Teraflops

Larrabee

TILE64

Sun Rock

Cell

XeonAgena Barcelona

Toliman

Core i7

IstanbulBecktonMagny-Cours

Performance Expectations vs. Reality

© Hardavellas 5

0

5

10

15

20

25

2004 2007 2010 2013 2016 2019

Re

lati

ve P

erf

orm

an

ce

Year of Technology Introduction

Speedup under Moore's Law

Speedup under Physical Constraints

Physical constraints limit speedup

Area vs. Power Envelope

© Hardavellas 6

Good news: can fit 100’s cores. Bad news: cannot power them all

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128256512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area (310mm)

Power (130W)

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128256512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area (310mm)

Power (130W)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

Pack More Slower Cores, Cheaper Cache

© Hardavellas 7

The reality of The Power Wall: a power-performance trade-off

VFS

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128256512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area (310mm)

Power (130W)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

Bandwidth (1 GHz)

Bandwidth (2.7GHz)

Pin Bandwidth Constraint

© Hardavellas 8

Bandwidth constraint favors fewer + slower cores, more cache

VFS

0

100

200

300

400

500

600

1 2 4 8 16 32 64 128 256 512

Perf

orm

an

ce (

GIP

S)

Cache Size (MB)

Area (max freq)

Power (max freq)

Bandwidth, VFS

Area+Power, VFS

Area+P+BW, VFS

Example of Optimization Results

© Hardavellas 9

BW:

~2x loss

Power + BW: ~5x loss

Jointly optimize parameters, subject to constraints, SW trends

Design is first bandwidth-constrained, then power-constrained

Core Counts for Peak-Performance Designs

© Hardavellas 10

1248

163264

128256512

1024

2004 2007 2010 2013 2016 2019

Nu

mb

er

of

Co

res


Max EMB Cores

Embedded (EMB)

General-Purpose (GPP)

Designs > 120 cores impractical for general-purpose server apps

B/W and power envelopes + dataset scaling limit core counts

Physical characteristics

modeled after

• UltraSPARC T2 (GPP)

• ARM11 (EMB)

Application Dataset Scaling

© Hardavellas 11

Application datasets scale faster than Moore’s Law! Big Caches

0

2

4

6

8

10

12

14

16

18

20

2004 2007 2010 2013 2016 2019

Scali

ng

Fac

tor


OS Dataset Scaling(Muhrvold's Law)

Transistor Scaling(Moore's Law)

TPC Dataset(Historic)

Pin Bandwidth Scaling

© Hardavellas 12

Off-chip bandwidth scales slowly (#pins, off-chip clock) Big Caches

1

2

4

8

16

2004 2007 2010 2013 2016 2019

Scali

ng

Fac

tor



Pin Bandwidth

Supply Voltage Scaling

© Hardavellas 13

Supply voltage scaling is SLOW! Dark Silicon

0.5

1

2

4

8

16

2004 2007 2010 2013 2016 2019

Scali

ng

Fac

tor



Supply Voltage(ITRS)

Chip Power Scaling

© Hardavellas 14

Chip power does not scale!

0

50

100

150

200

250

2004 2007 2010 2013 2016 2019

Watt

s / C

hip


Max Power(air cooling +heatsink)

Chip Power(ITRS)

Range of Operational Voltage

© Hardavellas 15

[Watanabe et al., ISCA’10]

Shrinking range of operational voltage hampers voltage-freq. scaling

Mitigating Bandwidth Limitations: 3D-stacking

© Hardavellas 16

[Loh et al., ISCA’08]

Delivers TB/sec of bandwidth; use as large “in-package” cache

[Amcor Tech]

[Philips]

Performance Analysis of 3D-Stacked Multicores

© Hardavellas 17

0

100

200

300

400

500

600

700

800

1 2 4 8 16 32 64 128 256 512

Perf

orm

an

ce (

GIP

S)

Cache Size (MB)

Area (max freq)

Power (max freq)

Bandwidth, VFS

Area+Power, VFS

Area+P+BW, VFS

Chip becomes power-constrained

Exponentially Large Die Area Left Unutilized

© Hardavellas 18

64

128

256

512

2004 2007 2010 2013 2016 2019

Die

Siz

e (

mm

2)


Max Die Size

DB2-TPCC

DB2-TPCH

Apache

Trendline (exp.)

Dark Silicon!!! Should we waste it?

Example of a Specialized Multicore Chip

© Hardavellas 19

ILP OoO

Core

SIMD

DSP

Many Threads

Reconfigurable

SIMD SIMD

SIMD

Many

Threads

ILP OoO

Core

Crypto

…

…

…

Many custom cores on chip; power only the most useful ones

TCP

Core Specialization • Existing general designs

OoO for ILP, in-order MT for memory-latency-bound, SIMD for data-parallel, systolic arrays

• Customizable cores

Tensilica Xtensa (custom ISA and datapath, operation fusion)

• Reconfigurable logic

• Generality of implemented operations

Target specific application

Common macro-operations

General ISA

• Trade-offs in performance, power, programmability, generality

© Hardavellas 20

Wide range of “heterogeneity” and “specialization” meanings

First-Order Core Specialization Model

• 720p HD H.264 encoder (high-definition video encoder)

• Several optimized implementations exist

Commercial ASICs, FPGAs, CMP software

• Wide range of computational motifs

© Hardavellas 21

Framesper sec

Energy per frame (mJ)

Performance gap with ASIC

Energy gap with ASIC

ASIC 30 4

CMP

IME 0.06 1179 525x 707x

FME 0.08 921 342x 468x

Intra 0.48 137 63x 157x

CABAC 1.82 39 17x 261x

[Hameed et al., ASPLOS’10]

Performance of Specialized Multicores

© Hardavellas 22

1

2

4

8

16

32

64

2004 2007 2010 2013 2016 2019

Sp

ee

du

p


Ideal-P + 3D mem

Ideal-P

EMB + 3D mem

EMB

GPP + 3D mem

GPP

Specialized multicores deliver 2x-12x higher performance

Core Counts for Specialized Multicores

© Hardavellas 23

1

2

4

8

16

32

64

128

256

512

1024

2004 2007 2010 2013 2016 2019

Nu

mb

er

of

Co

res


Max EMB Cores

EMB + 3D mem

EMB

GPP + 3D mem

Only few cores need to run at a time; large die area allow many cores

Power constraints? Yield?

Taming Power and Bandwidth : Nanophotonics

© Hardavellas 24

Optical

Interconnect

Split chip into chiplets, spread in space

Ease cooling and power delivery, high yield; photonics for bandwidth

Nanophotonic Components

© Hardavellas 25

Ge-doped

Rings selectively couple optical energy of a specific wavelength 110100101

110100101

64 wavelengths DWDM, 3 ~ 5μm waveguide pitch, 10Gbps per link

~100 Gbps/μm bandwidth density !!! [Batten et al., HOTI’08]

Technology: Off-chip Channel Material

© Hardavellas 26

Material Optical LossPropagation

Speed

Pitch

(density)

Silicon

Waveguide0.3 dB/cm 0.286c 20um

Optic Fiber 0.2 dB/km 0.676c 250um

• Optical fiber is low-loss, high speed

Enables further spreading out chiplets.

BW density was a challenge (fiber pitch size is large)

*

* J. Cardenas et al., Optics Express 2009

Fiber: low optical loss, high speed, flexibility eases assembly

Technology: Dense Off-Chip Coupling

© Hardavellas 27

• Dense optical fiber array. [Lee et al., OSA / OFC/NFOEC 2010]

• <1dB loss, 8 Tbps/mm demonstrated.

Tapered couplers solved bandwidth problem, demonstrated Tbps/mm

Galaxy Overall Architecture

© Hardavellas 28

Chiplet 1 Chiplet 0src

Chiplet 3

Chiplet 2

Chiplet 4

Cross-chiplet assemblies share an optical bus,

forming optical crossbars (FlexiShare)

Chiplet 0

Chiplet 3

Laser

Source

couplers

Optical fiber

Electrical

cluster

dst

Large-Scale Interconnects

© Hardavellas 29

200mm2 die, 64 routers per chiplet, 9 chiplets, 16cm fiber

Supports > 1K cores!

Conclusions

• Physical constraints and software pragmatics limit core counts

…and performance

• Emerging/exotic technologies may solve some problems

3D-memory for bandwidth

Nanophotonics for bandwidth, power, yield

• Need to reduce wasted energy per unit of work

Heterogeneity, only power the few cores needed

• Need to innovate across software/hardware stack

Programmability, tools are a great challenge

• Scaling forces caches to grow exponentially

Address data management both at cache and software

© Hardavellas 30

exploiting dark silicon in server design

Documents