dark s end of multicore scaling - college of …hadi/doc/paper/2012-toppicks-dark_silicon.pdfdark...

..........................................................................................................................................................................................................................

DARK SILICON AND THE ENDOF MULTICORE SCALING

..........................................................................................................................................................................................................................

A KEY QUESTION FOR THE MICROPROCESSOR RESEARCH AND DESIGN COMMUNITY IS

WHETHER SCALING MULTICORES WILL PROVIDE THE PERFORMANCE AND VALUE NEEDED

TO SCALE DOWN MANY MORE TECHNOLOGY GENERATIONS. TO PROVIDE A QUANTITATIVE

ANSWER TO THIS QUESTION, A COMPREHENSIVE STUDY THAT PROJECTS THE SPEEDUP

POTENTIAL OF FUTURE MULTICORES AND EXAMINES THE UNDERUTILIZATION OF

INTEGRATION CAPACITY—DARK SILICON—IS TIMELY AND CRUCIAL.

......Moore’s law (the doubling oftransistors on chip every 18 months) hasbeen a fundamental driver of computing.1

For the past three decades, through device,circuit, microarchitecture, architecture,and compiler advances, Moore’s law,coupled with Dennard scaling, has resultedin commensurate exponential performanceincreases.2 The recent shift to multicoredesigns aims to increase the number ofcores using the increasing transistor countto continue the proportional scaling ofperformance.

With the end of Dennard scaling, futuretechnology generations can sustain the dou-bling of devices every generation, but withsignificantly less improvement in energy effi-ciency at the device level. This device scalingtrend presages a divergence between energy-efficiency gains and transistor-densityincreases. For the architecture community,it is crucial to understand how effectivelymulticore scaling will use increased device in-tegration capacity to deliver performancespeedups in the long term. While everyoneunderstands that power and energy are criti-cal problems, no detailed, quantitative studyhas addressed how severe (or not) the power

problem will be for multicore scaling, espe-cially given the large multicore design space(CPU-like, GPU-like, symmetric, asymmet-ric, dynamic, composed/fused, and so forth).

To explore the speedup potential offuture multicores, we conducted a decade-long performance scaling projection formulticore designs assuming fixed powerand area budgets. It considers devices, coremicroarchitectures, chip organizations, andbenchmark characteristics, applying areaand power constraints at future technologynodes. Through our models we also esti-mate the effects of nonideal device scalingon integration capacity utilization and esti-mate the percentage of dark silicon (transis-tor integration capacity underutilization) onfuture multicore chips. For more informa-tion on related research, see the ‘‘RelatedWork in Modeling Multicore Speedupand Dark Silicon’’ sidebar.

Modeling multicore scalingTo project the upper bound performance

achievable through multicore scaling (undercurrent scaling assumptions), we consideredtechnology scaling projections, single-coredesign scaling, multicore design choices,

[3B2-9] mmi2012030122.3d 17/5/012 10:48 22

Hadi Esmaeilzadeh

University of Washington

Emily Blem

University of Wisconsin�

Madison

Renee St. Amant

University of Texas

at Austin

Karthikeyan

Sankaralingam

University of Wisconsin�

Madison

Doug Burger

Microsoft Research

..............................................................

122 Published by the IEEE Computer Society 0272-1732/12/$31.00 �c 2012 IEEE

actual application behavior, and microarchi-tectural features. We considered fixed-sizeand fixed-power-budget chips. We builtand combined three models to project per-formance, as Figure 1 shows. The three mod-els are the device scaling model (DevM), thecore scaling model (CorM), and the multi-core scaling model (CmpM). The modelspredict performance speedup and show agap between our projected speedup and thespeedup we have come to expect with eachtechnology generation. This gap is referredto as the dark silicon gap. The models alsoproject the percentage of the dark silicon asthe process technology scales.

We built a device scaling model that pro-vides the area, power, and frequency scaling

factors at technology nodes from 45 nmthrough 8 nm. We consider aggressive Inter-national Technology Roadmap for Semiconduc-tors (ITRS; http://www.itrs.net) projectionsand conservative projections from Borkar’srecent study.3

We modeled the power/performance andarea/performance of single core designs usingPareto frontiers derived from real measure-ments. Through Pareto-optimal curves, thecore-level model provides the maximum per-formance that a single core can sustain for anygiven area. Further, it provides the minimumpower that must be consumed to sustain thislevel of performance.

We developed an analytical modelthat provides per-benchmark speedup of a

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 123

...............................................................................................................................................................................................

Related Work in Modeling Multicore Speedup and Dark Silicon

Hill and Marty extend Amdahl’s law to model multicore speedup with

symmetric, asymmetric, and dynamic topologies and conclude that dy-

namic multicores are superior.1 Their model uses area as the primary

constraint and models single-core area/performance tradeoff using

Pollack’s rule (Performance /ffiffiffiffiffiffiffiffipArea) without considering technology

trends.2 Azizi et al. derive the single-core energy/performance tradeoff

of Pareto frontiers using architecture-level statistical models combined

with circuit-level energy/performance tradeoff functions.3 For modeling

single-core power/performance and area/performance tradeoffs, our

core model derives two separate Pareto frontiers from real measure-

ments. Furthermore, we project these tradeoff functions to the future

technology nodes using our device model.

Chakraborty considers device scaling and estimates a simultaneous

activity factor for technology nodes down to 32 nm.4 Hempstead et al.

introduce a variant of Amdahl’s law to estimate the amount of special-

ization required to maintain 1.5� performance growth per year,

assuming completely parallelizable code.5 Chung et al. study uncon-

ventional cores including custom logic, field-programmable gate arrays

(FPGAs), or GPUs in heterogeneous single-chip design.6 They rely on

Pollack’s rule for the area/performance and power/performance trade-

offs. Using International Technology Roadmap for Semiconductors

(ITRS) projections, they report on the potential for unconventional

cores considering parallel kernels. Hardavellas et al. forecast the lim-

its of multicore scaling and the emergence of dark silicon in servers

with workloads that have an inherent abundance of parallelism.7

Using ITRS projections, Venkatesh et al. estimate technology-imposed

utilization limits and motivate energy-efficient and application-specific

core designs.8

Previous work largely abstracts away processor organization and ap-

plication details. Our study provides a comprehensive model that consid-

ers the implications of process technology scaling; decouples power/area

constraints; uses real measurements to model single-core design

tradeoffs; and exhaustively considers multicore organizations, microarch-

itectural features, and the behavior of real applications.

References

1. M.D. Hill and M.R. Marty, ‘‘Amdahl’s Law in the Multicore

Era,‘‘ Computer, vol. 41, no. 7, 2008, pp. 33-38.

2. F. Pollack, ‘‘New Microarchitecture Challenges in the Com-

ing Generations of CMOS Process Technologies,‘‘ Proc.

32nd Ann. ACM/IEEE Int’l Symp. Microarchitecture (Micro

99), IEEE CS, 2009, p. 2.

3. O. Azizi et al., ‘‘Energy-Performance Tradeoffs in Processor

Architecture and Circuit Design: A Marginal Cost Analysis,‘‘

Proc. 37th Ann Int’l Symp. Computer Architecture (ISCA

10), ACM, 2010, pp. 26-36.

4. K. Chakraborty, ‘‘Over-Provisioned Multicore Systems,‘‘

doctoral thesis, Department of Computer Sciences, Univ.

of Wisconsin�Madison, 2008.

5. M. Hempstead, G.-Y. Wei, and D. Brooks, ‘‘Navigo: An Early-

Stage Model to Study Power-Constrained Architectures and

Specialization,‘‘ Workshop on Modeling, Benchmarking, and

Simulations (MoBS ), 2009.

6. E.S. Chung et al., ‘‘Single-Chip Heterogeneous Computing:

Does the Future Include Custom Logic, FPGAs, and

GPUs?‘‘ Proc. 43rd Ann. IEEE/ACM Int’l Symp. Microarchi-

tecture (Micro 43), IEEE CS, 2010, pp. 225-236.

7. N. Hardavellas et al., ‘‘Toward Dark Silicon in Servers,‘‘ IEEE

Micro, vol. 31, no. 4, 2011, pp. 6-15.

8. G. Venkatesh et al., ‘‘Conservation Cores: Reducing the En-

ergy of Mature Computations,‘‘ Proc. 15th Int’l Conf. Archi-

tectural Support for Programming Languages and Operating

Systems (ASPLOS 10), ACM, 2010, pp. 205-218.

....................................................................

MAY/JUNE 2012 123

multicore design compared to a baseline de-sign. The model projects performance foreach hybrid configuration based on high-level application properties and microarchi-tectural features. We modeled the two main-stream classes of multicore organizations,multicore CPUs and many-thread GPUs,which represent two extreme points in thethreads-per-core spectrum. The CPU multi-core organization represents Intel Nehalem-like, heavyweight multicore designs withfast caches and high single-thread perfor-mance. The GPU multicore organizationrepresents Nvidia Tesla-like lightweightcores with heavy multithreading supportand poor single-thread performance. Foreach multicore organization, we consideredfour topologies: symmetric, asymmetric, dy-namic, and composed (fused).

Table 1 outlines the four topologies inthe design space and the cores’ roles duringserial and parallel portions of applications.

Single-thread (ST) cores are uniprocessor-style cores with large caches, and many-thread (MT) cores are GPU-style coreswith smaller caches.

Combining the device model with thecore model provided power/performanceand area/performance Pareto frontiers atfuture technology nodes. Any performanceimprovements for future cores will comeonly at the cost of area or power as definedby these curves. Finally, combining all threemodels and performing an exhaustive design-space search produced the optimal multicoreconfiguration and the maximum multicorespeedups for each benchmark at future tech-nology nodes while enforcing area, power,and benchmark constraints.

Future directionsAs the rest of the article will elaborate, we

model an upper bound on parallel applica-tion performance available from multicore

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 124

Device scaling(DevM)

Collectingempirical data

Corescaling (CorM)

Multicore scaling(CmpM) = >××

×

×

Optimal no. of coresMulticore speedup% of dark silicon

Year

Year

Search 800 configs.for 12 benchmarks

DerivingPareto frontiers

CPU-likemulticore

GPU-likemulticore

ITRSprojections

Conservativeprojections

V DD

Pow

er

Core

are

a

Performance

PerformanceApplications

Are

a

Co

re p

ow

er

Cf

Tech node

Tech node

Tech node

Tech node

Tech node

2 chip organizations × 4 topologies

12 benchmarks

Data for 152processors

2 projectionschemes

Analytical models

Microarchitecturalfeatures

Application behavior

Multic

ore

sp

eed

up

Historicalperformance

scaling

% o

f d

ark

sili

con

Dark

sili

con

gap

Model projections

1

+1 – fS (q)

fN(q)s(q)

Figure 1. Overview of the methodology and models. By combining the device scaling model (DevM), core scaling model

(CorM), and multicore scaling model (CmpM), we project performance speedup and reveal a gap between the projected

speedup and the speedup expected with each technology generation indicated as the dark silicon gap. The three-tier

model also projects the percentage of dark silicon as technology scales.

....................................................................

124 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

and CMOS scaling—assuming no major dis-ruptions in process scaling or core efficiency.Using a constant area and power budget, thisstudy shows that the space of known multi-core designs (CPUs, GPUs, and theirhybrids) or novel heterogeneous topologies(for example, dynamic or composable) fallsfar short of the historical performance gainsour industry is accustomed to. Even withaggressive ITRS scaling projections, scalingcores achieves a geometric mean 7.9�speedup through 2024 at 8 nm. With con-servative scaling, only 3.7� geometricmean speedup is achievable at 8 nm. Fur-thermore, with ITRS projections, at 22 nm,21 percent of the chip will be dark, and at8 nm, more than 50 percent of the chip can-not be utilized.

The article’s findings and methodologyare both significant and indicate that withoutprocess breakthroughs, directions beyondmulticore are needed to provide performancescaling. For decades, Dennard scaling per-mitted more transistors, faster transistors,and more energy-efficient transistors witheach new process node, which justified theenormous costs required to develop eachnew process node. Dennard scaling’s failureled industry to race down the multicorepath, which for some time permitted perfor-mance scaling for parallel and multitaskedworkloads, permitting the economics of pro-cess scaling to hold. A key question for themicroprocessor research and design commu-nity is whether scaling multicores will pro-vide the performance and value needed toscale down many more technology genera-tions. Are we in a long-term multicore

‘‘era,’’ or will industry need to move in dif-ferent, perhaps radical, directions to justifythe cost of scaling?

The glass is half-emptyA pessimistic interpretation of this study

is that the performance improvements towhich we have grown accustomed over thepast 30 years are unlikely to continue withmulticore scaling as the primary driver. Thetransition from multicore to a new approachis likely to be more disruptive than the tran-sition to multicore and, to sustain the currentcadence of Moore’s law, must occur in only afew years. This period is much shorter thanthe traditional academic time frame requiredfor research and technology transfer. Majorarchitecture breakthroughs in ‘‘alternative’’directions such as neuromorphic computing,quantum computing, or biointegration willrequire even more time to enter industryproduct cycle. Furthermore, while a slowingof Moore’s law will obviously not be fatal, ithas significant economic implications for thesemiconductor industry.

The glass is half-fullIf energy-efficiency breakthroughs are

made on supply voltage and process scaling,the performance improvement potential ishigh for applications with very high degreesof parallelism.

Rethinking multicore’s long-term potentialWe hope that our quantitative findings

trigger some analyses in both academiaand industry on the long-term potential ofthe multicore strategy. Academia is now

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 125

Table 1. The four multicore topologies for CPU-like and GPU-like organizations. (ST core: single-thread core;

MT core: many-thread core.)

Multicore

organization

Portion

of code

Symmetric

topology Asymmetric topology

Dynamic

topology

Composed

topology

CPU multicore Serial 1 ST core 1 large ST core 1 large ST core 1 large ST core

Parallel N ST cores 1 large ST core +N small ST cores N small ST cores N small ST cores

GPU multicore Serial 1 MT core

(1 thread)

1 large ST core (1 thread) 1 large ST core

(1 thread)

1 large ST core

(1 thread)

Parallel N MT cores

(multiple

threads)

1 large ST core

(1 thread)

+N small MT cores

(multiple threads)

N small MT cores

(multiple threads)

N small MT cores

(multiple threads)

....................................................................

MAY/JUNE 2012 125

making a major investment in research focus-ing on multicore and its related problems ofexpressing and managing parallelism. Re-search projects assuming hundreds or thou-sands of capable cores should consider thismodel and the power requirements undervarious scaling projections before assumingthat the cores will inevitably arrive. The par-adigm shift toward multicores that started inthe high-performance, general-purpose mar-ket has already percolated to mobile andembedded markets. The qualitative trendswe predict and our modeling methodologyhold true for all markets even thoughour study considers the high-end desktopmarket. This study’s results could helpbreak industry’s current widespread consen-sus that multicore scaling is the viable for-ward path.

Model points to opportunitiesOur study is based on a model that takes

into account properties of devices, processorcore, multicore organization, and topology.Thus the model inherently provides the pla-ces to focus on for innovation. To surpassthe dark silicon performance barrier high-lighted by our work, designers must developsystems that use significantly more energy-efficient techniques. Some examples includedevice abstractions beyond digital logic(error-prone devices); processing paradigmsbeyond superscalar, single instruction, multi-ple data (SIMD), and single instruction,multiple threads (SIMT); and program se-mantic abstractions allowing probabilisticand approximate computation. The resultsshow that radical departures are needed,and the model shows quantitative ways tomeasure the impact of such techniques.

A case for microarchitecture innovationOur study also shows that fundamental

processing limitations emanate from the pro-cessor core. Clearly, architectures that movewell past the power/performance Pareto-optimal frontier of today’s designs are neces-sary to bridge the dark silicon gap anduse transistor integration capacity. Thus,improvements to the core’s efficiency willimpact performance improvement and willenable technology scaling even though thecore consumes only 20 percent of the

power budget for an entire laptop, smart-phone, or tablet. We believe this study willrevitalize and trigger microarchitecture inno-vations, making the case for their urgencyand potential impact.

A case for specializationThere is emerging consensus that special-

ization is a promising alternative to efficientlyuse transistors to improve performance. Ourstudy serves as a quantitative motivation onsuch work’s urgency and potential impact.Furthermore, our study shows quantitativelythe levels of energy improvement that spe-cialization techniques must deliver.

A case for complementing the coreOur study also shows that when perfor-

mance becomes limited, techniques that oc-casionally use parts of the chip to deliveroutcomes orthogonal to performance areways to sustain the industry’s economics.However, techniques that focus on usingthe device integration capacity for improvingsecurity, programmer productivity, softwaremaintainability, and so forth must considerenergy efficiency as a primary factor.

Device scaling model (DevM)The device model (DevM) provides

transistor-area, power, and frequency-scalingfactors from a base technology node (forexample, 45 nm) to future technologies.The area-scaling factor corresponds to theshrinkage in transistor dimensions. TheDevM model calculates the frequency-scalingfactor based on the fanout-of-four (FO4)delay reduction. The model computes thepower-scaling factor using the predicted fre-quency, voltage, and gate capacitance scalingfactors in accordance with the P ¼ �CV 2

DDfequation.

We generated two device scaling models:ITRS scaling and conservative scaling.The ITRS model uses projections from the2010 ITRS. The conservative model is basedon predictions presented by Borkar3 and rep-resents a less optimistic view. Table 2 summa-rizes the parameters used for calculating thepower and performance-scaling factors. Weallocated 20 percent of the chip power budgetto leakage power and assumed chip designerscan maintain this ratio.

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 126

....................................................................

126 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Core scaling model (CorM)We built the technology-scalable core

model (CorM) by populating the area/performance and power/performance designspaces with the data collected for a set of pro-cessors, all fabricated in the same technologynode. The core model is the combination ofthe area/performance Pareto frontier, A(q),and the power/performance Pareto frontier,P(q), for these two design spaces. The q isa core’s single-threaded performance. Thesefrontiers capture the optimal area/performanceand power/performance tradeoffs for acore while abstracting away specific detailsof the core.

As Figure 2 shows, we populated the twodesign spaces at 45 nm using 20 representa-tive Intel and Advanced Micro Devices(AMD) processors and derive the Paretofrontiers. The curve that bounds all power/performance (area/performance) points inthe design space and indicates the minimumamount of power (area) required for a givenperformance level constructs the Paretofrontier. The P(q) and A(q) pair, which arepolynomial equations, constitute the coremodel. The core performance (q) is the pro-cessor’s SPECmark and is collected from theSPEC website (http://www.spec.org). We

estimated the core power budget using thethermal design power (TDP) reported inprocessor datasheets. The TDP is the chippower budget, or the amount of power thechip can dissipate without exceeding thetransistor junction temperature. Afterexcluding the share of uncore componentsfrom the power budget, we divided thepower budget allocated to the cores tothe number of cores to estimate thecore power budget. We used die photosof the four microarchitectures—IntelAtom, Intel Core, AMD Shanghai, andIntel Nehalem—to estimate the core areas(excluding Level-2 [L2] and Level-3 [L3]caches). Because this work’s focus is tostudy the impact of technology constraintson logic scaling rather than cache scaling,we derive the Pareto frontiers using onlythe portion of power budget and area allo-cated to the core in each processor excludingthe uncore components’ share.

As Figure 2 illustrates, we fit a cubic poly-nomial, P (q), to the points along the edge ofthe power/performance design space, and aquadratic polynomial (Pollack’s rule4), A(q),to the points along the edge of the area/performance design space. The Intel AtomZ520 with an estimated 1.89 W core TDP

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 127

Table 2. Scaling factors with International Technology Roadmap for Semiconductors (ITRS ) and conservative

projections. ITRS projections show an average 31 percent frequency increase and 35 percent power

reduction per node, compared to an average 6 percent frequency increase and 23 percent power

reduction per node for conservative projections.

Device scaling

model Year

Technology

node (nm)

Frequency

scaling factor

(45 nm)

VDD scaling

factor (45 nm)

Capacitance

scaling factor

(45 nm)

Power scaling

factor (45 nm)

ITRS scaling 2010 45� 1.00 1.00 1.00 1.00

2012 32� 1.09 0.93 0.70 0.66

2015 22y 2.38 0.84 0.33 0.54

2018 16y 3.21 0.75 0.21 0.38

2021 11y 4.17 0.68 0.13 0.25

2024 8y 3.85 0.62 0.08 0.12

Conservative scaling 2008 45 1.00 1.00 1.00 1.00

2010 32 1.10 0.93 0.75 0.71

2012 22 1.19 0.88 0.56 0.52

2014 16 1.25 0.86 0.42 0.39

2016 11 1.30 0.84 0.32 0.29

2018 8 1.34 0.84 0.24 0.22.................................................................................................................................................................................� Extended Planar Bulk Transistors; y Multi-Gate Transistors.

....................................................................

MAY/JUNE 2012 127

represents the lowest power design (lower-leftfrontier point), and the Nehalem-based IntelCore i7-965 Extreme Edition with an esti-mated 31.25 W core TDP represents thehighest-performing design (upper-right fron-tier point). We used the points along thescaled Pareto frontier as the search space fordetermining the best core configuration bythe multicore scaling model.

Multicore scaling model (CmpM)We developed a detailed chip-level model

(CmpM) that integrates the area and powerfrontiers, microarchitectural features, and ap-plication behavior, while accounting for thechip organization (CPU-like or GPU-like)and its topology (symmetric, asymmetric,dynamic, or composed). Guz et al. proposeda model for studying the first-order impactsof microarchitectural features (cache organi-zation, memory bandwidth, threads percore, and so forth) and workload behavior(memory access patterns).5 Their modelconsiders stalls due to memory dependencesand resource constraints (bandwidth or func-tional units). We extended their approachto build our multicore model. Our exten-sions incorporate additional applicationbehaviors, microarchitectural features, andphysical constraints, and covers both ho-mogeneous and heterogeneous multicoretopologies.

Using this model, we consider single-threaded cores with large caches to coverthe CPU multicore design space and mas-sively threaded cores with minimal cachesto cover the GPU multicore design spaceacross all four topologies, as described inTable 1. Table 3 lists the input parametersto the model, and how the multicore designchoices impact them, if at all.

Microarchitectural featuresEquation 1 calculates the multithreaded

performance (Perf ) of either a CPU-like orGPU-like multicore organization running afully parallel (f ¼ 1) and multithreaded ap-plication in terms of instructions per secondby multiplying the number of cores (N ) bythe core utilization (Z) and scaling by theratio of the processor frequency to CPIexe:

Perf ¼

min Nfreq

CPIexe�;

BWmax

rm � mL1 � mL2 � b

� �

(1)

The CPIexe parameter does not includestalls due to cache accesses, which are consid-ered separately in core utilization (Z). Thecore utilization (Z) is the fraction of timethat a thread running on the core can keepit busy. It is modeled as a function of the

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 128

0 5 10 15 20 25 30 35 40

Performance (SPECmark)(a)

0

5

10

15

20

25

30

Core

pow

er

(W)

P(q) = 0.0002q3 + 0.0009q2 + 0.3859q − 0.0301

Intel Nehalem (45 nm)

Intel Core (45 nm)

AMD Shanghai (45 nm)

Intel Atom (45 nm)

Pareto Frontier (45 nm)

0

(b)

5 10 15 20 25 30 35 40

Performance (SPECmark)

5

10

15

20

25

30

Co

re a

rea

(m

m2)

Intel Nehalem (45 nm)

Intel Core (45 nm)

AMD Shanghai (45 nm)

Intel Atom (45 nm)

Pareto Frontier (45 nm)

A(q) = 0.0152q2 + 0.0265q + 7.4393

Figure 2. Design space and the derived Pareto frontiers. Power/performance frontier, 45 nm (a); area/performance frontier,

45 nm (b).

....................................................................

128 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

average time spent waiting for each memoryaccess (t ), fraction of instructions that accessthe memory (rm), and the CPIexe:

� ¼ min 1;T

1þ t rm

CPIexe

!(2)

The average time spent waiting for mem-ory accesses (t) is a function of the time toaccess the caches (tL1 and tL2), time to visitmemory (tmem), and the predicted cachemiss rate (mL1 and mL2):

t ¼ (1 � mL1)tL1 þ mL1 (1 � mL2)tL2

þ mL1mL2tmem (3)

mL1 ¼CL1

T �L1

� �1��L1

and

mL2 ¼CL2

T �L2

� �1��L2

(4)

Multicore topologies

The multicore model is an extendedAmdahl’s law6 equation that incorporatesthe multicore performance (Perf ) calculatedfrom Equations 1 through 4:

Speedup ¼ 1=f

Sparallelþ 1� f

Sserial

� �(5)

The CmpM model (Equation 5) mea-sures the multicore speedup with respect

to a baseline multicore (PerfB). That is, theparallel portion of code (f ) is sped up bySParallel ¼ PerfP/PerfB and the serial portion ofcode (1�f ) is sped up by SSerial¼ PerfS /PerfB.

We calculated the number of cores thatcan fit on the chip based on the multicore’stopology, area budget (AREA), power bud-get (TDP), and each core’s area [A(q)] andpower [P (q)].

NSymmðqÞ ¼

minAREA

AðqÞ ;TDP

PðqÞ

� �

NAsymðqL; qS Þ ¼

minAREA � AðqLÞ

AðqS Þ;TDP � PðqLÞ

PðqS Þ

� �

NdynmðqL ; qS Þ ¼

minAREA � AðqLÞ

AðqS Þ;

TDP

PðqS Þ

� �

NCompðqL ; qS Þ ¼

minAREA

ð1þ �ÞAðqS Þ;

TDP

PðqS Þ

� �

For heterogeneous multicores, qS is thesingle-threaded performance of the smallcores and qL is the large core’s single-threaded

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 129

Table 3. CmpM parameters with default values from 45-nm Nehalem.

Parameter Description Default Impacted by

N Number of cores 4 Multicore topology

T Number of threads per core 1 Core style

freq Core frequency (MHz) 3,200 Core performance

CPIexe Cycles per instruction (zero-latency cache accesses) 1 Core performance, application

CL1 Level-1 (L1) cache size per core (Kbytes) 64 Core style

CL2 Level-2 (L2) cache size per chip (Mbytes) 2 Core style, multicore topology

tL1 L1 access time (cycles) 3 N/A

tL2 L2 access time (cycles) 20 N/A

tmem Memory access time (cycles) 426 Core performance

BWmax Maximum memory bandwidth (Gbytes/s) 200 Technology node

b Bytes per memory access (bytes) 64 N/A

f Fraction of code that can be parallel Varies Application

rm Fraction of instructions that are memory accesses Varies Application

�L1, �L1 L1 cache miss rate function constants Varies Application

�L2, �L2 L2 cache miss rate function constants Varies Application

....................................................................

MAY/JUNE 2012 129

performance. The area overhead of support-ing composability is t, while no power over-head is assumed for composability support.

Model implementationOne of the contributions of this work is

the incorporation of Pareto frontiers, physi-cal constraints, real application behavior,and realistic microarchitectural features intothe multicore speedup projections.

The input parameters that characterize anapplication are its cache behavior, fraction ofinstructions that are loads or stores, and frac-tion of parallel code. For the PARSECbenchmarks, we obtained this data fromtwo previous studies.7,8 To obtain the frac-tion of parallel code (f ) for each benchmark,we fit an Amdahl’s law�based curve to thereported speedups across different numbersof cores from both studies. This fit showsvalues of f between 0.75 and 0.9999 forindividual benchmarks.

To incorporate the Pareto-optimal curvesinto the CmpM model, we converted theSPECmark scores (q) into an estimatedCPIexe and core frequency. We assumedthat core frequency scales linearly with per-formance, from 1.5 GHz for an Atom coreto 3.2 GHz for a Nehalem core. Each appli-cation’s CPIexe depends on its instructionmix and use of hardware optimizationssuch as functional units and out-of-orderprocessing. Since the measured CPIexe foreach benchmark at each technology node isnot available, we used the CmpM model togenerate per-benchmark CPIexe estimates foreach design point along the Pareto frontier.With all other model inputs kept constant,we iteratively searched for the CPIexe ateach processor design point. We started byassuming that the Nehalem core has a CPIexe

of ‘. Then, the smallest core, an Atom pro-cessor, should have a CPIexe such that theratio of its CmpM performance to the Neha-lem core’s CmpM performance is the sameas the ratio of their SPECmark scores (q).We assumed that the CPIexe does not changewith technology node, while frequencyscales.

A key component of the detailed model isthe set of input parameters modeling thecores’ microarchitecture. For single-threadcores, we assumed that each core has a

64-Kbyte L1 cache, and chips with onlysingle-thread cores have an L2 cache that is30 percent of the chip area. MT cores havesmall L1 caches (32 Kbytes for every eightcores), support multiple hardware contexts(1,024 threads per eight cores), a thread reg-ister file, and no L2 cache. From Atom andTesla die photos, we estimated that eightsmall many-thread cores, their shared L1cache, and their thread register file can fitin the same area as one Atom processor.We assumed that off-chip bandwidth(BWmax) increases linearly as process technol-ogy scales down and while the memoryaccess time is constant.

We assumed that t increases from 10 per-cent up to 400 percent, depending on thecomposed core’s total area. The composedcore’s performance cannot exceed perfor-mance of a single Nehalem core at 45 nm.

We derived the area and power budgetsfrom the same quad-core Nehalem multicoreat 45 nm, excluding the L2 and L3 caches.They are 111 mm2 and 125 W, respectively.The reported dark silicon projections arefor the area budget that’s solely allocated tothe cores, not caches and other uncore com-ponents. The CmpM’s speedup baseline is aquad-Nehalem multicore.

Combining modelsOur three-tier modeling approach allows

us to exhaustively explore the design spaceof future multicores, project their upperbound performance, and estimate theamount of integration capacity underutiliza-tion, dark silicon.

Device � core modelTo study core scaling in future technology

nodes, we scaled the 45 nm Pareto frontiersdown to 8 nm by scaling each processordata point’s power and performance usingthe DevM model and then refitting the Par-eto optimal curves at each technology node.We assumed that performance, which wemeasured in SPECmark, would scale linearlywith frequency. By making this assumption,we ignored the effects of memory latency andbandwidth on the core performance. Thus,actual performance gains through scalingcould be lower. Based on the optimisticITRS model, scaling a microarchitecture

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 130

....................................................................

130 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

(core) from 45 nm to 8 nm will result in a3.9� performance improvement and an 88percent reduction in power consumption.Conservative scaling, however, suggests thatperformance will increase only by 34 percentand that power will decrease by 74 percent.

Device � core � multicore modelWe combined all three models to produce

final projections for optimal multicorespeedup, number of cores, and amount ofdark silicon. To determine the best multicoreconfiguration at each technology node, weswept the design points along the scaledarea/performance and power/performancePareto frontiers (DevM � CorM) becausethese points represent the most efficientdesigns. For each core design, we constructeda multicore consisting of one such core ateach technology node. For a symmetric mul-ticore, we iteratively added identical coresone by one until we hit the area or powerbudget or until performance improvementwas limited. We swept the frontier and con-structed a symmetric multicore for each pro-cessor design point. From this set ofsymmetric multicores, we picked the multi-core with the best speedup as the optimalsymmetric multicore for that technologynode. The procedure is similar for othertopologies. We performed this procedure sep-arately for CPU-like and GPU-like organiza-tions. The amount of dark silicon is thedifference between the area occupied bycores for the optimal multicore and the areabudget that is only allocated to the cores.

Scaling and future multicoresWe used the combined models to study

the future of multicore designs and theirperformance-limiting factors. The resultsfrom this study provide detailed analysis ofmulticore behavior for 12 real applicationsfrom the PARSEC suite.

Speedup projectionsFigure 3 summarizes all of the speedup

projections in a single scatter plot. Forevery benchmark at each technology node,we plot the speedup of eight possible multi-core configurations (CPU-like or GPU-like)� (symmetric, asymmetric, dynamic, orcomposed). The exponential performance

curve matches transistor count growth asprocess technology scales.

Finding: With optimal multicore configura-tions for each individual application, at8 nm, only 3.7� (conservative scaling) or7.9� (ITRS scaling) geometric meanspeedup is possible, as shown by the dashedline in Figure 3.

Finding: Highly parallel workloads with adegree of parallelism higher than 99 percentwill continue to benefit from multicorescaling.

Finding: At 8 nm, the geometric meanspeedup for dynamic and composed topolo-gies is only 10 percent higher than thegeometric mean speedup for symmetrictopologies.

Dark silicon projectionsTo understand whether parallelism or the

power budget is the primary source of thedark silicon speedup gap, we varied each ofthese factors in two experiments at 8 nm.First, we kept the power budget constant(our default budget is 125 W) and variedthe level of parallelism in the PARSEC appli-cations from 0.75 to 0.99, assuming thatprogrammer effort can realize this improve-ment. Performance improved slowly as theparallelism level increased, with most bench-marks reaching a speedup of about only 15�at 99 percent parallelism. Provided that thepower budget is the only limiting factor, typ-ical upper-bound ITRS-scaling speedups willstill be limited to 15�. With conservativescaling, this best-case speedup is limitedto 6.3�.

For the second experiment, we kept eachapplication’s parallelism at its real level andvaried the power budget from 50 W to500 W. Eight of 12 benchmarks showedno more than 10� speedup even with a prac-tically unlimited power budget. In otherwords, increasing core counts beyond a cer-tain point did not improve performance be-cause of the limited parallelism in theapplications and Amdahl’s law. Only fourbenchmarks have sufficient parallelism toeven hypothetically sustain speedup levelsthat matches the exponential transistorcount growth, Moore’s law.

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 131

....................................................................

MAY/JUNE 2012 131

Finding: With ITRS projections, at 22 nm,21 percent of the chip will be dark, and at8 nm, more than 50 percent of the chipcannot be utilized.

Finding: The level of parallelism in PARSECapplications is the primary contributor tothe dark silicon speedup gap. However, inrealistic settings, the dark silicon resultingfrom power constraints limits the achievablespeedup.

Core count projectionsDifferent applications saturate perfor-

mance improvements at different corecounts. We considered the chip configura-tion that provided the best speedups for allapplications to be an ideal configuration.Figure 4 shows the number of cores (solidline) for the ideal CPU-like dynamic multicoreconfiguration across technology generations,because dynamic configurations performedbest. The dashed line illustrates the numberof cores required to achieve 90 percent of theideal configuration’s geometric mean speedupacross PARSEC benchmarks. As depicted,with ITRS scaling, the ideal configuration inte-grates 442 cores at 8 nm. However, 35 coresreach the 90 percent of the speedup

achievable by 442 cores. With conservativescaling, the 90 percent speedup core countis 20 at 8 nm.

Finding: Due to limited parallelism in thePARSEC benchmark suite, even withnovel heterogeneous topologies and opti-mistic ITRS scaling, integrating more than35 cores improves performance only slightlyfor CPU-like topologies.

Sensitivity studiesWe performed sensitivity studies on the

impact of various features, including L2cache sizes, memory bandwidth, simultane-ous multithreading (SMT) support, andthe percentage of total power allocatedto leakage. Quantitatively, these studiesshow that these features have limited impacton multicore performance.

LimitationsOur device and core models do not

explicitly consider dynamic voltage and fre-quency scaling (DVFS). Instead, we take anoptimistic approach to account for its best-case impact. When deriving the Pareto fron-tiers, we assume that each processor datapoint operates at its optimal voltage and

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 132

45 32 22 16 11 8

Technology node (nm)(a)

0

8

16

24

32

Sp

eed

up

Exponential performance

Geometric mean

Design points

45 32 22 16 11 8

Technology node (nm)(b)

0

8

16

24

32

Sp

eed

up

Exponential performance

Geometric mean

Design points

Figure 3. Speedup across process technology nodes across all organizations and topologies with PARSEC benchmarks.

The exponential performance curve matches transistor count growth. Conservative scaling (a); ITRS scaling (b).

....................................................................

132 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

frequency setting (VDDmin,Freqmax). At a fixed

VDD setting, scaling down the frequency fromFreqmax results in a power/performance pointinside the optimal Pareto curve, which is asuboptimal design point. However, scalingvoltage up and operating at a new(V 0DDmin

,Freq 0max) setting results in a differentpower-performance point that is still on theoptimal frontier. Because we investigate allof the points along the frontier to find theoptimal multicore configuration, our studycovers multicore designs that introduce heter-ogeneity to symmetric topologies throughDVFS. The multicore model considers thefirst-order impact of caching, parallelism,and threading under assumptions that resultonly in optimistic projections. Comparingthe CmpM model’s output against publishedempirical results confirms that our model al-ways overpredicts multicore performance.The model optimistically assumes that theworkload is homogeneous; that work is infi-nitely parallel during parallel sections ofcode; that memory accesses never stall dueto a previous access; and that no thread syn-chronization, operating system serialization,or swapping occurs.

T his work makes two key contributions:projecting multicore speedup limits

and quantifying the dark silicon effect, andproviding a novel and extendible model thatintegrates device scaling trends, core designtradeoffs, and multicore configurations.While abstracting away many details, themodel can find optimal configurations andproject performance for CPU- and GPU-style multicores while considering micro-architectural features and high-level applica-tion properties. We made our modelpublicly available at http://research.cs.wisc.edu/vertical/DarkSilicon. We believe thisstudy makes the case for innovation’surgency and its potential for high impactwhile providing a model that researchersand engineers can adopt as a tool to studylimits of their solutions. MICRO

AcknowledgmentsWe thank Shekhar Borkar for sharing

his personal views on how CMOS devicesare likely to scale. Support for this researchwas provided by the NSF under grantsCCF-0845751, CCF-0917238, and CNS-0917213.

[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 133

256

224

192

160

No. of core

s

406

128

96

64

32

045 32 22

Technology node (nm)(a) (b)

16 11 8

Ideal configuration90% configuration

256

224

192

160

No. of core

s

442

128

96

64

32

045 32 22

Technology node (nm)

16 11 8

Ideal configuration90% configuration

Figure 4. Number of cores for the ideal CPU-like dynamic multicore configurations and the number of cores delivering 90

percent of the speedup achievable by the ideal configurations across the PARSEC benchmarks. Conservative scaling (a);

ITRS scaling (b).

....................................................................

MAY/JUNE 2012 133

....................................................................References

1. G.E. Moore, ‘‘Cramming More Components

onto Integrated Circuits,‘‘ Electronics, vol. 38,

no. 8, 1965, pp. 56-59.

2. R.H. Dennard et al., ‘‘Design of Ion-

Implanted Mosfet’s with Very Small Physi-

cal Dimensions,‘‘ IEEE J. Solid-State Cir-

cuits, vol. 9, no. 5, 1974, pp. 256-268.

3. S. Borkar, ‘‘The Exascale Challenge,‘‘ Proc.

Int’l Symp. on VLSI Design, Automation

and Test (VLSI-DAT 10), IEEE CS, 2010,

pp. 2-3.

4. F. Pollack, ‘‘New Microarchitecture Chal-

lenges in the Coming Generations of

CMOS Process Technologies,‘‘ Proc. 32nd

Ann. ACM/IEEE Int’l Symp. Microarchitec-

ture (Micro 99), IEEE CS, 2009, p. 2.

5. Z. Guz et al., ‘‘Many-Core vs. Many-Thread

Machines: Stay Away From the Valley,‘‘

IEEE Computer Architecture Letters, vol. 8,

no. 1, 2009, pp. 25-28.

6. G.M. Amdahl, ‘‘Validity of the Single Pro-

cessor Approach to Achieving Large-scale

Computing Capabilities,‘‘ Proc. Joint Com-

puter Conf. American Federation of Infor-

mation Processing Societies (AFIPS 67),

ACM, 1967, doi:10.1145/1465482.1465560.

7. M. Bhadauria, V. Weaver, and S. McKee,

‘‘Understanding PARSEC Performance on

Contemporary CMPs,‘‘ Proc. IEEE Int’l

Symp. Workload Characterization (IISWC

09), IEEE CS, 2009, pp. 98-107.

8. C. Bienia et al., ‘‘The PARSEC Benchmark

Suite: Characterization and Architectural

Implications,‘‘ Proc. 17th Int’l Conf. Paral-

lel Architectures and Compilation Tech-

niques (PACT 08), ACM, 2008, pp. 72-81.

Hadi Esmaeilzadeh is a PhD student in theDepartment of Computer Science andEngineering at the University of Washing-ton. His research interests include power-efficient architectures, approximate general-purpose computing, mixed-signal architec-tures, machine learning, and compilers.Esmaeilzadeh has an MS in computer sciencefrom the University of Texas at Austin andan MS in electrical and computer engineer-ing from the University of Tehran.

Emily Blem is a PhD student in theDepartment of Computer Sciences at the

University of Wisconsin�Madison. Herresearch interests include energy and perfor-mance tradeoffs in computer architectureand quantifying them using analytic perfor-mance modeling. Blem has an MS incomputer science from the University ofWisconsin�Madison.

Renee St. Amant is a PhD student in theDepartment of Computer Science at theUniversity of Texas at Austin. Her researchinterests include computer architecture, low-power microarchitectures, mixed-signal ap-proximate computation, new computingtechnologies, and storage design for approx-imate computing. St. Amant has an MS incomputer science from the University ofTexas at Austin.

Karthikeyan Sankaralingam is an assistantprofessor in the Department of ComputerSciences at the University of Wisconsin�Madison, where he also leads the VerticalResearch Group. His research interestsinclude microarchitecture, architecture, andvery-large-scale integration (VLSI). Sankar-alingam has a PhD in computer science fromthe University of Texas at Austin.

Doug Burger is the director of client andcloud applications at Microsoft Research,where he manages multiple strategic re-search projects covering new user interfaces,datacenter specialization, cloud architec-tures, and platforms that support persona-lized online services. Burger has a PhDin computer science from the Universityof Wisconsin. He is a fellow of IEEE andthe ACM.

Direct questions and comments about thisarticle to Hadi Esmaeilzadeh, University ofWashington, Computer Science & Engi-neering, Box 352350, AC 101, 185 StevensWay, Seattle, WA 98195; [email protected].

[3B2-9] mmi2012030122.3d 16/5/012 14:26 4

....................................................................

134 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

dark s end of multicore scaling - college of …hadi/doc/paper/2012-toppicks-dark_silicon.pdfdark...

Documents