exploiting interposer technologies to disintegrate …enright/toppicks2016.pdf · 2017-06-08 ·...

.................................................................................................................................................................................................................

EXPLOITING INTERPOSER TECHNOLOGIESTO DISINTEGRATE AND REINTEGRATE

MULTICORE PROCESSORS.................................................................................................................................................................................................................

THE AUTHORS EXPLOIT A SILICON INTERPOSER TO DISINTEGRATE A MULTICORE CPU CHIP

INTO SMALLER CHIPS THAT COLLECTIVELY COST LESS TO MANUFACTURE THAN A SINGLE

LARGE CHIP. THEY STUDY THE PERFORMANCE-COST TRADEOFFS OF INTERPOSER-BASED,

MULTICHIP, MULTICORE SYSTEMS AND PROPOSE INTERPOSER NETWORK-ON-CHIP

ORGANIZATIONS TO MITIGATE THE PERFORMANCE CHALLENGES WHILE PRESERVING THE

COST BENEFITS. THIS ARTICLE ALSO PAVES THE WAY FOR VARIOUS RESEARCH PROBLEMS

IN INTERPOSER-BASED DISINTEGRATED SYSTEMS.

......Moore’s law has conventionallyenabled increasing integration; however, fun-damental physical limitations have slowedthe rate of transition from one technologynode to the next, and the costs of new fabri-cation facilities are skyrocketing. The matu-ration of die stacking enables the continuedintegration of system components in tradi-tionally incompatible processes. A key appli-cation of die-stacking is silicon-interposer-based integration of multiple 3D stacks ofDRAM, shown in Figure 1a,1–4 potentiallyproviding several gigabytes of in-packagememory with bandwidths already starting at128 Gbytes per second per stack.

The use of an interposer presents newopportunities; in particular, if one has alreadypaid for the interposer for the purposes ofmemory integration, any additional benefitsfrom exploiting the interposer could come at arelatively small incremental cost. We study

how the interposer can be used to address(at least in part) the increasing costs of manu-facturing chips in a leading-edge process tech-nology. We propose to use the interposer to“disintegrate” a large multicore chip into sev-eral small chips, such as in Figure 1b. Thesechips are less expensive to manufacturebecause of a combination of higher yield andbetter packing of the rectangular die on around wafer. Unfortunately, this approachfragments the network on chip (NoC) suchthat each chip contains only a piece of theoverall network, and communications betweenchips must take additional hops through theinterposer. We explore how to disintegrate amulticore processor on an interposer whileaddressing the problem of a fragmented NoC.

Background and MotivationTo mitigate the rising costs of large multi-cores, systems can comprise multiple smaller

Ajaykumar Kannan

Natalie Enright Jerger

University of Toronto

Gabriel H. Loh

Advanced Micro Devices

.......................................................

84 Published by the IEEE Computer Society 0272-1732/16/$31.00�c 2016 IEEE

chips. Various options exist for integratingmultiple chips, including multisocket sym-metric multiprocessors (SMPs), multichipmodules (MCMs), 3D die stacking, and sili-con interposers. The SMP and MCMapproaches are less desirable because they donot provide adequate bandwidth for arbitrarycore-to-core cache coherence without expos-ing significant nonuniform memory accesseffects. 3D stacking by itself is a less-attractivesolution at this time because it is more expen-sive and complicated, introduces potentialthermal issues, and can be an overkill interms of how much bandwidth it can pro-vide. This leaves us with silicon interposers.Having settled on silicon interposers as ourintegration technology, the key contributionsof this work include a cost and yield analysisto show the benefits of disintegration, thepotential for minimally active interposers,novel NoC topologies, and the concept ofmisaligned NoCs.

Cost and YieldTo reduce the cost of manufacturing a systemon a chip (SoC), we can reduce the chip’ssize. A larger chip’s cost comes from twomain sources. The first is geometry: fewerlarger chips fit on a wafer, and the smallerchips can be packed more tightly (that is, uti-lizing more of the area around the peripheryof the wafer). The second cost of a larger chipis due to manufacturing defects. A defect thatrenders a large die inoperable wastes more sil-icon than one that kills a smaller die.Although smaller chips result in lower costs,the downside is that they also provide lessfunctionality (for example, half the areayields half the cores). If we could manufac-ture several smaller chips and combine them

together into a single system, we would beable to have the functionality of a larger chipwhile maintaining the economic advantagesof the smaller chips. Ignoring for the timebeing exactly how multiple chips could becombined back together, Table 1 summarizesthe impact of implementing a 64-core systemranging from a conventional 64-core mono-lithic chip all the way down to building itfrom 16 separate quad-core chips. The lastcolumn shows that using a collection ofquad-core chips to assemble a 64-core SoCyields 29 percent more working parts thanthe monolithic-die approach.

Additional performance benefits can behad by speed-binning the individual chips toensure that fast chips are stacked together onthe same interposer to maximize the numberof high-performance, high-margin parts pro-duced. We used Monte Carlo simulations toconsider three scenarios:

� A 300-mm wafer is used to imple-ment 162 monolithic good dies perwafer (as per Table 1).

� The wafer is used to implement3,353 quad-core chips, which arethen assembled without speed-binning into 209 64-core systems.

� The individual dies from the samewafer are sorted so that the 16 fastestchips are assembled together, the nextfastest 16 are combined, and so on.

Figure 2 shows the number of 64-core sys-tems per wafer in 100 MHz speed bins, aver-aged across 100 wafer samples per scenario.The monolithic 64-core chip and 16 quad-core approaches have similar speed distribu-tions. However, with speed-binning, weavoid the situation wherein the overall system

Silicon interposer

3D DRAM 3D DRAM

3D DRAM3D DRAM

3D DRAM3D DRAM

3D DRAM3D DRAM

Silicon interposer 16-core CPU chip64-core CPU chip(a) (b)

Figure 1. Potential organizations for interposer-based systems. (a) An example interposer-based

system integrating a 64-core processor chip with four 3D stacks of DRAM. (b) A 64-core system

composed of four 16-core processor chips.

.............................................................

MAY/JUNE 2016 85

speed is slowed down by the presence of asingle slow chip, resulting in significantlyfaster average system speeds (the mean shiftsby approximately 400 MHz) and more sys-tems in the highest speed-bins (which usuallycarry the highest profit margins).

Minimally Active InterposersThe interposer is effectively a very large chip.Current approaches use passive interposers2,5

in which the interposer has no devices, onlyrouting. This greatly reduces the interposer’scritical area ðAcritÞ (that is, unless a defectimpacts a metal route, there are no transistorsthat can be affected), resulting in high yields.Based on our previous chip cost analysis, itwould seem prohibitively expensive to con-sider an active interposer, but the flexibility ofplacing routers on the interposer enablesmany more interesting NoC organizations.Although the interposer could be imple-

mented in an older process technology6 that isless expensive and more mature (that is, onewith lower defect rates), such a large chip, per-haps near the reticle limit, still would not beexpected to be less expensive if the yield ratesremained low.

For regular chips, it typically is desirableto maximize functionality by cramming in asmany transistors as possible into the chip-area budget. However, making use of everylast mm2 of the interposer would lead to avery high fraction of the area being criticalðFraccritÞ multiplied over a very large areaðAcrit ¼ chip area � FraccritÞ, thereby lead-ing to low yields and high costs. However,there is no need to use the entire interposer:its size is determined by the geometry of thechips and memory stacked on it, and usingmore or fewer devices has no impact on itsfinal size. As such, we advocate using aminimally active interposer, which implements

0

10

20

30

40

50

60

70

1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2Clock speed of the 64-core system (GHz)

Ave

rag

e nu

mb

er o

f 64-

core

sys

tem

s p

er w

afer Monolithic 64-core chip

Unsorted 16 x quad-core chipsSorted 16 x quad-core chips

Figure 2. Average number of 64-core SoCs per wafer per 100 MHz bin from Monte Carlo

simulations of 100 wafers. Results are shown for a monolithic 64-core chip and systems with

16 quad-core chips. Using multiple chips allows speed binning, leading to higher clock rates

as shown in the sorted case.

Table 1. Example yield analysis for different-sized multicore chips. A system on a chip (SoC) here is a 64-core

system, which might require combining multiple chips for the rows in the table corresponding to chips with

fewer than 64 cores each.

Cores per

chip

Chips per

wafer

Chips per

package

Area per

chip (mm2)

Chip yield (%) Good dies

per wafer

Good SoCs

per wafer

64 192 1 297.0 84.5 162 162

32 395 2 148.5 91.7 362 181

16 818 4 74.3 95.7 782 195

8 1,664 8 37.1 97.8 1,627 203

4 3,391 16 18.6 98.9 3,353 209

..............................................................................................................................................................................................

TOP PICKS

............................................................

86 IEEE MICRO

the devices required for the system’s function-ality (in our case, these would primarily berouters and repeaters), but no more. Thisresults in a sparsely populated interposer witha lower Fraccrit and therefore a lower cost.

We used the same yield model from ourearlier cost analysis to estimate the yields ofdifferent interposer options: a passive inter-poser, a minimally active interposer, and afully active interposer. The interposer sizeassumed throughout this work is24 mm � 36 mm ð864 mm2Þ, and weassume six metal layers in the interposer. Fora passive interposer, Fraccrit for the logic iszero (it remains nonzero for the metal layers).For a fully active interposer (that is, if one fillsthe entire interposer with transistors), we usea Fraccrit for logic of 0.75 and Fraccrit forwires of 0.2625. For a minimally active inter-poser, we estimate the total interposer areaneeded to implement our routers (logic) andlinks (metal) to be only 1 percent of the totalinterposer area. To be conservative, we alsoconsider a minimally active interposer inwhich we pessimistically assume the routerlogic consumes 10 times more area, althoughthe metal utilization is unchanged. Minimiz-ing utilization of the interposer for activedevices also minimizes the potential forundesirable thermal interactions resultingfrom stacking highly active CPU chips ontop of the interposer.

Considering an example defect rate of2,000 defects per m2 (from Table 1), we findthat the passive interposer has a nonperfectyield rate of 94.1 percent because it still usesmetal layers that can be rendered faulty bymanufacturing defects. At the other extremeis a fully active interposer with very low yieldsof 61.5 percent. This is not surprising giventhat a defect almost anywhere on the inter-poser could render it a loss. This is the pri-mary reason why one would likely beskeptical of active interposers. However,when using only the minimum amount ofactive area necessary on the interposer, theyield rates are not very different from the pas-sive interposer—93.9 percent for 1 percentactive and 92.4 percent for 10 percent active.The vast majority of the interposer is notbeing used for devices; defects that occur inthese white-space regions do not impact theinterposer’s functionality. As a result, we

believe that augmenting an otherwise passiveinterposer with just enough logic to do whatis needed has the potential to be economi-cally viable, and it should be sufficient forNoC-on-interposer applications.

Taking the cost argument alone to its logi-cal limit would lead one to falsely concludethat a large chip should be disintegrated intoan infinite number of infinitesimally smalldies. The countervailing force is perform-ance: although breaking a large system intosmaller pieces could improve overall yield,going to a larger number of smaller chipsincreases the amount of chip-to-chip com-munication that must be routed through theinterposer. In an interposer-based multicoresystem with a NoC distributed across chipsand the interposer, smaller chips create amore fragmented NoC, resulting in morecore-to-core traffic routing across the inter-poser, which eventually becomes a perform-ance bottleneck. Figure 3 shows the costreduction for three example defect rates, allshowing the relative cost benefit of disinte-gration. The figure also shows the relativeimpact on performance. So, although moreaggressive levels of disintegration provide bet-ter cost savings, they are directly offset by areduction in performance.

In the rest of this article, we explore theproblem of how one can get the cost benefits

0.7

0.8

0.9

1.0

64 32 16 8 4Cores per chip

Nor

mal

ized

cos

t/mes

sag

ela

tenc

y (lo

wer

is b

ette

r)

D = 1,500D = 2,000

D = 2,500Average latency

Figure 3. Normalized cost and execution time (lower is better for both) for

different multichip configurations. Sixty-four cores per chip corresponds to a

single monolithic 64-core die, and four cores per chip corresponds to 16

chips, each with four cores. Cost is shown for different defect densities (in

defects/m2), and the average message latency is normalized to the 16 quad-

core configuration.

.............................................................

MAY/JUNE 2016 87

of a disintegrated chip organization whileproviding a NoC architecture that, whilephysically fragmented across multiple chips,still behaves (performance-wise) at least aswell as one implemented on a single mono-lithic chip.

Architecting the NoC for MultichipInterposersThe analysis in the preceding section showsthat there are economic incentives for disinte-grating a large multicore chip into smallerdies, but that doing so induces performancechallenges with respect to the interconnect.We now discuss how to address these issues.

We propose the concept of misalignment.With conventional topologies such as a con-centrated Folded Torus, links that cross thebisection between the two halves of the inter-poser still carry a higher amount of trafficand continue to be a bottleneck for the sys-tem. For concentrated topologies in whichone router connects to four cores, every fourCPU cores in a 2� 2 grid share an interposerrouter that was placed in between them, asshown in both the perspective and side/cross-sectional views in Figure 4a.

Our misaligned interposer network offsetsthe location of the interposer routers. Coreson the edge of one chip now share a router

with cores on the edge of the adjacent chip(see Figure 4b). The change is subtle butimportant: with an “aligned” interposerNoC, the key resources shared between chip-to-chip coherence and memory traffic are thelinks crossing the bisection. If both a mem-ory-bound message (M) and a core-to-corecoherence message (C) wish to traverse thelink, one must wait as it serializes behind theother. With misaligned topologies, the sharedresource is now the router. As the bottom ofFigure 4b shows, this simple shift allowschip-to-chip and memory traffic to flowthrough a router simultaneously, therebyreducing queuing delays for messages to tra-verse the network bisection.

For interposer-based NoCs, providingsufficient bisection bandwidth is critical. Onestraightforward way to provide more bisec-tion bandwidth is to add more links. How-ever, if this is not done carefully, it can causethe routers to need more ports (higherdegree), which increases area and power, andcan decrease the router’s maximum clockspeed. By combining different topologicalaspects of both the Butterfly and FoldedTorus topologies into our novel ButterDonuttopology, we can further increase the inter-poser NoC bisection bandwidth withoutimpacting the router complexity. We main-tain the advantages of long express links of

Left chip Right chip

C

C

M M

(a) (b)

Figure 4. Perspective and side/cross-sectional views of (a) 4-to-1 concentration from cores to

interposer routers aligned beneath the CPU chips, and (b) 4-to-1 concentration misaligned

such that some interposer routers are placed in between neighboring CPU chips. The cross-

sectional view also illustrates the flow of example coherence (C) and memory (M) messages.

..............................................................................................................................................................................................

TOP PICKS

............................................................

88 IEEE MICRO

the butterfly and exploit folded ring networksin the X-dimension (from the Folded Torus).This increases the bisection bandwidth andleads to low hop counts for both coherenceand memory traffic, as shown in Figure 5a.The ButterDonut can also be misaligned toprovide even higher throughput across thebisection, as shown in Figure 5b.

We can compare topologies among severaldifferent metrics. Table 2 shows all of the con-centrated topologies considered in this article,along with several key network and graphproperties. The metrics listed correspond onlyto the interposer’s portion of the NoC (forexample, nodes on the CPU chips are notincluded), and the link counts exclude connec-tions both to the CPU cores as well as to thememory channels (this is constant across allconfigurations, with 64 links for the CPUsand 16 for the memory channels). Misalignedtopologies are annotated with their misalign-ment dimension in parentheses; for example,the Folded Torus misaligned in the X-dimen-sion is shown as “Folded Torus (X).” Misalign-ment can change the number of nodes(routers) in the network (see, for example,Figure 5). From the perspective of buildingminimally active interposers, we favor topolo-gies that minimize the number of nodes andlinks to keep the interposer’s Acrit as low as pos-sible. At the same time, we want to keep thenetwork diameter and average hop count low(to minimize expected latencies of requests)while maintaining high bisection bandwidth(for network throughput). Overall, theX-misaligned ButterDonut topology has thebest properties out of all of the topologies except

for the link count, for which it is a close secondbehind Double Butterfly (X). ButterDonut (X)combines the best of all of the other non-But-terDonut topologies, while providing 50 per-cent more east-west bisection bandwidth.

MethodologyFor the yield and relative cost figures, we useanalytical yield models, a fixed cost-per-waferassumption, and automated tools for comput-ing die-per-wafer,7 and we consider a range ofdefect densities. All analyses assume a 300-mmwafer. Our baseline monolithic 64-core diesize is 16.5 mm � 18 mm (the same assump-tion as used in a recent interposer-NoCpaper3). Smaller-sized chips are derived byhalving the longer of the two dimensions (forexample, a 32-core chip is 16.5 mm� 9 mm).The yield rate for individual chips is estimatedusing a simple classic model.8

For the speed binning results given earlier,we simulate a wafer’s yield by starting withthe good-die-per-wafer based on the desiredchip’s geometry (see Table 1). For each quad-core chip, we randomly select its speed usinga normal distribution (mean: 2,400 MHz;standard deviation: 250 MHz). Our simpli-fied model treats a 64-core chip as the com-position of 16 adjacent (4 � 4) quad-coreclusters, with the speed of each cluster chosenfrom the same distribution as the individualquad-core chips. Therefore, the 64-corechip’s clock speed is the minimum fromamong its constituent 16 clusters. For eachconfiguration, we simulate 100 differentwafers’ worth of parts and take the averageover the 100 wafers. Similar to the yield

(a) (b)

ButterDonut Misaligned ButterDonut (X)

Figure 5. Our ButterDonut topology. The (a) aligned and (b) misaligned ButterDonut

topologies combine topological elements from both the Double Butterfly and Folded Torus.

.............................................................

MAY/JUNE 2016 89

results, the exact distribution of per-chipclock speeds is not so critical: so long as thereexists a spread in chip speeds, binning andreintegration via an interposer can be benefi-cial for the final product speed distribution.

To evaluate the performance of variousinterposer NoC topologies for our disinte-grated systems, we use a cycle-level networksimulator.9 To evaluate the baseline and pro-posed NoC designs, we use both synthetic traf-fic patterns and SynFull traffic models10; thesetwo evaluation approaches cover a wide rangeof network utilization scenarios and exerciseboth cache-coherence and memory traffic. Forthe SynFull workloads, we run multiprog-rammed combinations composed of four 16-way multithreaded applications from Parsec.11

The application’s threads are distributed acrossthe 64 cores, and they share all 16 memorychannels. We construct workload combina-tions based on their memory traffic intensity.

EvaluationWe evaluated the performance, power, and costof our disintegrated multicore architecture con-sidering different chip sizes and NoC topolo-gies. We compared our proposed misalignedand ButterDonut topologies against more con-ventional topologies, such as a Mesh, Concen-trated Mesh (CMesh), and Folded Torus. Toevaluate the baseline and proposed NoCdesigns, we considered several traffic workloadsto cover a wide range of network utilization sce-narios. We used synthetic traffic patterns tostress our proposed network and realistic appli-

cation workloads to evaluate NoC perform-ance. Figure 6 shows average packet latencyresults when the system executes multiprog-rammed SynFull workloads. The results are fora system consisting of four 16-core CPU chips.These workloads exercise a diverse set of routeshighlighting differences among the topologies.Across the workloads, the misaligned Butter-Donut (X) and misaligned Folded Torus (XY)consistently perform the best.

Our evaluations show that the proposedNoC architecture outperforms the baselineMesh and CMesh. Misaligned topologiesprovide the best performance by reducingbisection pressure through isolating chip-to-chip coherence traffic from interposer mem-ory traffic. Having this traffic share a routerbut not a link provides substantial through-put improvements. The ButterDonut topol-ogies generally provide the best performanceby optimizing both east-west memory trafficthrough long links and chip-to-chip coher-ence traffic through reduced diameter fromthe folded rings along the X dimension.

Higher levels of disintegration (smallerchips) can result in lower overall cost based onour analysis, due to better yields and more chipsper wafer. The smaller chips could decreaseinterconnect performance because the NoC isfragmented, but the finer-grained binning alsoincreases average CPU clock speeds. To put it alltogether, we consider a figure of merit (FOM)based on cost and performance, or “delay” forbrevity. Our FOM is delaycost, which provides agreater emphasis on performance (see Figure 7).

Table 2. Comparison of the different interposer NoC topologies studied in this article. In the node column,

(n 3 m) indicates the organization of router nodes. Bisection links are the number of links crossing the vertical

bisection cut.

Topology Nodes Links Diameter Average hop Bisection links

Concentrated Mesh 24 (6� 4) 38 8 3.33 4

Double Butterfly 24 (6� 4) 40 5 2.70 8

Folded Torus 24 (6� 4) 48 5 2.61 8

ButterDonut 24 (6� 4) 44 4 2.51 12

Folded Torus (X)* 20 (5� 4) 40 4 2.32 8

Double Butterfly (X)* 20 (5� 4) 32 4 2.59 8

Folded Torus (XY)* 25 (5� 5) 50 4 2.50 10

ButterDonut (X)* 20 (5� 4) 36 4 2.32 12...................................................................................................................................

*Misaligned.

..............................................................................................................................................................................................

TOP PICKS

............................................................

90 IEEE MICRO

The rationale for the performance-heavy FOMis that for high-end servers, even relativelysmaller performance differentiations at the highend can translate into substantially higher sellingprices and margins.

Figure 7 shows our FOM. For a basicMesh on the interposer, the performance lossdue to disintegration actually hurts morethan the cost reductions help until one getsdown to using eight-core chips or smaller.CMesh provides an initial benefit with two32-core chips, but its lower bisection band-width is too much of a drag on performance,and further disintegration is unhelpful. TheFOM results show that the remaining topol-ogies can ward off the performance degrada-tions of the fragmented NoC sufficiently wellsuch that the combination of binning-basedimprovements and continued cost reductionsallow more aggressive levels of disintegration.

With different FOMs, the exact tradeoffpoints will shift, but our FOM illustrates thatsimple disintegration (using Mesh or CMesh)alone might not be sufficient to provide acompelling solution for both cost and per-formance. However, interposer-based disinte-gration appears promising when coupledwith an appropriate redesign of the interposerNoC topology.

T he combination of 2.5D and 3D stack-ing technologies enabled through silicon-

interposer-based integration will significantly

change the hardware organization of manyfuture systems. We believe that the generalframework of silicon-interposer-based integra-tion for the physical organization and assemblyof future SoCs will become common, and oneof the key goals of this article is to raise aware-ness of and evangelize for new research effortsin this broad area considering impacts on cost,performance, power, and functionality.

We focused on a homogeneous disinte-grated multicore system using 2.5D stackingto connect processors to 3D stacks of memory.Although we focused on interconnect issues toreintegrate homogeneous multicore chips, weanticipate broader interest and new researcharound silicon-interposer-based systems ingeneral. The interposer simply provides amechanical and electrical substrate for theintegration of multiple disparate chips. Multi-ple computing chips (such as the CPU andGPU) can be 2.5D stacked on the interposerwith the DRAM. Separate chips for logic andanalog devices, possibly in different processtechnology nodes, can be independently man-ufactured and 2.5D stacked. Having a solidframework for integrating these disparatecomponents enables a wide range of systemsand prompts new research questions.

We focused on enabling cost-effective,large multicore designs. However, this is arich platform for many exciting new systemsthat could become more economically viable.Consider, for example, new accelerators that

0

10

20

30

40

50

60

70

L-L-L-L AVG M-M-L-L AVG M-M-M-M AVG H-H-L-L AVG H-H-M-M AVG

Ave

rag

e p

acke

t lat

ency

MeshCMeshFolded TorusFolded Torus (X)Folded Torus (XY)ButterDonutButterDonut (X)

Figure 6. Average packet latency results for different multiprogrammed SynFull workloads

with varying network load intensities. Aligned and misaligned topologies are compared

against the Mesh and CMesh, with the misaligned ButterDonut having the lowest average

latency in most cases.

.............................................................

MAY/JUNE 2016 91

might not be applicable across all market seg-ments. Without interposer-based integration,a new accelerator would likely need to bedirectly incorporated into the same mono-lithic SoC silicon as the other commoditycomponents (for example, the CPUs andmemory controllers). This saddles the overallSoC with the cost of the accelerator, even forproducts that do not benefit from accelera-tion, or forces the manufacturer to pay foradditional engineering to develop two separateSoC designs (that is, with and without theaccelerator). With interposer-based integra-tion, the manufacturer can separately design astand-alone CPU and accelerator chips, andthen individual SoC designs can choose whichchips to stack. Although generic, our proposedinterconnect designs offer sufficiently richconnectivity to enable efficient routing ofchip-to-chip and chip-to-memory trafficregardless of the different SoC components.

Our approach could dramatically acceler-ate the adoption of accelerator-based technol-ogies. This in turn motivates new research ingeneral system architectures to support a widerange of accelerators in a way that enablesplug-and-play at the system level with respectto shared virtual memory, cache coherencybetween conventional computing and acceler-ators, work/task scheduling, and more.

We proposed the use of a minimally activeinterposer. We estimate that only 1 percent ofthe interposer area must be devoted to intercon-nect logic. Devoting this minimal amount of

area to active logic has a very small impact oninterposer yield. Although cramming the inter-poser full of transistors would be prohibitivelyexpensive, we believe that some additional areacan be spent on new interposer-based logic.Indeed, our results suggest that even devoting10 percent of interposer area to active logiccould still be practical. The cost-effective natureof such minimally active interposers opensmany interesting research opportunities as wedecide what to put on the interposers and howbest to use the devices and routing there.Beyond the interconnect, the interposer couldhouse all manner of system monitoring andonline profiling (“introspection”) mechanisms,auxiliary computing devices, and security fea-tures. The opportunities to reimagine computerarchitectures are wide open.

Given the value of interposer-based sys-tems, a key open question is exactly how tointerconnect the individual chips, memorystacks, and interposer. We take a first step inexploring possible topologies to effectivelyroute both coherence and memory trafficacross a disintegrated CPU, but we expectmuch follow-on work from the community.A key contribution is the advocacy of generalapproaches to interposer-based interconnectdesign. Future interposer-based systems willlikely span a large range of sizes, functional-ities, and physical arrangements of chips andmemory stacks. Our approach to interposer-based NoCs, while illustrated with a specificexample system, extends across a broad rangeof heterogeneous designs. Common groundmust be found across this wide range of sys-tems to design a generic interconnect that caneffectively reintegrate a limitless range of futuresystems. Our focus on low diameter and mis-alignment to avoid bisection-link bottlenecksare a step in this direction, but new research isneeded to understand the impact of traffic pat-terns in these different systems and re-envisionthe NoC accordingly. Our work provides astepping-stone for these designs.

This work assumes that all core dies areidentical. This creates new physical designand layout challenges as each die must imple-ment a symmetric interface since each diemust correctly interface with the interposerregardless of the mounting position. Conven-tional SoCs have no such requirements fortheir layouts. This extends beyond functional

–1%0%1%2%3%4%5%6%7%8%9%

10%

4 8 16 32 64FOM

imp

rove

men

t (d

elay

^ c

ost)

MeshCMeshFolded Torus (XY)ButterDonut (X)ButterDonutFolded TorusDouble Butterfly

Figure 7. Figure of merit based on delaycost. The x-axis specifies cores per

chip (64¼monolithic). Proper topology design improves cost and

performance at high levels of disintegration (for example, four cores per

chip).

..............................................................................................................................................................................................

TOP PICKS

............................................................

92 IEEE MICRO

interfaces to issues such as power delivery andthermal management. Many of these chal-lenges fall more in the domains of physicaldesign, CAD, and electronic design automa-tion. However, our work demonstrates thecost and performance potential for these sys-tems, which in turn motivates new researchin the physical design realm to support suchdisintegrated systems. Both the architectureand the larger ecosystem require additionalresearch to make these compelling systems areality. MICRO

....................................................................References1. “AMD Ushers in a New Era of PC Gaming

with Radeon R9 and R7 300 Series Graphics

Line-Up Including World’s First Graphics Fam-

ily with Revolutionary HBM Technology,”

Advanced Micro Devices, June 2015.

2. B. Black, “Die Stacking Is Happening,” Proc.

Int’l Symp. Microarchitecture, 2013; www.

microarch.org/micro46/files/keynote1.pdf.

3. N. Enright Jerger et al., “NoC Architectures

for Silicon Interposer Systems,” Proc.

47th Int’l Symp. Microarchitecture, 2014,

pp. 458–470.

4. M. O’Connor, “Highlights of the High-

Bandwidth Memory (HBM) Standard,”

Memory Forum Workshop, 2014; www.cs.

utah.edu/thememoryforum/mike.pdf.

5. Y. Deng and W. Maly, “Interconnect Char-

acteristics of 2.5-D System Integration

Scheme,” Proc. Int’l Symp. Physical

Design, 2001, pp. 171–175.

6. N. Madan and R. Balasubramonian,

“Leveraging 3D Technology for Improved

Reliability,” Proc. 40th Int’l Symp. Microarchi-

tecture, 2007, pp. 223–235.

7. M. Hackerott, “Die Per Wafer Calculator,”

Informatic Solutions, 2011.

8. C.H. Stapper, “The Effects of Wafer to

Wafer Defect Density Variations on Inte-

grated Circuit Defect and Fault Distribu-

tions,” IBM J. Research and Development,

Jan. 1985, pp. 87–97.

9. N. Jiang et al., “A Detailed and Flexible

Cycle-Accurate Network-on-Chip Simu-

lator,” Proc. Int’l Symp. Performance Analy-

sis of Systems and Software, 2013,

pp. 86–96.

10. M. Badr and N. Enright Jerger, “SynFull:

Synthetic Traffic Models Capturing a Full

Range of Cache Coherence Behaviour,”

Proc. ACM/IEEE 41st Int’l Symp. Computer

Architecture, 2014, pp. 109–120.

11. C. Bienia, “Benchmarking Modern Process-

ors,” PhD dissertation, Dept. Computer Sci-

ence, Princeton Univ., 2011.

Ajaykumar Kannan is an advanced designengineer in the Intel Programmable Solu-tions Group. His research interests includesilicon-interposer-based networks on chip.Kannan received an MASc in computerengineering from the University of Toronto,where he completed the work for this article.Contact him at [email protected].

Natalie Enright Jerger is an associate profes-sor in the Department of Electrical andComputer Engineering at the University ofToronto. Her research interests include multi-and many-core architectures, on-chip net-works, cache coherence protocols, memorysystems, and approximate computing. EnrightJerger received a PhD in computer architecturefrom the University of Wisconsin–Madison.She is the recipient of an Alfred P. SloanResearch Fellowship, the Borg Early CareerAward, the 2014 Ontario Professional Engi-neers Young Engineer Medal, and the 2012Ontario Ministry of Research and InnovationEarly Researcher Award. She served as programco-chair of the 7th Network-on-Chip Sympo-sium (NOCS) and as program chair of the20th International Symposium on High-Performance Computer Architecture (HPCA).Contact her at [email protected].

Gabriel H. Loh is a Fellow at AdvancedMicro Devices. His research interests includecomputer architecture, processor microarchi-tecture, emerging technologies, and 3D diestacking. Loh received a PhD in computer sci-ence from Yale University. He’s a senior mem-ber of IEEE and an ACM distinguished scien-tist. Contact him at [email protected].

.............................................................

MAY/JUNE 2016 93

exploiting interposer technologies to disintegrate …enright/toppicks2016.pdf · 2017-06-08 ·...

Documents