small geometrics on soc

8/13/2019 Small Geometrics on Soc

1/23

Impact of Small Process Geometries onMicroarchitectures in Systems on a Chip

DENNIS SYLVESTER, MEMBER, IEEE AND KURT KEUTZER, FELLOW, IEEE

Invited Paper

Process effects in deep-submicrometer geometries are expected to change the physical organization, or microarchitecture, of in-tegrated circuits. The factor that is expected to primarily impact integrated circuit microarchitectures is increasing delays in inter-connect. We believe that, to properly microarchitect integrated cir-cuits in small process geometries, it is necessary to get as detailed a picture as possible of the effects and then to draw conclusionsabout changes in microarchitecture. To this end, in this paper, wedescribe a comprehensive approach to accurately characterizingthe device and interconnect characteristics of present and future process generations. This approach uses a detailed extrapolationof future process technologies to obtain a realistic view of the fu-ture of circuit design. We then proceed to quantify the precise impact of interconnect, including dynamic delay due to noise, on the performance of high-end integrated circuit designs. Having deter-mined this, we then reconsider the impact of future processes onintegrated-circuit design methodology. We determine that local in-terconnect effects can be managed through a deep-submicrometer design hierarchy that uses 50K100K gate modules as primitivebuilding blocks.

In light of this new system-on-a-chip microarchitecture, wethen examine global interconnect issues. Our results indicate that,

while global communication speeds will necessarily be lower than local clock speeds, International Technology Roadmap for Semiconductors expectations should be attainable to the 0.05- mtechnology generation. Achieving these high clock speeds (10 GHzlocal clock) will be aided by the use of a newly proposed routinghierarchy that limits interconnect effects at each level of a design(local, isochronous, and global). In addition, key components of the interconnect architecture of the future include fat (or unscaled)global wires, intelligent repeater and shield wire insertion, and efficient packaging technologies.

Keywords Integrated circuit interconnections, integrated cir-cuit modeling, routing, ultralarge-scale integration.

Manuscript received May 31, 2000; revised January 12, 2001.D. Sylvester is with the University of Michigan, Ann Arbor, MI

48109-2122 USA.K. Keutzer is with the University of California at Berkeley, Berkeley, CA

94720-1770 USA.Publisher Item Identifier S 0018-9219(01)03199-1.

I. INTRODUCTIONProcess effects in deep-submicrometer geometries are

expected to change the physical organization, or microar-chitecture of integrated circuits. The factor that is expectedto primarily impact integrated circuit microarchitectures is

increasing delays in interconnect. A number of independentsources have made forecasts that in deep-submicrometer(DSM) process geometries 80% or more of the delay of crit-ical paths will be attributable to interconnect [1][3]. Thisforecast has been further supported by the broad industrialexperience of significant problems in timing closure forcurrent high-performance integrated circuit (IC) designs.Based on the present difficulties in IC design methodologiesand predictions of increasing migration of delay to inter-connect, there is a sense in both industry and academia thatcurrent synthesis and physical design methodologies requirea significant overhaul.

Deep-submicrometer effects, particularly interconnect,

have thus been billed as potential showstoppers to thecontinuation of Moores law. Among the effects that arecommonly mentioned are rising RC delay of on-chip wiring,noise considerations such as crosstalk and dynamic delay,reliability concerns due to rising current densities and oxideelectric fields, and increasing power dissipation. Each of these issues has underlying physical explanations that shedinsight into its potential impact as CMOS processes continueto scale. We address these issues and others in Section III.

We believe that to properly microarchitect integrate cir-cuits in small process geometries, it is necessary to get as de-tailedapictureaspossibleoftheeffectsandthentodrawcon-clusions about changes in microarchitecture. Our goal in thispaper is to propose a specific microarchitecture that limitsthe impact of interconnect on performance and allows de-vice scaling to continue to drive IC performance. The designmethodology we suggest is based on realistic projections of future technologies, simulation tools, analytical models, andactual design data. Furthermore, we describe several con-cepts that may help interconnect scale more readily at the

00189219/01$10.00 2001 IEEE

PROCEEDINGS OF THE IEEE, VOL. 89, NO. 4, APRIL 2001 467
http://-/?-http://-/?-http://-/?-http://-/?-


2/23


3/23

known as wire-load models , attempt to predict the amount of capacitance in a wire by reducing it to a function of fanoutand block size. In this approach, shown in Fig. 3, a singlecapacitance valueis used forall nets ina block with the samefanout. Clearly the capacitance values for nets in a blockwill vary and so the wire-load model is necessarily only anapproximation of the actual future capacitance. When wiredelay constitutes a small percentage of the total critical pathdelay, errors in capacitance estimation have little impact.However, if the capacitance values in wires increase, then

wire-load models lead to increasing inaccuracy. D. Problems with Physical Design

Placement and routing tools tend to use primitive modelsof circuit delay. Often the timing constraints inherent in thesynthesized netlist are translated into a single static net pri-oritization. Place and route tools attempt to honor this pri-oritization scheme, but because they have only a primitiveinternal model of critical path delay, they cannot be sure thatthey have in fact obeyed the initial timing constraints. Asdelay migrates to interconnect, the possibility that excess ca-pacitance in routing violates a timing constraint increases. If capacitance is increased due to cross-coupled capacitancesin routing, this is only likely to be discovered after the finallayout is extracted.

To summarize, the core problem with current design flowsthat is causing them tofail is thepoor back-end prediction ca-pabilities of logic synthesis (i.e., the wire-load models). Toattack this problem at the microarchitectural level, we pro-pose to use a block-based hierarchical design style that usesblocksizes of50K100K gates. Within theseblocks(ormod-ules),we will demonstrate that interconnecteffects canbe ef-fectively minimized via proper device sizing and intelligentscaling so that even relatively inaccurate wire-load modelscan be used at this level of hierarchy. We note now, and em-

phasize later, that this module size allows for a significantamount of flexibility and computing power.

III. DEVICE AND INTERCONNECT IN DSMWe propose that the size of a module that can still be reli-

ably designed in the methodology of Figs. 1 and 2 withoutsignificant attention to DSM effects is approximately50K100K gates. Fig. 4 shows the approach we employ toarrive at this conclusion. To begin, we develop a strawmantechnology file that completely describes future processgenerations for both device and interconnect characteristics.

To facilitate our analysis, we develop models of future de-signs, at varying degrees of accuracy, that allow us to deter-mine the performance of future ASICs. These models em-phasize the important issues of delay, noise, and power. Be-fore introducing our strawman technology, we will first givea high-level overview of device and interconnect issues indeep submicrometer.

A. MOSFETs

1) Voltage Scaling: While 5-V power supplies wereprevalent at feature sizes 0.5 m, DSM processes are

Fig. 4. Overview of our analysis methodology.

seeing a continual drop in . One key reason for reducingvoltages in scaled CMOS devices is that of reliability. Forinstance, gate oxides tend to break down if exposed toelectric fields in excess of 5 to 6 MV/cm [9]. Since oxidethickness ( ) is being scaled to increase current drive,the maximum voltage on the gate ( ) must also bescaled to keep electric fields within reason. Another formof device reliability that must be considered is the effect of hot electrons. When the electric field in the channel is toohigh near the drain, electrons may gain enough energy toinject themselves into the gate oxide. This accumulation of charge in the oxide results in shifts in and changes in thedevices currentvoltage ( ) characteristics. By reducingthe voltage supply, both the channel and gate electric fieldswill be proportionately reduced for improved reliability.

Theimpact of reduced supplyvoltagesonpoweris anotherimportant reason why values are decreasing. Dynamicpower consumption is defined by , where

is a switching activity ratio, is the switched capacitance,and is the operating frequency. The quadratic dependency

onsupplyvoltagemakes the reduction of a primary goalin power minimization. This issue will be discussed more inSection VI-C.

2) Drive Current: Classical long-channel MOS theorystates that device current in the saturation mode of operationis proportional to the square of gate drive ( )and inversely proportional to channel length. At ultrashortchannel lengths, most carriers travel at a maximum satu-rated velocity throughout the channel, which nearlyeliminates the impact of channel length on current. A newand more accurate expression for this limiting case is [9]

. From this expression, wecan see the channel length independence as well as the move

from quadratic to linear dependence on gate drive. Sinceis a material constant (approximately cm/s forelectrons, cm/s for holes), we find that (nor-malized to device width) will vary with .With fixed at (necessary to maintain sufficientgate drive), we obtain . Since both of these parameters are decreasing, DSM MOSFETs are notexpected to provide enhanced drive current per unit width.This marks a new regime for scaled CMOS devices, sinceup to the 0.35- m process generation additional current

SYLVESTER AND KEUTZER: IMPACT OF SMALL PROCESS GEOMETRIES ON MICROARCHITECTURES 4


4/23


5/23

Table 1Projected MOSFET Characteristics in DSM

highly sensitive to the ratio of coupling capacitance to totalcapacitance, implying that it will become a larger issue as in-terconnect dimensions continue to scale.

IV. FUTURE PROCESS TECHNOLOGIESAimingto quantify theconcerns raisedin theprevious sec-

tion, we now present substantiated projections of DSM tech-nology parameters at both the device and interconnect level.As conclusions are based on models and data, we take painsto detail our models and data. Care is taken to explain and justify our data and models, and comparisons are drawn topredictions in [1], [13]. Finally, where choices are made, weaim tobe consistentlyconservativeand err on the sideof pes-simism.

A. Device Roadmap

Table 1 presents the most important features of ourstrawman technology for DSM CMOS devices. Our gateoxide thickness projections are generally higher than thosefound in the 1999 ITRS [1]. Due to oxide reliability re-

quirements, one should keep the electric field in the oxidebelow 56 MV/cm [7]. In addition, oxides thinner than 15 raise serious concerns about leakage due to direct tunnelingthrough such an extremely thin layer [19].2 These concernsare not likely to be resolved by the 0.13/0.1- m generationsas anticipated in [1]. Therefore, we predict oxides that arefree from tunneling problems until the 0.07- m technologynode. Finally, values below 10 are extremely opti-mistic; this thickness represents only a few atomic layers of SiO . We predict oxides to bottom out around 1012 due to leakage and fabrication issues.

Regarding voltagescaling,we select thehigh-performancevalues from ITRS 99 [1], rather than the low-power

scenario. Also, is set at to provide sufficient cur-rent drive to keep performance climbing [7]. Beyond0.1 m,such small s may yield high amounts of leakage current. Inthis case, severalapproaches may betaken.A dual- processuses two different threshold voltages within the same design.

2Stacked gatedielectricsusinghigh- materials ontopof a verythinSiOlayer will help in reducing tunneling since a thicker high- material can beused while maintaining the same electrical characteristics as a thinner SiOregion. Difficulties are in manufacturability and minimizing the amount of SiO needed at the interface.

Table 2Interconnect Characteristics for Metals 1 and 2

Table 3Top Metal Layer Parameters for All Generations

Low devices are used where speed is essential while thebulk of devices (e.g., 90%) have a higher to keep overallleakage current small [20]. A second approach would be touse circuit techniques to raise the in idle circuit blocks.This approach is similar in concept tostandby or sleep modesin current microprocessors where the clock is disabled inlow-activity regions in order to reduce dynamic power con-sumption [21]. However, sleep modes do not eliminate staticpower and further work will be needed in this area to mini-mize latency and ensure reliability.

Combined with the characteristics described in Table 1,BSIM3 models have been developed to model DSM devices[22]. Default parameters are used except for , , mo-bility, and capacitances. Resulting curves yield good fit(within 20% in linear region, 10% in saturation) with mea-sured results from0.15- m devices ( ). Simulatedvalues of for 0.25 to 0.1- m processes show excellentcorrelation with published data.

B. Interconnect Roadmap

Tables 2 and 3 highlight key parameters from our inter-connect roadmap. Table 2 presents dimensions for the lowertwo levels of metal for each process. Our interconnect hier-archy consists of several pairs of identical metal layers thatfall under the categories of local, intermediate, and global

wiring. The number of metal layers increases from 6 at 0.25m to 9 at 0.05 m for enhanced connectivity. The first twolevels of metal are used exclusively for routing local signalsbetween gates within a larger block of gates (e.g., 50K). Inthese instances, the wirelength is typically short and the firstconcern is that ofwiringdensity. Therefore, we predict a con-tinuing drop in lower-level wiring pitch. This will not onlyprovide additional local routing capability but will also allowfor smaller standard cell sizes as [23] and [24] have showncell size to be set by contacted wiring pitch.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


6/23

The second concern at the local level is noise. Due to theshrinking pitches, larger coupling capacitances lead to en-hanced noise. In order to limit noise, we recommend the useof flat wiring where the aspect ratio is capped at about2 [16]. Compared to predictions in [13], an AR of 2 willyield 30% smaller coupling capacitance at 0.07 m than AR

. The use of thinner wires in the flat approach canbe seen as a tradeoff between noise and resistance, wherethe lower resistivity of copper is taken advantage of in orderto limit capacitances. This approach has been used in early

copper designs to limit capacitance and is especially benefi-cial at lower levels where device resistance tends to be muchlargerthan wire resistance (due toshortwirelengths)[20].Anadditional point in noise reduction is the scaling of insulatorthickness, . Inorder topreventcouplingcapacitance frombecoming an even larger portion of total wiring capacitance,

needs to be scaled appropriately. Also, by reducingvias with reasonable ARs can be used, allowing for easierfabrication and lower resistance.

Copper is used for all sheet resistance calculations witha resistance of 2.2 cm. The reduction of dielectric con-stant with scaling reflects the significant work being doneon low- materials to replace SiO as the insulator of choicein ULSI [25]. The implementation of low- dielectrics intoprocesses represents a significant step in realizing very highperformance designs. Smaller interconnect capacitances re-sult in lowerpowerdissipation aswell as smaller delay times,making low- dielectrics a more beneficial process advancethan copper wiring. Our projections for low- dielectrics arefairly conservative in comparison to [1], [13], especially be-yond 0.13 m.

Table 3 shows projections for top-level interconnectthroughout scaling. In contrast to lower level metals, whereaverage wirelengths scale down due to shrinking gate sizes,global wires actually become longer. For this reason, weselect a large cross-section global wire that maintains aconstant resistance of 44 /cm. This low resistance valuewill allow for unattenuated distribution of power grids,clocks, global buses, and other important signals on the toplayers of metal. The approach follows the concept of fatwires suggested in [5]. Transmission line characteristics willneed to be well-modeled and controlled in this scheme [26].

Intermediate metal layers not included in Tables 2 and 3provide inter-modular routing and offer pitches and thick-nesses between those presented. For instance, metals 3 and 4have a minimum pitch that is 50%100% larger than that of metals 1 and 2.

V. DESIGN DATATo supplement the analytical models and simulation tools

to be used in the analysis section, we obtained detailed de-sign data from modern ASICs. Empirical design data wascompiled for thirteen 0.35- m ASIC designs from SymbiosCorporation. Average wirelengths in the designs vary some-what but tend to be in the range of 200 to 300 m. Averagefan-out is consistently between 2.2 and 3. Designs regularlyincorporatedlarge macro blocks in addition to standard cells,

Fig. 5. Determination of gate/interconnect delay involves 2-inputNAND ring oscillators and average wirelengths.

effectively making them early SoC designs. The percentaof standard-cell logic area to total chip area rangedfrom 30to 95% but logic area typically encompassed about 70%the chip. Data was also obtained showing averagewirelenfor each fan-out in each design. This data is used in Setion VI to create a critical path model for future ASICs.

VI. ANALYSISIn this section, we develop models for various perfo

mance metrics (delay, noise, power) and apply them to DSASICs to confirm our position that 50K100K gate moducan be designed using the flow in Figs. 1 and 2.

A. Delay

1) Defining Gate/Interconnect Delay: We begin ouranalysisbyproposingawell-definedmethodofassessinggadelayversusinterconnectdelay.Typicallyforagivenprocesgate delay ( ) is determined using an unloaded (FO nosignificantwiringload)ringoscillatormadeupofinverteThis is an elegant way to determine as it is independof device sizing since increasing device width contribuequally to largerdrivecurrent andload capacitance.

We propose a modified version of the ring oscillatconcept to determine gate and interconnect delay. Fir2-input NAND gates replace inverters since they betrepresent on-chip logic gates. One of the inputs in thegates is tied to , resulting in worst case low-to-highdelays. Next, the fan-out of the gates is varied from 14 (fan-outs greater than 4 are not of practical interest). this point, gate delay is found for each fan-out individualFinally, minimum-pitch interconnect of length is addbetween each stage. Average wirelength is a function fan-out and is varied accordingly, using empirical desidata as a reference. At this point, a stage delay ( )found for a given technology with fan-out ranging fromto 4. Interconnect delay, , is defined as the differenbetween the stage delay and intrinsic gate delay. The baapproach is shown schematically in Fig. 5.

Sizing of gates with interconnect loading becomes notrivial as extremely large devices could be used to make

negligible. Likewise, the use of minimum-sized gates woyield a misleading depiction of interconnect delay. Since tis not practical due to area and power considerations, define an optimal driver size. The optimal size is definby a W/L ratio such that an increase in W/L of 1 does nyield a 2% drop in stage delay. This criterion was chosenmost closely approximatetheknee of thedelay versus devisizing plot obtained when sweeping W/L.

2) Analytical Approach: In this section, we discuss thefirstof two differentapproaches to thedelay analysis.A fir

472 PROCEEDINGS OF THE IEEE, VOL. 89, NO. 4, APRIL 20
http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


7/23

Table 4First-Order Analysis Results for Delay Scaling. 0.05- m Value in( ) is Adjusted to Match with [27]

order analytical model is presented to observe general delaytrends in future ASICs. A full-scale simulation approach thatemploysaccurate device models and distributed interconnecteffects is described in the following section.

To the first order, delay of a MOSFET is governed by asimple relationship between capacitance, voltage, and cur-rent: CV/2I. To study the scaling of delay throughprocess generations, we employ this expression by approxi-mating the trends of each of the variables involved. Severalapproximations are made and will be discussed in turn.

We begin by discussing a block-based design approachwith 50K gate building blocks. We later discuss other blocksizes. In a block of this size, wirelengths are typically smalland line resistance is much less than the effective resistanceof the MOSFET drivers. Therefore, we ignore the impact of wire resistance in these calculations. Consider a single gatewith a fixed fan-out in two processes, with a scaling factorof . By scaling both channel length and width by , we geta quadratic reduction in module area assuming that wiringpitch is also reduced by [23], [24]. We assume that av-

erage wirelength is a fixed fraction of the module side lengthso that scales by . Since is relatively constant inDSM processes, a reduction in channel width of results ina proportional reduction in .

Voltage swing, as discussed above, is also shrinking dueto power and reliability concerns. From Table 1, we estimatethe scaling factor for as 0.75. Finally, we assume thattheloadcapacitanceismade upof twocomponents, intercon-nect and fan-out gate capacitance. Interconnect capacitance,

, is found by multiplying the average wirelength by thecapacitance perunit length. From 2-Dsimulations that incor-porate low- dielectrics (Table 2), we find the capacitanceper unit length to be dropping by about 15% per generation,

or a scaling factor of 0.85. Gate capacitance can be evaluatedby estimating that is shrinking by 0.8 per generation(from Table 1). We set interconnect capacitance to be twiceas large as thefan-out capacitance based on 0.25- mcalcula-tions (FO ). Note that we are neglecting device junctionand overlap capacitances in this analysis.

Normalizing CV/2I to 1 for the original process, theshrunken process yields a CV/2I value of 0.65. Thisrepresents a stage delay improvement of 35% from onegeneration to the next despite the smaller current drive.

Translating to clock frequencies, these results predict a54% increase for a generic process shrink. This analysisbrings up several important points. First, it seems possibleto maintain rising performance levels while scaling channelwidths even in the era of velocity saturation. Second, thevoltage and current components of CV/2I cancel each otherto some extent, leaving the bulk of the delay improvementto reduction of load capacitance. This is important becauseit highlights the most vital issues in DSM. In order to keepon pace with performance projections, wirelengths need to

be reduced, low- dielectrics need to be introduced, anddevice capacitances need to drop. These are among the mostimportant factors contributing to faster designs.

Table4 shows detailedresults fromthisfirst-orderanalysisusing actual scaling numbers taken from our strawman tech-nology. Gate capacitance is determined usingand a fan-out of 2. As mentioned in the previous discussion,the sharp decrease in load capacitance ( ) com-pensates for a slow rise in to yield overall speedimprovements. A comparison is made to gate delay predic-tions in [27], which uses 3-input NANDs and several em-pirical factors to account for device resistance and the use of dynamic logic. Our simple analysis gives very similar resultsregarding the scaling trends of gate delay in DSM.

3) Simulation Approach: The previous approach isuseful in determining trends and important parameters indelay scaling. However, it makes several assumptions thatwe will now remove. For example, the analysis ignoredthe impact of global wires in clock cycle determination.Since designs are getting larger, global wires becomelonger, which means that they will necessarily take up anincreasing portion of the clock cycle. In addition, wiringresistance was ignored due to the assumption that deviceresistance dominated. Also, only the gate oxide componentof device capacitance was treated. In this section, we willlook more closely at the factors involved in determiningASIC performance using simulation tools.

We begin by characterizing a small ASIC library for eachprocess (0.25 through 0.1 m) consisting of 2-input NANDswith varying fan-outs. As mentioned earlier, BSIM3 devicemodels are used in simulations. According to the procedureoutlined in Section VI-A1, we determine , optimal de-vice sizing ( in 2-input NANDs), and . Afterfinding optimal device sizes at the 0.25- m generation, weuse this ratio for all subsequent processes to allow foreasier comparisons.

Fig. 6 shows plots resulting from device sizing optimiza-tion for 0.25 and 0.1 m with a fan-out of 2. While the gate

delay is constant throughout, the total stage delay decreasesappreciably when increasing device sizes. In the limit, infin-itely large devices will be unaffected by the presence of arelatively short wire, yielding a stage delay essentially equalto . We find that the optimal device size is around

. At this size, interconnect represents 28% and 22% of the total delay at 0.25 and 0.1 m, respectively. The fact thatinterconnect delay is actually decreasing is somewhat sur-prising. The primary reasons behind this conclusion are thepresence of shorter average wires and new materials.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


8/23

Fig. 6. (a) Device size determination for 0.25 m, FO . (b)Same plot for 0.1 m illustrates drop in interconnect delay.

Previous studies reporting rises in interconnect delay havetended to focus on a fixed line length. Due to shrinking gatepitches, localwirelengths areexpected to shrinkwith process

scaling. As demonstrated in Table 4, we forecast a decreasein average wiring capacitance by a factor of 12 from 0.25to 0.05 m due to shorter lines and low- dielectrics. Thispoint underlies our conclusion that interconnect delay willnotdominate within50Kgate modules of futuredesigns, andthat a block-based microarchitecture is well suited to DSMdesign methodologies.

However, the scaling down of wirelength does not hold atthe global level. In contrast, these signals must necessarilyget longer in order to ensure connectivity in larger chips.Due to theuse of functional clustering (keeping modules thatneed tocommunicateoftennear oneanother),extremelylongglobal wires ( chip edge length) can be minimized and

excludedfrom criticalpaths. Recent study has suggested thattypical global wires might be closer to half the chip edgelength [28]. We use this as an approximate starting point fora global wire in 0.25 m ( cm) and scale it up by15% for each process shrink. Simulation results show that ata constant buffer width (fixed ), delay decreases fromprocess to process mainly due to the drop in voltage swing.Low- dielectrics cancel out the 15% expected increase inwirelength so that delay varies roughly with accordingto CV/2I. Of course, this requires the use of larger drivers

in each process (in terms of ), which can lead to powpenalties. The available power and area budget needs toweighted against thebuffer requirementsto meet timing costraints.

A comparison between the approaches in Sections VI-and VI-A3 is deferred until the impact of noise on delay be considered. We will also extend our model from the deof a single stage to the delay of an entire critical path.

B. Noise

Theimpactof noise on systemperformance wasdescribqualitatively in Section III-B2. In this part of theanalysis, wmodify theresultsofSection VI-A3toinclude dynamic deleffects andalso discuss theimpactof crosstalknoiseon DSdesigns.Thegeneralresultofdynamicdelayisincreasedstadelays due to higher effective capacitance and larger powdissipation due to bigger drivers. Also, the portion of todelayattributabletointerconnectisincreased.Signalintegris viewed as a reliability problem. It places limitations on routability of signals on lower level, minimum-pitch wiriIncreasing wiring pitch to accommodate crosstalk reduclayoutdensities, which is a problem in wire-limiteddesigns

1) Critical Path: We begin by creating a generic criticapath model that will allow us to track ASIC clock frequcies through scaling. From empirical design data, we hadetermined that in 0.35- m ASIC designs a critical path tically consists of about 14 stages. While we use a 2-inpNAND as the gate type for each stage, we allow for differfan-out conditions, which are determined from fan-out dtributions of the design data. Our model is broken down iblocks of 8, 3, 2, and 1 stages with fan-outs of 1, 2, 3, a4, respectively. In addition, we include a global wire rouon the top metal layer, which is buffered to reduce delay. Dnamic delay effects arenotconsidered on theglobalwire aa fixed buffer size is assumed throughout scaling.3 Finally,

very conservatively, 10% timing overhead is allotted for critical path due to clock skew, process variation (both dvice and interconnect), and other unmodeled phenomenThis modeled critical path closely reflects the characteristof a typical path in a design.

2) Results: The noise analysis uses worst case neighboring wire switching activity to determine new delvalues. We model the worst case by having two adjacewires simultaneously switch in the opposite direction as victim line. Since there is no wiring component to the intrinsic gate delay will remain the same. However,anticipated the larger effective coupling capacitance dto Miller effect results in larger values. Results a

shown in Fig. 7, which illustrates the relationship betwegate delay and stage delay for various processes. Cleathe presence of dynamic delay increases the portion of todelay attributable to interconnect. However, we still forea drop in the ratio of interconnect delay to total delay evconsidering noise effects. For example, considering noeffects comprises 35% of the total delay at 0.25 m

3Scaling of the global buffer sizes does not change the results as sigicantly as one may think. This is due to the fact that the delay versussize plot becomes relatively flat for large drivers (as in Fig. 6).

http://-/?-http://-/?-


9/23

Fig. 7. Evolution of stage delay relative to a fixed gate delay withand without noise considerations.

but only 27% at 0.1 m. These values are far from the80% forecasts commonly reported and reflect the impact of shrinking average wirelengths on a chip.

In general, worst case dynamic delay yields approxi-mately an 80% increase in . This number correspondsvery closely with the contribution of coupling capacitanceto total line capacitance, which is about 75% throughoutscaling. To compensate for the larger effective capacitancethe optimal driver size is increased. These larger devicescontribute to enhanced power dissipation and may alsoreduce layout density if drivers are made large enough. For afixed optimal device size of , we incorporate thenew delay numbers into the critical path model to determineexpected ASIC clock frequencies in DSM. Global wiredelay is taken from simulations with a two-stage bufferingsystem and a fixed stage width of 100 m. Results arepresented in Table 5 and demonstrate similar trends as [1],[13] with a 100150-MHz performance increase. Isolating

the logic delay component of the critical path, we compareits evolution normalized to 0.25 m with results fromTable 4. We find that the inverse of logic delay scales as1.5, 1.94, and 2.63 here compared to the analytical resultsof 1.61, 2.18, and 3.07. The discrepancies are due mainly tothe inclusion of junction and overlap capacitances as well asdynamic delay effects.

3) Crosstalk: Crosstalk noise, or signal integrity, can beconsidered a reliability issue. Problems caused by crosstalkinclude functional errorsdueto falseswitchingandenhanceddevice stress when bootstrapping occurs ( ).Crosstalk is of the utmost concern when it can result infunctional errors, such as in dynamic logic families such as

domino [29] or in analog circuits. As a reliability problem,the most straightforward way to deal with signal integrityis the generation of accurate design rules. For instance,bounds may be set on the amount of tolerable crosstalknoise (e.g., 20% of ). From this constraint, analyticalmodels [30] can be used to define a critical line length fordifferent metal layers and driver scenarios. This parameter,

, is then compared to , which is defined as themaximum line length that can be used on a given metal layerbefore buffering becomes beneficial [31]. We have found

Table 5ASIC Performance Predictions Including Dynamic Delay Effects

that for typical driver conditions, crosstalk is a more severerestriction on routing in lower level metals than delay (i.e.,

). In addition, is also typically smallerthan the dimensions of a 50K gate module within which itis desirable to connect the majority of gates with metals 1and 2. These findings indicate that noise can be a limitingfactor in routing at the lower metal layers, which may leadto a loss in routing density due to possible increases in pitch,shielding wires, or the need to route in higher layers.

C. Power 1) Importance of Power: Low-power designs, especially

microprocessors, have received a large amount of attentionrecently as portable and wireless applications gain market-share. Also, even in the highest performance designs powerhas become an issue since the extremely high frequenciesbeing attained (now exceeding 1 GHz) can easily lead topower dissipation in the many tens of watts. Dissipation of this amount of power requires heat sinks, resulting in highercosts and potential reliability problems. In this section, wediscuss thereasons whypowerhasbecomea significant issueand describe the three types of power consumption and how

they can be expected to scale with CMOS processes.In high-performance ASICs, there are three main reasonswhy power dissipation is rising. First, the presence of largernumbers of devices and wires integrated on a larger chip re-sults in an overall increase in the total capacitance found ona design. Second, the drive for higher performance leads toincreasing clock frequencies and dynamic power is directlyproportional to the rate of charging capacitances (in otherwords, the clock frequency). Finally, the use of scaled volt-ages to improve reliability, decrease delay, and ironically,drop powerconsumption, leads to an increase in leakage cur-rent. While decreasing supply voltages does result in signif-icant power savings overall, the static, or standby, current

increases, which can be detrimental to low-activity designsthat require very little standby powerconsumption.An excel-lent overview of power issues in large-scale CMOS circuitsis given in [32] along with a discussion of power modelingtechniques in [33].

2) Dynamic Power: Dynamic power consumption oc-curs as a result of charging capacitive loads at the outputof gates. These capacitive loads are in the form of wiringcapacitance, junction capacitance, and the input (gate)capacitance of fan-out gates. The expression for dynamic

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


10/23

Fig. 8. Evolution of dynamic power density with scaling.

power was given previously and trends for several of thevariables have already been discussed. Switching activity isa difficult parameter to estimate although it can frequently beapproximated in the 0.10.2 range with reasonable accuracy.

In order to determine the impact of CMOS scaling on dy-namic power consumption, we develop a simplified modelof a 50K gate module that may exist in future SoCs. Givensuch information as packing density (devices/cm ), wiringpitches, average device size (taken from Section VI-B2), androuting density (metal occupancy), we calculate the dynamicpower density through process scaling [1], [13]. We do thisby first estimating the size of a module, then calculating theinterconnect and device components of the total capacitance.Fig. 8 summarizes the results and shows that dynamic powerdensity is not increasing appreciably despite the rise in clockfrequency. Thisanalysis implies thatdynamic power dissipa-tion will increase approximately proportionally to the chiparea. It should be noted that power dissipation in the clocknetwork, off-chip drivers, and memory blocks are excluded

from this analysisofa simplestandardcell module. Forlargerblock sizes (up to 200K gates), our analysis indicates thatlarger drivers are required, contributing to higher power den-sities. For instance, we expect up to a 50% rise in power den-sity for a block-based design methodology using 200K gatemodules for equivalent performance to 50K gates.

3) Static Power: The dominance of CMOS in moderncircuit design is due in large part to its lack of static powerconsumption. This perceived benefit is becoming more sus-pect as voltages are scaled in order to limit dynamic power.Ideally, when the gate voltage of a MOSFET is belowthere is negligible conduction. However, a small amount of leakage current flows at these conditions due to the inability

of the gate to completely turn off the conducting channel. Agood approximation of the amount of static current is givenby ( C) [9]:

Am (1)

It is seen that leakage current is an exponential function of threshold voltage. The need for scaling to maintain cur-rent drive has been discussed and results in a marked risein leakage current as we move into DSM. Leakage currents

Table 6Scaling of Static Power Consumption Within a 50K Gate Module

in the range of nA/ m become serious when consider

the large integration levels in ULSI. Static power consumtion is given by . Table 6 calculatesthe leakage power density for a 50K gate module. Therea 2500 increase from 0.25 to 0.1 m, demonstrating tleakage powerisbecominga largercomponent of totalpowconsumption that should not be ignored. The rapid rise

calls for the use of multiple- processes or novel ccuit techniques to limit standby power consumption. Finathe exponential relationship between static power and important since variation in can be significant. Devicexhibiting s at the lower tolerance limit of a process wexhibit considerablymore leakage current than a nominal dvice. In thesame manner, staticcurrent is strongly depende

on temperature, with high operating temperatures resultiin significantly worsened MOSFET subthreshold characistics. Poor control of either or operating temperature mlead to wild fluctuations in static power consumption.

4) Short-Circuit Power: The final component of powerdissipation is short-circuit power. Finite rise and fall timesthe input of gates mean that both the pull-up and pull-donetworks of a CMOS gate are conducting simultaneouslya shortperiodof time. Duringthis time, current is flowingbtween and ground, resulting in short-circuit power dsipation.Research in this area hasdemonstratedthat well-dsigned circuits exhibit short-circuit power that is less th20% of the dynamic component, with 5%10% a more tyical value [34]. Well-designed circuits strive to maintain reasonable input and output rise times so that short-circuit crent cannot flow for an appreciable amount of time. In sumary, short-circuitpower dissipation is a manageable portiof the total power budget and can be approximated as 1of the dynamic power consumption.

VII. MICROARCHITECTURAL IMPLICATIONSLogical hierarchy in an integrated circuit design is the

erarchy associated with the logical function of the circuPhysical hierarchy is the hierarchy associated with the phical floorplan and layout of the circuit. In high-performan

integrated circuit design, it is common for the logical hiarchy to be nearly identical with the physical hierarchy. Tallows for large elements of a design to be designed indepdently both functionally and physically.

In response to our analysis above, we have proposthe use of a block-based hierarchical design style thlimits physical block sizes to 50K100K gates. The natuimplication of this for IC microarchitecture is that tmicroarchitecture of the circuit be decomposed into blocof 50K100K gates. It is natural to ask: is this realistic?



11/23

review of the physical implementation of a recent high-endmicroprocessor (Alpha 21264) indicates that the floorplannaturally divides into approximately 25 blocks. Aside frommemory elements, the modules, such as branch-predictionunits, integer execution units, and reorder units, were eachunder 100K gates.

The largest ASICs manufactured today are entire sys-tems-on-a-chip. These SoCs typically consist of a numberof intellectual-property (IP) blocks connected by meansof standard on-chip buses. The IP blocks can largely be

broken into two categories: high-complexity blocks suchas microprocessor cores or MPEG decompression unitsand lower complexity blocks such as peripheral units. Wecontend that most of even the high-complexity IP blockswill fit within 100K gates. For example, a 32-bit RISCmicroprocessor core contains approximately 50K gates (notincluding memory arrays), while a 24-bit DSP has 45K50Kgates and MPEG audio/video cores range from 60K135Kgates [35]. In other words, a design reuse methodologybased on IP blocks is very suitable to our modular designstyle.

In summary, it appears that a hierarchical design ap-proach in which designs are decomposed into 50K100Kgate blocks is suitable for the microarchitectures of bothhigh-end microprocessors and high-end SoCs. In fact, itappears that such a methodology is already in use, at leastfor leading-edge designs. The next natural question is: howwill these blocks be interconnected in the microarchitecturesof future SoCs in small process geometries? To address thisissue, we must look beyond the module boundaries andconsider global interconnect issues.

VIII. GLOBAL WIRING ISSUES

To this point, we have presented an alternative analysisof DSM effects and analyzed how they are likely to impactfuture design methodologies. We proposed a new DSMdesign methodology based on the use of 50K100K gatemodules as primitive building blocks. Global wiring wasincluded in the critical path delay analysis of Section VI byapproximating a typical global wirelength (1 cm in 0.25 m,scaled by 15%/generation) and optimally buffering it withrespect to delay. However, a more detailed analysis is neces-sary since there are a host of global interconnect problemsthat were not sufficiently addressed there. To motivate thisanalysis, we begin by noting that semiconductor processingis advancing at such a rate as to enable the integration of hundreds and even thousands of 50K100K gate blocks at

sub-0.1- m process geometries. Block-level placement androuting of thousands of such modules under global timing,power, noise, and area constraints will be the major designchallenge of DSM (see Fig. 9).

The remainder of this work seeks to analyze and quan-tify the global impact of interconnect on future high-perfor-mance designs using the microarchitecture discussed above.Specifically, we shift our focus to microprocessors as thesedesigns achieve the highest performance and will meet thelimitations of global interconnect first. The main question to

Fig. 9. The design challenge in DSM is the assembly of thousandsof 50K100K gate modules considering chip-level interconnecteffects.

be answered is whether or not the new design methodologyof this paper will create insurmountable problems in globalrouting. Other key questions concerning global interconnectto be examined include:

Over what sized region will we be able to propagatea high-speed signal ( 1 GHz) across chip in a singleclock cycle?

How will this change in die sizes up to 750 mm ? Can a high-speed clock be distributed reliably across a

chip in light of increasing die sizes and process varia-tion? Does noise at the global level pose a significant signal

reliability concern? Will inductance result in severely degraded signal in-

tegrity?Each of these topics will be discussed in detail, beginning

with primary delay issues such as signal time-of-flight andscaling of global wires. Providing a comprehensive solutioninvolvesorchestrationof a number of existingmethodologiesandtechnologies:unscaledglobal wires, flip-chip packaging,shield wires, etc. To demonstrate, a representative back-endprocess for 0.05- m microprocessors is suggested.

IX. PRIMARY DELAY ISSUESTheforemostproblem posed by long interconnectin DSM

is that of the reverse scaling properties exhibited by wiring.This well-documented phenomenon implies that continualscaling (i.e., shrinking)of global interconnect, in conjunctionwith rising die sizes, will soon limit the attainable clock fre-quencies in a microprocessor. For instance, beginning withthe0.18- m technology generation, [1] predicts a divergenceof global and local clock frequencies due to the impact of global interconnect. In this section, we look at the conceptsof signal time-of-flight and global conductordimensions andthe constraints they place on global communication.

A. Time-of-Flight

Larger die sizes and higher clock frequencies predicted in[1] imply that time-of-flight may become a limiting upperbound on speed. The time-of-flight (TOF) for a signal in ahomogenous medium is given by

(2)

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


12/23

Here, is the dielectric constant of the medium ( 4.0 forSiO ). For example, a 750-mm die, as predicted for the0.05- m technology generation [13], cannotsupport a globalclock frequency greater than 5.48 GHz using Manhattanrouting techniques.4 This value is an upper bound since ituses air as a dielectric ( ) and the entire clock cycle totraverse the longest potential path. A more realistic valuewould allot 80% of the clock cycle to this path and use adielectric constant of 1.5, resulting in a maximum globalclock speed of 3.58 GHz. This speed is near the projected

global frequency of 3 GHz, but it is more interesting tolook at the impact of TOF on locally clocked (isochronous)regions.

Due to TOF restrictions alone, an entire 750 mm die isnot reachable in a single clock period for frequencies in therange of 10 GHz (Fig. 10). Due to advances in processing,this clock speed will be realizable as approximately thedelaythrough ten loaded stages in a 0.05- m process. Assumingthis divergence of clock speeds, TOF limitations on signalpropagation do not play the major role in CMOS designsdownto 0.05 m; other effects such as RC delay, inductance,and reliable clock distribution will be more limiting factors.Fig. 10 reinforces this point by illustrating the relationshipbetween TOF delays and die size. It is also interesting tonote that the implementation of new low- dielectric ma-terials tends to offset the larger die size to provide a fairlyconstant maximum signal TOF for each technology genera-tion. For instance, we expect a pathological corner-to-cornersignal to have a time-of-flight of 220 ps regardless of tech-nologynode. Although this valuedoes notincreasedueto theuse of low- materials, even a constant value will eventuallyrender TOF issues important for global signaling. However,at 0.05 m, this effect is still not dominant given a microar-chitecture with multiple clock speeds.

B. Scaled Global Wiring

The current wiring paradigm calls for shrinking metalpitches at each generation in order to maintain sufficientrouting densities. This is appropriate at the local level wherewirelengths also decrease with scaling due to smaller gatesizes. In addition, this paradigm has worked at the globallevel before the DSM regime since RC delays of evenglobal wires were insignificant compared to gate delays andclock cycle times. However, shrinking clock periods andtransistor delays puts the current wiring scheme in danger.To examine more closely, a nominal global wiring pitch of 2 m is used as a point of reference at 0.25 m. For eachsuccessive technology shrink, the global wiring pitch isreduced by a factor of 0.75. This leads to a final wiring pitchat 0.05 m of 0.48 m. Aspect ratio is held at 1.5 for alltechnologies, which is in keeping with published reports byleading manufacturers [36].5

4A later version of the roadmap [1] projects even larger microprocessordie sizes (by about 10%). However, upcoming roadmap revisions point to-wardmorecompact dies ( 500mm at 0.030.05 m) that further limit theimpact of time-of-flight effects.

5Global aspect ratio may be slowly increasing, which would yield better RC characteristics andsomewhatworsenednoisethan ouranalysis suggests.

Fig. 10. Relationship between TOF delays and die size. The figurefocuses on 50-nm microprocessors with .

Table 7Optimized Number of Repeaters for a Single Corner-to-CornerWire in Each Process Technology

With thesewiring pitches, we examine critical linelengtand optimal buffer sizes. Critical line length is a concebased on the fact that there isa maximum wirelength for eametal levelwirelengths longer than this should be brokup using repeaters [31]. Since the pathological corner-to-corner wirelength will always be much longer than this c

ical line length ( ), we can use these two values to detmine thenumber of repeatersneededto drivesuch a wire. Rsults are included in Table 7. Given this information, we ncalculatetheminimumdelay fora pathologicalwire using rpeaters and scaled global wire dimensions. Fig. 11 compathis minimum delay for two wire widths to the global cytime supplied in [1]. We see that beyond a certain point thedelay actually increasessince thewire RC product is risingatthe same time that the line length is increasing. Clearly, tscaling of global conductors is not compatible with the pected rise in global clock speeds. This trend has been idtified by industry; IBMs 0.18- m generation uses a largglobal metal pitch than the previous generation. Fig. 12

plores the power ramifications of using scaled global wirdrops quickly due to rising line resistance, resultinin huge amounts of repeaters. At 0.05 m, this wiring padigm will consume 40% of the projected total power jforglobal interconnectdistribution (repeaters wires)whyielding severely degraded performance.

To help integrate power considerations, a modificatiois made to Bakoglus optimal buffer sizing expression 4]to provide an area-optimal buffer size. By multiplyia weighted area function ( , where is the devic

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


13/23

Fig. 11. Delays for corner-to-corner wires compared to globalclock cycle from [2]. Scaled global wiring is used.

Fig. 12. Power for all repeaters and global interconnect (sized 2 ) where 50% of all devices are logic.

width) by the delay ( ), we obtain a new objective function.Since optimization of delay alone usually results in overlylarge buffers with high power requirements, we have usedthe weighted area function to power and delay concerns.The product of and is then differentiated and theminimum value is found at:

(3)

This expression gives a smaller value of than the orig-inal formulation in [4]; the delay is consequently higher butthe area and power savings are considerable. At line lengthsthat approach , this formula gives areas that are typi-cally50%to65%smallerthan[4]. Thedelaypenalty remainsunder 20% until the line length is about 1/4 of , at whichpoint the delay is not substantial (Fig. 13). At , the typ-ical delay penalty is 11%. As a simpler approximation to(3), we have found that at the is 50.8% of

from [4] or simply 1/2.

Fig. 13. Comparison between new area-optimal repeater size anddelay-optimal repeater expression [4].

Fig. 14. becomes less than twice the chip-side length past130 nm for minimum-pitch fat wiring. Wiresizing allows toexceed at all technology nodes.

C. Fat Wiring

1) Performance Analysis: The use of fat, or unscaled,wires at the global metal levels was first suggested in [5].Unscaled wiring has the benefit of a fixed low RC delay atthe expense of fewer available routing tracks. In this work,we present a more comprehensive analysis of the perfor-mance and routing impact of using such fat wires. Let us firstexamine theperformance aspectof fat wires.Fatwires in thiswork have a pitch of 2 m and a thickness of 1.5 m. Theresistance of a cladded copper wire at these dimensions is147 /cm. Capacitance varies from approximately 2 pF/cmto 0.75 pF/cm in 0.05- m technology. Fig. 14 explores the

maximum reachable distance, , for a minimum-pitchfat wire at each generation of interest. is definedas the distance that can be traveled in 80% of the globalclock cycle predicted in [1] and is found using analyticaldelay expressions [37]. This is compared to the pathologicalcorner-to-corner wirelength, . ITRS values for die sizeare used in the expectation that they will present an upperbound on chip area according to current design trends. Wesee that, even with fat wires, corner-to-corner wirelengthscannot be accommodated past 0.18 m.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


14/23

Table 8Wiresizing Leads to Increased Values (0.05- mValues Shown)

Fig. 15. Fat wiring yields lower power and better performance.Area-optimal drivers cut power by 30% and the number of repeaters is reduced from the scaled wire scenario.

However, the situation is not as bad as it appears. Basedon empirical data and global wirelength models [28], [38]we project that the percentage of global wires with lengths

is very small, under 2% throughout. Proper floor-planning may be able to further limit these wires. In addi-

tion, when using larger than minimum linewidths, in-creases suchthat,evenat 0.05 m,it exceeds . Thus, forvery long wires (a very small portion of the total global pic-ture), wiresizing techniques can be used to maintain accept-able performance at the penalty of increased power. Table 8demonstrates the effectiveness of wiresizing at 0.05 m with

calculated as in [31]. By doubling the linewidth fromthe minimum, rises by 26%, which translates directlyto since repeaters reduce delay dependency to linear.Two scenarios are shown: 1) spacing is kept at minimum and2) spacing is equal to linewidth. The second case has se-vere routing resource penalties with little performance gains.Fig. 15 revisits the power issue using fat wires and repeaters

with calculations based on [28], [39]. We find that poweris actually slightly decreased from the alternative scenariousing scaled wires. Furthermore, ITRS global clock frequen-cies canbemet throughout theroadmap usingfat wires.Also,the use of area-optimal repeaters rather than delay-optimal(80% area-optimal, corresponding to less critical paths) re-duces global interconnect power consumption significantly,by 30% at 0.05- m technology.

2) Routing Resources: This section assesses the impactof fat wiring on routing resources by examining several fac-

tors that limit IC routing capacity including power distribtion, via blockage, and clock routing. In the discussion, will focus on the logic portion of a design as the wiring quirements for memory are generally much lower than tfor logic.

The power distribution network uses significant routiresources, especially at the top layers. In this work, powgrid dimensions are found by limiting the peak IR voltagedrop to under 4% of . Analytical expressions are derived[40] to describe the IR dropof an arbitrary layer as a function

of metal linewidth. When reliability constraints are met, percentage of routing resources used is calculated for eametal layer.

It hasbeen estimated empirically that formetal layerswiequivalent pitches, an upper layer blocks12%15% of an uderlying layerdueto itsneed to connect to thesubstrateusinvias [5].6 However, when larger metal pitches are used ohigher levels, the amount of blockage is reduced by usfixed-size (or nearly fixed-size) vias. The relationship btween via blockage and metal pitch is linear in this caWith a multilevel interconnect system, vias connecting ttop layer to the substrate necessarily block all underlyilevels. Thus, metal one is blocked by all subsequent layeresulting in a sizable loss in its routing capacity. Fortunateupper layers have significantly larger pitches than bottolevels, reducing the via penalties associated with multilevinterconnect.

Clock distribution also serves to reduce available signrouting resources. Due to the regularity of H-tree structurthe total wiring required for such a network can be foufairly accurately. Given the number of clusters in the H-t(see Section XII), the total wirelength is given by a simanalytical expression in terms of the chip-side lengAdditional within-cluster routing is approximated usinheuristic that allocates wires from the central driver to tperimeter of the cluster in all directions [39].

Routing tools cannot fully utilize all the available routiresources for a given design. This is mainly due to the alrithms used within the routing tools. This effect is modeby firstcalculating theavailable routing resourcesafter clorouting, power/ground routing, and via blockages. At tpoint, the routing area is multiplied by a routing efficienfactor to give the estimated available routing resources. Trouting efficiency factor is set at 0.5, which is based on dcussions with industry CAD engineers.

Based on the above models, the wirability of a gene0.05- m microprocessor has been studied to determine feasibility of the fat wiring scheme. Global wires are defi

as any wires leaving a 50K gate module. We further classthese global wires into semiglobal and global. Wires thleave an isochronous (locally clocked) region are termglobal wires as they will run at the slower global clofrequency and are routed on fat wiring levels. Wires thleave modules but not their isochronous regions are termsemiglobal and are routed on semiglobal wiring, which

6Recent work has shown this approximation to be pessimistic [41]. Thus,we contend that our results on routing resource availability will alsslightly pessimistic.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


15/23

not considered to be fat wires (i.e., they scale from processto process). This routing hierarchy is discussed further inthe next section. Routing resources for both semiglobal andglobal wiring is considered.

There are found to be 100 isochronous regions, each of which contains 35 modules of 50K gates each. This designcorresponds to a 50% logic (device count) microprocessorwhere 67% of the chip area is used for logic. Designs withextensive memory (75% or more) are more likely and will bemore easily wirable.

We assume that semiglobal routing will use dimensionsthat are 3 the minimum allowable. For 0.05 m, this cor-responds to a contacted pitch of 0.45 m. Aspect ratio isset at two to compromise between resistance and noise ef-fects [16]. Using Rents rule-based models [42], an averageintra-isochronous wirelength of 1.15 mmis found forfan-outof two. While this valuemayseem small, it is actually greaterthan 50% of the isochronous region side-length. With an ex-pected average wiring pitch of (conservatively al-lowing for wiresizing), the total required semiglobal routingresources for each isochronous region is 3.32 mm .

Available routing is now determined, starting with twolevels of semiglobal wires. If the requirements exceed ca-pacity, a third semiglobal levelwill be required.Via blockagedepends on upper layers that havenotyetbeen assignedweuse final results of fat wiring optimization to yield accurateresults. Via blockage for the uppermost layer is 16% and fortheunderlying layer, 29%. Minimum-pitchwiringis found tobeacceptable forpowerdistribution with wiringusageof lessthan 2%. Clock distribution is performed on fat wiring layersso its impact on semiglobal resources is neglected. With arouting efficiency of 0.5 we find these two layers to provide3.84 mm of routing, which is greater than the requirements.Note, however, that significant wiresizing such that the av-erage pitch exceeds would yield anunwirabledesign.

Let us now examine global routing. Using the sameapproach as before (where the global Rents exponent isslightly smaller due to floorplanning emphasis and potentialdelay penalties associated with global routing), the averagewirelength is calculated to be 7.84 mm (with fan-out of 2). There will be nets with much longer lengths than this,however many signals will only need to travel from oneisochronous region to its nearest neighbor. The length of such a net will be in the range of 23 mm. Therefore, thiswirelength is a conservative intermediate value betweenrelatively rare pathological lines and more typical shorterglobal wires. Routing requirements are determined with anaverage wiring pitch of . We use a wider average

pitch here than in the semiglobal case because we anticipatethe use of shielding wires in order to deal with inductance.This will limit routing density and serve to increase theeffective wiring pitch somewhat. Recall that very few ( 2%)fat wires will require wiresizing to achieve delay targets.

At this point, we forecast the need for two classes of fatwires. The extreme wires ( ) require very fat wiringtracks. However, the shorter global wires also discussed willonly waste routing resources if they are routed at these fatlevels. So, we allow for the classification of global wires

between shorter global wires and extreme global wires.The former will have greater numbers but shorter averagelengths whereas the latter will be very few in absolute num-bers but their length and performance impact is significant.Given ITRS expectations for nine to ten metal layers at 0.05

m, four tofivelayersremain forglobal usage(threefor localand two for semiglobal). Using the more aggressive number,we further break this into two fat layers and two layers thatwillhave 1/2 the pitch and thickness of the fat layers. Using

m, we find the routing requirements to be 656.5

mm by approximating that 2/3 of the wires will be routedon the lower global layers since their capacity is doubled.Our analysis indicates thatviasblock 15% ofmetal 8, 14%

of metal 7, and 27% of metal 6 routing tracks in this wiringscheme. Powerandgrounddistribution accounts for 5% onthe top level (flip-chip is used with 10-mV voltage drop)and 2% on subsequent layers. Clock distribution (both localandglobal)isfoundtoconsume5%ofroutingareaonthetoptwo levels (weassume that, inorder to limitprocess variation,a shieldingplate is used underneath thetopleveldistribution)and negligible amounts of the other global layers. Summingthese contributions, we find a total available routing area of 806 mm . This system is therefore wirable, even in the pres-ence of anticipated shield wires.

This analysis indicates that by adding global wiring layersduring each generation (while remaining within the boundsof ITRS expected metal layers), large-scale microprocessorsremain wirable at 0.05 m ( 1 billion transistors). For in-stance, six of the nine metallization levels at this processtechnology should be used solely for global routing, whereglobal routing is defined as routing among 50K gate mod-ules.

X. ROUTING HIERARCHY

Our results indicate that due to global RC delays as wellas time-of-flight considerations, the global clock will neces-sarily be slower than the achievable local clock frequency.Early examples of such a microarchitecture are beginningto appear in the most advanced microprocessors being de-veloped today (1.5 GHz); speed sensitive functions (e.g.,integer execution units) may use a higher frequency clockthan other modules. We expect this 2-clock architecture tofully evolve around the 0.13- or 0.1- m generation, which isone to two generations later than predicted in [1]. The localclock speed will be set roughly by the delay time through tenloaded gates (approximately 10 GHz at 0.05 m). This will

continue to rise as long as faster devices can be made. How-ever, the global clock speed will be determined by the propa-gation delay of the longestglobal interconnect.Alternatively,global signals can be divided into a number of clock periodsby the insertion of latches along the repeater path. In anycase, it is clearly in the best interests of the designer to keepthe global clock as fast as possible (or as close to the localclock as possible) in order to reduce latency that will be as-sociated with global accesses. The problem of increasing theglobalclock speedbecomes equivalent to reducing thelength

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


16/23

Fig. 16. Application of a new wiring hierarchy to a 50-nm microprocessor.

of the longest global interconnect. Thus, timing-driven floor-planning will be key in that reducing the pathological wire-length from to will effectively double the globalclock speed (since repeaters reduce the delay problem to alinear one).

Furthermore,with theuse offatwiringonglobalmetalliza-tion layers, we demonstrate that cross-chip global wires willnot require ten or 20 clock cycles for propagation at 0.1 mas described elsewhere. In fact, we expect that global clockspeeds will be kept within a factor of 4 of the local clocksto at least 0.05 m. This implies that fewer than four local(fast) clock cycles will be needed to cross the chipusing ded-icated global wiring. As mentioned above, this relatively lowlatency can be achieved by dividing long global signal pathsusing latches or by using a separate global clock. The formermay reduce some overhead incurred when introducing a sep-arate clock but the power implications of each approach arenot clear. In either case, fat or slowly scaled global wiring isthe key to attaining low latency global signaling.

What is the size of a locally clocked, or isochronous re-gion at 0.05 m? This question can be answered by deter-mining how far we can transmit a signal within a single localclock cycle. Using the top-level fat wiring ( m)we find that, in 80% of the 10-GHz local clock cycle,is about 14 mm. This value corresponds to 16 isochronousregions with an area of 47 mm each. However, a differentapproachcouldbeused todetermineisochronous region size.Using lower levels of metal (not fat wiring), we will obtain asmaller value of , leading to more isochronous regions.The advantage here is the exclusion of fat wiring from beingused inside of these locally clocked zones, freeing up more

routing tracks for longer global wiring.This point forms the basis for a new wiring hierarchythat complements the envisioned DSM microarchitecture.The microarchitecture at 0.1 m and beyond consists of three levelsthe global level, the isochronous level, and themodule level. The optimal wiring hierarchy used to inter-connect these designs also consists of three levelsglobalrouting (connecting elements at global clock frequency),semiglobal routing (connecting modules within isochronousregions), and local routing (connecting gates within mod-

Table 9Back-End Process Parameters for a 0.05- m MicroprocessorUsing the Proposed Wiring Hierarchy

ules). In this manner, each level of the wiring hierarchy ha dedicated purpose; to provide connectivity for its corsponding level of the microarchitecture. Fig. 16 and Tablfurther explain this new wiring hierarchy. Having outlinour general approach to managing global interconnect, now consider a number of particular factors that miginhibit global wiring performance in deep submicrometer

XI. INDUCTANCEInductive effects are expected to become more signific

infutureDSM designs sincesignalbandwidth is increasingon-chiprisetimesdecrease.Agooddefinitionforsignalbanwidthisgivenby ,whichcorrespondstoa cutofffrequencyat the 3-dBpoints[43].Forinstance,a1-GHzsignalwith100-psrise/falltimeshasabandwidthof3.5GHz,whiis significantly greater than the operating frequency.

If the product of inductance and angular frequen( ) is comparable to the line resistance, first-order statement that inductive effects are important cbe made. More accurately, expressions from [44] can beused to define a range of line lengths at which inductanshould be considered.7 We have examined each technologygeneration to determine the relevant wirelengths with regto inductance. Fig. 17 examines the relationship between intervals found and the critical line lengths for 0.05 m. wiring is used at . Wider wires have also been studitheir smaller resistance and larger capacitance make trange of significant wirelengths much greater. In Fig. 17,

7These expressions assume that the line is being properly drivenodriving RLC lines results in ringing effects and additional delay.



17/23

Fig. 17. Minimum pitch global wiring at 50 nm requires shieldwires within a few pitches to limit inductive effects.

Fig. 18. Suggested global routing practice for sub-100-nmtechnologies. /GND lines supply close current return paths toreduce inductance.

spacing to ground is variedinductance is a weak functionof conductor geometry but a strong function of the distanceto the current return path. Large inductive loops are createdwhen the return path is far away whereas inductance can belimited by routing nearby ground lines to provide a stablecurrent return path. This figure shows the importance of having a nearby current return path that can be accomplishedby using shield wires as demonstrated in Fig. 18. It wasshown in [45] that the use of interdigitated shield wires isnormally a more effective approach to limiting inductancethan the sandwiched ground plane approach taken in somecommercial microprocessors [46]. Our results indicate thatthe range of inductive effects expands for scaled processesto include more and more of the useful wirelength spectrumfor a design. Extensive use of shield wires will drasticallyreduce routing density, further emphasizing the need foradded global metallization levels.

The most significant inductive effect on signal wiring maybe in the form of mutual inductive coupling. The impactof shield wires on reducing inductive coupling noise is notclear. The nonlocality of inductive coupling makes analysisa much more difficult problem than capacitivecoupling sincethe number of potential aggressors is substantially larger.

XII. CLOCK DISTRIBUTIONAnother major issue concerning chip-level interconnect is

that ofclock distribution.As theclock cycle shrinks, we seeacorrespondingdrop in allowable clock skew. However, risingdie sizes mean that a larger overall clock distribution net-work must be provided. These two points lead to a fine-grainclock network in which the growing network is made up of

increasing numbersof shrinking components. In this section,we apply the well-established buffered H-tree to future de-signs to determine if it can continue to provide low-skew,high-speed clock distribution. Modified H-tree designs thattake into account nonuniform clock sinks by modifying linelengths and driver strengths are directly extendable from thisanalysis.

Concentrating on local skew, we use BACPAC to find therequired size of an H-tree for a 0.05- m design. For a localclock cycle of 100 ps, we allot 5% of this to local skew and

5% to global skew. Modeling of global clock skew is com-plex since it requires an estimation of process variation at allintermediate levels of the H-tree. On the other hand, localskew is determined mainly by the size of a cluster (smallestcomponent of an H-tree) and localized process variation atthe last level of buffering.

Using this approachat0.05 m,anappropriateH-treecon-tains 4096 clusters and yields local skew of 1 ps, or 10% of a loadedgatedelay. We assume10% variation inrelevantper-formance parameters such as interconnect and gate capaci-tancesand drivecurrent.Each cluster hasa size of0.183mmand contains one or two 50K gate modules. This clock treecorresponds to the local clock distributionthe global clockmust also be distributed over the entire die. While a clocktree containing 4096 clusters seems complex, the bufferedH-tree structure has significant advantages that may allow itto continue as the clock network of choice in DSM SoCs.It does not use large amounts of wiring, has relatively lowpower consumption (compared to grid-based networks), andCADtools exist to exploit theregularity of its structure in thedesign phase.

Global skew may be the most important component of skew due to the large die sizes expected in the future. If global clock skew becomes a bottleneck in chip design, newdesign styles must be looked at. Currently a potentially ex-citing new area of research lies in moving the clock distribu-tion network from on-chip to the package level (see Fig. 19)[47]. Specifically, the use of flip-chip packaging allows foreasy signal distribution since connection can be made any-where on the die as opposed to only the periphery [48]. Also,flip-chip packaging has low parasitics (inductance, capac-itance) so that the package-to-chip connection is relativelyclean. In general, package-level RC parameters are three tofour orders of magnitude smaller than those found in on-chipapplications, allowing for easy transmission of digital wave-forms over a large area (e.g., 750 mm ) with little attentu-ation. Using this technique, global clock skew can be mini-mized while local clock skew constraints can be relaxed, al-

lowing for larger clusters within a clock tree.

XIII. NOISE

The issues of noise and crosstalk noise are sig-nificant in that, given current design techniques, both issuesare expected to become more problematic. Flip-chip pack-aging presents itself as a partial solution to noisesince flip-chip has very low inductive parasitics compared toconventional wirebonding.



18/23

Fig. 19. Diagram showing package-level global clock distributionusing flip-chip packaging [46].

Crosstalk noise and dynamic delay are also important dueto thedominanceofcoupling capacitanceovergroundcapac-itance. Based on the interconnect scaling scenario presentedhere (aspect ratio of 1.5 for global lines, unscaled minimumpitch) and the analytical crosstalk model of [30], we havefound that crosstalk at the global level will not be as signif-icant as the local level due to the relatively large spacingsand use of large repeaterstheir capacitance will dampenthe effects of coupling capacitance. For instance, the max-imum line length of has a crosstalk voltage of 80 mVat 0.05 m using area-optimal repeaters. This corresponds to13% of the supply voltage, which is not overly high. Valuesover 20% are usually considered problematic [16]. However,

the use of scaled global wires will lead to much larger valuesof crosstalk since will be decreased.Dynamic delay on critical timing paths will need to be

limited by the use of shield wires, which are also helpfulin reducing inductive effects. It is very unlikely, however,that a long net will have two neighbors for its entire run andthese signals will switch simultaneously. Efficient screeningtools and intelligent routers will be useful in assessing thenoise susceptibility of global nets. Even in the worst casescenario, we expect only a 30% rise in delay for an optimallydriven line of length . Nonetheless, 30% delay variationisunacceptableon critical timingpaths,hence therisingneedfor shield wires.

XIV. POWER DISTRIBUTIONVoltage drop in the power distribution networks of large-

scale designs is a function of the instantaneous current beingdrawn from the supply as well as the distribution networkresistance. Significant IR drops (e.g., 5% of ) lead todelay variation andreducednoisemargins.With rising powerconsumption yet dropping values, the supply current infuture microprocessors will increase quickly. Larger wiresare needed in order to limit IR drops, which negatively im-pacts global routing. In this section, we compare two pack-aging technologies to help determine if dc voltage loss prob-

lems can be managed in DSM.Powerandgroundon a chip arenormally distributedusinga grid of metal lines on each layer of metal with connectionswhen equipotential lines cross each other. The power gridmodel we use is adapted from [49] and described in moredetail in [40]. The maximum IR drop is calculated for eachmetal layerandthen summed toobtainthetotaldrop from thepads to the silicon level. Our hierarchy consists of a pair of lines ( and ground) running parallel at minimum spacing.The distance between lines of the same potential on a given

layer is called the grid pitch. For each grid pitch, there two lines running the full chip length. We concentrate on top layer of metal since the majority of the total voltage dro

, occurs there.Conventional wirebonding constrains power pads to be

cated at the chip periphery, creating very long lines from supplies to the middle of the die. Thus, wirebonding resuin very wide power distribution lines on the top metal lahamperingtherouting of other signals such asclockor globbuses. Imagine a gate in the center of the die. The resist

path from the pad to this gate will be at least in lengFlip-chiptechnology allows for /GND tobe distributeanywhere on the die using solder bumps. If and groubumps alternate, then the effective distance between powconnections to the grid becomes , where ithe bump pad pitch. The effective length of the worst cresistive path from padto underlying levelhas thereforebedecreased from to . This reduction correspondto at least a 50 change, which can be directly translatednarrower wires and lower voltage drops.

XV. SoC GLOBAL ISSUES

To this point, we have focused on global issues high-speed microprocessors as they have the largest dsizes and highest performance of all large-scale IC appcations. However, the growing ASIC market is pushing more advanced technologies and system-on-a-chip archittures. For this reason, it is interesting to look at the globwiring paradigm we have described in the context of SoThe major differences between SoCs and microprocessin terms of global interconnect are that SoCs typically hasmaller die sizes and lower performance. These points mthe communication requirements much less stringent fSoCs. For instance, time-of-flight is not a concern in So

as traversing the die sizes of interest (on the order of sevhundred mm ) will not consume an appreciable portion orelatively long ( 2 GHz) clock cycle. Fat wiring is requiin microprocessors due to very long global wires aquickly shrinking cycle times. However, with reduced dsizes in SoCs (and consequently reduced global wirelengtglobal wiring may be scalable to the 0.1- m generationeven beyond. As we have seen, the use of fat wiring areduces the impact of noise since line spacing is unscalSoCs may have larger signal integrity problems if scalglobal wiring is used in which case sophisticated wiresiztechniques or shield wires may be required. The inductanproblem in DSM SoCs is significantly less than that

microprocessors for two reasons. First of all, the need wide and fast buses is somewhat reduced in SoC designs it is these structures that are most susceptible to inductiphenomenon. Secondly, if scaled global conductors are udue to relaxed timing requirements, the resistive componof the wires will serve to dampen the inductive effects.

Regarding packaging, SoC vendors may prefer to keep packaging of their parts very simple, and as a result mnot choose the flip-chip global clock distribution. Howevthe lower clock speeds will, in general, result in little

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


19/23

Fig. 20. Various global connectionschemesin an SoCarchitecture.

no divergence of global and local clocks. In this case, anon-chip H-tree clock distribution network will be sufficientto distribute over the entire die at 12 GHz. The lower powerconstraints placed on SoCs by plastic packages also leadto smaller IR drop with a well-designed power supply grid.Eventually, however, we expect power consumption and re-lated issuessuch as powerdistribution tobecomethelimitingfactor in DSM design.

XVI. MICROARCHITECTURAL IMPLICATIONS A. Choice of Channel Implementation

We begin by outlining a few alternatives for intercon-necting multiple modules. The apparent choices are ad hocor point-to-point connections, buses (e.g., AMBA fromARM, Core-Connect from IBM, or Silicon Backplane fromSonics), or switching networks. Among switching networksone can use either circuit-switched or packet-switched.These choices are illustrated in Fig. 20.

We do not attempt to consider the many high-level ar-chitectural issues, but restrict ourselves to looking at phys-icalimplicationson microarchitecturalchoices. Based on ouranalysis of global interconnect resources, our conclusion isthat there are sufficient routing resources for any of the inter-connect choices named above, including point-to-point con-nections. As a result, we believe that choice of channel im-plementation will be principally determined by higher levelarchitectural issues.

B. Signaling Approaches

The physical implementation of our proposed microarchi-tecture will have a large role in limiting delay/noise/power.For example, the current paradigm of using large repeaterstructures at regular intervals seems to work well for RC

lines with moderate resistance levels. In future processes,global wires look more like RLC lines requiring attention toimpedance matchingin order toreduceringingandovershooteffects. Also, with longer and more resistive wires, the sheernumber of repeaters grows to staggering proportions as dis-cussed above. This leads to substantial power dissipation aswell as placing stringent requirements on power distributionnetworks and contributing to simultaneous switching noiseproblems.

A new global signaling convention would seek to mini-

mize power dissipation (helping the power distribution andswitching noise problems as well) while ensuring cleansignal propagation (no ringing, noise, etc.). One possibilityis the use of low-swing driver/receiver architectures. De-pending on the actual architecture, dynamic power can belinearly to quadratically reduced with these methods typi-cally at the expense of noise immunity and standby current.To attack the noise problem, differential signaling can beused that gives each signal line a complement, helping toreduce noise through common mode rejection. It can beshown that the use of differential low-swing signaling isactually more area-effective than using shield wires and alsoleads to better immunity against inductive coupling noise[50]. This is a promising area of future research.

C. Reliability

Another area where physical issues may influence com-munication approaches is reliability. To date, integrated cir-cuits have been designed with the presumption that a signallaunched at one latch in the integrated circuit will arrive reli-ably at its destination. Ensuring this quality is the problem of maintaining signal integrity . For some time boards and rackshave been designed with attention to signal errors.

There are three levels of defense. The first is designing tomaximize signal integrity. This is already in use in ICs. The

second is designing to detect signal errors. The third is toallow for limited error correction. These latter two have nottypically been required in IC designs, but there is some de-bate as to whether they will be required in the future. Our po-sitionis that this question reduces to onethat we havealreadyaddressed: the ability to maintain signal strength and speedacross long ( 1 cm) distances onchip. On a single integratedcircuit, the problem of signal reliability (in terms ofbad data,the need for error correction hardware, etc.) is dramaticallybetter than on a board since the distance between active de-vices is limited. For instance, although a global net in a verylarge die may approach several centimeters in total length,the longest practical distance between repeaters will not ex-ceed 5 mm, which helps minimize the impact of noisesources such as capacitive and inductive coupling. Thus, theproblemreduces toensuringthatthere areadequateresourcesto provide proper wire sizes and sufficient repeaters. Ourforecasts indicate that there will be such resources available.

D. Interconnect Libraries for Designing DSM ICs

In this section, we consider some constructive approachesto minimizing the impact of DSM effects. Recently, a new

http://-/?-http://-/?-


20/23

Fig. 21. Spectrum of routing complexity, bounded by current ad hoc methods and the dense wiring fabric.

layout fabric called the dense wiring fabric (DWF) was in-troduced to

small geometrics on soc

Documents