864 ieee journal of solid-state circuits, vol. 43, no. 4 ...€¦ · digital object identiﬁer...

864 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 4, APRIL 2008

Resonant-Clock Latch-Based DesignVisvesh S. Sathe, Member, IEEE, Jerry C. Kao, Member, IEEE, and Marios C. Papaefthymiou, Senior Member, IEEE

Abstract—This paper describes RF1 and RF2, two level-clockedtest-chips that deploy resonant clocking to reduce power consump-tion in their clock distribution networks. It also highlights RCL,a novel resonant-clock latch-based methodology that was used todesign the two test-chips. RF1 and RF2 are 8-bit 14-tap finite-im-pulse response (FIR) filters with identical architectures. Designedusing a fully automated ASIC design flow, they have been fab-ricated in a commercial 0.13 m bulk silicon process. RF1 op-erates at clock frequencies in the 0.8–1.2 GHz range and uses asingle-phase clocking scheme with a driven clock generator. Res-onating its 42 pF clock load at 1.03 GHz with �� 1.13 V, RF1dissipates 132 mW, achieving a clock power reduction of 76% overconventional switching. RF2 achieves higher clock power efficiencythan RF1 by relying on a two-phase clocking scheme with a dis-tributed self-resonant clock generator. Resonating 38 pF of clockload per phase at 1.01 GHz with �� 1.08 V, RF2 dissipates124 mW and achieves 84% reduction in clock power over con-ventional switching. At 133 nW/MHz/Tap/InBit/CoeffBit, RF2 fea-tures the lowest figure of merit for FIR filters published to date.

Index Terms—Digital signal processing, low-power VLSI.

I. INTRODUCTION

CLOCK POWER remains a major contributor to dynamicpower dissipation in high-performance VLSI designs.

Relying on inductance to efficiently resonate the capacitanceof the entire clock distribution network, resonant clocking isa promising approach to the design of clock networks withsubstantially reduced power dissipation [1]–[3].

A simple example of a resonant-clock design is given inFig. 1. In this example, the logic gates are implemented usingconventionally switching CMOS. The clock signal is a sinu-soidal waveform that is generated by setting up a resonantoscillation between an inductive element and the parasiticcapacitance of the clock distribution network. A clockgenerator (shown as a single transistor) is used to sustain theoscillation in this tank by periodically replenishing resistiveenergy losses in the clock distribution network.

Previous work in resonant-clock digital design [4], [1],[5]–[8] has focused on the implementation of resonant clocknetworks driving flip-flops, adiabatic or otherwise. While theuse of flip-flops in sequential designs greatly simplifies designand verification, this practice sacrifices performance and effi-ciency. Specifically, the relatively slow rise times of sinusoidalclocks degrade the clock-to-output times of the flip-flops.

Manuscript received August 23, 2007; revised November 10, 2007. Thiswork was supported in part by the U.S. Army Research Office under GrantDAADA19-03-1-0122. Fabrication was provided through MOSIS.

V. S. Sathe is with Advanced Micro Devices, Fort Collins, CO 80528 USA.J. C. Kao and M. C. Papaefthymiou are with the Department of Electrical

Engineering and Computer Science, University of Michigan, Ann Arbor, MI48109 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/JSSC.2008.917501

Fig. 1. Resonant-clocked pipeline example.

Moreover, although the deployment of adiabatic flip-flops witha buffer-less clock network enables resonating the capacitanceof the entire clock distribution network [4], [1], these flip-flopsrequire devices in series with the load capacitance, thus de-grading overall system efficiency.

To achieve energy-efficient clocking at relatively highfrequencies, recent resonant-clock implementations avoidedthe use of adiabatic techniques, relying on conventionalmaster–slave flip-flops instead. In one such system, wherecapacitance is provided by the distributed clock wiring capac-itance and the input capacitance of the flip-flop clock inputs,the increased clock slew of the sinusoidal clocks significantlydegrades the timing parameters of the flip-flops [6].

To prevent performance degradation due to poorly slewingsinusoidal clocks, buffers may be introduced to generate clockwaveforms with improved slew characteristics. Chan et al. havedemonstrated efficient resonance on a global resonant-clocknetwork, achieving significant jitter and skew reduction [5],[7]. Hansson et al. have implemented a resonant-clock networkdriving flip-flops using clock buffers [8]. In this case, the poorslew of the sinusoidal clocks causes significant short-circuitpower dissipation. Furthermore, clock buffers isolate the localclock distribution from the resonant system, thus limiting theamount of capacitance being resonated. Since the dominantportion of clock-related power dissipation lies in the localclock distribution [9], such use of clock buffers limits clockefficiency.

In this paper, we investigate the deployment of resonantclocking in conjunction with level-sensitive latches. Unlikeflip-flops, which rely on sharp clock edges for effective op-eration, latch performance is determined primarily by thevoltage level of the clock waveform. Moreover, latch-baseddesigns have the potential to achieve higher performance thanflip-flop-based designs, because the transparent phase in the op-eration of level-sensitive latches allows data to ripple throughlatch boundaries and enables time-borrowing across logicstages [10], [11]. The proposed resonant-clocked latch-based(RCL) design methodology relies on metal-only networks todistribute the clock signal all the way to the clock inputs of

0018-9200/$25.00 © 2008 IEEE

SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 865

Fig. 2. Microphotograph of RF1 and RF2.

the level-sensitive lathes without the use of any clock buffers.Substantial clock power reductions can thus be achieved by res-onating the capacitance of the entire clock distribution network.To support automated design, RCL relies on a static timinganalysis framework that has been developed for latch-baseddesigns with nonideal clocks such as sinusoidal resonant clocks[12].

To demonstrate the energy efficiencies attainable by RCLdesign, we used the RCL methodology in a fully automatedASIC flow to design two test-chips, called RF1 and RF2. Thetwo designs were fabricated in a 0.13 m CMOS process withCu interconnect [13]. Fig. 2 shows a microphotograph of thedie containing RF1 and RF2. Both designs are 8-bit 14-taptranspose finite-impulse response (FIR) filters with identicalarchitectures. The differences in the implementation lie in latchdesign, clocking scheme, and clock generator design. Specif-ically, RF1 is a single-phase 0.8–1.2 GHz, frequency-tunablelatch-based design. RF1 dissipates 10.7% of total power whiledriving 42 pF of clock load, corresponding to a 76% reductionin switching power compared with an identical conventionallyswitched clock load.

RF2 uses a two-phase distributed self-resonant clock gener-ator that relies on a blip generator topology to achieve efficientresonance [14]. Together with appropriate latch design, the twophases achieve robust operation through nonoverlapping dura-tions of latch transparency. The resonant clock drivers in RF2are part of the resonant domain, resulting in increased efficien-cies. At its resonant frequency of 1.01 GHz and while driving38 pF of clock load per clock phase, RF2 achieves 84% clockpower reduction over an identical, conventionally-switchingclock load. RF2 achieves the lowest energy-per-computationfigure of merit (normalized for filter size) reported for FIRfilters to date.

To compare the resonant-clocked implementations of the FIRfilter architecture that was used to obtain RF1 and RF2 withtheir conventional-clock alternatives, we performed separatesynthesis runs to obtain a conventional latch-based and a con-ventional flip-flop-based implementation with the same targetclock frequency. In comparison with their latch-based coun-terparts, RF1 and RF2 had identical latency and latch counts.

Furthermore, the clock load driven by RF1 and RF2 was lowerthan either of its conventionally clocked counterparts, latch-or flip-flop-based. Consequently, the clock power reductionsachieved by RF1 and RF2 are even greater than 76% and 84%,respectively, in comparison with their conventionally clockedimplementations.

The remainder of this paper has six sections. In Section II,we discuss the RCL methodology in the context of RF1 andRF2 design. In Section III, we present the FIR filter architectureand the test setup used in RF1 and RF2. The implementationof RF1 is discussed in Section IV. In Section V, we present thedesign of RF2. Experimental results for RF1 and RF2 are givenin Section VI. Conclusions are provided in Section VII.

II. DESIGN METHODOLOGY

In this section, we discuss the RCL methodology in the con-text of the fully automated design of RF1 and RF2.

Fig. 3(a) gives an example of a CMOS pipeline implementedusing the RCL methodology. This pipeline uses conventionalCMOS combinational logic gates, level-sensitive latches, and atwo-phase resonant clocking scheme. Robustness to race condi-tions is achieved by ensuring nonoverlapping transparency win-dows of consecutive latches. Due to the sinusoidal nature of theclock waveform, which directly drives transistors in the latch,the delay between the data arrival time at the input of a latchand the data departure time from the output of the latch, denotedby , varies with data arrival time, as shown in Fig. 3(a).Pipeline design ensures that critical signals arrive at the latcheswhen the clock signal is at its peak, providing clocked transis-tors with maximum gate overdrive and yielding a region of low

delay. Pipeline performance is therefore similar to that ob-tained with conventional square clocks.

The delays of the level-sensitive latches used in RF1and RF2 exhibit timing monotonicity, i.e., early arriving signalsat a latch input depart from the output of that latch earlier thanlater arriving signals. Timing monotonicity ensures that the latchoutput departure times of noncritical signals do not exceed thoseof critical ones, which is a key property for performing timingverification in systems with level-sensitive latches [11], [12].

Another characteristic of the sinusoidally clocked latchesused in RF1 and RF2 is that their transparency windows arewider than the region of low , as shown by the annotatedclock waveform in Fig. 3(a). Therefore, it is possible for dataarriving before the region of low delay to ripple throughthe latch ahead of time, possibly causing race violations in thenext pipeline stage. Such race possibilities are avoided throughappropriate clock and latch design, as discussed in Sections IVand V.

A. ASIC Design Flow

Design entry for RF1 and RF2 was performed using synthe-sizable Verilog. Simulation-based verification of the architec-ture was carried out using Cadence NC-Verilog. Synthesis wasperformed using Synopsys Design Compiler. Physical design ofthe synthesized netlist was carried out using Cadence SOC-En-counter.

The sinusoidal shape of resonant-clock waveforms poses achallenge to the use of commercial synthesis tools, since such


Fig. 3. RCL methodology. (a) Latch timing and pipeline design. (b) Derived clock waveform for synthesis.

tools use square clock descriptions to perform timing analysis.We have thus extended the original timing analysis frameworkfor level-sensitive latches described in [11] to encompass si-nusoidal clock waveforms [12]. In our extended framework, asquare clock waveform is derived from the original sinusoidalclock waveform so that it captures the operation of the reso-nant-clocked latches. Fig. 3(b) illustrates the derivation of thisconventional clock waveform from a sinusoidal clock waveformdriving a level-sensitive latch. The amplitude of the derivedclock is set equal to the lowest voltage of the sinusoidal clockwithin the transparency window of the latch. Characterized latchperformance can be improved by deriving a clock with a higheramplitude, resulting in limited time-borrowing due to the re-sulting narrower pulse width. Therefore, this derivation method-ology results in a tradeoff between time-borrowing and charac-terized latch performance. For timing verification purposes, thederived clock waveform is conservative, i.e., if the circuit meetstiming with the derived clock waveform, then it is guaranteed todo so with the sinusoidal clock waveform, even in the presenceof feedback loops. In designing RF1 and RF2, the amplitudeof the derived clock was chosen to be 10% lower than in thesinusoidal clock to allow for some time-borrowing without sub-stantially penalizing characterized latch performance.

B. Clock Network Design

The clock distribution networks of RCL-based designs donot include any buffers. Consequently, the energy dissipationof clock signal distribution is due to the resistive losses in theclock wires. Wire sizing optimization therefore plays a signifi-cant role in clock power reduction.

Clock distribution network design in RF1 and RF2 was drivenby three main objectives: 1) minimize clock power dissipation;

2) minimize clock skew; and 3) minimize voltage attenuation.For the sinusoidal shape of the resonant clock waveforms, allthree objectives were served through the judicious minimiza-tion of resistance in the clock distribution network. The choiceof a CMOS process with thick top-level metals was key for re-ducing clock network resistance. Specifically, we used a processwith a 4- m-thick top-level metal layer, providing a low-resis-tance interconnect for the clock distribution network. Using aprocess with such a thick top-level metal also enabled the de-sign of high- inductors. To reduce network resistance further,wide clock wires were used in the distribution network. Wiresizing increases clock capacitance, however, causing higher cur-rent flow and, therefore, increased resistive losses in the net-work. The optimal choice of wire widths in the distribution net-work was determined empirically by the tradeoff between theresistance and the capacitance of the clock wires [12].

To generate the resonant clock tree in an automated fashion,we developed a framework which interacts with the place androute tool to generate programmable levels of an H-tree drivingprogrammable levels of a clock grid. Both RF1 and RF2 utilizea two-level H-tree [15], driving a two-metal clock grid shieldedby supply and ground rails on each side. Such a clock gridtopology results in clock networks with decreased resistanceand increased capacitance. Relying on a heuristic that deter-mines optimal wiring capacitance for a given design, we gen-erated and evaluated several possible alternative networks withdifferent wire widths.

III. FIR FILTER ARCHITECTURE

This section describes the FIR filter architecture of RF1 andRF2. Fig. 4 gives a block diagram of the 14-tap, 8-bit trans-pose-type FIR filter with built-in self-test (BIST). To efficiently


Fig. 4. FIR filter architecture.

balance logic delays between different stages in the sequentialcircuit, a transpose filter implementation was used, where datais premultiplied in each of the taps with their correspondingcoefficients. The transmission of data from the source to thefilter taps distributed across the filter core is carried out by thedata-broadcast block. Long latencies involved in broadcastingthe data to all filter taps were addressed by pipelining the in-terconnect in the data-broadcast block so as to achieve highthroughput. In each tap, pipelined multipliers scale the inputdata according to the programmable coefficients of the filter.The data-coefficient product obtained from each tap is mergedwith cycle-delayed products from the previous taps using 4:2compressors [16]. The final vector merge addition is performedusing a carry-save adder. The BIST block generates a pseudo-random input sequence for the multiplier. Filter output data arealso compressed in the BIST block using a signature analyzer.A state machine enables the BIST block to capture the state ofthe signature analyzer at the end of a user-defined number ofcycles.

IV. RF1 DESIGN

Important aspects pertaining to the implementation of RF1such as latch design, pipeline timing and design, and clocksystem design are discussed in this section.

A. Latch Design

RF1 is a single-phase level-sensitive latch design. Since theentire clock network is buffer-less, latch design plays a centralrole in the robust and high-performance operation of RF1. Latchrequirements include: 1) the design of true-single phase latches;2) low despite sinusoidal clock waveforms with poor slew;and 3) avoiding crowbar current in latches due to gradually tran-sitioning sinusoidal clocks. Although the use of latches in con-ventionally clocked designs is well known, the benefits and chal-lenges from their use in resonant-clocked datapaths have yet tobe explored.

To operate with a single clock phase, RF1 pipelines consist ofinterleaved latches that become transparent during opposite po-

larities of the resonant clock. Fig. 5(a) shows circuit schematicsfor the level-sensitive high (H-LAT) and low (L-LAT) latchesin RF1. H-LAT and L-LAT are Svensson latch implementations[17] that have been optimized for low by placing the datapins closer to the output. From post-layout simulations, H-LAT(L-LAT) achieves 99 ps(106 ps) with a total dissipationof 24 fJ (22 fJ) per toggle while driving 15 fF of load. Since theRCL methodology does not rely on precharging or clock bufferswithin the latch, clock-related power dissipation in RF1 occursonly in the clock generator and clock distribution network. Sincethe latches do not include clock buffers or precharging nodes,the dynamic dissipation within the latch for a clock cycle withno input data toggle is zero. Connections in H-LAT (L-LAT) aremade so that the pull-down (pull-up) clocked transistor is in se-ries with a complementary logic stack. Thus, data input slewslimit the otherwise significant crowbar current due to large riseand fall times of the clock.

B. Timing

Fig. 5(b) shows waveforms obtained from simulations ofinterleaved latches in possible race and critical path scenariosat 1 GHz. In the race scenario, data latched by H-LAT duringthe rising transition of the clock does not race through L-LATduring the same transition. Instead, L-LAT latches the dataonly in the subsequent transparency window. From 1 GHzpost-layout simulations of RF1 at the process corner (fastNMOS, fast PMOS), the race setup in Fig. 5(b) with zero logicdelay is immune to race for clock skews up to 125 ps. ThroughHspice simulation of the extracted clock distribution networkwith parasitics, the insertion delay of the network wasestimated to be 10 ps. Given that clock skew is bounded by theinsertion delay of the clock network, RF1 comfortably satisfiesthe clock skew requirement.

Despite the apparent “overlap” between the transparencywindows of the two latches, RF1 comfortably maintains raceimmunity due to the increased difference between the arrivaltime of the clock at the latch and the departure time of datafrom the output of the latch (denoted by ). This increased

results from the low clock amplitude in the overlappingregions of transparency. Fig. 5(b) also shows that, consistentwith the RCL methodology, pipeline signals on the critical pathare designed to arrive at H-LAT (L-LAT) while the resonantclock is at nearly ( ). They thus provide maximum gateoverdrive to clocked latch transistors, enabling low delaysand high operating frequencies.

C. Clock Design

Fig. 6 shows the clock generator used in RF1. This clock gen-erator consists of a ring oscillator, a pulse generator, and a res-onant clock driver similar to that used in [1]. The clock driverperiodically replenishes the energy losses in the resonant systemthrough current injection in the inductor at the natural frequencyof the design.

Unlike previous implementations of the clock drivers, thedriver in RF1 uses on-chip decoupling capacitance (decap) tomitigate the effects of bondwire, package, and board-trace in-ductance and resistance. The clock rail driver operates by usinga 0.6 nH integrated inductor to achieve efficient resonance


Fig. 5. H-LAT and L-LAT. (a) Schematics and interleaved implementation for single-phase datapaths. (b) Simulation waveforms for critical path and raceconditions.

Fig. 6. RF1 clock generator schematics.

with the distributed parasitic capacitance of the clock network.Energy dissipation in the resistance of the network is replen-ished by the rail driver. As the clock approaches its minimum,pulse causes the pull-down switch to conduct, discharging theoutput clock voltage to 0 V, and causing an current buildup

in the inductor. At the falling edge of pulse , the system con-tinues oscillating freely with the initial condition and

at its natural frequency defined as

(1)

where is the damping factor of the network, and isthe current flowing in the inductor at the falling edge of pulse .As the clock reaches its peak, pulse causes the pull-up switchto conduct, resulting in a similar current buildup in the in-ductor. At the rising edge of , the system once again resumes afree oscillation at its natural frequency, with the initial condition

and , where is the current flowingin the inductor at that time. The current buildup in the inductorat the crest and trough of enables the supply to periodicallyprovide energy to the system, which is stored in the magneticfield of the inductor. The amount of current required to main-tain stable oscillations is governed by the equation

(2)

where is the energy dissipation in the resonant networkduring the last cycle with the desired clock amplitude. Thisequation is obtained by setting the per-cycle energy dissipation


Fig. 7. BLAT. (a) Schematics and interleaved implementation for single-phase datapaths. (b) Timing example for critical path and race conditions.

to be equal to the energy stored in the magnetic field that re-sults from the current buildup in the inductor. Notice that, inthis driven resonant clock oscillator, the frequency of anddetermines the oscillating frequency of the resonant clock in theneighborhood of its natural frequency given by (1).

Replenishing energy in the inductor every cycle itself incursenergy dissipation, since the clock driver is driven by a cascadeof conventionally switching buffers. The current buildup in theinductor also leads to resistive losses. Consequently, the optimalchoice for switch widths is governed by the tradeoff between re-sistive losses in the driver switches (reduced by wider switches)and dynamic power dissipation incurred in driving the driverswitches (reduced by smaller switches).

To explore regions of efficient clock generation, the driverswitches were implemented with programmable widths in therange (0–950 m) and (0–630 m). Pulse dutycycles were also programmable in the range (0%–50%). Thepulses and were derived from a programmable ring oscil-lator, which enabled frequency tuning around the resonant fre-quency.

The use of pull-up switches in the clock generator also pro-vides the flexibility of operating the rail driver without an ad-ditional power supply. In this configuration, the switchdelivering the power supply is opened, and the on-chipdecap is used to hold the dc voltage of the oscillating clock. En-ergy is supplied to the resonant system using only the powersupply.

V. RF2 DESIGN AND IMPLEMENTATION

This section presents the salient aspects of RF2 design, in-cluding latch, clock network, and pipeline design. RF2 is func-tionally identical to RF1, with both designs synthesized fromthe same Verilog description. Key differences between RF1 andRF2 lie in the latching scheme employed and the implementa-tion of the resonant clocks. Whereas RF1 is a single-phase latchbased design, RF2 uses a more robust two-phase nonoverlap-ping clocking scheme. In contrast to RF1 which uses a driven os-cillator to generate the single-phase resonant clock, RF2 deploys

a self-resonating “blip” generator [14]. The different clockingschemes used in the two designs lead to differences in latch de-sign, robustness to race conditions, and energy efficiency in theclock generators.

A. Latch Design

Like in RF1, latch design plays a critical role in enablingRF2 to achieve robust, energy-efficient operation at GHz-classoperating frequencies. Fig. 7(a) shows circuit schematics ofBLAT, the resonant-clocked level sensitive latch used in RF1.Like H-LAT and L-LAT, BLAT is a Svensson latch implemen-tation [17], modified for low . Post-layout simulations ofB-LAT-X2 with 15 fF load capacitance result in 97 ps.Similar to H-LAT and L-LAT, the absence of clock buffers orprecharging techniques results in zero dynamic power dissipa-tion in the latch if input data does not toggle. When the latchoutput toggles, the energy dissipation in the latch is 23 fJ. Thecross-coupled nMOS devices shown in the B-LAT schematicsare part of the distributed clock generator and are discussed inSection V-B.

Fig. 7(b) shows waveforms obtained from simulations of anexample pipeline stage in RF2 at 1 GHz. Similar to RF1, thepipeline is designed so that data on the critical path arrives atthe latch when the latching clock is near its peak voltage, re-sulting in a low . The two-phase nonoverlapping clocksdeployed in RF2 yield race-immunity. Simulations of a B-LATpipeline stage with zero logic delay at the process corner(fast NMOS, fast PMOS) indicate that the design is immuneto race conditions with clock skews of up to 350 ps. Since theestimated clock skew in RF2 is less than 10 ps, as obtainedfrom post-layout simulations, hold constraints are comfortablysatisfied.

B. Clock Design

The clock generator used in RF2 is a self-resonating oscillatorsimilar to that implemented by [14]. The motivation behind the


Fig. 8. RF2 clock generator. (a) Schematic. (b) Clock grid.

use of self-resonating clock generators is improved energy ef-ficiency, which is afforded at the expense of tunability in theoperating frequency.

Fig. 8(a) shows circuit schematics for the self-resonating“blip” generator used to derive two-phase nonoverlappingclocks in RF2. A 1.32 nH symmetric center-tapped inductor isused to achieve resonance with series-connected capaci-tance loads from each phase. Energy losses in the system arereplenished by the power supply. To obtain a 1.2 V clockamplitude using the blip generator, the voltage required forin RF2 is approximately 0.5 V. Unlike driven oscillators, suchas the one used in RF1, the switches in the blip generator aredriven by a resonant clock, enabling charge recovery from theswitch capacitance. The resulting energy efficiency in drivingthe switches enables the use of wider switches with reducedresistive losses. For a given clock load, the blip generator iscapable of achieving better energy efficiency than the clockgenerator used in RF1. Note the use of the on-chip decap inFig. 8. This decap is necessary in a fully integrated blip gener-ator to prevent package parasitics from affecting the resonantfrequency of the design.

Unlike previous work, RF2 does not contain a separate clockgenerator block. Fig. 8(b) shows the distributed blip generatorused in RF2 along with the clock network. The cross-coupleddevices shown in the figure provide the required negativetransconductance in the circuit and are embedded within eachlatch. Embedding the clock generator switches into the latchesprovides better local clock slew control and has the addedadvantage of simplifying design.

VI. MEASUREMENT RESULTS

Here, we present measurement results obtained from RF1 andRF2. We first give a summary of the results obtained for the

Fig. 9. Statistics and performance for RF1 and RF2.

two designs. We then discuss measurement results specific toeach of the two designs. For all data points reported, correctfunctionality of the two designs has been verified through BIST.

Fig. 9 summarizes measurement results obtained from RF1and RF2. At the natural frequency of 1.03 GHz, clock powerdissipation in RF1 is 14.2 mW, accounting for only 10.7% ofoverall power. The resonant clock network in RF1 achieves a76% power reduction over conventional switching of the samecapapacitance at the same rate. Driving 38 pF per clock phase,which amounts to 76 pF of total clock load, RF2 achieves 84%clock-power reduction over conventional switching. Based ontheir relative power efficiency, RF1 and RF2 had a systemquality factor of approximately 3.3 and 4.9, respectively[12]. At the minimum overall energy point, clock power in RF2is 19.9 mW, accounting for only 16% of the overall 124 mWchip power.

In both RF1 and RF2, the load of the entire clock network hi-erarchy (all the way to the clock inputs of the latches) was lowerthan in conventional (i.e., nonresonant) synthesized implemen-tations of the same FIR architecture that we obtained with thesame target clock frequency and supply voltage. The increasedclock load in the conventional clock networks was mainly dueto the use of clock buffers. Consequently, with respect to theirnonresonant counterparts, RF1 and RF2 attained clock powerreduction levels that were even greater than 76% and 84%, re-spectively.

An often cited figure of merit for FIR filters is the energy dis-sipation of the filter normalized for filter size and is measured innW/MHz/Tap/InBit/CoeffBit [18]–[20]. For RF1 and RF2, thisfigure of merit compares favorably to previously published con-ventional FIR implementations. In particular, RF2 features thelowest figure of merit for FIR filters published to date, dissi-pating 133 nW/Mhz/Tap/InBit/CoeffBit. In the case of RF1 andRF2, this energy metric has been obtained for a relatively highswitching activity of 0.5.

A. RF1 Measurement Results

Fig. 10(a) shows measured clock and total energy dissipationversus operating frequency in RF1. The clock energy dissipationcurve shown in the figure has been obtained for 1.2 V,

m , and pulse duty cycle 20%. Theclock energy minimum at 1.03 GHz corresponds to the resonant


Fig. 10. Energy dissipation of (a) RF1 versus operating frequency and (b) RF2versus � .

frequency of the design. The lowest achievable per-cycle energydissipation of the clock for a 1.2 V amplitude (clock network +clock generator) at resonance is 14.3 pJ. This energy dissipationcorresponds to a 76% clock power reduction over the power dis-sipation incurred in conventionally clocking an identical clockload.

The increase in the energy dissipation of the clock at frequen-cies away from resonance does not necessarily imply increasedtotal energy dissipation. Frequency reduction provides ad-ditional timing slack to the datapath, allowing operation ata reduced voltage supply level (voltage scaling) while stillmeeting timing constraints, and resulting in a reduction ofoverall power dissipation. As can be deduced from Fig. 10(a),the reduction in the logic energy dissipation of RF1 dominatesthe increase in clock dissipation due to off-resonance operation,resulting in an overall reduction in energy dissipation. At higherfrequencies, both clock and logic power increase, leading tohigher power dissipation.

RF1 can also be operated without the use of the additionalpower supply. The additional supply is removed by

opening the switch shown in Fig. 6, and using pMOS

switches to periodically deliver power to the resonant system.The effectiveness of this technique was diminished, however,since the addition of decoupling capacitance between the

and supplies was overlooked. Consequently, theinjection current through the pull-up switches flows throughthe additional inductance of the package and bondwire. Theincreased inductance in the current buildup path requires widerpull-up switches for the same current injection. The resultingincrease in pull-up switch width increases conventional powerdissipation in the clock generator. Therefore, although correctoperation was verified at the resonant frequency of 1.03 GHz,the energy dissipation per cycle in the power clock was higherat 26.2 pJ.

To explore possible energy savings by reducing dissipation inthe clock generator, we experimented with “half pumping,” i.e.,driving the clock generator pulse at half the natural frequency.In half-pumping mode, the clock generator circuitry switches athalf of the frequency, resulting in decreased switching powerdissipation. The energy replenished by the clock drivers ac-counts for energy losses over two cycles, however, resulting inincreased current injection and higher resistive losses on thereplenishing switches. For resonant systems with sufficientlyhigh energy efficiency, the energy savings obtained by the re-duced switching activity in the clock drivers are expected todominate the increased resistive losses, leading to lower overallpower consumption. In our half-pumping experiments, how-ever, the lowest achievable per-cycle energy dissipation in theclock was approximately 16.8 pJ, which exceeded clock powerduring normal operation.

B. RF2 Measurement Results

Fig. 10(b) shows the clock, logic, and total energy-per-cycledissipation of RF2 versus the clock supply voltage . The op-erating frequency of the design is 1.01 GHz. In contrast to con-ventional designs, in which insertion delay depends directly onsupply voltage due to the use of clock buffers, insertion delays inRCL-derived designs depends primarily on interconnect delays.Consequently, in contrast to conventional designs with bufferedclock networks, clock skews in RF2 remain basically unaffectedby supply voltage scaling, as confirmed independently by simu-lations, providing additional timing margin for reducing supplyvoltage. Moreover, efficient clocking in RF2 allows for furthersupply voltage scaling in the logic, since the increased logicdelay can be compensated by better latch performance obtainedfrom higher clock amplitude. The improved energy efficiencythat can be achieved by driving the clock at a higher amplitudethan the voltage-scaled logic are demonstrated in Fig. 10(b). Ateach value of , the supply voltage , was scaled to achieveminimum total power dissipation. As shown in the figure, theoptimal energy point occurs for 0.59 V and 1.08 V.For lower values of , a higher supply voltage is required dueto lower clock amplitude. Increasing increases clock am-plitude, allowing for lower total energy dissipation throughscaling. Beyond the optimal value of , the supply–voltagescaling afforded by improved latch performance cannot com-pensate for the increasing clock power dissipation, resulting inincreased total power dissipation.


An essential practical requirement of any synchronous designis the ability to clock the design at a predetermined frequency,regardless of variations in the fabrication process. Since the fre-quency of a fully self-resonating clock generator such as the oneused in RF2 cannot be tuned, it is important that frequency vari-ation arising from the fabrication process be kept to a minimum.To determine the variation of the resonant frequency of RF2across multiple chips, the resonant frequencies of ten randomlyselected chips were measured. All ten chips were operational,with average resonant frequency 1.012 GHz and relativestandard deviation 0.012. For such a relatively lowvariation in resonant frequency, injection-locking may be an at-tractive option for setting the oscillation frequency [7].

VII. CONCLUSION

This paper describes RF1 and RF2, two level-clocked ASICdesigns that deploy resonant clocking to drive the latches in theirclock distribution networks with substantially reduced powerdissipation over conventional switching. Fabricated in a com-mercial 0.13 m bulk silicon process, the two test-chips havebeen designed using RCL, a general latch-based design method-ology for energy-efficient high-performance resonant-clockedchips. The RCL methodology “resonates” the entire clock net-work, recovering energy from all clock-related switching capac-itance and maximizing clock efficiency. Clock distribution isperformed through a metal-only network without clock buffers,resulting in a sinusoidal clock waveform that directly drives thelatches at the leaves of the clock network. High performance isachieved through the use of appropriately designed level-sensi-tive latches whose timing characteristics depend on the voltagelevel, rather than the slew rate, of the clock waveform.

RF1 is a single-phase 8-bit 14-tap FIR filter with an on-chipinductor that operates at frequencies of up to 1.2 GHz. Fre-quency-scaled operation in RF1 is achieved by forcing the clocknetwork to oscillate off resonance, without altering its circuit pa-rameters. At its natural frequency of 1.03 GHz, RF1 dissipates143 nW/MHz/Tap/InBit/CoeffBit, with clock power accountingfor only 10.8% of the 131 mW total power dissipation. In com-parison with conventional switching, RF1 dissipates 76% lesspower for driving its 42 pF clock load.

RF2 has been designed using the same Verilog descrip-tion as the one used for synthesizing RF1. Relying on atwo-phase nonoverlapping resonant-clocked latching scheme,RF2 achieves robust and energy-efficient operation at its reso-nant frequency of 1.01 GHz. The distributed clock generatorused in RF2 simplifies design and provides improved localclock skew control. By including the clock drivers into theresonant system, RF2 achieves greater clock power efficiencythan RF1. Resonating 38 pF of clock per phase at 1.01 GHz,RF2 dissipates 84% less power than conventional switching.At 133 nW/MHz/Tap/InBit/CoeffBit, RF2 features the lowestfigure of merit for digital FIR filters published to date.

Both RF1 and RF2 have the same latency, throughput, andlatch counts as their conventional counterparts obtained fromthe same Verilog code. In general, resonant-clocked pipelines

designed using the RCL methodology do not place any addi-tional requirements on their combinational logic over conven-tional design. They are thus amenable to all nonclock relatedpower reduction techniques that can be applied to conventionaldesigns, including gate sizing, multithreshold voltage assign-ment, power gating, and dynamic voltage scaling. Therefore,resonant clocking can be used to achieve further power savings,in addition to the power reductions already achievable by con-ventional logic optimization approaches.

ACKNOWLEDGMENT

The authors would like to thank C. Tokunaga for his invalu-able assistance with design. They would also like to thank theanonymous reviewers for their constructive comments.

REFERENCES

[1] C. H. Ziesler, J. Kim, V. S. Sathe, and M. C. Papaefthymiou, “A 225MHz resonant clocked ASIC chip,” in Proc. ISLPED, Aug. 2003, pp.48–53.

[2] V. S. Sathe, J. C. Kao, and M. C. Papaefthymiou, “RF2: A 1GHz FIRfilter with distributed resonant clock generator,” in Symp. VLSI CircuitsDig. Tech. Papers, Jun. 2007, pp. 44–45.

[3] V. S. Sathe, J. C. Kao, and M. C. Papaefthymiou, “A 0.8–1.2 GHzsingle-phase resonant-clocked FIR filter with level-sensitive latches,”in Proc. CICC, Sep. 2007, pp. 583–586.

[4] W. Athas, N. Tzartzanis, L. Svensson, and L. Peterson, “A low-powermicroprocessor based on resonant energy,” IEEE J. Solid-State Cir-cuits, vol. 32, no. 11, pp. 1693–1701, Nov. 1997.

[5] S. Chan, P. Restle, and K. Shepard, “A 4.6 GHz resonant global clockdistribution network,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2004,pp. 42–43.

[6] A. J. Drake, K. J. Nowka, T. Y. Nguyen, J. L. Burns, and R. B. Brown,“Resonant clocking using distributed parasitic capacitance,” IEEE J.Solid-State Circuits, vol. 39, no. 9, pp. 1520–1528, Sep. 2004.

[7] S. Chan, P. Restle, and K. Shepard, “Distributed differential oscillatorsfor global clock networks,” IEEE J. Solid-State Circuits, vol. 41, no. 9,pp. 2083–2094, Sep. 2006.

[8] M. Hansson, B. Mesgarzadeh, and A. Alvandpour, “1.56 GHz on-chipresonant clocking in 130 nm CMOS,” in Proc. CICC, Sep. 2006, pp.241–244.

[9] P. Restle, “A clock distribution network for microprocessors,” IEEE J.Solid-State Circuits, vol. 36, no. 5, pp. 792–799, May 2001.

[10] T. G. Szymanski, “LEADOUT: A static timing analyzer for MOS cir-cuits,” in Proc. ICCAD, Nov. 1986, pp. 130–133.

[11] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “�� and�� : Timing verification and optimal clocking of synchronous dig-ital circuits,” in Proc. ICCAD, Nov. 1990, pp. 552–555.

[12] V. S. Sathe, “Hybrid resonant-clocked digital design,” Ph.D. disserta-tion, Univ. of Michigan, Ann Arbor, May 2007.

[13] MOSIS. [Online]. Available: http://www.mosis.com/products/fab/ven-dors/ibm/8rf dm/

[14] W. C. Athas, N. Tzartzanis, and L. J. Svensson, “A resonant signaldriver for two-phase, almost-non-overlapping clocks,” in Proc. ISCAS,May 1996.

[15] P. J. Restle and A. Deutsch, “Designing the best clock distribution net-work,” in VLSI Circuits Symp. Dig., Jun. 1998, pp. 2–5.

[16] N. Weste and D. Harris, CMOS VLSI Design, 3rd ed. Reading, MA:Addison-Wesley.

[17] J. Yuan and C. Svensson, “High speed CMOS technique,” IEEE J.Solid-State Circuits, vol. 24, no. 1, pp. 62–70, Feb. 1989.

[18] R. B. Staszewski, K. Muhammad, and P. Balsara, “A 500-Msample/s8-tap FIR digital filter for magnetic recording read channels,” IEEE J.Solid-State Circuits, vol. 35, no. 8, pp. 1205–1210, Aug. 2000.

[19] S. Rylov, “A 2.3Gsample/s 10-tap digital FIR filter for magneticrecording read channels,” in IEEE ISSCC Dig. Tech. Papers, Feb.2001, pp. 190–191.

[20] J. Park, “Computation sharing programmable FIR filter for low-powerand high-performance applications,” IEEE J. Solid-State Circuits, vol.39, no. 2, pp. 348–357, Feb. 2004.


Visvesh S. Sathe (M’02) received the B.Tech. de-gree in electrical engineering from the Indian Insti-tute of Technology, Bombay, in 2001, and the M.S.and Ph.D. degrees in electrical engineering and com-puter science from the University of Michigan, AnnArbor, in 2004 and 2007, respectively.

While at the University of Michigan, his researchfocused on low-energy circuit design with a partic-ular emphasis on resonant-clocked digital design. Hehas held internship positions at the IBM T. J. WatsonResearch Center and Cyclos Semiconductor. In 2007,

he joined the Advanced Power Technology Group, Advanced Micro Devices,Fort Collins, CO, as a Senior Design Engineer. His current work focuses on theexploration and implementation of power reduction techniques for micropro-cessors.

Jerry C. Kao (M’04) received the B.S. degree inelectrical engineering from Columbia University,New York, NY, in 2000, and the M.S. degree inelectrical engineering and computer science fromthe University of Michigan, Ann Arbor, in 2002. Heis currently working toward the Ph.D. degree at theUniversity of Michigan.

From 2002 to 2005, he was with IBM, Rochester,MN, where he was involved in the design of theCELL processor and the XBOX 360 processor. Hisresearch interests include high-performance and

low-power circuit technologies and design methodologies.

Marios C. Papaefthymiou (M’93–SM’02) receivedthe B.S. degree in electrical engineering from theCalifornia Institute of Technology, Pasadena, in1988 and the S.M. and Ph.D. degrees in electricalengineering and computer science from the Massa-chusetts Institute of Technology, Cambridge, in 1990and 1993, respectively.

After a three-year term as an Assistant Professorat Yale University, he joined the University ofMichigan, Ann Arbor, where he currently is aProfessor of Electrical Engineering and Computer

Science and Director of the Advanced Computer Architecture Laboratory. Heis also cofounder and Chief Scientist of Cyclos Semiconductor, a start-up com-pany commercializing low-power devices. His research interests encompassalgorithms, architectures, and circuits for energy-efficient high-performanceVLSI systems. He is also active in the field of parallel and distributed com-puting.

Dr. Papaefthymiou was the recipient of the ARO Young Investigator Award,the National Science Foundation CAREER Award, and a number of IBM Part-nership Awards. Furthermore, together with his students, he has received a BestPaper Award at the 32nd ACM/IEEE Design Automation Conference and theFirst Prize (Operational Category) in the VLSI Design Contest of the 38th ACM/IEEE Design Automation Conference. He has served multiple terms as an As-sociate Editor for the IEEE TRANSACTIONS ON THE COMPUTER-AIDED DESIGN

OF INTEGRATED CIRCUITS, IEEE TRANSACTIONS ON COMPUTERS, and IEEETRANSACTIONS ON VLSI SYSTEMS. He has served as the General Chair andas the Technical Program Chair for the ACM/IEEE International Workshopon Timing Issues in the Specification and Synthesis of Digital Systems. Hehas also participated several times in the Technical Program Committee of theIEEE/ACM International Conference on Computer-Aided Design.

864 ieee journal of solid-state circuits, vol. 43, no. 4 ...€¦ · digital object identiﬁer...

Documents