power driven microarchitecture workshopencoding technique and its effect on multimedia...

Power Driven Microarchitecture Workshop

in Conjunction with ISCA 98Barcelona, Spain

Sunday, June 28, 1998

Organizers: Dirk Grunwald, University of Colorado

Srilatha (Bobbie) Manne, University of ColoradoTrevor Mudge, University of Michigan

Table of Contents

Introduction

....................................................................................................................................1

Bus Transition Activity Session

.....................................................................................................2

Reducing the Energy of Address and Data Buses with the Working-Zone Encoding Technique and its Effect on Multimedia Applications.........................................3

Enric Musoll, Tomas Lang, and Jordi Cortadella

Reduced Address Bus Switching with Gray PC...................................................................9

Forrest Jensen and Akhilesh Tyagi

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors.................................................................................................................14

Mark C. Toburen, Thomas M. Conte, and Matt Reilly

Code Transformations for Embedded Multimedia Applications: Impact on Power and Performance......................................................................................................20

N. Zervas, K. Masselos and C.E. Goutis

Modeling Inter-Instruction Energy Effects in a Digital Signal Processor...........................25

Ben Klass, Don Thomas, Herman Schmit, and David Nagle

Power Issues in the Memory Subsystem

....................................................................................31

Split Register File Architectures for Inherently Low Power Microprocessors...................32

V. Zyuban and P. Kogge

Energy Efficient Cache Organizations for Superscalar Processors....................................38

Kanad Ghose and Milind Kamble

Power and Performance Tradeoffs Using Various Cache Configurations..........................44

Gianluca Albera and Iris Bahar

A New Scheme for I-cache Energy Reduction in High-Performance Processors..............50

Nikos Bellas, Ibrahim Hajj and Constantine Polychronopoulos

Low-Power Design of Page-Based Intelligent Memory.....................................................55

Mark Oskin, Frederic T. Chong, Aamir Farooqui, Timothy Sherwood, and Justin Hensley

Power Reduction by Low-Activity Data Path Design and SRAM Energy Models............61

Mir Azam, Robert Evans, and Paul Franzon

Cache-in-Memory: A Low Power Alternative?..................................................................67

Jason Zawodny, Eric W. Johnson, Jay Brockman, and Peter Kogge

Innovative VLSI Techniques for Power Reduction

...................................................................73

Dynamic Voltage Scaling and the Design of a Low-Power Microprocessor System.........74

Trevor Pering, Tom Burd, and Robert Broderson

Transmission Line Clock Driver........................................................................................80

Matthew Becker and Thomas Knight, Jr.

Architectural Power Reduction and Power Analysis

................................................................86

An Architectural Level Power Estimator............................................................................87

Rita Yu Chen, Mary Jane Irwin, and Raminder S. Bajwa

Multivariate Power/Performance Analysis For High Performance Mobile Microprocessor Design......................................................................................................92

George Z.N. Cai, Kingsum Chow, Tosaku Nakanishi, Jonathan Hall and Micah Barany

Power-Aware Architecture Studies: Ongoing Work at Princeton......................................98

Christina Leung, David Brooks, Margaret Martonosi, and Douglas Clark

Architectural Tradeoffs for Low Power............................................................................102

Vojin G. Oklobdzija

The Inherent Energy Efficiency of Complexity-Adaptive Processors..............................107

David Albonesi

Function Unit Power Reduction

................................................................................................113

Minimizing Floating-Point Power Dissipation via Bit-Width Reduction........................114

Ying-Fai Tong, Rob Rutenbar, and David Nagle

A Power Management for Self-Timed CMOS Circuits (FLAGMAN) andInvestigations on the Impact of Technology Scaling........................................................119

Thomas Schumann , Ulrich Jagdhold, and Heinrich Klar

Hybrid Signed Digit Representation for Low Power Arithmetic Circuits........................124

D. Phatak, Steffen Kahle, Hansoo Kim, and Jason Lue

Alternative Architectures

..........................................................................................................130

Power-Saving Features in AMULET2e............................................................................131

S.B. Furber, J.D. Garside, and S. Temple

A Fully Reversible Asymptotically Zero Energy Microprocessor...................................135

Carlin Vieri, M. Josephine Ammer, Michael Frank, Norman Margolus and Tom Knight

Low-Power VLIW Processors: A High-Level Evaluation................................................140

Jean-Michel Puiatti, Christian Piguet, Josep Llosa, and Eduardo Sanchez

Designing the Low-Power M*CORE

TM

Architecture.....................................................145

Jeff Scott, Lea Hwang Lee, John Arends, and Bill Moyer

Author Index

...............................................................................................................................151

Power-Driven Microarchitecture Workshop

Introduction

In recent years, reducing power dissipation has become a critical design goal for many micropro-cessors due to portability and reliability requirements. Most of the power reduction was achievedthrough supply voltage reduction and process shrinks. However, there is a limit to how far supplyvoltages may be reduced, and the power dissipated on-chip is increasing even as process technol-ogy improves. Further advances will require not only circuit and technology improvements butnew ideas in microarchitecture. This will be true not only for the obvious situation of portablecomputers but also for high-performance systems. It was the goal of the Power-Driven Microar-chitecture Workshop to provide a forum for examining innovative architectural solutions to thepower problem for processors at all levels of performance.

The Power-Driven Microarchitecture Workshop was held in conjunction with the ISCA98 confer-ence. The response to the call for papers was outstanding, enabling us to assemble an strong pro-gram of over two dozen papers. Three invited speakers, Mark Horowitz of Stanford, and VivekTiwari and Doug Carmean from Intel provided a brief tutorial introduction to power issues inVLSI design, and covered existing problems and solutions in the microprocessor market. The pur-pose of these tutorials was to establish common ground for discussing power issues inprocessor design.

It is our hope that, as a result of this workshop, the level of awareness will have been raised in thearchitecture community about issues related to power dissipation and energy consumption. Wefurther hope that this heightened awareness will lead to exciting new research in the area.

Dirk GrunwaldBobbie ManneTrevor Mudge

Bus Transition Activity

Reducing the Energy of Address and Data Buses with the

Working-Zone Encoding Technique and its Eect on Multimedia

Applications

Enric Musoll

Cyrix

(National Semiconductor Corp.)

Santa Clara, CA 95052

[email protected]

Tomas Lang

Dept. of Electrical and Computer Eng.

University of California at Irvine

Irvine, CA 92697

[email protected]

Jordi Cortadella

Dept. of Software

Universitat Politecnica de Catalunya

08071 Barcelona, Spain

[email protected]

Abstract

The energy consumption due to I/O pins is a substantialpart of the overall chip consumption. This paper gives anoverview of the Working Zone Encoding (WZE) method forencoding for low power the external address and data buses,based on the conjecture that programs favor a few workingzones of their address space at each instant. In such cases,the method identies these zones and sends through the ad-dress (data) bus only the oset of this reference (data value)with respect to the previous reference (data value) to thatzone, along with an identier of the current working zone.This is combined with a one-hot encoding for the oset.

The paper then focuses on preliminary work on the fol-lowing two topics:

reduction of the eect of the WZE delay on the busaccess time by overlapping this delay with the virtualto physical address translation. Although the modi-cation to allow this overlapping might increase thebus energy, simulations of the SPEC benchmarks in-dicate that for a page size of 1 KB or larger the eectis negligible.

extension of the technique for the data bus to somemultimedia applications which are characterized byhaving packed bytes in a word. For two typical appli-cations, the data-only data bus and data-only addressbus I/O activity is reduced by 74% and 51% with re-spect to the unencoded case, and by 68% and 33% withrespect the best of the rest of the encoding techniques.

1 Introduction

The I/O energy is a substantial fraction of the total energyconsumption of a microprocessor [4], because the capaci-tance associated with an external pin is between one hun-dred and one thousand times larger than that correspondingto an internal node. Consequently, the total energy con-sumption decreases by reducing the number of transitionson the high-capacitance, o-chip side, although this may

This work has been partially funded by CICYT TIC 95-0419.

Memory

2

D Cache I Cache

I Address bus

Addressgenerator unit

Execution core

D Data bus

I Data bus

D Address bus

Load/Store unit

RO

M (or multiplexedaddress bus)

(or multiplexed data bus)

n

Figure 1: Bus types in a general-purpose microprocessor.

come at the expense of some additional transitions on thelow-capacitance, on-chip side.

For a microprocessor chip, the main I/O pins correspondto the address and data buses. In this work, we consider anencoding to reduce the activity in both of these buses. Ifthe value carried by n bits has to be transmitted over abus, a reduction in the switching activity of this bus maybe obtained at the cost of extra hardware in the form ofan encoder on the sender device, a decoder on the receiverdevice, and potentially a larger number of wires m.

In [10] and [11] we have presented the Working-ZoneEncoding (WZE) method, which is based on the conjec-ture that applications favor a few working zones of theiraddress space. Moreover, consecutive addresses to each ofthese working zones frequently dier by a small amount. Insuch cases, an oset with respect to the previous address forthat zone is sent, together with a zone identier. The osetis encoded so as to reduce the activity of the address bus.This scheme is extended to the data bus [8] by noticing thatthe data for successive accesses to a working zone frequentlydier also by a small amount, so that it is eective also tosend the oset.

To evaluate the eectiveness of the technique in generalapplications, several SPEC95 streams of references to mem-ory along with the corresponding data values were used.Among the possible bus organizations (see Figure 1), wehave considered a multiplexed address bus (for instructionand data addresses) and a multiplexed instruction/data bus,with and without a unied 8K-byte direct-mapped cache.We also have compared with previously proposed encodings:Gray, bus-invert, T0, combined T0/bus-invert, inc-xor anddbm-vbm. Table 1 summarizes the results obtained. Thetable shows the energy reduction ratios of the WZE tech-

Address Bus Ratio Ins/Data Bus Ratio Both Buses Ratiovs. non vs. best vs. non vs. best vs. non vs. bestencoded of rest encoded of rest encoded of rest

Avg. (no cache) (0.39) 0.47 (0.54) 0.66 (0.67) 0.73 (0.77) 0.84 (0.56) 0.63 (0.69) 0.78Avg. (with cache) (0.71) 0.87 (0.87) 1.06 (0.53) 0.63 (0.61) 0.72 (0.60) 0.72 (0.70) 0.85

Table 1: Results summary for some SPEC95 streams. Ratios are calculated as Energy WZE/Energy other, being other theunencoded case and the best of the rest of the techniques evaluated. Energy overhead of the encoder/decoder logic is onlyincluded for the WZE technique. In parenthesis, without overhead.

nique with respect to the unencoded case and to the best ofthe rest of the techniques, for each of the buses and for bothtogether. We conclude that the WZE encoding signicantlyreduces the activity in both buses. Moreover, for the casewithout cache, the technique presented here outperforms theother previous bus encoding proposals for low power. On theother hand, for the case with cache the best scheme for theaddress bus is either the WZE presented here or bus-invertwith four groups, depending on the overhead of these twotechniques. In any case, the WZE method outperforms therest of the techniques when both buses are encoded, and re-quires fewer additional wires than the bus-invert with fourgroups.

In this paper we give an overview of the WZE techniqueand summarize previous work on the topic; this material issimilar to that of [8] and should give the reader a reasonableunderstanding of the method. For more details consult [10,11]. We then focus on preliminary work on the followingtwo topics:

Reduction of the eect of the WZE delay on the busaccess time by overlapping this delay with the virtualto physical address translation.

Use of the WZE technique in multimedia applications,which are characterized by having packed bytes in aword. Because of the particular features of these ap-plications, we explore the possibility of special modi-cations to the encoding technique for the data bus.

1.1 Previous work

Several encoding techniques for reduced bus activity havebeen reported, such as one-hot [6], Gray [7], bus-invert [13],T0 and combined bus-invert/T0 [5], and inc-xor and dbm-vbm [12].

One-hot encoding results in a reduced activity becauseonly two bits toggle when the value changes. However, itrequires a number of wires equal to the number of valuesencoded, so that it is not practical for typical buses.

Gray and T0 encoding are targeted to situations in whichconsecutive accesses dier by one (or by a xed stride). TheGray encoding is useful because the encoding of these valuesdiers by one bit. In the T0 encoding an additional wire isused to indicate the consecutive access mode, and no activityis required in the bus.

The bus-invert method [13] consists on sending either thevalue itself or its bit-wise complement, depending on whichwould result in fewer transitions. An extra wire is usedto carry this polarity information. For uniform and inde-pendent distributions, this encoding technique works betterwhen the bit-width of the value to be sent is divided intosmaller groups and each one encoded independently. Thebus-invert technique has been combined with T0 in [5], thusobtaining more activity reduction than each of the tech-niques by itself.

vector A

vector B

vector C

Address space

2n

offsetworking zone

PDAT 0

PDAT 1

PREF 0

PREF 1

PDAT 2PREF 2

value holdby PDAT2

Figure 2: Address space with three vectors.

A source-coding framework is proposed in [12] as well assome specic codes. The scheme is based on obtaining aprediction function and a prediction error. This predictionerror is XORed with the previous value sent to the bus sothat the number of transitions is reduced in the likely casewhen the prediction error has a small number of ones. Foraddresses, the only new code proposed is the inc-xor code, inwhich the prediction is the previous address plus one (or anyxed stride) and the prediction error is obtained by the bit-wise XOR operation. This code is most benecial for instruc-tion address buses, where sequential addressing is prevalent.Also presented are codes which relate to the 1-hot encodingused in this paper, such as the dbm-vbm, that are applied tothe data bus. In the dbm-vbm technique, the prediction isthe previous address and the prediction error is obtained bya function that increases as the absolute dierence betweenthe current input and the prediction increases. Afterwards,code-words with fewer 1's are assigned to smaller error val-ues. Finally, the result is XORed with the previous value sentto the bus.

2 Overview of the WZE technique

In this Section an overview of the WZE technique for theaddress bus is given along with the implementation deci-sions made and the rationale behind them. Afterwards, anextension of the WZE technique [8] is reviewed which allowsthe data bus to be encoded by reusing a large portion of thehardware already used to encode the address bus.

The basis of the WZE technique is as follows:

1. It takes into account the locality of the memory refer-ences: applications favor a few working zones of theiraddress space at each instant. In such cases, a refer-ence can be described by an identier of the workingzone and by an oset. This encoding is sent throughthe bus.

2. The oset can be specied with respect to the baseaddress of the zone or to the previous reference to thatzone. Since we want small osets encoded in a one-hotcode, the latter approach is the most convenient.

As a simple example consider an application that workswith three vectors (A, B and C) as shown in Fig-ure 2. Memory references are often interleaved amongthe three vectors and frequently close to the previ-ous reference to the vector. Thus, if both the senderand the receiver had three registers (henceforth namedPrefs) holding a pointer to each active working zone,the sender would only need to send:

the oset of the current memory reference withrespect to the Pref associated to the current work-ing zone

an identier of the current Pref.

3. To reduce the number of transitions, the oset is en-coded in a one-hot code. Since the one-hot code pro-duces two transitions if the previous reference was alsoin the one-hot code and an average of n=2 transitionswhen the previous reference is arbitrary, the numberof transitions is reduced by using a transition-signalingcode [14]. In this case, before sending the referencethrough the bus an XOR operation is performed withthe previous value sent, always resulting in one tran-sition.

4. One value can be sent using a 0-hot code, which withtransition signaling produces zero transitions. Thiscode should be used for the most-frequent event, whichwe have determined to be a repetition of the sameoset for the current working zone.

5. When there is a reference that does not correspond toa working zone pointed by any Pref, it is not possibleto send an oset; in such a case, the entire currentmemory reference is sent over the bus. Moreover, it isnecessary to signal this situation.

6. In general, the total number of working zones of a pro-gram can be larger than the number supported by thehardware. Consequently, these have to be replaced dy-namically. The most direct possibility is to replace anactive working zone as soon as there is a miss. How-ever, in this case any arbitrary reference would disturban active working zone. To reduce this eect, we in-corporate additional registers (henceforth named po-tential working zones) that store the references thatcause a miss. Various heuristics are possible to deter-mine when a potential working zone becomes an activeone.

2.1 Implementation decisions

In the general scheme presented above, there are many as-pects that have to be decided to obtain a suitable implemen-tation. These decisions aect both the complexity of the im-plementation and the energy reduction achieved. Since thereare many interdependent parameters, it is not practical toexplore the whole space. Below we indicate the decisionsmade and the rationale for them.

The number of active and potential working zones af-fects the number of registers and associated logic (andtherefore the encoder/decoder energy consumption) and

the number of values of the identier. In the evalua-tion of the scheme, we have explored a range of valuesand determined the one that produces the largest re-duction. It was determined that a small number ofworking zones is sucient.

When there is a hit to a working zone, an oset andan identier are sent. There are choices for the setof values of the oset and the code of the identier.Since the oset is sent in a one-hot code (with transi-tion signaling) the set of values is directly related tothe number of bits required. We have decided to useall bits of the original bus to send the oset. Moreover,we have seen that the number of hits is maximized ifpositive and negative osets are used. Since all bitsof the original bus are used for the oset, it is neces-sary to have additional wires for the identier and, tominimize these additional wires, we use a binary code.We have considered using bits of the original bus forthe identier (thus reducing the oset bits) and haveobserved a signicant increase in I/O activity with re-spect to the use of separate bits.

When there is a miss, this situation has to be signaledto the receiver. Since in that case, all bits of the origi-nal bus are used to send the address, this hit/miss con-dition has to use some additional wire. As we alreadyhave decided to use additional wires for the identier,one value on these wires might be used to signal themiss. However, this would produce a few transitionswhen changing from a hit to a miss. To assure onlyone transition, we have assigned an additional bit tosignal a miss.

The search for a hit in a working zone requires sub-tracting the previous address with the current one anddetecting whether the oset is in the acceptable range.For the selection of which zones to check it is possibleto use any of the schemes used for caches. Because ofthe small number of working zones, we have chosen afully-associative search.

There are two replacement procedures required: forthe active working zones and for the potential workingzones. As indicated before, when there is a miss theaddress is placed in a potential working zone. Sincethere are few of these, we use the LRU algorithm forthis placement. Moreover, it is necessary to determinewhen a new active working zone appears and, in thiscase, which active working zone to replace. Amongthe possible alternatives, we have chosen to initiatea new active working zone when there is a hit in apotential working zone. Again, here we use the LRUreplacement algorithm.

2.2 Extension to the data bus

The technique for the address bus can be extended toinclude also the data bus. This extension is based on thefact that in many instances the data values of consecutiveaccesses to a working zone dier by a small amount. If thatis the case, the data can also be sent as an oset, coded inthe one-hot encoding. In this case, the zero-hot encoding isused when the oset is zero.

To implement this extension, as illustrated in Figure 2,we include an additional register, called Pdat, per workingzone. On the other hand, if the access is not to an activeworking zone or if the oset is larger than possible for the

Value sent Transition signaling Receiver Receiverover the bus and one-hot retrieval action action

(either address 1. XORing 2. One-hot (address bus) (data bus)or data) retrieval

(-) 010011 - - - -(2) 011011 001000 3 oset 3 (Pref #2) oset 3 (Pdat #2)(1) 011011 000000 y same oset (Pref #1) same data value (Pdat #1)(1) 011001 000010 1 oset 1 (Pref #1) oset 1 (Pdat #1)

Table 2: Example of the decoding process. Assuming always hit; () in the rst column indicates working zone number.

one-hot encoding, the whole value is sent through the bus.An additional wire is required to distinguish these cases.

In addition, to further reduce the bus transitions, whenthe value in the data bus is not encoded by theWZE method,we use the bus-invert technique; for the address bus we sawthat the benets of using the bus-invert in this case werevery small.

In summary, for the address bus, to send the oset it isnecessary to compare it with the previous oset to the sameworking zone. The following two situations occur:

the osets are the same: send again the previous valuesent over the bus (zero transitions)

they are dierent: send the one-hot encoded value ofthe oset using transition signaling (one transition).

For the data bus, to send the oset it is necessary to com-pare the current data value with the Pdat associated to thecurrent working zone, and the following situations occur:

the values are the same: send again the previous valuesent over the bus (zero transitions)

they are dierent: send the one-hot encoded value ofthe oset (one transition).

The decoding of an oset in the receiver is done alsoin two steps: XORing the value that it receives with theprevious one, and retrieving the one-hot of the result. Whenthe XORing produces a 0 vector, the two values were thesame and this is interpreted (see Table 2):

for the address bus, as a repetition of the previousoset to that same working zone,

for the data bus, as a repetition of the previous datavalue when that same working zone was last accessed

2.3 Address and data bus elds

As shown in Table 3 (next page) the encoded addressand data bus consists of ve elds:

the na wires of the original address bus (word address)

the nd wires of the original data bus (word data)

dlog2(H+M)e wires to specify one of H working zones

or M potential zones (ident)

one wire to indicate whether there has been a hit or amiss in any of the zones (WZ miss)

one wire to indicate if the data bus has been able tobe encoded using the oset (dbus WZ encoded)

one wire to indicate, in the case of a miss in the work-ing zones, whether the data bus is coded with the bus-invert technique (dbus BI encoded).

Therefore, m = na + nd + dlog2(H + M)e + 3 wires are

required.

3 Reducing the delay

The decoder introduces some delay in the bus access. Sincethis might be unacceptable, we now describe a method tooverlap this delay with the virtual to physical address trans-lation.

When an address translation is required, the most directapproach would be to perform the translation and then ap-ply the encoding to the resulting physical address. However,this would produce an increased delay for the bus access. Toreduce this delay we propose the following modication ofthe WZE technique:

Use the virtual address to determine whether there isa hit in a working zone. To do this, each Pref con-tains the virtual address of the previous access to thecorresponding zone.

Since, as it is well known, the translation modies onlythe most-signicant bits of the address (the page number)but keeps unaltered the least-signicant bits (the page o-set), this procedure is correct as long as the oset does notcross page boundaries. Consequently, it is necessary to de-tect when a change of page occurs and, in that case, theaccess is not treated as an oset.

Since the data value is not translated, this is encoded asin the original method.

Because of this modication in the WZE technique, thereduction in energy might be aected. This is because nowwe do not use osets which cross a page boundary. On theother hand, the internal energy overhead might be reducedbecause now the detection of oset uses only the page-osetbits. The simulation of the SPEC benchmarks indicate thatfor a page size of 1 KB or larger the eect is negligible. Weobserve this in Table 4, where the I/O transitions per refer-ence on the multiplexed address bus for the gcc benchmark(with unied cache) and for dierent page sizes is shown.

I/O transitions/referencePage size (address bus)(no pages) 3.264K bytes 3.261K bytes 3.29256 bytes 3.3664 bytes 3.721 byte 5.32

Table 4: Eect of the pages on the activity reduction if theWZE delay is overlapped with the virtual to physical addresstranslation. For pages larger than 1K bytes the eect isnegligible. Some unrealistic page sizes are also shown forcomparison purposes. The data is for the gcc benchmarkwith unied cache.

m-wire encoded address and data busWZ miss ident word address dbus WZ encoded dbus BI encoded word data

(1 wire) (dlog2(H +M)e wires) (na wires) (1 wire) (1 wire) (nd wires)

WZ 0 WZ index oset or 1 don't care oset orformat last address value last data value

0 1 BIG=1(data value)0 complete data

Non WZ 1 don't care complete don't care 1 BIG=1(data value)format address 0 complete data

Table 3: Information assigned to each of the several elds of the encoded address and data bus when there is a hit (WZformat) and a miss (Non WZ format) in the H working zones and in the M potential working zones. The number of wiresfor each eld is also shown.

4 Use in Multimedia Applications

In this Section we modify the WZE technique for the databus for those multimedia applications that use data work-loads composed of packets of data that may be brought frommemory several in the same word. Examples of these are theimage processing applications and we will evaluate the WZEtechnique for two particular examples. We show the resultsfor the data-only data bus and the data-only address bus(see Figure 1). The code of the applications is assumed tobe stored in an internal ROM (therefore no references toinstructions are sent through the address bus and no in-structions are fetched through the data bus).

In multimedia applications images are composed of pix-els. These pixels consist of one or a few components, eachwith a relatively small number of values (for instance, inthe examples we illustrate each pixel consists of three colorsand each color can have 256 values). In such case, thesepixels can be stored in one word which is subdivided intosubwords for each component. In many applications, thesecomponents are accessed and processed simultaneously. Wenow describe how we modify the WZE technique for thissituation.

The address bus encoding part of the technique is notmodied, since the locality of reference is even more appar-ent in these applications with images. For the data part wedo the following:

Instead of using an oset for the whole word, we en-code each of the bytes using the byte oset. In thisway, the one-hot encoding allows eight possible o-sets. We have determined that this number of osets(which would correspond to osets from -4 to +4) isinsucient to capture a signicant portion of the data.Consequently, it is convenient to extend the encodingto include also k-hot encodings for k > 1.

Moreover, if we allow two possibilities for each byte,namely that the data satises the oset range (a hit)or not (a miss), we would need individual wires perbyte to indicate this hit/miss. Since this would bea signicant wire overhead, we decided to code everydata byte as an oset, no matter how large this osetis. Moreover, since a k-hot encoding with transitionsignaling generates k transitions in the bus, to reducethe average number of transitions we encode the osetsinto a k-hot code, with smaller value of k for the morefrequent osets (that is, the value of k increases as theabsolute value of the oset increases).

When there is an address miss (that is, the addressdoes not correspond to an active WZ), the most directsolution is to send the unencoded data value. How-ever, we have found that the variation in the values of

images is small enough so that it is better to send theosets with respect to the last data value.

Note that in this modication of the WZE technique,the two extra wires (dbus WZ encoded and dbus BI encoded

in Table 3) are not needed, since the data bus is alwaysencoded and no bus-invert is done.

We have done some preliminary evaluations for two ap-plications: image alpha blending andmotion estimation. Bothuse color images where a pixel is composed of three bytesthat specify the three main colors; a pixel is read or writtenper memory reference.

4.1 Image Alpha Blending

Alpha Blending [1] is used for imaging eects to mergetwo images together, weighting one image more than theother. Thus, alpha blending may be used for fading fromone image to another and this is the case shown here: giventwo images of two dierent human faces, six images are gen-erated so that the rst image is transformed into the secondone. This is done for three dierent sets of human faces [2].

The results are reported in Table 5, where the WZE tech-nique is compared to the unencoded case and to the best ofthe rest of the techniques (bus-invert with three groups, oneper byte). The WZE technique uses four active and twopotential working zones.

4.2 Motion Estimation

The Motion-estimation algorithm [9] is used in videotransmission to lower the bandwidth of the network wherethe video is being transmitted. The frame to be transmittedis divided into blocks which are compared to several blocksin the previous frame and the best match is selected.

The basic motion-estimation algorithm is applied to twodierent sets of images in motion [3, 2] (weather satellite,human face and football images). Results are shown in Ta-ble 6; the best of the rest of the techniques is the dbm-vbmwith three groups. The WZE technique uses three activeworking zones and no potential working zones.

Data-only Data BusImages non

encoded WZE BIG=3claire ! missa 10.07 3.00 8.68claire ! susie 10.68 3.23 8.58missa ! susie 10.43 3.83 8.97

Table 5: Data-only data bus I/O transitions per referencefor the alpha-blending application. The best of the rest ofthe techniques is the bus-invert with three groups.

Data-only Data Bus Ratio Data-only Address Bus Ratio Both Buses Ratiovs. non vs. best vs. non vs. best vs. non vs. best

Application encoded of rest encoded of rest encoded of restImage Blending 0.32 0.38 0.17 0.23 0.26 0.32

Motion Estimation 0.58 0.81 0.36 0.41 0.49 0.67

Table 8: Results summary for the two multimedia applications. Smaller ratio means fewer I/O transitions.

Data-only Data BusImage Sequences non

encoded WZE dbm-vbmG=3

weather satellite 10.05 5.93 7.58human face 9.13 5.14 5.89football 10.79 6.22 7.79

Table 6: Data-only data bus I/O transitions per referencefor the motion-estimation application. The best of the restof the techniques is the dbm-vbm with three groups.

Data-only Address BusApplication non best of

encoded WZE restImage Blending 7.78 1.33 5.78

Motion Estimation 6.71 2.40 7.21

Table 7: Data-only address bus I/O transitions per referencefor the two example multimedia applications.

Table 7 shows the data-only address bus results for bothapplications. For the image blending the best of the restof the techniques is the dbm-vbm whereas for the motionestimation is the bus-invert. In any case, theWZE techniqueclearly outperforms any previously proposed technique.

The averaged results for both applications are shown inTable 8. We show the ratio of the I/O transitions withrespect the unencoded case and to the best of the rest of thetechniques for the data-only data bus, data-only address busand both buses. The I/O activity when coding both buses isreduced by 74% and 51% with respect the unencoded case,and by 68% and 33% with respect the best of the rest of theencoding techniques.

5 Conclusions

This paper gives an overview the Working Zone Encoding(WZE) method for encoding an external address and databus, based on the conjecture that programs favor a few work-ing zones of their address space at each instant.

For the general-purpose microprocessor multiplexed ad-dress bus and multiplexed instruction/data bus (with uni-ed cache), the WZE signicantly reduces the I/O activity,around 30% with respect to the unencoded case and 15%with respect the best of the rest of the encoding techniques.When no caches are present, the reductions are even larger.

For two multimedia applications using images as the work-load, the activity in the data-only data bus and the data-only address bus is reduced by 74% and 51% with respectthe unencoded case, and by 68% and 33% with respect thebest of the rest of the encoding techniques. These resultsare promising and make it worthwhile to pursue the devel-opment of this approach.

References

[1] http://developer.intel.com/drg/mmx/appnotes/ap554.htm.

[2] http://ipl.rpi.edu/sequences/sequences.html.

[3] http://meteosat.e-technik.uni-ulm.de/meteosat.

[4] H. Bakoglu. Circuits, Interconnections and Packagingfor VLSI. Menlo Park, CA, 1990.

[5] L. Benini, G. De Micheli, E. Macii, D. Sciuto, andC. Silvano. Address bus encoding techniques forsystem-level power optimization. In Design, Automa-tion and Test in Europe, pages 861866, February 1998.

[6] A.P. Chandrakasan and R.W. Brodersen. Low PowerDigital CMOS Design. Kluwer Academic Publishers,1995.

[7] R.M. Owens H. Mehta and M.J. Irwin. Some issues ingray code addressing. In Great Lakes Symposium onVLSI, pages 178180, March 1996.

[8] T. Lang, J. Cortadella, and E. Musoll. Extension ofthe Working-Zone Encoding method to reduce the en-ergy on the microprocessor data bus. Technical Reporthttp://www.eng.uci.edu/numlab/archive/pub/nl98b/02.ps.Z, University of California, Irvine, May 1998. Tobe published in the next Int. Conference on ComputerDesign (ICCD'98).

[9] C. Lin and S. Kwatra. An adaptive algorithm for mo-tion compensated colour image coding. IEEE Globe-com, 1984.

[10] E. Musoll, T. Lang, and J. Cortadella. Exploiting thelocality of memory references to reduce the address busenergy. In Int. Symp. on Low Power Design and Elec-tronics, pages 202207, August 1997.

[11] E. Musoll, T. Lang, and J. Cortadella. Working-Zone Encoding for reducing the energy in mi-croprocessor address buses. Technical Reporthttp://www.eng.uci.edu/numlab/archive/pub/nl98b/01.ps.Z, University of California, Irvine, March 1998.To be published in the next special issue of Transac-tions on VLSI on low-power electronics and design.

[12] S. Ramprasad, N.R. Shanbhag, and I.N. Hajj. Codingfor low-power address and data busses: a source-codingframework and applications. In Proc. of the Int. Conf.on VLSI Design, pages 1823, January 1998.

[13] M.R. Stan and W.P. Burleson. Bus-invert coding forlow power I/O. IEEE Trans. on VLSI Syst., pages 4958, 1995.

[14] M.R. Stan andW.P. Burleson. Low-power encodings forglobal communications in CMOS VLSI. IEEE Trans.on VLSI Syst., pages 444455, 1997.

Reduced Address Bus Switching with Gray PC

FORRESTJENSEN

AKHILESH TYAGI

Iowa State UniversityAmes, IA 50011

Abstract

Reduced switching on the address bus saves on the en-ergy incurred in PC increment, the propagation of PCvalue across the pipeline, instruction cache access, andmemory access. We estimate the savings in bit switch-ings in PC by adopting the Gray sequence for programsequencing. There is a 40% reduction in address bitswitchings with a Gray PC over the traditional lexico-graphic PC over a collection of SPEC ’95 integer and FPbenchmarks. We also assess the adverse impact of GrayPC on the other processor units, particularly I-cache lo-cality. The cache miss rates are indistinguishable be-tween the Gray and lexicographic sequencing. Theseexperiments were conducted with the SimpleScalar toolset. We also propose a design for a Gray counter forthe PC. We have developed optimization algorithms forthe loader so that the expected address bus switching isminimized.

1 Introduction

The need for reduced energy in processors has been em-phasized elsewhere [Sin94]. Portable computing plat-forms (general purpose computing engines such as lap-tops or special purpose embedded processors) strive toreduce the energy of computation in order to prolong thebattery life. The packaging and heat dissipation limi-tations are another driving force behind the low energyprocessor architecture and implementation trend.

In this paper, we assess the effectiveness of non-lexicographic instruction ordering, in particular Gray or-dering, in reducing address bus switching and system en-ergy. There are several architecture level side-effects ofGray address ordering that need to be evaluated. Specifi-cally, an adverse impact on cache locality can negate anyenergy gains derived from the lower address bus switch-ing.

This change in the implementation is transparent tothe software. The only system component that needsto be modified is the loader. A simplistic loader canload an instruction originally at addressA at the addressgray(A). However, this poses a new optimization prob-lem for the loader, to minimize the expected switchingover the program graph. We provide some algorithmsfor this optimization problem as well, which have notbeen implemented yet (Section 4).

Note that we will use termsenergyandpowerinter-changeably since for a processor power is just the en-

This work was supported in part by NSF grant #MIP9703702.

ergy per cycle times the clock frequency. Hence anytechnique not explicitly changing clock frequency af-fects both energy and power identically. What fractionof processor and system energy is affected by reducedaddress bus switching? Burd and Peters [BP94] andGonzalez and Horowitz [GH96] have profiled the energydistribution of MIPS R3000 processor. Both the studiesare based on simulations of a VLSI design for the MIPSR3000 architecture. Burd and Peters [BP94] computethe switched capacitance per cycle as an estimate of en-ergy. The switching frequency of capacitances dependson the control signal, instruction and data correlations.They simulate the MIPS R3000 architecture with severalreal benchmark programs (from SPECint ’92 benchmarksuite) to derive a probability for switching for all the in-ternal capacitive nodes. The total expected switched ca-pacitance per clock cycle is 317 pF. The breakdown isas follows: instruction cache (I-Cache) memory: 30%;I-Cache control: 15%; Datapath: 28% with registerfile accounting for 10%; and ALU & shifter for another12%; Global buses: 3-4%; Controller: 10%; DCache:8-9%. The program counter logic (next-PC) takes upanother 4% energy. Note that the proposed Gray PC(program counter counting in Gray sequence) reducesenergy for ICache address decoder and PC logic (andsome for global buses) which accounts for 21% pro-cessor energy.

These numbers did not include the switched capac-itance for off-chip memory (address bus on the back-plane). Assuming reasonable cachehit rates, and givena 25pF external load, the external switched capacitanceper cycle is 272 pF! This is almost as high as the proces-sor average switched capacitance of 317 pF. Hence, theproposed Gray PC can potentially have a bigger impacton the input/output energy than on the intra-processorenergy. We discuss in Section 5 potential impact of GrayPC on the L1 instruction cache and L2 cache interface.

The literature on logic design for a Gray counterand/or adder is sparse. We propose a design for a Graycounter in Section 3. This design is being implementedso that the counting energy of a Gray PC and a lexi-cographic PC can be compared. Doran [Dor93] is themost comprehensive reference on Gray adder design.Altera corporation [Cor94] has an application brief deal-ing with a Gray counter design. Note that in the restof the paper, we refer to binary reflected Gray code se-quence by the term Gray sequence. We also discuss adesign for instruction cache decoder in order to betterexploit the low Hamming distances of Gray sequencesin Section 5. Section 2 describes the experimental setup

and results.Su et al. [STD94] also use a Gray PC. They report

on the average address bus switching on an internal setof programs, but do not consider other aspects of GrayPC.

2 Experimentsand Results

Burd & Peters [BP94] note that the average switchingper program counter (PC) bit is a littl e over 7% overa variety of SPECint ’92 benchmarks. Hence, the ex-pected number of bit switchingsover a32 bit addressare2.5. This is consistent with theobservation that an aver-agebasic block’ssizeis4-5 instructions, which accountsfor about 2.3 least-significant bit switchings. The re-maining switching comes with 20-25% frequency whena non-sequential instruction such as a branch or proce-dure call is encountered. Gray PC ensures that within abasic block (and between basic blocks as well, as oftenas possible), exactly one address bit switches throughthe use of Gray sequencing in the PC. Assuming, wecan ensure Gray sequencing within all basic blocks, theexpected address bus switching based on back-of-the-envelope calculations would be approximately1:2 (:2accounts for switching due to non-sequential instruc-tions).

The cache placement of instructions is based on thelexicographic addresses. The locality of program exe-cution is exploited to achieve high hit rates. The cacheblocks and sets are organized on the basis of contigu-ous bits. For instance, the least significantk0 bits giveblock-offset, thenext k1 bitsgive theset addressand theremaining bits give the tag. This type of mapping re-flects locality based on lexicographic ordering. It ispos-sible that two instructions residing in conflicting blocksin lexicographic ordering may not conflict in Gray or-dering or vice versa. Hence we need to calibrate the hitrates of the instruction cache under lexicographic andGray PCs. If the hit rate goes down significantly with aGray PC, the increased energy cost of memory accessesmight offset any gainsderived from reduced addressbusswitching.

To determine the number of PC bits that switch in aGray code PC versus a lexicographic PC and to deter-mine any difference in cache performance between thetwo encoding methods, we conducted our experimentalwork with a modified version of sim-cache, one of thesimulation tools available in the SimpleScalar suite ofsimulators[BAB96]. In addition to determining thetotalnumber of switched bitswith each encoding method, wealso determined the number of times each of the elevenleast significant bits of the PC switch. Sim-cache usesa 32 bit address to access 8 byte instructions. As a re-sult, the three least significant bits are always zero. Forthe remainder of this discussion we ignore bits 0-2 andthink of bits 3-10 as the eight least significant bits. Tomeasure differences in the cache performance, we cal-culated miss rate for the level 1 instruction cache. Forpurposes of simulation, we maintained the PC as a lex-icographic binary number. We incremented the counterand added offsets for jumps and branches in the usualfashion. Beforeeach instruction memory access, includ-ing program load and instruction fetch, weconverted thePC value to Gray. With SPEC95 integer and FP bench-marks, a Gray code PC produced approximately 40%fewer instruction address bit transitions. Note that the

address bus transition frequency is similar over both theinteger and FP benchmarks. In Table 1 we present theaddressbit switching results. For both lexicographic andGray encoding weshow theaveragenumber of switchedbitsper instruction for theentireaddressand also for theeight least significant bits. The final column shows thepercentage change in the total number of switched bitswhen going from lexicographic encoding toGray encod-ing.

The reduction in instruction address bit switchesconfirms the observations made by Su et. al. In addi-tion, we observe that the eight least significant bits ac-count for 90-93% of all switched bits. This suggests thepossibility of implementing aPC that ispartially lexico-graphic and partially Gray. The level 1 instruction cachemiss rates produced by the two encoding methods wereidentical. We present these values in Table 2. We usedseparate level 1 instruction and data caches. Both were8 KB with 256, 32-byte direct mapped blocks. The 1MB unified level 2 instruction and data cache had 4096,4-way associative sets. We used 64 KB blocks and theLRU replacement scheme.

3 Gray Counter Design

There are two general operations performed on the con-tents of the PC, increment and addition of an offset forjump or branch instruction. We begin by presentinga Gray adder and follow with a Gray counter, whichis a specialized adder. Conceptually, addition of Graynumbers is performed by first converting from Gray tolexicographic, adding the numbers, and then convertingback to Gray. The disadvantage of this approach is thatthe Gray to lexicographic conversion is performed be-ginning with the MSB and working towards the LSB,while addition is performed in the opposite direction.The result is that two passes must be made through thebits instead of the usual single pass. To reduce the timenecessary for the Gray to lexicographic conversion, welimi t the Gray implementation to the eight least signifi-cant bitsof thePC and use lexicographic for theremain-ing high order bits. The interface between the Gray andlexicographic sections is trivial. In addition, we includea parity bit that indicates the parity of the Gray portionof the PC and allows us to convert from Gray to lex-icographic beginning with the LSB. The Gray to lexi-cographic conversion, rather than the addition and con-version back to Gray, is still the limiting factor of thisdesign. Wehave therefore divided theGray portion intotwo nibbles. We convert the high order nibble to lexi-cographic beginning with its MSB, progressing towardstheLSB. For the low order nibble, wework in theoppo-sitedirection beginning with itsLSB. Theconversion tolexicographic of the two nibbles is therefore performedin parallel. Unlike Su et. al., we do not modify the off-sets before loading the program into memory. As a re-sult, we need to convert only the PC to lexicographicbefore adding. Although this does not affect the speedof the conversion, it does reduce the complexity of theadder. For lexicographic addition, a full adder generatessum and carry bits. Thesearecalculated with thefollow-ing equations:

si = ai bi cici+1 = aibi + ci(ai + bi)Thesymbols areunderstood as follows:ai andbi are the two bits being added.si is the sum

Lexicographic GrayBenchmark Instr. count Switches Bits3-10 Switches Bits3-10 % change

per inst per instint:cc1 264,897,677 2.3034 2.1832 1.4222 1.3126 -38.25li 173,965,506 2.3358 2.1718 1.4246 1.3217 -39.01go 132,917,038 2.2685 2.1504 1.3195 1.2338 -41.83compress95 35,684,602 2.2812 2.1939 1.3860 1.3044 -39.24m88ksim 494,917,870 2.3333 2.2338 1.3347 1.2681 -42.80vortex 404,996 2.2661 2.1858 1.3242 1.2519 -41.56FP:swim 796,527,564 2.2343 2.1244 1.2327 1.1241 -44.83wave5 4,515,144,715 2.1081 2.0595 1.1763 1.1538 -44.20

Table1: Average number of switched bits per instruction for lexicographic and Gray program counters

Benchmark Lexicographic Grayint:cc1 0.0986 0.0986li 0.0260 0.0260go 0.0854 0.0854compress95 0.0005 0.0005m88ksim 0.0867 0.0867vortex 0.0715 0.0715FP:swim 0.0132 0.0132wave5 0.0179 0.0179

Table2: Level 1 Instruction CacheMiss Rates

produced at positioni. ci andci+1 are the carry in andcarry out of positioni.

c0 = 0 for addition andc0 = 1 for subtraction.Figure1 illustratesthedataflow within aGray adder.

In the top sequence (Figure 1 (a)) the conversion to lex-icographic is completed before the addition can begin.In the middle sequence (Figure 1 (b)), the conversion tolexicographic is right to left, the addition runs in paral-lel but slightly behind. In thebottom sequence (Figure1(c)), the conversion to lexicographic of the two nibblesis performed in parallel. The addition follows behindthe low order nibble conversion but still finishes aheadbecause it is not delayed by the conversion of the highorder nibble.

The equations for a Gray adder are more compli-cated. This reflects the need for code conversions be-fore and after the addition. For the Gray adder, we areadding a lexicographic number to a Gray number. Inthe following equations we represent the bits of the lex-icographic value with a, the Gray value with g, and thelexicographic equivalent of the Gray value with b. Weusep to represent the parity bit. In addition to sum andcarry bits for each position, we must also determine thenew value of p. These are calculated with the followingequations:

si = ai ai+1 ci ci+1 gis7 = a7 a8 c7 c8 b7 b8ci+1 = aibi + ci(ai + bi)pnew = a0 c0 poldc0 = 0 for addition andc0 = 1 for subtraction. We

use a different equation for s7 becauseb7 is not derivedfrom b8.

For the low order nibble

bi = bi1 gi1b0 = pFor thehigh order nibblebi = bi+1 gib7 = g7Note that we have assumed the bits are labelled 0-7

with0 being theLSB, sog7 istheMSB of theGray num-ber andc8 is the carry out of the most significant posi-tion. c8 servesasthecarry in for thelexicographic adderused to sum the higher order bits. It is possible with aGray adder, as with a lexicographic adder, to implementcarry lookahead across a nibble. The only difference isthat we must convert to lexicographic before computingthecarry generate and carry propagate signals.

With a lexicographic counter, increment isgenerallyperformed by a half-adder. This is possible because, forincrement, the number being added to the PC consistsof a 1 for the LSB and 0’s for all higher bits. Becauseof this, the equations for both lexicographic and Grayadders are asimplification of those given above. For alexicographic counter, theequations areas follows:

si = bi cici+1 = cibic0 = 1For theGray counter, they areas follows:si = (cibi) gici+1 = cibic0 = 1pnew = poldFor the low order nibblebi = bi1 gi1b0 = pFor thehigh order nibblebi = bi+1 gib7 = g7

4 Loader

Loader’s task is to assign instructions from the binaryexecutable to the memory address space. Traditionally,it is a straightforward task since the mapping from theoriginal program sequence to theaddress sequence isanidentity (with a linear shift). However, with Gray se-quencing, the loader must perform extrawork.

A simple loader can takean instruction from addressi in the program (ith instruction in the static programorder) and load it at addressbase+gray(i)wherebaseis the base address of the text segment, andgray(i) is

-

Gray-to-lexicographic

Add and convert to Gray

Gray-to-lexicographic

Add and convert to Gray

Low nibble

High nibble

Add and convert

Time

(a)

(b)

(c)

Figure1: Illustration of Dataflow in Gray Counter

? ?

? ? ?

BB3

?

? ??

BB1

BB2

Figure2: Program Flow Graph with Basic Block Nodes

the binary string in the ith position in a Gray sequence.This is what wehavecurrently implemented.

However, there is a bigger opportunity for addressbit switching reduction at the loader stage. The simplis-tic approach outlined above results in asinglebit switchfor the program flow within a basic block. The numberof switched bits is not controlled (is comparable to thelexicographic PC scheme) for control flow out of basicblocks (inter basic block transitions). An example of aprogram flow graph is shown in Figure 2 where nodesrepresent basic blocks. Each of the edges can be as-

signed a probability based on static program profiling.Let p(BBi; B Bj) be the probability of the transitionfrom the basic blockBBi to the basic blockBBj . Lettheaddress assigned to abasic blockBB be denoted byA(BB). Let h(x;y ) be the Hamming distance of twobinary stringsx andy. Then the following optimizationproblem models the loader’s task.Input : A directed graph.Objective: Find a memory address (for relocatability, arelativeaddress)A(BB) for each Basic blockBB suchthatX

ov erallBB i;BBj

h(A(BBi); A(BBj)) p(BBi; B Bj)

isminimized.If profiling is not feasible, we can minimizePov erallBB i;BBj

h(A(BBi); A(BBj)).

We have represented the Gray chain of addressesassigned to the instruction in a basic block BB by asingle addressA(BB) in order to bring out the sim-ilarities in this optimization problem and the low en-ergy/power state machine synthesis. A common for-mulation of state assignment problem for low energyis to minimize

Psi;sj2S

h(si; sj) p(si; sj), where

p(si; sj) is the steady state probability of the transitionbetween statessi andsj , h(si; sj) is the Hamming dis-tance between the codes for the two states. We havedeveloped several algorithms [Tya96], [SCT97] for lowpower state assignment. Our intent is to modify one ofthem [Tya96] to handle theaddress assignment problemfor the loader. The main problem with this approachis the size of the input problem. The state assignmentmethods take time typically close to quadratic (or closetoN2 logN ) in number of statesN . A typical statema-chine has about 50-100 states. These approaches canturn out tobeimpractically expensivefor largeprogramswith hundreds of thousands of basic blocks. Hence, we

are also looking into heuristics linear in the number ofbasic blocks. Dichotomy based state assignment meth-ods such as [TPCD94] may be adaptable for an efficientloader algorithm.

5 FutureWork and Conclusions

Thework reported is still preliminary. There areseveralinteresting threads that we plan to pursue in the future.The reduced address bus switching derived from Graysequencing isdirectly visibleat the I-cache. However, itis the L1 and L2 cache/memory interface that provideshigh capacitances for theaddress bus. But theaddressesthat appear at the L1/L2 interface are already filteredbased on thecache locality characteristics. AretheGrayaddress sequences any better than the lexicographic ad-dress sequences at the L1/L2 interface? Are there anyadditional constraintson theloader optimization that canreduce the Hamming distance of upper address bits forthe sets of addresses likely to collide in the cache? Anadditional complexity is that at the L1/L2 interface theinstruction and data addresses are unified. This effectwil l further increase the average Hamming distance. IsHarvard architecture a better choice in order to separatethe instruction and data data address sequences and re-tain their low hamming distances?

The decoder in the instruction cache accounts for asignificant fraction of energy consumed in the instruc-tion cache [BP94]. However, a typical decoder design isprecharged. We plan to experiment with a hybrid staticand dynamic decoder design that leveragesthelow ham-ming distance in theaddresses to reduce energy.

References

[BAB96] D. Burger, T. M. Austin, and S. Bennett.Evaluating Future Microprocessors: TheSimpleScalar Tool Set. Technical ReportCS-TR-96-1308, University of Wisconsin,Madison, 1996.

[BP94] T. D. Burd and B. Peters. A Power Anal-ysis of a Microprocessor: A Study ofan Implementation of the MIPS R3000Architecture. Technical report, ERL, Uni-versity of California, Berkeley, 1994. URL:http://infopad.EECS.Berkeley.edu/burd/gpp/gpp.html#publishedpapers.

[Cor94] Altera Corporation. Ripple-Carry GrayCode Counters in FLEX 8000 Devices.Technical Report Application Brief 135, Al-teraCorporation, May 1994.

[Dor93] R. W. Doran. The Gray Code. Techni-cal report, Dept. of Computer Science,University of Auckland, 1993. URL:http://www.cs.aucland.ac.nz/techrep/TR131.

[GH96] R. Gonzalez and M. Horowitz. Energy Dis-sipation in General Purpose Microproces-sors. IEEE Journal of Solid State Circuits,31:1277–1284, 1996.

[SCT97] P. Surthi, L. Chao, andA. Tyagi. Low-PowerFSM Design using Huffman-Style Encod-ing. In Proc. of European Design and Test

Conference, IEEE Computer Society Press,pages 521–525, 1997.

[Sin94] D. Singh. Prospects for Low-Power Mi-croprocessor Design. talk delivered atACM/IEEE International Workshop on LowPower Design, Napa Valley, CA, 1994.

[STD94] Ching-Long Su, Chi-Ying Tsui, and AlvinDespain. Low power architectural de-sign and compilation techniques for high-performance processor. In ’Proceedings ofCOMPCON 94, pages 489–498, February1994.

[TPCD94] C. Y. Tsui, M. Pedram, C. Chen, and A. M.Despain. Low Power StateAssignment Tar-geting Two- and Multi-level Logic Imple-mentations. In Proc. of ICCAD, pages 82–87. ACM/IEEE, 1994.

[Tya96] A. Tyagi. Integrated Area-Power Opti-mal State Assignment. In Proceedingsof the Synthesis and System Integration ofMixed Technologies, SASIMI ’96, pages 24–31. SASIMI, Seiei Insatsu, Osaka, Japan,November 1996.

,QVWUXFWLRQ6FKHGXOLQJIRU/RZ3RZHU'LVVLSDWLRQLQ+LJK3HUIRUPDQFH

0LFURSURFHVVRUV

0DUN&7REXUHQ7KRPDV0&RQWH

'HSDUWPHQWRI(OHFWULFDODQG&RPSXWHU(QJLQHHULQJ

1RUWK&DUROLQD6WDWH8QLYHUVLW\

5DOHLJK1RUWK&DUROLQD

^PFWREXUHFRQWH`#HRVQFVXHGX

0DWW5HLOO\

'LJLWDO(TXLSPHQW&RUSRUDWLRQ

6KUHZVEXU\0DVVDFKXVHWWV

UHLOO\#URFNHQHWGHFFRP

$EVWUDFW

3RZHUGLVVLSDWLRQLVUDSLGO\EHFRPLQJDPDMRUGHVLJQ

FRQFHUQIRUFRPSDQLHVLQWKHKLJKHQGPLFURSURFHVVRU

PDUNHW7KHSUREOHPQRZLVWKDWGHVLJQHUVDUH

UHDFKLQJWKHOLPLWVRIFLUFXLWDQGPHFKDQLFDO

WHFKQLTXHVIRUUHGXFLQJSRZHUGLVVLSDWLRQ+HQFH

ZHPXVWWXUQRXUDWWHQWLRQWRDUFKLWHFWXUDODSSURDFKHV

WRVROYLQJWKLVSUREOHP,QWKLVZRUNZHSURSRVHD

PHWKRGRILQVWUXFWLRQVFKHGXOLQJZKLFKOLPLWVWKH

QXPEHURILQVWUXFWLRQVZKLFKFDQEHVFKHGXOHGLQD

JLYHQF\FOHEDVHGRQVRPHSUHGHILQHGSHUF\FOH

HQHUJ\GLVVLSDWLRQWKUHVKROG7KURXJKWKHXVHRID

PDFKLQHGHVFULSWLRQ>@>@ZHDUHDEOHWRGHILQHD

VSHFLILFSURFHVVRUDUFKLWHFWXUHDQGDORQJZLWKWKDWDQ

HQHUJ\GLVVLSDWLRQYDOXHDVVRFLDWHGZLWKHDFK

IXQFWLRQDOXQLWGHILQHGWKHUHLQ7KURXJKFDUHIXO

LQVSHFWLRQZHFDQGHILQHWKHF\FOHWKUHVKROGVXFK

WKDWWKHPD[LPDODPRXQWRIHQHUJ\GLVVLSDWLRQFDQEH

VDYHGIRUDJLYHQSURJUDPZKLOHLQFXUULQJOLWWOHWRQR

SHUIRUPDQFHLPSDFW

,QWURGXFWLRQ

3RZHUGLVVLSDWLRQLVEHFRPLQJDYLWDOGHVLJQLVVXHLQ

WRGD\¶VKLJKHQGPLFURSURFHVVRULQGXVWU\7ZR

H[DPSOHVRIKLJKHQGSURFHVVRUVWKDWVXIIHUIURPKLJK

OHYHOVRISRZHUGLVVLSDWLRQDUHWKH'(&DDQG

7KHDUXQVDWDFORFNVSHHGRI

0+]ZKLOHGLVVLSDWLQJ:DWWVRISRZHU7KH

VXIIHUVHYHQZRUVHZLWKLQWHUQDOFORFNVSHHGV

ZKLFKFDQUHDFK0+]ZKLOHGLVVLSDWLQJ

:DWWV7KHSUREOHPQRZLVWKDWDVSRZHUGLVVLSDWLRQ

FRQWLQXHVWRULVHZHDUHUDSLGO\DSSURDFKLQJDSRLQW

ZKHUHZHZLOOEHIRUFHGWRXVHFRROLQJWHFKQLTXHV

ZKLFKDUHQRWVXLWDEOHIRUWRGD\¶VSHUVRQDO

FRPSXWHUVZRUNVWDWLRQVDQGORZHQGWRPLGUDQJH

VHUYHUVVXFKDVOLTXLGLPPHUVLRQRUMHWLPSLQJHPHQW

&LUFXLWGHVLJQHUVDQGWKHUPDOHQJLQHHUVKDYH

SURGXFHGVRPHH[FHOOHQWWHFKQLTXHVIRUNHHSLQJ

SRZHUGLVVLSDWLRQWRDPLQLPXPLQUHFHQW\HDUV

&ORFNJDWLQJSRZHUVXSSO\UHGXFWLRQVPDOOHU

SURFHVVWHFKQRORJ\DQGVWDWHRIWKHDUWSDFNDJLQJDUH

DOOH[DPSOHVRIWKHDSSURDFKHVWKDWKDYHEHHQXVHGWR

GDWHWRHOLPLQDWHWKHSRZHUGLVVLSDWLRQSUREOHP

+RZHYHUWKHVHDSSURDFKHVDUHUHDFKLQJWKHLU

OLPLWDWLRQVDVWKHGHPDQGIRUSURFHVVRUVZLWKKLJKHU

FORFNVSHHGVDQGGHQVHUWUDQVLVWRUFRXQWVFRQWLQXHVWR

ULVH

,QWKLVZRUNZHSURSRVHDQDUFKLWHFWXUDOFRPSLOHU

DSSURDFKWRZDUGVVROYLQJWKLVSUREOHP2QHZD\ZH

FDQUHGXFHSHDNSRZHUGLVVLSDWLRQLVE\SUHYHQWLQJ

WKHRFFXUUHQFHRIFXUUHQWVSLNHVGXULQJSURJUDP

H[HFXWLRQ,IDUHJLRQRIFRGHKDVDKHDY\SURILOH

ZHLJKWDQGFRQWDLQVRQHRUPRUHLQVWUXFWLRQVZKLFK

UHTXLUHVLJQLILFDQWO\PRUHHQHUJ\WKDQRWKHUVWKHQ

WKHH[HFXWLRQRIWKLVUHJLRQZLOOOHDGWRUHSHDWHG

FXUUHQWVSLNHVLQWKHSURFHVVRUZKLFKUHVXOWVLQ

LQFUHDVHGSRZHUGLVVLSDWLRQ2XUJRDOLVWRSUHYHQW

WKLVIURPRFFXUULQJE\OLPLWLQJWKHDPRXQWRIHQHUJ\

WKDWFDQEHGLVVLSDWHGLQDQ\JLYHQF\FOH%HFDXVHRI

VFKHGXOHVODFNWKLVRIWHQUHVXOWVLQOLWWOHRUQR

SHUIRUPDQFHLPSDFW,QRXUVFKHGXOLQJPRGHOZH

VFKHGXOHDVPDQ\LQVWUXFWLRQVDVSRVVLEOHLQDJLYHQ

F\FOHXQWLOWKHF\FOHWKUHVKROGLVYLRODWHG2QFHWKDW

SRLQWLVUHDFKHGZHPRYHRQWRWKHQH[WF\FOHDQG

UHVXPHVFKHGXOLQJZLWKWKHLQVWUXFWLRQWKDWFDXVHG

WKHYLRODWLRQLQWKHSUHYLRXVF\FOH

3UHYLRXV:RUN

7KHUHKDYHEHHQSUHYLRXVDWWHPSWVDWXVLQJ

VFKHGXOLQJWHFKQLTXHVWRUHGXFHWRWDOSRZHU

FRQVXPSWLRQ6X7VXLDQG'HVSDLQSURSRVHGD

WHFKQLTXHZKLFKFRPELQHG*UD\FRGHDGGUHVVLQJDQG

DPHWKRGFDOOHGFROGVFKHGXOLQJWRUHGXFHWKHDPRXQW

RIVZLWFKLQJDFWLYLW\LQWKHFRQWUROSDWKRIKLJK

SHUIRUPDQFHSURFHVVRUV>@8VHGLQFRQMXQFWLRQ

ZLWKWKHWUDGLWLRQDOOLVWVFKHGXOLQJDOJRULWKPFROG

VFKHGXOLQJVFKHGXOHVLQVWUXFWLRQVLQWKHUHDG\OLVW

EDVHGRQKLJKHVWSULRULW\3ULRULW\RIDQLQVWUXFWLRQLV

GHWHUPLQHGE\WKHSRZHUFRVWZKHQWKHLQVWUXFWLRQLQ

TXHVWLRQLVVFKHGXOHGIROORZLQJWKHODVWLQVWUXFWLRQ

7KHSRZHUFRVWLVWDNHQIURPDSRZHUFRVWWDEOH

ZKLFKKROGVSRZHUHQWULHV6,-FRUUHVSRQGLQJWR

WKHVZLWFKLQJDFWLYLW\FDXVHGE\WKHH[HFXWLRQRI

LQVWUXFWLRQ,IROORZHGE\LQVWUXFWLRQ-,QVWUXFWLRQVLQ

WKHUHDG\OLVWZLWKORZHUSRZHUFRVWVKDYHKLJKHU

SULRULW\$IWHUHDFKLQVWUXFWLRQLVVFKHGXOHGWKH

SRZHUFRVWRIWKHUHPDLQLQJLQVWUXFWLRQVLQWKHUHDG\

OLVWKDVWREHUHFDOFXODWHGEHIRUHVFKHGXOLQJWKHQH[W

LQVWUXFWLRQ7KHGUDZEDFNVWRWKHFROGVFKHGXOLQJ

DSSURDFKDUHREYLRXV)LUVWDODUJHWDEOHLVUHTXLUHG

WRKROGSRZHUFRVWVIRUDOOSRVVLEOHLQVWUXFWLRQ

FRPELQDWLRQV)RUDKLJKSHUIRUPDQFHSURFHVVRUZLWK

DFRPSOH[LQVWUXFWLRQVHWWKLVWDEOHFDQEHH[WUHPHO\

ODUJH6HFRQGWKLVWDEOHPXVWEHDFFHVVHGIRUDOO

LQVWUXFWLRQVLQWKHUHDG\OLVWDIWHUHDFKQHZ

LQVWUXFWLRQLVVFKHGXOHG7KLVZLOOPDNHWKH

VFKHGXOLQJSURFHVVLWVHOIVORZHU+RZHYHU6X7VXL

DQG'HVSDLQVKRZWKDWWKHFRPELQDWLRQRI*UD\FRGH

DGGUHVVLQJDQGFROGVFKHGXOLQJUHVXOWVLQD

UHGXFWLRQLQVZLWFKLQJDFWLYLW\IRUWKHFRQWUROSDWK

$QRWKHUVFKHGXOLQJWHFKQLTXHIRUUHGXFLQJSRZHU

FRQVXPSWLRQZDVSUHVHQWHGE\7LZDUL0DOLNDQG

:ROIH>@>@7KHJRDOLQWKHVHZRUNVLVWR

VFKHGXOHFRGHVXFKWKDWLQVWUXFWLRQVDUHPRUH

MXGLFLRXVO\FKRVHQDVRSSRVHGWRLQVWUXFWLRQ

UHRUGHULQJ,QWKLVDSSURDFKDFWXDOFXUUHQW

PHDVXUHPHQWVZHUHWDNHQRQJHQHUDOSXUSRVH

SURFHVVRUVDQG'63SURFHVVRUV&XUUHQWZDV

PHDVXUHGIRUHDFKLQVWUXFWLRQDQGDSRZHUWDEOHEXLOW

IRUVLQJOHLQVWUXFWLRQYDOXHVDVZHOODVYDOXHVIRU

FRPPRQSDLUHGLQVWUXFWLRQV7KHQEDVHGRQWKHVH

PHDVXUHPHQWVWHVWSURJUDPVZHUHUHVFKHGXOHGWRXVH

LQVWUXFWLRQVZKLFKUHVXOWLQOHVVSRZHUFRQVXPSWLRQ

7KLVVHOHFWLRQLVEDVHGRQDQXPEHURILVVXHVVXFKDV

UHJLVWHUDFFHVVHVDVRSSRVHGWRPHPRU\DFFHVVHVDQG

ORZHUODWHQF\LQVWUXFWLRQV7LZDULHWDODOVRWDNH

LQWRFRQVLGHUDWLRQZKDWWKH\WHUPFLUFXLWVWDWH

RYHUKHDGZKLFKLVWKHVZLWFKLQJDFWLYLW\EHWZHHQD

SDLURIVSHFLILFLQVWUXFWLRQV,Q>@DQG>@WKH\DUJXH

WKDWIRUWKHSURFHVVRUVWHVWHGWKDWFLUFXLWVWDWH

RYHUKHDGLVLQVLJQLILFDQW+RZHYHUDGHWDLOHG

DQDO\VLVRIDQRWKHU'63SURFHVVRU>@IRXQGWKDW

FLUFXLWVWDWHRYHUKHDGZDVPXFKPRUHVLJQLILFDQWLQ

GHWHUPLQLQJWKHHQHUJ\FRQVXPSWLRQRIDSDLURI

LQVWUXFWLRQV7KURXJKWKLVDSSURDFKRISK\VLFDO

PHDVXUHPHQWDQGFRGHUHVFKHGXOLQJHQHUJ\VDYLQJV

XSWRZHUHDFKLHYHGRQWKHEHQFKPDUNVXVHG

$JDLQWKHSUREOHPVZLWKWKLVDSSURDFKDUHJODULQJ

7KHSURFHVVRIKDQGPHDVXULQJFXUUHQWIRUDOO

LQVWUXFWLRQVDQGLQVWUXFWLRQSDLUVZKLOHH[WUHPHO\

YDOXDEOHLVH[WUHPHO\WLPHFRQVXPLQJHVSHFLDOO\LQ

OLJKWRIWKHHQRUPRXVLQVWUXFWLRQVHWVXVHGLQVRPHRI

WRGD\¶VKLJKSHUIRUPDQFHSURFHVVRUV$OVROLNH6X

HWDOWKHUHLVDODUJHWDEOHQHHGHGWRVWRUHDOOSRZHU

YDOXHV7KLVFDQOHDGWRFRVWO\DFFHVVHVGXULQJWKH

VFKHGXOLQJSURFHVV

7KHVFKHGXOLQJDSSURDFKSURSRVHGKHUHLVIRFXVHGRQ

UHGXFLQJSRZHUGLVVLSDWLRQDVRSSRVHGWRSRZHU

FRQVXPSWLRQ2QHDGYDQWDJHWKDWRXUDSSURDFK

SURYLGHVLVWKDWZHFDQH[SOLFLWO\FRQWUROWKHDPRXQW

RISRZHUGLVVLSDWLRQDOORZHGIRUDQ\JLYHQVFKHGXOH

7KHWZRSULRUZRUNVVLPSO\OLPLWSRZHUFRQVXPSWLRQ

DVEHVWWKH\FDQ,QRXUDSSURDFKZHGHWHUPLQHKRZ

PXFKSRZHULVGLVVLSDWHGZKLFKJLYHVXVWUHPHQGRXV

IOH[LELOLW\LQWHUPVRIEHLQJDEOHWRFRQWUROGLVVLSDWLRQ

IRUGLIIHUHQWDUFKLWHFWXUHV

7KHUHPDLQGHURIWKLVZRUNSUHVHQWVRXUPHWKRGRI

ORZSRZHUVFKHGXOLQJ6HFWLRQSUHVHQWVWKH

DOJRULWKPLWVHOIDORQJZLWKDGLVFXVVLRQRIWKH

PDFKLQHGHVFULSWLRQPHFKDQLVP6HFWLRQSUHVHQWV

WKHUHVXOWVRIRXUSUHOLPLQDU\LQYHVWLJDWLRQLQWRWKLV

DSSURDFKDQGLQ6HFWLRQZHFRQFOXGHWKHSDSHUDQG

SUHVHQWSODQVIRUIXWXUHZRUN$OOVWXGLHVSUHVHQWHG

LQWKLVSDSHUZHUHSHUIRUPHGXVLQJWKHH[SHULPHQWDO

/(*2FRPSLOHUGHVLJQHGE\WKH7,1.(55HVHDUFK

*URXSDW1RUWK&DUROLQD6WDWH8QLYHUVLW\

/RZ3RZHU6FKHGXOLQJ

7KHJRDORIWUDGLWLRQDOVFKHGXOLQJDOJRULWKPVLVWR

LPSURYHSHUIRUPDQFHLQWHUPVRIH[HFXWLRQWLPH

7KLVFDQEHGRQHLQDQXPEHURIZD\V6XFKPRGHUQ

DSSURDFKHVDVVXSHUEORFNVFKHGXOLQJ>@K\SHUEORFN

VFKHGXOLQJ>@DQGWUHHJLRQVFKHGXOLQJ>@IRFXV

PDLQO\RQLQFUHDVLQJSHUIRUPDQFHWKURXJKLQFUHDVLQJ

WKHDPRXQWRILQVWUXFWLRQOHYHOSDUDOOHOLVPLQ

SURJUDPFRGH,QRUGHUWRVFKHGXOHIRUUHGXFHG

SRZHUGLVVLSDWLRQZHDUHIRUFHGWRVDFULILFHVRPHRI

WKHSHUIRUPDQFHJDLQVSURYLGHGE\WKHVHVFKHGXOLQJ

DOJRULWKPVLQRUGHUWRREWDLQWKHGHVLUHGUHGXFWLRQLQ

SRZHUGLVVLSDWLRQ+RZHYHUWKHVFKHGXOLQJ

DSSURDFKSUHVHQWHGKHUHKDVVKRZQWKDWVLJQLILFDQW

HQHUJ\VDYLQJVFDQEHREWDLQHGZLWKPLQLPDO

UHGXFWLRQLQSURJUDPSHUIRUPDQFH

,QWKLVVHFWLRQWKHPHWKRGEHKLQGRXWDSSURDFKZLOO

EHSUHVHQWHG)LUVWZHZLOOGLVFXVVWKHPDFKLQH

GHVFULSWLRQPHFKDQLVPDQGKRZHQHUJ\YDOXHVDUH

GHILQHGWKHUHLQ)ROORZLQJZLOOEHDGLVFXVVLRQRIWKH

VFKHGXOLQJDOJRULWKPLWVHOI

0'(60DFKLQH'HVFULSWLRQ

:HXVHWKH0'(6PDFKLQHGHVFULSWLRQODQJXDJH

GHYHORSHGDWWKH8QLYHUVLW\RI,OOLQRLV>@>@DVWKH

EDVLVIRUGHILQLQJWKHDUFKLWHFWXUHIRUZKLFKZHDUH

VFKHGXOLQJ:HKDYHEXLOWWKH0'(6HQYLURQPHQW

LQWRWKH/(*2FRPSLOHUZKLFKDOORZVXVJUHDW

IOH[LELOLW\LQGHILQLQJQHZH[SHULPHQWDO

DUFKLWHFWXUHV,QDGGLWLRQLWSURYLGHVDQLFH

PHFKDQLVPIRUGHILQLQJQHZPDFKLQHVSHFLILF

SDUDPHWHUVVXFKDVHQHUJ\GLVVLSDWLRQYDOXHV,QWKH

0'(6GHVFULSWLRQZHDUHDEOHWRGHILQHKDUGZDUH

UHVRXUFHVVXFKDVUHJLVWHUVUHJLVWHUILOHVGLIIHUHQW

W\SHVRIIXQFWLRQDOXQLWVHWF,QDGGLWLRQZHFDQ

GHILQHHDFKRSHUDWLRQDQGWKHIXQFWLRQDOXQLWVWKDWLW

FDQEHH[HFXWHGRQ,WLVLQWKLVGHVFULSWLRQWKDWZH

GHILQHWKHHQHUJ\GLVVLSDWLRQYDOXHVDVVRFLDWHGZLWK

HDFKRIWKHPDFKLQH¶VIXQFWLRQDOXQLWV2QFHWKHVH

QXPEHUVDUHGHILQHGWKH0'(6HQYLURQPHQWEXLOGV

DGDWDVWUXFWXUHWKDWFRQWDLQVWKHHQHUJ\LQIRUPDWLRQ

7KHQIRUHDFKLQVWUXFWLRQWKHVFKHGXOHUTXHULHVWKH

0'(6WRGHWHUPLQHZKLFK)8WKHLQVWUXFWLRQLVWREH

VFKHGXOHGRQ2QFHWKHVFKHGXOHUNQRZVZKLFK)8

WRVFKHGXOHURQLWTXHULHVWKH0'(6DJDLQWRJHWWKH

HQHUJ\GLVVLSDWLRQYDOXHDVVRFLDWHGZLWKWKHVSHFLILHG

)8

)RUWKLVSDUWLFXODUVWXG\WKHHQHUJ\YDOXHVXVHGLQWKH

0'(6GHVFULSWLRQVZHUHREWDLQHGIURPDFWXDOSRZHU

VLPXODWLRQVUXQRQGLIIHUHQWIXQFWLRQXQLWW\SHV

GHVLJQHGDW'LJLWDO(TXLSPHQW&RUSRUDWLRQ7KH

QXPEHUVSURYLGHGZHUHDEVWUDFWHGDELWLQRUGHUQRW

WRUHYHDOSURSULHWDU\LQIRUPDWLRQEXWDUHDFFXUDWH

HQRXJKWRSURYLGHUHOLDEOHUHVXOWV

6FKHGXOLQJ$OJRULWKP

7KHVFKHGXOLQJDOJRULWKPSUHVHQWHGKHUHLVEDVHGRQ

WKHWUDGLWLRQDOOLVWVFKHGXOLQJDOJRULWKP2QFHWKH

'$*KDVEHHQEXLOWIRUDVSHFLILFUHJLRQWKHOLVW

VFKHGXOHUEXLOGVWKHUHDG\OLVWDQGEHJLQVVFKHGXOLQJ

LQVWUXFWLRQVEDVHGRQGHSHQGHQFHKHLJKW&XUUHQWO\

ZHDUHRQO\VFKHGXOLQJLQVWUXFWLRQVDWWKHEDVLFEORFN

OHYHO2QFHDQLQVWUXFWLRQKDVEHHQFOHDUHGWREH

VFKHGXOHGDQGDVVLJQHGWRWKHSURSHU)8WKH)8¶V

HQHUJ\GLVVLSDWLRQYDOXHLVTXHULHGE\WKHOLVW

VFKHGXOHUIURPWKH0'(67KHOLVWVFKHGXOHUWKHQ

DGGVWKHYDOXHSURYLGHGWRWKHHQHUJ\WRWDOIRUWKH

FXUUHQWF\FOH,IWKHWRWDOH[FHHGVWKHWKUHVKROG

GHILQHGWKHQWKHVFKHGXOHUTXLWVVFKHGXOLQJIRUWKH

FXUUHQWF\FOHDQGEHJLQVVFKHGXOLQJIRUWKHQH[WF\FOH

ZLWKWKHLQVWUXFWLRQWKDWFDXVHGWKHYLRODWLRQLQWKH

SUHYLRXVRQH,IWKHOLVWVFKHGXOHUGRHVQRWGHWHFWD

YLRODWLRQLWSURFHHGVQRUPDOO\

7KHUHVXOWVSUHVHQWHGLQWKHIROORZLQJVHFWLRQVKRZ

WKDWWKLVLVDSRZHUIXOWHFKQLTXHIRUUHGXFLQJSRZHU

GLVVLSDWLRQ:HDUHFXUUHQWO\LQWKHSURFHVVRI

LQYHVWLJDWLQJIXUWKHUHQKDQFHPHQWVWRWKHFXUUHQW

LPSOHPHQWDWLRQ

([SHULPHQWDO5HVXOWV

7KHVWXGLHVSHUIRUPHGVRIDUZLWKRXUVFKHGXOHUKDYH

EHHQUXQHQWLUHO\RQEDVLFEORFNFRGHRQDQLVVXH

9/,:DUFKLWHFWXUHGHILQHGDV7LQNHUZKLFK

FRQWDLQVWKUHHLQWHJHU$/8XQLWVJHQHUDOSXUSRVH

IORDWLQJSRLQWXQLWVWZRORDGVWRUHXQLWVDQGRQH

EUDQFKXQLW$OOUHVXOWVJLYHQLQWKLVVHFWLRQDUHIRU

WKH63(&LQWVXLWHRIEHQFKPDUNV6RIDUZHKDYH

IRXQGWKDWFDUHIXOGHWHUPLQDWLRQRIWKHSHUF\FOH

HQHUJ\WKUHVKROGFDQUHVXOWLQVLJQLILFDQWVDYLQJVLQ

WHUPVRIRYHUDOOHQHUJ\YLRODWLRQVDQGWRWDOHQHUJ\

VDYHGRYHUDJLYHQEHQFKPDUN7KHWKUHVKROGFDQ

WDNHRQDQ\YDOXHHTXDORUJUHDWHUWRWKHODUJHVW

HQHUJ\YDOXHDVVRFLDWHGZLWKDQ\GHILQHG)8$WWKH

ORZHQGRIWKHWKUHVKROGVSHFWUXPZHVHHWKH

RSWLPXPDPRXQWRIHQHUJ\VDYLQJV+RZHYHU

FKRRVLQJH[WUHPHO\ORZSRZHUWKUHVKROGVZLOOUHVXOW

LQDODUJHLPSDFWRQSURJUDPSHUIRUPDQFHLQWHUPVRI

WRWDOF\FOHFRXQW,QFRQWUDVWVHOHFWLQJH[WUHPHO\

KLJKWKUHVKROGYDOXHVFDQUHVXOWLQOLWWOHWRQRHQHUJ\

VDYLQJV7KHLGHDOLVWRILQGWKHSRLQWDWZKLFK

HQHUJ\VDYLQJVDUHVLJQLILFDQWZKLOHPDLQWDLQLQJ

SURJUDPSHUIRUPDQFH)LJXUHVDQGVKRZUHVXOWV

IRUWZRWHVWFDVHVLQZKLFKZHFKRVHHQHUJ\

WKUHVKROGVRIDQGQ-SHUF\FOH)LJXUHVKRZV

WKHWRWDODPRXQWRIHQHUJ\YLRODWLRQVVDYHGE\XVLQJ

WKHSURSRVHGORZSRZHUVFKHGXOLQJWHFKQLTXH,Q

JHQHUDODVWKHWKUHVKROGGHFUHDVHVWKHQXPEHURI

WRWDOYLRODWLRQVVDYHGLQFUHDVHV+RZHYHUWKHFRVWLV

LQFUHDVHGH[HFXWLRQWLPH:HIRXQGWKDWE\

LQFUHDVLQJWKHWKUHVKROGIURPQ-WRWKDWZH

FRXOGVLJQLILFDQWO\UHGXFHWKHSHUIRUPDQFHLPSDFW

ZKLOHPDLQWDLQLQJVLJQLILFDQWRYHUDOOHQHUJ\VDYLQJV

7KHRYHUDOOLPSURYHPHQWLQH[HFXWLRQWLPHUDQJHG

IURPIDVWHUIRULMSHJWRIDVWHUIRU

OL

7KHUHGXFWLRQLQVDYLQJVLQWRWDOHQHUJ\YLRODWLRQV

DQGWRWDOHQHUJ\LVLQYHUVHO\SURSRUWLRQDOWRWKH

DPRXQWRISHUIRUPDQFHLPSURYHPHQWDFKLHYHGE\

LQFUHDVLQJWKHHQHUJ\WKUHVKROG)LJXUHVDQG

FOHDUO\GHPRQVWUDWHWKLVUHODWLRQVKLS)RUWKHQ-

WKUHVKROG)LJXUHVDQGVKRZWKDWWKHDPRXQWRI

YLRODWLRQVDQGWRWDOHQHUJ\VDYHGIDOOVRIIDELWIURP

WKHQ-WKUHVKROG+RZHYHUZHIHHOWKDWWKLV

UHGXFWLRQLVDFFHSWDEOHJLYHQWKHSHUIRUPDQFHJDLQ

ZHDFKLHYHE\LQFUHDVLQJWKHWKUHVKROGYDOXH

,QDGGLWLRQWRWKHUHVXOWVSUHVHQWHGDERYHIRUWRWDO

HQHUJ\YLRODWLRQVDQGHQHUJ\VDYLQJVZHPHDVXUHG

WKHH[WUDHQHUJ\FRVWLQHDFKEHQFKPDUNSURJUDPIRU

HYHU\F\FOHVVFKHGXOHG)RUHYHU\F\FOH

VHJPHQWZHFDOFXODWHWKHWRWDODPRXQWRIH[FHVV

SRZHUGLVVLSDWHG7DEOHVKRZVWKDWH[FHVVLYH

SRZHULVW\SLFDOO\LQWKHUDQJHRIWRQ-IRU

HYHU\F\FOHV2YHUODUJHSURJUDPVWKLVUHVXOWVLQ

DODUJHDPRXQWRIH[FHVVLYHSRZHUGLVVLSDWLRQZKLFK

FDQEHVDYHGXVLQJWKHORZSRZHUVFKHGXOLQJ

WHFKQLTXHSURSRVHGKHUH

)LJXUH

,Q7DEOHWKHGDWDUHSUHVHQWVWKHQXPEHURIWLPHV

WKDWWKHWRWDOSRZHUGLIIHUHQFHZDVLQWKHVSHFLILHG

UDQJHGXULQJF\FOHVSDQV

&RQFOXVLRQV

,QWKLVSDSHUZHKDYHSURSRVHGDQHZPHWKRGIRU

VFKHGXOLQJLQVWUXFWLRQVZKLFKKHOSVUHGXFHSRZHU

GLVVLSDWLRQLQKLJKSHUIRUPDQFHSURFHVVRUV7KH

DSSURDFKLVEDVHGRQDSHUF\FOHHQHUJ\WKUHVKROG

ZKLFKPD\QRWEHYLRODWHGLQDQ\JLYHQF\FOH

,QVWUXFWLRQVDUHVFKHGXOHGEDVHGRQWKHOLVW

VFKHGXOLQJDOJRULWKPXQWLOWKHWKUHVKROGIRUWKH

FXUUHQWF\FOHLVUHDFKHG2QFHWKHWKUHVKROGKDVEHHQ

H[FHHGHGWKHVFKHGXOHUEHJLQVVFKHGXOLQJIRUWKH

QH[WF\FOH:HKDYHVKRZQWKDWWKLVPHWKRGFDQ

UHVXOWLQVLJQLILFDQWHQHUJ\VDYLQJVRYHUDJLYHQ

SURJUDPZLWKOLWWOHWRQRSHUIRUPDQFHLPSDFW

)XWXUHSODQVIRUWKLVUHVHDUFKDUHWRH[WHQGWKHHQHUJ\

PRGHOWRFRQWDLQDPRUHFRQFLVHUHSUHVHQWDWLRQRIWKH

GHVLUHGDUFKLWHFWXUHDQGWRLQYHVWLJDWHIXUWKHU

HQKDQFHPHQWVWRWKHVFKHGXOLQJDOJRULWKPWRDOORZ

IXUWKHULQFUHDVHGVDYLQJV

(QHUJ\9LRODWLRQV6DYHGZLWK/RZ3RZHU6FKHGXOLQJ

(

(

(

(

(

(

(

J

R

P

N

V

LP

J

F

F

F

R

P

S

UH

V

V

OL

LMS

H

J

S

H

UO

Y

R

UWH

[

7KUHVKROG Q-

7KUHVKROG Q-

)LJXUH

7DEOH±([FHVVLYH3RZHU'LVWULEXWLRQSHU&\FOHV

; ; ; ;

JR

PNVLP

JFF

FRPSUHVV

OL

LMSHJ

SHUO

YRUWH[

5HIHUHQFHV

>@6X&/7VXL&<DQG'HVSDLQ$0³/RZ

3RZHU$UFKLWHFWXUHDQG&RPSLODWLRQ7HFKQLTXHVIRU

+LJK3HUIRUPDQFH3URFHVVRUV´LQ3URFRIWKH,(((

&203&21SS

>@7LZDUL90DOLN6DQG:ROIH$

³&RPSLODWLRQ7HFKQLTXHVIRU/RZ(QHUJ\$Q

2YHUYLHZ´SUHVHQWHGDWWKH6\PSRVLXPRQ

/RZ3RZHU(OHFWURQLFV

>@7LZDUL90DOLN6DQG:ROIH$³3RZHU

$QDO\VLVRI(PEHGGHG6RIWZDUH$)LUVW6WHS

7RZDUGV6RIWZDUH3RZHU0LQLPL]DWLRQ´LQ,(((

7UDQVRQ9HU\/DUJH6FDOH,QWHJUDWLRQ6\VWHPV

SS

7RWDO(QHUJ\6DYLQJVZLWK/RZ3RZHU6FKHGXOLQJ

(

(

(

(

(

(

(

J

R

P

N

V

LP

J

F

F

F

R

P

S

UH

V

V

OL

LMS

H

J

S

H

UO

Y

R

UWH

[

7KUHVKROG Q-

7KUHVKROG Q-

>@/HH07&7LZDUL90DOLN6DQG)XMLWD

0³3RZHU$QDO\VLVDQG/RZ3RZHU6FKHGXOLQJ

7HFKQLTXHVIRU(PEHGGHG'636RIWZDUH´SUHVHQWHG

DWWKH,QW6\PSRVLXPRQ6\VWHP6\QWKHVLV

>@+ZX::0DKONH6$&KHQ:<&KDQJ

33:DUWHU1-%ULQJPDQ5$2XHOHWWH5*

+DQN5(.L\RKDUD7+DDE*(+ROP-*

DQG/DYHU\'0³7KH6XSHUEORFN$QHIIHFWLYH

VWUXFWXUHIRU9/,:DQGVXSHUVFDODUFRPSLODWLRQ´LQ

7KH-RXUQDORI6XSHUFRPSXWLQJYROSS

-DQ

>@0DKONH6$/LQ'&&KHQ:<+DQN

5(DQG%ULQJPDQ5$³(IIHFWLYHFRPSLOHU

VXSSRUWIRUSUHGLFDWHGH[HFXWLRQXVLQJ+\SHUEORFN´

LQ3URFRIWKHWK$QQ,QW¶O6\PSRVLXPRQ

0LFURDUFKLWHFWXUHSS

>@+DYDQNL:$7UHHJLRQVFKHGXOLQJIRU9/,:

SURFHVVRUV067KHVLV'HSDUWPHQWRI(OHFWULFDODQG

&RPSXWHU(QJLQHHULQJ1RUWK&DUROLQD6WDWH

8QLYHUVLW\5DOHLJK1&

>@*\OOHQKDDO-&$PDFKLQHGHVFULSWLRQ

ODQJXDJHIRUFRPSLODWLRQ067KHVLV'HSDUWPHQWRI

(OHFWULFDODQG&RPSXWHU(QJLQHHULQJ8QLYHUVLW\RI

,OOLQRLVDW8UEDQD&KDPSDLJQ8UEDQD,/

>@*\OOHQKDDO-&+ZX::DQG5DX%5

³+0'(69HUVLRQ6SHFLILFDWLRQ´7HFKQLFDO

5HSRUW,03$&78QLYHUVLW\RI,OOLQRLVDW

8UEDQD&KDPSDLJQ8UEDQD,/

Code Transformations for Embedded MultimediaApplications: Impact on Power and Performance

Nikos D. ZervasUniversity of Patras

Dep. of Electrical & ComputerEngineering, Rio 26500, Greece

Tel.: (+) 30 61 997324E-mail:[email protected]

Kostas MasselosUniversity of Patras



C. E. GoutisUniversity of Patras



ABSTRACTA number of code transformations for embedded multimediaapplications is presented in this paper and their impact on bothsystem power and performance is evaluated. In terms ofpower the transformations move the accesses from the largebackground memories to small buffers that can be keptforeground. This leads to reduction of the memory relatedpower consumption that forms the dominant part of the totalpower budget of such systems. The transformations also affectthe code size and the system’s performance which is usuallythe overriding issue in embedded systems. The impact of thetransformations to the performance is analyzed in detail. Thecode parameters related to the performance of the system andthe way they are affected by the transformations are identified.This allows for the development of a systematic methodologyfor the application of code transformations that achieve anoptimal balance between power and performance.

KeywordsEmbedded, Multimedia, Code Transformations, Power,Performance.

1. IntroductionImage and video coding rapidly became an integral part ofinformation exchange. The number of computer systemsincorporating multimedia capabilities for displaying andmanipulating image and video data is continuously increased.The rapid advances in multi-media and wireless technologiesmade possible the realization of sophisticated portable multi-media applications such as portable video phones, portablemultimedia terminals and portable video cameras. Real timeimage and video processing are required in such applications.Low power consumption is of great importance in suchsystems to allow for extended battery life. Low powerportable multimedia systems are described in [1-2]. Portabilityis by no means the only reason for low power consumption[3]. Low power consumption is of utmost importance in non-portable applications as well. For this reason there is greatneed for power optimization strategies especially in the highlevels where the most significant savings can be achieved [3].Power exploration and optimization strategies for image and

video processing applications are described in [4-10].

There are two general approaches for the implementation ofmultimedia systems. The first is to use custom hardwarededicated processors. This solution leads to smaller area andpower consumption however it lacks flexibility since only aspecific algorithm can be executed by the system. The secondsolution is to use a number of instruction set processors. Thissolution requires increased area and power in comparison tothe first solution however it offers increased flexibility andallows implementation of multiple algorithms by the samehardware. Mixed hardware/software architectures can also beused.

In multimedia applications, memory related powerconsumption forms the major part of the total power budget ofa system [7-8]. A systematic methodology for the reduction ofmemory power consumption has been proposed in [7-8]. Thismethodology includes the application of loop and data flowtransformations. However it mainly targets custom hardwarearchitectures and the impact of the transformations on theperformance of an implementation based on instruction setprocessors is not addressed.

In this paper a number of code transformations is presented.The effect of the transformations on three basic parameters ofembedded multimedia systems namely power, performanceand code size is illustrated. The way in which thesetransformations affect power and performance is analyzed. Astest vehicles four well-known motion estimation algorithmsare used. The aim of the research under consideration is todevelop a methodology for the effective application of codetransformations in order to achieve the optimal balancebetween power consumption and performance.

The rest of the paper is organized as follows: In section 2 abrief description of the motion estimation algorithms is given.In section 3 the applied transformations are described indetail. In section 4 the way in which the transformations affectthe power consumption and the performance is described.Finally in section 5 some conclusions are offered.

2. Motion estimation algorithmsFour typical [11] motion estimation algorithms were used astest vehicles, namely full search, three level hierarchicalsearch (hierarchical), parallel hierarchical one dimensionalsearch (phods) and two dimensional three step logarithmicsearch (log). Simulations using the luminance component ofQCIF frames were carried out. The dimension of theluminance component of a QCIF frame is 144x176 (N×M).The reference window was selected to include 15×15((2p+1)×(2p+1) and p=7) candidate matching blocks. Blocksof 16x16 (B×B) pixels were considered. The general structureof the above algorithms is described in figure 1.

Figure 1: General structure of motion estimationalgorithms.

The algorithms consist from 3 double nested loops. Eachblock of the current frame (outer loop) is compared to anumber of candidate blocks included in the reference window(middle loop). Computing a distortion criterion (inner loop)using all the pixels of both the current and the candidateblocks performs the comparison. A conditional statementinside the nested loops checks whether the pixels of thecandidate blocks are inside the previous frame or not.

3. Description of the appliedtransformations

a) Transformation 1

The first transformation applied was a loop interchange [12]between the loop related to the candidate blocks and the looprelated to the pixels of each block. In this way each pixel ofthe current block was accessed only once (instead of(2p+1)x(2p+1) times in the original description) and itscontribution to the distortions of the candidate blocks wascomputed. An array signal of size (2p+1)x(2p+1) wasintroduced to store the intermediate values of the candidateblock distortions.

a) Transformation 2

The second transformation introduced a new array signal forthe storage of the current block. This array signal wasinitialized from the current frame array signal before the looprelated to the candidate blocks. This transformation is a loopdistribution of the middle loop i.e. an insertion of an extraloop with the same limits and step, combined with a nodesplitting of the outer loop. As a consequence the inner loopaccessed the current block array signal, instead of the currentframe array signal, for the distortion computation.

b) Transformation 3

The third transformation introduced an array signal for thereference window. This array signal was initialized from theprevious frame array signal before the loop related to thecandidate blocks. This transformation is similar totransformation 2. The only difference was that the introducedarray signal contained the reference window instead of thecurrent block. So the inner loop accessed the referencewindow array signal, instead of the previous frame arraysignal, for the distortion computation.

d) Transformation 4

The fourth transformation was based in the same idea as theprevious transformation but it also exploited data reuse [13].Data reuse was feasible because of the overlapping referencewindows of neighboring blocks (Figure 2). Specifically anoverlapping of 2px(2p+B) pixels exists for the referencewindows of two neighboring blocks. For every current blockshifting its last 2p columns to the left and transferring the new

non-overlapping B columns from the previous frame arraysignal initialized the reference window array signal.

Figure 2: Overlapping of reference windows ofneighboring blocks.

e) Transformation 5

The fifth transformation introduced two new array signals,one for the reference window, and one for the candidate block.The first array signal storing the reference window wasinitialized from the previous frame array signal before theloop related to the candidate blocks without exploiting thedata reuse as in transformation 4.

Figure 3: Overlapping of neighboring candidate blocks.

The second array signal storing the candidate block wasinitialized from itself for the overlapping part ((B-1)xB pixels)and from the reference window array signal for the non-overlapping part (Bx1 pixels), before the inner loop. In thisway the data reuse for the previous block array signal wasoptimally exploited (Figures 3, 4).

Figure 4: Optimal data reuse for the candidate blocks in areference window.

This transformation is a node splitting of the outer loop and aloop distribution of the loop related to the candidate blocks.

f) Transformation 6

This is a data flow transformation and was applied only to thehierarchical motion estimation algorithm. The transformationproduces the sub-sampled by four versions of both current andprevious frames from the sub-sampled by two version of eachframe, instead of producing them from the original frames asdefined by the initial algorithm.

4. Effect of transformations on power andperformance

As already stated memory related power consumption formsthe major part of the total power budget of an image or video

for(x=0;x<N/B;x++) /*for all blocks in the */for(y=0;y<M/B;y++) /*current frame */ for(i=-p;i<p+1;i++ ) /*for all candidate blocks */ for(j=-p;j<p+1;j++) /*in the reference window*/ for(k=0;k<B;k++ ) /*for all pixels in the block */ for(l=0;l<B;l++) if ((B*x+i+k)<0 || (B*x+i+k)>N-1 || (B*y+j+l)<0 || (B*y+j+l)>M-1) /* Conditional statement for the pixels of the candidate blocks */

Block i+1

Previous FrameBlock i

B

B

Reference Window

2p

-p B-p B+p 2B+p

Currentblock

i

Currentblocki+1

Ref. window i Ref. window i+1

processing system. Thus for the evaluation of thetransformations’ effect on the power consumption only thememory power consumption was considered. Thetransformations described in the previous section reduce thenumber of accesses to the large background memories storingthe current and the previous frame which are the most powercostly and introduce accesses to small arrays (also introducedby the transformations) that can be stored foregroundrequiring smaller power per access. In this way the powerconsumption is heavily reduced. To better illustrate the effectof the transformations on memory accesses and on powerconsumption let assume that the accesses required by analgorithm in its original are given by the following equation.

Small buffers, caches, register files and registers areconsidered as foreground memories. Large buffers areconsidered as background memories. The application of thetransformations results in an algorithm description withnumber of memory accesses given by the following equation.

Obviously the real effect of the transformations is the transferof the larger part of the background memory accesses toforeground memories. The total number of accesses tomemory elements is increased after the application of thetransformations since extra accesses are required fortransferring the data from the background memories toforeground memories. A first order metric of thetransformation effectiveness (in terms of power consumptionreduction) can be the relative decrease of the backgroundmemory accesses i.e. the ratio (A-C)/A. This first order metricof the power consumption reduction for the differenttransformations is presented in figure 5. In figure 6 the realpower savings achieved by the application of thetransformations are illustrated.

Figure 5: First order metric of power consumptionreduction.

The power savings introduced by the application of thetransformations described in the previous section ranged from30% to 50% for transformations 1-5 while transformation 6succeeded almost 8% power reduction but it affected only asmall part of the complete motion estimation algorithm. Forthe evaluation of the power consumption of the data memorythe model of Landman [14] was used. For the estimation itwas assumed that the current and previous frames are stored intwo separate background on-chip buffers while all the smallarrays introduced by the transformations were assumed to be

stored in an intermediate small buffer (data cache). All bufferswere assumed to be single port read/write and 8 bits wide.

Figure 6: Power savings after the application of codetransformations.

The code transformations described in the previous sectionaffect also the system performance i.e. the number of cyclesrequired for the execution of the code. In an abstract level,performance can be considered as a function of four codeparameters namely number of memory accesses, number ofcomputing operations, number of control operations andnumber of address operations as described by the followingequation.

Where a, b, c, d are parameters determined by the instructionset processor on which the code is executed and reflect thenumber of cycles required for each one of these operations. Amore refined model should distinguish the accesses withrespect to the memory to which they are performed and thedifferent computing operations with respect to theircomplexity. In general evaluation of the above factors beforeand after the application of code transformations may give anestimate of the transformation effect on performance. Theeffect of the transformations on the performance-number ofcycles is illustrated in figure 7.

To estimate the transformation effect on performance themotion estimation codes were simulated on ARM 7 processorthat was used as an embedded core. ARM is considered as thestate-of-the-art core for embedded telecommunicationapplications. However it is still a general-purpose processorand thus it is not the most suitable for multimedia applicationslike motion estimation. Since only the relative effect oftransformations on performance (in number of cycles) is ofinterest the use of ARM suffices. The effect of codetransformations on the number of cycles required for programexecution is not the same in all cases. This is the number ofcycles either increases or decreases depending on the effect oftransformations on the parameters of the model described inequation 3.

Transformation 1 increased the number of cycles for allmotion estimation algorithms. The increase ranged from 1,6%(full search-simplest structure) to 60% (parallel one-dimensional hierarchical). This is due to the increase of totalnumber of memory accesses. The number of computing andcontrol operations remained stable and the addressingoperations were reduced.

Transformation 2 decreased the accesses to backgroundprevious buffer, which has more complex address equations,while increased the number of accesses to the foregroundcandidate block buffer, which has simple address equations.

( ) ( ) ( ) ( )1fgBbkgAorgaccessesofnumberTotal +=

( ) ( ) ( ) ( )

( ) ( )

DB

CA

trnsfaccessesofnumberTotalorgaccessesofnumberTotal

where

fgDbkgCtrnsfaccessesofnumberTotal

<<>>

<

+= 2

( )3/#/#

/##

tionsopaddressdtionsopcomputingb

tionsopcontrolcaccessesmemoryaeperformanc

×+×++×+×=

7UDQVIRUPDWLRQ

&KDQJHLQQXPEHURIDFFHVVHVWR

EDFNJURXQGPHPRULHV

)XOO6HDUFK

/RJ

3KRGV

+LHUDUFKLFDO

7UDQVIRUPDWLRQ

3RZHU6DYLQJV

)XOO6HDUFK

/RJ

3KRGV

+LHUDUFKLFDO

That way the reduction of address operations led to decreaseof the number of cycles for all motion estimation algorithmsof 0,2% (phods)-10%(full search) although the total numberof memory accesses increased and the computing and controloperations remained stable.

In the same way transformation 3 reduced the addressoperations. Performing check for each reference window,instead of each candidate block, lead to an 81%-98%reduction of control operations. So although the total numberof memory accesses was increased, the number of cycles wasreduced from 27% (full search) to 37% (log).

For transformations 4 and 5 multiple control actions for thesame pixel were eliminated because of the data reuse resultingin a reduction of the number of control operations from 85%up to 99%. Transformation 4 also reduced the addressoperations for the same reasons as transformations 2 and 3. Soalthough the total number of memory accesses increased, thenumber of cycles reduced from 27% (full search) up to 37%(log). Transformation 5 achieves optimal data reuse at theexpense of complex addressing. This fact caused an 8%-63%increase of the address operations and combined with theincrease of total memory accesses lead to an increase of cyclesfrom 30% (hierarchical)-101% (full search).

It must be noted that no data caching was taken into accountduring simulation. This means that the performance of thecodes (in number of cycles) is better than described aboveespecially for the cases where many accesses to the faster datacache occurred. Another system parameter affected by thecode transformations is the code size. Transformations usuallymake the code more complex and thus increase the code size.This implies an indirect effect on the system’s powerconsumption since increase of the code size leads to anincrease of the program memory. Increased size of theprogram memory leads to increased effective capacitance peraccesses i.e. capacitance per instruction fetching. Since themotion estimation codes were mapped on ARM that is ageneral-purpose processor, the final code size and the numberof instructions required for the program execution (effectivelythe number of times the program memory is accessed) are notvery realistic. For this reason the power related to programmemory was not evaluated. Furthermore no instructioncaching was taken into account. The presence of an instructioncache reduces significantly the program memory relatedpower consumption especially in data dominated applicationswhere cache misses do not occur frequently. The effect of thetransformations on the code size is illustrated in figure 8.Code size was increased for all the transformations and for allalgorithms from 7% up to 141%, due to the extra array signalsand the complexity introduced by all the transformations.

5. ConclusionsIn this paper code transformations for power consumptionreduction of embedded multimedia applications werepresented. The transformations achieve power consumptionreduction by moving the main part of the background memoryaccesses to small foreground memories. The codetransformations affect also the system’s performance. Theeffect of transformations on the performance is describedanalytically. Abstract models based on high-level codeparameters for both power and performance were alsodescribed. These models can be used to evaluate the powerand performance effects independently of the instruction setprocessor on which the code will be executed. Another systemparameter affected by the code transformations is the codesize. It was demonstrated that the code size increases after theapplication of code transformations affecting the size of the

code memory and indirectly the system’s power consumption.This is because increase of the size of the program memorycorresponds to increase of the effective capacitance perinstruction set. A model for evaluating the power consumptionof the program memory was also included.

The research described in this paper is part of on-goingresearch. The aim of our research is the derivation of modelsfor fast and accurate evaluation of the effect of codetransformations on power and performance and thedevelopment of a methodology for efficient application oftransformations in order to achieve the optimal powerperformance balance.

Figure 7: Transformation effect on performance (# cycles).

Figure 8: Effect of transformations on code size.

6. References[1] A. P. Chandrakasan, A. Burstein, R. W. Brodersen, “A

Low-Power Chipset for a Portable Multimedia I/OTerminal”, IEEE Journal of Solid-State Circuits, Vol. 29,No. 12, December 1994, pp. 1415-1428.

[2] T. H. Meng, B. M. Gordon, E. K. Tsern, A. C. Hung,“Portable Video-on-Demand in WirelessCommunications”, Proceedings of the IEEE, Vol. 83, No.4, April 1995.

[3] J. M. Rabaey, M. Pedram, “Low Power DesignMethodologies”, Kluwer Academic Publishers 1995.

[4] D. B. Lidsky, J. M. Rabaey, “Low-Power Design ofMemory Intensive Functions”, 1994 IEEE Symposiumon Low Power Electronics, Digest of Technical Papers,pp. 16-17.

[5] M. Tartagni, A. Leone, A. Pirani, R. Guerrieri, “A Block-Matching Module for Video Compression”, 1994 IEEESymposium on Low Power Electronics, Digest ofTechnical Papers, pp. 24-25.

7UDQVIRUPDWLRQ

&RGH6L]H&KDQJH

)XOO6HDUFK

/RJ

3KRGV

+LHUDUFKLFDO

7UDQVIRUPDWLRQ

&\FOHV&KDQJH

)XOO6HDUFK

/RJ

3KRGV

+LHUDUFKLFDO

[6] K. K. Chan, C. Y. Tsui, “Exploring the PowerConsumption of Different Motion EstimationArchitectures for Video Compression”, proc. of the 1997IEEE Intl. Symposium on Circuits and Systems, pp.1217-1220.

[7] L. Nachtergaele, F. Catthoor, F. Balasa, F. Franssen, E.De Greef, H. Samsom, H. De Man, “Optimization ofmemory organization and hierarchy for decreased sizeand power in video and image processing systems”, IEEEIntl. Workshop on Memory Technology, Design andTesting, August 7-8 1995, San Jose, pp. 82-87.

[8] S. Wuytack, F. Catthoor, L. Nachtergaele, H. De Man,“Power Exploration for Data Dominated VideoApplications”, proc. of the 1996 Intl. Symposium on LowPower Electronics and Design, Monterey, California,pp.359-364.

[9] F. Catthoor, M. Janssen, L. Nachtergaele, H. De Man,“System-Level Data-Flow Transormations for PowerReduction in Image and Video Processing”, proc. of the1996 Intl. Conference on Electronics Circuits andSystems (ICECS’96), Rhodos, Greece, pp. 1025-1028.

[10] L. Nachtergaele, F. Catthoor, B. Kapoor, S. Janssens,“Low power storage exploration for H.263 videodecoder”, in VLSI Signal Processing IX, IEEE Press,pp.116-125, 1996.

[11] V. Bhaskaran, K. Konstantinides, “Image and VideoCompression Standards”, Kluwer Academic Publishers,1994.

[12] D. Kulkarni, M. Stumm, “Loop and DataTransformations: A tutorial”, Technical Report CSRI-337, Computer Systems Research Institute, University ofToronto, June 1993.

[13] J. P. Diguet, S. Wuytack, F. Catthoor, H. DeMan,“Hierarchy Explorartion in High Level MemoryManagement”, in proc. of the 1997 InternationalSymposium on Low Power Electronics and Design,Monterey CA, August 18-20.

[14] P. Landman, “Low power architectural designmethodologies”, Doctoral Dissertation, U. C. Berkeley,Aug. 1994.

Abstract

This paper explores techniques for creating accurateinstruction-level energy models for digital signal processors(DSP). Our initial results confirm previous work showing thatinter-instruction effects can become a significant component ofpower consumption for many programs. To overcome limita-tions of previous models, we develop a straightfoward method(the NOP model) that models transitions between any twoinstructions. Measurements show that our method accuratelymodels inter-instruction effects without a quadratic increase inthe size of energy tables. Complex instructions are handled bytreating functional units within the processor separately.

1 Introduction

Instruction-level energy models can be an effective tool forhigh-level software-based optimizations [LEE97][TIWA94].The basic technique constructs a table that records each instruc-tion’s average energy. High-level power estimators use thistable to quickly determine each software instruction’s energyconsumption, avoiding costly circuit-level simulation (e.g.,Spice). Because instructions are the atomic units used by codegenerators, instruction-level energy models can be integratedwith power-optimizing compilers more easily than simulation-based estimators. Further, instruction-level energy models allowchip manufacturers to provide fine-grained power informationwithout having to disclose confidential design layout and imple-mentation details—allowing software designers to quickly andaccurately estimate a program’s power consumption withoutunderstanding the underlying implementation details.

Unfortunately, accurate instruction-level energy modelsrequire more than simple per-instruction power estimates. Inter-instruction effects can significantly alter the power consumedby a given instruction, making it difficult to derive a singlepower number for each architectural instruction [LEE97].Power tables could be expanded to include every pair of instruc-tions. Unfortunately, building such tables can be very time con-suming and requires O(N

2

) space (where N is at least the size ofthe instruction set). Grouping instructions into common classes[Lee97] can reduce the table size, but does not scale well for

DSP-type architectures with their rich addressing modes andparallel instruction issue capabilities.

To overcome the problems of classification, we have devel-oped a straightfoward method that requires only O(N) spacewhile accurately estimating program energy. Our results, simu-lated with an implementation of a subset of the MotorolaDSP56000 (56K), produce instruction-level power tables thatpredict program power within 8% percent of simulation-basedestimates. Further, by attributing each instruction’s power con-sumption to the various functional units, we preserve accuracywhile overcoming the difficulty associated with modeling the56K’s rich addressing modes and parallel functions.

Section 2 describes our subset of the 56K DSP and ourdesign methodology. Section 3 presents our approach to gener-ating instruction-level power tables and compares our resultswith previous techniques. Section 4 further evaluates thesemodels and describes potential limitations. Finally, in Section 5we present our conclusions and outline future work.

2 Tools and Methodology

2.1 CMU 56000 DSP

To build accurate models and to compare our results with areal design, we designed and implemented a standard-cell basedsubset of the Motorola DSP56000 instruction set [MOTO90].Synopsys and Cascade’s Epoch synthesized our 56K Verilogmodel into a standard-cell layout (see Figure 1).

We choose the 56K because instruction-level power analysisis more complex than simple RISC cores and because the 56K’sfunctionality is representative of many power-conscious archi-tectures. The 56K is a 24-bit, fixed-point DSP that can encodeand issue one arithmetic operation and up to two “parallel” datamoves in one

packed

instruction. Our 56K core implementsmost of the arithmetic and basic data movement instructionsand accounts for most of the logic that effects power consump-tion.

2.2 Mynoch Power Estimator

A variety of approaches have been used to characterize thepower consumption of digital systems. For physical devices,direct-measurement of current gives the most accurate measure-ments [TIWA94]. However, the granularity of results is limitedto device-level, multi-clock cycle measurements. Accurate,fine-grained results can be obtained with Spice-based simula-tors, such as Star-Sim [KRIS97], but long run-times severelylimit the number of cycles/events that can be simulated. Gate-level power estimators improve simulation speed[KOJI95][PURS95][XANT97], by sacrificing accuracy toachieve faster run-times.

The initial analysis of our 56K’s power consumption wasdone using CMU’s gate level analysis tool, Mynoch [PURS95].Mynoch estimates power by counting transitions from a Verilog

This work was supported by the Defense Advanced Research ProjectsAgency under Order No. A564 and the National Science Foundationunder Grant No. MIP90408457.

Modeling Inter-Instruction Energy Effects

in a Digital Signal Processor

Ben Klass, Donald E. Thomas, Herman Schmit, David F. NagleDepartment of ECE, Carnegie Mellon University

Pittsburgh, PA [email protected], [email protected]

simulation and calculating the dynamic energy consumed foreach transition using:

Mynoch runs 450 times faster than Spice simulation, allowingus to simulate thousands of cycles for each test program. Mem-ory is not modeled with Mynoch since Epoch uses behavioralmodels for memory. Spice based simulations have shown mem-ory to be approximately 20% of the total energy.

We verified the accuracy of Mynoch by comparingMynoch’s power estimates against Avanti’s Star-Sim, whichuses a modified Spice engine. Eighteen iterations of a 4-tap FIRfilter were simulated with both Mynoch and Star-Sim. Initially,Mynoch’s power estimates showed significant error in contrastto Star-Sim. To locate the source of Mynoch’s error, we com-pared Star-Sim’s per module power estimates with Mynoch’sestimates. The analysis showed that Mynoch’s inability toaccount for intra-gate capacitance within registers (i.e., D-flipflops) was the primary source of error. To correct for this error,we used Star-Sim’s power estimates to build a simple linear-regression model that included the number of registers. Theresulting model had a very high degree of accuracy (r

2

factor of0.98). This model was further verified by comparing results ofMynoch augmented by the regression model vs. Star-Sim for120 cycles of an FFT program. The error between the two meth-ods was less than 1% for the processor, although error on func-tional units was higher. All of the Mynoch results reported inthis study are augmented by the regression model.

2.3 Test Programs and Reference Energy

In contrast to general processors, a DSP is frequently used tocompute fairly simple, data intensive programs. For the work-loads of our power analysis we chose five program kernels thatrepresent those found in typical signal processing applications(see Table 1). Three of the kernels are finite-impulse response(FIR) filters, one is a Fast Fourier Transform (FFT), and one isan adaptive filter (LMS). Gaussian white noise was used asinput data.

Power consumption for the benchmark programs was mea-sured using Mynoch (see Figure 2). Average energy consumedper cycle, given in nanoJoules, provides the basic unit of mea-sure which will be used throughout the paper. This is obtainedby dividing the total energy over the program execution by thenumber of cycles. Power consumed by the pads and memory isnot included.

3 Instruction-level Models

Although circuit-level simulations provide insight intopower consumption for a given program, an instruction-levelmodel is more appropriate for code generators. Instruction-levelmodels typically use an

energy table

that describes the energycost for each instruction in an instruction-set architecture (ISA).The challenge in building such a model is balancing accuracyand energy table size. This section presents four models withdifferent accuracy and table size trade-offs. The first model,

base model

, produces the smallest table size, but yields poorenergy-prediction accuracy across a program run. The secondmodel,

pair model

, has greater accuracy, but at the cost of muchlarger tables. The third model,

NOP model

, provides nearly theaccuracy of the pair model with much smaller tables. The fourthmodel,

general model

, is similar to the NOP model but theenergy table generated is independent of the program beingevaluated. The general model provides reasonable accuracy andis appropriate for use with code generators.

3.1 Base Model and Estimation

Building the Base Model’s Energy Table

Creating a complete and accurate energy table for the 56KDSP requires one to account for each instruction in the ISA,each instruction’s different register and immediate values, andevery possible packed instruction.

1

This can require a signifi-cant amount of time and space. To make the base model more

Figure 1: DSP56000 architecture and layoutMajor components: 1) 3 KB of SRAM; 2) data ALU, which con-tains a 24x24-bit multiplier and 56-bit accumulator; 3) addressgeneration unit (AGU), which contains three sets of eight 16-bit registers and two ALUs capable of arbitrary modulo and bitreversed arithmetic; and 4) program control unit (PCU).

ProgramMemory24b x 1kw

X DataMemory24b x 1kw

Y DataMemory24b x 1kw

Addr. Unit Data ALU Pgm Ctl

XDBPDB

YDBGDB

YMEM PMEM

ALU AGU

PCU and Busses

XMEM

E12--- C• V

2•=

Pgm Description Instr%Instrarith

Sim cycles

fir4 4-tap FIR filter (direct form) 5 57% 1760

fir64 64-tap FIR filter (direct form) 5 96% 5,000

fir4u 4 tap FIR filter, unrolled once 12 66% 1,500

FFT 256 point FFT 24 79% 8,062

LMS 64-tap least mean squares adaptive filter

13 63% 30,000

Table 1: Description of workloadsThe Instr column lists the number of unique instructions exe-cuted in each workload’s main program loop. %Instr arithlists the percentage of instructions executed that are arith-metic instructions. Sim cycles lists the number of cycles sim-ulated for power analysis.

0 1 2 3 4 5 6

f f t

lms

f i r4u

f i r 4

f i r 6 4

Average Energy per Cycle (nJoules)

Figure 2: Energy consumption for workloadsThis figure shows power consumption for the workloads. TheAGU and ALU are the two largest consumers of power andalmost all of the variation between programs is caused byvariation in the multiplier and AGU power. The multiplier usesguard latches on the input and consumes less power withprograms that have proportionally fewer multiply operations.

Ben

chm

ark

Pro

gram

s

Average Energy per Cycle (nJ)

Clock

Busses

Multiplier

Adder

Other

AGU

PCU

Other

ALU

tractable, we only computed the energy cost for every uniqueinstruction

2

found in our five workloads—not for every possibletuple of

instruction, reg/immed, instr pair

.

Each instruction’s energy was estimated by constructing atight-loop test program that included the target instruction and azero-overhead branch instruction so that only the target instruc-tion was executed in the core of the loop. Figure 3 shows thetight-loop test program used to characterize the

MAC

instruction.Each tight-loop test program was run through Mynoch to gener-ate the base energy cost, B

inst

. For measuring the

REP

instruc-tion, we were forced to use a

NOP

inside of the loop because aloop of

REP

instructions is illegal.

Whenever possible, loops were made with the actual instruc-tions used in the programs.

Some instructions were modifiedslightly to ensure that different operands were used on eachcycle. Data from the workload being characterized was used.Because the test programs are based on the actual instructionsand data from our five workloads (Table 1), the results of thisapproach are optimistic in their accuracy. When generalizingthis approach, parameters such as the destinations of parallelmoves would be abstracted to reduce energy table’s size.

Base Model Energy Estimates

The instruction measurements described above were used toconstruct a base model energy table that was then used to esti-mate the average energy per cycle for each of our workload pro-grams. For example, the energy per cycle for the

fir4

programwas calculated by:

E

fir4

= (B

CLR

+ B

REP

+ 3 B

MAC

+ B

MACR

+ B

MOVE

) / 7

Figure 5 shows the power estimates gained from the instructionenergy table. The results show a very accurate power estimatefor the

fir64

. However, the estimated energy for other pro-grams is underestimated by 17% to 25%. This error is 50%higher than the error reported in [LEE97].

This error can be understood by considering the

fir64

and

fir4

programs (see Figure 4). The accuracy in the

fir64

workload is due to

fir64

’s limited number of inter-instructioneffects. The

fir64

repeats the

MAC

instruction 63 times, withno intervening instructions, in a loop of 67 instructions. Thisbehavior is very similar to the tight-loop test programs used toderive the instruction energy table. In the

fir4

, however,energy is underestimated by 25% because inter-instructioneffects are significant. The

fir4

repeats the

MAC

instruction 3times in a loop of 7 instructions, while the remaining 4 instruc-tions are all different. Inter-instruction effects in the remaininginstructions are not represented by the base model, leading tothe observed error.

For each of the workloads, most of the inaccuracy for thebase model occurs in the DSP’s AGU (address-generation unit)and PCU (program-control unit) functional units. For the fir4,the energy for the AGU and PCU are underestimated by morethan 30%. The error in these units can be understood by consid-ering the microarchitecture (Figure 1). The AGU must generatean address by the end of pipeline stage 2, so no registers existbetween the control logic and the data path (our implementationdoes this to increase performance). The PCU does not latch itscontrol points for similar reasons. The lack of registers allowsglitches in the control logic to propagate into the data path,causing many false transitions as an instruction word changes.In contrast, the ALU (data ALU functional unit) latches all of itscontrol points, so glitches are confined to the control logic,making the base model more accurate.

1. Like many DSPs, the 56K allows for packed instructions, where anarithmetic instruction and a data-movement instruction are groupedtogether in one instruction.

2. Instructions are considered different if there is any difference in theiropcode, immediate values, registers, or pairings. For example, twoversions of a MAC instruction, (

MAC x0,y0,a

vs.

MACx1,y0,b

), are considered different and will have different entriesin the base model’s energy table.

DO #<50

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0 ;target instruction


Figure 3: Loop used to characterize MAC from FIR filters

This loop was used to calculate the base energy cost of thetarget instruction, in this case the MAC instruction as itoccurs in the FIR filters. Two data values are read in frommemory each cycle, so both multiplier inputs change. Twoinstances of the target instruction are needed to match thesemantics of the DO instruction.

CLR A X0,X:(r0)+ Y:(r4)+,Y0 ; A = 0;; X0 <- X(n-1); Y0 <- B(0)

REP #<3 ; for j = 0 to 2

MAC Y0,X0,A X:(r0)+,X0 Y:(r4)+,Y2 ; A+= X(n-j)*B(j); X0 <- X(n-j-1); Y0<- B(j+1)

MACR Y0,X0,A (r0)- ; A += X(n-3)*B(3); update pointer

MOVE X:(r1)+,X0 A,Y:(r5)+ ; X0 <- X(n+1); Y:(out) <- A

Figure 4: Main loop of fir4

This figure shows the code used in the fir4 main loop. Thefir64 is the same, except that the MAC operation isrepeated 63 times by changing the repeat instruction to “rep#<63 .” The CLR, MAC, MACR, and MOVE instructionsemploy parallel moves to move data into registers immedi-ately before the data is needed.

0

1

2

3

4

5

6

Figure 5: Mynoch vs. Base ModelSimulated energy from Mynoch and estimated energy usingthe base model (base ) are given for major units. Both clockand bus power are contained in “Other.” (Due to length andcomplexity, LMS has not been done.)

Ene

rgy

per

Cyc

le (

nJ)

fir4 fir64 fir4u

Myn

och

base

base

base

base

Myn

och

Myn

och

fft

Myn

och

Other

PCU

AGU

ALU

3.2 The Pair Model

The impact of inter-instruction effects on power estimationhas been noted before and can be compensated for by assigninga per-instruction overhead that accounts for inter-instructioneffects [LEE97]. This overhead, O

instr

, is added to the baseenergy cost if an instruction is not the same as the previousinstruction. For example, the energy model for the codesequence:

would only use B

MAC

for the second MAC operation, while theenergy model for the code sequence:

would use B

MAC

+ O

MOVE,MAC

because the

MAC

instruction ispreceded by a different instruction,

MOVE

. Similar to the base model, we measured O

instr

, using tight-loop test programs. Each loop consisted of the target instructionand the instruction that preceded it in the execution trace, givingaverage energy per cycle for the loop E

loop

(see Figure 6). Over-head was calculated by:

O

previous,target

= E

loop

– (B

previous

+ B

target

)/2

Using

B

instr

and

O

instr

where appropriate, the pair modelestimated the energy for each of the workloads (Figure 7). Theresults, labeled as

pair , are much more accurate than the basemodel, with error between 1% and 10% for all programs. How-ever, generalizing this technique would require characterizingevery possible pair of instructions, requiring a table of sizeO(N2), where N is the number of instructions and addressingmodes. For the 49 different instructions and addressing modesimplemented in our 56K chip, a complete instruction energytable would contain 1176 entries. To reduce the table size,[LEE97] grouped instructions into classes and derived overheadcosts between classes. This technique works well for simplemachines, but is much more difficult to apply when dealingwith the many complex addressing modes and instruction typesfound in a DSP such as the 56K.

3.3 The NOP Model

To avoid the difficulties of instruction grouping, we havedeveloped a new approach that requires only one overhead costfor each instruction. This model is based on the assumption thatthe overhead cost for an instruction is not strongly dependent onthe neighboring instruction, but does depend on whether theneighboring instructions are the same or different. This observa-tion leads to the NOP model, which allows us to account forinstruction changes without enumerating each pair of instruc-tions.

Like the pair model, the NOP model calculates the energyfor a particular operation with either Binstr or Binstr + Oinstr,depending on the previous instruction. The NOP model differsin that we calculated one overhead cost for each instruction,Oinstr, using loops which alternate the target instruction withNOP instructions (Figure 8). This techniques allowed us to cap-ture the energy effects of changing instructions while keepingthe size of the table to O(N). Power estimates using the NOPmodel are shown in Figure 7, labeled as NOP. The results show

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0 ; target instruction

MOVE X:(r1)+,X0 Y:(r4)+,Y0

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0 ; target instruction

DO #<50

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0

MACR Y1,X1,a (r0)- ;target instruction

Figure 6: Loop used to find overhead: pair modelThis loop was used to calculate the overhead cost of the tar-get instruction, in this case the MACR instruction as it occursin the FIR filters, under the pair model. The pair of instruc-tions that appear in the workload programs was used andthe overhead for this pair was assigned to the trailing instruc-tion. Different source registers were used for the MAC andMACR instructions to ensure that both multiplier operandschange, as in the FIR programs.

0

1

2

3

4

5

6

Figure 7: Different approaches to estimating inter-instruction overheadThis figure compares energy from Mynoch simulation with estimates from the base model (base ), pair model (pair ) and NOP model(NOP). The pair model measures overhead for each pair of instructions that appear in the program trace. The NOP model measuresoverhead for each instruction using NOP instructions

Ene

rgy

per

Cyc

le (

nJ)

fir4 fir64 fir4u fft

Myn

och

Myn

och

Myn

coh

Myn

och

base

base

base

base pa

ir

NO

P

NO

P

NO

P

NO

P

pair

pair

pair

Other

PCU

AGU

ALU

DO #<50

NOP


Figure 8: Loop used to find overhead: NOP modelThis loop was used to calculate the overhead cost of the tar-get instruction, in this case the MAC instruction as it occursin the FIR filters, under the NOP model. A target instructionis paired with an NOP to calculate its overhead cost.

error between 1% and 8% on programs that previously hadmuch larger errors with the base model. Considering that group-ing instructions also decreases accuracy, this should comparefavorably with any model based on instruction classes.

3.4 General Instruction ModelHaving established the effectiveness of the NOP model, we

generalized this approach to build tables that could be used forany program—a general instruction model. Unlike our previousmodels, where the instruction energy tables were built to matchinstructions as found in the workloads as closely as possible,the general instruction model creates a single instruction energytable that can be used across all programs. Such a table could becreated by processor manufacturers and then used by a codegenerator to optimize power.

The general instruction model is similar to the NOP model,but extends the power analysis by accounting for packedinstructions. Packed instructions present a problem when build-ing general tables because any combination of arithmetic andparallel move is allowed. Our previous models used the actualpacked instructions from each workload. When generalizing,the 23 arithmetic instructions and 24 types of parallel moveslead to 552 possible combinations, making a complete tablefairly large. Fortunately, the two parts of a packed instructionare largely executed by different units within the 56K. Usingthis architectural knowledge, we separated the two parts of apacked instruction, building tables for the energy consumed byeach of the functional units rather than the entire DSP.

The general instruction model consists of four tables, corre-sponding to the four significant functional units: ALU, AGU,PCU, and “Other.” The first three units have been describedabove. “Other” refers to all remaining parts of the chip, prima-rily the clock and bus power. The ALU and PCU were charac-terized by the arithmetic portion of packed instructions only,ignoring the parallel move unless no arithmetic instruction was

present. The AGU and “Other” were characterized by the paral-lel move portion, ignoring arithmetic instructions. Data wascoarsely modeled in ways that would be visible to a code gener-ator. Table entries for arithmetic operations were separatedbased on which operands changed value; move operations con-tained separate entries for the number of moves and type ofupdate performed on address registers.

Energy costs were generated with loops similar to thosedescribed above (Figure 9). From each loop, the relevant energycosts were calculated for each unit. Uniform random data wasused as input to the arithmetic unit. Under the general instruc-tion model, the base energy cost of the instruction:

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0

was calculated by:

BALU,MACxy + B AGU,MOVEx+y+ + B PCU,MACxy + B Other,MOVEx+y+

Overhead energy costs, and whether an instruction has changed,was calculated for each unit in the same way.

Estimates based on these general instruction tables areshown in Figure 10. By making estimation automatic, we wereable provide estimates for lms as well. Considering that pro-gram dependent information from Figure 7 has been removed,results are remarkably similar. Accuracy on all programs iswithin 10%.

4 Applications and LimitationsSection 3 developed an general instruction model that pro-

vides reasonable accuracy while limiting the table size. In thissection we analyze this model from two perspectives. The firstexamines a possible use of such an energy model—evaluatingthe energy of code transformations within a code generator. Thesecond looks into the importance of program data, which is notconsidered by the instruction-level model.

4.1 Code Transformations to Save PowerComparing different implementations of a 4-tap FIR filter

allows us to see if the energy models can recognize power sav-ings due to code transformations (see Figure 11). While[TIWA94] looked at instruction reordering, more aggressivecode transformations are used here. We implemented four ver-sions of a 4-tap FIR filter which used the same coefficients and

DO #<50

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0 ; MACxy

MAC Y0,X0,a X:(r0)+,X0 Y:(r4)+,Y0 ; MACxy

DO #<50

MOVE X:(r0)+,X0 Y:(r4)+,Y0 ; MOVEx+y+

MOVE X:(r0)+,X0 Y:(r4)+,Y0 ; MOVEx+y+

Figure 9: Loops used for general modelThese loops were used to calculate the base cost of theMACxy and MOVEx+y+ instructions under the generalmodel. The MACxy refers to a MAC instruction where bothmultiplier inputs change while MOVEx+y+ refers to two par-allel moves with increment.

0

1

2

3

4

5

6

Figure 10: General Instruction ModelThis figure compares Mynoch simulation energy (Mynoch)and the general instruction-level model (gen ). Each compo-nent: AGU, ALU, PCU, and Other was estimated using sep-arate tables.

Ene

rgy

per

Cyc

le (

nJ)

fir4 fir64 fir4u fft

Myn

och

lms

Myn

och

Myn

och

Myn

och

Myn

och

gen

gen

gen

gen

gen

Other

PCU

AGU

ALU

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

Figure 11: Implementations of a 4-tap FIR filterThis figure compares the energy to produce one datum ofoutput for four different software implementations of thesame 4-tap FIR filter. The fir4 and fir4u implementa-tions are described in Table 1. The no-rep implementationdoes not use the repeat instruction or store past inputs. Theno-par implementation only uses one parallel move anddoes not use packed instructions. The number of cycles perdatum for the fir4 , fir4u , no-rep , and no-par pro-grams are 7, 6, 6, and 15, respectively.

fir4

fir4u

Mynoch gen

fir4

fir4u

Ene

rgy

per

Dat

um (

nJ)

Other

AGU

PCU

ALU

no-r

ep

no-p

ar

no-r

ep

no-p

ar

input data. Energy per datum processed is used as the metric tocompare these programs to account for the different number ofcycles required by different programs. Energy per datum isgiven by per-cycle energy multiplied by the number of cyclesper datum.

Figure 11 shows that Mynoch power estimation predicts thatloop unrolling, fir4u , consumes 20% less energy per datumthan fir4 while the version without packed instructions, no-par , consumes 75% more energy per datum. The difference inenergy is due to both the energy per cycle and the number ofcycles required to process one datum. The general model is ableto recognize the difference in power, but does not show as dra-matic an improvement for fir4u and no-rep . The lost accu-racy comes from the general model underestimating the AGU’sper-cycle energy for fir4 while overestimating the AGU’s per-cycle energy for fir4u and no-rep . While improved accuracyin AGU power estimation is needed, results show that a codegenerator using this model would choose the implementationusing the least power under the general model we developed.

4.2 Data Dependent VariationNone of the models presented consider energy effects of pro-

gram data. However, the power consumption of many units,such as the multiplier, can be highly data-dependent. Otherresearch on DSP power consumption has noted the data depen-dent variation and analyzed the energy of individual functionalunits, such as the multiplier [KOJI95][LEE97]. Code transfor-mations that keep one operand constant or reduce the number of“1” bits in a Booth-Encoded multiplier are ways that the com-piler can change the data to reduce the multiplier’s energy.

To gauge the importance of data, we used cycle accuratesimulation to measure the power in the ALU when each MACoperation was active for the workloads. The average power con-sumed by each instruction, with standard deviation error bars, isshown in Figure 12 along with the energy costs for theseinstructions from the general model. Most instructions deviatebetween 8% and 12% from the mean, while two instructionsdeviate by 16% and 28%. The costs from the base model aregenerally within the one standard deviation of the workloads.

The fft has the largest standard deviation and has the leastaccurate ALU estimates under all models (c.f. Figure 7). Thefft showed a bimodal distribution of energy in one of its MACinstructions, probably due to a large number of multiplicationsby zero. Improving accuracy for this problem would requiremoving to a model based on execution traces with sample pro-gram data. Increased accuracy of such a model would come at acost of significantly longer simulation time, although tech-

niques such as those proposed in [MARC96] could be used toreduce the trace length.

5 Conclusion and Future WorkWe have presented several approaches for dealing with inter-

instruction effects when building instruction-level energy mod-els for a specific DSP design. The results show that using NOPinstructions to model transitions between any two instructionsgive accuracy within 8% while reducing table size from almost1200 to less than 100, and eliminates possible human error fromother simplification methods. Using separate models on majorunits within the DSP avoided multiple table entries for differentcombinations of arithmetic and parallel move instructions andallowed us to build a general model. Such a model could allowcode generators to recognize code transformations that reducethe energy consumed by programs.

Future work attempt to recognize when data dependent vari-ation is likely to be important and include such variation withinthe model. Building models for other, non-DSP architectureswould further validate the applicability of the ideas presentedhere.

6 AcknowledgmentsAndrew Ryan, Chris Inacio, Jonathan Ying-Fai Tong and

Bill Dougherty Jr. provided critical components for this study.

7 References[BAJW97] R. S. Bajwa, N. Schumann, H. Kojima. Power Analysis of a 32-bit RISC Microcontroller Integrated with a 16-bit DSP. Proceedings 1997 International Symposium on Low Power Electronics and Design, pp. 137-142, 1997.

[KOJI95] H. Kojima, D. Gorny, K. Nitta, and K. Sasaki. Power Analysis of a Programmable DSP for Architecture/Program Optimization. IEEE Symposium on Low Power Electronics, Digest of Tech. Papers, pp. 26-27, Oct. 1995.

[KRIS97] Ram K. Krishnamurthy, Mixed Swing Techniques for Low Energy/Operation Datapath Circuits, Ph.D. Thesis, Carn-egie Mellon University, December 1997.

[LEE97] M. T.-C. Lee, V. Tiwari, S. Malik, M. Fujita. Power Analysis and Minimization Techniques for Embedded DSP Software. IEEE Trans. on VLSI Systems, pp. 1-14, March, 1997.

[MARC96] Diana Marculescu, Radu Marculescu, Massoud Pedram. Stochastic Sequential Machine Synthesis Targeting Constrained Sequence Generation. Proceedings of Design Auto-mation Conference, pp. 696-701, 1996.

[MOTO90] Motorola, Inc. DSP56000/56001 Digital Signal Processor User’s Manual. 1990.

[PURS96] D. J. Pursley. A gate level simulator for power con-sumption analysis. M.S. Thesis, Carnegie Mellon University, May 1996.

[TIWA94] V. Tiwari, S. Malik, A. Wolfe. Power Analysis of Embedded Software: A First Step towards Software Power Minimization. I994 ICCAD, Digest of Technical Papers. pp. 384-390, 1994

[XANT97] T. Xanthopoulos, Y. Yaoi, A. Chandrakasan. Archi-tectural Exploration Using Verilog-Based Power Estimation: A Case Study of the IDCT. DAC 97, pp. 415-420, 1997.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Figure 12: Energy per instruction of data ALUThe average energy for each instruction, based on Mynochsimulation of the programs, is given with standard deviationerror bars to the left. The three instructions with the higheststandard deviation showed a bimodal distribution. Baseenergy cost (B) and base plus overhead (B+O) from the gen-eral model are given to the right.

Ene

rgy

per

Cyc

le (

nJ)

fir4

MA

Cxy

fir64

MA

Cxy

fir4u

MA

Cxy

fir4u

MA

Cxa

fir4u

MA

Cya

fft M

AC

xy

fft M

AC

xy

B M

AC

xy

B M

AC

xa

B M

AC

ya

B+O

MA

Cxy

B+O

MA

Cxa

B+O

MA

Cya

Power Issues in the Memory Subsystem

Split Register File Architectures for Inherently Lower Power Microprocessors

V. Zyuban and P. Kogge

Computer Science & Eng. Department, University of Notre Dame, IN 46556, USA

Abstract

Register files represent a substantial portion of the energy bud-get in modern processors, and are growing rapidly with the trendtowards wider instruction issue. The actual access energy costsdepend greatly on the register file circuitry and technology used.However, according to a recent study, it appears that neithertechnology scaling nor circuitry techniques will prevent central-ized register files from becoming the dominant power compo-nent of next-generation superscalar processors. This paper stud-ies alternative methods for inter-instruction communication asto their energy efficiency and begins to lay out approaches atthe architectural level that would allow inherently more energy-efficient computations.

1 Introduction

Current microprocessor design has a tendency towards wider in-struction issue and increasingly complex out-of-order execution.This leads to the growth of the on-chip hardware, and dissipatedpower. Energy-delay product, delay

operation

energy

operation, or its in-

verse SPEC2

W, seem to be a reasonable metric for power effi-

ciency of a design [8]. Smaller energy-delay values imply alower energy solution at the same level of performance – a moreenergy-efficient design.

Reference [10] described and analyzed those portions of amicroarchitecture where complexity, and consequently powergrow with increasing instruction issue width. Among them are:register rename logic, wakeup and selection logic, data bypasslogic, register files, caches and instruction fetch logic. Thepower growth of each of these structures can be described asPower (IPC) , whereIPC is the average number of in-structions completed per cycle, and is some constant. Everyone of the above structures has its own power growth constant .By substituting the assumed expression for power dependenceuponIPC to the energy-delay formula we see that those struc-tures that have > 2 may begin to swamp the power budgetas the processor issue width andIPC grow, and eventually leadto a deterioration of the energy-delay metric and thus the energyefficiency of a microprocessor.

In this paper, which is a part of a bigger project, we concen-trate on one of the structures whose is greater than2, namelythe centralized multiported Register File (RF), and consider al-ternative methods for inter-instruction communication that aremore energy efficient. Our goal is to find one that has 2.

This work was supported in part by the National Science Foundation underGrant No.MIP–95–03682.

2 Power Complexity of Centralized Register File

Register files in modern superscalar CPUs are usually centeredon multiported memory macros whose storage size and the num-ber of ports grow with increasing issue width. The silicon areaof a multiported memory, built using conventional approaches,grows quadratically in the number of ports [13]. Therefore, tak-ing into account growth both in storage needs and the numberof ports, we should expect that the power portion of such multi-ported on-chip memories will grow rapidly in the future.

In our recent work [17] we studied the dependence of the ac-cess energy to a multiported register file upon the number of readand write ports, and the number of registers in the register file.We did a study for various circuit organizations of the RF, andtried to find the lower bound on the access energy. The studyshowed that even when the Port Priority Selection (PPS) tech-nique [11] is applied, combined with double-ended reads andlow-swing writes (which was found to be the most energy effi-cient), still the access energy grows significantly as the numberof ports and the number of registers increases, Fig. 1. Here weassumed 0.95 RF read accesses and 0.6 RF write accesses perinstruction (measured for the SPARC-V8), and for the numberof portsNread = 2Nwrite.

0

500

1000

1500

2000

acce

ss e

nerg

y, p

J

32

64

128

256

number of r

egisters

number of ports

3228

2420

1612

8

Figure 1: Average access energy per instruction of a RF usingthe PPS technique combined with double-ended reads and low-swing writes (1st bar) and a conventional RF architecture (2ndbar), 0.5um technology,Vdd = 3:3V

To express the RF access energy and dissipation power interms of the issue width (IW) of a processor, we assumed forthe number of read and write ports:Nread = 2IW , andNwrite = IW . To estimate the number of registers we usedthe results in [4], where it was found that for a four-issue andeight-issue machines the performance saturates around80 and128 registers, respectively. Based on this data, and assuming that40 registers is sufficient for a single issue machine, we extrapo-late the dependence linearly to two-issue and 16-issue machines.

The average RF access energy per instructionEaverage ver-sus the microprocessor issue width is plotted in Fig. 2, both forthe conventional RF and the RF using the most energy efficientPPS technique. The equations listed for each case are approxi-

matecurve fits for the regionIW = 4 to IW = 1 6.We see that there is a significant energy penalty per in-

struction in supporting large instruction level parallelism witha centralized register file. In order to compute the overall RFpower wemultiply theaverageenergy dissipation per instructionEaverage by thenumber of instructions issued per cycle (IPC ),and by the clocking ratef : Po w e r= f IPC Eaverage .According to Fig. 2, Eaverage IW , where is between0:96 and 1:8. Assuming that IPC = IW (or inversely,IW = IPC 1=), thisyieldsPo w e r fIPC IPC = =

f IPC 1+=, or = 1 +=.If theIPC grew linearly with the issue width ( = 1) then

the use of the PPS register file architecture would result in thepower dependence parameter close to a value of 2, while theuse of the conventional register file architecture would result in = 2:8. However, since in real machinesIPC increases lessthan linearly with the increase in the issue width ( < 1), theneven theuseof themost energy-efficient register filearchitecturedoesnot solvetheproblem, and leavesthepower dependencepa-rameter well above2. For an Amdahl’s law-like of 0:5 [14]the is between3 and4:6!

acce

ss e

nerg

y, p

J

1 2 4 8 160

500

1000

1500

2000

2500

issue width (IW)

E = (IW) 1.8

E = (IW) 0.96

Figure 2: Average access energy per instruction for a PPS (1stbar) and a conventional RF architecture (2nd bar) versus issuewidth, 0:5 featuresize,Vdd = 3:3V .

It must also bestressed that all the results for theregister filepower described aboveassumed very aggressiveenergy manage-ment techniques, including pulse word lineactivation techniquefor reducing the bit line swing to the minimum, pulse activa-tion of the sensing circuitry, fully cutting off precharge duringreading and writing, taking advantageof thestatisticsof thedatastored in RF memory cells, minimum transistor sizing whereverpossible, useof equalizing transistorsto savebit lineenergy dur-ing precharge. In real world CPUsthedesire for speed wil l oftennot permit some of these techniques, meaning that real registerfile powers are even higher. The conclusion is that none of theknown circuit techniques solves the problem of rapid RF powergrowth for machines with increasing ILP. This leads to the in-escapable conclusion that theuse of acentralized register fileasan inter-instruction communication mechanism is going to be-come prohibitively expensive. Alternative techniques involvingmore than just circuit tricks are going to be necessary, and are atarget of thiswork.

3 Register File Partitionin g – theBasic Idea

In this section we analyze potential energy savings of replacinga centralized RF with acollection of decentralized register files.

The primary advantage of accessing a collection of local regis-ter files rather than a single centralized register file is that localregister files do not need to have many ports and, the number ofentries in each register file is much smaller, which makes themvery simple, fast and low power.

Suppose we have a superscalar processor capable of issu-ing IW instruction in parallel. Using the above assumptions, aconventional centralized register filewould haveNread;centr =2IW , read ports and Nwrite;centr = IW write ports, andNreg;centr N0 + 1 2 IW physical registers [4]. Ideally,we would like to partition the centralized register file into mlocal register files (m< IW ) in such a way that every lo-cal register file hasNports;local = 1

mNports;centr ports, and

Nreg;local =1

mNreg;centr entries.

Under such ideal partitioning of the register file we wouldexpect the access energy to every local register file to beElocal = 1

m1:8Ecentr, based on the results for the conven-tional register filearchitecture, presented in theprevioussection.Such a partitioning would result in the total register file powerof Po w e r= f IPC Elocal = f IPC

1

m1:8Ecentr.Choosing thedegree of RF partitioning to beproportional to theprocessor issuewidth,m IW , weget for thetotal register filepower Po w e r f IPC , with the power growth parameter = 1. This result is achieved withoutusing complicated circuittechniques for the register file.

However, there are certain hurdles that make the describedideal partitioning impossible. First, additional paths need to beprovided to passdatabetween local register files, resulting in ex-tra ports to local register files. Second, the ideal 1

mpartitioning

of registersamong local register filesisnot possible, rather, localregister files wil l need more registers than

Nread;centr

m. These

effects potentially reduce theenergy savings.

4 Existing Approaches to Register File Decen-tralization

The problem of the growth of centralized register files and theassociated increase in access time has been considered by someresearchers, and a few architectureswith adecentralized registerfilehavebeenproposed. However, thusfar researchershavebeenprimarily concerned with access times, not energy costs.

4.1 Limite d Connectivity VLIWs

The ideal VLI W architecture would be a machine with manyfunctional units connected to a single central register file [3].This organization would enable any operation to be carried outon any available function unit, simplifying code generation. Toavoid the degradation in performance caused by multi-portinga single register file researchers have suggested partitioning theregister file into banks, so that each functional unit is connectedto a specific bank [3, 15]. Some kind of crossover provides alimited accessto theregisters from other banks. Every operationspecifies itsdestination bank at compile time.

This arrangement alleviates the problem with the number ofports, but brings about the problem of inter-bank data traffic.If value stored in one bank is required for an operation sched-uled to a functional unit connected to another bank, this valuemust becopied between the register filebanks. To minimize theadditional workload on the compiler and minimize data trafficbetween the different register files banks, the partitioning of theregister filemust be done in avery careful way.

Since the assignment of an operation to a functional unit isdone at compile time, VLI W compiler must carefully choose

a cluster for every operation and a register file bank where theresult wil l bewritten.

4.2 Multiscalar Architecture

The Multiscalar architecture [1, 6, 7] executes code segmentsidentified at compile time in parallel on multipleprocessing ele-ments organized as a circular chain. A decentralized realizationof a single register file called a Multi-Version Register File isproposed to provide a high bandwidth inter-instruction commu-nication mechanism needed for simultaneously active multipleexecution units. Each execution unit isprovided with adifferentversion of the architectural register file, called a local registerfile. Al l register reads and writesoccurring in aunit aredirectedonly to theunit’s local register file. Communication between themultiple register files is done through unidirectional serial linksby forwarding the last possible register instances in a task to thesubsequent tasks as soon as the instances are determined to bethe last possibleones. The identification of the last possible reg-ister instance is done using acompiler support.

Finding appropriate static program chunks and load balanc-ing across the processing elements, as well as identification ofthe last possible register instances, are some of the issues thatarise in this approach.

4.3 TraceWindow Architecture

Thepartitioning of thephysical register filein theTraceWin-dow architecture [16] is based on the fact that not all physicalregisters are livebeyond the trace line that produces them. Typ-ically, around half the registers written by a small trace line (16or 24 instructions) are not live beyond the trace line. The pro-portion of such local registers increases for larger trace lines [5].

In the Trace Window architecture, locally live destinationregisters are renamed to a physical register file local to thetrace line; live-on-exit destination registers are renamed to aglobal register file. To identify live-on-exit registers, output-dependence checking among instructions in one trace line isdone in parallel with the true-dependence checking. The localRF isflushed when thecorresponding traceline isremoved fromthewindow.

Onedisadvantage of this approach is that in order to renameregisters into local and global register files, all chunks of a traceline need to be buffered at the rename stage’s output until thelast chunk is renamed. As a result, register renaming of a traceline restricts dispatch bandwidth, and identification of live-on-exit registers makes instruction dispatch bursty. A mechanismfor alleviating this problem is proposed which avoids full re-naming on trace reuse by capturing in the trace cache renamedtracelines, obtained from theoutput of theregister renamestage,rather than raw instructions.

5 Split Register FileArchitecture

Thefirst of theabovetechniquesmakestheregister filepartition-ing visible to the programmer and requires that all assignmentsof operands to register file banks be done explicitly at compiletime. This approach potentially has binary compatibility prob-lems. The two other techniques are based on the time localitiesof inter-instruction communication. Each bank of the registerfile stores those register instances that are produced and con-sumed by instruction that are executed close to each other intime in thedynamic instruction sequence.

In our work we study an approach to the register file decen-tralization based on another kind of inter-instruction communi-cation locality. Thisapproach isbased on ahypothesis that there

exist certain groupsof instructionsin thedynamic instruction se-quencesuch that the inter-instruction communication is likely tobe mostly local within each group. These groups do not neces-sarily consist of consecutive instructions in thedynamic instruc-tion sequence.

If this is the case, then we can implement our CPU as a col-lection of processing unit clusters and provide each cluster witha local physical register file. At run-time, instruction despatchlogic triesto steer every instructionsto thecluster where instruc-tions producing its register source operands wereexecuted. Thedesired result of such partitioning would be that instructions ac-cess local register files most of the time. Additional paths areprovided for inter-cluster traffic.

There are additional energy benefits we expect to get fromthe described organization. First, the energy efficiency of thebypass logic can be improved if we implement full bypassingwithin each cluster only. Second, we expect that the proposedidea should allow us to more fully take advantage of correla-tion between operand values of instructions in the same group.The average energy dissipation of most processing units likeALU, shifter, etc. is known to be statistically proportional tothe number of data bit transitions at the inputs in a clock cycle,Eaverage = Econst + Nchange Echange , whereEconst is aconstant energy independent of the data activity at the inputs,Nchange is theaverage number of bit transitions at the inputs inone clock cycle, andEchange is a coefficient specific for everyprocessing unit [9, 12]. By steering instructions that access thesame registers to the same processing unit cluster we increasetheprobability that operand values of these instructions are cor-related, and thereby reduce theaveragenumber of bit transitionsat the inputs of functional units.

The suggested split register filearchitecture is similar to thedependence-based architecture that wasproposed in [10] to sim-plify wakeup and select logic (which was found to be the mostcritical in terms of the delay complexity), and thus allow fasterclocking. In the dependence-based architecture the instructionissue window is replaced with a number of FIFO buffers, eachqueue holding a chain of dependent instructions. Instructionsfrom multiple queues are issue in parallel. The key differencebetween our approach and the dependence-based architectureis that the register file in the dependence-based architecture isnot partitioned into register files local to clusters. Also, thedependence-based architecture has not been studied for energyefficiency in [10].

Before going any further the efficiency of the idea must beevaluated. The primary questions that determine the efficiencyof splitting theregister fileare: (i) how oftenweneed topassdatabetween different clusters (which is equivalent to how often ac-cessesto remoteregister fileswil l occur, or how many additionalwrite ports might be needed for inter-register file communica-tion), (ii ) how many registers in local register files are needed,and (iii ) what effect such partitioning hason correlation betweenoperand valuesof instructions issued to thesamefunctional unit.

6 Early Experiments

In this section we check the hypothesis that “there exist certaingroups of instructions in the dynamic instruction sequence suchthat the inter-instruction communication is mostly local withineach group.” For thisweneed to find thebest possiblepartition-ing of instructions in the dynamic code sequence into groupssuch that inter-instruction communication across group bound-ariesisminimized. Thiswould giveusatheoretical lower boundon the inter-register file traffic when instructions from differentgroups are scheduled to access different register files. We aregoing to use these data to evaluate the efficiency of various reg-

ister file decentralization techniques that we are going to studylater.

Our approach to identifying the best possible partitioning ofinstructions is as follows. First, we construct a graph for a dy-namic instruction sequence. In this graph nodes represent in-structions in the dynamic sequence, and edges represent inter-instruction dependencies. We take into account dependenciestrough registers, through memory locations, through thethecon-dition code andY registers. To take into account control depen-dencies, wemakeinstructions in thegraph depend on all mispre-dicted branches executed earlier in the dynamic code sequence.Correctly predicted branches do not introduce any control de-pendencies. Special care is taken to account for delayed instruc-tions.

To find the best possible partitioning of instructions intogroups such that inter-instruction communication across groupboundaries is minimized, we need to solve the problem of opti-mal partitioning of the program graph into subgraphs in such away that the total number of crossed edges representing depen-dencies through registers is minimized. Unfortunately we can-not solve this problem for the whole graph because it is knownto be NP complete. We overcome this difficulty by dividing theproblem intosubproblemsof fixedsize, finding theoptimal solu-tionsfor thesesubproblems, and thusconstructing asub-optimalsolution for the whole problem. To argue that our suboptimalsolution is close to the optimal one we observe the dependenceof the result upon thesizeof subproblems.

The subproblems of optimal graph partitioning are posed asfollows: AssumewehaveM execution unitsandN instructionsin the analysis window (subproblem size). We need to sched-ule theseN instructions to theM execution units in such a waythat the total number of crossed edges representing dependen-cies through registers in the whole graph from the beginningof a program to the last instruction in this analysis window isminimized. This optimization target is referred to as traffic costminimization.

However, the traffic cost minimization contradicts the con-dition of exploiting maximal amount of instruction level paral-lelism, because the desire to minimize the number of crossededges would result in the tendency to schedule most of the in-structions to the same execution unit, leaving the rest of thehardware idle. The optimum would be achieved by schedulingall instructions to the same unit, thereby crossing no edges atall. Therefore, we need to take into account the performancerequirement, namely that the total execution time from the be-ginning of program to the last instruction in thewindow must beminimized. Because of this requirement, independent instruc-tions wil l tend to be scheduled to different execution units. Theperformance optimization target is further referred to as delaycost minimization.

There is one more requirement that we need to take intoaccount when searching for the optimal partitioning of instruc-tions. Namely, thenumber of physical registers in every registerfile needed for execution should be minimized. This require-ment is equivalent to reducing the number of live registers inevery group of instruction, and it somewhat conflicts with theperformance requirement by reducing thedegreeof out-of orderissue. Also it results in more even partitioning of instructionsamong execution units. The target of minimizing thenumber ofphysical registers is further referred to as storage sizecost mini-mization.

To take into account all these requirements we take the fol-lowing approach. First we fil l the analysis window with incom-ing instructions. Then, we consider all possible assignments ofthe instructions in the analysis window to execution units, andcalculatethreecosts for every assignment i.e. thetraffic cost, de-lay cost, and storage size cost. Then we calculate the total cost

of every assignment as some function of the three costs above,choose the assignment with the minimum total cost, and fil l theanalysis window with new instructions. We can place emphasison any of the above requirements (either communication local-ity, or performance, or storage size), by constructing appropri-ate cost functions. The easiest way is to sum the three costsabove with appropriate weights. At this point we assume thatall execution units are equivalent, and every instruction can bescheduled to any of the execution units. Within every executionunit instructionsare issued in-order, thus thereareMN possiblecombinations.

Results of applying the described optimization algorithm toafew short integer programsarepresented inFigures3, and 4 forprocessors with four and eight execution units, respectively. Inthisand other experiments in thissection weused thebasic two-bit branch predictor which has about10% misprediction rateonthe integer benchmarks. The branch misprediction penalty wasassumed to be 5 cycles, and the instruction and data cache hitrateswereassumed to be100%.

0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.80

0.05

0.1

0.15

0.2

0.25

CPI

aver

age

num

ber

of r

emot

e R

F a

cces

se p

er in

stru

ctio

n Inter register file traffic on the per instruction basis versus CPI. Number of functional units is 4.

← 10

← 8

← 4

10 →8 →

← 4

← 2

← 1

← 10

← 8

← 4

← 2

← 1

10 → ← 8

← 4

← 2

← 1

4 →

← 2

← max performanceanalysis window size = 8

weight (traffic : cpi) = (2 : 1) weight (traffic : cpi) = (5 : 1) weight (traffic : cpi) = (10 : 1)weight (traffic : cpi) = (50 : 1)product traffic × cpi maximum performance

Figure 3: Inter-register file traffic on a per-instruction basis ver-sus CPI, for different optimization targets and different sizes oftheanalysis window. Thenumber of execution units is four.

Resultsarepresented by dots, whereeach dot corresponds toa certain optimization target, and analysis window size. TheXaxisshows CPI, and theY axisshows theaveragenumber of re-moteregister fileaccesses per instruction. Thus, points closer tothe origin of the coordinates represent better scheduling. Pointscorresponding to the same optimization target are connectedwith dotted lines. We see that bigger analysis windows resultin a better scheduling policy. Unfortunately we could not runthesimulation for larger analysis windows because of executiontimeconstraints.

Wedid not observe significant saturation of improvement inscheduling with the increase in the analysis window size withintherange that wecould simulate, although wecan seethat bene-fit of increasing thesizeof theanalysis window diminishes withthe growth of the analysis window. We expect to see a moresignificant saturation of scheduling improvement for the anal-ysis window sizes around 32 instructions, because most of theregister instances areconsumed within 32 instructions [5].

The results show that different optimization target result inquite different traffic-delay tradeoffs. By increasing the weightof thetrafficcost weplacemoreemphasisonminimizationof thetraffic cost, however, this also results in performance decrease(higher CPI). The corresponding points tend to concentrate inthe lower right corners of the figures. On the other hand, by in-

creasing the weight of the delay cost we place more emphasison performanceoptimization, but at theexpenseof higher trafficcost. The corresponding points tend to concentrate in upper-left corners of the charts. The vertical dotted lines on the chartscorrespond to the maximum performance scheduling, when thetraffic cost is ignored. We also observed a significant space forthe traffic-delay tradeoffs for configurations with 2 and16 func-tional units.

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.650

0.05

0.1

0.15

0.2

0.25

CPI

aver

age

num

ber

of r

emot

e R

F a

cces

se p

er in

stru

ctio

n

Inter register file traffic on the per instruction basis versus CPI. Number of functional units is 8.

← max performance, analysis window size = 6

6 → ← 4

← 2

← 6

← 4

← 2

← 6

← 4

← 2

← 1

weight (traffic : cpi) = (5 : 1) weight (traffic : cpi) = (10 : 1)weight (traffic : cpi) = (50 : 1)maximum performance

Figure 4: Inter-register file traffic on a per-instruction basis ver-sus CPI, for different optimization targets and different sizes oftheanalysis window. Thenumber of execution units is eight.

As a result of running these experiments we see that evenwith theexistingcompiler which performsoptimizationsfor per-formance only, it is possible to partition instruction into groupssuch that inter-instruction communication tends to bemostly lo-cal within each group. By scheduling instructions in differentgroups to different execution units it ispossible to keep theinter-unit communication low, while exploiting almost all the ILPavailable in thecode.

We also checked the hypothesis that “steering instructionsthat access thesameregisters to thesameprocessing unit clusterincreases the probability that operand values of these instruc-tions are correlated”. Indeed, we observed a 9% to 10% reduc-tion in thedataswitching activity at theinputsof functional unitsfor the suboptimal scheduling algorithm described above. Thecorresponding reduction in thedataswitching activity at theout-puts of the functional unit was from 12% to 14%. Runs withhigher weight of the traffic cost resulted in more significant re-ductions in data switching activity. The observed reduction inthe switching activity could result in up to 10% reduction in thedissipated power of the corresponding functional units [9, 12].The reduction in the switching activities at the inputs and out-puts of local register fileswas even moresignificant, up to 17%.This could be used to further reduce the access energy to localregister files [17].

In the experiments described above we analyzed propertiesof code, avoiding implementation details as much as possible.As a next step we plan to come up with certain implementationthat would allow us to take advantage of the discovered proper-ties. Wearegoing to analyze the inter-instruction traffic in moredetail and study how the inter-group traffic could be handled.Wewil l study theuseof aglobal register file for storing registerinstances that are alive for long periods of time and extensivelyaccessed by many instructions.

7 Opcode-Based Steering

In practice various heuristics can be used for steering instruc-tions to clusters, and the simplest of them is the opcode-basedsteering, when the cluster for every instruction is chosen basedon the instruction opcode. To evaluate the efficiency of theopcode-based steering we set up an experiment, based on theSPARC instruction set simulator Shade [2]. The methodologyof the experiment was as follows: Each entry in the simulatedphysical register array has a tag that shows in which register filethe data is being stored. Al l read accesses to register files arerecorded in a matrix. Whenever unit i reads data from registerfile j, the entry M [i][j] in the matrix is incremented. The val-ues in the matrix are divided by the total number of instructionsexecuted, so that diagonal and non-diagonal elements in thema-trix represent local register file hit and miss rates, respectively,on aper-instruction basis. Thesum of all elements in thematrixis the average number of non-g0 register file accesses by oneinstruction. Accesses to thezero registerg0 are ignored.

In our first experiment we divided all instructions of theSparc architecture into four groups: load/store instructions,branches, ALU instructions, and processor control instructions,including call, jmpl and return. We assumed that there is a sep-aratecluster for every group, with a local register file. Note thatearly CDC and Cray machinesoften had multiplearchitecturallydedicated register filessimilar to thispartitioning.

In this experiment we assumed that if a unit needs to ac-cess data from a remote RF, the data is moved from that re-mote register file to the RF of the unit requesting the data(move-on-miss policy). Therefore, non-diagonal entries in thematrix show how often such inter-register file transfers areneeded (in terms of accesses per instruction). For example,M[L OAD/STORE][ALU] shows how often data needed to betransferred from theALU register fileto theLOAD/STOREreg-ister file, on aper-instruction basis.

The first approach that we simulated is a ”lazy” placementpolicy. Every unit writes data to its local register file. If an-other unit needs this data, the data is moved to the correspond-ing register file. The results of using thispolicy is that too manyinter-register file data transfers are needed: we observed localregister filemisses in up to 30% of instructions executed. Thus,this placement policy is not suitable for the split register file ar-chitecture.

The second approach we simulated is a ”greedy” placementpolicy, when data is always written to the register file local tothat unit whichwil l need thedatafirst (in thedynamic instructionsequence). If another unit needs thisdata later, thedata ismovedto the register file local to that unit. The results for the greedyplacement policy averaged over a set of integer bench programaregiven in theTable1.

InstructionNumber of accessesto the local register files (per instruction)

Groups LOAD/STORE BRANCH PROC. CONTR. ALU

LD/ST 0.223 0.000 0.000 0.027

BRANCH 0.000 0.000 0.000 0.000

CONTR. 0.000 0.000 0.025 0.000

ALU 0.046 0.000 0.002 0.535

Table1: Inter-register filetraffic matrix for thegreedy placementpolicy.

The third approach we simulated is an ”optimal” placementpolicy, when data is always written to the register file local tothe unit that wil l access this data for the most number of times.Unlikethetwopreviousplacement policies, if another unit needs

to access this data, the data is not moved from the register file.The results for the optimal placement policy averaged over thesame set of integer bench program aregiven in theTable2.

InstructionNumber of accessesto the local register files (per instruction)

Groups LOAD/STORE BRANCH PROC. CONTR. ALU

LD/ST 0.213 0.000 0.000 0.037

BRANCH 0.000 0.000 0.000 0.000

CONTR. 0.000 0.000 0.023 0.002

ALU 0.026 0.000 0.000 0.557

Table 2: Inter-register file traffic matrix for the optimal place-ment policy.

Of course, it is not always possible for either hardware orcompiler to predict exactly what unit wil l need the data first orwhich unit wil l access the data the most extensively, howeverthese results show us the lower bound on the register file missrates for the opcode-based steering scheme. The main result ofthis experiment is that most of the register file accesses hit lo-cal register files. The total register file miss rate is around 7.5percent on aper-instruction basis, which means that about7:5%of all instructions miss their local register files. An interestingobservation isthat thegreedy and optimal placement policiesre-sult in approximately the same total miss rate. Note that thisresult was obtained for a code compiled with a regular optimiz-ing compiler. If we use a compiler with proper optimizations,then the results wil l be better.

Most inter-register-file traffic occurs between ALU andLoad/Store units. These register file misses are responsible forabout90% of the total missrate. Thepercentageof thesemissesis higher for an optimized code than for a non-optimized code.This happens because an optimized code uses registers moreef-ficiently for inter-instruction communication. A trivial way toeliminate this kind of misses would be to broadcast results ofthe instructions in these groups to both register files, however,this might not beenergy efficient. In futureexperiments wewillstudy theenergy efficiency of such abroadcasting.

8 Conclusionsand FutureWork

Future growth in performance for modern CPUs is predicatedon higher and higher levels of instruction issue. However, evenwith the best of circuit techniques known the centralized regis-ter files needed to support such machines wil l rapidly becomea bottleneck in the power equation. This paper has suggestedone path towards reducing this factor by architecturally parti-tioning the register file into logically and physically separateunits. Early simulation results has indicated that the amount ofinter-unit traffic that needs to behandled to allow such partition-ing may be quite manageable - validating that the basic conceptmay hold substantial promise, and opening up an opportunity totrade performance for power by sharing write ports instead ofadding ports dedicated to inter-register file traffic.

Near term work wil l involve in firming up thesimulation re-sultswith moreextensivebenchmarksand moredetailed simula-tions, along with more complete power models that incorporatemore of the CPU. Longer term work wil l focus on alternativeregister partitioning schemes, moregeneral that those suggestedin the multi-scalar and trace window work, but again with anemphasis on inherently lower power organizations. In additionsecond order effectswhich may besignificant, such as increasedcorrelation between bits processed by function units, are alsopart of thesimulation plans and wil l be investigated extensively.

The long term goal for this work is thus determination ofarchitectural techniques, perhaps all the way at the instructionset level, that permit inherently low energy organizations whichin turn arenot significantly performance constrained.

References

[1] S. E. Breach, T. N. Vijaykumar, and G. S. Sohi, “TheAnatomy of the Register File in a Multiscalar Processor,”Proc. 27th Annual Int’ l Symp. on Microarchitecture, pp.181-190, December 1994.

[2] B. Cmelik, ”The SHADE Simulator,” Sun-Labs TechnicalReport, 1993.

[3] R. Colwell, et al., “A VLI W Architecture for a TraceScheduling Compiler,” IEEE Transactions on Computers,pp. 967–979, NO. 8, August 1988.

[4] K. Farkas, N. Jouppi, P. Chow, “Register FileDesign Con-siderationsinDynamically Scheduled Processors.” Techni-cal Report 95/10, Digital Equipment Corporation WesternResearch Lab, November 1995.

[5] M. Franklin and G. S. Sohi, “Register Traffic Analysisfor Streamlining Inter-Operation Communication in Fine-Grain Parallel Processors,” Proc. 25th Annual Int’ l Symp.on Microarchitecture, pp. 236-245, 1992.

[6] M. Franklin and G. S. Sohi, “The Expandable Split Win-dow Architecture for Exploiting Fine-Grain Parallelism,”Proc. 19th Annual Int’ l Symposium on Computer Architec-ture, May, 1992.

[7] M. Franklin, “The Multiscalar Architecture,” Ph.D. The-sis, University of Wisconsin-Madison, Tech. Report 1196,November 1993.

[8] R. Gonzalez and M. Horowitz, “Energy Dissipation inGeneral PurposeMicroprocessors,” IEEE Journal of Solid-StateCircuits, Vol. 31, No. 9, September 1996.

[9] H. Kojima, et al.,“Power Analysisof aProgrammableDSPfor Architecture/Program Optimization,” IEEE Symposiumon Low Power Electronics, San Jose, pp. 26–27, October1995.

[10] S. Palacharla, N. Jouppi, J. Smith, “Complexity-EffectiveSuperscalar Processor.” In: Proceedings of the 24th An-nual International Symposium on Computer Architecture.pp. 206–218, June1997.

[11] U.S. Patent No. 5,657,291, issued Aug. 12, 1997 to A.Podlesny, G. Kristovsky, A. Malshin, “Multipor t RegisterFileMemory Cell Configuration for Read Operation.”

[12] T. Sato, et al.,“Evaluation of Architecture-level Power Es-timation for CMOS RISC Processors,” IEEE Symposiumon Low Power Electronics, San Jose, pp. 44–45, October1995.

[13] M. Tremblay, B. Joy and K. Shin, “A Three DimensionalRegister FileFor Superscalar Processors.” In: Proceedingsof the 28th Annual Hawaii International Conference onSystem Sciences. pp. 191–201, January 1995.

[14] M. Tremblay, D. Greenley and K. Normoyle, “TheDesignof the Microarchitecture of UltraSPARC TM-I,” Proceed-ings of the IEEE. pp. 16531–1663, December 1995.

[15] J. Turley and H. Hakkarainen, “TI ’s New ’C6x DSPScreams at 1,600 MIPS,” Microprocessor Report, pp. 14–17, February 17, 1997.

[16] S. Vajapeyam and T. Miltra, “Improving Superscalar In-struction Dispatch and Issue by Exploiting Dynamic CodeSequences,” Proc. 24th Annual Int’ l Symposium on Com-puter Architecture, June, 1997.

[17] V. Zyuban, P. Kogge, “TheEnergy Complexity of RegisterFiles,” IEEE Symposium on Low Power Electronics andDesign, Monterey, August 1998.

ENERGY EFFICIENT CACHE ORGANIZATIONS FOR SUPERSCALAR PROCESSORS *

Kanad Ghose and Milind B. KambleDepartment of Computer Science

State University of New YorkBinghamton, NY 13902–6000

email: ghose, [email protected]

Abstract

Organizational techniques for reducing energy dissipation inon–chip processor caches as well as off–chip caches have beenobserved to provide substantial energy savings in a technologyindependent manner. We propose and evaluate the use of blockbuffering using multiple block buffers, subbanking and bit lineisolation to reduce the power dissipation within on–chip cachesfor superscalar CPUs. We use a detailed register–levelsuperscalar simulator to glean transition counts that occur withinvarious cache components during the execution of SPEC 95benchmarks. These transition counts are fed into an energydissipation model for a 0.8 micron cache to allow powerdissipation within various cache components to be estimatedaccurately. We show that the use of 4 block buffers, withsubbanking and bit line isolation can reduce the energydissipation of conventional caches very significantly, often by asmuch as 60–70%.

1. Introduction

Most high–performance microprocessors have one or two levelsof on chip caches to accommodate the large memory bandwidthrequirements of the superscalar pipeline. A significant amount ofpower is dissipated by these on–chip caches, as exemplified in thefollowing:

(a) The on–chip L1 and L2 caches in the DEC 21164microprocessor dissipate about 25% of the total power dissipatedby the entire chip [ERB+ 95].

(b) In the bipolar, multi–chip implementation of a 300–MHz.CPU reported in [JBD+ 93], 50% of the total dissipated power isdue to the primary caches.

(c) A recently announced low–power microprocessor targeted forthe low power, the DEC SA–110, medium performance market(that also leads all microprocessors available today in terms ofSPECmarks/watt) dissipates 27% and 16% of the total power,respectively, in the on–chip I–cache and D–cache respectively[Mon 96].

As yet another example, the HP PA 8500 is significantlycache–rich compared to other similar high–end microprocessorsand reportedly more than 70% of its die area is occupied by theon–chip L1 caches, which are likely to be a major source of powerdissipation. Several factors result in significant powerdissipations in on–chip caches. First, their tag and data arrays areimplemented as SRAMs, to allow the cache access rate match thepipeline clock rate. Second, the cache area is more denselypacked than other areas on the die, so that the number oftransistors devoted to the cache are quite a significant percentageof the total number of transistors on the die. The above examplessuggest that it is imperative to seek techniques that reduce the

& )#% & &($$#%' " $%' * ' '#" " #("'#" '%#()% # " * ' ' "!'#"

power dissipated in on–chip caches without compromising theperformance of the caches.

The bulk of the energy dissipation in CMOS circuits comes fromtransitions in the logic levels at gate outputs. The powerdissipated by a CMOS gate due to output transitions is given by P= 0.5 ⋅ CL ⋅ V2 ⋅ f, where CL is the capacitative load driven by thegate, V is the voltage swing per transition and f is the transitionfrequency. In addition to these capacitative dissipation, circuits inreal caches (such as the sense amps) also have non–capacitativedissipations, that also depend on the number of transitions.Several techniques have been proposed (and actually used insome cases) for reducing the power dissipated by caches ingeneral. These techniques can be divided into the followingcategories:

1) Techniques that use alternative cache organizations forreducing cache power dissipation. These include the use ofsubbanked caches [SuDe 95, KaGh 97a], block buffers [SuDe 95,KaGh 97a], their combination and multimodular external caches[KBN 95]. Many techniques for organizing SRAMs for lowerpower dissipation, as presented in [ Itoh 96] and [EvFr 95] can alsobe used to reduce dissipation in the tag and data arrays within acache.

2) Circuit design techniques that focus on reducing the powerdissipation within the static RAM bit cells, particularlydissipations due to bit line transitions. These include bit lineisolation techniques that isolate the sense amp from the bit linesonce the sense amp has started making a transition, as used in theHitachi SH–3 embedded microprocessor [HKY+ 95], clamped bitlines that limit bit line swings and other similar techniques.

3) Instruction scheduling techniques that schedules instructions ina manner that reduces power dissipation in instruction caches[SuDe 95].

Note that techniques in these three categories are generallyindependent from each other and can thus be used in conjunction.The focus of this paper is on techniques that fall into the firstcategory.

In this paper we propose and evaluate the use of multiple blockbuffers, subbanking and bit line isolation to reduce the powerdissipation within on–chip caches for superscalar CPUs. Weshow that a suitable combination of these techniques can beextremely effective in reducing the energy dissipation in allcaches across the memory hierarchy, both on–chip and off–chip.(Power savings for off–chip caches and other techniques forreducing off–chip cache power are discussed in a separate paper.)Our approach is to use a detailed register–level simulator for asuperscalar processor and cache hierarchy to glean transitioncounts within key cache components as the simulator executesSPEC 95 benchmarks. These transition counts are then fed into anenergy dissipation model for the major cache components [KaGh97b, KaGh98], with capacitative coefficients obtained from[WiJo 94] for a real cache implementation, to determine theenergy dissipations are realistically as possible.

The rest of this paper is organized as follows. In Section 2, wedescribe the energy efficient architectural techniques such asblock buffering, subbanking, multiple block buffers and bit–lineisolation. These are elaborated in earlier papers [KaGh 97a,

KaGh97b, GhKa 98]. In Section 3, we elaborate on theexperimental setup used for our measurements. The energydissipations for the proposed cache configurations, as obtainedfrom the simulated execution of SPEC 95 benchmarks arediscussed in Section 4. The main conclusions are given in Section5. The Appendix summarizes the energy dissipation model of theset associative caches.

2. Organizing Caches for Energy Efficiency

The most common cache organization employed in modernmicroprocessors today is the set–associative cache [Smith 82]; thedirect–mapped cache and fully associative cache organizationsare two extremes of the set–associative organization. In a normalm–way set associative cache, there are m tag and data array pairs,each consisting of S rows, where S = 2s. Each data array locationis known as a cache line, which is a group of W consecutive words,W being a power of 2 (W = 2w). A cache line is thus capable ofholding the contents of what is called a memory block, consistingof W consecutive memory words. Tag and data array locations atthe same offset within the tag and data arrays make up what iscalled a set. The placement rules for such a cache dictates that aword at an address A in the memory (RAM), if present in thecache, can be found in any cache line within the set at offset (A divW) mod S. To uniquely identify the memory block that resideswithin a cache line, the tag part of the address (obtained bystripping the lower order (s+w) bits of A) are kept in the associatedtag array location. The access steps for the m–way set–associativecache are thus as follows:

Step 1: Use the middle order s bits in the address of the word to beaccessed to read out the set that potentially contains these wordsinto output latches of the tag and data arrays. These latchestogether make up what is called a block buffer (aka line buffer).

Step 2: Compare the tag part of the address being accessed inparallel with the m outputs from the tag arrays (using mindependent comparators). A match with the the output of the tagarray indicates that the required data is within the output latch ofthe corresponding data array. If this is the case, a situation called acache hit, the lower order w bits of the address are used tomultiplex out the desired word from the data array output latch. Ifno match occurs, we have a cache miss, implying that the desireddata is not within the cache.

In most modern caches, the two steps outlined above areimplemented in 2 (or more) clock cycles. In the pipelined cachesthat we have designed in our simulator, the two steps mentionedabove are naturally divided into 2 stages, thereby providing atheoretical throughput of one request per cache cycle (per port).The tag and data arrays need to be updated when a missing line isfetched or when a STORE request updates the contents of anexisting line. This update of the arrays is performed by a thirdstage of the cache pipeline. On a cache miss, some replacementalgorithm – implemented in hardware – is used to select a victimline from the set read out in step 1 and appropriate steps arefollowed to install the memory block into the victim’s line frame.

For a superscalar CPU, to maximize the number of instructions tobe examined for dispatch in each cycle, a facility is needed totransparently step across the line boundary to the physically nextline (if that line is cached). One way of doing this is to use a deckbuffer mechanism (as used in the MIPS 10K) or to use odd andeven cache banks (with an automatic line address incrementationfacility, as originally used in the IBM Risc/6000). Modernsuperscalar CPUs also tend to support multiple pipelinedload/store units that can request access to the D–cachesimultaneously. This requirement is met by using a multiportedcache or a interleaved cache. In the power studies reported here,we assume that the L1 I–cache uses odd–even cache banks, whilethe L1–D cache is multiported.

2.1 Set–associative cache with a single block buffer

Figure 1 depicts a two–way set–associative cache augmented witha single block buffer (i.e., one set of tag and data latches for thevarious cache ways) that does not prolong the cache cycle time.The components added to the two–way set associative cache toincorporate block buffering are shown highlighted in a greybackground. The cache shown in Figure 1 is accessed with a2–cycle latency but at the rate of one access per cycle, exactly likea conventional set–associative cache. These block bufferaugmented caches exploit the spatial locality of referenceexhibited by all programs in general. If the current data beingaccessed is within the last cache line that was accessed, there is noneed to fetch that line again from the data array of the cache. Thisnot only saves accesses of the data arrays but, at the same time,also saves the access necessary to the tag arrays, since both the tagand data arrays have to be read to determine if the line containingthe required data is within the cache.

The steps for accessing the set–associative cache with a singleblock buffer, using a 2–phase clock (phases φ1, φ2) are:

Cycle 1:

φ1: – Precharge the tag and data arrays for a read access– Start decoding of the set address applied to the arrays

(This is exactly identical to what happens in normal set–associative cache.)

– Simultaneously, compare fields in the address being accessed with the set number of the previous access (todetermine if the current access is to the same set as theprevious one).

φ2: – If the comparison succeeds (called a “block buffer hit”),abort the readout of the tag and data arrays. Otherwise,latch in the selected set into the array output latches.

– Move the set number for the current access into the latchthat holds the set number for the last access made.

Cycle 2:

φ1: – Perform the normal tag comparison, as in a normalset–associative cache.

φ2: – Perform the word multiplexing on a cache hit, as in anormal set–associative cache.

Note also that the cache cycle time with block buffering does notgo up from that of a normal set–associative cache; all performancecharacteristics (access rate, pipelining, number of pipeline stages)are maintained.

Block buffered caches were introduced by Su and Despain in aslightly different form in [SuDe 94]. In their design, the blockaddress of the current access is first compared against the blockaddress of the set that is currently resident in the block buffer. Thenormal access of the cache – including bit line precharging androw address decoding – is started only when a mismatch occurs.Consequently, this arrangement prolongs the cache accesslatency, a solution unattractive in practice. Second, if the cacheaccess is pipelined in two stages, a completely new set of tagcomparators are needed (along with a comparator for the setnumber) to allow a block buffer hit to be determined in parallelwith the tag comparison step of a normal cache access for theprevious access. Both of these problems are not present in ourblock buffering scheme.

2.2 Set–associative cache with multiple block buffers

The benefit of a block buffer can be further extended by usingmultiple block buffers, in effect using the block buffers as aminiature level 0 cache. Figure 2 depicts a set–associative cache

+.0 ..5."!+10 (+ '1##". %&0

3

$ 0 $ 0

0 %"/

* +!".

6 %" %&0

)

/

"-1&."! 3+.!

/0"0

+),.0+./

!!."// &//1"!5

&$1." 35 "0 //+ &0&2" %" 3&0% *" (+ ' 1##".

&*0"./0$" (0 %

with multiple block buffers, in this case 4. Here, some additionalconsiderations are needed due to the use of multiple block buffers.First, we need four latches to hold the number of the four mostrecently accessed sets. Second, four more comparators areneeded to compare the set number field of the current accessagainst the numbers of the sets sitting in the four block buffers.Third, if no block buffer hits occur, the data from the set selectedby the current address applied to the cache must be retrieved intoone of the four block buffers. Consequently, on a block buffermiss, a victim must be selected from one of the four block buffers.The need to write back an updated block buffer’s content into thetag and data arrays, when the block buffer gets selected as avictim, is avoided by writing through updates to the block bufferinto the corresponding data array on write accesses that result in ablock buffer hit. Along with these, a multiplexing facility isneeded to direct the outputs of the tag and data arrays into thevictim block buffer. In this case, the four block buffers effectivelymake up a 4–entry fully–associative write–through cache, whichis accessed in parallel with the normal cache without anyprolongation in the cycle time or without any impact on thenormal cache access pipeline.

The access steps for a cache with multiple block buffers are asfollows:

Cycle 1:

φ1: – Precharge the tag and data arrays for a read access– Start the decoding of the set address applied to the arrays

(This is exactly identical to what happens in normalset–associative cache).

– Simultaneously, compare the set selector field in theaddress being accessed with the set numbers four the foursets stored in the four block buffers.

– Concurrently, identify a victim block buffer in advance tohandle misses on the block buffers and start the setup of themultiplexor at the output of the tag and data arrays.

φ2: – If the set number of the current access matches the setnumber associated with any of the block buffers (as foundin the set number latches), abort the readout of the tag anddata arrays and steer the tag and data values from thematching buffer into the tag comparators and the wordmultiplexer, respectively.

Otherwise, if a hit did not occur on the block buffers, latchin the contents of the selected set from the arrays into theblock buffer chosen as a victim in the previous clock phase,and move the set selector bits in the currently accessed

address to the set number latch associated with this blockbuffer.

Cycle 2:

φ1: – Perform the normal tag comparison, as in a normalset–associative cache.

φ2: – Perform the word multiplexing on a cache hit, as in a normal set–associative cache.

The energy savings resulting in a set–associative cache with amultiple block buffers is due to reasons similar to that for aset–associative cache with a single block buffer. Multiple linebuffers simply increase the probability of aborting the tag and dataarray accesses, so that bit line dissipations during the arrayreadout are avoided.

In [KGM 97], Kin et al describe the use of a small “filter cache”that sits in front of a conventional L1 cache for reducing the powerdissipation of the cache memory system. If a hit occurs in thesmaller filter cache, data is accessed from the filter cache. Thenormal L1 cache access is started only after a miss has beendetected in this filter cache. Consequently, the cache accesslatency is increased – this can have adverse impact onperformance. Notice that although the multiple block buffers inour proposal behave as a fully associative cache and serves thesame purpose as a filter cache, it does not impact the cache latencyat all. This is because we probe the block buffers in parallel withthe normal cache access. Also, unlike a fully associative cache,we need only one set of tag comparators and four set numbercomparators to detect a block buffer hit. (In a four entry fullyassociative cache four tag comparators would be needed)

+.0 ..5."!+10 (+ '1##". %&0

$ 0 $ 0

!!."// &//1"!5

/

6 %" %&0

+.!)1(0&,("4".

3"-1&."! 3+.!

3

/"0 *1)".(0 %"/

/"0 *1)". +),.0+./

* +!". 0 %"/ (+ ' 1##". 0 % .",( ")"*0 (+$& )1(0&,("4&*$ +*0.+( #+. (+ ' 1##"./

&*0"./0$" (0 %

&$1." 35 "0 //+ &0&2" %" 3&0% +1. (+ ' 1##"./

2.3 Bit–line isolation and subbanking

One of the major source of power dissipation in a cache has beenshown to due to transitions in the bit lines of the data and tagarrays. Bit line dissipations occur when the bit lines areprecharged or discharged. Bit line isolation represents one way ofreducing the energy dissipations due to bit line transitions. Here,the sense amps sensing the bit lines are disconnected from the bitlines as soon as they start the transition on sensing. Consequently,when the sense amps recover the cleaner logic levels, these logiclevels are not driven on the bit lines. With this arrangement, asmall differential voltage on the bit lines is sufficient to trigger thefull transition on the sense amp outputs, but because the sense ampoutput remains disconnected from the bit lines, the bit lines

themselves do not experience the full transition made by the senseamp. Considerable energy savings thus results. Bit line isolationhas been used to reduce the power dissipation in the cachememory of the Hitachi SH–3 embedded microprocessor [HKY+95].

To achieve further savings on the bit line energy, the data arrayscan be subdivided into subbanks, so that only those subbanks thatcontain the desired data can be readout [SuDe 95, KaGh 97a]. Asubbank consists of a number of consecutive bit columns of thedata array. A data line is thus spread across a number of subbanks.The size of a subbank refers to the physical word width of eachsubbank. Each subbank within a data array can be activatedindependently. By using an array of bit flags to indicate thepresence/absence of subbanks in the block buffer, the array accessstage can determine if a subbank needs to be read out for thecurrent request. This again does not affect the cycle time nor thepipeline of the cache, while at the same time preventing readout ofthose subbanks from the data array which might never be needed.

3. Experimental Setup

We now describe the experimental setup used in our study of theenergy efficiency achieved through the use of set associativecaches with multiple block buffers, subbanking and bit lineisolation as described in Section 2. We use a detailedregister–level simulator, SCAPE, which accurately simulates atthe cycle level a superscalar pipelined processor, based on theMIPS instruction set. SCAPE can be configured (using aconfiguration file) at run time to simulate a variety of cachehierarchies such as 2 or 3 levels of caches, split or unified cachesetc. The primary cache organizational features such as size,associativity, block size and cache operating frequency can beconfigured individually for each cache used in the system. Apartfrom these, a variety of other cache organizational features thatare important in the context of a superscalar processor can beindividually tuned for each cache, such as the number of ports,multiple outstanding requests, writeback policy andsub–blocking. For the results presented here, we used a 3 levelcache hierarchy, with split L1 caches, unified L2 cache and anunified L3 cache. The L1 and L2 caches were assumed to beon–chip and the L3 cache was assumed to be off–chip. Sincesuperscalar CPUs need to fetch multiple instructions at a time tofeed the instruction dispatch unit, the L1 I–cache was designed tosupply multiple instructions per Ifetch request. To improve theI–cache bandwidth, the I–cache was designed to have even–odddirectories (and banks) so that requests that spanned consecutiverows could still be completed in the same cycle. The presence ofmultiple LOAD/STORE functional units in the superscalarpipeline demanded the use of multiple ports for the L1 D–cache.All caches consisted of a 3–stage pipelined design where stage 1performed that bit–line precharging and array access activities,stage 2 performed the tag compare, steer and miss handlingactivities and stage 3 performed the write activity into the array.We assume the layout and technology parameters for the 0.8µcache described in detail in [WiJo 94].

The superscalar pipeline implements the MIPS 3000 ISA. Theout–of–order execution engine uses a register update unit (RUU)which maintains the program order of the instructions to retirecompleted instructions. Data forwarding from completed but notretired instructions in the RUU as well as from functional units isemployed to maximize instruction throughput. Multiplefunctional units are also used achieve dispatch of multipleinstructions in the same cycle. Many parameters of thesuperscalar pipeline itself can be configured at run time, such assize of the RUU, number of functional units, degree of superscalarissue and number of physical registers used to implement registerrenaming). For our studies we used a 64–entry RUU and 4instruction fetch and issue. We assumed that the following

function units are present: 2 LOAD units, 1 STORE unit, 6 integerunits, 2 Integer multiply/divide (pipelined)units and 2 Floatingpoint (pipelined) units. SCAPE accepts any statically compiledexecutable runnable on a MIPS R3000 processor. We simulatedthe execution of the SPECInt95 and few SPECFp95 (su2cor,mgrid, applu) benchmarks to get a good mix of CPU–intensiveand memory–intensive loads.

For our base case, we assume a 32 Kbyte, direct–mapped L1I–cache and a 32 Kbyte, 4–way set–associative L1 D–cache. Theline sizes for both of these caches are set to 32 bytes, a typicalnumber, with 16 byte subblock size. The L2–cache is assumed tobe a 128 Kbyte, 4–way set–associative unified (i.e., shared byinstructions and data) on–chip cache with a line size of 64 bytesand 32 byte subblock size. The 64 byte line size was chosen, sinceL2 caches must have a line size longer than that of the L1 caches tobe effective. The off–chip L3–cache is assumed to be a 1Mbyte,8–way set–associative unified off–chip cache with a line size of128 bytes. The interconnection bus width was 32 bytes betweenL2 and L3, and 16 bytes between L1 and L2 and 64 bytes betweenL3 and the main memory. All the caches had write–back policyexcept L3 which was write–through with a 16 deep write backbuffers. A buddy replacement algorithm, approximating LRU,was used for all the set–associative caches, as well as for choosingthe victim block buffer on a block buffer miss.

We studied a variety of cache configurations, keeping theindividual capacities of L1 I, L1 D, L2 and L3 caches constant, tostudy the effects of multiple block buffers and other energyefficient enhancements. These configurations are as follows (seeTable I):

• L1 I–caches with associativities of 1 and 4. For each of theseconfigurations we compared the energy saving through theuse of simultaneously using 4–block buffers, subbanking andbit–line isolation against a conventional cache. For theseconfigurations of L1 I–cache, the parameters of the L1D–cache and L2–cache were kept the same as in the basecase. The L3 cache is always maintained as a conventionalcache with base case configuration.

• L1 D–caches with associativities of 2 and 4. For each of theseconfigurations we compared the energy saving through theuse of simultaneously using 4–block buffers, subbanking andbit–line isolation against a conventional cache. For theseconfigurations of L1 D–cache, the parameters of the L1I–cache and L2–cache were kept the same as in the base case.The L3 cache is always maintained as a conventional cachewith base case configuration.

For the purpose of this paper, we used the capacitances for the 0.8micron CMOS cache described in [WiJo 94]. A supply voltage ofVdd = 5 Volts was assumed. The voltage swing on the bit lines waslimited to 500mV on each side of Vprecharge which was assumed tobe Vdd/2. For the bit line isolation design however, the voltageswing was assumed to be 200mV on each side of Vprecharge. TheCPU pipeline (and the caches) was (were) assumed to be clockedat 120 MHz. The 0.8µ pipelined cache used for the case study cansustain this clock rate for a wide range of cache sizes based on thetimings reported in [WiJo 94] where the cache was not pipelined.The L2 cache is operated at half this frequency.

4. Results and Discussions

Figures 3 and 4 depict how the power dissipation varies across theSPEC 95 benchmarks for a conventionally organized L1 I–cacheand a conventionally organized L1 D–cache. Figure 5 shows howthe power dissipation in the conventional, unified L2 cache variesover the same benchmarks.

#!#(% "72% "72% +)-% "72% 13""+.#*1#!#(% "72% "72% +)-% "72% 13""+.#*1 #!#(% "72% 3-)&)%$ 5!7 "72% +)-% "72% 13""+.#*1 #!#(% "72% 5!7 "72% +)-% -.-13""+.#*%$ !++ #!#(%1 #.-4%-2).-!+

#!#(% "72% "72% +)-% "72% 13""+.#*1 #!#(% "72% "72% +)-% "72% 13""+.#*1 #!#(%

"72% 5!7 "72% +)-% "72% 13""+.#*1 #!#(% "72%5!7 "72% +)-% -.-13""+.#*%$ !++ #!#(%1 #.-4%-2).-!+

#!#(% "72% "72% +)-% "72% 13""+.#*1#!#(% "72% "72% +)-% "72% 13""+.#*1 #!#(% "72% 3-)&)%$ 5!7 "72% +)-% "72% 13""+.#*1 #!#(% "72% 5!7 "72% +)-% -.-13""+.#*%$ !++ #!#(%1 #.-4%-2).-!+

1!,% !1 #!1% %6#%/2#!#(% "+.#*"3&&%01 "72% 13""!-*1 #!#(% "+.#*"3&&%01

"72% 13""!-*1#!#(% "+.#*"3&&%01 "72% 13""!-*1 #!#(% #.-4%-2).-!+

1!,% !1 #!1% %6#%/2#!#(% "+.#*"3&&%01 "72% 13""!-*1 #!#(% "+.#*"3&&%01

"72% 13""!-*1#!#(% "+.#*"3&&%01 "72% 13""!-*1 #!#(% #.-4%-2).-!+

1!,% !1 #!1% %6#%/2#!#(% "+.#*"3&&%01 "72% 13""!-*1 #!#(% "+.#*"3&&%01

"72% 13""!-*1#!#(% "+.#*"3&&%01 "72% 13""!-*1 #!#(% #.-4%-2).-!+

!"+% !#(% #.-&)'30!2).-1 31%$ )- 2(% %4!+3!2).-

Figure 6 shows how the average power dissipation for the L1I–cache (over the SPEC 95 benchmarks studied) decreases as wemove from a conventional organization (labelled a) to anorganization with 4 block buffers and subbanking (labelled b) andan organization with 4 block buffers and subbanking thatadditionally uses bit line isolation (labelled c). The powerreductions that result are in excess of 66%. Figure 7 shows thatbetter power improvements also occur for the L1 D–cache whenidentical enhancements are made – in this case the power savingsapproach 70%.

In Figures 8 and 9, we depict the power dissipations across thebenchmarks for the two most power–efficient organizations forthe L1 I–cache and the L1 D–cache respectively. A comparison ofFigures 3 and 8 shows that the power savings for the L1 I–cacheare quite consistent across the benchmarks when 4 block buffers,subbanking and bit line isolation are used. Similar conclusionsare to be drawn for the L1 D–cache when Figures 4 and 9 arecompared.

Figure 10 depicts the extent of overall power savings whenorganizational enhancements are made to all of the on–chipcaches. Since the L1 I–cache has a relatively higher dissipation,the specific organization chosen for the L1 I–cache was the onethat had the least power dissipation. The L1–Dcache and the L2cache configurations were then chosen to minimize the overallpower dissipation in all of the on–chip caches.

Figures 11 and 12 are useful in understanding where the powersavings come from. For the conventional organizations, theenergy spent in precharging and discharging the bit linesdominate. With the use of block buffers, the number of accesses tothe actual cache arrays are reduced, reducing in turn the bit linedissipation component. When subbanking is additionallydeployed, the number of bit lines that have to be driven,precharged or sensed in the data array gets further reduced.Additional savings come when the capacitative loading of the bitlines, as seen by the sense amp, is reduced through bit lineisolation.

5. Conclusions

On–chip caches are a major source of power dissipation incontemporary superscalar microprocessors. The bulk of theenergy dissipated in conventionally organized caches is inprecharging, sensing and discharging the bit lines of the tag anddata arrays. We proposed the use of alternative organizations,such as multiple block buffering, subbanking and bit line isolationto reduce the power dissipation in on–chip caches withoutcompromising the cache cycle time (and other aspects ofperformance, such as the cache hit ratio and cache access rate).The power saving achieved ranged from 60% to 70%.

We are currently investigating other organizational and circuitenhancements to caches and on–chip memory systems forreducing power without sacrificing performance.

References

[EvFr 95] Evans, R. J. and Franzon, P. D., “EnergyConsumption Modeling and Optimization for SRAM’s”, in IEEEJournal of Solid–State Circuits, Vol. 30, No. 5, May 1995, pp571–579.

[HKY+ 95] Hasegawa, A. et al, “SH3: High Code Density, LowPower”, IEEE Micro magazine, Dec. 1995, pp. 11–19.

[Itoh 96] Itoh, K., “Low Power Memory Design”, in LowPower Design Methodologies, ed. by Rabaey, J. and Pedram, M.,Kluwer Academic Pub., pp. 201–251.

[KBN 95] Ko, U., Balsara, P. T. and Nanda, A.K., “EnergyOptimization of Multi–Level Processor Cache Architectures”, inProc. of the Int’l. Sym. on Low Power Design, 1995, pp. 45–49.

[KaGh 97a] Kamble, M. B. and Ghose, K., “Energy–Efficiencyof VLSI Caches: A Comparative Study”, in Proc. IEEE 10–th.Int’l. Conf. on VLSI Design, Jan. 1997, pp. 261–267.

[KaGh 97b] Kamble, M. B. and Ghose, K., “Analytical EnergyDissipation Models for Low Power Caches”, in Proc. 1997 Int’lSymposium on Low Power Electronics and Design, Aug. 1997,pp. 143–148.

[KaGh 98] Kamble, M. B. and Ghose, K., “Modeling EnergyDissipation in Low Power Caches”, Technical ReportCS–TR–98–02, Dept. of Computer Science, SUNY–Binghamton1998.

[KGM 97] Kin, J., Gupta, M. and Mangione–Smith, W.H., “TheFilter Cache: An Energy–Efficient Memory Structure”, in Proc.MICRO 30, 1997, pp. 184–193.

[Larus 96] Larus, J., “SPIM: A MIPS 2000 SImulator”,available form Univ. Wis., CS ftp site.

[Mon 96] Montanaro, J. et al., “A 160 MHz, 32b 0.5 W CMOSRISC Microprocessor”, in IEEE ISSCC 1996 Digest of Papers,1996.

[Smith 82] Smith, A. J., “Cache Memories”, ACM ComputingSurveys, Sept. 1982, pp. 473–530.

[SuDe 95] Su, C. and Despain, A., “Cache Design Tradeoffs forPower and Performance Optimization: A Case Study”, in Proc. ofthe Int’l. Sym. on Low Power Design, 1995, pp. 63–68.

[WiJo 94] Wilton, S. E., and Jouppi, N., “An Enhanced Accessand Cycle Time Model for On–Chip Caches”, DEC WRLResearch Report 93/5, July 1994

[WaRa+ 92] Wada t., Rajan S., and Przybylski S.A., “AnAnalytical Access Time Model for On–Chip Cache Memories”,IEEE Journal ofSolid–State Circuits, Vol. 27, NO. 8, Aug 1992,pp. 1147–1156.

.5$0,

(&30$ .5$0 #(11(/ 2(.-1 (- 2'$ ".-4$-2(.- + "0.112'$ !$-"', 0*1 %.0 .-%(&30 2(.-

""

".,/0$11()/$& /$0+

,* ,&0(#

//+3

13".0

.5$0,

(&30$ .5$0 #(11(/ 2(.-1 (- 2'$ ".-4$-2(.- + "0.112'$ !$-"', 0*1 %.0 .-%(&30 2(.-

""".,/0$11

()/$& /$0+

,*

,&0(#

//+3

13".0

.5$0,

(&30$ .5$0 #(11(/ 2(.-1 (- 2'$ ".-4$-2(.- + "0.112'$ !$-"', 0*1 %.0 .-%(&30 2(.-

""

".,/0$11

()/$&

/$0+

,*

,&0(#

//+3

13".0

! " ! "

.5$0, #(0$"2 , //$#

5 6

(&30$ .5$0 #(11(/ 2(.- (- "0.11 #(%%$0$-2.0& -(7 2(.-1

.-4$-2(.- +! !+."* !3%%$01

-# 13!! -*(-&" !+."* !3%%$01

13!! -*(-& -#!(2+(-$ (1.+ 2(.-

! " ! "

.5$0,

5 6

5 6

(&30$ .5$0 #(11(/ 2(.- (- "0.11 #(%%$0$-2.0& -(7 2(.-1

.-4$-2(.- +! !+."* !3%%$01

-# 13!! -*(-&" !+."* !3%%$01

13!! -*(-& -#!(2+(-$ (1.+ 2(.-

.5$0,

(&30$ .5$0 #(11(/ 2(.-1 (- 2'$ "0.11 2'$ !$-"', 0*1 %.0 2'$ ,.12 $-$0&6 $%%("($-2" "'$ ".-%(&30 2(.- .-%(&30 2(.-

""

".,/0$11

()/$&

/$0+,* ,&0(#

//+3

13".0

.5$0,

(&30$ .5$0 #(11(/ 2(.-1 (- 2'$ "0.11 2'$ !$-"', 0*1 %.0 2'$ ,.12 $-$0&6 $%%("($-2" "'$ ".-%(&30 2(.- .-%(&30 2(.-

""".,/0$11

()/$& /$0+

,*

,&0(#

//+3

13".0

2.2 + 2.2 +

.5$0,

(&30$ .5$0 #(11(/ 2(.-1 (- 2'$

.-%(&30 2(.-

.-%(&30 2(.-

ËËË

ËËËËËË

(&30$ .,/.-$-21 .% 2'$ /.5$0 #(11(/ 2(.- (- 2'$

".-4$-2(.- +.-%(&30 2.(-

5 6 !+."* !3%%$0(-& !62$ 13!! -*$# 5(2' !(2+(-$(1.+ 2(.- .-%(&30 2(.-

ÍÍ32/32 #0(4$0##0 (-/32.,/ 0 2.0 2"'$1

.0#+(-$(2+(-$

$-1$ ,/

ËËËË

ËËËËËË

(&30$ .,/.-$-21 .% 2'$ /.5$0 #(11(/ 2(.- (- 2'$

Í

32/32 #0(4$0##0 (-/32.,/ 0 2.0 2"'$1

.0#+(-$(2+(-$

$-1$ ,/

".-4$-2(.- +.-%(&30 2(.-

5 6 !+."* !3%%$0(-& !62$ 13!! -*$# 5(2' !(2+(-$(1.+ 2(.- .-%(&30 2(.-

Power and Performance Tradeoffs using Various CacheConfigurations

Gianluca Alberaxy and R. Iris Baharyx Politecnico di Torino

Dip. di Automatica e InformaticaTorino, ITALY 10129

y Brown UniversityDivision of EngineeringProvidence, RI 02912

AbstractIn this paper, we will propose several different dataand instruction cache configurations and analyze theirpower as well as performance implications on the pro-cessor. Unlike most existing work in low power micro-processor design, we explore a high performance pro-cessor with the latest innovations for performance. Us-ing a detailed, architectural-level simulator, we eval-uate full system performance using several differentpower/performance sensitive cache configurations. Wethen use the information obtained from the simulator tocalculate the energy consumption of the memory hierar-chy of the system. Based on the results obtained fromthese simulations, we will determine the general char-acteristics of each cache configuration. We will alsomake recommendations on how best to balance powerand performance tradeoffs in memory hierarchy design.

1 IntroductionIn this paper we will concentrate on reducing the energydemands of an ultra high-performance processor, suchas the Pentium Pro or the Alpha 21264, which uses su-perscalar, speculative, out-of-order execution. In partic-ular, we will investigate architectural-level solutions thatachieve a power reduction in the memory subsystem ofthe processorwithoutcompromising performance.

Prior research has been aimed at measuring and rec-ommending optimal cache configuration for power. Forinstance, in [10], the authors determined that high per-formance caches were also the lowest power consum-ing caches since they reduce the traffic to the lowerlevel of the memory system. The work by Kin [7] pro-posed accessing a smallfilter cachebefore accessing thefirst level cache to reduce the accesses (and energy con-sumption) from DL1. The idea lead to a large reduc-tion in memory hierarchy energy consumption, but alsoresulted in a substantial reduction in processor perfor-mance. While this reduction in performance may be tol-erable for some applications, the high-end market willnot make such a sacrifice. This paper will propose mem-ory hierarchy configurations that reduce power while re-taining performance.

Reducing cache misses due to line conflicts has beenshown to be effective in improving overall system per-formance in high-performance processors. Techniquesto reduce conflicts include increasing cache associativ-ity, use of victim caches [5], or cache bypassing withand without the aid of a buffer [4, 9, 11]. Figure 1 shows

the design of the memory hierarchy when using buffersalongside the first level caches.

UnifiedL2 cache

L1DataCache

dL1 buffer

iL1 buffer

data access

instructionaccess

fromProcessor

L1InstCache

Figure 1: Memory hierarchy design using buffersalongside the L1caches. These buffers may beused as victim caches,non-temporal buffers, orspeculative buffers.

The buffer is a small cache, between 8–16 entries,located between the first level and second level caches.The buffer may be used to hold specific data (e.g. non-temporal or speculative data), or may be used for generaldata (e.g. “victim” data). In this paper we will analyzevarious uses of this buffer in terms of both power andperformance. In addition, we will compare the powerand performance impact of using this buffer to more tra-ditional techniques for reducing cache conflicts such asincreasing cache size and/or associativity.

2 Experimental setupThis section presents our experimental environment.First, the CPU simulator will be briefly introduced andthen we will describe how we obtained data about the en-ergy consumption of the caches. Finally we will describeeach architectural design we considered in our analysis.

2.1 Full model simulatorWe use an extension of theSimpleScalar[2] tool suite.SimpleScalar is an execution-driven simulator that usesbinaries compiled to a MIPS-like target. SimpleScalarcan accurately model a high-performance, dynamically-scheduled, multi-issue processor. We use an extendedversion of the simulator that more accurately models

all the memory hierarchy, implementing non-blockingcaches and complete bus bandwidth and contentionmodeling [3]. Other modifications were added to han-dle precise modeling of cache fills.

Tables 1, 2, and 3 show the configuration of the pro-cessor modeled. Note that first level caches are on-chip,while the unified second level cache is off-chip. In addi-tion we have a 16-entry buffer associated with each firstlevel cache; this buffer was implemented either as a fullyassociative cache with LRU replacement, or as a directmapped cache. Note that we chose a 8K first level cacheconfiguration in order to obtain a reasonable hit/miss ratefrom our benchmarks [13]. In Tables 2 and 3 note thatsome types of resource units (e.g., the FP Mult/Div/Sqrtunit) may have different latency and occupancy valuesdepending on the type of operation being performed bythe unit.

Table 1: Machine configuration parameters.

Parameter ConfigurationL1 Icache 8KB direct; 32B line; 1 cycle lat.L1 Dcache 8KB direct; 32B line; 1 cycle lat.L2 Unified Cache 256KB 4-way; 64B line; 12 cyclesMemory 64 bit-wide; 20 cycles on page hit,

40 cycles on page missBranch Pred. 2k gshare + 2k bimodal + 2k metaBTB 1024 entry 4-way set assoc.Return Addr. Stack 32 entry queueITLB 32 entry fully assoc.DTLB 64 entry fully assoc.

Table 2: Processor resources.

Parameter UnitsFetch/Issue/Commit Width 4Integer ALU 3Integer Mult/Div 1FP ALU 2FP Mult/Div/Sqrt 1DL1 Read Ports 2DL1 Write Ports 1Instruction Window Entries 64Load/Store Queue Entries 16Fetch Queue 16Minimum Misprediction Latency 6

Table 3: Latency and occupancy of each resource.

Resource Latency OccupancyInteger ALU 1 1Integer Mult 3 1Integer Div 20 19FP ALU 2 1FP Mult 4 1FP Div 12 12FP Sqrt 24 24Memory Ports 1 1

Our simulations are executed on SPECint95 bench-marks; they were compiled using a re-targeted versionof the GNUgcccompiler, with full optimization. Thiscompiler generates 64 bit-wide instructions, but only 32bits are used, leaving the others for future implemen-tations; in order to model a typical actual machine, weconvert these instructions to 32 bits before executing the

code. Since one of our architectures was intended by theauthors for floating point applications [9], we also ran asubset of SPECfp95 in this case. Since we are executinga full model on a a very detailed simulator, the bench-marks take several hours to complete; due to time con-straints we feed the simulator with a small set of inputs.However we execute all programs entirely (from 80Minstructions incompressto 550M instructions ingo).

2.2 Power model

Energy dissipation in CMOS technology circuits ismainly due to charging and discharging gate capaci-tances; on every transition we dissipateEt =

1

2

Ceq V 2

dd Watts. To obtain the values for the equiva-lent capacitances,Ceq, for the components in the mem-ory subsystem, we follow the model given by Wilton andJouppi [12]. Their model assumes a 0.8m process; if adifferent process is used, only the transistor capacitancesneed to be recomputed. To obtain the number of transi-tions that occur on each transistor, we refer to Kambleand Ghose [6], adapting their work to our overall archi-tecture.

An m-way set associative cache consists of threemain parts: a data array, a tag array and the necessarycontrol logic. The data array consists ofS rows contain-ing m lines. Each line containsL bytes of data and atagT which is used to uniquely identify the data. Uponreceiving a data request, the address is divided into threeparts. The first part indexes one row in the cache, thesecond selects the bytes or words desired, and the lastis compared to the entry in the tag to detect a hit or amiss. On a hit, the processor accesses the data from thefirst level cache. On a miss, we use a write-back, write-allocate policy. The latency of the access is directly pro-portional to the capacitance driven and to the length ofbit and word-lines. In order to maximize speed we keptthe arrays as square as possible by splitting the data andtag array vertically and/or horizontally. Sub-arrays canalso be folded. We used the toolCACTI [12] to computesub-arraying parameters for all our caches.

According to [6] we consider the main sourcesof power to be the following three components:Ebit,Eword ,Eoutput . We are not considering the energy dissi-pated in the address decoders, since we found this valueto be negligible compared to the other components. Sim-ilar to Kin [7], we found that the energy consumption ofthe decoders is about three orders of magnitude smallerthan that of the other components. The energy consump-tion is computed as:

Ecache = Ebit + Eword + Eoutput (1)

A brief description of each of these components follows.Further details can be found in [1].

Energy dissipated in bit-lines

Ebit is the energy consumption in the bit-lines; it isdue to precharging lines (including driving the prechargelogic) and reading or writing data. Since we assume asub-arrayed cache, we need to precharge and dischargeonly the portion directly related with the address we needto read/write.

Table 4:Baseline results:Number of cycles, accesses and energy consumption in the base case. Energy isgiven in Joules.

DL1 Cache IL1 Cache UL2 CacheTest Cycles Accesses Eng. (Joules) Accesses Eng. (Joules) Accesses Eng. (Joules)

compress 70 538 278 27 736 186 0.092 66 095 704 0.207 3 134 116 0.402go 826 111 517 169 860 620 0.612 430 587 086 1.499 60 055 380 6.819vortex 250 942 324 93 470 624 0.284 108 663 648 0.375 15 336 216 1.796gcc 392 296 667 103 457 950 0.317 176 609 209 0.618 22 959 880 2.652li 134 701 535 74 445 274 0.232 140 347 169 0.446 5 856 655 0.652ijpeg 139 376 153 73 358 576 0.214 135 804 661 0.437 3 906 431 0.451m88ksim 659 451 084 130 319 219 0.369 241 836 872 0.882 34 953 990 3.980perl 372 034 266 90 388 059 0.293 148 817 944 0.531 24 532 245 2.794

Table 5: Instruction cache sizes:Percent improvement in performance and power compared to the basecase. Latency = 1 cycle.

Test 8K-2way 16K-direct 16K-2way%Cyc %IL1 Eng. %Tot. Eng. %Cyc %IL1 Eng. %Tot. Eng. %Cyc %IL1 Eng. %Tot. Eng.

compress 0.24 -18.39 -5.26 0.34 -20.31 -5.67 0.32 -58.63 -17.05go 13.15 -20.59 8.98 21.17 -22.61 15.90 26.51 -62.84 14.04vortex 8.30 -22.12 6.14 21.14 -23.22 18.07 27.61 -66.16 18.61gcc 5.08 -18.95 2.08 19.29 -20.04 15.44 25.02 -58.33 14.60li 7.20 -18.78 1.58 2.47 -20.11 -5.06 8.69 -59.13 -11.40ijpeg 4.76 -18.02 -2.28 4.410 -19.88 -3.29 5.43 -57.51 -18.60m88ksim 16.10 -17.24 12.89 27.33 -15.53 24.73 50.50 -55.10 42.75perl 11.13 -18.83 7.85 19.10 -20.27 14.98 24.71 -59.82 15.13

Note that in order to minimize the power overheadintroduced by buffers, in the fully associative config-uration, we perform first a tag look-up and access thedata array only on a hit. If timing constraints make thisapproach not feasible, direct mapped buffers should beconsidered.

Energy dissipated in word-lines

Eword is the energy consumption due to assertion ofword-lines; once the bit-lines are all precharged we se-lect one row, performing the read/write to the desireddata.

Energy dissipated driving external buses

Eoutput is the energy used to drive external buses; thiscomponent includes both the data sent/returned and theaddress sent to the lower level memory on a miss request.

3 Experimental results

3.1 Base case

In this section we will describe the base case we used;all other experiments will compare to this one.

As stated before, our base case uses 8K direct mappedon-chip first level caches (i.e. DL1 for data and IL1 forinstruction), with a unified 256K 4-way off-chip sec-ond level cache (UL2). Table 4 shows the executiontime measured in cycles and, for each cache, the num-ber of accesses and the energy consumption (measured

in Joules). The next sections' results will show percent-age decreases over the base case; thus positive numberswill mean an improvement in power or performancesand negative number will mean a worsening.

First level caches are on-chip, so their energy con-sumption refers to the CPU level, while the off-chip sec-ond level cache energy refers to the board level. In thefollowing sections we show what happens to the energyin the overall cache architecture (L1 plus L2), and alsoclarify whether variations in power belong to the CPU orto the board. As shown in Table 4 the dominant portionof energy consumption is due to the UL2, because of itsbigger size and high capacitance board buses. Further-more we see that the instruction cache demands morepower than the data cache, due to a higher number ofaccesses.

3.2 Traditional techniques

Traditional approaches for reducing cache misses utilizebigger cache sizes and/or increased associativity. Thesetechniques may offer good performances but presentseveral drawbacks; first an area increase is required andsecond it is more likely that a larger latency will resultwhenever bigger caches and/or higher associativity areused.

Table 5 presents results obtained changing instructioncache size and/or associativity. Specifically, we present8K 2-way, 16K direct mapped and 16K 2-way configura-tions, all of them maintaining a 1 cycle latency. Reduc-tion in cycles, energy in IL1 and total energy are shown.The overall energy consumption (i.e.Total Energy) isgenerally reduced, due to a reduced activity in the sec-

Table 6:Victim cache fully associative:Percent improvement in performance and power compared to thebase case.

Swapping Non SwappingTest Data Only Inst. Only Data & Inst. Data Only Inst. Only Data & Inst.

%Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng.

compress 2.30 6.79 -0.44 -1.51 2.83 5.97 2.91 6.98 -0.44 -1.50 2.77 5.65go 3.33 10.98 4.40 3.48 7.84 14.49 4.22 12.89 6.50 4.60 10.92 18.00vortex 3.04 12.01 5.20 4.80 7.43 12.99 3.43 12.34 5.63 4.86 9.75 17.74gcc 1.72 6.98 3.01 2.54 4.61 8.96 2.11 7.41 4.30 3.37 6.69 11.27li 2.88 6.62 3.06 2.04 6.26 9.11 3.82 8.96 6.11 4.90 10.69 15.00ijpeg 4.41 11.04 1.61 -0.72 5.92 11.44 4.86 10.58 2.66 -0.24 7.54 12.16m88ksim 0.36 3.06 9.80 9.39 10.22 12.51 0.61 4.41 21.07 18.14 21.85 23.12perl 0.90 4.49 5.60 5.06 6.75 9.78 1.69 5.42 7.58 5.64 8.50 11.36

ond level cache (since we decrease the IL1 miss rate).However we can also see cases in which we have an in-crease in power, since benchmarks likecompress, li andijpegalready present a high hit rate in the base case.

Even if the total energy improves, we show that theon-chip power increases significantly (up to 66%) as thecache becomes bigger. As will be shown in the comingsection, we observed that using buffers associated withfirst level caches permit us to obtain an overall energy re-duction with an on-chip increase byat most3%. We alsoobserved that if the increase in size/associativity requireshaving a 2 cycle latency, performance improvements areno longer so dramatic and in some cases we also saw adecrease (e.g. especially for the instruction cache).

In the following sections we will presents various ar-chitectures we have analyzed. We will start with someliterature examples such as thevictim cache[5] andnon-temporal buffer[9] and then we will present newschemes based on the use of buffers and how to combinethem.

3.3 Victim cache

We consider the idea of avictim cacheoriginally pre-sented by Jouppi in [5], with small changes. The au-thor presented the following algorithm. On a main cachemiss, the victim cache is accessed; if the address hits thevictim cache, the data is returned to the CPU and at thesame time it is promoted to the main cache; the replacedline in the main cache is moved to the victim cache,therefore performing a “swap”. If the victim cache alsomisses, an L2 access is performed; the incoming datawill fill the main cache, and the replaced line will bemoved to the victim cache. The replaced entry in thevictim cache is discarded and, if dirty, written back tothe second level cache.

We first made a change in the algorithm, perform-ing a parallel look-up in the main and victim cache, aswe saw that this helps performance without significantdrawbacks on power. We refer to this algorithm asvic-tim cache swapping, since swapping is performed on avictim cache hit.

We found that the time required by the processor toperform the swapping, due to a victim hit, was detri-mental to performance, so we also tried a variation onthe algorithm that doesn' t require swapping (i.e. on avictim cache hit, the line is not promoted to the main

cache). We refer to it asvictim cache non-swapping.Table 6 shows effects using a fully associativevictim

cache. They refer to theswappingandnon swappingmechanisms. We present cycles and overall energy re-duction for three different schemes; they use a bufferassociated with theData cache, theInstructioncache,or both of them. We observed that the combined useof buffers for both caches offers a roughly additive im-provement over the single cache case. This result gen-erally applies to all uses of buffers we tried. As statedbefore, we show that thenon swappingmechanism gen-erally outperforms the original algorithm presented bythe author. In [5], the data cache of a single issue proces-sor was considered, where a memory access occurs ap-proximately one out of four cycles; thus the victim cachehad ample time to perform the necessary swapping. Thesame cannot be said of a 4-way issue processor that has,on average, one data memory access per cycle. In thiscase the advantages obtained by swapping are often out-weighed by the extra latency introduced.

It is interesting to note that increased performanceand energy reduction usually go hand-in-hand, since areduced L2 activity helps both of them.

3.4 Non-temporal buffer

This idea has been formulated by Rivers and Davidsonin [9] and refers to data buffer only. They observed thatin numerical applications data accesses may be dividedin two categories:scalaraccesses that presenttemporalbehavior, i.e. are accessed more than once during theirlifetime in the cache, andvectoraccesses that presentnon-temporalbehavior, i.e. once accessed are no longerreferenced. This is true, since vectors are generally ac-cessed sequentially and often their working-set is big-ger than the cache itself, so that when they need to bereferenced again, they no longer reside in the cache.The idea is to use a special buffer to contain the non-temporal data and reduce conflicts in the main cache.They presented a history based algorithm that tags eachline with a temporal/non-temporal bit; this informationis also saved in the second level cache, requiring addi-tionally writebacks for replaced clean blocks.

Since this algorithm was intended for numerical ap-plications we added a subset of SPECfp95 to this set ofruns. Experimental results show this technique not to beeffective for integer programs; performance generally is

Table 7:Speculative buffer: Percent improvement in performance and power compared to the base case.

Fully Associative Direct MappedTest Data Only Inst. Only Data & Inst. Data Only Inst. Only Data & Inst.

%Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng. %Cyc. %Eng.

compress 1.90 5.37 -0.20 -1.27 2.40 4.79 0.88 1.79 -0.12 -1.83 0.65 -0.010go 3.41 10.76 6.28 4.49 9.84 15.41 2.20 8.07 5.44 3.27 7.68 11.09vortex 2.39 7.75 5.78 5.09 8.00 12.84 1.76 6.00 5.55 4.30 7.66 10.47gcc 1.51 5.27 4.52 3.59 6.15 9.13 1.07 4.12 4.40 2.94 5.46 7.14li 2.44 4.99 5.94 5.66 8.75 10.84 1.88 3.48 6.32 5.46 8.28 8.94ijpeg 3.33 8.01 1.40 -0.72 4.58 7.38 2.61 5.50 1.75 -0.71 4.26 5.12m88ksim 0.29 2.20 22.88 21.65 23.13 24.47 0.25 1.99 22.05 20.45 22.49 22.96perl 0.66 3.25 6.60 5.65 7.38 9.15 0.79 2.99 6.26 4.79 7.10 7.98

worse by 3% to 20%, while power increases by 7% to58%. We also observed that only specific numeric ap-plications benefit from this algorithm; for exampleswimimproves 6.8% in performance and 46% in overall en-ergy consumption, but others likeapsiorhidro2dworsenup to 7% in power and 1% in performance.

The main reason of these negative results is dueto an increased writeback activity to the second levelcache required by the algorithm that saves in the L2 thetemporal/non-temporal information. Given a shorter la-tency L2 cache (e.g. on-chip L2 design), this algorithmmight present better overall results.

3.5 Speculative buffer

The idea of a speculative buffer is based on previouswork by the authors [1] and based on use of confidencepredictors presented in Manneet al. [8]. In this casewe mark every cache access with a confidence level ob-tained by examining the processor speculation state andthe current branch prediction estimate. We use the maincache to accommodate misses that are most likely on thecorrect path (high confidence) and thespeculative bufferfor misses that have a high probability to be from a mis-speculated path (low confidence). This idea originatesfrom a study in which the authors found that line fillscoming from mis-speculated path misses have a lowerprobability of being furtheraccessed. Putting them in thebuffer reduces the contention misses in the main cacheby reducing cache pollution.

Table 7 presents results using either a fully associativeor direct mappedspeculative buffer. Although thevictimcachescheme with swapping gives better results whenused with the data cache only, overall the results showthat thespeculative bufferscheme is better at reducingenergy consumption and improving performances.

Using these buffers not only reduces energy consump-tion in the overall cache memory system, but it doesso without significantly increasing on-chip energy con-sumption. For the results listed in Table 7 we found thatthe on-chip energy consumption in the data portion in-creases on average only by 0.7%, and in the instructionportion by 2.8% (these number are not shown in the ta-ble). Even more, for some programs likego we reducethe on-chip data cache portion up to 8%. This is in con-trast to results shown in Table 5 where on-chip powerincreased by as much as 66%.

We also tried a variation on this algorithm deciding

randomly whether to put an incoming fill, either in themain cache or in the buffer. We tried, among othercases, to put randomly 10% or 20% of the fills in thebuffer. It is interesting to note that therandomcasepresents, on average, good results. This demonstratesthat simply adding “associativity” to the main cache, suf-fices to eliminate most of the contention misses (and alsomay indicate our benchmarks are not ideal candidates forspeculativesorting).

3.6 Penalty buffer

This model is applied to the instruction cache bufferonly. We observed that, as opposed to data cache behav-ior, there is not a fixed correlation between variations ininstruction cache miss-rate and performance gain. Thisis due to the fact that some misses are more criticalthan others; fetch latency often may be hidden by theFetch/Dispatch queue as well as the Instruction Window(Resources Reservation Unit - RUU in SimpleScalar ter-minology) making some instruction misses non-critical.However, since misses sometime present a burst behav-ior, these hardware structures can remain empty and alllatency is detrimental for performance.

Our idea is to divide misses on a penalty basis; wemonitor the state (number of valid entries) of the Dis-patch Queue and RUU upon a cache miss and we markthem ascritical (Dispatch Queue or RUU with few validentries) ornon-critical (Dispatch Queue or RUU withmany valid entries). We placenon-critical misses in thepenalty bufferandcritical ones in the main cache. In thisway we preserve the instructions contained in the maincache that are presumed to be more critical.

We tried two different schemes in deciding when tobypass the main cache and place fills instead in thepenalty buffer. In the first scheme we compared the num-ber of Dispatch Queue or RUU valid entries to a fixedthreshold. In the second scheme, we used a history-based method that saves in the main cache the state of thehardware resources at the moment of the fill (e.g. howmany valid instructions we have in the RUU). This num-ber will be compared to the current one at the moment ofa future replacement. If the actual miss is more criticalthan the past one, it will fill the main cache, otherwise itwill fill the penaltybuffer.

Table 8 presents results using either a fully associa-tive or direct mappedpenalty buffer. We observed somegood results, but when we tried different threshold val-

Table 8:Penalty buffer: Percent improvement in performance and power compared to the base case.

Fully Associative Direct MappedTest Dispatch RUU Dispatch RUU

%Cycles %Energy %Cycles %Energy %Cycles %Energy %Cycles %Energy

compress 0.02 -1.04 0.08 -1.22 0.10 -1.49 0.10 -1.65go 6.02 4.26 6.37 4.60 4.35 2.65 5.27 3.28vortex 6.52 5.40 6.29 4.28 6.25 4.22 4.91 2.78gcc 4.76 3.94 4.69 3.88 4.11 3.10 4.30 3.15li 6.25 5.70 6.35 5.88 6.11 5.04 6.12 5.17ijpeg 2.95 1.07 2.21 -0.94 2.51 -0.48 2.62 -0.91m88ksim 18.40 17.17 24.55 23.14 20.62 18.97 23.83 22.07perl 7.05 5.95 6.56 5.29 6.32 4.83 5.69 4.00

ues or history schemes, we found great variability inthe behavior. This variability is due to the fact that insome cases looking separately at the Dispatch Queueor the RUU doesn' t give a precise model of the stateof the pipeline, thus creating interference in thepenaltybuffer mechanism. Nevertheless, this scheme producesbetter results than thespeculative bufferscheme. Weare currently investigating a combined use of these twoschemes that may prove even better.

3.7 Combining use of buffers

So far we presented two main categories of buffer im-plementations; the first one is based on thevictim cachethat is used to offer a “second chance” to data that, oth-erwise, should be moved out from L1. In the secondcategory (all other cases), the algorithm decides what toput in the buffer, before data has the opportunity to bewritten to the L1cache. These are not mutually exclu-sive, so we tried to combine them, i.e. the buffer is atthe same time used to store specific kinds of data andto accommodate L1 cache victims. Due to space limita-tions, we will not show the results, but it is important topoint out that this mechanism is generally beneficial andgives improvements over the non-combined case. How-ever most of the gain is still due to the original algorithmand not to the victim scheme.

4 Conclusions and future work

In this paper we presented tradeoffs between power andperformance in cache architectures, using conventionalas well as innovative techniques; we showed that addingsmall buffers can offer an overall energy reduction with-out decreasing the performance, and in most cases im-proving it. These small buffers suffice in reducing theoverall cache energy consumption, without a notableincrease in on-chip power that traditional techniquespresent. Furthermore use of a buffer is not mutuallyexclusive to increasing the cache size/associativity, of-fering the possibility of a combined effect. We are cur-rently investigating further improvements in combinedtechniques as well as replacement policies. In the re-sults we showed LRU policy was adopted; we are nowstudying policies that favor replacement of clean blocks,thereby reducing writeback activity.

AcknowledgmentsWe wish to thank Bobbie Manne for her invaluable helpand support throughout this work. We also wish to thankDoug Burger for providing us with the framework for theSimpleScalar memory model.

References[1] I. Bahar, G. Albera and S. Manne, “Using Confidence to

Reduce Energy Consumption in High-Performance Mi-croprocessors,” to appear inInternational Symposium onLow Power Electronics and Design (ISLEPD), 1998

[2] D. Burger, and T. M. Austin, “The SimpleScalar Tool Set,Version 2.0,” Technical Report TR#1342, University ofWisconsin, June 1997.

[3] D. Burger, and T. M. Austin, “SimpleScalar Tutorial ,”presented at30th International Symposium on Microar-chitecture, Research Triangle Park, NC, December, 1997.

[4] T. L. Johnson, and W. W. Hwu, “Run-time AdaptiveCache Hierarchy Management via Reference Analysis,”ISCA-97: ACM/IEEE International Symposium on Com-puter Architecture, pp. 315–326, Denver, CO, June 1997.

[5] N. Jouppi, “Improving Direct-Mapped Cache Perfor-mance by the Addition of a Small Fully-AssociativeCache and Prefetch Buffers,”ISCA-17: ACM/IEEE Inter-national Symposium on Computer Architecture, pp. 364–373, May 1990.

[6] M. B. Kamble and K. Ghose, “Analytical Energy Dissipa-tion Models for Low Power Caches,”ACM/IEEE Interna-tional Symposium on Low-Power Electronics and Design,August, 1997.

[7] J. Kin, M. Gupta, and W. H. Mangione-Smith, “The FilterCache: An Energy Efficient Memory Structure,”MICRO-97: ACM/IEEE International Symposium on Microarchi-tecture, pp. 184–193, Research Triangle Park, NC, De-cember 1997.

[8] S. Manne, D. Grunwald, A. Klauser, “Pipeline Gating:Speculation Control for Energy Reduction,” to appear inISCA-25: ACM/IEEE International Symposium on Com-puter Architecture, June 1998.

[9] J. A. Rivers, and E. S. Davidson, “Reducing Conflictsin Direct-Mapped Caches with a Temporality-Based De-sign,” International Conference on Parallel Processing,pp. 154-163, August 1996.

[10] C. Su, and A. Despain, “Cache Design Tradeoffs forPower and Performance Optimization: A Case Study,”ACM/IEEE International Symposium on Low-Power De-sign, pp. 63–68, Dana Point, CA, 1995.

[11] G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun,“Managing Data Caches using Selective Cache Line Re-placement,”Journal of Parallel Programming, Vol. 25,No. 3, pp. 213–242, June 1997.

[12] S. J. E. Wilton, and N. Jouppi, “An Enhanced Access andCycle Time Model for On-Chip Caches,” Digital WRLResearch Report 93/5, July 1994.

[13] Jeffrey Gee, Mark Hill, Dinoisions Pnevmatikatos, AlanJ. Smith, “Cache Performance of the SPEC BenchmarkSuite,” IEEE Micro, Vol. 13, Number 4, pp. 17-27 (Au-gust 1993)

A new scheme for I-Cache energy reduction in High-Performance

Processors

Nikolaos Bellas, Ibrahim Hajj and Constantine Polychronopoulos

Coordinated Sciences Laboratory, and

Electrical and Computer Engineering Department

University of Illinois at Urbana-Champaign, Urbana, IL 61801

Abstract

Prompted by demands in portability and low cost

packaging, the microprocessor industry has started

viewing power, along with area and performance,

as a decisive design factor in today's microproces-

sors. To that eect, a number of research eorts

have been devoted to architectural modications

that target power or energy minimization. Most

of these eorts, however, involve a degradation in

processor performance, and are, thus, deemed ap-

plicable only for the embedded, low-end market.

In this paper, we propose the addition of an ex-

tra, small cache between the I-Cache and the CPU,

that serves to reduce the eective energy dissi-

pated per memory access. In our scheme, the com-

piler generates code that exploits the new memory

hierarchy and reduces the likelihood of a miss in

the extra cache. We show that this is an attrac-

tive solution for the high-end processor market,

since the performance degradation is minimal. We

describe the hardware and compiler modications

needed to eciently implement the new memory

hierarchy, and we give the performance and en-

ergy results for most of the SPEC95 benchmarks.

The extra cache, dubbed L-Cache from now on,

is placed between the CPU and the I-Cache. The

D-Cache subsystem remains as is.

1 Introduction

In the latest generations of microprocessors, an in-

creasing number of architecture features have been

exposed to the compiler to enhance performance.

The advantage of this cooperation is that the com-

piler can generate code that exploits the character-

This work was supported by Intel Corp., Santa Clara,CA

istics of the machine and avoids expensive stalls.

We believe that such schemes can also be applied

for power/energy optimization by exposing the mem-

ory hierarchy features in the compiler.

The lter cache [1] tackles the problem of large

energy consumption of the L1 caches by adding a

small, and, thus, more energy ecient cache be-

tween the CPU and the L1 caches. Provided that

the working set of the program is relatively small,

and the data reuse large, this \mini" cache can pro-

vide the data or instructions of the program and

eectively shut down the L1 caches for long periods

during program execution. The penalty to be paid

is the increased miss rates, and, hence, longer aver-

age memory access time. Although this might be

acceptable for embedded systems for multimedia

or mobile applications, it is out of the question for

high performance processors. The lter cache de-

livers an impressive energy reduction of 58% for a

256-byte, direct mapped lter cache, while reduc-

ing performance by 21% for a set of multimedia

benchmarks. Our approach has a very small per-

formance degradation with respect to the original

scheme without the lter cache, and smaller, but

still very large, energy gains.

We can alleviate the performance degradation

by having the compiler selecting statically, i.e. dur-

ing compile time, the parts of the code that are to

be placed in the extra cache (the L-Cache), and

restructuring the code so that it fully exploits the

new hierarchy. The CPU will then access the L-

Cache only when instructions from the selected

part of the program are to be fetched, and it will

bypass it otherwise. Naturally, we want the most

frequently executed code to be selected by the com-

piler for placement in the L-Cache, since this is

where most of the energy gains will come from.

The approach advocated in our scheme relies

on the use of prole data from previous runs to

select the best instructions to be cached. The unit

of allocation is the basic block, i.e. an instruc-

tion is placed in the L-Cache only if it belongs to

a selected basic block. After selection, the com-

piler lays out the target program so that the se-

lected blocks are placed contiguously before the

non-placed ones. The main eort of the compiler

focuses on placing the selected basic blocks in po-

sitions so that two blocks that need to be in the

L-Cache at the same time, do not map in the same

L-Cache location.

The organization of the paper is as follows: Sec-

tion 2 brie y describes how the compiler selects in-

structions to be placed in the extra cache and what

code modications are necessary for this scheme to

be carried out eectively. Section 3 refers to the

hardware implementation details, and the energy

estimation phase. Section 4 presents and discusses

experimental results on SPEC95 benchmarks, and

the paper is concluded with Section 5.

2 Compiler Enhancements

In this work we describe a possible static imple-

mentation, which relies on the capability of the

compiler to detect \good" candidate basic blocks

for caching, i.e. blocks whose instructions can be

stored in the L-Cache. The main stage in our com-

piler enhancement is block placement in which the

compiler selects, and then places the selected ba-

sic blocks so that the number of blocks that are

placed at the same time in the extra cache is max-

imized. To that eect, it avoids placing two blocks

that have been selected to reside at the same time

in the L-Cache in the same L-Cache location.

2.1 Block Placement for Maximum L-Cache Uti-

lization

The main step of our algorithm is to position the

code in such a way as to maximize the number of

basic blocks that are cached in the extra cache. In

order to do that, we need to place these blocks con-

tiguously in memory so that they do not overlap.

Consider the following code:

do 100 i=1, n

B1; # basic block

if (error) then

error handling;

B2; # basic block

100 continue

When the code is compiled and mapped in the

address space, the basic blocks B1 and B2 will be

separated by the code for the if-statement in the

nal layout. If the L-Cache size is smaller than the

sum of the sizes of B1, B2 and the if-statement, but

larger than the sum of the sizes of B1 and B2, the

blocks B1 and B2 will overlap when stored in the

L-Cache. Therefore, we need to place B1 and B2

one after the other and leave the if-statement at

the end.

This is usually the case in loops. Blocks that

are executed for every iteration are intermingled

B1

B2

B7

B4

B6

L1L2

L5L6

L3

L4

L7

B5

L8

B8

B3

Figure 1: Nesting example

with blocks that are seldom executed. We iden-

tify such cases and move the infrequently executed

code away so that the normal ow of control is in a

straight-line sequence. This entails the insertion of

extra branch and jump instructions to retain the

original semantics of the code.

We will delineate our approach in the following

paragraphs without giving many details, because

of lack of space. Our compiler modication takes

as input the generated object code and the prole

data of the program, and outputs a new, equivalent

object code with a dierent address mapping for

some of the basic blocks.

Our strategy uses the Control Flow Graph (CFG)

for each function of the program and then com-

putes the nesting of each basic block [2]. For ex-

ample, in Fig. 1, the nesting of basic block B3 are

the loops L1; L2.

The basic ideas behind the compiler modica-

tion are the following: First, the most frequently

executed basic blocks of a function/procedure should

be placed in the global memory space the one after

the other so that their locality is enhanced. These

blocks are detected during proling. Second, the

structure of the CFG is also important in minimiz-

ing the miss rate of the L-Cache. For example, B3

should not overlap with B1 and B2 in the L-Cache.

If that was the case, B3 would miss in every iter-

ation of the loop L2 and the possible gains of its

placement in the L-Cache would be nullied. The

same goes for B6 with respect to B4 and B5. On

the other hand, B1 and B2 can overlap since B2

is only executed when B1 nishes. Our algorithm

uses the CFG and the execution prole of each ba-

sic block in the program to create the new object

code.

There are six reasons that hinder a basic block

from being selected for placement in the L-Cache:

It belongs to a library and not to a user func-

tion. We follow the convention that only user

functions are candidates for placement, since

they have the tightest loops.

The algorithm nds that the basic block was

too large to t in the L-Cache. This can be

either because the size of the block is larger

than the L-Cache size, or because it cannot

t at the same time with other, more impor-

tant, basic blocks.

Its execution frequency is smaller than a thresh-

old, and is thus deemed unimportant.

It is not nested in a loop. There is no point

in placing such a basic block in the L-Cache

since it will be executed only once for each

invocation of its function.

Even if its execution is large, its execution

densitymight be small. For example, a basic

block that is located in a function which is in-

voked a lot of times might have a large execu-

tion frequency, but it might only be executed

few times for every function invocation. We

dene the execution density of a basic block

as the ratio of the number of times it is exe-

cuted to the number of times that the func-

tion in which it belongs, is invoked.

Finally, a very small basic block is not placed

in the L-Cache even if it satises all the above

requirements. The extra branch instructions

that might be needed to link it to its succes-

sor basic blocks will be an important over-

head in this case.

The basic blocks are laid out in the memory ad-

dress space so that all the selected basic blocks are

placed contiguously before the non-selected ones.

This arrangement greatly simplies the hardware

of the L-Cache as we will see in the next section.

Branches are placed at the end of the blocks, if

needed, to sustain the functionality of the code.

The user can trade o energy savings with de-

lay increase by adjusting the thresholds as these

were discussed in this section. For example, a

smaller size threshold will probably lead to larger

energy savings, and larger delay as well. As an

extreme, the user can also completely disable the

L-Cache, by forcing the compiler (through the user

given thresholds) to generate the original code, or,

on the other extreme, can emulate a lter cache

scheme by selecting every basic block for place-

ment in the L-Cache. The quality of the generated

code can be determined by the user during com-

pile time. Therefore, individual applications can

choose from a range of caching policies.

3 Hardware modications and Power Estimation

In addition to the compiler enhancement, our scheme

needs extra hardware for the implementation of

the L-Cache scheme. The extra hardware is shown

in Fig. 2.

I - Cache

PC(31:0)

Data Path

LCache_Hit

blocked_part

32 - bit MUX

L-CacheL-Cache

DataTag

32 - bitLogic Comp.

DB

us(

31:0

)

ICache_Bus(31:0)

LC

ac

he

_B

us(

31:0

)

Figure 2: LCache organization

The organization of the L-Cache itself is de-

picted in Fig. 2. It needs the data and the tag part,

an extra comparator for the tag comparison, and a

32-bit multiplexer which drives the data from the

L-Cache or the I-Cache to the data path pipeline.

The functionality of the L-Cache is as follows:

The PC is presented to the L-Cache tag at the

beginning of the clock cycle. The L-Cache will

only get activated if the \blocked part" signal is

on. This signal is generated by the IF unit, and its

meaning is explained in the following paragraphs.

In that case, the comparator checks for a match,

and if it nds one, it instructs the multiplexer to

drive the contents of the L-Cache in the data path.

The I-Cache is disabled for this clock cycle since

the signal \blocked part" is on.

In case of a L-Cache miss (\LCache Hit" is o),

the I-Cache is activated in the next clock cycle and

provides the data. The I-Cache is accessed if the

L-Cache misses, whereas the L-Cache is accessed

only when \blocked part" = on. If \blocked part"

= o, the I-Cache controller activates the I-Cache

without waiting for the \LCache Hit" signal. The

two caches are always accessed sequentially and

not in parallel.

We extend the ISA, and we add an instruction

called \alloc" which has a J-type MIPS format. It

is inserted by our tool and contains the address of

the rst non-placed block of the function. There

is one such instruction for every function of the

program and the address of \alloc" is stored when

a function is entered. During the execution of the

code in the function, if the PC has a value less

than that address, the \blocked part" signal is set,

else this signal will be set to o. This way, the

machine can gure out which portion of the code

executes with only an extra comparison.

We have developed our cache energy models

based on [3]. This is a transistor level model which

uses the run-time characteristics of the cache to

estimate the energy dissipation of its main com-

ponents [4]. A 0.8um technology with 3.3 Volts

power supply is assumed. The cache energy is a

Experiments Frequency Thres. Size Thres. Exec. Density Thres.FP INT FP INT FP INT

Aggressive (a) 0.01% 0.01% 5 5 5 5Less Aggressive (b) 0.5% 0.5% 10 5 10 5Moderate (c) 1% 1% 20 5 20 5

Table 1: User given thresholds in the L-Cache experiments

function of the cache complexity (cache size, block

size, and associativity), its internal organization

(banking), and the run time statistics (mumber of

accesses, hits, misses, average number of bits read,

input switching probabilities). The model is used

for both the I-Cache and the L-Cache/lter cache.

4 Experimental Evaluation

We evaluated the eectiveness of our software and

hardware enhancements by examining the energy

savings on a set of SPEC95 benchmarks. The base

case is a 16KB, direct-mapped I-Cache, with a

block size of 32 bytes. We compared our scheme

against the lter cache as well as the base case. We

considered two sizes for the lter and the L-Cache:

256 and 512 bytes. The line size was always 4 bytes

for the L-Cache and varied from 8 to 32 bytes for

the lter cache. Both caches are direct-mapped.

The L-Cache does not need to have a larger line

size since its hit rate is almost always near 100%.

A larger line size does not alter signicantly the

hit rate of the L-Cache, but aects negatively its

energy prole. There is one cycle penalty in case

of a miss in the L-Cache or the lter cache, and

a 4 cycle penalty in case of a miss in the I-Cache

since then, the data should be taken from the L2

cache. The L2 cache is o-chip and its energy dis-

sipation is not modelled. The I-Cache is banked

to optimize its cycle time [3].

We have developed an in-house, cache simu-

lator to gather the statistics of the new memory

hierarchy. We use the SpeedShop [5] set of tools

from SGI to gather the prole data and generate

the dynamic instruction traces that were fed into

our simulator. The MIPS compiler is used for com-

pilation and code optimization. All the necessary

run time data were gathered for the computation

of the energy dissipation of the cache subsystem

under the dierent scenaria.

We experimented with three dierent scenaria

for the user-given compiler thresholds (Table 1).

The more aggressive scenario delivers larger en-

ergy gains at the expense of larger performance

overhead. A frequency threshold of 0.01%, for ex-

ample, will force the tool to mark for placement

only basic blocks that have an execution time of

at least 0.01% of the total execution time of the

program. A size threshold of 10 will force the tool

to mark only the basic blocks that have at least 10

instructions, and so on.

Tables 2 and 3, show the normalized energy

dissipation of all the simulated cache subsystems

with respect to the original I-Cache scheme. For

the SPECfp95 benchmarks, the 512 byte L-Cache

is almost as succesful as the lter cache in pro-

viding instructions to the data path and, thus, in

saving energy by disabling the larger I-Cache. The

most energy-ecient lter cache is the one with 8B

or 16B line size. A lter cache with a 32B line size

has larger hit rate, but larger energy dissipation

per access as well. An L-Cache of 256 bytes is

usually too small for any substantial energy sav-

ings.

Tables 4 and 5 show the normalized perfor-

mance overhead for the various schemes, with re-

spect to the original organization. Since we are

only interested in the impact of the new memory

organization on the performance overhead, we as-

sume that there are no other stalls in the machine

other than the ones created when an instruction is

accessed.

The most important advantage of the L-Cache

is its small performance overhead, which is vital

for high performance machines. Performance over-

head is due to the (very small) miss rate in the

L-Cache, and the extra jump and branch instruc-

tions that are inserted by the compiler to the code.

The performance of 512-byte L-Cache (under an

aggressive scenario) is comparable to the perfor-

mance of a 512-byte lter cache with a block size

of 32 bytes. Such a lter cache, however, has worse

energy gains than the L-Cache, and, in addition,

entails a large 32-bytes wide bus to connect it to

the I-Cache.

Most integer benchmarks do not have a large

number of basic blocks that can be cached in the L-

Cache. They are also insensitive to the cache size

variation, which is to be expected since the ba-

sic blocks of integer programs are generally small.

Most of the basic blocks of the SPECint95 bench-

marks are not nested, or, they have a small exe-

cution density. Hence, they cannot be included in

the L-Cache. From a performance point of view,

however, a lter cache in a processor that runs an

integer code, scores poorly. An L-Cache is prefer-

able here if performance degradations cannot be

tolerated.

5 Conclusions

We believe that, since performance is the most im-

portant objective of today's high-end microproces-

sors, no energy reduction technique will be accept-

able, unless it only marginally aects the execution

time, or its overhead can be hidden by other com-

piler/architectural techniques. If this is the case,

even a moderate energy reduction will be welcome.

This paper presents a new, modied version of

the lter cache in which the compiler and the extra

hardware cooperate to decrease energy consump-

tion. The compiler can select only the most im-

portant parts of the code to be placed in the extra

cache, and can direct the hardware to probe the

extra cache only when this code is to be fetched

to the data path. The method is adaptive, since

the user can aggressively pursue energy reductions

to the expense of performance, or vice versa, by

providing dierent compilation options.

References

[1] Johnson Kin, Munish Gupta and William Mangione-

Smith, \The Filter Cache: An Energy Ecient Mem-ory Structure," in IEEE International Symposium on

Microarchitecture, pp. 184193, Dec. 1997.

[2] A. Aho, R. Sethi, and J. Ullman, Compilers: Princi-

ples, Techniques and Tools. AddisonWesley, 1986.

[3] S. Wilson and N. Jouppi, \An Enhanced Access and

Cycle Time Model for On-Chip Caches," DEC WRL

Technical Report 93/5, July 1994.

[4] N. Bellas, PhD Thesis in progress. 1998.

[5] SpeedShop User's Guide. Silicon Graphics Inc., 1996.

256 bytes extra cacheBenchmark L-Cache Filter Cache

(a) (b) (c) 8B 16B 32B

tomcatv 0.686 0.705 0.739 0.551 0.497 0.610swim 0.242 0.244 0.262 0.243 0.329 0.516su2cor 0.701 0.734 0.827 0.537 0.485 0.610hydro2d 0.418 0.438 0.493 0.397 0.413 0.566mgrid 0.943 0.980 0.986 0.600 0.530 0.638applu 0.736 0.755 0.840 0.541 0.490 0.610turb3d 0.618 0.622 0.839 0.426 0.430 0.578apsi 0.758 0.867 0.957 0.554 0.498 0.613wave5 0.768 0.822 0.822 0.576 0.517 0.557

FP aver. 0.652 0.685 0.752 0.492 0.465 0.589

go 0.948 0.955 0.955 0.612 0.545 0.660m88ksim 0.818 0.822 0.856 0.617 0.566 0.663compress95 0.890 0.897 0.897 0.569 0.523 0.630li 0.975 0.978 0.976 0.651 0.586 0.708

INT aver. 0.908 0.913 0.921 0.612 0.555 0.665

Table 2: Normalized energy relative to the basemachine for 256-byte


(a) (b) (c) 8B 16B 32B


FP aver. 0.455 0.481 0.559 0.320 0.379 0.537


INT aver. 0.907 0.914 0.922 0.517 0.494 0.629

Table 3: Normalized energy relative to the basemachine for 512-byte extra cache


(a) (b) (c) 8B 16B 32B


FP aver. 1.023 1.016 1.010 1.264 1.140 1.088


INT aver. 1.015 1.014 1.013 1.368 1.216 1.133

Table 4: Normalized delay relative to the base ma-chine for 256-byte extra cache


(a) (b) (c) 8B 16B 32B


FP aver. 1.046 1.029 1.019 1.110 1.059 1.041


INT aver. 1.015 1.014 1.013 1.274 1.159 1.099

Table 5: Normalized delay relative to the base ma-chine for 512-byte extra cache

Low-Power Design of Page-Based Intelligent Memory

Mark Oskin, Frederic T. Chong, Aamir Farooqui, Timothy Sherwood, and Justin Hensley

University of California at Davis

Abstract

Advances in DRAM technology have led many researchers tointegrate computational logic on DRAM chips to improve per-formance and reduce power dissipated across chip boundaries.The density, packaging, and storage characteristics of theseintelligent memory chips, however, present new challenges inpower management.

We introduce Active Pages, a promising architecture forintelligent memory based upon pages of data and simple func-tions associated with that data [OCS98]. We evaluate thepower consumption of three design alternatives for support-ing Active-Page functions in DRAM: recongurable logic, asimple processing element, and a hybrid combination of re-congurable logic and a processing element. Additionally, wediscuss operating system techniques to manage power con-sumption by limiting the number of Active Pages computingsimultaneously on a chip.

1 Introduction

As processor performance grows, both memory bandwidthand power consumption become serious issues. At the sametime, commodity DRAM densities are increasing dramatically.Many researchers have proposed integrating computation withDRAM to increase memory bandwidth and decrease powerconsumption. The operating and packaging characteristics ofDRAM chips, however, present new constraints on the powerconsumption of these systems.

We introduce Active Pages, a promising architecture forintelligent memory with substantial performance benets fordata-intensive applications [OCS98]. An Active Page consistsof a page of data and a set of associated functions that operateon that data.

To use Active Pages, computation for an application mustbe divided, or partitioned, between processor and memory. Forexample, we use Active-Page functions to gather operands fora sparse-matrix multiply and pass those operands on to theprocessor for multiplication. To perform such a computation,the matrix data and gathering functions must rst be loadedinto a memory system that supports Active Pages. The pro-cessor then, through a series of memory-mapped writes, startsthe gather functions in the memory system. As the operandsare gathered, the processor reads them from user-dened out-put areas in each page, multiplies them, and writes the resultsback to the array datastructures in memory. Simulation re-sults show up to a 1000X speedup on such applications runningon systems that use Active-Page memory over conventionalmemory.

This paper focuses upon ongoing work which evaluates thepower consumption of Active-Page implementations. First, we

Acknowledgments: Thanks to Venkatesh Akella for his helpful com-

ments. This work is supported in part by an NSF CAREER award to

Fred Chong, by Altera, and by grants from the UC Davis Academic

Senate. More info at http://arch.cs.ucdavis.edu/RAD

Reconfigurable Bit-array (4096kbit)

Col

umn-

sele

ct

Row-select

Logic

Figure 1: The RADram System

describe three alternative designs based upon recongurablelogic, small processors, and a combination of the two. Thenwe present some power models and estimates for these designs.We also discuss software techniques for reducing power con-sumption. Finally, we conclude with some future directionsfor this work.

2 Design Alternatives

In this section we brie y present three architectures for Ac-tive Page memory systems. These architectures consist ofRADram, or recongurable logic in DRAM [OCS98], an ar-chitecture consisting of only a small RISC or MISC [Jon98]like processing element near each page of data, and a hybridarchitecture which provides a small processing element withrecongurable logic processing capability.

Currently, within all of our designs we adopt a processor-mediated approach to inter-page communication which assumesinfrequent communication. When an Active-Page functionreaches a memory reference that can not be satised by itslocal page, it blocks and raises a processor interrupt. Theprocessor satises the request by reading and writing to theappropriate pages. The processor-mediated approach reducespower by not requiring extensive inter-page communicationnetworks, and global clocking.

2.1 Recongurable Logic

Our initial implementation of Active Pages focused uponthe integration of DRAM and recongurable logic. For gigabitDRAMs, a reasonable sub-array size to minimize power andlatency is 512Kbytes [I+97]. The RADram (RecongurableArchitecture DRAM) system, shown in Figure 1, associates256 LEs (Logic Elements, a standard block of logic in FPGAswhich is based upon a 4-element Look Up Table or 4-LUT)to each of these sub-arrays. This allows ecient support forActive-Page sizes which are multiples of 512Kbytes.

Each LE requires approximately 1k transistors of area ona logic chip. The Semiconductor Industry Association (SIA)roadmap [Sem97] projects mass production of 1-gigabit DRAM

chips by the year 2001. If we devote half of the area of sucha chip to logic, we expect the DRAM process to supportapproximately 32M transistors, which is enough to provide256 LEs to each 512Kbytes sub-array of the remaining 0.5-gigabits of memory on the chip. DeHon [DeH96] gives sev-eral estimates of FPGA area, and existing prototype mergedDRAM/recongurable logic designs demonstrate this to be arealistic design partitioning [M+97].

2.2 Processing Elements

An alternative architecture for Active Page implementationsis that of a small RISC or MISC [Jon98] type processing ele-ment used to execute user level process functions in memory.Several low-power processing cores currently exist and ActivePage designs can leverage the existing and future work in thisarea. However, a processing engine designed for Active Pagesmust satisfy two constraints: power and size. While high-performance low-power designs exist, such as the StrongARM[San96], their size makes them prohibitive for use in ActivePage memory. For this study we assume a slightly modiedToshiba TLCS-R3900 series [NTM+95] processor. The currentprocessor contains a 4k direct mapped instruction cache. Sucha cache is not needed for the limited number of instructionsan Active Page memory function contains. The calculationsin this paper assume a 1k instruction cache. The remainderof the TLCS-R3900 processing engine is relatively standard,with a 1k data cache, 32 bit data-path, 32 general purposeregisters and MIPS-like instruction set.

2.3 Hybrids

An alternative architecture from RADram or a small process-ing element, is a hybrid, which mixes a RISC-like processorwith recongurable logic. Such an architecture is presented in[HW97]. Here we extend this notion by simplifying the pro-cessing element core, and shrinking the recongurable logic ar-ray. The hybrid architecture would consist of a simplied pro-cessing element described above in Section 2.2 and a smallerrecongurable logic array than that presented for RADramin Section 2.1. The congurable logic array would be on theorder of 128 LEs in size.

3 Power Consumption

In order to analyze the power consumption of the three alter-native Active Page implementations described in Section 2, ananalysis of each was performed based on analytical models andpreviously published results. In this section we present ana-lytical models for power consumption for DRAM sub-arraysand recongurable logic arrays. Furthermore, we present es-timations for the power consumed by an Active Page mem-ory which utilizes o-the-shelf MIPS-like processing cores aspage based functional units. Finally we estimate the powerof a combined processor / recongurable logic array design.Throughout these analyses the target has been a 1 gigabit sizeDRAM array. Each design compromises a certain percentageof DRAM storage for functional logic. Each design has itsown size and here we also estimate the relative percentage oflogic to DRAM and thus the overall size of the resulting mem-ory chip. While these designs are based on a 1 gigabit sizeDRAM, it should be kept in mind that the actual number ofpages will most likely be a power of two, and thus either an

increased or decreased chip size will ultimately be necessaryfor non-power-of-two size designs.

3.1 DRAM model for Power Analysis

MxN Bit Dram

m

n

Column Decoder

RowDecoder

Periphery

Idd(active)

Idd(retention)

Figure 2: DRAM Model used in the analysis.

In this section we present a DRAM model for power estima-tion. Using this model we estimate the power consumptionof a sub-array size used for gigabit size DRAMs. A DRAMblock of m columns and n rows can be represented as shownin Figure 2[I+95]. This gure shows the current owing inthe DRAM block. There are two major sources of power con-sumption in DRAMs:

1. Active current, which ows during the read operation.

2. Data retention current, which ows during re-chargingthe DRAM cells.

The active current is expressed by:

Idd = (m Cd Vw + Cpt Vint) f + Idcp (1)

where Cd is the data line capacitance, Vw is the voltage swingat a particular data line, Cpt is the total capacitance of theCMOS logic and driving circuitry in the periphery, Vint is theinternal supply voltage, f is the operating frequency, and Idcpis the DC static current.

The data retention current is given by:

Idd = (m Cd Vw + Cpt Vint) (n=tref ) + Idcp (2)

where tref is the refresh time of cells in retention mode.

Cd 0.2pfVw 0.5Vm 4096n 1024f 20Mhztref 20msVdd 1.65V

Table 1: DRAM parameters for 512Kbyte block sub-arrays.

Based on Equations 1 and 2, and technological parame-ters summarized in Table 1, it is possible to estimate powerconsumption for a single 512 kilobyte page of memory. Using

this base power per-page value we will extrapolate the powerconsumption of various Active Page implementations.

With mixed logic and memory designs, the eects of theCpt term are negligible compared to the power consumptionof the attached logic, and thus the term is assumed to be zero.Furthermore, at 20Mhz operation the DC static current Idcp isnegligible. Thus, the active current required to access a wordin memory is given by:

Iactive = (4096 0:2pf 0:5V ) (20 106) = 8:1mA (3)

The current due to data retention is given by:

Iretention = (4096 0:2pf 0:5V ) (1024=20ms) = 20uA (4)

Thus, for a 1.65V power supply, the total power dissipated bya 512Kbyte DRAM sub-array is:

Ptotal = Pactive + Pretention (5)

Ptotal = (Iactive + Iretention) Vdd (6)

Ptotal = (8:1mA+ 20uA) 1:65V = 13:4mW (7)

The estimated power per 512Kbyte sub-array is slightlybelow existing DRAM power consumption[Tos98]. This is dueto the reduced supply and swing voltages used in the cal-culation, and is expected in future DRAM memory designs.Using this calculation we can deduce that a storage-only gi-gabit size DRAM chip will consume 13:4mW 256 = 3:4Wof power. The 1 gigabit DRAM power consumption, whilegreater than power consumed by existing DRAM designs, iswell below the maximum power ratings predicted by the SIATechnology Roadmap [Sem97] for high-performance packagetechnology.

3.2 Power for Recongurable Logic

In order to estimate the power consumption of the RADramarchitecture, we use the power estimation guide provided byAltera [Alt96] for the FLEX architecture of devices. We modelpower for all devices in an operational state. It is assumed thatappropriate software control by the Operating System can beused to shutdown portions of the Active Page memory deviceduring reconguration. The Altera power estimation guidepredicts power for operational devices as:

Power = Pint + Pio (8)

Pint = (ICCStandby + ICCActive) Vdd (9)

ICCStandby = ICC0 (10)

ICCActive = (K fmax N toggleavg)uA (11)

where Pint is the power consumed internally by the FPGA,Pio is the power consumed driving external loads, ICCStandbyis the current drain while idle, ICCActive is the current drainwhile active, K is a technology parameter which is dependenton the FPGA design, fmax is the operating frequency in mega-hertz, N is the number of logic elements, and toggleavg is theaverage percent of cells toggling at each clock.

For estimation purposes, we assume Pio to be negligible,and the standby current is equal to a constant ICC0 = 0:5mA.Furthermore, current technology has K values between 32 and95. We assume a conservative view of future technology, witha K of 30. Using the RADram conguration presented in

Section 2.1, we let N equal 128 and the frequency of operationequal 100Mhz. Furthermore, we assume an average togglingpercentage of 12.5 percent. Thus the active current is givenby:

ICCActive = 30 100Mhz 256LEs 0:125 = 96mA (12)

Therefore, power consumption of the recongurable logic fora single page in the RADram architecture is:

P = (:5mA+ 96mA) 1:65V = 159mW (13)

The RADram architecture trades DRAM memory storagefor recongurable logic. This dedicates fty percent of the chipspace to DRAM and fty percent to logic. Thus, a gigabit sizeDRAM actually stores only 0.5 gigabits of data (128 512 Kbytepages). Using the power estimation presented in Section 3.1we can estimate the power consumed by a 0.5 gigabit RADrammemory chip to be:

Pradram = 128 (159mW + 13:4mW ) = 17:1W (14)

3.3 Power for Processing Elements

The current TLCS-R3900 series processors operate at 74Mhzand 3.3V and have a typical operating current of 110mA.We project the power consumption of the modied TLCS-R3900 series processor running at 100Mhz and 1.65V. Cur-rently, roughly 20% of the power consumption is used by theinstruction cache. By reducing the size of this cache by a factorof four, it is expected the power and area of the processing en-gine will be reduced. We expect the smaller instruction cacheto consume one-third the power of the existing cache. Thus,the expected operating current will be 95mA. However, thefaster clock will increase the current used to roughly 126mA.Finally, operating at 1.65V we expect the processor core toconsume 208mW of power.

Despite this power consumption, the TLCS-R3900 seriesprocessing engine is quite large. Nearly 440k transistors arerequired to implement the base TLCS-R3900 core. It thusbecomes impossible to continue with the 50% DRAM and 50%logic ratio when a processing element is used. In order tomaintain the 512Kbyte sub-array size, a roughly 63% logic/ 37% DRAM ratio is required. This means that only 92512Kbyte pages will be available on our hypothetical ActivePage memory chip. Therefore, total power consumption willbe:

Ppe = 92 (208mW + 13:4mW ) = 20:3W (15)

3.4 Power for Hybrid PE/LE Design

Although the true cost of power consumption by a mixed de-sign is not available without an actual design, we can estimatethe power by mixing a TLCS-R3900 series processing elementwith a limited number of recongurable logic LEs. Here wehave chosen a hybrid design with a limited number (128) LEsalong side a fully functional processing element. Additionalpower will be required to interconnect these two processingengines, but here we assume this interconnection power to benegligible. Thus, each hybrid processing element consumesroughly one-half the power of the recongurable logic-onlydesign added to the power consumed by the processing ele-ment. Combined these consume 288mW of power at 1.65V.

0 1 10 100

Problem Size (in 512K Pages)

0

1

10

100

1000

10000

Rel

ativ

e B

us T

raff

ic R

educ

tion array-delete

array-findarray-insertdatabasedyn-progmatrix-boeingmatrix-simplexmedian-kernelmedian-totalmmx

Figure 3: Experimental Results of RADram Relative Reduc-tion in Processor / Memory Bus Trac

The DRAM to logic ratio of this design is roughly 70% to30%, thus yielding an Active Page device with only 78 pagesof memory. Therefore total power consumption will be:

Phybrid = 78 (288mW + 13:4mW ) = 23:5W (16)

3.5 Clock Distribution

In high-frequency, large-area designs the power consumed bythe clock distribution and generation network cannot be ne-glected. The Active Page memory model permits an ecientsolution to this problem. Active Page designs do not includehardware support for inter-page communication do not requirea globally synchronized clock. An external clock can be usedand a large amount of skew can be tolerated. Hence, the inter-nal clock generation network need only consist of regenerationbuers, without an extensive and dicult to design networkof clock distribution nodes.

Furthermore, recongurable logic designs usually are oper-ated at the highest possible frequency permitted by the design.This suggests that the clock of each Active Page will be dis-tinct, or else operate at some nominal baseline. Furthermore,software techniques discussed in Section 4.1 require variableclock speeds for all Active Page designs. For this reason, wepropose that all clocks be generated locally for each page, us-ing internal oscillator techniques. A global clock may be nec-essary to provide a baseline reference with which to self-timethe locally generated clocks.

Given these design constraints, hardware inter-page com-munication facilities can still be constructed, if desired, usingasynchronous logic techniques. Relevant techniques can befound in asynchronous processor cores [FGT+97], caches, andrecongurable logic array designs [BEHB] [MA98].

3.6 Reduced Memory I/O

As is pointed out by Stan and Burleson [SB97] typically10-15% of the power used by desktop systems, and 50% of thepower in low power systems comes directly from external I/O.This is due largely to the fact that the busses used for o chipcommunications use higher voltages and have two to threeorders of magnitude more capacitance. Substantial amountsof power are consumed every time the voltage on a bus linemust be changed.

The number of transitions on the bus can be reduced intwo ways: by encoding the bus to minimize the number of

transitions per transaction [SB97] [SB95] [WCF+94], or byreducing the total number of transactions on the bus. The useof Active Pages reduces I/O power by addressing the former.

In a traditional system, memory intensive sections of codeproduce poor cache hit rates which in turn results in largeamounts of memory trac. However in an Active Page sys-tem, memory intensive sections of an application are executedinside the memory chip thereby reducing the total number ofexternal bus transactions.

Figure 3 shows relative reduction in bus-trac for severalapplications executing on the RADram Active Page memorysystem. Each application is described in [OCS98], but here wenote that a signicant reduction in the total number of mem-ory bus accesses can be achieved by the use of Active Pages.The actual reduction in bus trac varies from application toapplication, however the general trend of larger problems sizesyielding more ecient bus use can be seen. The tailing o ofpower eciency for very large problem sizes is due to the factthat at very large problem sizes the system becomes saturated,and the processor remains active throughout the computation.

3.7 Power Summary

Our estimates show that our initial RADram design using re-congurable logic has the lowest power consumption of ourthree alternatives. A further advantage for RADram is in-creased chip yield due to fault-tolerant designs. The com-plexity and number of Active-Page functions, however, is lim-ited by the amount of recongurable logic available for eachpage. The processing element and hybrid designs have morecomputational power. Furthermore, we expect to rene bothof the latter designs to use less power. Renements such asclock-gating, and asynchronous processor designs [FGT+97]will help reduce the power consumed by processor, and hybridprocessor / recongurable logic processor element designs.

4 Software Power Management

Active Pages presents new design challenges under low-powerconstraints. One potential solution is to use software mech-anisms which trade o performance for decreased power con-sumption. Most of these mechanisms can be implementeddirectly in the operating system and can be made to operateacross all Active-Page applications, however, specic compilertechniques can be employed which reduce overall power con-sumption on a per-application basis. For example, Active Pagememory devices could enable an operating system to decreasethe clock frequency of inactive Active Pages with minimal ef-fects on performance. Furthermore, for extremely low-powerenvironments, an operating system can schedule fewer com-putations per clock cycle to keep under power budgets. Fi-nally, for reduced power, application partitioning techniquescan take into account power consumption and thus balancepower consumption versus performance.

4.1 Variables Clock Frequencies

There are two situations where a slower clock frequency can beadvantageous for low-power Active Page memory designs. Therst is when an Active Page is allocated, but not computing.In this case the clock frequency of the Active Page processingunit can be decreased with minimal impact on applicationperformance. The second is when an Active Page is computingbut the application is limited by the processor.

Active Pages which are not computing generally remain inan idle loop checking a synchronization variable. A decreasein clock frequency of the Active Page functional unit can beviewed as an increase in the startup latency. Generally, thecomputation time is on the order of several thousand clockcycles. Thus, increasing the startup latency from one clockcycle to ten or even a hundred would not seriously aect per-formance. There would be considerable power savings by suchten or hundred fold decrease in clock frequency.

Active-Page applications often require a balance betweenprocessor and intelligent memory. If Active Pages are pro-ducing results too quickly for the processor to consume, thentheir clock frequency may be reduced without aecting appli-cation performance. Application characteristics vary widely,however, and can make developing general operating systempolicies dicult.

4.2 Replacement and Activation Control

At a coarser grain, the operating system can further decreasecurrent draw by limiting the number of Active Pages com-puting. The operating system must already manage page re-placement and schedule Active-Page computation under theconstraints of the amount of physical memory available. Ifthe number of Active Pages that t in a single chip exceedsthe power that can be dissipated, then the number that are al-lowed to compute simultaneously must be limited. Once again,if the application is processor-limited, than performance maynot be aected. Furthermore, deactivation of the associatedlogic does not restrict ordinary memory I/O operations on thememory super-page.

4.3 Compiler Techniques

Future work will address partitioning entire applications. Ex-isting technology exists for such system level partitioning inthe form of exact solutions obtained by Integer Linear Pro-gramming (ILP), and approximate solutions obtained by ge-netic and simulated annealing algorithms. Such algorithmsbalance communication and synchronization costs against po-tential parallel execution in an attempt to nd an ecientoverall system partitioning. The same algorithms can alsobe employed to consider power consumption. Power thus be-comes another variable in the minimization goal, and can bebalanced by the programmer at compilation time against per-formance.

5 Future Work

Our previous work [OCS98] has shown that Active Pages area promising architecture for intelligent memory with substan-tial performance benets. Our current work focuses upon thepower consumption of Active Page implementations. Esti-mates show that several design alternatives exist, but ActivePage DRAMs will benet from advances in low-cost packag-ing technologies. Future work will pursue both architecturaland software approaches to reducing power consumption. Thiswork will include detailed chip-level synthesized VHDL mod-els. Furthermore, we will explore the software techniques dis-cussed in this paper using cycle-by-cycle simulation of the pro-cessor and memory subsystem.

An increasing amount of research is being done in the areaof asynchronous logic, due to the fact that clock skew anddistribution are serious issues that a logic designer must face.

Clock speed has an eect on both the area and power con-sumed by a design. As the clock speed increases, more com-plex structures are needed to distribute the clock accuratelythroughout the chip. The increased clock speed has two disad-vantages. First, the larger clock drivers consume more power.Second, overall power dissipation increases linearly with clockfrequency.

Recent developments have made it possible to use asyn-chronous logic in more complex logic designs. The AMULET2eprocessor [FGT+97] is an asynchronous ARM[Fur96] 32-bitprocessor that consumes 30% of the power and takes 54% ofthe area of the ARM810[Fur96]. An asynchronous RISC coreprovides a low-power means of implementing an Active Pageprocessing element.

Another possibility is the use of an asynchronous FPGA.Commercial FPGAs are currently targeted to synchronouslogic and do not lend themselves to the implementation ofasynchronous logic[HBBE94][HBBE92]. Various new FPGAdesigns have been developed solely for the purpose of im-plementing self-timed logic[HBBE94][MA98]. Work done atthe University of Washington has shown that it is possibleto implement real circuits in an asynchronous FPGA[BEHB].Asynchronous FPGAs may prove to be a viable architecturefor Active Page memory system processing elements.

6 Conclusion

This paper evaluates three dierent Active Page processing-element architectures, and explores the power requirements ofeach. Due to the many projections required to estimate ourpower consumption in future technologies, we suspect up toa 30-50% error in our power models. However, these roughcalculations suggest that Active Pages will be in line with thepower budgets of the commodity packaging technologies ofthe future [Sem97]. Furthermore, software mechanisms showstrong potential for reducing peak and overall power consump-tion.

References

[Alt96] Altera Corporation. The Altera Data Book, 1996.

[BEHB] G. Borriello, C. Ebeling, S. Hauck, and S. Burns. The trip-tych fpga architecture. Univ. of Washington, Seattle WA98195.

[DeH96] Andre DeHon. Recongurable Architectures for General-

Purpose Computing. PhD thesis, MIT, 1996.

[FGT+97] S.B. Furber, J.D. Garside, S. Temple, J. Liu, P. Day, andN.C. Paver. Amulet2e: An asynchronous embedded con-troller. In Async '97, 1997.

[Fur96] S.B. Furber. ARM System Architecture. Addison WesleyLongman, 1996.

[HBBE92] S. Hauck, G. Borriello, S. Burns, and C. Ebeling. Montage:An fpga for synchronous and asynchronous circuits. In 2nd

International Workshop on Field-Programmable Gate Ar-

rays. FPL, August 1992.

[HBBE94] S. Hauck, S. Burns, G. Borriello, and C. Ebeling. An fpgafor implementing asynchronous circuits. IEEE Design and

Test of Computers, 11(3), Fall 1994.

[HW97] J. Hauser and J. Wawrzynek. Garp a MIPS processor witha recongurable coprocessor. In Symp on FPGAs for Cus-

tom Computing Machines, Napa Valley, CA, April 1997.

[I+95] K. Itoh et al. Trends in low-power RAM circuit technologies.Proceedings of the IEEE, 83(4), April 1995.

[I+97] K. Itoh et al. Limitations and challenges of multigigabitDRAM chip design. IEEE Journal of Solid-State Circuits,32(5):624634, 1997.

[Jon98] D. Jones. The ultimate risc. ACM Computer Architecture

News, 16(3):4855, 1998.

[M+97] Masato Motomura et al. An embedded DRAM-FPGA chipwith instantaneous logic reconguration. In 1997 Sympo-

sium on VLSI Circuits, 1997.

[MA98] K. Maheswaran and V. Akella. Pga-stc: Programmable gatearray for implementing self-timed circuits. International

Journal of Electronics, 84(3), 1998.

[NTM+95] Masato Nagamatsu, H. Tago, T. Miyamori, et al. Ta 6.6:A 150mips/w cmos risc processor for pda applications. InProc of the International Solid-State Circuits Conference.IEEE, 1995.

[OCS98] Mark Oskin, Frederic T. Chong, and Timothy Sherwood.Active pages: A computation model for intelligent mem-ory. In Proceedings of the 25th Annual International Sym-

posium on Computer Architecture (ISCA'98), Barcelona,Spain, 1998. To Appear.

[San96] S. Santhanam. Strongarm sa110: A 160mhz 32b 0.5w cmosarm processor. HotChips, VIII:119130, 1996.

[SB95] M. R. Stan and W. P. Burleson. Bus-invert coding for low-power i/o. Transactions on VLSI Systems, 3(1), March1995.

[SB97] M. R. Stan and W. P. Burleson. Low-power encodings forglobal communication in cmos vlsi. Transactions on VLSI

Systems, 5(4), December 1997.

[Sem97] Semiconductor Industry Associa-tion. The national technology roadmap for semiconductors.http://www.sematech.org/public/roadmap/, 1997.

[Tos98] Toshiba. TOSHIBA TC59S6416 Datasheet, Janurary 1998.

[WCF+94] S. Wuytack, F. Catthoor, F. Franssen, L. Nachtergaele, andH. D. Man. Global communication and memory optimizingtransformations for low pwer systems. In Proc. Int. Work-

shop Low Power Design, Napa Valley, California, April1994.

! $ ' ( ( ( + ' - / ( 2 ! 5 6 / 5 = -

> ? A C E G I J K L M O A R S T G V X G V Z \ G ] ^ ` A G V E L V

S ^ O a R A ? a G ^ G V Z c L I e ] R O A S V h ? V O O A ? V h J i L A R j c G A L ^ ? V G m R G R O n V ? T O A X ? R o J K G ^ O ? h j J i c t v w x y

O I G ? ^ I X G E G I O L X ~ V a X ] ~ O Z ] J e G ] ^ O L X ~ V a X ] ~ O Z ]

j ? X e G e O A ? X L a ] X O Z L V Z ? O A O V R ^ O T O ^ X L a ? A a ] ? R Z O X ? h V L A O a ? O V R e L O A I G V G h O I O V R G V Z X e O O Z ] e ~ j O I G R O A ? G ^? X e A O X O V R O Z ? V R L I G L A a G R O h L A ? O X a ? A a ] ? R I L Z ? a G R ? L V X R L X G T O O V O A h o ? V R j O Z G R G e G R j G V Z R j O I O I L A o J G V Z

G A a j ? R O a R ] A G ^ I L Z ? a G R ? L V X ? V R j O Z G R G e G R j L A e L O A G V Ze O A L A I G V a O L e R ? I ? E G R ? L V X ~ > L A O X e O a ? a G ^ ^ o J O e A O X O V R G

^ L e L O A G Z Z O A Z O X ? h V R O a j V ? ] O J L e R ? I G ^ m K C > Z O X ? h V? X X ] O X J G V Z G V O O a ] R ? L V M o e G X X R O a j V ? ] O ~

¡ ¢ £ ¤ ¥ ¦ ¨ ¢ © ¤ ¡

« L e L O A G V Z j ? h j e O A L A I G V a O G A O e ] A X ] O Z G R G ^ ^ ^ O T O ^ XL a ? A a ] ? R Z O X ? h V ~ j O a ? A a ] ? R G V Z G A a j ? R O a R ] A G ^ ^ O T O ^ X G A OO I e j G X ? E O Z j O A O ~ j O ^ L G a R ? T ? R o G Z Z O A A O Z ] a O X O V O A h oa L V X ] I e R ? L V M o t ¯ ° ± y ¯ ° L A Z ? O A O V R A O ] O V a ? O X G V Z L Ae A L a O X X a L A V O A X ~ j O I O G X ] A O Z a L V X ] I e R ? L V L R j O G M A ? a G R O Z a ? A a ] ? R X j L X e L O A X G T ? V h X M O R O O V ´ µ ° G V Z µ ¯ ° ~ j O a L I e ] R G R ? L V M o e G X X R O a j V ? ] O X j L X I ] ^ R ? e ^ ? O A e L O AX G T ? V h X L ] e R L w ¯ ° G V Z R j O ^ L G Z ] V ? R X G T ? V h X L ] e R L· ¯ ° ~ j O m K C > I L Z O ^ X Z O ^ ? V O G R O Z ? O A O V R ^ O T O ^ X L I L Z O ^ ? V h T O A X ] X R j O ? A G a a ] A G a ? O X ~ j O X ? I ] ^ G R ? L V G V Z G V G ^ o R ? a G ^M G X O Z I ? O Z I L Z O ^ L R j O m K C > X e A L T O R L M O R j O I L X R G a a ] A G R O ~ j O h ^ L M G ^ Z O a L Z O A G V Z R j O e A O a j G A h O a ? A a ] ? R A o ? VR j O m K C > X a L V X ] I O R j O M ? h h O X R X j G A O L O V O A h o ¸ O G a j ? R j

µ ¯ ° L R L R G ^ m K C > a L V X ] I e R ? L V » J G V Z R j O C ¼ G Z Z A O X XZ G R G ^ G R a j O X G ^ X L a L V X ] I O G X ? h V ? a G V R e L A R ? L V ¸ t ¯ ° » L R j O R L R G ^ O V O A h o M ] Z h O R ~

½ ¾ ¤ ¿ À Á ¨ ¢ © Ã © ¢ Ä Á ¥ ¥ Ç £

n V G ^ ? h V O Z X ? h V G ^ X ? V R j O I ? Z Z ^ O L a L I M ? V G R ? L V G ^ ^ L h ? a M ^ L a Èa G ] X O ] V G V R O Z R A G V X ? R ? L V X L A h ^ ? R a j O X ~ j O X O X ? h V G ^ X G A O

G ^ ? h V O Z ? R j X O ^ R ? I O Z ^ G R a j O X a G ^ ^ O Z ` A O X j m R G A R « G R a j O X¸ ` m « X » ~ m ? V a O h ^ ? R a j ? V h G a R ? T ? R o ? V a A O G X O X O e L V O V R ? G ^ ^ o

? R j ^ L h ? a Z O e R j J G X ? h V ? a G V R e L A R ? L V L R j O h ^ ? R a j ? V h ? XO ^ ? I ? V G R O Z M o j G T ? V h G ^ ? h V O Z X ? h V G ^ X G R O A G a O A R G ? V V ] I M O AL ^ L h ? a X R G h O X ~

Ì Í Î 2 / ( = ! = / Ï

` ? h ] A O ´ h A G e j ? a G ^ ^ o Z O I L V X R A G R O X R j O R ? I ? V h I G V G h O I O V RL R j O ` m « X ~ j O a L I M ? V G R ? L V G ^ ^ L h ? a ? X X e ^ ? R ? V R L R L e G A R X

Ð Ñ Ò Ó Ô G V Z Ð Ñ Ò Ó Õ ~ j O L ] R e ] R X L c Ö « ´ G A O ] V G ^ ? h V O ZR L G a O A R G ? V Z O h A O O M O a G ] X O R j O o G ^ A O G Z o O V R R j A L ] h j X O T O A G ^ ^ L h ? a X R G h O X ~ × ] R R j O o G A O V L R G X M G Z ^ o I ? X G ^ ? h V O Z G X

R j O ^ G R O A X R G h O X ? V c Ö « t ~ ` m « X G ^ ? h V R j O X O L ] R e ] R X L c Ö « ´ G V Z R j ] X e A L T ? Z O G A O X j X O R L G ^ ? h V O Z X ? h V G ^ X G X ? V e ] R X R L

R L c Ö « t j O V a O R j O X ? h V G ^ G ^ ? h V I O V R ^ G R a j O X G A O V G I O ZØ Ù Ú Û Ü Ý Þ ß Ù Þ Ò ß Þ á Ü Ú Û ~

Latch

Latch

Latch

Latch

delay

C/L

LSF

CL-1 CL-2

unaligned signals signals

skewed clk for FSL

clk

skew

evaluation time

(output of C/L-1)

aligned

(input of C/L-2)

â ã ä å æ ç è é ê ì í ì î ï í ð î ð ï ò í ò î ó ð î ô õ ì ï î ð ÷ ð ÷ ì ï î í ò î ó

ø ì ó ù ú û ò õ ù ü ó ð û ó ý ð ó þ ù ò õ ÿ

` m « X j G T O R L L e O A G R O ? R j G a ^ L a È j ? a j ? X X È O O Z O V L ] h j ? R j A O X e O a R R L R j O X o X R O I a ^ L a È h L ? V h R L R j O a L V T O V R ? L V G ^

^ G R a j O X X G V Z ? a j ? V h R j O a L I M ? V G R ? L V G ^ ^ L h ? a ~ j O A O L A O JR j O a L V X R A G ? V R L A R j O X È O ? X X ] a j R j G R R j O X È O ? X ^ G A h OO V L ] h j R L G ^ ^ L c Ö « ´ R L X O R R ^ O Z L V J M ] R V L R X L ^ G A h O X ] a jR j G R c Ö « t Z L O X V L R j G T O O V L ] h j R ? I O R L O T G ^ ] G R O ~ R j O AR j G V R j O X O R L a L V X R A G ? V R X J R j O A O G A O V L L R j O A I G L A A O ] ? A O I O V R X L A R j O G I L ] V R L X È O ~ j O ` m « X ? V Z ? O A O V R

c Ö « M ^ L a È X V O O Z V L R M O G ^ ? h V O Z ~ m ? h V G ^ G ^ ? h V I O V R ? V Z ? O A O V R ^ L h ? a M ^ L a È X a G V M O Z L V O ? R j R j O ? A L V X O ^ R ? I O Z

` A O X j m R G A R « G R a j O X ~ ` ? h ] A O t X j L X R j G R R j O X ? h V G ^ X ? V R L^ L h ? a M ^ L a È X G A O G ^ ? h V O Z G R Z ? O A O V R R ? I O ? V X R G V a O X ~ × ] RM L R j X È O X G A O ^ G A h O O V L ] h j R L j G T O R j O ? A A O X e O a R ? T O A X RX ] M M ^ L a È X L ^ L h ? a R L O T G ^ ] G R O ~ C V Z M L R j X È O X G A O X I G ^ ^O V L ] h j R L j G T O R j O ? A A O X e O a R ? T O X O a L V Z X ] M M ^ L a È X L ^ L h ? aR L V ? X j O T G ^ ] G R ? V h M O L A O R j O V O R O Z h O L R j O Û Û Þ Ú á á X L R j G R R j O a L A A O a R L ] R e ] R X a G V M O a L V T O V R ? L V G ^ ^ o ^ G R a j O Z ~ j ] X R j O Z ? O A O V a O ? V X È O O Z a ^ L a È X L A R j O Z ? O A O V R X O R XL ` m « X Z L V L R j G T O G V o O O a R L V R j O a L A A O a R ] V a R ? L V G ^ ? R oL R j O a ? A a ] ? R ~ j O A O ? X V L G M X L ^ ] R O h ^ L M G ^ X È O A O ] ? A O ZR L L e O A G R O R j O X O ^ R ? I O Z ` m « X ~ j ] X R j O ^ L a G ^ X È O X a G VM O Z O A ? T O Z A L I R j O ^ L a G ^ e L A R ? L V L R j O a L a È Z ? X R A ? M ] R ? L VV O R L A È R j A L ] h j a ] X R L I Z O ^ G o O ^ O I O V R X j ? a j I G R a j R j OA O ] ? A O Z Z O ^ G o G R R j G R e L A R ? L V L R j O a ? A a ] ? R ~

Ì Í Ì Ï = Ï ( 2 -

j O a L V a O e R L ` A O X j m R G A R « G R j O X G X ? I e ^ O I O V R O Z L V Gµ t M A ? e e ^ O a G A A o G Z Z O A ? V ¯ ~ w R O a j V L ^ L h o ~ C A ? e e ^ O a G A A oG Z Z O A G X a j L X O V M O a G ] X O ? R ? X X R A ] a R ] A O Z G V Z O G X o R L ? I

e ^ O I O V R ~ C V K c G Z Z O A j G X G e G R R O A V L G a j G ? V L Z O e O V Z O V R ^ L h ? a R j A L ] h j ? R X e A L e G h G R O Z a G A A o ~ j ? X È ? V Z L

Latch

Latch

Latch

Latch

clk

skewed clk for FSL-1

skewed clk for FSL-2

delay2

delay1

1

FSL

2

FSL

â ã ä å æ ç é ì ò û ò î ó õ ò ø õ û ú û ò õ ù ü ó ð û ó ý ð ó þ ù ò õ ÿ

a L V X R A ] a R ? L V ? X O G X o R L e G A R ? R ? L V ? R j A O X j X R G A R ^ G R a j O X ~` ] A R j O A I L A O J K c G Z Z O A j G X R j O ^ O G X R G a R ? T ? R o e L O A a L I

e G A O Z R L G V o L R j O A G Z Z O A G A a j ? R O a R ] A O M O a G ] X O ? R j G X R j O^ O G X R G I L ] V R L h G R O X ? V ? R X Z O X ? h V ~

â ã ä å æ ç é ì þ û ! ù ó ï û ð ! ù ó ù ò ÷ ø ! ø ò û þ ù ì ! ÿ

` ? h ] A O µ X j L X R j O e j L R L I ? a A L h A G e j L G µ t M A ? e e ^ O a G A A oG Z Z O A ? R j ` m « X ? V a L A e L A G R O Z J G a L V T O V R ? L V G ^ K c G Z Z O A J

X L I O A G V Z L I V ] I M O A h O V O A G R L A X G V Z X ? h V G R ] A O G V G ^ o E O A X ~ j O L ] R e ] R X L R j O A G V Z L I V ] I M O A h O V O A G R L A X O O Z R L R j O

? V e ] R X L M L R j R j O G Z Z O A X ~ j O L ] R e ] R X L R j O G Z Z O A X G A O O Z R L R j O ? V e ] R X L R j O X ? h V G R ] A O G V G ^ o E O A X ~ j O X ? h V G R ] A O

G V G ^ o E O A X X ? I e ^ o R G È O R j O µ t M X ] I X G V Z X O A ? G ^ ? E O R j O ? A L ] R e ] R X R j A L ] h j X j ? R A O h ? X R O A X ~ j ? X G o J ? V X R O G Z L G V G ^ o E ? V h

G e G ? A L µ t M X ] I X J O a G V V L ^ L L È G R R L X ? h V G R ] A O X G V ZT O A ? o R j G R R j O G Z Z O A L ] R e ] R X G A O R j O X G I O ~

* + ¤ , . ¦ ¢ 0 ¢ © ¤ ¡ 1 Ä . 0 4 4

c L V T O V R ? L V G ^ Z G R G e G R j X X j L X O V X ? R ? T ? R o R L L V ^ o R j O a L V R A L ^G V Z Z G R G Z O e O V Z O V a ? O X ~ C R O A R j O a L V R A L ^ Z O e O V Z O V a ? O X G A O

e A O Z ? a R O Z G V Z Z G R G Z O e O V Z O V a ? O X G A O X L ^ T O Z J ? V X R A ] a R ? L V XG A O O O a ] R O Z ? V R j O ] V a R ? L V G ^ ] V ? R X ~ 5 L O T O A J R j O X O O O a ]

R ? L V X Z L V L R a L V X ? Z O A R j O T G ^ ] O X e A L a O X X O Z M o ? V X R A ] a R ? L V X ? VX R G V Z G A Z e ? e O ^ ? V O Z ¼ m \ G V Z Ö L A X ] e O A X a G ^ G A G A a j ? R O a R ] A O X ~7 O e A L T O Z I ? a A L G A a j ? R O a R ] A G ^ O G R ] A O X Z O R O a R ? V h R j O A O e O R

? R ? T O V O X X L R j O á 9 : Þ ß Þ < > Û ¸ V L R < > Û Þ Ù : á Þ < > Û » a G V X ? h V ? a G V R ^ o A O Z ] a O e A L a O X X L A e L O A A O ] ? A O I O V R X A ´ B L A ? V a A O G X Oe O A L A I G V a O A t B ~

C Í Î D + 6 E G ( +

j O A ] V R ? I O j G A Z G A O I L V ? R L A X O e A L e L X O Z G A O a G ^ ^ O ZS O a ] R ? L V c G a j O X ¸ S c X » A ´ J t B ~ C V O O a ] R ? L V a G a j O X ? I e ^ o

j L ^ Z X L V R L L ^ Z a L I e ] R G R ? L V X R j O R L X L ] A a O L e O A G V Z T G ^ ] O X G V Z R j O a L I e ] R O Z A O X ] ^ R ~ j O S c X G A O Z ? X R A ? M ] R O Z ? R jR j O ] V a R ? L V G ^ ] V ? R X J G V Z R j O A O L A O J R j O A O ? X V L V O O Z L A G V

validrs1-value rs2-value result rd-tag

index+tag

* index taken from both sources

optional fields

â ã ä å æ ç J é K ì ó M ò ÷ ô õ N ò ÷ î ï ì î ï ó ò ð þ ù ò î ó û P ì î ð î ò Q ò þ R ó ì î þ ð þ ù ò ÿ

L e a L Z O O a O e R L A R j O e A O a ? X ? L V M ? R X G V Z X ] M R ^ O L e O A G R ? L V G ^Z ? O A O V a O X X e O a ? O Z M o L V ^ o G e L A R ? L V L R j O L e a L Z O ~ j OS c X G A O G ^ X L X O e G A G R O Z ? R j ? V G ] V a R ? L V G ^ ] V ? R J ? R j O L e

a L Z O e A L Z ] a O X Z ? O A O V R A O X ] ^ R X L A R j O X G I O L e O A G V Z T G ^ ] O XO T O V R j L ] h j R j O o I G o M O e A L a O X X O Z M o R j O X G I O ] V a R ? L V G ^] V ? R ~ U V L R j O A L A Z X J R j O G a a O X X R L R j O O O a ] R ? L V a G a j O X G A OZ O e O V Z O V R L V ^ o L V R j O L e O A G V Z T G ^ ] O X ~

` ? h ] A O · X j L X G ^ ? V O L A R j O O O a ] R ? L V a G a j O ~ \ A O T ? L ] X ^ oa L I e ] R O Z A O X ] ^ R X G A O X R L A O Z ? V R j O O O a ] R ? L V a G a j O X ~ 7 j O V

G V O a L I e ] R G R ? L V ? X G R R O I e R O Z J R j O G e e A L e A ? G R O O O a ] R ? L Va G a j O ? X a j O a È O Z L A G A O Z ] V Z G V R a L I e ] R G R ? L V G a L I e ] R G R ? L V j L X O A O X ] ^ R ? X G ^ A O G Z o G T G ? ^ G M ^ O ? V R j O S c ~ j O A OG A O R j A O O a G X O X L A a L I e ] R G R ? L V ^ L L È ] e ~ j O o G A O Z ? X a ] X X O Z

M O ^ L ~W X Z [ ] _ ` b ] _ ` e g i j X e Z k e e n j i W X g j ] Z _ e j

p X q r t _ j X e Z j O O O a ] R ? L V a G a j O X G A O X ? V h ^ O a o a ^ O ^ L L È ] e I L Z ] ^ O X ~

j O A O L A O J ? R Z L O X V L R X O O I M O V O a ? G ^ R L M o e G X X X ? V h ^ O a o a ^ O? V X R A ] a R ? L V X ~ V O a G V G A h ] O R j G R O O a ] R ? L V L R j O ? V X R A ] a R ? L V a G V M O e O A L A I O Z ? V X R O G Z L G ^ L L È ] e ~ × ] R J R j O A O I G oV L R M O O V L ] h j O O a ] R ? L V ] V ? R X G T G ? ^ G M ^ O R L e A L a O X X R j O ? V X R A ] a R ? L V ~ S O a ] R ? L V ] V ? R X G A O O e O V X ? T O J G V Z G V S c e A L T ? Z O X G a j O G e G o R L e A L T ? Z O R j O A O X ] ^ R ? ? R ? X G T G ? ^ G M ^ O ~ j O L V ^ o a G R a j ? X R j G R R j O S c Z L O X V L R h ] G A G V R O O G A O X ] ^ R

^ ? È O G ] V a R ? L V G ^ ] V ? R a G V ~ j O R A G Z O L J R j O A O L A O J ? X R j Oj ? R A G R O T O A X ] X R j O O R A G a o a ^ O e O V G ^ R o G V Z R j O e L O A a L X Re G ? Z L A j G T ? V h G V S c ? V R j O ] V a R ? L V G ^ ] V ? R ~

u j ] X i ] _ ` b ] _ ` e g i j X e Z k e e n j i W X g j ] Z _ ye j p X q q _ z X y ` b ] _ e r t _ j X e Z

> ] ^ R ? a o a ^ O ? V X R A ] a R ? L V X a G V M O ^ L L È O Z ] e L A A O e O G R O Za L I e ] R G R ? L V X O ? R j O A ? V R j O a G X O L G A O X L ] A a O a L V ? a R L A ? Ra G V G ^ G o X M O ^ L L È O Z ] e ? V e G A G ^ ^ O ^ ? R j R j O A X R a o a ^ O L

? R X O O a ] R ? L V ~ U R j O a L I e ] R G R ? L V ? X L ] V Z ? V R j O S c J R j OO O a ] R ? L V a G V M O X ] X e O V Z O Z ~ j ? X V L R L V ^ o e A L T ? Z O X e O A

L A I G V a O O V j G V a O I O V R M o O G A ^ o a L I e ^ O R ? L V L ? V X R A ] a R ? L V X Ja L ^ ^ G e X ? V h L ^ ^ L O Z L Z O e O V Z O V a ? O X G X R O A G V Z A O X L ^ T ? V hM A G V a j O X O G A ^ ? O A J M ] R R j ? X O O a ] R ? L V I O R j L Z G ^ X L A O Z ] a O XX ? R a j ? V h G a R ? T ? R o L A M o e G X X O Z a L I e ] R G R ? L V X ~

u j ] X i ] _ ` b ] _ ` e g i j X e Z k e e n j i p X q Z _ k e e n j i ` b ] _

> ] ^ R ? e ^ O a o a ^ O ? V X R A ] a R ? L V X a G V G ^ X L M O ^ L L È O Z ] e M oG Z Z ? V h G ^ L L È ] e a o a ^ O M O L A O O O a ] R ? L V L ^ L V h ^ G R O V a o ? V

X R A ] a R ? L V X ~ U V R j ? X G o J X ? R a j ? V h G a R ? T ? R o ? X a L I e ^ O R O ^ oO ^ ? I ? V G R O Z L A A O Z ] V Z G V R a L I e ] R G R ? L V X ~ U V R j O a G X O L GA O X L ] A a O a L V ? a R J R j ? X Z L O X V L R O T O V e L X O R j O O R A G a o a ^ Oe O V G ^ R o L A ] V G T G ? ^ G M ^ O a L I e ] R G R ? L V X ~ × ] R J L A ^ L V h ^ G R O V a o

? V X R A ] a R ? L V X j ? a j j G T O V L A O X L ] A a O a L V ? a R X J R j O O R A G a o a ^ O ? X V O O Z O Z j O V G V ? V X R A ] a R ? L V ? X O O a ] R O Z ? V X R O G Z L M o e G X X O Z ~ j ? X G Z Z ? R ? L V G ^ a o a ^ O e O V G ^ R o ? X G ^ X L ? V a ] A A O Z L AM A G V a j I ? X e A O Z ? a R ? L V X ? V R j O X R A O G I G V Z L A L Z O e O V Z O V a o a L ^ ^ G e X ? V h ~

C Í Ì / ( 2 ! 5 6 G -

Dispatch

Reorder

Buffer

Fetch

Decode

Resv.Stat.

FU

FU

FU

EC0EC1EC3 EC2

write ports

access ports

Databus

complete retire

D-

CacheBranch Predictor

further stages

EC-access

I-

Cache

execution bypass

â ã ä å æ ç é ò î ò û ì þ ì î ó ò ï û ð ó ì î ò Q ò þ R ó ì î þ ð þ ù ò õ ì î ð

õ R ! ò û õ þ ð ÷ ð û ô ð ó ð ! ð ó ù ÿ

j O h O V O A ? a ? V R O h A G R ? L V L R j O S c X L A G X ] e O A X a G ^ G A Z G R Ge G R j G V Z R j O G Z Z O Z a o a ^ O ? V R O h A G R ? L V G A O X j L V ? V ` ? h ] A O X y G V Z w ~ U V ` ? h ] A O y J R j O Z ? X e G R a j O Z ? V X R A ] a R ? L V X G A OO O a ] R O Z L ] R L L A Z O A J M ] R S c ^ L L È ] e X a G V M O I G Z O L A X O

^ O a R O Z ? V X R A ] a R ? L V X ~ U O V L ] h j e O A L A I G V a O G Z T G V R G h O a G VM O h G ? V O Z J R j O V R j O e L O A a L X R R L e O A L A I R j O ^ L L È ] e a G V M O

^ O X X R j G V R j O e L O A a L X R R L j G T O G ^ L V h O A e A L h A G I O O a ] R ? L VR ? I O ~ C ^ X L J L A O G a j ? V Z ? T ? Z ] G ^ ] V a R ? L V G ^ ] V ? R ¸ O X e O a ? G ^ ^ o

L A ^ L V h ^ G R O V a o ] V ? R X » J R j O O V O A h o e O A a L I e ] R G R ? L V a G V M OG V G ^ o E O Z ^ L a G ^ ^ o ~ ` ? h ] A O w X j L X R j O S c ? V R O h A G R ? L V ? R jG Z Z O Z ^ L L È ] e a o a ^ O e A O a O Z ? V h R j O O O a ] R ? L V ~

ID Cache EX MEM/WB

A

B

AopB

RF

Comp

Tag Result

U

X

MAB

Execution Cache

â ã ä å æ ç é ü ì í ! ÷ ò K P ! ð õ õ ó ù û R ï ù ! ò û ð î ô ó ð ï ï ò ô

þ ð þ ù ì î ï ø ì ó ù ð ô ô ò ô ÷ R ! þ P þ ÷ ò ÿ

j O a L I e ] R G R ? L V A O e ^ G a O I O V R ? V R j O S c a G V M O ? V ? R ? G R O Z O ? R j O A G R ? V X R A ] a R ? L V a L I e ^ O R ? L V L A G R R j O K O X O A T G R ? L Vm R G R ? L V X ~ j O X ? I e ^ O X R G o R L A O e ^ G a O G V Z ^ G R O A M o e G X X ? V

X R A ] a R ? L V X ? X R L A O e ^ G a O R j O O V R ? A O S c ^ ? V O G R R j O a L I e ^ O R ? L VL G V ? V X R A ] a R ? L V ~ C R G ^ G R O A R ? I O J j O V G V ? V X R A ] a R ? L V ? R j R j O X G I O a L I e ] R G R ? L V a L ] ^ Z M O M o e G X X O Z J ? R ? X X ? I e ^ o

^ L L È O Z ] e ? V R j O S c J R j O A O X ] ^ R h A G M M O Z G V Z R j O O O a ] R ? L V L R j O ? V X R A ] a R ? L V ? X G T L ? Z O Z ~ j ? X M o e G X X R O a j V ? ] O ? XX j L V ? V ` ? h ] A O v ~ 5 L O T O A J a L I e ] R G R ? L V X L ] A a O L e O A G V ZT G ^ ] O X a G V M O A O e ^ G a O Z G R R j O A O X O A T G R ? L V X R G R ? L V X ~ j O V R j OO V R A o V O O Z X R L M O ? V T G ^ ? Z G R O Z J G V Z T G ^ ? Z G R O Z G h G ? V j O V R j OA O X ] ^ R ? X G T G ? ^ G M ^ O G R O A O O a ] R ? L V ~ ¼ ] O R L A O h ? X R O A A O V G I ? V h

G V Z I ] ^ R ? e ^ O R G h X G X X L a ? G R O Z ? R j O G a j A O h ? X R O A J R j O A O h ? X R O A R G h L R j O A X R a L I e ] R G R ? L V R L M O O O a ] R O Z V O O Z X R L

EC Resv. Stations

FU Array

Reorder Buffer

Access

rs1, rs2, result

result, curr. rd-tag

Replace at Completion

Access

rs1,rs2,rd-tag

EC-rd-tagEC Resv. Stations

FU Array

Reorder Buffer

result

result, curr. rd-tag,

Replace at Resv. Stat.

â ã ä å æ ç é ý ú ê ò ! ÷ ð þ ò ð ó í ! ÷ ò ó ì î ð î ô

ê ò ! ÷ ð þ ò ð ó ò õ ÿ ü ó ð ó ÿ ÿ

M O X R L A O Z ? V R j O S c G X O ^ ^ ~ 7 j O V R j ? X a L I e ] R G R ? L V M o e G X X O X G V ? V X R A ] a R ? L V X R ? ^ ^ I G A È O Z < > ß < J I O G V ? V h R j G R R j Oa L A A O a R A O X ] ^ R ? X e O V Z ? V h J M L R j R j O R G h ? V R j O S c G V Z R j OV O ? V X R A ] a R ? L V Z O X R ? V G R ? L V A O h ? X R O A R G h V O O Z R L M O e G X X O ZG ^ L V h X L M L R j A O X ] ^ R X a G V M O G X X L a ? G R O Z ? R j R j O ? A A O X e O a R ? T O

A O h ? X R O A U ¼ X ~ j ? X A O e ^ G a O I O V R G V Z M o e G X X X R A G R O h o ? X V L RA O a L I I O V Z O Z L A ? R X a L I e ^ ? a G R ? L V X ~ U V G Z Z ? R ? L V J O ? ^ ^X O O ^ G R O A R j G R A O e ^ G a O I O V R L R j O L e O A G V Z X G R A O X O A T G R ? L VX R G R ? L V X G a R ] G ^ ^ o j G X I ] a j ^ O X X O O a R R j G V A O e ^ G a O I O V R G Ra L I e ^ O R ? L V ~

¡ ¢ ¤ ¿ Ç £ £ ¤ ¥ Ç ¥ 4 ¦ ¤ £ ¢ ¨ Ç © « Á £ 4

U V R j ? X X O a R ? L V J O ? ^ ^ X ? R a j h O G A X R L Z ? X a ] X X X O T O A G ^ e L O AI L Z O ^ X L A R j O m K C > G A A G o G A a j ? R O a R ] A O ~ j O È O o ? X R LÈ V L j L G a a ] A G R O O G a j I L Z O ^ ? X J G V Z I G È O G Z O a ? X ? L VG M L ] R m K C > G A a j ? R O a R ] A O X G X O G A ^ o ? V R j O Z O X ? h V a o a ^ O G X

e L X X ? M ^ O ~D + = ( ( = 2 ® / 5 =

U V R j O A O ^ G R ? L V G ^ X ? E ? V h I L Z O ^ J L V ^ o R j O ? V R O A a L V V O a R^ O V h R j X G A O a L V X ? Z O A O Z ~ j O m K C > ? X L A h G V ? E O Z ? V A L X

G V Z a L ^ ] I V X ~ j O L A Z ^ ? V O G V Z R j O M ? R ^ ? V O ^ O V h R j X G A Oa L V X ? Z O A O Z R L h O R G A O ^ G R ? T O O X R ? I G R O L e L O A Z O ^ G o R A G Z O L X ~ j ? X I L Z O ^ Z L O X V L R e A L Z ] a O G V G M X L ^ ] R O V ] I M O A L AR j O O V O A h o a L V X ] I e R ? L V J M ] R e A L T ? Z O X ? V X ? h j R X R L j L R j O

m K C > G A A G o X j L ] ^ Z M O a L V X R A ] a R O Z h ? T O V R j O a G e G a ? R o L R j O I O I L A o ~ j ? X ? X G X ? I e ^ O I L Z O ^ j L X O A L O V G M ^ O O V O A h o a G V M O X j L V G X

¯ ° ± ² ³ ´ µ ¶ · ¹ º t ¼ ½ ¾ ¿ t´

½ À

j O A O J Á Â Ã A L G Z Z A O X X ^ ? V O X J Å Â Ã a L ^ ] I V G Z Z A O X X^ ? V O X J À Â j L A ? E L V R G ^ X ? E O L G X ? V h ^ O I O I L A o a O ^ ^ J G V Z ¾ Â

T O A R ? a G ^ X ? E O L G X ? V h ^ O I O I L A o a O ^ ^ ~! ( = ( = G + ( ( ® ( 5 =

j O G V G ^ o R ? a G ^ M G X O Z I L Z O ^ e A L T ? Z O X ? V L A I G R ? L V L V R j OA O ^ G R ? T O a L V R A ? M ] R ? L V X L O G a j I L Z O ^ O ^ O I O V R ~ j O G Z T G V R G h O L R j ? X I L Z O ^ ? X R j G R R j ? X a G e G a ? R G V a O M G X O Z I L Z O ^ G ^

^ L X L A O V O A h o O X R ? I G R O X L R j O j L ^ O m K C > G A a j ? R O a R ] A O JL A O T O V Z L V R L R j O V O h A G ? V Z O R G ? ^ O Z O X R ? I G R ? L V L O G a jI L Z O ^ a L I e L V O V R J L A G V o a L I M ? V G R ? L V L R j O X O I L Z O ^ X ~ j Oa G e G a ? R G V a O X L R j O X O O ^ O I O V R X G A O ^ ? X R O Z ? V G M ^ O ´ ~

2 Ï = ( G + ( ( ® ( 5 =

j O X ? I ] ^ G R ? L V M G X O Z a j G A G a R O A ? E G R ? L V I L Z O ^ ] X O X Z O R G ? ^ O Z R A G V X ? X R L A Z O X a A ? e R ? L V X R L O X R ? I G R O R j O O V O A h o a L V X ] I e R ? L V A L I X ? I ] ^ G R ? V h O R A G a R O Z a ? A a ] ? R ~ j O e L O A

È É Ê Ë ç è é ÷ ò í ò î ó ô ò õ þ û ì ! ó ì î õ ì î ó ù ò ð î ð ÷ P ó ì þ ð ÷ í ô ò ÷ ÿ

Ì Í Î Ï Î Ð Ñ Ò Î Ó Ô Õ Ö × Ñ Ö Ø Ð Ù Ú × Û Ü Ý Í Û Ö Ð Þ ß × Õ Ø Ô Î Ó Óà á Ø Õ â

ã ä æ ç è ê ë ì í î ï ð ñ ò ï ó î ó ð õ ñ ö ÷ æ ø ÷ ø ì ù ø ù ù ö ú ÷ è ø ì û ø ü ù öã ä ý ÿ ê ë ì ï õ ï í î ï ð ñ ò ï ó î ó ð õ ñ ö

ã ç è î ê ì í î ï ð ñ ò ï ó î ó ð õ ñ ö ã ý ÿ î ê ì ï õ ï í î ï ð ñ ò ï ó î ó ð õ ñ ö ÷ ø ì ø ü ö

ã ã æ ë î ê ÷ ø ì ø ü öã ç è ÿ ã ï ð ê õ ï ï ê ì í ü í ÷ õ ! ñ ö ÷ æ " ü ø # ø ì ø $ ö

ã ç è ÿ ý ÿ ï ð ê õ ï ï ê ì ë ò ð ! ñ õ % ð î ï ë ö ÷ æ ø ì & ö ø ì ø û ö ú ÷ è ø ì ( ö ø ì ø ÷ # öã ç è ÿ ) + , ï ð ê õ ï ï ê ì ë ò ð ! - õ / 1 ö ÷ æ ø ì ø ö

ã ç è ÿ 2 3 4 ï ð ê õ ï ï ê ì ó 5 õ ñ ð ó ð % 8 ó ö ã ç è ÿ æ ë ï ð ê õ ï ï ê ÷ æ ø & ø ì ø ü ; ö ú ÷ è ø ì ( ö ø ì ø ü ; ö

< ÿ ç è > 5 ñ % ? ï ì í î ï ð ñ ò ï ó î ó ð õ ñ ö ø ü û @ " ø A ü< ÿ ý ÿ > 5 ñ % ? ï ì ï õ ï í î ï ð ñ ò ï ó ö $ ø $ û ø @ " ü ÷ ú ü ø # $ @ " $ ø A ÷

Vx VxαTest

Circuit

â ã ä å æ ç C é E ø ò û í ò ó ò û û õ ì í R ÷ ð ó ì î ò õ ó ì í ð ó û ÿ

I O R O A R O a j V ? ] O ? X G e L e ] ^ G A L V O j ? a j L e O A G R O X G T L ^ R G h O a L V R A L ^ ^ O Z a ] A A O V R X L ] A a O R L Z ] I e a j G A h O L V G a G e G a ? R L A ~

j O G c c m ? X O Z L G A O X ? X R ? T O Z A L e a G ] X O Z M o R j O X ] e e ^ o

a ] A A O V R R L R j O X ? I ] ^ G R O Z R O X R a j ? e ~ ` ? h ] A O I X j L X G Z ? G h A G I ? ^ ^ ] X R A G R ? V h R j ? X R O a j V ? ] O ~ j O m K C > a ? A a ] ? R X e O a K X

Z O X ? h V O Z G V Z X ? I ] ^ G R O Z ? V t e A L a O X X ? X X j L V ? V G M ^ O t ~

È É Ê Ë ç é ü L þ ó ÿ õ R N õ ò þ ó ì î õ ð î ô ó ù ò ì û õ ! ò þ ÿ

N O P Ô Ö Õ Ô O Ö Ñ R Ø S Î Í N × Î Ô ÛT õ U ! W õ ! ñ # ó î X ! ó Y í Z # [ $ [

T õ U ? ï ò 8 / ! # T õ U ó î X ! ó Y í Z # [ $ [ $ \ õ ñ / î ï ! ó î X ! ó Y ï Z $ [ [ [ ùê õ / % í ï ! W õ ! ñ # ó î X ! ó Y ï Z # [ $ [

ê õ / % í ï ? ï ò 8 / ! # ó î X ! ó Y ï Z # [ $ [ a / õ 8 ò / ! W õ ! ñ ü ó î X ! Y - Z $

e ñ ! W 5 ò ñ ë ! # ð 1 - ! ó Y [ f [ ê# W õ / % í ï / î ï ! ó î X ! ó Y í Z $ [ [ g h i î ï ! ó ò ï e ñ ! W 5 ò ñ ë ! ÷ í ! í õ ñ 1 ó % 8 ò ñ ñ ò 1 U î ð 5 ó Y ï Z $ [

a / õ 8 ò / k % l > a ñ î ! ñ ó $ ó î X ! ó Y $ [ ü [ # ÷ [ $ ë ò ð ! ó ñ î ! ïa / õ 8 ò / k % l > 5 ñ % ñ î ! ñ ó ÷ ó î X ! ó Y û [ ü ë ò ð ! ó

i õ ï ë ï ð ê õ ï ï i î ï ! ó ì - / ò î ï ö ó î X ! ó Y m ç è ÿ ã n è è Z [ ü [ ÷ [ û ß íi õ ï ë ï ð ê õ ï ï i î ï ! ó ì ñ î ! ñ ó ö # ó î X ! ó Y m ç è ÿ ã n è è Z ß í [ p Z û [ ü

m ç è ÿ ã n è è Z ü ß í [ p Z ûq ! ï ó ! í - ü ó î X ! Y # ò 8 ñ î W ò ð ! õ ñ ò ! ñ ò ë î ï ë\ ñ î ð ! ñ î ! ñ ó ü ó î X ! Y ÷ ò 8 ñ î W ò ð ! õ ñ ò ! ñ ò ë î ï ë

h ê 5 î - ñ î ! ñ ó ü ó î X ! Y ÷ ò 8 ñ î W ò ð ! õ ñ ò ! ñ ò ë î ï ë

D + 5 E G + ( ( ® ( 5 =

j O I ? O Z a j G A G a R O A ? E G R ? L V I L Z O ^ ? X G a L I M ? V G R ? L V L R j O

G V G ^ o R ? a G ^ G V Z R j O X ? I ] ^ G R ? L V M G X O Z I L Z O ^ X ~ > ? O Z a j G A G a R O A ? E G R ? L V I L Z O ^ X j L ] ^ Z M O G e e ^ ? O Z R L R j O X ] M X O a R ? L V X L

R j O a ? A a ] ? R Z ? X e ^ G o ? V h ] ^ ^ o Z ? h ? R G ^ M O j G T ? L A ? R j ] ^ ^ T L ^ R

G h O X ? V h X ¸ O ~ h ~ ^ G R a j O X J Z O a L Z O A X J M ] O A X J L a j ? e Z A ? T O A X JO R a ~ » ~ j O a j G A G a R O A ? E G R ? L V O ] G R ? L V X G A O Z O R O A I ? V O Z ? V

R j O ? V Z ? T ? Z ] G ^ G V G ^ o R ? a G ^ G V Z R j O X ? I ] ^ G R ? L V M G X O Z I L Z O ^ X r

R j O A O L A O J R j O A O G A O V L ? V Z O e O V Z O V R a j G A G a R O A ? E G R ? L V O ] G R ? L V X L A R j ? X R O a j V ? ] O ~

D + 5 ( - Ï s ( - 5 =

j O I O G X ] A O I O V R M G X O Z R O a j V ? ] O ? X R j O I L X R G a a ] A G R O

I L Z O ^ L A c > m a ? A a ] ? R X ~ U c X G A O G M A ? a G R O Z J R j O e L O A

A G ? ^ X G A O ? X L ^ G R O Z L A Z ? O A O V R X ] M a L I e L V O V R J G V Z R j O V X ? I

e ^ o I O G X ] A O R j O e L O A Z A G V M o O G a j L R j O a L I e L V O V R X

G R Z ? O A O V R A O ] O V a ? O X ~ j ? X ? X R j O I L X R G a a ] A G R O I L Z O ^ ~5 L O T O A J R j ? X ? X R j O I L X R Z ? a ] ^ R I L Z O ^ R L M ] ? ^ Z J X ? V a O

R j O A O ? X G X ? h V ? a G V R G I L ] V R L Z O X ? h V O L A R ? V T L ^ T O Z J R j O

e A L a O X X a L A V O A X a L V X ? Z O A O Z J O R a ~D + D ( - - G 5 =

È É Ê Ë ç é ê û ð î õ ì õ ó û þ R î ó þ ù ð û ó û u v N ô ð ó ð ð ÷ R ò õ ÿ

w Ø x Ö Ô á Ý Ï Ö Í Ú Ù Õ Ý Ð Ó Ö Ó Ñ Ø Õ y Ø O Ð ÑR z Î Õ N | R Ý Õ Õ Ý Ú Ô Ô Î Ó Ó | Ý Ñ Ö Ø

q ð ò ð î W $ ; ; # $ $ ÷ û ü û 1 ï ò í î W $ ÷ ù ü $ ü ü ü $

ê e i # û û # û ü ü

j O R A G V X ? X R L A a L ] V R ? V T L ^ T O Z ? V O G a j m K C > G A A G o G a

a O X X G V Z Z G R G Z A ? T O L ] R G X Z L V O ? R j ? V L A I G R ? L V A L I R j O

a ? A a ] ? R L G V m K C > G A A G o G a a O X X ~ « ? È O ? X O J R j O R A G V X ? X R L Aa L ] V R ? V T L ^ T O Z L A G I ] ^ R ? e ^ ? O A G X Z L V O ? R j ? V L A I G R ? L V

A L I R j O a ? A a ] ? R L R j O 7 G ^ ^ G a O A O O I ] ^ R ? e ^ ? O A a L V X R A ] a

R ? L V ~ j O I G L A a L I e L V O V R X a L V X ? Z O A O Z R L Z O R O A I ? V O R j OO V O A h o A G R ? V h L G V m K C > G A A G o O A O R j O R G h a L I e G A G R L A X

G V Z R j O M ? R ^ ? V O G V Z R j O L A Z ^ ? V O Z A ? T O A X ~ j O Z A ? T O A X ? E O X

O A O Z O R O A I ? V O Z A L I Z ? O A O V R m K C > G A A G o X ? E O X X ? V a O R j OZ A ? T O X R A O V h R j a j G V h O X ? R j T G A o ? V h a G a j O X ? E O X ~ ` L A R j O

I ] ^ R ? e ^ ? O A J L A O G I e ^ O J G V I M > \ O A ? X a L I e L X O Z L R L

· I e G A R ? G ^ e A L Z ] a R h O V O A G R L A X J t v · t a L I e A O X X L A > n O XG V Z G ´ w M C ¼ ¼ O A ~ 7 ? R j R j O X O G e e A L G a j O X J R j O R A G V X ? X R L A

a L ] V R L A R j O a L I e L V O V R X ? V T L ^ T O Z G A O G X ^ ? X R O Z ? V G M ^ O µ

L A Z ? A O a R I G e e O Z a G a j O X G V Z 7 G ^ ^ G a O A O O I ] ^ R ? e ^ ? O A X ~

« Ç 4 ¦ ¥ ¢ 4

` ? h ] A O x X j L X R j O e O A a O V R G h O L G T O A G h O e L O A ¸ V L R K > m

e L O A » X G T O Z ? R j R j O ^ L e L O A G Z Z O A ~ j O X G T ? V h X O A O

M O R O O V ´ µ ° R L µ ¯ ° L A R j O G M A ? a G R O Z a ? A a ] ? R ~ G M ^ O · ^ ? X R XR j O K > m e L O A a L V X ] I e R ? L V L A G ^ ^ e A L a O X X e G A G I O R O A a L I

M ? V G R ? L V X J G V Z R j O K > m X G T ? V h X A G V h O Z M O R O O V t ¯ ° G V Z

y ¯ ° ~ ` ? h ] A O ´ ¯ ? ^ ^ ] X R A G R O X R j O I O G X ] A O Z T O A X ] X X ? I ] ^ G R O ZG T O A G h O a L V X ] I e R ? L V X ~

â ã ä å æ ç é E ò û þ ò î ó ð ï ò ! ø ò û õ ð ò ô ø ì ó ù ó ù ò ÷ ø ! ø ò û

ð ô ô ò û þ í ! ð û ò ô ó ó ù ò þ î ò î ó ì î ð ÷ ð ô ô ò û õ ì í R ÷ ð ó ì î

ð î ô í ò ð õ R û ò ô ÿ

` ? h ] A O ´ ´ X j L X ] e R L w ¯ ° O V O A h o X G T ? V h X e O A > \ L e

O A G R ? L V L A X I G ^ ^ G V Z X ? I e ^ O O O a ] R ? L V a G a j O X ~ ` ? h ] A O ´ t

È É Ê Ë ç J é E ø ò û þ î õ R í ! ó ì î ó ù ò ÷ ø ! ø ò û ð ô ô ò û þ í ! ð û ò ô ó ð þ î ò î ó ì î ð ÷ ð ô ô ò û ÿ

e õ U ! ñ ! / ò 1 e õ U ! ñ ê õ ï ó % í - ð î õ ï ì î ï æ ö % ! ð õ e ñ õ W ! ó ó ò ñ î ò ð î õ ï e U ñ e

[ ï e

[ ï ó e[ ó e

[ ï e[ ó ï e

[ ï e

[ ó ó e[

ó e[ ï

) 3 n è ; û û ù û ÷ ; û û û ; # û û

$ ÷ û û û ÷# ) m ) ù û ù

# û ; ù ; ÷ ; û #

) @ Þ

# #

÷ ÷ ; ÷ ù ÷ ) 3 n è ü ü

û ÷ ü ÷ ; ÷ # ü

û ü ü ÷

ü

ü;

ü

ü ù ; ù #$

) m ) ; û û $ ÷ ù

$ ù ÷ ù

) @ Þ #

Þ # û # ù # # # # # $ #

) 3 n è ü #

; ù ü ÷ ;ü ü ü

; ;ü #

$ ÷ ü ÷ û ÷ ü #

#

ü ÷ ;

$ ü ÷ û û ü ÷ $

) m ) û ÷ ü ù # û ù û û

ü ù ;# ù û û ù

# ù ù û ù ; ) @ $ ü

$ # û $ ÷

$ ü $

) 3 n è ü

ü ÷

ü $

ü;

ü

$ ü $

ü ü $

$

ü $

$ ù ü

ü ù ü

#

ù ) m ) û # û ù û ÷ û

# û ÷ û û

# $ û ü û

ü ù û #

) @

$ ù $ ÷ $ ÷ $

$ $ $

) 3 n è ü û ; ü ù

$

ü

ü;

ü û # ÷ ü ù

ü ü û # ü

û # ü ù $

ü ù

üü

) m ) ; û û ; ù ; # ÷ ;

$ #; ;

# ù ; # #

; # û ;

#

) @ $ û $

Þ $

$ û $ $ $ $

È É Ê Ë ç é L N õ ÷ R ó ò ò î ò û ï P ! û ò ô ì þ ó ì î õ N P ô ì ò û ò î ó ò î ò û ï P í ô ò ÷ õ û ð N ü L ð û û ð P ÿ

y Ø Ð Ó O Ï Î S Ì Ð Î Õ x Ú â N | R N O P Ô Ö Õ Ô O Ö Ñ Ð Ý Í Ú Ñ Ö Ô Ý Í N Ö Ï O Í Ý Ñ Ö Ø Ð R Î Ý Ó O Õ Î Ï Î Ð Ñ R Ö Î S

T î ó ! ò / / T î ó ! ò / / T î ó ! ò / / T î ó ! ò / /

i i ò ð W 5[

ñ î !ü

ü

! ; ÷ $ ÷ !

ü ü ü ! ; ÷

$ ÷ ! ü ü

g g ü

ü

! ; ÷ $ ÷ !

ü üT õ U ! W õ ! ñ ;

$!

ü

$ !

ü û ; !

ü ;

$!

ü

ê õ / % í ï ! W õ ! ñ # !

ü ü # ÷ !

ü ü û !

ü ü # !

ü üT õ U ? ï ò 8 / !

ü

#!

ü ü ù ; !

ü

ü

ü!

ü

$ û ÷ !

ü

# ÷ !

ü ù ü

! ü

ü

#

! ü ü

ù ; ! ü

ê õ / % í ï ? ï ò 8 / ! ù !

ü ÷ ÷ ù ! ü ü û ;

ü!

ü ÷ ÷ #

! ü ü #

÷ $!

ü ü ü ÷ !

ü ù !

ü ÷ ÷ ù ! ü ü

a / õ 8 ò / ! W õ ! ñ û ! ü ü ù ;

$!

ü ü ÷ ü

! ü

û ! ü ü

q % 8 ò ñ ñ ò 1 ? ï ò 8 / !ü

û ; ! ü ü $

;ü

! ü

g g g g ü

û ; ! ü ü $

;ü

! ü

q % 8 ë / õ 8 ò / ñ õ U ! W g ? ï ÷ !

ü ü ÷ ! ; ÷ #

! ü ü ÷

$! ; g g ÷ !

ü ü ÷ ! ;e ñ ! W 5 ò ñ ë ! ? ï û ; ÷ !

ü ÷ ü !

ü

ü ù !

ü g g

ü ù !

ü

e ñ W 5 ë ê õ / T ! ò ü

û ü! ; ; û ù ! ;

ü

#! û ; û ù ! ;

e ñ W 5 ë > 5 ñ % ? ï ! ñ ë 1$

! ü ÷ ÷ ÷ !

ü ÷ ÷ # ÷ !

ü ÷ ÷ ÷ ! ü ÷

q ! ï ó ! í - ÷ ü

! ü ÷ #

ù ü!

ü ÷ # ;

$!

ü ÷ ÷ ü

! ü ÷

h ê 5 î - ñ î ! ñ ÷ ! ü

#

! ü

ù # ÷ !

ü ÷ !

ü

> õ ð ò / ? ï ! ñ ë 1 ì T ! ò ö û $

! ü ü ü

; ! û ÷ ü

! û g

â ã ä å æ ç è ¡ é ê ó ð ÷ ! ø ò û þ î õ R í ! ó ì î ó ù ò ÷ ø ! ø ò ûð î ô ó ù ò þ î ò î ó ì î ð ÷ ð ô ô ò û õ õ ì í R ÷ ð ó ì î ð î ô í ò ð

õ R û ò ô ÿ

e ^ L R X R j O « C ¼ M o e G X X X G T ? V h X ~ j O ^ L G Z X G T ? V h X O A O ] eR L · ¯ ° e O A ^ L G Z ~ j O ^ L G Z M o e G X X O Z G µ t ¢ × t X G Z G R Ga G a j O G V Z R j O R A G a A L I R j O I G ? V I O I L A o ~ j O e O A

L A I G V a O G Z T G V R G h O X G A O X j L V ? V ` ? h ] A O ´ µ ~ m \ S c ? V R x y? I e A L T O I O V R ? V U \ c G X ] e R L ´ y ° J m \ S c e x y O V j G V a O I O V R G X I ] a j G X µ ¯ ° G V Z I ] ^ R ? I O Z ? G G ^ h L A ? R j I X X j L O Z

G X j ? h j G X · y ° ? I e A L T O I O V R ? V U \ c ~ j O m K C > O V O A h oe A O Z ? a R ? L V X G A O ? ^ ^ ] X R A G R O Z ? V G M ^ O X y J w G V Z v ~ G M ^ O vX j L X R j G R I L X R L R j O O V O A h o a L V X ] I e R ? L V ? X ? V R j O h ^ L M G ^Z O a L Z O A X J e A O a j G A h O a ? A a ] ? R A o G V Z R j O G Z Z A O X X Z G R G ^ G R a j O X ~ j O A L O V G M ^ O G V Z M ? R ^ ? V O e A O a j G A h O O V O A h o a L V X ] I e R ? L V X

G A O e ^ L R R O Z ? V ` ? h ] A O X ´ · G V Z ´ y j ? a j X j L R j O I O G X ] A O

0 2 4 6 8

C (EC Size=2 c)

−40

−20

0

20

40

60

80

100

Aver

age

% E

nerg

y Sav

ed/C

ompu

tatio

n

go

compress

segm

mpeg2(h&s)

mpeg2(sp.)

Average

â ã ä å æ ç è è é î ò û ï P õ ð ì î ï õ ! ò û E ¤ ÿ

I O V R I L Z O ^ R L M O R j O I L X R e O X X ? I ? X R ? a G V Z R j O G V G ^ o R ? a G ^I L Z O ^ R L M O R j O I L X R L e R ? I ? X R ? a ~

¥ + ¤ ¡ ¨ ¥ ¦ 4 © ¤ ¡

c ? A a ] ? R G V Z I ? a A L G A a j ? R O a R ] A O a G V e ^ G o G V ? I e L A R G V R A L ^ O? V ^ L O V O A h o I ? a A L e A L a O X X L A Z O X ? h V ~ C O O G I e ^ O X G A O

h ? T O V j O A O ~ « L e L O A a L I e L V O V R X ? V R j O Z G R G e G R j a G V M OZ O X ? h V O Z ? R j A O R ? I O Z L X ? h V G ^ X G V Z G ^ X L M o R A G Z ? V h L A O X ] ^ R ^ L L È ] e ? R j a L I e ] R G R ? L V G a R ? T ? R o ~ U V G Z Z ? R ? L V J O V O A h oa L V X ] I e R ? L V L A I O I L A o a ? A a ] ? R X a G V M O O X R ? I G R O Z O G A ^ o ? VR j O Z O X ? h V a o a ^ O ? R j A O ^ ? G M ^ O I L Z O ^ X ~ m ? h V G ^ G ^ ? h V O Z G Z Z O AZ O X ? h V a G V ? I e A L T O e L O A a L V X ] I e R ? L V M o G X I ] a j G X y ¯ ° ~ j O G M A ? a G R O Z a ? A a ] ? R X j L O Z ] e R L µ ¯ ° ? I e A L T O I O V R ~

2 4 8 16 32 64 128EC size

−20−15−10

−505

1015202530354045505560

% Po

wer S

aved

/ Loa

dPower Savings (DM−EC)

go

compress

li

gcc

perl

ijpeg

tomcatv

swim

mgrid

applu

apsi

â ã ä å æ ç è é î ò û ï P õ ð ì î ï õ û ý ¦ L ÿ

go

comp

ress li

gcc

perl

ijpeg

tomca

tv

¨ swim

mgrid

applu

su2c

or apsi

mpeg

2

segm

0

5

10

15

20

25

30

35

40

45

50

% Im

prov

emen

t in IP

C

2n entries

4n entries

8n entries

16n entries

32n entries

64n entries

128n entries

256n entries

â ã ä å æ ç è é E ò û û í ð î þ ò ì í ! û ò í ò î ó õ ì î í ò í û Pô û ì ò î ô ð ó ð ! ð ó ù õ ÿ

c L I e ] R G R ? L V M o e G X X a G V a ] R Z L V e L O A a L V X ] I e R ? L V M oj G ^ ? R j X I G ^ ^ O O a ] R ? L V a G a j O X ~ \ O A L A I G V a O L e A L a O X X L A Xa G V M O O ^ O T G R O Z ? R j X ] a j I O a j G V ? X I X G X O ^ ^ ~ j O I O I L A o a L V X ] I e R ? L V O X R ? I G R ? L V X G A O T O A ? O Z ? R j G M A ? a G R O Za ? A a ] ? R X j ? a j X j L R j G R R j O X ? I ] ^ G R ? L V M G X O Z O X R ? I G R ? L V ? X ? R j ? V ´ ¯ ° L R j O I O G X ] A O Z a L V X ] I e R ? L V ~ > O I L A o O V O A h o

? X I L X R ^ o Z ? X X ? e G R O Z ? V R j O Z O a L Z O A X ¸ µ ´ ° » J e A O a j G A h O a ? A a ] ? R X ¸ µ ¯ ° » G V Z Z G R G ^ G R a j O X ¸ t ¯ ° » ~ m ] a j I ] ^ R ? ^ O T O ^ Z O X ? h V

G e e A L G a j ? X j O ^ e ] ^ R L L e R ? I ? E O e L O A a L V X ] I e R ? L V R L R j O ] ^ ^ O X R ~

« Ç ¦ Ç £ Ç ¡ ¨ Ç 4

ª ü « k X ò í [ e ñ ò ï X õ ï [ ò ï \ i î % [ i õ U e õ U ! ñ ò ð ò e ñ õ W ! ó ó î ï ë 8 1 ? / î í î ï ò ð î õ ï õ T ! % ï ò ï ð ê õ í - % ð ò ð î õ ï ó [ ¯ î ï ï ð ! ñ ï ò ð î õ ï ò / q 1 í - õ ó î % í õ ï i õ Ue õ U ! ñ ? / ! W ð ñ õ ï î W ó ò ï ! ó î ë ï [ - - ÷ ; ° ÷ [ % ë % ó ð ü ; ; ù

ª ÷ « k X ò í ò ï e ñ ò ï X õ ï [ k ! í õ ñ î X ! W õ í - % ð ò ð î õ ï ó õ ñ ó - ! ! % - ò ï - õ U ! ñW õ ï ó ! ñ ò ð î õ ï [ ¯ q % 8 í î ð ð ! ð õ ð 5 ! > ñ ò ï ó ò W ð î õ ï ó õ ï ê õ í - % ð ! ñ ó [ ! W ! í 8 ! ñü ; ; û

ª # « T ? ò ï ó [ ? ï ! ñ ë 1 W õ ï ó % í - ð î õ ï í õ ! / î ï ë ò ï õ - ð î í î X ò ð î õ ï õ ñ ó ñ ò í ó ¯e 5 > 5 ! ó î ó [ ± % ï ! ü ; ; #

0 8 16 24 32 40 48 56 64

n, Cell Loads per Word Line (2 n)

22.5

33.5

44.5

55.5

66.5

77.5

88.5

9

Ener

gy (

nJ)

Row Enable Energy Consumption

Analytical

Simulation

Measurement

â ã ä å æ ç è J é ø ò î ð N ÷ ò ò î ò û ï P þ í ! ð û ì õ î ! û ò ô ì þ ó ò ô ² í ò ð õ R û ò ô N P ô ì ò û ò î ó í ô ò ÷ õ ÿ

0 128 256 384 512 640 768 896 1024Cell Loads (# of cells per column)

0

50

100

150

200

250

300

Ene

rgy

(nJ)

Bit−Line Precharge Energy Consumption

Analytical

Simulation

Measurement

â ã ä å æ ç è é K ì ó ÷ ì î ò ò î ò û ï P þ í ! ð û ì õ î! û ò ô ì þ ó ò ô ² í ò ð õ R û ò ô N P ô ì ò û ò î ó í ô ò ÷ õ ÿ

È É Ê Ë ç é í ! ð û ì õ î ð í î ï ó ù ò ò î ò û ï P í ô ò ÷ õ ÿ

N | R ³ × Ñ Û Õ Õ Û Ì Ð Î Õ x Ú y Ø Ð Ó O Ï × Ñ Ö Ø Ð â Õ Ô ´ Û Ï Ð × × µ × N Ö Ï Ð Ý Í Ú Ñ Û R Ö Î S

q ï ë / f / p ì $ p 8 ö ù # ! ; ü # ! ; # $ ! ;q ï ë / f / p ì $ k 8 ö ü ÷ ü ü ü ! ÷ $ ! ù ü ü ! f % ó q 8 ñ ñ ; û ù ÷ # ! û # ! û ù ÷ ù ! û

k % l q 8 ñ ñ ; $ ; ü ; ! û # ÷ û ! ; ù $ û ! ; \ i ü ü ÷ û ! û ü ÷ ! û ü û ; ! û

¶ \ ü ù ì # $ ö ÷ # ! û ü # # ! û ÷ ÷ # ! û \ i T · ê ü ù ü ; ! û û $ ! ; ü ; ÷ ! û

ª $ « T ± ? ò ï ó ò ï e ñ ò ï X õ ï [ ? ï ! ñ ë 1 W õ ï ó % í - ð î õ ï í õ ! / î ï ë ò ï õ - ð î í î X ò ð î õ ï õ ñ ó ñ ò í ó [ ¯ ? ? ? ± õ % ñ ï ò / õ q õ / î q ð ò ð ! ê î ñ W % î ð ó [ õ / # [ - - ù ü ° ù ; [ k ò 1 ü ; ;

ª « ê 5 ò ï ñ ò p ò ó ò ï ò ï T f ñ õ ! ñ ó ! ï [ i õ U e õ U ! ñ ê k h q î ë î ð ò / ! ó î ë ï º / % U ! ñ W ò ! í î W e % 8 / î ó 5 ! ñ ó [ ü ; ;

ª « q \ % 1 ð ò W p [ ê ò ð ð 5 õ õ ñ [ i ò W 5 ð ! ñ ë ò ! / ! [ ò ï ¶ k ò ï [ e õ U ! ñ ? l - / õ ñ ò ð î õ ï õ ñ ò ð ò õ í î ï ò ð ! î ! õ - - / î W ò ð î õ ï ó [ ¯ î ï ï ð ! ñ ï ò ð î õ ï ò / q 1 í - õ ó î % í õ ï i õ U e õ U ! ñ ? / ! W ð ñ õ ï î W ó ò ï ! ó î ë ï [ - - # ; ° # $ [ % ë % ó ð ü ; ;

ª ù « e i ò ï í ò ï [ ¶ î ë 5 i ! ! / e õ U ! ñ ? ó ð î í ò ð î õ ï [ ¯ î ï ï ð ! ñ ï ò ð î õ ï ò / q 1 í - õ ó î % íõ ï i õ U e õ U ! ñ ? / ! W ð ñ õ ï î W ó ò ï ! ó î ë ï [ - - ÷ ; ° # [ % ë % ó ð ü ; ;

ª û « õ 8 8 ! ñ - % 5 / [ > 5 ! ! ó î ë ï õ ¶ î ë 5 e ! ñ õ ñ í ò ï W ! i õ U e õ U ! ñ k î W ñ õ - ñ õ W ! ó ó õ ñ [ ¯ î ï ï ð ! ñ ï ò ð î õ ï ò / q 1 í - õ ó î % í õ ï i õ U e õ U ! ñ ? / ! W ð ñ õ ï î W ó ò ï ! ó î ë ï [- - ü ü ° ü [ % ë % ó ð ü ; ;

ª ; « T a õ ï X ò / ! X ò ï k ¶ õ ñ õ U î ð X [ ? ï ! ñ ë 1 î ó ó î - ò ð î õ ï î ï a ! ï ! ñ ò / e % ñ - õ ó !k î W ñ õ - ñ õ W ! ó õ ó ñ ó [ ¯ ? ? ? ± õ % ñ ï ò / õ q õ / î q ð ò ð ! ê î ñ W % î ð ó [ õ / # ü [ - - ü ÷ ù ù °ü ÷ û $ [ q ! - ð ! í 8 ! ñ ü ; ;

ª ü « q 5 1 5 ± 1 ! ± õ % ! ð ò / [ - î - ! / î ï ! í % / ð î - / î ! ñ ò W W % í % / ò ð õ ñ % ó î ï ë ò 5 î ë 5 ó - ! ! [ / õ U - õ U ! ñ ó ð ò ð î W ò ï 1 ï ò í î W % / / ò ! ñ ! ó î ë ï [ ¯ ? ? ? ± õ % ñ ï ò / õ q õ / î q ð ò ð ! ê î ñ W % î ð ó [ õ / # ÷ [ - - ü ü $ ° ü ü û [ ± ò ï % ò ñ 1 ü ; ; ù

ª ü ü « > ò ï ð 5 õ - õ % / õ ó [ » » ò õ î [ ò ï ê 5 ò ï ñ ò p ò ó ò ï [ ñ W 5 î ð ! W ð % ñ ò / ? l - / õ ñ ò ð î õ ï ¼ ó î ï ë ! ñ î / õ ë f ò ó ! e õ U ! ñ ? ó ð î í ò ð î õ ï ½ ê ò ó ! q ð % 1 õ ð 5 ! ê > [ ¯

î ï ï ð ! ñ ï ò ð î õ ï ò / ! ó î ë ï % ð õ í ò ð î õ ï ê õ ï ! ñ ! ï W ! [ k ò ñ W 5 ü ; ; ù

ª ü ÷ « q ? T î W 5 ò ñ ó õ ï [ ê ò W 5 î ï ë % ï W ð î õ ï ñ ! ó % / ð ó ½ ò ó ð ! ñ ò ñ î ð 5 í ! ð î W 8 1 ò õ î î ï ë % ï ï ! W ! ó ó ò ñ 1 W õ í - % ð ò ð î õ ï [ ¯ ð ! W 5 ñ ! - [ q % ï k î W ñ õ ó 1 ó ð ! í ó i ò 8 õ ñ ò ð õ ñ î ! ó [

ü ; ; ÷

ª ü # « q ¶ % ï ë ò ï ± e q 5 ! ï [ / î í î ð ó ð % 1 õ / õ W ò / í ! í õ ñ 1 ñ ! ¾ % î ñ ! í ! ï ð ó% ó î ï ë ò / % ! ñ ! % ó ! - ñ õ î / ! ó [ ¯ î ï e ñ õ W ! ! î ï ë ó õ ð 5 ! ÷ û ð 5 ï ï % ò / ï ð ! ñ ï ò ð î õ ï ò /q 1 í - õ ó î % í õ ï k î W ñ õ ò ñ W 5 î ð ! W ð % ñ ! [ õ / ÷ û [ - - ù ü ° û ü [ ! W ! í 8 ! ñ ü ; ;

ª ü $ « q õ ò ï î ò ï a q õ 5 î [ 1 ï ò í î W î ï ó ð ñ % W ð î õ ï ñ ! % ó ! [ ¯ î ï ï ð ! ñ ï ò ð î õ ï ò / q 1 í - õ ó î % í õ ï ê õ í - % ð ! ñ ñ W 5 î ð ! W ð % ñ ! [ - - ü ; $ ° ÷ [ ± % ï ! ü ; ; ù

È É Ê Ë ç é ì õ ó û ì N R ó ì î ò î ò û ï P ì î ð þ ó ì ò ü L ÿ

N | R N O P Ô Ö Õ Ô O Ö Ñ Ø à Ù Ø Ñ Ý Í Ì Ð Î Õ x Ú > ñ ! ó ó ò ð ò i ò ð W 5 ÷ ü

T õ U ! W õ ! ñ ù \ õ ñ i î ï ! ; ê õ / % í ï ! W õ ! ñ ü

a / õ 8 ò / ! W õ ! ñ # ü e ñ ! W 5 ò ñ ë ! #

q ! ï ó ! í - °h % ð - % ð ñ î ! ñ ü

Cache-In-Memory: A Lower Power Alternative?

Jason T. Zawodny, Eric W. Johnsony, Jay B. Brockman, Peter M. Kogge

University of Notre DameyValparaiso University

fjzawodny,ejohnson,jbb,[email protected]

Abstract

The concept of Processing-In-Memory (PIM)achieves many benets by placing memory andprocessing logic on the same chip, utilizing moreof the available bandwidth and reducing both thepower and time associated with memory accesses.Typical architectures, even current single-chip mi-croprocessors with on-chip memory, rely on tradi-tional cache hierarchies. This paper serves to ex-plore the possible performance and power improve-ment of \migrating" the cache into the memoryitself. This idea, which we call Cache-In-Memory(CIM) utilizes the sense-ampliers inside the mem-ory array to perform this function. Early resultsindicate that the most trivial implementations ofthis technique provide comparable power eciencywith reasonable performance. However, the in-crease in resident DRAM has yet to be explored.Ongoing work strives to further solidify the toolsneeded for energy and performance simulations aswell as explore the potential capacity benets aswell.

1 Introduction

As part of an attempt to explore alternative mem-ory congurations that provide equal performancewhile using less energy than traditional cache hier-archies, CIM moves the \cache" into the memorystructure itself. The key to this arrangement isthe utilization of latching sense-ampliers in thememory structures to act as \free" cache-lines.With the appropriate supplementary memory ar-rays, equivalent to the tag arrays found in typicalcaches, the architecture itself can remember whatwas last accessed.

Others have explored the advantages of puttinga processor core on a chip with large amounts ofDRAM [1][3][4]. That work considered the mem-ory as a \black box," accepting current DRAMstructures without modication. Newer studieshave investigated the eects of retaining entirelines of DRAM within the memory array using thesense-ampliers, those projects looked at perfor-mance [7]. This study approaches the idea from apower eciency standpoint.

In this paper, we discuss the ongoing simulationof the CIM architecture and the results derivedfrom preliminary simulations. By closely couplingmemory energy and time simulations and micro-processor pipeline emulators, it is possible to char-

acterize both the time and energy performance ofvarious congurations. The data presented in thisstudy comes from a comparison of 16KB of rowbuer in the CIM architecture against split 8KBdata and instruction caches in the traditional cachearrangement. The latest results in this continuingwork indicate that CIM is capable of providingcomparable memory energy consumption at rea-sonable performance levels. Further work will helpto validate our current results as well as explore theadvantages that CIM provides in DRAM capacitygiven constant die size.

2 Memory Structure

Traditional memory hierarchies employ caches tohide the latency associated with an access to o-chip memory devices. These caches are stand-alone entities, often clearly distinguishable onphoto-micrographs of processors. While cacheshave been mandatory in the past, decreases in fea-ture size and increases in die size have spawnedthe capability to put conventional microprocessorcores and large amounts of DRAM on a singlechip [8]. The rst generation of these devices havesimply integrated the traditional hierarchy on thesingle device [11][12]. Studies have shown showhow much performance and energy gain can be ob-tained from creating systems on a single chip whileleaving the memory arrays virtually unchanged[3][4]. And there have been some performance esti-mates of more interesting ways to utilize the prox-imity [7].

Memory devices are constructed of smaller ar-rays. The reasons involve the ability to drive andsense signal changes across limited distances dueto capacitance in the wiring, as well as providingconvenient redundancy. If the whole system is onthe single die, could there be an arrangement ofmemory, processor, and cache functions which willenable equal performance with signicant reduc-tion in power compared to traditional, separatecaches?

Figure 1 shows two obvious options for systemson single chips. The most basic approach wouldbe to throw away the caches: simply access theexisting memory arrays. While this would providethe highest DRAM capacity on the chip, the enor-mous time and energy penalties for fully decod-ing and sensing each memory reference would beinecient. The traditional memory hierarchy im-plements unied or split rst-level SRAM caches,

Figure 1:Single Chip Memory Options

which take up vast amounts of die area that couldbe used for more DRAM. This yields good overallperformance, but at the expense of DRAM capac-ity. Figure 1 also begins to explain how CIM diersfrom the previous two ideas. In the CIM depiction,we see that the traditional cache has vanished, andthe area not covered by the processor is allocatedfor DRAM occupancy. However this DRAM hassome special characteristics allow it to take the roleof cache structures.

A more detailed look at the DRAM can befound in Figure 2. The primary features of thisarray are the latching sense ampliers distributedthroughout the array. Memory is normally brokendown into subarrays, because of capacitive eects,etc. The subarray is the rst of three primary en-tities that are dealt with in the CIM structure.The subarray is the block of memory where therows reside, as in regular memory devices. An en-tire row is then sensed and latched into the senseampliers. After the row has been latched, it ismultiplexed down to usable words. The result ofthis multiplexing, some subset of the columns thatwere read into the row, is called the subrow.

Each of these subarrays has its own row of senseampliers, and its own subarray control for decodeand timing. This control interfaces with a tag-array that keeps track of the row that was last ac-cessed for each of the subarrays. Based on whetherthe requested address matches the current subar-ray and subrow, and whether the row is alreadycached in the appropriate subarray, certain eventsare required to satisfy the read or write. Dier-ent decoding is required based upon how much ofthe address matched, and sensing directly from thememory cells need not repeat each access. Section3 will discuss what causes these dierent events.

Figure 2:CIM Memory Architecture

3 CIM Hit/Miss Taxonomy

To measure the eectiveness of the CIM structure,a taxonomy was developed that describes whatoccurs when an access to the memory array ismade. The taxonomy is similar to one outlinedfor a typical memory hierarchy with various \hit"and \miss" denitions. These denitions are usedto determine the performance of the CIM structureboth in terms of time and energy consumption.

In CIM, the characteristic that has the mostimpact on performance and energy is whether therequested data is available in one of the sense amprows. If the data is available in the sense ampliersof the appropriate subarray, a hit is said to haveoccurred. If it is not available, then a miss hasoccurred, and another access is required to loadthe sense amps with the new row of data. Since arow is stored in the sense amps, it is not requiredto re-write the data into the row of DRAM cellsafter each read. Because the data stored, it can bewritten back only when displaced without adverseeects.

To more eectively measure the performancecharacteristics of the memory structure, both thehit and miss denitions can be further expanded.There are two other factors that can impact mem-ory performance: whether current subrow multi-plexing and active subarray match that which hasbeen requested. Our taxonomy was expanded toinclude these conditions. The three factors, eachwith two options, leads to eight possible outcomes.These factors are shown in binary tree format inFigure 3.

Figure 3:CIM Access Taxonomy

The names associated with each leaf-node inthe tree are largely arbitrary, but are picked toindicate to what degree a given address matchescached data and current active elements. The sub-array is decoded from the most signicant bits ofthe address. The row is then decoded from themiddle bits, followed by the subrow. The word andbyte bits are, of course, the least signicant bits,but these do not matter to the memory array.

4 Description of the Model

Some initial experiments with CIM were con-ducted using the Shade [2] tool from SunMicrosys-tems, Inc. Shade is a library of functions that al-lows the programmer to trace and analyze the in-structions that are executed in most any program.

It returns the information for each instruction, andallows the user to extract and utilize any data thathe/she sees t.

In our models, we utilize most of the instruc-tion information at one point or another. Ba-sic simulations look only at the memory accessesthat were represented in the instructions. The ad-dresses accessed were utilized as though they camefrom a single-cycle architecture: no interleaving ofaddresses was performed to account for pipelinebehavior. The resolved addresses are then fed intoa representation of the tag arrays required to im-plement the CIM idea to determine hit and missrates, as well as estimations of power consumedand time required for each application. These ad-dresses can also be utilized in a similar represen-tation of traditional cache hierarchies to comparethe two.

More advanced simulations utilize the details ofthe instructions themselves. In this case, we choseto simulate the eects and relationships betweenthe IBM PowerPC 401 microprocessor core and thememory. The 401 core was chosen because it is theprocessor that has been chosen for one or more ofthe projects which sponsored this study. It is arelatively simple, 32 bit RISC processor availableas a synthesized core. The 401 has a wide varietyof cache options, including omission of the caches.To translate between the Sun architecture and thatof the PowerPC, the largest hurdle was convertingthe branch behavior. While the Sparc v9 architec-ture uses delayed and annulling branches, the 401does not support these features.

Coupling the pipeline simulation with the CIMrepresentation discussed above, we will achieve acomplete-chip functional simulation. Again, a tra-ditional memory hierarchy can be integrated inplace of the CIM model to compare the two ar-chitectures. This simulator is aimed at compar-ing the time and energy performance of traditionalstructures to CIM structures. We are integratingenergy and timing numbers from separate softwareto achieve a reasonable degree of realism in thesesimulations [5].

5 Test Suite

A reasonable eort was expended in collecting typ-ical yet diverse applications to analyze using theShade-based simulators. These applications arelisted and described in in Table 1. The appli-cations include pointer-chasing applications andtight unrolled loops, which should explore thegamut of most cache-performance envelopes.

6 CIM Characterization

Preliminary experiments were performed to look athit and miss rates for various congurations of theCIM system. These arrangements varied between128 and 512 rows per subarray and 128 to 4096columns per subarray using the simple memory ac-cess simulator discussed in previous sections. Thenumber of subarrays was not xed, enabling thesimulator to cover the entire address space. Withvirtual memory, page mapping, etc. the overallapplication size must be limited. The results of

Program Description

ShellSort A C++ application that imple-ments the shell sort algorithmimplemented using the LEDAdata structure library on 10,000 oating point numbers.

cc The Sun performance 'C' lan-guage compiler performingsource to executable compila-tion of a simple program (ndthe maximum of 5 numbers)

FpMatrMult Multiplies 256x512 and 512x256element oating point matricesin very tight hand optimizedloop.

JpegTran Translates a 550KB JPEG im-age into a GIF format image

Table 1: CIM Test Suite

these simulations showed that for several diverseprograms and most congurations, CIM producedinstruction hit rates in excess of 90% and data hitrates in excess of 60%. While these hit rates arenot stellar, they show promise for the performanceand energy eciency of CIM.

7 Energy Estimation Parameters

As mentioned before, energy values in this studyare derived from separate software which computesSRAM size, time, and energy [5]. The basis for thissimulator was the IBM 7LD .25 DRAM and logicprocess [10]. Using these parameters and request-ing moderate energy and time constraints on theSRAM simulator, we were able to generate the tagand data array numbers directly. By specifyinga large SRAM array with one sense amplier perbit line it is possible to get an approximation toDRAM power gures. However, one must modifythree things to obtain a good estimate. First, thecell energy must be updated, followed by appropri-ate scaling of the power used on the bit and wordlines. Finally, one must account for another set ofsense ampliers: the primaries lie in each subarray,while the secondary sense ampliers reside at thebase of the stack of subarrays. These factors havebeen included in these power values. The DRAMnumbers are approximations, but we believe themto be adequate for these early studies.

Catagory Joules (W-sec)

Level 1 Tag Access 2.52e-9Level 1 Data Access 2.52e-9Main Memory Access 2.36e-8

Table 2: Traditional Energy Specication

All the basic power numbers that were used inthese simulations can be found in Tables 2 and 3. Ifwe look at how these numbers combine to produceper-access energy specications, we can begin tosee more interesting details. Tables 4 and 5 quan-tify the amount of energy expended per access ineach of the dierent catagories.

Catagory Joules (W-sec)

Tag Access 7.65e-10Subarray Decode 7.0e-9Row Decode 4.13e-9Wordline 2.78e-11Bitline 7.0e-9Mux Energy 2.29e-9Output Driver 2.35e-9Sense Amps 7.65e-102ndry Sense Amps 5.0e-11

Table 3: CIM Energy Specication

Access Result Joules (W-sec)

Cache Hit 5.04e-9

Cache Miss 3.116e-8Writeback Penalty 2.612e-8

Table 4: Traditional Energy Per Access

8 Time Estimation Parameters

Tables 6 and 7 show the time required for each ofthe access results in the given architectures. Thesecycle times are based on a 150MHz clock, which isone possible clock frequency for the 401 to be usedin the projects funding this study. These time val-ues are ambitious, but reasonable. The processormust be able to obtain a nominal rate of two mem-ory accesses per clock cycle: an instruction fetchand potentially a memory operation. Because thesense ampliers themselves latch data, the readsand write to CIM should be as fast as SRAM. The33ns access time for the cache conguration is alsoreasonable because all the memory references andCIM misses in the simulations are assumed to beon the chip. This means two things. First, thereare no high-capacitance busses to drive o the chip,which reduces the time dramatically. Secondly, theDRAM interface to the standard cache can be fullwidth, eliminating the need to do multiple wordtransfers to satisfy a 32 byte request.

Taking a closer look at of Tables 6 and 7 showthe number of cycles required for each access re-sult. Although CIM enjoyed a comfortable ad-vantage in the per access power cost for most hitand miss instances in the energy domain, in thetime domain, it has lower performance than thestandard cache hit for two of the four CIM hit

Access Result Joules (W-sec)

Full Hit 8.15e-10Subarray Hit 5.46e-9Row Hit 8.69e-9Subrow Hit 6.40e-9

Full Miss 2.22e-8Subarray Miss 1.99e-8Row Miss 1.67e-8Subrow Miss 1.90e-8Writeback Penalty 8.63e-9

Table 5: CIM Energy Per Access

Access Result # Cycles

Cache Hit .5Cache Miss 5Writeback Penalty 5

Table 6: Traditional Cycles Per Access

catagories. Even if CIM managed to match hitrates, it would fall behind in raw cycle compar-isons.

Access Result # Cycles

Full Hit .4Subarray Hit .5Subrow Hit 2Row Hit 2Full Miss 5Subarray Miss 5Subrow Miss 5Row Miss 5Writeback Penalty 5

Table 7: CIM Cycles Per Access

9 Analysis and Simulation Results

Two approaches are used to examine the behaviorof CIM against traditional cache structures. First,we look at the power model and timing model tond inherent characteristics that should lead torecognizable trends in the numerical data. Second,those models are used in the shade simulation togenerate data over the test suite applications.

At a very fundamental level, Equations 1 and2 on Page 5 relate energy to the frequency andenergy associated with the hits and misses thatwere presented above. These equations are sim-ply adapted versions of standard energy calcula-tions for any system. In this instance, however,the adaptation of these equations, coupled withthe power and timing data outlined above revealsseveral trends.

The tag array is accessed at least once on everymemory request. This frequency of 1 means thatsubstantial savings on tag access could prove tobe an important factor in overall energy savings.For CIM, the tag array contains the same numberof entries as there are subarrays in the memorydevice: in our initial testing, this value is 128. Thenumber of bits in the CIM tag array is 10, nine tostore the row that the subarray has latched in itssense ampliers and a single valid bit. For thetraditional cache structure, we have separate 8KBinstruction and data caches. Each of the tag arraysfor the caches have 2048 entries, each with 24 bitsof tag and valid bits. The energy used to accessthe CIM tag array should be substantially less thanwhat the normal cache architecture. This is indeedthe case, as seen in Tables 2 and 3, CIM tag accessconsume less than one third that of the standardcaches.

If we look at the energy consumed on cachehits, we notice that CIM is actually less ecient

ECIM = ETagAccess +X

(ECIM Hits FCIM Hits) +X

(ECIM Misses FCIM Misses) +EWriteBack FWriteBack(1)

ETrad = ETag Access +ECacheAccess FCacheHit +EMemory Access FCacheMiss +EWriteBack FWriteBack (2)

in all but one rare case. In Table 4 and 5, one ndsthat the cache hit is slightly less than all but theCIM full hit. While the full hit consumes half thepower of the cache hit, we expect that these areextremely rare. A full hit indicates that the exactsame word has been requested as the last access,meaning that no multiplexing, decoding, etc. isrequired. Obviously this has little bearing on thecomparison. These numbers show that the normalcache is more ecient for hits. We have presentedapproximate hit rates for the CIM architecture of90% instruction and 60% data. If we look at typ-ical hit rates for split 8KB direct mapped caches,we nd rates of approximately 99% instructionand 90% data [6]. This illustrates that the nor-mal cache architecture should be more ecient interms of power.

Next we examine the energy associated withcache misses. Again looking in Tables 4 and 5 wend that CIM holds an edge in per access energyfor misses. Without considering the the frequencyweighted average of the dierent CIM misses, thesimple average has only 63% the energy cost of thenormal cache miss. Let us also examine the dif-ference between miss penalties: CIM write-backsare one third the energy of their traditional cachecounterparts.

Based on these values, we believe that there isa point where CIM becomes more power ecientthan traditional arrangements. We can write themathematical equation that describes where thecross-over occurs, however it requires simplifyingEquations 1 and 2 and setting them equal. If weinsist on keeping all four types of hits and all fourtypes of misses in the CIM architecture in thisequation, the problem becomes multi-dimensional.We can, however, perform some simplications,or x some values in order to obtain a reasonablysolvable problem. The series of equations belowshows the transformation to a simple solution.From this point, you can simply plug in valuesfor the hit rate for either architecture to obtainthe hit rate that should produce equal energy inthe other. For instance, a hit rate of 85% in CIMwould be the energy equivalent of 89% in thetraditional cache model. This equation will serveas a useful check for simulator results.ECIM ETrad

ECIM = 7:65 1010+6:0 109 FCIM Hit+2:05 108 (1 FCIM Hit) + :05 8:63 109

ETrad = 2:52 109 +2:52 109 FTradHit+3:12 108 (1 FTradHit) + :05 2:61 108

FCIM Hit 1:97 FTradHit :916

Now that we have examined how we expectthe trends to appear in the simulation results, wecan look at the end-to-end Shade simulations. Al-though this simulator has been in development forquite a while, recent attempts to put larger pro-grams through it have revealed precision problems.Using the same energy gures as the simulator has

at its disposal, hand calculated values for the en-ergy per access have been substituted into Table9. The modied Table entries are indicated withan asterik. We will discuss the results with thesecorrections.

Table 8 shows the hit rates that the simulatorachieved using each processor and memory combi-nation. For the most part, the results for the tra-ditional cache structure falls close to the expectedgure. Except for the FpMatrMult data result,which we expect to thrash the cache [6], all otherprograms approach or exceed the 99% instructionand 90% data rates expected. The results for theCIM simulations also show good correlation be-tween previous study and these simulations, meet-ing or exceeding the 90% instruction and 60% datahit rates.

Program SC+TradHit%

401+TradHit%

SC+CIMHit%

401+CIMHit%

ShellSort 1.0/.92 1.0/.92 .94/.64 .94/.69cc .97/.95 .98/.95 .92/.61 .92/.60FpMatrMult 1.0/.83 1.0/.83 .98/.57 .98/.57JpegTran 1.0/.94 1.0/.94 .90/.71 .92/.72

Table 8: Instruction/Data Hit Rates

We reduced the complex and multi-dimensionalequations above to derive a equal power rule be-tween the two memory architectures. A quick lookat Table 9 indicates that the CIM hit rate nevergets quite high enough to break past the tradi-tional cache conguration. A look at the detailedsimulator output shows that the highest combinedhit rate achieved by CIM is 87.5%. This is equiva-lent to a 91% cache hit rate. Unfortunately, noneof the combined results show an overall hit ratethis low.

Program SC+TradnJ/A

401+TradnJ/A

SC+CIMnJ/A

401+CIMnJ/A

ShellSort 6.55 6.48 8.94 8.39cc 6.11 5.89 7.10 6.38FpMatrMult 7.14* 7.11* 7.83 7.77JpegTran 5.50* 5.40* 6.46 5.67

Table 9: Energy Per Access Comparison

As expected, the CIM architecture shows lowerperformance that the standard cache architecture.The dierence seen in Table 10 is between .75 and2 cycles per instruction. This is a relatively largedierence up to 50% increase in execution time.

Note that the FpMatrMult is very oatingpoint intensive. The 401 uses 2 clock cycles toprocess a oating point operation, while single cy-cle simulation assumes 1 clock per operation. This

Program SC+Trad

401+Trad

SC+CIM

401+CIM

ShellSort 1.76 1.66 3.64 3.56cc 1.35 1.32 2.35 2.25FpMatrMult 1.93 2.21 3.24 3.21JpegTran 1.23 1.16 2.08 2.11

Table 10: Cycles Per Instruction Comparison

is why the values for FpMatrMult using the singlecycle simulation with both memory structures inTable 10 are close or below the value using the 401and perfect memory.

10 Conclusion and Future Work

This paper starts with the supposition that com-bining main memory, CPU, and cache on the samedie is a clear win from a power and performanceperspective. The objective is then to explore howmuch cache is appropriate and how we might takeadvantage of the structure internal to the mem-ory to reduce it. In the process of performingthis study, we have characterized a broader rangeof events between a classical hit and a classicalmiss, each with a dierent power and performancepenalty. Equations have been built that relate thisto overall energy per access, and an early simula-tor has been constructed to explore the energy andperformance ramications.

For the single design point explored here, theresults are interesting, but somewhat inconclusivein terms of power. CIM shows comparable en-ergy eciency to traditional cache structures, es-pecially for applications that make poor use ofthose caches, such as matrix multiply. The timeperformance is around 50% slower, however, inlooking at these results we should remember thatthis was the rst design point simulated. A rangeof row and column sizes need to be explored. Look-ing at the eect of increases in associativity [7] andsmall L1 caches are possible next steps.

We will continue to look at how to choose bitsfrom the address to build each of eld for the sub-array and row mapping. The bits that are pickedout of the address for each eld is largely arbi-trary: the data does not care where it actuallyresides. However, mapping these bits appropri-ately and adapting compilers to make use of theCIM structure, should positively aect hit ratesfor CIM-based systems.

Further, we will continue to rene the DRAMpower and performance values, as well as add sizegures. While there should not be a huge changein the energy results obtained so far, developing amore robust DRAM energy model is essential inorder to claim conclusive ndings. Additionally,we would like to generate a DRAM model thatwill allow us to accurately compute area to look attrade-os that may exist between more DRAM ca-pacity lling the size dierence between CIM andnormal cache architecture.

We believe that the simulator used for thisstudy can be easily modied to provide some con-rmation of other studies [3][7]. By simply chang-ing parameter, and adding provision for higher as-

sociativity sense-ampliers, it should be possibleto conrm or rebut ndings that have been pub-lished in past papers. Duplicating those resultswould help to validate that our simulations are in-deed accurate, and lend more credibility to thoseprevious studies that have largely stood withoutconrmation.

11 Acknowledgements

This work was performed for the Jet Propul-sion Laboratory, California Institute of Technol-ogy, and was sponsored in part by the JPLREE program, the Defense Advanced ResearchProjects Agency (DARPA) and the National Se-curity Agency (NSA) through an agreement withthe National Aeronautics and Space Administra-tion.

12 References

References

[1] Brockman, Jay B., and Peter M. Kogge, TheCase for PIM, Notre Dame CSE TR-9707,Jan. 10, 1997.

[2] Cmelik, B. The SHADE simulator, Sun-LabsTechnical Report 1993

[3] Fromm, Richard, ET AL., The Energy Ef-ciency of IRAM Architectures, In Confer-ence Proceedings, 1997 International Sym-posium on Computer Architecture (Denver,Colorado, June 1997), pp. 327-337.

[4] Kogge, Peter M., EXECUBE- A New Ar-chitecture for Scalable MPPs, 1994 Interna-tional Conference on Parallel Processing.

[5] Nanda, Kartik, SRAM Energy Modeleing ,Notre Dame CSE Technical Report, June1998.

[6] Hennessey, John, and David Patterson,Computer Architecture: A Quantitative Ap-proach, Second Edition, Morgan KaufmannPublishers, 1996

[7] Saulsbury, Ashley, ET AL., Missing theMemory Wall: The Case for Proces-sor/Memory Integration, In Proceedings ofthe 23rd Annual International Symposiumon Computer Architecture (Philadelphia,Pennsylvania, May, 1996), pp. 90-101

[8] http://notes.sematech.org/97melec.htm

[9] http://www.chips.ibm.com/products/embedded/chips/401core.html

[10] http:www.chips.ibm.com/services/foundry/technology/cmos7ld.html

[11] http:www.mitsubishichips.com/products/mcu/m32rd/m32rd.htm

[12] http:www.usa.samsungsemi.com/new/dram-asic.htm

Innovative VLIS Techniques for Power Reduction

Dynamic Voltage Scaling and theDesign of a Low-Power Microprocessor System

Trevor Pering, Tom Burd, and Robert BrodersenUniversity of California Berkeley, Electronics Research Laboratory

pering, burd, [email protected]://infopad.eecs.berkeley.edu/~pering/lpsw

Abstract

This paper describes the design of a low-power micro-processor system that can run between 8Mhz at 1.1Vand 100MHz at 3.3V. The ramifications of DynamicVoltage Scaling, which allows the processor to dynami-cally alter its operating voltage at run-time, will be pre-sented along with a description of the system designand an approach to benchmarking. In addition, a morein-depth discussion of the cache memory system will begiven.

1. Introduction

Our design goal is the implementation of a low-power microprocessor for embedded systems. It is esti-mated that the processor will consume 1.8mW at 1.1V/8MHz and 220mW at 3.3V/100MHz using a 0.6µmCMOS process. This paper discusses the system design,cache optimization, and the processor’s Dynamic Volt-age Scaling (DVS) ability.

In CMOS design, the energy-per-operation isgiven by the equation

where C is the switched capacitance and V is the operat-ing voltage [2]. To minimize , we use aggressivelow-power design techniques to reduce C and DVS tooptimize V.

Our system design, which addresses the completemicroprocessor system and not just the processor core,is presented in Section 2. Our benchmark suite, which isdesigned for a DVS embedded system, is presented inSection 3. Section 4 discusses the issues involved withthe implementation of DVS, while Section 5 presents anin-depth discussion of our cache design.

The basic goal of DVS is to quickly (~10µs)adjust the processor’s operating voltage at run-time tothe minimum level of performance required by theapplication. By continually adapting to the varying per-formance demands of the application energy efficiencyis maximized.

The main difference between our design and thatof the StrongARM is the power/performance target: oursystem targets ultra-low power consumption with mod-erate performance while the StrongARM targets moder-ate power consumption with high performance. Ourprocessor core is based on the ARM8 architecture [1],

which is virtually identical to that of the StrongARM.The similarities and differences between the two designsare highlighted throughout this paper.

2. System Overview

To effectively optimize system energy, it is neces-sary to consider all of the critical components: there islittle benefit in optimizing the microprocessor core ifother required elements dominate the energy consump-tion. For this reason, we have included the microproces-sor core, data cache, processor bus, and external SRAMin our design, as seen in Figure 1. The energy consumedby the I/O system (not shown) is completely applicationand device dependent and is therefore beyond the scopeof our work. The expected power distribution of our sys-tem is given in Figure 2.

To reduce the energy consumption of the memorysystem, we use a highly optimized SRAM design [3]which is 32 data-bits wide, requiring only one device be

Eop CV2∝

Eop

Figure 1: System Block Diagram

µProc.Core

UnifiedCache

SRAM

SRAM

SRAM

SRAMDVS Components

Fixed-VoltageComponents

I/O Bridge

Cache33%

SRAM2%

Core58%

Processor Bus7%

Figure 2: System Energy Breakdown

activated for each access. Schemes that use multiplenarrow-width SRAMs require multiple devices to beactivated for each access, resulting in a significantincreases in the energy consumption. To alleviate thehigh pin count problem of 32-bit memory devices, wemultiplex the data address onto the same bit-lines as thedata words.

We use a custom designed high-efficiency non-linear switching voltage regulator [14] to generatedynamic supply voltages between 1.1V and 3.3V. Anefficient regulator is crucial to an efficient systembecause all energy consumed is channeled through theregulator. When switching from 3.3V to 1.1V, a linearregulator would only realize a 3x energy savings,instead of the 12x reduction afforded by our design.

The threshold voltage (Vt) significantly effectsthe energy and performance of a CMOS circuit. Ourdesign uses aVt of 0.8V to achieve a balance betweenperformance and energy consumption. The StrongARM[13], for comparison, uses aVt of 0.35V, whichincreases performance at the expense of increased staticpower consumption. When idle, the StrongARM isreported to consume 20mW, which is the predictedpower consumption of our processorwhen running at20MHz. When idle, we estimate our processor will con-sume 200µw, an order of magnitude improvement.

3. Benchmarks

Our benchmark suite targets PDAs and embeddedapplications. Benchmark suites such as SPEC95 are notappropriate for our uses because they are batch-orientedand target high-performance workstations. DVS evalua-tion requires the benchmarking of workload idle charac-teristics, which is not possible with batch-orientedbenchmarks. Additionally, our target device has on theorder of 1MB of memory and lacks much of the systemsupport required by heavy-weight benchmarks; runningSPEC95 on our target device would simply be impracti-cal.

We feel the following six benchmarks are neededto adequately represent the range of workloads found inembedded systems:

• AUDIO Decryption• MPEG Decoding• User Interfaces• Java Interpreter• Web Browser• Graphics Primitive Rendering

As of this writing, we have implemented the firstthree of these and their characteristics are summarizedin Table 3. “Idle Time” represents the portion of systemidle time, used by DVS algorithms. The “Bus Activity”column reports the fraction of active cycles on the exter-nal processor bus, an important metric when optimizingthe cache system. The cache architecture used to gener-ate Table 3 is discussed in Section 5.

As an example, Figure 4 shows anevent impulse

graph [13], which is used to help characterize programsfor DVS analysis. Each impulse represents one MPEGframe and indicates the amount of work necessary toprocess that frame. For this example, there is a fixedframe-rate which can be used to calculate the optimalprocessor speed for each frame, assuming only one out-standing frame at any given time.

4. Dynamic Voltage Scaling

Our processor has the ability, termed DynamicVoltage Scaling (DVS), to alter it’s execution voltagewhile in operation. This ability allows the processor tooperate at the optimal energy/efficiency point and real-ize significant energy savings, which can be as much as80% for some applications [13]. This section discussesDVS design considerations and explains how it affectsarchitectural performance evaluations.

DVS combines two equations of sub-micronCMOS design [2]:

and

where is the energy-per-operation, is themaximum clock frequency, and is the operating volt-age. To minimize the energy consumed by a given task,we can reduce , affecting a reduction in . Areduction in , as shown in the second equation, resultsin a corresponding decrease in . A simple exampleof these effects is given below.

Reducing , the actual processor clock used,without reducing does not reduce the energy con-sumed by a processor for a given task. The

BenchmarkMissRate

IdleTime

BusActivity

AUDIO 0.23% 67% 0.35%

MPEG 1.7% 22% 14%

UI 0.62% 95% 0.52%

Table 3: Benchmark Characterization

0

20

40

60

80

100

120

Eve

nt D

elay

(m

s)

Event Number

MPEG Event Delays

1 385

Figure 4: MPEG Event Impulse Graph

Eop V2∝ f max

V Vt–

V---------------∝

Eop f maxV

V EopV

f max

f clkV

StrongARM 1100, for example, allows to bedynamically altered during operation [16], affecting alinear reduction in the power consumed. However, thechange in also causes a linear increase in task run-time, causing the energy-per-task to remain constant.Our system always runs with , which min-imizes the energy consumed by a task.

From a software perspective, we have abstractedaway the voltage parameter and specify the operatingpoint in terms of . The actual voltage used is deter-mined by a feedback loop driven by a simple ring oscil-lator. The primary reason for this design was ease of thehardware implementation; fortunately, it also presentsthe most useful software interface.

Our system applies one dynamic voltage to theentire system to realize savings from all components. Itwould be possible, however, to use multiple independentsupply voltages to independently meet subsystem per-formance requirements. This was not attempted in ourdesign. To interface with DVS-incompatible externalcomponents we use custom designed level-convertingcircuits.

The implementation of DVS requires the applica-tion of voltage schedulingalgorithms. These algorithms,discussed in Section 4.2, monitor the current andexpected state of the system to determine the optimaloperating voltage (frequency).

4.1 Energy/Performance Evaluation Under DVS

DVS can affect the way we analyze architecturaltrade-offs. As an example, we explore the interactionbetween DVS and the ARM Thumb [4] instruction set.We apply Thumb to the MPEG benchmark fromSection 3 and analyze the energy consumed. This exam-ple assumes a 32-bit memory system, which is a validassumption for high-performance systems but not nec-essarily for all embedded designs.

The MPEG benchmark is 22% idle when runningat 100 MHz using the 32-bit ARM instruction set. DVSallows us to minimize the operating voltage to fillunnecessary idle-time. Using a first-order approxima-tion, this would reduce the energy consumed by 40%and slow down the processor clock to the point at whichidle time is zero. From this starting point, we considerthe application of the Thumb instruction set to thisbenchmark.

For typical programs, the 16-bit Thumb instruc-tion-set is 30% more dense than it’s 32-bit counterpart,reducing the energy consumed in the cache and memoryhierarchy. However, due to reduced functionality, thenumber of instructions executed increases by roughly18%, increasing the energy dissipated in the processorcore as well as the task execution time.

This example will teach two important lessons.First, an increase in task delay directly relates to anincrease in energy: DVS exposes the trade-off betweenenergy and performance. Second, an increase in delay

affects theentire system (core and cache), not just onefragment: it is vital that the associated increase in theenergy-per-operation is applied to the entire system.

Figure 5 presents six metrics crossed with threeconfigurations running the MPEG benchmark. The threeconfigurations are:

• Base:78 MHz using 32-bit instructions.• Thumb: 78 MHz using Thumb instructions.• Adjusted: 92 MHz using Thumb instructions.

The ‘Base’ configuration represents the MPEGbenchmark running as 32-bit code, as discussed above.‘Thumb’ illustrates the intermediate effects of the 16-bitThumb architecture without increasing the clock speed.The energy consumed in the cache (see Figure 5)decreases due to the decreased memory bandwidthcaused by the smaller code size. The energy of the core,however, rises slightly due to the increased number ofinstructions processed. Overall, the energy decreases byapproximately 10%.

The delay increase caused by the expandedinstruction stream pushes the processor utilization over100%. Because of this, the MPEG application will notbe able to process its video frames fast enough. The‘Adjusted’ configuration represents the increase in pro-cessor speed required to maintain performance. Thischange in clock frequency necessitates an increase involtage which raises the energy-per-operation of theentire system. As can be seen from the ‘Total Energy’columns, the energy savings are no longer realized: the16-bit architecture increases overall energy consump-tion.

Although not energy-efficient in all situations, theThumb instruction set may be efficient for some tasksdue to the non-linearity of voltage-scaling. If the basesystem were initially running at a very low voltage, forexample, the increase in processor speed necessarywould not dramatically increase the energy-per-opera-tion. The savings due to the reduced code-size, there-fore, would affect an overalldecreasein system energy.

4.2 Voltage Scheduling

To effectively control DVS, avoltage schedulerisused to dynamically adjust the processor speed and volt-

f clk

f clk

f clk f max=

f max

0%

20%

40%

60%

80%

100%

120%

140%

160%

Proc.Utilization

Proc.Speed

Energy PerOp.

CacheEnergy

CoreEnergy

TotalEnergy

Figure 5: DVS ExampleBase Thumb Adjusted

ProcessorUtilization

ProcessorSpeed

Energy PerOperation

CacheEnergy

CoreEnergy

TotalEnergy

age at run-time. Voltage scheduling significantly com-plicates the scheduling task since it allows optimizationof the processor clock rate. Voltage schedulers analyzethe current and past state of the system in order to pre-dict the future workload of the processor.

Interval-based voltage schedulers are simpletechniques that periodically analyze system utilizationat a global level: no direct knowledge of individualthreads or programs is needed. If the preceding timeinterval was greater than 50% active, for example, thealgorithm might increase the processors speed and volt-age for the next time interval. [5][13][17] analyze theeffectiveness of this scheduling technique across a vari-ety of workloads. Interval-based scheduling has theadvantage of being easy to implement, but it often hasthe difficulty of incorrectly predicting future workloads.

More recently, investigation has begun intothread-based voltage schedulers, which require knowl-edge of individual thread deadlines and computationrequired [7][12]. Given such information, thread-basedschedulers can calculate the optimal speed and voltagesetting, resulting in minimized energy consumption. Asample deadline-basedvoltage scheduling graphisgiven in Figure 6;Sx and Dx represent task start-timeand deadline, respectively, while the graph area,Cx, rep-resents computational resources required.

4.3 Circuit Level Considerations

At the circuit level, there are two types of compo-nents in our design adversely affected by DVS: complexlogic gates and memory sense-amps. Complex logicgates, such as 8-input NAND gates, are implemented bya CMOS transistor chain which will have a different rel-ative delay if the voltage is varied. Additionally, mem-ory sense-amps are sensitive to voltage variationsbecause of their analog nature, which is necessary todetect the small voltage fluctuations of the memorycells.

To the largest extent possible, these voltage sensi-tive circuits are avoided; however, in some situations,such as in the cache CAM design described below, it isbetter to redesign the required components withincreased tolerance. Redesigns of these components willoften be less efficient or slower than the original versionwhen running at a fixed voltage. We estimate anincrease in the average energy/instruction of the micro-processor on the order of 10%, which is justified by the

overall savings afforded by DVS.

5. Cache Design

This section describes the design of our cachesystem, which is a 16kB unified 32-way set-associativeread-allocate write-back cache with a 32-byte line size.The cache is an important component to optimize sinceit consumes roughly 33% of the system power and iscentral to system performance. Our primary design goalwas to optimize for low-power while maintaining per-formance; our cache analysis is based on layout capaci-tance estimates and aggregated benchmark statistics.

Our 16kB cache is divided into 16 individual 1kBblocks. The 1kB block-size was chosen to achieve a bal-ance between block access energy and global routingenergy. Increasing the block-size would decrease thecapacitance of the global routing but it would alsoincrease the energy-per-access of the individual blocks.

Our cache geometry is very similar to that of theStrongARM, which has a split 16kB/16kB instruction/data cache. Other features, namely the 32-way associa-tive CAM array, are similar. In the StrongARM design,the caches consume approximately 47% of the systempower [13].

5.1 Basic Cache Structure

We have discovered that a CAM based cachedesign (our implementation is given in Figure 7) is moreefficient than a traditional set-associative organization(Figure 8) in terms of both power and performance. Thefundamental drawback with the traditional design is thatthe energy per access scales linearly with the associativ-ity: multiple tags and data must be fetched simulta-neously to maintain cycle time. A direct-mapped cache,therefore, would be extremely energy efficient; its per-formance, however, would be unacceptable. We esti-mate that the energy of our 32-way set-associativedesign is comparable to that of a 2-way set-associativetraditional design.

Figure 6: The Voltage Scheduling Graph

D3

proc

esso

r sp

eed

time

S1 D1 S2

C1

D2S3

C2C3

2x128DataBlock

Figure 7: Implemented CAM Cache Design

CAMTag

Array

Address

RData

Word-lines

Our design has been modified from a vanillaCAM in two major ways:

• Narrow memory bank: The fundamental SRAMdata block for our design is organized as a 2-word x128-row block, instead of a 8-word x 32-row block.

• Inhibited tag checks: Back-to-back accesses map-ping to the same cache line do not trigger multipletag checks.

The 2-word by 128-row block organization forour cache data was chosen primarily because a largeblock width would increase the energy-per-access to thedata block. A block width of 8 words, for example,would effectively entail fetching 8 words per access,which is wasteful since only one or two of these wordswould be used. The narrow block width unfortunatelycauses an irregular physical layout, increasing totalcache area; however, we chose this design as energy wasour primary concern.

There are two natural lower-bounds on the blockwidth. First, the physical implementation of the SRAMblock has an inherent minimum width of 2-words [3].Second, the ARM8 architecture has the capability fordouble-bandwidth instruction fetches and data reads [1],which lends itself to a 2-word per access implementa-tion.

Unnecessary tag checks, which would wasteenergy, are inhibited for temporally sequential accessesthat map to the same cache line. Using the sequential-access signal provided by the processor core and a smallnumber of access bits, this condition can be detectedwithout a full address comparison. Our simulations indi-cate that about 46% of the tag checks are avoided with a8-word cache line size, aggregated across both instruc-tion and data accesses. For the individual instruction anddata streams, 61% and 8% of tag checks are prevented,respectively.

5.2 Cache Policies and Geometry

Cache energy has a roughly logarithmic relation-

ship with respect to its overall size, due to selectiveblock enabling: a 16kB cache consumes little moreenergy than an 8kB cache. Our fundamental cache sizeconstraint was die cost, which is determined primarilyby cache area. Benchmark simulations indicate that a16kB unified cache is sufficient; we felt the increasedcost of a 32kB cache was not justified. We chose a uni-fied cache because it is most compatible with the ARM8architecture.

The cache line size has a wide-ranging impact onenergy efficiency; our analysis (Figure 9) indicates thatan 8-word line size is optimal for our workload. Giventhe 1kB block size, our associativity is inversely propor-tional to the line size: an 8-word line yields 32-wayassociativity (1kB / 8-words = 32-way). The energy of aCAM tag access is roughly linear with associativity.Also, smaller cache line sizes generate less external bustraffic, consuming less energy. The energy of the datamemory is practically constant, although there are slightvariations caused by updates due to cache misses.

We implement a write-back cache to minimizeexternal bus traffic. Our simulations indicate that awrite-through cache would increase the external bustraffic by approximately 4x, increasing the energy of theentire system by 27%. We found no observable perfor-mance difference between the two policies.

Our simulations find no significant evidenceeither for or against read-allocate in terms of energy orperformance; we implement read-allocate to simplifythe internal implementation. Similarly, we find thatround-robin replacement performance is comparable tothat of both LRU and random replacement, due to thelarge associativity.

5.3 Related Work

Most low-power cache literature [6][9][15][8]suggests improvements to the standard set-associativecache model of Figure 8. The architectural improve-ments proposed center around the concepts of sub-bank-ing and row-buffering. Sub-banking retrieves only therequired portion of a cache line, saving energy by notextraneously fetching data. Row-buffering fetches andsaves an entire cache line to avoid future unnecessary

Figure 8: Traditional Set-Associative Cache Design

TagMem

DataBlock

TagMem

DataBlock

Block Select

Address

RData

0%

10%

20%

30%

40%

50%

60%

4 8 16Line Size

Per

cent

of S

yste

m E

nerg

y

Figure 9: Line-Size Energy BreakdownData Block Cam Access External Memory

tag comparisons.Our CAM-based cache design indirectly imple-

ments the concepts of sub-banking and row-buffering.The 2-word block size of our memory bank is similar to2-word sub-banking. Tag checks inhibition is similar torow-buffering: only one tag-check is required for eachcache-line access.

[8] presents a technique for reducing the energyof CAM-based TLBs by restricting the effective asso-ciativity of the parallel tag compare and modifying theinternal CAM block. Due to time constraints, thesemodifications were not considered for our design.

6. Conclusion

This paper describes the implementation of alow-power Dynamic Voltage Scaling (DVS) micropro-cessor. Our analysis encompasses the entire micropro-cessor system, including the memory hierarchy andprocessor core. We use a custom benchmark suiteappropriate for our target application: a portable embed-ded system.

Dynamic Voltage Scaling allows our processor tooperate at maximal efficiency without limiting peak per-formance. Understanding the fluid relationship betweenenergy and performance is crucial when making archi-tectural design decisions. A new class of algorithms,termed voltage schedulers, are required to effectivelycontrol DVS.

A description of our cache design was givenwhich presents the architectural and circuit trade-offswith energy and performance for our applicationdomain. For minimized energy consumption, we foundthat a CAM-based cache design is more energy efficientthan a traditional set-associative configuration.

Acknowledgments

This work was funded by DARPA and made possible bycooperation with Advanced RISC Machines Ltd (ARM).The authors would like to thank Eric Anderson, KimKeeton, and Tom Truman for their proofreading help.

7. References

[1] ARM 8 Data-Sheet, Document Number ARMDDI0080C, Advanced RISC Machines Ltd, July1996.

[2] T. Burd and R. Brodersen, “Energy EfficientCMOS Microprocessor Design,”Proc. 28thHawaii Int’l Conf. on System Sciences, 1995.

[3] A. Chandrakasan, A. Burstein, R. Brodersen, “ALow-Power Chipset for a Portable Multimedia I/O Terminal,” IEEE Journal of Solid-State Cir-cuits, Vol. 29, No. 12, December 1994.

[4] L. Goudge, S. Segars, “Thumb: reducing the costof 32-bit RISC performance in portable and con-sumer applications,”Forty-First IEEE ComputerSociety International Conference, 1996.

[5] K. Govil, E. Chan, H. Wasserman, “Comparingalgorithms for dynamic speed-setting of a low-power CPU,”Proc. of the First Annual Int’l Conf.on Mobile Computing and Networking, 1995.

[6] P. Hicks, M. Walnock, and R. Owens, “Analysisof Power Consumption in Memory Hierarchies,”Proc. 1997 Int’l Symp. on Low Power Electronicsand Design.

[7] T. Ishihara, H. Yasuura, “Voltage SchedulingProblem for Dynamically Variable Voltage Pro-cessors,”Proc. 1998 Int’l Symp. on Low PowerElectronics and Design.

[8] T. Juan, T. Lang, J. Navarro, “Reducing TLBPower Requirements,”Proc. 1997 Int’l Symp. onLow Power Electronics and Design.

[9] M. Kamble and K. Ghose, “Analytical EnergyDissipation Models For Low Power Caches,”Proc. 1997 Int’l Symp. on Low Power Electronicsand Design.

[10] T. Kuroda, et. al., “Variable Supply-VoltageScheme for Low-Power High-Speed CMOS Dig-ital Design,” IEEE Journal of Solid-State Cir-cuits, Vol. 33, No. 3, March 1998.

[11] J. Montanaro, et. al., “A 160Mhz 32b 0.5WCMOS RISC Microprocessor,”1996 IEEE Int’lSolid-State Circuits Conf.

[12] T. Pering and R. Brodersen, “Energy EfficientVoltage Scheduling for Real-Time OperatingSystems,”4th IEEE Real-Time Technology andApplications Symposium, 1998, Works InProgress Session.

[13] T. Pering, T. Burd, and R. W. Brodersen, “TheSimulation and Evaluation of Dynamic VoltageScaling Algorithms,”Proc. 1998 Int’l Symp. onLow Power Electronics Design.

[14] A. Stratakos, S. Sanders, R. Brodersen, “A low-voltage CMOS DC-DC converter for a portablebattery-operated system,”25th Annual IEEEPower Electronics Specialists Conference, 1994.

[15] C. Su and A. Despain, “Cache Designs forEnergy Efficiency,”Proc. 28th Hawaii Int’l Conf.on System Sciences, 1995.

[16] M. Viredaz, “Itsy: An Open Platform for PocketComputing,” presentation from Digital Equip-ment Corporation’s Western Research Labora-tory.

[17] M. Weiser, “Some computer science issues inubiquitous computing,”Communications of theACM, Vol. 36, July 1993.

Transmission Line Clock Driver

Matthew E. BeckerArtificial Intelligence Laboratory

Massachusetts Institute Of TechnologyCambridge, MA, USA

[email protected]

Thomas F. Knight, Jr.Artificial Intelligence Laboratory

Massachusetts Institute Of TechnologyCambridge, MA, USA

[email protected]

Abstract Clock distribution is one of the major issuesin digital design. Engineers want to distribute a square-wave with low skew and fast transition times across verywide chips. And they want to do so, wasting as littlepower as possible. Common micro-processor chips, likethe Alpha, throw 40% of their power into distributing asquare-wave across a 1.5 cm die. This paper describes anew clock distribution technique utilizing resonanttransmission lines that not only reduces clock skew andtransition times, but also reduces power consumption byup to an order of magnitude over standard clock drivers.

1. Motivation

Over the years a number of trends have increased theamount of power consumption of clock drivers. First,the clock capacitance has steadily increased as the die sizeand the gate capacitances have become larger. Second, theclock frequency is also increasing. Clock drivers oncurrent micro-processor chips must drive a 4 nF load at500 MHz. This burns almost 20 W or 40% of theoverall chip power.i

Researchers have explored many ways of reducingpower consumption. Arguably the simplest way hasinvolved reducing the power supply voltage. Thisprovides a square reduction in power, but adversely affectscircuit speeds and increases sub-threshold leakage. Othermethods temporarily stop driving unused sections of thechip. Unfortunately, this complicates system design andincreases the amount of clock skew between chipsections.

On a different topic, a relatively new approach tocoordinate different chips at a system level involves usinga resonating cavity. Vernon Chi at the University ofNorth Carolina resonates a uniform transmission linewith a single sinusoid.ii Because the entire transmissionline crosses zero simultaneously, this structure cansynchronize individual chips throughout the system.Unfortunately, a sinusoidal waveform cannot be used todrive individual clock loads directly. Its transition timesare just too long and the large clock capacitance wouldinterfere with transmission line operation.

This work takes the resonant approach the next step.I will describe a resonating transmission line which candrive the large clock capacitance directly. This requiresresonating a square-wave instead of a single sinusoid.Unfortunately, a uniform transmission line driving a large

capacitive load will not support the odd harmonicsrequired to produce a square-wave. Instead the resonantline must be tuned to support each of the harmonics.

There are two major benefits of using a tunedtransmission line to drive the clock load directly. First,the half crossing still occurs simultaneously across thecentral section of the transmission line. This translatesto virtually no skew across a single die. Second, thepower to drive the on-chip capacitance, comes from thetransmission line instead of the power supplies.Therefore, the power consumption falls to the amountburnt in the parasitic resistance of the transmission line.This results in a 10x reduction in power consumption.

The next section describes a standard clock driver usedin the micro-processor industry. Section 3 delves intoour proposed driver technique using resonant transmissionlines. Then, the paper reviews some experimental workverifying this techniqueÕs viability.

2. Standard Clock Driver

A standard clock distribution structure appears inFigure 1. It is relatively simple. It includes a clockgenerator, a buffer and a distribution network. We havedrawn the clock lines as transmission lines. Since theseries resistance dominates over the inductance, the clockloads and lines must be modeled as distributed RC lines.

ClockLoad

Clock Driver

Figure 1. Standard Clock System For Large Chips

Power to fill up the clock capacitance and pre-drivercapacitance comes from the power supply. This power isdumped into the gnd supply when discharging thecapacitances explaining why the clock uses 40% of chippower.

3. Transmission Line Clock Driver

Our technique shown in Figure 2, simply adds anexternal transmission line. We recover power bycharging and discharging the final clock load not through

the clock driver but with the transmission line. This alsomeans that the clock buffers can be smaller resulting inless pre-driver power.

Flip-ChipConnections

ResonatingTransmission Line

CHIPClock Driver

Figure 2. New Transmission Line Clock Driver

Let me walk through a brief description of how thetechnique works. The central driver forces a square waveinto the chip and the external transmission line. Becausethe driver is smaller, it cannot initially drive the systemto vdd. However, a reduced height pulse flows down thelength of the transmission line. The pulse travels to theopen termination of the transmission line and reflectsback towards the chip. The line length is such that thepulse in the transmission line will reach the driver exactlywhen it drives again. The result is an increase in thepulse height. Eventually the transmission line will beresonating a full clock pulse at a given clock frequency.

Taps using flip-chip technology strap the internalclock line to the external transmission line. A pinstructure like flip-chip is useful because of its lowinductance, short pin lengths and the even distribution ofpins across the chip.

3.1 Tuning a Line Using Traps

Unfortunately, the simple structure presented in theprevious section works only with an ideal transmissionline. When an actual square-wave travels along a lossytransmission line, the different components of thewaveform travel at different velocities based on theirfrequency. This results from the parasitic resistancewhich varies linearly with the frequency of the waveform.

So in the actual transmission line one must introduceimpedance variations along its length to reinforce thedesired waveform. For example, one could include aversion of the traps that are used in RF design.iii

When a waveform reaches a trap, the low frequencycomponents pass through, while the high frequency partreflects. The length of the trap affects what will pass andwhat wonÕt. By changing the location of the traps ourtransmission line can be tuned for each of the desiredfrequencies.

Figure 3 shows a simple model of a transmission linewith a single trap. High frequencies are trapped on theleft. By varying LHIGH one can change the resonantfrequency of the left section. The lower frequencycomponent of the signal can pass through the trap.

Therefore, the combined length of LHIGH and LLOW set thelower resonant frequency.

TrapHigh and Low Frequency

LowFrequency

Lhigh

Llow

Figure 3. Traps Used to Tune Resonant Frequencies

3.2 Clock Skew and Rise Times

One of the very exciting prospects of this designregards clock skew. Ideally the entire resonatingtransmission line crosses through 1/2 Vdd at the sametime. In real cases, the clock load, interconnect delay anddispersion within the transmission line introduce aminimal amount of skew, less than 10 ps. Thisperformance figure is almost an order of magnitude betterthan what is achieved in the Alpha!

Unfortunately, though the center crossing is relativelywell controlled, the differences in rise and fall timesacross the chip can be serious. Consider the situation inthe frequency domain as shown in Figure 4. In thefrequency domain, our clock waveform decomposes into asum of sinusoids. The primary components are the 1st,3rd and 5th harmonics. As we move farther away fromthe center of the chip, the amplitudes of the third and thefifth harmonic fall off quickly. This degrades the rise andfall times at the edge of the chip.

0 v

1/2 Vdd

Amplitude

DistanceAlongT-Line

Chip

-1/2 Vdd

Figure 4. Frequency Domain View of Rise Time Problem

For a 2 cm chip this technique provides reasonablerise times up to 1 GHz. Beyond this frequency, thereduced amplitude of the higher harmonics does notprovide adequate rise times at the edge of the chip. Onecorrection would be to shift the tap points, flip-chipconnections, towards the center of the chip.Unfortunately, this introduces extra skew from the newtap points to the edge of the chip. Another correctionwould be to drive each tap point with independenttransmission lines, but this would involve a morecomplicated structure.

4. Large Scale Mockup

In order to test this driver technique quickly andcheaply, we have built a low frequency version. Thisallows standard off the shelf parts to be used while stilltesting all the basic principles.

The large scale mockup has proven the viability of thetransmission line clock driver technique. At 20 Mhz thedriver with uniform impedance reduces powerconsumption by a factor of 5.7 relative to a standard clockdriver. Using a more complex transmission line saves afactor of 10 over the standard driver. Further, the qualityof waveforms is significantly better for the complextransmission line.

The next two sections describes the overall structureof our mockup and the testing method. Then the paperpresents the results from each test case: the standard clockdriver, the driver with a uniform transmission line and thedriver with a complex transmission line.

4.1 Structure

Figure 5 shows the basic schematic of the large scalemockup. It can be broken into two distinct parts: theexternal transmission line and on-chip clock structure.

10 cm

CLOAD

5 cm

DriverOn-chipClockStructure

ExternalTransmissionLine

10 cm10 cm10 cm10 cm

Figure 5. Standard Clock Driver

All driver configurations share the on-chip clockstructure. As shown at the bottom, it contains a driverinvolving a pulse generator with variable frequencydriving a pair of inverters. In the transmission linedrivers, the frequency is set to the resonant frequency.For the standard driver, the impedance of each inverter isapproximately 2 Ω. In order to test a configuration witha different inverter width, W, series resistance is addedbetween the inverters and the load. Lumped elements,CLOAD, model the on-chip clock capacitance. The totalload is 2.2 nF. The small, 5 cm wires represent the on-chip clock wires.

All configurations including the standard driver sharethe central section of the external transmissionline. There are two reasons for this. First, the standarddriver performed better with the external transmission lineand so made a harder standard to beat. Second, for a realchip the clock impedance would probably be lower thanthe 5 cm segments by themselves, but definitely notlower than the impedance with the external transmissionlines. The central section of the external transmission

line includes the five 10 cm segments that match thelength of the entire chip, 50 cm. This length is derivedfrom the frequency scale up factor of 500 MHz : 20 MHz,and the width of a standard chip, 2 cm. The sides of theexternal transmission line change extensively betweenconfigurations. As seen in Figure 5, the standard driverhas no sides at all. The next configuration in Figure 7uses a uniform transmission line that is 4.6 m long.Figure 9 shows the case with a complex transmissionline.

4.2 Testing Method

This section lists the steps involved in testing andevaluating each transmission line driver. It then describeshow to normalize power relative to frequency, self-loadingpower and pre-driver power.

The first three steps are performed on the standarddriver:

1) Measure the rise time, TSTD, for a given inverterwidth, WSTD

2) Measure the power, PSTD, at WSTD

3) Normalize PSTD

Then the following steps are performed on eachtransmission line driver:

4) Measure the rise time, T, for a given inverterwidth, W

5) Change W until T = TSTD

6) Measure the power, P, at W7) Normalize P8) Compare the normalized powers

The first three steps involve the standard clock driver,and provide a base line against which to compare thetransmission line drivers. In the final 5 steps, the driverwidth acts as the independent variable. As the widthincreases the signal rise times improve, while the powerincreases. Please recall that changing the width issimulated by adding extra series resistance. Step 5 setsthe driver width such that the transmissions line driverprovides the same quality signal, based on rise time, asthe standard driver. At this size one can compare thenormalized powers.

Power must be normalized for three different effects.First, in order to compare powers obtained at differentfrequencies, the power is normalized relative to afrequency of 20 MHz. As seen in Equation 1, the threefrequency terms, (20 MHz/F), (20 MHz/FSELF) and (20MHz/FSTD) normalize each power. F, FSELF and FSTD

represents the frequency for the current configuration, forthe self-loading test case and for the standard driver,respectively.

Second, in order to compare configurations, one needsto include the self-loading power. Self-loading power isthe power required to fill the source and drain capacitanceof the driver. Because all the tests use the same inverterpart, they all consume the same amount of self-loading

power, even in the cases with a smaller, simulatedinverter. Therefore, one must normalize the measuredpower to correct for the reduced self-loading whensimulating a smaller inverter. This correction appears inthe second term of Equation 1. (1-W/WSTD) represents theratio of the extra width to the total width. Multiplyingthis by the total self-loading, leaves that power related tojust the unused width. This can then be subtracted fromthe total power.

Finally, one needs to include pre-driver power. Pre-driver power represents the amount of power consumed indriving the input to the final inverter stage. Please notethat in all of the configurations, this power comes fromthe pulse generator not the power supply. Therefore, thepre-driver power is estimated by dividing the measuredpower in the standard case by an inverter scale up factor of4. The third term of Equation 1provides this correction.Again, in order to compare pre-driver powers withdifferent inverter widths, the pre-driver power has to benormalized relative to the inverter width. Therefore, wemultiply the standard pre-driver power by the by the newwidth, W/WSTD.

P P20MHz

FP 1

W

W

20MHz

F

14 P

W

W

20MHz

F

NORM TEST

TEST

SELF

STD SELF

STD

STD STD

=

− −

+

Equation 1

4.3 Standard Clock Driver

This section describes a few test cases used toestablished a base line for evaluating later designs. Thesemeasured results are substituted into Equation 1. Table 1shows the power consumption for a few important casesas wells as the rise times for the edge and center of thechip. The self-loading test simply involves disconnectingthe clock load before power is measured.

Center T Edge T P PNORM

Self @16.4 MHz 0.12 WStd @2 MHz 15.6 ns 11.6 ns 0.19 W 2.4 WStd @16.4 MHz 13.3 ns 10 ns 1.5 W 2.3 WStd @14 MHz 14.7 ns 11.4 ns 1.48 W 2.6 W

Table 1. Mock-up of Standard Clock Driver

The first test provides the self-loading power andfrequency: PSELF=0.12 W and FSELF=16.4 Mhz. Thesecond test provides the pre-driver power and frequency:PSELF=0.19 W and FSELF=2 Mhz. Once these values aresubstituted into Equation 1, things can be simplifieddown to Equation 2. As a check note that for twodifferent frequencies, namely the third and fourth testsshown in Table 1, the normalized powers calculated byEquation 2 are reasonably close.

P P

MHz

FW W

W

WNORM TEST

TEST STD

=

− +

200 15 0 62. .

Equation 2

Figure 6 shows a plot of the last test case of thestandard driver. This was run at 14 Mhz, a period of 71.5ns. Signal C1 corresponds to the center of the chip andsignal C2 corresponds to the edge. Despite the seeminglyreasonable rise times shown in Table 1 and appearing onthe right of Figure 6, note the poor quality of thewaveform for this given width. The next section willshow how the transmission line drivers have cleaned upthe quality appreciably.

Figure 6. Clock Signals of the Standard Driver

4.4 Driver with Uniform Transmission Line

This section describes the transmission line clockdriver with uniform impedance. Figure 7 shows itsassociated external transmission line. The uniform lineresonated a 14 Mhz square wave. The impedance of theentire transmission line was fixed at approximately 10 Ω.

10 cm 10 cm 10 cm 10 cm 10 cm4.6 m 4.6 m


On-chip Clock Structure

Figure 7. External Transmission Line with UniformImpedance

Table 2 shows the power for many different driverwidths, W. In order to calculate the normalized power,W/WSTD and P were substituted into Equation 2. Basedon the rise time and overall quality of the waveform, thepoint where the transmission line driver matches the

standard clock driver appears when W/WSTD = 0.07. Atthis width, the transmission line clock driver consumes0.42 W saving a factor of 5.7 over the standard clockdriver.

W/WSTD Center T Edge T P PNORM

0.04 14.1 ns 20.5 ns 0.32 W 0.33 W0.06 14 ns 19 ns 0.35 W 0.39 W0.07 10 ns 19 ns 0.37 W 0.42 W0.15 6.6 ns 18.8 ns 0.38 W 0.49 W0.33 4.6 ns 19.3 ns 0.40 W 0.62 W

Table 2. Mock-up of a Driver with a UniformTransmission Line

Figure 8. Clock Signal of a Driver with a UniformTransmission Line

Figure 8 shows a plot of the uniform transmissionline driver at 14 Mhz, a period of 71.5 ns. The width ofthe driver, W, is 0.15 times the width of the standarddriver, WSTD. Signal C1 corresponds to the center of thechip and signal C2 corresponds to the edge. The qualityof this waveform is quite superior to that found with thestandard driver, even though the driver width and powerconsumed is significantly smaller.

4.5 Driver with Complex Transmission Line

This section describes the transmission line driverwith a complex impedance. Figure 9 shows the externaltransmission line. The complex line reinforces the firstand third harmonics of a 16.4 Mhz square wave. In orderto tune for the third harmonic, a pair of 20 cm sectionsare added to the central section. Beyond this a 40 cm trapof higher impedance, 35 Ω, blocks the third harmonic.The outside sections, 1.4 m, tune the first harmonic.

10 cm


20 cm 40 cm 1.4 m10 cm 10 cm 10 cm 10 cm20 cm40 cm1.4 m

On-chip Clock Structure

Figure 9. External Transmission Line with ComplexImpedance

Table 3 shows the power consumption and rise timesfor many different driver widths. Based on the rise timeand overall quality of the waveform, the point where thecomplex transmission line driver matches the standardclock driver appears when W/WSTD = 0.06. At this widththe transmission line clock driver consumes 0.24 Wsaving a factor of 10.0 over the standard driver.

W/WSTD Center TR Edge TR PMEAS PNORM

0.04 19.9 ns 21.0 ns 0.27 W 0.20 W0.06 5.6 ns 14.5 ns 0.29 W 0.24 W0.07 5.0 ns 14.2 ns 0.3 W 0.26 W0.15 3.5 ns 12.1 ns 0.33 W 0.35 W0.33 3.2 ns 12.1 ns 0.33 W 0.46 W

Table 3. Mock-up of a Driver with a ComplexTransmission Line

Figure 10 shows a plot of the complex transmissionline driver at 16.4 Mhz, a period of 61.1 ns. The widthof the driver, W, is set to 0.15 times the width of thestandard driver, WSTD. Signal C1 corresponds to thecenter of the chip and signal C2 corresponds to the edge.The quality of this waveform is quite superior to thatfound with the previous drivers, even though the driverwidth and power consumed is significantly smaller.

Figure 10. Clock Signal of a Driver with a ComplexTransmission Line

5. Conclusions

Clock distribution represents an important aspect ofdigital system design. Current techniques have stalled intheir effort to reduce clock skew, transition times andpower consumption. A new clock distribution techniqueutilizing resonant transmission lines with a compleximpedance solves this problem. It can match thewaveform quality of a standard clock driver for a tenth ofthe driver width and a tenth of the power. i Bowhill and Gronowski. Practical ImplementationMethods and Circuit Examples used on the ALPHA21164. VLSI Circuits Workshop. DigitalSemiconductorii Chi, Vernon. Salphasic Distribution of TimingSignals for the Synchronization of Physically SeparatedEntities. US Patent #5,387,885. University of NorthCarolina. Chapel Hill, NC 1993.iii Hall, G.L. Trap Antennas. Technical Correspondence,QST, Nov. 1981, pp. 49-50.

Architectural Power Reduction and Power Analysis

AbstractArchitecture trade-off experiments require theavail-

ability of an accurate, efficient, high level power estima-tion tool. We have developed such a tool that providescycle-by-cyclepower consumption statisticsbased on theinstruction/data flow stream. While the approach onlyassumesa high level definition of thearchitecture, it doesrequire that the energy consumption of the functionalunits has been previously characterized. The accuracy ofour estimation approach has been validated by compar-ing the power values our tool produces against measure-ments made by a gate level power simulator of acommercial processor for the same benchmark set. Ourestimation approach hasbeen shown to providevery effi-cient, accurate power analysis. Using this same powerestimation technique, several architecture level trade-offexperiments for various architectures have been per-formed.

1.0 IntroductionPower dissipation hasbecomeacritical issue in pro-

cessor design. When designing high performance, lowpower processors, designers need to experiment witharchitectural level trade-offs and evaluate various poweroptimization techniques. Thus, a tool for doing efficientand accurate architectural level power estimationbecomes indispensable. Most of the research in this areafalls in the category of empirical methods which “mea-sure” the power consumption of existing implementa-tions and produce models based on those measurements.This macromodeling technique can be subdivided intothree sub-categories. The first approach [1] is a fixed-activity macromodeling strategy called the Power FactorApproximation (PFA) method. The energy models areparameterized in terms of complexity parameters and aPFA proportionality constant. Thus, the intrinsic internalactivity is captured through this PFA constant. Thisapproach implicitly assumes that the inputs do not affectthe switching activity of the hardware block.

___________________This work is supported in part by a grant from the

National Science Foundation (MIP-9705128) and byHitachi America Ltd. The authors can be contacted [email protected], [email protected], or [email protected]

To remedy this weakness of the fixed-activityapproach, activity-sensitive empirical energy modelshave been developed. They are based on predictableinput signal statistics, such as used in the SPA method[2][3][4] . Although the individual models built in thisway are relatively accurate (the error rate is 10%-15%),overall accuracy may be sacrificed for the reasons ofunavailable correct input statistics or an inability tomodel the interactions correctly.

The third empirical approach, transition-sensitiveenergy models, is based on input transitions rather thaninput statistics. The method presented in [5] assumes anenergy model is provided for each functional unit - atablecontaining thepower consumed for each input tran-sition. The authors give a scheme for collapsing closelyrelated input transition vectors and energy patterns intoclusters, thereby reducing thesizeof the tables. A signif-icant reduction in the number of clusters, and also thesize of the tables and the effort necessary to generatethem, can be obtained while keeping the maximum errorwithin 30% and the root mean square error within 10%-15%. After the energy models are built, it is not neces-sary to use any knowledge of the unit’s functionality orto have prior knowledge of any input statistics during theanalysis.

Our work follows the third approach in thedevelop-ment of a power estimator for a commercial processor.The estimator imitates the behavior of the processor ineach clock cycle as it executes a set of benchmark pro-grams. It also collects power consumption data on thefunctional units (ALU, MAC, etc.) exercised by theinstruction and its data. The results of our power estima-tor are compared with the power consumption data pro-vided by themanufacturer [6]. Thiscomparison confirmsthe accuracy of our power estimation technique [7]. Toshow the application of the estimator, several architec-ture trade-off experiments have been performed.

The rest of this paper consists of four sections. Sec-tion 2 presents the power estimation approach. Section 3overviews the processor architecture and discusses thevalidation results. Section 4 shows several architecturetrade-off experiments. Finally, Section 5 draws the con-clusions.

2.0 Power EstimationAn overview of the power estimator is shown in

Figure 1. The inputs of the estimator are a benchmarkprogram and energy models for each functional unit. Thearchitectural simulator consists of several parts: the

An Architectural Level Power EstimatorRita Yu Chen, Mary Jane Irwin, Raminder S. Bajwa

Department of Computer Science and EngineeringThe Pennsylvania State University

Semiconductor Research Laboratory, Hitachi America Ltd.June 1998

assembler, the controller, and the datapath implementedas many functional units (the small blocks labeled as m).

The controller generates control signals for eachfunctional unit as required for the instructions. For exam-ple, when a load instruction puts data from the memoryto the IDB bus, the “MEMtoIDB ” control bit wil l be set.As the control signals are set, the appropriate functionalunits are activated. For each unit, there is a function rou-tine which gathers all event activities. Figure 2shows anexample. In this example, when a control signal is set,the IDB bus wil l receive data from one of its sources.This input data to the functional unit wil l be used todetermine its power consumption.

The energy consumed by each functional unit peraccess can be computed as follows where Cm is the

switch capacitance per access of the functional unit andVdd is the supply voltage. For each active functionalunit, Cm is calculated from the energy model of the unitbased on the previous and present input vectors. Thefunctional units can be grouped into two classes: bit-dependent and bit-independent functional units. Here,the terms energy and power are used interchangeably.

In the bit-dependent functional units, the switchingof one bit affects other bit slice’ s operations. Typical bit-

dependent functional units are adders, multipliers,decoders, multiplexers, etc. Their energy characteriza-tion isbased on a lookup tableconsisting of a full energytransition matrix where the row address is the previousinput vector, the column address is the present input vec-tor, and the matrix value is the switch capacitance. Allcombinations of previous and present input vectors arecontained in one table. A major problem is that the sizeof this tablegrowsexponentially in thesizeof the inputs.A clustering algorithm [11] solves this problem and thesub-problems associated with it by compressing similarenergy patterns. Table 1 shows an example of uncom-pressed/compressed energy table for a 2:1 multiplexer.The capacitance data in the energy characterization tableisobtained from aswitch level/circuit level simulation ofa circuit level/layout level implementation of the func-tional unit.

In the bit-independent functional units, the switch-ing of one bit does not affect other bit slice’ s operations,for example, registers, logic operations in the ALU,memories, etc. Thetotal energy consumption of thefunc-tional unit can be calculated by summing the energy dis-sipation of the individual bits. To model the energydissipation of a bus, the load capacitance of each bit isused asan independent bit characterization. Furthermore,each bit is assumed to have the same load capacitance.The product of the load capacitance and the number ofbits is the Cm in the energy equation.

Some energy models are not built from the com-plete functional unit but from smaller subcells of theseunits. For example, because a 4:1 multiplexer can bemade from three identical 2:1 multiplexers, its energydissipation can be calculated by using the energy modelof a 2:1 multiplexer. This reduces 26**2 = 4096 tableentries to 23**2 = 64 entries before compression. Afterbuilding the subcell energy model, energy routines willimplement thoseoperations needed to calculate the func-tional unit power dissipation from subcells.

3.0 ValidationTo validate this power estimation technique, we

have implemented it for a commercial processor whichintegrates a32-bit RISC processor and a16-bit DSPon asingle chip. Figure 3shows its main block diagram. Theprocessor core includes the X-memory, the Y-memory,the buses, the CPU engine and DSP engine. The CPU

PowerEstimator

m m

m m

m m

m m

Program Energy Models

OutputConsumptionPower

Assembler

Controller

FIGURE 1. Power estimator overview

FIGURE 2. Event activities in the IDB

toIDB()

if ctr_bit[MEMtoIDB] is setread data from memory;put data on IDB;

if ctr_bit[MWBtoIDB] is setMWB buffer puts data on IDB;

if ctr_bit[BIFtoIDB] is setBIF (Bus InterFace) puts data on IDB;

Em12---CmVdd

2=

TABL E 1. Energy table for a 2:1 multiplexer

Uncompressed Compressed000 000 0.00 000 0xx 0.00000 001 0.00 000 100 0.04000 010 0.00 000 101 0.05000 011 0.00 000 110 0.04000 100 0.04 000 111 0.05000 101 0.05 . . .000 110 0.04000 111 0.05000 111 0.05001 000 0.00

. . .

engine includes the instruction fetch/decode unit, a 32-bit ALU, a 16-bit Pointer Arithmetic Unit (PAU), a 32-bit Addition Unit (AU) for PC increment, and 16 generalpurpose 32-bit registers. As the ALU calculates the X-memory address, the PAU can calculate the Y-memoryaddress. Additional registers in the CPU support hard-ware looping and modulo addressing. The DSP enginecontains aMAC, an ALU, a shifter, and 8 registers (six32 bits wide and two 40 bits wide).

A set of simple synthetic programs listed in Table 2are used as the benchmarks to validate our power ana-lyzer. These are the same benchmarks used to test theprocessor in [6]. Power consumption of the instructionloop buffer (ILB) , memories, buses, ALU, the multiplierand other functional units are collected as the bench-marks are run. Data values used were those which maxi-mized the switching in the datapath. In Table 2, “paddpmuls movx+ movy” contains four separate operationsexecuted simultaneously. They are an addition, a multi-plication, two loads and two address increments. “Paddpmuls movx movy” is the same as the previous instruc-tion except it contains no address increments.

Figure 4 compares the results generated by ourpower estimator with the data presented in [6]. Thepower consumption data of each program is normalizedto that of the pow034 benchmark. As you can see, formost of the benchmarks our simulator produces resultsvery close to those reported in [6]. However, in both thecaseof thepow035d benchmark and thepow035c bench-mark, the power consumption is underestimated by oursimulator. Since we did not have access to the design ofthe control unit of the processor, average data is used tocharacterize its power consumption. Also, a statisticalpower consumption value was used in estimating clockpower, capturing the average value of the clock drive cir-cuitry and clock distribution tree. Power consumed bythe gated clock logic is not captured. These two bench-marks use the least power overall, so that clock and con-trol unit power account for a higher percentage of thetotal power consumption. Overall, the average error rateof power analysis by our simulator is 8.98%.

4.0 Architectural Level Trade-offsUsing our power estimator, we have experimented

with various architectural level trade-off. For example,the DSP multiplication operation can consume 67.8%power of theDSPengineand 34.4% of the total power asshown in Figure 5for the pow035 benchmark. Thus, notsurprisingly, reducing the power consumption of theMAC unit would lead to asignificant power reduction forprograms with a number of multiplication operations.

More architectural level experiments [9] have beenperformed using a power estimator based on a 5-stagepipelined DLX processor [10]. The power estimator

TABLE 2. The Power Benchmarks

Program Use ILB Main Loop Operationpow032 no padd pmulsmovx+ movy+pow033 yes padd pmulsmovx+ movy+pow035a yes padd pmuls movx movypow034 no padd pmulspow035 yes padd pmulspow035b yes paddpow035d no noppow035c yes nop

CPUDSP

IDB[31:0]

YDB[15:0]

XDB[15:0]

IAB[31:0]

YAB[15:1]

XAB[15:1]

X MemoryROM 24KBRAM 8KB RAM 8KB

ROM 24KB

Y Memory

Register File

MAC

ALU

Decode

ALU

Register File

PAU

AU

Fetch & Decode

FIGURE 3. Block diagram of the commercialprocessor

0.0

0.5

1.0

1.5

2.0

Nor

mal

ized

Pow

er

pow032 pow033 pow035a pow034 pow035 pow035b pow035d pow035Benchmark

p ( )

Reported CoreEstimated Core

FIGURE 4. Core (DSP+CPU) estimated andreported power

Pow035

DSP 51.2%

BUS 12.1%

20.1%Memory

16.6%CPU

Memory

16.6%CPU

BUS 12.2%

DSP 16.5%

20.3%

Pow035b

Reduce

34.4%

FIGURE 5. Power comparison for MAC unit

identifies the most power hungry modules: instructionand data caches, pipeline registers and the register file.To reduce register fileaccesses, an architectureoptimiza-tion approach - early bypass detection - places the regis-ter forwarding logic in the ID (Instruction Decode) stage.As soon as an instruction is decoded, bypass detectionwil l be performed before the register access. If themachine is going to forward a source register from apipeline register, then the register file access wil l notoccur eliminating unnecessary register reads. Anothermodified register forwarding approach - source to sourceforwarding - avoids fetching a source register if it is“alive” in the pipeline register. However, for bothapproaches, whileregister filepower is reduced, theextraforwarding control hardware wil l consume energy. Fig-ure 6 shows the reduction of register file accesses whenusing early bypass detection, source to source forward-ing or both. The total power consumed in register file,pipeline latches and forwarding hardware is compared inFigure 7.

In other architectural level experiments, we havemodified the normal load/store architecture to supportone operand from memory by adding a memory address-ing mode for several ALU instructions. The approachreduces not only the number of the register file accessesbut also the instruction count. As aresult, it lowers regis-ter file switching activity and instruction fetch power.However, the register memory addressing mode may

increase clock cycle time or the number of clocks perinstruction (CPI). This scheme also requires more com-plex control unit which may consume more power. Twoimplementation options have been considered: a 5-stagepipeline and a 6-stage pipeline. In the 5-stage implemen-tation the MEM stage precedes the EX stage and mustinclude address calculation hardware. This mayadversely affect clock cycle time. In the 6-stage imple-mentation the MEM stage and the EX stage are againswapped and a separate address calculation stage pre-cedes the MEM stage. This wil l affect the CPI. Theenergy saving comparison in Figure 8 shows that the 5-stage pipeline reduces switch capacitance from 5% to13%. But, the 6-stage pipeline consumes as much ormoreenergy than thenormal architecturedueto theaddi-tional pipeline register.

Different from above approaches which change thearchitecture, a register relabeling technique[11] modifiesonly the compiler. This approach encodes the registerlabels such that the switching costs of all the registerlabel transitions are minimized. It can reduce the energyof the pipeline registers, register file and the instructionbus. To decrease the energy consumption of the switch-ing from 0 to 0 or from 1 to 1, flipflops which can guardclock signals are used to implement registers. However,becauseof theadditional logic, the1 to 0or 0 to 1 transi-tions takemoreswitch capacitance than unguarded regis-ters. Overall, Figure 9 shows that the energy reduction ispossible by using the register relabeling technique.

Histogram QuickSort0

2

4

6

8

10

12

Reg

File

Acc

esse

s (x

10**

5)

OriginalEarly BypassSrc to SrcBoth

FIGURE 6. Decrease in register file accesses

Histogram QuickSort0

1

2

3

4

Sw

itch

Cap

acita

nce

(x10

**6p

F)

OriginalEarly BypassSrc to SrcBoth

FIGURE 7. Decrease in switch capacitance

I$ RegFile Idecode PipeReg0

1

2

3

4

5

6

Sw

itch

Cap

acita

nce

(x10

**5p

F)

Original5−stage RegMem6−stage RegMem

FIGURE 8. Energy saving of register memoryaddressing mode

PipeRef Decoder0

5

10

15

20

25

%im

prov

emen

t

GAPXYMatrixMulHeapsort

FIGURE 9. Energy improvement of registerrelabeling

5.0 ConclusionsIn this paper, we described an accurate, high level

power estimation technique. Our technique needs to doenergy characterization for each functional unit (ALU,MAC, etc.) only once. When the operations of the func-tional units are specified by the instruction stream, thepower consumed in each functional unit is calculatedfrom its energy model. The simulation results for a com-mercial processor clearly verified the correctness of ourpower analysis methodology. Without loss of accuracy,the running time of power estimation for each syntheticbenchmark program was less than 90 CPU seconds. Thistime saving feature of our power estimator is beneficialfor reducing the design cycle time. The simulator is alsoapplicable for power optimization research at the archi-tecture, system software, and application software levels.

Research continues on several fronts. We havepromising results to improve the estimation of controlpower using a hierarchical technique [12]. Since clockpower consumption is typically the largest portion of thetotal chip power, it needs to be accurately modeled. Tocharacterize clock power, a statistical energy model ofthe ungated clock drive logic/distribution tree wil l becombined with transition-sensitive models of differentclock gating schemes. Finding amoreaccuratehigh levelinterconnect power model is also important. More com-plex memory structures (e.g., caches) can be modeledusing methods like the one presented in [13].

Bibliography1 S.R. Powell and P.M. Chau, “Estimating power dissi-

pation of VLSI signal processing chips: The PFAtechnique,” i n VLSI Signal Processing IV, 1990.

2 P. Landman and J. Rabaey, “Power estimation for highlevel synthesis,” i n Proc. of EDAC-EUROASIC,1993.

3 P. Landman and J. Rabaey, “Black-box capacitancemodels for architectural power analysis,” i n Proc. ofInter. Symp. on Low Power Electronics and Design(SLPED), April 1994.

4 P. Landman and J. Rabaey, “Activity-sensitive archi-tectural power analysis for thecontrol path,” in Proc.of SLPED, April 1995.

5 H. Mehta, R.M. Owens, and M.J. Irwin, “Energycharacterization using clustering,” in Proc. of DesignAutomation Conf., June 1996.

6 R. Bajwa, N. Schumann, and H. Kojima, “Power anal-ysisof a32-bit RISC microcontroller integrated witha 16-bit DSP,” i n Proc. of SLPED, Aug. 1997.

7 R.Y. Chen, R.M. Owens, M.J. Irwin, and R.S. Bajwa,“Validation of an architectural level power analysistechnique,” to appear in Proc. of Design AutomationConf., June 1998.

8 H. Mehta, R.M. Owens, and M.J. Irwin, “Instructionlevel power profiling,’”i n Proc. of Inter. Conf. onAcoustics, Speech and Signal Processing, Aug. 1996.

9 Atul Kalambur, Micro-architectural techniques forlow power processors, M.S. Thesis, Penn State Uni-versity, 1997.

10 J.L. Hennessy, and D.A. Patterson, Computer Archi-tecture - A Quantitative Approach, Morgan Kauf-mann, 1996.

11 Huzefa Mehta, System Level Power Analysis, Ph.D.Thesis, Penn State University, 1996.

12 R.Y. Chen, M.J. Irwin and R.S. Bajwa, “Architec-tural level hierarchical power estimation of controlunits,” submitted to ASIC’98.

13 J. Kin, M. Gupta and W. Mangione-Smith, “The fil-ter cache: An energy efficient memory structure,” inProc. of MICRO’97, Nov. 1997.

Multivariate Po wer/Performance Analysis For HighPerformance Mobile Microprocessor Design

George Z.N. Cai+ Kingsum Chow++ Tosaku Nakanishi+ Jonathan Hall+++ Micah Barany+

+ Intel Corp. MS RA2-401 ++ Intel Corp. MS EY2-09 +++ Intel Corp. MS RA2-401 MHPG Low Power Design Lab MicroComputer Research Labs MD6 Division Hillsboro, Oregon 97124 Hillsboro, Oregon 97124 Hillsboro, Oregon 97124

1 Abstract

The power consumption becomes one of theimportant architectural and design issues besidesthe primary goal of performance on highperformance microprocessors. With increasinghigh clock frequency and design complexity onhigh performance mobile microprocessors, thepower consumption has directly impacts onfuture generation of mobile microprocessorquality and reliability. Traditional performanceand power analysis can not provide enoughanalysis and information to support decisions onpower/performance tradeoffs on architecture anddesign for high performance mobilemicroprocessors. The multivariatepower/performance analysis provides asystematic method to identify and analyze thepower consumption and the correspondingperformance impact. It can be useful method tosupport power/performance tradeoffs inmicroprocessor architecture and design.

2 Challenge topower/performance analysis

With dramatically increasing microprocessorfrequency, architectural and design complexity,microprocessor power analysis and reductionbecome one of the most important issues in highperformance microprocessor design [1,3,4,6,7].The power limitation of high performance mobilemicroprocessors is already critical to design. The

high power consumption of a high frequencymicroprocessor not only increases the cost ofmobile computing system integration and it alsoreduces the battery life time and generatesadditional heat to compact mobile computingsystems. In other words, a mobile computingsystem quality and reliability could be affectedby the high power manufacturing, consumptionof a high frequency microprocessor. To avoid themicroprocessor from overheating, especially onthe application programs with high powerconsumption characteristics, we must explore thenew architecture and techniques, which arebeyond the traditional clock gating method tofurther reduce microprocessor powerconsumption. To achieve the minimal powerconsumption in a high performance mobilemicroprocessor, we have used multivariateanalysis between power consumption andperformance in the microprocessor architectureand design. In this paper, we present a method,which employs multivariate analysis to aid thedesign of microprocessor in having less powerconsumption.

There are two fundamental questions in a highperformance low power microprocessor design:(1) How do we identify targets for powerreduction within microprocessor architecture anddesign. It basically asks where the power isheavily consumed and why. It challenges theentire microprocessor architecture and micro-architecture from the point of view of powerconsumption. (2) How do we reduce powerconsumption at the identified architecture anddesign points with either minimal or noperformance impact. This essentially asks for apower/performance optimization in a complexmobile microprocessor. It involves determiningnew tradeoffs in the design of micro-architecturebetween the performance and power

This work was supported by Intel Corporation. Forinformation on obtaining reprints of this article andcommercially use this method please contact TosakuNakanishi at [email protected].

consumption. The traditional microprocessorarchitecture tradeoff decisions are based oncost/performance analysis. These decisions aregood for a microprocessor performance. Theymay or may not be good for power saving. Inhigh performance mobile microprocessors, thepower consumption requirement is critical. Thearchitecture and design tradeoff criteria andpriority of the mobile microprocessors are verydifferent comparing to the desktop computer andservers in this sense. The conventionalmicroprocessor architecture and design analysiscan not provide enough information and analysisdata to make optimal design decisions for highperformance mobile microprocessors.

3 Multivariatepower/performance analysis

Multivariate power/performance analysis is amultivariate statistical analysis method. It isconcerned with the collection and interpretationof simulation data from multiple variables ofmicroprocessor power consumption, architecturaland design simultaneously [2,8]. Multivariatepower/performance analysis helpsmicroprocessor architects and designers toidentify the possible power reduction targetswithin complex microprocessor architecture andmicro architecture. It also provides an easy andquick way to predict a potential powerconsumption and performance impact due to thearchitecture and design changes. This powerconsumption and performance prediction andestimation is used to keep track of thearchitectural modifications on the correctdirection before complex detailedmicroprocessor power/performance simulationscarry out.

There are 8 steps for setting up and runningmultivariate power/performance analysis. Theyare: create power/performance model, qualifypower/performance model, select representativevariables, select target software and workload,collect the simulation results and build theanalysis database, identify power reduction targetand logic, modify micro-architecture, logic, andcircuit implementation, and verify powerreduction and performance impacts.

One method of multivariate power/performanceanalysis is the multivariate correlation analysis.

In a traditional correlation analysis, thecorrelation between two variables is studied. In amultivariate correlation analysis, the correlationsof all pairs of variables are studied. Multivariate

Create anarchitecturalsimulation modeland its correspondingpower consumptionmodel.

Qualify thearchitectural modelsby architectural teamand performancesimulation

Qualify the powerconsumption modelby logic design,circuit teams andcircuit and layoutsimulations

Select targetworkload andbenchmarks forpower/performanceevaluation

Select representativepower consumptionand performancevariables

Use multivariatepower/performanceanalysis method tocollect simulationresults and to buildthe analysis database

Identify the targetlogic for powerreduction andimprovingpower/performance

Modify micro-architectural or logicdesign or circuitimplementation interms of powerconsumption

Verify powerreduction andpower/performance improvement.Predict possibletrends forpower/performance enhancement

Figure 1 The flow chart ofimplementing multivariatepower/performance analysis forhigh performance microprocessordesign

power/performance correlation analysis is aspecialization of multivariate correlationanalysis. In this specialization, the correlationsbetween architectural parameters, different levelsof power consumption and performancemeasurements are studied.

When we measure a group of microprocessorarchitectural parameters and their correspondingpower consumption, not only are we interested intheir central tendency and variation of eachindividual architecture parameters, but also in anassessment of the association among theperformance and different levels of powerconsumption, such as a full chip powerconsumption and a logic cluster powerconsumption. Multivariate power/performancecorrelation analysis focuses on the relationshipsbetween variations of performance and powerconsumption for each architectural parameter.Mathematically, the relationships are measuredby correlation coefficients, which can havevalues between –1 and 1, inclusively. A strongrelationship is indicated by a large positive ornegative correlation coefficients while a weakrelationship is indicated by a correlationcoefficient that is close to 0. The correlationcoefficient by itself cannot determine if anarchitectural parameter has a high powerconsumption or performance. Instead it onlytends to indicate that an architectural parameteris a tight grip on the power consumption orperformance. The correlation coeff icients ofthose parameters are not systematicallycontrolled by architects or designers but bycomplex factors that need not be preciselydetermined. In other words, the multivariatecoefficient values represent the fact of powerconsumption and performance from compoundedmultiple effects on complex microprocessorarchitectural and design factors. This is one ofthe reasons that multivariate power/performanceanalysis is different from other power andperformance analysis methods.

The multivariate power/performance analysisdoes not assume that any particular architecturalparameter has a causal logic relation with powerconsumption at full chip level and logic clusterlevel. By observing how the architecturalparameters, power consumption, andperformance vary with each other in a set ofexperiments, our analysis tend to have the effectof summarizing complex relationships among

variables without deriving all possible complexequations. However, it does not establish a causalrelation between an architectural parameter andpower consumption or performance. In themultivariate power/performance analysis, we donot make any causal logic relation assumptionduring the correlation analysis. This assumptionreflexes that the power consumption andperformance of a high performance mobile CPUis related to many architectural and design issues.The simple causal logic relation may not alwaysfit in a real microprocessor design. Thecorrelation studies could be the common,inexpensive, and valuable power/performancestudies in general. It, however, will give us anindication the effect of altering an architectureand design on power consumption andperformance.

Table 1 summarizes the classification ofarchitectural parameters based on theircorrelations with performance and powerconsumption.

Class 1 correlation for a group of architecturalparameters means that the architecturalparameters are strongly correlated to bothperformance and power consumption. If thecorrelation coefficients are both positive or bothnegative, reducing power consumption is likelyto reduce performance. On the other hand, if thecorrelation coeff icients are of the opposite signs,reducing power consumption and increasing

Table 1 Classification of architecturalparameters based on their correlations withmicroprocessor performance and powerconsumption

Classes ofarchitecturalparameters

Correlationwith

microprocessorperformance

Correlationwith

microprocessorpower

consumption

Class 1 Strong Strong

Class 2 Strong Weak

Class 3 Weak Strong

Class 4 Weak Weak

performance are possible. Not surprisingly, thelatter case is rare.

One example for class 1 correlation is that thebus unit power/performance correlationcoefficients (coefficient with the full chip powerconsumption and coefficient with overallperformance) may have the opposite signs. Thismeans that the higher power consumption on thebus unit and the lower over all microprocessorperformance will be achieved. One of reasonscould be too many inaccurate speculativeinstruction fetches and memory data accesses todeliver correct instructions and data to internalcore for execution. Therefore, even the bus unitis extremely busy; the internal core is still idle.When the incorrect speculations are reduced, thebus unit power consumption is reduced and theinstructions and data can be delivered to theexecution units effectively.

Class 2 correlations for a group of architecturalparameters mean that the architecturalparameters are strongly correlated toperformance but weakly correlated to powerconsumption. While such architecturalparameters are good candidates for performance,they are not good candidates for power reductiontargets.

Class 3 correlations for a group of architecturalparameters mean that the architecture parametersare strongly correlated to power consumption butweakly correlated to performance. While sucharchitectural parameters may have minimalimpact on performance, they are excellentcandidates for power reduction without severeeffect on performance degradation. We may usea new micro-architecture and circuit withaggressive power reduction to replace the existmicro-architecture or circuit that beingrepresented by these group of architectural

parameters. The class 3 correlation providesopportunities for architects and designers tofurther reduce the power consumption on a verycomplex microprocessor. The best part of theclass 3 correlation is that it tries to isolate thepower consumption problems from complexperformance impacts.

Class 4 correlation for an architectural parametermeans that the architectural parameter is weaklycorrelated to both performance and powerconsumption. Obviously, such architecturalparameters should not be selected forperformance optimization or as power reductiontargets.

These four classes of power/performancecorrelations represent the overall effects, such asmicro-architecture, clock gating, etc. We have tomention that the power characteristics of selectedbenchmarks and target software may affect thecorrelation in some cases. Therefore, thecorrelation class variations of a logic block/unitbetween different benchmarks provide theadditional information of the power/performancebehaviors under different software executions.The average power/performance correlation of alogic block provides the power/performancebehaviors in general.

In a microprocessor design, the correlationanalysis must reach further details to achieveadditional power reduction possibilities. Forexample, we may use the following table tooversee the correlation among all related microarchitectures of Load buff and Store buff.

LoadBufFull and StoreBufFull are twoarchitectural parameters that have varyingcorrelations with microprocessor performanceand power consumption. The resident logicblocks and units are displayed along with their

Table 2 Two examples of architectural parameters and theircorrelations with performance and power consumption

ArchitecturalParameter

ResidentLogic Block

ResidentLogic Unit

MicroprocessorPerformance

Block PowerConsumption

Unit PowerConsumption

MicroprocessorPower

ConsumptionLoadBufFull Load buffer Load/Store ISPEC95

numberCorr.Value

mW Corr.Value

mW Corr.Value

mW Corr.Value

StoreBufFull Store buffer Load/Store ISPEC95number

Corr.Value

mW Corr.Value

mW Corr.Value

mW Corr.Value

performance and power correlationsMicroprocessor architects may use such a tableto scan for class 3 architectural parameters andtheir corresponding resident logic blocks andunits for power reduction opportunities. In ouranalysis, we have found that principal componentanalysis [2] is a good data reduction techniquefor multivariate power/performance analysis.Principal component analysis maximizes thevariance of information in the first few majorcomponents. By examining the principalcomponents, instead of all the data, architectsand designers can easily glance at the simulationresults respected with the performance and powerconsumption in different depths. Principalcomponent analysis also enables us to build apower/performance prediction model. Thisprediction model captures the complexcorrelation among multiple design variables andpower/performance within a high performancemicroprocessor. The prediction model alsoprovides a quick estimation of the powerconsumption and performance changes due tomultiple architectural variable modifications.This prediction model is useful when there aremicro-architecture changes experimentally. Theprediction is able to indicate if thepower/performance is on the right track while themicro-architecture is being modified. By usingsuch a prediction model, we can reduce thenumber of expensive and time consumingsimulations. In other words, the multivariatepower/performance prediction model is acomplementary method to the microprocessorsimulations.

The collection of qualified multivariate data isextremely important. One of the limitations ofthis analysis is that unqualified data can misleadthe multivariate power/performance analysis intoa wrong direction. Unqualified data is a complexissue which can occur at either the originalarchitectural simulator, or the correspondingpower modeling, or a data collection method, orincorrectly using the multivariatepower/performance method. This limitation leadsus to another important topic: the multivariatepower/performance analysis qualification andverification. It is very useful to qualify themultivariate power/performance analysis resultswith corresponding architecture performance andthe circuit simulation results. Theoretically, allthe correlation in the multivariatepower/performance analysis should be preserved

in different design modeling stages, such as ahigh level architecture model, RTL model, andlayout model. Because all measurements maycontain errors, this correlation and predictionqualification and verification is a very complexissue.

From our experience, multivariatepower/performance analysis helps us to identifythe power reduction targets in a very complexhigh performance microprocessor design. Forexample, when multivariate power/performanceanalysis is applied to evaluate how a speculativebranch prediction impacts on microprocessorpower and performance, useful information canbe obtained to make a correct design decision.

4 Acknowledgment

We would like to thank many people involved inthe multivariate power/performance analysis atIntel, especially Doug Carmean, Steve Guntherfor their comments and helps on researchdirections. We would also like to thank WayneScott, Pierre Brasseur, and Konrad Lai for theirsupport and Shih-Lien Lu for useful commentsfor the earlier drafts of this paper.

5 References[1] Abdellatif Bellaouar and Mohamed

Elmasry, “Low-Power Digital VLSI DesignCircuits and Systems”, Kluwer AcademicPublishers, 1997.

[2] Kingsum Chow and Jason Ding,“Multivariate Analysis of Pentium® ProProcessor”, Intel Software DevelopersConference, Portland, Oregon, October 27-29, 1997, pp. 84-91.

[3] R. Mehra and J. Rabaey, “Behavioral LevelPower Estimation and Exploration”,Proceedings of the International Workshopon Low Power Design, Napa, California, pp.197-202, April 1994.

[4] Gary Yeap, “Practical Low Power DigitalVLSI Design”, Kluwer AcademicPublishers, 1998

[5] C. Huang, B. Zhang, A Deng, and B.Swirski, “The Design and Implementation ofPowerMill”, Proceedings of InternationalSymposium on Low Power Design, pp. 105-109, 1995.

[6] P. Lanman and J. Rabaey, “Activity-sensitive Architectural Power Analysis”,

IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems,Vol 15, no. 6, pp.571-587, June 1996

[7] Jan M. Rabaey, “Digital IntegratedCircuits”, Prentice Hall Electronics andVLSI Series, pp. 234-255, 1996.

[8] Sam K. Kachigan, “Multivariate StatisticalAnalysis”, 2nd Edition, Radius Press, NewYork, 1991.

[9] J. Winer, “Statistical Principles inExperimental Design”, 2nd Edition,McGraw-Hill, New York, 1971.

Power-Aware Architecture Studies: Ongoing Work at Princeton

Christina Leung, David Brooks, Margaret Martonosi, and Douglas W. Clark*

Depts. of Electrical Engineering and Computer Science*

Princeton University

Contact: [email protected]

Abstract

The eects of power consumption increasingly shape proces-sor design choices not only in battery-operated devices, butalso in high-performance processors. Unfortunately whencompared to the huge body of performance-oriented archi-tecture research, much less research attention has been placedeither on modeling power consumption at the architecturallevel, or on devising organizational mechanisms for reducingpower consumption in high-end processors.

Our work seeks to address these issues in two main ways.First, we are developing a suite of high-level modeling toolsthat allow architects to manipulate designs and comparepower tradeos in a manner similar to current performance-oriented architecture studies. In this paper we discuss thestatus of the work and our observations on how the needsof power modelling aect the simulator speed and struc-ture. Second, we are harnessing parts of this infrastructureto evaluate a technique for aggressively clock-gating arith-metic units in response to observed operand values. We ndthat many operand values in current programs are muchsmaller than the actual word width of the machines. Asa result, dynamic techniques for adjusting functional unitwidth to match operand width are promising for reducingpower consumption in these units.

1 Introduction

Current-generation high-end microprocessors can consume30 watts or more of power in normal operation. Systemdesigns become increasingly challenging as chips and sys-tems get smaller and power consumption increases. Whilepower consumption is clearly a signicant worry in battery-operated portable devices, power is also a pressing concernin desktop systems as well. Most notably, processors in the30W power range require elaborate cooling mechanisms tokeep them from overheating. With each new generation ofchips, their implementation is increasingly shaped by at-tempts to keep power consumption and thermal load to man-ageable levels. High power is also a system cost issue: fancyheat sinks can be expensive, and increased power consump-

tion also raises the costs of the power supply and capacitorsrequired.

For all these reasons, the power consumption of high-endprocessors has been a major industry concern for several chipgenerations. Initial attempts to limit power have focusedon relatively straightforward applications of voltage scalingand clock gating. With each new generation, however, morestrenuous eorts must be made, and these eorts are be-ginning to include organizational attempts to limit powerconsumption. The academic architecture community, how-ever, has been slow to respond to this situation. While theRISC revolution spurred an increased focus on quantitativeevaluation of architectural tradeos, the quantities most of-ten studied have been performance-oriented. The time hascome for academics to consider both power and performanceas important architecture metrics to be optimized together.

The work described here consists of two main thrusts.First, we are developing architectural-level tools to modeland evaluate power consumption in high-performance pro-cessors. We are beginning to use these tools to evaluatepower-aware organizational tradeos in processor design.Second, we are also researching mechanisms for dynamicoperand analysis that allow processors to use arithmeticfunctional units more eciently.

2 Modeling Power Tradeos in High-Performance Pro-cessors

The traditional academic approach to processor organiza-tion tradeos has been to consider their eect on perfor-mance. In cache studies, for example, one might evaluatea fast direct-mapped cache against a slower set-associativecache with a higher hit ratio. The design with superiorperformance would be the superior design, modulo worriesabout hardware complexity and area.

Academics have rarely considered power in these analy-ses, although they have used very informal estimates of chiparea as a proxy for power. Thus, in the cache analysis above,caches requiring equal amounts of SRAM bits might be re-garded as equivalent from the power standpoint. To rstorder, such estimates are useful, but as power issues becomemore important, more careful area analyses and resultingpower tradeos are warranted.

With this in mind, we are building modeling infrastruc-ture that supports power analyses at the architectural level.Using the SimpleScalar toolset as a starting point [2, 1], wehave built expanded modules that include power models andrudimentary power accounting.

2.1 Cache Power Models

We have begun this eort by modifying SimpleScalar's cachemodel to account for power dissipation. In particular, westarted by incorporating a cache power model rst describedin the literature by Su and Despain [5]. This model considersthe cache in terms of three primary modules: the addressdecoding path, the cell array in which the data is stored,and the data I/O path by which data enters or leaves thecache.

Our simulator tracks current and previous data and ad-dress values on the lines entering and exiting each module.Since CMOS power dissipation is related to the frequency atwhich bits switch from 0 to 1 or vice versa, tracking currentand previous values allows us to track data on the frequencyof such bit changes.

While power modeling at this level is very implementation-dependent, our primary goal in building these models is notto provide absolute wattage estimates, but rather to pro-vide higher-level intuition on which organizations use moreor less power. Such relative comparisons can be gotten moreeasily than precise absolutes.

2.2 Implications of PowerModelling on Simulation Speed,Structure

Adding power modeling to the simulator has signicant im-plications on the speed and structure of the simulator. Track-ing bit changes within the decoding path requires that westore the last address referenced so we can compute bitswitches between address decoding bits based on the cur-rent and previous cache address. More signicantly, track-ing bit switches in the cell array and the I/O path requiresstoring the cached data itself. SimpleScalar is usually usedin a mode in which the functionality and timing of thecache is simulated without actually storing the data in thecache. The memory overhead related to data storage andbit-change calculations varies depending on the cache orga-nizations and sizes, but can result in 25% or more increasesin SimpleScalar process size. Similarly, runtime overheadscan also be quite signicant. Our unoptimized prototypeversion of the cache power simulator currently runs about50% more slowly than the original sim-cache shipped withSimpleScalar.

2.3 Summary

Our work on power modeling has focused thus far on thecaches, but we intend to broaden this modeling infrastruc-ture considerably. As we do, we note that relative poweranalysis and tradeos between like structures (e.g. instruc-tion cache and data cache) is fairly straightforward, buttradeo analysis among dierent structures needs more in-formation. What is required is either an absolutely accuratepower model of each structure in the tradeo, or a kind ofaccurate \exchange rate" that correctly estimates how twostructures (or more) can be compared. For most academics,the latter is more easily obtained. An example might be theclaim that some number of kilobytes of instruction cachewas equivalent in power to some other number of entriesin the reorder buer. Our goal is not simply to maximizeperformance for a xed power consumption, but also to con-sider regions of higher or lower power eciency, that is, toevaluate tradeos from the perspective of the energy-delayproduct metric.

Another possible approach in developing architectural-level power models is to follow Liu and Svensson [4], who

rened a coarse complexity-based model (power follows thenumber of gate-equivalents) into a set of specialized andmore accurate models. They built special parameterizedmodels for logic, memory, interconnect, and clock distribu-tion, and then validated these against published informationon the Digital Alpha 21064 and the Intel 80386. We will beevaluating this technique within SimpleScalar in upcomingwork. Overall, our goal is focused more on incorporatingpower abstractions into high-level architectural simulators,with lesser emphasis on developing extensive new models.

3 Operand Value Analysis for Aggressive Clock Gating

In parallel with our eorts to develop power simulation andmodelling techniques for architectural decisions, we are alsoinvestigating particular power reduction ideas whose quan-titative evaluation will benet from the modeling infrastruc-ture we are developing. In particular, this section focuseson dynamic techniques that work to save power by recog-nizing and exploiting particular attributes when present inoperand values.

3.1 Background

Current processors exploit a variety of clock gating tech-niques to reduce power dissipation in CMOS circuits. Thatis, information learned about the instructions during the de-code phase is used to determine which CPU units will notbe needed for this instruction. Such information enables ordisables the clock signal at particular chip modules. By dis-abling the clock, one prevents power dissipation at unusedunits.

Clock gating has traditionally been based on static in-struction attributes embodied in the opcode. Our approachextends on this to exploit dynamic properties of the operandsas they appear at runtime. Our work begins from the obser-vation that in many codes, the dynamic values stored andmanipulated in registers are often much smaller than the fullword width provided by the machine.

3.2 Our Approach

In order to exploit the dierence between average operandvalue width and machine word width, we are exploring thepotential of gating o the top end of functional units to savepower in instances when small operands are being manipu-lated. Figure 1 illustrates the basic structure we propose.

In our approach, the last step of any arithmetic com-putation is for the functional unit to perform a zero-detecton the high-order 48 bits of the result. (In some CPUs,some sort of zero-detect might be performed at this stageanyway, in order to update condition codes.) If the resultof the zero detect is that none of the high-order bits wereon, then the result is said to be a low-bit value. If any ofthe high-order 48 bits are on, then the result is a high-bitvalue. This "high-bit" ag value (0 or 1) is associated withvalues as they travel back to the register le, to reservationstations, and to the bypassing paths.

When operands arrive at an arithmetic functional unit,their ag value is used to gate the clock that controls theinput latch. In all cases, the low-order 16 bits are clock intothe latch. In small-operand cases, the latch for the high-order bits is never clocked. This prevents bit changes fromoccurring inside the functional unit, thus diminishing thepower requirements.

IntegerFunctional

Unit

clk

Clk AND high_bit

A63-16

A15-0

6464

clk

Clk AND high_bit

B63-16

B15-0

6464

64

Result63-0

Result63-15

48

ZeroDetect

1

high_bit’

low_bitfrom

registers

toregisters

OperandAfrom

registers

Operand Bfrom

registers

ALatchhigh

ALatchlow

BLatchhigh

BLatchlow

Figure 1: Block diagram of value-based clock gating at ALU input.

3.3 Experimental Results

Whether or not this approach is benecial depends on: (1)the frequency of small operands in the code, (2) the powerdissipation saved by gating o most of the arithmetic unit,and (3) the power dissipation increased by computing andmaintaining the high-bit indicator ags. Our initial resultsfocus on the rst of these three factors. We are workingnow to generate arithmetic unit power models that help usquantify the tradeos between the second and third of thefactors listed.

The potential of this approach is illustrated in Figure 2.In this graph, the total height of each bar corresponds to theobserved frequency small-operand operations performed ineight of the SPECint95 benchmarks, as a fraction of the totaloperations that were eligible for consideration. The opera-tions eligible for this optimization include arithmetic, logi-cal, and some shift operations. In addition, there is a fourthcategory that includes other miscellaneous operations, suchas some comparison instructions.

Overall, the results show that a large number of SPECint95instructions operate on small operands and are therefore el-igible for the optimization. Fractions range from 56% forgo, up to over 70% for m88ksim and compress. We note,however, that in many cases, the small operands are forlogic instructions. The benets of clock gating for logicinstructions are implementation dependent. If the imple-mentation would be performing a zero-detect anyway, forcondition codes, then power savings may result from storingthis result. On the other hand, if the zero-detect and clock

gating logic is all new, then the power savings resulting fromsaving part of the logic operation (much simpler than an addor multiply) is less clear.

Although we present results here only for integer codes,we are currently exploring mechanisms for applying similartechniques on oating point operands as well. For oating-point operands, the approaches used will depend on the typeof operation being performed. For example, in a oating-point addition, determining which bits are needed and rele-vant cannot be performed until the two operands have beennormalized (i.e., their exponents adjusted). In contrast, oating-point multiplications can have mantissa checks thatare more similar to their integer counterparts. Finally, weare also exploring options for using compile-time analysisand proling to determine narrow bit-width operations with-out the overhead of dynamic on-line checks. We are alsoconsidering approaches for speculatively assuming narrowbit widths in some operations, and then cleaning up compu-tation via over ow detection in cases when the full bit-widthindeed should have been used.

4 Conclusions

Overall, the philosophy embodied by this work is that ap-propriate monitoring infrastructure is imperative to gain-ing insights on power tradeos at design time. In addition,detailed operand analysis techniques, applied at compile-time or run-time, are a promising method for using operandvalues to drive clock gating techniques, and oer sizable

% of Low-bit (16 or less) operations

0

10

20

30

40

50

60

70

80

ijpeg

m88

ksim go

xlisp

com

pres

scc

1

vorte

xpe

rl

% lo

w-b

it Power_OnlyShiftLogicArith

Figure 2: Operand width distribution in the SPECint benchmarks.

improvements over purely opcode-based clock gating tech-niques.

References

[1] D. Burger and T. M. Austin. The SimpleScalar Tool Set,Version 2.0. Computer Architecture News, pages 1325,June 1997.

[2] D. Burger, T. M. Austin, and S. Bennett. Evaluating Fu-ture Microprocessors: the SimpleScalar Tool Set. Tech-nical Report TR-1308, University of Wisconsin-MadisonComputer Sciences Department, July 1996.

[3] P. Landman. High-level power estimation. In Interna-tional Symposium on Low-Power Electronics and Design,pages 2935, August 1996.

[4] D. Liu and C. Svensson. Power consumption estimationin CMOS VLSI chips. IEEE Journal of Solid-State Cir-cuits, pages 663670, June 1994.

[5] Ching-Long Su and Alvin M. Despain. Cache designtrade-os for power and performance optimization: Acase study. In International Symposium on Low-PowerDesign, pages 6368, April 1995.

[6] V. Tiwari, S. Malik, and A. Wolfe. Compilation tech-niques for low energy: An overview. In Proc. Symp. on

Low Power Electronics, October 1994.

[7] V. Tiwari, S. Malik, and A. Wolfe. Power analysis ofembedded software: A rst step towards software powerminimization. IEEE Transactions on VLSI Systems,2(4):437445, December 1994.

[8] V. Tiwari, S. Malik, and A. Wolfe. Power analysis ofembedded software: A rst step towards software powerminimization. IEEE Transactions on VLSI Systems,2(4):437445, December 1994.

Architectural Tradeoffs for Low Power

Vojin G. Oklobdzija*

IntegrationBerkeley, CA 94708

http://www.integr.com

*Electrical Engineering DepartmentUniversity of California, Davis, CA 95616

http://www.ece.ucdavis.edu/acsel

AbstractA study of several RISC, DSP and embedded processors wasconducted. It has been shown that the transistor utilizationdrops substantially as the complexity of the architectureincreases. Simple architecture that are enabling fullutilization of technology are favorable as far as energyefficient design styles are concerned. The results favorsimple architectures that leverage performanceimprovements through technology improvements.

1. Introduction

Demand for reducing power in digital systems has notlimited to systems which are required to operate underconditions where battery life is an issue. The growth ofhigh-performance microprocessors has also beenconstrained by the power-dissipation capabilities of thepackage using inexpensive air-cooling techniques. Thatlimit is currently in the neighborhood of fifty watts.However, the increasing demand for performance(which has been roughly doubling every two years) isleaving an imbalance in the power dissipation increase,which is growing approximately at 10 Watts per year.

This growth is threatening to slow the performancegrowth of future microprocessors. The “CMOS ULSIsare facing a power dissipation crisis” in the words ofKuroda and Sakurai [1]. The increase in powerconsumption for three generations of DigitalEquipment Corporation, “Alpha” architecture high-performance processors is given in Fig. 1.

2. Comparative Analysis

Most of the improvement on power savings is gainedby technology. Scaling of the device and processfeatures, lowering of the threshold and supply voltageresult in an order of magnitude savings in power.Indeed, this resulting power reduction has been asalient achievement during the entire course ofprocessor and digital systems development. Had thisnot been the case then the increase in power from onegeneration to another would have been much largerlimiting the performance growth of microprocessorsmuch earlier.

The technology amounts for approximately 30%improvement in gate delay per generation. Theresulting switching energy CV2 has been improving atthe rate of 0.5 times per generation. Given that thefrequency of operation has been doubling for eachnew generation the power factor P = CV2f remainedconstant ( 0.5 X 2 = 1.0). It is the increase incomplexity of the VLSI circuits that goes largelyuncompensated as far as power is concerned.However, it is estimated that the number of transistorhas been tripling for every generation. Therefore, theexpected processor performance increase is 6 timesper generation (two times due to the doubling ofprocessor frequency multiplied by the three timesincrease in the number of transistors).

The fact that the performance has been increasing fourtimes per generation instead of six is a strongindication that the transistors are not efficiently used.What that means is that the added architecturalfeatures are at the point of diminishing returns.

This diminishing trend is illustrated in Table 1 whichcompares a transition from a dual-issue machine to a4-way-super-scalar for the IBM PowerPCarchitecture.

All three implementations of the PowerPCarchitecture are compared at the same frequency of100MHz. The performance of PowerPC 620, as wellas power consumption has been normalized to what itwould have been at 100MHz. We can observe that thepower has more than doubled and quadrupledrespectively in transition from a relatively simpleimplementation (601+) into a true super-scalar 620.The respective performance has also improved by 50to 60% (integer) and 30 to 80% (floating-point).However, the number of Specs/Watt has gone downdramatically-- one and two times as compared to601+. Given that all the data for all the threeimplementation has been compared at 100MHz, we

Fig. 1. Power increase for three generations of DEC“Alpha” processor

0

10

20

30

40

50

60

70

80

1992 1996 1998

DEC2106430W @ 200MHz

DEC2116450W @ 300MHz

DEC2126472W @ 600MHz

Year of Introduction

Pow

er [W

atts

]

Table 1. Comparison of PowerPC performance / powertransition[8,9]

Feature 601+ 604 620 Diff.

Frequency

MHz

100 100 133(100) same

CMOS Process .5u 5-metal

.5u 4-metal .5u 4-metal

~same

Cache Total 32KBCache

16K+16KCache

64K ~same

Load/Store Unit No Yes Yes

Dual Integer Unit No Yes Yes

Register Renaming No Yes Yes

Peak Issue 2 + Br 4 Insts 4 Insts ~double

Transistors 2.8Million

3.6 Million 6.9Million

+30%/+146%

SPECint92 105 160 225

(169)

+50%/+61%

SPECfp02 125 165 300

(225)

+30%/+80%

Power 4W 13W 30W

(22.5W)

+225%/+463%

Spec/Watt 26.5/31.2 12.3/12.7 7.5/10 -115%/-252%

are indeed comparing the inverse of Energy-Delayproduct which is a true measure for power efficiencyof an implementation as shown in [7].

The comparable inefficiency in power-performancefactor in transition from singe-issue to a super-scalarfor MIPS processor architecture is shown in Table 2.The comparison shows a 31% decrease in powerefficiency for the integer code but a 23%improvement for the floating-point.

Table 3 shows that the best trade-off betweenperformance and power has been achieved in DECAlpha 21164 implementation of their “Alpha”architecture. The table shows comparable efficiencyfor MIPS, PowerPC and HP processorimplementations, slightly better for Sun UltraSPARCand substantially better power efficiency for Digital21164.

The power efficiency of DEC 21216 was achievedthrough very careful circuit design, thus eliminatingmuch of the inefficiency at the logic level. This wasnecessary in order to be able to operate at thefrequency that is twice as high compared to otherRISC implementations. However, no architecturalfeatures, other than their very careful implementationsare contributors to the power efficiency of DEC21164.

It is interesting to compare what a particularimprovement means in terms of power. In Table 4 weare comparing the effect of increasing the cache sizefor IBM 401 and 403 processors. The measurement isnormalized to 50MHz. The power-efficiency hasdropped by a factor of close to two, resulting fromincreasing the caches. Similar findings are confirmedin the case of PowerPC architecture where thedecrease in power efficiency is 60% as shown inTable 5.

Table 2. Transistion from single issue MIPS R5000 to MIPSR10000 implementation of MIPS architecture[8,9]

Feature MIPSR10000

MIPSR5000

Diff.

Frequency 200MHz 180MHz ~same

CMOS Process 0.35 /4M 0.35 /3M

Cache Total 32K/32KBCache

32K/32KCache

~same

Load/Store Unit Yes No

RegisterRenaming

Yes

Peak Issue 4 Issue 1+FP

Transistors 5.9 Million 3.6 Million +64%

SPECint95 10.7 4.7 +128%

SPECfp95 17.4 4.7 +270%

Power 30W 10W 200%

SPEC/Watt 0.36/0.58 0.47/0.47 -31%/

23%

Metrics:Horowitz et al.[7] introduces Energy-Delay product asa metric for evaluating power efficiency of a design.

An appropriate scaling of the supply voltage results ina lower power, however, at the expense of the speedof the circuit. The energy-delay curve shows anoptimal operation point in terms of energy efficiencyof a design. This point is reached by varioustechniques, which are all being discussed in thispaper.

The fabrication technology seems more important forthe energy-delay that the architectural features of themachine. This finding is consistent with the fact thatthe processors’ performance has been increasing four-fold per generation. Though we would expect a six-fold increase in performance: the frequency has beendoubling per generation and the number of transistortripling. This shows that the transistors have not beenused efficiently and that the architectural features thatare consuming this transistor increase have not beenbringing a desired effect in terms of the energy-efficiency of the processors.

Power Tradeoffs in DSP and Embedded Systems:

A detailed power analysis of a programmable DSPprocessor and an integrated RISC and DSP processorwas described in the papers by Bajwa and Kojima etal [2,3]. The authors have shown a module-wisebreakdown of power used in the different blocks.Contrary to many opinions it was found that the buspower is significantly smaller compared to the datapath. It was also shown in this paper how a simpleswitch of the multiplier inputs (applicable to Boothencoded multipliers only) can reduce multiplierpower by 4-8 times. Instruction fetch and decodecontribute a significant portion of the power in thesedesigns and since signal processing applications tendto spend a very large portion of their dynamicexecution time executing loops, simple bufferingschemes (buffers/caches) help reduce power by up to25% [10].

In Fig. 2, the power breakdown is shown for theintegrated RISC+DSP processor. For the benchmarksconsidered, which are kernels for typical DSPapplications, the CPU functions as an instructionfetch, instruction decode and address generation unitfor the DSP. Hence the variability in its power is less.

Similarly, the power consumed in the memories isquite high (in spite of their being low power,segmented bit-line memories) and shows littlevariation. In the case of the DSP the power variationis more and is data dependent. The interconnectpower (INTR) represents the top level interconnectpower which includes the main busses (three data andthree address) and the clock distribution network atthe top level. Clock power alone contributes between30 and 50% of the total power consumptiondepending on system load.

Table 4. A difference in power-performance factor resultingfrom increasing the size of cache[8,9]

Feature 401 403 Difference

Frequency 50MHz 66MHz(50MHz)

close

CMOS Process 0.5u 3-metal 0.5u 3-metal same

Cache Total 2K-I / 1K-D 16K-I / 8K D 8x

FPU No No same

MMU No Yes

Bus Width 32 32 same

Transistors 0.3 Million 1.82 Million 600%

MIPS 52 81(61)

+56%(+17%)

Power 140mW 400mW(303mW)

+186%(+116%)

MIPS/Watt 371 194 -91%

Table 5. The effect of increasing the cache size of PowerPCarchitecture[8,9]

Feature 604 620 Difference

Frequency 100MHz 133MHz(100MHz)

same

CMOS Process 0.5u 4-metal 0.5u 4-metal same

Cache Total 16K+16KCache

64K ~double

Load/Store Unit Yes Yes same

Dual Intgr Unit Yes Yes same

Reg- Renaming Yes Yes same

Peak Issue 4 Instructions 4 Instructions same

Transistors 3.6 Million 6.9 Million +92%

SPECint92 160 225 (169) +6%

SPECfp02 165 300 (225) +36%

Power 13W 30W (22.5W) +73%

Spec/Watt 12.3 / 12.7 7.5 / 10 -64%

The characteristics of embedded systems are quitedifferent from those of desktop systems. For one, costis a much more acute issue. Secondly, thecomputational load is a smaller well-defined set oftasks. For real-time signal processing applications,throughput behavior is typically more critical thanminimum response time. These constraints dominatethe design decisions. In many instances the cost ofpackaging is comparable to the cost of the die andusing a more expensive package albeit with betterheat dissipation capabilities is not an option. In themobile arena, battery life and heat dissipation incompact designs (constricted space reduces airflowand hence the capacity to disperse heat) putdownward pressure on power consumption of theseprocessors. Depending on the application domainthere are two broad approaches.

Table 3. Comparison of Performance/Power and 1/Energy*Delay for representative RISC microporcessors[8,9]

Feature Digital21164

MIPS10000

PwrPC 620

HP 8000 SunUltra-Sparc

Freq 500 MHz 200 MHz 200 MHz 180 MHz 250 MHzPipeline Stages 7 5-7 5 7-9 6-9Issue Rate 4 4 4 4 4Out-of-Order Exec. 6 lds 32 16 56 noneRegister Renam.(int/FP)

none/8 32/32 8/8 56 none

Transistors/Logic transistors

9.3M/1.8M

5.9M/2.3M

6.9M/2.2M

3.9M*/3.9M

3.8M/2.0M

SPEC95(Intg/FlPt)

12.6/18.3 8.9/17.2 9/9 10.8/18.3 8.5/15

Power 25W 30W 30W 40W 20WSpecInt/Watt

0.5 0.3 0.3 0.27 0.43

1/Energy*Delay 6.4 2.6 2.7 2.9 3.6

(a)

(b)

Fig. 2 Module-wise breakdown of the chip powerconsumption for the kernel benchmarks forthe integrated RISC+DSP processor, (a) as apercentage of the total (b) normalized

Throughput and real-time constraints typically lead tomore balanced systems, as in the case of DSPs(Harvard architecture, processor speed equal to busspeed). Balance here is a reference to a balancebetween Throughput and real-time constraintstypically lead to more balanced systems, as in the caseof DSPs (Harvard architecture, processor speed equalto bus speed). Balance here is a reference to a balancebetween processing speed, bandwidth and memory.Portable computing devices such as PDAs andhandheld PCs form the other application domain, onein which the processors see a load similar to that ofdesktop systems in many respects. The StrongARMdrops the processor core's clock frequency to be equalto that of it's bus’ clock frequency when it makesaccesses off-chip thereby curtailing it's powerallowing its MIPS/Watt rating to scale.

Benchmarks are fraught with controversy and in thecase of embedded systems where MIPS numbers are

based on Dhrystones, it is especially meaningless. TheDhrystone suite can fit in roughly 4KB of memory.This makes the disparity or lack thereof betweenprocessor speeds and bus speeds noteworthy.

The biggest impact on performance/power is processtechnology. The StrongARM, which is at the highend of embedded and low power processors, benefitsfrom DEC's process technology (same as the oneused for the Alpha chips) and a full custom design.This is atypical of embedded processor design. Asrecently as 1.5 years ago, the SA-110 was availablein 0.35 micron technology and 2/3.3V (core/IO). Allof its competitors were available in technologiesranging from 0.5 to 1 micron and voltages between3.3V and 5V. This is changing but the SA-110 andthe SA-1100 have been able to maintain their leadingposition as low power processors by aggressivelyreducing the core's voltage (1.35V for the SA-1100),circuit techniques and edge triggered flip-flops. Athreshold voltage of 0.35 has allowed a much loweroperating voltage. Most other embedded processorshave had higher threshold voltages and hence,correspondingly higher operating voltages. Over thenext year or two embedded processors with lowerthreshold voltages and dual threshold designs willbecome more standard.

The ARM9TDMI which has adopted a SA-110-like, five-stage pipeline, as opposed to ARM’straditional three-stage design, and a Harvardarchitecture illustrates the advantages of a morebalanced design and can now be clocked at 150 MHzat sub-watt power levels. Better task partitioning ispossible in embedded systems, due to the applicationsrequiring a small set of predictable tasks to beperformed, allowing unused hardware to beshutdown. In DSPs, control overheads are minimizedand the data-path power and activity dominates. Indesktop processors by contrast the control poweralmost drowns out the variations in the data-pathpower [4]. Power analysis of DSPs and simple RISCprocessors show two main sources of power the data-path units (multiply-accumulate units) and memory orcache (Fig.2.).

Conclusion

The conclusion from the studies presented is that thebest power-performance is obtained if the architectureis kept simple thus allowing improvements to beachieved by technology. In the other words, thearchitecture should not stay in the way of technologyand whenever this is not the case we will experience adecrease in power efficiency.

The second finding that goes contrary to the commonknowledge is that we should seek improvements viasimple design but increasing the clock frequencyrather than keeping the frequency of operation lowand increasing the complexity of the design.

The current processors today have reached their limitin terms of power. Digital 21264 is an example of aprocessor which had a potential of higher operatingfrequency but had to lower it (to 600MHz) in order tokeep the power contained. This situation was firstreached by Digital “Alpha” processor but it is soon tobe reached by all of the others.

0%

5%

10%

15%

20%

25%

30%

INTR

DSPCPU

SYSCBSC

INTC

MEM

Pow

er (

% o

f tot

al)

fir

lms_A

lms_B

g711

fir_blk

iir

fft

0.0

0.2

0.4

0.6

0.8

1.0

1.2

INTR

DSPCPU

SYSCBSC

INTC

MEM

Nor

mal

ized

Pow

er

fir

lms_A

lms_B

g711

fir_blk

iir

fft

More specialized systems, used in signal processingapplications can benefit from re-configurable data-path designs. The main advantage is to reduce the clock and control overhead by mapping loopsdirectly onto the re-configurable data-path.

Applications in signal processing where stream data-or block data-processing dominates it makes sense toconfigure the data-path to compute algorithm specificoperations. The cost of configuration can beamortized over the data block or stream. Aggressiveuse of chaining (as in vector processing) can be usedto reduce memory accesses resulting in designs thatmay be called re-configurable vector pipelines.Embedded architectures can, in the future, beexpected to employ all or some of these techniques

AcknowledgmentContribution of Dr. Raminder Bajwa of HitachiSemiconductor Research Laboratories as well assupport received from Hitachi Ltd. is greatlyappreciated.

References1. T. Kuroda and T. Sakurai “Overview of Low-Power ULSI

Circuit Techniques” , IEICE Trans. Electronics, E78-C,No.4, April 1995, pp.334-344, INVITED PAPER,Special Issue on Low-Voltage Low-Power IntegratedCircuits.

2. R. S. Bajwa, N. Schumann and H. Kojima, "Poweranalysis of a 32-bit RISC integrated with a 16-bit DSP",SLPED'97.

3. H. Kojima, et al, “Power Analysis of a ProgrammableDSP for Architecture/Program Optimization” ,Proceedings of the 1995 Low-Power Symposium.

4. V. Tiwari, S. Malik and A. Wolfe, "Power Analysis ofEmbedded Software: A First Step Towards PowerMinimization", IEEE Trans. on VLSI Systems,2(4):437-445, Dec. 1994.

5. Y. Sasaki, et al, “Multi-Level Pass-Transistor Logic forLow-Power ULSIs”, Proceedings of the 1995 Low-PowerSymposium.

6. C. Tan, et al, “Minimization of Power in VLSI CircuitsUsing Transistor Sizing, Input Ordering, and StatisticalPower Estimation”, Proceedings of the InternationalWorkshop on Low-Power Design, 1994.

7. M. Horowitz, et al, “Low-Power Digital Design” ,Proceedings of the 1994 IEEE Symposium on Low-Power Electronics, 1994.

8. L. Gwennap, “RISC on the Desktop: A ComprehensiveAnalysis of RISC Microprocessors for PCs,Workstations, and Servers”, MicroDesign Resources,1994.

9. Microprocessor Report, several issues, MicroDesignResources, Sebastopol, California.

10. R. S. Bajwa et al, "Instruction Buffering to ReducePower in Processors for Signal Processing", IEEE Trans.VLSI Systems, 5(4):417-424, December 1997.

The Inherent Energy Efficiency of Complexity-Adaptive Processors

David H. AlbonesiDept. of Electrical and Computer Engineering

University of RochesterRochester, NY [email protected]

Abstract

Conventional microprocessor designs that statically setthe functionality of the chip at design time may wasteconsiderable energy when running applications whoserequirements are poorly matched to the particular hard-ware organization chosen. This paper describes howComplexity-Adaptive Processors, which employ dy-namic structures whose functionality can be modified atruntime, expend less energy as a byproduct of the wayin which they optimize performance. Because CAPs at-tempt to efficiently utilize hardware resources to max-imize performance, this improved resource usage re-sults in energy efficiency as well. CAPs exploit repeatermethodologies used increasingly in deep submicron de-signs to achieve these benefits with little or no speeddegradation relative to a conventional static design. Bytracking hardware activity via performance simulation,we demonstrate that CAPs reduce total expended energyby 23% and 50% for cache hierarchies and instructionqueues, respectively, while outperforming conventionaldesigns. The additive effect observed for several appli-cations indicates that a much greater impact can be re-alized by applying the CAPs approach in concert to anumber of hardware structures.

1 Introduction

As power dissipation continues to grow in importance,the hardware resources of high performance micropro-cessors must be judiciously deployed so as not to need-lessly waste energy for little or no performance gain.The major hardware structures of conventional designs,which are fixed at design time, may be inefficientlyused at runtime by applications whose requirements arenot well-matched to the hardware implementation. Forexample, an application whose working set is muchsmaller than the L1 Dcache may waste considerable en-ergy precharging and driving highly capacitive word-lines and bitlines. Similarly, an application whose work-ing set far exceeds the L1 Dcache size may waste energyperforming look-ups and fills at multiple levels due tohigh L1 Dcache miss rates. The most energy-efficient(but not necessarily the best performing) cache orga-

This research is supported in part by NSF CAREER AwardMIP-9701915.

nization is that which is well-matched to the applica-tion’s working set size and access patterns. However,because of the disparity in the cache requirements ofvarious applications, conventional caches often expendmuch more energy than required for the performance ob-tained. Other major hardware structures, such as instruc-tion issue queues, similarly waste energy while operat-ing on a diverse workload.

Complexity-Adaptive Processors (CAPs) makemore efficient use of chip resources than conventionalapproaches by tailoring the complexity and clock speedof the chip to the requirements of each individualapplication. In [1], we show how CAPs can achieve thisflexibility without clock speed degradation comparedto a conventional approach, and thus achieve signifi-cantly greater performance. In this paper, we describehow CAPs can achieve this performance gain whileexpending considerably less energy than a conventionalmicroprocessor.

The rest of this paper is organized as follows. In thenext section, we discuss how the increasing use of re-peaters in long interconnects creates the opportunity fornew flexible hardware structures. Complexity-AdaptiveProcessors and then described in Section 3, followed bya discussion in Section 4 of their inherent energy ef-ficiency. In Section 5, our experimental methodologyis described. Energy efficiency results are discussed inSection 6, and finally we conclude and discuss futurework in Section 7.

2 Dynamic Hardware Structures

As semiconductor feature sizes continue to decrease, toa first order, transistor delays scale linearly with fea-ture size while wire delays remain constant. Thus, wiredelays are increasingly dominating overall delay paths.For this reason, repeater methodologies, in which buffersare placed at regular intervals within a long wire to re-duce propagation delay, are becoming more common-place in deep submicron designs. For example, the SunUltraSPARC-IIi microprocessor, implemented in a 0.25micron CMOS process, contains over 1,200 buffers toimprove wire delay [5]. Note that wire buffers are usednot only in busses between major functional blocks, butwithin self-contained hardware structures as well. Theforthcoming HP PA-8500 microprocessor, which is alsoimplemented in 0.25 micron CMOS, uses wire buffersfor the global address and data busses of its on-chipcaches [3]. As feature sizes decrease to 0.18 micron andbelow, other smaller structures will require the use of

Element1

Element2

Element3

Address DataBus Bus

Element8

Element7

Element6

Element5

Element4

disable1

disable2

disable3

disable4

repeater

Figure 1: A dynamic hardware structure which can beconfigured with four to eight elements.

wirebuffers in order to meet timing requirements [4].For these reasons, we expect that many of the ma-

jor hardware structures of future high performance mi-croprocessors, such as caches, TLBs, branch predic-tor tables, and instruction queues, wil l be of the formshown in Figure1. Thehardware structure in this figureconsists of replicated elements interconnected by globaladdress/control and data busses driven using repeatersplaced at regular intervals to reduce propagation delay.The isolation of individual element capacitances pro-vided by the repeaters creates a distinct hierarchy of el-ement delays, unlike unbuffered structures in which theentire wire capacitance is seen by every element on thebus. By employing address decoder disabling controlsignalsas shown in Figure1, wecan make thisstructuredynamic in the sense that the complexity and delay ofthe structure can be varied as required by the applica-tion. For example, this structure can be configured withbetween four and eight elements, with the overall delayincreasing as a function of the number of elements. As-suming this structure is on the critical timing path withfour or moreelements, if theclock frequency of thechipis varied according to the number of enabled elements1,then the IPC/clock rate tradeoff of this structure can bevaried at runtime to meet the dynamic requirements ofthe application. Due to their exploitation of repeater us-age, such dynamic hardware structures can be designedwith littl eor no delay penalty relativetoafixedstructure.

An alternative to disabling elements is to use themas slower “backups” to the faster “primary” elementsas is shown in Figure 2. Here, the difference betweentheprimary and backup elements is theaccess latency inclock cycles. For example, the four primary elements inFigure 2 may be accessed in fewer cycles than the fourbackup elements, due to the latter’s longer address anddata bus delays. Such an approach may be appropriate,for example, for an on-chip Dcache hierarchy, in whichtheprimary and backup elements correlate to L1 and L2

1An alternative is to vary the latency in clock cyclesof thestructure.

Element1

Element2

Element3

Address DataBus Bus

current

section

Element8

Element7

Element6

Element5

Element4

repeater

"backup"

section"primary"current

Figure 2: A dynamic hardware structure with config-urable “primary” and “backup” sets of elements.

caches, and the division between them is determined asa function of the current working set size and the cycletimeand backup element latency of each configuration.

3 Complexity-AdaptiveProcessors

Having discussed how repeater methodologies wil l leadnaturally to thedevelopment of dynamic hardwarestruc-tures, we now describe the overall organization of aComplexity-Adaptive Processor.

The overall elements of a CAP hardware/softwaresystem, shown in Figure3, areas follows:

Dynamic hardware structures as just describedwhich can vary in complexity and timing (latencyand/or cycle time);

Conventional static hardware structures, usedwhen implementing adaptivity is unwieldly, willstrongly impact cycle time, or is ineffective dueto a lack of diversity in target application require-ments for theparticular structure;

Performancecounterswhich track theperformanceof each dynamic structure and which are readablevia special instructions and accessible to the con-trol hardware;

Configuration Registers (CRs), loadable by thehardwareand by special instructions, which set theconfiguration of each dynamic structure as well asthe clock speed of the chip; different CR valuesrepresents different complexity-adaptive configu-rations, not all of which may bepractical;

A dynamic clocking system whose frequency iscontrolled viaparticular CR bits; achange in thesebits causes a sequence in which the current clockis disabled and the new one started after an appro-priatesettling period;

structuresstatic hardware

..

.

application

instruction setarchitecture

CAPcompiler

reconfiguration

CAP hardware implementation

dynamicreconfigurationcontrol logic

registers

structures

countersperformance

clocking system

dynamic hardware

runtime systemCAP configexecutable with

instructions

CAP

oscillators

Figure3: Overall elements of aCAPhardware/software system.

Theinstruction set architecture(ISA) consisting ofconventional instructions augmented with specialinstructions for loading CRs and reading the per-formance counters;

Configuration control, implemented in the com-piler, hardware (dynamic reconfiguration controllogic), and runtime system, that acquires infor-mation about the application and uses predeter-mined knowledge about each configuration’s IPCand clock speed characteristics to create a config-uration schedule that matches thehardware imple-mentation to theapplication dynamically during itsexecution.

Theprocessof compiling and running an applicationon aCAPmachineisasfollows. TheCAPcompiler ana-lyzes theapplication’s hardwarerequirements for differ-ent phases of itsexecution. For example, it may analyzedata cache requirements based on working set analysis,or determine the ILP based on the control flow graph.With this information, and knowledge about the hard-ware’savailableconfigurations, thecompiler determineswhether it can with good confidence create an effectiveconfiguration schedule, specifying at what points withintheapplication thehardwareshould bereconfigured, andto which organizations. The schedule is created by in-serting special instructionsat particular pointswithin theapplication that load the CRs with the desired config-uration. In cases where dynamic runtime informationis necessary to determine the schedule, this task is per-formed by the runtime system or the dynamic reconfig-uration control logic. For example, TLB configuration

scheduling may be best handled in conjunction with theTLB miss handler, based on runtime TLB performancemonitoring, while theoptimal branch predictor sizemayin some cases be best determined by runtime hardware.A major CAP design challenge is determining the op-timal configuration schedule for several dynamic struc-tures that interact with each other as well as with staticstructures, and which may be controlled by up to threedifferent sources (compiler, runtime, and hardware).

Through these various mechanisms, the CRs areloaded at various points in an application’s execution,resulting in reconfiguration of dynamic structures andchanges in clock frequency. For runtime control, theperformance counters are queried at regular intervals ofoperation, and using history information about past de-cisions, aprediction ismadeabout theconfiguration thatwil l perform best over the next interval. Assuming tensof cyclesare required for each reconfiguration operation(based on the timeto reliably switch clock frequencies),then an interval on theorder of thousands of instructionsisnecessary to maintain a reasonable level of reconfigu-ration overhead.

4 Improving Energy Efficiency Via CAPs

There are two main aspects of CAPs that allow for re-duced power consumption: theability to disableor opti-mally allocate (between primary and backup sections)hardware elements, and the ability to control the fre-quency of the clock. One option is to explicitly man-agethesefeaturesinorder to reducepower consumption.

For example, in a portable environment, when batterylif e falls below a certain threshold, a low power modein which theprocessor isstill fully functional can been-abled by setting all dynamic hardwarestructures to theirminimum size, and selecting theslowest clock.

In addition, CAPdesignshavean inherent energy ef-ficiency that isabyproduct of theway in which they op-timizeperformance. Because the CAP hardware is con-figured to match the characteristics of each application,hardware structures are generally used more efficiently,and thusexpend lessenergy, for agiven task. It isthisin-herent energy efficiency of CAPs that follows naturallyfrom optimizing performance that weexplore in the restof thispaper.

5 Experimental Methodology

Weexamined theresultsof our preliminary performanceanalysis [1] of two-level on-chip Dcachehierarchiesandinstruction issue queues to estimate the relative energyexpended by CAPand conventional approachesfor thesestructures. For the Dcache hierarchy, we assumed a to-tal of 128KB of cache, and for theCAP design, allowedthe division between the L1 and L2 caches to be variedin steps of 8KB up to a total L1 Dcache size of 64KBon an application-by-application basis. We comparedthis approach with the best overall-performing conven-tional design: one with a 16KB 4-way set-associativeL1 Dcache with the rest of the 128KB allocated to theL2 cache. We ran each benchmark on our cache sim-ulator for 100 million memory references. The CAPinstruction queue varied in size from 16 to 128 entriesin steps of 16 entries. Unused entries were disabled.The best-performing conventional design contained 64entries. We used the SimpleScalar simulator [2] andran each benchmark for 100 million instructions. Moredetails on the evaluation methodology and benchmarksused can be found in [1].

To estimate relative expended energy, we calculatetheactivity ratio for each approach by tracking thenum-ber and types of operations, and estimating the activitygenerated by each. For the two-level cache design, wetrack the number of L1 and L2 operations, and calcu-late the total number of cache banks activated consid-ering the number activated for each operation and theL1/L2 configuration. We then take the ratio of the totalactivity for the CAP approach and for the conventionaldesign. Because we use an exclusive caching policy,misses in theL1 Dcacheto avalid location that hit in theL2 Dcache result in a swap between the two caches. Inaddition, the global miss ratio of the hierarchy remainsconstant due to the use of a random replacement policy.Thus, aCAPdesign that isoptimized for performance ingeneral also promotes energy efficiency by optimizingthe amount of L1 Dcache allocated for a given applica-tion, and reducing L2 Dcacheactivity.

For the instruction queue, we did not have the abil-ity to track the number of instruction queue accesses.However, because the rest of the design was almostideal (large caches and queues, perfect branch predic-tion, plentiful functional units), weused thetotal numberof cyclesexecuted to approximatetherelativenumber ofinstruction queueaccesses for each benchmark and eachapproach (CAP and conventional). We then multipliedthis number by the number of queue entries to get thetotal activity factor.

Although this approach is not exact, we believe thatby tracking activity in this manner that we obtain a rea-sonable first-order approximation of relative expendedenergy.

6 Results

Table1 showsevent and activity countsaswell astheac-tivity ratio (CAP/conventional) for the ten benchmarksin which the CAPs configuration differs from the bestconventional approach. The eleven benchmarks whichuse identical CAPs and conventional configurations arenot shown as the expended energy is the same. The“L 1 size (CAPs)” column denotes the CAPs configura-tion which performed best for each benchmark. The lastcolumn indicates the ratio of CAPstotal activity to con-ventional total activity.

For five of the benchmarks, a CAPs configurationwith an 8KB L1 Dcache outperforms the conventionalapproach using a 16KB L1 Dcache due to the former’sfaster cycle time, despite the increase in L1 misses andL1-L2 swaps incurred, as indicated in Table 1. Interest-ingly, the total activity count for theCAPsconfigurationis lower than theconventional approach. This isbecausefor theconventional approach, twiceasmuch L1Dcachemust be precharged and probed for each load operation(we make the simplifying assumption that only the se-lected way is activated on a store). This offsets the ad-ditional L2 probeand L1-L2 swap activity incurred withtheCAPsconfiguration, and the fact that moreL2 cachemust be probed in the CAPs case on an L1 miss. Thismoreefficient cacheallocation reducesexpended energyby 33% in thecaseof mgrid.

The tradeoff is different for the five remainingbenchmarks where the CAPs L1 Dcache is larger thanthe conventional 16KB cache. Here, the reduction inL1 miss and L1-L2 swap activities must offset the addi-tional L1 Dcache load activity for the CAPs configura-tion to expend less energy. This is true in all cases ex-cept for wave5, whose conventional cache experiencesfewer total L1 misses than the other four benchmarks inthis category. For benchmarks such as stereo and appcgwhoserequirementsarenot well-matched to theconven-tional organization, theenergy savingswith theCAPap-proach are significant: 44% for stereo and 62% for ap-pcg. Because these benchmarks run significantly fasteron the CAPs configuration as well [1], the reduction inthe energy-delay product, an indicator of the efficiencywith which aparticular performance level isobtained, iseven more striking: 70% for both benchmarks. Overall,23% less energy isexpended by theCAPsconfigurationfor thesebenchmarks asabyproduct of performanceop-timization.

Table 2 shows the relevant data for the best-performing CAPs and conventional instruction queuesfor those benchmarks in which the CAPs and conven-tional configurationsdiffer. Here, unused entriesaredis-abled for the CAPs approach. In the cases in whichthe CAPs configuration performs best with fewer en-triesthan the64-entry conventional approach (due to thefact that the cycle time improvement overrides the IPCpenalty incurred), thismeans that fewer elementsareac-tivated on each instruction queueaccess. However, moreissue operations are required as the smaller window re-sults in fewer instructions issued on average per issueoperation. In all cases, the energy savings from acti-

benchmark L1 size (CAP) loads stores L1 misses L2 misses L1-L2 swaps total activity activity ratioconv CAP conv CAP conv CAP (CAP/conv)

m88ksim 8KB 64.6 35.4 2.48 2.97 2.41 2.43 2.48 370.5 261.6 0.71compress 24KB 80.0 20.0 12.3 6.32 0.23 0.22 0.22 697.2 671.1 0.96airshed 8KB 81.2 18.8 32.0 32.5 0.03 3.24 3.73 1275.0 1193.6 0.94stereo 48KB 74.1 25.9 54.1 7.39 6.06 11.6 1.91 1910.2 1078.0 0.56radar 8KB 61.2 38.8 1.51 4.20 0.50 0.87 3.54 328.8 295.5 0.90appcg 64KB 4.84 95.2 12.0 0.49 0.47 11.9 0.49 474.6 181.9 0.38swim 24KB 75.9 24.1 13.8 5.27 5.27 3.25 2.66 737.1 629.9 0.85mgrid 8KB 95.4 4.60 4.55 4.64 4.12 1.45 1.45 511.7 344.9 0.67applu 8KB 72.2 27.8 8.53 9.20 8.22 5.03 5.29 577.2 470.9 0.82wave5 24KB 72.9 27.1 7.05 3.83 0.64 5.17 2.27 529.1 571.0 1.08total 7411.3 5698.4 0.77

Table 1: Cache hierarchy event counts, total activity counts, and CAP/conventional activity ratio for each benchmark.Al l counts are in millions.

benchmark IQ entries (CAP) executed cycles total activity activity ratioconv CAP conv CAP (CAP/conv)

m88ksim 16 29.1 40.6 1862.6 649.1 0.35compress 128 32.6 22.2 2088.9 2847.2 1.36

ijpeg 32 22.8 23.1 1461.9 738.0 0.50airshed 32 23.7 25.0 1517.2 799.8 0.53radar 16 110.1 141.8 7046.1 2268.9 0.32appcg 16 48.4 49.8 3099.7 796.9 0.26applu 128 27.6 19.8 1765.7 2535.0 1.44fpppp 16 90.1 101.5 5766.3 1624.4 0.28total 24608.5 12259.1 0.50

Table 2: Instruction queue executed cycles, total activity counts, and CAP/conventional activity ratio for each bench-mark. Al l counts are in millions.

vating fewer elements overrides theenergy cost of morequeue accesses. This translates into as much as a 74%reduction in queueactivity (in thecaseof appcg). Again,this benefit is achieved not through explicit power man-agement, but simply as a byproduct of optimizing per-formance.

Theoppositeeffect isobserved for compressand ap-plu which perform best with a larger 128-entry queue.The result is a significant increase in expended en-ergy with the CAPs approach, despite the reduction inqueue accesses. However, as these are the only twobenchmarksusing moreentriesthan theconventional ap-proach, theoverall result isa50% reduction in expendedenergy with theCAPsapproach. Clearly, if morebench-marks performed best with more entries than the bestaverage-performing configuration, then the energy sav-ingswould be less, perhaps significantly so. In addition,if configurations with more than 128 entries were avail-able and some applications performed best with theseconfigurations, then even agreater increase in expendedenergy would be incurred for theseapplications with theCAPs approach. However, it is likely that the inclusionof these applications into the benchmark suite wouldchange the best-performing conventional approach toone with more entries. Thus, we expect that even witha wide range of benchmark behavior, that the CAPs ap-proach wil l expend less energy overall due to its betteruse of hardware resources. However, the benefit is lessclear with a structure in which elements are disabledthan one in which the resources are allocated betweenprimary and backup elements.

Examining Tables 1 and 2, we see that some appli-cations, such as m88ksim, experience significant reduc-tions in expended energy for both theCAPcachehierar-chy and instruction queue. Thus, by applying the CAPs

approach to other structures such as TLBs and branchpredictors, we expect that a one to two order of mag-nitude reduction in expended energy is possible for ap-plicationswhosehardware requirements areparticularlypoorly-matched to those in a conventional microproces-sor design. A key point isthat CAPscan achievethis im-provement without adversely impacting theperformanceor expended energy of well-matched applications.

7 Conclusionsand FutureWork

The energy efficiency of a microprocessor is highly de-pendent on how well the hardware design matches therequirements of a particular application. In this paper,we have described how Complexity-Adaptive Proces-sors inherently achieve energy efficiency as abyproductof performance optimization by dynamically configur-ing their hardware organization to the problem at hand.By exploiting repeater methodologies used increasinglyin deep submicron designs, CAPs achieve this benefitwith littl e or no cycle time degradation over a conven-tional approach. By examining our performance resultsfor anumber of benchmarks, wefound that theCAPsde-sign reduced overall expended energy by 23% for cachehierarchies and 50% for instruction queues. We alsodiscovered an additive effect for some benchmarks in-dicating that a much greater impact can be realized byapplying the CAPs approach in concert to a number ofhardware structures. In the future, we plan on obtainingresults using more precise energy models for these andother hardware structures.

References

[1] D.H. Albonesi. Dynamic IPC/clock rate optimiza-tion. Proceedings of the 25th International Sympo-sium on Computer Architecture, June 1998.

[2] D. Burger and T.M. Austin. The simplescalartoolset, version 2.0. Technical Report TR-97-1342,University of Wisconsin-Madison, June1997.

[3] J. Fleischman. Private communication. November1997.

[4] D. Matzke. Wil l physical scalability sabotage per-formance gains? IEEE Computer, 30(9):37–39,September 1997.

[5] K.B. Normoyle et al. UltraSPARC-IIi : Expandingthe boundaries of a system on a chip. IEEE Micro,18(2):14–24, March 1998.

Function Unit Power Reduction

Abstract

For many low-power systems, the power cost of floating-point hardware has been prohibitively expensive. This paperexplores ways of reducing floating-point power consumption byminimizing the bit-width representation of floating-point data.Analysis of several floating point programs that utilize low-res-olution sensory data shows that the programs suffer almost noloss of accuracy even with a significant reduction in bit-width.This floating point bit-width reduction can deliver a significantpower saving through the use of a variable bit-width floatingpoint unit.

1 Introduction

Floating point numbers provide a wide, dynamic range ofrepresentable real numbers, freeing programmers from the man-ual scaling code necessary to support fixed-point operations.Floating-point (FP) hardware is also very power hungry. Forexample, FP multipliers are some of the most expensive compo-nents in a processor’s power budget. This has limited the use ofFP in embedded systems, with many low-power processors notincluding any floating point hardware.

For an increasing number of embedded applications such asvoice recognition, vision/image processing, and other signal-processing applications, FP’s simplified programming model(vs. fixed-point systems) and large dynamic range makes FPhardware a useful feature for many types of embedded systems.Further, many applications achieve a high-degree of accuracywith fairly low-resolution sensory data. Leveraging these char-acteristics by allowing software to use the minimal number ofmantissa and exponent bits, standard floating-point hardwarecan be modified to significantly reduce its power consumptionwhile maintaining a program’s overall accuracy.

For example, Figure 1 graphs the accuracy of CMU’s SphinxSpeech Recognition System [Sphinx98] vs. the number of man-tissa bits used in floating point computation. The left mostpoint, 23 bits of mantissa, is the standard for a 32-bit IEEE FPunit. With 23 bits, the recognition accuracy is over 90%; buteven with just 5 mantissa bits (labeled A), Sphinx still main-tains over 90% word recognition accuracy. For Sphinx, there isalmost no difference between a 23-bit mantissa and a 5-bit man-tissa. In terms of power, however, a FP multiplier that uses only5 mantissa bits consumes significantly less power than a 23-bit

mantissa FP multiplier. This property, that programs can main-tain accuracy while utilizing only a few bits for FP number rep-resentation, creates a significant opportunity for enabling low-power FP units.

The goal of this work is to understand the floating-point bit-width requirements of several common floating-point applica-tions and quantify the amount of power saved by using variablebit-width floating-point units. Section 2 begins our discussionby examining different aspects of the IEEE floating point stan-dard that could lead to additional power savings. Section 3 pre-sents the analysis of several floating point programs that requireless bits than specified in the IEEE standard. In section 4, wedescribe the use of a digit-serial multiplier to design variablebit-width hardware and discuss the possible power savings.Finally, Section 6 outlines our conclusions and future work.

2 Background

2.1 IEEE 754 Floating Point Standard

One of the main concerns of the IEEE 754 Floating PointStandard is the accuracy of arithmetic operations. IEEE-754specifies that any single precision floating point number be rep-resented using 1 sign bit, 8 bits of exponents and 23 bits of man-tissa. With double precision, the bit-width requirements ofexponents and mantissa go up to 11 bits and 53 bits respec-tively.

In addition to specifying the bit-width requirement for float-ing point numbers, IEEE-754 incorporates several additionalfeatures, including delicate rounding modes and support forgradual underflow to preserve the maximal accuracy of pro-

This work was supported by the Defense Advanced Research ProjectsAgency under Order No. A564 and the National Science Foundationunder Grant No. MIP90408457.

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

23 21 19 17 15 13 11 9 7 5 3 1

Mantissa Bit-widthR

ecog

nitio

n A

ccur

acy

Figure 1 : Accuracy of Sphinx Speech Recognition vs. Mantissa Bit-width

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

23 21 19 17 15 13 11 9 7 5 3 1

Mantissa Bit-widthR

ecog

nitio

n A

ccur

acy

A

Minimizing Floating-Point Power Dissipation Via Bit-Width Reduction

Ying Fai Tong, Rob A. Rutenbar, and David F. NagleDepartment of ECE

Carnegie Mellon UniversityPittsburgh, PA 15213

yftong,[email protected]

grams. Nevertheless, the implementation of an IEEE compliantfloating point unit is not always easy. In addition to the designcomplexity and the large area it occupies, a floating point unit isalso a major consumer of power in microprocessors. Manyembedded microprocessors such as the StrongARM[Dobberpuhl96] and MCore [MPR97] do not include a floatingpoint unit due to its heavy implementation cost.

2.2 Accuracy Requirements and Workloads

For floating point applications that rely on sensory inputs,power savings can be obtained by modifying the floating pointhardware’s mantissa and exponent widths while maintainingsufficient accuracy for overall program execution. There arefour conceivable dimensions that we can explore (see Table 1).Each of these dimensions allow us to make trade-off betweenprogram accuracy and the power consumption of the floating-point unit.

3 Experiments and Results

3.1 Methodology

To validate the usefulness and accuracy of reducing FP bit-widths, we analyzed four single-precision floating point pro-grams (see Table 2). To determine the impact of different man-tissa and exponent bit-widths, we emulated different bit-widthFP units in software by replacing each floating-point operationwith a corresponding function call to our floating-point soft-ware emulation package that initially implements the IEEE-754standard. Careful modifications to the floating-point emulationpackage allowed us to simulate different mantissa and exponentbit-widths. For each bit-width, the emulation package was mod-ified to use a smaller number of bits. Then, each program wasrun using the modified floating-point package and the resultsare compared to determine application accuracy.

3.2 Results

Figure 2 graphs the accuracy for each of the four programsacross a range of mantissa bit-widths. None of the workloadsdisplays a noticeable degradation in accuracy when the man-tissa bit-width is reduced from 23 bits to 11 bits. For ALVINNand Sphinx III the results are even more promising; the accu-racy does not change significantly with mantissa bit-width of 5or more bits.

Table 1 : Design Dimensions for Floating Point Repre-sentation

Dimension Description

Reduction in mantissa bit-width

Reduce the number of mantissa bits at the expense of precision.

Reduction in exponent bit-width

Reduce the number of exponent bits at the expense of a smaller dynamic range.

Change of the implied radix

Increase the implied radix from 2 to 4 (or 16). This provides greater dynamic range but lower density of floating point num-bers, potentially leading to power savings since fewer normalizing shifts are neces-sary.

Simplification of rounding modes

Full support of all the rounding modes is very expensive in terms of power. Some programs may achieve an acceptable accuracy with a modified low power rounding algorithm.

Workload Description Accuracy Measurement

Sphinx III CMU’s speech recognition program based on fully continuous hidden Markov models. The input set is taken from the DARPA evaluation test set which consists of spoken sentences from the Wall Street Journal. [Hwang94]

Accuracy is estimated by dividing the number of words recognized correctly over the total number of words in the input set.

ALVINN Taken from SPECfp92. A neural network trainer using backpropagation. Designed to take as input sensory data from a video camera and a laser range finder and guide a vehicle on the road.

The input set consists of 50 road scenes and the accu-racy is measured as the number of correct travel direc-tion made by the network.

PCASYS A pattern-level finger print classification program developed at NIST. The program classifies images of fingerprints into six pattern-level classes using a probabilistic neural network.

The input set consists of 50 different finger print images and the classification result is measured as percentage error in putting the image in the wrong class. The accuracy of the recognition is simply (1 - percentage error).

Bench22 An image processing benchmark which warps a random image, and then compares the warped image with the original one.

Percentage deviation from the original outputs are used as a measure of accuracy.

Table 2 : Description of Workloads

0.00%10.00%

20.00%30.00%

40.00%50.00%

60.00%70.00%

80.00%90.00%

100.00%

23 21 19 17 15 13 11 9 7 5 3 1

Mantissa Bit-width

Acc

urac

y

SphinxAlvinnPCASYSBench22

Figure 2 : Program Accuracy across Various Mantissa Bit-widths

Figure 3 shows that each program’s accuracy has a similartrend when the exponent bit-width is varied. With 7 or moreexponent bits, the error rates remain quite stable. Once theexponent bit-width drops below 6, the error rates increase dra-matically and in some cases the programs could not finish prop-erly.

Many programs dealing with human interfaces process sen-sory data with intrinsically low resolutions. The arithmeticoperations on these data may generate intermediate results thatrequire more dynamic range, but not vastly more precision. Thisis different from many scientific programs such as wind tunnelsimulation or weather prediction, which not only require a hugeamount of precision and dynamic range but also delicate round-ing modes to preserve the accuracy of the results.

For programs that do not need the dynamic range nor theprecision of floating point arithmetic, the use of fixed-pointarithmetic might well be a better choice in terms of chip space,operation latency, and power consumption. But for the pro-grams we have analyzed, three of them require 6 bits or more ofthe exponents to preserve a reasonable degree of accuracy,which means they need more than the 32 bits of precision thatfixed point arithmetic can offer. Simply using fixed point repre-sentation without additional scaling will not resolve the prob-lem.

It should be noted that these complex applications wereaggressively tuned by various software designers to achievegood performance using full IEEE representation. However,Figure 2 and Figure 3 suggest that significantly smaller bit-width FP units could be used by these applications withoutcompromising the necessary accuracy. For instance, certainfloating point constants in the Sphinx III code require more than10 bits of mantissa to represent, but we modified those numbersso they can be represented using fewer bits during our experi-ment and yet this have little impact on the overall recognitionaccuracy. We believe that if the numerical behavior of theseapplications are adjusted to a smaller bit-width unit, we couldget even better performance.

4 Power Savings by Exploiting Variable Bit-width Requirement

4.1 Multiplication with a Digit-Serial Multiplier

Since different floating point programs have differentrequirements on both the mantissa and exponent bit-width, wepropose the use of a variable bit-width floating point unit1 toreduce power consumption. To create hardware capable of vari-able bit-width multiplications (up to 24x24 bit), we used a 24x8bit digit-serial architecture similar to the one described in Hart-ley and Parhi [Hartley95]. The 24x8 bit architecture allows us toperform 8, 16, and 24-bit multiplication by passing the dataonce, twice, or three times though the serial multiplier. A finitestate machine is used to control the number of iterationsthrough the CSA array.

To perform accurate power and timing measurements, themultiplier was described in Verilog and then taken to layoutusing our ASIC design flow (for a standard 0.5u process). Syn-opsys’ Design Compiler was used to synthesize the multiplier’scontrol logic. Next, the entire structural model was fed into Cas-cade Design Automation’s physical design tool Epoch. Adescription of digit-serial arithmetic and the block diagram ofthe digit-serial multiplier can be found in the appendix.

We compare our variable bit-width multiplier with a baselinefixed-width 24x24 bit Wallace Tree multiplier. The layout ofthis Wallace Tree multiplier was generated by Epoch’s cell gen-erator in the same 0.5u process as used in the design of thedigit-serial multiplier. The two multipliers are described inTable 3. Cycle time is estimated using Epoch’s static timinganalysis tool, Tactic, and is rounded to the nearest 5ns intervalfor the convenience of power simulation.

4.2 Power Analysis

For each design, a SPICE netlist was generated from layoutand used to estimate power consumption with Avanti’s Star-Sim. Determining the complete power dissipated in a multiplierrequires the sensitization of all possible combinations of inputs,which means we need to have 22N input combinations where Nis the number of inputs. Fortunately, it is possible to obtain amean estimation of the power consumption using statisticaltechniques [Burch92]. In our approach, we ran 50 batches ofvectors with each batch containing 30 vectors to insure a 95%confidence intervals. The energy dissipation is computed usingthe cycle time in Table 3. The test vectors are taken from someof the actual multiplication operands in Sphinx III.

Figure 4 graphs the energy/operation and latency/operationfor the digit-serial multiplier. Both the energy/operation andlatency/operation decrease linearly with the operand bit-width.This is different from [Callaway97], where a multiplier’s power

1. Our current research focuses particularly on the floating point multi-plier, since multipliers are usually the major consumer of power[Tiwari94].

Figure 3 : Program Accuracy across Various Exponent Bit-widths

Figure 2 and Figure 3 show that we can reduce both the mantissa and exponent bit-width without affecting the accuracy of the programs. This effect is especially prominent in the mantissa. This reduction of bit-width can be turned into a reduction in power dissipation with the use of appropriate arithmetic circuits

0.00%

10.00%20.00%

30.00%40.00%

50.00%

60.00%70.00%

80.00%90.00%

100.00%

8 7 6 5 4 3

Exponent Bit-width

Acc

urac

y

SphinxALVINNPCASYSBench22

Multiplier Area Cycle Time Latency/op

Wallace (24x24) .777square mm 40ns 40ns

Digit-Serial (24x8) .804 square mm 15ns 15ns

Table 3 : Timing and Area of the Two Multipliers

To perform 16 bits multiplication using the digit-serial multiplier, 2 cycles are needed which increases the total delay/op to 30ns. Simi-larly, 24 bits multiplication takes 3 cycles(45ns).

consumption decreases exponentially with the operand bit-width. The difference between the two results is due to the fixedstructure (the 24x8 bit CSA array) of the digit-serial multiplierand the control circuitry needed to do iterative carry and sumreduction. This additional power dissipation is the penalty wepay for the flexibility of doing variable bit-width multiplication.

Figure 5 shows the potential power reduction for our threeprograms if we use the digit-serial multiplier as the mantissamultiplier. For 8-bit multiplication, the digit-serial multiplierconsumes less than 1/3 of the power than the Wallace Tree mul-tiplier (in the case of Sphinx and ALVINN). When 9 to 16 bitsof the mantissa are required (in the case of PCASYS andBench22), the digit-serial multiplier still consumes 20% lesspower than the Wallace Tree multiplier. The digit-serial multi-plier does consume 40% more power when performing 24-bitmultiplication due to the power consumption of the overheadcircuitry.

Table 3 shows another benefit which is improved speed.When performing 8-bit or 16-bit multiplications, the operation’sdelay can be greatly reduced if the critical path of the circuit liesin the multiplier. This increases the throughput in addition to theenergy saving.

4.3 Summary of Results

These power comparison results show the potential powersavings achievable by using variable bit-width arithmetic units.It should be noted that the digit-serial multiplier is designedusing an ASIC approach and is not as heavily optimized physi-cally; the Wallace Tree multiplier was optimized for the pro-cess. This explains why the 24x8 bit digit-serial multiplier isactually slightly larger than the Wallace Tree multiplier. Webelieve that with a more careful implementation, both the powerand area of the digit-serial multiplier can be reduced. In addi-tion, [Chang97] presents several low power digit-serial archi-tecture that can further reduce the power consumption of thedigit-serial multiplier.

For software designers who know a program’s precisionrequirements, code annotation could be used to allow the under-lying arithmetic circuits re-configure themselves for variablebit-width operations. As Figure 4 and Figure 5 show, as long asthe bit-width requirement is less than 16 bits, the 24x8 bit digit-

serial multiplier consumes less energy than the Wallace Treemultiplier. Even for programs which require a large bit-width tomaintain precision, there may be sections within the programthat require a smaller bit-width requirement and thus can benefitfrom a variable bit-width unit.

5 Previous Work

The idea of reducing bit-width to save power has beenemployed in other areas of low power research. In [He97], it isshown that an average of more than 4 bits of pixel resolutioncan be dropped during motion estimation to obtain a powerreduction of 70%. To reduce energy consumption of the I/Opins, [Musoll97] proposes sending only those address bits thathave changed. For integer applications that do not need 32 bitsof precision, the Intel MMX instructions allow arithmetic oper-ations on multiple data simultaneously. Overall, the basic ideaof bit-width reduction is to avoid waste.

6 Conclusions and Future Work

It is clear that floating point programs that use human sen-sory inputs may not need the entire 23 bits of mantissa and 8bits of exponents as specified in the IEEE standard. Our soft-ware analysis shows no loss of accuracy across each of our pro-grams when the mantissa bit-width is reduced to less than 50%of the original value. This large reduction in mantissa bit-widthenables significant power reduction without sacrificing programaccuracy. Further, the digit-serial multiplier results show that itis possible to obtain a substantial power saving by using a vari-able bit-width arithmetic unit.

Currently, we are looking at various digit-serial architecturesand other means of exploiting the variable bit-width require-ments of programs. Our future work will be directed towardsinvestigating other characteristics of floating point programs

0

500

1000

1500

2000

2500

8 bits 16 bits 24 bits

Operand Bit-width

Ene

rgy/

op

0

5

10

15

20

25

30

35

40

45

Late

ncy/

op

Energy/Op(pJ)Latency/Op(ns)

Figure 4 : Performance of the Digit-Serial Multiplier

Both energy/op and latency/op of the digit-serial multiplier increase linearly with the operand bit-width. Digit-serial architecture allows us to perform variable bit-width arithmetic and save power when the bit-width requirement is less than that specified in the IEEE standard.

Figure 5 : Power Reduction using Digit-Serial Multiplier

A reduction of 70% in energy/op is attainable for Sphinx and ALVINN with the use of digit-serial multiplier. Both Sphinx and ALVINN need only 5 bits of mantissa to be 90% accurate and thus an 8 bit operand bit-width is used for the digit-serial multiplier. PCASYS requires 11 bits of mantissa while Bench22 requires 9 bits and thus a 16 bit operand bit-width is used which results in a 20% energy/op reduction

0

200

400

600

800

1000

1200

1400

Sph

inx

AlV

INN

PC

AS

YS

Ben

ch22

Programs

Ene

rgy/

op

DigitSerialWallaceTree

that may provide additional power savings. This includes usinga simplified rounding algorithm and changing the implied radix.

7 AppendixArithmetic operations can be performed in different styles

such as bit-serial, bit-parallel, and digit-serial. Bit-serial archi-tecture processes data one bit at a time, saving hardware at theexpense of processing speed. Bit-parallel circuits process allbits of the data operands at the same time with more hardware.Digit-serial arithmetic falls in between these two extremes byprocessing a fixed number of bits at one time. Figure 6 showsthe block diagram of the digit-serial multiplier used in ourexperiment. For applications that require a moderate amount ofthroughput while having serious constraints on design space,digit-serial systems have become a viable alternative for manydesigners. Most of the digit-serial systems are used because ofspace limitation. Even though there is research on low-powerdigit-serial multipliers [Chang97], it has focused on comparingpower consumption among different digit-serial architecture.

8 References

[Burch92] Burch R., Najm, F., Yang, P., Trick, T. McPOWER:A Monte Carlo Approach to Power Estimation, IEEE/ACMInternational Conference. on CAD, 1992.

[Callaway97] Callaway, Thomas K., Swartzlander, Earl E.Power-Delay Characteristics of CMOS Multipliers, IEEE 13thSymposium on Computer Arithmetic, 1997.

[Chang97] Chang, Y.N., Satyanarayana, Janardhan H., Parhi,Keshab K. Design and Implementation of Low-Power Digit-Serial Multipliers, IEEE International Conference on ComputerDesign, 1997.

[Dobberpuhl96] Dobberpuhl Dan. The Design of a High Per-formance Low Power Microprocessor. International Sympo-sium on Low Power Electronics and Design, 1996.

[Hartley95] Hartley, Richard I., Parhi, Keshab K. Digit-SerialComputation. Norwell, Kluwer Academic Publishers, 1995.

[He97] He, Z.L., Chan, K.K., Tsui, C.Y., Liou, M. L., LowPower Motion Estimation Design Using Adaptive Pixel Trunca-tion, International Symposium on Low Power Electronics andDesign, 1997.

[Hwang94] Hwang, M., Rosenfeld, R., Theyer, E., Mosur, R.,Chase, L., Weide, R., Huang, X., Alleva, F. Improving Speechrecognition performance via phone-dependent VQ codebooksand adaptive language models in SPHINX-II. InternationalConference on Acoustics, Speech and Signal Processing, 1994.

[MPR97] MCore Shrinks Code, Power Budgets, Microproces-sor Report11 (14), 27 October 1997.

[Musoll97] Musoll, E., Lang, T., Cortadella, J. Exploiting thelocality of memory references to reduce the address bus energy.International Symposium on Low Power Electronics 1997.

[Sphinx98] SPHINX Group, Carnegie Mellon University, Pitts-burgh, PA. http://www.speech.cs.cmu.edu.speech/sphinx.html

[Tiwari94] Tiwari, V., Malik, S. Wolfe, A. Power Analysis ofEmbedded Software: a first step towards software power mini-mization, IEEE Transactions on VLSI systems, vol.2, pp. 437-445, December 1994.

24

24

8

FLIP FLOPS

24x8

CSA ARRAYS

PARALLEL/SERIAL CONVERTERS

Control

8 bitAdder

8

8

Low order 8 bits

High order 8 bits

Figure 6 : Block Diagram of a 24x8 Digit-Serial Multiplier

REG REG

combinatorial logic combinatorial logic

handshake circuit handshake circuit

data outdata in

Req

Ack

Ini Comp

Req

Ack

Fig. 1. A pipelined self-timed circuit

T. Schumann and H. Klar are with the Institute ofMicroelectronics, Technical University of Berl in,D-10623 Berl in, Germany.U. Jagdhold is with the Institute for Semiconductor Physics,D-15230 Frankfurt (Oder), Germany.

A Power Management for Self-Timed CMOS Circuits (FLAGMAN)and Investigations on the Impact of Technology Scaling

Thomas Schumann, Ulrich Jagdhold and Heinrich Klar

Abstract

One advantage of self-timed circuits is the fact that theperformance of the chip depends on actual circuit delays,rather than on worst-case delays. This allows to exploitvariations in operating temperature by reducing the powersupply voltage to achieve an average throughput with aminimum of power consumption. Usually this averagethroughput is given by the clock frequency of the syn-chronous environment and a FIFO is used as a bufferinterfacing the self-timed circuit to the synchronous envi-ronment.

Rather than adaptively adjusting the power supplyvoltage to the smallest possible value by using a statedetecting circuit and a dc/dc converter additional to theFIFO [1], we propose to switch the supply voltagebetween two levels by using the Half-Full flag of the FIFO(FLAGMAN) [7]. Because two power supply voltages arealready required on modern circuit boards for standardlow-power ICs, e.g. µPs [2] and DSPs [3] for mobilesystems, taking advantage of 2 supply voltages meanszero-overhead for the board design.

By investigating real-life data of portable systems, asfar as temperature variation is concerned, we give ananalysis of power savings achieved by this technique. As apractical example we use a self-timed IC which alreadyhas been designed and fabricated: A 4bit carry-savemultiplier [4]. It turns out that applying the proposedpower management results in potential power savings ofup to 40 % for that IC, depending on the correspondingoperating temperature and the choice of the supply voltagelevels.

1 Introduction

A self-timed circuit is defined as one where the transfer ofdata is not performed in synchrony with a global clocksignal, but rather at times determined by the latencies ofthe pipeline stages themselves. Therefore communicationsignals (Request, Acknowledge) between the pipelinestages are necessary. A handshake circuit is responsiblefor enforcing a protocol on these communication signalsand the protocol ensures that the transfer of data onlyoccur at current times. Fig. 1 shows the principle of apipelined self-timed circuit, each stage consisting of com-binatorial logic in combination with storage elements andthe handshake circuit.Because the transfer of data is performed at times deter-mined by the latencies of the pipeline stages, the through-put rate of the pipelined circuit depends on actual gate

delays. This means that the throughput rate of an existingself-timed circuit depends on both, the supply voltage andthe actual operating temperature (case temperature) of the

IC. This fact can be used to manage the power consump-tion in a very positive way: Assuming, the demandingaverage throughput rate is given by the clock frequency ofthe synchronous environment, variations in the operatingtemperature can be exploited by reducing the supply volt-age rather then having longer idle times of the gates.Investigating the formula for the dynamic power con-sumption of one pipeline stage in a CMOS process:

2DDLastage VCfP ⋅⋅= (1)

where Pstage is the power consumption of one pipelinestage, fa is the average operating frequency of the gates, CL

is the total switching capacitance of the stage, and VDD isthe supply voltage, it turns out that the supply voltage isthe only parameter for optimization, assuming an existingcircuit and a demand on the average operating frequency.

The proposed power management takes advantage oftwo supply voltages which are already required on moderncircuit boards for portable systems. E.g. synchronousclocked, low-power µPs [2] are running on two supplyvoltages. The processor core operates at a reduced voltageinternally while the I/O interface operates at the standardvoltage 3.3 volts, which allows the processor tocommunicate with existing 3.3 volt components in thesystem. This trend is not only for µPs, but also for high-performance DSPs [3], developed for multi-functionapplications, e.g. wireless PDAs.

Section 2 illustrates the principle of the powermanagement and gives a formula of the power reductionfactor achievable using this technique. Section 3 focuseson an existing self-timed circuit, a 4bit array multiplier[4], and presents the power savings which can be obtainedapplying the power management. Measured data arecompared to data calculated using the formula of thepower reduction factor. Section 4 shows how technologyscaling affects the achievable power savings.

2 Power Management

2.1 PrincipleThe proposed power management for self-timed circuits isshown in Fig. 2. The circuit is operating in a synchronous

Self-TimedCircuit

AsynchronousFIFO

Write

FF Read

Data in

Half-Full Flag

asynchronous synchronous

VDD vdd

Req

Ack

Req

Ack

REG

fsyn

MUX

supplyvoltage

Data out

Fig. 2. Principle of the power management (FLAGMAN)

environment: The data stream input and output is con-trolled by the clock frequency fsyn.

Fig. 2 shows the output FIFO, working as a bufferinterfacing the self-timed circuit to the synchronousenvironment.

The basic idea of the power management is to use theHalf-Full flag of that FIFO as a signal which monitors thespeed of the self-timed circuit: If the flag is set, the FIFOis running full and the supply voltage can be reduced. Theoutput of the MUX changes to the reduced supply voltage.

If, although the reduced supply voltage is applied, thebuffer runs full, the full-flag (FF) of the FIFO remains theacknowledge signal on the same logical state. This resultsin an idle time for the self-timed circuit. Even under worstcase operating conditions the buffer must not becomeempty. This must be guaranteed when choosing theasynchronous circuit.2.2 Power Reduction FactorThe basic idea, to slow down the operating frequency byswitching to a second, reduced supply voltage whichalready exists on modern circuit boards, effects the powerconsumption in a very positive way. That is, interpreting(1), because the most important parameter controlling thepower consumption in CMOS is the supply voltage. Thepower reduction factor R can be described by:

voltagefixed

voltagestwo

P

TPTR

)()( = (2)

where T is the case temperature of the IC.In the following we are referring to the case

temperature T as the operating temperature. Thecorresponding ambient and junction temperature can becalculated if the thermal resistances are known.Using the basic equation (1) it follows:

2

22 ))(1()()()()()(

DDsyn

DDVDDDDvdd

Vf

VTxTfVbTxTfTR

⋅

⋅−⋅+⋅⋅⋅=

(3)where b·VDD is the reduced supply voltage, fvdd(T) is thethroughput rate applying the reduced supply voltage underthe actual operating temperature T, fVDD(T) is the through-

put rate applying the standard supply voltage VDD underthe actual case temperature T, fsyn is the clock frequency ofthe synchronous environment and x(T) the portion of theoperating time during which the reduced supply voltage isapplied.

To meet the requirement of the synchronousenvironment it must hold:

))(1()()()( TxTfTxTff VDDvddsyn −⋅+⋅= (4)

The power management is based on the fact that thegate delay in CMOS depends on both, supply voltage andoperating temperature. The gate delay can be estimated [5]as

DDL

tDD

DDL

drainVDD

VC

TVVT

VC

ITf

⋅−⋅

=⋅

=2))(()(

)(β (5)

ZKHUH 7LVWKH026WUDQVLVWRUJDLQIDFWRUDQGV t(T) isthe threshold voltage of the device.Operating temperature influence circuit delays as follows:

5,1

00 )()( −⋅=

T

TT ββ (6),

TCmVVTV tt ⋅°−= )/4()( 0 (7)

where the exponent –1,5 and the factor 4 mV/°C are em-pirical values, reported in [6]. T0 is the temperature atZKLFKWKH026GHYLFHSDUDPHWHUV 0 and Vt0 are defined.

For a given case temperature of the IC the powerreduction factor depends on the chosen level of thesecond, reduced supply voltage. Fig. 3 presents calculateddata for a case temperature of 100 °C, assuming a self-timed circuit with a throughput rate 5 % higher than theclock frequency of the synchronous environment at120 °C worst case temperature, operating at the standard3.3 volts in a typical CMOS process with Vt0 = 820 mV at25 °C.

3 Application Chip

For applying the power management we use a self-timedIC which already has been designed and fabricated: apipelined 4bit multiplier, using a modified micropipelinedesign style [4]. The asynchronous circuit has been im-plemented in a 1.2 µm CMOS process.

0,70

0,75

0,80

0,85

0,90

0,95

1,00

0,55 0,61 0,70 0,76 0,85 0,90

normalized second, reduced supply voltage Vdd/VDD

po

wer

red

uct

ion

fac

tor

R

measured data for 4bit multiplier and VDD=3.3 V

calculated data

case temperature: 100 °C

fVDD(120°C)=105%*f syn

Fig. 3. Power reduction factor: calculated and meas-ured data

20

30

40

50

60

70

80

90

-5 5 15 28 40 55 70 80 90 120

case temperature [C°]

thro

ug

hp

ut

rate

[M

Hz]

1.8 V

2.0 V

2.3 V

2.5 V

3.3 V

3.0 V

2.8 V

Fig. 4. Throughput rate of the self-timed multiplierapplying different supply voltages

0,50

0,55

0,60

0,65

0,70

0,75

0,80

0,85

0,90

0,95

1,00

-5 20 40 65 90 115

case tem perature [°C]

po

wer

red

uct

ion

fac

tor

1.8 V and 3.3 V

2.0 V and 3.3 V

2.5 V and 3.3 V

2.8 V and 3.3 V

2.3 V and 3.3 V

f ixed 3.3 V

Fig. 5. Achievable power savings for the self-timedmultiplier

Fig. 6. Power savings dependence on the probabilityfor running the IC at temperature T

Fig. 4 shows data of the throughput rate for that specificdesign, over a wide range of temperature variations. Themeasurement was done by attaching a thermocouple beadto the center of the package top surface using highly ther-mally conductive cements.

Assuming a demand of 50 MHz on the averagethroughput, given by the clock frequency of thesynchronous environment, Fig. 5 presents the powersavings achievable for that self-timed circuit by applyingthe proposed power management. It turns out thatapplying a very low second supply voltage is not theoptimum power saving solution. This is because theoperating time during which the reduced supply voltage isapplied is smaller than the operating time for higher levelsof the second supply.

For example, applying 1.8 V results in a power savingof 20 %, whereas applying 2.5 V results in 40 % powersavings over a wide temperature range.

By applying the power management in a portablesystem which runs at different temperatures during thelife-time of it battery we investigated potential powersavings. In the following we are dealing with the impact ofreal-life data on the power reduction factor of theproposed power management. That a mobile system isrunning at different temperatures over the life-time of itsbattery is a well known fact. We found that the probabilityof running the system at temperature T can be described,to give a first order approximation for calculating apractical value for the expected power savings, by aGaussian normal distribution:

( )

⋅

−−

⋅⋅⋅

=2

2

2

2

1),;( σ

µ

σπσµ

T

eTf (8)

The average temperature µ is set to 35 °C, which means

for our IC an ambient temperature of about 25 °C, assum-ing no airflow. The empirical variance σ is set to 15 °C,which means a wide range of operating temperature.

Fig. 6 shows both, the probability density distributionof running the IC at case temperature T, and the powerreduction factor achievable by applying the powermanagement on a circuit board with two supply voltages,2.5 V and 3.3 V.The expected total power savings for that IC, over a life-time of its battery, is about 36 %.Tests of the circuit board confirm, that the Half-Full flagof a commercially available asynchronous FIFO monitorsthe speed of the selftimed multiplier.

Fig. 7 shows, that the supply voltage switchesbetween two levels, as a result of the switching Half-Fullflag. This is true if the on/off-delay of the MUX is lessthan 5.1 µs, assuming a 512 words FIFO and a 50 MHzclock frequency of the synchronous environment.

Half-Full flag

number of words > 256

number of words < 256

supply voltage of the selftimed circuit

low

high

Fig. 7. Oszilloscope snap shot of Half-Full flag andsupply voltage

Fig. 8. Inverter delay versus junction temperature simu-lated in a 0.25µm CMOS process

Fig. 9. Inverter delay versus junction temperature simu-lated in a 1.2µm CMOS process

Fig. 10. Expected throughput rate of the self-timed multi-plier implemented in a 0.25µm process

Fig. 11. Impact of technology scaling on the achievablepower savings for the self-timed multiplier

4 Impact of Technology Scaling on Power Savings

The impact of technology scaling on the possible powersavings is investigated by SPICE simulations. A standard0.25 µm CMOS process, n-well, 5.5 nm gate oxide thick-ness, 0.5 V threshold voltage of the NMOS and -0.26 Vthreshold voltage of the PMOS device, is considered. Thetechnology is optimized for a supply voltage of 2.5 V.

Fig. 8 shows data from SPICE simulations: Theinverter delay (fanout = 2) for this 0.25 µm process isplotted versus the junction temperature of the device. Forcomparison Fig. 9 shows simulation results for the 1.2 µmCMOS process, which has been used for theimplementation of the self-timed multiplier. It turns out,that the variation of the gate delay between normal case(28 °C) and worst case (120 °C) temperature is for the0.25 µm process 51 % less than for the 1.2 µm process.

So the question arises how much this affects thepower savings applying the proposed power management.Considering the simulated inverter delays we calculatedthe throughput rate, the self-timed multiplier wouldperform, if it would be implemented in the highperformance 0.25 µm process. Fig. 10 shows data of thethroughput rate applying 2.5 V and 1.8 V, respectively. Itturns out that even under worst case condition a clockfrequency of 200 MHz for the synchronous environmentcould be obtained by applying the standard 2.5 V supplyto the self-timed IC. Under normal operating conditionsthe power management is responsible for switching thesupply voltage to 1.8 V for a time interval, to achieve200 MHz on average. This means 35 % power savings fora case temperature of 28 °C. Fig. 11 shows data of thecalculated power savings applying the proposed powermanagement to the self-timed multiplier, implemented inthe quarter micron CMOS process. An equal probabilitydensity distribution of running the IC at case temperatureT as for the 1.2µm process is assumed.

Implementing the multiplier in a 0.25 µm CMOSprocess and applying the proposed power managementresults in expected power savings of about 28 %.

5 Conclusion

A power management for self-timed CMOS circuits isproposed, which takes advantage of the existence to twosupply voltages on modern circuit boards.

The effectiveness of the power management is investigatedby applying the technique to a self-timed multiplier IC,implemented in a 1.2 µm CMOS technology.It also turns out that the power management results insignificant power savings of up to 40 % by making use ofthe fact that the performance of a self-timed circuit de-pends on actual gate delay, rather than on worst casedelay. Investigations also show, that the power savings fora deep sub-micron technology is reduced in comparison toa 1.2 µm technology, caused by the reduced influence ofthe junction temperature on the gate delay.

Acknowledgement

The authors would like to thank IDT, Inc. and MAXIM,Inc. for providing us with free samples of standard ICs forboard design. They would also like to thank Ayman Bobofor his help designing and testing the circuit board.

References

[1] L. S. Nielsen, C. Niessen, J. Sparso, and K. vanBerkel, “Low-power operation using self-timed circuitsand adaptive scaling of the supply voltage”, IEEE Trans.on VLSI Systems, vol.2, no. 4, pp. 391-397, Dec. 1994.[2] intel, “The mobile Pentium processor fact sheet”,http://www.intel.com/design/mobile, Aug. 1997.[3] Texas Instruments, “TMS320C6x Device Featuresand Applications”, http://www.ti.com/sc/docs/dsps,Sept. 1997.[4] T. Schumann, H. Klar, and M. Horowitz,“A fullyasynchronous, high-throughput multiplier using wave-pipelining for DSP custom applications”, Mikroelektronikfür die Informationstechnik, ITG-Fachtagung,Germany,pp. 283-286, 1996.[5] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen,“Low-power CMOS digital design”, IEEE J. Solid-StateCirc., vol. 27, no. 4, pp. 473-484, Apr. 1992.[6] N. Weste and K. Eshraghian, Principles of CMOSVLSI Design. Reading, MA: Addison-Wesley, 2nd ed.,1992, pp. 61-69.[7] T. Schumann and H. Klar,“Power Management forself-timed CMOS circuits using the half-full flag of anasynchronous FIFO”, Mikroelektronik für die Informati-onstechnik, ITG-Fachtagung, Germany, 1998.

Hybrid Signed Digit Representation for Low Power Arithmetic Circuits

Dhananjay S. Phatak

Steffen Kahle, Hansoo Kim and Jason Lue

Electrical Engineering DepartmentState University of New YorkBinghamton, NY 13902-6000

email: [email protected]

Abstract

The Hybrid Signed Digit (HSD) representation intro-duced in [9] employs both signed and unsigned digitsand renders the maximum length of carry propagationequal to the maximum distance between two consecu-tive signed digits. This representation offers a contin-uum of choices from two’s complement representationon one extreme (where there are no signed digits) all theway to the conventional fully signed digit (SD) repre-sentation on the other extreme, wherein every digit issigned. The area-delay tradeoffs associated with each ofthe HSD formats have been analyzed in [9]. Possibil-ity of reducing power consumption by utilizing the HSDrepresentation was also mentioned therein.

This paper investigates the impact of the HSD repre-sentation on power consumption. Adder based on a wellknown fully SD implementation [2] (which is adopted ina microprocessor [4]) is compared with the adder basedon HSD representation. Layouts of basic adder cellsbased on SD and HSD representations areexhaustivelysimulated to obtain the average number of transitionsand power dissipation per addition. The results are thenused to estimate power dissipation in adders of variousword lengths and indicate that adopting HSD represen-tation can lead to a reduction in power consumption overa fully SD representation.

The power (and area) reduction is achieved as a re-sult of algorithm level change. This leaves room for ad-ditional power reduction by employing standard tech-niques at lower levels of the design process (such ascircuit level, layout level etc). Furthermore, The HSDrepresentation offers a continuum of choices to the de-signer. By increasing the distance between the signeddigits, lower power dissipation can be achieved at theexpense of slower speed. Thus, if the worst case delayis predetermined, the designer can select a hybrid rep-resentation that minimizes power dissipation under thedelay constraint.

1 Introduction

The well known signed–digit (SD) number representa-tion [1, 9, 10] makes it possible to perform addition withcarry propagation chains that are limited to a single digitposition, and has been used to speed up arithmetic op-erations. The advantages of SD representation can bedescribed by drawing an analogy with the transforms(Fourier or Laplace transforms, for instance): certainproblems of sufficient complexity make it worthwhile to

pay the overhead of forward and reverse transforms be-cause the advantages/gain in the transform domain morethan offset the overhead. SD representation is analogousto a transformed domain wherein the addition is carryfree. Moreover, the forward transform, (i.e., convertingnormal numbers into signed digits) is trivial. Only thereverse transform requires a delay equivalent to a full ad-dition (which depends on the word length). If multipleoperands are to be added, then going to an intermedi-ate signed digit representation is advantageous becausethe additions are carry free. There is only one full carrypropagation operation required to convert the final sumback to two’s complement format. SD representation hasbeen exploited to speed up most types of arithmetic cir-cuits (multipliers, dividers, CORDIC processors, etc).

A carry save adder tree based on (3,2) countersachieves a reduction from 3 to 2 operands at eachlevel. Hence the number of levels required for adding

n operands isdlog 3

2

n2e = d

log n log 2

log 3=2e wheredxe

denotes the ceiling ofx, i.e., smallest integer greaterthan or equal tox and log z without any base indicateslog to base 10. A signed digit tree, on the other handneedsdlog

2ne levels. If the delay associated with a

carry-save stage isFA and that associated with a SDstage isSD, then the SD tree is (asymptotically) faster

if SD

FA

log 2

log(3=2) 1:71. Signed–digit adder trees

are easier to lay out and route than Wallace trees [2]. In[2] a6464 bit multiplier based on a redundant signed–digit binary adder tree was shown to yield a smaller crit-ical path delay than the corresponding Wallace tree mul-tiplier with booth recoding. A similar design for a fast54 54 bit multiplier was published in [3]. The cost forthe speedup is higher area, since every (binary) signeddigit needs two bits for encoding.

The advent of mobile computing favors circuitswith low power dissipation. Hence, combined analy-sis of speed and power tradeoffs is attracting increasingamounts of research efforts [5, 6]. All the number repre-sentations considered in [5, 6] were special cases of theGSD representation [8].

In [9] a Hybrid Signed Digit (HSD) representationwas introduced. It employs both signed and unsigneddigits and reveals a continuum of choices of number rep-resentations between the extremes of fully SD on onehand to two’s complement on the other. It is outside theGSD framework (but overlaps it) as shown in Figure 1.The area-delay tradeoffs associated with each represen-tation were analyzed in [9] and show that a multiplier

based on HSD-1 representation (i.e., alternate digit po-sitions are signed) has lowerAT product than multipli-ers that employ fully SD trees. Possibility of reducingpower consumption by utilizing theHSD representationwas mentioned in [9].

Thispaper investigatesthe impact of theHSD repre-sentation on power consumption. Adder based on awellknown fully SD implementation [2] (which is adoptedin a microprocessor [4]) is compared with the adderbased on HSD representation. Layouts of basic addercells based on SD and HSD representations are exhaus-tively simulated to obtain the average number of transi-tionsand power dissipation per addition. The resultsarethen used to estimatepower dissipation in addersof var-ious word lengths and indicate that adopting HSD rep-resentation can lead to a reduction in power consump-tion ( 11% or moredepending on thedistancebetweenthesigned digits) over a fully SD representation.

The power (and area) reduction is achieved as a re-sult of algorithm level change. This leaves room for ad-ditional power reduction by employing standard tech-niques at lower levels of the design process (such ascircuit level, layout level etc). Furthermore, The HSDrepresentation offers a continuum of choices to the de-signer. By increasing the distance between the signeddigits, lower power dissipation can be achieved at theexpense of slower speed. Thus, if the worst case delayis predetermined, the designer can select a hybrid rep-resentation that minimizes power dissipation under thedelay constraint.

Next section summarizes the preliminaries, describ-ing the HSD representation. Section II I derives the av-erage number of transitions and power dissipation peraddition in the basic adder cells based on SD and HSDrepresentations. These averages are determined via ex-haustive simulations of all possible cases and are usedto estimate the power dissipation in adders of differ-ent lengths, and utilizing different HSD formats. Lastsection discusses the implications and presents conclu-sions.

2 Preliminaries: TheHSD representation

Without lossof generality, weconsider radix-2HSD rep-resentation for the sake of illustration. HSD formats areillustrated in Figure1. For radix r = 2 the signed dig-its can take any value in the setf1, 0, +1g. The un-signed digits are like normal bits and can assume anyof the two valuesf0,1g. The addition procedure pre-sented in [9] enables a signed digit to stop an incomingcarry from propagating further. Consequently, the car-ries propagate between the signed digits and the maxi-mum length of carry propagation equals thedistancebe-tween the signed digits. It can be verified that additionin such arepresentation requires thecarry in between alldigit positions (signed or unsigned) to assumeany valuein thesetf1; 0; 1g asin theSD system. Theoperationsin a signed–digit position are exactly the same as thosein the SD case. For instance, let xi and yi be radix–2 signed digits to be added at the ith digit position, andci1 bethecarry into theithdigit position. Eachof thesevariables can assumeany of the threevaluesf1; 0; 1g.Hence3 xi + yi + ci1 +3. This sum can berepresented in terms of a signed outputzi and a signedcarryci as follows:

xi + yi + ci1 = 2ci + zi (1)

whereci; zi 2 f 1; 0; 1g. In practice, the signed–digitoutputzi is not produced directly. Instead, the carry ciand an intermediatesumsi areproduced in thefirst step,and the summationzi = si + ci1 is carried out in thesecond. Following the procedure in [9] guarantees thatthesecond step generates no carry.

The operations in an unsigned digit position are asfollows. Let ai1 andbi1 be thebits to beadded at the(i 1)th digit position;ai1; bi1 2 f0; 1g. The carryinto the (i 1)th position is signed and can be1; 0

or 1. The output digit ei1 is restricted to be unsigned,i.e.,ei1 2 f0; 1g. Hence the carry out of the(i 1)thpositionmust beallowed toassumethevalue1 aswell.In particular

if (ai1 = bi1 = 0 & ci2 = 1) thenci1 = 1 and ei1 = 1

elseai1 + bi1 + ci2 = 2ci1 + ei1

where ci1; ei1 0

endif

(2)

The signed–digit positions generate a carry–out andan intermediate sum based only on the two input signeddigits and the two bits at the neighboring lower orderunsigned digit position. In the second step, the carriesgenerated out of thesigned digit positionsripplethroughthe unsigned digits all the way up to the next higher or-der signed digit position, where the propagation stops.Al l the (limited) carry propagation chains between thesigned digit positions areexecuted simultaneously.

Themost significant digit in any HSD representationmust be a signed digit in order to incorporate enoughnegative numbers. Al l the other digits can be unsigned.For example, if the word length is 32 digits, then, the32nd (i.e., the most significant) digit is a signed digit.The remaining digits are at the designer’s disposal. Ifregularity is not necessary, one can make the 1st, 2nd,4th, 8th and 16th (and 32nd) digits signed and let allthe remaining digits be unsigned digits (bits). The addi-tion time for such a representation is determined by thelongest possible carry–propagation chain between con-secutive signed digit positions (16 digit positions; fromthe16th to the32nd digit in thisexample).

The HSD representation has another interestingproperty: there is no need to be restricted to a partic-ular HSD format (with a certain value of d). The rep-resentation can be modified (i.e., the value of d can bechanged) while performing addition (and consequently,other arithmetic operations) and this can be done in cer-tain cases without any additional time delay. For in-stance, let x andy be two HSD operands, with uniformdistancesdx anddy, respectively, between their signeddigits. Also assume that(dy + 1 )is an integral multipleof (dx + 1 )so that the signed digit positions of y arealigned with the signed digit positions of x (note thatxhas more signed digits thany under the stated assump-tion). Let z = x + y be their sum, having a uniformdistancedz between its signed digits. The possible val-ues of dz which are interesting from a practical point ofview are0; dx anddy. If we set dz = dy then theaboveaddition wil l takeexactly thesametimeasan addition oftwo HSD operands with uniform distancedy producingan HSD result with distancedy. Settingdz = dx (andclearly, dz = 0) wil l reduce the addition time even fur-ther, since the introduction of extra signed digits resultsin shorter carry propagation chains. For example, sup-

pose thatdx = 0 (all digits are signed) anddy = 1 (al-ternate digits are signed). If dz equals 1, then the delayrequired to perform the addition is the same as that re-quired to add two HSD numbers with the same distanced = 1 to generate an output with dz = 1. This formatconversion flexibilit y (without any additional timedelaypenalty) can be useful, as illustrated by the multiplierdesign in [9].

3 Power consumption of basic adder cells

The adders considered are static CMOS circuits sincethese typically consume lower power than dynamicCMOS designs. The basic building block for the signeddigit adders in [2, 4] is the efficient signed digit addercell illustrated in [2]. The cell takes 2 signed digitoperands and a signed carry as inputs and generates asigned digit sum a signed carry output. Since each vari-able takes two bits to encode, thecell has 6 inputs and 4outputs and requires 42 transistors.

Building blocks of the HSD adder are the cells insigned and unsigned digit positionspresented in [9]. Thesigned cell takes 2 signed digits and one signed carryand generates a signed digit output and carry, and has6 inputs and 4 outputs and also requires 42 transistors.Cell in theunsigned digit position takes 2 unsigned dig-its (bits) and a signed carry as inputs and generates asigned carry and unsigned sum bit as the outputs. It has4 inputs and 3 outputs and requires 34 transistors.

Al l cells were layed out using the MAGIC editorwith the default scalable CMOS technology files. Theaverage number of transitions as well as the averagepower consumption (per addition) of thesecellswasesti-mated both analytically aswell as through an exhaustivesimulation with IRSIM as described next.Transition count via exhaustive testing of combina-tional circuits

Assume the circuit hasn inputs. Then there are2n

possible input patterns labeledP0 throughPN whereN = 2

n 1.For exhaustivetesting: set thecurrent input toP0 and

set next input to each of P1; P2; : : : ; PN and count tran-sitions for each case. Do this for eachPi; i = 0; : : : ; N ,accumulating transitions and power dissipation for eachiteration. Dividing theaccumulated sums byN(N 1)

(which is the total number of iterations) yields the (ex-haustive) averagenumber of transitionsper input patternapplied to thecircuit.

To theoretically estimatethenumber of transitionsinthe circuit, one needs the total number of transitions atthe inputs. This number is determined as follows. Letthe current input be Pi. We exhaust all possible nextinputsPj ; j 6= i and repeat thisprocess for all possibleivalues. It turns out that the number of transitions on theinputs during exhaustive testing of Pi are the same forall i.

Let Pi =( x1 : : : xn). The number of patternsPjthat differ from Pi in exactly one position is

nC1 =

n(n 1)1 2 and each of these contributes 1 transition.

Similarly the number of patterns differing from Pi in

exactlyk positions isnCk =

n(n 1) (n k + 1 )

1 2 kand each contributesk transitions. Hence the total num-ber of transitions (on all inputs) in exhaustive testing of

Pi isnX

k=1

k nCk = n 2n1 (3)

The last equality can be derived by differentiating theexpression for (1+ x)

n and settingx = 1 on both sidesof the resulting identity.

Given0 i < 2n, the total number of transitions is

2n (n 2n1) = n 22n1 (4)

or thenumber of transitionsfor any singleinput is 22n1

sinceall inputs aresymmetric.Using thezero delay model in [7], theexpected num-

ber of transitionsE[ny(T )] at the output y of a logicmodule in time interval T is

E[ny(T )]=

nX

i=1

P (@y

@xi) E[nxi(T )] (5)

wherexi ;i = 1 n are the inputs to the module and

P (@y

@xi) is the probability that the Boolean difference

@y

@xi= (yjxi=1) (yjxi=0) is 1.

Using thetotal number of input transitionsinexhaus-tive testing from equation (4) in the above equation andassuming that the primary inputs of the cells are inde-pendent leads to the the analytical estimates of the av-erage number of transitions per addition for each of thecells that areshown in Table1.

Theoretical Simulationestimate outputsAverage Average Average

Transitions Transitions PowerDissipation

(105 Watts)

Signed digitsadder cell 7.94 11 8.171based on

SD rep. [2]

Signed digitsadder cell 7.94 12 8.291based on

HSD rep. [9]Unsigned digits

adder cell 7.46 9 5.126based on

HSD rep. [9]

Table 1: Comparison of average number of transitionsand power dissipation in the basic cells. The simula-tion average was determined by exhaustively testing allpossibleinput cases, excluding thedon’t careinput com-binations.

Thesimulator used was IRSIM 9.3. It counts transi-tions at ALL internal nodes and takes into account anyglitches and/or extraneous transitions. Hence the num-ber of transitions counted by the simulator are higherthan the analytical estimates. The signed digit cells aredesigned assuming that 0 is encoded as “00”, 1 as “01”and a1 as “11”. Hence, it was assumed that the inputcombination “10” would never occur as a signed digit

and was exploited as a “don’t care” to simplify the logi-cal expressions and arrive at the final cell design. Theseinput combinations arethereforeexcluded from thesim-ulations.

4 Power dissipation of SD and HSD adders

Exhaustive determination of average number of transi-tions and power dissipation per addition is impossiblefor adders of practical word lengths (32, 53 or 64). Infact, even a 3 digit SD adder has 14 inputs (3 2 = 6signed digits requiring 12 bits plus the two bits repre-senting the carry-in into the least significant position).The number of cases to be considered (for exhaustive

determination of averages) is 2142

=228 which is

prohibitively large(even if don’t carecasesareexcludedtheexact count is (364)(3

641)=8500140 or over8.5 million). For 2 digit adders, however, the number ofcases (excluding the “don’t care” input combinations) is104,653 which ismanageable.

In the HSD-1 format, alternatedigits aresigned (therest are unsigned). Hence the total power consumed bythecascade of oneunsigned and onesigned cellsshouldbe compared with the total power consumed by the cas-cade of two SD cells (each of these cover 2 digit posi-tions in their respective representations). Thus, differ-encebetween HSD-1 and SD formatscan beunderstoodby considering two digit positions of each, which is an-other reason behind exhaustively simulating adders ofword length 2. The results areshown in Table2.

number Average Averageof Transitions Power

cases Dissipation(105 Watts)

Cascade of twoSD cells [2] 104653 23 (22) 14.36 (16.34)

Cascade ofSigned(higher 20591 21 (21) 12.81 (13.42)

significant)and

UnsignedHSD cells [9]

Table 2: Comparison of average number of transitionsand power dissipation in two-digit adders (word lengthof twodigits). Thenumber of casesexcludesthepatternsin which any of the signed digits gets the impossible in-put combination “10”. Thevalues in parenthesis areob-tained by adding thecorresponding values for individualcells from Table1.

Comparison of Table 1 and Table 2 allows a testingof the “independence hypothesis”. Thecarry signals be-tween the adjacent cells are not independent and hencetheaveragepower dissipated by agroup of cellsmay notbeestimated simply asasum of theaveragepowers dis-sipated by the individual cells. This is seen in Table 2,where the values in parenthesis (which are the sums ofaverage transitions and power dissipated by individual

cells) differ from the values generated by actual simula-tion. The difference, however, is moderate (about 4.5%in the number transitions and 13% in the power dissipa-tion values).

For word lengthsbeyond 2 digits, thepower dissipa-tion estimates must is obtained by extrapolating the re-sults for two digit cascades. This is more accurate thanextrapolating the results from individual cells. The es-timates for word lengths of 24, 32, 53 (the number ofdigits in double precision floating point representationprescribed in IEEE 754 standard for floating point im-plementations) and 64 bits are summarized in Table 3.

Estimate of Estimate ofWord Average Average

Length Transitions Powerper addition Dissipation

per addition(105 Watts)

SD HSD–1 SD HSD–1

24 276 252 172.3 153.7232 368 336 229.8 204.9653 609.5 556.5 380.5 339.46564 736 672 459.5 409.92

Table 3: Comparison of average number of transitionsand power dissipation in SD and HSD adders. The esti-mates are obtained by extrapolating the values in Table2.

It is seen that the power dissipated by the HSD-1adder is about 12% less that the SD counterpart. Thecritical path delay for SD adder is equivalent to about 5gates (two input NAND/NOR gates) whereas the delayof the HSD–1 adder is about 6 gates [9], irrespective oftheword length.

Note that HSD-1 is just one of the several feasibleformats within the HSD representation. Distance be-tween signed digits can be increased to obtain furtherpower reduction at the expense of speed. This is illus-trated by the plots in Figure 2, for a word length of 24.The number 24 was chosen merely for the sake of illus-tration because it is divisible by several integers, yield-ing lot of possible HSD formats with uniform distancebetween the signed digits and also happens to be thenumber of significand bits in the IEEE standard for sin-gle precision floating–point numbers. In these figures,the measures of interest (transitions, power dissipation,delay, power delay) are plotted as a function of d,the distance between adjacent signed digits. The pointd = 0 corresponds to the SD representation, where ev-ery digit issigned.

In Figure2, note that at every point betweend = 1 2

and d = 23 (which corresponds to all digits exceptmost significant digit unsigned), the number of signeddigits is 2, since the most significant or 24th digit issigned and there is one additional signed digit in theword. Asd isincreased from 12 to22, theposition of thesigned digit shifts from the 13th to the23rd place. Thusthedistancebetween signed digits isnon–uniform when12 <d 22. However, the total number of signed

digits and hence the total number of unsigned digits, re-main 2 and 22, respectively, for d in this range. Hence,the power dissipation is nearly constant for all d val-ues in the range from 13 to 22. The critical path, onthe other hand, increases linearly with the longest dis-tance between signed digits as illustrated in Figure 3–c.Also, beyond d> 20, the carry chain between signeddigits is long enough to render the total delay (i.e., thepropagation delay through the chain plus the complexgates at the ends) of the HSD adder higher than that ofthe ordinary ripple–carry adder. Finally from Figure 2-d, it is seen that the HSD–1 adder has almost the samepower delay product as the fully SD adder.

5 Conclusion

TheHybrid Signed Digit (HSD) representation isshownto lead to smaller power dissipation as compared withthe fully Signed Digit representation. The power re-duction is about 12% (or higher depending on the dis-tancebetween thesigned digits). Number of transitions,power dissipated, delay and the power delay prod-uct wereevaluated as function of thedistanced betweensigned digits. The plots indicate that the HSD–1 adderhas almost the same power delay product than thefully SD adder.

The power (and area) reduction is achieved as a re-sult of algorithm level change. This leaves room for ad-ditional power reduction by employing standard tech-niques at lower levels of the design process (such ascircuit level, layout level etc). Possible future work in-cludesinvestigation of optimal encodings for signed dig-its to reduce power dissipation.

References

[1] Avizienis, A. “Signed-digit number representations forfast parallel arithmetic”. IRE Transactions on ElectronicComputers, vol. EC-10, Sep. 1961, pp. 389–400.

[2] Kuninobu, S., Nishiyama, T., Edamatsu, H., Taniguchi,T., and Takagi, N. “Design of high speed MOSmultiplierand divider using redundant binary representation”. Proc.of the 8th Symposium on Computer Arithmetic, 1987,pp. 80–86.

[3] Makino, H., Nakase, Y., and Shinohara, H. “A 8.8–ns 54 54–bit Multiplier Using New Redundant Bi-nary Architecture”. In Proceedings of the InternationalConference Computer Deign (ICCD), Cambridge, Mas-sachusetts, Oct. 1993, pp. 202–205.

[4] Miyake, J., et. al. “A Highly Integrated 40-MIPS (Peak)64-b RISC Microprocessor”. IEEE Journal of Solid StateCircuits, vol. 25, no. 5, Oct. 1990, pp. 1199–1206.

[5] Nagendra, C., Owens, R. M., and Irwin, M. J. “PowerDelay Characteristics of CMOS Adders”. IEEE Transac-tions on VLSI , Sept. 1994, pp. 377–381.

[6] Nagendra, C., Owens, R. M., and Irwin, M. J. “Uni-fying Carry–Sum and Signed–Digit Number Representa-tions for Low Power”. In Proceedings of InternationalSymposium on Low Power Design, Dana Point, Califor-nia, Apr. 1995.

[7] Najm, F. “Transition Density, A Stochastic Measure ofActivity in Digital Circuits”. In Proceedings of the 28thACM/IEEE Design Automation Conference (DAC), 1991,pp. 644–649.

[8] Parhami, B. “Generalized signed-digit number systems:a unifying framework for redundant number representa-tions”. IEEE Transactions on Computers, vol. C-39, Jan.1990, pp. 89–98.

[9] Phatak, D. S., and Koren, I. “Hybrid Signed–DigitNumber Systems: A Unified Framework for RedundantNumber Representations with Bounded Carry Propaga-tion Chains”. IEEE Trans. on Computers, Special issueon Computer Arithmetic, vol. TC–43, no. 8, Aug. 1994,pp. 880–891. (An unabridged version is available onthe web via theURLhttp://www.ee.binghamton.edu/faculty/phatak).

[10] Srinivas, H. R., and Parhi, K. K. “A fast VLSIadder architecture”. IEEE Journal of Solid-State Cir-cuits, vol. SC-27, May 1992, pp. 761–767.

210

220

230

240

250

260

270

280

0 4 8 12 16 20

Tra

nsiti

on c

ount

Distance d between signed digits23

Figure 2–a : Number of transitions vs. the maximumdistanced between two consecutivesigned digits (wher-ever possible, the distance between signed digits is uni-form). Word length is 24. d =0 corresponds to thefully SD adder using theSD cell in [2]; andd = 2 3cor-responds to a ripple–carry adder. d = 1 correspondsto HSD–1 representation using thecellspresented in [9].Similarly, d = 2 corresponds to HSD–2,: : :

120

130

140

150

160

170

180

0 4 8 12 16 20

Pow

er d

issi

patio

n (

x 10

^-5

W

atts

)

Distance between signed digits d23

Figure 2–b : Power dissipation as a function of the dis-tanced between two consecutive signed digits.

rectangle representssigned digit

circle representsunsigned digit

HSD with non uniform distance between signed digits

HSD-1

HSD-2

Carries propagate IN PARALLELbetween consecutive signed digits

Illustration of HSD formats(a)

HSD with uniform distance between signed digitscorresponds to GSD representations

GSD representation HSD representation

HSD with non-uniform distance between signed digits

Relationship between HSD and GSD number representations (b)

critical paths

Figure1 : HSD formats and relationship between HSD and GSD number representations.(a) Different HSD formats: HSD-k denotes a representation with k unsigned digits between every pair of consecutivesigned digits. Last drawing illustrates a format with non uniform distancebetween signed digits.(b) Relationship between HSD and GSD number representations.

5

10

15

20

25

30

35

40

0 4 8 12 16 20

Del

ay


Figure2–c : Critical path delay vs. themaximumdistanced between two consecutive signed digits.

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 4 8 12 16 20

Pow

er x

del

ay (

x 10

^-5

)


Figure 2–d : Power delay as a function of the distancedbetween two consecutive signed digits.

Alternative Architectures

Power-Saving Features in AMULET2e

S.B. Furber, J.D. Garside and S. Temple.Department of Computer Science, The University of Manchester,

Oxford Road, Manchester M13 9PL, UK.

Abstract

AMULET2e is a self-timed embedded system control-ler which is software compatible with the ARM6 micro-processor. Its design incorporates a number of power-saving features which can be disabled under softwarecontrol, thereby allowing direct measurement of theirefficacy. In this paper we review three of these features:two exploit the high frequency of sequential memoryaccesses to bypass unnecessary circuit functions in themain cache and the branch target cache; the thirdexploits the ability of clockless circuits to become com-pletely inert when idle. These features are described andmeasurements on their effectiveness are presented.

1 Introduction

Designing for power-efficiency at the architectural levelis a matter of minimising unnecessary circuit activity.Self-timed design, with its inherent data-driven charac-teristic, would appear to have an intrinsic advantage inthis respect. However, when a processor is runningunder maximum load there is little redundant activitywhich can be avoided in order to reduce dissipation,whether the control is clocked or self-timed; but whenthe load is reduced or varies a self-timed design canadapt automatically, whereas a clocked design mustemploy explicit power management techniques whichthemselves cost power.

The AMULET microprocessors, developed at theUniversity of Manchester, are self-timed implementa-tions of the ARM architecture [1]. They were designedto demonstrate the feasibility and desirability of self-timed design. The advantages of self-timed design arenot restricted to power-efficiency - they extend to elec-tromagnetic compatibility, modularity, and an argumentcan be made on performance grounds - but in this paperwe focus on the power-efficiency benefits as manifestedin AMULET2e.

2 AMULET2e overview

AMULET2e is a self-timed controller designed forpower-sensitive embedded applications. Throughout thedesign reasonable care was taken to minimise powerwastage, and many techniques which are used on

clocked ARM designs were re-employed in the self-timed context.

The key features of AMULET2e are a self-timedARM-compatible microprocessor core (AMULET2), a4 Kbyte on-chip memory, a flexible memory interfaceand a number of control functions. Its organisation isillustrated in figure 1. Off-chip accesses use a referencedelay to ensure that timing requirements are met, withdifferent memory regions being configurable to differ-ent bus widths and access timings (all expressed interms of the reference delay).

3 Memory subsystem

The memory system comprises 4 Kbytes of RAMwhich can be memory mapped or used as a cache [2]. Itis a composition of four identical 1Kbyte blocks, eachhaving an associated 64-entry tag CAM. The CAM andRAM stages are pipelined so that up to two memoryaccesses may be in progress in the cache at any onetime.

The cache is 64-way associative with four 32-bitwords per line. It is write-through (for simplicity) anddoes not allocate on a write miss. Line fetch is per-formed with the addressed-word first and is non-block-ing, allowing hit-under-miss, the line fetch being aseparate, asynchronous task [3].

3.1 Cache power-efficiencyThe organisation of the cache, based upon a highly asso-ciative cache with a segmented CAM tag store, followsthe practice of many of the clocked ARM CPUs and isjustified from a power-efficiency perspective as follows:• the high associativity minimises misses, thereby

minimising the high energy cost of off-chip access-es;

• the CAM tag store avoids the high power cost of thelarge number of active sense amplifiers in a set-asso-ciative cache;

• segmenting the CAM avoids the high energy cost ofactivating a single monolithic CAM;

• self-timing is used to delay activation of the dataRAM sense amplifiers and to turn them off as soonas the data has been sensed.

As a final power-saving feature, the cache can be con-figured to cause the CAM lookup to be bypassed forsequential accesses to the same cache line. As, typically,

controlregisters

tag

addressdecode

line fill

funn

el a

nd m

emor

y co

ntro

l

chip

DRAM

Address

Data

data in

data out

address

area enables pipelinelatches

delay

AMULET2core

control

selects

dataRAM

CAM

Figure 1: AMULET2e internal organization

75% of all ARM memory accesses are sequential to thepreceding access, this saves a significant proportion ofall CAM lookups.

Bypassing the CAM effectively removes a pipelinestage from the cache access, so it not only saves powerbut also results in a slightly faster cache access time andtherefore increased performance. It is a simple matter todisable this optimization by presenting all addresses tothe cache as if they are non-sequential, so the power andperformance benefits of the CAM bypass mechanismcan readily be measured.

4 Branch prediction

Branch prediction is used in the AMULET2 core toreduce the number of unwanted instructions prefetched,thereby increasing performance and reducing powerconsumption. The approach is based upon a branch tar-get cache (BTC, see figure 2) which associates a partic-ular instruction address (that of a previously discoveredbranch) with a target address. Subsequent references tothis instruction then cause an automatic flow change.

The BTC needs no reference to the instructionitself and so is totally contained within the address inter-face. It acts in parallel with the PC incrementer (the nor-mal next address ‘predictor’), subverting the flow whenit recognises an address. For simplicity, in AMULET2

the BTC always predicts a known branch as taken;despite this it more than halves the number of errone-ously prefetched instructions in typical code

Interestingly, this unit is similar to the ‘jump tracebuffer’ used on the MU5 mainframe computer [4] devel-oped at the University of Manchester between 1969 and1974; this too was an asynchronous machine.

4.1 BTC power-efficiencyIn many embedded controller applications the process-ing power needs only to be ‘adequate’, whereas the elec-trical power consumption must be minimal. Whilstpredicting branches saves power by reducing thenumber of unnecessary memory cycles, the branch pre-diction unit itself must dissipate power. An importantconsideration in designing the BTC for AMULET2 wasthat its own consumption should be low. Its simplearchitecture contributes to this, but the detailed designhas also exploited features of the local environment forpower saving.

The BTC caches the addresses of branches. Theworking set of branches is relatively small and sparse,thus a small, fully associative cache is appropriate, withone tag per branch. In AMULET2 a twenty entry CAM/RAM structure is employed, with the ‘RAM’ (which israrely invoked) implemented as registers for speed.Addresses circulating in the address interface are com-pared with the CAM values to see if a branch is to bepredicted. In contrast to an instruction cache, where hitsare far more common than misses, most instructions arenot branches so the expected result in the BTC is a cachemiss.

If an address is sequential to the previous one it isunusual for its high order bits to change. Thus if, forexample, the four least significant bits are ignored, thetag comparison changes only once in every sixteenaddresses. AMULET2 exploits this by splitting theCAM into two (see figure 3), so that the more significantbits are compared only when such a change occurs, andmost comparisons are reduced to a few bits. The fullprecharge/discharge cycle is only necessary in threecases:• when a sequential cycle affects the more significant

bits;

mux

mux

branchtargetcache

incrementer

from execution pipeline

hit

to memory

PC

branchtarget

Figure 2: Branch target cache organisation

• when a branch is predicted and the next cycle is non-sequential;

• when an unpredicted branch arrives.The first two cases are always known about in the pre-ceding cycle, whilst the last causes a BTC write opera-tion which does not require a preceding precharge, so innone of these cases is performance impacted. In fact thelow-order lookup is somewhat faster than a full CAMaccess, so there is also a small performance benefit fromthis power-saving feature.

The BTC can be completely disabled and its per-formance and power contributions thereby measured.When the BTC is enabled the split-CAM power-savingfeature can be enabled or disabled, so its contributioncan also be measured.

5 ‘Halt’

An interesting property of a self-timed system is that itprocesses either at full speed or it halts, and it makes aninstant transition between these two states. In its haltedstate the device effectively uses no power because thepower consumption of digital CMOS circuits is usuallynegligible when they are not switching.

This has been exploited in AMULET2 by retrofit-ting a halt function to the existing instruction set. MostARM programs enter an idle loop where an instructioncontinuously loops back to itself until an interruptoccurs. AMULET2 detects this instruction and a mecha-nism stalls a control loop in the execution pipeline. Thestall propagates throughout the system, halting all activ-ity. An interrupt causes immediate, full speed resump-tion. This feature is completely code-compatible withexisting software.

The halt mechanism can be disabled and its effec-tiveness measured. It is different in nature to the previ-ously mentioned optimizations as it only has an effectwhen the processor is idle, so it has no effect on bench-mark programs that operate at peak throughput.

6 AMULET2e test results

AMULET2e has been fabricated (see figure 4) and suc-cessfully runs standard ARM code. It was fabricated ona 0.5µm, 3-layer metal process. It uses 454,000 transis-tors (93,000 of which are in the processor core) on a diewhich is 6.4 mm square. The device in the fastest avail-able configuration delivers 42 Dhrystone 2.1 MIPS witha power consumption of about 150 mW in the core logic(which includes the processor core and cache memory,but excludes the I/O pads) running at 3.3 V. This is fasterthan the older ARM710 but slower than the ARM810which was fabricated at about the same time. It repre-sents a power-efficiency of 280 MIPS/W, which is asgood as either of the clocked CPUs.

6.1 BTC and cache power-save measurementsTwo benchmark programs were used to measure the per-formance and power-efficiency benefits of the previ-ously-mentioned architectural features:• Dhrystone 2.1 is widely used to evaluate embedded

cores, but we found that it has very unusual branch-ing characteristics compared with any ‘real’ pro-grams we used to during the architectural evaluationof the BTC (and we resisted the temptation to opti-mise the BTC for Dhrystone!);

• a set of sorting routines, combining insertion sort,shell sort and quick sort algorithms, running on a ran-domised data set, was used to generate alternative re-sults. This program has very tight looping behaviour.

The results of these tests are summarised in Table 1. Thebase case is the processor running from cache with theBTC (and its power-saving feature) and the CAM bypassall disabled. Table 1 shows the improvement in perform-ance (MIPS) and power-efficiency (MIPS/W) deliveredby enabling each feature in turn (noting that the BTCpower-save feature requires the BTC to be enabled), andtheir total effect when they are all enabled.

The cache CAM bypass mechanism delivers animprovement of around 5% in performance and 15% inpower-efficiency on both benchmark programs. This is a

hit

target

frequentlynot invoked

hit or miss?

CAM (low)

CAM (high)

RAMaddress

buffer (high)bf(l)

Figure 3: BTC structure

PC

Figure 4: AMULET2e die picture

useful contribution from a simple circuit addition.The BTC gives a modest 5% performance improve-

ment on Dhrystone and a dramatic 22% on the ‘Sorts’benchmark. ‘Real’ programs are expected to fall some-where between these two. At the lower end it has littleeffect on power-efficiency (indeed, in some configura-tions we have observed it reduce power-efficiency). Atthe higher end it aids power-efficiency, but less than itenhances performance. The BTC power-save featuremakes little difference to the Dhrystone performance butimproves its power-efficiency by 3%. It improves the‘Sorts’ performance by a further 3% and its power-effi-ciency by 7%. In all cases where we have observed theBTC have a negative impact on power-efficiency theBTC power-save feature has cancelled the loss and con-verted it into a power-efficiency gain.

6.2 ‘Halt’ measurementsThe operation of the ‘Halt’ feature is different in princi-ple from the cache CAM bypass and the BTC power-save as it has no impact on power consumption when theprocessor is running flat-out, as it is when it is executinga benchmark program. It only comes into play when theprocessor is idle.

To judge its effectiveness we can observe the cur-rent consumed by the processor core logic when idlingwith ‘Halt’ disabled, which is around 50 mA if the idleloop is in the cache. With ‘Halt’ enabled this reduces toaround 1 mA on the test card when centisecond timerinterrupts are being handled, and to around 1µA (leak-age current) when interrupts are disabled. These resultsare summarized in Table 2.

7 Conclusions

AMULET2e is a self-timed embedded controllerdesigned for power-sensitive applications. It incorpo-rates a number of features which have been added toimprove its power-efficiency, and which can be enabledand disabled under software control so that their contri-bution can be assessed.

We have described specific power-saving featuresin the branch prediction mechanism and cache whichtogether reduce the core power consumption by around20%, and also increase performance by around 10%. Thebranch prediction mechanism itself improves perform-ance by between 5% and 25% and power-efficiency bybetween 0% and 22%, depending on the application.

We have also described the AMULET2e ‘Halt’

mechanism which exploits the self-timed control of theprocessor to reduce idle power to leakage current only,which is a factor 50,000 times lower than the idle statewith ‘Halt’ disabled. The power consumption of thehalted processor core is therefore a miserly 3µW. Asmany embedded controllers spend a lot of their timehalted waiting for input, interspersed with bursts ofintense activity, it can be observed that an asynchronousimplementation of a low power architecture can yieldsignificantly lowered average power consumption,resulting in (for example) increased battery life in porta-ble equipment.

Whilst a clocked chip can make up some of thisground by using power management techniques toreduce its clock frequency, this requires software controland that software itself consumes power. OnAMULET2e this minimal idle power comes easily andautomatically, and can be exploited in the shortest of idleperiods between bursts of activity.

8 Acknowledgements

The AMULET2e design work described in this paperwas carried out as part of ESPRIT project 6909, OMI/DE-ARM (the Open Microprocessor systems Initiative -Deeply Embedded ARM Applications project). Theauthors are grateful for this support from the EU.

The authors are also grateful for support in variousforms from Advanced RISC Machines Limited and forthe fabrication service supplied by VLSI Technology,Inc. The chip was designed using EDA tools from Com-pass Design Automation, with accurate simulation usingTimeMill from EPIC Design Technology, Inc., beingvital to the final verification of the design.

Nigel Paver and Paul Day both made very substan-tial contributions to the design of AMULET2e and werejointly responsible for the ‘Halt’ mechanism describedin this paper.

9 References

[1] Furber, S.B., ARM System Architecture, AddisonWesley Longman, 1996. ISBN 0-201-40352-8.

[2] Garside, J.D., Temple, S. and Mehra, R., “TheAMULET2e Cache System”, Proc. Async'96,Aizu-Wakamatsu, 208-217.

[3] Mehra, R. and Garside, J.D., “A Cache Line FillCircuit for a Micropipelined Asynchronous Micro-processor”, TCCA Newsletter, IEEE ComputerSociety, October 1995.

[4] Morris, D. and Ibbett, R.N., The MU5 ComputerSystem, Macmillan, New York, 1979.

FeatureDhrystone 2.1 Sort

MIPS MIPS/W MIPS MIPS/W

BTC on +5% +0% +22% +15%

BTC power-save +5% +3% +25% +22%

CAM bypass +6% +15% +5% +14%

All on +11% +18% +32% +35%

Table 1: Experimental results

Idle condition Idd (mA)

‘Halt’ disabled 50

‘Halt’ enabled; centisecond timer interrupts 1

‘Halt’ enabled; interrupts disabled 0.001

Table 2: AMULET2e core idle currents

A Fully Reversible Asymptotically Zero Energy Microprocessor

Carlin Vieri M. Josephine Ammer Michael Frank Norman MargolusTom Knight

MIT Artificial Intelligence Laboratory,545 Technology Square, Cambridge MA 02139, USA

Abstract

The only way to compute with asymptotically zeropower is reversibly. Recent implementations of re-versible and adiabatic [17, 8] logic in standardCMOSsilicon processes have motivated further research into re-versible computing. The application of reversible com-puting techniques to reduce energy dissipation of cur-rent generationCMOS circuits has so far been found tobe limited, but the techniques used to design reversiblecomputers are interesting in and of themselves, and othertechnologies, such as Josephson Junctions and quantumcomputers, as well as futureCMOS technologies, mayrequire fully reversible logic. This paper discusses thedesign of a fully reversible microprocessor architecture.

Computing with reversible logic is the only way toavoid dissipating the energy associated with bit erasuresince no bits are erased in a reversible computation. Lowenergy techniques such as voltage scaling lower the costof erasing information. Techniques such as clock gatingeffectively reduce the number of bits erased. Reversibletechniques have already been used to lower the cost ofbit erasure for nodes that have a high cost of erasure,but the present work is directed at saving every bit, com-puting fully reversibly. The goal is to convert a conven-tionalRISCprocessor to completely reversible operation.This investigation indicates where bit erasure happens ina conventional machine and the varying difficulty acrossdatapath modules of computing without erasing bits.

The initial motivation for reversible computing re-search came from an investigation of fundamental lim-its of energy dissipation during computation [9]. Thelink between entropy in the information science senseand entropy in the thermodynamics sense, exhibited byMaxwell’s demon [10], requires a minimum energy dis-sipation ofkBT ln 2, wherekB is Boltzmann’s constant,when a bit is erased. Erasing a bit is a logically irre-versible operation with a physically irreversible effect.A reversible computer avoids bit erasure.

Judicious application of reversibility in adiabatic cir-cuits has already proven its usefulness in reducing en-ergy dissipation [2]. This paper examines the complex-ity and difficulty in avoiding bit erasure entirely and dis-cusses a set of techniques for designing reversible sys-tems.

This work is supported by DARPA contract DABT63-95-C-0130

1 Introduction

Power and energy dissipation in modern microproces-sors is obviously a concern in a great number of ap-plications. Riding the technology curve to deep sub-micron devices, multi-gigahertz operating frequencies,and low supply voltages provides high performance andsome reduction in dissipation if appropriate design stylesare used, but for applications with a more strict dissi-pation budget or technology constraints, more unusualtechniques may be necessary in the future.

Adiabatic or energy recovery circuit styles have be-gun to show promise in this regard. Motivated by theresult from thermodynamics that bit erasure is the onlycomputing operation that necessarily requires energydissipation, various techniques that either try to avoidbit erasure or try to bring the cost of bit erasure closerto the theoretical minimum have been implemented. Totruly avoid bit erasure, and therefore perform compu-tation that has no theoretical minimum dissipation, thecomputing engine must be reversible. Losses not asso-ciated with bit erasure are essentially frictional, such asthe non-zero resistance of “on” transistors, and may bereduced through circuit design techniques such as op-timally sized transistors, silicon processing techniquessuch as silicided conductors, and by moving chargethrough the circuit quasistatically such as constant cur-rent ramps in adiabatic circuits.

This paper discusses the engineering requirements ofbuilding a fully reversible processor. Fully reversiblemeans that both the instruction set and the underlyingcircuit implementation will be reversible. Such a pro-cessor theoretically requires asymptotically zero dissipa-tion, with dissipation per operation falling to zero as theclock period is increased to infinity. This assumes thatan appropriate energy-recycling, constant-current powersupply could be developed. The ISI “blip circuit” [1],stepwise capacitor charging [12], and MIT’s transmis-sion line clock drivers [3], are steps toward this end.This paper assumes that the clock drivers exist and thedatapath currently being constructed uses Younis andKnight’s [16] three-phaseSCRL logic family in all cir-cuits. Any complete analysis of power dissipation inan adiabatic circuit must include the dissipation in thepower supply and control logic. This paper focuses onthe architectural and circuit level engineering of a re-versible system rather than the actual dissipation of sucha system.

Following a more detailed discussion of the moti-vation behind this work and a few words about the ter-minology of the field, Section 4 covers instruction set

design, since the ISA and assembly language code mustalsomaintain reversibility, Section5 concernsthedetailsof instruction fetch and decode, and Section 6 toucheson some related work in the field and offers conclusionsabout reversible computing.

2 Why BuildaReversibleProcessor

A fully reversibleprocessor must implement areversibleinstruction set in a reversible circuit implementation. Areversible instruction set is one in which both the previ-ous and next instructions areknown for each instructionin a correctly written assembly language program. Thedynamic instruction stream may be executed both for-wardsand backwards. Theinstruction set described hereis a modification of the one designed in Vieri’s master’sthesis [15].

A reversible circuit implementation is one in which,in the asymptotic limit , charge flows through the cir-cuit in a thermodynamically reversible way at all times.This is only possible if the circuit is performing a log-ically reversible operation in which no information islost. A logically reversible operation is one in whichvaluesproduced asoutput uniquely determine the inputsused togeneratethat output. Performingexclusively log-ically reversibleoperationsisanecessary but insufficientcondition for thermodynamically reversible operation.When performed using an adiabatic circuit topology, theoperation is thermodynamically reversible.

Performing circuit-level operations in a thermody-namically reversible way allows the energy dissipationper operation to fall asymptotically to zero in the limitof infinitely slow operation. Conventional CMOScircuitshave aminimum energy dissipation associated with eachcompute operation that changes the state of the outputnode, regardless of operation frequency. So-called “adi-abatic” techniques, in which the dissipation per com-puteoperation isproportional to theoperation frequency,haveshown themselves to beuseful in practical applica-tions [13, 2].

As mentioned above, adiabatic operation requiresthat the circuits perform logically reversible operations.If one attempts to implement a conventional instructionset in a reversible logic family, reversibility wil l be bro-ken at the circuit level when the instruction set speci-fies an irreversible operation. This break in reversibilitytranslates to a required energy dissipation.

3 A Few Words About Words

Much of the discussion about reversible computing ishampered by imprecision and confusion about terminol-ogy. In the context of this paper, terms wil l hopefullybe used consistently according to the definitions in thissection.

The overall field of reversible computing encom-passes anything that performs some operation on somenumber of inputs, producing some outputs that areuniquely determined by and uniquely determine the in-puts. While this could include analog computation, thispaper is only concerned with digital computation. Re-versiblecomputation only implies that theprocess isde-terministic, that the inputs determine the outputs, andthat thoseoutputs aresufficient to determine the inputs.

A particular design may be reversibleat oneor morelevels and irreversible at other levels. For example, an

instruction set may bereversibleand yet beimplementedin an irreversible technology. The test chip “Tick” [4]was an eight-bit implementation of an early version ofthereversiblePendulum instruction set in an irreversiblestandard CMOS technology.

Thespecific implementation of a reversiblecomput-ing system may vary greatly. Numerous mechanicalstructureshavebeen proposed, including Drexler’s“RodLogic” and Merkle’s clockwork schemes [11]. Younisand Knight’s SCRL isareversibleelectronic circuit stylesuitable for use in the implementation of a reversiblecomputer.

Many papers refer to “dissipationless” computing.This is a misnomer and generally may be interpreted asshorthand for “asymptotically dissipationless” comput-ing. This means that if the system could be run arbitrar-il y slowly, and if some unaccounted-forN th order ef-fect of the particular implementation does not interfere,in the limi t of infinite time the dissipation per compu-tation asymptotically approaches zero. Only reversiblecomputations may beperformed asymptotically dissipa-tionlessly because bit erasure requires a dissipation ofkBT ln2 regardless of operation speed. Any N th or-der effectsthat prevent areversiblesystem from dissipat-ing zero energy areessentially frictional in nature, ratherthan fixed, fundamental barriers.

4 ThePendulum Instructio n Set

The particular implementation discussed here is knownas the Pendulum processor. The Pendulum proces-sor was originally based on the elegantly simple MIPSRISC architecture [7]. The register-to-register opera-tions, fixed instruction length, and simple memory ac-cess instructions make it a good starting point for aradically different approach to instruction set design.For ease of implementation, and of course to maintainreversibility, the instruction set has been substantiallymodified. It retains the general purpose register struc-tureand fixed length instructions, however.

ThePendulum processor supportsanumber of tradi-tional instructions with additional restrictions to ensurereversibility. The instruction set includes conventionalregister to register operations such as add and logicalAND, shift and rotate operations, operations on imme-diate values such as add immediate and OR immediate,conditional branchessuch asbranch on equal tozero andbranch on less than zero, and a single memory accessoperation, exchange. The direction of the processor ischanged using conditional branch-and-change-directioninstructions.

4.1 Register to Register Operations

Conventional general purpose register processors readtwo operands, stored in two possibly different registers,and perform someoperation to produce a result. There-sult may be stored either in the location of one of theoperands, overwriting that operand, or some other loca-tion, overwriting whatever value was previously storedthere. Thisproduces two difficultiesfor areversiblepro-cessor. First, writing a result over a previously storedvalue is irreversible since the information stored thereis lost. Second, mapping from the information space oftwo operands to the space of two operands and a resultwil l quickly fil l the available memory. However, if the

result is stored in the location originally used for one oftheoperands, thecomputation takes two operands as in-put and outputs one operand and a result. This spaceoptimization is not required for reversibility if the pro-cessor can always store the result without overwritingsome other value, but it is a useful convention for man-aging resources.

An astutely designed instruction set wil l inherentlyavoid producing garbage information while retainingflexibilit y and power for the programmer. This leadsto thedistinction between expanding and non-expandingoperations. Both types of instruction are reversible; thedistinction is made only in how much memory these in-structions consume when executed.

Al l non-expanding two operand instructions in thePendulum instruction set take the form:

Rsd F(Rsd; Rs) (1)

whereRsd is the source of one operand and the desti-nation for the result, andRs is the source of the secondoperand.

By contrast, logical AND isnot reversible if only theresult and one operand are retained. Except for the spe-cial case of the saved operand having every bit positionset to one, the second operand can not be recovered ac-curately and must bestored separately. Sinceoperationslike AND, including logical OR and shift operations, re-quire additional memory space after execution, they aretermed expanding operations. The problem then arisesof how store that extra information. If the two operandscontinue to be stored in their original location, the re-sult must be stored in a new location. It is still not per-missible for the result to overwrite a previously storedvalue, so the result may either be stored in a locationthat is known to be clear or combined with the previ-ously stored value in a reversible way. The logical XORoperation isreversible, so thePendulum processor storestheresult by combining it inalogical XOR with thevaluestored in the destination register, forming ANDX, ORXand so on. Logical ANDX and all other expanding twooperand instructions take the form:

Rd F(Rs; Rt) P (2)

whereRs andRt are the two operand source registers,Rd is the destination register, andP is the value origi-nally stored in Rd.

Constructing a datapath capable of executing an in-struction stream forwards and backwards is simplified ifinstructions perform the same operation in both direc-tions. Addition, which appears simple, is complicatedby the fact that the ALU performs addition when exe-cuting in one direction and subtraction when reversing.Theexpanding operations are their own inverses, since

P = F(Rs; Rt)Rd (3)

Non-expanding operations could take the same form asthe expanding operations, performing addx and so on,simplifying the ALU, but programming then becomesfairly difficult. Only expanding operations, which re-quire additional storage space, are implemented to con-sume that space.

Converting a conventional register to register in-struction execution schemeto reversibleoperation isrel-atively simple. The restriction on which registers canbe operands is minor, and implementation of an SCRL

ALU is not particularly difficult. While aconventionalprocessor erasesasignificant number of bits in theseop-erations, preserving them isnot difficult.

4.2 Memory Access

A reversible memory system, named XRAM, has beenfabricated ina0.5m CMOSsiliconprocess. Thesystemwas intended to be aprototype of the Pendulum registerfile. Reversible memory system design is discussed inmoredepth elsewhere [14], and this section draws heav-ily on previous work by theauthors.

From a system point of view, the only additional re-quirement of a reversible memory, beyond a traditionalmemory system’s function, is that it not erase bits whenit is read from and written to. The memory must ofcourse perform as a random access memory, allowingbits to bestored and retrieved. Bit erasurecan happen asa fundamental side effect of the operation of the mem-ory or asafunction of theparticular implementation. Forexample, onecan imagine amemory in which theexter-nally visible values being stored and retrieved are neverlost but the implementation of the memory is such thatintermediatebits areerased internally.

A traditional SRAM architecture is based onread/write operations. An address is presented to thememory and, based on a read/write and possibly an en-able signal, a word of data is read from or written to thememory array. Data may be read from any location anarbitrary number of times, and datawritten to a locationoverwrites that location’s previously stored data.

Reading a value does not at first seem to be an ir-reversible operation. Reading from a standard memorycreates a copy of the stored value and sends it to an-other part of the computing system. An arbitrary num-ber of copies may be created this way. If , in a reversiblesystem, the overall system can properly manage thesecopies, the memory need not be concerned with them.The larger system will , however, probably exhaust itsability to store or recover the bits generated by the pro-duction of an arbitrary number of copies. So it is a de-sirable feature of a reversible memory not to be alimit-less source of bits when used in a larger system. It mustbe emphasized, however, that copy itself is not an irre-versibleoperation.

A conventional memory performs explicit bit era-sure during writes because the previously stored valueis overwritten and lost. A reversible memory must savethose bits somehow. The specific mechanism for thismay vary. For example, reads may be performed de-structively, as in a DRAM, to avoid producing copies ofthe data. The information is moved out of the memoryrather than being copied from it.

During writes, thevaluewhich would beoverwrittencould be pushed off to a separate location, either auto-matically or explicitly under programmer control. Thisonly postpones the problem until later since any finitecapacity storage wil l befilled eventually.

If the separate location is accessible to theprogram-mer, however, that data may either be useful or it maybe possible to recover the space by undoing earlier op-erations. So if a write is preceded by a destructive read,theold information ismoved out of thememory and intothe rest of the system, and the new information replacesit in the memory. The old value has been exchangedfor the new value. This type of eXchange memory ar-chitecture, or XRAM, is the memory access technique

used in the Pendulum register file and for data and in-struction memory access. The instruction set supports asingle exchang e instruction which specifies a registercontaining the memory address to be exchanged and aregister containing thevalue to bestored to memory andin which thememory valuewil l beplaced.

Theessential insight of the XRAM is that performingaread and thenawriteto thesamememory locationdoesnot loseany information. Onedataword ismoved out ofthememory, leaving an empty slot for anew value to bemoved in. In general, moving data rather than copyingisavalid technique in reversiblecomputing for avoidingbit erasureon theonehand and avoiding producing largeamounts of garbage information on theother.

4.3 Control Flow Operations

If programs consisted solely of register to register andmemory access operations, programming and imple-mentation would be relatively simple. Unfortunately,conditional branches are crucial to creating useful pro-grams. The processor must be able to follow arbitraryloops, subroutine calls, and recursion during forwardand reverseoperation. A great deal of information is lostin conventional processors during branches, and addingstructures to retain this information isvery difficult.

Any instruction in a conventional machine implic-itl y or explicitly designates the next instruction in theprogram. Branch instructions specify if a branch is tobe taken, and if so, what the target is. Non-branch in-structions implicitly specify the instruction at the nextinstruction memory address location. To follow a seriesof sequential instructions backwards is trivial; merelydecrement theprogram counter rather than incrementingit. Following a series of arbitrary jumps and branchesbackwards in a traditional processor is impossible: theinformation necessary to follow a jump or branch back-wardsislost when thebranch istaken. A reversiblecom-puter must store enough information, either explicity inthe instruction stream or elsewhere, to retrace programexecution backwards.

To reverse an instruction stream, some informationmust beavailableat the target location of abranch spec-ifying how to undo the branch. During reverse opera-tion, a target instruction must somehow be identifiableand the processor must be able to determine the “returnaddress” of the branch. A challenge exists as to howto ensure reversibility even if the code is written incor-rectly. A simple “ill egal instruction” check to determineif the targeted instruction is not a proper target type is alegitimatetechnique, but data-dependent checksaregen-erally not possible.

Requiring branches to target particular types of in-structions to ensure branch reversibility, however, is aconvenient technique for reversible processor design.Early designs proposed explicit “come-from” instruc-tions [15] or a branch register and branch bit [5]. Spacedoes not permit a complete discussion of possible tech-niques for performing jumps and branches reversibly,but theliteraturecontainsanumber of examplesthat dif-fer from the scheme presented here [6]. The discussionbelow refers only to the particular scheme used in thecurrent version of thePendulum processor.

Pendulum branch instructions specify the conditionto beevaluated, either equal to zero or lessthan zero, theregister containing the value to be evaluated, and a reg-ister containing the target address. The instruction at the

target address must be able to point back to the branchaddressand know if thebranch was taken or if the targetlocation was reached through sequential operation. Forproper operation, each branch instruction must target anidentical copy of itself.

When a branch condition is true, an internal branchbit is toggled. If the branch bit is false and the branchcondition istrue, theprogram counter update(PCU) unitexchanges the value of the program counter and the tar-get address. The target address must hold an identi-cal branch instruction which toggles the branch bit andsequential operation resumes. The address of the firstbranch instruction isstored in theregister fileso that dur-ing reverse operation the branch can be executed prop-erly.

5 Instructio n Fetch and Decode

Reading from the instruction memory suffers from thesame difficulty as reading from the data memory. Eachcopy created when an instruction is read must be “un-copied.” If instruction fetch operations are performedby moving instructions rather than copying them, the in-structionsmust bereturned to memory when theinstruc-tion has finished executing.

After the instruction is read (or moved) from the in-struction memory, the opcode is decoded, and a numberof datapath control signals are generated. Just beforethe instruction ismoved back to the instruction memory,thesecontrol signalsmust be“ungenerated” by encodingthe instruction.

A certain symmetry is therefore enforced with re-spect to instruction fetch and decode. An instruction ismoved from the instruction memory to the instructiondecodeunit. Theresultingdatapathcontrol signalsdirectoperation of theexecution and memory accessunits. Af-ter execution, the control signals are used to restore theoriginal instruction encoding, and the instruction maythen be returned to the instruction memory.

The processor must be able to return the instructionto its original location, so its address must be passedthrough the datapath and made available at the end ofinstruction execution. Since the address of the next in-struction must also be available at the end of instructionexecution, the Pendulum processor has two instructionaddress paths. One path contains the address of the in-struction being executed, and the program counter up-date unit uses it to compute the address of the next in-struction. The second path contains the address of theprevious instruction and the PCU uses it to compute theaddressof thecurrent instruction. Theoutput of thePCUis then the next instruction address and the current in-struction address, which are the values required to re-turn the current instruction to the instruction memoryand read out thenext instruction.

Performing thesecomputations during branching in-structions is very complex, especially when the proces-sor is changing direction. Traditional processors throwaway every instruction executed, and ensuring that theinstructions are returned to the instruction memory ischallenging.

6 Conclusions

This extreme approach to low power computing isclearly impractical in the near-term. Al l the primary

blocks of a traditional RISC processor erase bits, and re-taining those bits presents varying levels of difficulty tothe designer. These challenges present the opportunityto reexamine conventional RISC architecture in terms ofbit erasure during operation. Register to register opera-tionsand memory accessarerelatively easy to convert toreversibility, but control flow and, surprisingly, instruc-tion fetch and decode, are decidedly non-trivial. Thisknowledge may be used in traditional processor designto target datapath blocks for energy dissipation reduc-tion.

References

[1] W. Athas, L. Svensson, and N. Tzartzanis. Aresonant signal driver for two-phase, almost-non-overlapping clocks. In International Symposiumon Circuits and Systems, 1996.

[2] W. Athas, N. Tzartzanis, L. Svensson, L. Peterson,H. Li , X. Jiang, P. Wang, and W-C. Liu. AC-1:A clock-powered microprocessor. In InternationalSymposiumon Low Power Electronicsand Design,pages 18–20, 1997.

[3] Matthew E. Becker and Thomas F. Knight, Jr.Transmission line clock driver. In Proceedingsof the1998 ISCA Power Driven MicroarchitectureWorkshop, 1998.

[4] Michael Frank and Scott Rixner. Tick: A simplereversible processor (6.371 project report). Onlinetermpaper, may 1996. http://www.ai.mit. -edu/˜mpf/rc/tick-report.ps .

[5] Michael P. Frank. Modifications to PISAarchitecture to support guaranteed reversibilityand other fetures. Online draft memo, July1997. http://www.ai.mit.edu/˜mpf/rc/memos/M07/M07_revarch.html .

[6] J. StorrsHall. A reversible instruction set architec-ture and algorithms. In Physics and Computation,pages 128–134, November 1994.

[7] Gerry Kane and Joe Heinrich. MIPSRISC Archi-tecture. PrenticeHall, 1992.

[8] J. G. Koller and W. C. Athas. Adiabatic switching,low energy computing, and the physics of storingand erasing information. In Physics of Computa-tion Workshop, 1992.

[9] R. Landauer. Irreversibility and heat generation inthe computing process. IBM J. Research and De-velopment, 5:183–191, 1961.

[10] J. C. Maxwell. Theory of Heat. Longmans, Green& Co., London, 4th edition, 1875.

[11] Ralph C. Merkle. Towards practical reversiblelogic. In Physics and Computation, pages 227–228, October 1992.

[12] L.“J.” Svensson and J.G. Koller. Adiabaticcharging without inductors. Technical ReportACMOS-TR-3a, USC Information Sciences Insti-tute, February 1994.

[13] NestorasTzartzanisand William C. Athas. Energyrecovery for the design of high-speed, low powerstatic RAMs. In International Symposium on LowPower Electronicsand Design, pages55–60, 1996.

[14] Carlin Vieri, M. Josephine Ammer, Amory Wake-field, Lars“Johnny” Svensson, William Athas, andThomas F. Knight, Jr. Designing reversible mem-ory. In C. S. Calude, J. Casti, and M. J. Dinneen,editors, Unconventional Models of Computation,pages 386–405. Springer-Verlag, 1998.

[15] Carlin J. Vieri. Pendulum: A reversible computerarchitecture. Master’s thesis, MIT Artificial Intel-ligenceLaboratory, 1995.

[16] Saed G. Younis and Thomas F.. Knight, Jr. Prac-tical implementation of charge recovering asymp-totically zero power CMOS. In Proceedings ofthe 1993 Symposium in Integrated Systems, pages234–250. MIT Press, 1993.

[17] Saed G. Younisand ThomasF. Knight, Jr. Asymp-totically zero energy split-level charge recoverylogic. In International Workshop on Low PowerDesign, pages 177–182, 1994.

Low-Power VLIW Processors: A High-Level Evaluation

Jean-Michel Puiatti, Christian Piguet , Eduardo Sanchez, Josep Llosa y

Logic Systems LaboratorySwiss Federal Institute of Technology

CH–1015 Lausanne, SwitzerlandfJean-Michel.Puiatti, [email protected]

Abstract

Processors having both low-power consumption andhigh-performance are more and more required in theportable systems market. Although it is easy to findprocessors with one of these characteristics, it is harderto find a processor having both of them at the sametime. In this paper, we evaluate the possibility of de-signing a high-performance, low-consumption processorand investigate whether instruction-level parallelism ar-chitectures can be adapted to low-power processors. Wefind that an adaptation of high-performance architecture,such as the VLIW architecture, to low-power 8b or 16bmicroprocessors yields a significant improvement in theprocessor’s performance while keeping the same energyconsumption.

1 Introduction

In recent years, the need for ultra low-power embeddedmicrocontrollers has been steadily increasing. This canbe explained by the high demand for portable applica-tions. Currently, a wide range of products, such as em-bedded microprocessors from 8b to 32b and DSP, canbe found on the market and are well adapted to a widerange of applications. Eight-bit embedded microcon-trollers can be found in the form of low-complexity cir-cuits that have generally no pipeline, no cache memo-ries, and a reduced level of performance. The Motorola68HC11, Intel 80C51, and PIC 16Cxx are examples ofsuch products. They consume between 5 and 50 mili-watts at 3 volts, have slow clock frequency (usually nomore than 20 MHz), and take several clock cycles toexecute an instruction [12]. On the other hand, high-end, low-power processors can be found in the form ofpipelined 32b processors with small cache memories anda high level of performance. The DEC StrongArm 110,Motorola MPC821, and IBM PowerPC 401 are examplesof these products. They consume between 100 and 1000milliwatts at 3 volts and have a high clock frequency (be-tween 25 and 200 MHz) [12][14]. Between these two

Ultra Low Power Section, Centre Suisse d’Electronique et de Mi-crotechnique, CH–2000 Neuchatel, Switzerland, [email protected]

yDepartament d’Arquitectures de Computadors, Universitat Politec-nica de Catalunya, Barcelona, Spain, [email protected]

This paper was submitted to PATMOS’98, c 1998 PAT-MOS’98. All rights reserved.

categories there is a gap. Indeed, in spite of the needfor low-power processors adapted to 8b and 16b appli-cations, there are no 8b or 16b processors that reach alevel of performance comparable to the 32b low-powerprocessors.

In this paper, we evaluate the possibility of design-ing a high-performance, low-consumption 8b or 16b mi-crocontroller. More precisely, we investigate whetherthe instruction-level parallelism (ILP) architectures (e.g.,superscalar or VLIW) used in high-performance, high-consumption processors can be adapted to low-power,high-performance 8b or 16b microcontrollers. In orderto quantify the potential improvements that can be ob-tained by these kinds of parallel architectures, we makea comparison between low-power scalar processors andlow-power ILP processors in terms of both performanceand energy efficiency. The metric used in our compari-son is the Energy Delay Product [5], EDP , defined asthe product between the total energy, ET , needed to ex-ecute a task and the time, Texec, needed to execute thistask: EDP = ET Texec.

Ricardo Gonzalez and al. [5] showed that superscalararchitectures with a compile-time scheduler do not im-prove the energy-efficiency level. Thomas D. Burd [3]argued that instruction-level parallelism (ILP) architec-tures do not improve the energy efficiency due to the con-trol overhead and the unissued instructions (speculation).However, we believe that VLIW architectures adaptedfor low-power can help us obtain a better consumption-performance trade-off. These beliefs are motivated bynew developments in the VLIW field, such as the newarchitectural solution HP/Intel IA-64 [6] and TI ’C6201processors [15].

The remainder of this paper is organized as follows.Section 2 describes the characteristics of the CoolRISC816, which is the processor of reference of our experi-ment. Section 3 presents the ILP architecture used in ourexperiment to increase the performance of the low-powerprocessor of reference. Section 4 describes our experi-ment and shows the results. Finally, Section 5 concludes.

2 CoolRISC 816: A low-power 8b processor

The CoolRISC 816 [13] is an ultra low-power embed-ded 8b microcontroller developed by the Centre Suissed’Electronique et de Microtechnique (CSEM). It has thefollowing characteristics (core only): a harvard architec-ture (separate code and data memory), sixteen 8-bit reg-isters, 22-b wide instructions, a clock frequency of upto 18 MHz, a typical consumption of 105 W/MHz at 3

volts with a 0.5m three metal layers CMOS technology.The CoolRISC 816 has a non blocking pipeline

which allows it to execute an instruction every cyclewithout adding extra delay due to pipeline stalls. Themain limitation from a performance point of view is theworking frequency: the maximum clock frequency of theCoolRISC 816 core is 18 MHz, but generally the maxi-mum working frequency is limited by the access time ofthe code memory. The code memory used in CoolRISCis ultra-low power at the cost of a slow access time.

The energy consumption of the CoolRISC 816 is dis-tributed in three different parts: the core, the data mem-ory and the code memory. Figure 1 shows the typical dis-tribution of the energy consumption when CoolRISC isexecuting a program. This data was obtained by execut-ing a set of programs and extracting the relative utiliza-tion of the core, data memory, and code memory. The setof programs used consisted of: a quicksort, a stringsort, aFFT, and a sine/cosine computation. The extracted aver-age resource utilization is: 100% for the processor core,100% for the code memory, and 40% for the data mem-ory.

49%

26% 16%

35%

50%56%

16% 24% 29%

0%10%20%30%40%50%60%70%80%90%

100%

256x22, 128x8 4kx22, 2kx8 16kx22, 8kx8

Core Code memory Data memory

CodeMemory

DataMemory

Figure 1: Energy consumption distribution in the Cool-RISC 816.

Figure 1 shows that the energy consumed by the pro-cessor core corresponds to less than 50% of the total en-ergy consumption and that the major sources of energyconsumption are the memories.

3 Increasing Performance

Currently, high-performance processors use instruction-level parallelism (ILP) to increase their performance.Indeed, a scalar processor such as CoolRISC 816 exe-cutes sequentially a single flow of instructions. How-ever, not all the executed instructions are interdependentand therefore several of these instructions can be exe-cuted in parallel.

Superscalar and VLIW [7] architectures are the twomain types of architectures that exploit instruction-levelparallelism to achieve a higher level of performance. Su-perscalar processors fetch several instructions in the codememory, analyze the dependencies between all of them,and, according to the resource availability and the in-struction dependencies, schedule the instructions in thevarious units. Therefore, there is a considerable increasein the circuit complexity due to the instruction dispatchunit. VLIW architectures eliminate this increase in hard-ware complexity using a compile-time instruction sched-

uler. This means that the control of the instruction de-pendencies is made by the compiler. The VLIW com-piler analyzes the code and, according to the proces-sor’s resources and the dependencies between instruc-tions, generates very large instructions that contain sev-eral independent meta-instructions which will be exe-cuted in parallel. The processor has only to fetch and ex-ecute these very large instructions without checking themeta-instruction dependencies.

VLIW architectures have a major drawback, the codedensity of a VLIW processor depends on the availableinstruction parallelism. If there is no sufficient instruc-tion parallelism to generate a VLIW instruction that usesall units, the non-used units will execute a NOP meta-instruction, which results in a considerable increase incode size, and therefore, in energy consumption. Tosolve the problem of the NOP insertion, the new gener-ation of VLIW processors, such as the HP/Intel Merced[6], contains a special encoding technique which elim-inates the extra-NOP insertion. Figure 2 illustrates thiskind of technique. Each VLIW instruction contains sev-eral meta-instructions (four in our example) which couldbe dependent or independent instructions. An additionalfield is added to specify the group of meta-instructionsthat will be executed in parallel. The unit number fieldspecifies which unit must execute the meta-instruction,and the separator bit between two meta-instructionswithin a meta-instructions is set to ’0’ if the two can beexecuted in parallel, to ’1’ if they must be executed se-quentially. The hardware costs of the NOP eliminationare the extra bits added to the code memory (3 bits permeta-instruction in our example) and the crossbar neededto send the meta-instruction to their corresponding unit.However, this technique prevents the increase in codesize (and therefore of consumption) due to the extra NOPinsertion.

N O P N O P M U LA D D I

N O PN O P A D D

N O P N O PL O A DL O A D

C l a s s i c a l V L I W

M U LA D D I L O A DL O A D 0 1 01 1 11 0 1 0 0 0

A D DL O A D . . . . . . 0 0 1x x x x x x 1 1 1

L O A DN O PN O P

L O A DN O PL O A D

N O PA D D IA D D

N O PM U LN O P

0123

Unit number

Separator

New VLIW

C o m p r e s s e di n s t r u c t i o n

C r o s s b a r

E x p a n d e di n s t r u c t i o n

01

1 0

2

1 0

2

L O A D 2

Figure 2: VLIW architecture: NOP elimination.

VLIW architectures can be divided in two maingroups: (1) The heterogeneous VLIW architectureswhich are the most common among existing VLIW ar-chitectures. The term heterogeneous indicates that theunits are different, which in turn means that a meta-instruction must be dispatched to a unit capable of ex-ecuting it; (2) The homogeneous VLIW architectureswhich are VLIW architectures composed of several unitswhich are able to execute any kind of meta-instruction.As the units are homogeneous there is no need for acrossbar to dispatch the meta-instructions to their corre-sponding units, and therefore there is no need for a unitnumber field.

4 Experimental part

This experiment aims at comparing 8-bit or 16-bit pro-cessors, such as scalar and VLIW architectures, andquantifying their differences in terms of both perfor-mance and energy consumption. Performance and en-ergy consumption are computed for the execution of abenchmark suite composed of inner loops of several pro-grams. Most of the execution time of a program is spentin inner loops: for example, on a HP-PA 7100 proces-sor, 78% [8] of the execution time of the Perfect ClubBenchmark Suite [2] is spent in inner loops. Therefore,the performance achieved and the energy consumed ininner loops are representative of the execution of the en-tire program.

The next subsections address the following issues:the compilation techniques used to obtain the executablecode of inner loops; the benchmark used for our eval-uation; the architectures compared in our experiment;the consumption model used to estimate the energy con-sumption; and finally the results of our experiment.

4.1 Compilation techniques

The executable code for the different architectures is ob-tained as follows: first, the dependencies graph of the in-structions of the loop is generated; then, the instructionsare scheduled depending on the constraints imposed bythe data dependencies and by the architecture of the pro-cessor.

To generate the graphs of dependencies, we usedprincipally the ICTINEO tool [1], which extracts the in-ner loops of a FORTRAN program and provides an opti-mized graph of dependencies for each inner loop.

Software pipelining has been used to schedule opera-tions because it is the most effective compiling techniquefor loop parallelization. The software pipeline techniqueused is Swing Modulo Scheduling (SMS) [9]. SMS triesto produce the maximum performance and, in addition,includes heuristics to decrease the high register require-ments of software pipelined loops [10]. When the num-ber of registers required is higher than the available num-ber, spill code [4] (i.e, instructions which temporarilysave the contents of some registers into the data mem-ory) has to be introduced, increasing energy consump-tion. When required, spill code was added in softwarepipelined loops using the heuristics described in [11].

Figure 3 shows the principle of software pipelin-ing. In the sequential execution of a loop each iterationand each instruction are executed sequentially. Softwarepipelining rearranges the instructions, according to thedependencies and architectural constraints, in order toobtain a loop divided in SC stages (three in our example,from S0 to S2) which can be executed in parallel. Everystage is divided in II (Initiation Interval) cycles in whichit is possible to execute one or more instructions.

4.2 Benchmark

As a benchmark, we used a set of 25 integer loops. Theseloops are divided into three groups. The first includesfive integer loops which operate on 8-bit data: FIR fil-ter, vector-matrix multiplication, vector-vector multipli-cation (dot), vector-vector addition, and function inte-gration. The second consists of the same five integerloops operating on 16-bit data. Finally, the third group is

T e x e c = N * I I + ( S C - 1 ) * I I

P a r a l l e l e x e c u t i o n

S 0

S 2S 1 S 0

S 2S 1 S 0

S 2S 1 S 0

S 2S 1 S 0

S 2S 1 S 0

S 2S 1 S 0

S 2S 1 S 0

S 2S 1 S 0

S 2S 1

I IP r o l o g u e

S t e a d y S t a t e

E p i l o g u e

1

N Tim

e

L

I t e r a t i o n 2

I t e r a t i o n 1

I t e r a t i o n N

T e x e c = N * L

S e q u e n t i a l e x e c u t i o n

Figure 3: Parallel execution of a loop using softwarepipelining.

composed of 15 16-bit integer loops of the Perfect ClubBenchmark Suite [2].

4.3 Compared architectures

The evaluated architectures are based on the CoolRISCarchitecture and use the same low-power memories thathave been used with CoolRISC 816. These memorieshave the property of having no sense amplifiers, whicheliminates static energy consumption. Therefore, thereis no additional penalty due to the width of the instruc-tion words.

C8 CoolRISC 816, processor of referenceC16 16-b version of the C8V8E1 VLIW heterogeneous, 8-b, 1 Load/Store, 2

ALUs, 1 Branch UnitV8E2 VLIW heterogeneous, 8-b, 2 Load/Store, 2

ALUs, 1 Branch UnitV8H1 VLIW homogeneous, 4 units, 1 acces to the

data memory at the same timeV8H2 VLIW homogeneous, 4 units, 2 acces to the

data memory at the same timeV16E1 16-b version of the V8E1V16E2 16-b version of the V8E2V16H1 16-b version of the V8H1V16H2 16-b version of the V8H2

4.4 Consumption model

The consumption model is based on the utilization of re-sources. The energy needed to execute a task is com-puted by summing the energy consumed by the differentresources. As CoolRISC 816 is our processor of refer-ence, we base the energy consumption estimates on ourin-depth knowledge of the energy consumption charac-teristics of the C8 processor, which are extracted fromthe real implementation of the processor. Because thecompared VLIW architectures use the NOP eliminationtechnique, their instructions contain predecoded bits thatindicate which units must work. Therefore, it is possi-ble to halt the signals activity of all unused units. Asa consequence, the units which do not execute a meta-instruction do not consume any energy. For the hetero-geneous VLIW architectures the energy needed to exe-cute a meta-instruction is estimated as the energy con-sumption of the operational part of the C8 or C16 pro-cessor (pipeline, decoder, register file accesses, ALU op-eration). For the homogeneous VLIW architectures theenergy needed to execute a meta-instruction is estimatedas the same energy needed to execute a scalar instructionin the C8 or C16. This high-level modeling of the energyconsumed during the execution of an instruction impliesa certain loss of precision. However, one factor limits the

1.0

2.32.5

3.1

2.4

5.15.4

7.36.8

2.0

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

C8 V8E1 V8E2 V8H1 V8H2 C16 V16E1 V16E2 V16H1 V16H2

Figure 4: Speed-up comparison.

CODE: 256 INSTRUCTIONSDATA: 128 WORDS

CODE: 4k INSTRUCTIONSDATA: 2k WORDS


0

0.2

0.4

0.6

0.8

1

1.2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

core code data interconn.

Figure 5: Energy comparison.

impact of estimate error: as described in Section 2, theenergy consumed in the processor core represents only asmall part (about 30%) of the total energy consumption.

The energy needed to perform a memory (code ordata) access is estimated through a statistical energy con-sumption model of the memory architecture. This modeltakes into account the type of memory (RAM or ROM),its size (in words), its geometry (number of rows andcolumns), the width of the word, and the power sup-ply voltage. The technological parameters are extractedfrom a 0.5 CMOS process. In our experiment we usethe typical value of the energy consumption per memoryaccess.

The extra consumption energy due to the intercon-nection is estimated using a statistical model of the en-ergy consumption of the crossbar and of the circuit over-head due to the additional access ports of the register file.

4.5 Results

In this subsection we compare the performance and en-ergy consumption of the architectures described in Sub-section 4.3. All use the same power supply voltage(Vdd=3V) and clock frequency (imposed by the accesstime of the code memory). The experiment is repeatedfor several memory configurations.

Figure 4 compares the performance, in terms ofspeed-up, of the different processors with respect to theC8 processor. Figure 5 shows the ratio between the en-ergy consumption of the different processors and the C8,while processor executing our benchmark. It illustratesthe energy consumption distribution. Figure 6 shows theratio between the energy-delay product achieved by the

CODE: 256 INSTRUCTIONSDATA: 128 WORDS



1.00

0.47

0.35

0.23

0.11 0.12

1.00

0.34

0.23

0.11 0.11

1.00

0.480.45

0.33

0.22

0.11 0.11

0.460.390.40

0.090.090.090.09 0.090.09

0.49

0.37

0.49

0.00

0.20

0.40

0.60

0.80

1.00

1.20

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

C8

V8E

1

V8E

2

V8H

1

V8H

2

C16

V16

E1

V16

E2

V16

H1

V16

H2

Figure 6: Energy-Delay Product comparison.

different processors and by the C8 processor, while exe-cuting our benchmark.

From these three figures we can observe the advan-tage of the transition: (1) from a 8-bit to a 16-bit archi-tecture, and (2) from a scalar to a VLIW architecture.

The transition from a 8-bit to a 16-bit architectureyields a major improvement in the energy-delay prod-uct (approximately a factor of four). This is a conse-quence of the smaller number of instructions requiredto execute the benchmark, which contains a majority of16b data. Performance increases by a factor of about 2.4while energy consumption decreases by a factor of about1.7. This result shows the importance of having an archi-tecture able to process efficiently the data of the applica-tion.

The transition from scalar to a VLIW architecture im-proves significantly the energy-delay product (by a factorvarying between 2.0 and 2.8). Indeed, VLIW architec-tures achieve better performance while consuming moreor less the same energy as scalar architectures. This ob-servation is explained by a redistribution of the energyconsumption: the increase in energy consumption of theVLIW core is compensated for by a decrease in the en-ergy consumption of the code memory. The increase isdue to the circuit overhead introduced by the intercon-nections (crossbar and register file). On the other hand,the decreased consumption of the code memory can beexplained as follows: first, the employed memories donot have sense amplifiers, and therefore do not consumestatic energy (i.e., there is no penalty for the larger in-struction words); second, a VLIW processor requires lessenergy to fetch a meta-instruction than a scalar architec-ture. In fact, as the energy consumed by the line de-coder is independent of the width of the word, the energyneeded to fetch four instructions simultaneously is lessthan four times the energy consumption needed to fetchone instruction.

The main difference between homogeneous and het-erogeneous VLIW architectures is in terms of perfor-mance. The former reach a higher level of performancedue to their higher machine parallelism; however, thedownside is the higher core complexity, which entails ahigher energy consumption. Therefore, if the ILP is in-sufficient, the homogeneous and heterogeneous VLIWarchitectures have a similar energy-delay product sincethere is no significant difference in terms of speed-up.On the other hand, if sufficient ILP can be extracted thenthe homogeneous architecture attains a higher level ofperformance (as is the case for our 8-bit processors), ul-

timately superseding the heterogeneous one with respectto the energy-delay product.

5 Conclusion

In this paper we have shown that an adaptation of high-performance architectures, such as the VLIW architec-ture, to low-power embedded 8b or 16b microcontrollersusing low-power memories yields a significant improve-ment of the energy-delay product compared to a scalarprocessor. This improvement, by a factor varying be-tween 2 and 3, is obtained through a redistribution ofthe energy consumption, which allows to obtain a higherlevel of performance while keeping the energy consump-tion at the same level. We have also shown the impor-tance of using a processor adapted to the size of the datain order to minimize the number of instructions executed,which leads to a decrease in the energy consumption andin the time of execution.

Our results are based on loop parallelization and ona high-level energy consumption model, which allowus to validate the use of VLIW architectures for high-performance low-power processors and to identify whichVLIW architecture provides the best results. The nextstep will be to develop a complete VLIW compiler anda prototype of such a low-power VLIW processor in or-der to compute the energy consumption more precisely.

Acknowledgments

We are grateful to Rosario Martinez, Jean-Marc Mas-gonty, Moshe Sipper, Gianluca Tempesti, the Logic Sys-tems Laboratory team, and the processors design teamof the CSEM (Centre Suisse d’Electronique et de Mi-crotechnique) for our fruitful discussions. Finally, fi-nancial support from the National Science Foundation isgratefully acknowledged.

References

[1] Eduard Ayguade, Cristina Barrado, AntonioGonzalez, Jesus Labarta, David Lopez, JosepLlosa, Susana Moreno, David Padua, Fermin J.Reig, and Mateo Valero. Ictineo: A tool for re-search on ILP. In Supercomputing’96, November1996.

[2] M. Berry, D. Chen, P. Koss, and D. Kuck. ThePerfect Club benchmarks: Effective performanceevaluation of supercomputers. Technical Report827, Center for Supercomputing Research and De-velopment, November 1988.

[3] Thomas D. Burd and Robert W. Brodersen. En-ergy efficient CMOS microprocessor design. InProceedings of the 28th Annual HICSS Conference,volume 1, pages 288–297, January 1995.

[4] G.H. Chaitin. Register allocation and spilling viagraph coloring. In Proc., ACM SIGPLAN Symp. onCompiler Construction, pages 98–105, June 1982.

[5] Ricardo Gonzalez and Mark Horowitz. Energy dis-sipation in general purpose microprocessors. IEEEJournal of Solid-State Circuits, 31(9):1277–1283,September 1996.

[6] Linley Gwennap. Intel,HP make EPIC disclosure.Microprocessor Report, 11(14), October 1997.

[7] Mike Johnson. Superscalar microprocessor de-sign. Prentice Hall, 1991.

[8] J. Llosa. Reducing the Impact of Register Pressureon Software Pipelined Loops. PhD thesis, Univer-sitat Politenica de Catalunya, January 1996.

[9] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero.Swing modulo scheduling: A lifetime-sensitive ap-proach. In IFIP WG10.3 Working Conferenceon Parallel Architectures and Compilation Tech-niques (PACT’96), pages 80–86, October 1996.

[10] J. Llosa, M. Valero, and Ayguade. Quantita-tive evaluation of register pressure on softwarepipelined loops. International Journal of ParallelProgramming, 26(2):121–142, 1998.

[11] J. Llosa, M. Valero, and E. Ayguade. Heuristics forregister-constrained software pipelining. In Proc.of the 29th Ann. Int. Symp. on Microarchitecture(MICRO-29), pages 250–261, December 1996.

[12] Christian Piguet. Low-power 8-bit microcon-trollers. Architectural and Circuit Design forPortable Electronic Systems, March 31-April 51997. Monterey, CA, USA.

[13] Christian Piguet, Jean-Marc Masgonty, ClaudeArm, Serge Durand, Thierry Schneider, FlavioRampogna, Ciro Scarnera, Christian Iseli, Jean-Paul Bardyn, R. Pache, and Evert Dijkstra. Low-power design of 8-b embedded CoolRISC micro-controller cores. IEEE Journal Of Solid-State Cir-cuits, 32(7):1067–1078, July 1997.

[14] Jim Turley. Embedded vendors seek differentia-tion. Microprocessor Report, 11(1), 1997.

[15] Tim Turley and Harri Hakkarainen. TI’s new ’C6xDSP screams at 1600 MIPS. Microprocessor Re-port, 11(2), February 1997.

Jeff Scott, Lea Hwang Lee, John Arends, Bill Moyer

M•CORE Technology CenterSemiconductor Products Sector

Motorola, Inc. TX77/F51,7600-C Capital of Texas Highway, Austin, TX 78731

jscott,leahwang,arends,[email protected]

Designing the Low-Power M•CORETM Architecture

AbstractThe M•CORE microRISC architecture has been developed to

address the growing need for long battery life among today’s por-table applications. In this paper, we will present the low-powerdesign techniques and architectural trade-offs made during thedevelopment of this processor. Specifically, we will discuss theinitial benchmarking, the development of the Instruction SetArchitecture (ISA), the custom datapath design, and the clockingmethodology . Finally, we will discuss two system solutions utiliz-ing the M•CORE processor, presenting power, area, and perfor-mance metrics.

1 Introduction

There is significant effort directed toward minimizing thepower consumption of digital systems. Minimization techniqueshave been applied to various levels of abstraction from the circuitto the system level. The industry, coupled with contributions fromthe research community, has developed new architectures andmicroarchitectures [22,23], derivatives of existing architectures[1,15], compiler and compression optimizations [16,17],enhanced process technology [10], and new tools [11] and designtechniques [5,7,14] to enhance overall system performance,which in this context translates to a decrease in milliwatts perMHz.

Existing architectures are not suitable for low-power applica-tions due to their inefficiency in code density, memory bandwidthrequirements, and architectural and implementation complexity.This drove the development of the new general purposemicroRISC M•CORE architecture [18,21], which was designedfrom the ground up to achieve the lowest milliwatts per MHz.The M•CORE instruction set was optimized using benchmarkscommon to embedded applications coupled with benchmarks tar-geted specifically for portable applications. Compilers weredeveloped in conjunction with the instruction set to maximizecode density.

The benchmarks used to drive the development of theM•CORE architecture, as well as for making design trade-offs,are presented in Section 2. In Section 3, an overview of theinstruction set is presented. Section 4 presents implementationdetails and various design methodologies. Section 5 presentspower consumption statistics. Two system solutions based on theM•CORE microprocessor are presented in Section 6. Section 7summarizes the paper.

2 Benchmarks

Embedded and portable benchmarks were used to makedesign trade-offs in the architecture and the compiler. The Pow-erstone benchmarks, which include paging, automobile control,signal processing, imaging, and fax applications, are detailed inTable 1.

3 Instruction Set Architecture

3.1 Overview

The M•CORE instruction set is designed to be an efficienttarget for high-level language compilers in terms of code densityas well as execution cycle count. Integer data types of 8, 16 and32-bits are supported to ease application migration from existing8 and 16 bit microcontrollers. A standard set of arithmetic andlogical instructions are provided, as well as instruction supportfor bit operations, byte extraction, data movement, and control

Table 1: Powerstone benchmark suite

Benchmark Instr. Count Description

auto 17374 Automobile control applications

bilv 21363 Shift, and, or operations

blit 72416 Graphics application

compress 322101 A Unix utility

crc 22805 Cyclic redundancy check

des 510814 Data Encryption Standard

dhry 612713 Dhrystone

engine 986326 Engine control application

fir_int 629166 Integer FIR filter

g3fax 2918109 Group three fax decode (singlelevel image decompression)

g721 231706 Adaptive differential PCM forvoice compression

jpeg 9973639 JPEG 24-bit image decompres-sion standard

pocsag 131159 POCSAG communication pro-tocol for paging application

servo 41132 Hard disc drive servo control

summin 3463087 Handwriting recognition

ucbqsort 674165 U.C.B. Quick Sort

v42bis 8155159 Modem encoding/decoding

whet 3028736 Whetstone

flow modification.A small set of conditionally executed instructions are avail-

able, which can be useful in eliminating short conditionalbranches. These instructions utilize the current setting of the pro-cessor’s condition bit to determine whether they are executed.Conditional move, increment, decrement, and clear operationssupplement the traditional conditional branch capabilities.

The M•CORE processor provides hardware support for cer-tain operations which are not commonly available in low-costmicrocontrollers. These include single-cycle logical shift (LSL,LSR), arithmetic shift (ASR), and rotate operations (ROTL), asingle cycle find-first-one instruction (FF1), a hardware loopinstruction (LOOPT), and instructions to speed up memory copyand initialization operations (LDQ,STQ). Also provided areinstructions which generate an arbitrary power of 2 constant(BGENI, BGENR), as well as bit mask generation (BMASKI)for generating a bit string of ‘1’s ranging from 1 to 32 bits inlength. Absolute value (ABS) is available for assisting in magni-tude comparisons.

Because of the importance of maximizing real-time controlloop performance, hardware support for multiply and divide isalso provided (MULT, DIVS, DIVU). These instructions providean early-out execution model so that results are delivered in theminimum time possible. Certain algorithms also take advantageof the bit reverse instruction (BREV), which is efficiently imple-mented in the barrel shifter logic [3].

3.2 16-Bit vs. 32-Bit

System cost and power consumption are strongly affected bythe memory requirements of the application set. To address this,the M•CORE architecture adopts a compact 16-bit fixed lengthinstruction format, and a 32-bit Load/Store RISC architecture.The result is high code density which reduces the total memoryfootprint, as well as minimizing the instruction fetch traffic.

Benchmark results on a variety of application tasks indicatethat the code density of the M•CORE microRISC engine ishigher than many CISC (complex instruction set computer)designs, in spite of the fixed length nature of the encodings.Fixed-length instructions also serve to reduce the CPU’s controlunit complexity, since instructions have a small number of wellstructured formats, thus simplifying the decoding process.

3.3 Low-Power Instructions

The M•CORE processor minimizes power dissipation by uti-lizing dynamic power management. DOZE, WAIT, and STOPpower conservation modes provide for comprehensive systempower management. These modes are invoked via issuing theappropriate M•CORE instruction. System level design dictatesthe operation of various components in each of the low powermodes, allowing a flexible set of operating conditions which canbe tailored to the needs of a particular application. For example, asystem may disable all unnecessary peripherals in DOZE mode,while in STOP mode, all activity may be disabled. For short termidle conditions, the WAIT mode may disable only the CPU, leav-ing peripheral functions actively operating [18].

3.4 Instruction Usage

The M•CORE ISA was profiled by running the Powerstonebenchmark suite on a cycle accurate C++ simulator. Table 2shows the percentage of dynamic instructions utilizing the adderand barrel shifter, as well as the percentage of change of flow andload/store instructions.

4 Processor Implementation

The initial implementation of the M•CORE architecture wastargeted at sub-2.0 volt operation from 0-40 MHz and achievesboth low-power consumption and high-performance in a 0.36-micron low-leakage static CMOS process. In addition, a die size

of just over 2 mm2 was achieved by utilizing a synthesizable con-trol section and full-custom datapath. The first implementation isalso being produced in a 0.31-micron process for a cellular appli-cation and has been targeted for production in a 0.25-micron pro-cess for other embedded applications. Other low-power processtechnologies are being investigated including a Silicon On Insu-lator (SOI) process [2].

Power consumption was minimized in the first implementa-tion of the M•CORE architecture by: (i) gating clocks; (ii) utiliz-ing delayed clocks to minimize glitching; and (iii) optimizing theclock tree. Powermill [12], Spice, and Focus [8], a transistor siz-ing power optimization tool, were used to optimize the circuits.

Figure 1: M •CORE datapath

a. 83.5% of change of flow instructions are taken.

Table 2: Dynamic instruction percentages

Type Dyn. Inst.Percentage

Adder usage 50.23%

Barrel shifter usage 9.68%

Change of flow instructionsa 17.04%

Load/store instructions 22.46%

Gen. PurposeRegister File

16entries,each 32 bits

AlternateRegister File


ControlRegister File


Scale 1,2,4

Barrel ShifterMultiplierDivider

Y Port

Sign Extend

X Port

MuxMux

ALU, Priority Encoder, Condition Detect

Writeback Bus

PCIncrementer

BranchAdder

Imm

edia

te B

us

AddressMUX

Address Bus

Instruction Pipeline

Instruction Decode

Data Bus

HW Accelerator Interface

Accelerator Bus

4.1 Datapath Design

The M•CORE datapath consists of a register file, an instruc-tion register, an adder, a logical unit, a program counter, a branchadder, and a barrel shifter. A block diagram of the datapath isshown in Figure 1.

4.1.1 Synthesizable vs. CustomSynthesis methodologies have the advantage of decreased

time to market, fewer resource requirements, and the ability toquickly migrate to different technologies. Custom designs lackthose benefits, but can achieve faster performance, consume lesspower, and occupy less area.

Research was conducted to determine whether to pursue asynthesizable or custom datapath design. An energy-delay metric[13] was used to develop a custom 32-bit adder. Our studiesshowed that in going from a custom to a synthesized adder, tran-sistor count increased by 60%, area increased by 175%, andpower consumption increased by 40%. This research coupledwith similar block-level comparisons led to the development of afull-custom datapath.

4.1.2 Speed, Area and Power Trade-offFocus [8] was used to optimize speed, area and power in the

custom datapath. Focus takes a user-specified circuit and timingconstraints and generates a family of solutions which form a per-formance/area trade-off curve. Given a set of input vectors,Focus allows the user to invoke Powermill interactively to mea-sure power at any given solution. The user is then allowed toselect the solution with the desired performance, area, and powercharacteristics.

Focus uses a dc-connected component-based static timinganalysis technique to propagate the arrival times in a forwardpass through the circuit, and the required times in a backwardpass. At each node, a slack is generated as the difference betweenthe required and the arrival times (negative slack denotes a viola-tion of a constraint).

Focus performs a hill climbing search starting from a mini-mum sized circuit (or a given initial circuit). In each step, Focuspicks a set of candidate transistors to size based on their sensitiv-ity to the overall path delay, and iteratively and graduallyapproaches a solution that will meet the timing constraints[8].

Figure 2: Speed and area trade-off

Figure 2 shows an example of how Focus was used to designthe logical bit in the ALU. The vertical axis is expressed in somenormalized area cost, while the horizontal axis represents theslack in nanoseconds.

The figure shows that Focus gives a family of solutions. Italso shows that faster timing can be achieved at the expense ofhigher area cost. Utilizing this family of solutions, a solution wasselected to meet the timing constraints with optimal area andpower savings.

4.1.3 FloorplanningFloorplanning of the datapath blocks was based on bus

switching activity, timing, and block usage percentages derivedfrom the Powerstone benchmarks. Blocks having output busseswith high switching activity, tight timing requirements, and highusage were placed close together to minimize the routing capaci-tance. When searching through the solution space of block place-ment, timing constraints were used as hard constraints, whileswitching activities, weighted by their appropriate routing dis-tances and block usage, were used as a cost function in the itera-tive search process.

In the final floorplan, shown in Figure 3, each horizontal linerepresents a multiple bit (up to 32 bits) data bus route. The datap-ath could be roughly partitioned into two halves, separated by theregister file: the left half includes the address and data interface,the instruction register, the branch adder, and the address genera-tion unit; the right half includes the execution units.

Figure 3: Datapath floorplan

4.1.4 Register File DesignAn architectural decision was made to have sixteen general

purpose registers based on the limited opcode space. An alternateset of sixteen 32-bit registers is provided as well, to reduce theoverhead associated with context switching and saving/restoringfor time-critical tasks. This alternate register file can be selectedvia a bit (AF) in the processor status register (PSR). Upon detect-ing an exception, the hardware places bit 0 of the exception vec-tor table entry into the AF bit in the PSR. This allows theexception handler to select either register file for exception pro-cessing with no cycle penalty [20]. Hardware does not precludeaccess to the alternate register file in user mode, resulting in alarger working set of registers. This can result in a reduced num-ber of memory fetches, thereby saving power.

In supervisor mode, an additional 5 supervisor scratch regis-ters can be accessed by means of the move-to-control-register(MTCR) and move-from-control-register (MFCR) instructions.

−1.5−1.0−0.50.00.50.0

1000.0

2000.0

3000.0

4000.0

5000.0

Area

Slack (ns)

Original design

minimum solution meetingtiming constraint

Tim

ing

cons

trai

nt

addr

ess/

data

inte

rfac

e

inst

ruct

ion

regi

ster

bran

ch a

dder

addr

ess

gene

ratio

n

regi

ster

file

mul

tiply

shi

ft re

gist

erco

nditi

on b

it ge

nera

tion

prio

rity

enco

der

logi

cal u

nit

oper

and

y m

uxop

eran

d x

mux

adde

rsh

ifter

hard

war

e ac

cele

rato

r in

terf

ace

datapath clock aligner and regenerator region

control logic

The register file also includes: (i) feed forwarding logic; (ii)immediate muxes to produce various constants; and (iii) scalelogic for computing load/store addresses.

4.1.5 Barrel Shifter DesignAs noted earlier, approximately 10% of the instructions exe-

cuted in the Powerstone benchmarks utilize the barrel shifter. Inaddition, barrel shifters are typically one of the highest powerconsuming modules in datapath designs. A comparison wasmade between two competing barrel shifters to determine whichone offered the least current drain within a given performancerequirement.

“Snaked Data” Barrel Shifter“Snaked data” barrel shifters contain a 32x32 array of transis-

tors. For precharged shifter outputs, the array uses nmos transis-tors. The term “snaked data” comes from the fact that input data(IL[2:0],IR[3:0]) is driven diagonally across the array, as shownin Figure 4. Control signals (S[3-0]) to select the amount of theshift are driven vertically and outputs (O[3:0]) are driven hori-zontally. Data enters the shifter through multiplexers with selec-tion based on the type of shift operation, either rotates, logicals,or arithmetics. The control signals are generated from a 5-32decoder for 32-bit shifters, where 5 bits represent the amount toshift and can be inverted to select shifts in the opposite direction.

Figure 4: 4-bit “snaked data” barrel shifter

“Snaked Control” Barrel Shifter“Snaked control” barrel shifters contain a 32x32 array of tran-

sistors. For precharged shifter outputs, the array uses nmos tran-sistors. The term “snaked control” comes from the fact thatcontrol signals (SL[3:1],SR[3:0]) are driven diagonally acrossthe array as shown in Figure 5. Shifter outputs (O[3:0]) aredriven vertically and inputs (I[3:0]) are driven horizontally. Con-trol signal assertion is based on the type of shift operation, eitherrotates, logicals, or arithmetics and the amount of the shift. Thecontrol signals require a 5-32 decoder, where 5 bits represent theamount of shift, as well as a greater-than decoder for arithmeticoperations.

Figure 5: 4-bit “snaked control” barrel shifter

Barrel Shifter SelectionThere are three components of routing in either shifter design.

Table 3 shows the routing breakdown for the two 32-bit shifters.

Control signal power consumption is negligible in both cases,since the selections are one-hot. Even though the total length ofpower consuming data routing for the two designs is nearly thesame, the “snaked control” barrel shifter actually has signifi-cantly less routing capacitance, 5.9 pF vs. 7.4 pF, because ofmetal layer usage.

It would appear that the “snaked control” would be the idealchoice for a power conscious design. However, further analysisof the Powerstone benchmarks was required. Many benchmarksperform similar types of shifts in succession. In fact, 64% of theshifts are either logical shift left to logical shift left, or logicalshift right to logical shift right. Under these conditions, the“snaked data” shifter maintains constant data input into half ofthe array throughout the entire algorithm. In these circumstances,the “snaked data” shifter toggles over 30% less capacitance thenthe “snaked control” barrel shifter.

4.2 Clocking System Design

Clock gating saves significant power by eliminating unneces-sary transitions of the latching clocks. Furthermore, it also elimi-nates unnecessary transitioning of downstream logic connectedto the output of the latches. In the datapath, delayed clocks aregenerated for late arriving latch input data in order to furtherminimize downstream glitching.

Power due to clocking can vary anywhere from 30% to 50%in a power-optimized design [4]. As more and more datapathpower optimizations were added to the design, the percentage ofpower due to clocking increased. Therefore, it was crucial toaggressively attack clock power consumption.

The M•CORE processor uses a single global clock with localgeneration of dual phase non-overlapping clocks. The globalclock proceeds through two levels of buffering/gating; a clockaligner and regenerator, as shown in Figure 6 and Figure 7respectively. The aligner generates the dual phase clocks, whilethe regenerator provides local buffering.

Figure 6: Clock aligner

IL0

IL1

IL2

IR0 IR1 IR2 IR3S3 S2 S1 S0

O0

O1

O2

O3

O3

IR0S3

I3

SR0O3

SL3

SL2

SL1

SR0 SR1 SR2 SR3O3 O2 O1 O0

I0

I1

I2

I3

Table 3: Snaked control vs. snaked data routing

Shifter TypeInput Data

RoutingOutput Data

RoutingControlRouting

Snaked control 12,896 173,056 171,962

Snaked data 171,962 24,576 173,056

FORCE_φ2_ON_BFORCE_φ2_OFF_B

CLK

FORCE_φ1_ON_B

FORCE_φ1_OFF_B

φ2_RAW

φ1_RAW

Figure 7: Clock regeneratorClock gating can be performed at the aligners and/or the

regenerators. This allows for complete or partial clock tree dis-abling. Besides shutting down the system clock input completely,gating at the aligners allows for maximum power savings byshutting down a collection of regenerators. In the instructiondecoder, for example, the clocking for over 60 latches and flip-flops is disabled during pipeline stalls.

5 Power Measurement

Powermill and Pathmill were used to perform the followingtasks: (i) measuring power usage of each block to detect anydesign anomalies; (ii) detecting nodes with excessive DC leakagecurrent; (iii) locating hot spots in the design; and (iv) identifyingnodes with excessive rise and fall times. Various design changeswere made based on these findings.

5.1 Processor Power Distribution

M•CORE processor power was measured using Powermill ona back-annotated taped out implementation of the architecture.Clock power is 36% of the total processor power consumption,as shown in Figure 8. It is important to note that due to measure-ment constraints, the datapath regenerator clocking power isincluded in the datapath percentage, not the clocking power per-centage. This clocking power would push the clock power per-centage even higher. This is a good indication that the datapathand control sections are well optimized for power. It is alsoimportant to note that through utilization of the low-power modeinstructions, the system clock can be completely shut down.

Figure 8: Processor total power distribution

5.2 Datapath Power Consumption

Figure 9 shows the breakdown of the datapath power con-sumption. The register file consumes 16% of total processor

power consumption and 42% of the datapath power consump-tion. It is the largest consumer of datapath power and occupies46% of the datapath area as well. The barrel shifter consumesless than 4% of the total processor power consumption and 8% ofthe datapath power consumption.

The address/data interface contains the local bus drivers. Theinstruction register drives the instruction decoder in the controlsection. The operand muxes drive the ALU, priority encoder, andcondition detect logic. The address generation unit contains theprogram counter and dedicated branch adder circuitry (see Fig-ure 1).

Figure 9: Datapath power consumption

6 System Solutions

6.1 MMC2001

The MMC2001 low power platform incorporates anM•CORE integer processor, an on-chip memory system, a

OnCETM debug module, an external interface module, andperipheral devices [19]. It includes a section of 20K gates that arecustomizable. The MMC2001’s production version is imple-mented in a 0.36 micron, triple-layer metal low-leakage staticCMOS process. The on-chip memory system consists of 256KBytes of ROM and 32 KBytes of SRAM. The device measures7.23 mm on a side or 52 sq. mm. The part is packaged in a 144pin Thin Quad Flat Package (TQFP) and achieves 34 MHz per-formance at 1.8 V.

The MMC2001, when running out of internal SRAM, con-sumes on average less than 30 mA at 2.0 V, which translates intoless than 60 mW at 34 MHz for the complete system. On aver-age, this implementation of the M•CORE processor consumes 7mA, which translates to a 0.40 mW/MHz rating. When the part isput into STOP mode, the part consumes less than 50 microampswith the clock input shut down. When the part goes into a lowvoltage detect condition with just the oscillator, SRAM backupand time-of-day timer running, the part consumes less than 5microamps.

6.2 MMC56652

The MMC56652 Dual-Core Baseband Processor incorporates

φ2_RAWφ2_CLOCK_ENABLE

φ2

φ2_B

φ1_RAWφ1_CLOCK_ENABLE

φ1

φ1_B

Control28%

Datapath36%

36%Clock

Register File42%

Address/DataInterface

14%

Instruction

Register

9%

Barrel Shifter

8%

Operand X Mux7%

Other

9%

Operand Y Mux

6%

Address

Generation

5%

Register File42%

Instruction

Barrel Shifter

Operand X Mux

Operand Y Mux

Address Generation9%

5%

Other

Interface14%

Address/Data9%Register

8%

7%

6%

an M•CORE integer processor, an on-chip memory system, a

OnCETM debug module, an external interface module, andperipheral devices [9]. It includes a 56600 Onyx Ultra-Lite 16-bitdigital signal processor, program and data memory, and periph-eral devices. The MMC56652’s production version is imple-mented in a 0.31 micron, triple-layer metal static CMOS process.This device consists of 8 KBytes of ROM and 2 KBytes ofSRAM to support the M•CORE processor. The chip measures7.4 mm on a side or 55 sq. mm. The part is packaged in a 196plastic ball grid array (PBGA) and was designed for 16.8 MHzperformance at 1.8 V.

The MMC56652 when running out of internal SRAM con-sumes on average less than 9 mA at 1.8 V, which translates intoless than 16.2 mW at 16.8 MHz for the complete system. Onaverage, this implementation of the M•CORE processor con-sumes 2.8 mA, which translates to a 0.30 mW/MHz rating. Thepart consumes less than 60 microamps in STOP mode.

7 Conclusion

The M•CORE architecture development represents four yearsof research in the area of low-power embedded microprocessorand system design. This research encompassed all levels ofabstraction from application software, system solutions, com-piler technologies, architecture, circuits, and process optimiza-tions.

The instruction set architecture was developed in conjunctionwith benchmark analysis and compiler optimization. A full-cus-tom datapath design with clock gating was utilized to minimizethe power consumption. Two system solutions integrating theM•CORE microprocessor were presented along with power,area, and performance metrics.

Further research will involve allocating unutilized opcodespace, along with additional compiler enhancements. Bench-marks will continue to be added to the Powerstone suite reflect-ing future portable low-power applications. Efforts will be madeto drive novel design techniques and process technologies into aproduction environment in order to achieve higher levels ofpower reduction and performance.

8 References

[1] An Introduction to Thumb, Advanced RISC MachinesLtd., March 1995.

[2] D. Antoniadis, “SOI CMOS as a Mainstream Low-PowerTechnology,”Proc. IEEE Int’l. Symp. Low Power Elec-tronics and Design, August 1997, pp. 295-300.

[3] J. Arends, J. Scott, “Low Power Consumption Circuit andMethod of Operation for Implementing Shifts and BitReversals,” U. S. Patent, number 5,682,340, October 28,1997.

[4] R.S. Bajwa, N. Schumann, H. Kojima, “Power Analysis ofa 32-bit RISC Microcontroller Integrated with a 16-BitDSP,” Proc. IEEE Int’l Symp. Low Power Electronics andDesign, August, 1997, pp. 137-142.

[5] L. Benini, G. Micheli, “Transformation and Synthesis ofFSMs for Low-Power Gated-Clock Implementation,”

Proc. IEEE Int’l. Symp. on Low Power Design, 1995, pp.21-26.

[6] J. Bunda, D Fussell, R. Jenevein, W. C. Athas, “16-Bit vs.32-Bit Instructions for Pipelined Microprocessors,” Uni-versity of Texas, Austin, October 1992.

[7] A. Chandrakasan, S. Sheng, R. Brodersen, “Low-powerCMOS digital design,”IEEE Journal of Solid-State Cir-cuits, April 1992, pp. 473-484.

[8] A. Dharchoudhury, D. Blaauw, J. Norton, S. Pullela, J.Dunning, “Transistor-level Sizing and Timing Verificationof Domino Circuits in the PowerPC Microprocessor,”Proc. IEEE Int’l. Conf. Computer Design, Austin, Texas,October 1997.

[9] DSP 56652 User’s Manual, Motorola Inc., 1998.[10] D. Frank, P. Solomon, S. Reynolds, and J. Shin, “Supply

and Threshold Voltage Optimizations for Low PowerDesign,” Proc. IEEE Int’l. Symp. Low Power Electronicsand Design, August 1997, pp. 317-322.

[11] J. Frenkil, “Issues and Directions in Low Power DesignTools: An Industrial Perspective,”Proc. IEEE Int’l Symp.Low Power Electronics and Design. August 1997, pp.152-157.

[12] C. Huang, B. Zhang, A. Deng, B. Swirski, “The Designand Implementation of Powermill,”Proc. IEEE Int’lSymp. Low Power Design, 1995, pp. 105-109.

[13] M. Horowitz, T. Indermaur, and R. Gonzalez, “Low-Power Digital Design,”Proc. IEEE Int’l. Symp. on LowPower Electronics”, October, 1994, pp. 8-11.

[14] M. Igarashi, K. Usami, K. Nogami, F. Minami, “A Low-Power Design Method Using Multiple Supply Voltages,”Proc. IEEE Int’l. Symp. Low Power Electronics andDesign, August 1997, pp. 36-41.

[15] K. Kissell, “MIPS16: High Density MIPS for the Embed-ded Market,” Silicon Graphics MIPS Group, 1997.

[16] M. Kozuch and A. Wolfe, “Compression of embeddedsystem programs,”Proc. IEEE Int’l. Conf. on ComputerDesign, 1994.

[17] C. Lefurgy, P. Bird, I. Chen, and T. Mudge, “ImprovingCode Density Using Compression Techniques,”Proc.IEEE Int’l. Symp. on Microarchitecture, December, 1997,pp. 194-203.

[18] M•CORE Reference Manual, Motorola Inc., 1997.[19] MMC2001 Reference Manual, Motorola Inc., 1998.[20] W. Moyer, J. Arends, patent pending, “Method and Appa-

ratus for Selecting a Register File in a Data ProcessingSystem,” U. S. Patent, filed September, 1996.

[21] W. Moyer, J. Arends, “RISC Gets Small,”Byte Magazine,February 1998.

[22] “NEC Introduces 32-bit RISC V832 Microcontroller,”NEC Inc., May 1998.

[23] J. Slager, “SH-4: A Low-Power RISC for Multimedia andPersonal Access Systems,”Microprocessor Forum, Octo-ber 1996.

M•CORE is a trademark of Motorola, Inc.OnCE is a trademark of Motorola, Inc.

Author Index

Albera, G..............................................44 Hajj, I................................................50

Albonesi, D........................................ 107 Hall, J................................................92

Ammer, M.J........................................135 Hensley, J..........................................55

Arends, J.............................................145 Irwin, M.J..........................................87

Azam, M...............................................61 Jagdhold, U..................................... 119

Bahar, I.................................................44 Jensen, F..............................................9

Bajwa, R.S............................................87 Johnson, E.W.....................................67

Barany, M.............................................92 Kahle, S........................................... 124

Becker, M.............................................80 Kamble, M.........................................38

Bellas, N...............................................50 Kim, H.............................................124

Brockman, J..........................................67 Klar, H............................................. 119

Broderson, R.........................................74 Klass, B.............................................25

Brooks, D..............................................98 Knight, T................................... 80, 135

Burd, T..................................................74 Kogge, P.......................................32, 67

Cai, G.Z.N............................................92 Lang, T.................................................3

Chen, R.Y.............................................87 Lee, L.H...........................................145

Chong, F.T............................................55 Leung, C............................................98

Chow, K................................................92 Llosa, J.............................................140

Clark, D................................................98 Lue, J............................................... 124

Conte, T.M...........................................14 Margolus, N.....................................135

Cortadella, J...........................................3 Martonosi, M.....................................98

Evans, R...............................................61 Masselos, K.......................................20

Farooqui, A..........................................55 Moyer, B..........................................145

Frank, M.............................................135 Musoll, E.............................................3

Franzon, P.............................................61 Nagle, D.....................................25, 114

Furber, S.B..........................................131 Nakanishi, T.......................................92

Garside, J.D.........................................131 Oklobdzija, V.G................................102

Ghose, K...............................................38 Oskin, M............................................55

Goutis, C.E...........................................20 Pering, T............................................74

Phatak, D............................................124 Sherwood, Timothy............................55

Piguet, Christian.................................140 Temple, S.........................................131

Polychronopoulos, Constaintine...........50 Thomas, Don......................................25

Puiatti, Jean-Michel............................140 Toburen, Mark C................................14

Reilly, Matt...........................................14 Tong, Ying-Fai.................................114

Runtebar, Rob.....................................114 Tyagi, Akhilesh....................................9

Sanchez, Eduardo...............................140 Vieri, Carlin.....................................135

Schmit, Herman....................................25 Zawodny, Jason.................................67

Schumann, Thomas............................119 Zervas, N...........................................20

Scott, Jeff............................................145 Zyuban, V..........................................32

power driven microarchitecture workshopencoding technique and its effect on multimedia...

Documents