optimizing bus energy consumption of on-chip multiprocessors using frequent values

14
Optimizing bus energy consumption of on-chip multiprocessors using frequent values q Chun Liu * , Anand Sivasubramaniam, Mahmut Kandemir Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA Received 30 June 2004; accepted 20 October 2004 Available online 22 July 2005 Abstract Chip multiprocessors (CMP) are a convenient way of leveraging from the technological trends to build high-end and embedded systems that are performance and power efficient, while exhibiting attractive properties such as scalability, reliability and ease of design. However, the on-chip interconnect for moving the data between the processors, and between the processors and memory subsystem, plays a crucial role in CMP design. This paper presents a novel approach to optimizing its power by exploiting the value locality in data transfers between processors. A communicat- ing value cache (CVC) is proposed to reduce the number of bits transferred on the interconnect, and simulation results with several parallel applications show significant energy savings with this mechanism. Results show that the impor- tance of our proposal will become even more significant in the future. Ó 2005 Elsevier B.V. All rights reserved. Keywords: On-chip multiprocessors; Value locality; Power optimization 1. Introduction Two important trends motivate the research to be presented in this paper. With higher levels of integration and the ability to pack millions of tran- sistors on-chip, the important question to ask is how to employ these transistors. Numerous recent studies (e.g., [1–5]) show that it may be easier and more profitable to provision a number of small processing cores rather than a monolithic complex core. Several reasons for this argument include the wire delays, clock distribution, verification and overall system reliability. These multiple process- ing cores on the same chip can provide closely cou- pled on-chip parallelism, and there are many applications that can take advantage of such an 1383-7621/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2004.10.009 q This research has been supported in part by NSF grants 0103583, 0130143 and Career Award 0093082. * Corresponding author. Tel.: +1 814 863 3627; fax: +1 814 865 3176. E-mail addresses: [email protected] (C. Liu), anand@cse. psu.edu (A. Sivasubramaniam), [email protected] (M. Kandemir). Journal of Systems Architecture 52 (2006) 129–142 www.elsevier.com/locate/sysarc

Upload: chun-liu

Post on 29-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Systems Architecture 52 (2006) 129–142

www.elsevier.com/locate/sysarc

Optimizing bus energy consumption of on-chipmultiprocessors using frequent values q

Chun Liu *, Anand Sivasubramaniam, Mahmut Kandemir

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA

Received 30 June 2004; accepted 20 October 2004Available online 22 July 2005

Abstract

Chip multiprocessors (CMP) are a convenient way of leveraging from the technological trends to build high-end andembedded systems that are performance and power efficient, while exhibiting attractive properties such as scalability,reliability and ease of design. However, the on-chip interconnect for moving the data between the processors, andbetween the processors and memory subsystem, plays a crucial role in CMP design. This paper presents a novelapproach to optimizing its power by exploiting the value locality in data transfers between processors. A communicat-ing value cache (CVC) is proposed to reduce the number of bits transferred on the interconnect, and simulation resultswith several parallel applications show significant energy savings with this mechanism. Results show that the impor-tance of our proposal will become even more significant in the future.� 2005 Elsevier B.V. All rights reserved.

Keywords: On-chip multiprocessors; Value locality; Power optimization

1. Introduction

Two important trends motivate the research tobe presented in this paper. With higher levels of

1383-7621/$ - see front matter � 2005 Elsevier B.V. All rights reservdoi:10.1016/j.sysarc.2004.10.009

q This research has been supported in part by NSF grants0103583, 0130143 and Career Award 0093082.* Corresponding author. Tel.: +1 814 863 3627; fax: +1 814

865 3176.E-mail addresses: [email protected] (C. Liu), anand@cse.

psu.edu (A. Sivasubramaniam), [email protected] (M.Kandemir).

integration and the ability to pack millions of tran-sistors on-chip, the important question to ask ishow to employ these transistors. Numerous recentstudies (e.g., [1–5]) show that it may be easier andmore profitable to provision a number of smallprocessing cores rather than a monolithic complexcore. Several reasons for this argument include thewire delays, clock distribution, verification andoverall system reliability. These multiple process-ing cores on the same chip can provide closely cou-pled on-chip parallelism, and there are manyapplications that can take advantage of such an

ed.

130 C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142

ability with thread-level parallelism. Using multi-ple (and perhaps slower) cores on the same diecan turn out to be more rewarding than a singleand powerful core [1]. The other motivating trendfor this research stems from the power dissipationof these on-chip transistors, which becomes a seri-ous concern with higher levels of integration andfaster clock speeds.

Data communication becomes an importantconsideration when designing these chip multi-processors (CMPs). The interconnect needs toprovision communication channels between eachprocessing core (termed a processor henceforth)and the memory subsystem, together with commu-nication channels between the processors them-selves. The latter is needed to enforce coherenceand data transfer mechanisms between the proces-sor caches that are needed to reduce the data ac-cess latencies to the memory subsystem. The on-chip interconnect between the processors, thus,plays a crucial role in determining the performanceand scalability of these systems. Data buses thatare used to implement these on-chip interconnectscan also drain a considerable amount of dynamicpower on each of these data transfers because oftheir high capacitances. For instance, the inte-grated switch in the Alpha 21364 [6] is estimatedto consume 20% of the chip energy budget(125 W), of which 33% is attributed to the datalinks [7]. These factors make it imperative to opti-mize the power drained by such on-chip inter-connects.

Typically, an all-to-all interconnect is needed tofacilitate communication between the processors,and also to the memory subsystem. In addition,coherence between the processor caches necessi-tates snooping capabilities to monitor accesses byother processors. A bus is typically employed be-cause of its simplicity in design (compared tosophisticated interconnects), serializability (makesit easier to implement a stricter memory consis-tency model), broadcast capabilities (allows cachesto snoop on other processor�s requests), and thelimited number of processing cores in use (8–16is the range of focus in this paper). This bus carriesthe requests to the memory system for read/writemisses, together with the corresponding responses(either from the memory subsystem or from some

other processor cache), and the coherence controlmessages needed to keep the caches consistent.Although the evaluations of the idea to be pre-sented use a bus architecture, the results areequally applicable to other on-chip interconnectsthat can be employed.

There are two broad solution strategies for alle-viating the power dissipated by the interconnect:(i) reducing the number of messages themselves,and (ii) reducing the bit switching activity byencoding the data sent on the buses [8–12]. Thefirst one is a performance optimization techniqueand has been explored in depth in early researchwith different coherence protocols and messagecombining techniques [13]. The second solutionstrategy tries to encode patterns of ones and zerosin the exchanged messages to reduce the bit switch-ing between successively sent flits on the buses. Inthis paper we present a different (and orthogonal)technique for addressing the interconnect powerproblem that can be used in conjunction with thesetwo other solution strategies as well.

Our approach uses the principle of value local-ity [14,15] for optimizing message traffic. We ob-serve that there is considerable locality in thevalues for the data items that are communicatedover the on-chip data bus (similar to what othershave observed at the processor cache itself[14,15]). We can exploit this locality by keeping asmall content-addressable cache, called Communi-cation Value Cache (CVC) at each processing core,that maintains the most commonly exchanged val-ues. When a message is sent/received, the CVC isconsulted to check if there are frequent values thatare being communicated, and if so only the indexof the location of these values (in the CVC) areactually communicated instead of the values them-selves. We can get savings by keeping the size ofthe CVC relatively small (so that the index issmall) as long as the program exhibits good valuelocality for such small sizes. This can lead to areduction in the number of bits exchanged overthe bus, and can provide power savings. Buses al-ready provision abilities to shut down individualbit lines to benefit from this mechanism. Whilesuch techniques have been explored in the contextof uniprocessor data exchanges between the cacheand main memory [16,8,17], this is the first paper

C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142 131

to investigate this idea in the context of multipro-cessors (multiple processing cores on-chip). It is tobe noted that the power consumption of the CVCitself should be discounted from the overall powersavings, and our results include this overhead.

The next section gives an overview of the hard-ware and implementation of CVC. Section 3 dis-cusses the experimental setup. The results fromsimulations are given in Section 4 and finally Sec-tion 6 summarizes the contributions of this work.

2. Hardware details

2.1. CMP overview

Our evaluations use an 8-way (8 processingcores and their L1 caches) CMP that employs a36-line wide bus to interconnect these cores andto the rest of the memory subsystem (on-chip L2and off-chip memory). We have aggressively as-sumed L2 to also be on-chip (as in [2]), thoughour ideas would automatically extend to the caseswhere it is off-chip (in which case it can provideeven higher savings because off-chip capacitancesare higher). A schematic of this architecture isgiven in Fig. 1. Four lines of the bus are used forcontrol signals (read/write requests, assertingshared lines, etc.). The remaining thirty two linesare used to carry address and data signals.

We use a 4-state MESI invalidation based pro-tocol [13] that has been widely employed. In gen-eral, a cache block that is present in multiple L1caches is in shared (S) state, and moves to anexclusive (E)/modified (M) state once it is written

Fig. 1. CMP architecture under study.

by one of the processors, requiring a write signalon the bus which is snooped by the other cachesto invalidate their copies. A write-back cache isused, where the contents of the block that isin modified (M) state need to be propagated tothe rest of the memory subsystem upon itsreplacement.

2.2. CVC design

The idea behind optimizing the number of bitstransferred between processor caches, or betweenan L1 cache and the memory, is based on exploit-ing the frequency of certain specific values that arecommunicated. The motivation for this idea stemsfrom the results that are shown in Fig. 2 whichillustrate the fraction of data words communicatedover the bus whose values are captured by themost frequent n values. Note that we are not per-forming any optimizations for address words andfocus only on data value locality in this paper.The figures give the fractions for values exchangedbetween processor L1 data caches (L1–L1) andthose between the L1 datacache and L2 (L1–L2)separately, in eight SPEC OMP2001 parallel appli-cations (please refer to [18] for further informationon these benchmarks). Similar results were ob-served for the other applications, and with otherdatasets for the same applications. We observethat even if we select the most frequent four values,these constitute as much as 85% of the data trans-fers (for swim), with the average being around40%, between L1 caches. The L1–L1 value localitydoes turn out to be much higher than the L1–L2value locality. Previous studies along similar ideas[16] have mainly looked at a single processor�scache and the rest of the memory subsystem, andour results here show that the value locality be-tween the L1s is much higher for a CMP. Further,we also note that the value locality between L1–L2is not as high for SPEC OMP2001 applications ascompared to those reported in [16] for the Spec95applications. The relative weightage of the numberof bus transaction in the two categories (L1–L1and L1–L2) are used in calculating the overall sav-ings in bus energy in the later discussions.

An easy option of exploiting this value localityis to provision a communicating value cache

swim applu galgel equake apsi fma3d art ammp average0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Val

ue

Co

vera

ge

n = 4n = 8n = 16

swim applu galgel equake apsi fma3d art ammp average0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Val

ue

Co

vera

ge

n = 4n = 8n = 16

(a) L1-L1 (b) L1-L2

Fig. 2. Coverage of frequent values.

132 C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142

(CVC) at each processing core, and making surethat all the cores are in agreement on what goesinto their CVCs together with exactly where theyare placed within their CVCs. Consequently, a sen-der of a frequent value, can send the index of thelocation in the CVC of the destination rather thansending the value itself. Please note that we are try-ing to reduce the number of bits transferred on thebus (to reduce switching activity) rather than pro-pose a narrower bus. The receiver can use this in-dex to reassemble the value. There are severalissues involved in the design of this CVC that im-pact its size. First, we do not want to make theCVC too large, or else we will incur a substantialenergy cost per access (remember that this is a con-tent-addressable structure), voiding any savings wemay get from the bus. Second, the larger the CVCgets, more bits are needed to send the index whichin turn eats away from the bus energy savings weare trying to optimize. Third, we also want to keepaccess times relatively low since the words need tobe encoded/decoded using the CVC before/afterthey are sent/received on the bus. Since the encod-ing process can be done in parallel with the busarbitration, CVC will only incur performance pen-alty for the decoding process. And, for a 500 MHZbus, HSPICE result shows that the decoding takesless than 1 cycle.

If we look at Fig. 2, we find that we are captur-ing a considerable portion of the value localitywith just 4 or 8 values (later results will show that

larger CVCs are not very effective). Consequently,we opted to implement a CVC circuit that usesvery few such frequent values.

The purpose of this circuit is to quickly lookupthe values for the data words that are being sent/received, and we designed this circuit with the phi-losophy of optimizing (both performance and en-ergy) lookups rather than writes to the CVC (therationale will be explained later). This circuit waslaid out using MicroMagic�s MAX and the de-sign/layout are shown in Fig. 3, from which the en-ergy values were obtained using HSPICE. It uses70 nm technology and occupies an area of140 · 50 lm2. Later we scale it to 180 nm technol-ogy to comprare with normal bus.

In a CMP with n processors, we would haven + 1 CVCs overall—one at each of the cores,and one at the L2 end as shown in Fig. 1. It isfairly straightforward to integrate the CVC intothe architecture of bus based shared memory mul-tiprocessors. When data needs to be sent out froma source L1 core (or from L2), we need to quicklylook up the CVC and send out the data words as asequence of either actual words (if they are notpresent in the CVC), or as an index in the CVC.There is one additional bit line of the bus (of thefour control mentioned earlier) to indicate whichof these two categories a word belongs to andthe energy for this bit is included in our experi-ments. Rather than perform all such lookups/encoding for a block before sending it out (which

Fig. 3. CVC design. fvEN indicates whether the index into CVC or the actual data is being transmitted. FVi denotes a 32-bit registerholding the ith entry of CVC.

C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142 133

can incur a high latency), we pipeline the wholemechanism, i.e. the lookup for the second wordof the block in the CVC is performed concurrentlywhile the first word transfer goes on the bus. Sim-ilarly, at the receiver end, the lookups/decodingare pipelined. Consequently, the performanceramifications of a block transfer are not signifi-cantly affected. This mechanism may add twocycles (one at the sender and receiver each) overallper block for data communication, and with largerblock sizes this overhead can become even lesssignificant. The control/request messages do notincur any overheads because they do not gothrough the CVC.

2.3. Using the CVC

There are different ways of using this CVCstructure. We could potentially have different con-tents (or at different locations within) for the CVCat the different processor cores. This can possiblyoptimize for cases where value localities are differ-ent across processors (or between pairs of proces-sors). We noticed in our experiments that this

was not a very significant issue, and decided tomaintain the CVC as a structure that is mirroredacross the cores (i.e. there is one CVC at each core,and the CVCs across the cores have the same val-ues in the same locations).

We could opt to fill the values in the CVCdynamically based on execution characteristics,by tracking the frequent values during the run.For instance, one could use this as an associativecache and allow replacements from it. The advan-tage of this is that the optimizations can exploitdynamic changes in value locality. However, wefound that such adaptation does not give better re-sults than fixing the values statically, since the fre-quent values are more or less the same throughoutthe execution. Further, fixing the values can makethe circuit much more efficient from both timingand power perspectives, and we consequently usethe CVC as mainly a lookup mechanism and fixthe values a priori.

These design choices make it easy to implementthe CVCs and keep them coherent across the pro-cessor cores. In a bus-based system, it is possiblefor the CVCs to snoop the data transfers, and

Table 1Default simulation parameters

Process technology 70 nmClock speed 500 MHzNo. of CPUs 8L1 size 16 KB per CPUL1 associativity 4-WayL1 and L2 block size 32 bytesCVC 4 Entries, fully-associativeCVC energy/access 18.6 pJBus width 32 + 4 linesBus power/access 11.6 pJ/lineBus clock speed 500 MHz

134 C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142

figure out what data the remote CVCs may con-tain and modify themselves appropriately. How-ever, in a more general on-chip interconnect, theCVCs may not have any idea of what remoteCVCs contain, and it may be necessary to explic-itly propagate such information (which is quitecostly) or maintain per processor-pair CVCs(which is not a very scalable design methodology).On the other hand, if we fix the contents (and theirlocations) of the CVC a priori (possibly by profil-ing the code), we can avoid these problems, mak-ing this mechanism general enough to workacross diverse interconnects.

It should be noted that the CVC is orthogonalto the cache coherence mechanism itself. We areonly optimizing the data values transmitted andare not changing the semantics of the messagesthemselves. One could hypothesize that if we areprofiling and fixing the values, then it may be pos-sible to recompile the code itself so that the indexesto the values get communicated instead of the val-ues themselves in software. While this may be pos-sible in some cases, the problem is that it may notbe possible to statically identify the references (orall of them) that correspond to these frequent val-ues, while our hardware technique does this atruntime.

3. Experimental setup

We have conducted detailed simulation studiesof the bus energy behavior of CMP systems. Sincewe wanted to conduct complete system simulationstudies, for real applications together with theoperating system, we have used the Simics [19]

Table 2Benchmarks (instructions and bus transactions are in millions)

Benchmark Description Sim

swim Water modeling 1applu Parabolic/elliptic PDE 1galgel Fluid dynamics 1equake Earthquake modeling 3apsi Pollution modeling 2fma3d Finite-element crash sim. 9art Neural network sim. 2ammp Comp. chemistry 17

simulation infrastructure. This infrastructure pro-vides a memory system event generation mecha-nism, that has been used in conjunction with L1and L2 simulations that we have written. Thebaseline (default) target hardware that have beenused in the experiments is given in Table 1. Manyof these parameters have also been varied to studytheir effectiveness on the energy savings. Similar toother related studies [17], the bus energy calcula-tions are based on an average 50% bit switchingactivities per transaction and have been calculatedusing the formulations given in [20].

We have used several real-world applicationsfrom the SPEC OMP2001 suite [18] (reference in-put) that are given in Table 2. These are parallelapplications using the OpenMP interface and areintended to evaluate the performance of sharedmemory multiprocessors. We have adapted theseapplications to run on the Simics environment.The simulation of these applications is extremelytime-consuming, and we have collected statistics,after fast-forwarding over the first four billioninstructions. The number of instructions and the

ulated instns. (m) Bus trans. (m)

5,720 20996,696 18025,080 14925,312 21525,600 17835,072 21074,800 19986,344 5661

C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142 135

number of bus transactions simulated are alsogiven in Table 2.

In the following results, we are interested in thesavings in bus energy, and also give the additionalenergy consumed by the CVC explicitly.

4. Results

As mentioned earlier, we are looking to pre-load the CVC with the values instead of dynami-cally tracking the frequent values. We could profilethe code and compute frequent values, either at aglobal scale over the entire execution or withinshorter epochs, and use such information to pre-load the CVCs. We refer to these as globally fre-quent and locally frequent values, respectively,and investigate the benefits of these options firstin the following discussion.

4.1. Using globally frequent values

Fig. 4 shows the normalized energy that is ob-tained for the bus enhanced with the CVC com-pared to the normal bus. In these graphs, thefirst two bars for each application give the energyconsumption of using the globally frequent valuesfor the CVC throughout the execution (i.e. theyare pre-loaded before starting the applicationand the contents do not change) for CVC sizesof 4 and 8 entries, respectively. These bars are in-turn broken down into two parts—the bottom isthe energy taken by the CVC itself and the top isthat of the bus (typically the latter overwhelmsthe former showing that our scheme is not incur-ring significant energy costs). The three graphs inthis figure correspond to the bus transactions (a)for L1 to L1 (L1–L1) transfers, (b) between L1and L2 (L1–L2), and (c) for all transfers (L1 toL1 and between L1 and L2).

Corresponding to the value locality graphsshown earlier in Fig. 2, we find the energy savingsbetter for the L1–L1 transactions compared to theL1–L2 transactions. We find savings as much as70% (in swim) with average savings of 28% forn = 4 across the applications in L1–L1 transac-tions. The average saving for L1–L2 transactionsis around 15%, which is still reasonably significant.

When we examine the overall savings in energyacross all transactions, we observe savings ofaround 20% on the average, showing the benefitsof our CVC mechanism for reducing bus switchingactivity.

When we increase the size of the CVC from 4 to8, we notice that there are no additional savings,and in fact, the energy consumption goes up inmost cases (except art) because the number of in-dex bits and per access CVC energy overshadowany savings from additional coverage of frequentvalues which we already observed earlier to be rel-atively small.

4.2. Using locally frequent values

The previous set of experiments fixed the n fre-quent values throughout the run. It is possible thatthe value locality can change over the phases of aprogram, and a CVC that tracks such changes canprovide even more energy savings. Consequently,we next explore the potential of dynamic valuelocality adaptation.

To explore this potential, we assume an oraclethat has perfect knowledge of the frequent valuesduring different epochs (defined to be 10,000 bustransactions in this study) of the execution, andit adjusts the CVCs of all processors to these fre-quent values before the epoch starts. We call thisscheme local frequent values. Note that this is ascheme which considers the limits/potential ofthe benefits of tracking locally frequent valuesthough this may be difficult to implement inpractice.

The graphs in Fig. 5 plot the increase in hit ratioto the CVC by employing such a locally frequentvalue CVC mechanism compared to a globally fre-quent value CVC mechanism for each epoch in theexecution, i.e. in each epoch i on the x-axis, the hitrate of the CVC with a global value CVC is sub-tracted from the hit rate of the CVC in anothersimulation which uses a local value mechanism.Higher the value of this difference, there is morescope for a locally frequent value mechanism tobenefit from the CVC compared to fixing the val-ues globally. These graphs are shown only forL1–L1 transactions in the interest of space. Theactual energy consumption of this oracle-based

Fig. 4. Energy consumption of the bus with CVC compared to a normal bus. The results are given for (a) transactions between L1s, (b)transactions between L1s and L2, and (c) all transactions. The left two bars for each application are for global frequent values withn = 4 (4,G) and n = 8 (8,G), and the right two bars (4,L) and (8,L) for each application are for local frequent values. The lower part ofeach bar shows the CVC energy while the top part gives the bus energy (base configuration).

136 C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142

locally frequent value scheme is shown as the sec-ond set of two bars for each application in Fig. 4.

We observe that in many applications (swim,galgel, apsi, ammp), the increases in hit ratio arenot significant, and in fact the energy consumptionfor those applications in Fig. 4 confirm that the lo-cal scheme is not very different from choosing theglobally frequent values. In fma3d, we find that

while the increases in hit ratio are significant atcertain epochs (and these epochs display certainpatterns) because of phase changes in programs,the consequent impact on energy in Fig. 4 doesnot still seem to be significantly different fromthe globally frequent value case. Finally, applica-tions such applu, equake and art show significantdifferences in hit rates in many epochs (and there

500 1000 1500 20000

2

4

6

8x 10

4

swim500 1000 1500

0

0.1

0.2

0.3

0.4

applu

200 400 600 800 1000 1200 14000

0.2

0.4

0.6

galgel

Dif

f. in

Hit

Rat

io

500 1000 1500 20000

0.02

0.04

0.06

0.08

equake

Dif

f. in

Hit

Rat

io

500 1000 15000

0.05

0.1

0.15

0.2

0.25

apsi500 1000 1500 2000

0

0.1

0.2

0.3

0.4

fma3d

500 1000 1500

0.44

0.46

0.48

0.5

art1000 2000 3000 4000 5000

0

0.05

0.1

0.15

ammp

Fig. 5. Difference in hit rate between the local value CVC compared to global value CVC for each epoch shown on the x-axis. A largervalue indicates that local value does much better than fixing the values globally for that epoch.

C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142 137

are not decipherable patterns in these differenceseither). Even in these applications (particularlyapplu and equake), we do not see much differencesin energy savings. This is because the epochs wherethese differences occur are themselves not verygood exhibitors of value locality and choosing a

locally good set of values does not make signifi-cant differences. Only in art (and that too onlyfor n = 4 where the local mechanism is able to se-lect a better set of values than global) do we findsignificant differences between local and globalmechanisms.

138 C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142

In general, we find that there is a significantfraction of epochs where the value locality is goodand global works equally well across these epochs(to give the energy savings in Fig. 4). In theremaining epochs, the locality is anyway not verygood and the energy behavior of locally frequentis not significantly different from globally frequent.This suggests that a simple policy of profiling thecode and fixing the CVC to contain the globallymost frequent values (across the run) is sufficient.

4.3. Sensitivity analysis

In the interest of space, we focus on globallyfrequent values and the fma3d application in therest of this discussion (the results are representa-tive of those observed in the other applications).When changing any one parameter, the othersare fixed at their default values given in Table 1.

Number of frequent values (n): We have con-ducted experiments with varying number of entriesfor the CVC. We have already discussed the pros(better value locality) and cons (larger index bits,higher CVC energy cost per access) of larger CVCsand Fig. 6 shows these trade-offs for fma3d. Wefind that even with a one entry CVC, we get goodenergy savings and the overheads start becoming

No

rmal

ized

En

erg

y

Number of En n=1 n=2 n=

0

0.2

0.4

0.6

0.8

1

Fig. 6. Sensitivity to the number of CVC entries (n). Energy is normaliof each bar is the CVC energy and the top part is the bus energy.

more significant with larger n. In fact, the resultspresented until now (with a four entry CVC) arethus conservative, and one could get even moresavings with a one entry CVC. Typically, zeroturns out to be a frequent value in most applica-tions (6 of the 8), and it is possible to simply hard-wire this value to get good savings though this maydefeat the flexibility of adjusting this value for theapplication at hand. On the other hand, the CVCcan be customized for a specific application.

L1 size: With improving technology, we expectL1 sizes to get larger. Fig. 7 shows the energy sav-ings of our mechanism across different L1 sizes,demonstrating that this scheme is likely to beequally effective in the future.

Number of CPUs: As the number of on-chipCPUs increases, which will be another trend forthe future, the importance of bus energy optimiza-tion will increase. While L1–L2 traffic would growat most linearly with the number of CPUs, the L1–L1 traffic would grow faster (possibly quadrati-cally), making the savings with our mechanismeven more pronounced. Fig. 8 validates this obser-vation. Note that our mechanism extends to otheron-chip interconnects as well, if one wants to optfor a more scalable interconnect with larger num-ber of CPUs.

tries in CVC4 n=8 n=16

zed with respect to the bus energy without CVCs. The lower part

No

rmal

ized

En

erg

y

L1 Cache Size 8kB 16kB 32kB 64kB 128kB

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 7. Sensitivity to the L1 size. Energy is normalized with respect to the bus energy without CVCs. The lower part of each bar is theCVC energy and the top part is the bus energy.

No

rmal

ized

En

erg

y

Number of CPUs 1 2 4 8 16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 8. Sensitivity to the number of CPUs. Energy is normalized with respect to the bus energy without CVCs. The lower part of eachbar is the CVC energy and the top part is the bus energy.

C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142 139

5. Further improvement

We also compared our scheme against bus in-vert scheme. In the study, we assume an 128-bitsystem bus. We collect 1GB-4GB of values thatactually transmitted on the bus (32 M transac-tions) and use them to study our scheme. Since

our scheme can also employ bus inverting, we pre-sents those results as well.

5.1. Comparison with inverting bus

We investigate the possibility of combining ourscheme with bus inverting. Figs. 9 and 10 show

swim applu galgel equake apsi fma3d art ammp average

Red

uctio

n in

sw

itchi

ng

0 %

5 %

10 %

15 %

20 %

25 %

30 %

35 %

40 %

45 %

50 %invert

cvc4

cvc8

cvc16

cvc4+inv

cvc8+inv

cvc16+inv

Fig. 9. Reduction of bus switching with CVC and CVC + inverting compare to the normal inverting bus. Here 4-, 8-, 16-entry word-level CVC are used.

swim applu galgel equake apsi fma3d art ammp average

Red

uctio

n in

crea

se o

ver

Inve

rt

0 %

5 %

10 %

15 %

20 %

cvc4+inv cvc8+inv cvc16+inv

Fig. 10. Increase in bus switching reduction for CVC + invert over inverting alone.

swim applu galgel equake apsi fma3d art ammp average

Red

uctio

n in

sw

itchi

ng

0 %

5 %

10 %

15 %

20 %

25 %

30 %invert

byte–2

byte–4

word–4

byte–8

word–8

Fig. 11. Reduction of bus switching with byte-level CVC. 2-, 4-, 8-entry 1-byte CVC are used and compared against the 4-, 8-entry 4-byte (1-word) CVC as well as an 128-bit inverting bus.

140 C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142

C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142 141

that out scheme could reduce the bus switchingactivity by another 10–18%, which is 35–95% morethan bus inverting alone.

These results clearly show that our approachcan also be used with the inverting bus scheme tofurther reduce switching activity.

5.2. Using byte-based CVC

To further exploit the locality of communica-tion values, we study the effectiveness of employ-ing a byte-level CVC, where one CVC is used foreach byte transmitted. Four CVCs are used totransmit one word (4 bytes) to check each byteconcurrently so that the latency is minimized.

Fig. 11 shows the reduction in bus switching thebus enhanced with the byte-level CVC comparedto the word-level CVC. As shown in the figure,for byte-level CVC, using more entries does notbring proportional savings compare to the word-level CVC. And if we use 16-entry byte-levelCVC, we get no savings at all since it requires16 bits to transfer 4 bytes while word-level CVConly need 4 bits.

6. Concluding remarks

As integration densities increase, on-chip multi-processors are likely to gain more prominence forboth high-end and embedded systems because oftheir attractiveness from the design, implementa-tion, power, verification and scalability angles. Atthe same time, the importance of the interconnectthat serves as the backbone for on-chip processingwill be increasingly felt from both performance andpower angles. This paper has presented a novelapproach to optimize the energy consumption ofthese on-chip interconnects by exploiting the local-ity in values exchanged between processors. Thecommunicating value cache (CVC) can be used inconjunction with other bus power optimizationtechniques such as low swing buses, message reduc-tion techniques and bus encoding techniques toamplify the savings.

We have experimentally demonstrated the ben-efits of this mechanism using several applicationsfrom the SPEC OMP2001 suite. We obtain over

40% energy savings in some applications with20% savings of the total bus energy on the averageacross the applications. If we examine the L1 to L1transactions, the savings are even higher. This ef-fect is likely to benefit from larger number ofCPUs, suggesting that our approach will be evenmore effective in the future. We have also shownthat a simple strategy of profiling the code to pickthe most frequent values and fixing this for the restof the execution suffices to provide most of the en-ergy savings. Since we need very few entries in theCVC, it would be possible to implement this struc-ture in RAM without significant overheads com-pared to a CAM.

References

[1] G. de Micheli, L. Benini, Networks on chip: a newparadigm for systems on chip design, in: Proceedings ofthe Conference on Design, Automation and Test inEurope, IEEE Computer Society, 2002, p. 418.

[2] K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K.Chang, The case for a single-chip multiprocessor, in:Proceedings of the Seventh International Conference onArchitectural Support for Programming Languages andOperating Systems, ACM Press, 1996, pp. 2–11.

[3] D. Bertozzi, L. Benini, G. de Micheli, Low power errorresilient encoding for on-chip data buses, in: Proceedingsof the Conference on Design, Automation and Test inEurope, IEEE Computer Society, 2002, p. 102.

[4] DAC�02: session: design methodologies meet networkapplications and system on chip design, 2002.

[5] I. Kadayif, M. Kandemir, U. Sezer, An integer linearprogramming based approach for parallelizing applicationsin on-chip multiprocessors, in: Proceedings of the 39thConference on Design Automation, ACM Press, 2002, pp.703–706.

[6] S.S. Mukherjee, P. Bannon, S. Lang, A. Spink, D. Webb,The alpha 21364 network architecture, IEEE Micro 22 (1)(2002) 26–35.

[7] W. Dally, P. Carvey, L. Dennison, Architecture of the aviciterabit switch router, in: Proc. Hot Interconnects, 1998.

[8] D. Citron, L. Rudolph, Creating a wider bus using cachingtechniques, in: Proceedings of the 1st IEEE Symposium onHigh-Performance Computer Architecture, IEEE Com-puter Society, 1995, p. 90.

[9] H. Mehta, R.M. Owens, M.J. Irwin, Some issues in graycode addressing, in: Proceedings of the 6th Great LakesSymposium on VLSI, IEEE Computer Society, 1996, p.178.

[10] M.R. Stan, W.P. Burleson, Bus-invert coding for low-power i/o, IEEE Trans. Very Large Scale Integr. Syst. 3 (1)(1995) 49–58.

142 C. Liu et al. / Journal of Systems Architecture 52 (2006) 129–142

[11] M.R. Stan, W.P. Burleson, Low-power encodings forglobal communication in cmos vlsi, IEEE Trans. VeryLarge Scale Integr. Syst. 5 (4) (1997) 444–455.

[12] W. Fornaciari, D. Sciuto, C. Silvano, Power estimation forarchitectural exploration of hw/sw communication onsystem-level buses, in: Proceedings of the Seventh Interna-tional Workshop on Hardware/Software Codesign, ACMPress, 1999, pp. 152–156.

[13] D.E. Culler, A. Gupta, J.P. Singh, Parallel ComputerArchitecture: A Hardware/Software Approach, MorganKaufmann, 1998.

[14] M.H. Lipasti, Value locality and speculative execution,Ph.D. Thesis, Technical Report CMU-CSC-97-4, Depart-ment of Electrical and Computer Engineering, CarnegieMellon University, 1997.

[15] J. Yang, Y. Zhang, R. Gupta, Frequent value compressionin data caches, in: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture,ACM Press, 2000, pp. 258–265.

[16] J. Yang, R. Gupta, Fv encoding for low-power data i/o, in:Proceedings of the 2001 International Symposium on LowPower Electronics and Design, ACM Press, 2001, pp. 84–87.

[17] K. Basu, A. Choudhary, J. Pisharath, M. Kandemir,Power protocol: reducing power dissipation on off-chipdata buses, in: Proceedings of the 35th Annual ACM/IEEEInternational Symposium on Microarchitecture, IEEEComputer Society Press, 2002, pp. 345–355.

[18] V. Aslot, M.J. Domeika, R. Eigenmann, G. Gaertner,W.B. Jones, B. Parady, Specomp: a new benchmark suitefor measuring parallel computer performance, in: Proceed-ings of the International Workshop on OpenMP Applica-tions and Tools, 2001, pp. 1–10.

[19] P.S. Magnusson, M. Christensson, J. Eskilson, D. Fors-gren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B.Werner, Simics: a full system simulation platform, IEEEComput. 35 (2) (2002) 50–58.

[20] H. Zhang, J. Rabaey, Low-swing interconnect interfacecircuits, in: Proceedings of the 1998 International Sympo-sium on Low Power Electronics and Design, ACM Press,1998, pp. 161–166.

Chun Liu is a Ph.D. candidate in theComputer Science and EngineeringDepartment at the Pennsylvania StateUniversity. His main research interestsare performance evaluation and mic-roarchitecture of CMPs. He receivedthe B.Sc. and M.Sc. degrees in Com-puter Science and Engineering fromShanghai JiaoTong University,Shanghai, China in 1997 and 1999,respectively. He is a student member of

the IEEE and the ACM.

Anand Sivasubramaniam received hisB.Tech. in Computer Science from theIndian Institute of Technology,Madras, in 1989, and the M.S. andPh.D. degrees in Computer Sciencefrom the Georgia Institute of Tech-nology in 1991 and 1995, respectively.He has been on the faculty at ThePennsylvania State University sinceFall 1995 where he is currently a Pro-fessor. Anand�s research interests are in

computer architecture, operating systems, performance evalu-

ation, and applications for both high performance computersystems and embedded systems. Anand�s research has beenfunded by NSF through several grants, including the CAREERaward, and from industries including IBM, Microsoft andUnisys Corp. He has several publications in leading journalsand conferences, and is on the editorial board of IEEE Trans-actions on Computers and IEEE Transactions on Parallel andDistributed Systems. He is a recipient of the 2002 and 2004IBM Faculty Awards. Anand is a member of the IEEE, IEEEComputer Society, and ACM.

Mahmut Kandemir is an associate pro-fessor in the Computer Science andEngineering Department at the Penn-sylvania State University. His mainresearch interests are optimizing com-pilers, I/O intensive applications, andpower-aware computing. He receivedthe B.Sc. and M.Sc. degrees in controland computer engineering from Istan-bul Technical University, Istanbul,Turkey, in 1988 and 1992, respectively.

He received the Ph.D. from Syracuse University, Syracuse, New

York in electrical engineering and computer science, in 1999.He is a member of the IEEE and the ACM.