locality-aware data replication in the last-level...

Locality-Aware Data Replication in the Last-Level CacheGeorge Kurian, Srinivas Devadas

Massachusetts Institute of TechnologyCambridge, MA USA

{gkurian, devadas}@csail.mit.edu

Omer KhanUniversity of Connecticut

Storrs, CT [email protected]

AbstractNext generation multicores will process massive data

with varying degree of locality. Harnessing on-chip datalocality to optimize the utilization of cache and networkresources is of fundamental importance. We propose alocality-aware selective data replication protocol for thelast-level cache (LLC). Our goal is to lower memory accesslatency and energy by replicating only high locality cachelines in the LLC slice of the requesting core, while simul-taneously keeping the off-chip miss rate low. Our approachrelies on low overhead yet highly accurate in-hardware run-time classification of data locality at the cache line granu-larity, and only allows replication for cache lines with highreuse. Furthermore, our classifier captures the LLC pres-sure at the existing replica locations and adapts its replica-tion decision accordingly.

The locality tracking mechanism is decoupled from thesharer tracking structures that cause scalability concerns intraditional coherence protocols. Moreover, the complexityof our protocol is low since no additional coherence statesare created. On a set of parallel benchmarks, our proto-col reduces the overall energy by 16%, 14%, 13% and 21%and the completion time by 4%, 9%, 6% and 13% whencompared to the previously proposed Victim Replication,Adaptive Selective Replication, Reactive-NUCA and Static-NUCA LLC management schemes.

1 IntroductionNext generation multicore processors and applications

will operate on massive data. A major challenge in fu-ture multicore processors is the data movement incurredby conventional cache hierarchies that impacts the off-chipbandwidth, on-chip memory access latency and energy con-sumption [7].

A large, monolithic on-chip cache does not scale beyonda small number of cores, and the only practical option is tophysically distribute memory in pieces so that every core isnear some portion of the cache [5]. In theory this providesa large amount of aggregate cache capacity and fast pri-vate memory for each core. Unfortunately, it is difficult tomanage the distributed cache and network resources effec-tively since they require architectural support for cache co-herence and consistency under the ubiquitous shared mem-ory model.

Popular directory-based protocols enable fast local

caching to exploit data locality, but scale poorly with in-creasing core counts [3, 22]. Many recent proposals haveaddressed directory scalability in single-chip multicoresusing sharer compression techniques or limited directo-ries [24, 31, 29, 12]. Yet, fast private caches still sufferfrom two major problems: (1) due to capacity constraints,they cannot hold the working set of applications that oper-ate on massive data, and (2) due to frequent communicationbetween cores, data is often displaced from them [19]. Thisleads to increased network traffic and request rate to the last-level cache. Since on-chip wires are not scaling at the samerate as transistors, data movement not only impacts memoryaccess latency, but also consumes extra power due to the en-ergy consumption of network and cache resources [16].

Last-level cache (LLC) organizations offer trade-offs be-tween on-chip data locality and off-chip miss rate. Whileprivate LLC organizations (e.g., [10]) have low hit laten-cies, their off-chip miss rates are high in applications thathave uneven distributions of working sets or exhibit highdegrees of sharing (due to cache line replication). SharedLLC organizations (e.g., [1]), on the other hand, lead tonon-uniform cache access (NUCA) [17] that hurts on-chiplocality, but their off-chip miss rates are low since cachelines are not replicated. Several proposals have explored theidea of hybrid LLC organizations that attempt to combinethe good characteristics of private and shared LLC organi-zations [9, 30, 4, 13], These proposals either do not quicklyadapt their policies to dynamic program changes or repli-cate cache lines without paying attention to their locality.In addition, some of them significantly complicate coher-ence and do not scale to large core counts (see Section 5 fordetails).

We propose a data replication mechanism for the LLCthat retains the on-chip cache utilization of the shared LLCwhile intelligently replicating cache lines close to the re-questing cores so as to maximize data locality. To achievethis goal, we propose a low-overhead yet highly accurate in-hardware locality classifier at the cache line granularity thatonly allows the replication of cache lines with high reuse.1.1 Motivation

The utility of data replication at the LLC can be under-stood by measuring cache line reuse. Figure 1 plots thedistribution of the number of accesses to cache lines in theLLC as a function of run-length. Run-length is defined asthe number of accesses to a cache line (at the LLC) from a

978-1-4799-3097-5/14/$31.00 ©2014 IEEE978-1-4799-3097-5/14/$31.00 ©2014 IEEE

0%

20%

40%

60%

80%

100%

RADIX FFT LU-‐C

LU-‐NC

CHOLESKY

BARNES

OCEAN-‐C

OCEAN-‐NC

WATER-‐N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIONS

FLUIDA

NIM.

STREAM

CLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOM

P

Private [1-‐2] Private [3-‐9] Private [≥10] Instruc;on [1-‐2] Instruc;on [3-‐9] Instruc;on [≥10] Shared Read-‐Only [1-‐2] Shared Read-‐Only [3-‐9] Shared Read-‐Only [≥10] Shared Read-‐Write [1-‐2] Shared Read-‐Write [3-‐9] Shared Read-‐Write [≥10]

Figure 1. Distribution of instructions, private data, sharedread-only data, and shared read-write data accesses to theLLC as a function of run-length. The classification is doneat the cache line granularity.

particular core before a conflicting access by another coreor before it is evicted. Cache line accesses from multiplecores are conflicting if at least one of them is a write. Forexample, in BARNES, over 90% of the accesses to the LLCoccur to shared (read-write) data that has a run-length of10 or more. Greater the number of accesses with higherrun-length, greater is the benefit of replicating the cacheline in the requester’s LLC slice. Hence, BARNES wouldbenefit from replicating shared (read-write) data. Similarly,FACESIM would benefit from replicating instructions andPATRICIA would benefit from replicating shared (read-only)data. On the other hand, FLUIDANIMATE and OCEAN-Cwould not benefit since most cache lines experience just 1or 2 accesses to them before a conflicting access or an evic-tion. For such cases, replication would increase the LLCpollution without improving data locality.

Hence, the replication decision should not depend on thetype of data, but rather on its locality. Instructions andshared-data (both read-only and read-write) can be repli-cated if they demonstrate good reuse. It is also important toadapt the replication decision at runtime in case the reuse ofdata changes during an application’s execution.

1.2 Proposed Idea

We propose a low-overhead yet highly accuratehardware-only predictive mechanism to track and classifythe reuse of each cache line in the LLC. Our runtime clas-sifier only allows replicating those cache lines that demon-strate reuse at the LLC while bypassing replication for oth-ers. When a cache line replica is evicted or invalidated,our classifier adapts by adjusting its future replication deci-sion accordingly. This reuse tracking mechanism is decou-pled from the sharer tracking structures that cause scalabil-ity concerns in traditional cache coherence protocols. Ourlocality-aware protocol is advantageous because it:

1. Enables lower memory access latency and energy byselectively replicating cache lines that show high reusein the LLC slice of the requesting core.

2. Better exploits the LLC by balancing the off-chip missrate and on-chip locality using a classifier that adapts

to the runtime reuse at the granularity of cache lines.3. Allows coherence complexity almost identical to that

of a traditional non-hierarchical (flat) coherence pro-tocol since replicas are only allowed to be placed atthe LLC slice of the requesting core. The additionalcoherence complexity only arises within a core whenthe LLC slice is searched on an L1 cache miss, orwhen a cache line in the core’s local cache hierarchyis evicted/invalidated.

2 Locality-Aware LLC Data Replication2.1 Baseline System

The baseline system is a tiled multicore with an electri-cal 2-D mesh interconnection network as shown in Figure 2.Each core consists of a compute pipeline, private L1 instruc-tion and data caches, a physically distributed shared LLCcache with integrated directory, and a network router. Thecoherence is maintained using a MESI protocol. The coher-ence directory is integrated with the LLC slices by extend-ing the tag arrays (in-cache directory organization [8, 5])and tracks the sharing status of the cache lines in the per-core private L1 caches. The private L1 caches are keptcoherent using the ACKwise limited directory-based coher-ence protocol [20]. Some cores have a connection to a mem-ory controller as well.

Even though our protocol can be built on top of bothStatic-NUCA and Dynamic-NUCA configurations [17, 13],we use Reactive-NUCA’s data placement and migrationmechanisms to manage the LLC [13]. Private data is placedat the LLC slice of the requesting core and shared data isaddress interleaved across all LLC slices. Reactive-NUCAalso proposed replicating instructions at a cluster-level (e.g.,4 cores) using a rotational interleaving mechanism. How-ever, we do not use this mechanism, but instead build alocality-aware LLC data replication scheme for all types ofcache lines.2.2 Protocol Operation

The four essential components of data replication are:(1) choosing which cache lines to replicate, (2) determiningwhere to place a replica, (3) how to lookup a replica, and(4) how to maintain coherence for replicas. We first definea few terms to facilitate describing our protocol.

1. Home Location: The core where all requests for acache line are serialized for maintaining coherence.

2. Replica Sharer: A core that is granted a replica of acache line in its LLC slice.

3. Non-Replica Sharer: A core that is NOT granted areplica of a cache line in its LLC slice.

4. Replica Reuse: The number of times an LLC replicais accessed before it is invalidated or evicted.

5. Home Reuse: The number of times a cache line is ac-cessed at the LLC slice in its home location before aconflicting write or eviction.

6. Replication Threshold (RT): The reuse above orequal to which a replica is created.

L2 Cache (LLC Slice)

Compute Pipeline

Directory

Router

Private L1 Caches

Core

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

Replica

Home

Replica

Home

1

4

32

Figure 2. 1© – 4© are mockup requests showing thelocality-aware LLC replication protocol. The black datablock has high reuse and a local LLC replica is allowed thatservices requests from 1© and 2©. The low-reuse red datablock is not allowed to be replicated at the LLC, and therequest from 3© that misses in the L1, must access the LLCslice at its home core. The home core for each data blockcan also service local private cache misses (e.g., 4©).

Replica No Replica

Home Reuse >= RT

Home Reuse < RT

Initial XReuse >= RT

XReuse < RT

Figure 3. Each directory entry is extended with replica-tion mode bits to classify the usefulness of LLC replica-tion. Each cache line is initialized to non-replica mode withrespect to all cores. Based on the reuse counters (at thehome as well as the replica location) and the parameter RT,the cores are transitioned between replica and non-replicamodes. Here XReuse is (Replica + Home) Reuse on an in-validation and Replica Reuse on an eviction.

Note that for a cache line, one core can be a replica sharerwhile another can be a non-replica sharer. Our protocolstarts out as a conventional directory protocol and initial-izes all cores as non-replica sharers of all cache lines (asshown by Initial in Figure 3). Let us understand the han-dling of read requests, write requests, evictions, invalida-tions and downgrades as well as cache replacement policiesunder this protocol.

2.2.1 Read RequestsOn an L1 cache read miss, the core first looks up its localLLC slice for a replica. If a replica is found, the cache lineis inserted at the private L1 cache. In addition, a ReplicaReuse counter (as shown in Figure 4) at the LLC directoryentry is incremented. The replica reuse counter is a satu-rating counter used to capture reuse information. It is ini-tialized to ‘1’ on replica creation and incremented on everyreplica hit.

On the other hand, if a replica is not found, the request isforwarded to the LLC home location. If the cache line is notfound there, it is either brought in from the off-chip memoryor the underlying coherence protocol takes the necessary ac-tions to obtain the most recent copy of the cache line. The

LRU Tag

Core ID1 Core IDk …

Mode1 Modek …

Home Reuse1 Home Reusek …

Limited Locality List (1 .. k)

ACKWise Pointers (1 … p)

LRU Tag Mode1 Moden …

Home Reuse1 Home Reusen …


Complete Locality List (1 .. n)

Replica Reuse

State

State

Replica Reuse

Figure 4. ACKwisep-Complete locality classifier LLCtag entry. It contains the tag, LRU bits and directory entry.The directory entry contains the state, ACKwisep pointers,a Replica reuse counter as well as Replication mode bits andHome reuse counters for every core in the system.

directory entry is augmented with additional bits as shownin Figure 4. These bits include (a) Replication Mode bit and(b) Home Reuse saturating counter for each core in the sys-tem. Note that adding several bits for tracking the localityof each core in the system does not scale with the numberof cores, therefore, we will present a cost-efficient classifierimplementation in Section 2.2.5. The replication mode bitis used to identify whether a replica is allowed to be createdfor the particular core. The home reuse counter is used totrack the number of times the cache line is accessed at thehome location by the particular core. This counter is initial-ized to ‘0’ and incremented on every hit at the LLC homelocation.

If the replication mode bit is set to true, the cache lineis inserted in the requester’s LLC slice and the private L1cache. Otherwise, the home reuse counter is incremented.If this counter has reached the Replication Threshold (RT),the requesting core is “promoted” (the replication mode bitis set to true) and the cache line is inserted in its LLC sliceand private L1 cache. If the home reuse counter is still lessthan RT, a replica is not created. The cache line is onlyinserted in the requester’s private L1 cache.

If the LLC home location is at the requesting core, theread request is handled directly at the LLC home. Even ifthe classifier directs to create a replica, the cache line is justinserted at the private L1 cache.

2.2.2 Write RequestsOn an L1 cache write miss for an exclusive copy of a cacheline, the protocol checks the local LLC slice for a replica. Ifa replica exists in the Modified(M) or Exclusive(E) state, thecache line is inserted at the private L1 cache. In addition,the Replica Reuse counter is incremented.

If a replica is not found or exists in the Shared(S) state,the request is forwarded to the LLC home location. The di-rectory invalidates all the LLC replicas and L1 cache copiesof the cache line, thereby maintaining the single-writermultiple-reader invariant [25]. The acknowledgements re-ceived are processed as described in Section 2.2.3. Afterall such acknowledgements are processed, the Home Reusecounters of all non-replica sharers other than the writer arereset to ‘0’. This has to be done since these sharers have notshown enough reuse to be “promoted”.

If the writer is a non-replica sharer, its home reusecounter is modified as follows. If the writer is the onlysharer (replica or non-replica), its home reuse counter isincremented, else it is reset to ‘1’. This enables the repli-

cation of migratory shared data at the writer, while avoidingit if the replica is likely to be downgraded due to conflictingrequests by other cores.

2.2.3 Evictions and InvalidationsOn an invalidation request, both the LLC slice and L1 cacheon a core are probed and invalidated. If a valid cache lineis found in either caches, an acknowledgement is sent to theLLC home location. In addition, if a valid LLC replica ex-ists, the replica reuse counter is communicated back withthe acknowledgement. The locality classifier uses this in-formation along with the home reuse counter to determinewhether the core stays as a replica sharer. If the (replica +home) reuse is ≥ RT, the core maintains replica status, elseit is demoted to non-replica status (as shown in Figure 3).The two reuse counters have to be added since this is the to-tal reuse that the core exhibited for the cache line betweensuccessive writes.

When an L1 cache line is evicted, the LLC replica loca-tion is probed for the same address. If a replica is found,the dirty data in the L1 cache line is merged with it, else anacknowledgement is sent to the LLC home location. How-ever, when an LLC replica is evicted, the L1 cache is probedfor the same address and invalidated. An acknowledgementmessage containing the replica reuse counter is sent back tothe LLC home location. The replica reuse counter is usedby the locality classifier as follows. If the replica reuse is≥ RT, the core maintains replica status, else it is demotedto non-replica status. Only the replica reuse counter has tobe used for this decision since it captures the reuse of thecache line at the LLC replica location.

After the acknowledgement corresponding to an evictionor invalidation of the LLC replica is received at the home,the locality classifier sets the home reuse counter of the cor-responding core to ‘0’ for the next round of classification.

The eviction of an LLC replica back-invalidates the L1cache (as described earlier). A possibly more optimal strat-egy is to maintain the validity of the L1 cache line. Thisrequires two message types as well as two messages, oneto communicate back the reuse counter on the LLC replicaeviction and another to communicate the acknowledgementwhen the L1 cache line is finally invalidated or evicted. Weopted for the back-invalidation for two reasons: (1) to main-tain the simplicity of the coherence protocol, and (2) theenergy and performance improvements of the more optimalstrategy are negligible since (a) the LLC is more than 4×larger than the L1 cache, thereby keeping the probability ofevicted LLC lines having an L1 copy extremely low, and(2) our LLC replacement policy prioritizes retaining cachelines that have L1 cache copies.

2.2.4 LLC Replacement PolicyTraditional LLC replacement policies use the least recentlyused (LRU) policy. One reason why this is sub-optimal isthat the LRU information cannot be fully captured at theLLC because the L1 cache filters out a large fraction of ac-cesses that hit within it. In order to be cognizant of this, the

LRU Tag

Core ID1 Core IDk …

Mode1 Modek …

Home Reuse1 Home Reusek …

Limited Locality List (1 .. k)


LRU Tag Mode1 Moden …

Home Reuse1 Home Reusen …


Complete Locality List (1 .. n)

Replica Reuse

State

State

Replica Reuse

Figure 5. ACKwisep-Limitedk locality classifier LLC tagentry. It contains the tag, LRU bits and directory entry.The directory entry contains the state, ACKwisep pointers,a Replica reuse counter as well as the Limitedk classifier.The Limitedk classifier contains a Replication mode bit andHome reuse counter for a limited number of cores. A ma-jority vote of the modes of tracked cores is used to classifynew cores as replicas or non-replicas.

replacement policy should prioritize retaining cache linesthat have L1 cache sharers. Some proposals in literature ac-complish this by sending periodic Temporal Locality Hintmessages from the L1 cache to the LLC [15]. However, thisincurs additional network traffic.

Our replacement policy accomplishes the same using amuch simpler scheme. It first selects cache lines with theleast number of L1 cache copies and then chooses the leastrecently used among them. The number of L1 cache copiesis readily available since the directory is integrated withinthe LLC tags (“in-cache” directory). This reduces back in-validations to a negligible amount and outperforms the LRUpolicy (cf. Section 4.2).

2.2.5 Limited Locality Classifier OptimizationThe classifier described earlier which keeps track of localityinformation for all the cores in the directory entry is termedthe Complete locality classifier. It has a storage overhead of30% (calculated in Section 2.4) at 64 cores and over 5× at1024 cores. In order to mitigate this overhead, we developa classifier that maintains locality information for a limitednumber of cores and classifies the other cores as replica ornon-replica sharers based on this information.

The locality information for each core consists of (1) thecore ID, (2) the replication mode bit and (3) the home reusecounter. The classifier that maintains a list of this informa-tion for a limited number of cores (k) is termed the Limitedk

classifier. Figure 5 shows the information that is tracked bythis classifier. The sharer list of the ACKwise limited direc-tory entry cannot be reused for tracking locality informationbecause of its different functionality. While the hardwarepointers of ACKwise are used to maintain coherence, thelimited locality list serves to classify cores as replica or non-replica sharers. Decoupling in this manner also enables thelocality-aware protocol to be implemented efficiently on topof other scalable directory organizations. We now describethe working of the limited locality classifier.

At startup, all entries in the limited locality list are freeand this is denoted by marking all core IDs’ as Invalid.When a core makes a request to the home location, the di-rectory first checks if the core is already being tracked bythe limited locality list. If so, the actions described previ-ously are carried out. Else, the directory checks if a free

entry exists. If it does exist, it allocates the entry to the coreand the same actions are carried out.

Otherwise, the directory checks if a currently trackedcore can be replaced. An ideal candidate for replacement isa core that is currently not using the cache line. Such a coreis termed an inactive sharer and should ideally relinquish itsentry to a core in need of it. A replica core becomes inac-tive on an LLC invalidation or an eviction. A non-replicacore becomes inactive on a write by another core. If sucha replacement candidate exists, its entry is allocated to therequesting core. The initial replication mode of the coreis obtained by taking a majority vote of the modes of thetracked cores. This is done so as to start off the requester inits most probable mode.

Finally, if no replacement candidate exists, the mode forthe requesting core is obtained by taking a majority vote ofthe modes of all the tracked cores. The limited locality listis left unchanged.

The storage overhead for the Limitedk classifier is di-rectly proportional to the number of cores (k) for which lo-cality information is tracked. In Section 4.3, we evaluate thestorage and accuracy tradeoffs for the Limitedk classifier.Based on our observations, we pick the Limited3 classifier.

2.3 Discussion

2.3.1 Replica Creation StrategyIn the protocol described earlier, replicas are created in allvalid cache states. A simpler strategy is to create an LLCreplica only in the Shared cache state. This enables instruc-tions, shared read-only and shared read-write data that ex-hibit high read run-length to be replicated so as to servemultiple read requests from within the local LLC slice.However, migratory shared data cannot be replicated withthis simpler strategy because both read and write requestsare made to it in an interleaved manner. Such data patternscan be efficiently handled only if the replica is created in theExclusive or Modified state. Benchmarks that exhibit boththe above access patterns are observed in our evaluation (cf.Section 4.1).

2.3.2 Coherence ComplexityThe local LLC slice is always looked up on an L1 cachemiss or eviction. Additionally, both the L1 cache and LLCslice is probed on every asynchronous coherence request(i.e., invalidate, downgrade, flush or write-back). This isneeded because the directory only has a single pointer totrack the local cache hierarchy of each core. This methodalso allows the coherence complexity to be similar to thatof a non-hierarchical (flat) coherence protocol.

To avoid the latency and energy overhead of searchingthe LLC replica, one may want to optimize the handlingof asynchronous requests, or decide intelligently whether tolookup the local LLC slice on a cache miss or eviction. Inorder to enable such optimizations, additional sharer track-ing bits are needed at the directory and L1 cache. Moreover,additional network message types are needed to relay coher-ence information between the LLC home and other actors.

In order to evaluate whether this additional coherencecomplexity is worthwhile, we compared our protocol to adynamic oracle that has perfect information about whethera cache line is present in the local LLC slice. The dynamicoracle avoids all unnecessary LLC lookups. The completiontime and energy difference when compared to the dynamicoracle was less than 1%. Hence, in the interest of avoidingthe additional complexity, the LLC replica is always lookedup for the above coherence requests.

2.3.3 Classifier OrganizationThe classifier for the locality-aware protocol is organizedusing an in-cache structure, i.e., the replication mode bitsand home reuse counters are maintained for all cache linesin the LLC. However, this is a not an essential requirement.The classifier is logically decoupled from the directory andcould be implemented using a sparse organization.

The storage overhead for the in-cache organization iscalculated in Section 2.4. The performance and energyoverhead for this organization is small because: (1) Theclassifier lookup incurs a relatively small energy and latencypenalty when compared to the data array lookup of the LLCslice and communication over the network (justified in ourresults). (2) Only a single tag lookup is needed for accessingthe classifier and LLC data. In a sparse organization, a sep-arate lookup is required for the classifier and the LLC data.Even though these lookups could be performed in parallelwith no latency overhead, the energy expended to lookuptwo CAM structures needs to be paid.

2.3.4 Cluster-Level ReplicationIn the locality-aware protocol, the location where a replicais placed is always the LLC slice of the requesting core. Anadditional method by which one could explore the trade-offbetween LLC hit latency and LLC miss rate is by replicat-ing at a cluster-level. A cluster is defined as a group ofneighboring cores where there is at most one replica for acache line. Increasing the size of a cluster would increaseLLC hit latency and decrease LLC miss rate, and decreasingthe cluster size would have the opposite effect. The optimalreplication algorithm would optimize the cluster size so asto maximize the performance and energy benefit.

We explored the clustering under our protocol after mak-ing the appropriate changes. The changes include (1) block-ing at the replica location (the core in the cluster where areplica could be found) before forwarding the request to thehome location so that multiple cores on the same cluster donot have outstanding requests to the LLC home location,(2) additional coherence message types for communicationbetween the requester, replica and home cores, and (3) hier-archical invalidation and downgrade of the replica and theL1 caches that it tracks. We do not explain all the detailshere to save space. Cluster-level replication was not foundto be beneficial in the evaluated 64-core system, for the fol-lowing reasons (see Section 4.4 for details).

1. Using clustering increases network serialization de-lays since multiple locations now need to be

searched/invalidated on an L1 cache miss.2. Cache lines with low degree of sharing do not benefit

because clustering just increases the LLC hit latencywithout reducing the LLC miss rate.

3. The added coherence complexity of clustering in-creased our design and verification time significantly.

2.4 Overheads

2.4.1 Storage

The locality-aware protocol requires extra bits at the LLCtag arrays to track locality information. Each LLC direc-tory entry requires 2 bits for the replica reuse counter (as-suming an optimal RT of 3). The Limited3 classifier tracksthe locality information for three cores. Tracking one corerequires 2 bits for the home reuse counter, 1 bit to store thereplication mode and 6 bits to store the core ID (for a 64-core processor). Hence, the Limited3 classifier requires anadditional 27 (= 3 × 9) bits of storage per LLC directoryentry. The Complete classifier, on the other hand, requires192 (= 64 × 3) bits of storage.

All the following calculations are for one core but theyare applicable for the entire processor since all the cores areidentical. The sizes of the per-core L1 and LLC caches usedin our system are shown in Table 1. The storage overheadof the replica reuse bit is 2×256

64×8 = 1KB. The storage over-head of the Limited3 classifier is 27×256

64×8 = 13.5KB. Forthe complete classifier, it is 192×256

64×8 = 96KB. Now, thestorage overhead of the ACKwise4 protocol in this proces-sor is 12KB (assuming 6 bits per ACKwise pointer) and thatfor a Full Map protocol is 32KB. Adding up all the storagecomponents, the Limited3 classifier with ACKwise4 proto-col uses slightly less storage than the Full Map protocoland 4.5% more storage than the baseline ACKwise4 pro-tocol. The Complete classifier with the ACKwise4 protocoluses 30% more storage than the baseline ACKwise4 proto-col.

2.4.2 LLC Tag & Directory AccessesUpdating the replica reuse counter in the local LLC slicerequires a read-modify-write operation on each replica hit.However, since the replica reuse counter (being 2 bits) isstored in the LLC tag array that needs to be written on eachLLC lookup to update the LRU counters, our protocol doesnot add any additional tag accesses.

At the home location, the lookup/update of the lo-cality information is performed concurrently with thelookup/update of the sharer list for a cache line. However,the lookup/update of the directory is now more expensivesince it includes both sharer list and the locality informa-tion. This additional expense is accounted in our evaluation.

2.4.3 Network TrafficThe locality-aware protocol communicates the replica reusecounter to the LLC home along with the acknowledgmentfor an invalidation or an eviction. This is accomplished

Architectural Parameter Value

Number of Cores 64 @ 1 GHzCompute Pipeline per Core In-Order, Single-IssueProcessor Word Size 64 bits

Memory SubsystemL1-I Cache per core 16 KB, 4-way Assoc., 1 cycleL1-D Cache per core 32 KB, 4-way Assoc., 1 cycleL2 Cache (LLC) per core 256 KB, 8-way Assoc.,

2 cycle tag, 4 cycle dataInclusive, R-NUCA [13]

Directory Protocol Invalidation-based MESI,ACKwise4 [20]

DRAM Num., Bandwidth, Latency 8 controllers, 5 GBps/cntlr,75 ns latency

Electrical 2-D Mesh with XY RoutingHop Latency 2 cycles (1-router, 1-link)Flit Width 64 bitsHeader (Src, Dest, Addr, MsgType) 1 flitCache Line Length 8 flits (512 bits)

Locality-Aware LLC Data ReplicationReplication Threshold RT = 3Classifier Limited3

Table 1. Architectural parameters used for evaluation

without creating additional network flits. For a 48-bit phys-ical address and 64-bit flit size, an invalidation message re-quires 42 bits for the physical cache line address, 12 bits forthe sender and receiver core IDs and 2 bits for the replicareuse counter. The remaining 8 bits suffice for storing themessage type.

3 Evaluation MethodologyWe evaluate a 64-core multicore. The default architec-

tural parameters used for evaluation are shown in Table 1.3.1 Performance Models

All experiments are performed using the core, cache hi-erarchy, coherence protocol, memory system and on-chipinterconnection network models implemented within theGraphite [23] multicore simulator. All the mechanisms andprotocol overheads discussed in Section 2 are modeled. TheGraphite simulator requires the memory system (includingthe cache hierarchy) to be functionally correct to completesimulation. This is a good test that all our cache coherenceprotocols are working correctly given that we have run 21benchmarks to completion.

The electrical mesh interconnection network usesXY routing. Since modern network-on-chip routers arepipelined [11], and 2- or even 1-cycle per hop router laten-cies [18] have been demonstrated, we model a 2-cycle perhop delay. In addition to the fixed per-hop latency, networklink contention delays are also modeled.3.2 Energy Models

For dynamic energy evaluations of on-chip electrical net-work routers and links, we use the DSENT [26] tool. Thedynamic energy estimates for the L1-I, L1-D and L2 (withintegrated directory) caches as well as DRAM are obtainedusing McPAT/CACTI [21, 27]. The energy evaluation is

performed at the 11 nm technology node to account for fu-ture scaling trends. As clock frequencies are relatively slow,high threshold transistors are assumed for lower leakage.

3.3 Baseline LLC Management SchemesWe model four baseline multicore systems that assume

private L1 caches managed using the ACKwise4 protocol.1. The Static-NUCA baseline address interleaves all

cache lines among the LLC slices.2. The Reactive-NUCA [13] baseline places private data

at the requester’s LLC slice, replicates instructions inone LLC slice per cluster of 4 cores using rotationalinterleaving, and address interleaves shared data in asingle LLC slice.

3. The Victim Replication (VR) [30] baseline uses therequester’s local LLC slice as a victim cache for datathat is evicted from the L1 cache. The evicted victimsare placed in the local LLC slice only if a line is foundthat is either invalid, a replica itself or has no sharersin the L1 cache.

4. The Adaptive Selective Replication (ASR) [4] alsoreplicates cache lines in the requester’s local LLC sliceon an L1 eviction. However, it only allows LLC repli-cation for cache lines that are classified as shared read-only. ASR pays attention to the LLC pressure by bas-ing its replication decision on per-core hardware mon-itoring circuits that quantify the replication effective-ness based on the benefit (lower LLC hit latency) andcost (higher LLC miss latency) of replication. We donot model the hardware monitoring circuits or the dy-namic adaptation of replication levels. Instead, we runASR at five different replication levels (0, 0.25, 0.5,0.75, 1) and choose the one with the lowest energy-delay product for each benchmark.

3.4 Evaluation MetricsEach multithreaded benchmark is run to completion us-

ing the input sets from Table 2. We measure the energyconsumption of the memory system including the on-chipcaches, DRAM and the network. We also measure the com-pletion time, i.e., the time in the parallel region of thebenchmark. This includes the compute latency, the mem-ory access latency, and the synchronization latency. Thememory access latency is further broken down into:

1. L1 to LLC replica latency is the time spent by the L1cache miss request to the LLC replica location and thecorresponding reply from the LLC replica includingtime spent accessing the LLC.

2. L1 to LLC home latency is the time spent by theL1 cache miss request to the LLC home location andthe corresponding reply from the LLC home includingtime spent in the network and first access to the LLC.

3. LLC home waiting time is the queueing delay atthe LLC home incurred because requests to the samecache line must be serialized to ensure memory con-sistency.

Application Problem Size

SPLASH-2 [28]RADIX 4M integers, radix 1024FFT 4M complex data pointsLU-C, LU-NC 1024 × 1024 matrixOCEAN-C 2050 × 2050 oceanOCEAN-NC 1026 × 1026 oceanCHOLESKY tk29.OBARNES 64K particlesWATER-NSQUARED 512 moleculesRAYTRACE carVOLREND head

PARSEC [6]BLACKSCHOLES 65,536 optionsSWAPTIONS 64 swaptions, 20,000 sims.STREAMCLUSTER 8192 points per block, 1 blockDEDUP 31 MB dataFERRET 256 queries, 34,973 imagesBODYTRACK 4 frames, 4000 particlesFACESIM 1 frame, 372,126 tetrahedronsFLUIDANIMATE 5 frames, 300,000 particles

Others: Parallel MI-Bench [14], UHPC Graph benchmark [2]PATRICIA 5000 IP address queriesCONNECTED-COMPONENTS Graph with 218 nodes

Table 2. Problem sizes for our parallel benchmarks.

4. LLC home to sharers latency is the round-trip timeneeded to invalidate sharers and receive their acknowl-edgments. This also includes time spent requestingand receiving synchronous write-backs.

5. LLC home to off-chip memory latency is the timespent accessing memory including the time spent com-municating with the memory controller and the queue-ing delay incurred due to finite off-chip bandwidth.

One of the important memory system metrics we trackto evaluate our protocol are the various cache miss types.They are as follows:

1. LLC replica hits are L1 cache misses that hit at theLLC replica location.

2. LLC home hits are L1 cache misses that hit at theLLC home location when routed directly to it or LLCreplica misses that hit at the LLC home location.

3. Off-chip misses are L1 cache misses that are sent toDRAM because the cache line is not present on-chip.

4 Results4.1 Comparison of Replication Schemes

Figures 6 and 7 plot the energy and completion timebreakdown for the replication schemes evaluated. The RT-1,RT-3 and RT-8 bars correspond to the locality-aware schemewith replication thresholds of 1, 3 and 8 respectively.

The energy and completion time trends can be under-stood based on the following 3 factors: (1) the type of dataaccessed at the LLC (instruction, private data, shared read-only data and shared read-write data), (2) reuse run-length

0 0.2 0.4 0.6 0.8 1

1.2 S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

RADIX FFT LU-‐C LU-‐NC CHOLESKY BARNES OCEAN-‐C OCEAN-‐NC WATER-‐NSQ RAYTRACE VOLREND

Energy (n

ormalized

)

L1-‐I Cache L1-‐D Cache L2 Cache (LLC) Directory Network Router Network Link DRAM

0 0.2 0.4 0.6 0.8 1

1.2

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

BLACKSCH. SWAPTIONS FLUIDANIM. STREAMCLUS. DEDUP FERRET BODYTRACK FACESIM PATRICIA CONCOMP AVERAGE

Energy (n

ormalized

)

Figure 6. Energy breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA. Note thatAverage and not Geometric-Mean is plotted here.

0 0.2 0.4 0.6 0.8 1

1.2

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8


Comple'

on Tim

e (normalized

)

Compute L1-‐To-‐LLC-‐Replica L1-‐To-‐LLC-‐Home LLC-‐Home-‐-‐WaiJng LLC-‐Home-‐To-‐Sharers LLC-‐Home-‐To-‐OffChip SynchronizaJon

0 0.2 0.4 0.6 0.8 1

1.2

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8


Comple'

on Tim

e (normalized

)

Figure 7. Completion Time breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA.Note that Average and not Geometric-Mean is plotted here.

0% 20% 40% 60% 80%

100%

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8


L1 Cache

Miss

Breakd

own

(normalized

)

LLC-‐Replica-‐Hits LLC-‐Home-‐Hits OffChip-‐Misses

0% 20% 40% 60% 80%

100%

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8

S-‐NUCA

R-‐NUCA

VR

ASR

RT-‐1

RT-‐3

RT-‐8


L1 Cache

Miss

Breakd

own

(normalized

)

Figure 8. L1 Cache Miss Type breakdown for the LLC replication schemes evaluated.

at the LLC, and (3) working set size of the benchmark. Fig-ure 8, which plots how L1 cache misses are handled by theLLC is also instrumental in understanding these trends.

Many benchmarks (e.g., BARNES) have a working setthat fits within the LLC even if replication is done on ev-ery L1 cache miss. Hence, all locality-aware schemes (RT-1, RT-3 and RT-8) perform well both in energy and per-formance. In our experiments, we observe that BARNESexhibits a high reuse of cache lines at the LLC throughaccesses directed at shared read-write data. S-NUCA, R-NUCA and ASR do not replicate shared read-write data andhence do not observe any benefits with BARNES.

VR observes some benefits since it locally replicatesread-write data. However, it exhibits higher L2 cache (LLC)energy than the other schemes for two reasons. (1) Its (al-most) blind process of creating replicas on all evictions re-sults in the pollution of the LLC, leading to less space foruseful replicas and LLC home lines. This is evident fromlower replica hit rate for VR when compared to our locality-aware protocol. (2) The exclusive relationship between theL1 cache and the local LLC slice in VR causes a line to bealways written back on an eviction even if the line is clean.This is because a replica hit always causes the line in theLLC slice to be invalidated and inserted into the L1 cache.Hence, in the common case where replication is useful, eachhit at the LLC location effectively incurs both a read and awrite at the LLC. And a write expends 1.2× more energythan a read. The first factor leads to higher network en-ergy and completion time as well. This explains why VRperforms worse than the locality-aware protocol. Similartrends in VR performance and energy exist in the WATER-NSQ, PATRICIA, BODYTRACK, FACESIM, STREAMCLUS-TER and BLACKSCHOLES benchmarks.

BODYTRACK and FACESIM are similar to BARNES ex-cept that their LLC accesses have a greater fraction of in-structions and/or shared read-only data. The accesses toshared read-write data are again mostly reads with only afew writes. R-NUCA shows significant benefits since itreplicates instructions. ASR shows even higher energy andperformance benefits since it replicates both instructionsand shared read-only data. The locality-aware protocol alsoshows the same benefits since it replicates all classes ofcache lines, provided they exhibit reuse in accordance withthe replication thresholds. VR shows higher LLC energyfor the same reasons as in BARNES. ASR and our locality-aware protocol allow the LLC slice at the replica locationto be inclusive of the L1 cache, and hence, do not have thesame drawback as VR. VR, however, does not have a perfor-mance overhead because the evictions are not on the criticalpath of the processor pipeline.

Note that BODYTRACK, FACESIM and RAYTRACE arethe only three among the evaluated benchmarks that have asignificant L1-I cache MPKI (misses per thousand instruc-tions). All other benchmarks have an extremely low L1-I MPKI (< 0.5) and hence R-NUCA’s replication mecha-nism is not effective in most cases. Even in the above 3

benchmarks, R-NUCA does not place instructions in the lo-cal LLC slice but replicates them at a cluster level, hencethe serialization delays to transfer the cache lines over thenetwork still need to be paid.

BLACKSCHOLES, on the other hand, exhibits a largenumber of LLC accesses to private data and a small num-ber to shared read-only data. Since R-NUCA places privatedata in its local LLC slice, it obtains performance and en-ergy improvements over S-NUCA. However, the improve-ments obtained are limited since false sharing is exhibitedat the page-level, i.e., multiple cores privately access non-overlapping cache lines in a page. Since R-NUCA’s clas-sification mechanism operates at a page-level, it is not ableto locally place all truly private lines. The locality-awareprotocol obtains improvements over R-NUCA by replicat-ing these cache lines. ASR only replicates shared read-onlycache lines and identifies these lines by using a per cache-line sticky Shared bit. Hence, ASR follows the same trendsas S-NUCA. DEDUP almost exclusively accesses privatedata (without any false sharing) and hence, performs op-timally with R-NUCA.

Benchmarks such as RADIX, FFT, LU-C, OCEAN-C,FLUIDANIMATE and CONCOMP do not benefit from repli-cation and hence the baseline R-NUCA performs opti-mally. R-NUCA does better than S-NUCA because thesebenchmarks have significant accesses to thread-private data.ASR, being built on top of S-NUCA, shows the same trendsas S-NUCA. VR, on the other hand, shows higher LLC en-ergy because of the same reasons outlined earlier. VR’sreplication of private data in its local LLC slice is also notas effective as R-NUCA’s policy of placing private data lo-cally, especially in OCEAN-C, FLUIDANIMATE and CON-COMP whose working sets do not fit in the LLC.

The locality-aware protocol benefits from the optimiza-tions in R-NUCA and tracks its performance and energyconsumption. For the locality-aware protocol, an RT of 3dominates an RT of 1 in FLUIDANIMATE because it demon-strates significant off-chip miss rates (as evident from itsenergy and completion time breakdowns) and hence, it isessential to balance on-chip locality with off-chip miss rateto achieve the best energy consumption and performance.While an RT of 1 replicates on every L1 cache miss, an RTof 3 replicates only if a reuse ≥ 3 is demonstrated. Us-ing an RT of 3 reduces the off-chip miss rate in FLUIDAN-IMATE and provides the best performance and energy con-sumption. Using an RT of 3 also provides the maximumbenefit in benchmarks such as OCEAN-C and OCEAN-NC.

As RT increases, the off-chip miss rate decreases but theLLC hit latency increases. For example, with an RT of8, STREAMCLUSTER shows an increased completion timeand network energy caused by repeated fetches of the cacheline over the network. An RT of 3 would bring the cacheline into the local LLC slice sooner, avoiding the unnec-essary network traffic and its performance and energy im-pact. This is evident from the smaller “L1-To-LLC-Home”component of the completion time breakdown graph and

the higher number of replica hits when using an RT of 3.We explored all values of RT between 1 & 8 and found thatthey provide no additional insight beyond the data pointsdiscussed here.

LU-NC exhibits migratory shared data. Such data ex-hibits exclusive use (both read and write accesses) by aunique core over a period of time before being handed toits next accessor. Replication of migratory shared data re-quires creation of a replica in an Exclusive coherence state.The locality-aware protocol makes LLC replicas for suchdata when sufficient reuse is detected. Since ASR does notreplicate shared read-write data, it cannot show benefit forbenchmarks with migratory shared data. VR, on the otherhand, (almost) blindly replicates on all L1 evictions and per-forms on par with the locality-aware protocol for LU-NC.

To summarize, the locality-aware protocol provides bet-ter energy consumption and performance than the otherLLC data management schemes. It is important to balancethe on-chip data locality and off-chip miss rate and overall,an RT of 3 achieves the best trade-off. It is also important toreplicate all types of data and selective replication of certaintypes of data by R-NUCA (instructions) and ASR (instruc-tions, shared read-only data) leads to sub-optimal energyand performance. Overall, the locality-aware protocol hasa 16%, 14%, 13% and 21% lower energy and a 4%, 9%,6% and 13% lower completion time compared to VR, ASR,R-NUCA and S-NUCA respectively.

4.2 LLC Replacement Policy

As discussed earlier in Section 2.2.4, we propose to usea modified-LRU replacement policy for the LLC. It first se-lects cache lines with the least number of sharers and thenchooses the least recently used among them. This replace-ment policy improves energy consumption over the tradi-tional LRU policy by 15% and 5%, and lowers completiontime by 5% and 2% in the BLACKSCHOLES and FACESIMbenchmarks respectively. In all other benchmarks this re-placement policy tracks the LRU policy.

4.3 Limited Locality Classifier

Figure 9 plots the energy and completion time of thebenchmarks with the Limitedk classifier when k is varied as(1, 3, 5, 7, 64). k = 64 corresponds to the Complete classi-fier. The results are normalized to that of the Complete clas-sifier. The benchmarks that are not shown are identical toDEDUP, i.e., the completion time and energy stay constantas k varies. The experiments are run with the best RT valueof 3 obtained in Section 4.1. We observe that the completiontime and energy of the Limited3 classifier never exceeds bymore than 2% the completion time and energy consumptionof the Complete classifier except for STREAMCLUSTER.

With STREAMCLUSTER, the Limited3 classifier starts offnew sharers incorrectly in non-replica mode because of thelimited number of cores available for taking the majorityvote. This results in increased communication between theL1 cache and LLC home location, leading to higher com-pletion time and network energy. The Limited5 classifier,

0 0.2 0.4 0.6 0.8 1

1.2

Energy (n

ormalized

)

k=1 k=3 k=5 k=7 k=64 1.4

0 0.2 0.4 0.6 0.8 1

1.2

Comple3

on Tim

e (normalized

)

0 0.2 0.4 0.6 0.8 1

1.2

RADIX

LU-‐NC

CHOLESKY

BARNES

OCEAN-‐NC

WATER-‐N

SQ

RAYTRACE

VOLREND

STREAM

CLUS.

DEDUP

FERRET

FACESIM

CONCOM

P

GEOM

EAN Co

mple'

on Tim

e (normalized

)

Figure 9. Energy and Completion Time for the Limitedk

classifier as a function of number of tracked sharers (k). Theresults are normalized to that of the Complete (= Limited64)classifier.

however, performs as well as the complete classifier, butincurs an additional 9KB storage overhead per core whencompared to the Limited3 classifier. From the previous sec-tion, we observe that the Limited3 classifier performs betterthan all the other baselines for STREAMCLUSTER. Hence,to trade-off the storage overhead of our classifier with theenergy and completion time improvements, we chose k = 3as the default for the limited classifier.

The Limited1 classifier is more unstable than the otherclassifiers. While it performs better than the Completeclassifier for LU-NC, it performs worse for the BARNESand STREAMCLUSTER benchmarks. The better energy con-sumption in LU-NC is due to the fact that the Limited1 clas-sifier starts off new sharers in replica mode as soon as thefirst sharer acquires replica status. On the other hand, theComplete classifier has to learn the mode independently foreach sharer leading to a longer training period.4.4 Cluster Size Sensitivity Analysis

Figure 10 plots the energy and completion time for thelocality-aware protocol when run using different clustersizes. The experiment is run with the optimal RT of 3. Us-ing a cluster size of 1 proved to be optimal. This is due toseveral reasons.

In benchmarks such as BARNES, STREAMCLUSTER andBODYTRACK, where the working set fits within the LLCeven with replication, moving from a cluster size of 1 to 64reduced data locality without improving the LLC miss rate,thereby hurting energy and performance.

Benchmarks like RAYTRACE that contain a significantamount of read-only data with low degrees of sharing alsodo not benefit since employing a cluster-based approach re-duces data locality without improving LLC miss rate. Acluster-based approach can be useful to explore the trade-off between LLC data locality and miss rate only if the datais shared by mostly all cores within a cluster.

In benchmarks such as RADIX and FLUIDANIMATE that

0 0.2 0.4 0.6 0.8 1

1.2

RADIX

LU-‐NC

BARNES

WATER-‐N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIONS

FLUIDA

NIM.

STREAM

CLUS.

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOM

P

GEOM

EAN

Energy (n

ormalized

) C-‐1 C-‐4 C-‐16 C-‐64

0 0.2 0.4 0.6 0.8 1

1.2

RADIX

LU-‐NC

BARNES

WATER-‐N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIONS

FLUIDA

NIM.

STREAM

CLUS.

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOM

P

GEOM

EAN Co

mple'

on Tim

e (normalized

)

Figure 10. Energy and Completion Time at cluster sizesof 1, 4, 16 and 64 with the locality-aware data replicationprotocol. A cluster size of 64 is the same as R-NUCA ex-cept that it does not even replicate instructions.

show no usefulness for replication, applying the locality-aware protocol bypasses all replication mechanisms andhence, employing higher cluster sizes would not be any-more useful than employing a lower cluster size. Intelli-gently deciding which cache lines to replicate using an RTof 3 was enough to prevent any overheads of replication.

The above reasons along with the added coherence com-plexity of clustering (as discussed in Section 2.3.4) motivateusing a cluster size of 1, at least in the 64-core multicore tar-get that we evaluate.

5 Related WorkCMP-NuRAPID [9] uses the idea of controlled replica-

tion to place data so as to optimize the distance to the LLCbank holding the data. The idea is to decouple the tag anddata arrays and maintain private per-core tag arrays and ashared data array. The shared data array is divided into mul-tiple banks based on distance from each core. A cache lineis replicated in the cache bank closest to the requesting coreon its second access, the second access being detected us-ing the entry in the tag array. This scheme uses controlledreplication to force just one cache line copy for read-writeshared data, and forces write-through to maintain coherencefor such lines using an additional Communication (C) co-herence state. The decision to replicate based on the secondaccess for read-only shared data is also limited since it doesnot take into account the LLC pressure due to replication.

CMP-NuRAPID does not scale with the number of coressince each private per-core tag array potentially has to storepointers to the entire data array. Results reported in [9] indi-cate that the private tag array size should only be twice thesize of the per-cache bank tag array but this is because onlya 4-core CMP is evaluated. In addition, CMP-NuRAPID re-quires snooping coherence to broadcast invalidate replicas,and complicates the coherence protocol significantly by in-troducing many additional race conditions that arise from

the tag and data array decoupling.Reactive-NUCA [13] replicates instructions in one LLC

slice per cluster of cores. Shared data is never replicatedand is always placed at a single LLC slice. The one size fitsall approach to handling instructions does not work for ap-plications with heterogeneous instructions. In addition, ourevaluation has shown significant opportunities for improve-ment through replication of shared read-only and sharedread-mostly data.

Victim Replication (VR) [30] starts out with a private L1shared L2 (LLC) organization and uses the requester’s localLLC slice as a victim cache for data that is evicted from theL1 cache. The evicted victims are placed in the local LLCslice only if a line is found that is either invalid, a replica it-self or has no sharers in the L1 cache. VR attempts to com-bine the low hit latency of the private LLC with the low off-chip miss rate of the shared LLC. The main drawback of thisapproach is that replicas are created without paying atten-tion to the LLC cache pressure since a replacement candi-date with no sharers exists in the LLC with high probability.Furthermore, the static decision policy to blindly replicate(without paying attention to reuse) private and shared datain the local LLC slice could lead to increased LLC misses.

Adaptive Selective Replication (ASR) [4] also replicatescache lines in the requester’s local LLC slice on an L1 evic-tion. However, it only allows LLC replication for cachelines that are classified as shared read-only using an addi-tional per-cache-line shared bit. ASR pays attention to theLLC pressure by basing its replication decision on a prob-ability that is obtained dynamically. The probability valueis picked from discrete replication levels on a per-core ba-sis. A higher replication level indicates that L1 eviction vic-tims are replicated with a higher probability. The replicationlevels are decided using hardware monitoring circuits thatquantify the replication effectiveness based on the benefit(lower L2 hit latency) and cost (higher L2 miss latency) ofreplication.

A drawback of ASR is that the replication decision isbased on coarse-grain information. The per-core countersdo not capture the replication usefulness for cache lines withtime-varying degree of reuse. Furthermore, the restrictionto replicate shared read-only data is limiting because it doesnot exploit reuse for other types of data. ASR assumes an8-core processor and the LLC lookup mechanism that needsto search all LLC slices does not scale to large core counts(although we model an efficient directory-based protocol inour evaluation).

6 ConclusionWe have proposed an intelligent locality-aware data

replication scheme for the last-level cache. The localityis profiled at runtime using a low-overhead yet highly ac-curate in-hardware cache-line-level classifier. On a set ofparallel benchmarks, our locality-aware protocol reducesthe overall energy by 16%, 14%, 13% and 21% and thecompletion time by 4%, 9%, 6% and 13% when compared

to the previously proposed Victim Replication, AdaptiveSelective Replication, Reactive-NUCA and Static-NUCALLC management schemes. The coherence complexity ofour protocol is almost identical to that of a traditional non-hierarchical (flat) coherence protocol since replicas are onlyallowed to be created at the LLC slice of the requestingcore. Our classifier is implemented with 14.5KB storageoverhead per 256KB LLC slice.

References[1] First the tick, now the tock: Next generation Intel microarchi-

tecture (Nehalem). White Paper, 2008.[2] DARPA UHPC Program (DARPA-BAA-10-37), March 2010.[3] A. Agarwal, R. Simoni, J. L. Hennessy, and M. Horowitz. An

Evaluation of Directory Schemes for Cache Coherence. In In-ternational Symposium on Computer Architecture, 1988.

[4] B. M. Beckmann, M. R. Marty, and D. A. Wood. Asr: Adap-tive selective replication for cmp caches. In Proceedings of the39th Annual IEEE/ACM International Symposium on Microar-chitecture, MICRO 39, pages 443–454, 2006.

[5] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao,C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fair-banks, D. Khan, F. Montenegro, J. Stickney, and J. Zook.Tile64 - processor: A 64-core soc with mesh interconnect. InIEEE Solid-State Circuits Conference, feb. 2008.

[6] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSECBenchmark Suite: Characterization and Architectural Implica-tions. In International Conference on Parallel Architecturesand Compilation Techniques, 2008.

[7] S. Borkar. Thousand core chips: a technology perspective. InDesign Automation Conference, 2007.

[8] L. M. Censier and P. Feautrier. A new solution to coher-ence problems in multicache systems. IEEE Trans. Comput.,27(12):1112–1118, Dec. 1978.

[9] Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizingreplication, communication, and capacity allocation in cmps.In Proceedings of the 32nd annual international symposiumon Computer Architecture, ISCA ’05, pages 357–368, 2005.

[10] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, andB. Hughes. Cache Hierarchy and Memory Subsystem of theAMD Opteron Processor. IEEE Micro, 30(2), Mar. 2010.

[11] W. J. Dally and B. Towles. Principles and practices of inter-connection networks. Morgan Kaufmann, 2003.

[12] N. Eisley, L.-S. Peh, and L. Shang. In-network cache coher-ence. In IEEE/ACM International Symposium on Microarchi-tecture, MICRO 39, pages 321–332, 2006.

[13] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki.Reactive NUCA: Near-Optimal Block Placement and Repli-cation in Distributed Caches. In International Symposium onComputer Architecture, 2009.

[14] S. Iqbal, Y. Liang, and H. Grahn. Parmibench - anopen-source benchmark for embedded multiprocessor sys-tems. Computer Architecture Letters, 9(2):45 –48, feb. 2010.

[15] A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr., andJ. Emer. Achieving Non-Inclusive Cache Performance with In-clusive Caches: Temporal Locality Aware (TLA) Cache Man-agement Policies. In International Symposium on Microarchi-tecture, 2010.

[16] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy,and S. Borkar. Near-threshold voltage (NTV) design - opportu-nities and challenges. In Design Automation Conference, 2012.

[17] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-ChipCaches. In International Conference on Architectural Supportfor Programming Languages and Operating Systems, 2002.

[18] A. Kumar, P. Kundu, A. P. Singh, L.-S. Peh, and N. K. Jha.A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switchallocator in 65nm cmos. In International Conference on Com-puter Design, 2007.

[19] G. Kurian, O. Khan, and S. Devadas. The locality-awareadaptive cache coherence protocol. In Proceedings of the 40thAnnual International Symposium on Computer Architecture,ISCA ’13, pages 523–534, New York, NY, USA, 2013. ACM.

[20] G. Kurian, J. Miller, J. Psota, J. Eastep, J. Liu, J. Michel,L. Kimerling, and A. Agarwal. ATAC: A 1000-Core Cache-Coherent Processor with On-Chip Optical Network. In Inter-national Conference on Parallel Architectures and Compila-tion Techniques, 2010.

[21] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M.Tullsen, and N. P. Jouppi. Mcpat: An integrated power, area,and timing modeling framework for multicore and manycorearchitectures. In International Symposium on Microarchitec-ture, 2009.

[22] M. M. K. Martin, M. D. Hill, and D. J. Sorin. Why on-chipcache coherence is here to stay. Commun. ACM, 55(7), 2012.

[23] J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann,C. Celio, J. Eastep, and A. Agarwal. Graphite: A DistributedParallel Simulator for Multicores. In International Symposiumon High-Performance Computer Architecture, 2010.

[24] D. Sanchez and C. Kozyrakis. SCD: A Scalable CoherenceDirectory with Flexible Sharer Set Encoding. In InternationalSymp. on High-Performance Computer Architecture, 2012.

[25] D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Mem-ory Consistency and Cache Coherence. Synthesis Lectures inComputer Architecture, Morgan Claypool Publishers, 2011.

[26] C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agar-wal, L.-S. Peh, and V. Stojanovic. DSENT - A Tool Connect-ing Emerging Photonics with Electronics for Opto-ElectronicNetworks-on-Chip Modeling. In International Symposium onNetworks-on-Chip, 2012.

[27] S. Thoziyoor, J. H. Ahn, M. Monchiero, J. B. Brockman, andN. P. Jouppi. A comprehensive memory modeling tool and itsapplication to the design and analysis of future memory hierar-chies. In International Symposium on Computer Architecture,pages 51–62, 2008.

[28] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta.The SPLASH-2 Programs: Characterization and Methodolog-ical Considerations. In International Symposium on ComputerArchitecture, 1995.

[29] J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos.A tagless coherence directory. In International Symposium onMicroarchitecture, 2009.

[30] M. Zhang and K. Asanovic. Victim Replication: MaximizingCapacity while Hiding Wire Delay in Tiled Chip Multiproces-sors. In International Symp. on Computer Architecture, 2005.

[31] H. Zhao, A. Shriraman, and S. Dwarkadas. SPACE: sharingpattern-based directory coherence for multicore scalability. InInternational Conference on Parallel Architectures and Com-pilation Techniques, pages 135–146, 2010.

Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache

Zhe Wang† Daniel A. Jiménez† Cong Xu‡ Guangyu Sun♯ Yuan Xie‡ §

†Texas A&M University ‡Pennsylvania State University ♯Peking University §AMD Research†{zhewang, djimenez}@cse.tamu.edu ‡{czx102, yuanxie}@cse.psu.edu ♯[email protected]

Abstract

Emerging Non-Volatile Memories (NVM) such as Spin-

Torque Transfer RAM (STT-RAM) and Resistive RAM

(RRAM) have been explored as potential alternatives for

traditional SRAM-based Last-Level-Caches (LLCs) due to

the benefits of higher density and lower leakage power.

However, NVM technologies have long latency and high en-

ergy overhead associated with the write operations. Conse-

quently, a hybrid STT-RAM and SRAM based LLC archi-

tecture has been proposed in the hope of exploiting high

density and low leakage power of STT-RAM and low write

overhead of SRAM. Such a hybrid cache design relies on

an intelligent block placement policy that makes good use

of the characteristics of both STT-RAM and SRAM technol-

ogy.

In this paper, we propose an adaptive block placement

and migration policy (APM) for hybrid caches. LLC write

accesses are categorized into three classes: prefetch-write,

demand-write, and core-write. Our proposed technique

places a block into either STT-RAM lines or SRAM lines

by adapting to the access pattern of each class. An access

pattern predictor is proposed to direct block placement and

migration, which can benefit from the high density and low

leakage power of STT-RAM lines as well as the low write

overhead of SRAM lines.

Our evaluation shows that the technique can improve

performance and reduce LLC power consumption com-

pared to both SRAM-based LLC and STT-RAM-based LLCs

with the same area footprint. It outperforms the SRAM-

based LLC on average by 8.0% for single-thread workloads

and 20.5% for multi-core workloads. The technique reduces

power consumption in the LLC by 18.9% and 19.3% for

single-thread and multi-core workloads, respectively.

1 Introduction

Emerging non-volatile memory (NVM) technologies,

such as Phase-Change RAM (PCRAM), Resistive RAM

(ReRAM), and Spin-Torque Transfer RAM (STT-RAM),

are potential alternatives for future memory architecture de-

sign due to their higher density, better scalability, and lower

leakage power. With all the NVM technology advances in

recent years, it is anticipated that NVM technologies will

closer to market in the near future.

Among these three memory technologies, PCRAM and

ReRAM have slower performance and very limited life-

time (usually in the range of 108-1010) and therefore can

only be suitable as storage-class memory or at most as

main-memory. STT-RAM, on the other hand, has better

read/write latency and better lifetime reliability, so it can

potentially be used as the SRAM replacement for last-level

cache (LLC) design.

However, the latency and energy of write operations to

STT-RAM are significantly higher than those of read oper-

ations. The long write latency can degrade performance by

causing large write-induced interference to subsequent read

requests. High write energy can increase power consump-

tion of the LLC.

As both STT-RAM and SRAM have strengths and weak-

nesses, hybrid cache design [25, 9, 3, 2, 26] has been pro-

posed in the hope of benefiting from the high density, low

leakage power of STT-RAM as well as low write over-

head of SRAM. In the hybrid cache, each set consists of

a large portion of STT-RAM lines and a small portion of

SRAM lines. Hybrid cache design depends on an intelli-

gent block placement policy that can achieve high perfor-

mance by making use of the large capacity of STT-RAM

and maintain low write overhead using SRAM.

Recent work proposes [25, 9, 3, 2, 26] block place-

ment and migration policies to map write-intensive blocks

to SRAM lines. Prior proposals reduce the average write

operations to STT-RAM lines. However, these proposals

incur frequent block migration [9, 3], require the compiler

to provide static hints [3], mitigate write overhead only for

a subset of write accesses [25], or trade off performance for

power or endurance [9, 2, 26].

This paper describes a low cost adaptive block placement

policy for a hybrid LLC called APM (Adaptive Placement

and Migration). The technique places a cache block in the

proper position according to its access pattern. LLC write

978-1-4799-3097-5/14/$31.00 ©2014 IEEE

accesses are categorized into three distinct classes : core-

write, prefetch-write and demand-write. Core-write is a

write from the core. For a write-through core cache, it is

the write directly from the core writing through to the LLC.

For a write-back core cache, it is dirty data evicted from

the core cache and written back to the LLC. Prefetch-write

is a write from the LLC replacement caused by a prefetch

miss. Demand-write is a write from the LLC replacement

caused by a demand miss. The proposed technique is based

on the observation that block placement is often initiated

by a write access. The characteristics of each class of write

enable them to be adapted to different block placement poli-

cies. A low overhead and low complexity pattern predictor

is used to predict the access pattern for each class of ac-

cesses for directing the block placement.

Our contributions can be summarized as follows: (1)

We analyze the LLC access pattern for distinct write ac-

cess types: prefetch-write, demand-write and core-write.

We propose that each class should be adapted to different

block placement and migration policies for a hybrid LLC.

We design a low overhead access pattern predictor. It pre-

dicts write burst blocks and dead blocks in the LLC that

can be used to direct block placement and migration. (2)

We propose an intelligent cache placement and migration

policy for a hybrid STT-RAM and SRAM based LLC. The

technique optimizes block placement by adapting to the ac-

cess pattern of each class of write accesses. (3) We show the

proposed technique can improve system performance and

reduce LLC power consumption of hybrid LLC compared

to both SRAM-based LLC and STT-RAM-based LLC with

the same area. It outperforms an SRAM-based LLC on av-

erage by 8.0% for single-thread workloads and by 20.5%

for multi-core workloads. The technique reduces the power

consumption of the LLC by 18.9% and 19.3% respectively

for single-thread and multi-core workloads compared with

the baseline.

2 Background

2.1 Comparison of STT-RAM and SRAMCache

Compared to SRAM, STT-RAM caches have higher den-

sity and lower leakage power, but higher write latency and

write power consumption. Table 1 lists the technology fea-

tures of various STT-RAM cache bank sizes and SRAM

cache bank sizes used in our evaluation. The technology

parameters are generated by NVSim [5], a performance, en-

ergy, and area model based on CACTI [8]. The cell param-

eters we used in NVSim are based on the projection from

Wang et al. [28]. We assume a 22nm × 45nm MTJ built

with 22nm CMOS technology. The SRAM cell parameters

are estimated using CACTI [8].

The density of STT-RAM is currently 3×-4× higher

than SRAM. Another benefit for STT-RAM is its low leak-

age power. Leakage power can dominate the total power

400.perlbench

401.bzip2403.gcc429.m

cf433.m

ilc434.zeusm

p

435.gromacs

436.cactusAD

M

450.soplex456.hm

mer

459.Gem

sFDTD

462.libquantum

470.lbm471.om

netpp

473.astar482.sphinx3

483.xalancbmk

Average

0.0

0.2

0.4

0.6

0.8

1.0

Tota

l L

LC

Acc

esse

s Demand-write Core-write Prefetch-write

Figure 1. Distribution of LLC write accesses.

Each type of write access accounts for a significant

fraction of total write accesses

Ra, Ra, Rb, Rc, Rd, Ra, Wa, Rc, Ra, Wa, Rf, Rb, Rc, Rd, Re, Rm, Rn, Rs

RR of block a: 4

DR of the first Wa: 2

DR of the second Wa: 0

Figure 2. An example illustrating read range and

depth range

consumption for large SRAM-based LLCs [15]. Thus, the

low leakage power consumption of STT-RAM makes it suit-

able for a large LLC. Disadvantages of STT-RAM are long

write latency and high write energy.

2.2 Hybrid Cache Structure

The hybrid cache structure is composed of STT-RAM

banks and SRAM banks. Each cache set consists of a

large portion of STT-RAM cache lines and a small portion

of SRAM cache lines distributed among multiple banks.

The hybrid cache architecture relies on an intelligent block

placement policy to bridge the performance and power gaps

between STT-RAM and SRAM.

An intelligent block placement policy for hybrid cache

design should be optimized for three requirements. First,

the SRAM portion should service as many write requests as

possible, thus minimizing the write overhead of STT-RAM

portion. However, sending many write operations to SRAM

without considering the access pattern can cause misses due

to the small capacity of SRAM, leading to performance

degradation. Thus, the second requirement is that reused

cache blocks should be placed in the LLC to maintain per-

formance by hiding the memory access latency. Finally, the

block placement policy should be a low overhead and low

complexity design without incurring frequent migration be-

tween cache lines.

3 Analysis of LLC Write Access Patterns

LLC block placement is often initiated by a LLC write

access that can be categorized into three classes: prefetch-

write, core-write and demand-write. Figure 1 shows the

breakdown of each class of LLC write accesses for 17

memory intensive SPEC CPU2006 [7] benchmarks. The

study is performed using the MARSSx86 [19] simulator

Memory Type 1M SRAM 2M SRAM 2M STT-RAM 4M STT-RAM

Area (mm2) 0.825 1.650 0.518 1.035

Read Latency (ns) 1.751 2.017 2.681 2.759

Write Latency (ns) 1.530 1.663 10.954 10.993

Read Energy (nJ/access) 0.055 0.072 0.132 0.142

Write Energy (nJ/access) 0.039 0.056 0.608 0.618

leakage power (mW) 29.798 59.596 7.108 14.216

Table 1. Characteristics of SRAM and STT-RAM caches (22nm, temperature=350K)

with single-core configuration and a 4MB LLC. We im-

plement a middle-of-the-road stream prefetcher that models

the prefetcher of the Intel Core i7 [1]. From Figure 1, we

can see that each type of write access accounts for a sig-

nificant fraction of total write accesses. Prefetch-writes ac-

count for 21.9% on average while core-writes and demand-

writes take 45.6% and 32.5%, respectively. In this section,

we analyze the access pattern of each write access type and

suggest a block placement policy that adapts to the access

pattern for each class.

We first define the terminology that will be used later in

this section for pattern analysis. To be clear, when we write

“block” we mean a block of data apart from its physical

realization. When we write “line” we mean the physical

frame, whether in STT-RAM or SRAM, where a block may

reside.

• Read-range: The read-range is a property of a cache

block that fills the LLC by a demand-write or prefetch-

write request. It is the largest interval between consec-

utive reads of the block from the time it is filled into

the LLC until the time it is evicted.

• Depth-range: The depth-range is a property of a core-

write access. It is the largest interval between accesses

to the block from the current core-write access until the

next core-write access to the same block. The “depth”

refers to how deep the block descends into the LRU

recency stack before it is accessed again.

We use an example to illustrate the read range and depth

range. Figure 2 shows the behavior of a block a from the

time it fills into a 8-way set until it is evicted. In the ex-

ample, Ra represents "read block a" while Wa represents

"write block a". The largest re-read interval of block a dur-

ing the time it resides in the cache is 4 which is the read

interval between the second Ra and the third Ra. Thus,

the read range (RR) of block a is 4. The depth range (DR)

of the first Wa access is 2 which is the access interval be-

tween the first Wa access and the fourth Ra access. The

depth range of the second Wa access is 0 meaning a is

not re-written from the second Wa until it is evicted from

the LLC. We further classify the read/depth-range into three

types: zero-read/depth-range, immediate-read/depth-range

and distant-read/depth-range.

• Zero-read/depth-range: Data is filled into the LLC by

a prefetch or demand request/core-write request, and it

is never read/written again before it is evicted.

• Immediate-read/depth-range: The read/depth-range is

smaller than a parameter m. We set m = 2 which is

the same as the number of SRAM ways in our hybrid

cache configuration as in Section 5.

• Distant-read/depth-range: The read/depth-range is

larger than m = 2 and at most the associativity of the

cache set which is 16 in our configuration.

3.1 Pattern Analysis of Prefetch-WriteBlock

Prefetching data into the cache before it is accessed can

improve performance by hiding memory access latency.

However, prefetching can also induce cache pollution by

inaccurate prefetch requests.

We analyze the access pattern for the LLC prefetch-

write blocks by using read-range. The first bar in Figure 3

shows each type of access pattern as a fraction of the total

number of prefetch-write blocks. Zero-read-range prefetch-

write blocks are inaccurately prefetched blocks accounting

for 26% of all of prefetch blocks. Placing the zero-read-

range prefetch-write blocks into the STT-RAM lines causes

pollution and high write overhead. Thus, zero-read-range

prefetch-write blocks should be placed in SRAM lines.

The immediate-read-range access pattern is related to

cache bursts. After an initial burst of references, a cache

block becomes dead, i.e. it is never used again prior to evic-

tion. Of all prefetch-write blocks, 56.9% are immediate-

read-range blocks. Immediate-read-range prefetch-write

blocks should be placed in SRAM lines so they can be ac-

cessed by subsequent demand requests without incurring

write operations to STT-RAM lines. Moreover, placing

immediate-read-range prefetch-write blocks in SRAM al-

lows them to be evicted when they are dead, reducing pol-

lution.

Distant-read-range prefetch-write blocks should be

placed in STT-RAM lines to make use of the large capac-

ity to avoid cache misses. Distant-read-range prefetch-write

blocks account for only 17.5% of all prefetch blocks.

Zero-read-range and immediate-read-range prefetch-

write blocks account for 82.5% of all prefetch-write blocks.

400.perlbench401.bzip2

403.gcc

429.mcf

433.milc

434.zeusmp

435.gromacs

436.cactusAD

M450.soplex

456.hmm

er459.G

emsFD

TD

462.libquantum470.lbm

471.omnetpp

473.astar

482.sphinx3483.xalancbm

kA

verage

0.0

0.2

0.4

0.6

0.8

1.0

Norm

ali

zed

Acc

esse

s

Zero-read/depth-range

Immediate-read/depth-range

Distant-read/depth-range1st bar: Prefetch-write 2nd bar: Core-write 3rd bar: Demand-write

Figure 3. The distribution of access pattern for each type of LLC write access

Thus, we suggest the initial placement of the prefetch-write

blocks in SRAM. Once a block is evicted from the SRAM,

if it is a distant-read-range block, i.e. it is still live, it should

be migrated to STT-RAM lines. Otherwise, the block is

dead and should be evicted from the LLC. Since only 17.5%

of prefetch blocks are distant-access prefetch blocks, the

migration will not cause significantly increased traffic.

3.2 Pattern Analysis of Core-Write Ac-cess

In our design, if a core-write access misses in the LLC,

the data will be written back to the main memory directly.

Thus, our core-write placement policy is only designed for

core-write hit accesses.

We analyze the access pattern of the LLC core-write ac-

cess by using depth-range. The second bar in Figure 3

shows the access pattern for core-write accesses. Zero-

depth-range accesses account for 49.1% of all core-write

accesses. Though the data written by the zero-depth-range

core-write access will not be rewritten before it is evicted,

it still has some chance to be read again. Thus, we suggest

leaving zero-depth-range data in its original cache line for

avoiding read misses and block migrations.

Immediate-depth-range accesses account for 32.9% of

total core-write accesses. The immediate-depth-range ac-

cesses are the write-intensive accesses with write burst be-

havior. Thus, the immediate-depth-range access data is pre-

ferred to be placed in the SRAM line for low write over-

head. The distant-depth-range access data should remain in

its original cache line, thus minimizing migration overhead.

3.3 Pattern Analysis of Demand-WriteBlock

The access pattern of demand-write blocks is analyzed

using read-range. Zero-read-range demand-write blocks,

also known in the literature as “dead-on-arrival” blocks, are

brought to the LLC by a demand request and never refer-

enced again before being evicted. It is unnecessary to place

zero-read-range demand-write block into the LLC so the

block should bypass the cache (assuming a non-inclusive

cache). The third bar in Figure 3 shows dead-on-arrival

blocks account for 61.2% of LLC demand-write blocks.

Thus, bypassing dead-on-arrival blocks can significantly re-

duce write operations to LLC. Moreover, bypassing can im-

prove cache efficiency by allowing the LLC to save space

for other useful blocks in the cache.

The immediate-read-range and distant-read-range

demand-write blocks account for 38.8% of the total

demand-write blocks. We suggest placing them in the

STT-RAM ways for making use of the large capacity of

the STT-RAM portion and reducing pressure on the SRAM

portion.

3.4 Pattern Analysis Conclusions

Each class of LLC write access can be applied to a differ-

ent placement policy. The access pattern of each class type

can be used to guide the block placement policy. From the

analysis of the access pattern of each access class, we make

the following conclusions: (1) The initial placement of

prefetch-write blocks should be to SRAM lines. (2) Write-

burst core-write data should be placed in SRAM lines while

other types of core-write data should remain in their orig-

inal cache lines. (3) Dead-on-arrival demand-write blocks

should bypass the LLC while the other types of demand-

write blocks should be placed in the STT-RAM lines. (4)

When a block is evicted from SRAM, if it is live, it should

be migrated to STT-RAM lines for avoiding LLC misses.

4 Adaptive Block Placement and Migration

Policy

The design of the block placement and migration policy

is guided by the access pattern of each write access type. An

access pattern predictor is proposed to predict write-burst

blocks and dead blocks. The information provided by the

access pattern predictor is used to direct bypass and migra-

tion of blocks between STT-RAM lines and SRAM lines.

The policy targets reducing write overhead by allowing the

SRAM portion to service as many write requests as possi-

ble and attain high performance by benefiting from the high

density of the STT-RAM portion.

4.1 Policy Design

Figure 4 shows a flow-chart for the technique. In our

design, a prediction bit is associated with each block in the

Core-Write ?

Hit?

Dead?

Place in STT-RAM to STT-RAM

Place in SRAM

Dead?

Yes

Evict Migrate

No

Access to LLC

Prefetch-Write?

Evicted Block

No

Write Hit?

NoHit in SRAM?

Write Burst?

Yes No

Write to Original

Write to Memory

Yes

Yes No

Migrate to SRAM

Demand

Fetched Block

Bypassing

DeDeYes

Return Data

No

FeFeYes No

Yes No

Yes No

Cache Line

Write to Original

YeCache Line

Figure 4. Flow-chart of the adaptive block placement and migration mechanism

LLC for representing whether the block is predicted dead.

An access to the LLC searches all the STT-RAM lines and

SRAM lines in the set. On a prefetch miss, the prefetched

data is placed into an SRAM line and the prediction bit is

set to 1 meaning we assume the prefetch block is dead on

arrival. For every demand hit request in the SRAM lines, the

access pattern predictor makes a prediction about whether

the block is dead. Once the block is evicted from the SRAM

lines, if it is predicted dead, it will be evicted from the LLC.

Otherwise, it will be migrated to the STT-RAM lines. In

this case, if a prefetch block is never accessed before it is

evicted from the SRAM lines, it is taken as an inaccurately

prefetched block and evicted based on the observation that

accurately prefetched blocks are usually accessed soon by

subsequent demand requests.

On a core-write request to the LLC, if it is an LLC miss,

it will be written to main memory directly. For the core-

write hit request in the STT-RAM lines, if it is predicted

to be a write burst access, then data will be written to the

SRAM lines with the prediction bit set to 0 indicating the

block is predicted live; the prior position in the STT-RAM

line is set to invalid. Then, as with prefetched blocks, once

the data is evicted from the SRAM portion, a predicted dead

block is evicted from the LLC while a live block is migrated

to the STT-RAM portion.

On a demand miss to the LLC, the data is fetched from

the main memory. If the fetched block is predicted to be

a dead-on-arrival block, it will bypass the LLC. Otherwise,

the block will be placed in the STT-RAM lines.

Minimizing Write Overhead The proposed scheme re-

duces write operations to STT-RAM portion in the follow-

ing ways. First, bypass can reduce write operations to STT-

RAM lines caused by dead-on-arrival requests. Second,

SRAM lines filter the write operations caused by the inac-

curate and immediate-read-range prefetch requests. Finally,

the core-write-intensive blocks are placed in the SRAM

Sam

pled

Core C

ache M

iss

Data Access

Prediction

Sampled Prefetch

Request

Update

Simulator

LLC

Selected Core Cache Miss

Prediction Table

Core Cache

Pattern

& C

ore−

Write R

equest

& Core−Write Request

Figure 5. System structure

lines, reducing the write operations to STT-RAM lines

caused by write burst behavior.

Attaining High Performance The block placement pol-

icy can attain high performance for the hybrid cache by

benefiting from the high density of the STT-RAM por-

tion. Specifically, the distant-read-range blocks are placed

in STT-RAM lines that can reduce cache misses by mak-

ing use of the large capacity of STT-RAM. Also, bypass-

ing zero-range demand blocks saves space in the LLC for

other useful data. Moreover, filtering zero-range prefetch

blocks and immediate-read-range blocks using SRAM lines

can improve cache efficiency by reducing inaccurately

prefetched blocks and dead blocks in the LLC.

The technique relies on an access pattern predictor for

directing block bypassing and migration.

4.2 Access Pattern Predictor

The goal of the access pattern predictor is to predict

dead blocks and write burst blocks for guiding block place-

ment. The predicted dead blocks are used to direct bypass-

ing and block migration from SRAM lines to STT-RAM

lines. Write burst blocks are used to guide block migra-

tion from STT-RAM lines to SRAM lines for core-write ac-

Initial

Status

Partial Tag Partial Read PC Partial Write PC

DEC Read

PCR1 PCR2 PCR3 PCR4 a b c d PCW1 PCW2

Read B DECDCNT PCR2

CW A WCNT PCW1

INC Hit

CW C

Prefetch E

Read E

Miss

INC

Hit

INCDCNT PCR4 WCNT PCW3 WCNT PC

DEC

MRU LRU

DEC/INC C DCNT PCR*: R*: Decrease/Increase the counter in the dead block prediction table indexed by PCR*

DEC/INC INC WCNT PCW*:

LRU

W*: Decrease/Increase the counter in the write burst prediction table indexed by PCW*

ReadHit

PCR5 PCR1 PCR3 PCR4 b a c Hid PCW2 WCIN

PCW3

PCR5 PCR1 PCR3 PCR4 b a c Mi

d PCW2 PCW3

PCR5 PCR1 PCR3 DEDEDDPCR4 b a c

ReReHid PCW2 PCW1

I PCR5 PCR1 ININDDPCR3 e b a c I WCWC

DEDEPCW2

PCR6 PCR5 PCR1 PCR3 e b a Hic I PCW2

HitHiCW

MissMiCW

HitHiRead

I : Invalid CW : core-write access hit

Figure 6. An example illustrating the set behavior of pattern simulator

cesses. The access pattern predictor consists of a prediction

table and a pattern simulator as shown in Figure 5. The pre-

diction table is composed of a dead block prediction table

and a write burst prediction table having the same structure

but making predictions for different types of accesses. The

pattern simulator is used to learn the access pattern for up-

dating the prediction table.

The design of the access pattern predictor is inspired by

the sampling dead block predictor (SDBP) [13]. However,

the SDBP predicts dead blocks only taking into account de-

mand accesses, while the access pattern predictor predicts

both dead blocks and write-intensive blocks by considering

all types of LLC accesses. The predictor predicts access

pattern using the Program Counter (PC) of memory access

instructions. The intuition is the cache access pattern can be

classified based on the instructions of the memory accesses.

Specifically, if a given memory access instruction PC leads

to some access pattern in previous accesses, then the future

access pattern of same PC will be similar.

4.2.1 Making Prediction

The access pattern predictor makes a prediction in the fol-

lowing three conditions: (1) When a core-write request hits

in the STT-RAM lines, the write burst prediction table will

be accessed to predict whether it is a write burst request.

(2) For each read hit request in the SRAM lines, the dead

block prediction table will be accessed to predict whether it

is a dead block. (3) On a demand-write request, dead block

prediction table will be accessed to predict whether it is a

dead-on-arrival request.

The dead block prediction and write burst prediction ta-

bles have the same structure. Each entry in a prediction

table has a two-bit saturating counter. When making a pre-

diction, bits from the related memory access instruction PC

are hashed to index a prediction table entry. The predic-

tion result is generated by thresholding the counter value in

the prediction table entry. In our implementation, we use

a skewed organization [29, 13, 18] to reduce the impact of

conflicts in the hash table.

4.2.2 Updating Predictor

The pattern simulator samples LLC sets and simulates the

access behavior of the LLC by using sampled sets. It up-

dates the prediction table by learning the access pattern of

the PCs from the simulated LLC access behavior. It targets

learning the dead/live behavior and write burst behavior of

LLC blocks.

Each simulated set in the pattern simulator has a tag field,

a read PC field, an LRU field, a write PC field, a write

LRU field and a valid bit field. The partial tag and partial

read/write PC which are the lower 16-bit of the full tag and

full read/write PC are stored in the tag field and read/write

PC field. The read PC field is for learning the dead/live be-

havior of the block while the write PC field is for learning

the write-burst behavior of the block.

A write burst occurs within a small number of cache

lines. Thus the write PC field should have a small asso-

ciativity. The pattern simulator consists of two parts: the

tag array and its related read PC field, LRU field and valid

bit field which have the same associativity while the write

PC field and its write LRU field have a smaller associativity.

In our implementation, we found 4-way associativity of the

write PC field and 12-way associativity of the read PC field

yield the best performance while the associativity of LLC is

18.

The behavior of a pattern simulator set is illustrated in

Figure 6 using an example access pattern. In the example,

the associativity of the tag and read PC fields is 4 while the

associativity of the write PC field is 2. On each demand hit

request, the pattern simulator updates the dead block pre-

diction table entry indexed with the related read PC by de-

creasing the counter value indicating “not dead.” When a

block is evicted from the simulator set, the pattern simulator

updates the dead block prediction table entry indexed with

the related read PC by increasing the counter value indicat-

ing “dead.” The LRU recency is updated for every demand

request and prefetch-write request. The write PC field is

for learning the write-burst behavior for core-write requests.

On each core-write hit request, the simulator increments the

counter value stored in the write burst prediction table entry

indexed with the related write PC indicating “write burst.”

When a block is evicted from the write field, the simula-

tor decrements the counter value stored in the write burst

prediction table entry indexed with the related write PC in-

dicating “not write burst.”

5 Evaluation

5.1 Methodology

Execution core 4.0GHZ, 1-core/4-core CMP, out of order,128 entry reorder buffer, 48 entry load queue,44 entry store queue, 4 width issue/decode

Caches L1 I-cache: 64KB/2 way, private, 2-cycle64 bytes block size, LRUL1 D-cache: 64KB/2 way, private, 2-cycle64 bytes block size, LRUL2 Cache: shared, 64 bytes block size, LRU

DRAM DDR3-1333, open-page policy, 2-channel,8-bank/channel, FR_FCFS [21] policy,32-entry/channel write buffer,drain_when_full write buffer policy

Table 2. System configuration

Name Technique

SRAM SRAM-based LLC

STT-RAM STT-RAM-based LLC

OPT-STT-RAM STT-RAM-based LLC

Assuming symmetric read/write overhead

Sun-Hybrid Hybrid LLC technique as described in [25]

APM Adaptive placement and migration based

hybrid cache as described in Section 4

Table 3. Legend for various LLC techniques.

We use MARSSx86 [19], a cycle-accurate simulator for

the 64-bit x86 instruction set. The DRAMSim2 [22] simu-

lator is integrated into MARSSx86 to simulate DDR3-1333

system. Table 2 lists the system configuration. We model a

per-core LLC middle-of-the-road stream prefetcher [24, 1]

with 32 streams for each core. The prefetcher looks up

the stream table at each LLC request for issuing eligible

prefetch requests. The LLC is configured with multiple

banks. Requests to different banks are serviced in paral-

lel. Within the same bank, requests are be pipelined. The

LLC is implemented with single-port memory bitcell. We

obtain STT-RAM and SRAM parameters using NVSim [5]

and CACTI [8] as shown in Table 1.

Name Benchmarks

Mix 1 milc gcc xalancbmk tonto

Mix 2 gamess soplex libquantum perlbench

Mix 3 gcc sphinx3 GemsFDTD tonto

Mix 4 lbm mcf cactusADM GemsFDTD

Mix 5 zeusmp bzip2 astar libquantum

Mix 6 mcf soplex zeusmp bwaves

Mix 7 omnetpp lbm cactusADM sphinx3

Mix 8 bwaves libquantum mcf GemsFDTD

Mix 9 omnetpp cactusADM tonto gcc

Mix 10 soplex mcf bzip2 gcc

Mix 11 perlbench sphinx3 libquantum lbm

Table 4. Multi-Core workloads

The SPEC CPU2006 [7] benchmarks are used for the

evaluation. We evaluate five LLC techniques based on the

same area configuration. Table 3 shows the legends for

these techniques referred to in the graphs that follow. The

OPT-STT-RAM technique assumes write operations have

similar access latency to read operations which is the opti-

mistic case. The Sun-Hybrid [25] technique assumes that a

cache block that has been consecutively written to the LLC

twice is a write-intensive block. The technique migrates

the write-intensive blocks to SRAM lines for reducing the

write operations to STT-RAM lines. The APM technique is

our proposed adaptive block placement and migration pol-

icy based hybrid cache technique as described in section 4.

We modify MARSSx86 to support all types of LLC listed

in Table 3.

Single-Core Workloads and LLC Configuration Of

the 29 SPEC CPU2006 benchmarks, 22 can be com-

piled and run with our infrastructure. We use all 22 of

these benchmarks for evaluation including both memory-

intensive benchmarks and non-memory-intensive bench-

marks. For each workload, we simulate 250 million instruc-

tions from a typical phase identified by SimPoint [23].

Various LLC techniques are evaluated with the same

area. A 2MB SRAM has similar area to a 6MB STT-RAM.

Thus, we evaluate a 16-way SRAM with 2MB capacity

and 24-way STT-RAM/OPT-STT-RAM with 6MB capac-

ity. The hybrid cache design for the APM technique has 16

STT-RAM lines and 2 SRAM lines in each set and hence

we evaluate a 4.5MB APM hybrid cache which has the

same area with a 2MB SRAM. In the Sun-Hybrid technique,

each cache set allocates 1 SRAM line. Thus we evaluate a

20 STT-RAM lines and 1 SRAM line hybrid cache with a

5.25MB capacity for Sun-Hybrid technique. We implement

a 1MB SRAM cache bank and 2MB STT-RAM cache bank

for a single-core configuration yielding the best trade off of

access latency and bank level parallelism. If the capacity of

SRAM is smaller than 1M, it is configured as one bank.

400.perlbench401.bzip2403.gcc410.bw

aves429.m

cf433.m

ilc434.zeusm

p435.grom

acs436.cactusA

DM

444.namd

445.gobmk

450.soplex456.hm

mer

459.Gem

sFDTD

462.libquantum464.h264ref465.tonto470.lbm471.om

netpp473.astar482.sphinx3483.xalancbm

k

Average

0.00.10.20.30.40.50.60.70.80.91.0

Norm

ali

zed

Wri

te O

per

ati

on

s

Core-write

Demand-write

Prefetch-write1st bar: All write operations 2nd bar: Write operations to STT-RAM

Figure 7. The distribution of write accesses to STT-RAM lines in APM LLC for single-core applications


aves429.m

cf433.m

ilc434.zeusm

p435.grom

acs436.cactusA

DM

444.namd

445.gobmk

450.soplex456.hm

mer

459.Gem

sFDTD



k

Gm

ean

0.90

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

Sp

eed

up

6M-STT-RAM

5.25M-Sun-Hybrid

4.5M-APM

6M-OPT-STT-RAM

0.85

0.81

0.890.

88

1.49

1.48

1.39

1.50

1.36

1.38

1.41

1.35

1.39

1.35

Figure 8. The comparison of IPC for single-core applications (normalized to 2M SRAM LLC)

Multi-Core Workloads and LLC Configuration We

use quad-core workloads for evaluation. Table 4 shows

eleven mixes of SPEC CPU2006 benchmarks with a vari-

ety of memory behaviors. For each mix, we run the ex-

periment with 1 billion instructions total for all four cores

starting from the typical phase. Each benchmark runs si-

multaneously with others. For the multi-core configuration,

we evaluate a 16-way SRAM with 8MB capacity, 24-way

STT-RAM/OPT-STT-RAM with 24MB capacity, 16-way

STT-RAM and 2-way SRAM Hybrid APM technique with

18MB capacity, and 21MB 20-way STT-RAM and 1-way

SRAM Sun-Hybrid technique. In the multi-core configura-

tion, we use 2MB SRAM cache bank and 4MB STT-RAM

cache bank which yield best performance. If the capacity of

SRAM is smaller than 2M, it is configured as one bank.

5.2 Evaluation Results for Single-CoreConfiguration

5.2.1 Reduced Writes and Endurance Evaluation

The APM technique allows SRAM lines to service as many

write requests as possible, thus reducing write operations to

STT-RAM lines. Figure 7 shows the distribution of write

operations to STT-RAM lines in the APM technique nor-

malized to all write operations to the LLC for single-core

workloads. We can see the APM LLC reduces write opera-

tions to STT-RAM lines for each type of write accesses. In

APM LLC, only 32.9% of the total LLC write requests are

serviced by the STT-RAM portion, significantly reducing

the write overhead of the STT-RAM portion and translating

into performance improvement and power reduction of the

LLC.

5.2.2 Performance Evaluation

Figure 8 shows the speedup for various techniques com-

pared with baseline technique which is 2MB SRAM LLC.

The 6MB STT-RAM LLC has similar area to 2MB SRAM

LLC. It improves the performance by 6.2% on average due

to the increased capacity. Most of the benchmarks can ben-

efit from the increased capacity of STT-RAM. However,

several benchmarks such as gcc, milc, libquantum

and lbm suffer more from the large write overhead of

STT-RAM. The OPT-STT-RAM LLC assumes symmet-

ric read/write latency meaning large write-induced inter-

ference to read request is removed. It yields a geomet-

ric mean speedup of 9.3%. The performance difference

between OPT-STT-RAM and STT-RAM is caused by the

long write latency of STT-RAM. The 4.5MB APM LLC

reduces the write overhead of the STT-RAM portion, de-

livering a geometric mean speedup of 8.0%. It yields

even higher speedup for benchmarks 462.libquantum,

482.sphinx3, and 483.xalancbmk than the 6MB

OPT-STT-RAM LLC. Because those workloads generate a

large number of dead blocks or inaccurate prefetch blocks in

the LLC and hence reducing the LLC efficiency. The APM

technique reduces the LLC pollution cased by dead blocks


aves429.m

cf433.m

ilc434.zeusm

p435.grom

acs436.cactusA

DM

444.namd

445.gobmk

450.soplex456.hm

mer

459.Gem

sFDTD



k

Gm

ean

0.00.20.40.60.81.01.21.41.61.82.0

Norm

ali

zed

Pow

er

Write Power

Read Power

Leakage Power

1st bar: 2M-SRAM 2nd bar: 6M-STT-RAM 3rd bar: 5.25M-Sun-Hybrid 4th bar: 4.5M-APM

3.57

2.29

2.68

2.43

2.37

2.38

2.32

Figure 9. The power breakdown for single-core applications (normalized to 2MB SRAM)

mix1

mix2

mix3

mix4

mix5

mix6

mix7

mix8

mix9

mix10

mix11

Average

0.0

0.2

0.4

0.6

0.8

1.0

No

rma

lize

d W

rite

Op

era

tio

ns

Core-write

Demand-write

Prefetch-write

1st bar: All write operations

2nd bar: Write operations to STT-RAM

Figure 10. The distribution of write accesses to

STT-RAM lines in APM LLC for multi-core ap-

plications

and inaccurate prefetch blocks, thus improving the LLC

efficiency for those workloads. The 5.25MB Sun-Hybrid

LLC improves the performance by 5.0% on average.

5.2.3 Power Evaluation

Figure 9 shows the normalized power consumption for

various techniques due to leakage power, dynamic power

caused by reads and dynamic power caused by writes. The

baseline technique is a 2MB SRAM. For the SRAM tech-

nique, the leakage power dominates the total power con-

sumption. The STT-RAM technique consumes lower leak-

age power. However, the dynamic power caused by writes is

significantly increased due to the large write energy of STT-

RAM. Thus the STT-RAM technique increases the over-

all power consumption by 11.9% on average. The APM

technique reduces write operations to the STT-RAM por-

tion, thus reducing the dynamic power caused by writes. It

reduces the overall power consumption by 18.9% on aver-

age compared with the baseline. The Sun-Hybrid technique

does not significantly reduce the write power because the

Sun-Hybrid technique does not reduce write operations to

STT-RAM caused by LLC replacement.

mix1

mix2

mix3

mix4

mix5

mix6

mix7

mix8

mix9

mix10

mix11

Gm

ean

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

Sp

eed

up

24M-STT-RAM

21M-Sun-Hybrid

18M-APM

24M-OPT-STT-RAM1.

42

Figure 11. The comparison of IPC for multi-core

applications (normalized to 8MB SRAM)

5.2.4 Extra LLC Traffic Evaluation

The APM technique migrates blocks between SRAM lines

and STT-RAM lines. Migrating blocks from SRAM lines

to STT-RAM lines causes extra cache traffic. We evaluate

the LLC traffic caused by migration. Migration causes only

3.8% extra LLC traffic. Most of the blocks evicted from

SRAM ways are dead blocks, thus only a small number of

distant-read-range blocks need to be migrated to STT-RAM.

Thus, the small percentage of traffic caused by migration

will not cause significant traffic overhead.

5.3 Evaluation Results for Multi-CoreConfiguration

5.3.1 Reduced Writes, Endurance, Performance and

Power Evaluation

Figure 10 shows the distribution of write operations to STT-

RAM lines normalized to all write operations to LLC for the

multi-core workloads. The APM technique reduces write

operations to the STT-RAM portion to 28.9% on average of

the total number of write operations.

Figure 11 shows the speedups of the various techniques

for the multi-core workloads normalized to 8MB SRAM

LLC. The 24MB STT-RAM LLC improves performance by

14.8% on average. Removing write-induced interference

caused by asymmetric writes improves the average perfor-

mance by 18.7% in the OPT-STT-RAM LLC. The 18M

APM technique reduces write overhead of the STT-RAM

portion. It achieves a geometric mean speedup of 20.5%

which is higher than the 24MB OPT-STT-RAM LLC. In

multi-core workloads, the large number of dead or inaccu-

rately prefetched blocks generated from one workload can

also negatively affect the performance of other workloads,

significantly reducing performance. The APM technique

reduces cache pollution caused by dead blocks and inaccu-

rately prefetched blocks. Our evaluation shows the 18MB

APM LLC yields better performance and fewer misses for 5

out of 11 multi-core workloads compared with 24MB OPT-

STT-RAM LLC.

Figure 12 shows the distribution of normalized power

consumption for various techniques. The baseline is the

8MB SRAM cache. For a large LLC, the majority of

power consumption comes from leakage power. The 24MB

STT-RAM technique reduces overall power consumption to

87.8% of baseline due to low leakage power. The APM

technique reduces dynamic power caused by the STT-RAM

write operations. Thus, it further reduces power consump-

tion to 80.7% of baseline on average.

5.3.2 Prediction Evaluation

We evaluate the access pattern predictor using false posi-

tive rate and coverage. Mispredictons can be false positives

and false negatives. False positives are more harmful for

two reasons: mispredicting a live block as a dead block can

cause bypass or eviction of a live block from LLC early and

generate LLC misses, and mispredicting a non-write-burst

request as a write-burst request can cause extra migrations

between STT-RAM lines and SRAM lines. False positive

rate is measured as the number of mispredicted positive pre-

dictions divided by the total number of predictions. Among

the 11 multi-core workloads, the access pattern predictor

yields a low false positive rate ranging from 2.1% to 14.8%,

with a geometric mean of 8.3%. The access pattern pre-

dictor achieves an average coverage of 71.7%. Thus, the

majority of dead blocks and write burst blocks can be pre-

dicted by the access pattern predictor. The false positive

and coverage rates are not illustrated with graphs for lack

of space.

5.3.3 Memory Energy Evaluation

Figure 13 shows the memory energy evaluation results nor-

malized to 8MB SRAM LLC. The 24MB STT-RAM LLC

reduces the average memory energy to 72.9% of the base-

line. The 18MB APM LLC technique reduces average

memory energy to 72.4%. Compared with 24MB STT-

RAM LLC, the 18MB APM LLC increases average mem-

ory traffic by 5.6% due to its smaller capacity. However,

it does not increase the dynamic energy consumption be-

cause it consumes less activation/precharge energy. The

APM LLC achieves a higher DRAM row-buffer hit rate for

write requests than the STT-RAM LLC which can reduce

mix1

mix2

mix3

mix4

mix5

mix6

mix7

mix8

mix9

mix10

mix11

Gm

ean

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Norm

ali

zed

Pow

er

Write Power

Read Power

Leakage Power

1st bar: 8M-SRAM 2nd bar: 24M-STT-RAM3rd bar: 21M-Sun-Hybrid 4th bar: 18M-APM

Figure 12. The LLC power breakdown for multi-

core applications (normalized to 8MB SRAM)

mix1

mix2

mix3

mix4

mix5

mix6

mix7

mix8

mix9

mix10

mix11

Gm

ean

0.0

0.2

0.4

0.6

0.8

1.0

No

rma

lize

d M

emory

En

ergy

Background Energy

Refresh Energy

Read/Write Energy

Activation/Precharge Energy

1st bar: 8M-SRAM 2nd bar: 24M-STT-RAM

3nd bar: 21M-Sun-Hybrid 4th bar: 18M-APM

Figure 13. The memory energy breakdown

for multi-core applications (normalized to 8MB

SRAM)

the activation/precharge energy. The large LLC can filter

the locality of dirty blocks, so the dirty blocks have low spa-

tial locality when they are evicted from the LLC and writ-

ten back to the main memory. However, in the APM LLC,

a significant fraction of dirty blocks are written back to the

main memory when they are evicted from the SRAM por-

tion where the small capacity of SRAM allows the evicted

dirty blocks to have higher spatial locality. Our evaluation

result shows the DRAM row-buffer hit rates for writes are

21.1% and 35.6% for 24M STT-RAM LLC and 18M APM

LLC respectively.

5.4 Storage Overhead and Power

The technique uses an access pattern predictor to predict

the dead blocks and write burst blocks which cause extra

storage and power overhead.

5.4.1 Storage Overhead

Each cache block in the LLC adds 1 bit for representing

whether it is a dead block, using 9.2K storage total. For

the pattern predictor, each prediction table has 4,096 en-

tries with a 2-bit counter in each entry. There are 6 tables

for the skewed structure for the dead block prediction ta-

ble and write burst prediction table using a total of 6KB

of storage. In the pattern simulator, one simulated set cor-

responding 32 LLC sets, each simulated set has 12-entry

16-bit partial read PC, 12-entry 16-bits partial tag, 12-entry

4-bit LRU position, 12-entry 1-bit valid flag, 4-entry 16-

bits partial write PC and 4-entry 2-bits write LRU posi-

tion. For a 4.5MB hybrid cache, it consumes 8.5K stor-

age. Thus, the storage overhead for single-core configura-

tion is 6K+8.5K+9.2K=23.7K which is only 0.53% capac-

ity overhead of the hybrid 4.5MB LLC. For the quad-core

configuration, the storage overhead of the APM technique is

6K+8.5K×4+9.2K×4=84.8K, which is 0.43% of the hybrid

18MB cache capacity.

5.4.2 Power Overhead

We evaluate the power overhead of our technique using

NVSim [5] and CACTI [8]. For the single-core configura-

tion, the extra dynamic and leakage power consumed by the

access pattern predictor is 1.6% and 1.9% of the LLC dy-

namic and leakage power respectively. It induces a power

overhead of 1.8% of the total LLC power consumption. For

the quad-core system configuration, the extra dynamic and

leakage power consumed by the access pattern predictor is

1.1% and 0.60% of the LLC dynamic and leakage power

respectively. The overall power overhead caused by the ac-

cess pattern predictor is 0.76% of the overall LLC power

consumption for quad-core configuration.

6 Related Work

6.1 Related Work on Mitigating WriteOverhead of STT-RAM

Many prior papers [12, 27, 17] focus on mitigating write

overhead of an STT-RAM cache. Jog et al. [12] propose

to improve the write speed of STT-RAM-based LLC by

relaxing its data retention time. However, that technique

requires large capacity buffers for a line level refreshing

mechanism to retain reliability. Mao et al. [17] propose pri-

oritization policies for reducing the waiting time of critical

requests in the STT-RAM-based LLC. However, the tech-

nique increases the power consumption of LLC. Recently,

researchers propose hybrid SRAM and STT-RAM tech-

niques [25, 3, 9, 2, 26] for improving LLC efficiency. Sun et

al. [25] take the first step introducing the hybrid cache struc-

ture. That technique uses a counter-based approach for pre-

dicting write-intensive blocks. Write-intensive blocks are

placed in SRAM ways for reducing write overhead to STT-

RAM portion. However, that technique is optimized only

for core-write operations. It cannot reduce the prefetch-

write and demand-write operations to STT-RAM. Jadidi et

al. [9] propose reducing write variance of STT-RAM lines

by migrating frequently written cache blocks to other STT-

RAM lines or SRAM lines. However, frequently migrating

data between cache lines incurs significant performance and

energy overhead. Chen et al. [3] propose a combined static

and dynamic scheme to optimize the block placement for

hybrid cache. The downside of the technique is it requires

the compiler to provide static hints for initializing the block

placement.

6.2 Related Work on Intelligent BlockPlacement Policy

Much previous work [6, 10, 20, 11] focuses on improv-

ing cache efficiency by using intelligent block placement

policy. Hardavellas et al. [6] propose a block placement

and migration policy for a non-uniform cache. The tech-

nique classifies cache access patterns into instructions, pri-

vate data and shared data. The cache design places data into

its proper position based on data type. Our technique is or-

thogonal to LLC replacement policy. It can be combined

with the LLC replacement policy applied to the STT-RAM

portion to further improve the cache efficiency.

6.3 Related Work on Dead Block Predic-tion

Dead block predictors have been proposed [14, 4, 16,

13]. Lai et al. [4] propose a trace-based predictor used to

prefetch data into dead blocks in the L1 data cache. The

technique hashes a sequence of memory-access PCs related

to the cache block into a signature used to predict when a

block can be safely evicted. Kharbutli and Solihin [14] pro-

pose counting-based predictors used for cache replacement

policy. Khan et al. [13] propose sampling-based dead block

prediction for the LLC. None of the previous proposed dead

block predictors take into account all data access types. Our

proposed pattern predictor predicts dead blocks and write

burst blocks in the presence of all types of LLC accesses.

7 Conclusion

Hybrid STT-RAM and SRAM LLC design relies on an

intelligent block placement policy to leverage the benefits

from the characteristics for both STT-RAM and SRAM

technology. In this work, we propose a new block place-

ment and migration policy for a hybrid LLC that places data

based on access patterns. LLC writes are categorized into

three classes: core-write, prefetch-write, and demand-write.

We analyze the access pattern for each class of LLC writes

and design a block placement policy that adapt to the access

pattern of each class. A low cost access pattern predictor is

proposed for guiding the block placement. Experimental re-

sults show our technique can improve performance and re-

duce LLC power consumption compared with both SRAM

LLC and STT-RAM LLC with the same area configuration.

8 Acknowledgments

Yuan Xie and Cong Xu were supported in part by NSF1218867, 1213052, and by DoE under Award Number DE-SC0005026. Guangyu Sun was supported by NSF China61202072 and National High-tech R&D Program of China(Number 2013AA013201). Daniel A. Jiménez and ZheWang were supported in part by NSF CCF grants 1332654,1332598, and 1332597.

References

[1] Intel core i7 processor. http://www.intel.com/products /pro-

cessor/corei7/.

[2] Y.-T. Chen, J. Cong, H. Huang, B. Liu, C. Liu, M. Potkonjak,

and G. Reinman. Dynamically reconfigurable hybrid cache:

An energy-efficient last-level cache design. In DATE’12,

pages 45–50, 2012.[3] Y.-T. Chen, J. Cong, H. Huang, C. Liu, R. Prabhakar, and

G. Reinman. Static and dynamic co-optimizations for blocks

mapping in hybrid caches. In Proceedings of the 2012

ACM/IEEE international symposium on Low power electron-

ics and design, ISLPED ’12, pages 237–242, New York, NY,

USA, 2012. ACM.[4] A. chow Lai, C. Fide, and B. Falsafi. Dead-block prediction

and dead-block correlating prefetchers. In In Proceedings of

the 28th International Symposium on Computer Architecture,

pages 144–154, 2001.[5] X. Dong, C. Xu, Y. Xie, and N. Jouppi. Nvsim: A circuit-

level performance, energy, and area model for emerging non-

volatile memory. Computer-Aided Design of Integrated Cir-

cuits and Systems, IEEE Transactions on, 31(7):994–1007,

2012.[6] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki.

Reactive nuca: near-optimal block placement and replication

in distributed caches. In Proceedings of the 36th annual in-

ternational symposium on Computer architecture, ISCA ’09,

pages 184–195, New York, NY, USA, 2009. ACM.[7] J. L. Henning. Spec cpu2006 benchmark descriptions.

SIGARCH Comput. Archit. News, 34:1–17, September 2006.[8] HP Laboratories. CACTI 6.5. http: //hpl .hp.com: research

/cacti/.[9] A. Jadidi, M. Arjomand, and H. Sarbazi-Azad. High-

endurance and performance-efficient design of hybrid cache

architectures through adaptive line replacement. In Low

Power Electronics and Design (ISLPED) 2011 International

Symposium on, pages 79–84, 2011.[10] A. Jaleel, K. Theobald, S. S. Jr., and J. Emer. High perfor-

mance cache replacement using re-reference interval predic-

tion. In Proceedings of the 37th Annual International Sym-

posium on Computer Architecture (ISCA-37), June 2010.[11] D. A. Jiménez. Insertion and promotion for tree-based pseu-

dolru last-level caches. In MICRO, pages 284–296, 2013.[12] A. Jog, A. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and

C. Das. Cache revive: Architecting volatile stt-ram caches

for enhanced performance in cmps. In Design Automation

Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 243–

252, 2012.[13] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead

block prediction for last-level caches. In Proceedings of the

2010 43rd Annual IEEE/ACM International Symposium on

Microarchitecture, MICRO ’43, pages 175–186, Washington,

DC, USA, 2010. IEEE Computer Society.[14] M. Kharbutli and Y. Solihin. Counter-based cache replace-

ment and bypassing algorithms. IEEE Trans. Comput.,

57:433–447, April 2008.[15] C. Kim, J.-J. Kim, S. Mukhopadhyay, and K. Roy. A for-

ward body-biased low-leakage sram cache: device, circuit

and architecture considerations. Very Large Scale Integra-

tion (VLSI) Systems, IEEE Transactions on, 13(3):349–357,

2005.[16] H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts:

A new approach for eliminating dead blocks and increas-

ing cache efficiency. In Proceedings of the 41st annual

IEEE/ACM International Symposium on Microarchitecture,

MICRO 41, pages 222–233, Washington, DC, USA, 2008.

IEEE Computer Society.

[17] M. Mao, H. H. Li, A. K. Jones, and Y. Chen. Coordinating

prefetching and stt-ram based last-level cache management

for multicore systems. In Proceedings of the 23rd ACM in-

ternational conference on Great lakes symposium on VLSI,

GLSVLSI ’13, pages 55–60, New York, NY, USA, 2013.

ACM.[18] P. Michaud, A. Seznec, and R. Uhlig. Trading conflict and

capacity aliasing in conditional branch predictors. In Pro-

ceedings of the 24th International Symposium on Computer

Architecture, pages 292–303, June 1997.[19] A. Patel, F. Afram, S. Chen, and K. Ghose. MARSSx86: A

full system simulator for x86 CPUs. In Proceedings of the

2011 Design Automation Conference, June 2011.[20] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer.

Adaptive insertion policies for high performance caching.

In Proceedings of the 34th annual international symposium

on Computer architecture, ISCA ’07, pages 381–391, New

York, NY, USA, 2007. ACM.[21] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.

Owens. Memory access scheduling. In Proceedings of the

27th annual international symposium on Computer architec-

ture, ISCA ’00, pages 128–138, New York, NY, USA, 2000.

ACM.[22] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A

cycle accurate memory system simulator. Computer Archi-

tecture Letters, PP(99):1, 2011.[23] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Au-

tomatically characterizing large scale program behavior. In

Proceedings of the 10th International Conference on Archi-

tectural Support for Programming Languages and Operating

Systems, October 2002.[24] S. Srinath, O. Mutlu, H. Kim, and Y. Patt. Feedback directed

prefetching: Improving the performance and bandwidth-

efficiency of hardware prefetchers. In High Performance

Computer Architecture, 2007. HPCA 2007. IEEE 13th Inter-

national Symposium on, pages 63–74, 2007.[25] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel archi-

tecture of the 3d stacked mram l2 cache for cmps. In HPCA,

pages 239–249, 2009.[26] S.-M. Syu, Y.-H. Shao, and I.-C. Lin. High-endurance hy-

brid cache design in cmp architecture with cache partitioning

and access-aware policy. In Proceedings of the 23rd ACM in-

ternational conference on Great lakes symposium on VLSI,

GLSVLSI ’13, pages 19–24, New York, NY, USA, 2013.

ACM.[27] J. Wang, X. Dong, and Y. Xie. Oap: An obstruction-aware

cache management policy for stt-ram last-level caches. In

Design, Automation Test in Europe Conference Exhibition

(DATE), 2013, pages 847–852, 2013.[28] X. Wang, Y. Chen, H. Li, D. Dimitrov, and H. Liu. Spin

torque random access memory down to 22 nm technology.

IEEE Transactions on Magnetics, 44(11):2479–2482, 2008.[29] Z. Wang, S. M. Khan, and D. A. Jiménez. Improving

writeback efficiency with decoupled last-write prediction. In

Proceedings of the 39th International Symposium on Com-

puter Architecture, ISCA ’12, pages 309–320, Piscataway,

NJ, USA, 2012. IEEE Press.

locality-aware data replication in the last-level...

Documents