locality-aware data replication in the last-level...
TRANSCRIPT
Locality-Aware Data Replication in the Last-Level CacheGeorge Kurian, Srinivas Devadas
Massachusetts Institute of TechnologyCambridge, MA USA
{gkurian, devadas}@csail.mit.edu
Omer KhanUniversity of Connecticut
Storrs, CT [email protected]
AbstractNext generation multicores will process massive data
with varying degree of locality. Harnessing on-chip datalocality to optimize the utilization of cache and networkresources is of fundamental importance. We propose alocality-aware selective data replication protocol for thelast-level cache (LLC). Our goal is to lower memory accesslatency and energy by replicating only high locality cachelines in the LLC slice of the requesting core, while simul-taneously keeping the off-chip miss rate low. Our approachrelies on low overhead yet highly accurate in-hardware run-time classification of data locality at the cache line granu-larity, and only allows replication for cache lines with highreuse. Furthermore, our classifier captures the LLC pres-sure at the existing replica locations and adapts its replica-tion decision accordingly.
The locality tracking mechanism is decoupled from thesharer tracking structures that cause scalability concerns intraditional coherence protocols. Moreover, the complexityof our protocol is low since no additional coherence statesare created. On a set of parallel benchmarks, our proto-col reduces the overall energy by 16%, 14%, 13% and 21%and the completion time by 4%, 9%, 6% and 13% whencompared to the previously proposed Victim Replication,Adaptive Selective Replication, Reactive-NUCA and Static-NUCA LLC management schemes.
1 IntroductionNext generation multicore processors and applications
will operate on massive data. A major challenge in fu-ture multicore processors is the data movement incurredby conventional cache hierarchies that impacts the off-chipbandwidth, on-chip memory access latency and energy con-sumption [7].
A large, monolithic on-chip cache does not scale beyonda small number of cores, and the only practical option is tophysically distribute memory in pieces so that every core isnear some portion of the cache [5]. In theory this providesa large amount of aggregate cache capacity and fast pri-vate memory for each core. Unfortunately, it is difficult tomanage the distributed cache and network resources effec-tively since they require architectural support for cache co-herence and consistency under the ubiquitous shared mem-ory model.
Popular directory-based protocols enable fast local
caching to exploit data locality, but scale poorly with in-creasing core counts [3, 22]. Many recent proposals haveaddressed directory scalability in single-chip multicoresusing sharer compression techniques or limited directo-ries [24, 31, 29, 12]. Yet, fast private caches still sufferfrom two major problems: (1) due to capacity constraints,they cannot hold the working set of applications that oper-ate on massive data, and (2) due to frequent communicationbetween cores, data is often displaced from them [19]. Thisleads to increased network traffic and request rate to the last-level cache. Since on-chip wires are not scaling at the samerate as transistors, data movement not only impacts memoryaccess latency, but also consumes extra power due to the en-ergy consumption of network and cache resources [16].
Last-level cache (LLC) organizations offer trade-offs be-tween on-chip data locality and off-chip miss rate. Whileprivate LLC organizations (e.g., [10]) have low hit laten-cies, their off-chip miss rates are high in applications thathave uneven distributions of working sets or exhibit highdegrees of sharing (due to cache line replication). SharedLLC organizations (e.g., [1]), on the other hand, lead tonon-uniform cache access (NUCA) [17] that hurts on-chiplocality, but their off-chip miss rates are low since cachelines are not replicated. Several proposals have explored theidea of hybrid LLC organizations that attempt to combinethe good characteristics of private and shared LLC organi-zations [9, 30, 4, 13], These proposals either do not quicklyadapt their policies to dynamic program changes or repli-cate cache lines without paying attention to their locality.In addition, some of them significantly complicate coher-ence and do not scale to large core counts (see Section 5 fordetails).
We propose a data replication mechanism for the LLCthat retains the on-chip cache utilization of the shared LLCwhile intelligently replicating cache lines close to the re-questing cores so as to maximize data locality. To achievethis goal, we propose a low-overhead yet highly accurate in-hardware locality classifier at the cache line granularity thatonly allows the replication of cache lines with high reuse.1.1 Motivation
The utility of data replication at the LLC can be under-stood by measuring cache line reuse. Figure 1 plots thedistribution of the number of accesses to cache lines in theLLC as a function of run-length. Run-length is defined asthe number of accesses to a cache line (at the LLC) from a
978-1-4799-3097-5/14/$31.00 ©2014 IEEE978-1-4799-3097-5/14/$31.00 ©2014 IEEE
0%
20%
40%
60%
80%
100%
RADIX FFT LU-‐C
LU-‐NC
CHOLESKY
BARNES
OCEAN-‐C
OCEAN-‐NC
WATER-‐N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIONS
FLUIDA
NIM.
STREAM
CLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOM
P
Private [1-‐2] Private [3-‐9] Private [≥10] Instruc;on [1-‐2] Instruc;on [3-‐9] Instruc;on [≥10] Shared Read-‐Only [1-‐2] Shared Read-‐Only [3-‐9] Shared Read-‐Only [≥10] Shared Read-‐Write [1-‐2] Shared Read-‐Write [3-‐9] Shared Read-‐Write [≥10]
Figure 1. Distribution of instructions, private data, sharedread-only data, and shared read-write data accesses to theLLC as a function of run-length. The classification is doneat the cache line granularity.
particular core before a conflicting access by another coreor before it is evicted. Cache line accesses from multiplecores are conflicting if at least one of them is a write. Forexample, in BARNES, over 90% of the accesses to the LLCoccur to shared (read-write) data that has a run-length of10 or more. Greater the number of accesses with higherrun-length, greater is the benefit of replicating the cacheline in the requester’s LLC slice. Hence, BARNES wouldbenefit from replicating shared (read-write) data. Similarly,FACESIM would benefit from replicating instructions andPATRICIA would benefit from replicating shared (read-only)data. On the other hand, FLUIDANIMATE and OCEAN-Cwould not benefit since most cache lines experience just 1or 2 accesses to them before a conflicting access or an evic-tion. For such cases, replication would increase the LLCpollution without improving data locality.
Hence, the replication decision should not depend on thetype of data, but rather on its locality. Instructions andshared-data (both read-only and read-write) can be repli-cated if they demonstrate good reuse. It is also important toadapt the replication decision at runtime in case the reuse ofdata changes during an application’s execution.
1.2 Proposed Idea
We propose a low-overhead yet highly accuratehardware-only predictive mechanism to track and classifythe reuse of each cache line in the LLC. Our runtime clas-sifier only allows replicating those cache lines that demon-strate reuse at the LLC while bypassing replication for oth-ers. When a cache line replica is evicted or invalidated,our classifier adapts by adjusting its future replication deci-sion accordingly. This reuse tracking mechanism is decou-pled from the sharer tracking structures that cause scalabil-ity concerns in traditional cache coherence protocols. Ourlocality-aware protocol is advantageous because it:
1. Enables lower memory access latency and energy byselectively replicating cache lines that show high reusein the LLC slice of the requesting core.
2. Better exploits the LLC by balancing the off-chip missrate and on-chip locality using a classifier that adapts
to the runtime reuse at the granularity of cache lines.3. Allows coherence complexity almost identical to that
of a traditional non-hierarchical (flat) coherence pro-tocol since replicas are only allowed to be placed atthe LLC slice of the requesting core. The additionalcoherence complexity only arises within a core whenthe LLC slice is searched on an L1 cache miss, orwhen a cache line in the core’s local cache hierarchyis evicted/invalidated.
2 Locality-Aware LLC Data Replication2.1 Baseline System
The baseline system is a tiled multicore with an electri-cal 2-D mesh interconnection network as shown in Figure 2.Each core consists of a compute pipeline, private L1 instruc-tion and data caches, a physically distributed shared LLCcache with integrated directory, and a network router. Thecoherence is maintained using a MESI protocol. The coher-ence directory is integrated with the LLC slices by extend-ing the tag arrays (in-cache directory organization [8, 5])and tracks the sharing status of the cache lines in the per-core private L1 caches. The private L1 caches are keptcoherent using the ACKwise limited directory-based coher-ence protocol [20]. Some cores have a connection to a mem-ory controller as well.
Even though our protocol can be built on top of bothStatic-NUCA and Dynamic-NUCA configurations [17, 13],we use Reactive-NUCA’s data placement and migrationmechanisms to manage the LLC [13]. Private data is placedat the LLC slice of the requesting core and shared data isaddress interleaved across all LLC slices. Reactive-NUCAalso proposed replicating instructions at a cluster-level (e.g.,4 cores) using a rotational interleaving mechanism. How-ever, we do not use this mechanism, but instead build alocality-aware LLC data replication scheme for all types ofcache lines.2.2 Protocol Operation
The four essential components of data replication are:(1) choosing which cache lines to replicate, (2) determiningwhere to place a replica, (3) how to lookup a replica, and(4) how to maintain coherence for replicas. We first definea few terms to facilitate describing our protocol.
1. Home Location: The core where all requests for acache line are serialized for maintaining coherence.
2. Replica Sharer: A core that is granted a replica of acache line in its LLC slice.
3. Non-Replica Sharer: A core that is NOT granted areplica of a cache line in its LLC slice.
4. Replica Reuse: The number of times an LLC replicais accessed before it is invalidated or evicted.
5. Home Reuse: The number of times a cache line is ac-cessed at the LLC slice in its home location before aconflicting write or eviction.
6. Replication Threshold (RT): The reuse above orequal to which a replica is created.
L2 Cache (LLC Slice)
Compute Pipeline
Directory
Router
Private L1 Caches
Core
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
L1 D L1 I
LLC Slice
Replica
Home
Replica
Home
1
4
32
Figure 2. 1© – 4© are mockup requests showing thelocality-aware LLC replication protocol. The black datablock has high reuse and a local LLC replica is allowed thatservices requests from 1© and 2©. The low-reuse red datablock is not allowed to be replicated at the LLC, and therequest from 3© that misses in the L1, must access the LLCslice at its home core. The home core for each data blockcan also service local private cache misses (e.g., 4©).
Replica No Replica
Home Reuse >= RT
Home Reuse < RT
Initial XReuse >= RT
XReuse < RT
Figure 3. Each directory entry is extended with replica-tion mode bits to classify the usefulness of LLC replica-tion. Each cache line is initialized to non-replica mode withrespect to all cores. Based on the reuse counters (at thehome as well as the replica location) and the parameter RT,the cores are transitioned between replica and non-replicamodes. Here XReuse is (Replica + Home) Reuse on an in-validation and Replica Reuse on an eviction.
Note that for a cache line, one core can be a replica sharerwhile another can be a non-replica sharer. Our protocolstarts out as a conventional directory protocol and initial-izes all cores as non-replica sharers of all cache lines (asshown by Initial in Figure 3). Let us understand the han-dling of read requests, write requests, evictions, invalida-tions and downgrades as well as cache replacement policiesunder this protocol.
2.2.1 Read RequestsOn an L1 cache read miss, the core first looks up its localLLC slice for a replica. If a replica is found, the cache lineis inserted at the private L1 cache. In addition, a ReplicaReuse counter (as shown in Figure 4) at the LLC directoryentry is incremented. The replica reuse counter is a satu-rating counter used to capture reuse information. It is ini-tialized to ‘1’ on replica creation and incremented on everyreplica hit.
On the other hand, if a replica is not found, the request isforwarded to the LLC home location. If the cache line is notfound there, it is either brought in from the off-chip memoryor the underlying coherence protocol takes the necessary ac-tions to obtain the most recent copy of the cache line. The
LRU Tag
Core ID1 Core IDk …
Mode1 Modek …
Home Reuse1 Home Reusek …
Limited Locality List (1 .. k)
ACKWise Pointers (1 … p)
LRU Tag Mode1 Moden …
Home Reuse1 Home Reusen …
ACKWise Pointers (1 … p)
Complete Locality List (1 .. n)
Replica Reuse
State
State
Replica Reuse
Figure 4. ACKwisep-Complete locality classifier LLCtag entry. It contains the tag, LRU bits and directory entry.The directory entry contains the state, ACKwisep pointers,a Replica reuse counter as well as Replication mode bits andHome reuse counters for every core in the system.
directory entry is augmented with additional bits as shownin Figure 4. These bits include (a) Replication Mode bit and(b) Home Reuse saturating counter for each core in the sys-tem. Note that adding several bits for tracking the localityof each core in the system does not scale with the numberof cores, therefore, we will present a cost-efficient classifierimplementation in Section 2.2.5. The replication mode bitis used to identify whether a replica is allowed to be createdfor the particular core. The home reuse counter is used totrack the number of times the cache line is accessed at thehome location by the particular core. This counter is initial-ized to ‘0’ and incremented on every hit at the LLC homelocation.
If the replication mode bit is set to true, the cache lineis inserted in the requester’s LLC slice and the private L1cache. Otherwise, the home reuse counter is incremented.If this counter has reached the Replication Threshold (RT),the requesting core is “promoted” (the replication mode bitis set to true) and the cache line is inserted in its LLC sliceand private L1 cache. If the home reuse counter is still lessthan RT, a replica is not created. The cache line is onlyinserted in the requester’s private L1 cache.
If the LLC home location is at the requesting core, theread request is handled directly at the LLC home. Even ifthe classifier directs to create a replica, the cache line is justinserted at the private L1 cache.
2.2.2 Write RequestsOn an L1 cache write miss for an exclusive copy of a cacheline, the protocol checks the local LLC slice for a replica. Ifa replica exists in the Modified(M) or Exclusive(E) state, thecache line is inserted at the private L1 cache. In addition,the Replica Reuse counter is incremented.
If a replica is not found or exists in the Shared(S) state,the request is forwarded to the LLC home location. The di-rectory invalidates all the LLC replicas and L1 cache copiesof the cache line, thereby maintaining the single-writermultiple-reader invariant [25]. The acknowledgements re-ceived are processed as described in Section 2.2.3. Afterall such acknowledgements are processed, the Home Reusecounters of all non-replica sharers other than the writer arereset to ‘0’. This has to be done since these sharers have notshown enough reuse to be “promoted”.
If the writer is a non-replica sharer, its home reusecounter is modified as follows. If the writer is the onlysharer (replica or non-replica), its home reuse counter isincremented, else it is reset to ‘1’. This enables the repli-
cation of migratory shared data at the writer, while avoidingit if the replica is likely to be downgraded due to conflictingrequests by other cores.
2.2.3 Evictions and InvalidationsOn an invalidation request, both the LLC slice and L1 cacheon a core are probed and invalidated. If a valid cache lineis found in either caches, an acknowledgement is sent to theLLC home location. In addition, if a valid LLC replica ex-ists, the replica reuse counter is communicated back withthe acknowledgement. The locality classifier uses this in-formation along with the home reuse counter to determinewhether the core stays as a replica sharer. If the (replica +home) reuse is ≥ RT, the core maintains replica status, elseit is demoted to non-replica status (as shown in Figure 3).The two reuse counters have to be added since this is the to-tal reuse that the core exhibited for the cache line betweensuccessive writes.
When an L1 cache line is evicted, the LLC replica loca-tion is probed for the same address. If a replica is found,the dirty data in the L1 cache line is merged with it, else anacknowledgement is sent to the LLC home location. How-ever, when an LLC replica is evicted, the L1 cache is probedfor the same address and invalidated. An acknowledgementmessage containing the replica reuse counter is sent back tothe LLC home location. The replica reuse counter is usedby the locality classifier as follows. If the replica reuse is≥ RT, the core maintains replica status, else it is demotedto non-replica status. Only the replica reuse counter has tobe used for this decision since it captures the reuse of thecache line at the LLC replica location.
After the acknowledgement corresponding to an evictionor invalidation of the LLC replica is received at the home,the locality classifier sets the home reuse counter of the cor-responding core to ‘0’ for the next round of classification.
The eviction of an LLC replica back-invalidates the L1cache (as described earlier). A possibly more optimal strat-egy is to maintain the validity of the L1 cache line. Thisrequires two message types as well as two messages, oneto communicate back the reuse counter on the LLC replicaeviction and another to communicate the acknowledgementwhen the L1 cache line is finally invalidated or evicted. Weopted for the back-invalidation for two reasons: (1) to main-tain the simplicity of the coherence protocol, and (2) theenergy and performance improvements of the more optimalstrategy are negligible since (a) the LLC is more than 4×larger than the L1 cache, thereby keeping the probability ofevicted LLC lines having an L1 copy extremely low, and(2) our LLC replacement policy prioritizes retaining cachelines that have L1 cache copies.
2.2.4 LLC Replacement PolicyTraditional LLC replacement policies use the least recentlyused (LRU) policy. One reason why this is sub-optimal isthat the LRU information cannot be fully captured at theLLC because the L1 cache filters out a large fraction of ac-cesses that hit within it. In order to be cognizant of this, the
LRU Tag
Core ID1 Core IDk …
Mode1 Modek …
Home Reuse1 Home Reusek …
Limited Locality List (1 .. k)
ACKWise Pointers (1 … p)
LRU Tag Mode1 Moden …
Home Reuse1 Home Reusen …
ACKWise Pointers (1 … p)
Complete Locality List (1 .. n)
Replica Reuse
State
State
Replica Reuse
Figure 5. ACKwisep-Limitedk locality classifier LLC tagentry. It contains the tag, LRU bits and directory entry.The directory entry contains the state, ACKwisep pointers,a Replica reuse counter as well as the Limitedk classifier.The Limitedk classifier contains a Replication mode bit andHome reuse counter for a limited number of cores. A ma-jority vote of the modes of tracked cores is used to classifynew cores as replicas or non-replicas.
replacement policy should prioritize retaining cache linesthat have L1 cache sharers. Some proposals in literature ac-complish this by sending periodic Temporal Locality Hintmessages from the L1 cache to the LLC [15]. However, thisincurs additional network traffic.
Our replacement policy accomplishes the same using amuch simpler scheme. It first selects cache lines with theleast number of L1 cache copies and then chooses the leastrecently used among them. The number of L1 cache copiesis readily available since the directory is integrated withinthe LLC tags (“in-cache” directory). This reduces back in-validations to a negligible amount and outperforms the LRUpolicy (cf. Section 4.2).
2.2.5 Limited Locality Classifier OptimizationThe classifier described earlier which keeps track of localityinformation for all the cores in the directory entry is termedthe Complete locality classifier. It has a storage overhead of30% (calculated in Section 2.4) at 64 cores and over 5× at1024 cores. In order to mitigate this overhead, we developa classifier that maintains locality information for a limitednumber of cores and classifies the other cores as replica ornon-replica sharers based on this information.
The locality information for each core consists of (1) thecore ID, (2) the replication mode bit and (3) the home reusecounter. The classifier that maintains a list of this informa-tion for a limited number of cores (k) is termed the Limitedk
classifier. Figure 5 shows the information that is tracked bythis classifier. The sharer list of the ACKwise limited direc-tory entry cannot be reused for tracking locality informationbecause of its different functionality. While the hardwarepointers of ACKwise are used to maintain coherence, thelimited locality list serves to classify cores as replica or non-replica sharers. Decoupling in this manner also enables thelocality-aware protocol to be implemented efficiently on topof other scalable directory organizations. We now describethe working of the limited locality classifier.
At startup, all entries in the limited locality list are freeand this is denoted by marking all core IDs’ as Invalid.When a core makes a request to the home location, the di-rectory first checks if the core is already being tracked bythe limited locality list. If so, the actions described previ-ously are carried out. Else, the directory checks if a free
entry exists. If it does exist, it allocates the entry to the coreand the same actions are carried out.
Otherwise, the directory checks if a currently trackedcore can be replaced. An ideal candidate for replacement isa core that is currently not using the cache line. Such a coreis termed an inactive sharer and should ideally relinquish itsentry to a core in need of it. A replica core becomes inac-tive on an LLC invalidation or an eviction. A non-replicacore becomes inactive on a write by another core. If sucha replacement candidate exists, its entry is allocated to therequesting core. The initial replication mode of the coreis obtained by taking a majority vote of the modes of thetracked cores. This is done so as to start off the requester inits most probable mode.
Finally, if no replacement candidate exists, the mode forthe requesting core is obtained by taking a majority vote ofthe modes of all the tracked cores. The limited locality listis left unchanged.
The storage overhead for the Limitedk classifier is di-rectly proportional to the number of cores (k) for which lo-cality information is tracked. In Section 4.3, we evaluate thestorage and accuracy tradeoffs for the Limitedk classifier.Based on our observations, we pick the Limited3 classifier.
2.3 Discussion
2.3.1 Replica Creation StrategyIn the protocol described earlier, replicas are created in allvalid cache states. A simpler strategy is to create an LLCreplica only in the Shared cache state. This enables instruc-tions, shared read-only and shared read-write data that ex-hibit high read run-length to be replicated so as to servemultiple read requests from within the local LLC slice.However, migratory shared data cannot be replicated withthis simpler strategy because both read and write requestsare made to it in an interleaved manner. Such data patternscan be efficiently handled only if the replica is created in theExclusive or Modified state. Benchmarks that exhibit boththe above access patterns are observed in our evaluation (cf.Section 4.1).
2.3.2 Coherence ComplexityThe local LLC slice is always looked up on an L1 cachemiss or eviction. Additionally, both the L1 cache and LLCslice is probed on every asynchronous coherence request(i.e., invalidate, downgrade, flush or write-back). This isneeded because the directory only has a single pointer totrack the local cache hierarchy of each core. This methodalso allows the coherence complexity to be similar to thatof a non-hierarchical (flat) coherence protocol.
To avoid the latency and energy overhead of searchingthe LLC replica, one may want to optimize the handlingof asynchronous requests, or decide intelligently whether tolookup the local LLC slice on a cache miss or eviction. Inorder to enable such optimizations, additional sharer track-ing bits are needed at the directory and L1 cache. Moreover,additional network message types are needed to relay coher-ence information between the LLC home and other actors.
In order to evaluate whether this additional coherencecomplexity is worthwhile, we compared our protocol to adynamic oracle that has perfect information about whethera cache line is present in the local LLC slice. The dynamicoracle avoids all unnecessary LLC lookups. The completiontime and energy difference when compared to the dynamicoracle was less than 1%. Hence, in the interest of avoidingthe additional complexity, the LLC replica is always lookedup for the above coherence requests.
2.3.3 Classifier OrganizationThe classifier for the locality-aware protocol is organizedusing an in-cache structure, i.e., the replication mode bitsand home reuse counters are maintained for all cache linesin the LLC. However, this is a not an essential requirement.The classifier is logically decoupled from the directory andcould be implemented using a sparse organization.
The storage overhead for the in-cache organization iscalculated in Section 2.4. The performance and energyoverhead for this organization is small because: (1) Theclassifier lookup incurs a relatively small energy and latencypenalty when compared to the data array lookup of the LLCslice and communication over the network (justified in ourresults). (2) Only a single tag lookup is needed for accessingthe classifier and LLC data. In a sparse organization, a sep-arate lookup is required for the classifier and the LLC data.Even though these lookups could be performed in parallelwith no latency overhead, the energy expended to lookuptwo CAM structures needs to be paid.
2.3.4 Cluster-Level ReplicationIn the locality-aware protocol, the location where a replicais placed is always the LLC slice of the requesting core. Anadditional method by which one could explore the trade-offbetween LLC hit latency and LLC miss rate is by replicat-ing at a cluster-level. A cluster is defined as a group ofneighboring cores where there is at most one replica for acache line. Increasing the size of a cluster would increaseLLC hit latency and decrease LLC miss rate, and decreasingthe cluster size would have the opposite effect. The optimalreplication algorithm would optimize the cluster size so asto maximize the performance and energy benefit.
We explored the clustering under our protocol after mak-ing the appropriate changes. The changes include (1) block-ing at the replica location (the core in the cluster where areplica could be found) before forwarding the request to thehome location so that multiple cores on the same cluster donot have outstanding requests to the LLC home location,(2) additional coherence message types for communicationbetween the requester, replica and home cores, and (3) hier-archical invalidation and downgrade of the replica and theL1 caches that it tracks. We do not explain all the detailshere to save space. Cluster-level replication was not foundto be beneficial in the evaluated 64-core system, for the fol-lowing reasons (see Section 4.4 for details).
1. Using clustering increases network serialization de-lays since multiple locations now need to be
searched/invalidated on an L1 cache miss.2. Cache lines with low degree of sharing do not benefit
because clustering just increases the LLC hit latencywithout reducing the LLC miss rate.
3. The added coherence complexity of clustering in-creased our design and verification time significantly.
2.4 Overheads
2.4.1 Storage
The locality-aware protocol requires extra bits at the LLCtag arrays to track locality information. Each LLC direc-tory entry requires 2 bits for the replica reuse counter (as-suming an optimal RT of 3). The Limited3 classifier tracksthe locality information for three cores. Tracking one corerequires 2 bits for the home reuse counter, 1 bit to store thereplication mode and 6 bits to store the core ID (for a 64-core processor). Hence, the Limited3 classifier requires anadditional 27 (= 3 × 9) bits of storage per LLC directoryentry. The Complete classifier, on the other hand, requires192 (= 64 × 3) bits of storage.
All the following calculations are for one core but theyare applicable for the entire processor since all the cores areidentical. The sizes of the per-core L1 and LLC caches usedin our system are shown in Table 1. The storage overheadof the replica reuse bit is 2×256
64×8 = 1KB. The storage over-head of the Limited3 classifier is 27×256
64×8 = 13.5KB. Forthe complete classifier, it is 192×256
64×8 = 96KB. Now, thestorage overhead of the ACKwise4 protocol in this proces-sor is 12KB (assuming 6 bits per ACKwise pointer) and thatfor a Full Map protocol is 32KB. Adding up all the storagecomponents, the Limited3 classifier with ACKwise4 proto-col uses slightly less storage than the Full Map protocoland 4.5% more storage than the baseline ACKwise4 pro-tocol. The Complete classifier with the ACKwise4 protocoluses 30% more storage than the baseline ACKwise4 proto-col.
2.4.2 LLC Tag & Directory AccessesUpdating the replica reuse counter in the local LLC slicerequires a read-modify-write operation on each replica hit.However, since the replica reuse counter (being 2 bits) isstored in the LLC tag array that needs to be written on eachLLC lookup to update the LRU counters, our protocol doesnot add any additional tag accesses.
At the home location, the lookup/update of the lo-cality information is performed concurrently with thelookup/update of the sharer list for a cache line. However,the lookup/update of the directory is now more expensivesince it includes both sharer list and the locality informa-tion. This additional expense is accounted in our evaluation.
2.4.3 Network TrafficThe locality-aware protocol communicates the replica reusecounter to the LLC home along with the acknowledgmentfor an invalidation or an eviction. This is accomplished
Architectural Parameter Value
Number of Cores 64 @ 1 GHzCompute Pipeline per Core In-Order, Single-IssueProcessor Word Size 64 bits
Memory SubsystemL1-I Cache per core 16 KB, 4-way Assoc., 1 cycleL1-D Cache per core 32 KB, 4-way Assoc., 1 cycleL2 Cache (LLC) per core 256 KB, 8-way Assoc.,
2 cycle tag, 4 cycle dataInclusive, R-NUCA [13]
Directory Protocol Invalidation-based MESI,ACKwise4 [20]
DRAM Num., Bandwidth, Latency 8 controllers, 5 GBps/cntlr,75 ns latency
Electrical 2-D Mesh with XY RoutingHop Latency 2 cycles (1-router, 1-link)Flit Width 64 bitsHeader (Src, Dest, Addr, MsgType) 1 flitCache Line Length 8 flits (512 bits)
Locality-Aware LLC Data ReplicationReplication Threshold RT = 3Classifier Limited3
Table 1. Architectural parameters used for evaluation
without creating additional network flits. For a 48-bit phys-ical address and 64-bit flit size, an invalidation message re-quires 42 bits for the physical cache line address, 12 bits forthe sender and receiver core IDs and 2 bits for the replicareuse counter. The remaining 8 bits suffice for storing themessage type.
3 Evaluation MethodologyWe evaluate a 64-core multicore. The default architec-
tural parameters used for evaluation are shown in Table 1.3.1 Performance Models
All experiments are performed using the core, cache hi-erarchy, coherence protocol, memory system and on-chipinterconnection network models implemented within theGraphite [23] multicore simulator. All the mechanisms andprotocol overheads discussed in Section 2 are modeled. TheGraphite simulator requires the memory system (includingthe cache hierarchy) to be functionally correct to completesimulation. This is a good test that all our cache coherenceprotocols are working correctly given that we have run 21benchmarks to completion.
The electrical mesh interconnection network usesXY routing. Since modern network-on-chip routers arepipelined [11], and 2- or even 1-cycle per hop router laten-cies [18] have been demonstrated, we model a 2-cycle perhop delay. In addition to the fixed per-hop latency, networklink contention delays are also modeled.3.2 Energy Models
For dynamic energy evaluations of on-chip electrical net-work routers and links, we use the DSENT [26] tool. Thedynamic energy estimates for the L1-I, L1-D and L2 (withintegrated directory) caches as well as DRAM are obtainedusing McPAT/CACTI [21, 27]. The energy evaluation is
performed at the 11 nm technology node to account for fu-ture scaling trends. As clock frequencies are relatively slow,high threshold transistors are assumed for lower leakage.
3.3 Baseline LLC Management SchemesWe model four baseline multicore systems that assume
private L1 caches managed using the ACKwise4 protocol.1. The Static-NUCA baseline address interleaves all
cache lines among the LLC slices.2. The Reactive-NUCA [13] baseline places private data
at the requester’s LLC slice, replicates instructions inone LLC slice per cluster of 4 cores using rotationalinterleaving, and address interleaves shared data in asingle LLC slice.
3. The Victim Replication (VR) [30] baseline uses therequester’s local LLC slice as a victim cache for datathat is evicted from the L1 cache. The evicted victimsare placed in the local LLC slice only if a line is foundthat is either invalid, a replica itself or has no sharersin the L1 cache.
4. The Adaptive Selective Replication (ASR) [4] alsoreplicates cache lines in the requester’s local LLC sliceon an L1 eviction. However, it only allows LLC repli-cation for cache lines that are classified as shared read-only. ASR pays attention to the LLC pressure by bas-ing its replication decision on per-core hardware mon-itoring circuits that quantify the replication effective-ness based on the benefit (lower LLC hit latency) andcost (higher LLC miss latency) of replication. We donot model the hardware monitoring circuits or the dy-namic adaptation of replication levels. Instead, we runASR at five different replication levels (0, 0.25, 0.5,0.75, 1) and choose the one with the lowest energy-delay product for each benchmark.
3.4 Evaluation MetricsEach multithreaded benchmark is run to completion us-
ing the input sets from Table 2. We measure the energyconsumption of the memory system including the on-chipcaches, DRAM and the network. We also measure the com-pletion time, i.e., the time in the parallel region of thebenchmark. This includes the compute latency, the mem-ory access latency, and the synchronization latency. Thememory access latency is further broken down into:
1. L1 to LLC replica latency is the time spent by the L1cache miss request to the LLC replica location and thecorresponding reply from the LLC replica includingtime spent accessing the LLC.
2. L1 to LLC home latency is the time spent by theL1 cache miss request to the LLC home location andthe corresponding reply from the LLC home includingtime spent in the network and first access to the LLC.
3. LLC home waiting time is the queueing delay atthe LLC home incurred because requests to the samecache line must be serialized to ensure memory con-sistency.
Application Problem Size
SPLASH-2 [28]RADIX 4M integers, radix 1024FFT 4M complex data pointsLU-C, LU-NC 1024 × 1024 matrixOCEAN-C 2050 × 2050 oceanOCEAN-NC 1026 × 1026 oceanCHOLESKY tk29.OBARNES 64K particlesWATER-NSQUARED 512 moleculesRAYTRACE carVOLREND head
PARSEC [6]BLACKSCHOLES 65,536 optionsSWAPTIONS 64 swaptions, 20,000 sims.STREAMCLUSTER 8192 points per block, 1 blockDEDUP 31 MB dataFERRET 256 queries, 34,973 imagesBODYTRACK 4 frames, 4000 particlesFACESIM 1 frame, 372,126 tetrahedronsFLUIDANIMATE 5 frames, 300,000 particles
Others: Parallel MI-Bench [14], UHPC Graph benchmark [2]PATRICIA 5000 IP address queriesCONNECTED-COMPONENTS Graph with 218 nodes
Table 2. Problem sizes for our parallel benchmarks.
4. LLC home to sharers latency is the round-trip timeneeded to invalidate sharers and receive their acknowl-edgments. This also includes time spent requestingand receiving synchronous write-backs.
5. LLC home to off-chip memory latency is the timespent accessing memory including the time spent com-municating with the memory controller and the queue-ing delay incurred due to finite off-chip bandwidth.
One of the important memory system metrics we trackto evaluate our protocol are the various cache miss types.They are as follows:
1. LLC replica hits are L1 cache misses that hit at theLLC replica location.
2. LLC home hits are L1 cache misses that hit at theLLC home location when routed directly to it or LLCreplica misses that hit at the LLC home location.
3. Off-chip misses are L1 cache misses that are sent toDRAM because the cache line is not present on-chip.
4 Results4.1 Comparison of Replication Schemes
Figures 6 and 7 plot the energy and completion timebreakdown for the replication schemes evaluated. The RT-1,RT-3 and RT-8 bars correspond to the locality-aware schemewith replication thresholds of 1, 3 and 8 respectively.
The energy and completion time trends can be under-stood based on the following 3 factors: (1) the type of dataaccessed at the LLC (instruction, private data, shared read-only data and shared read-write data), (2) reuse run-length
0 0.2 0.4 0.6 0.8 1
1.2 S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
RADIX FFT LU-‐C LU-‐NC CHOLESKY BARNES OCEAN-‐C OCEAN-‐NC WATER-‐NSQ RAYTRACE VOLREND
Energy (n
ormalized
)
L1-‐I Cache L1-‐D Cache L2 Cache (LLC) Directory Network Router Network Link DRAM
0 0.2 0.4 0.6 0.8 1
1.2
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
BLACKSCH. SWAPTIONS FLUIDANIM. STREAMCLUS. DEDUP FERRET BODYTRACK FACESIM PATRICIA CONCOMP AVERAGE
Energy (n
ormalized
)
Figure 6. Energy breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA. Note thatAverage and not Geometric-Mean is plotted here.
0 0.2 0.4 0.6 0.8 1
1.2
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
RADIX FFT LU-‐C LU-‐NC CHOLESKY BARNES OCEAN-‐C OCEAN-‐NC WATER-‐NSQ RAYTRACE VOLREND
Comple'
on Tim
e (normalized
)
Compute L1-‐To-‐LLC-‐Replica L1-‐To-‐LLC-‐Home LLC-‐Home-‐-‐WaiJng LLC-‐Home-‐To-‐Sharers LLC-‐Home-‐To-‐OffChip SynchronizaJon
0 0.2 0.4 0.6 0.8 1
1.2
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
BLACKSCH. SWAPTIONS FLUIDANIM. STREAMCLUS. DEDUP FERRET BODYTRACK FACESIM PATRICIA CONCOMP AVERAGE
Comple'
on Tim
e (normalized
)
Figure 7. Completion Time breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA.Note that Average and not Geometric-Mean is plotted here.
0% 20% 40% 60% 80%
100%
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
RADIX FFT LU-‐C LU-‐NC CHOLESKY BARNES OCEAN-‐C OCEAN-‐NC WATER-‐NSQ RAYTRACE VOLREND
L1 Cache
Miss
Breakd
own
(normalized
)
LLC-‐Replica-‐Hits LLC-‐Home-‐Hits OffChip-‐Misses
0% 20% 40% 60% 80%
100%
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
S-‐NUCA
R-‐NUCA
VR
ASR
RT-‐1
RT-‐3
RT-‐8
BLACKSCH. SWAPTIONS FLUIDANIM. STREAMCLUS. DEDUP FERRET BODYTRACK FACESIM PATRICIA CONCOMP AVERAGE
L1 Cache
Miss
Breakd
own
(normalized
)
Figure 8. L1 Cache Miss Type breakdown for the LLC replication schemes evaluated.
at the LLC, and (3) working set size of the benchmark. Fig-ure 8, which plots how L1 cache misses are handled by theLLC is also instrumental in understanding these trends.
Many benchmarks (e.g., BARNES) have a working setthat fits within the LLC even if replication is done on ev-ery L1 cache miss. Hence, all locality-aware schemes (RT-1, RT-3 and RT-8) perform well both in energy and per-formance. In our experiments, we observe that BARNESexhibits a high reuse of cache lines at the LLC throughaccesses directed at shared read-write data. S-NUCA, R-NUCA and ASR do not replicate shared read-write data andhence do not observe any benefits with BARNES.
VR observes some benefits since it locally replicatesread-write data. However, it exhibits higher L2 cache (LLC)energy than the other schemes for two reasons. (1) Its (al-most) blind process of creating replicas on all evictions re-sults in the pollution of the LLC, leading to less space foruseful replicas and LLC home lines. This is evident fromlower replica hit rate for VR when compared to our locality-aware protocol. (2) The exclusive relationship between theL1 cache and the local LLC slice in VR causes a line to bealways written back on an eviction even if the line is clean.This is because a replica hit always causes the line in theLLC slice to be invalidated and inserted into the L1 cache.Hence, in the common case where replication is useful, eachhit at the LLC location effectively incurs both a read and awrite at the LLC. And a write expends 1.2× more energythan a read. The first factor leads to higher network en-ergy and completion time as well. This explains why VRperforms worse than the locality-aware protocol. Similartrends in VR performance and energy exist in the WATER-NSQ, PATRICIA, BODYTRACK, FACESIM, STREAMCLUS-TER and BLACKSCHOLES benchmarks.
BODYTRACK and FACESIM are similar to BARNES ex-cept that their LLC accesses have a greater fraction of in-structions and/or shared read-only data. The accesses toshared read-write data are again mostly reads with only afew writes. R-NUCA shows significant benefits since itreplicates instructions. ASR shows even higher energy andperformance benefits since it replicates both instructionsand shared read-only data. The locality-aware protocol alsoshows the same benefits since it replicates all classes ofcache lines, provided they exhibit reuse in accordance withthe replication thresholds. VR shows higher LLC energyfor the same reasons as in BARNES. ASR and our locality-aware protocol allow the LLC slice at the replica locationto be inclusive of the L1 cache, and hence, do not have thesame drawback as VR. VR, however, does not have a perfor-mance overhead because the evictions are not on the criticalpath of the processor pipeline.
Note that BODYTRACK, FACESIM and RAYTRACE arethe only three among the evaluated benchmarks that have asignificant L1-I cache MPKI (misses per thousand instruc-tions). All other benchmarks have an extremely low L1-I MPKI (< 0.5) and hence R-NUCA’s replication mecha-nism is not effective in most cases. Even in the above 3
benchmarks, R-NUCA does not place instructions in the lo-cal LLC slice but replicates them at a cluster level, hencethe serialization delays to transfer the cache lines over thenetwork still need to be paid.
BLACKSCHOLES, on the other hand, exhibits a largenumber of LLC accesses to private data and a small num-ber to shared read-only data. Since R-NUCA places privatedata in its local LLC slice, it obtains performance and en-ergy improvements over S-NUCA. However, the improve-ments obtained are limited since false sharing is exhibitedat the page-level, i.e., multiple cores privately access non-overlapping cache lines in a page. Since R-NUCA’s clas-sification mechanism operates at a page-level, it is not ableto locally place all truly private lines. The locality-awareprotocol obtains improvements over R-NUCA by replicat-ing these cache lines. ASR only replicates shared read-onlycache lines and identifies these lines by using a per cache-line sticky Shared bit. Hence, ASR follows the same trendsas S-NUCA. DEDUP almost exclusively accesses privatedata (without any false sharing) and hence, performs op-timally with R-NUCA.
Benchmarks such as RADIX, FFT, LU-C, OCEAN-C,FLUIDANIMATE and CONCOMP do not benefit from repli-cation and hence the baseline R-NUCA performs opti-mally. R-NUCA does better than S-NUCA because thesebenchmarks have significant accesses to thread-private data.ASR, being built on top of S-NUCA, shows the same trendsas S-NUCA. VR, on the other hand, shows higher LLC en-ergy because of the same reasons outlined earlier. VR’sreplication of private data in its local LLC slice is also notas effective as R-NUCA’s policy of placing private data lo-cally, especially in OCEAN-C, FLUIDANIMATE and CON-COMP whose working sets do not fit in the LLC.
The locality-aware protocol benefits from the optimiza-tions in R-NUCA and tracks its performance and energyconsumption. For the locality-aware protocol, an RT of 3dominates an RT of 1 in FLUIDANIMATE because it demon-strates significant off-chip miss rates (as evident from itsenergy and completion time breakdowns) and hence, it isessential to balance on-chip locality with off-chip miss rateto achieve the best energy consumption and performance.While an RT of 1 replicates on every L1 cache miss, an RTof 3 replicates only if a reuse ≥ 3 is demonstrated. Us-ing an RT of 3 reduces the off-chip miss rate in FLUIDAN-IMATE and provides the best performance and energy con-sumption. Using an RT of 3 also provides the maximumbenefit in benchmarks such as OCEAN-C and OCEAN-NC.
As RT increases, the off-chip miss rate decreases but theLLC hit latency increases. For example, with an RT of8, STREAMCLUSTER shows an increased completion timeand network energy caused by repeated fetches of the cacheline over the network. An RT of 3 would bring the cacheline into the local LLC slice sooner, avoiding the unnec-essary network traffic and its performance and energy im-pact. This is evident from the smaller “L1-To-LLC-Home”component of the completion time breakdown graph and
the higher number of replica hits when using an RT of 3.We explored all values of RT between 1 & 8 and found thatthey provide no additional insight beyond the data pointsdiscussed here.
LU-NC exhibits migratory shared data. Such data ex-hibits exclusive use (both read and write accesses) by aunique core over a period of time before being handed toits next accessor. Replication of migratory shared data re-quires creation of a replica in an Exclusive coherence state.The locality-aware protocol makes LLC replicas for suchdata when sufficient reuse is detected. Since ASR does notreplicate shared read-write data, it cannot show benefit forbenchmarks with migratory shared data. VR, on the otherhand, (almost) blindly replicates on all L1 evictions and per-forms on par with the locality-aware protocol for LU-NC.
To summarize, the locality-aware protocol provides bet-ter energy consumption and performance than the otherLLC data management schemes. It is important to balancethe on-chip data locality and off-chip miss rate and overall,an RT of 3 achieves the best trade-off. It is also important toreplicate all types of data and selective replication of certaintypes of data by R-NUCA (instructions) and ASR (instruc-tions, shared read-only data) leads to sub-optimal energyand performance. Overall, the locality-aware protocol hasa 16%, 14%, 13% and 21% lower energy and a 4%, 9%,6% and 13% lower completion time compared to VR, ASR,R-NUCA and S-NUCA respectively.
4.2 LLC Replacement Policy
As discussed earlier in Section 2.2.4, we propose to usea modified-LRU replacement policy for the LLC. It first se-lects cache lines with the least number of sharers and thenchooses the least recently used among them. This replace-ment policy improves energy consumption over the tradi-tional LRU policy by 15% and 5%, and lowers completiontime by 5% and 2% in the BLACKSCHOLES and FACESIMbenchmarks respectively. In all other benchmarks this re-placement policy tracks the LRU policy.
4.3 Limited Locality Classifier
Figure 9 plots the energy and completion time of thebenchmarks with the Limitedk classifier when k is varied as(1, 3, 5, 7, 64). k = 64 corresponds to the Complete classi-fier. The results are normalized to that of the Complete clas-sifier. The benchmarks that are not shown are identical toDEDUP, i.e., the completion time and energy stay constantas k varies. The experiments are run with the best RT valueof 3 obtained in Section 4.1. We observe that the completiontime and energy of the Limited3 classifier never exceeds bymore than 2% the completion time and energy consumptionof the Complete classifier except for STREAMCLUSTER.
With STREAMCLUSTER, the Limited3 classifier starts offnew sharers incorrectly in non-replica mode because of thelimited number of cores available for taking the majorityvote. This results in increased communication between theL1 cache and LLC home location, leading to higher com-pletion time and network energy. The Limited5 classifier,
0 0.2 0.4 0.6 0.8 1
1.2
Energy (n
ormalized
)
k=1 k=3 k=5 k=7 k=64 1.4
0 0.2 0.4 0.6 0.8 1
1.2
Comple3
on Tim
e (normalized
)
0 0.2 0.4 0.6 0.8 1
1.2
RADIX
LU-‐NC
CHOLESKY
BARNES
OCEAN-‐NC
WATER-‐N
SQ
RAYTRACE
VOLREND
STREAM
CLUS.
DEDUP
FERRET
FACESIM
CONCOM
P
GEOM
EAN Co
mple'
on Tim
e (normalized
)
Figure 9. Energy and Completion Time for the Limitedk
classifier as a function of number of tracked sharers (k). Theresults are normalized to that of the Complete (= Limited64)classifier.
however, performs as well as the complete classifier, butincurs an additional 9KB storage overhead per core whencompared to the Limited3 classifier. From the previous sec-tion, we observe that the Limited3 classifier performs betterthan all the other baselines for STREAMCLUSTER. Hence,to trade-off the storage overhead of our classifier with theenergy and completion time improvements, we chose k = 3as the default for the limited classifier.
The Limited1 classifier is more unstable than the otherclassifiers. While it performs better than the Completeclassifier for LU-NC, it performs worse for the BARNESand STREAMCLUSTER benchmarks. The better energy con-sumption in LU-NC is due to the fact that the Limited1 clas-sifier starts off new sharers in replica mode as soon as thefirst sharer acquires replica status. On the other hand, theComplete classifier has to learn the mode independently foreach sharer leading to a longer training period.4.4 Cluster Size Sensitivity Analysis
Figure 10 plots the energy and completion time for thelocality-aware protocol when run using different clustersizes. The experiment is run with the optimal RT of 3. Us-ing a cluster size of 1 proved to be optimal. This is due toseveral reasons.
In benchmarks such as BARNES, STREAMCLUSTER andBODYTRACK, where the working set fits within the LLCeven with replication, moving from a cluster size of 1 to 64reduced data locality without improving the LLC miss rate,thereby hurting energy and performance.
Benchmarks like RAYTRACE that contain a significantamount of read-only data with low degrees of sharing alsodo not benefit since employing a cluster-based approach re-duces data locality without improving LLC miss rate. Acluster-based approach can be useful to explore the trade-off between LLC data locality and miss rate only if the datais shared by mostly all cores within a cluster.
In benchmarks such as RADIX and FLUIDANIMATE that
0 0.2 0.4 0.6 0.8 1
1.2
RADIX
LU-‐NC
BARNES
WATER-‐N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIONS
FLUIDA
NIM.
STREAM
CLUS.
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOM
P
GEOM
EAN
Energy (n
ormalized
) C-‐1 C-‐4 C-‐16 C-‐64
0 0.2 0.4 0.6 0.8 1
1.2
RADIX
LU-‐NC
BARNES
WATER-‐N
SQ
RAYTRACE
VOLREND
BLACKSCH.
SWAPTIONS
FLUIDA
NIM.
STREAM
CLUS.
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOM
P
GEOM
EAN Co
mple'
on Tim
e (normalized
)
Figure 10. Energy and Completion Time at cluster sizesof 1, 4, 16 and 64 with the locality-aware data replicationprotocol. A cluster size of 64 is the same as R-NUCA ex-cept that it does not even replicate instructions.
show no usefulness for replication, applying the locality-aware protocol bypasses all replication mechanisms andhence, employing higher cluster sizes would not be any-more useful than employing a lower cluster size. Intelli-gently deciding which cache lines to replicate using an RTof 3 was enough to prevent any overheads of replication.
The above reasons along with the added coherence com-plexity of clustering (as discussed in Section 2.3.4) motivateusing a cluster size of 1, at least in the 64-core multicore tar-get that we evaluate.
5 Related WorkCMP-NuRAPID [9] uses the idea of controlled replica-
tion to place data so as to optimize the distance to the LLCbank holding the data. The idea is to decouple the tag anddata arrays and maintain private per-core tag arrays and ashared data array. The shared data array is divided into mul-tiple banks based on distance from each core. A cache lineis replicated in the cache bank closest to the requesting coreon its second access, the second access being detected us-ing the entry in the tag array. This scheme uses controlledreplication to force just one cache line copy for read-writeshared data, and forces write-through to maintain coherencefor such lines using an additional Communication (C) co-herence state. The decision to replicate based on the secondaccess for read-only shared data is also limited since it doesnot take into account the LLC pressure due to replication.
CMP-NuRAPID does not scale with the number of coressince each private per-core tag array potentially has to storepointers to the entire data array. Results reported in [9] indi-cate that the private tag array size should only be twice thesize of the per-cache bank tag array but this is because onlya 4-core CMP is evaluated. In addition, CMP-NuRAPID re-quires snooping coherence to broadcast invalidate replicas,and complicates the coherence protocol significantly by in-troducing many additional race conditions that arise from
the tag and data array decoupling.Reactive-NUCA [13] replicates instructions in one LLC
slice per cluster of cores. Shared data is never replicatedand is always placed at a single LLC slice. The one size fitsall approach to handling instructions does not work for ap-plications with heterogeneous instructions. In addition, ourevaluation has shown significant opportunities for improve-ment through replication of shared read-only and sharedread-mostly data.
Victim Replication (VR) [30] starts out with a private L1shared L2 (LLC) organization and uses the requester’s localLLC slice as a victim cache for data that is evicted from theL1 cache. The evicted victims are placed in the local LLCslice only if a line is found that is either invalid, a replica it-self or has no sharers in the L1 cache. VR attempts to com-bine the low hit latency of the private LLC with the low off-chip miss rate of the shared LLC. The main drawback of thisapproach is that replicas are created without paying atten-tion to the LLC cache pressure since a replacement candi-date with no sharers exists in the LLC with high probability.Furthermore, the static decision policy to blindly replicate(without paying attention to reuse) private and shared datain the local LLC slice could lead to increased LLC misses.
Adaptive Selective Replication (ASR) [4] also replicatescache lines in the requester’s local LLC slice on an L1 evic-tion. However, it only allows LLC replication for cachelines that are classified as shared read-only using an addi-tional per-cache-line shared bit. ASR pays attention to theLLC pressure by basing its replication decision on a prob-ability that is obtained dynamically. The probability valueis picked from discrete replication levels on a per-core ba-sis. A higher replication level indicates that L1 eviction vic-tims are replicated with a higher probability. The replicationlevels are decided using hardware monitoring circuits thatquantify the replication effectiveness based on the benefit(lower L2 hit latency) and cost (higher L2 miss latency) ofreplication.
A drawback of ASR is that the replication decision isbased on coarse-grain information. The per-core countersdo not capture the replication usefulness for cache lines withtime-varying degree of reuse. Furthermore, the restrictionto replicate shared read-only data is limiting because it doesnot exploit reuse for other types of data. ASR assumes an8-core processor and the LLC lookup mechanism that needsto search all LLC slices does not scale to large core counts(although we model an efficient directory-based protocol inour evaluation).
6 ConclusionWe have proposed an intelligent locality-aware data
replication scheme for the last-level cache. The localityis profiled at runtime using a low-overhead yet highly ac-curate in-hardware cache-line-level classifier. On a set ofparallel benchmarks, our locality-aware protocol reducesthe overall energy by 16%, 14%, 13% and 21% and thecompletion time by 4%, 9%, 6% and 13% when compared
to the previously proposed Victim Replication, AdaptiveSelective Replication, Reactive-NUCA and Static-NUCALLC management schemes. The coherence complexity ofour protocol is almost identical to that of a traditional non-hierarchical (flat) coherence protocol since replicas are onlyallowed to be created at the LLC slice of the requestingcore. Our classifier is implemented with 14.5KB storageoverhead per 256KB LLC slice.
References[1] First the tick, now the tock: Next generation Intel microarchi-
tecture (Nehalem). White Paper, 2008.[2] DARPA UHPC Program (DARPA-BAA-10-37), March 2010.[3] A. Agarwal, R. Simoni, J. L. Hennessy, and M. Horowitz. An
Evaluation of Directory Schemes for Cache Coherence. In In-ternational Symposium on Computer Architecture, 1988.
[4] B. M. Beckmann, M. R. Marty, and D. A. Wood. Asr: Adap-tive selective replication for cmp caches. In Proceedings of the39th Annual IEEE/ACM International Symposium on Microar-chitecture, MICRO 39, pages 443–454, 2006.
[5] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao,C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fair-banks, D. Khan, F. Montenegro, J. Stickney, and J. Zook.Tile64 - processor: A 64-core soc with mesh interconnect. InIEEE Solid-State Circuits Conference, feb. 2008.
[6] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSECBenchmark Suite: Characterization and Architectural Implica-tions. In International Conference on Parallel Architecturesand Compilation Techniques, 2008.
[7] S. Borkar. Thousand core chips: a technology perspective. InDesign Automation Conference, 2007.
[8] L. M. Censier and P. Feautrier. A new solution to coher-ence problems in multicache systems. IEEE Trans. Comput.,27(12):1112–1118, Dec. 1978.
[9] Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizingreplication, communication, and capacity allocation in cmps.In Proceedings of the 32nd annual international symposiumon Computer Architecture, ISCA ’05, pages 357–368, 2005.
[10] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, andB. Hughes. Cache Hierarchy and Memory Subsystem of theAMD Opteron Processor. IEEE Micro, 30(2), Mar. 2010.
[11] W. J. Dally and B. Towles. Principles and practices of inter-connection networks. Morgan Kaufmann, 2003.
[12] N. Eisley, L.-S. Peh, and L. Shang. In-network cache coher-ence. In IEEE/ACM International Symposium on Microarchi-tecture, MICRO 39, pages 321–332, 2006.
[13] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki.Reactive NUCA: Near-Optimal Block Placement and Repli-cation in Distributed Caches. In International Symposium onComputer Architecture, 2009.
[14] S. Iqbal, Y. Liang, and H. Grahn. Parmibench - anopen-source benchmark for embedded multiprocessor sys-tems. Computer Architecture Letters, 9(2):45 –48, feb. 2010.
[15] A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr., andJ. Emer. Achieving Non-Inclusive Cache Performance with In-clusive Caches: Temporal Locality Aware (TLA) Cache Man-agement Policies. In International Symposium on Microarchi-tecture, 2010.
[16] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy,and S. Borkar. Near-threshold voltage (NTV) design - opportu-nities and challenges. In Design Automation Conference, 2012.
[17] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-ChipCaches. In International Conference on Architectural Supportfor Programming Languages and Operating Systems, 2002.
[18] A. Kumar, P. Kundu, A. P. Singh, L.-S. Peh, and N. K. Jha.A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switchallocator in 65nm cmos. In International Conference on Com-puter Design, 2007.
[19] G. Kurian, O. Khan, and S. Devadas. The locality-awareadaptive cache coherence protocol. In Proceedings of the 40thAnnual International Symposium on Computer Architecture,ISCA ’13, pages 523–534, New York, NY, USA, 2013. ACM.
[20] G. Kurian, J. Miller, J. Psota, J. Eastep, J. Liu, J. Michel,L. Kimerling, and A. Agarwal. ATAC: A 1000-Core Cache-Coherent Processor with On-Chip Optical Network. In Inter-national Conference on Parallel Architectures and Compila-tion Techniques, 2010.
[21] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M.Tullsen, and N. P. Jouppi. Mcpat: An integrated power, area,and timing modeling framework for multicore and manycorearchitectures. In International Symposium on Microarchitec-ture, 2009.
[22] M. M. K. Martin, M. D. Hill, and D. J. Sorin. Why on-chipcache coherence is here to stay. Commun. ACM, 55(7), 2012.
[23] J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann,C. Celio, J. Eastep, and A. Agarwal. Graphite: A DistributedParallel Simulator for Multicores. In International Symposiumon High-Performance Computer Architecture, 2010.
[24] D. Sanchez and C. Kozyrakis. SCD: A Scalable CoherenceDirectory with Flexible Sharer Set Encoding. In InternationalSymp. on High-Performance Computer Architecture, 2012.
[25] D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Mem-ory Consistency and Cache Coherence. Synthesis Lectures inComputer Architecture, Morgan Claypool Publishers, 2011.
[26] C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agar-wal, L.-S. Peh, and V. Stojanovic. DSENT - A Tool Connect-ing Emerging Photonics with Electronics for Opto-ElectronicNetworks-on-Chip Modeling. In International Symposium onNetworks-on-Chip, 2012.
[27] S. Thoziyoor, J. H. Ahn, M. Monchiero, J. B. Brockman, andN. P. Jouppi. A comprehensive memory modeling tool and itsapplication to the design and analysis of future memory hierar-chies. In International Symposium on Computer Architecture,pages 51–62, 2008.
[28] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta.The SPLASH-2 Programs: Characterization and Methodolog-ical Considerations. In International Symposium on ComputerArchitecture, 1995.
[29] J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos.A tagless coherence directory. In International Symposium onMicroarchitecture, 2009.
[30] M. Zhang and K. Asanovic. Victim Replication: MaximizingCapacity while Hiding Wire Delay in Tiled Chip Multiproces-sors. In International Symp. on Computer Architecture, 2005.
[31] H. Zhao, A. Shriraman, and S. Dwarkadas. SPACE: sharingpattern-based directory coherence for multicore scalability. InInternational Conference on Parallel Architectures and Com-pilation Techniques, pages 135–146, 2010.
Adaptive Placement and Migration Policy for an STT-RAM-Based Hybrid Cache
Zhe Wang† Daniel A. Jiménez† Cong Xu‡ Guangyu Sun♯ Yuan Xie‡ §
†Texas A&M University ‡Pennsylvania State University ♯Peking University §AMD Research†{zhewang, djimenez}@cse.tamu.edu ‡{czx102, yuanxie}@cse.psu.edu ♯[email protected]
Abstract
Emerging Non-Volatile Memories (NVM) such as Spin-
Torque Transfer RAM (STT-RAM) and Resistive RAM
(RRAM) have been explored as potential alternatives for
traditional SRAM-based Last-Level-Caches (LLCs) due to
the benefits of higher density and lower leakage power.
However, NVM technologies have long latency and high en-
ergy overhead associated with the write operations. Conse-
quently, a hybrid STT-RAM and SRAM based LLC archi-
tecture has been proposed in the hope of exploiting high
density and low leakage power of STT-RAM and low write
overhead of SRAM. Such a hybrid cache design relies on
an intelligent block placement policy that makes good use
of the characteristics of both STT-RAM and SRAM technol-
ogy.
In this paper, we propose an adaptive block placement
and migration policy (APM) for hybrid caches. LLC write
accesses are categorized into three classes: prefetch-write,
demand-write, and core-write. Our proposed technique
places a block into either STT-RAM lines or SRAM lines
by adapting to the access pattern of each class. An access
pattern predictor is proposed to direct block placement and
migration, which can benefit from the high density and low
leakage power of STT-RAM lines as well as the low write
overhead of SRAM lines.
Our evaluation shows that the technique can improve
performance and reduce LLC power consumption com-
pared to both SRAM-based LLC and STT-RAM-based LLCs
with the same area footprint. It outperforms the SRAM-
based LLC on average by 8.0% for single-thread workloads
and 20.5% for multi-core workloads. The technique reduces
power consumption in the LLC by 18.9% and 19.3% for
single-thread and multi-core workloads, respectively.
1 Introduction
Emerging non-volatile memory (NVM) technologies,
such as Phase-Change RAM (PCRAM), Resistive RAM
(ReRAM), and Spin-Torque Transfer RAM (STT-RAM),
are potential alternatives for future memory architecture de-
sign due to their higher density, better scalability, and lower
leakage power. With all the NVM technology advances in
recent years, it is anticipated that NVM technologies will
closer to market in the near future.
Among these three memory technologies, PCRAM and
ReRAM have slower performance and very limited life-
time (usually in the range of 108-1010) and therefore can
only be suitable as storage-class memory or at most as
main-memory. STT-RAM, on the other hand, has better
read/write latency and better lifetime reliability, so it can
potentially be used as the SRAM replacement for last-level
cache (LLC) design.
However, the latency and energy of write operations to
STT-RAM are significantly higher than those of read oper-
ations. The long write latency can degrade performance by
causing large write-induced interference to subsequent read
requests. High write energy can increase power consump-
tion of the LLC.
As both STT-RAM and SRAM have strengths and weak-
nesses, hybrid cache design [25, 9, 3, 2, 26] has been pro-
posed in the hope of benefiting from the high density, low
leakage power of STT-RAM as well as low write over-
head of SRAM. In the hybrid cache, each set consists of
a large portion of STT-RAM lines and a small portion of
SRAM lines. Hybrid cache design depends on an intelli-
gent block placement policy that can achieve high perfor-
mance by making use of the large capacity of STT-RAM
and maintain low write overhead using SRAM.
Recent work proposes [25, 9, 3, 2, 26] block place-
ment and migration policies to map write-intensive blocks
to SRAM lines. Prior proposals reduce the average write
operations to STT-RAM lines. However, these proposals
incur frequent block migration [9, 3], require the compiler
to provide static hints [3], mitigate write overhead only for
a subset of write accesses [25], or trade off performance for
power or endurance [9, 2, 26].
This paper describes a low cost adaptive block placement
policy for a hybrid LLC called APM (Adaptive Placement
and Migration). The technique places a cache block in the
proper position according to its access pattern. LLC write
978-1-4799-3097-5/14/$31.00 ©2014 IEEE
accesses are categorized into three distinct classes : core-
write, prefetch-write and demand-write. Core-write is a
write from the core. For a write-through core cache, it is
the write directly from the core writing through to the LLC.
For a write-back core cache, it is dirty data evicted from
the core cache and written back to the LLC. Prefetch-write
is a write from the LLC replacement caused by a prefetch
miss. Demand-write is a write from the LLC replacement
caused by a demand miss. The proposed technique is based
on the observation that block placement is often initiated
by a write access. The characteristics of each class of write
enable them to be adapted to different block placement poli-
cies. A low overhead and low complexity pattern predictor
is used to predict the access pattern for each class of ac-
cesses for directing the block placement.
Our contributions can be summarized as follows: (1)
We analyze the LLC access pattern for distinct write ac-
cess types: prefetch-write, demand-write and core-write.
We propose that each class should be adapted to different
block placement and migration policies for a hybrid LLC.
We design a low overhead access pattern predictor. It pre-
dicts write burst blocks and dead blocks in the LLC that
can be used to direct block placement and migration. (2)
We propose an intelligent cache placement and migration
policy for a hybrid STT-RAM and SRAM based LLC. The
technique optimizes block placement by adapting to the ac-
cess pattern of each class of write accesses. (3) We show the
proposed technique can improve system performance and
reduce LLC power consumption of hybrid LLC compared
to both SRAM-based LLC and STT-RAM-based LLC with
the same area. It outperforms an SRAM-based LLC on av-
erage by 8.0% for single-thread workloads and by 20.5%
for multi-core workloads. The technique reduces the power
consumption of the LLC by 18.9% and 19.3% respectively
for single-thread and multi-core workloads compared with
the baseline.
2 Background
2.1 Comparison of STT-RAM and SRAMCache
Compared to SRAM, STT-RAM caches have higher den-
sity and lower leakage power, but higher write latency and
write power consumption. Table 1 lists the technology fea-
tures of various STT-RAM cache bank sizes and SRAM
cache bank sizes used in our evaluation. The technology
parameters are generated by NVSim [5], a performance, en-
ergy, and area model based on CACTI [8]. The cell param-
eters we used in NVSim are based on the projection from
Wang et al. [28]. We assume a 22nm × 45nm MTJ built
with 22nm CMOS technology. The SRAM cell parameters
are estimated using CACTI [8].
The density of STT-RAM is currently 3×-4× higher
than SRAM. Another benefit for STT-RAM is its low leak-
age power. Leakage power can dominate the total power
400.perlbench
401.bzip2403.gcc429.m
cf433.m
ilc434.zeusm
p
435.gromacs
436.cactusAD
M
450.soplex456.hm
mer
459.Gem
sFDTD
462.libquantum
470.lbm471.om
netpp
473.astar482.sphinx3
483.xalancbmk
Average
0.0
0.2
0.4
0.6
0.8
1.0
Tota
l L
LC
Acc
esse
s Demand-write Core-write Prefetch-write
Figure 1. Distribution of LLC write accesses.
Each type of write access accounts for a significant
fraction of total write accesses
Ra, Ra, Rb, Rc, Rd, Ra, Wa, Rc, Ra, Wa, Rf, Rb, Rc, Rd, Re, Rm, Rn, Rs
RR of block a: 4
DR of the first Wa: 2
DR of the second Wa: 0
Figure 2. An example illustrating read range and
depth range
consumption for large SRAM-based LLCs [15]. Thus, the
low leakage power consumption of STT-RAM makes it suit-
able for a large LLC. Disadvantages of STT-RAM are long
write latency and high write energy.
2.2 Hybrid Cache Structure
The hybrid cache structure is composed of STT-RAM
banks and SRAM banks. Each cache set consists of a
large portion of STT-RAM cache lines and a small portion
of SRAM cache lines distributed among multiple banks.
The hybrid cache architecture relies on an intelligent block
placement policy to bridge the performance and power gaps
between STT-RAM and SRAM.
An intelligent block placement policy for hybrid cache
design should be optimized for three requirements. First,
the SRAM portion should service as many write requests as
possible, thus minimizing the write overhead of STT-RAM
portion. However, sending many write operations to SRAM
without considering the access pattern can cause misses due
to the small capacity of SRAM, leading to performance
degradation. Thus, the second requirement is that reused
cache blocks should be placed in the LLC to maintain per-
formance by hiding the memory access latency. Finally, the
block placement policy should be a low overhead and low
complexity design without incurring frequent migration be-
tween cache lines.
3 Analysis of LLC Write Access Patterns
LLC block placement is often initiated by a LLC write
access that can be categorized into three classes: prefetch-
write, core-write and demand-write. Figure 1 shows the
breakdown of each class of LLC write accesses for 17
memory intensive SPEC CPU2006 [7] benchmarks. The
study is performed using the MARSSx86 [19] simulator
Memory Type 1M SRAM 2M SRAM 2M STT-RAM 4M STT-RAM
Area (mm2) 0.825 1.650 0.518 1.035
Read Latency (ns) 1.751 2.017 2.681 2.759
Write Latency (ns) 1.530 1.663 10.954 10.993
Read Energy (nJ/access) 0.055 0.072 0.132 0.142
Write Energy (nJ/access) 0.039 0.056 0.608 0.618
leakage power (mW) 29.798 59.596 7.108 14.216
Table 1. Characteristics of SRAM and STT-RAM caches (22nm, temperature=350K)
with single-core configuration and a 4MB LLC. We im-
plement a middle-of-the-road stream prefetcher that models
the prefetcher of the Intel Core i7 [1]. From Figure 1, we
can see that each type of write access accounts for a sig-
nificant fraction of total write accesses. Prefetch-writes ac-
count for 21.9% on average while core-writes and demand-
writes take 45.6% and 32.5%, respectively. In this section,
we analyze the access pattern of each write access type and
suggest a block placement policy that adapts to the access
pattern for each class.
We first define the terminology that will be used later in
this section for pattern analysis. To be clear, when we write
“block” we mean a block of data apart from its physical
realization. When we write “line” we mean the physical
frame, whether in STT-RAM or SRAM, where a block may
reside.
• Read-range: The read-range is a property of a cache
block that fills the LLC by a demand-write or prefetch-
write request. It is the largest interval between consec-
utive reads of the block from the time it is filled into
the LLC until the time it is evicted.
• Depth-range: The depth-range is a property of a core-
write access. It is the largest interval between accesses
to the block from the current core-write access until the
next core-write access to the same block. The “depth”
refers to how deep the block descends into the LRU
recency stack before it is accessed again.
We use an example to illustrate the read range and depth
range. Figure 2 shows the behavior of a block a from the
time it fills into a 8-way set until it is evicted. In the ex-
ample, Ra represents "read block a" while Wa represents
"write block a". The largest re-read interval of block a dur-
ing the time it resides in the cache is 4 which is the read
interval between the second Ra and the third Ra. Thus,
the read range (RR) of block a is 4. The depth range (DR)
of the first Wa access is 2 which is the access interval be-
tween the first Wa access and the fourth Ra access. The
depth range of the second Wa access is 0 meaning a is
not re-written from the second Wa until it is evicted from
the LLC. We further classify the read/depth-range into three
types: zero-read/depth-range, immediate-read/depth-range
and distant-read/depth-range.
• Zero-read/depth-range: Data is filled into the LLC by
a prefetch or demand request/core-write request, and it
is never read/written again before it is evicted.
• Immediate-read/depth-range: The read/depth-range is
smaller than a parameter m. We set m = 2 which is
the same as the number of SRAM ways in our hybrid
cache configuration as in Section 5.
• Distant-read/depth-range: The read/depth-range is
larger than m = 2 and at most the associativity of the
cache set which is 16 in our configuration.
3.1 Pattern Analysis of Prefetch-WriteBlock
Prefetching data into the cache before it is accessed can
improve performance by hiding memory access latency.
However, prefetching can also induce cache pollution by
inaccurate prefetch requests.
We analyze the access pattern for the LLC prefetch-
write blocks by using read-range. The first bar in Figure 3
shows each type of access pattern as a fraction of the total
number of prefetch-write blocks. Zero-read-range prefetch-
write blocks are inaccurately prefetched blocks accounting
for 26% of all of prefetch blocks. Placing the zero-read-
range prefetch-write blocks into the STT-RAM lines causes
pollution and high write overhead. Thus, zero-read-range
prefetch-write blocks should be placed in SRAM lines.
The immediate-read-range access pattern is related to
cache bursts. After an initial burst of references, a cache
block becomes dead, i.e. it is never used again prior to evic-
tion. Of all prefetch-write blocks, 56.9% are immediate-
read-range blocks. Immediate-read-range prefetch-write
blocks should be placed in SRAM lines so they can be ac-
cessed by subsequent demand requests without incurring
write operations to STT-RAM lines. Moreover, placing
immediate-read-range prefetch-write blocks in SRAM al-
lows them to be evicted when they are dead, reducing pol-
lution.
Distant-read-range prefetch-write blocks should be
placed in STT-RAM lines to make use of the large capac-
ity to avoid cache misses. Distant-read-range prefetch-write
blocks account for only 17.5% of all prefetch blocks.
Zero-read-range and immediate-read-range prefetch-
write blocks account for 82.5% of all prefetch-write blocks.
400.perlbench401.bzip2
403.gcc
429.mcf
433.milc
434.zeusmp
435.gromacs
436.cactusAD
M450.soplex
456.hmm
er459.G
emsFD
TD
462.libquantum470.lbm
471.omnetpp
473.astar
482.sphinx3483.xalancbm
kA
verage
0.0
0.2
0.4
0.6
0.8
1.0
Norm
ali
zed
Acc
esse
s
Zero-read/depth-range
Immediate-read/depth-range
Distant-read/depth-range1st bar: Prefetch-write 2nd bar: Core-write 3rd bar: Demand-write
Figure 3. The distribution of access pattern for each type of LLC write access
Thus, we suggest the initial placement of the prefetch-write
blocks in SRAM. Once a block is evicted from the SRAM,
if it is a distant-read-range block, i.e. it is still live, it should
be migrated to STT-RAM lines. Otherwise, the block is
dead and should be evicted from the LLC. Since only 17.5%
of prefetch blocks are distant-access prefetch blocks, the
migration will not cause significantly increased traffic.
3.2 Pattern Analysis of Core-Write Ac-cess
In our design, if a core-write access misses in the LLC,
the data will be written back to the main memory directly.
Thus, our core-write placement policy is only designed for
core-write hit accesses.
We analyze the access pattern of the LLC core-write ac-
cess by using depth-range. The second bar in Figure 3
shows the access pattern for core-write accesses. Zero-
depth-range accesses account for 49.1% of all core-write
accesses. Though the data written by the zero-depth-range
core-write access will not be rewritten before it is evicted,
it still has some chance to be read again. Thus, we suggest
leaving zero-depth-range data in its original cache line for
avoiding read misses and block migrations.
Immediate-depth-range accesses account for 32.9% of
total core-write accesses. The immediate-depth-range ac-
cesses are the write-intensive accesses with write burst be-
havior. Thus, the immediate-depth-range access data is pre-
ferred to be placed in the SRAM line for low write over-
head. The distant-depth-range access data should remain in
its original cache line, thus minimizing migration overhead.
3.3 Pattern Analysis of Demand-WriteBlock
The access pattern of demand-write blocks is analyzed
using read-range. Zero-read-range demand-write blocks,
also known in the literature as “dead-on-arrival” blocks, are
brought to the LLC by a demand request and never refer-
enced again before being evicted. It is unnecessary to place
zero-read-range demand-write block into the LLC so the
block should bypass the cache (assuming a non-inclusive
cache). The third bar in Figure 3 shows dead-on-arrival
blocks account for 61.2% of LLC demand-write blocks.
Thus, bypassing dead-on-arrival blocks can significantly re-
duce write operations to LLC. Moreover, bypassing can im-
prove cache efficiency by allowing the LLC to save space
for other useful blocks in the cache.
The immediate-read-range and distant-read-range
demand-write blocks account for 38.8% of the total
demand-write blocks. We suggest placing them in the
STT-RAM ways for making use of the large capacity of
the STT-RAM portion and reducing pressure on the SRAM
portion.
3.4 Pattern Analysis Conclusions
Each class of LLC write access can be applied to a differ-
ent placement policy. The access pattern of each class type
can be used to guide the block placement policy. From the
analysis of the access pattern of each access class, we make
the following conclusions: (1) The initial placement of
prefetch-write blocks should be to SRAM lines. (2) Write-
burst core-write data should be placed in SRAM lines while
other types of core-write data should remain in their orig-
inal cache lines. (3) Dead-on-arrival demand-write blocks
should bypass the LLC while the other types of demand-
write blocks should be placed in the STT-RAM lines. (4)
When a block is evicted from SRAM, if it is live, it should
be migrated to STT-RAM lines for avoiding LLC misses.
4 Adaptive Block Placement and Migration
Policy
The design of the block placement and migration policy
is guided by the access pattern of each write access type. An
access pattern predictor is proposed to predict write-burst
blocks and dead blocks. The information provided by the
access pattern predictor is used to direct bypass and migra-
tion of blocks between STT-RAM lines and SRAM lines.
The policy targets reducing write overhead by allowing the
SRAM portion to service as many write requests as possi-
ble and attain high performance by benefiting from the high
density of the STT-RAM portion.
4.1 Policy Design
Figure 4 shows a flow-chart for the technique. In our
design, a prediction bit is associated with each block in the
Core-Write ?
Hit?
Dead?
Place in STT-RAM to STT-RAM
Place in SRAM
Dead?
Yes
Evict Migrate
No
Access to LLC
Prefetch-Write?
Evicted Block
No
Write Hit?
NoHit in SRAM?
Write Burst?
Yes No
Write to Original
Write to Memory
Yes
Yes No
Migrate to SRAM
Demand
Fetched Block
Bypassing
DeDeYes
Return Data
No
FeFeYes No
Yes No
Yes No
Cache Line
Write to Original
YeCache Line
Figure 4. Flow-chart of the adaptive block placement and migration mechanism
LLC for representing whether the block is predicted dead.
An access to the LLC searches all the STT-RAM lines and
SRAM lines in the set. On a prefetch miss, the prefetched
data is placed into an SRAM line and the prediction bit is
set to 1 meaning we assume the prefetch block is dead on
arrival. For every demand hit request in the SRAM lines, the
access pattern predictor makes a prediction about whether
the block is dead. Once the block is evicted from the SRAM
lines, if it is predicted dead, it will be evicted from the LLC.
Otherwise, it will be migrated to the STT-RAM lines. In
this case, if a prefetch block is never accessed before it is
evicted from the SRAM lines, it is taken as an inaccurately
prefetched block and evicted based on the observation that
accurately prefetched blocks are usually accessed soon by
subsequent demand requests.
On a core-write request to the LLC, if it is an LLC miss,
it will be written to main memory directly. For the core-
write hit request in the STT-RAM lines, if it is predicted
to be a write burst access, then data will be written to the
SRAM lines with the prediction bit set to 0 indicating the
block is predicted live; the prior position in the STT-RAM
line is set to invalid. Then, as with prefetched blocks, once
the data is evicted from the SRAM portion, a predicted dead
block is evicted from the LLC while a live block is migrated
to the STT-RAM portion.
On a demand miss to the LLC, the data is fetched from
the main memory. If the fetched block is predicted to be
a dead-on-arrival block, it will bypass the LLC. Otherwise,
the block will be placed in the STT-RAM lines.
Minimizing Write Overhead The proposed scheme re-
duces write operations to STT-RAM portion in the follow-
ing ways. First, bypass can reduce write operations to STT-
RAM lines caused by dead-on-arrival requests. Second,
SRAM lines filter the write operations caused by the inac-
curate and immediate-read-range prefetch requests. Finally,
the core-write-intensive blocks are placed in the SRAM
Sam
pled
Core C
ache M
iss
Data Access
Prediction
Sampled Prefetch
Request
Update
Simulator
LLC
Selected Core Cache Miss
Prediction Table
Core Cache
Pattern
& C
ore−
Write R
equest
& Core−Write Request
Figure 5. System structure
lines, reducing the write operations to STT-RAM lines
caused by write burst behavior.
Attaining High Performance The block placement pol-
icy can attain high performance for the hybrid cache by
benefiting from the high density of the STT-RAM por-
tion. Specifically, the distant-read-range blocks are placed
in STT-RAM lines that can reduce cache misses by mak-
ing use of the large capacity of STT-RAM. Also, bypass-
ing zero-range demand blocks saves space in the LLC for
other useful data. Moreover, filtering zero-range prefetch
blocks and immediate-read-range blocks using SRAM lines
can improve cache efficiency by reducing inaccurately
prefetched blocks and dead blocks in the LLC.
The technique relies on an access pattern predictor for
directing block bypassing and migration.
4.2 Access Pattern Predictor
The goal of the access pattern predictor is to predict
dead blocks and write burst blocks for guiding block place-
ment. The predicted dead blocks are used to direct bypass-
ing and block migration from SRAM lines to STT-RAM
lines. Write burst blocks are used to guide block migra-
tion from STT-RAM lines to SRAM lines for core-write ac-
Initial
Status
Partial Tag Partial Read PC Partial Write PC
DEC Read
PCR1 PCR2 PCR3 PCR4 a b c d PCW1 PCW2
Read B DECDCNT PCR2
CW A WCNT PCW1
INC Hit
CW C
Prefetch E
Read E
Miss
INC
Hit
INCDCNT PCR4 WCNT PCW3 WCNT PC
DEC
MRU LRU
DEC/INC C DCNT PCR*: R*: Decrease/Increase the counter in the dead block prediction table indexed by PCR*
DEC/INC INC WCNT PCW*:
LRU
W*: Decrease/Increase the counter in the write burst prediction table indexed by PCW*
ReadHit
PCR5 PCR1 PCR3 PCR4 b a c Hid PCW2 WCIN
PCW3
PCR5 PCR1 PCR3 PCR4 b a c Mi
d PCW2 PCW3
PCR5 PCR1 PCR3 DEDEDDPCR4 b a c
ReReHid PCW2 PCW1
I PCR5 PCR1 ININDDPCR3 e b a c I WCWC
DEDEPCW2
PCR6 PCR5 PCR1 PCR3 e b a Hic I PCW2
HitHiCW
MissMiCW
HitHiRead
I : Invalid CW : core-write access hit
Figure 6. An example illustrating the set behavior of pattern simulator
cesses. The access pattern predictor consists of a prediction
table and a pattern simulator as shown in Figure 5. The pre-
diction table is composed of a dead block prediction table
and a write burst prediction table having the same structure
but making predictions for different types of accesses. The
pattern simulator is used to learn the access pattern for up-
dating the prediction table.
The design of the access pattern predictor is inspired by
the sampling dead block predictor (SDBP) [13]. However,
the SDBP predicts dead blocks only taking into account de-
mand accesses, while the access pattern predictor predicts
both dead blocks and write-intensive blocks by considering
all types of LLC accesses. The predictor predicts access
pattern using the Program Counter (PC) of memory access
instructions. The intuition is the cache access pattern can be
classified based on the instructions of the memory accesses.
Specifically, if a given memory access instruction PC leads
to some access pattern in previous accesses, then the future
access pattern of same PC will be similar.
4.2.1 Making Prediction
The access pattern predictor makes a prediction in the fol-
lowing three conditions: (1) When a core-write request hits
in the STT-RAM lines, the write burst prediction table will
be accessed to predict whether it is a write burst request.
(2) For each read hit request in the SRAM lines, the dead
block prediction table will be accessed to predict whether it
is a dead block. (3) On a demand-write request, dead block
prediction table will be accessed to predict whether it is a
dead-on-arrival request.
The dead block prediction and write burst prediction ta-
bles have the same structure. Each entry in a prediction
table has a two-bit saturating counter. When making a pre-
diction, bits from the related memory access instruction PC
are hashed to index a prediction table entry. The predic-
tion result is generated by thresholding the counter value in
the prediction table entry. In our implementation, we use
a skewed organization [29, 13, 18] to reduce the impact of
conflicts in the hash table.
4.2.2 Updating Predictor
The pattern simulator samples LLC sets and simulates the
access behavior of the LLC by using sampled sets. It up-
dates the prediction table by learning the access pattern of
the PCs from the simulated LLC access behavior. It targets
learning the dead/live behavior and write burst behavior of
LLC blocks.
Each simulated set in the pattern simulator has a tag field,
a read PC field, an LRU field, a write PC field, a write
LRU field and a valid bit field. The partial tag and partial
read/write PC which are the lower 16-bit of the full tag and
full read/write PC are stored in the tag field and read/write
PC field. The read PC field is for learning the dead/live be-
havior of the block while the write PC field is for learning
the write-burst behavior of the block.
A write burst occurs within a small number of cache
lines. Thus the write PC field should have a small asso-
ciativity. The pattern simulator consists of two parts: the
tag array and its related read PC field, LRU field and valid
bit field which have the same associativity while the write
PC field and its write LRU field have a smaller associativity.
In our implementation, we found 4-way associativity of the
write PC field and 12-way associativity of the read PC field
yield the best performance while the associativity of LLC is
18.
The behavior of a pattern simulator set is illustrated in
Figure 6 using an example access pattern. In the example,
the associativity of the tag and read PC fields is 4 while the
associativity of the write PC field is 2. On each demand hit
request, the pattern simulator updates the dead block pre-
diction table entry indexed with the related read PC by de-
creasing the counter value indicating “not dead.” When a
block is evicted from the simulator set, the pattern simulator
updates the dead block prediction table entry indexed with
the related read PC by increasing the counter value indicat-
ing “dead.” The LRU recency is updated for every demand
request and prefetch-write request. The write PC field is
for learning the write-burst behavior for core-write requests.
On each core-write hit request, the simulator increments the
counter value stored in the write burst prediction table entry
indexed with the related write PC indicating “write burst.”
When a block is evicted from the write field, the simula-
tor decrements the counter value stored in the write burst
prediction table entry indexed with the related write PC in-
dicating “not write burst.”
5 Evaluation
5.1 Methodology
Execution core 4.0GHZ, 1-core/4-core CMP, out of order,128 entry reorder buffer, 48 entry load queue,44 entry store queue, 4 width issue/decode
Caches L1 I-cache: 64KB/2 way, private, 2-cycle64 bytes block size, LRUL1 D-cache: 64KB/2 way, private, 2-cycle64 bytes block size, LRUL2 Cache: shared, 64 bytes block size, LRU
DRAM DDR3-1333, open-page policy, 2-channel,8-bank/channel, FR_FCFS [21] policy,32-entry/channel write buffer,drain_when_full write buffer policy
Table 2. System configuration
Name Technique
SRAM SRAM-based LLC
STT-RAM STT-RAM-based LLC
OPT-STT-RAM STT-RAM-based LLC
Assuming symmetric read/write overhead
Sun-Hybrid Hybrid LLC technique as described in [25]
APM Adaptive placement and migration based
hybrid cache as described in Section 4
Table 3. Legend for various LLC techniques.
We use MARSSx86 [19], a cycle-accurate simulator for
the 64-bit x86 instruction set. The DRAMSim2 [22] simu-
lator is integrated into MARSSx86 to simulate DDR3-1333
system. Table 2 lists the system configuration. We model a
per-core LLC middle-of-the-road stream prefetcher [24, 1]
with 32 streams for each core. The prefetcher looks up
the stream table at each LLC request for issuing eligible
prefetch requests. The LLC is configured with multiple
banks. Requests to different banks are serviced in paral-
lel. Within the same bank, requests are be pipelined. The
LLC is implemented with single-port memory bitcell. We
obtain STT-RAM and SRAM parameters using NVSim [5]
and CACTI [8] as shown in Table 1.
Name Benchmarks
Mix 1 milc gcc xalancbmk tonto
Mix 2 gamess soplex libquantum perlbench
Mix 3 gcc sphinx3 GemsFDTD tonto
Mix 4 lbm mcf cactusADM GemsFDTD
Mix 5 zeusmp bzip2 astar libquantum
Mix 6 mcf soplex zeusmp bwaves
Mix 7 omnetpp lbm cactusADM sphinx3
Mix 8 bwaves libquantum mcf GemsFDTD
Mix 9 omnetpp cactusADM tonto gcc
Mix 10 soplex mcf bzip2 gcc
Mix 11 perlbench sphinx3 libquantum lbm
Table 4. Multi-Core workloads
The SPEC CPU2006 [7] benchmarks are used for the
evaluation. We evaluate five LLC techniques based on the
same area configuration. Table 3 shows the legends for
these techniques referred to in the graphs that follow. The
OPT-STT-RAM technique assumes write operations have
similar access latency to read operations which is the opti-
mistic case. The Sun-Hybrid [25] technique assumes that a
cache block that has been consecutively written to the LLC
twice is a write-intensive block. The technique migrates
the write-intensive blocks to SRAM lines for reducing the
write operations to STT-RAM lines. The APM technique is
our proposed adaptive block placement and migration pol-
icy based hybrid cache technique as described in section 4.
We modify MARSSx86 to support all types of LLC listed
in Table 3.
Single-Core Workloads and LLC Configuration Of
the 29 SPEC CPU2006 benchmarks, 22 can be com-
piled and run with our infrastructure. We use all 22 of
these benchmarks for evaluation including both memory-
intensive benchmarks and non-memory-intensive bench-
marks. For each workload, we simulate 250 million instruc-
tions from a typical phase identified by SimPoint [23].
Various LLC techniques are evaluated with the same
area. A 2MB SRAM has similar area to a 6MB STT-RAM.
Thus, we evaluate a 16-way SRAM with 2MB capacity
and 24-way STT-RAM/OPT-STT-RAM with 6MB capac-
ity. The hybrid cache design for the APM technique has 16
STT-RAM lines and 2 SRAM lines in each set and hence
we evaluate a 4.5MB APM hybrid cache which has the
same area with a 2MB SRAM. In the Sun-Hybrid technique,
each cache set allocates 1 SRAM line. Thus we evaluate a
20 STT-RAM lines and 1 SRAM line hybrid cache with a
5.25MB capacity for Sun-Hybrid technique. We implement
a 1MB SRAM cache bank and 2MB STT-RAM cache bank
for a single-core configuration yielding the best trade off of
access latency and bank level parallelism. If the capacity of
SRAM is smaller than 1M, it is configured as one bank.
400.perlbench401.bzip2403.gcc410.bw
aves429.m
cf433.m
ilc434.zeusm
p435.grom
acs436.cactusA
DM
444.namd
445.gobmk
450.soplex456.hm
mer
459.Gem
sFDTD
462.libquantum464.h264ref465.tonto470.lbm471.om
netpp473.astar482.sphinx3483.xalancbm
k
Average
0.00.10.20.30.40.50.60.70.80.91.0
Norm
ali
zed
Wri
te O
per
ati
on
s
Core-write
Demand-write
Prefetch-write1st bar: All write operations 2nd bar: Write operations to STT-RAM
Figure 7. The distribution of write accesses to STT-RAM lines in APM LLC for single-core applications
400.perlbench401.bzip2403.gcc410.bw
aves429.m
cf433.m
ilc434.zeusm
p435.grom
acs436.cactusA
DM
444.namd
445.gobmk
450.soplex456.hm
mer
459.Gem
sFDTD
462.libquantum464.h264ref465.tonto470.lbm471.om
netpp473.astar482.sphinx3483.xalancbm
k
Gm
ean
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
Sp
eed
up
6M-STT-RAM
5.25M-Sun-Hybrid
4.5M-APM
6M-OPT-STT-RAM
0.85
0.81
0.890.
88
1.49
1.48
1.39
1.50
1.36
1.38
1.41
1.35
1.39
1.35
Figure 8. The comparison of IPC for single-core applications (normalized to 2M SRAM LLC)
Multi-Core Workloads and LLC Configuration We
use quad-core workloads for evaluation. Table 4 shows
eleven mixes of SPEC CPU2006 benchmarks with a vari-
ety of memory behaviors. For each mix, we run the ex-
periment with 1 billion instructions total for all four cores
starting from the typical phase. Each benchmark runs si-
multaneously with others. For the multi-core configuration,
we evaluate a 16-way SRAM with 8MB capacity, 24-way
STT-RAM/OPT-STT-RAM with 24MB capacity, 16-way
STT-RAM and 2-way SRAM Hybrid APM technique with
18MB capacity, and 21MB 20-way STT-RAM and 1-way
SRAM Sun-Hybrid technique. In the multi-core configura-
tion, we use 2MB SRAM cache bank and 4MB STT-RAM
cache bank which yield best performance. If the capacity of
SRAM is smaller than 2M, it is configured as one bank.
5.2 Evaluation Results for Single-CoreConfiguration
5.2.1 Reduced Writes and Endurance Evaluation
The APM technique allows SRAM lines to service as many
write requests as possible, thus reducing write operations to
STT-RAM lines. Figure 7 shows the distribution of write
operations to STT-RAM lines in the APM technique nor-
malized to all write operations to the LLC for single-core
workloads. We can see the APM LLC reduces write opera-
tions to STT-RAM lines for each type of write accesses. In
APM LLC, only 32.9% of the total LLC write requests are
serviced by the STT-RAM portion, significantly reducing
the write overhead of the STT-RAM portion and translating
into performance improvement and power reduction of the
LLC.
5.2.2 Performance Evaluation
Figure 8 shows the speedup for various techniques com-
pared with baseline technique which is 2MB SRAM LLC.
The 6MB STT-RAM LLC has similar area to 2MB SRAM
LLC. It improves the performance by 6.2% on average due
to the increased capacity. Most of the benchmarks can ben-
efit from the increased capacity of STT-RAM. However,
several benchmarks such as gcc, milc, libquantum
and lbm suffer more from the large write overhead of
STT-RAM. The OPT-STT-RAM LLC assumes symmet-
ric read/write latency meaning large write-induced inter-
ference to read request is removed. It yields a geomet-
ric mean speedup of 9.3%. The performance difference
between OPT-STT-RAM and STT-RAM is caused by the
long write latency of STT-RAM. The 4.5MB APM LLC
reduces the write overhead of the STT-RAM portion, de-
livering a geometric mean speedup of 8.0%. It yields
even higher speedup for benchmarks 462.libquantum,
482.sphinx3, and 483.xalancbmk than the 6MB
OPT-STT-RAM LLC. Because those workloads generate a
large number of dead blocks or inaccurate prefetch blocks in
the LLC and hence reducing the LLC efficiency. The APM
technique reduces the LLC pollution cased by dead blocks
400.perlbench401.bzip2403.gcc410.bw
aves429.m
cf433.m
ilc434.zeusm
p435.grom
acs436.cactusA
DM
444.namd
445.gobmk
450.soplex456.hm
mer
459.Gem
sFDTD
462.libquantum464.h264ref465.tonto470.lbm471.om
netpp473.astar482.sphinx3483.xalancbm
k
Gm
ean
0.00.20.40.60.81.01.21.41.61.82.0
Norm
ali
zed
Pow
er
Write Power
Read Power
Leakage Power
1st bar: 2M-SRAM 2nd bar: 6M-STT-RAM 3rd bar: 5.25M-Sun-Hybrid 4th bar: 4.5M-APM
3.57
2.29
2.68
2.43
2.37
2.38
2.32
Figure 9. The power breakdown for single-core applications (normalized to 2MB SRAM)
mix1
mix2
mix3
mix4
mix5
mix6
mix7
mix8
mix9
mix10
mix11
Average
0.0
0.2
0.4
0.6
0.8
1.0
No
rma
lize
d W
rite
Op
era
tio
ns
Core-write
Demand-write
Prefetch-write
1st bar: All write operations
2nd bar: Write operations to STT-RAM
Figure 10. The distribution of write accesses to
STT-RAM lines in APM LLC for multi-core ap-
plications
and inaccurate prefetch blocks, thus improving the LLC
efficiency for those workloads. The 5.25MB Sun-Hybrid
LLC improves the performance by 5.0% on average.
5.2.3 Power Evaluation
Figure 9 shows the normalized power consumption for
various techniques due to leakage power, dynamic power
caused by reads and dynamic power caused by writes. The
baseline technique is a 2MB SRAM. For the SRAM tech-
nique, the leakage power dominates the total power con-
sumption. The STT-RAM technique consumes lower leak-
age power. However, the dynamic power caused by writes is
significantly increased due to the large write energy of STT-
RAM. Thus the STT-RAM technique increases the over-
all power consumption by 11.9% on average. The APM
technique reduces write operations to the STT-RAM por-
tion, thus reducing the dynamic power caused by writes. It
reduces the overall power consumption by 18.9% on aver-
age compared with the baseline. The Sun-Hybrid technique
does not significantly reduce the write power because the
Sun-Hybrid technique does not reduce write operations to
STT-RAM caused by LLC replacement.
mix1
mix2
mix3
mix4
mix5
mix6
mix7
mix8
mix9
mix10
mix11
Gm
ean
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
Sp
eed
up
24M-STT-RAM
21M-Sun-Hybrid
18M-APM
24M-OPT-STT-RAM1.
42
Figure 11. The comparison of IPC for multi-core
applications (normalized to 8MB SRAM)
5.2.4 Extra LLC Traffic Evaluation
The APM technique migrates blocks between SRAM lines
and STT-RAM lines. Migrating blocks from SRAM lines
to STT-RAM lines causes extra cache traffic. We evaluate
the LLC traffic caused by migration. Migration causes only
3.8% extra LLC traffic. Most of the blocks evicted from
SRAM ways are dead blocks, thus only a small number of
distant-read-range blocks need to be migrated to STT-RAM.
Thus, the small percentage of traffic caused by migration
will not cause significant traffic overhead.
5.3 Evaluation Results for Multi-CoreConfiguration
5.3.1 Reduced Writes, Endurance, Performance and
Power Evaluation
Figure 10 shows the distribution of write operations to STT-
RAM lines normalized to all write operations to LLC for the
multi-core workloads. The APM technique reduces write
operations to the STT-RAM portion to 28.9% on average of
the total number of write operations.
Figure 11 shows the speedups of the various techniques
for the multi-core workloads normalized to 8MB SRAM
LLC. The 24MB STT-RAM LLC improves performance by
14.8% on average. Removing write-induced interference
caused by asymmetric writes improves the average perfor-
mance by 18.7% in the OPT-STT-RAM LLC. The 18M
APM technique reduces write overhead of the STT-RAM
portion. It achieves a geometric mean speedup of 20.5%
which is higher than the 24MB OPT-STT-RAM LLC. In
multi-core workloads, the large number of dead or inaccu-
rately prefetched blocks generated from one workload can
also negatively affect the performance of other workloads,
significantly reducing performance. The APM technique
reduces cache pollution caused by dead blocks and inaccu-
rately prefetched blocks. Our evaluation shows the 18MB
APM LLC yields better performance and fewer misses for 5
out of 11 multi-core workloads compared with 24MB OPT-
STT-RAM LLC.
Figure 12 shows the distribution of normalized power
consumption for various techniques. The baseline is the
8MB SRAM cache. For a large LLC, the majority of
power consumption comes from leakage power. The 24MB
STT-RAM technique reduces overall power consumption to
87.8% of baseline due to low leakage power. The APM
technique reduces dynamic power caused by the STT-RAM
write operations. Thus, it further reduces power consump-
tion to 80.7% of baseline on average.
5.3.2 Prediction Evaluation
We evaluate the access pattern predictor using false posi-
tive rate and coverage. Mispredictons can be false positives
and false negatives. False positives are more harmful for
two reasons: mispredicting a live block as a dead block can
cause bypass or eviction of a live block from LLC early and
generate LLC misses, and mispredicting a non-write-burst
request as a write-burst request can cause extra migrations
between STT-RAM lines and SRAM lines. False positive
rate is measured as the number of mispredicted positive pre-
dictions divided by the total number of predictions. Among
the 11 multi-core workloads, the access pattern predictor
yields a low false positive rate ranging from 2.1% to 14.8%,
with a geometric mean of 8.3%. The access pattern pre-
dictor achieves an average coverage of 71.7%. Thus, the
majority of dead blocks and write burst blocks can be pre-
dicted by the access pattern predictor. The false positive
and coverage rates are not illustrated with graphs for lack
of space.
5.3.3 Memory Energy Evaluation
Figure 13 shows the memory energy evaluation results nor-
malized to 8MB SRAM LLC. The 24MB STT-RAM LLC
reduces the average memory energy to 72.9% of the base-
line. The 18MB APM LLC technique reduces average
memory energy to 72.4%. Compared with 24MB STT-
RAM LLC, the 18MB APM LLC increases average mem-
ory traffic by 5.6% due to its smaller capacity. However,
it does not increase the dynamic energy consumption be-
cause it consumes less activation/precharge energy. The
APM LLC achieves a higher DRAM row-buffer hit rate for
write requests than the STT-RAM LLC which can reduce
mix1
mix2
mix3
mix4
mix5
mix6
mix7
mix8
mix9
mix10
mix11
Gm
ean
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Norm
ali
zed
Pow
er
Write Power
Read Power
Leakage Power
1st bar: 8M-SRAM 2nd bar: 24M-STT-RAM3rd bar: 21M-Sun-Hybrid 4th bar: 18M-APM
Figure 12. The LLC power breakdown for multi-
core applications (normalized to 8MB SRAM)
mix1
mix2
mix3
mix4
mix5
mix6
mix7
mix8
mix9
mix10
mix11
Gm
ean
0.0
0.2
0.4
0.6
0.8
1.0
No
rma
lize
d M
emory
En
ergy
Background Energy
Refresh Energy
Read/Write Energy
Activation/Precharge Energy
1st bar: 8M-SRAM 2nd bar: 24M-STT-RAM
3nd bar: 21M-Sun-Hybrid 4th bar: 18M-APM
Figure 13. The memory energy breakdown
for multi-core applications (normalized to 8MB
SRAM)
the activation/precharge energy. The large LLC can filter
the locality of dirty blocks, so the dirty blocks have low spa-
tial locality when they are evicted from the LLC and writ-
ten back to the main memory. However, in the APM LLC,
a significant fraction of dirty blocks are written back to the
main memory when they are evicted from the SRAM por-
tion where the small capacity of SRAM allows the evicted
dirty blocks to have higher spatial locality. Our evaluation
result shows the DRAM row-buffer hit rates for writes are
21.1% and 35.6% for 24M STT-RAM LLC and 18M APM
LLC respectively.
5.4 Storage Overhead and Power
The technique uses an access pattern predictor to predict
the dead blocks and write burst blocks which cause extra
storage and power overhead.
5.4.1 Storage Overhead
Each cache block in the LLC adds 1 bit for representing
whether it is a dead block, using 9.2K storage total. For
the pattern predictor, each prediction table has 4,096 en-
tries with a 2-bit counter in each entry. There are 6 tables
for the skewed structure for the dead block prediction ta-
ble and write burst prediction table using a total of 6KB
of storage. In the pattern simulator, one simulated set cor-
responding 32 LLC sets, each simulated set has 12-entry
16-bit partial read PC, 12-entry 16-bits partial tag, 12-entry
4-bit LRU position, 12-entry 1-bit valid flag, 4-entry 16-
bits partial write PC and 4-entry 2-bits write LRU posi-
tion. For a 4.5MB hybrid cache, it consumes 8.5K stor-
age. Thus, the storage overhead for single-core configura-
tion is 6K+8.5K+9.2K=23.7K which is only 0.53% capac-
ity overhead of the hybrid 4.5MB LLC. For the quad-core
configuration, the storage overhead of the APM technique is
6K+8.5K×4+9.2K×4=84.8K, which is 0.43% of the hybrid
18MB cache capacity.
5.4.2 Power Overhead
We evaluate the power overhead of our technique using
NVSim [5] and CACTI [8]. For the single-core configura-
tion, the extra dynamic and leakage power consumed by the
access pattern predictor is 1.6% and 1.9% of the LLC dy-
namic and leakage power respectively. It induces a power
overhead of 1.8% of the total LLC power consumption. For
the quad-core system configuration, the extra dynamic and
leakage power consumed by the access pattern predictor is
1.1% and 0.60% of the LLC dynamic and leakage power
respectively. The overall power overhead caused by the ac-
cess pattern predictor is 0.76% of the overall LLC power
consumption for quad-core configuration.
6 Related Work
6.1 Related Work on Mitigating WriteOverhead of STT-RAM
Many prior papers [12, 27, 17] focus on mitigating write
overhead of an STT-RAM cache. Jog et al. [12] propose
to improve the write speed of STT-RAM-based LLC by
relaxing its data retention time. However, that technique
requires large capacity buffers for a line level refreshing
mechanism to retain reliability. Mao et al. [17] propose pri-
oritization policies for reducing the waiting time of critical
requests in the STT-RAM-based LLC. However, the tech-
nique increases the power consumption of LLC. Recently,
researchers propose hybrid SRAM and STT-RAM tech-
niques [25, 3, 9, 2, 26] for improving LLC efficiency. Sun et
al. [25] take the first step introducing the hybrid cache struc-
ture. That technique uses a counter-based approach for pre-
dicting write-intensive blocks. Write-intensive blocks are
placed in SRAM ways for reducing write overhead to STT-
RAM portion. However, that technique is optimized only
for core-write operations. It cannot reduce the prefetch-
write and demand-write operations to STT-RAM. Jadidi et
al. [9] propose reducing write variance of STT-RAM lines
by migrating frequently written cache blocks to other STT-
RAM lines or SRAM lines. However, frequently migrating
data between cache lines incurs significant performance and
energy overhead. Chen et al. [3] propose a combined static
and dynamic scheme to optimize the block placement for
hybrid cache. The downside of the technique is it requires
the compiler to provide static hints for initializing the block
placement.
6.2 Related Work on Intelligent BlockPlacement Policy
Much previous work [6, 10, 20, 11] focuses on improv-
ing cache efficiency by using intelligent block placement
policy. Hardavellas et al. [6] propose a block placement
and migration policy for a non-uniform cache. The tech-
nique classifies cache access patterns into instructions, pri-
vate data and shared data. The cache design places data into
its proper position based on data type. Our technique is or-
thogonal to LLC replacement policy. It can be combined
with the LLC replacement policy applied to the STT-RAM
portion to further improve the cache efficiency.
6.3 Related Work on Dead Block Predic-tion
Dead block predictors have been proposed [14, 4, 16,
13]. Lai et al. [4] propose a trace-based predictor used to
prefetch data into dead blocks in the L1 data cache. The
technique hashes a sequence of memory-access PCs related
to the cache block into a signature used to predict when a
block can be safely evicted. Kharbutli and Solihin [14] pro-
pose counting-based predictors used for cache replacement
policy. Khan et al. [13] propose sampling-based dead block
prediction for the LLC. None of the previous proposed dead
block predictors take into account all data access types. Our
proposed pattern predictor predicts dead blocks and write
burst blocks in the presence of all types of LLC accesses.
7 Conclusion
Hybrid STT-RAM and SRAM LLC design relies on an
intelligent block placement policy to leverage the benefits
from the characteristics for both STT-RAM and SRAM
technology. In this work, we propose a new block place-
ment and migration policy for a hybrid LLC that places data
based on access patterns. LLC writes are categorized into
three classes: core-write, prefetch-write, and demand-write.
We analyze the access pattern for each class of LLC writes
and design a block placement policy that adapt to the access
pattern of each class. A low cost access pattern predictor is
proposed for guiding the block placement. Experimental re-
sults show our technique can improve performance and re-
duce LLC power consumption compared with both SRAM
LLC and STT-RAM LLC with the same area configuration.
8 Acknowledgments
Yuan Xie and Cong Xu were supported in part by NSF1218867, 1213052, and by DoE under Award Number DE-SC0005026. Guangyu Sun was supported by NSF China61202072 and National High-tech R&D Program of China(Number 2013AA013201). Daniel A. Jiménez and ZheWang were supported in part by NSF CCF grants 1332654,1332598, and 1332597.
References
[1] Intel core i7 processor. http://www.intel.com/products /pro-
cessor/corei7/.
[2] Y.-T. Chen, J. Cong, H. Huang, B. Liu, C. Liu, M. Potkonjak,
and G. Reinman. Dynamically reconfigurable hybrid cache:
An energy-efficient last-level cache design. In DATE’12,
pages 45–50, 2012.[3] Y.-T. Chen, J. Cong, H. Huang, C. Liu, R. Prabhakar, and
G. Reinman. Static and dynamic co-optimizations for blocks
mapping in hybrid caches. In Proceedings of the 2012
ACM/IEEE international symposium on Low power electron-
ics and design, ISLPED ’12, pages 237–242, New York, NY,
USA, 2012. ACM.[4] A. chow Lai, C. Fide, and B. Falsafi. Dead-block prediction
and dead-block correlating prefetchers. In In Proceedings of
the 28th International Symposium on Computer Architecture,
pages 144–154, 2001.[5] X. Dong, C. Xu, Y. Xie, and N. Jouppi. Nvsim: A circuit-
level performance, energy, and area model for emerging non-
volatile memory. Computer-Aided Design of Integrated Cir-
cuits and Systems, IEEE Transactions on, 31(7):994–1007,
2012.[6] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki.
Reactive nuca: near-optimal block placement and replication
in distributed caches. In Proceedings of the 36th annual in-
ternational symposium on Computer architecture, ISCA ’09,
pages 184–195, New York, NY, USA, 2009. ACM.[7] J. L. Henning. Spec cpu2006 benchmark descriptions.
SIGARCH Comput. Archit. News, 34:1–17, September 2006.[8] HP Laboratories. CACTI 6.5. http: //hpl .hp.com: research
/cacti/.[9] A. Jadidi, M. Arjomand, and H. Sarbazi-Azad. High-
endurance and performance-efficient design of hybrid cache
architectures through adaptive line replacement. In Low
Power Electronics and Design (ISLPED) 2011 International
Symposium on, pages 79–84, 2011.[10] A. Jaleel, K. Theobald, S. S. Jr., and J. Emer. High perfor-
mance cache replacement using re-reference interval predic-
tion. In Proceedings of the 37th Annual International Sym-
posium on Computer Architecture (ISCA-37), June 2010.[11] D. A. Jiménez. Insertion and promotion for tree-based pseu-
dolru last-level caches. In MICRO, pages 284–296, 2013.[12] A. Jog, A. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and
C. Das. Cache revive: Architecting volatile stt-ram caches
for enhanced performance in cmps. In Design Automation
Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 243–
252, 2012.[13] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead
block prediction for last-level caches. In Proceedings of the
2010 43rd Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO ’43, pages 175–186, Washington,
DC, USA, 2010. IEEE Computer Society.[14] M. Kharbutli and Y. Solihin. Counter-based cache replace-
ment and bypassing algorithms. IEEE Trans. Comput.,
57:433–447, April 2008.[15] C. Kim, J.-J. Kim, S. Mukhopadhyay, and K. Roy. A for-
ward body-biased low-leakage sram cache: device, circuit
and architecture considerations. Very Large Scale Integra-
tion (VLSI) Systems, IEEE Transactions on, 13(3):349–357,
2005.[16] H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts:
A new approach for eliminating dead blocks and increas-
ing cache efficiency. In Proceedings of the 41st annual
IEEE/ACM International Symposium on Microarchitecture,
MICRO 41, pages 222–233, Washington, DC, USA, 2008.
IEEE Computer Society.
[17] M. Mao, H. H. Li, A. K. Jones, and Y. Chen. Coordinating
prefetching and stt-ram based last-level cache management
for multicore systems. In Proceedings of the 23rd ACM in-
ternational conference on Great lakes symposium on VLSI,
GLSVLSI ’13, pages 55–60, New York, NY, USA, 2013.
ACM.[18] P. Michaud, A. Seznec, and R. Uhlig. Trading conflict and
capacity aliasing in conditional branch predictors. In Pro-
ceedings of the 24th International Symposium on Computer
Architecture, pages 292–303, June 1997.[19] A. Patel, F. Afram, S. Chen, and K. Ghose. MARSSx86: A
full system simulator for x86 CPUs. In Proceedings of the
2011 Design Automation Conference, June 2011.[20] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer.
Adaptive insertion policies for high performance caching.
In Proceedings of the 34th annual international symposium
on Computer architecture, ISCA ’07, pages 381–391, New
York, NY, USA, 2007. ACM.[21] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.
Owens. Memory access scheduling. In Proceedings of the
27th annual international symposium on Computer architec-
ture, ISCA ’00, pages 128–138, New York, NY, USA, 2000.
ACM.[22] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A
cycle accurate memory system simulator. Computer Archi-
tecture Letters, PP(99):1, 2011.[23] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Au-
tomatically characterizing large scale program behavior. In
Proceedings of the 10th International Conference on Archi-
tectural Support for Programming Languages and Operating
Systems, October 2002.[24] S. Srinath, O. Mutlu, H. Kim, and Y. Patt. Feedback directed
prefetching: Improving the performance and bandwidth-
efficiency of hardware prefetchers. In High Performance
Computer Architecture, 2007. HPCA 2007. IEEE 13th Inter-
national Symposium on, pages 63–74, 2007.[25] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel archi-
tecture of the 3d stacked mram l2 cache for cmps. In HPCA,
pages 239–249, 2009.[26] S.-M. Syu, Y.-H. Shao, and I.-C. Lin. High-endurance hy-
brid cache design in cmp architecture with cache partitioning
and access-aware policy. In Proceedings of the 23rd ACM in-
ternational conference on Great lakes symposium on VLSI,
GLSVLSI ’13, pages 19–24, New York, NY, USA, 2013.
ACM.[27] J. Wang, X. Dong, and Y. Xie. Oap: An obstruction-aware
cache management policy for stt-ram last-level caches. In
Design, Automation Test in Europe Conference Exhibition
(DATE), 2013, pages 847–852, 2013.[28] X. Wang, Y. Chen, H. Li, D. Dimitrov, and H. Liu. Spin
torque random access memory down to 22 nm technology.
IEEE Transactions on Magnetics, 44(11):2479–2482, 2008.[29] Z. Wang, S. M. Khan, and D. A. Jiménez. Improving
writeback efficiency with decoupled last-write prediction. In
Proceedings of the 39th International Symposium on Com-
puter Architecture, ISCA ’12, pages 309–320, Piscataway,
NJ, USA, 2012. IEEE Press.