the dirty-block index - carnegie mellon...

The Dirty-Block Index

Vivek Seshadri,Abhishek Bhowmick,Onur Mutlu,Phillip B. Gibbons†,Michael A. Kozuch†, Todd C. Mowry

Carnegie Mellon University †Intel Pittsburgh

AbstractOn-chip caches maintain multiple pieces of metadata about

each cached block—e.g., dirty bit, coherence information, ECC.Traditionally, such metadata for each block is stored in thecorresponding tag entry in the tag store. While this approachis simple to implement and scalable, it necessitates a full tagstore lookup for any metadata query—resulting in high latencyand energy consumption. We Vnd that this approach is ineX-cient and inhibits several cache optimizations.

In this work, we propose a new way of organizing the dirtybit information that enables simpler and more eXcient im-plementations of several optimizations. In our proposed ap-proach, we remove the dirty bits from the tag store and orga-nize it diUerently in a separate structure, which we call theDirty-Block Index (DBI). The organization of DBI is simple: itconsists of multiple entries, each corresponding to some row inDRAM. A bit vector in each entry tracks whether or not eachblock in the corresponding DRAM row is dirty.

We demonstrate the beneVts of DBI by using it to simul-taneously and eXciently implement three optimizations pro-posed by prior work: 1) Aggressive DRAM-aware writeback,2) Bypassing cache lookups, and 3) Heterogeneous ECC forclean/dirty blocks. DBI, with all three optimizations enabled,improves performance by 31% compared to the baseline (by6% compared to the best previous mechanism) while reducingoverall cache area cost by 8% compared to prior approaches.

1. Introduction

On-chip caches in modern processors maintain multiplepieces of metadata about cached blocks—e.g., dirty bit, coher-ence state, ECC. Traditionally, caches store such metadata foreach block in the corresponding tag entry in the tag store.Although this approach is straightforward and scalable, evensimple metadata queries require a full tag store lookup, whichincurs high latency and energy [33]. In this work, we focusour attention on the dirty bit information in writeback caches.

We Vnd that the existing approach of organizing the dirtybit information in the tag entry inhibits several cache opti-mizations. More speciVcally, we Vnd that several cache opti-mizations require the cache to quickly and eXciently 1) de-termine if a block is dirty [33, 44, 49], and 2) identify the listof all spatially co-located dirty blocks—i.e, dirty blocks fromthe same DRAM row [27, 47, 51]. However, checking the dirtystatus of even a single cache block with the existing cache or-ganization requires an expensive tag store lookup.

In this work, we propose a new way of organizing the dirtybit information. In our proposed organization, we remove thedirty bits from the cache tag entries and organize them diUer-ently in a separate structure, which we call the Dirty-Block In-dex (DBI).1 All queries regarding the dirty status are directedto the DBI. The organization of the DBI is simple. It has mul-tiple entries, each tracking the dirty bit information of all theblocks in some DRAM row. Each entry contains a row tagindicating which DRAM row the entry corresponds to, and abit vector indicating whether each block in the DRAM row isdirty. A cache block is dirty if and only if the DBI has a validentry corresponding to the DRAM row containing the blockand the bit corresponding to the cache block in the associatedbit vector is set.

DBI has three nice properties. First, since it is much smallerthan the main tag store, it can identify whether a block is dirtymuch faster than the main tag store. Second, since the dirtybit information of all blocks of a DRAM row are stored to-gether, DBI can list all dirty blocks of a DRAM row with asingle query (whereas the main tag store requires one queryfor each block of the row). Finally, since DBI is organizedindependently of the main tag store, it can be used to limitthe number of dirty blocks to a small fraction of the numberof blocks in the cache. These properties allow DBI to eX-ciently implement and accelerate several cache optimizations.In this work, we quantitatively evaluate the beneVts of DBIusing three previously proposed optimizations.

First, prior works [27, 51] have shown that proactively writ-ing back all dirty blocks of a DRAM row in a single burst per-forms signiVcantly better compared to writing them back inthe order that they are evicted from the cache. DBI allowsthe cache to eXciently identify all dirty blocks of a DRAMrow, enabling a simpler implementation of this optimization.Second, prior works [33, 44] have proposed mechanisms tobypass the cache lookup for accesses that are likely to miss inthe cache. However, since accesses to dirty blocks cannot bebypassed, previously proposed implementations of this opti-mization incur high complexity. In contrast, DBI, with its abil-ity to quickly identify if a block is dirty, enables a very simpleand eXcient implementation of this optimization. Third, priorworks [30, 58, 59] have proposed mechanisms to reduce theoverhead of ECC in caches by storing only a simple error de-tection code for each clean block and a strong error correction1Many cache coherence protocols may maintain the dirty status implicitly inthe cache coherence states. Section 2.3 discusses how diUerent cache coher-ence protocols can be seamlessly adapted to work with DBI.

1

978-1-4799-4394-4/14/$31.00 © 2014 IEEE

code for each dirty block. However, since any block in thecache can be dirty, prior approaches require complex changesto implement this optimization. In our proposed organization,since DBI decouples the dirty bit information from the tagstore, it is suXcient to maintain strong ECC for only blockstracked by the DBI. Section 3 discusses these optimizationsand our proposed implementations using DBI.

We compare DBI (with diUerent combinations of these op-timizations) to two baseline mechanisms (a cache employingthe Least Recently Used policy and one employing the Dy-namic Insertion Policy [18, 42]), and three previous mech-anisms that employ individual optimizations (DRAM-awarewriteback (DAWB) [27], Virtual Write Queue (VWQ) [51],and Skip Cache [44]). The results show that DBI, with alloptimizations enabled, outperforms all prior approaches (6%compared to the best previous mechanism and 31% comparedto the baseline for a 8-core system) while also signiVcantly re-ducing overall cache area (8% compared to the baseline for a16MB cache). Section 6 discusses these results in more detail.

While we discuss three optimizations enabled by DBI indetail, DBI can be used to enable several other mechanisms.Section 7 brieWy describes some of these mechanisms.The main contributions of this paper are as follows.• We propose a new way of tracking dirty blocks thatmaintains dirty bit information of cache blocks in a sepa-rate structure called the Dirty-Block Index (DBI), insteadof in the main tag store.

• We show that our new dirty-bit organization using DBIenables eXcient implementation of several previouslyproposed optimizations that involve dirty blocks. DBI si-multaneously enables all these optimizations using a sin-gle structure whereas prior approaches require separatestructures for each optimization.

• We quantitatively compare the performance and areabeneVts of DBI using three optimizations to prior ap-proaches [18, 27, 44, 51]. Our evaluations show that DBIwith all optimizations outperforms all previous mecha-nisms while reducing cache area cost. The performanceimprovement is consistent across a wide variety of sys-tem conVgurations and workloads.

2. The Dirty-Block Index (DBI)

We conceived the idea of the Dirty-Block Index based on twokey principles motivated by two previously proposed opti-mizations related to dirty blocks: cache lookup bypass [33, 44]and DRAM-aware writeback [27, 51]. First, prior works haveshown that bypassing the cache lookup for an access that islikely to miss in the cache can reduce average latency and en-ergy consumed by memory accesses [33]. However, the cachemust not be bypassed for a dirty block. This motivates theneed for a mechanism that can quickly check if a block isdirty (without looking up the entire tag store). Second, priorworks have shown that writing back spatially co-located dirtyblocks (i.e., those that belong to the same DRAM row [45, 60])together improves system performance signiVcantly. This mo-tivates the need for a mechanism to quickly and eXcientlyidentify spatially co-located dirty blocks.

Based on these principles, we propose to remove the dirtybits from the main tag store and store them in a separate struc-ture, called the Dirty-Block Index (DBI). DBI reorganizes thedirty bit information such that the dirty bits of blocks of thesame DRAM row are stored together in a single entry.

2.1. DBI Structure

Figure 1 compares the conventional tag store organizationwith a tag store augmented with a DBI. In the conventionalorganization (shown in Figure 1a), each tag entry containsa dirty bit that indicates whether the corresponding block isdirty or not. For example, to indicate that a block B is dirty,the dirty bit of the corresponding tag entry is set.

In contrast, in a cache augmented with a DBI (Figure 1b),the dirty bits are removed from the main tag store and orga-nized diUerently in the DBI. The organization of DBI is simple.It consists of multiple entries. Each entry corresponds to somerow in DRAM—identiVed using a row tag present in each en-try. Each DBI entry contains a dirty bit vector that indicates ifeach block in the corresponding DRAM row is dirty or not.

DBI Semantics. A block in the cache is dirty if and onlyif the DBI contains a valid entry for the DRAM row that con-tains the block and the bit corresponding to the block in the

Tag Entry

Tag Store

1 1 B

Valid Bit Dirty Bit Cache Block Tag

(a) Conventional cache tag store

Tag Entry

Tag Store

1 B

Valid Bit Cache Block Tag

DBI Entry

Dirty Block Index

1 R 0 1 0 0

Valid Bit Row Tag(log2 # rowsin DRAM)

Dirty Bit Vector(# of blocks ina DRAM row)

(b) Cache tag store augmented with a DBI

Figure 1: Comparison between conventional cache organization and a cache augmented with a DBI. The Vgure shows how eachorganization represents the dirty status of a block B. Block B is assumed to be the second block in the DRAM row R.

2

bit vector of that DBI entry is set. For example, assuming thatblock B is the second block of DRAM row R, to indicate thatblock B is dirty, the DBI contains a valid entry for DRAM rowR, with the second bit of the corresponding bit vector set.2

2.2. DBI Operation

Figure 2 pictorially describes the operation of a cache aug-mented with a DBI. The focus of this work is on the on-chiplast-level cache (LLC). Therefore, for ease of explanation, weassume that the cache does not receive any sub-block writesand any dirty block in the cache is a result of a writeback gen-erated by the previous level of cache.3 There are four possibleoperations, which we describe in detail below.

2.2.1. Read Access to the Cache

The addition of DBI does not change the path of a read accessin any way. On a read access, the cache simply looks up theblock in the tag store and returns the data on a cache hit.Otherwise, it forwards the access to the memory controller.

2.2.2. Writeback Request to the Cache

In a system with multiple levels of on-chip cache, the LLC willreceive a writeback request when a dirty block is evicted fromthe previous level of cache. Upon receiving such a writebackrequest, the cache performs two actions (as shown in Figure 2).First, it inserts the block into the cache if it is not alreadypresent. This may result in a cache block eviction (discussedin Section 2.2.3). If the block is already present in the cache,the cache just updates the data store (not shown in the Vgure)with the new data. Second, the cache updates the DBI to indi-cate that the written-back block is dirty. If the DBI already hasan entry for the DRAM row that contains the block, the cachesimply sets the bit corresponding to the block in that DBI en-try. Otherwise, the cache inserts a new entry into the DBI forthe DRAM row containing the block and with the bit corre-sponding to the block set. Inserting a new entry into the DBImay require an existing DBI entry to be evicted. Section 2.2.4discusses how the cache handles such a DBI eviction.

2.2.3. Cache Eviction

When a block is evicted from the cache, it has to be writtenback to main memory if it is dirty. Upon a cache block evic-tion, the cache consults the DBI to determine if the block isdirty. If so, it Vrst generates a writeback request for the blockand sends it to the memory controller. It then updates the DBIto indicate that the block is no longer dirty—done by simplyresetting the bit corresponding to the block in the bit vectorof the DBI entry. If the evicted block is the last dirty blockin the corresponding DBI entry, the cache invalidates the DBI

2Note that the key diUerence between the DBI and the conventional tag storeis the logical organization of the dirty bit information. While some processorsstore the dirty bit information in a separate physical structure, the logicalorganization of the dirty bit information is same as the main tag store.3Sub-block writes typically occur in the primary L1 cache where writes areat a word-granularity, or at a cache which uses a larger block size than theprevious level of cache. The DBI operation described in this paper can beeasily extended to caches with sub-block writes.

entry so that the entry can be used to store the dirty blockinformation of some other DRAM row.

TagStore

DBI

ReadAccess

Ê

WritebackRequest

Ë

CacheEviction Ì

Look up tag storefor hit/miss

Insert/update block intag store

Insert/update metadata in DBI

Check DBI.Writeback ifdirty

DBI EvictionGenerate writebacks for all dirty blocks

marked by the evicted entry Í

Figure 2: Operation of a cache with DBI

2.2.4. DBI Eviction

The last operation in a cache augmented with a DBI is a DBIeviction. Similar to the cache, since the DBI has limited space,it can only track the dirty block information for a limitednumber of DRAM rows. As a result, inserting a new DBI en-try (on a writeback request, discussed in Section 2.2.2) may re-quire evicting an existing DBI entry. We call this event a DBIeviction. The DBI entry to be evicted is decided by the DBIreplacement policy (discussed in Section 4.3). When an entryis evicted from the DBI, all the blocks indicated as dirty bythe entry should be written back to main memory. This is be-cause, once the entry is evicted, the DBI can no longer main-tain the dirty status of those blocks. Therefore, not writingthem back to memory will likely lead to incorrect execution,as the version of those blocks in memory is stale. Althougha DBI eviction may require evicting many dirty blocks, witha small buUer to keep track of the evicted DBI entry (until allof its blocks are written back to memory), the DBI evictioncan be interleaved with other demand requests. Note thaton a DBI eviction, the corresponding cache blocks need notbe evicted from the cache—they only need to be transitionedfrom the dirty state to clean state.

2.3. Cache Coherence Protocols

Many cache coherence protocols implicitly store the dirty sta-tus of cache blocks in the cache coherence states. For example,in the MESI protocol [37], the M (modiVed) state indicates thatthe block is dirty. In the improved MOESI protocol [52], bothM (modiVed) and O (Owner) states indicate that the block isdirty. To adapt such protocols to work with DBI, we proposeto split the cache coherence states into multiple pairs—eachpair containing a state that indicates the block is dirty and thenon-dirty version of the same state. For example, we split theMOESI protocol into three parts: (M, E), (O, S) and (I). We canuse a single bit to then distinguish between the two states ineach pair. This bit will be stored in the DBI.

3

3. Optimizations Enabled by DBI

We demonstrate the eUectiveness of DBI by describ-ing eXcient implementations of three cache optimizations:1) DRAM-aware writeback, 2) cache lookup bypass, and3) ECC overhead reduction. Note that while prior works haveproposed these optimizations [23, 27, 30, 33, 44, 51, 58, 59],DBI has two key advantages over prior proposals. First, im-plementing these optimizations using DBI is simpler and moreeXcient than prior works. Second, while combining priorproposals is either impossible or further increases complexity,DBI can simultaneously enable all three optimizations (andmany more described in Section 7). Our evaluations (Sec-tion 6) show that DBI performs signiVcantly better than anyindividual optimization, while also reducing the overall areacost. We now describe the three optimizations in detail.

3.1. EXcient DRAM-Aware Aggressive Writeback

The Vrst optimization is referred to as DRAM-Aware Write-back. This optimization is based on two observations. First,each DRAM bank has a structure called the row buUer thatcaches the last accessed row from that bank [26]. Requeststo the row buUer (row buUer hits) are much faster and moreeXcient than other requests (row buUer misses) [26, 45, 60].Second, the memory controller buUers writes to DRAM in awrite buUer and Wushes the writes when the buUer is close tofull [27]. Based on these observations, Vlling the write buUerwith blocks from the same DRAM row will lead to a fasterand more eXcient writeback phase.

Unfortunately, the write sequence to DRAM primarily de-pends on the order in which dirty blocks are evicted from theLLC. Since blocks of a DRAM row typically map to diUerentcache sets, dirty blocks of the same row are unlikely to beevicted together. As a result, writing back blocks in the orderin which they are evicted will cause a majority of writes toresult in row misses [27]. To address this problem, a recentwork [27] proposed a mechanism to proactively write back alldirty blocks of a DRAM row when any dirty block from thatrow is evicted from the cache. However, this requires multi-ple tag store lookups to identify if blocks of the row are dirty.Many of these lookups may be unnecessary as the blocks mayactually not be dirty or may not even be in the cache.

Virtual Write Queue (VWQ) [51] proposed to address thisproblem by using a Set State Vector (SSV) to Vlter some ofsuch unnecessary lookups. The SSV indicates if each cacheset has any dirty block in the LRU ways. VWQ looks up aset for dirty blocks only if the SSV indicates that the set hasdirty blocks in the LRU ways. Since VWQ only checks theLRU ways to generate proactive writebacks, we Vnd that itis not signiVcantly more eXcient compared to DRAM-AwareWriteback [27]. Our evaluations show that, as a result of theadditional tag lookups, both DAWB and VWQ perform sig-niVcantly more tag lookups than the baseline (1.95x and 1.88xrespectively—Section 6.1).

In contrast to these mechanisms, implementing such aproactive writeback scheme is straightforward using DBI.When a block is evicted from the cache, the cache Vrst con-sults the DBI to Vnd out if the block is dirty. If so, the bitvector of the corresponding DBI entry also provides the listof all dirty blocks of the same row. In addition to generat-ing a writeback for the evicted block, the cache also generateswritebacks for the other dirty blocks of the row. We call thisscheme the Aggressive Writeback (AWB) scheme.

Figure 3 pictorially shows the operation of AWB. As shown,AWB looks up the tag store only for blocks that are actuallydirty, thereby reducing the contention for the cache port. Thisreduced contention enables AWB to further improve perfor-mance compared to prior approaches [27, 51], especially inmulti-core systems, where a tag lookup in the shared cachecan delay requests from all applications (Section 6.2).4

DBI

1 R 0 1 0 1 0 0 0 1

DBI entry corresponding to evicted block

Evicted BlockAddress (B)

Check DBI

Lookup only theseblocks and write them

back to memory

Figure 3: Aggressive writeback using DBI. Block B (the evictedblock) is the second block of the DRAM row R and is dirty.

3.2. EXcient Cache Lookup Bypass

The second optimization is based on a simple idea: if an accessto the cache is likely to miss, we can potentially bypass thecache lookup, thereby reducing the overall latency and energyconsumed by the access [33, 44]. This optimization requires amechanism to predict whether an access will hit or miss in thecache. The key challenge in realizing this optimization is thatwhen the cache contains dirty blocks, the mechanism cannotbypass the lookup for a block that is dirty in the cache.

Prior works address this problem by either using a write-through cache [44] (no dirty blocks to begin with) or by ensur-ing that the prediction mechanism has no false positives (noaccess is falsely predicted to miss in the cache). Both of theseapproaches have shortcomings: using the write-through pol-icy can signiVcantly increase the memory write bandwidth re-quirement, and ensuring that there are no false positives in themiss prediction requires complex hardware structures [33].

In contrast to the above approaches, using DBI to main-tain the dirty block information enables a simpler approachto bypassing cache lookups that works with any prediction

4We always prioritize a demand lookup over a lookup for generating aggres-sive writeback (similar to prior work [27]). However, once started, a lookupcannot be preempted and can delay other queued requests.

4

mechanism. Our approach is based on the fact that the DBIis much smaller than the main tag store and hence, has lowerlatency and energy per access. Figure 4 pictorially depicts ourproposed mechanism. As shown, our mechanism uses a misspredictor to predict if each read access misses in the cache. Ifan access is predicted to miss in the cache, our mechanismchecks the DBI to determine if the block is dirty in the cache.If so, the block is accessed from the cache. Otherwise, the ac-cess is forwarded to the next level of the hierarchy. Our mech-anism does not modify any of the other cache operations. Werefer to this optimization as Cache Lookup Bypass (CLB).

CLB can be used with any miss predictor [33, 44, 49, 57]. Inour evaluations (Section 6), we employ CLB with the predic-tor used by Skip Cache [44] as it is simpler to implement andhas lower hardware overhead compared to other miss predic-tors (e.g., [33, 49]). Skip Cache divides execution into epochsand monitors the miss rate of each application/thread in eachepoch (using set sampling [41]). If an application’s miss rateexceeds a threshold (0.95, in our experiments), all accesses ofthe application (except those that map to the sampled sets) arepredicted to miss in the cache in the next epoch.

TagStore

Find blockin cache?

Block isdirty?

Miss incache?

DBI

Miss Predictor

ReadAccess

Forward to next level

no

yes

yes

no

Figure 4: Cache lookup bypass mechanism using DBI

3.3. Reducing ECC Overhead

The third optimization is an approach to reduce the area over-head of Error-Correction Codes (ECC) employed in many pro-cessors. With small feature sizes, data reliability is a majorconcern in modern processors [9]. Therefore, to ensure theintegrity of the data in the on-chip caches, processors storeECC (e.g., SECDED – Single Error Correction Double ErrorDetection code) along with each cache block. However, stor-ing such ECC information comes with an area cost.

This optimization to reduce ECC overhead is based on asimple observation: only dirty blocks require a strong ECC;clean blocks only require error detection. This is because,if an error is detected in a clean block, the block can be re-trieved from the next level of the memory hierarchy. On theother hand, if an error is detected in a dirty block, then thecache should also correct the error as it has the only copy ofthe block. To exploit this heterogeneous ECC requirement forclean and dirty blocks, prior works proposed to use a sim-ple Error Detection Code (EDC) for all blocks while storingthe ECC only for dirty blocks in a separate structure [30] or

in main memory [58, 59]. While these mechanisms are suc-cessful in reducing the ECC overhead, they require signiVcantchanges to the error correction logic (such as two-tiered er-ror protection [58, 59]), even though errors might be rare tobegin with. In contrast to these prior works, in our proposedcache organization, since the DBI is the authoritative sourcefor determining if a block is dirty, it is suXcient to keep ECCinformation for only the blocks tracked by the DBI.

Figure 5 compares the ECC organization of the baselinecache and a cache augmented with DBI. While the baselinestores ECC for all blocks in the tag store, our mechanismstores only EDC for all blocks and stores additional ECC in-formation for only blocks tracked by the DBI . As our evalua-tions show (Section 6), with the two previously discussed op-timizations (AWB and CLB), DBI signiVcantly improves per-formance while tracking far fewer blocks than the main tagstore. Therefore, maintaining ECC for only blocks tracked bythe DBI signiVcantly reduces the area overhead of ECC with-out signiVcant additional complexity.

TagStore EC

C TagStore EDC

ECC

DBI

Baseline Cache Cache with DBI

Figure 5: Reducing ECC overhead using DBI. In this example,the cumulative number of blocks tracked by the DBI is 1/4th thenumber of blocks tracked by the tag store.

4. DBI Design Choices

The DBI design space can be deVned using three key param-eters: 1) DBI size, 2) DBI granularity and 3) DBI replacementpolicy.5 These parameters determine the eUectiveness of thethree optimizations discussed in the previous section. We nowdiscuss these parameters and their trade-oUs in detail.

4.1. DBI Size

The DBI size refers to the cumulative number of blockstracked by all the entries in the DBI. For ease of analysis acrosssystems with diUerent cache sizes, we represent the DBI sizeas the ratio of the cumulative number of blocks tracked bythe DBI and the number of blocks tracked by the cache tagstore. We denote this ratio using α. For example, for a 1MBcache with a 64B block size (16k blocks), a DBI of size α = 1/2enables the DBI to track 8k blocks.

The DBI size presents a trade-oU between the size of writeworking set (set of frequently written blocks) that can be cap-tured by the DBI, and the area, latency, and power cost of the

5Similar to the main tag store, DBI is also a set-associative structure and has aVxed associativity. However, we do not discuss the DBI associativity in detailas its trade-oUs are similar to any other set-associative structure.

5

DBI. A large DBI has two beneVts: 1) it can track a largerwrite working set, thereby reducing the writeback bandwidthdemand, and 2) it gives more time for a DBI entry to accumu-late writebacks to a DRAM row, thereby better exploiting theAWB optimization. However, a large DBI comes at a higherarea, latency and power cost. On the other hand, a smallerDBI incurs lower area, latency and power cost. This has twobeneVts: 1) lower latency in the critical path for the CLB opti-mization and 2) ECC storage for fewer dirty blocks. However,a small DBI limits the number of dirty blocks in the cache andthus, result in premature DBI evictions, reducing the poten-tial to generate aggressive writebacks. It can also potentiallylead to thrashing if the write working set is signiVcantly largerthan the number of blocks tracked by the small DBI.

4.2. DBI Granularity

The DBI granularity refers to the number of blocks trackedby a single DBI entry. Although our discussion in Section 2.1suggests that this is same as the number of blocks in eachDRAM row, we can design the DBI to track fewer blocks ineach entry. For example, for a system with DRAM row of size8KB and cache block of size 64B, a natural choice for the DBIgranularity is 8KB/64B = 128. Instead, we can design a DBIentry to track only 64 blocks, i.e. one half of a DRAM row.

The DBI granularity presents another trade-oU between theamount of locality that can be extracted during the write-back phase (using the AWB optimization) and the size of writeworking set that can be captured using the DBI. A large gran-ularity leads to better potential for exploiting the AWB opti-mization. However, if writes have low spatial locality, a largegranularity will result in ineXcient use of the DBI space, po-tentially leading to write working set thrashing.

4.3. DBI Replacement Policy

The DBI replacement policy determines which entry is evictedon a DBI eviction, described in Section 2.2.4. A DBI evictiononly writes back the dirty blocks of the corresponding DRAMrow to main memory, and does not evict the blocks themselvesfrom the cache. Therefore, a DBI eviction does not aUect thelatency of future read requests for the corresponding blocks.However, if the previous cache level generates a writebackrequest for a block written back due to a DBI eviction, theblock will have to be written back to memory again, leadingto an additional write to main memory. Therefore, the goalof the DBI replacement policy is to ensure that blocks are notprematurely written back to main memory.

The ideal DBI replacement policy is to evict the DBI en-try that has a writeback request farthest into the future.However, similar to Belady’s optimal replacement policy forcaches [10], this ideal policy is impractical to implement inreal systems. We evaluated Vve practical replacement policiesfor DBI: 1) Least Recently Written (LRW)—similar to the LRUpolicy for caches, 2) LRW with Bimodal Insertion Policy [42],3) Rewrite-interval prediction policy—similar to the RRIP pol-

icy for caches [19], 4) Max-Dirty—entry with the maximumnumber of dirty blocks, 5) Min-Dirty—entry with the mini-mum number of dirty blocks. We Vnd that the LRW policyworks comparably or better than the other policies and use itfor all our evaluations in the paper.

5. Evaluation Methodology

System. We use an in-house event-driven x86 multi-coresimulator that models out-of-order cores, coupled with aDDR3 [20] DRAM simulator. The simulator faithfully mod-els all processor stalls due to memory accesses. All the simu-lated systems use a three-level cache hierarchy. The L1 datacache and the L2 cache are private to each core. The L3cache is shared across all cores. All caches uniformly use a64B cache block size. We do not enforce inclusion in anylevel of the cache hierarchy. We implement DBI and priorapproaches [27, 44, 51] for the aggressive writeback and cachelookup bypass optimizations at the shared last-level cache. Weuse CACTI [1] to model the area, latency, and power for thecaches and the DBI. Table 1 lists the main conVguration pa-rameters in detail.

Benchmarks and Workloads. Our evaluations use bench-marks from the SPEC CPU2006 suite [3] and STREAM [4]. Weuse Pinpoints [38] to collect instruction traces from the repre-sentative portion of these benchmarks. We run each bench-mark for 500 million instructions—the Vrst 200 million in-structions for warmup and the remaining 300 million instruc-tions to collect the results. For our multi-core evaluations, weclassify the benchmarks into nine categories based on readintensity (low, medium, and high) and write intensity (low,medium, and high), and generate multi-programmed work-loads with varying levels of read and write intensity. The readintensity of a workload determines how much a workload canget aUected by interference due to writes and the write inten-sity determines how much interference a workload is likely tocause to read accesses. In all, we present results for 102 2-core,259 4-core, and 120 8-core workloads.

Metrics. We use instruction throughput to evaluate single-core performance, and weighted speedup [50] to evaluatemulti-core performance. We also present a detailed analy-sis of our single-core experiments using other statistics (e.g.,MPKI, read/write row hit rates). For multi-core workloads, wepresent other performance and fairness metrics—instructionthroughput, harmonic speedup [32], and maximum slow-down [14, 24]. Weighted and harmonic speedup were shownto correlate with system throughput and response time [15].

6. Results

We evaluate nine diUerent mechanisms, including the base-line, four prior approaches (Dynamic Insertion Policy [18, 42],Skip Cache [44], DRAM-aware Writeback [27], and VirtualWrite Queue [51]), and four variants of DBI with diUerentcombinations of the two performance optimizations (AWBand CLB). Table 2 lists all the mechanisms, the labels used

6

Processor 1-8 cores, 2.67 GHz, Single issue, Out-of-order, 128 entry instruction window

L1 CachePrivate, 32KB, 2-way set-associative, tag store latency = 2 cycles, data store latency = 2 cycles, parallel tag anddata lookup, LRU replacement policy, number of MSHRs = 32

L2 CachePrivate, 256KB, 8-way set-associative, tag store latency = 12 cycles, data store latency = 14 cycles, parallel tag anddata lookup, LRU replacement policy

L3 CacheShared, 2MB/core. 1/2/4/8-core, 16/32/32/32-way set-associative, tag store latency = 10/12/13/14 cycles, data storelatency = 24/29/31/33 cycles, serial tag and data lookup, LRU replacement policy

DBI Size (α) = 1/4, granularity = 64, associativity = 16, latency = 4 cycles, LRW replacement policy (Section 4.3)

DRAM Controller Open row, row interleaving, FR-FCFS scheduling policy [45, 60], 64-entry write buUer, drain when full policy [27]

DRAM and Bus DDR3-1066 MHz [20], 1 channel, 1 rank, 8 banks, 8B-wide data bus, burst length = 8, 8KB row buUer

Table 1: Main conVguration parameters used for our evaluation

for them in our evaluations, and the values for their key de-sign parameters. All mechanisms except the baseline use TA-DIP [42] to determine the insertion policy for the incomingcache blocks. We do not present detailed results for SkipCache in our Vgures as Skip Cache performs comparably to orworse than the baseline/TA-DIP (primarily because it employsthe write-through policy). We evaluate the third optimizationenabled by DBI—reducing ECC overhead—separately in Sec-tion 6.3 as it has no Vrst order eUect on performance.

6.1. Single-Core Results

Figure 6 presents the IPC results for our single core experi-ments. The Vgure also plots the memory write row hit rate,LLC tag lookups per kilo instructions, memory writes per kiloinstructions, and memory read row hit rate for all the bench-marks to illustrate the underlying trends that result in the per-formance improvement of various mechanisms. For clarity,the Vgure does not show results for the Baseline LRU policyand for benchmarks with LLC MPKI < 1 or Baseline IPC >0.9. For these benchmarks, there is no (negative) impact onperformance due to any of the mechanisms. We draw severalconclusions from our single core results.

First, DBI+AWB signiVcantly outperforms TA-DIP (13% onaverage), and performs similarly to DAWB and VWQ for al-most all benchmarks (Figure 6a). This performance improve-ment is due to the signiVcant increase in the memory write

row hit rate—DAWB, VWQ, and DBI+AWB write back blocksof the same DRAM row together. These mechanisms improvewrite row hit rate from 35% (TA-DIP) to 88%, 82% and 81%,respectively (Figure 6b).

Second, the key diUerence between DBI+AWB and theother two mechanisms is brought out by the number of taglookups they perform. Figure 6c plots the number of taglookups per kilo instructions for all the mechanisms. DAWBsigniVcantly increases the number of tag lookups compared toTA-DIP (by 1.95x on average). This is because, when a dirtyblock is evicted from the cache, DAWB indiscriminately looksup the tag store for every other block of the correspondingDRAM row to check if each block is dirty, leading to a sig-niVcant number of unnecessary lookups for blocks that arenot dirty. As described in Section 3.1, although VWQ aims toreduce the number of unnecessary tag lookups, by checkingonly the LRU ways for dirty blocks on each dirty eviction, itperforms such additional lookups multiple times. Hence, it isnot much more eUective compared to DAWB (1.85x more tagstore lookups compared to TA-DIP). In contrast to both DAWBand VWQ, DBI looks up the tag store for only blocks that areactually dirty. As a result, DBI (with AWB) obtains the bene-Vts of aggressive DRAM-aware writeback without increasingtag store contention. Although this reduction in contentiondoes not translate to single-core performance over DAWBand VWQ, DBI+AWB signiVcantly improves multi-core per-

Mechanism Description

Baseline Baseline cache using the Least Recently Used (LRU) replacement policy

TA-DIP Thread-aware dynamic insertion policy [18], 32 dueling sets, 10-bit policy selector, bimodal insertion probability = 164

DAWB Cache using the DRAM-aware writeback policy [27] and the TA-DIP policy [18] for read accesses

VWQ Virtual Write Queue [51], cache employs TA-DIP [18]

Skip Cache Per-application lookup bypass [44], cache employs TA-DIP [18], threshold = 0.95, epoch length = 50 million cycles

DBI Plain DBI without any optimizations, cache employs TA-DIP [18], DBI parameters listed in Table 1

DBI+AWB DBI with the aggressive writeback optimization (described in Section 3.1)

DBI+CLB DBI with the cache lookup bypass optimization (described in Section 3.2), same miss predictor as Skip Cache [44]

DBI+AWB+CLB DBI with both the aggressive writeback and the cache lookup bypass optimizations

Table 2: List of evaluated mechanisms

7

TA-DIP DAWB VWQ DBI DBI+AWB DBI+CLB DBI+AWB+CLB

0.10.20.30.40.50.60.70.80.91.0

mcf lbm GemsFDTD soplex omnetpp cactusADM stream leslie3d milc sphinx3 libquantum bzip2 astar bwaves gmean

Instructions

perCycle

0.2

0.4

0.6

0.8

1.0

WriteRHR

130 64 117

1020304050


TagLo

okup

sPK

I

36912151821

Mem

oryWPK

I

0.2

0.4

0.6

0.8

1.0


ReadRHR

Figure 6: (top to bottom) a) Instructions per cycle (IPC), b) Write Row Hit Rate (Write RHR), c) LLC tag lookups per kilo instructions(Tag lookups PKI), d) Memory Writes per Kilo Instructions (WPKI), e) Read Row Hit Rate (Read RHR). The benchmarks on the x-axisare sorted based on increasing order of baseline IPC.

formance compared to DAWB and VWQ (Section 6.2).

Third, Figure 6c also illustrates the beneVts of the cachelookup bypass (CLB) optimization. For applications that donot beneVt from the LLC, CLB avoids the tag lookup for ac-cesses that will likely miss in the cache. As a result, the CLBoptimization signiVcantly reduces the number of tag lookupsfor several applications compared to TA-DIP (14% on average).Although this reduction in the number of tag lookups does notimprove single core performance, it signiVcantly improves theperformance of multi-core systems (Section 6.2).

Fourth, although one may expect DBI (and its variants) toincrease the number of writebacks to main memory (due toits proactive writeback mechanism), this is not the case. Evenwith the aggressive writeback optimization (AWB) enabled,with the exception of mcf and omnetpp, DBI does not haveany visible impact on the number of memory writes per kiloinstruction (Figure 6d—Memory WPKI). Although not shownin the Vgure, DBI and its variants have no impact on theLLC MPKI (read misses per kilo instructions) compared to theTA-DIP policy. This is expected as these mechanisms do not

change the cache replacement policy for read requests—theyonly proactively write back the dirty blocks.

Finally, as a side eUect of the increase in write row hit rate,DBI (and its variants) also increase the read row hit rate (Fig-ure 6e). This is because, with a high write row hit rate, veryfew rows are deactivated during the writeback phase of thememory controller. As a result, many DRAM rows openedby read requests are likely to remain open when the memorycontroller moves back to the read phase.

In summary, our Vnal mechanism, DBI+AWB+CLB, whichcombines both the optimizations, provides the best single-coreperformance compared to all other mechanisms while signiV-cantly reducing the number of tag lookups.

6.2. Multi-Core Results

Figure 7 plots the average weighted speedup of diUerentmechanisms for 2-core, 4-core and 8-core systems. We do notshow results for VWQ as DAWB performs better than VWQ.The key takeaway from the Vgure is that DBI with both AWBand CLB optimizations provides the best system throughput

8

across all mechanisms for all the systems (31% better than thebaseline and 6% better than DAWB for 8-core systems).

Baseline TA-DIP DAWB DBI

DBI+AWB DBI+CLB DBI+AWB+CLB

0.51.01.52.02.53.03.5

2-Core 4-Core 8-Core

WeightedSpeedu

p

Figure 7: Multi-core system performance

We draw four other conclusions. First, across the manymulti-core workloads, TA-DIP performs similarly to the base-line LRU on average. This is because TA-DIP makes sub-optimal decisions for some workloads, oUsetting the beneVtgained in other workloads. Second, DBI by itself performsslightly better than DAWB for all three multi-core systems.This is because, in contrast to DAWB, DBI evictions pro-vide the beneVt of DRAM-aware writeback without increas-ing contention for the tag store. Third, adding the AWB opti-mization further improves performance by more aggressivelyexploiting the DRAM-aware writebacks. Across 120 8-coreworkloads, DBI+AWB improves performance by 3% comparedto DAWB. Fourth, the CLB optimization further reduces thecontention for tag store lookups. This reduction in contentionenables CLB to further improve the performance of DBI (withand without the AWB optimization).

To better understand the beneVts of reducing tag store con-tention, let us consider a case study of the 2-core systemrunning the workload GemsFDTD and libquantum. For thisworkload, DAWB performs signiVcant number of unneces-sary tag store lookups (2.2x increase in tag store lookups com-pared to baseline for GemsFDTD—Figure 6c). On the otherhand, the CLB optimization signiVcantly reduces the num-ber of tag store lookups (3x reduction in tag store lookups forlibquantum compared to baseline). As a result of reducingcontention for tag store, we expect DBI to perform signiV-cantly better than DAWB and the CLB optimization to furtherimprove performance. DAWB improves performance com-pared to baseline by 40%. DBI (even without any optimiza-tion) improves performance by 83% compared to baseline (30%compared to DAWB). This is because the DBI evictions obtainthe beneVt of DRAM-aware writebacks, even without AWB.In fact, enabling the AWB optimization does not buy muchperformance for this workload (< 1%). Enabling the CLB op-timization, as expected, further improves performance (92%compared to baseline and 37% compared to DAWB).

Figure 8 compares the system performance of DAWB withDBI+AWB+CLB for all the 259 4-core workloads. The work-loads are sorted based on the weighted speedup improvementof DBI+AWB+CLB. There are two key takeaways from the

Vgure. First, the average performance improvement of DBIover AWB is not due to a small set of workloads. Rather, ex-cept for a few workloads, DBI+AWB+CLB consistently out-performs DAWB. Second, DBI slightly degrades performancecompared to the baseline only for 7 workloads (mainly due tothe additional writebacks generated by DBI).

Baseline DAWB DBI+AWB+CLB

1.0

1.2

1.4

1.6

1.8

2.0

Normalized

WeightedSpeedu

p

Figure 8: 4-core system performance for all 259 workloads

Table 3 presents three other multi-core performanceand fairness metrics: instruction throughput, harmonicspeedup [32], and maximum slowdown [14, 24, 25]. As ourresults indicate, DBI+AWB+CLB improves both system per-formance and fairness for all multi-core systems.

Number of Cores 2 4 8

Number of workloads 102 259 120

Weighted Speedup [50] Improvement 22% 32% 31%

Instruction Throughput Improvement 23% 32% 30%

Harmonic Speedup [32] Improvement 23% 36% 35%

Maximum Slowdown [14, 24] Reduction 18% 29% 28%

Table 3: Multi-core: Performance and fairness of DBI with bothAWB and CLB optimizations compared to baseline

Based on our performance results, we conclude that DBIis a simple and eUective mechanism to concurrently enablemultiple optimizations that signiVcantly improve system per-formance. In the next section, we show that DBI achieves thiswhile reducing overall cache area cost.

6.3. Area and Power Analysis (with and without ECC)

A cache augmented with DBI can reduce cache area cost com-pared to conventional organization due to two reasons. First,DBI tracks far fewer blocks than the main tag store, therebydirectly reducing the storage cost for dirty bits. Even whentracking only 1/4

th as many blocks as the main tag store, DBIsigniVcantly improves performance (as shown in Section 6.2).Second, as we discussed in Section 3.3, DBI can enable a sim-ple mechanism to reduce the ECC overhead, further reducingthe cache area cost.

In our evaluations, we assume the ECC organization shownin Figure 5. The baseline cache stores SECDED ECC (12.5%overhead) for all cache blocks, whereas the cache augmentedwith DBI stores only parity EDC (1.5% overhead) for all cacheblocks and SECDED ECC for only blocks tracked by the DBI.

9

Note that our performance evaluations do not include anyoverhead of computing EDC or ECC.

Table 4 summarizes the reduction in the tag store and over-all cache storage cost due to DBI. Since we scale the DBI sizewith the cache size, the storage savings of using DBI, in termsof the total number of bits, is roughly independent of thecache size. As highlighted in the table, with α = 1/4 andwith ECC enabled, DBI reduces the tag store cost (in terms ofnumber of bits) by 44% and the overall cache by 7%.

DBI Size (α)Without ECC With ECC

Tag Store Cache Tag Store Cache

1/4 2% 0.1% 44%* 7%1/2 1% 0.0% 26%* 4%

Table 4: Bit storage cost reduction of cache with DBI comparedto conventional cache (with and without ECC). *We assume ECCis stored in the main tag store (or DBI)

We perform a more detailed area and power analysis of acache with DBI using CACTI [1]. Our results show that thereduction in bit storage cost translates to commensurate re-duction in the overall cache area. For a 16MB cache with ECCprotection, our mechanism reduces the overall cache area by8% and 5% for α = 1/4 and α = 1/2, respectively. We expectthe gains to increase with stronger ECC, which will be likelybe required in future systems [7, 9].

Table 5 shows the percentage increase in the static and dy-namic power consumption of DBI. DBI leads to a marginal in-crease in both static and dynamic power consumption of thecache. However, our analysis using DDR3 SDRAM System-Power Calculator [2] shows that as a result of increasing theDRAM row hit rate, our mechanism reduces overall memoryenergy consumption of single-core system by 14% on averagecompared to the baseline.

Cache size 2 MB 4 MB 8 MB 16MB

Static 0.12% 0.21% 0.21% 0.22%

Dynamic 4% 1% 1% 2%

Table 5: DBI power consumption (fraction of total cache power)

6.4. Sensitivity to DBI Design Parameters

In Section 4, we described three key DBI design parametersthat can potentially aUect the eUectiveness of the diUerent op-timizations: 1) DBI granularity, 2) DBI size (α), and 3) DBIreplacement policy. As brieWy mentioned in that section, theLeast Recently Written (LRW) policy performs comparably orbetter than the other policies. In this section, we evaluatethe sensitivity of DBI performance to the DBI granularity andsize, which are more unique to our design. We individuallyanalyze the sensitivity of the two performance optimizations,AWB and CLB, to these parameters.

Table 6 shows the sensitivity of the AWB optimization toDBI size and granularity. The table shows the average IPC im-provement of DBI+AWB compared to baseline for the single-core system. As expected, the performance of AWB increaseswith increasing granularity and size. However, a smaller DBIsize enables more reduction in the ECC area overhead (asshown in Section 6.3), presenting a trade-oU between perfor-mance and area.

Granularity 16 32 64 128

Size α = 1/4 10% 12% 12% 13%

α = 1/2 10% 12% 13% 14%

Table 6: Sensitivity of AWB to DBI size and granularity. Valuesshow the average IPC improvement of DBI+AWB compared tobaseline for our single-core system.

The eUectiveness of the CLB optimization depends on theeUectiveness of the miss predictor and the latency of lookingup the DBI (Section 3.2). As described in that section, theeUectiveness of our miss predictor, Skip Cache, depends onthe epoch length and the miss threshold. The access latencyof the DBI primarily depends on the size of the DBI. We ranextensive simulations analyzing the sensitivity of the CLB op-timization to these parameters. For reasonable values of theseparameters (bypass threshold—0.5 to 0.95, epoch length—10million cycles to 250 million cycles, and DBI size—0.25 to 0.5),we did not Vnd a signiVcant diUerence in performance im-provement of the CLB optimization.

6.5. Sensitivity to Cache Size and Replacement Policy

Table 7 plots the improvement in weighted speedup of DBI(with both AWB and CLB) compared to the Baseline withtwo diUerent cache sizes for each of the multi-core systems(2MB/Core and 4MB/core). As expected, the performance im-provement of DBI decreases with increasing cache size, asmemory bandwidth becomes less of an issue with larger cachesizes. However, DBI signiVcantly improves performance com-pared to the baseline even for large caches (25% for 8-coresystems with a 32MB cache).

Cache Size 2-Core 4-Core 8-Core

2MB/Core 22% 32% 31%

4MB/Core 20% 27% 25%

Table 7: EUect of varying cache size. Values indicate perfor-mance improvement of DBI+AWB+CLB over baseline.

Since DBI modiVes only the writeback sequence of thecache, it does not aUect the read hit rate. As a result, weexpect the beneVts of DBI to complement any beneVts froman improved replacement policy. As expected, even when us-ing a better replacement policy, Dynamic Re-reference Inter-val Prediction policy (DRRIP) [19], DBI signiVcantly improvesperformance (7% over DAWB for 8-core systems).

10

7. Other Optimizations Enabled by DBI

Although we have quantitatively evaluated only three opti-mizations enabled by DBI, there are several other applicationsfor DBI. Our approach can also be employed at other cachelevels to organize the dirty bit information to cater to thewrite access pattern favorable to each cache level—similar tothe DRAM row oriented organization in our proposal. In thissection, we list a few other potential applications of DBI.

Load Balancing Memory Accesses. A recent priorwork [49] proposed a mechanism to load balance memory re-quests between an on-chip DRAM cache and oU-chip memoryto improve bandwidth eXciency. For this purpose, they usecostly special structures: a counting Bloom Vlter [16] to keeptrack of heavily written pages and a small cache to keep trackof pages that are likely dirty in the DRAM cache. In our pro-posed organization, DBI can seamlessly serve both purposes—it can track heavily written pages using its replacement policyand can track pages that are dirty in the cache.

Fast Lookup for Dirty Status. DBI can eXciently answerqueries of the nature “Does DRAM row R have any dirtyblocks?”, “Does DRAM bank/rank X have any dirty blocks?”,etc. As a result, DBI can be used more eUectively with manyopportunistic memory scheduling algorithms that schedulewrites based on rank idle time [55], with reads to the samerow [21], or other eager writeback mechanisms [28, 35, 48].

Cache Flushing. In many scenarios, large portions of thecache should be Wushed—e.g., powering down banks to savepower [6, 11], persistent memory updates [13]. In such cases,current systems have to writeback dirty blocks in a brute forcemanner—by looking up the tag store. In contrast, DBI, with itscompact representation of dirty bit information, can improveboth the eXciency and latency of such cache Wushes.

Direct Memory Access (DMA). DMA operations are oftenperformed in bulk to amortize software overhead. A recentwork [47] also proposed a memory-to-memory DMA mecha-nism to accelerate bulk copy operations. When a device readsdata from memory, the memory controller must ensure thatthe data is not dirty in the cache [5]. DBI can accelerate thiscoherence operation, especially in case of a bulk DMAwhere asingle DBI query can provide the dirty status of several blocksinvolved in the DMA.

Metadata about Dirty Blocks. DBI provides a compact,Wexible framework that enables the cache to store informationabout dirty blocks. While we demonstrate the beneVt of thisframework by reducing the overhead of maintaining ECC, onecan use DBI in other similar scenarios which require the cacheto store information only about dirty blocks (e.g., compressioninformation in main memory compression [39]).

8. Related Work

To our knowledge, this is the Vrst work that proposes a diUer-ent way of organizing the dirty bit information of the cacheto enable several dirty-block-related optimizations. We havealready provided comparisons to prior works [23, 27, 29, 30,

33, 44, 51, 58, 59] that have explored these optimizations. Inthis section, we discuss other related work.

Loh and Hill [31] proposed a data structure called MissMapwith the aim reducing the latency of a cache miss by avoid-ing the need to look up the tag store (maintained in a DRAMcache). DBI can be favorably combined with MissMap to ac-celerate costly MissMap entry evictions [31].

Wang et al. [54] propose a mechanism to predict last writesto cache blocks to improve writeback eXciency. This mech-anism can be combined with DBI to eliminate premature ag-gressive writebacks.

Khan et al. propose a mixed-cell architecture [22]—mostof the cache built with small cells and a small portion builtwith larger, more reliable cells. By storing dirty blocks in themore reliable portion of the cache, the amount and strength ofECC required for the cache can be reduced. This approach, al-though eUective, limits the number of dirty blocks in each set,thereby increasing the number of writebacks. DBI, in con-trast, reduces ECC overhead without requiring any changesto the cache data store design. Having said that, DBI can becombined with such mixed-cell designs to enable all the otheroptimizations discussed in Sections 3 and 7.

Several prior works (e.g., [12, 17, 19, 40, 43, 46, 53, 56]) haveproposed eXcient cache management strategies to improveoverall system performance. Our proposed optimizations canbe used to both improve the writeback eXciency (AWB) andbypass the cache altogether for read accesses of applications(CLB) for which these cache management strategies are noteUective. Similarly, AWB can also be combined with diUerentmemory read scheduling algorithms (e.g., [8, 24, 25, 34, 36]).

9. Conclusion

We presented the Dirty-Block Index (DBI), a new and sim-ple structure that decouples the dirty bit information fromthe main cache tag store. This decoupling allows the DBIto organize the dirty bit information independent of the tagstore. More speciVcally, DBI 1) groups the dirty bit informa-tion of blocks in the same DRAM row in the same DBI entry,and 2) tracks far fewer blocks than the main tag store. Wepresented simple and eXcient implementations of three opti-mizations related to dirty blocks using DBI. DBI, with all threeoptimizations, performs signiVcantly better than individuallyemploying any single optimization (6% better than best pre-vious mechanism for 8-core systems across 120 workloads),while also reducing overall cache area by 8%.

Although we have demonstrated the beneVts of DBI usingthese three optimizations applied to the LLC, this approachis an eUective way of enabling several other optimizations atdiUerent levels of caches by organizing the DBI to cater to thewrite patterns of each cache level. We believe this approachcan be extended to more eXciently organize other metadatain caches (e.g., cache coherence states), enabling more opti-mizations to improve performance and power-eXciency.

11

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their valuable com-ments. We thank the members of the SAFARI and LBAresearch groups for their feedback and the stimulating re-search environment they provide. We acknowledge the gen-erous support from our industry partners: IBM, Intel, Qual-comm, and Samsung. This research was partially funded byNSF grants (CAREER Award CCF 0953246, CCF 1212962, andCNS 1065112), Intel Science and Technology Center for CloudComputing, and the Semiconductor Research Corporation.

References[1] CACTI 6.0. http://www.cs.utah.edu/~rajeev/cacti6/.[2] Micron Technologies, Inc., DDR3 SDRAM system-power calculator.

http://www.micron.com/products/support/power-calc/.[3] SPEC CPU2006 Benchmark Suite. www.spec.org/cpu2006.[4] STREAM Benchmark. http://www.streambench.org/.[5] Intel 64 and IA-32 Architectures Software Developer’s Manual, vol-

ume 3A, chapter 11, page 12. April 2012.[6] Software techniques for shared-cache multi-core systems.

http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems, 2012.

[7] A. R. Alameldeen et al. Energy-eXcient cache design using variable-strength error-correcting codes. In ISCA, 2011.

[8] R. Ausavarungnirun et al. Staged Memory Scheduling: Achieving highperformance and scalability in heterogeneous systems. In ISCA, 2012.

[9] R. Baumann. Soft errors in advanced computer systems. IEEE DesignTest of Computers, 22(3):258–266, 2005.

[10] L. A. Belady. A study of replacement algorithms for a virtual-storagecomputer. IBM Systems Journal, 5(2):78 –101, 1966.

[11] K. Chang et al. Enabling eXcient dynamic resizing of large DRAMcaches via a hardware consistent hashing mechanism. Technical ReportSafari TR 2013-001, Carnegie Mellon University, 2013.

[12] J. D. Collins and D. M. Tullsen. Hardware identiVcation of cache conWictmisses. In MICRO, 1999.

[13] J. Condit et al. Better I/O through byte-addressable, persistent Memory.In SOSP, 2009.

[14] R. Das et al. Application-aware prioritization mechanisms for on-chipnetworks. In MICRO, 2009.

[15] S. Eyerman and L. Eeckhout. System-level performance metrics for mul-tiprogram workloads. IEEE Micro, 2008.

[16] L. Fan et al. Summary Cache: A scalable wide-area web cache sharingprotocol. IEEE/ACM ToN, 8(3):281–293, June 2000.

[17] E. G. Hallnor and S. K. Reinhardt. A fully associative software managedcache design. In ISCA, 2000.

[18] A. Jaleel et al. Adaptive insertion policies for managing shared caches.In PACT, 2008.

[19] A. Jaleel et al. High performance cache replacement using re-referenceinterval prediction (RRIP). In ISCA, 2010.

[20] JEDEC. DDR3 SDRAM, JESD79-3F, 2012.[21] M. Jeon. Reducing DRAM row activations with eager writeback. Mas-

ter’s thesis, Rice University, 2012.[22] S. M. Khan et al. Improving multi-core performance using mixed-cell

cache architecture. In HPCA, 2013.[23] S. Kim. Area-eXcient error protection for caches. In DATE, 2006.[24] Y. Kim et al. ATLAS: A scalable and high-performance scheduling al-

gorithm for multiple memory controllers. In HPCA, 2010.[25] Y. Kim et al. Thread Cluster Memory Scheduling: Exploiting diUerences

in memory access behavior. In MICRO, 2010.[26] Y. Kim et al. A case for exploiting subarray-level parallelism (SALP) in

DRAM. In ISCA, 2012.[27] C. J. Lee et al. DRAM-aware last-level cache writeback: Reducing

write-caused interference in memory systems. Technical Report TR-HPS-2010-2, University of Texas at Austin, 2010.

[28] H. S. Lee et al. Eager writeback - a technique for improving bandwidthutilization. In MICRO, 2000.

[29] L. Li et al. Soft Error and Energy Consumption Interactions : A DataCache Perspective. In ISLPED, 2004.

[30] G. Liu. ECC-Cache: A novel low power scheme to protect large-capacityL2 caches from transient faults. IAS, 2009.

[31] G. H. Loh and M. D. Hill. EXciently enabling conventional block sizesfor very large die-stacked DRAM caches. In MICRO, 2011.

[32] Kun Luo et al. Balancing thoughput and fairness in SMT processors. InISPASS, 2001.

[33] G. Memik et al. Just Say No: BeneVts of early cache miss determination.In HPCA, 2003.

[34] O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: En-hancing both performance and fairness of shared DRAM systems. InISCA, 2008.

[35] C. Natarajan et al. A study of performance impact of memory controllerfeatures in multi-processor server environment. InWMPI, 2004.

[36] K. J. Nesbit et al. Fair queuing memory systems. In MICRO, 2006.[37] M. S. Papamarcos and J. H. Patel. A low-overhead coherence solution

for multiprocessors with private cache memories. In ISCA, 1984.[38] H. Patil et al. Pinpointing Representative Portions of Large Intel Itanium

Programs with Dynamic Instrumentation. In MICRO, 2004.[39] G. Pekhimenko et al. Linearly Compressed Pages: A low-complexity,

low-latency main memory compression framework. In MICRO, 2013.[40] M. K. Qureshi et al. The V-way cache: Demand based associativity via

global replacement. In ISCA, 2005.[41] M. K. Qureshi et al. A case for MLP-aware cache replacement. In ISCA,

2006.[42] M. K. Qureshi et al. Adaptive insertion policies for high performance

caching. In ISCA, 2007.[43] M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A low-

overhead, high-performance, runtime mechanism to partition sharedcaches. In MICRO, 2006.

[44] K. Raghavendra et al. SkipCache: Miss-rate aware cache management.In PACT, 2012.

[45] S. Rixner et al. Memory access scheduling. In ISCA, 2000.[46] V. Seshadri et al. The Evicted-Address Filter: A uniVed mechanism to

address both cache pollution and thrashing. In PACT, 2012.[47] V. Seshadri et al. RowClone: Fast and energy-eXcient in-DRAM bulk

data copy and initialization. In MICRO, 2013.[48] J. Shao and B. T. Davis. A burst scheduling access reordering mecha-

nism. In HPCA, 2007.[49] J. Sim et al. A mostly-clean DRAM cache for eUective hit speculation

and self-balancing dispatch. In MICRO, 2012.[50] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultane-

ous multithreaded processor. In ASPLOS, 2000.[51] J. Stuecheli et al. The Virtual Write Queue: Coordinating DRAM and

last-level cache policies. In ISCA, 2010.[52] P. Sweazey and A. J. Smith. A class of compatible cache consistency

protocols and their support by the IEEE futurebus. In ISCA, 1986.[53] G. Tyson et al. A modiVed approach to data cache management. In

MICRO, 1995.[54] Z. Wang et al. Improving writeback eXciency with decoupled last-write

prediction. In ISCA, 2012.[55] Z. Wang and D. A. Jiménez. Exploiting rank idle time for scheduling

last-level cache writeback. In PACT, 2011.[56] Y. Xie and G. H. Loh. PIPP: Promotion/insertion pseudo-partitioning of

multi-core shared caches. In ISCA, 2009.[57] A. Yoaz et al. Speculation techniques for improving load related instruc-

tion scheduling. In ISCA, 1999.[58] D. H. Yoon and M. Erez. Flexible cache error protection using an ECC

FIFO. In SC, 2009.[59] D. H. Yoon and M. Erez. Memory mapped ECC: Low-cost error protec-

tion for last level caches. In ISCA, 2009.[60] W. K. ZuravleU and T. Robinson. Controller for a synchronous DRAM

that maximizes throughput by allowing memory requests and com-mands to be issued out of order. US Patent 5630096, 1997.

12

the dirty-block index - carnegie mellon...

Documents