architectural support for memory streaming exam...edic research proposal software prefetching, and...

7
EDIC RESEARCH PROPOSAL Architectural Support for Memory Streaming Stavros Volos PARSA, I&C, EPFL ABSTRACT Prior research showed that irregular memory-intensive applications are dominated by off-chip data stalls, resulting in a concerted effort to bring data to on-chip caches. It has been shown that temporal memory streaming and address-correlated prefetching capture irregular access patterns to fetch data prior processor’s request, thus improving significantly the performance of such applications. Address-correlated prefetching can be initiated in either hardware or software. Software prefetching benefits due to the knowledge of the semantic program information, thus identifying the parts of the application that exhibit frequently repeated data reference sequences. However, the dynamic profiling adds overhead in the execution time that has to be overcome to yield net performance improvement. On the other hand, a hardware prefetcher is not on the critical path of the processor, and hence there is not overhead in the execution time. However a hardware prefetcher lacks the program semantic information, thus requiring megabytes of correlation metadata to be effective. Entering the multicore era, the last level cache is the primary bottleneck of commercial systems. Our goal is to take advantage of software’s knowledge about semantic program information to reduce the correlation metadata of a hardware address-correlating prefetcher. Moving the metadata to a small on-chip structure allows a prefetcher to fetch data from the last level cache rather than main memory and hide the last level cache access latency. INDEX TERMS caching, prefetching, temporal address-correlation 1. INTRODUCTION Advances in semiconductor fabrication along with micro- architectural innovation have resulted in a tremendous performance gap between processor and memory. Today’s micro- processors try to reduce this gap by putting multiple levels of cache memory between processor and main memory. However, accesses to distant hierarchy levels require an order of magnitude higher latency than the higher level (closer to processor). Long access latency to lower level (closer to main memory) caches and the limited instruction-level parallelism in software restrict today's out of order superscalar processors from overlapping the memory access latency. Researchers have relied on prefetching as one method to mitigate this performance gap. Prefetching fetches data before processor’s request, thus effectively hiding memory access latency. To be effective, prefetching must achieve high coverage, accuracy, and timeliness. Coverage indicates the fraction of memory requests that were supplied by the prefetcher rather than demand-fetched due to a miss. Accuracy indicates the fraction of the prefetched cache lines that were actually used by the processor. Unless the prefetched data are provided to the processor prior to their request, the processor will not be able to hide the memory latency. Timeliness indicates if the data fetched by the prefetcher arrive before are demand-fetched, but not so early to be discarded before their usage. An ideal prefetcher offers the processor only the data the processor needs, before are demand-fetched. While prefetching can be initiated in either hardware [9,15,16,20, 27,28,31] or software [6,17,18,25] many researchers and vendors select the hardware implementations. Compared to software prefetching, hardware prefetching has the advantage of transpa- rency and low overhead. However, due to the lack of semantic program information, many hardware prefetchers have relied on capturing specific access patterns. Simple sequential methods prefetch cache lines at addresses immediately following the accessed line [26]. Other more sophisticated proposals target load instructions that stride through the address space (stride pre- fetchers) [2,24] or accesses to linked data structures [22]. Al- though these proposals are highly accurate, they achieve low coverage in commercial applications, which are dominated by irregular access patterns [3,30]. Correlation-based prefetching [6,8,9,10,13,15,16,20,25,31] is a more general prefetching scheme, which attempts to exploit more complicated access patterns, and hence is effective for capturing repetitive irregular access patterns such as pointer chasing [6,31,32]. Particularly, address-correlating prefetchers capture the correlation between miss addresses and their likely successors. However, hardware address-correlating prefetchers have never been deployed in production designs, because they require megabytes of address correlation metadata. This drawback is due to the fact that hardware cannot know which addresses are more important to store in the correlation metadata, and hence it tracks and stores all the observed misses. Proposal submitted to committee: June 18th, 2010; Candidacy exam date: June 25th, 2010; Candidacy exam committee: Giovanni de Micheli, Babak Falsafi, Anastassia Ailamaki. This research plan has been approved: Date: June 25 th 2010 Doctoral candidate: __________________________ (Stavros Volos) Thesis director: __________________________ (Babak Falsafi) Doct. prog. director: __________________________ (Rüdiger Urbanke)

Upload: trinhhanh

Post on 11-Mar-2018

224 views

Category:

Documents


4 download

TRANSCRIPT

EDIC RESEARCH PROPOSAL

Architectural Support for Memory Streaming

Stavros Volos PARSA, I&C, EPFL

ABSTRACT Prior research showed that irregular memory-intensive applications are dominated by off-chip data stalls, resulting in a concerted effort to bring data to on-chip caches. It has been shown that temporal memory streaming and address-correlated prefetching capture irregular access patterns to fetch data prior processor’s request, thus improving significantly the performance of such applications. Address-correlated prefetching can be initiated in either hardware or software. Software prefetching benefits due to the knowledge of the semantic program information, thus identifying the parts of the application that exhibit frequently repeated data reference sequences. However, the dynamic profiling adds overhead in the execution time that has to be overcome to yield net performance improvement. On the other hand, a hardware prefetcher is not on the critical path of the processor, and hence there is not overhead in the execution time. However a hardware prefetcher lacks the program semantic information, thus requiring megabytes of correlation metadata to be effective. Entering the multicore era, the last level cache is the primary bottleneck of commercial systems. Our goal is to take advantage of software’s knowledge about semantic program information to reduce the correlation metadata of a hardware address-correlating prefetcher. Moving the metadata to a small on-chip structure allows a prefetcher to fetch data from the last level cache rather than main memory and hide the last level cache access latency.

INDEX TERMS caching, prefetching, temporal address-correlation

1. INTRODUCTION Advances in semiconductor fabrication along with micro-architectural innovation have resulted in a tremendous performance gap between processor and memory. Today’s micro-processors try to reduce this gap by putting multiple levels of cache memory between processor and main memory. However, accesses to distant hierarchy levels require an order of magnitude higher latency than the higher level (closer to processor). Long access latency to lower level (closer to main memory) caches and the limited instruction-level parallelism in software restrict today's out of order superscalar processors from overlapping the memory access latency. Researchers have relied on prefetching as one method to mitigate this performance gap. Prefetching fetches data before processor’s request, thus effectively hiding memory access latency. To be effective, prefetching must achieve high coverage, accuracy, and timeliness. Coverage indicates the fraction of memory requests that were supplied by the prefetcher rather than demand-fetched due to a miss. Accuracy indicates the fraction of the prefetched cache lines that were actually used by the processor. Unless the prefetched data are provided to the processor prior to their request, the processor will not be able to hide the memory latency. Timeliness indicates if the data fetched by the prefetcher arrive before are demand-fetched, but not so early to be discarded before their usage. An ideal prefetcher offers the processor only the data the processor needs, before are demand-fetched. While prefetching can be initiated in either hardware [9,15,16,20, 27,28,31] or software [6,17,18,25] many researchers and vendors select the hardware implementations. Compared to software prefetching, hardware prefetching has the advantage of transpa-rency and low overhead. However, due to the lack of semantic program information, many hardware prefetchers have relied on capturing specific access patterns. Simple sequential methods prefetch cache lines at addresses immediately following the accessed line [26]. Other more sophisticated proposals target load instructions that stride through the address space (stride pre-fetchers) [2,24] or accesses to linked data structures [22]. Al-though these proposals are highly accurate, they achieve low coverage in commercial applications, which are dominated by irregular access patterns [3,30]. Correlation-based prefetching [6,8,9,10,13,15,16,20,25,31] is a more general prefetching scheme, which attempts to exploit more complicated access patterns, and hence is effective for capturing repetitive irregular access patterns such as pointer chasing [6,31,32]. Particularly, address-correlating prefetchers capture the correlation between miss addresses and their likely successors. However, hardware address-correlating prefetchers have never been deployed in production designs, because they require megabytes of address correlation metadata. This drawback is due to the fact that hardware cannot know which addresses are more important to store in the correlation metadata, and hence it tracks and stores all the observed misses.

Proposal submitted to committee: June 18th, 2010; Candidacy exam date: June 25th, 2010; Candidacy exam committee: Giovanni de Micheli, Babak Falsafi, Anastassia Ailamaki.

This research plan has been approved:

Date: June 25th 2010 Doctoral candidate: __________________________ (Stavros Volos)

Thesis director: __________________________ (Babak Falsafi)

Doct. prog. director: __________________________ (Rüdiger Urbanke)

EDIC RESEARCH PROPOSAL

Software prefetching, and more generally, compile-time analysis of memory access behavior [17,18] were studied in the past. More recently software address-correlating prefetchers have been imple-mented [6,25]. However, software prefetching has to overcome the overhead of the dynamic profiling to yield net performance gains. In this research proposal we present three different approaches of implementing an address-correlating prefetcher. Chilimbri and Hirzel [6] use semantic program information to reduce the overhead of online profiling and analysis of their implementation. Solihin et al [25] propose a hardware/software implementation where the correlation metadata are kept in the main memory to avoid building expensive on-chip structures. They use a memory processor to run a User Level Memory Thread that observes L2 cache misses and prefetch data to the L2 according to the recorded access patterns. While these two approaches are designed for uniprocessor systems, Wenisch et al [33] propose a practical hardware prefetcher for a chip multiprocessor that keeps the correlation metadata in two tables in the main memory. Our research plan focuses in hardware address-correlated pre-fetching in chip multiprocessors from the last level cache rather than the main memory. Prior research work emphasizes on prefetching for a uniprocessor or a shared-memory multi-processor system where a high portion of the execution time is wasted in off-chip accesses. Especially, commercial workloads such as online transaction processing (OLTP) running on shared-memory multiprocessors are dominated by such accesses due to high coherence misses in the last level cache [31,32]. However, entering the multicore era, the highest portion of execution time is wasted in the last level cache (LLC) rather than main memory [11]. Although hardware address-correlating prefetchers are practical and efficient when prefetching from main memory [33], they can-not be practical and efficient when prefetching from LLC (e.g. L2 in most designs). This is due to the fact that the correlation tables require megabytes of metadata, which are stored in main memory. By keeping metadata in a small on chip structure, the prefetcher achieves less coverage since the structure cannot capture the useful access patterns. Our goal is to use the software knowledge to increase the coverage of an address-correlating prefetcher while maintaining the metadata in a small on-chip structure. Our intuition is that there are parts of the software where an address-correlating prefetcher exhibit high prediction rate and other parts where the prefetcher exhibit low prediction rate. Therefore the prefetcher needs to track only the miss addresses of the high-prediction parts of the application. This allows increasing the coverage of the prefetcher without increasing the size of the on-chip structure. The rest of this research proposal is organized as follows: In Section 2 we talk about address-correlated prefetching and present in depth three different approaches to implement an address-correlating prefetcher. Section 3 describes our research proposal, the preliminary work of preparing the necessary infrastructure for our research purposes, and the future work.

2. ADDRESS-CORRELATED PREFETCHING This section describes three papers, which are related to address-correlating prefetchers implemented in either software or hard-ware. The first paper describes a software dynamic prefetching framework for fetching hot data streams [6]. The second paper

presents address correlated prefetching done by a User-Level Memory Thread running on a memory processor [25]. The last paper proposes a practical hardware implementation of a prefetching mechanism [33] for a chip multiprocessor.

2.1 DYNAMIC HOT DATA STREAM PREFETCHING FOR GENERAL-PURPOSE PROGRAMS

Sophisticated automatic prefetching techniques have been developed for scientific codes that access dense arrays in tight nested loops [19]. They rely on static compiler analysis to predict program’s data accesses and insert prefetch instructions at appropriate program points. However, these techniques fail to capture the reference pattern of general-purpose program, which use pointer-based data structures. Researchers have shown that programs possess a small number of hot data streams. A hot data stream consists of a frequently repeated sequence of <pc,addr> pairs, where pc is the address of the instruction which requests the data reference addr. Resear-chers have shown that the hot data streams account for 90% of program references and 80% of cache misses [5,23]. Chilimbri and Hirzel [6] implement a dynamic prefetching scheme, which profiles the data references of the program, extracts hot data streams from the temporal profile and injects code at appropriate program points to detect and prefetch these hot data streams. Next, it deoptimizes by removing the injected code and starts the profile phase from the beginning. This is necessary because a program may have different execution phases where hot data streams vary in each phase.

2.1.1 DYNAMIC PROFILING AND ANALYSIS The profiling framework needs to collect a temporal reference profile with low overhead, because the slowdown from profiling must be overcome by effective prefetching to yield net performance gains. The profiler obtains a temporal profile by sampling bursts of data references [1,12]. Bursts are sub-sequences of the sequence of all events during the program execution. To implement sampling, each procedure is duplicated in two versions, checking and instrumental. Both versions contain the original instructions, but only the instrumental code profiles the data references. Both versions transfer control to checks at procedure entries or loopback edges. A counter in each version code is used to count the number of checks done while running the corresponding version. When the counter reaches a fixed threshold, the control goes to the other version. Techniques of how semantic program information is used to eliminate checks at procedure entries and loopback edges are described in [12] in order to balance the overhead/profiling quality trade off.

Figure 1. Sequitur grammar (left) and parse tree (right) for

string abaabcabcabc

EDIC RESEARCH PROPOSAL

The online analysis algorithm relies on Sequitur algorithm [21] to extract hot data streams. The Sequitur algorithm is used to com-press the temporal data reference profile, which is constructed by the profiling framework. Sequitur constructs a context-free gram-mar for the language {w} consisting only one word, the string w. Each non-terminal A of a Sequitur grammar generates a language {wA} with just one word wA. A non-terminal is considered as hot if its heat is greater than a fixed threshold. Heat is defined as the product of the length of the non-terminal and the number of times that it occurs in the parse tree of the Sequitur grammar excluding the number of occurrences in sub-trees that belong to other hot non-terminals. Figure 1 shows the Sequitur grammar (left) for a string abaabcabcabc and the corresponding parse tree (right). The algorithm finds that abcabc is a hot data stream. Non-terminal C is not hot, because it does not occur in other sub-trees rather than B’s. Non-terminal A is not hot as well, because the majority of its occurrences belong to B’s sub-trees.

2.1.2 DYNAMIC PREFETCHING After extracting the hot data streams from the temporal profiling, the scheme must match their prefixes and prefetch their suffixes. A single deterministic finite state machine (DFSM) is generated to match prefixes of all hot data streams. A fixed prefix size is used. Picking the appropriate prefix size is very important. While big prefix reduces the prefetches of the detected stream, small prefix may cause inaccurate matches and consequently useless prefetches. Next the detection and prefetching code that is generated and injected in the program. The generation is done according to the DFSM and consists of if-then-else structure code that checks whether the current state is final or not (i.e. a prefix is matched). The data references of the matched hot data stream’s suffix will be prefetched. The detection and prefetching code is injected dynamically using a binary instrumentation tool [29]. For every procedure that contains at least one program counter (PC) for which the optimizer wants to inject code, it first makes a copy of the procedure and second injects the code into this copy. Finally, an unconditional jump is placed before the first instruction of the original procedure. When the optimizer wants to deoptimize, the jumps are simply removed.

2.1.3 EVALUATION AND CONCLUSION The framework is evaluated for five low-performance benchmarks coming from SpecInt2000 suite and a graphic application called boxsim. These benchmarks experience irregular access patterns and traditional prefetching mechanisms cannot improve performance. The average overhead (i.e. temporal profiling, extracting hot data streams and prefix matching) and net performance improvement are 6.4% and 10.5% respectively showing that software prefetching can be effective. In conclusion, the main drawback of this paper is the fact that online profiling and analysis of the recorded addresses is an important overhead of the implementation. The authors sacrifice coverage of the hot data streams to reduce the overhead by sampling the recorded addresses sequence of the program. Furthermore, there is not a way to distinguish whether a data reference misses in cache or not. Therefore, the prefetched suffix contains all addresses that follow the prefix of the matched hot data stream. Although, it may contain potential cache misses, useless prefetches will be done wasting memory bandwidth. However this paper shows that taking advantage of the semantic

program information is a promising approach to reduce the profiling of data references.

2.2 USING A USER-LEVEL MEMORY THREAD FOR CORRELATION PREFETCHING

In prior research work address-correlated prefetching has been supported by hardware controllers. These hardware controllers typically require large hardware tables to keep the correlations between the miss address and its successors. Solihin et al [25] propose address-correlated prefetching using a User Level Me-mory Thread (ULMT) running on a simple general-purpose processor in the memory. The memory processor may reside either in the memory controller chip or in a DRAM chip. This method of prefetching is called memory-side, because the prefetching engine does not reside near the main processor, but near/in the memory. The correlation data table is a software data structure placed in the main memory and it is cached in the memory processor, thus allowing inexpensive accesses. The memory processor observes the requests from the main processor that reach main memory and prefetches other data to the cache that may be useful for the main processor in the future. The ULMT runs two independent steps; prefetching and learning. During the prefetching step, when a miss is observed, the thread looks up the correlation table and generates the addresses of the lines to prefetch. In the learning step that follows the prefetching step, the thread updates the correlation table with the observed miss address. The prefetching step is characterized by the response time (i.e. the time from the moment a miss is observed until the corresponding prefetches are generated) and time should be as small as possible. The occupancy time (i.e. the time that ULMT is busy for a single observed miss) should be less than the time between two consecutive L2 misses, otherwise the second L2 miss will not be serviced timely, thus hurting the prefetcher’s effectiveness.

2.2.1 CORRELATION PREFETCHING ALGORITHMS The authors extend the base pair-based correlation algorithm introduced in Markov predictors [15] with two algorithms; chain and replicated. In chain algorithm the correlation table maintains up to NumRows entries. Each entry is associated with a miss address and NumSucc successors of that miss. Figure 2.a.i shows the maintained correlation table. Upon an observed miss the thread looks up the correlation table and if the miss exists, all successors are prefetched. The MRU successor is taken and the aforementioned

Figure 2. Correlation prefetching algorithms.

(a) Chain (b) Replicated

EDIC RESEARCH PROPOSAL

process is repeated for NumLevels-1 times. Next the address of the observed miss is stored in the entry of the previous miss address and if it does not exist in the table is also stored into a new entry. Figure 2.a.iii illustrates which addresses are prefetched when observing the miss address a. Addresses d,b are prefetched as the immediate successors of the observed miss. Address c is also prefetched because it is the immediate successor of the MRU successor of the observed miss. To address the coverage limitation of the original pair-based correlation algorithm, where only one successor is prefetched, the parameter NumLevels has to be big. This leads to a high response time, because each access in the table involves an associative search. To avoid the high response time of the chain algorithm, the correlation table is expanded and the replicated algorithm is proposed. Each row stores the miss address a and NumLevels levels of successors. Each level contains NumSucc addresses that follow the miss address. Figure 2.b shows an example; b and c are the successors of a, and hence are stored in the corresponding entry. Upon a miss that exists in the table, all addresses of the corresponding entry are prefetched. In our example in Figure 2.b upon a miss of the address a, addresses d,b,c are prefetched. The computation of updating all the levels is done in the learning step and not in the prefetching step as chain was doing. Therefore, the response time of the replicated algorithm is smaller than the chain, algorithm because only one associative search is needed. Replicated algorithm, not only reduces the response time, but also prefetches the true MRU successors at each level, while the chain algorithm does not. Consider the sequence of miss addresses a,b,c,…,b,e,b,f,…,a,b,c. In chain algorithm the entry of the address a will contain address b as a successor. However the entry of the address b will contain miss addresses f,e. Upon a new miss on address a, address b will be prefetched and also addresses f,e as the immediate successors of address b. However address c is not prefetched which was the second successor of address a. On the other hand in the replicated algorithm, the entry of the address a will also contain address c in the second level. Thus on a new miss of the address a, the addresses b,c will be prefetched which are the actual successors of the address a.

2.2.2 ARCHITECTURE OF THE SYSTEM Figure 3 illustrates the architecture of a system that integrates the memory processor in the North Bridge chip. Miss requests from the main processor are deposited in queues 1 and 2 at the same time. The ULMT uses the entries of the queue 2 to build its correlation table, and based on it, generates the address to prefetch. These addresses are deposited in the queue 3. Queues 1 and 3 compete to access memory, but queue 1 has higher priority than queue 3, because it contains the actual requests of the

processor. When an address is deposited in queue 3, it’s compared against the entries of the queue 2. If a match of this address is detected, it is removed from the queue 3, because a higher priority request of the same address is already in the queue 1. The address is also removed from queue 2 to save computation in the ULMT, and hence reduce the total occupancy of the scheme. Similarly when a processor miss is about to be posited in queues 1 and 2, the hardware compares its address against those in the queue 3. If a match exists, the request is only put in the queue 1 and it is removed from the queue 3. Last, the Figure 3 shows the Filter component. This filter module drops prefetch requests directed to any address that has been recently issued another prefetch request. This is done to avoid prefetching the same addresses in a short time period.

2.2.3 EVALUATION AND CONCLUSION Base algorithm predicts correctly on average 82% for the first level. Replication algorithm predicts correctly on average 77% and 73% for misses on the second and third levels respectively. On average 60% of L2 misses have distance between 200 and 280 cycles. The round trip latency to memory is 208-243 cycles, and hence dependent misses are likely to fall in this 60%. Dependent misses cannot be overlapped and hence the prefetcher has to prefetch them. To do that the occupancy time needs to be less than 200 cycles. The replication algorithm achieves 32% speed up on average over the system without prefetching. When combined with a processor-side prefetcher 46% speed up is achieved on average. When customized correctly for each application 53% speed up is achieved on average. Chain algorithm performs worse than the replication algorithm. In conclusion there are several drawbacks in this implementation. While this paper innovates by placing the correlation table in main memory and uses the cache memory of the memory processor to allow fast table lookups, the hardware cost for building a second general purpose processor is very expensive. Another drawback is the fact that the parameters NumLevels and NumSucc are predefined, and hence this implementation cannot support variable stream lengths effectively. Furthermore multicore processors usually have more than one memory controller. Any memory controller receives a subset of memory requests of one core. Locating the predicting control in the memory controller prevents the prefetcher from seeing the entire stream of each executing thread. Thus, the scheme will not work for processors with multiple memory controllers.

2.3 MAKING ADRESS-CORRELATED PREFETCHING PRACTICAL

Despite its efficacy, address-correlated prefetching has never been implemented in a shipping processor. As megabytes of address correlation metadata are required – far too large to be stored on-chip, recent address-correlating prefetchers store metadata in main memory [6,8,10,31]. However, the shifting of the correlation tables into the main memory creates two challenges. First of all, accessing the correlation tables takes at least one memory access latency, and hence the prefetches are delayed. Prefetchers are designed to deal with this long lookup latency by targeting prefetchers far away from the occurred miss. Secondly, the memory bandwidth pressure increases, because there is extra memory traffic due to the table lookups and updates every observed cache miss. Wenisch et al [33] propose Sampled Temporal Memory Streaming (STMS) to target these challenges and make address-correlated prefetching practical.

Figure 3. Architecture of a system integrating the memory

processor in the North Bridge Chip

EDIC RESEARCH PROPOSAL

2.3.1 TEMPORAL STREAMING A temporal stream is a sequence of two or more cache misses that recurs over program execution. To illustrate how temporal streams recur in database applications, we describe an example of range scans in B+ trees. The B+ tree is a critical data structure in database applications. It maintains a sorted index of records according to a key constructed from one or more fields in the record. Each B+ tree node contains a sorted key list with pointers to children, such as the range of keys within the child’s sub-tree is bounded by the adjacent keys in the parent. The leaves of the B+ tree point to the location of the corresponding database record. A distinguish feature of B+ trees is the existence of horizontal pointers that connect sibling leaves of the tree, thus enabling fast in-order tree traversals. During a range scan the database engine first locates the lower key and then it traverses horizontally along sibling links until it reaches the upper key. Scans for overlapping ranges result in temporal streams following the sibling links. The first range scan records a miss sequence for the leaves along the bottom of the tree, but the second range scan will access the same leaves in the exact same order. Temporal streams extend the notion of address correlation to sequences rather that pairs of misses [15]. Researchers initially proposed correlation tables to store temporal streams. The primary drawback of the initial proposals is that the stream length is fixed to the table entry. Offline analysis has shown that the stream length varies from 2 to 100 misses [6,31]. Thus, fixing the stream length is a storage/coverage trade-off. Tables with small table entries lose coverage and tables with big table entries lead to inefficient use of the storage. To support variable-length temporal streams, researchers separated address sequence storage and correlation data [9,20,31]. The application’s recent miss address sequence is recorded in a history buffer. An index table correlates a particular miss address to a location in the history buffer. Thus, a simple entry in the index table points to a stream of arbitrary length, allowing maximal coverage without extra storage overhead. Figure 4 shows the index table and the history buffers.

2.3.2 DESIGN OVERVIEW As we said in the introduction of this section, the shifting of the correlation metadata to the main memory increases the lookup latency and the memory bandwidth pressure. In this subsection we

present Sampled Temporal Memory Streaming; a practical design to address the aforementioned challenges. Figure 4 shows a block diagram of a 4-core chip multiprocessor equipped with STMS. Each core keeps its own history buffers in main memory, but all cores share a unified index table that is also kept in main memory. The history buffer must be large enough to fit all intervening miss addresses between a recorded temporal stream and its previous occurrence, otherwise the temporal stream will be lost and the coverage will be reduced. For commercial workloads 32 megabytes are needed for the history buffers. The index table is implemented as a bucketized probabilistic hash table [7]. Miss addresses are spread over buckets applying Least-Recently Used policy in each bucket. For this design 16 megabytes are needed for the index table to achieve maximum coverage. On chip, STMS requires a prefetch buffer and an address queue for each core. The address queue keeps the addresses read from the history buffer. STMS prefetches address from the queue in order and keeps the prefetched data in the prefetch buffer to avoid cache pollution due to erroneous prefetches. STMS records correct-path cache misses and prefetched hits in the corresponding core’s history buffer. To avoid polluting the history buffers with wrong-path accesses, the effective address of a load instruction that occurs a cache miss is only appended in a history buffer if the load is retired. To minimize pin-bandwidth overhead, an on-chip cache-block-sized buffer accumulates en-tries, which are written as a group to the main memory [10]. Upon a cache read miss STMS performs a pointer lookup in the index table. The miss address is hashed to select a bucket. The entire bucket is retrieved in one memory latency and searched linearly. If the miss address is found in the bucket the address sequence is read from the history buffer, beginning at the history pointer location. On update, if the miss address does not exist, the least recently used entry of the bucket will be replaced. Since the index table is unified for all cores, a core can locate a temporal stream from another core’s history buffer. The hash table design allows fast index table lookups and address the first challenge of maintaining the correlated tables in the main memory. To reduce memory bandwidth pressure due to updates of the index table the authors propose probabilistic index update. For every potential index table update a predetermined sampling probability determines whether the update will be performed or not. The index table bandwidth is directly proportional to the sampling probability. For long temporal streams an index table entry will be created near the first miss with high probability. Coverage lost on the first few blocks is negligible to the stream length. For short, frequent temporal streams the probability of adding the head of the stream to the index table is high after several stream repetitions. Therefore probabilistic update does not reduce coverage significantly.

2.3.3 EVALUATION AND CONCLUSION Whereas traffic overhead decreases rapidly, predictor coverage decreases logarithmically with sampling probability. By employing 12.5% sampling probability the average overhead traffic reduction is 2.45x and 6.45x for commercial and scientific workloads respectively. The evaluation demonstrates that hash-based lookups and probabilistic updates yield a bandwidth-, latency-, and storage- efficient temporal memory streaming design that keeps the predictor metadata in main memory, while

Figure 4. STMS block diagram

EDIC RESEARCH PROPOSAL

achieving 90% of the performance potential of idealized on-chip metadata storage. Although the mechanism is built for a chip multiprocessor, it targets only the cache misses to hide off-chip access latency. Un-fortunately entering the multicore era and allowing multithreaded applications to run on a single chip, the time wasted for the last level cache (e.g. L2 in most designs) accesses is greater than the time wasted for going off-chip. Therefore, a mechanism where the correlation table is maintained in main memory would not work for prefetching from LLC.

3. RESEARCH PLAN Prior research work in address-correlated prefetching emphasizes in prefetching from main memory, because systems have been exhibiting high number of off-chip accesses. The last level cache of uniprocessor systems fails to capture the working set of applications with big memory footprint, thus leading to high number of off-chip accesses. Shared-memory multiprocessor systems (i.e. each processor resides in a single chip) exhibit high number of off-chip accesses even when running applications with smaller memory footprints but with high number of shared accesses. In commercial applications such as online transaction processing the majority of accesses is shared among all the processors. While each processor resides in a single chip, coherence misses (i.e. misses in the LLC, because the cache block is not the most recent copy) is the majority of off-chip accesses [32]. Prefetching from main memory is a promising approach to reduce off-chip accesses, and hence improve the performance of processors. However entering the multicore era, where multiple cores are integrated in one chip, the bottleneck shifts to the LLC [11]. The reasons of this could be primarily posted as twofold. First, in chip multiprocessors the LLC has to be big enough to capture a significant fraction of the working sets of each core, thus increasing the access latency. In most designs LLC is shared among all the cores that are integrated in the chip. This eliminates the coherence misses in the LLC. By eliminating the coherence misses, off-chip accesses are reduced dramatically in applications where the data are shared among the cores (e.g. online transaction processing). Therefore the time wasted in accessing main memory is much smaller than the time wasted in accessing the LLC, thus shifting the bottleneck to the LLC. In this research proposal we focus in hardware address-correlated prefetching from the LLC as a solution to hide LLC’s access latency. This is a challenging task since it involves conflicting design choices. To be effective the correlation data must be stored on chip, otherwise an access in the off-chip correlation metadata prevents data to arrive timely, and hence the effectiveness will be limited. Storing megabytes of correlation data on chip is very expensive and of course it is not practical. On the other hand, limiting the size of the on-chip structure might lead coverage loss, since useful sequences will be discarded from the correlation tables prior to their recurrence. Unless we find a way to keep the correlation metadata in a cheap on-chip structure, we will not be able to implement a hardware mechanism to prefetch effectively from the LLC. Our intuition is that there are parts of the application where an address-correlating prefetcher exhibits high accuracy. We refer to these parts of the application as hot regions. We can reduce the storage require-ments by storing observed miss addresses coming only from the

hot regions. Therefore it is possible to achieve high coverage without requiring big on-chip structures. By the time that a hot region starts/stops being executed the prefetcher should start/stop recording any observed miss address into the history buffers. Unfortunately hardware lacks knowledge of semantic program information, and hence it cannot know when a hot region is being executed. However, software knows when a hot region is being executed, and hence it can give a hint to hardware to start/stop recording when the hot region starts/stop being executed. By combining knowledge of software about semantic program information the storage requirements can be reduced, and hence an implementation of a hardware address-cor-related prefetching mechanism from the LLC is feasible.

3.1 PRELIMINARY WORK We use trace-based and cycle-accurate full system simulation of a chip multiprocessor using FLEXUS [34]. FLEXUS models the SPARC v9 ISA and can execute unmodified commercial applications and operating systems. FLEXUS extends the Virtutech Simics functional simulator with models of processing tiles with out-of-order and in-order cores, cache hierarchy, on-chip protocol controllers and on-chip interconnect. We simulate a chip multiprocessor with 16 out-of-order cores, private L1 caches and shared L2. In our initial work we focus on online transaction processing workloads (OLTP) and we use Shore-MT [14] as an underlining system. Shore-MT is a state-of-the-art storage manager based on Shore [4] built at CMU/EPFL. Shore-MT is a multithreaded incarnation of Shore in the sense that it utilizes the high parallelism offered today by chip multiprocessors. On top of Shore-MT the maintaining team offers three OLTP benchmarks (TPC-B [35], TPC-C [36] and TAPT [37]). We choose Shore-MT because it is open-source and its maintaining team can easily provide support. We have already imported Shore-MT in FLEXUS and shown that Shore-MT exhibits similar microarchitectural behavior with commercial systems such as IBM DB2 by comparing the time breakdowns of these systems. Therefore the results of our research in Shore-MT can be assumed as general. Furthermore the time breakdowns confirm that the last level cache is the major bottleneck of database systems running OLTP workloads.

3.2 FUTURE WORK Before identifying the hot regions of Shore-MT when running an OLTP workload we need to study the repetition of its L1 miss sequences and verify that these recur in repetitive sequences. We will apply Sequitur algorithm to quantify the maximum opportunity for prefetching from L2. Sequitur analysis shows the fraction of total misses that recurs in repetitive sequences (temporal streams). The stream length is a critical factor affecting the usefulness of temporal streams. Calculating the stream length of the temporal streams is important, because a long stream cannot be buffered completely on-chip without displacing other potentially useful data. Reuse distance is another critical factor affecting the coverage of the prefetcher. The on-chip structure that maintains the correlation data has to be large enough to hold a temporal stream until its recurrence. By calculating the reuse distances we will have a rough estimation of the maximum required storage for the correlation metadata. Next, we plan to identify the hot regions of Shore-MT and how much coverage we can achieve from recording only the miss sequences that occur in the hot regions. Hopefully our intuition

EDIC RESEARCH PROPOSAL

will be correct and we will be able to achieve high coverage by keeping the correlation data on a small on-chip structure. Last part of this research proposal is to build a prefetcher that takes hints from the software when entering/exiting in/from the hot regions, stores the hot regions’ correlation metadata in the index table and history buffers as explained in [30] and prefetch upon an L1 miss from the L2.

4. REFERENCES [1] M. Arnold and B. Ryder. A Framework for Reducing the Cost of Instrumental Code. In Conference on PLDI’01. [2] J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proc of Supercomputing '91, November 1991, pp. 176-186.

[3] L. A. Barssoso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proc. of 25th ISCA, June 1998.

[4] M. Carey, D. DeWitt, J. Naughton, M. Solomon et al: Shoring Up Persistent Applications. In SIGMOD 1994. [5] T. M. Chilimbi. Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality. In Proc. of PLDI’01, June 2001. [6] T.M. Chilimbri and M. Hirzel. Dynamic Hot Data Stream Prefetching for General-Purpose Programs. In Proc of PLDI, 2002, pp. 199-209.

[7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. The MIT Press, 2001 [8] Y. Chou. Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. In Proc. of 40th MIRCO, December 2007. [9] M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Temporal Instruction Fetch Streaming. In Proc. of 41st MICRO, December 2008.

[10] M. Ferdman and B. Falsafi. Last-touch correlated data streaming. In ISPASS, 2007. [11] N. Hardavellas, I. Pandis, R. Johnson, N. G. Mancheril, A. Ailamaki, and B. Falsafi. Database Servers on Chip Multiprocessors: Limitations and Opportunities. In Proc. 3rd Biennial Conf. Innovative Data Systems Research, 2007. [12] M. Hirzel and T. Chilimbi. Burst Tracing: A framework for Low-Overhead Temporal Profiling. In Workshop on FDDO’01. [13] Z. Hu, M. Martonosi, and S. Kaxiras. Tcp: Tag correlating prefetchers. In Proc. of 9th HPCA, 2003. [14] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. Falsafi. Shore-MT: a scalable storage manager for the multicore era. In EDBT, pages 24-35, 2009. [15] D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proc. of 24th ISCA, June 1997, pp. 252-263.

[16] A.-C. Lai, C. Fide, and B. Falsafi. Dead-Block Prediction & Dead-Block Correlating Prefetchers. In Proc of 28th ISCA, June 2001, pp. 144-154.

[17] M. H. Lipasti, W. J. Schmidt, S. R. Kunkel, and R. R. Roediger. Spaid: Software prefetching in pointer and call-intensive enviroments. In Proc. of 28th Micro, 1995, pp. 231-236. [18] C.-K. Luk and T. C. Mowry. Compiler based prefetching for recursive data structures. In Proc. of 7th ASPLOS, October 1996, pp. 222-233. [19] T. Mowry, M. Lam, and A. Gupta. Design and Analysis of a Compiler Algorithm for Prefetching. In Proc. of 2nd ASPLOS, 1992. [20] K. J. Nesbit and J.E. Smith. Data Cache Prefetching Using a Global History Buffer. IEEE Micro 2005, pp. 90-97.

[21] C. G. Nevill-Manning and I.H. Witten. Linear-time, incremental hierarchy inference for compression. In Proc. DCC’97. [22] A. Roth, A. Moshovos, and G. Sohi. Dependence based prefetching for linked data structures. In Proc. of 8th ASPLOS, October 1998.

[23] S. Rubin, R. Bodik, and T. Chilimbi. An Efficient Profile-Analysis Framework for Data-Layout Optimizations. In Proc. of POLP’02, January 2002. [24] T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proc. of 33rd MICRO, December 2000.

[25] Y. Solihin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for Correlation Prefetching. In Proc of 29th ISCA, June 2002.

[26] A.J. Smith. Sequential Program Prefetching in Memory Hierrarchies. In IEEE Trans. Computers, vol 11, no. 12, December 1978, pp. 7-21.

[27] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial Memory Streaming. In Proc. of 33rd ISCA, July 2006.

[28] S. Somogyi, T. F. Wenisch, A. Ailamaki and B. Falsafi. Spatio-Temporal Memory Streaming. In Proc. of 36th ISCA, June 2009.

[29] A. Srivastava, A. Edwards, and H. Vo. Vulcan: Binary transformation in a distributed environment. In Microsoft Research Technical Report, MSR-TR-2001-50,2001. [30] P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torellas. The memory performance of DSS commercial workloads in shared-memory multiprocessors. In Proc. of 3rd HPCA, 1997. [31] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal Streaming of Shared Memory. In Proc. of 32nd ISCA, June 2005.

[32] T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. Temporal Streams in Commercial Server Applications. In Proc of IISWC, 2008. [33] T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. Making address-correlated prefetching practical. In IEEE Micro 2010, pp. 50-59. [34] T. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. SimFlex: statistical sampling of computer system simulation. IEEE Micro, 26(4):18-31, Jul-Aug. 2006. [35] TPC Benchmark B. http://www.tpc.org/tpcb [36] TPC Benchmark C. http://www.tpc.org/tpcc [37] TATP Benchmark, http://tatpbenchmark.sourceforge.net