call graph prefetching for database applications

Call Graph Prefetching for DatabaseApplications

MURALI ANNAVARAMIntel CorporationandJIGNESH M. PATEL and EDWARD S. DAVIDSONThe University of Michigan, Ann Arbor

With the continuing technological trend of ever cheaper and larger memory, most data sets indatabase servers will soon be able to reside in main memory. In this configuration, the perfor-mance bottleneck is likely to be the gap between the processing speed of the CPU and the memoryaccess latency. Previous work has shown that database applications have large instruction anddata footprints and hence do not use processor caches effectively. In this paper, we propose CallGraph Prefetching (CGP), an N instruction prefetching technique that analyzes the call graph ofa database system and prefetches instructions from the function that is deemed likely to be callednext. CGP capitalizes on the highly predictable function call sequences that are typical of databasesystems. CGP can be implemented either in software or in hardware. The software-based CGP(CGP S) uses profile information to build a call graph, and uses the predictable call sequences inthe call graph to determine which function to prefetch next. The hardware-based CGP(CGP H) usesa hardware table, called the Call Graph History Cache (CGHC), to dynamically store sequencesof functions invoked during program execution, and uses that stored history when choosing whichfunctions to prefetch.

We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as onCPU-2000 benchmarks. For most CPU-2000 applications the number of instruction cache (I-cache)misses were very few even without any prefetching, obviating the need for CGP. On the other hand,the database workloads do suffer a significant number of I-cache misses; CGP S improves theirperformance by 23% and CGP H by 26% over a baseline system that has already been highly tunedfor efficient I-cache usage by using the OM tool. CGP, with or without OM, reduces the I-cache missstall time by about 50% relative to O5+OM, taking us about half way from an already highly tunedbaseline system toward perfect I-cache performance.

This work was done while M. Annavaram was at the University of Michigan. This material is basedupon work supported by the National Science Foundation under Grant IIS-0093059.Authors’ addresses: Murali Annavaram, Intel Corporation, 220 Mission College Blvd., Santa Clara,CA 95052-8119; email: [email protected]; Jignesh M. Patel, University of Michigan,2239 EECS, Ann Arbor, MI 48109-2122; email: [email protected]; Edward S. Davidson, 1100Chestnut Rd., Ann Arbor, MI 48104; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM Inc., 1515Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]© 2003 ACM 0734-2071/03/1100-0412 $5.00

ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003, Pages 412–444.

Call Graph Prefetching for Database Applications • 413

Categories and Subject Descriptors: C.4 [Performance of Systems]: Design studies; C.1.0[Processor Architectures]: General

General Terms: Performance, Design, Experimentation

Additional Key Words and Phrases: Instruction cache prefetching, call graph, database

1. PERFORMANCE BOTTLENECKS IN DBMS

The increasing need to store and query large volumes of data has made databasemanagement systems (DBMS) one of the most prominent applications on to-day’s computer systems. DBMS performance in the past was bottlenecked bydisk access latency which is orders of magnitude slower than processor cycletimes. But with the trend toward denser and cheaper memory, database serversin the near future will have large main memory configurations, and many work-ing sets will be resident in main memory [Bernstein et al. 1998]. Moreover tech-niques such as concurrent query execution, where a query that is waiting for adisk access is switched with another query that is ready for execution, can suc-cessfully mask most of the remaining disk access latencies. Several commercialdatabase systems already implement concurrent query execution along withasynchronous I/O to reduce the I/O bottleneck. Once the disk access latency istolerated, or disk accesses are sufficiently infrequent, the primary performancebottleneck shifts from the I/O response time to the memory access time.

There is a growing gap between processor and memory speeds, which canbe reduced by the effective use of multi-level caches. But recent studies haveshown that current database systems with their large code and data footprintssuffer significantly from poor cache performance [Ailamaki et al. 1999; Bonczet al. 1998; Lo et al. 1998; Nyberg et al. 1994; Shatdal et al. 1994]. Thus the keychallenge in improving the performance of memory-resident database systemsis to utilize caches effectively and reduce cache miss stalls.

The graph in Figure 1 shows the number of instruction cache (I-cache) missesincurred while concurrently executing a set of Wisconsin [Bitton et al. 1983] andTPC-H [TPC 1999] benchmark queries in a DBMS built on top of SHORE [Careyet al. 1994], using a 32 KB I-cache. The leftmost bar, labeled O5, shows theI-cache misses for the binary compiled using the highest compiler optimizationlevel (C++ with the−O5 optimization flag turned on). The second bar, O5+OM,shows the misses when the O5 binary is further optimized using the OMtool [Srivastava and Wall 1992], which implements a modified code layoutscheme Pettis and Hansen [1990] for improving I-cache performance (OM opti-mizations are described in Section 5.3). Similarly O5+NL 4 and O5+OM+NL 4are for the O5 and O5+OM binaries, respectively, with tagged next-N-line(NL) instruction prefetching where the underlying hardware issues prefetchesto the next 4 lines whenever there is a cache miss or a first hit to a cacheline after it was prefetched by the tagged NL scheme. O5+CGP S 4 andO5+CGP H 4, respectively, show the cache misses using the software and thehardware based CGP alone, without using the OM optimizations. These twoschemes are described in Sections 3 and 4, respectively. O5+OM+CGP S 4and O5+OM+CGP H 4 are for CGP applied to an OM optimized binary.

ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.

414 • M. Annavaram et al.

Fig. 1. I-Cache misses in DBMS for 32 KB I-cache.

For each bar, the upper component, labeled CallTarget-Miss, shows theI-cache misses that occur immediately following a function call—upon accessingthe target address of the function call. The bottom component, IntraFunc-Miss,shows all the remaining I-cache misses (incurred while executing instructionswithin a function boundary).

Four key observations may be made from the graph:

(1) Although OM does reduce the I-cache misses, there is still room for signifi-cant improvement.

(2) Both OM optimizations and NL 4 prefetching significantly reduce theIntraFunc-Misses; however, they are almost totally ineffective in reducingthe CallTarget-Misses.

(3) CGP is very effective in reducing the CallTarget-Misses, as intended; more-over the CGP H optimized binary also achieves a considerable further re-duction in IntraFunc-Misses.

(4) Finally, comparing CGP without OM and CGP with OM, it is apparent thatCGP makes OM optimizations almost unnecessary.

Previous research [Franklin et al. 1994] has shown that DBMS have a largenumber of small functions due to the modular software architecture used intheir design. Furthermore after applying existing compiler techniques and sim-ple prefetch schemes (cf. O5+OM+NL 4 in Figure 1), I-cache misses at functioncall boundaries constitute a significant portion of all I-cache misses. In orderto recover the performance lost due to call target misses without sacrificingthe advantages of modular design we have developed Call Graph Prefetching(CGP), an instruction prefetching technique that analyzes the call graph of anapplication and prefetches instructions from a function that is likely to be callednext. Although CGP is a generic instruction prefetching scheme, it is particu-larly effective for large software systems such as DBMS because of the layeredsoftware design approach used by these systems. Section 2 argues intuitivelywhy CGP can be effective in prefetching for database applications.

Section 3 then describes an algorithm (CGP S) that implements CGP insoftware. This algorithm uses a profile run to build a call graph of a database



Fig. 2. Software layers in a typical DBMS.

system, and exploits the predictable call sequences in the call graph to deter-mine which function to prefetch next.

Section 4 reviews the hardware implementation of CGP introduced inAnnavaram et al. [2001a]. This implementation (CGP H) uses a hardwaretable, called the Call Graph History Cache (CGHC), to dynamically store se-quences of functions invoked during program execution, and uses that storedhistory when choosing which functions to prefetch.

In this paper we show that CGP S performs nearly as well as CGP H, butwithout the need for additional hardware. However, the hardware approachmay well be preferred by many people who have an aversion to or distrust ofprofiling. In Annavaram et al. [2001a] we used two well known metrics, cover-age and accuracy, to determine the effectiveness of CGP H. But the coverageand accuracy metrics ignore the effects of a prefetch of line X on the cache line,Y, that it replaces, for example, in the case that Y will be needed before thenext reference to X. This lack of information about the line that is replacedto accommodate the prefetched line makes the coverage and accuracy metricsinsufficient to evaluate the effectiveness of the prefetches issued by a prefetchscheme. Hence, to measure the effectiveness of CGP, we now use a more re-fined prefetch classification, the Prefetch Traffic and Miss Taxonomy (PTMT),developed by Srinivasan et al. [2003].

Section 5 describes the simulation environment and performance analysistools that we used to assess the effectiveness of CGP S and CGP H.

Section 6 describes previous related work and Section 7 presents conclusionsand suggests future directions.

2. MOTIVATION FOR CALL GRAPH PREFETCHING

DBMS software is commonly built using a layered software architecture whereeach layer provides a set of well-defined entry points to the layers above it.Figure 2 shows the layers in a typical database system with the storage man-ager being the bottom-most. The storage manager provides basic file storagemechanisms (such as tables and indices), concurrency control and transactionmanagement facilities. Relational operators that implement algorithms for join,aggregation and so on, are typically built on top of the storage manager. The



Fig. 3. Call graph for the Create rec function.

query scheduler, the query optimizer and the query parser are then built on topof the operator layer. Each layer in this modular architecture provides a set ofwell-defined entry points and hides its internal implementation details so asto improve the portability and maintainability of the software. The sequenceof function calls within each of these entry points is transparent to the layersabove. Although such layered code typically exhibits poor spatial and temporallocality, the function call sequences can often be predicted with great accuracy.CGP exploits this predictability to prefetch instructions from the function thatis deemed most likely to be executed next.

2.1 A Simple Call Graph Example

We introduce CGP with the following pedagogical example. Figure 3 shows asegment of a call graph for adding a record to a file in SHORE [Carey et al.1994]. SHORE is a storage manager that provides storage volumes, B+ trees,R* trees, concurrency control and transaction management. In this example,Create rec makes a call to Find page in buffer pool to check if the relation intowhich the record is being added is already in the main memory buffer pool. Ifthe page is not already in the pool, the Getpage from disk function is invoked tobring it into the pool from the disk. This page is then locked using the Lock pageroutine, subsequently updated using Update page, and finally unlocked usingUnlock page.

The Create rec function is the entry point provided by the storage managerto create a record, and is routinely invoked by a number of relational opera-tors, including insert, bulk load, join (to create temporary partitions or sortedruns), and aggregate. Although it is difficult to predict when calls to Create recwill occur, once Create rec is invoked, Find page in buffer pool is always thenext function to be called. When a page is brought into the memory buffer poolfrom the disk, DBMS typically “pin” the page in the buffer pool to prevent thepossibility of it being replaced before it is used. Given a large buffer pool sizeand repeated calls to Create rec, the page that is being updated will usually befound pinned in the buffer. Hence Getpage from disk will usually not be calledand Lock page, Update page and Unlock page will be the sequence of functions



next invoked. CGP capitalizes on this predictability by prefetching instructionsneeded for executing Find page in buffer pool upon entering Create rec, thenprefetching instructions for Lock page once Find page in buffer pool is en-tered, and finally prefetching instructions for Update page upon returningfrom Find page in buffer pool, and for Unlock page upon returning fromUpdate page.

3. IMPLEMENTING CGP IN SOFTWARE

This section presents CGP S, an algorithm that implements CGP in software.CGP S uses a profile run to build a call graph of a database system, and in-serts instructions into the original database system to prefetch a function thatis likely to be called next. Although CGP S uses a profile workload to buildthe call graphs, our experiments show it to be highly effective even when theactual query workload differs significantly from the profile workload. Profilingis a popular technique that is used to identify execution bottlenecks and tuneapplications to achieve better performance. Most compilers (including Com-paq, Intel IA32 and IPF compilers, and gcc) support profile-directed feedbackoptimizations. CGP S can be implemented on any processor whose instructionset architecture supports an instruction for prefetching instructions into theI-cache. Some current architectures, such as HP-PA8000 and SPARC-V9 al-ready provide an instruction for prefetching into the I-cache, and we believethat future architectures are likely to provide such an instruction.

3.1 CGP S Algorithm

This CGP algorithm requires two inputs: a database binary, and a sample work-load that is used for collecting profile information about the given databasebinary. CGP S has three phases; the call graph generation phase, the profilegeneration phase, and the prefetch insertion phase, as described below:

(1) The Call Graph Generation Phase(a) The CGP algorithm reads the given database binary and builds a di-

rected call graph. Each function in the binary is represented as a nodein the graph, and a directed edge is drawn from each caller functionto each function that it calls (callee). Self edges (from a node to itself ),which occur when a function invokes itself recursively, are removed.

(b) Each node in the call graph is initially labeled 0, to signify that the firstinvocation of the function has not yet occurred.

(c) Each edge in the call graph is initially labeled 0, to signify that its callerfunction has not yet invoked its callee function.

(2) The Profile Generation Phase(a) To warm up the buffer pool, CGP first runs the given database binary

once with the sample database workload as input.(b) Then to generate a sequence of function call tokens, which are stored

in a profile file, CGP runs an instrumented binary with the same sam-ple workload as input. During the instrumented binary run, the pro-file generation phase collects information regarding the call sequences



of the DBMS. To collect this profile information CGP uses ATOM[Srivastava and Eustace 1994] to instrument all the function call pointsin the original (uninstrumented) database binary. When the instru-mented binary is executed, it generates a sequence of tokens, of theform <caller, callee>. Whenever a function is called during this runa <caller, callee> token is generated and appended to the sequence,provided that both the following constraints are satisfied:i. The caller function is “enabled,” that is, it is being executed for the

first time during the profile run of the instrumented binary. All func-tions are enabled initially and are permanently disabled when theyfirst return.

ii. The caller function is invoking this callee function for the first time,that is, no token is appended if another instance of the same tokenhas already been stored in the profile file.

Thus the sequence of tokens generated during the execution of the in-strumented binary corresponds to the sequence of function calls invokedby each caller during its first invocation, with duplicate tokens deleted.Note that the instrumented binary’s execution is functionally identicalto the original binary; it simply runs somewhat slower due to the profilegeneration process.

(3) Prefetch Insertion Phase(a) CGP reads the sequence of tokens from the profile file. For each token,

<A, B>, the value stored in node A of the call graph is incremented.Then the edge from A to B in the call graph is labeled with the valuestored in node A. Therefore, in this labeled call graph, the edge labelsoriginating from node A indicate the order of the functions invoked byA during the first invocation of A in the profile run. When all the tokensfrom the profile file have been processed, a new call graph is producedwith these updated edge labels.

(b) CGP uses the labeled call graph to insert prefetch instructions into theoriginal binary. For each edge of the call graph with a label greaterthan 0, CGP inserts a prefetch instruction that prefetches the calleefunction associated with that edge. For each caller, the correspondingcallee functions are prefetched in the ascending order of their edge la-bels as follows. For each caller, CGP inserts a prefetch for its first calleefunction as early as possible in the first basic block of that caller; aprefetch for the second callee function is inserted immediately after thecall to the first callee function, that is, the prefetch of the second calleefunction will occur as soon as the first callee returns, and so on.

Since CGP S uses profile information, prefetching an entire function maywaste processor resources if the prefetched function is not invoked duringthe actual query execution. Moreover prefetching an entire large function intothe I-cache could pollute the cache by replacing existing cache lines that maybe needed sooner than the prefetched lines. Hence CGP S inserts prefetch in-structions to prefetch only the first N cache lines of a function. The rest of thecallee function is assumed to be prefetched after entering the callee function



Fig. 4. Directed call graph with the edge labels from the profile execution.

Fig. 5. Create rec function after applying the CGP algorithm (new prefetch instructions are indi-cated by *).

itself by using a tagged NL prefetching scheme. In particular, CGP assumesthat the underlying hardware issues a prefetch to the next N lines whenever acache miss occurs, as well as upon the next reference to a cache line after it hasbeen prefetched. (Tagged NL prefetching is described further in Section 3.2). Nis a parameter whose value can be selected based on the I-cache capacity, linesize, and miss latency. We use the notation CGP S N to represent a CGP Sscheme that prefetches only N cache lines, rather than an entire function, oneach prefetch request.

Figure 4 shows the labeled directed call graph for the example describedin Figure 3. Initially all the outgoing edges are labeled 0. During the profileworkload execution when the Create rec function is first invoked it makes a callto Find page in buffer pool followed by calls to Update page and Unlock page.The outgoing edges are labeled 1, 2, 3 to reflect the order in which the func-tion calls from Create rec were made. When Find page in buffer pool is firstinvoked (after the buffer is warmed up) it makes a call to Lock page and hencethe corresponding outgoing edge from Find page in buffer pool is labeled 1.Since Find page in buffer pool did not call Getpage from disk during this firstinstance of its execution after warm-up, the label for the corresponding outgoingedge retains its initial value of 0.

Figure 5 shows the code generated after applying the CGP S algorithm to thecall graph of Create rec. A prefetch instruction is inserted at the beginning ofCreate rec to prefetch Find page in buffer pool. Since Update page is the nextfunction called after returning from Find page in buffer pool, a prefetch in-struction is inserted for this function after the call to Find page in buffer pool.



Unlock page is prefetched immediately after returning from Update page.An instruction is inserted early in Find page in buffer pool to prefetchLock page. Since its incoming edge label is 0, Getpage from disk is neverprefetched.

3.2 Considerations for CGP S Implementation

CGP S requires that the instruction set architecture includes an opcode toprefetch instructions into the instruction cache. In our implementation we as-sume that the underlying machine has one such opcode. Some current archi-tectures, such as HP-PA8000 and SPARC-V9, already provide an instructionfor prefetching into the I-cache. The more recent Itanium ISA supports an evenwider range of prefetch instructions. Hence, we believe that future architec-tures are likely to provide such an instruction.

Since, in our implementation, CGP S uses tagged NL prefetching to prefetchwithin function boundaries, the tagged NL component requires hardware sup-port. The tagged NL scheme can be implemented by adding one tag bit to eachcache line; the tag bit associated with a cache line is set whenever the line isprefetched and is reset once the line has been accessed or replaced. The tagbit is never set for a cache line that was brought into the cache by a demandreference that missed. On any I-cache miss and on any hit to an I-cache linewhose tag bit is set, the next N sequential lines are prefetched into the I-cache;prefetches are squashed for any of these lines that are already in the I-cacheor are currently en route to the I-cache.

For evaluating the effectiveness of CGP S we ran the CGP S algorithmonly once to obtain a profile and generate a new database binary withprefetch instructions. But in a production environment, if necessary, a newDBMS binary can be generated periodically by profiling a production run.In such environments the decision to generate a new binary could be made,for example, whenever the I-cache miss rate increases significantly, as mea-sured by a hardware monitor that can non-intrusively count the I-cachemisses [www.developer.intel.com/drg/mmx/appnotes/perfmon.htm]. As shownin Section 5.4, CGP is highly effective even when the production run differsfrom the profile run. Thus we expect that it will rarely be necessary in practiceto build a new database binary and swap it with an existing binary.

4. IMPLEMENTING CGP IN HARDWARE

The software implementation of CGP uses a profile run to determine the callsequences in the DBMS. It lacks the ability to adapt dynamically to runtimechanges in the function call sequences. This section presents a hardware imple-mentation of CGP (CGP H) that uses a hardware table, called the Call GraphHistory Cache (CGHC), to dynamically store sequences of functions invokedduring program execution; CGP H uses this stored history when choosingwhich functions to prefetch. As opposed to the software scheme, this hardwarescheme does incur hardware cost beyond what is needed to support taggedNL prefetching, but has the ability to adapt to dynamic changes, if any, in thefunction call sequences.



Fig. 6. Call graph history cache (state shown in CGHC occurs as Lock page is being prefetchedfrom Find page in buffer pool).

4.1 Exploiting Call Graph Information Using CGHC

The main hardware component of the CGP H prefetcher is the Call GraphHistory Cache (CGHC) which comprises a tag array and a data array as shownin Figure 6. The tag array is similar to the tag array of a direct mapped cache.Each entry in the tag array stores the starting address of a function and an index(I). The tag array is indexed using the lower order bits of the starting addressof a function, F ; when the full function address of F matches the functionaddress stored in the tag entry the corresponding data array entry is accessed.The data array entry stores a sequence of starting addresses corresponding tothe sequence of functions that were called by F the last time that function F wascalled. If F has not yet returned from its most recent call, this sequence may bepartially updated. For ease of explanation and readability here and in Figure 6we use the function name to represent the starting address of the function.

By analyzing the executables using ATOM [Srivastava and Eustace 1994],we discovered that 80% of the functions in our benchmarks make calls to nomore than 8 distinct functions. Hence we decided to let each entry in the dataarray, as implemented in our evaluations, store up to 8 function addresses. If afunction in the tag entry invokes more than 8 functions, only the first 8 functionsinvoked are stored in our evaluations. As shown below in Section 5.5, a smalldirect mapped CGHC achieves nearly the same performance as an infinite sizeCGHC; hence we chose to use a direct mapped CGHC instead of a more complexset-associative CGHC.

Each call and each return instruction that is executed makes two accesses toCGHC. In each case the first of these accesses uses the target address of the call(or the return) to determine which function to prefetch next; the second accessuses the starting address of the currently executing function to update thecurrent function’s index and call sequence that is stored in CGHC. To quicklygenerate the target address of a call or return instruction, the processor’s branchpredictor is used instead of waiting for the target address computation whichmay not be available from the out-of-order processor pipeline for several morecycles.

On a CGHC access, if there is no hit in the tag array, no prefetches are issuedand a new tag array entry is created with the desired tag and an index value



of 1. The corresponding data array entry is marked “invalid,” unless the CGHCmiss occurs on the second (update) access for a call (say P calls F ), in whichcase the first slot of the data array entry for P is set to F .

In general, the index value in the tag array entry for a function F , points toone of the functions in the data array entry for F . An index value of 1 selectsthe first (leftmost) function in the data array entry. Note that the index valueis initialized to 1 whenever a new entry is created for F , and the index valueis reset to 1 whenever F returns.

When the branch predictor predicts that P is calling F , the first (callprefetch) access to the direct mapped CGHC tag array is made by using thelower order bits of the predicted target address, F , of the function call. If theaddress stored in the tag entry matches F , given that the index value of a func-tion being called should be 1, a prefetch is issued to the first function addressthat is stored in the corresponding data array entry. The second function will beprefetched when the first function returns, the third when the second returns,and so on. The prefetcher thus predicts that the sequence of calls to be invokedby F will be the same as the last time F was executed. We chose to implementthis prediction scheme because of the simplicity of the resulting prefetch logicand the accuracy of this predictor for stable call sequences.

For the same call instruction (P calls F ), the second (call update) accessto the CGHC tag array is made using the lower order bits of the starting addressof the current function, P . If the address stored in the tag entry matches P , thenthe index of that entry is used to select one of the 8 slots of the correspondingdata array entry, and the predicted call target, F , is stored in that slot. Finallythe index is incremented by 1 on each call update, up to a maximum value of 8.

On a return instruction, when the function F returns to function P , the lowerorder bits of the starting address of P are used for the first (return prefetch)access to the CGHC. On a tag hit, the index value in the tag array entry is usedto select a slot in the corresponding data array entry, and the function in thatslot is prefetched.

Note that on a return instruction, a conventional branch predictor only pre-dicts the return address in P to which F returns; in particular it does notprovide the starting address of P . Since the entries in the tag array store onlystarting addresses of functions, the target address of a return instruction can-not be directly used for a tag match in CGHC. To overcome this problem, theprocessor always keeps track of the starting address of the function currentlybeing executed. When a call instruction is encountered, the starting address ofthe caller function is pushed onto the branch predictor’s return address stackstructure along with the return address. On a return instruction, the mod-ified branch predictor retrieves the return address as usual, and also getsthe caller function’s starting address which is used to access the CGHC tagarray.

On the same return instruction, the second (return update) access to CGHC ismade using the lower order bits of the starting address of the current returningfunction, F . On a tag hit, the index value in the tag array entry is reset to one.

Since CGP H predicts that the sequence of function calls made by a callerwill be the same as the last time that caller was executed, prefetching an entire



function based on this prediction may waste processor resources if theprefetched function is not invoked during the actual execution, and may evenlead to cache pollution. Hence, as in the CGP S algorithm, the CGP H algo-rithm prefetches only N cache lines from the beginning of the callee function,where N is a parameter that can be based on the I-cache capacity, line size,and miss latency. We attempt to prefetch the rest, if any, of the callee functionin a gradual fashion from within the callee function itself by using a tagged NLprefetching scheme, as described in Section 3.2. We use the notation CGP H Nto represent a CGP H scheme that prefetches only N cache lines, rather thanan entire function, on each prefetch request.

4.2 Considerations for CGP H Implementation

Operations that access and update the CGHC are not on the critical path ofthe processor pipeline and can be executed in the background. In our imple-mentation the prefetch and update accesses to the CGHC are in different cy-cles, so as to eliminate the need for having a dual-ported CGHC. The CGHCis accessed n cycles after the branch predictor predicts the target of a call orreturn instruction. We chose n based on the size of CGHC, namely 1 cycle fora 1 KB CGHC, and 2 for a 16 KB CGHC. Since the CGHC is a small directmapped cache, we assume that the tag match of the target address is com-pleted in this cycle. A prefetch is issued in the next cycle after a hit in CGHC.To reflect the call sequence history, the CGHC is updated in the following clockcycle.

Our current CGP implementation prefetches instructions from only a singlefunction at a time. In particular, it does not prefetch all callees at once from acaller function. Depending on control flow, only a few callees may be called froma caller, in which case prefetching all callees would unnecessarily aggravate buscongestion.

CGP prefetches instructions directly into the L1 I-cache; no separate prefetchbuffer was used to hold the prefetched cache lines. Using a separate prefetchbuffer can potentially reduce cache pollution by reducing the chances of evictinguseful cache lines. As shown in the experimental evaluations of Section 5, lessthan 4% of CGP’s prefetches are polluting. Hence using a separate prefetchbuffer would not provide significant additional benefits.

The traffic generated by the prefetches and the L1 cache misses are servicedby the L2 cache in strict FIFO order without giving any priority to the demandmiss traffic. Although the lack of priority might increase the latency of thedemand misses, it simplifies the L2 accessing interface within the L1 cache. Asshown in the experimental evaluations of Section 5, this approach works quitewell.

5. SIMULATION RESULTS

In this section we first describe how we generated the database workloads thatwe used to evaluate the effectiveness of CGP. We then present the experimentalresults.



Table I. Microarchitecture Parameter Values for CGPEvaluations

Fetch, Decode & Issue Width 4Inst Fetch & L/S Queue Size 32ROB entries 64Functional Units 4 Generalpurpose/2multMemory system ports to CPU 4L1 I and D cache each 32 KB, 2-way, 32 byte lineUnified L2 cache 1 MB, 4-way, 32 byte lineL1 hit latency(cycles) 1L2 hit latency(cycles) 16Mem latency (cycles) 80Branch Predictor comb(2-lev+ 2-bit)

5.1 Methodology and Workloads

To evaluate the effectiveness of CGP we implemented a subset of the commonDBMS relational operators on top of the SHORE storage manager [Carey et al.1994]. SHORE is a fully functional storage manager that has been used exten-sively by the database research community and is also used in some commercialdatabase systems. SHORE provides storage volumes, files of untyped objects,B+ trees and R* trees, full concurrency control and recovery with two-phaselocking and write-ahead logging. We implemented the following relational op-erators on top of SHORE: select, indexed select, grace join, nested loops join,indexed nested loop join, and hash-based aggregate. Each SQL query was trans-formed into a query plan using these operators.

The relational operators and the underlying storage manager were compiledon an Alpha 21264 processor running OSF Version 4.0F. We used the CompaqC++ compiler, version 6.2, with −O5 -ifo -inline and speed optimization flagsturned on. The Compaq compiler is the only compiler that supports the OM toolfor doing feedback directed code layout. Since OM implements one of the bestknown code layout techniques for improving I-cache performance, we felt thatit was important to use a compiler that implemented this optimization, so thatwe could measure the effectiveness of CGP over a highly optimized baselinecase.

We used the SimpleScalar simulator [Burger and Austin 1997], for detailedcycle-level processor simulation. The microarchitecture parameters were set asshown in Table I. We chose a 2-way set-associative L1 cache for two reasons.

—Rupley et al. [2002] compared and contrasted the behavior of an Oraclebased Online Transaction Processing (OLTP) workload, called ODB, withCPU2000 benchmarks. That analysis showed that increasing the associativ-ity of I-cache does not reduce the miss rate of ODB. In fact, our measurementsshowed that more than 99% of the cache misses in ODB are due to capacitymisses. Previous work [Ailamaki et al. 1999] also showed that database ap-plications have large instruction and data footprints, and hence suffer froma significant number of capacity misses. Increasing the set-associativity of agiven size cache only reduces conflict misses, not capacity misses.



—L1 cache designs are constrained by fast access time requirements, typicallysingle cycle access latency. It is difficult to design highly associative L1 cachesthat operate at a high frequency and also meet the single cycle access timerequirement.

To evaluate the performance of CGP we used a database workload thatconsists of eight queries from the Wisconsin benchmark [Bitton et al. 1983],and five queries from the TPC-H [TPC 1999] benchmark. The selected Wiscon-sin benchmark queries are queries 1 through 7 (1% and 10% range selectionqueries with and without indices) and query 9 (a two-way join query). Theselected TPC-H queries are queries 1, 2, 3, 5 and 6; these comprise a simplenested query (query 2) and four complex queries with aggregations and manyjoins.

The following experiments evaluate the effectiveness of CGP for four differ-ent workload configurations:

(1) Wisc-prof, a set of three queries from the Wisconsin benchmark: query 1(sequential scan), query 5 (non-clustered index select), and query 9 (two-way join). These queries were chosen since they include operations that arefrequently used by the other Wisconsin benchmark queries. These selectedqueries were run on a dataset of 2100 tuples (1,000 tuples in each of thefirst two relations, and 100 tuples in the third relation).

(2) Wisc-large-1 consists of the same three queries used in the Wisc-prof work-load, except that the queries were run on a full 21,000 tuple Wisconsindataset (10,000 tuples in each of the first two relations, and 1,000 tuplesin the third relation). The total size of the dataset including the indices is10 MB. This workload was selected to see how CGP performance differswhen running the same queries on a different size dataset.

(3) Wisc-large-2 consists of all eight Wisconsin queries running on a 10 MBdataset.

(4) Wisc+tpch consists of all eight Wisconsin queries and the five TPC-Hqueries running concurrently on a total dataset of size 40 MB. In this work-load the size of the TPC-H dataset is 30 MB.

The queries in each workload were executed concurrently, each query run-ning as a separate thread in the database server. Keeping the dataset sizesrelatively small (40 MB or less) allows the SimpleScalar simulation to completein a reasonable time. Even with this small dataset, the total number of instruc-tions simulated in wisc+tpch was about 3 billion and required about 20 hoursper simulation run. Our results on wisc-prof and wisc-large-1 show that in-creasing the size of the dataset for the same queries increases the number ofinstructions executed, but does not significantly alter the types and sequencesof function calls that are made; CGP performance is in fact fairly independentof the dataset size that is used. We also ran a few CGP simulations on the wisc-large-2 queries with a 100 MB dataset and saw improvements that are quitesimilar to those for the 10 MB dataset.



5.2 Generating Profile Information for CGP S

For the profile workload that is provided as input to CGP S, we used thewisc-prof workload. Note, for example, that none of these queries include anyaggregations or joins of more than two relations, but the TPC-H queries weselected do joins and aggregates on more than two relations. As shown inour experimental evaluations, CGP is effective even when the actual queryworkload differs significantly from the profile workload. This property of CGPmakes it easier to produce a broadly useful profile workload for the databasesystem.

5.3 Feedback-Directed Code Layout with OM

Before presenting the results for CGP, we briefly discuss the feedback-directedcode layout optimizations of OM, which reduce I-cache misses by increasingspatial locality. Since CGP also targets I-cache misses, we applied CGP to boththe O5 optimized and the OM optimized binaries. Our performance resultsshow that even though OM improves the performance of the O5 optimized bi-nary by 15%, CGP alone, without OM, achieves a 35% (CGP S) or 39% (CGP H)performance improvement over O5. CGP with OM provides only a small addi-tional performance improvement over CGP without OM (4% for either CGP Sor CGP H).

The OM [Srivastava and Wall 1992] tool on Alpha processors implements amodified version of the Pettis and Hansen [1990] profile-directed code layoutalgorithm for reducing I-cache misses. OM performs two levels of code layoutoptimizations at link time. OM also performs traditional compiler optimiza-tions at link time that could not be performed effectively at compile time. OM’sability to analyze object level code at link time opens up new opportunitiesfor redoing optimizations such as inter-procedural dead code elimination andloop-invariant code motion.

In the first level of code layout optimization, OM uses profile information todetermine the most likely outcome of each conditional branch and rearrangesthe basic blocks within each function so that conditional branches are mostlikely not taken. This optimization increases the average number of instruc-tions executed between two taken branches. Consequently, the number of in-structions used in each cache line increases, which in turn reduces I-cachemisses.

The second level of code layout optimization rearranges functions using aclosest-is-best strategy. If one function calls another function frequently, thetwo functions are allocated close to one another in the code segment so as toimprove the spatial locality. Since OM is a link-time optimizer, it has the abilityto rearrange functions that are spread across multiple files, including staticallylinked library routines.

The profile information needed for OM optimizations was generated by run-ning two workloads, wisc-prof and wisc+tpch to provide better feedback in-formation than that provided by running just one workload. Providing thefeedback information from the largest workload makes the OM optimizedbinary an even stronger baseline than might be achieved in practice. Each



Fig. 7. Performance of OM and CGP relative to O5 (Execution cycles of O5 optimized binary (X109

cycles): wisc-prof = 0.38, wisc-large-1 = 2.83, wisc-large-2 = 2.86, wisc+tpch = 5.36).

workload was run separately and the profile information of both runs wasmerged to generate the feedback file required by OM. The OM optimizationswere applied to an O5 optimized binary. OM’s ability to perform traditionalcompiler optimizations reduced the dynamic instruction count of the O5 codeby 12%.

5.4 CGP and OM Performance Comparisons

In this section we present the performance improvements due to OM optimiza-tions. We also present the improvements due to CGP without OM, and thosefrom applying CGP to an OM optimized binary. Unless otherwise stated, all theCGP H results shown in this paper use a 16 KB CGHC (512 entries) with a2 cycle access time.

Figure 7 shows the run time, relative to O5, of the four workloads withthe O5+OM optimized binary, with the four binaries generated by using eitherCGP S N or CGP H N on either the O5 binary or the O5+OM binary, and withthe O5+OM binary running on a system with perfect I-cache, which allows eachaccess to the I-cache to be completed in 1 cycle. N, the number of cache linesprefetched on each prefetch request, is set to 4. The rightmost set of bars inthe graph, labeled Avg, shows the arithmetic average of the four workload runtimes relative to the arithmetic average for O5. Each arithmetic average iscomputed by summing the execution cycles required by each of the 4 databaseworkloads and dividing by 4.

Figure 7 shows that on average the OM optimizations result in a 15%speedup over the O5 optimized code. On each benchmark CGP S 4 alone signifi-cantly outperforms OM alone, and CGP H 4 outperforms CGP S 4. On average



CGP S 4 or CGP H 4, without OM, achieves a 35% or 39% speedup, respec-tively, over O5 alone, corresponding to an 18% or 21% speedup over O5+OM.When CGP S 4 or CGP H 4, respectively, is used with OM, they achieve a 41%or 45% speedup, respectively, over O5 alone, corresponding to a 23% or 26%speedup over O5+OM. This shows that CGP alone, without OM, can signif-icantly improve performance, and that using OM with CGP gives some addi-tional benefit. Since CGP can effectively prefetch functions from non-contiguouslocations, OM’s effort to layout the code contiguously provides only a 4% addi-tional performance benefit when either CGP S 4 or CGP H 4 is used with OM.CGP, with or without OM, reduces the I-cache miss stall time by about 50%relative to O5+OM, taking us about half way from an already highly tunedbaseline system toward perfect I-cache performance.

One observation might help explain why CGP improves performance signifi-cantly over OM. Namely, the closest-is-best strategy used by OM for code layoutis not very effective for functions that are frequently called from many differentplaces in the code. For instance, procedures such as lock record() can be invokedby several functions in the database system, and OM’s closest-is-best strategycan place lock record() close to multiple callers only by replicating lock record().As aggressive function replication can cause significant code bloat, which canadversely affect I-cache performance, OM tries to control this code bloat by plac-ing lock record() close to only a few of its callers. On the other hand, CGP avoidsthis dilemma entirely because it can simply prefetch lock record() from what-ever functions invoke lock record(), and has no need to replicate or reallocatethe function.

Although CGP S uses wisc-prof as the profile workload to determine thecall sequences it uses for all four workloads, it significantly improves perfor-mance not only for the wisc-prof workload, where the workload exactly matchesthe profile workload, but even for the remaining three workloads in which thequery mix and the datasets differ quite significantly from the profile work-load. This remarkable behavior of CGP S shows that the function call se-quences in DBMS layers exhibit highly stable behavior across various work-loads. Since the wisc-prof workload captures the function call sequences atthe bottommost two DBMS layers (storage manager and relational operatorlayers) where every database query spends a significant fraction of its totalexecution time, CGP S is successful in prefetching desired instructions intothe I-cache, even when the mix of queries differs significantly from the profileworkload.

Unlike CGP S, CGP H is implemented in hardware and does not need anyprofiling information, except for the profile run of instrumented code requiredby OM, if OM is used as a base. The ability of CGP H to dynamically adapt tochanges in the function call sequences at runtime gives it about a 3% additionalperformance gain. The fact that the hardware scheme’s ability to dynamicallyadapt to changes in function call sequences results in such a small performancegain over the software scheme is further evidence that the layered softwarearchitecture of DBMS results in highly predictable call sequences that do notvary much, either dynamically over the runtime of a workload or from oneworkload to another.



Fig. 8. Performance of four different CGHC configurations relative to an infinite CGHC.

5.5 Exploring the Design Space for CGHC

The performance of CGP H depends on the ability of the hardware to storeenough call graph history so as to effectively issue prefetches for repeated callsequences. Since CGHC stores this history information, we explored the effect ofvarying the size of CGHC on the overall performance of CGP. Figure 8 shows therun time of CGP H 4 relative to the run time with an infinite CGHC for four dif-ferent CGHC configurations: 1 KB CGHC (CGHC-1 K ), 16 KB CGHC (CGHC-16 K ), 1 KB+8 KB two level CGHC (CGHC−1 K+8 K ), and 2 KB+16 KB twolevel CGHC (CGHC−2 K+16 K ). In an infinite CGHC (CGHC-Inf ) each func-tion in the program is allocated an entry in the CGHC that stores the entirefunction call sequence of its most recent invocation.

The access time for the 1 KB one level CGHC is one cycle; two cycles for the16 KB one level CGHC. The access times for both two level CGHCs are the sameas the access times of the two level I-cache hierarchy: 1 cycle to access the firstlevel CGHC and 16 cycles to access the second level. On a miss in the firstlevel CGHC, the second level CGHC is accessed. On a hit in the second levelCGHC, an entry from the first level CGHC is written back to the second leveland the hit entry in the second level CGHC is moved to the first level. On a missin the second level CGHC, a new entry is allocated in the first level CGHC andthe replaced entry from the first level is written back to the second level. Thusthe one level CGHC designs assume a simple, but aggressive design in order tomeet the fast access time requirements; the two level CGHC designs assume asmall, fast first level and a much less aggressive design for the second level, buttheir overall design and control is still more complex than a one level design.

As seen from Figure 8, the 1 KB CGHC is 17% slower on average than theinfinite CGHC. But the performance gap between the other three finite CGHCconfigurations and the infinite CGHC is very small. We therefore chose thesimpler one level 16 KB CGHC with a two cycle access time over the morecomplex 2 level CGHC designs for all the other CGP H evaluations in this paper.



Fig. 9. Performance of OM, NL, stream buffers and CGP relative to O5.

5.6 Comparison with Stream Buffers and Tagged Next-N-Line Prefetching

Since I-cache accesses between two taken branches are sequential, a simplehardware prefetching scheme such as Stream Buffers [Jouppi 1990] or taggedNL (as described in Section 3.2) might improve performance by prefetchinglong straight line sequences of code within a function. Furthermore, these tech-niques might even prefetch from successive functions when they are allocatedcontiguously, for example, by OM. For the results presented in this section weimplemented an 8-way stream buffer that prefetches from 8 instruction missstreams concurrently, and each stream prefetches the next 4 consecutive I-cachelines after its associated I-cache miss. We applied stream buffer prefetching tothe OM optimized binaries.

Figure 9 compares the performance of stream buffers and tagged NLprefetching with CGP H. STR 4 refers to the stream buffer prefetching usinga 4 deep stream buffer with 8 parallel streams. NL 4 refers to the tagged Next-N-Line scheme with N=4, as described in Section 3.2. The results show thatthe NL 4 and CGP H 4 schemes (without OM) improve the performance of theO5 binary by 23% and 39%, respectively, and when applied with OM improvethe performance of an OM optimized binary by 14% and 26%, respectively. TheSTR 4 scheme with OM performs almost identically to the NL 4 scheme withoutOM, and about 8% worse than NL 4 with OM. In our workloads on average only43 instructions were executed between two successive function calls. These fre-quent changes in control flow limit the effectiveness of the stream buffer and theNL scheme. In the rest of this paper we compare CGP with the NL scheme only.

5.7 Prefetch Effectiveness and Bus Traffic Overhead

In addition to the gross metrics of cache misses, bus traffic, and overall runtimeperformance, prefetching techniques are traditionally evaluated using two met-rics that are related to individual prefetches: Coverage, which is the ratio of the



Table II. Prefetch Traffic and Miss Taxonomy [Srinivasan et al. 2003]

prefetch-cache outcomes conventional-cache outcomes extracase x (prefetched) y (replaced) x (prefetched) y (replaced) traffic misses

1 hit miss hit hit 2 12 hit prefetched hit hit 1 03 hit don’t care hit replaced 1 04 hit miss miss hit 1 05 hit prefetched miss hit 0 −16 hit don’t care miss replaced 0 −17 replaced miss don’t care hit 2 18 replaced prefetched don’t care hit 1 09 replaced don’t care don’t care replaced 1 0

number of times that the next reference to a prefetched cache line is a hit (i.e.the prefetched cache line was not replaced before its next reference) relative tothe total number of misses in a cache without prefetching, and Accuracy whichis the ratio of the number of times that the next reference to a prefetched cacheline is a hit relative to the total number of prefetches issued. Coverage and accu-racy metrics, however, are not completely accurate because they do not accountfor the effects of a prefetch that are due to the cache line that the prefetchedline replaces. For instance, these two metrics are not sufficient to infer whethera prefetched line (X) has replaced another line (Y) that will be needed beforethe next reference to X. Hence, to measure the effectiveness of CGP, we use amore refined prefetch classification, the Prefetch Traffic and Miss Taxonomy(PTMT), developed by Srinivasan et al. [2003].

PTMT requires the simultaneous simulation of a cache with prefetching(prefetch cache), and a cache without prefetching (conventional cache). By com-paring the next events for X and Y in the conventional cache and in the prefetchcache, PTMT identifies 9 possible outcomes, as shown in Table II. Of all theprefetches issued, only those that fall under cases 5 and 6 are useful prefetchesbecause only these result in a net reduction in cache misses; furthermore onlycases 5 and 6 generate no extra traffic relative to the conventional cache withoutprefetching.

In case 6 when a prefetched line X replaces line Y , Y is also replaced inthe conventional cache sometime before its next reference; hence the replacedline Y does not contribute to extra misses in the prefetch cache relative to theconventional cache. On the other hand in case 5, the next reference to Y is ahit (i.e. Y was not replaced) in the conventional cache and case 5 is only usefulbecause it relies on a subsequent prefetch of Y back into the prefetch cachebefore its next reference. This subsequent prefetch of Y may in turn be usefulor useless depending on what happens in the conventional cache to the linethat it replaces in the prefetch cache. Hence although both cases 5 and 6 areuseful, case 6 prefetches are always useful, whereas a case 5 prefetch, althoughit appears to be useful in isolation, begins a chain of related prefetches whosetotal cost may or may not be beneficial.

Case 1 and case 7 prefetches are polluting prefetches because they gener-ate an extra miss by replacing a useful line, and also increase the bus traffic.



Fig. 10. Distribution of prefetches for NL 4, CGP S 4 and CGP H 4.

Prefetches in the remaining five cases are called useless; they generate oneextra line of traffic for each issued prefetch without reducing the cache misses.

Table II does not account for one side effect caused by prefetching into a set-associative cache that uses LRU replacement. In associative caches, a prefetchhas the side effect of inducing a re-ordering of the LRU stack of the set inwhich the prefetch occurs, and this reordering may affect subsequent trafficand misses. The following example, found in Srinivasan et al. [2003], illustratesan occurrence of this side effect. X is prefetched, replacing the LRU line Y; anexisting line W in that set becomes the LRU line. The next cache access to thatset results in a miss in both caches; W is replaced in the prefetch cache whileY is replaced in the conventional cache. If the next access to W follows soonenough, it will be a hit in the conventional cache, but a miss in the prefetchcache. Thus, although W is not replaced directly by prefetching X, the W missin the prefetch cache is a side effect of prefetching. This prefetch side effect isreferred to as case 10. The cost of case 10 is 1 line of extra traffic and 1 extramiss.

An occurrence of case 10 can be detected when the following two conditionshold:

(1) There is a demand fetch into L1 cache due to a miss in both the conven-tional cache and the prefetch cache and different lines are replaced in the twocaches.

(2) The line replaced in the prefetch cache is subsequently referenced result-ing in a hit in the conventional cache and a miss in the prefetch cache.

Srinivasan showed that these 10 cases of PTMT completely and disjointly ac-count for all the extra traffic (always non-negative) and extra misses (hopefullynegative) of a prefetch algorithm.

Figure 10 shows the classification of prefetches issued by the NL 4, CGP S 4and CGP H 4 schemes applied to O5+OM. With NL 4, 4% of the prefetchesgenerated are polluting prefetches, but less than 2% are polluting in CGP S 4and CGP H 4. With NL 4, 39% of the prefetches are useful prefetches while



Fig. 11. CGP H 4 prefetches due to NL (left bar) and CGHC (right bar).

in CGP S 4 and CGP H 4, although they issued more prefetches, the use-ful prefetches increase to 44% and 46%, respectively. As there are very fewcase 5 prefetches, nearly all the useful prefetches are the more desirable case 6prefetches where the prefetched line replaces a line that the conventional cachealso replaces before its next reference.

To understand why CGP generates about as many useless prefetches as NL(mostly in case 9, with a substantial number in cases 3 and 8 as well) wesplit the CGP prefetches into those that are issued by its NL prefetcher andthose that are issued by its CGHC. Figure 11 shows the results of this split forCGP H 4. While only 34% of the prefetches issued by the NL component areuseful prefetches (cases 5 and 6), 58% of the prefetches issued by the CGHCcomponent are useful. Hence the prefetches in the CGHC component are muchmore accurate than those in the NL component.

Since CGP uses CGHC only to prefetch across function boundaries and usesNL to prefetch within a function, we might expect that CGHC and NL prefetchdisjoint sets of instructions. However, we see that the useful prefetches ofthe NL portion of Figure 11 (2.6 × 107 on average) are fewer than those forNL 4 in Figure 10 (5.6 × 107 on average). This decrease implies that some ofthe useful prefetches issued by the NL 4 scheme when acting alone are is-sued by the CGHC component, not the NL component, of the CGP 4 scheme.Such a shift from NL to CGHC could occur, for example, if a callee func-tion is laid out close to its caller and NL 4 prefetches past the end of thecaller to the beginning lines of the callee function due to the sequentialityof the code layout, whereas under CGP 4 such callee prefetches would tendto occur earlier during caller execution and fall within the CGHC portionof CGP 4.

Thus CGHC allows the CGP scheme to issue some of the prefetches earlier(i.e. at a more timely point) than those same prefetches would be issued by NL.The NL prefetch of such cache lines in CGP will be squashed since the prefetchwas already issued by CGHC. The timely nature of CGHC prefetches can be



Fig. 12. Prefetch timeliness of NL and CGP.

inferred from Figure 12 which shows the timeliness of the prefetches issued byNL 4, CGP S 4 and CGP H 4 by categorizing the total prefetch hits (sum ofcategories 1 through 6) into two categories. The bottom component, Pref Hits,shows the number of times that the next reference to a prefetched cache linefound the referenced instruction already in the L1 cache. The upper component,Delayed Hits, shows the number of times that the next reference to a prefetchedcache line found that the referenced instruction was still en route to the cachefrom the lower levels of memory. The total delayed hits of CGP 4 are fewer thanthe delayed hits of NL 4 which is one measure of the increased timeliness ofCGP prefetches relative to NL. The total number of delayed hits of NL 4 is 36%of the total prefetch hits while in CGP S 4 and CGP H 4 they are reduced to25% of the total prefetch hits, despite the increased total and the use of NL 4within CGP to prefetch lines from within a function.

5.8 I-cache Performance in Future Processors

The design and verification of future processors may well be compromised byincreasingly complex out-of-order processor cores. One way to reduce this com-plexity is to design simpler in-order processor cores, branch predictors, andsmaller caches. To reduce cache misses and branch mispredictions, this sim-pler design can be supplemented with prefetch and precomputation engines[Annavaram et al. 2001b; Annavaram 2001] that are almost entirely decoupledfrom the processor core.

The purpose of this section, however, is to explore the effectiveness of CGPunder the assumption that today’s trends are projected to a future processordesign that does use larger caches and wider out-of-order execution cores, andsimply suffers the significantly increased design and verification costs. Pro-vided this trend continues into the near term future, it raises the question ofwhether such future processor designs with larger I-caches and wider out-of-order execution cores would make CGP redundant by eliminating the I-cache



Table III. Microarchitecture Parameter Values for futureConfiguration

Fetch, Decode & Issue Width 8Inst Fetch & L/S Queue Size 64ROB entries 256Functional Units 8add/4multMemory system ports to CPU 4L1 I and D cache each 64 KB, 2-way, 32 byte lineUnified L2 cache 2 MB, 4-way, 32 byte lineL1 hit latency(cycles) 1L2 hit latency(cycles) 24Mem latency (cycles) 100Branch Predictor comb(2-lev+ 2-bit)

access bottleneck. We claim that CGP will continue to be useful for databasesystems on such future processors.

On the more aggressive future processor model defined in Table III, CGPwith OM improves the performance of our database workloads by 43% (CGP S)or 45% (CGP H) over O5, and 23% (CGP S) or 25% (CGP H) over O5+OM.L2 cache size in the future configuration is slightly smaller than what we ex-pect to see in future. As stated earlier, to get the simulation results within areasonable time, the size of the dataset was scaled down, and hence the sizeof the L2 was also scaled down in appropriate proportion to provide realisticresults.

We simulated this very aggressive out-of-order processor model, future,which can execute up to 8 instructions every cycle. Comparing this configu-ration with the configuration shown in Table I, the I-cache size is now doubled,which should reduce the number of I-cache misses. Note, however, that in thefuture configuration, the Level 2 cache hit latency and the memory access la-tency are also greater, as might be expected due to the widening gap betweenprocessor and memory system speeds. Consequently, even though such a fu-ture processor may suffer fewer I-cache misses, the penalty for each miss willbe higher.

Figure 13 shows the run time required to complete the four workloads on thefuture configuration relative to the run time of the O5 optimized binary. CGPstill outperforms both OM and NL by about the same margin on the futureconfiguration as on the original 4-wide machine configuration.

CGP maintains its performance advantage despite the fact that in our bench-marks the I-cache miss rates on the future configuration with a 64 KB I-cacheare reduced to less than 1% without any prefetching, and less than 0.1% withCGP. Thus the working sets of our benchmarks are well accommodated bythe larger caches in the future configuration. These larger caches decrease thenumber of misses sufficiently to gain in performance despite the increased misspenalty. Consequently the percentage gains in performance of CGP relative toOM and NL are slightly less when calculated on the future configuration, ratherthan on the original 4-wide machine configuration. However, it is important tonote that CGP performance remains about half way between O5+OM and per-fect cache.



Fig. 13. Performance of OM, NL and CGP relative to O5 on the future configuration (Executioncycles of O5 optimized binary (X109 cycles): wisc-prof = 0.28, wisc-large-1 = 2.27, wisc-large-2 =2.29, wisc+tpch = 4.21).

Furthermore, from current trends we expect that the working sets in fu-ture databases will continue to increase and will be much larger than thoseused in this study. In addition, database systems will continue to use a layeredsoftware design approach so as to ensure the maintainability and portabilityof the software. With larger working sets, cache misses will continue to bea significant performance bottleneck, and consequently CGP will continue tobe a useful technique for reducing the number of I-cache misses of databasesystems.

5.9 Applying CGP to SPEC CPU2000 Benchmarks

In this section we show that although CGP is a general technique that can beapplied to applications in other domains, the layered software architecture ofdatabase applications makes CGP particularly attractive for DBMS. To quan-tify the impact of CGP when applied to some other application domain, weused CGP on the CPU-intensive SPEC benchmarks. We selected seven bench-marks from the SPEC CPU2000 integer benchmark suite, namely gzip, gcc,crafty, parser, gap, bzip2 and twolf. These benchmarks were selected becauseour existing simulation infrastructure allows us to run them without modifyingthe benchmark source codes. The perl benchmark, for example, uses multipleprocesses to concurrently execute several input scripts, but SimpleScalar can-not simulate concurrent execution of multiple processes. One way to executethe perl benchmark is to modify the source code of the program and change theconcurrent execution of input scripts to sequential execution where only one in-put script is executed at a time, but that would be a different benchmark. Wetherefore chose to limit this study to only those benchmarks that ran withoutmodification.

The selected benchmarks were compiled with the Compaq C++ compiler withO5 and then OM. The test input set, provided by SPEC, was used to generate



Fig. 14. Effectiveness of CGP on SPEC CPU2000 applications.

the required profile information for OM. The train input set was then run fortwo billion instructions to generate the results presented in this section.

In Figure 14, the rightmost bar for each benchmark shows the executioncycles required with a perfect I-cache, where each access to the I-cache is com-pleted in 1 cycle. Without prefetching (O5+OM), the performance gap due tousing the 32 KB I-cache, rather than a perfect I-cache, is 17% in gcc, 9% in crafty,2% in gap, and less than 1% for each of the other benchmarks. In fact with a 32KB I-cache, for SPEC CPU2000, the I-cache miss ratios are nearly 0% exceptfor gcc and crafty which have 0.5% and 0.3% I-cache miss ratios, respectively.The I-cache is thus not a significant performance bottleneck in any of theseSPEC CPU2000 applications, in which case it is unnecessary to use prefetch-ing techniques such as CGP and NL. For those applications that do suffer fromI-cache misses, namely gcc and crafty, NL prefetching alone achieves perfor-mance gains similar to those of CGP. NL 4 and CGP H 4 each speed up theexecution of gcc by 7% and crafty by 4% relative to O5+OM alone. These resultsshow that CGP is not needed for workloads with small I-cache footprints and/orinfrequent function calls. However, once again CGP performance is about halfway between no instruction prefetching and perfect I-cache performance.

6. RELATED WORK

Researchers have proposed several techniques to improve the I/O bottleneck ofdatabase systems. Nyberg et al. [1994] suggested that if data intensive applica-tions use software assisted disk striping, the performance bottleneck shifts fromI/O response time to the memory access time. Boncz et al. [1998] showed thatthe query execution time of data mining workloads with a large main memorybuffer pool is memory bound rather than I/O bound. Shatdal et al. [1994] pro-posed cache-conscious performance tuning techniques that improve the localityof the data accesses for join and aggregation algorithms. These techniques re-duce data cache misses, which is orthogonal to CGP’s goal of reducing I-cachemisses. CGP may be implemented on top of these cache-conscious algorithms.

It is only recently that researchers have examined the performance impact ofarchitectural features on DBMS [Ailamaki et al. 1999; Lo et al. 1998; Trancoso



et al. 1997; Eickemeyer et al. 1996; Cvetanovic and Bhandarkar 1994; Franklinet al. 1994; Maynard et al. 1994]. Their results show that database appli-cations have much larger instruction and data footprints and exhibit moreunpredictable branch behavior than benchmarks that are commonly used inarchitectural studies (e.g. SPEC). Database applications have fewer loops andsuffer from frequent context switches, causing significant increases in the I-cache miss rates [Franklin et al. 1994]. Lo et al. [1998] showed that in OLTPworkloads, the I-cache miss rate is nearly three times the data cache miss rate.Ailamaki et al. [1999] analyzed three commercial DBMS on a Xeon processorand showed that TPC-D queries spend about 20% of their execution time onbranch misprediction stalls and 20% on L1 I-cache miss stalls (even though theXeon processor uses special instruction prefetching hardware). Their resultsalso showed that L1 data cache misses that hit in L2 were not a significantbottleneck, but L2 misses reduced the performance by 20%.

Researchers have proposed several schemes to improve I-cache performance.Pettis and Hansen [1990] proposed a code layout algorithm that uses profileguided feedback information to contiguously layout the sequence of basic blocksthat lie on the most commonly occurring control flow path. Romer et al. [1997]implemented the Pettis and Hansen code layout algorithm using the Etch tooland showed performance improvements for Win32 binaries. Hashemi et al.[1997] used a cache line coloring scheme to remap procedures so as to reduceconflict misses. Similarly Kalamatianos and Kaeli [1998] exploited the temporallocality of procedure invocations to remap procedures in a binary. They useda structure called a Conflict Miss Graph (CMG), where every edge weight inCMG is an approximation of the worst-case number of misses two procedurescan inflict upon one another. The ordering implied by the edge weights is usedto apply color-based procedure mapping to eliminate conflict misses. Gloy et al.[1997] compared several of these recent code placement techniques to improveI-cache performance. In this paper we used OM [Srivastava and Wall 1992],which implements a modified Pettis and Hansen algorithm to do feedback-directed code layout. Our database workload results showed that OM improvesperformance by 15% over O5, and CGP with OM achieves a 41% (CGP S) or45% (CGP H) performance improvement over O5. CGP alone, without OM, doesnot need recompilation of the source code and still achieves a 35% (CGP S) or39% (CGP H) performance improvement over O5. Since CGP can effectivelyprefetch functions from non-contiguous locations, OM’s effort to layout the codecontiguously provides only about a 4% additional performance benefit for CGPwith OM over CGP without OM.

Tagged Next-N-line prefetching (NL) [Smith 1978] is a sequential prefetchingtechnique that is often used. In this technique the next N sequential lines areprefetched on a cache miss, as well as on the first hit to a cache line that wasprefetched. Tagged NL prefetching works well in programs that execute longsequences of straight line code. CGP uses tagged NL prefetching for prefetchingcode within a function, and profile-guided prefetching (in CGP S) or the CGHC(in GCP) for prefetching across function calls. Our results show that CGP takesgood advantage of the tagged NL prefetching scheme and that OM+CGP S orOM+CGP H outperforms OM+NL alone by 7% or 10%, respectively.



Recently researchers have proposed several techniques for non-sequential in-struction prefetching [Hsu and Smith 1998; Luk and Mowry 2001; Srinivasanet al. 2001]. Of these, the work of Luk and Mowry is closest to CGP. They pro-posed cooperative prefetching where the compiler inserts prefetch instructionsto prefetch branch targets. Their approach, however, requires ISA extensions toadd four new prefetch instructions: two to prefetch the targets of branches, onefor indirect jumps and one for function returns. They use next-N-line prefetch-ing for sequential accesses. Special hardware filters are used to reduce theprefetch traffic. CGP S uses profiling to identify the common call sequencesand inserts prefetches on the more likely paths. Cooperative prefetching doesnot need profile information to guide prefetch insertion. However, it insertsprefetches on all possible paths (up to a certain distance) and then uses com-piler optimizations to reduce the number of prefetches actually needed. CGP Sinserts prefetches to the next call immediately after the current call site. Thisapproach may not be effective in reducing very large memory latencies. How-ever, CGP as implemented in this paper is targeted toward masking L1 I-cachemiss latencies of a few dozen cycles. As shown in the results section, our ap-proach works well in masking such L1 I-cache miss latencies.

CGP uses NL prefetching to prefetch within a function boundary and canbenefit from using the OM tool at link time to make NL more effective byreducing the number of taken branches, which increases the sequentiality ofthe code. Hence using OM with NL can effectively prefetch instructions within afunction boundary, and thereby reduces the need for branch target prefetchingthat occurs within a function boundary. By building on NL, CGP can focus onprefetching for function calls.

Hsu and Smith [1998] proposed target line prefetching. In their scheme,a target line prediction table stores the address of the current cache line, C,and the address of a target cache line which the processor has accessed in therecent past due to the execution of a branch instruction in C. Whenever theprocessor accesses C, a prefetch is issued to the target address, if any, foundin the history information available in the target line prediction table. Sincedatabase workloads are dominated by short forward branches, many cache lineshave multiple branch instructions. Hence multiple target addresses need to bestored per cache line, which can significantly increase the size of the target lineprediction table.

Srinivasan et al. [2001] proposed branch history guided instruction prefetch-ing. BHGP correlates the execution of branch instructions with I-cache missesand uses branches to trigger prefetches to instructions that occur N−1 brancheslater, for a given N > 1. In BHGP any branch instruction can potentially trig-ger a prefetch, while CGP prefetches only at function boundaries and uses nextline prefetching to prefetch within a function.

Reinman et al. [1999] proposed fetch-directed instruction prefetching andChen et al. [1997] proposed branch prediction based prefetching. Both of theseschemes use a run-ahead branch predictor that predicts the program’s controlflow several branches ahead of the currently executing branch. Prefetches areissued to instructions along the predicted control flow path. While Chen usesonly one additional program counter, called the look-ahead PC, to determine



the prefetch address, Reinman uses a fetch target queue to enqueue multipleprefetch addresses. The accuracy of their prefetches is determined by the ac-curacy of the run-ahead predictor. CGP uses history information rather thana run-ahead engine, and does not employ a branch predictor to determine itsprefetch addresses.

Pierce and Mudge [1996] proposed wrong-path prefetching, which combinesnext-line prefetching with the prefetching of all control instruction targets re-gardless of the predicted directions of conditional branches. However, they alsoshowed that prefetching all branch targets aggravates bus congestion.

Joseph and Grunwald [1997] proposed Markov prefetching, which capitalizeson the correlations in the cache miss stream to issue a prefetch for the nextpredicted miss address. They use part of the L2 cache as a history buffer tostore a miss address, M , and a sequence of miss addresses that follow M . Whenaddress M misses in the cache again, their scheme uses M to index the historybuffer and issues prefetches to a subset of the miss addresses that followed Mthe last time. This scheme focuses primarily on data prefetching. In particular,in Joseph and Grunwald [1997] there are no results on the effectiveness of thisscheme for instruction prefetching. For data prefetching their results showedthat Markov prefetching generates a significant number of extra prefetches andrequires a large amount of space to store the miss correlations.

Although it would be interesting to quantitatively compare the performanceof CGP with previous instruction prefetching schemes, due to time and resourceconstraints we only present a qualitative discussion of the related work.

7. CONCLUSIONS AND FUTURE DIRECTIONS

With the trend toward denser and cheaper memory, a significant number ofdatasets in database servers will soon reside in main memory. In such config-urations, the performance bottleneck is the gap between the processing speedof the CPU and the memory access latency. Database applications, with theirlarge instruction footprints and datasets, suffer significantly from poor cacheperformance.

We have proposed and evaluated a technique called Call Graph Prefetching(CGP) that increases the performance of database systems by improving theirI-cache utilization. CGP can be implemented either in software or in hardware.The software implementation of CGP (CGP S) uses a profile run to build a callgraph of a database system and exploits the predictable call sequences in thecall graph to insert prefetches into the executable to prefetch the function that isdeemed likely to be called next. Although CGP S uses a profile workload to buildthe call graphs, our experiments show that this scheme is insensitive to runtime variations. The hardware CGP scheme (CGP H) uses a hardware table,called the Call Graph History Cache (CGHC), to dynamically store sequences offunctions invoked during program execution, and uses this stored history whenchoosing which functions to prefetch. The hardware scheme, as opposed to thesoftware scheme, has the ability to adapt to runtime changes in the functioncall sequences. The disadvantage of the hardware scheme is that it cannot beimplemented on existing processors because it requires adding special purpose



Fig. 15. Average performance improvements of OM, NL and CGP relative to O5 on the original4-wide configuration (Average execution cycles of O5 optimized binary = 2.86 × 109 cycles).

hardware to implement the CGHC, as well as a simple modification to thebranch predictor. Our results show that CGP H provides only a small additionalperformance improvement over CGP S in our experimental runs. However, wefeel that it is important to consider CGP H, as the hardware approach may wellbe preferred by many people who have an aversion to or distrust of profiling.Both the hardware and the software schemes are especially attractive for DBMSsince neither scheme modifies the original DBMS source code, or even requiresaccess to the original source code. Consequently, CGP can also work with user-defined functions and data types in an Object Relational DBMS (ORDBMS),for which rewriting the database code may not be feasible.

Our results summarized in Figure 15 show that CGP S or CGP H with OMoutperforms an OM optimized binary by 23% or 26%, respectively. Furthermore,they provide an additional speedup of 7% or 10%, respectively, over NL prefetch-ing alone. Even on an aggressive future configuration with larger caches thatreduce the number of cache misses, but with higher miss penalties, our resultssummarized in Figure 16 show that CGP S or CGP H improves performanceby 43% or 45%, respectively, over O5, 23% or 25% over O5+OM, and 4% or 6%over O5+OM+NL. The working sets of our benchmarks are well accommodatedby the larger caches in this future configuration. These larger caches decreasethe number of misses sufficiently to gain in performance despite the increasedmiss penalty. The performance gains achieved by CGP on these benchmarksrelative to O5 and O5+OM are similar to those in Figure 15, but as the gainsachieved by NL are somewhat higher in Figure 16, the relative gains of CGPover NL are slightly reduced in this more aggressive system. However, CGPperformance remains about half way between O5+OM and perfect cache.

Furthermore, from the current trends we expect that the working sets offuture databases will continue to increase, and hence will be much larger than



Fig. 16. Average performance improvements of OM, NL and CGP relative to O5 on the futureconfiguration (Average execution cycles of O5 optimized binary = 2.26× 109 cycles).

those used in this study. Cache misses will no doubt continue to be a significantperformance bottleneck, and consequently techniques like CGP that reducethe I-cache misses will remain critical to the performance of future databasesystems on future processors.

As the complexity of software systems continues to grow, the instructionfootprint sizes are also increasing, thereby putting tremendous pressure on theI-cache. As the complexity of the software grows, the behavior of the systemtypically becomes more unpredictable. Research in memory system design cangain significantly by analyzing the behavior of specific types of software sys-tems at a higher level of granularity, rather than by trying to capitalize onlyon low-level generic program behavior. The prevailing programming style fortoday’s large and complex software systems favors modular software where theflow of control at the function level is exposed while the implementation de-tails within the functions are abstracted away. CGP exploits the regularity ofDBMS function call sequences, and avoids dealing with low-level details withinfunctions by simply prefetching the first few cache lines of a function, whichoften constitutes the entire function, and using tagged next-N-line prefetchingto bring in successive lines of longer functions.

Although CGP does eliminate about half the I-cache miss penalty, there isstill room for further improvement. The cache misses that remain after applyingCGP are mostly either cold start misses or misses to infrequently executedfunctions. As we have shown in the CGP performance results section, simplyusing a bigger Call Graph History Cache to store more history information is notthe solution. History-based schemes, such as CGP, typically require a learningperiod during which they acquire program knowledge before they can exploitthat knowledge to improve performance. Thus reducing cold start misses andmisses to infrequently executed functions by using history-based schemes is



difficult if not impossible. A simpler way to reduce these remaining misses mightbe to give the DBMS more direct control of cache memory management. Today’sDBMS already use application-specific main memory management routines.They control page allocation and replacement policies in a more flexible mannerthan the rigid “universal” policies provided by the operating system. In a similarway the cache hierarchy could be placed under some degree of DBMS control. Todo more effective prefetching, database developers can provide hints to cachemanagement hardware regarding calling sequences to infrequently executedfunctions and other code segments that cannot be captured by CGP.

REFERENCES

AILAMAKI, A., DEWITT, D., HILL, M., AND WOOD, D. 1999. DBMSs on a Modern Processor: WhereDoes Time Go? In Proceedings of the 25th International Conference on Very Large Data Bases.266–277.

ANNAVARAM, M. 2001. Prefetch Mechanisms that Acquire and Exploit Application Specific Knowl-edge. Ph.D. thesis, University of Michigan, EECS Department.

ANNAVARAM, M., PATEL, J., AND DAVIDSON, E. 2001a. Call Graph Prefetching for Database Applica-tions. In Proceedings of the 7th International Symposium on High Performance Computer Archi-tecture. 281–290.

ANNAVARAM, M., PATEL, J., AND DAVIDSON, E. 2001b. Data Prefetching by Dependence Graph Pre-computation. In Proceedings of the 28th International Symposium on Computer Architecture.52–61.

BERNSTEIN, P., BRODIE, M., CERI, S., DEWITT, D., FRANKLIN, M., GARCIA-MOLINA, H., GRAY, J., HELD,G., HELLERSTEIN, J., JAGADISH, H., LESK, M., MAIER, D., NAUGHTON, J., PIRAHESH, H., STONEBRAKER,M., AND ULLMAN, J. 1998. The Asilomar Report on Database Research. SIGMOD Record 27, 4(December), 74–80.

BITTON, D., DEWITT, D. J., AND TURBYFILL, C. 1983. Benchmarking database systems a systematicapproach. In Proceedings of the 9th International Conference on Very Large Data Bases. 8–19.

BONCZ, P., RUHL, T., AND KWAKKEL, F. 1998. The Drill Down Benchmark. In Proceedings of the 24thInternational Conference on Very Large Data Bases. 628–632.

BURGER, D. AND AUSTIN, T. 1997. The SimpleScalar Tool Set. Tech. Rep. 1342, University ofWisconsin-Madison, Computer ScienceDepartment. June.

CAREY, M., DEWITT, D., FRANKLIN, M., HALL, N., MCAULIFFE, M., NAUGHTON, J., SCHUH, D., SOLOMON, M.,TAN, C., TSATALOS, O., WHITE, S., AND ZWILLING, M. 1994. Shoring Up Persistent Applications.In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data.383–394.

CHEN, I.-C. K., LEE, C.-C., AND MUDGE, T. 1997. Instruction Prefetching Using Branch PredictionInformation. In Proceedings of the International Conference on Computer Design. 593–601.

CVETANOVIC, Z. AND BHANDARKAR, D. 1994. Characterization of Alpha AXP Performance UsingTP and SPEC Workloads. In Proceedings of the 21st International Symposium on ComputerArchitecture. 60–70.

EICKEMEYER, R., JOHNSON, R., KUNKEL, S., SQUILLANTE, M., AND LIU, S. 1996. Evaluation of Multi-threaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23rdInternational Symposium on Computer Architecture. 203–212.

FRANKLIN, M., ALEXANDER, W., JAUHARI, R., MAYNARD, A., AND OLSZEWSKI, B. 1994. Commercial Work-load Performance in the IBM POWER2 RISC System/6000 Processor. IBM J. Res. Dev. 38, 5(April), 555–561.

GLOY, N., BLACKWELL, T., SMITH, M., AND CALDER, B. 1997. Procedure Placement Using TemporalOrdering Information. In Proceedings of the 30th International Symposium on Microarchitecture.303–313.

HASHEMI, A., KAELI, D., AND CALDER, B. 1997. Efficient Procedure Mapping Using Cache LineColoring. In Proceedings of the SIGPLAN ’97 Conference on Programming Language Design andImplementation. 171–182.



HSU, W.-C. AND SMITH, J. 1998. A Performance Study of Instruction Cache Prefetching Methods.IEEE Trans. Comput. 47, 5 (May), 497–508.

INTEL WEB SITE: http://www.developer.intel.com/drg/mmx/appnotes/perfmon.htm Survey of PentiumProcessor Performance Monitoring Capabilities & Tools.

JOSEPH, D. AND GRUNWALD, D. 1997. Prefetching Using Markov Predictors. In Proceedings of the24th International Symposium on Computer Architecture. 252–263.

JOUPPI, N. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of the 17th International Symposium onComputer Architecture. 364–373.

KALAMATIANOS, J. AND KAELI, D. 1998. Temporal-Based Procedure Reordering for Improved In-struction Cache Performance. In Proceedings of the 4th International Symposium on High Per-formance Computer Architecture. 244–253.

LO, J., BARROSO, L. A., EGGERS, S. J., GHARACHORLOO, K., LEVY, H. M., AND PAREKH, S. S. 1998. AnAnalysis of Database Workload Performance on Simultaneous Multithreaded Processors. In Pro-ceedings of the 25th International Symposium on Computer Architecture. 39–50.

LUK, C. AND MOWRY, T. 2001. Architectural and compiler support for effective instruction prefetch-ing: a cooperative approach. ACM Trans. Comput. Syst. 19, 1 (Feb.), 71–109.

MAYNARD, A., DONNELLY, C., AND OLSZEWSKI, B. R. 1994. Contrasting characteristics and cache per-formance of technical and multi-user commercial workloads. In Proceedings of the 6th Interna-tional Conference on Architectural Support for Programming Languages and Operating Systems.145–156.

NYBERG, C., BARCLAY, T., CVETANOVIC, Z., GRAY, J., AND LOMET, D. 1994. AlphaSort: a RISC machinesort. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data.233–242.

PETTIS, K. AND HANSEN, R. 1990. Profile Guided Code Positioning. In SIGPLAN ’90 Conference onProgramming Language Design and Implementation. 16–27.

PIERCE, J. AND MUDGE, T. 1996. Wrong-path Prefetching. In Proceedings of the 29th InternationalSymposium on Microarchitecture. 264–273.

REINMAN, G., CALDER, B., AND AUSTIN, T. 1999. Fetch Directed Instruction Prefetching. In Proceed-ings of the 32nd International Symposium on Microarchitecture. 16–27.

ROMER, T., VOELKER, G., LEE, D., WOLMAN, A., WONG, W., LEVY, H., BERSHAD, B., AND CHEN, B. 1997.Instrumentation and Optimization of Win32/Intel Executables Using Etch. In USENIX WindowsNT Workshop. 1–7.

RUPLEY, J., ANNAVARAM, M., DEVALE, J., DIEP, T., AND BLACK, B. 2002. Comparing and Contrast-ing a Commercial OLTP Workload with CPU2000 on IPF. In the 5th Workshop on WorkloadCharacterization.

SHATDAL, A., KANT, C., AND NAUGHTON, J. 1994. Cache Conscious Algorithms for Relational QueryProcessing. In Proceedings of the 20th International Conference on Very Large Data Bases. 510–521.

SMITH, A. 1978. Sequential Program Prefetching in Memory Hierarchies. IEEE Comput. 11, 2(December), 7–21.

SRINIVASAN, V., DAVIDSON, E., AND TYSON, G. 2003. A Prefetch Taxonomy. IEEE Trans. Comput.SRINIVASAN, V., DAVIDSON, E., TYSON, G., CHARNEY, M., AND PUZAK, T. 2001. Branch History Guided

Instruction Prefetching. In Proceedings of the 7th International Symposium on High PerformanceComputer Architecture. 291–300.

SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A System for Building Customized Program AnalysisTools. Tech. Rep. 94/2, Digital Western Research Laboratory. March.

SRIVASTAVA, A. AND WALL, D. 1992. A Practical System for Intermodule Code Optimization atLink-Time. Tech. Rep. 92/6, Digital Western Research Laboratory. June.

TRANCOSO, P., LARRIBA-PEY, J., ZHANG, Z., AND TORELLAS, J. 1997. The Memory Performance of DSSCommercial Workloads in Shared-Memory Multiprocessors. In Procedings of the 3rd Interna-tional Symposium on High Performance Computer Architecture. 211–220.

TPC. 1999. TPC Benchmark H Standard Specification (Decision Support). In Revision 1.1.0.

Received June 2001; revised July 2002, February 2003; accepted May 2003


call graph prefetching for database applications

Documents