buffering database operations for enhanced instruction cache performance jingren zhou, kenneth a....

Buffering Database Operations for Buffering Database Operations for Enhanced Instruction Cache Enhanced Instruction Cache

PerformancePerformance

Jingren Zhou, Kenneth A. RossJingren Zhou, Kenneth A. RossSIGMOD International Conference on Management of DataSIGMOD International Conference on Management of Data

20042004

Nihan Özman - 2005700452Nihan Özman - 2005700452

OutlineOutline

• IntroductionIntroduction• Related WorkRelated Work• Memory HierarchyMemory Hierarchy• Pipelined Query ExecutionPipelined Query Execution• A New Buffer OperatorA New Buffer Operator• Buffering StrategiesBuffering Strategies• Instruction Footprint AnalysisInstruction Footprint Analysis• Overall AlgorithmOverall Algorithm• Experimental ValidationExperimental Validation• Conclusion and Future WorkConclusion and Future Work

IntroductionIntroduction

• Recent database research has demonstrated that Recent database research has demonstrated that most of most of the memory stalls are due to the data cache misses on the the memory stalls are due to the data cache misses on the second-level cache and the instruction cache misses on second-level cache and the instruction cache misses on the first-level instruction cache.the first-level instruction cache.

• With random access memory (RAM) getting cheaper and With random access memory (RAM) getting cheaper and new 64-bit CPUs join the PC family, it becomes affordable new 64-bit CPUs join the PC family, it becomes affordable and feasible to build computers with large main memories. and feasible to build computers with large main memories.

• Which implies more and more query processing work can Which implies more and more query processing work can be done in main memory.be done in main memory.

• Advance in CPU speeds have far outpaced the advances Advance in CPU speeds have far outpaced the advances in memory latencyin memory latency

• Main memory access is becoming a significant cost Main memory access is becoming a significant cost component of database operationscomponent of database operations

IntroductionIntroduction

• Relatively little research has been done on improving Relatively little research has been done on improving the instruction cache performance.the instruction cache performance.

• In this paper, the focus is onIn this paper, the focus is on improving the instruction cache performance for improving the instruction cache performance for

conventional demand-pull database query enginesconventional demand-pull database query engines achieving fast query execution throughputachieving fast query execution throughput not making substantial modifications to existing database not making substantial modifications to existing database

implementationsimplementations

Related WorkRelated Work

• There are many techniques proposed to improve There are many techniques proposed to improve instruction cache performance at compiler level.instruction cache performance at compiler level.

• For database systems, these techniques do not solve For database systems, these techniques do not solve the fundamental problem:the fundamental problem: Large instruction footprints during database query execution, Large instruction footprints during database query execution,

leading to a large number of instruction cache misses leading to a large number of instruction cache misses

• The challenge:The challenge: Reducing the effective size of the query execution footprints Reducing the effective size of the query execution footprints

without sacrificing performancewithout sacrificing performance

• This issue should be resolved at the “query compiler” This issue should be resolved at the “query compiler” level (during query optimization)level (during query optimization)


• A block oriented processing technique for aggregation, A block oriented processing technique for aggregation, expression evaluation and sorting operation is expression evaluation and sorting operation is proposed by Padmanabhan, Malkemus, Agarwal and proposed by Padmanabhan, Malkemus, Agarwal and Jhingran.Jhingran. Each operation is performed on a block of records using a Each operation is performed on a block of records using a

vector style processing strategy tovector style processing strategy to

• achieve better instruction pipeliningachieve better instruction pipelining• minimize instruction count and function callsminimize instruction count and function calls

Impact of instruction cache performance is not studied.Impact of instruction cache performance is not studied. The technique requires a complete redesign of database The technique requires a complete redesign of database

operations (all return block of tuples)operations (all return block of tuples) The technique do not consider the operator footprint size The technique do not consider the operator footprint size

when deciding whether to process data in blockswhen deciding whether to process data in blocks


• Zhou and Ross use a similar buffer strategy to batch Zhou and Ross use a similar buffer strategy to batch access to a tree-based index access to a tree-based index (“Buffering accesses (“Buffering accesses to memory-resident index structures”, In to memory-resident index structures”, In Proceedings of VLDB Conference, 2003)Proceedings of VLDB Conference, 2003)

• The performance benefits of such buffering are due to The performance benefits of such buffering are due to improved data cache hit rates.improved data cache hit rates.

• The impact of buffering on instruction-cache The impact of buffering on instruction-cache behaviour is not described.behaviour is not described.

Memory HierarchyMemory Hierarchy

• Modern computer architectures have a hierarchical Modern computer architectures have a hierarchical memory system, where access by the CPU to main memory system, where access by the CPU to main memory is accelerated by various levels of cache memory is accelerated by various levels of cache memories.memories.

• Cache memories based on the principles of Cache memories based on the principles of spatialspatial and and temporal localitytemporal locality..

• A A cache hitcache hit happens when the requested data (or happens when the requested data (or instructions) are found in the cache.instructions) are found in the cache.

• A A cache misscache miss incurs when the CPU loads data (or incurs when the CPU loads data (or instructions) from a lower cache or memoryinstructions) from a lower cache or memory


• There are typically two or three cache levels.There are typically two or three cache levels.• Most of the CPU’s now have Level 1 and Level 2 Most of the CPU’s now have Level 1 and Level 2

cachescaches integrated integrated on die. on die.

• Instruction and data caches are usually seperated Instruction and data caches are usually seperated in the first level and are shared in the second in the first level and are shared in the second level.level.

• Caches are characterized by:Caches are characterized by: capacitycapacity cache-line sizecache-line size associativityassociativity


• LatencyLatency is the time span that passes after issuing a is the time span that passes after issuing a data access until the requested data is available in the data access until the requested data is available in the CPUCPU

• In hierarchical memory systems,In hierarchical memory systems, llatency increases atency increases with distance from the CPUwith distance from the CPU

• L2 cache sizes are expected to increase(2MB for L2 cache sizes are expected to increase(2MB for Pentium M), L1 cache sizes will not increase at the Pentium M), L1 cache sizes will not increase at the same rate. Larger L1 caches are slower than smaller same rate. Larger L1 caches are slower than smaller L1 caches, and may slowL1 caches, and may slow down the processor clock.down the processor clock.

• The aim in this paper is to minimize L1 instruction The aim in this paper is to minimize L1 instruction cache missescache misses

Memory HierarchyMemory Hierarchy• Modern processors implement a Modern processors implement a Trace CacheTrace Cache instead of a instead of a

conventional conventional L1 instruction cache. L1 instruction cache. • A trace cache is an instruction cache that stores instructions A trace cache is an instruction cache that stores instructions

either after they have been decoded, or as they are retired. This either after they have been decoded, or as they are retired. This allows the instruction fetch unit of a processor to fetch several allows the instruction fetch unit of a processor to fetch several basic blocks, without having to worry about branches in the basic blocks, without having to worry about branches in the execution flow.execution flow.

• Upon a trace cache miss, the instruction address is submitted to Upon a trace cache miss, the instruction address is submitted to the instruction translation lookthe instruction translation look aside buffer (ITLB), which aside buffer (ITLB), which translates the address into a physical memory address before translates the address into a physical memory address before the cache lookup is performed.the cache lookup is performed.

• The TLB ( Translation Look aside Buffer ) is responsible for The TLB ( Translation Look aside Buffer ) is responsible for translating the virtual address into a physical address. The TLB translating the virtual address into a physical address. The TLB is a little cache that contains recent translations. If it's not in the is a little cache that contains recent translations. If it's not in the TLB then the translation must be loaded from the memory TLB then the translation must be loaded from the memory hierarchy, possible all the way down from main memory.hierarchy, possible all the way down from main memory.

Memory HierarchyMemory Hierarchy• Data memory latency can be hidden by correctly Data memory latency can be hidden by correctly

prefetching data in to the cache. prefetching data in to the cache. • Most of the modern CPU’s are pipelined CPU’sMost of the modern CPU’s are pipelined CPU’s• The term pipeline represents the concept of splitting a The term pipeline represents the concept of splitting a

job into sub processes in which the output of one sub job into sub processes in which the output of one sub process feeds into the next. (for Pentium 4 the process feeds into the next. (for Pentium 4 the pipeline is 20 stages deep)pipeline is 20 stages deep)

• Conditional branch instructions present another Conditional branch instructions present another significant problem for modern pipelined CPUs significant problem for modern pipelined CPUs because the CPUs do not know in advance which of because the CPUs do not know in advance which of the two possible outcomes of the comparison will the two possible outcomes of the comparison will happen.happen.


• CPUs try to CPUs try to predict predict the outcome of branches, and the outcome of branches, and have special hardware for maintaining the branching have special hardware for maintaining the branching history of many branch instructions. history of many branch instructions.

• For Pentium 4 processors, the trace cache integrates For Pentium 4 processors, the trace cache integrates branch prediction into the instruction cache by storing branch prediction into the instruction cache by storing traces of instructions that have previously been traces of instructions that have previously been executed in sequence, including branches.executed in sequence, including branches.

Pipelined Query ExecutionPipelined Query Execution

• Each operator supports an open-next-close iterator Each operator supports an open-next-close iterator interfaceinterface

• Open():Open(): initializes the state of the iterator by allocating initializes the state of the iterator by allocating buffers for its inputs and output, and is also used to pass in buffers for its inputs and output, and is also used to pass in arguments such as selection predicates that modify the arguments such as selection predicates that modify the behavior of the operator.behavior of the operator.

• Next():Next(): calls the next() function on each input node calls the next() function on each input node recursively and processes the input tuples until one output recursively and processes the input tuples until one output tuple is generated. The state of the operator is updated to tuple is generated. The state of the operator is updated to keep track of how much input has been consumed.keep track of how much input has been consumed.

• Close():Close(): deallocates the state information deallocates the state information and performs and performs final housekeepingfinal housekeeping

Pipelined Query ExecutionPipelined Query Execution

• Pipelining requires much less space for intermediate Pipelining requires much less space for intermediate results during execution, and this query execution results during execution, and this query execution model can be executed by a single process or thread.model can be executed by a single process or thread.

• Demand driven pipelined query plans exhibit poor Demand driven pipelined query plans exhibit poor instruction locality and have bad instruction cache instruction locality and have bad instruction cache behaviorbehavior

• For a query plan with a pipeline of operators, all the For a query plan with a pipeline of operators, all the operators instructions on the pipeline must be operators instructions on the pipeline must be executed at least once to generate one output tuple.executed at least once to generate one output tuple.

• The query execution shows large instruction The query execution shows large instruction footprintsfootprints..

Instruction Cache Thrashing

• IfIf the instruction cache is smaller than the total the instruction cache is smaller than the total instruction footprint, every operator finds that some of instruction footprint, every operator finds that some of its instructions are not in the cache. its instructions are not in the cache.

• By loading its own instructions into the instruction By loading its own instructions into the instruction cache, capacity cache misses occur to evict cache, capacity cache misses occur to evict instructions for other operators from the cache.instructions for other operators from the cache.

• However, the evicted instructions are required by However, the evicted instructions are required by subsequent requests to the operators, and have to be subsequent requests to the operators, and have to be loaded again in the future. This happens for every loaded again in the future. This happens for every output tuple and for every operator in the pipeline. output tuple and for every operator in the pipeline.

• The resulting instruction cache thrashing can have a The resulting instruction cache thrashing can have a severe impact on the overall performancesevere impact on the overall performance

Simple Query

SELECT

SUM(l_extendedprice *

(1 - l_discount) *

(1 + l_tax)) as sum_charge,

AVG(l_quantity) as avg_qty,

COUNT(*) as count_order

FROM lineitem

WHERE

l_shipdate <= date '1998-11-01';

(TPC-H Schema)

A New Buffer Operator

• GivenGiven a query plan a query plan,, a special buffer operator is added in a special buffer operator is added in certain places between a parent and a child operator. certain places between a parent and a child operator.

• During query execution, each buffer operator stores a large During query execution, each buffer operator stores a large array of pointers to intermediate tuples generated by the array of pointers to intermediate tuples generated by the child operator. child operator.

• Rather than returning control to the parent operator after Rather than returning control to the parent operator after one tuple, the buffer operator calls its array with one tuple, the buffer operator calls its array with intermediate results by repeatedly calling child operator. intermediate results by repeatedly calling child operator.

• Control returns to the parent operator once the buffer's Control returns to the parent operator once the buffer's array is filled (or when there are no more tuples). array is filled (or when there are no more tuples).

• When the parent operator requests additional tuples, they When the parent operator requests additional tuples, they are returned from the buffer's array without executing any are returned from the buffer's array without executing any code from the child operator.code from the child operator.

An Example for Usage of Buffer OperatorAn Example for Usage of Buffer Operator

A New Buffer Operator

• To avoid the instruction cache thrashing problem, To avoid the instruction cache thrashing problem, implementing a new light-weight buffer operator using implementing a new light-weight buffer operator using the conventional iterator interface is proposed.the conventional iterator interface is proposed.

• A buffer operator simply batches the intermediate A buffer operator simply batches the intermediate results of the operators below it.results of the operators below it.

• Organize the tree of operators into Organize the tree of operators into execution groupsexecution groups. . Execution groups are candidate units of buffering.Execution groups are candidate units of buffering.

• The new buffer operator supports the open-next-close The new buffer operator supports the open-next-close interface.interface.

Query Execution PlanQuery Execution Plan

Pseudocode For Buffer Operator

Buffer Operator

• Here is an example of a buffer operator with size 8.Here is an example of a buffer operator with size 8.

• The operator maintains a buffer array which stores The operator maintains a buffer array which stores pointers to previous intermediate tuples from the child pointers to previous intermediate tuples from the child operatoroperator

Point of Buffering

• The point of buffering is that it The point of buffering is that it increases temporalincreases temporal and and spatial instruction localityspatial instruction locality below the buffer below the buffer operator. operator.

• When the child operator is executed, its instructions When the child operator is executed, its instructions are loaded into the L1 instruction cache. are loaded into the L1 instruction cache.

• Instead of being used to generate one tuple, these Instead of being used to generate one tuple, these instructions are used to generate many tuples. instructions are used to generate many tuples.

• As a result, the memory overhead for bringing the As a result, the memory overhead for bringing the instructions into the cache and the CPU overhead of instructions into the cache and the CPU overhead of decoding the instructions is smaller.decoding the instructions is smaller.

• Buffering also improves Buffering also improves Query throughputQuery throughput..

Buffering Strategies

• Every buffer operator incurs the cost of maintaining Every buffer operator incurs the cost of maintaining operator state and pointers to tuples during execution.operator state and pointers to tuples during execution.

• If the cost of buffering is significant compared with the If the cost of buffering is significant compared with the performance gain then there is too much buffering.performance gain then there is too much buffering.

• If the total footprint of an execution group is larger than the If the total footprint of an execution group is larger than the L1 instruction cache then it results in large number of L1 instruction cache then it results in large number of instruction cache misses.instruction cache misses.

• Operators with small cardinality estimates are unlikely to Operators with small cardinality estimates are unlikely to benefit from putting new buffers.benefit from putting new buffers.

• The cardinality threshold at which the benefits outweigh The cardinality threshold at which the benefits outweigh the costs can be determined using a calibration experiment the costs can be determined using a calibration experiment on the target architecture.on the target architecture.

Buffering Strategies

• This threshold can be determined once, in advance, This threshold can be determined once, in advance, by the database system. by the database system.

• The calibration experiment would consist of running a The calibration experiment would consist of running a single query with and without buffering at various single query with and without buffering at various cardinalities.cardinalities.

• The cardinality at which the buffered plan begins to The cardinality at which the buffered plan begins to beat the unbuffered plan would be the cardinality beat the unbuffered plan would be the cardinality threshold for buffering.threshold for buffering.

Instruction Footprint AnalysisInstruction Footprint Analysis

• The basic strategy is to break a pipeline of operators into The basic strategy is to break a pipeline of operators into execution groups so that the instruction footprint of each execution groups so that the instruction footprint of each execution group combined with the footprint of a new buffer execution group combined with the footprint of a new buffer operator is less than the L1 instruction cache size. This operator is less than the L1 instruction cache size. This eliminates instruction cache thrashing within an execution eliminates instruction cache thrashing within an execution group.group.

• An ideal footprint estimate can only be measured when actually An ideal footprint estimate can only be measured when actually running the query, in an environment typical of the query being running the query, in an environment typical of the query being posed. However, it would be too expensive to run the whole posed. However, it would be too expensive to run the whole query first just for estimating footprint sizes.query first just for estimating footprint sizes.

• No matter which data is used, the same set of functions are No matter which data is used, the same set of functions are almost always executed for each module. Thus, the core almost always executed for each module. Thus, the core database system can be calibrated once by running a small database system can be calibrated once by running a small query set that covers all kinds of operators.query set that covers all kinds of operators.

Overall AlgorithmOverall Algorithm

• The database system is first calibrated by running a The database system is first calibrated by running a small set of simple queries which cover all the small set of simple queries which cover all the operator types, and the instruction footprint for each operator types, and the instruction footprint for each operatoroperator is measured is measured..

• Plan refinement algorithmPlan refinement algorithm accepts a query plan tree accepts a query plan tree from the optimizer as input. It produces as output an from the optimizer as input. It produces as output an equivalent enhanced plan tree with buffer operators equivalent enhanced plan tree with buffer operators added.added.

Overall AlgorithmOverall Algorithm

1. Consider only nonblocking operators with output 1. Consider only nonblocking operators with output cardinalities exceeding the calibration threshold.cardinalities exceeding the calibration threshold.

2. A bottom-up pass is made of the query plan. Each leaf 2. A bottom-up pass is made of the query plan. Each leaf operator is initially an execution group. Try to enlarge each operator is initially an execution group. Try to enlarge each execution group by including parent operators or merging execution group by including parent operators or merging adjacent execution groups until any further action leads to adjacent execution groups until any further action leads to a combined footprint larger than the L1 instruction cache. a combined footprint larger than the L1 instruction cache. When that happens, finish with the current execution group When that happens, finish with the current execution group and label the parent operator as a new execution group, and label the parent operator as a new execution group, and continue the bottom-up traversal.and continue the bottom-up traversal.

3. Add a buffer operator above each execution group and 3. Add a buffer operator above each execution group and return the new plan.return the new plan.

Specification of Experimental System

Experimental ValidationExperimental Validation

• Large buffer pool is allocated so that all tables can be Large buffer pool is allocated so that all tables can be memorymemory--resident.resident.

• Enough memory for sorting and hashing operationsEnough memory for sorting and hashing operations is is also allocatedalso allocated..

Validating Buffer StrategiesValidating Buffer Strategies

SELECT

COUNT(*) as count_order

FROM lineitem

WHERE

l_shipdate <= date '1998-11-01';

• The results confirms thatThe results confirms that The overhead of buffering is smallThe overhead of buffering is small Buffering within a group of operators thatBuffering within a group of operators that already fit in the already fit in the

trace cache does nottrace cache does not improve instruction cache performance.improve instruction cache performance.

Validating Buffer StrategiesValidating Buffer Strategies

SELECT SUM(l_extendedprice *(1 - l_discount) * (1 + l_tax)) as sum_charge,AVG(l_quantity) as avg_qty, COUNT(*) as count_order FROM lineitemWHERE l_shipdate <= date '1998-11-01';

• The results shows thatThe results shows that Combined footprint is 23KBCombined footprint is 23KB NNumberumber of trace cache misses reduced of trace cache misses reduced by 80%by 80% NNumberumber of branch mispredictions of branch mispredictions reduced by 21%reduced by 21%

Cardinality EffectsCardinality Effects

• Query 1 is used as a query template to calibrate the Query 1 is used as a query template to calibrate the system, in order to determine a cardinality threshold system, in order to determine a cardinality threshold for buffering.for buffering.

• The threshold is not sensitive to the choice of operatorThe threshold is not sensitive to the choice of operator• The benefits of buffering become more obvious as the The benefits of buffering become more obvious as the

predicate becomes less selectivepredicate becomes less selective

Buffer SizeBuffer Size

• Buffering parameter is the size of array used to buffer tuple Buffering parameter is the size of array used to buffer tuple pointers. pointers.

• The size is set during operator initialization. The size is set during operator initialization. • The number of reduced trace cache misses is roughly The number of reduced trace cache misses is roughly

proportional to 1/proportional to 1/bbuuffersizeffersize. Once buffer is of moderate . Once buffer is of moderate size, there is only a small incentive to make it bigger.size, there is only a small incentive to make it bigger.

Buffered query performance as a function of the buffer size for Query 1Buffered query performance as a function of the buffer size for Query 1

Buffer SizeBuffer Size

Execution Time Breakdown for Varied Buffer SizesExecution Time Breakdown for Varied Buffer Sizes

• As the buffer size increases, As the buffer size increases, trash cache miss penalty drops trash cache miss penalty drops the buffered plan shows better performancethe buffered plan shows better performance Buffer operators incur more L2 data cache misses (hardware Buffer operators incur more L2 data cache misses (hardware

prefetching hides most of this miss latency, since the data is prefetching hides most of this miss latency, since the data is allocated or accessed sequentially)allocated or accessed sequentially)

Complex QueriesComplex Queries

SELECTSELECT sum(o_totalprice), count(*), avg(l_discount) sum(o_totalprice), count(*), avg(l_discount)

FROMFROM lineitem, orders lineitem, orders

WHEREWHERE l_orderkey = o_orderkey l_orderkey = o_orderkey

ANDAND l_shipdate <= date '1998-11-01'; l_shipdate <= date '1998-11-01';

Nest Loop JoinNest Loop Join

Nest Loop JoinsNest Loop Joins

• Reduced cache trace misses by Reduced cache trace misses by 5353%%• Reduced bReduced branch mispredictions by ranch mispredictions by 2626%%

Hash JoinHash Join

HashJoin JoinsHashJoin Joins

• Reduced cache trace misses by 70%Reduced cache trace misses by 70%• Reduced bReduced branch mispredictions by 44%ranch mispredictions by 44%

MergeMerge Join Join

Merge JoinsMerge Joins

• Reduced cache trace misses by 7Reduced cache trace misses by 799%%• Reduced bReduced branch mispredictions by ranch mispredictions by 3030%%

General ResultsGeneral Results

• Buffered plans lead to more L2 cache missesBuffered plans lead to more L2 cache misses• The overhead is much smaller than the performance The overhead is much smaller than the performance

gain from the trace cache and branch predictiongain from the trace cache and branch prediction

Overall ImprovementOverall Improvement

General ResultsGeneral Results

• Better instruction cache performance leads to lower Better instruction cache performance leads to lower CPI (Cost-Per-Instruction)CPI (Cost-Per-Instruction)

• The original and buffered plans have almost the same The original and buffered plans have almost the same number (less than 1% difference) of instructions number (less than 1% difference) of instructions executedexecuted

• Result: Result: Buffer operators are light-weightBuffer operators are light-weight

CPI ImprovementCPI Improvement

ConclusionConclusion

• Techniques to buffer query execution of query plans to Techniques to buffer query execution of query plans to exploit instruction cache spatial and temporal locality exploit instruction cache spatial and temporal locality are proposedare proposed

• A new buffer operator over the existing open-next-close A new buffer operator over the existing open-next-close interface is implemented (without changing the interface is implemented (without changing the implementation of other operators)implementation of other operators)

• Buffer operators are useful for complex queries which Buffer operators are useful for complex queries which have large instruction footprints and large output have large instruction footprints and large output cardinalitiescardinalities

ConclusionConclusion

• Plan refinement algorithm traverses the query plan in Plan refinement algorithm traverses the query plan in a botoom-up fashion and makes buffering decisions a botoom-up fashion and makes buffering decisions based on based on footprint footprint and and large output cardinalitieslarge output cardinalities

• Instruction cache misses are reduced by up to 80%Instruction cache misses are reduced by up to 80%

• The query performance is improved by up to 15%The query performance is improved by up to 15%

Future WorkFuture Work

• Integrate buffering with in the plan generatorIntegrate buffering with in the plan generator of a of a query optimizerquery optimizer

• Break the operators with very large footprints into Break the operators with very large footprints into several several sub-operators and execute them in stages.sub-operators and execute them in stages.

• Reschedule the execution so that operators that are of Reschedule the execution so that operators that are of the same type are scheduled together for complex the same type are scheduled together for complex query plansquery plans

ReferencesReferences

• S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhingran. S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in Block oriented processing of relational database operations in modern computer architectures. In modern computer architectures. In Proceedings of ICDE Proceedings of ICDE ConferenceConference, 2001., 2001.

• J. Zhou, and K. A. Ross. Bufferring accesses to memory-resident J. Zhou, and K. A. Ross. Bufferring accesses to memory-resident index structures. index structures. In In Proceedings of VLDB conferenceProceedings of VLDB conference, , 20032003..

• K.A. Ross, J. Cieslewicz, J. Rao, and J. Zhou. Architecture K.A. Ross, J. Cieslewicz, J. Rao, and J. Zhou. Architecture Sensitive Database Design: Examples from the Columbia Group. Sensitive Database Design: Examples from the Columbia Group. Bulletin of Bulletin of the IEEE Computer Society Technical Committe on the IEEE Computer Society Technical Committe on Data EngineeringData Engineering, 2005, 2005

ReferencesReferences

• Transaction Processing Performance Council. TPC Benchmark Transaction Processing Performance Council. TPC Benchmark H. Available via H. Available via http://www.tpc.com/tpch/http://www.tpc.com/tpch/..

• A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In a modern processor: Where does time go? In Proceedings of Proceedings of VLDB conferenceVLDB conference, 1999., 1999.

Questions?Questions?

Thank YouThank You

Pentium 4 Cache Memory SystemPentium 4 Cache Memory System

Instruction FootprintsInstruction Footprints

TPC-H SchemaTPC-H Schema

buffering database operations for enhanced instruction cache performance jingren zhou, kenneth a....

Documents

level instruction cache

level cache

instruction cache misses

instructioncache behaviour

lower cache

data cache misses

improved data cache

database query execution