database architecture optimized for the new bottleneck ...€¦ · bottleneck for data structures...

Database Architecture Optimized for the newBottleneck: Memory Access

Peter Boncz∗ Stefan Manegold Martin L. Kersten

Data Distilleries B.V. CWIAmsterdam · The Netherlands Amsterdam · The Netherlands

[email protected] {S.Manegold,M.L.Kersten}@cwi.nl

Abstract

In the past decade, advances in speed of com-modity CPUs have far out-paced advancesin memory latency. Main-memory access istherefore increasingly a performance bottle-neck for many computer applications, includ-ing database systems. In this article, we use asimple scan test to show the severe impact ofthis bottleneck. The insights gained are trans-lated into guidelines for database architecture;in terms of both data structures and algo-rithms. We discuss how vertically fragmenteddata structures optimize cache performanceon sequential data access. We then focuson equi-join, typically a random-access oper-ation, and introduce radix algorithms for par-titioned hash-join. The performance of thesealgorithms is quantified using a detailed ana-lytical model that incorporates memory accesscost. Experiments that validate this modelwere performed on the Monet database sys-tem. We obtained exact statistics on eventslike TLB misses, L1 and L2 cache misses, byusing hardware performance counters found inmodern CPUs. Using our cost model, we showhow the carefully tuned memory access pat-tern of our radix algorithms make them per-form well, which is confirmed by experimentalresults.∗*This work was carried out when the author was at the

University of Amsterdam, supported by SION grant 612-23-431

Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordirect commercial advantage, the VLDB copyright notice andthe title of the publication and its date appear, and notice isgiven that copying is by permission of the Very Large Data BaseEndowment. To copy otherwise, or to republish, requires a feeand/or special permission from the Endowment.

Proceedings of the 25th VLDB Conference,Edinburgh, Scotland, 1999.

1 Introduction

Custom hardware – from workstations to PCs – hasbeen experiencing tremendous improvements in thepast decades. Unfortunately, this growth has notbeen equally distributed over all aspects of hardwareperformance and capacity. Figure 1 shows that thespeed of commercial microprocessors has been increas-ing roughly 70% every year, while the speed of com-modity DRAM has improved by little more than 50%over the past decade [Mow94]. Part of the reason forthis is that there is a direct tradeoff between capacityand speed in DRAM chips, and the highest priorityhas been for increasing capacity. The result is thatfrom the perspective of the processor, memory hasbeen getting slower at a dramatic rate. This affectsall computer systems, making it increasingly difficultto achieve high processor efficiencies.

Three aspects of memory performance are of inter-est: bandwidth, latency, and address translation. Theonly way to reduce effective memory latency for appli-

79 81 83 85 87 89 91 93 95 97

1

10

100

1000 Processors

DRAM

Spee

d (M

hz)

Year

Figure 1: Hardware trends in DRAM and CPU speed

cations has been to incorporate cache memories in thememory subsystem. Fast and more expensive SRAMmemory chips found their way to computer boards,to be used as L2 caches. Due to the ever-rising CPUclock-speeds, the time to bridge the physical distancebetween such chips and the CPU became a problem; somodern CPUs come with an on-chip L1 cache (see Fig-ure 2). This physical distance is actually a major com-plication for designs trying to reduce main-memorylatency. The new DRAM standards Rambus [Ram96]and SLDRAM [SLD97] therefore concentrate on fix-ing the memory bandwidth bottleneck [McC95], ratherthan the latency problem.

Cache memories can reduce the memory latencyonly when the requested data is found in the cache.This mainly depends on the memory access pat-tern of the application. Thus, unless special care istaken, memory latency becomes an increasing perfor-mance bottleneck, preventing applications – includingdatabase systems – from fully exploiting the power ofmodern hardware.

Besides memory latency and memory bandwidth,translation of logical virtual memory addresses tophysical page addresses can also have severe impact onmemory access performance. The Memory Manage-ment Unit (MMU) of all modern CPUs has a Trans-lation Lookaside Buffer (TLB), a kind of cache thatholds the translation for (typically) the 64 most re-cently used pages. If a logical address is found in theTLB, the translation has no additional cost. Othe-wise, a TLB miss occurs. A TLB miss is handled bytrapping to a routine in the operating system kernel,that translates the address and places it in the TLB.Depending on the implementation and hardware ar-chitecture, TLB misses can be more costly even thana main memory access.

1.1 Overview

In this article we investigate the effect of memory ac-cess cost on database performance, by looking in detailat the main-memory cost of typical database appli-cations. Our research group has studied large main-memory database systems for the past 10 years. Thisresearch started in the PRISMA project [AvdBF+92],focusing on massive parallelism, and is now centeredaround Monet [BQK96, BWK98]; a high-performancesystem targeted to query-intensive application areaslike OLAP and Data Mining. For the research pre-sented here, we use Monet as our experimentationplatform.

The rest of this paper is organized as follows: In Sec-tion 2, we analyze the impact of memory access costson basic database operations. We show that, unlessspecial care is taken, a database server running even asimple sequential scan on a table will spend 95% of itscycles waiting for memory to be accessed. This mem-ory access bottleneck is even more difficult to avoid in

L1 Cache

Main Memory

BUS

(on disk)swap file

Virtual MemoryMemory Page

L2 CacheL2 cache-line

CPU

Die

L1 cache-line

registers

CPU

Figure 2: Hierarchical Memory System

more complex database operations like sorting, aggre-gation and join, that exhibit a random access pattern.

In Section 3, we discuss the consequences of thisbottleneck for data structures and algorithms to beused in database systems. We identify vertical frag-mentation as the solution for database data structuresthat leads to optimal memory cache usage. Concern-ing query processing algorithms, we focus on equi-join,and introduce new radix-algorithms for partitionedhash-join. We analyze the properties of these algo-rithms with a detailed analytical model, that quanti-fies query cost in terms of CPU cycles, TLB misses,and cache misses. This model enables us to show howour algorithms achieve better performance by havinga carefully tuned memory access pattern.

Finally, we evaluate our findings and conclude thatthe hard data obtained in our experiments justify thebasic architectural choices of the Monet system, whichback in 1992 were mostly based on intuition.

2 Initial Experiment

In this section, we demonstrate the severe impact ofmemory access cost on the performance of elementarydatabase operations. Figure 3 shows results of a simplescan test on a number of popular workstations of thepast decade. In this test, we sequentially scan an in-memory buffer, by iteratively reading one byte with avarying stride, i.e. the offset between two subsequentlyaccessed memory addresses. This experiment mimicswhat happens if a database server performs a read-onlyscan of a one-byte column in an in-memory table witha certain record-width (the stride); as would happenin a selection on a column with zero selectivity or in asimple aggregation (e.g. Max or Sum). The Y-axis inFigure 3 shows the cost of 200,000 iterations in elapsedtime, and the X-axis shows the stride used. We madesure that the buffer was in memory, but not in any ofthe memory caches.

0

10

20

30

40

50

60

70

0 50 100 150 200 250

elap

sed

time

in m

sec

record-width in bytes (stride)

origin2k: cpu=R10000 @ 250Mhz L2-linesize=128 bytes L1-linesize=32 bytes year=1998 sun450: cpu=ultrasparc2 @ 296Mhz L2-linesize= 64 bytes L1-linesize=16 bytes year=1997

ultra: cpu=ultrasparc1 @ 143Mhz L2-linesize= 64 bytes L1-linesize=16 bytes year=1995 sunLX: cpu=sparc @ 50 Mhz L2-linesize= 16 bytes year=1992

Figure 3: Reality Check: simple in-memory scan of 200,000 tuples

When the stride is small, successive iterations in thescan read bytes that are near to each other in memory,hitting the same cache line. The number of L1 and L2cache misses is therefore low. The L1 miss rate reachesits maximum of one miss per iteration as soon as thestride reaches the size of an L1 cache line (16 to 32bytes). Only the L2 miss rate increases further, untilthe stride exceeds the size of an L2 cache line (16 to128 bytes). Then, it is certain that every memory readis a cache miss. Performance cannot become any worseand stays constant.

The following model describes—depending on thestride s—the execution costs per iteration of our ex-periment in terms of pure CPU costs (including dataaccesses in the on-chip L1 cache) and additional costsdue to L2 cache accesses and main-memory accesses:

T (s) = TCPU + TL2(s) + TMem(s)

with

TL2(s)=ML1(s) ∗ lL2, ML1(s)=min(

s

LSL1, 1)

TMem(s)=ML2(s) ∗ lMem, ML2(s)=min(

s

LSL2, 1)

and Mx, LSx, lx denoting the number of cache misses,the cache line sizes and the (cache) memory accesslatencies for each level, respectively.

While all machines exhibit the same pattern of per-formance degradation with decreasing data locality,Figure 3 clearly shows that the penalty for poor mem-ory cache usage has dramatically increased in the lastsix years. The CPU speed has improved by almost

an order of magnitude, but the memory access laten-cies have hardly changed. In fact, we must draw thesad conclusion that if no attention is paid in query pro-cessing to data locality, all advances in CPU power areneutralized due to the memory access bottleneck. Theconsiderable growth of memory bandwidth—reflectedin the growing cache line sizes1—does not solve theproblem if data locality is low.

This trend of improvement in bandwidth but stand-still in latency [Ram96, SLD97] is expected to con-tinue, with no real solutions in sight. The workin [Mow94] has proposed to hide memory latency be-hind CPU work by issuing prefetch instructions, beforedata is going to be accessed. The effectiveness of thistechnique for database applications is, however, lim-ited due to the fact that the amount of CPU work permemory access tends to be small in database opera-tions (e.g., the CPU work in our select-experiment re-quires only 4 cycles on the Origin2000). Another pro-posal [MKW+98] has been to make the caching systemof a computer configurable, allowing the programmerto give a “cache-hint” by specifying the memory-accessstride that is going to be used on a region. Only thespecified data would then be fetched; hence optimizingbandwidth usage. Such a proposal has not yet beenconsidered for custom hardware, however, let alone inOS and compiler tools that would need to provide thepossibility to incorporate such hints for user-programs.

1In one memory fetch, the Origin2000 gets 128 bytes, whereasthe Sun LX gets only 16; an improvement of factor 8.

3 Architectural Consequences

In the previous sections we have shown that it is lessand less appropriate to think of the main memory ofa computer system as “random access” memory. Inthis section, we analyze the consequences for both datastructures and algorithms used in database systems.

3.1 Data Structures

The default physical tuple representation is a consecu-tive byte sequence, which must always be accessed bythe bottom operators in a query evaluation tree (typi-cally selections or projections). In the case of sequen-tial scan, we have seen that performance is stronglydetermined by the record-width (the position on theX-axis of Figure 3). This width quickly becomes toolarge, hence performance decreases (e.g., an Item tu-ple, as shown in Figure 4, occupies at least 80 bytes onrelational systems). To achieve better performance, asmaller stride is needed, and for this purpose we recom-mend using vertically decomposed data structures.

Monet uses the Decomposed Storage Model [CK85],storing each column of a relational table in a sepa-rate binary table, called a Binary Association Table(BAT). A BAT is represented in memory as an arrayof fixed-size two-field records [OID,value], or BinaryUNits (BUN). Their width is typically 8 bytes.

In the case of the Origin2000 machine, we deducefrom Figure 3 that a scan-selection on a table withstride 8 takes 10 CPU cycles per iteration, whereasa stride of 1 takes only 4 cycles. In other words, in asimple range-select, there is so little CPU work per tu-ple (4 cycles) that the memory access cost for a strideof 8 still weighs quite heavily (6 cycles). Therefore wehave found it useful in Monet to apply two space op-timizations that further reduce the per-tuple memoryrequirements in BATs:

virtual-OIDs Generally, when decomposing a rela-tional table, we get an identical system-generatedcolumn of OIDs in all decomposition BATs, whichis dense and ascending (e.g. 1000, 1001, . . . ,1007). In such BATs, Monet computes the OID-values on-the-fly when they are accessed using po-sitional lookup of the BUN, and avoids allocatingthe 4-byte OID field. This is called a “virtual-OID” or VOID column. Apart from reducingmemory requirements by half, this optimizationis also beneficial when joins or semi-joins are per-formed on OID columns.2 When one of the joincolumns is VOID, Monet uses positional lookupinstead of e.g., hash-lookup; effectively eliminat-ing all join cost.

2The projection phase in query processing typically leadsin Monet to additional “tuple-reconstruction” joins on OIDcolumns, that are caused by the fact that tuples are decomposedinto multiple BATs.

101011

13121111

13

TRUCKMAILAIR

14.25

AIR

MAILSHIPAIRSHIP

37.50

04.75

"Item" Tablecommentdate2

11.50

92.8002.5075.0010.20

int datedatedate

flag

char(27)

optimized BAT storage: 1 byte per column

ordersupp part tax

floatfloatintint int char(1)float int varchar

date1

vertical fragmentation in Monet

8 bytes

width of relational tuple ~= 80 bytes

price shipmodestatusqtydiscnt

13

chr

3

2

2

626

10

SHIPFOB

2

6543

void

11

1010

1005

11

13131211

1007

1000

physical datastructures

encoding BAT

1001

1006

100410031002

RAIL

1006100510041003

1007

TRUCKMAILAIR

logical appearance

100210011000oid str

AIR

chr str

MAILAIRTRUCKREG AIR

MAILSHIPAIRSHIP

8 bytes

0.000.100.100.00

floatoid

0.000.000.10

0.00

0.000.100.100.00

0.0010.2011.5004.75

1005

75.00

14.2537.5092.8002.50

1007

1000intoid

1001

1006

100410031002

1006

100410031002

1007

0.000.000.10

1005

1001

floatoid1000

8 bytes

Figure 4: Vertically Decomposed Storage in BATs

byte-encodings Database columns often have a lowdomain cardinality. For such columns, Monet usesfixed-size encodings in 1- or 2-byte integer val-ues. This simple technique was chosen becauseit does not require decoding effort when the val-ues are used (e.g., a selection on a string “MAIL”can be re-mapped to a selection on a byte withvalue 3). A more complex scheme (e.g., usingbit-compression) might yield even more memorysavings, but the decoding-step required whenevervalues are accessed can quickly become counter-productive due to extra CPU effort. Even if de-coding would just cost a handful of cycles for eachtuple, this would more than double the amount ofCPU effort in simple database operations, like therange-select from our experiment.

Figure 4 shows that when applying both techniques;the storage needed for 1 BUN in the “shipmode” col-umn is reduced from 8 bytes to just one.

3.2 Query Processing Algorithms

We now shortly discuss the effect of the memory accessbottleneck on the design of algorithms for commonquery processing operators.

selections If the selectivity is low; most data needs tobe visited and this is best done with a scan-select(it has optimal data locality). For higher selectiv-ities, Lehman and Carey [LC86] concluded thatthe T-tree and bucket-chained hash-table were thebest data structures for accelerating selections inmain-memory databases. The work in [Ron98]reports, however, that a B-tree with a block-size

equal to the cache line size is optimal. Our find-ings about the increased impact of cache missesindeed support this claim, since both lookup us-ing a hash-table or T-tree cause random memoryaccess to the entire relation; a non cache-friendlyaccess pattern.

grouping and aggregation Two algorithms are of-ten used here: sort/merge and hash-grouping.In sort/merge, the table is first sorted on theGROUP-BY attribute(s) followed by scanning.Hash-grouping scans the relation once, keeping atemporary hash-table where the GROUP-BY val-ues are a key that give access to the aggregate to-tals. This number of groups is often limited, suchthat this hash-table fits the L2 cache, and proba-bly also the L1 cache. This makes hash-groupingsuperior to sort/merge concerning main-memoryaccess; as the sort step has random access be-havior and is done on the entire relation to begrouped, which probably does not fit any cache.

equi-joins Hash-join has long been the preferredmain-memory join algorithm. It first builds ahash table on the smaller relation (the inner re-lation). The outer relation is then scanned; andfor each tuple a hash-lookup is done to find thematching tuples. If this inner relation plus thehash table does not fit in any memory cache, aperformance problem occurs, due to the randomaccess pattern. Merge-join is not a viable alterna-tive as it requires sorting on both relations first,which would cause random access over even alarger memory region.

Consequently, we identify join as the most prob-lematic operator, therefore we investigate possible al-ternatives that can get optimal performance out of ahierarchical memory system.

03

92

81

47

17

5703

17

57

81

92

H

H different cluster buffers

1-Pass Cluster

Figure 5: Straightforward clustering algorithm

(100)

(111)

(101)

(100)

(000)

(001)

(011)

(010)

57

92

66

03

75

(110)

17

81

96

(001)

0

1

0

1

0

(001)

(001)

01

00

20

2

2

2

2

Pass 2 (1 bit)

4

Pass 1 (2 bits)

10

37

47

06

11 06

75

66

20

06

96

37

81

(110)

(100)

(001)

(100)

(001)

(001)

(011)

(111)

(000)

92

03

47

17

(101)

(010)

(001)

57

57

96

(111)

(110)

(011)

(100)

(100)

(101)

17

47

37

20

92

81

75

66

03

(010)

(001)

(001)

(001)

(001)

(000)

1

0

1

Figure 6: 2-pass/3-bit Radix Cluster (lower bits indi-cated between parentheses)

3.3 Clustered Hash-Join

Shatdahl et al. [SKN94] showed that a main-memoryvariant of Grace Join, in which both relations arefirst partitioned on hash-number into H separate clus-ters, that each fit the memory cache, performs betterthan normal bucket-chained hash join. This work em-ploys a straightforward clustering-algorithm that sim-ply scans the relation to be clustered once, insertingeach scanned tuple in one of the clusters, as depictedin Figure 5. This constitutes a random access patternthat writes into H separate locations. If H exceeds thenumber available cache lines (L1 or L2), cache trashingoccurs, or if H exceeds the number of TLB entries, thenumber of TLB misses will explode. Both factors willseverely degrade overall join performance.

As an improvement over this straightforward algo-rithm, we propose a clustering algorithm that has acache-friendly memory access pattern, even for highvalues of H.

3.3.1 Radix Algorithms

The radix-cluster algorithm splits a relation into Hclusters using multiple passes (see Fig. 6). Radix-clustering on the lower B bits of the integer hash-valueof a column is done in P sequential passes, in whicheach pass clusters tuples on Bp bits, starting with theleftmost bits (

∑P1 Bp = B). The number of clusters

created by the radix-cluster is H =∏P

1 Hp, where eachpass subdivides each cluster into Hp = 2Bp new ones.When the algorithm starts, the entire relation is con-sidered as one cluster, and is subdivided in H1 = 2B1

clusters. The next pass takes these clusters and sub-divides each in H2 = 2B2 new ones, yielding H1 ∗H2

clusters in total, etc.. Note that with P = 1, radix-cluster behaves like the straightforward algorithm.

96

57

17

81

75

66

03

20

06

47

17

(000)

(001)

(001)

(001)

(001)

(010)

(011)

(100)

(100)

(101)

(110)

(111)

32

35

20

47

(000)

(001)

(010)

(100)

(111)

(011)

(000)96

03 (011)

66

(010)(010)10

2

92

37

RL

Figure 7: Joining 3-bit Radix-Clustered Inputs (blacktuples hit)

The interesting property of the radix-cluster is thatthe number of randomly accessed regions Hx can bekept low; while still a high overall number ofH clusterscan be achieved using multiple passes. More specifi-cally, if we keep Hx = 2Bx smaller than the number ofcache lines, we avoid cache trashing altogether.

After radix-clustering a column on B bits, all tu-ples that have the same B lowest bits in its columnhash-value, appear consecutively in the relation, typ-ically forming chunks of C/2B tuples. It is thereforenot strictly necessary to store the cluster boundariesin some additional data structure; an algorithm scan-ning a radix-clustered relation can determine the clus-ter boundaries by looking at these lower B “radix-bits”. This allows very fine clusterings without intro-ducing overhead by large boundary structures. It isinteresting to note that a radix-clustered relation is infact ordered on radix-bits. When using this algorithmin the partitioned hash-join, we exploit this property,by performing a merge step on the radix-bits of bothradix-clustered relations to get the pairs of clustersthat should be hash-joined with each other.

The alternative radix-join algorithm, also pro-posed here, makes use of the very fine clustering ca-

partitioned-hashjoin(L, R, H):radix-cluster(L,H)radix-cluster(R,H)FOREACH cluster IN [1..H]

hash-join(L[c], R[c])radix-join(L, R, H):

radix-cluster(L,H)radix-cluster(R,H)FOREACH cluster IN [1..H]

nested-loop(L[c], R[c])

Figure 8: Join Algorithms Employed

pabilities of radix-cluster. If the number of clustersH is high, the radix-clustering has brought the poten-tially matching tuples near to each other. As chunksizes are small, a simple nested loop is then sufficientto filter out the matching tuples. Radix-join is similarto hash-join in the sense that the number H shouldbe tuned to be the relation cardinality C divided bya small constant; just like the length of the bucket-chain in a hash-table. If this constant gets down to 1,radix-join degenerates to sort/merge-join, with radix-sort [Knu68] employed in the sorting phase.

3.4 Quantitative Assessment

The radix-cluster algorithm presented in the previoussection provides three tuning parameters:

1. the number of bits used for clustering (B), imply-ing the number of clusters H = 2B,

2. the number of passes used during clustering (P ),

3. the number of bits used per clustering pass (Bp).

In the following, we present an exhaustive seriesof experiments to analyze the performance impact ofdifferent settings of these parameters. After establish-ing which parameters settings are optimal for radix-clustering a relation on B bits, we turn our attentionto the performance of the join algorithms with vary-ing values of B. Finally, these two experiments arecombined to gain insight in overall join performance.

3.4.1 Experimental Setup

In our experiments, we use binary relations (BATs)of 8 bytes wide tuples and varying cardinalities, con-sisting of uniformly distributed unique random num-bers. In the join-experiments, the join hit-rate is one,and the result of a join is a BAT that contains the[OID,OID] combinations of matching tuples (i.e., ajoin-index [Val87]). Subsequent tuple reconstructionis cheap in Monet, and equal for all algorithms, so justlike in [SKN94] we do not include it in our comparison.

The experiments were carried out with an Ori-gin2000 machine on one 250Mhz MIPS R10000 pro-cessor. This system has 32Kb of L1 cache, consistingof 1024 lines of 32 bytes, 4MB of L2 cache, consistingof 32,768 lines of 128 bytes, and sufficient main mem-ory to hold all data structures. Further, this systemuses a page size of 16Kb and has 64 TLB entries. Weused the hardware event counters of the MIPS R10000CPU [Sil97] to get exact data on the number of cycles,TLB misses, L1 misses and L2 misses during theseexperiments.3 Using the data from the experiments,we formulate an analytical main-memory cost model,that quantifies query cost in terms of these hardwareevents.

3The Intel Pentium family, SUN UltraSparc, and DEC Alphaprovide similar counters.

3.4.2 Radix Cluster

To analyze the impact of all three parameters (B, P ,Bp) on radix clustering, we conduct two series of ex-periments, keeping one parameter fixed and varyingthe remaining two.

First, we conduct experiments with various num-bers of bits and passes, distributing the bits evenlyacross the passes. The points in Figure 9 depict the re-sults for a BAT of 8M tuples—the remaining cardinali-ties (≤ 64M) behave the same way. Up to 6 bits, usingjust one pass yields the best performance (cf. “mil-lisecs”). Then, as the number of clusters to be filledconcurrently exceeds the number of TLB entries (64),the number of TLB misses increases tremendously (cf.“TLB misses”), decreasing the performance. Withmore than 6 bits, two passes perform better than one.The costs of an additional pass are more than compen-sated by having significantly less TLB misses in eachpass using half the number of bits. Analogously, threepasses should be used with more than 12 bits, and fourpasses with more than 18 bits. Thus, the number ofclusters per pass is limited to at most the number ofTLB entries. A second more moderate increase in TLBmisses occurs when the number of clusters exceeds thenumber of L2 cache lines, a behavior which we cannotreally explain.

Similarly, the number of L1 cache misses and L2cache misses significantly increases whenever the num-

1e+07

TLB L1 TLB2 L2 TLB3 L12 TLB4

L1 m

isse

s

4 passes3 passes2 passes1 pass

1e+06

1e+07

L2 m

isse

s

1e+04

1e+05

1e+06

1e+07

TLB

mis

ses

1e+03

1e+04

5 10 15 20

mill

isec

s

number of bits

Figure 9: Performance and Model of Radix-Cluster

ber of clusters per pass exceeds the number of L1 cachelines (1024) and L2 cache lines (32,768), respectively.The impact of the additional L2 misses on the totalperformance is obvious for one pass (it doesn’t oc-cur with more than one pass, as then at most 13 bitsare used per pass). The impact of the additional L1misses on the total performance nearly completely van-ishes due to the heavier penalty of TLB misses and L2misses.

Finally, we notice that the best-case execution timeincreases with the number of bits used.

The following model calculates the total executioncosts for a radix cluster depending on the number ofpasses, the number of bits, and the cardinality:

Tc(P,B,C) =

P ∗(C∗wc+ML1,c

(B

P,C

)∗lL2+ML2,c

(B

P,C

)∗lMem

+MTLB,c

(B

P,C

)∗lTLB

)with MLi,c(Bp, C) =

2 ∗ |Re|Li +

C ∗ Hp

|Li|Li, if Hp ≤ |Li|Li

C ∗(

1 + log(Hp

|Li|Li

)), if Hp < |Li|Li

and MTLB,c(Bp, C) =

2 ∗ |Re|Pg +

|Re|Pg ∗

(Hp

|TLB|

)if Hp ≤ |TLB|

C ∗(

1− |TLB|Hp

), if Hp > |TLB|

|Re|Li and |Cl|Li denote the number of cache lines perrelation and cluster, respectively, |Re|Pg the numberof pages per relation, |Li|Li the total number of cachelines, both for the L1 (i = 1) and L2 (i = 2) caches,and |TLB| the number of TLB entries.

The first term of MLi,c equals the minimal numberof Li misses per pass for fetching the input and storingthe output. The second term counts the number of ad-ditional Li misses, when the number of clusters eitherapproaches the number of available Li or even exceedsthis. MTLB,c is made up analogously. Due to spacelimits, we omit the term that models the additionalTLB misses when the number of clusters exceeds thenumber of available L2 lines. A detailed description ofthese and the following formulae is given in [MBK99].

The lines in Figure 9 represent our model for a BATof 8M tuples. The model shows to be very accurate4.

The question remaining is how to distribute thenumber of bits over the passes. The experimen-tal results—not presented here due to space limits(cf. [MBK99])—showed, that the performance stronglydepend on even distribution of bits.

4On our Origin2000 (250 Mhz) we calibrated lTLB = 228ns,LL2 = 24ns, LMem = 412ns, and wc = 50ns.

3.4.3 Isolated Join Performance

We now analyze the impact of the number of radix-bits on the pure join performance, not including theclustering cost.

The points in Figure 10 depict the experimentalresults of radix-join (L1 and L2 cache misses, TLBmisses, elapsed time) for different cardinalities. Thelower graph (“millisecs”) shows that the performanceof radix-join improves with increasing number of radix-bits. The upper graph (“L1 misses”) confirms, thatonly cluster sizes significantly smaller than L1 size arereasonable. Otherwise, the number of L1 cache missesexplodes due to cache trashing. We limited the ex-ecution time of each single run to 15 minutes, thus,using only cluster sizes significantly smaller than L2size and TLB size (i.e. number of TLB entries * pagesize). That’s why the number of L2 cache misses stayalmost constant. The performance improvement con-tinues until the mean cluster size is 1 tuple. At thatpoint, radix-join has degenerated to sort/merge-join.The high cost of radix-join with large cluster-size isexplained by the fact that it performs nested-loop joinon each pair of matching clusters. Therefore, clus-ters need to be kept small; our results indicate that acluster-size of 8 tuples is optimal.

The following model calculates the total executioncosts for a radix-join, depending on the number of bitsand the cardinality5

Tr(B,C) =

C ∗⌈C

H

⌉∗ wr + C ∗ w′r +ML1,r(B,C) ∗ lL2

+ML2,r(B,C) ∗ lMem +MTLB,r(B,C) ∗ lTLB

with MLi,r(B,C) =

3 ∗ |Re|Li + C ∗

|Cl|Li|Li|Li

, if |Cl|Li ≤ |Li|Li|Cl|Li, if |Cl|Li > |Li|Li

and MTLB,r(B,C) =

3 ∗ |Re|Pg + C ∗ ||Cl||||TLB||

|Re|Pg, |Re|Li, |Cl|Li, and |Li|Li are as above (i ∈{1, 2}), ||Cl|| denotes the cluster size (in byte), and||TLB|| = |TLB| ∗ ||Pg|| denotes the memory rangecovered by |TLB| pages.

The first term of Tr calculates the costs for evaluat-ing the join predicate—each tuple of the outer relationhas to be checked against each tuple in the respectivecluster; the cost per check is wr . The second termrepresents the costs for creating the result with w′rdenoting the costs per tuple. The left term of MLi,r

5For simplicity of presentation, we assume the cardinalitiesof both operands and the result to be the same.

1e+04

1e+05

1e+06

1e+07

1e+08

1e+09

1e+10

L1 m

isse

s

clustersize<L1size

6400000080000001000000125000

15625

1e+04

1e+05

1e+06

1e+07

L2 m

isse

s

clustersize<L2size

1e+02

1e+03

1e+04

1e+05

TLB

mis

ses

1e+01

1e+02

1e+03

1e+04

1e+05

1e+06

5 10 15 20 25

mill

isec

s

number of bits

clustersize<L1size

clustersize<8tuples

Figure 10: Performance and Model of Radix-Join

1e+04

1e+05

1e+06

1e+07

1e+08

1e+09

L1 m

isse

s

clustersize<L1size

640000008000000100000012500015625

1e+04

1e+05

1e+06

1e+07

1e+08

L2 m

isse

s

clustersize<L2size

1e+02

1e+03

1e+04

1e+05

1e+06

1e+07

1e+08

1e+09

TLB

mis

ses

clustersize>TLBsize

1e+01

1e+02

1e+03

1e+04

1e+05

1e+06

5 10 15 20 25

mill

isec

s

number of bits

clustersize>L2size clustersize<TLBsize

clustersize<L1size

clustersize<256tuples

Figure 11: Performance and Model of PartitionedHash-Join

equals the minimal number of Li misses for fetchingboth operands and storing the result. The right termcounts the number of additional Li misses during theinner loop, when the number of Li lines per clustereither approaches the number of available Li lines oreven exceeds this. MTLB,r is made up analogously.The lines in Figure 10 prove the accuracy of our modelfor different cardinalities (wr = 24ns, w′r = 240ns).

The partitioned hash-join also exhibits increasedperformance with increasing number of radix-bits.Figure 11 shows that performance increase flattens af-ter the point where the entire inner cluster (includ-ing its hash table) consists of less pages than thereare TLB entries (64). Then, it also fits the L2 cachecomfortably. Thereafter performance decreases onlyslightly until the point that the inner cluster fits the L1cache. Here, performance reaches its minimum. Thefixed overhead by allocation of the hash-table struc-ture causes performance to decrease when the clustersizes get too small (200 tuples) and clusters get verynumerous.

As for the radix-join, we also provide a cost modelfor the partitioned hash-join:

Th(B,C) =C ∗ wh +H ∗ w′h +ML1,h(B,C) ∗ lL2

+ML2,h(B,C) ∗ lMem +MTLB,h(B,C) ∗ lTLB

with MLi,h(B,C) =

3∗|Re|Li +

C ∗ ||Cl||||Li||, if ||Cl|| ≤ ||Li||

C ∗ 10 ∗(

1− ||Li||||Cl||

), if ||Cl|| < ||Li||

and MTLB,h(B,C) =

3∗|Re|Pg

+

C ∗ ||Cl||||TLB||, if ||Cl|| ≤ ||TLB||

C ∗ 10 ∗(1− ||Li||||TLB||

), if ||Cl|| > ||TLB||

||Cl||, ||Li||, and ||TLB|| denote (in byte) the clus-ter size, the sizes of both caches (i ∈ {1, 2}), and thememory range covered by |TLB| pages, respectively.wh represents the pure calculation costs per tuple, i.e.building the hash-table, doing the hash lookup andcreating the result. w′h represents the additional costsper cluster for creating and destroying the hash-table.

The left term of MLi,h equals the minimal numberof Li misses for fetching both operands and storing theresult. The right term counts the number of additionalLi misses, when the cluster size either approaches Lisize or even exceeds this. As soon as the clusters getsignificantly larger than Li, each memory access yieldsa cache miss due to cache trashing: with a bucket-chain length of 4, up to 8 memory accesses per tuple

are necessary while building the hash-table and doingthe hash lookup, and another two to access the actualtuple. For simplicity of presentation, we omit the for-mulae for the additional overhead for allocating thehash-table structure when the cluster sizes get verysmall. The interested reader is referred to [MBK99].Again, the number of TLB misses is modeled analo-gously.

The lines in Figure 11 represent our model for dif-ferent cardinalities (wh = 680ns, w′h = 3600ns). Thepredictions are very accurate.

3.4.4 Overall Join Performance

After having analyzed the impact of the tuning pa-rameters on the clustering phase and the joining phaseseparately, we now turn our attention to the combinedcluster and join cost for both partitioned hash-join andradix-join. Radix-cluster gets cheaper for less radixB bits, whereas both radix-join and partitioned hash-join get more expensive. Putting together the exper-imental data we obtained on both cluster- and join-performance, we determine the optimum number of Bfor relation cardinality and join-algorithm.

It turns out that there are four possible strategies,which correspond to the diagonals in Figures 11 and10:

phash L2 partitioned hash-join on B = log2(C ∗12/||L2||) clustered bits, so the inner relation plushash-table fits the L2 cache. This strategy wasused in the work of Shatdahl et al. [SKN94] intheir partitioned hash-join experiments.

phash TLB partitioned hash-join on B = log2(C ∗12/||TLB||) clustered bits, so the inner relationplus hash-table spans at most |TLB| pages. Ourexperiments show a significant improvement ofthe pure join performance between phash L2 andphash TLB.

phash L1 partitioned hash-join on B = log2(C ∗12/||L1||) clustered bits, so the inner relation plushash-table fits the L1 cache. This algorithm usesmore clustered bits than the previous ones, henceit really needs the multi-pass radix-cluster al-gorithm (a straightforward 1-pass cluster wouldcause cache trashing on this many clusters).

radix radix-join on B = log2(C/8) clustered bits.The radix-join has the most stable performancebut has higher startup cost, as it needs to radix-cluster on significantly more bits that the otheroptions. It therefore is only a winner on the largecardinalities.

Figure 12 compares radix-join (thin lines) and par-titioned hash-join (thick lines) throughout the wholebit range, using the corresponding optimal number ofpasses for the radix-cluster (see Section 3.4.2). The

diagonal lines mark the setting for B that belong tothe four strategies. The optimal setting for each joinalgorithm is even beyond these strategies: partitionedhash-join performs best with cluster size of approxi-mately 200 tuples (“phash min”) and radix with just4 tuples per cluster (“radix min”) is slightly betterthan radix 8.

Finally, Figure 13 compares our radix-cluster-basedstrategies to non-partitioned hash-join (“simple hash”)and sort-merge-join. This clearly demonstrates thatcache-conscious join-algorithms perform significantlybetter than the “random-access” algorithms. Here,“cache-conscious” does not only refer to L2 cache, butalso to L1 cache and especially the TLB. Further, Fig-ure 12 shows that our radix algorithms improve hash-performance, both in the “phash L1” strategy (cardi-nalities larger than 250,000 require at least two clus-tering passes) and with the radix-join itself.

4 Evaluation

In this research, we brought to light the severe im-pact of memory access on performance of elementarydatabase operations. Hardware trends indicate thatthis bottleneck remains here for quite some time; henceour expectation that its impact eventually will be-come deeper than the I/O bottleneck. Database al-gorithms and data structures should therefore be de-signed and optimized for memory access from the out-set. A sloppy implementation of the key algorithmsor ’features’ at the innermost level of an operator tree(e.g. pointer swizzling/object table lookup) can be-come a performance disaster, that ever faster CPUswill not come to rescue.

Conversely, careful design can lead to an order ofmagnitude performance advancement. In our Monetsystem, under development since 1992, we have de-creased the memory access stride using vertical de-composition; a choice that back in 1992 was mostlybased on intuition. The work presented here now pro-vides hard backing that this feature is in fact the ba-sis of good performance. Our simple-scan experimentdemonstrates that decreasing the stride is crucial foroptimizing usage of memory bandwidth.

Concerning query processing algorithms, wehave formulated radix algorithms and demonstratedthrough experimentation that these algorithms formboth an addition and an improvement to the workin [SKN94]. The modeling work done to show howthese algorithms improve cache behavior during joinprocessing represents an important improvement overprevious work on main-memory cost models [LN96,WK90]. Rather than characterizing main-memoryperformance on the coarse level of a procedure callwith “magical” costs factors obtained by profiling, ourmethodology mimics the memory access pattern of thealgorithm to be modeled and then quantifies its costby counting cache miss events and CPU cycles. We

1e+01

1e+02

1e+03

1e+04

1e+05

1e+06

0 5 10 15 20 25

L2phash

TLBphash

L1phash

8radix

mill

isec

s

number of bits

6400000016000000

40000001000000250000

6250015625

1 passes2 passes3 passes4 passes

phash min, radix minbest

Figure 12: Overall Performance of Radix-Join (thinlines) vs. Partitioned Hash-Join (thick lines)

1e+01

1e+02

1e+03

1e+04

1e+05

1e+06

1e+07

16 64 256 1024 4096 16384 65536

mill

isec

s

cardinality (in 1000)

sort-mergesimple hash

phash L2phash TLB

phash L1phash 256phash min

radix 8radix min

Figure 13: Overall Algorithm Comparison

were helped in formulating these models through ourusage of hardware event counters present in modernCPUs.

We think our findings are not only relevant to main-memory databases engineers. Vertical fragmentationand memory access cost have a strong impact on per-formance of database systems at a macro level, includ-ing those that manage disk-resident data. Nyberg etal. [NBC+94] stated that techniques like software as-sisted disk-striping have reduced the I/O bottleneck;i.e. queries that analyze large relations (like in OLAPor Data Mining) now read their data faster than it canbe processed. We observed this same effect with theDrill Down Benchmark [BRK98], where a commercialdatabase product managing disk-resident data was runwith a large buffer pool. While executing almost ex-clusively memory-bound, this product was measuredto be a factor 40 slower on this benchmark than theMonet system. After inclusion of cache-optimizationtechniques like described in this paper, we have sincebeen able to improve our own results on this bench-mark with almost an extra order of magnitude. Thisclearly shows the importance of main-memory accessoptimization techniques.

In Monet, we use I/O by manipulating virtualmemory mappings, hence treat management of disk-resident data as memory with a large granularity. Thisis in line with the consideration that disk-resident datais the bottom level of a memory hierarchy that goes upfrom the virtual memory, to the main memory through

the cache memories up to the CPU registers (Figure2). Algorithms that are tuned to run well on one levelof the memory, also exhibit good performance on thelower levels (e.g., radix-join has pure sequential accessand consequently also runs well on virtual memory).As the major performance bottleneck is shifting fromI/O to memory access, we therefore think that main-memory optimization of both data structures and algo-rithms – like described in this paper – will increasinglybe decisive in order to efficiently exploit the power ofcustom hardware.

5 Conclusion

It was shown that memory access cost is increas-ingly a bottleneck for database performance. We sub-sequently discussed the consequences of this findingon both data structures and algorithms employed indatabase systems. We recommend using vertical frag-mentation in order to better use scarce memory band-width. We introduced new radix algorithms for usein join processing, and formulated detailed analyticalcost models that explain why these algorithms makeoptimal use of hierarchical memory systems found inmodern computer hardware. Finally, we placed ourresults in a broader context of database architecture,and made recommendations for future systems.

References

[AvdBF+92] P. M. G. Apers, C. A. van den Berg,J. Flokstra, P. W. P. J. Grefen, M. Ker-sten, and A. N. Wilschut. PRISMA/DB:A Parallel Main Memory RelationalDBMS. IEEE Trans. on Knowledgeand Data Eng., 4(6):541–554, December1992.

[BQK96] P. Boncz, W. Quak, and M. Kersten.Monet and its Geographical Extensions:a Novel Approach to High-PerformanceGIS Processing. In Proc. of the Intl.Conf. on Extending Database Technol-ogy, pages 147–166, Avignon, France,June 1996.

[BRK98] P. Boncz, T. Ruhl, and F. Kwakkel. TheDrill Down Benchmark. In Proc. of theInt’l. Conf. on Very Large Data Bases,pages 628–632, New York, NY, USA,June 1998.

[BWK98] P. Boncz, A. N. Wilschut, and M. Ker-sten. Flattening an Object Algebra toProvide Performance. In Proc. of theIEEE Int’l. Conf. on Data Engineer-ing, pages 568–577, Orlando, FL, USA,February 1998.

[CK85] G. P. Copeland and S. Khoshafian. ADecomposition Storage Model. In Proc.of the ACM SIGMOD Int’l. Conf. onManagement of Data, pages 268–279,Austin, TX, USA, May 1985.

[Knu68] D. E. Knuth. The Art of Computer Pro-gramming, volume 1. Addison-Wesley,Reading, MA, USA, 1968.

[LC86] T. J. Lehman and M. J. Carey. A Studyof Index Structures for Main MemoryDatabase Management Systems. InProc. of the Int’l. Conf. on Very LargeData Bases, pages 294–303, Kyoto,Japan, August 1986.

[LN96] S. Listgarten and M.-A. Neimat. Mod-elling Costs for a MM-DBMS. InProc. of the Int’l. Workshop on Real-Time Databases, Issues and Applica-tions, pages 72–78, Newport Beach, CA,USA, March 1996.

[MBK99] S. Manegold, P. Boncz, and M. Kersten.Optimizing Main-Memory Join On Mod-ern Hardware. Technical Report INS-R9912, CWI, Amsterdam, The Nether-lands, October 1999.

[McC95] J. D. McCalpin. Memory Bandwidthand Machine Balance in Current HighPerformance Computers. IEEE Techni-cal Committee on Computer Architecturenewsletter, December 1995.

[MKW+98] S. McKee, R. Klenke, K. Wright,W. Wulf, M. Salinas, J. Aylor, andA. Batson. Smarter Memory: Improv-ing Bandwidth for Streamed References.IEEE Computer, 31(7):54–63, July 1998.

[Mow94] T. C. Mowry. Tolerating La-tency Through Software-Controlled DataPrefetching. PhD thesis, Stanford Uni-versity, Computer Science Department,1994.

[NBC+94] C. Nyberg, T. Barclay, Z. Cvetanovic,J. Gray, and D. Lomet. AlphaSort: ARISC Machine Sort. In Proc. of the ACMSIGMOD Int’l. Conf. on Management ofData, pages 233–242, Minneapolis, MN,USA, May 1994.

[Ram96] Rambus Technologies, Inc. Direct Ram-bus Technology Disclosure, 1996. http://www.rambus.com/docs/drtechov.pdf.

[Ron98] M. Ronstrom. Design and Modeling of aParallel Data Server for Telecom Appli-cations. PhD thesis, Linkoping Univer-sity, 1998.

[Sil97] Silicon Graphics, Inc., Mountain View,CA. Performance Tuning and Optimiza-tion for Origin2000 and Onyx2, January1997.

[SKN94] A. Shatdal, C. Kant, and J. Naughton.Cache Conscious Algorithms for Rela-tional Query Processing. In Proc. of theInt’l. Conf. on Very Large Data Bases,pages 510–512, Santiago, Chile, Septem-ber 1994.

[SLD97] SLDRAMInc. SyncLink DRAM Whitepaper,1997. http://www.sldram.com/Docu-ments/SLDRAMwhite970910.pdf.

[Val87] P. Valduriez. Join Indices. ACMTrans. on Database Systems, 12(2):218–246, June 1987.

[WK90] K.-Y. Whang and R. Krishnamurthy.Query Optimization in a Memory-Resident Domain Relational CalculusDatabase System. ACM Trans. onDatabase Systems, 15(1):67–95, March1990.

database architecture optimized for the new bottleneck ...€¦ · bottleneck for data structures...

Documents