carnegie mellon improving index performance through prefetching shimin chen, phillip b. gibbons †...

Carnegie Mellon

Improving Index Performance through Prefetching

Shimin Chen, Phillip B. Gibbons† and Todd C.

MowrySchool of Computer ScienceCarnegie Mellon University

Information SciencesResearch CenterBell Laboratories

Improving Index Performance through Prefetching - 2 - Chen, Gibbons & Mowry

Carnegie Mellon

Databases and the Memory Hierarchy

Traditional Focus: buffer pool management (DRAM as a cache for disk)

Important Focus Today: processor cache performance (SRAM as a cache for

DRAM) e.g., [Ailamaki et al, VLDB ’99], etc.

Main MemoryCPUL2/L3Cache

Larger, slower, cheaper

L1Cache

Carnegie Mellon

Index Structures

Used extensively in databases to accelerate performance selections, joins, etc.

Common Implementation: B+-Trees

Leaf Nodes

Non-Leaf Nodes

Carnegie Mellon

B+-Tree Indices: Common Access Patterns

Search: locate a single tuple

Range Scan: locate a collection of

tuples within a range

Carnegie Mellon

Cache Performance of B+-Tree Indices

A main memory B+-Tree containing 10M keys: Search: 100K random searches Scan: 100 range scans of 1M keys, starting at random

keys Detailed simulations based on Compaq ES40 system

Most of execution time is wasted on data cache misses 65% for searches, 84% for range scans

Data Cache StallsOther StallsBusy Time

Carnegie Mellon

B+-Trees: Optimizing Search for Cache vs. Disk To minimize the number of data transfers (I/O or cache

misses):

Optimal Node Width = Natural Data Transfer Size for disk: disk page size (~8 Kbytes) for cache: cache line size (~64 bytes)

Much narrower nodes and higher trees Search performance more sensitive

to changes in branching factors

Optimized for disk

Optimized for cache

Carnegie Mellon

Previous Work: “Cache-Sensitive B+-Trees”Rao and Ross [SIGMOD 2000]

Key insight:

nearly all child ptrs can be eliminated by restricting data layout double the branching factor of cache-line-sized non-leaf

nodesB+-Trees CSB+-Trees

K3 K4 K5 K6 K7 K8

K1 K3K2 K4

K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4

Carnegie Mellon

Impact of CSB+-Trees on Search Performance

Search is 15% faster due to reduction in height of tree

However: update performance is worse [Rao & Ross, SIGMOD ’00] range scan performance does not improve

There is still significant room for improvement

B+-Tree CSB+-Tree

Carnegie Mellon

Latency Tolerance in Modern Memory Hierarchies

Main MemoryCPUL2/L3CacheL1

pref 0(r2)pref 4(r7)pref 0(r3)pref 8(r9)

Modern processors overlap multiple simultaneous cache misses e.g., Compaq ES40 supports 8 off-chip misses per processor

Prefetch instructions allow software to fully exploit the parallelism

What dictates performance: NOT simply the number of cache misses but rather the amount of exposed miss latency

Carnegie Mellon

Our Approach

New Proposal: “Prefetching B+-Trees” (pB+-Trees) use prefetching to reduce the amount of exposed miss

latency

Key Challenge: data dependences caused by chasing pointers

Benefits: significant performance gains for:

searches range scans updates (!)

complementary to CSB+-Trees

Carnegie Mellon

Overview

Prefetching Searches

Prefetching Range Scans

Experimental Results

Conclusions

Carnegie Mellon

Example: Search where Node Width = 1 Line

0Time (cycles)

Cache miss

We suffer one full cache miss at each level of the tree.

1000 keys, 64B lines, 4B keys, ptrs & tupleIDs

4 levels in B+-Tree (cold cache)

Carnegie Mellon

Same Example where Node Width = 2 Lines

0Time (cycles)

Cache miss

0Time (cycles)

Cache miss

3 levels in tree

Additional misses per node dominate reduction in # of levels.

Carnegie Mellon

How Things Change with Prefetching

0Time (cycles)

Cache miss

0Time (cycles)

Cache miss

# of misses exposed miss latency

fetch all lines within a node in parallel

Cache miss

Time (cycles)

Carnegie Mellon

pB+-Trees: Using Prefetching to Improve Search

Basic Idea: make nodes wider than the natural data transfer size

e.g., 8 cache lines wide prefetch all lines of a node before searching in the node

Improved Search Performance: Larger branching factors, shallower trees Cost to access every node only increased slightly

Reduced Space Overhead: primarily due to fewer non-leaf nodes

Update Performance: ???

Carnegie Mellon

Overview

Experimental Results

Conclusions

Carnegie Mellon

Range Scan Cache Behavior: Normal B+-Trees

Steps in Range Scan:• search for the starting leaf node• traverse the leaves until end is found

0Time(cycles)

Cache miss

We suffer a full cache miss for each leaf node!

Carnegie Mellon

If Prefetching Wider Nodes

e.g., node width = 2 lines

0Time(cycles)

Cache miss

0Time(cycles)

Cache miss

• Exposed miss latency is reduced by up to a factor of node width.

A definite improvement, but can we still do better?

Carnegie Mellon

The Ideal Case

Overlap misses until• all latency is hidden, or• run out of bandwidth

How can we achieve this? 0

Time(cycles)

Cache miss

0Time(cycles)

Cache miss

0Time(cycles)

Cache miss

Carnegie Mellon

The Pointer Chasing Problem

Currently visiting Want to prefetch

If prefetching through pointer chasing,

still experience the full latency at each node

Directly prefetch

Ideal case

Carnegie Mellon

Our Solution: Jump Pointer Arrays

Put leaf addresses in an

Directly prefetch by using the jump pointers

Back pointers needed to initialize prefetching

Carnegie Mellon

Our Solution: Jump Pointer Arrays

Cache miss

Carnegie Mellon

External Jump Pointer Arrays: Efficient Updates

Impact of an insertion is limited to its chunk

Deletions leave empty slots

Actively interleave empty slots during bulkload and chunk splits

Back pointer to position in jump-pointer array is now a hint points to correct chunk but may require local search within chunk to init prefetching

hints chunked linked-list

Carnegie Mellon

Alternative Design: Internal Jump-Pointer Arrays

B+-Trees already contain structures that point to the leaf nodes

bottom non-leaf nodes

the parents of the leaf nodes ( “bottom non-leaf nodes”)

By linking them together, we can use them as a jump-pointer array

Tradeoff: no need for back-pointers, and simpler to maintain consumes less space, though external array overhead is <1% but less flexible, chunk size is fixed by B+-Tree structure

Carnegie Mellon

Overview

Experimental Results search performance range scan performance update performance

Conclusions

Carnegie Mellon

Experimental Framework

Results are for a main-memory database environment (we are extending this work to disk-based environments)

Executables: we added prefetch instructions to C source code by hand used gcc to generate optimized MIPS executables with

prefetch instructions

Performance Measurement: detailed, cycle-by-cycle simulations

Machine Model: based on Compaq ES40 system, with slightly updated

parameters

Carnegie Mellon

Simulation Parameters

Pipeline Parameters

Clock Rate 1 GHz

Issue Width 4 insts/cycle

Functional Units 2 Int, 2 FP, 2 Mem, 1

Branch

Reorder Buffer Size 64 insts

Integer Multiply/Divide 12/76 cycles

All Other Integer 1 cycle

FP Divide/Square Root 15/20 cycles

All Other FP 2 cycles

Branch Prediction Scheme

gshare

Memory Parameters

Line Size 64 bytes

Primary Data Cache 64 KB, 2-way set assoc.

Primary Instruction Cache

64 KB, 2-way set-assoc.

Miss Handlers 32 for data, 2 for inst

Unified Secondary Cache

2 MB, direct-mapped

Primary-to-Secondary Miss Latency

15 cycles (plus contention)

Primary-to-Memory Miss Latency

150 cycles (plus contention)

Main Memory Bandwidth

1 access per 10 cycles

Models all the gory details, including memory system contention

Carnegie Mellon

Index Search Performance

100K random searchesafter bulkload; 100% full (except root);warm caches.

# of tupleIDs in the trees

)B+tree CSB+ p2B+tree p4B+tree

p16B+treep8B+tree

p8CSB+

pB+-Trees achieve 27-47% speedup vs. B+-Trees, 14-34% vs. CSB+-Trees optimal node width is 8 cache lines pB+-Trees and CSB+-Trees are complementary: p8CSB+-Trees are

Carnegie Mellon

Same Search Experiments with Cold Caches

Large discrete steps within each curve

What is happening here?

# of tupleIDs in trees

B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+

100K random searchesafter bulkload; 100% full (except root);cold caches (i.e. clearedafter each search).

Carnegie Mellon

Analysis of Cold Cache Search Behavior

Height of the tree dominates performance effect is blurred in warm cache case

If the same height, the smaller the node size the better

# of tupleIDs in trees

B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+

Tree Type

Number of Keys

300K 1M

3M 10M

B+ 5 6 6 7 7 8 8

CSB+ 4 5 5 5 6 6 7

p2B+ 4 4 5 5 6 6 6

p4B+ 3 3 4 4 4 5 5

p8B+ 3 3 3 4 4 4 4

p16B+ 2 3 3 3 3 4 4

p8CSB+

3 3 3 3 3 4 4

# of Levels in the Trees

Carnegie Mellon

Overview

Conclusions

Carnegie Mellon

Index Range Scan Performance

Scans of 1K-1M keys: 6.5-8.7 speedup over B+-Trees factor of 3.5-3.7 from prefetching wider nodes additional factor of ~2 from jump-pointer arrays

log scale

100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)

# of tupleIDs scanned through in a single call

s)B+tree p8B+tree

p8iB+treep8eB+tree

Carnegie Mellon

Index Range Scan Performance

Small scans (<1K keys): overshooting cost is noticeable exploit only if scan is expected to be large (e.g., search for end)

# of tupleIDs scanned through in a single call

s)B+tree p8B+tree p8eB+treep8iB+tree

log scale

100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)

Carnegie Mellon

Overview

Conclusions

Carnegie Mellon

Update Performance

pB+-Trees achieve at least a 1.24 speedup in all cases

50 60 70 80 90 100percentage of entries used in leaf nodes

Insertions Deletions

100K random insertions/deletions on 3M-key bulkloaded index; warm caches

B+tree p8B+tree p8eB+treep8iB+tree

Carnegie Mellon

Update Performance

Reason #1: faster search times

Reason #2: less frequent node splits with wider nodes

Insertions Deletions

100K random insertions/deletions on 3M-key bulkloaded index; warm caches

Carnegie Mellon

pB+-Trees: Other Results

Similar results for: varying bulkload factors of trees large segmented range scans mature trees varying jump-pointer array parameters:

prefetch distance chunk size

Optimal node width: increases as memory bandwidth increases

(matches the width predicted by our model in the paper)

Carnegie Mellon

Cache Performance Revisited

Search: eliminated 45% of original data cache stalls 1.47 speedup

Scan: eliminated 97% of original data cache stalls 8-fold speedup

Carnegie Mellon

Conclusions

Impact of Prefetching B+-Trees on performance:

Search: 1.27-1.55 speedup over B+-Trees wider nodes reduce height of tree, # of expensive misses outperform and are complementary to CSB+-Trees

Updates: 1.24-1.52 speedup over B+-Trees faster search and less frequent node splits in contrast with significant slowdowns for CSB+-Trees

Range Scan: 6.5-8.7 speedup over B+-Trees wider nodes: factor of ~3.5 speedup jump-pointer arrays: additional factor of ~2 speedup

Prefetching B+-Trees also reduce space overhead.

These benefits are likely to increase with future memory systems.

Applicable to other levels of the memory hierarchy (e.g., disks).

Carnegie Mellon

Backup Slides

Carnegie Mellon

Revisiting the Optimal Node Width for Searches

Total cache misses for a search is minimized when: w = 1

w = # of cache lines per nodem = # of child pointers per one-cache-line wide nodeN = # of tupleIDs in index

1logwmN

w wmTotal cache misses

Misses per level # of levels in tree

Carnegie Mellon

Scheduling Prefetches Early Enough

ni ni+1ni+2 ni+3ni+2 ni+3

currently visiting

want to prefetch

p = &n0;while(p) { work(p->data); p = p->next;}

Loading a node

Work()

Our Goal: fully hide latency

• thus achieving fastest possible computation rate of 1/W

e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this.

Carnegie Mellon

Performance without Prefetching

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

while(p) {work(p->data);p = p->next;

loading nkwork(nk)

Computation rate = 1/(L+W)

Carnegie Mellon

Prefetching One Node Ahead

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

• Computation is overlapped with memory accesses.

computation rate = 1/L

loading nkwork(nk)

data dependence

visiting

prefetch ni+1

pf(p->next)

while(p) {pf(p->next);

work(p->data);p = p->next;

Carnegie Mellon

Prefetching Three Nodes Ahead

Computation rate does not improve (still = 1/L)!

visitingni

prefetchni+3

pf(p->next->next->next)

Pointer-Chasing Problem: [Luk & Mowry, ASPLOS ’96]• any scheme which follows the pointer chain is limited to a rate of 1/L

Timewhile(p) {

pf(p->next->next->next);work(p->data);p = p->next;

loading nkwork(nk)

data dependence

Carnegie Mellon

Our Goal: Fully Hide Latency

Li+1 Wi+1

Li+2 Wi+2

Li+3 Wi+3

Achieves the fastest possible computation rate of 1/W.

visiting ni

prefetch ni+1pf(&ni+3)

loading nkwork(nk)

data dependence

while(p) {pf(&ni+3);

work(p->data);p = p->next;

Carnegie Mellon

Challenges in Supporting Efficient Updates

jump-pointer array

back pointers

Conceptual view of jump-pointer array:

What if we really implemented it this way?

•Insertion: could incur significant overheads• copying data within the array to create a new hole• updating back-pointers

•Deletion: ok; just leave a hole

Carnegie Mellon

Summary: Why We Expect Updates to Perform Well

Insertions: only a small number of jump pointers move

between insertion point and nearest hole in the chunk normally only update the hint pointer for the inserted node

which does not require any significant overhead significant overheads only occur on chunk splits, which are rare

Deletions: no data is moved (just leave an empty hole) no need to update any hints

In general, the jump-pointer array requires little concurrency control.

Carnegie Mellon

B+-Trees Modeled and their Notations

B+-Trees: regular B+-Trees

CSB+-Trees: cache-sensitive B+-Trees [Rao & Ross, SIGMOD 2000]

pwB+-Trees: prefetching B+-Trees with node size = w cache lines and no jump-pointer arrays

• we consider w = 2, 4, 8, and 16

p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and external jump-pointer arrays

p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and internal jump-pointer arrays

p8CSB+-Trees: prefetching cache-sensitive B+-Trees with node size = 8 cache lines (and no jump-pointer arrays)

(Gory implementation details are in the paper.)

Carnegie Mellon

Searches with Varying Bulkload Factors

Similar trends with smaller bulkload factors as when 100% full

Performance of pB+-Trees is somewhat less sensitive to bulkload factor

B+tree CSB+ p2B+tree p4B+tree

p16B+treep8CSB+

p8B+tree

cold cacheswarm caches

Carnegie Mellon

Range Scans with Varying Bulkload Factors

Prefetching B+-Trees offer: larger speedups with smaller bulkload factors (more nodes to fetch) less sensitivity of performance to bulkload factor

50 60 70 80 90 10010

percentage of entries used in leaf nodes

Carnegie Mellon

Large Segmented Range Scans

1M keys, scanned in 1000-key segments

Similar performance gains as unsegmented scans

50 60 70 80 90 10010

percentage of entries used in leaf nodes

Carnegie Mellon

Insertions with Cold Caches

Carnegie Mellon

Deletions with Cold Caches

Carnegie Mellon

55 60 65 70 75 80 85 90percentage of entries used in leaf nodes

lits B+tree

p8B+tree p8eB+treep8iB+tree

Analysis of Nodes Splits upon Insertions

Far fewer node splits

Bulkload Factor = 60-90% Bulkload Factor = 100%

At least 2 splits

One split No splits

Fewer node splits Fewer non-leaf node

splits

Carnegie Mellon

Mature Trees: Searches (Warm Caches)

40 80 120 160 200number of search (x 1000)

Carnegie Mellon

Mature Trees: Insertions (Warm Caches)

40 80 120 160 200number of insertion (x 1000)

• CSB+-Tree could be 25% worse than B+-Tree under the same mature tree experiments (on diff h/w configuration)

• pB+-Trees are significantly faster than B+-Tree

Carnegie Mellon

Mature Trees: Deletions (Warm Caches)

40 80 120 160 200number of deletion (x 1000)

Carnegie Mellon

Mature Trees: Searches (Cold Caches)

40 80 120 160 200number of search (x 1000)

Carnegie Mellon

Mature Trees: Insertions (Cold Caches)

40 80 120 160 200number of insertion (x 1000)

Carnegie Mellon

Mature Trees: Deletions (Cold Caches)

40 80 120 160 200number of deletion (x 1000)

Carnegie Mellon

Mature Trees: Large Segmented Range Scans

B+tree p8B+ p8eB+ p8iB+0

40003537

479 452

Carnegie Mellon

Search varying memory bandwidth (warm cache)

5 10 15 20 25 30normalized bandwidth (B)

p2B+tree p4B+tree p8B+tree p16B+treep19B+tree

Even when pessimistic (B=5), p8B+-Tree still achieve significant speedups: 1.2 for warm cache

Carnegie Mellon

Search varying memory bandwidth (cold cache)

5 10 15 20 25 30normalized bandwidth (B)

p2B+tree p4B+tree p8B+tree p16B+treep19B+tree

• Even when B=5, 1.3 speedup for cold cache

• The optimal value for w increases when B gets larger

Carnegie Mellon

Scan varying prefetching distance (P8eB+-Tree)

entries scanned through in a single call

k=2 k=3 k=4 k=8 k=16k=32

• not sensitive to moderate increases in the prefetching distance

• Though overshooting cost shows up when #entries to scan is small

Carnegie Mellon

Scan varying chunk size (P8eB+-Tree)

entries scanned through in a single call

c=2 c=4 c=8 c=16c=32

Not sensitive to varying chunk size

Carnegie Mellon

Table 1 Terminology

Variable Definition

w # of cache lines in an index node

m # of child pointers in a one-line-wide node

N # of <key, tupleID> pairs in an index

d # of child pointers in non-leaf node (= w m)

T1 Full latency of a cache miss

Tnext Latency of an additional pipelined cache miss

B Normalized memory bandwidth (B = T1/Tnext)

K # of nodes to prefetch ahead

C #of cache lines in jump-pointer array chunk

pwB+-Tree Plain pB+-Tree with w-line-wide nodes

pwB+-Tree pwB+-Tree with external jump-pointer arrays

pwB+-Tree pwB+-Tree with internal jump-pointer arrays

Carnegie Mellon

Search w/ & w/o Jump-Pointer Arrays: Cold Cache

entries in leaf nodes

180tim

p8B+tree p8eB+treep8iB+tree

different # of levels in tree

Carnegie Mellon

Cache Performance Revisited

Search: eliminated 45% of original data cache stalls 1.47 speedup

Scan: eliminated 97% of original data cache stalls 8-fold speedup

Carnegie Mellon

Can We Do Even Better on Searches?

Hiding latency across levels is difficult given: data dependence through the child pointer the relatively large branching factor of tree nodes equal likelihood of following any child

assuming uniformly distributed random search keys

What if we prefetch a node’s children in parallel with accessing it? duality between this and creating wider nodes BUT, this approach has the following relative disadvantages:

storage overhead for the child (or grandchild) pointers size of node can only grow by multiples of the branching

factor

carnegie mellon improving index performance through prefetching shimin chen, phillip b. gibbons †...

cache performance of

index performance

gibbons mowry carnegie

data cache misses

search performance search

update performance

performance selections

cache line size

Documents

shimin chen intel labs pittsburgh upitt cs 3150, guest...

carnegie mellon log based dynamic binary analysis for...

inspector joins ic-65 advances in data management systems 1...

mowry landing mowry avenue & blacow road fremont, ca...

shimin chen big data reading group

ww1.prweb.com€¦ · web viewin addition, cara b...

vera mowry roberts papers - library.hunter.cuny.edu...

shimin chen lba reading group

a scalable approach to thread-level speculation j. gregory...

public art 201: creating a winning proposal...public art...

synchronization todd c. mowry cs 495 march 26, 2002

km c224e-20180320095656 - acgov.org · 39400 paseo padre...

shimin chen (lba reading group presentation)

parallel programming todd c. mowry cs 740 october 16 & 18,...

intro to parallel architecture todd c. mowry cs 740...

shimin chen big data reading group presented and modified by...

reducing dram latency at low cost by exploiting ......thesis...

why parallel architecture? todd c. mowry cs 495 january 15,...

cache coherence: part 1 todd c. mowry cs 740 october 25,...

computer architecture lab at evangelos vlachos, michelle l....