carnegie mellon improving index performance through prefetching shimin chen, phillip b. gibbons †...
Post on 21-Dec-2015
215 Views
Preview:
TRANSCRIPT
Carnegie Mellon
Improving Index Performance through Prefetching
Shimin Chen, Phillip B. Gibbons† and Todd C.
MowrySchool of Computer ScienceCarnegie Mellon University
Information SciencesResearch CenterBell Laboratories
†
Improving Index Performance through Prefetching - 2 - Chen, Gibbons & Mowry
Carnegie Mellon
Databases and the Memory Hierarchy
Traditional Focus: buffer pool management (DRAM as a cache for disk)
Important Focus Today: processor cache performance (SRAM as a cache for
DRAM) e.g., [Ailamaki et al, VLDB ’99], etc.
Disk
Main MemoryCPUL2/L3Cache
Larger, slower, cheaper
L1Cache
Improving Index Performance through Prefetching - 3 - Chen, Gibbons & Mowry
Carnegie Mellon
Index Structures
Used extensively in databases to accelerate performance selections, joins, etc.
Common Implementation: B+-Trees
Leaf Nodes
Non-Leaf Nodes
Improving Index Performance through Prefetching - 4 - Chen, Gibbons & Mowry
Carnegie Mellon
B+-Tree Indices: Common Access Patterns
Search: locate a single tuple
Range Scan: locate a collection of
tuples within a range
Improving Index Performance through Prefetching - 5 - Chen, Gibbons & Mowry
Carnegie Mellon
Cache Performance of B+-Tree Indices
A main memory B+-Tree containing 10M keys: Search: 100K random searches Scan: 100 range scans of 1M keys, starting at random
keys Detailed simulations based on Compaq ES40 system
Most of execution time is wasted on data cache misses 65% for searches, 84% for range scans
Data Cache StallsOther StallsBusy Time
Improving Index Performance through Prefetching - 6 - Chen, Gibbons & Mowry
Carnegie Mellon
B+-Trees: Optimizing Search for Cache vs. Disk To minimize the number of data transfers (I/O or cache
misses):
Optimal Node Width = Natural Data Transfer Size for disk: disk page size (~8 Kbytes) for cache: cache line size (~64 bytes)
Much narrower nodes and higher trees Search performance more sensitive
to changes in branching factors
Optimized for disk
Optimized for cache
Improving Index Performance through Prefetching - 7 - Chen, Gibbons & Mowry
Carnegie Mellon
Previous Work: “Cache-Sensitive B+-Trees”Rao and Ross [SIGMOD 2000]
Key insight:
nearly all child ptrs can be eliminated by restricting data layout double the branching factor of cache-line-sized non-leaf
nodesB+-Trees CSB+-Trees
K1 K2
K3 K4 K5 K6 K7 K8
K1 K3K2 K4
K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4
Improving Index Performance through Prefetching - 8 - Chen, Gibbons & Mowry
Carnegie Mellon
Impact of CSB+-Trees on Search Performance
Search is 15% faster due to reduction in height of tree
However: update performance is worse [Rao & Ross, SIGMOD ’00] range scan performance does not improve
There is still significant room for improvement
Data Cache StallsOther StallsBusy Time
B+-Tree CSB+-Tree
Improving Index Performance through Prefetching - 9 - Chen, Gibbons & Mowry
Carnegie Mellon
Latency Tolerance in Modern Memory Hierarchies
Main MemoryCPUL2/L3CacheL1
Cache
pref 0(r2)pref 4(r7)pref 0(r3)pref 8(r9)
Modern processors overlap multiple simultaneous cache misses e.g., Compaq ES40 supports 8 off-chip misses per processor
Prefetch instructions allow software to fully exploit the parallelism
What dictates performance: NOT simply the number of cache misses but rather the amount of exposed miss latency
Improving Index Performance through Prefetching - 10 - Chen, Gibbons & Mowry
Carnegie Mellon
Our Approach
New Proposal: “Prefetching B+-Trees” (pB+-Trees) use prefetching to reduce the amount of exposed miss
latency
Key Challenge: data dependences caused by chasing pointers
Benefits: significant performance gains for:
searches range scans updates (!)
complementary to CSB+-Trees
Improving Index Performance through Prefetching - 11 - Chen, Gibbons & Mowry
Carnegie Mellon
Overview
Prefetching Searches
Prefetching Range Scans
Experimental Results
Conclusions
Improving Index Performance through Prefetching - 12 - Chen, Gibbons & Mowry
Carnegie Mellon
Example: Search where Node Width = 1 Line
0Time (cycles)
Cache miss
300
450
150
We suffer one full cache miss at each level of the tree.
600
1000 keys, 64B lines, 4B keys, ptrs & tupleIDs
4 levels in B+-Tree (cold cache)
Improving Index Performance through Prefetching - 13 - Chen, Gibbons & Mowry
Carnegie Mellon
Same Example where Node Width = 2 Lines
0Time (cycles)
Cache miss
0Time (cycles)
Cache miss
300
600
150
450
150
450
600
750
3 levels in tree
900
300
Additional misses per node dominate reduction in # of levels.
Improving Index Performance through Prefetching - 14 - Chen, Gibbons & Mowry
Carnegie Mellon
How Things Change with Prefetching
0Time (cycles)
Cache miss
300
600
150
450
0Time (cycles)
Cache miss
480
160
320
# of misses exposed miss latency
fetch all lines within a node in parallel
0
Cache miss
300
600
150
450
750
900
Time (cycles)
Improving Index Performance through Prefetching - 15 - Chen, Gibbons & Mowry
Carnegie Mellon
pB+-Trees: Using Prefetching to Improve Search
Basic Idea: make nodes wider than the natural data transfer size
e.g., 8 cache lines wide prefetch all lines of a node before searching in the node
Improved Search Performance: Larger branching factors, shallower trees Cost to access every node only increased slightly
Reduced Space Overhead: primarily due to fewer non-leaf nodes
Update Performance: ???
Improving Index Performance through Prefetching - 16 - Chen, Gibbons & Mowry
Carnegie Mellon
Overview
Prefetching Searches
Prefetching Range Scans
Experimental Results
Conclusions
Improving Index Performance through Prefetching - 17 - Chen, Gibbons & Mowry
Carnegie Mellon
Range Scan Cache Behavior: Normal B+-Trees
Steps in Range Scan:• search for the starting leaf node• traverse the leaves until end is found
0Time(cycles)
Cache miss
300
450
600
We suffer a full cache miss for each leaf node!
150
750
900
Improving Index Performance through Prefetching - 18 - Chen, Gibbons & Mowry
Carnegie Mellon
If Prefetching Wider Nodes
e.g., node width = 2 lines
0Time(cycles)
Cache miss
300
450
600
150
750
900
0Time(cycles)
Cache miss
320
• Exposed miss latency is reduced by up to a factor of node width.
A definite improvement, but can we still do better?
160
480
Improving Index Performance through Prefetching - 19 - Chen, Gibbons & Mowry
Carnegie Mellon
The Ideal Case
Overlap misses until• all latency is hidden, or• run out of bandwidth
How can we achieve this? 0
Time(cycles)
Cache miss
0Time(cycles)
Cache miss
300
450
600
150
750
900
0Time(cycles)
Cache miss
320
160
480
200
Improving Index Performance through Prefetching - 20 - Chen, Gibbons & Mowry
Carnegie Mellon
The Pointer Chasing Problem
Currently visiting Want to prefetch
If prefetching through pointer chasing,
still experience the full latency at each node
Directly prefetch
Ideal case
Improving Index Performance through Prefetching - 21 - Chen, Gibbons & Mowry
Carnegie Mellon
Our Solution: Jump Pointer Arrays
Put leaf addresses in an
array
Directly prefetch by using the jump pointers
Back pointers needed to initialize prefetching
Improving Index Performance through Prefetching - 22 - Chen, Gibbons & Mowry
Carnegie Mellon
Our Solution: Jump Pointer Arrays
0Time
Cache miss
Improving Index Performance through Prefetching - 23 - Chen, Gibbons & Mowry
Carnegie Mellon
External Jump Pointer Arrays: Efficient Updates
Impact of an insertion is limited to its chunk
Deletions leave empty slots
Actively interleave empty slots during bulkload and chunk splits
Back pointer to position in jump-pointer array is now a hint points to correct chunk but may require local search within chunk to init prefetching
hints chunked linked-list
Improving Index Performance through Prefetching - 24 - Chen, Gibbons & Mowry
Carnegie Mellon
Alternative Design: Internal Jump-Pointer Arrays
B+-Trees already contain structures that point to the leaf nodes
bottom non-leaf nodes
the parents of the leaf nodes ( “bottom non-leaf nodes”)
By linking them together, we can use them as a jump-pointer array
Tradeoff: no need for back-pointers, and simpler to maintain consumes less space, though external array overhead is <1% but less flexible, chunk size is fixed by B+-Tree structure
Improving Index Performance through Prefetching - 25 - Chen, Gibbons & Mowry
Carnegie Mellon
Overview
Prefetching Searches
Prefetching Range Scans
Experimental Results search performance range scan performance update performance
Conclusions
Improving Index Performance through Prefetching - 26 - Chen, Gibbons & Mowry
Carnegie Mellon
Experimental Framework
Results are for a main-memory database environment (we are extending this work to disk-based environments)
Executables: we added prefetch instructions to C source code by hand used gcc to generate optimized MIPS executables with
prefetch instructions
Performance Measurement: detailed, cycle-by-cycle simulations
Machine Model: based on Compaq ES40 system, with slightly updated
parameters
Improving Index Performance through Prefetching - 27 - Chen, Gibbons & Mowry
Carnegie Mellon
Simulation Parameters
Pipeline Parameters
Clock Rate 1 GHz
Issue Width 4 insts/cycle
Functional Units 2 Int, 2 FP, 2 Mem, 1
Branch
Reorder Buffer Size 64 insts
Integer Multiply/Divide 12/76 cycles
All Other Integer 1 cycle
FP Divide/Square Root 15/20 cycles
All Other FP 2 cycles
Branch Prediction Scheme
gshare
Memory Parameters
Line Size 64 bytes
Primary Data Cache 64 KB, 2-way set assoc.
Primary Instruction Cache
64 KB, 2-way set-assoc.
Miss Handlers 32 for data, 2 for inst
Unified Secondary Cache
2 MB, direct-mapped
Primary-to-Secondary Miss Latency
15 cycles (plus contention)
Primary-to-Memory Miss Latency
150 cycles (plus contention)
Main Memory Bandwidth
1 access per 10 cycles
Models all the gory details, including memory system contention
Improving Index Performance through Prefetching - 28 - Chen, Gibbons & Mowry
Carnegie Mellon
Index Search Performance
100K random searchesafter bulkload; 100% full (except root);warm caches.
104
105
106
10710
20
30
40
50
60
70
80
# of tupleIDs in the trees
tim
e (
M c
yc
les
)B+tree CSB+ p2B+tree p4B+tree
p16B+treep8B+tree
p8CSB+
pB+-Trees achieve 27-47% speedup vs. B+-Trees, 14-34% vs. CSB+-Trees optimal node width is 8 cache lines pB+-Trees and CSB+-Trees are complementary: p8CSB+-Trees are
best
Improving Index Performance through Prefetching - 29 - Chen, Gibbons & Mowry
Carnegie Mellon
Same Search Experiments with Cold Caches
Large discrete steps within each curve
What is happening here?
104
105
106
10760
80
100
120
140
160
180
# of tupleIDs in trees
time
(M
cyc
les)
B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+
100K random searchesafter bulkload; 100% full (except root);cold caches (i.e. clearedafter each search).
Improving Index Performance through Prefetching - 30 - Chen, Gibbons & Mowry
Carnegie Mellon
Analysis of Cold Cache Search Behavior
Height of the tree dominates performance effect is blurred in warm cache case
If the same height, the smaller the node size the better
104
105
106
107
60
80
100
120
140
160
180
# of tupleIDs in trees
tim
e (M
cyc
les)
B+tree CSB+ p2B+tree p4B+tree p8B+tree p16B+treep8CSB+
Tree Type
Number of Keys
10K
30K
100K
300K 1M
3M 10M
B+ 5 6 6 7 7 8 8
CSB+ 4 5 5 5 6 6 7
p2B+ 4 4 5 5 6 6 6
p4B+ 3 3 4 4 4 5 5
p8B+ 3 3 3 4 4 4 4
p16B+ 2 3 3 3 3 4 4
p8CSB+
3 3 3 3 3 4 4
# of Levels in the Trees
Improving Index Performance through Prefetching - 31 - Chen, Gibbons & Mowry
Carnegie Mellon
Overview
Prefetching Searches
Prefetching Range Scans
Experimental Results search performance range scan performance update performance
Conclusions
Improving Index Performance through Prefetching - 32 - Chen, Gibbons & Mowry
Carnegie Mellon
Index Range Scan Performance
Scans of 1K-1M keys: 6.5-8.7 speedup over B+-Trees factor of 3.5-3.7 from prefetching wider nodes additional factor of ~2 from jump-pointer arrays
log scale
100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)
10
101
102
103
104
105
106
104
106
108
10
# of tupleIDs scanned through in a single call
tim
e (C
ycle
s)B+tree p8B+tree
p8iB+treep8eB+tree
Improving Index Performance through Prefetching - 33 - Chen, Gibbons & Mowry
Carnegie Mellon
Index Range Scan Performance
Small scans (<1K keys): overshooting cost is noticeable exploit only if scan is expected to be large (e.g., search for end)
101
102
103
104
105
106
104
106
108
1010
# of tupleIDs scanned through in a single call
tim
e (C
ycle
s)B+tree p8B+tree p8eB+treep8iB+tree
log scale
100 scans starting atrandom locations on indexbulkloaded with 3M keys(100% full)
Improving Index Performance through Prefetching - 34 - Chen, Gibbons & Mowry
Carnegie Mellon
Overview
Prefetching Searches
Prefetching Range Scans
Experimental Results search performance range scan performance update performance
Conclusions
Improving Index Performance through Prefetching - 35 - Chen, Gibbons & Mowry
Carnegie Mellon
Update Performance
pB+-Trees achieve at least a 1.24 speedup in all cases
Why?
50 60 70 80 90 100percentage of entries used in leaf nodes
50
60
70
80
90
100
110
tim
e (M
cyc
les)
Insertions Deletions
100K random insertions/deletions on 3M-key bulkloaded index; warm caches
50 60 70 80 90 100percentage of entries used in leaf nodes
50
60
70
80
90
100
110
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 36 - Chen, Gibbons & Mowry
Carnegie Mellon
Update Performance
Reason #1: faster search times
Reason #2: less frequent node splits with wider nodes
50 60 70 80 90 100percentage of entries used in leaf nodes
50
60
70
80
90
100
110
tim
e (M
cyc
les)
Insertions Deletions
100K random insertions/deletions on 3M-key bulkloaded index; warm caches
50 60 70 80 90 100percentage of entries used in leaf nodes
50
60
70
80
90
100
110
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 37 - Chen, Gibbons & Mowry
Carnegie Mellon
pB+-Trees: Other Results
Similar results for: varying bulkload factors of trees large segmented range scans mature trees varying jump-pointer array parameters:
prefetch distance chunk size
Optimal node width: increases as memory bandwidth increases
(matches the width predicted by our model in the paper)
Improving Index Performance through Prefetching - 38 - Chen, Gibbons & Mowry
Carnegie Mellon
Cache Performance Revisited
Search: eliminated 45% of original data cache stalls 1.47 speedup
Scan: eliminated 97% of original data cache stalls 8-fold speedup
Data Cache StallsOther StallsBusy Time
Improving Index Performance through Prefetching - 39 - Chen, Gibbons & Mowry
Carnegie Mellon
Conclusions
Impact of Prefetching B+-Trees on performance:
Search: 1.27-1.55 speedup over B+-Trees wider nodes reduce height of tree, # of expensive misses outperform and are complementary to CSB+-Trees
Updates: 1.24-1.52 speedup over B+-Trees faster search and less frequent node splits in contrast with significant slowdowns for CSB+-Trees
Range Scan: 6.5-8.7 speedup over B+-Trees wider nodes: factor of ~3.5 speedup jump-pointer arrays: additional factor of ~2 speedup
Prefetching B+-Trees also reduce space overhead.
These benefits are likely to increase with future memory systems.
Applicable to other levels of the memory hierarchy (e.g., disks).
Improving Index Performance through Prefetching - 40 - Chen, Gibbons & Mowry
Carnegie Mellon
Backup Slides
Improving Index Performance through Prefetching - 41 - Chen, Gibbons & Mowry
Carnegie Mellon
Revisiting the Optimal Node Width for Searches
Total cache misses for a search is minimized when: w = 1
w = # of cache lines per nodem = # of child pointers per one-cache-line wide nodeN = # of tupleIDs in index
1logwmN
w wmTotal cache misses
Misses per level # of levels in tree
Improving Index Performance through Prefetching - 42 - Chen, Gibbons & Mowry
Carnegie Mellon
Scheduling Prefetches Early Enough
ni ni+1ni+2 ni+3ni+2 ni+3
currently visiting
ni
want to prefetch
ni+3
p = &n0;while(p) { work(p->data); p = p->next;}
P
Loading a node
L
Work()
W
Our Goal: fully hide latency
• thus achieving fastest possible computation rate of 1/W
e.g., if L=3W, we must prefetch 3 nodes ahead to achieve this.
Improving Index Performance through Prefetching - 43 - Chen, Gibbons & Mowry
Carnegie Mellon
Performance without Prefetching
ni
ni+1
ni+2
ni+3
Time
Li Wi
Li+1 Wi+1
Li+2 Wi+2
Li+3 Wi+3
while(p) {work(p->data);p = p->next;
}
Li
Wi
loading nkwork(nk)
Computation rate = 1/(L+W)
Improving Index Performance through Prefetching - 44 - Chen, Gibbons & Mowry
Carnegie Mellon
Prefetching One Node Ahead
ni
ni+1
ni+2
ni+3
Li Wi
Li+1 Wi+1
Li+2 Wi+2
Li+3 Wi+3
• Computation is overlapped with memory accesses.
computation rate = 1/L
Li
Wi
loading nkwork(nk)
data dependence
visiting
ni
prefetch ni+1
pf(p->next)
while(p) {pf(p->next);
work(p->data);p = p->next;
}
Time
Improving Index Performance through Prefetching - 45 - Chen, Gibbons & Mowry
Carnegie Mellon
Prefetching Three Nodes Ahead
ni
ni+1
ni+2
ni+3
Li Wi
Wi+1
Wi+2
Wi+3
Computation rate does not improve (still = 1/L)!
visitingni
prefetchni+3
pf(p->next->next->next)
Li+1
Li+2
Li+3
L
Pointer-Chasing Problem: [Luk & Mowry, ASPLOS ’96]• any scheme which follows the pointer chain is limited to a rate of 1/L
Timewhile(p) {
pf(p->next->next->next);work(p->data);p = p->next;
}
Li
Wi
loading nkwork(nk)
data dependence
Improving Index Performance through Prefetching - 46 - Chen, Gibbons & Mowry
Carnegie Mellon
Our Goal: Fully Hide Latency
ni
ni+1
ni+2
ni+3
Li Wi
Li+1 Wi+1
Li+2 Wi+2
Li+3 Wi+3
Achieves the fastest possible computation rate of 1/W.
visiting ni
prefetch ni+1pf(&ni+3)
Time
Li
Wi
loading nkwork(nk)
data dependence
while(p) {pf(&ni+3);
work(p->data);p = p->next;
}
Improving Index Performance through Prefetching - 47 - Chen, Gibbons & Mowry
Carnegie Mellon
Challenges in Supporting Efficient Updates
jump-pointer array
back pointers
Conceptual view of jump-pointer array:
What if we really implemented it this way?
•Insertion: could incur significant overheads• copying data within the array to create a new hole• updating back-pointers
•Deletion: ok; just leave a hole
Improving Index Performance through Prefetching - 48 - Chen, Gibbons & Mowry
Carnegie Mellon
Summary: Why We Expect Updates to Perform Well
Insertions: only a small number of jump pointers move
between insertion point and nearest hole in the chunk normally only update the hint pointer for the inserted node
which does not require any significant overhead significant overheads only occur on chunk splits, which are rare
Deletions: no data is moved (just leave an empty hole) no need to update any hints
In general, the jump-pointer array requires little concurrency control.
Improving Index Performance through Prefetching - 49 - Chen, Gibbons & Mowry
Carnegie Mellon
B+-Trees Modeled and their Notations
B+-Trees: regular B+-Trees
CSB+-Trees: cache-sensitive B+-Trees [Rao & Ross, SIGMOD 2000]
pwB+-Trees: prefetching B+-Trees with node size = w cache lines and no jump-pointer arrays
• we consider w = 2, 4, 8, and 16
p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and external jump-pointer arrays
p8B+-Trees: prefetching B+-Trees with node size = 8 cache lines and internal jump-pointer arrays
p8CSB+-Trees: prefetching cache-sensitive B+-Trees with node size = 8 cache lines (and no jump-pointer arrays)
(Gory implementation details are in the paper.)
e
i
Improving Index Performance through Prefetching - 50 - Chen, Gibbons & Mowry
Carnegie Mellon
Searches with Varying Bulkload Factors
Similar trends with smaller bulkload factors as when 100% full
Performance of pB+-Trees is somewhat less sensitive to bulkload factor
50 60 70 80 90 100percentage of entries used in leaf nodes
40
50
60
70
80
90
tim
e (M
cyc
les)
B+tree CSB+ p2B+tree p4B+tree
p16B+treep8CSB+
p8B+tree
50 60 70 80 90 100percentage of entries used in leaf nodes
100
120
140
160
180
200
tim
e (M
cyc
les)
cold cacheswarm caches
Improving Index Performance through Prefetching - 51 - Chen, Gibbons & Mowry
Carnegie Mellon
Range Scans with Varying Bulkload Factors
Prefetching B+-Trees offer: larger speedups with smaller bulkload factors (more nodes to fetch) less sensitivity of performance to bulkload factor
50 60 70 80 90 10010
5
106
107
percentage of entries used in leaf nodes
tim
e (C
ycle
s)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 52 - Chen, Gibbons & Mowry
Carnegie Mellon
Large Segmented Range Scans
1M keys, scanned in 1000-key segments
Similar performance gains as unsegmented scans
50 60 70 80 90 10010
8
109
1010
percentage of entries used in leaf nodes
time
(C
ycle
s)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 53 - Chen, Gibbons & Mowry
Carnegie Mellon
Insertions with Cold Caches
50 60 70 80 90 100percentage of entries used in leaf nodes
120
140
160
180
200
220
240
260
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 54 - Chen, Gibbons & Mowry
Carnegie Mellon
Deletions with Cold Caches
50 60 70 80 90 100percentage of entries used in leaf nodes
120
140
160
180
200
220
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 55 - Chen, Gibbons & Mowry
Carnegie Mellon
55 60 65 70 75 80 85 90percentage of entries used in leaf nodes
0
2000
4000
6000
8000
10000
inse
rtio
ns
with
no
de
sp
lits B+tree
p8B+tree p8eB+treep8iB+tree
Analysis of Nodes Splits upon Insertions
Far fewer node splits
Bulkload Factor = 60-90% Bulkload Factor = 100%
At least 2 splits
One split No splits
Fewer node splits Fewer non-leaf node
splits
Improving Index Performance through Prefetching - 56 - Chen, Gibbons & Mowry
Carnegie Mellon
Mature Trees: Searches (Warm Caches)
40 80 120 160 200number of search (x 1000)
0
50
100
150
200
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 57 - Chen, Gibbons & Mowry
Carnegie Mellon
Mature Trees: Insertions (Warm Caches)
40 80 120 160 200number of insertion (x 1000)
0
50
100
150
200
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
• CSB+-Tree could be 25% worse than B+-Tree under the same mature tree experiments (on diff h/w configuration)
• pB+-Trees are significantly faster than B+-Tree
Improving Index Performance through Prefetching - 58 - Chen, Gibbons & Mowry
Carnegie Mellon
Mature Trees: Deletions (Warm Caches)
40 80 120 160 200number of deletion (x 1000)
0
50
100
150
200
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 59 - Chen, Gibbons & Mowry
Carnegie Mellon
Mature Trees: Searches (Cold Caches)
40 80 120 160 200number of search (x 1000)
0
100
200
300
400
500
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 60 - Chen, Gibbons & Mowry
Carnegie Mellon
Mature Trees: Insertions (Cold Caches)
40 80 120 160 200number of insertion (x 1000)
0
100
200
300
400
500
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 61 - Chen, Gibbons & Mowry
Carnegie Mellon
Mature Trees: Deletions (Cold Caches)
40 80 120 160 200number of deletion (x 1000)
0
100
200
300
400
500
time
(M
cyc
les)
B+tree p8B+tree p8eB+treep8iB+tree
Improving Index Performance through Prefetching - 62 - Chen, Gibbons & Mowry
Carnegie Mellon
Mature Trees: Large Segmented Range Scans
B+tree p8B+ p8eB+ p8iB+0
500
1000
1500
2000
2500
3000
3500
40003537
825
479 452
Improving Index Performance through Prefetching - 63 - Chen, Gibbons & Mowry
Carnegie Mellon
Search varying memory bandwidth (warm cache)
5 10 15 20 25 30normalized bandwidth (B)
60
65
70
75
80
85
90
95
100
no
rma
lize
d e
xecu
tion
tim
e
p2B+tree p4B+tree p8B+tree p16B+treep19B+tree
Even when pessimistic (B=5), p8B+-Tree still achieve significant speedups: 1.2 for warm cache
Improving Index Performance through Prefetching - 64 - Chen, Gibbons & Mowry
Carnegie Mellon
Search varying memory bandwidth (cold cache)
5 10 15 20 25 30normalized bandwidth (B)
50
60
70
80
90
100
110
no
rma
lize
d e
xecu
tion
tim
e
p2B+tree p4B+tree p8B+tree p16B+treep19B+tree
• Even when B=5, 1.3 speedup for cold cache
• The optimal value for w increases when B gets larger
Improving Index Performance through Prefetching - 65 - Chen, Gibbons & Mowry
Carnegie Mellon
Scan varying prefetching distance (P8eB+-Tree)
102
104
106
entries scanned through in a single call
104
105
106
107
108
109
time
(C
ycle
s)
k=2 k=3 k=4 k=8 k=16k=32
• not sensitive to moderate increases in the prefetching distance
• Though overshooting cost shows up when #entries to scan is small
Improving Index Performance through Prefetching - 66 - Chen, Gibbons & Mowry
Carnegie Mellon
Scan varying chunk size (P8eB+-Tree)
102
104
106
entries scanned through in a single call
104
105
106
107
108
109
time
(C
ycle
s)
c=2 c=4 c=8 c=16c=32
Not sensitive to varying chunk size
Improving Index Performance through Prefetching - 67 - Chen, Gibbons & Mowry
Carnegie Mellon
Table 1 Terminology
Variable Definition
w # of cache lines in an index node
m # of child pointers in a one-line-wide node
N # of <key, tupleID> pairs in an index
d # of child pointers in non-leaf node (= w m)
T1 Full latency of a cache miss
Tnext Latency of an additional pipelined cache miss
B Normalized memory bandwidth (B = T1/Tnext)
K # of nodes to prefetch ahead
C #of cache lines in jump-pointer array chunk
pwB+-Tree Plain pB+-Tree with w-line-wide nodes
pwB+-Tree pwB+-Tree with external jump-pointer arrays
pwB+-Tree pwB+-Tree with internal jump-pointer arrays
e
i
Improving Index Performance through Prefetching - 68 - Chen, Gibbons & Mowry
Carnegie Mellon
Search w/ & w/o Jump-Pointer Arrays: Cold Cache
entries in leaf nodes
104
105
106
107
60
80
100
120
140
160
180tim
e (
M c
ycle
s)
p8B+tree p8eB+treep8iB+tree
different # of levels in tree
Improving Index Performance through Prefetching - 69 - Chen, Gibbons & Mowry
Carnegie Mellon
Cache Performance Revisited
Search: eliminated 45% of original data cache stalls 1.47 speedup
Scan: eliminated 97% of original data cache stalls 8-fold speedup
Data Cache StallsOther StallsBusy Time
Improving Index Performance through Prefetching - 70 - Chen, Gibbons & Mowry
Carnegie Mellon
Can We Do Even Better on Searches?
Hiding latency across levels is difficult given: data dependence through the child pointer the relatively large branching factor of tree nodes equal likelihood of following any child
assuming uniformly distributed random search keys
What if we prefetch a node’s children in parallel with accessing it? duality between this and creating wider nodes BUT, this approach has the following relative disadvantages:
storage overhead for the child (or grandchild) pointers size of node can only grow by multiples of the branching
factor
top related