fast: fast architecture sensitive tree search on modern...
TRANSCRIPT
![Page 1: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/1.jpg)
FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs
N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey
SIGMOD 2010
Presented by: Andy Hwang
![Page 2: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/2.jpg)
Motivation
• Index trees are not optimized for architecture
• Only one node is accessed per tree level, ineffective cache line utilization • Prefetch cannot be used (depends on comparison of
search key to parent)
• Nodes in different pages, causing TLB misses
• Previous work optimized for page, cache, SIMD separately, not together
• Compression can be used to save memory bandwidth
2
![Page 3: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/3.jpg)
Motivation: Index Tree Layout 3
Bad for traversal
![Page 4: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/4.jpg)
Motivation
Hierarchical Blocking
CPU/GPU Implementation
Compression
Throughput/Response Time
Summary/Discussion
4
![Page 5: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/5.jpg)
Hierarchical Blocking 5
Optimize for accesses (SIMD/cache/memory)
![Page 6: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/6.jpg)
Hierarchical Blocking 6
![Page 7: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/7.jpg)
Motivation
Hierarchical Blocking
CPU/GPU Implementation
Compression
Throughput/Response Time
Summary/Discussion
7
![Page 8: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/8.jpg)
Tree Construction
• Assuming 4-byte keys (32-bits)
• Block size depends on SIMD instruction width, cache line size, and page size
• Use one SIMD instruction to calculate multiple indices
• Parallelize output amongst CPU cores / GPU shared multiprocessors
8
![Page 9: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/9.jpg)
Tree Construction: CPU
• 128-bit SIMD = max 4 nodes at once
• SIMD block = 2 tree levels (3 nodes)
• 64-byte cache line = max 16 nodes
• Cache line block = 4 levels (15 nodes)
• 2MB page size
• Page block = 19 levels
• 4KB page = 10 levels
9
![Page 10: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/10.jpg)
Tree Construction: GPU
• 32 data elements (thread warp)
• Various SIMD block sizes possible (up to 32)
• Set depth to 4 to make use of instruction granularity at half-warp
• No cache exposed – cache line block size set equal to SIMD block size
10
![Page 11: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/11.jpg)
Tree Traversal: CPU 11
![Page 12: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/12.jpg)
Tree Traversal: GPU 12
![Page 13: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/13.jpg)
Simultaneous Queries
• Issue queries in parallel on the hardware
• Software pipelining used to hide cache/TLB miss or GPU memory latency
• CPU: 8 concurrent queries per thread, 64 total
• GPU: 2 concurrent queries per thread warp, 960 total
13
![Page 14: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/14.jpg)
Optimization Speedup 14
![Page 15: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/15.jpg)
CPU vs GPU Search Throughput 15
![Page 16: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/16.jpg)
Tree Traversal: MICA
• Intel Many-Core Architecture Platform
• Intel GPGPU effort
• 32KB L1, 256KB L2 (partitioned)
• 4 threads/core
• Traversal code similar to CPU
• 16-wide SIMD
• SIMD block depth = 4 (15 nodes at once)
16
![Page 17: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/17.jpg)
Tree Traversal: MICA
Throughput (million queries / sec)
Small Tree (64K keys) Large Tree (16M keys)
CPU 280 60
GPU 150 100
MICA 667 183
17
Benefits of both CPU and GPU!
![Page 18: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/18.jpg)
Motivation
Hierarchical Blocking
CPU/GPU Implementation
Compression
Throughput/Response Time
Summary/Discussion
18
![Page 19: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/19.jpg)
Compression
• Key sizes are different in practice
• Impact cache line and page usage
• Non-Contiguous Common Prefix
• Hashing keys based on their difference (partial keys)
• 4-bit blocks as unit of compression
• SIMD instruction to find similarity and compress
19
![Page 20: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/20.jpg)
Compression
• First page partial key size is larger (128 bits) to reduce false positives
• Subsequent pages have partial key size 32
• Construction overhead increased
• +75% for variable size keys, +30% integer keys
• During traversal, the query key is compressed
20
![Page 21: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/21.jpg)
Compression 21
![Page 22: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/22.jpg)
Compression: Alphabet Size 22
![Page 23: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/23.jpg)
Compression: Throughput 23
![Page 24: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/24.jpg)
Query Batching/Buffering 24
![Page 25: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/25.jpg)
Summary
• Hierarchical blocking to optimize search tree for page, cache, SIMD instructions • Architectural-aware block depths
• CPU/GPU/MICA implementations • Fast construction, search, and parallel queries for
varying tree sizes
• Hide memory latency wherever possible • NCCP compression for integer and variable length
keys • Throughput/Response time for different query
batching schemes
25
![Page 26: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs](https://reader035.vdocument.in/reader035/viewer/2022070805/5f0387127e708231d4097f82/html5/thumbnails/26.jpg)
Discussion
• Focus on throughput
• Assumes large number of queries
• Not much info on latency
• Updates
• Full reconstruction? Flushed from cache?
• Synthetic workloads
• Deployment
26