designing high-performance in-memory key-value … · 2018-09-06 · ching-hsiang chu (the ohio...
TRANSCRIPT
Ching-Hsiang Chu (The Ohio State University, NVIDIA), Sreeram Potluri (NVIDIA), Anshuman Goswami (NVIDIA), Manjunath Gorentla Venkata (Oak Ridge National Laboratory), Neena Imam (Oak
Ridge National Laboratory), and Chris J. Newburn (NVIDIA)
DESIGNING HIGH-PERFORMANCE IN-MEMORY KEY-VALUE OPERATIONS WITH PERSISTENT GPU KERNELS AND OPENSHMEM
2
OUTLINE
Introduction
Problem Statement & Challenges
Background
Proposed Solutions
Evaluation
Conclusion
3
HUGE DATA EXPLOSION
“Annual global IP traffic will reach 3.3 ZB (Zetta-bytes; 1000 Exa-bytes [EB]) by 2021”
source: Cisco Visual Networking Index: Forecast and Methodology, 2016–2021
All kinds of indexed traffic
Video, Search, Maps…
How to provide efficient access to such a large volume?
Cache!
Predicted to continue
“VNI Forecast Highlights Tool”,
https://www.cisco.com/c/m/en_us/solutions/service-provider/vni-forecast-
highlights.html
4
IN-MEMORY KEY-VALUE STORES
• In-Memory Key-Value (IMKV) Stores : Distributed in-memory data caching systems
• Caching data and objects in system memory, i.e., RAM (Random Access Memory)
• Querying the cache server(s) before retrieving data from database
• Speeding up web applications by shortening query/response path
What is it?
Network
Web Servers
Back-end Database
IMKV servers1. Cache access requests/responses
2. Cache miss: Fetching objects from DB
5
IN-MEMORY KEY-VALUE STORES (CONT’)
• Faster for I/O-intensive applications (especially, read-heavy workloads)
• System memory (micro/nano-seconds) >>> database with persistent storage (milliseconds)
• Resources are available: Large system memory now common on cloud and HPC systems
• Amazon EC2 X1e provides up to 4TB DDR4 Memory (https://aws.amazon.com/ec2/instance-types/)
• Amazon ElastiCache offers Memcached/Redis-ready nodes equipped with up to 400 GB memory (https://aws.amazon.com/elasticache/)
• Summit (#1 supercomputer): 512GB DDR4 per node
Why it becomes popular?
6
IN-MEMORY KEY-VALUE STORES (CONT’)
• Fit hash table on system memory
• Keys: 8Bytes – 250Bytes (for memcached)
• Values: 8Bytes – 1Mbytes data objects (for memcahced)
• Basic Key-Value (KV) operations: GET (the most commonly used) & SET
How does it work?
Network
Web Servers
Back-end Database
IMKV serversGET(Key) / SET(Key, Value)
Cache miss: Fetching objects from DB
7
LET’S MAKE IMKV SERVER FASTERState-of-the-arts (NOT a comparison)
[1] H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: a holistic approach to fast in-memory key-value storage. NSDI'14. USENIX Association, Berkeley, CA, USA, 429-444.
[2] T. H. Hetherington, Mike O'Connor, and Tor M. Aamodt. MemcachedGPU: scaling-up scale-out key-value stores. SoCC '15. ACM, New York, NY, USA, 43-57.
[3] K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, X. Zhang, B. He, J. Hu, and b. Hua. A distributed in-memory key-value store system on heterogeneous CPU---GPU cluster. The VLDB Journal 26, 5 (October 2017), 729-750.
[4] H. Fu, M. G. Venkata, A. R. Choudhury, N. Imam and W. Yu, "High-Performance Key-Value Store on OpenSHMEM," 2017 CCGRID, Madrid, 2017, pp. 559-568.
[5] B. Li, Z. Ruan, W. Xiao, Y. Lu, Y. Xiong, A. Putnam, E. Chen, and L. Zhang. 2017. KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. SOSP '17. ACM, New York, NY, USA, 137-152.
KVS HighlightsReported Peak throughput (MOPS)
Key HW GET PUT
Memcached Traditional CPU based solutions, UDP Intel 8-core Xeon CPU, One 40-GbE NIC 1.5 1.5
MICA[1] Kernel bypass, Intel DPDK 24-core CPU, 12-port NIC 137 135
MemcachedGPU[2] Offload packet parsing (UDP) and KV
operations to GPU, GPUDirectOne Nvidia Tesla K20c, 10-GbE NIC 12.92 N/A
Mega-KV[3] Offload only indexing operations to
GPU, Intel DPDK (UDP)
Singe Node : Dual socket Intel Xeon 8-core CPUs, one Nvidia
Tesla K40c GPU, one 40-GbE NIC per socket
4 nodes: Same config as above
166
623
171
N/A
SHMEMCache[4] OpenSHMEM, RDMA, One-sided KV
operations (CPU bypass)10-core Intel Xeon CPU, IB ConnectX3 NIC
1024 nodes on Titan cluster
~0.9
~50
~0.3
~12
KV-Direct[5] CPU Bypass, Offload KV operations
to Programable NIC (FPGA)
Both are on a single server:
One 40-GbE NIC, two PCIE Gen3 x8 link
Ten 40-GbE NICs, one PCIE Gen3 x8 link per NIC
180
1220
114
610
8
GPU ACCELERATED IMKV SERVER
• KV operations are memory-intensive operations
• GPU with HBM is a perfect fit!
• But, GPU memory is not as BIG as system memory
• Mega-KV: Cache KVs in SysMem; Index KV on GPU
• Introduce a new hash table in GPU memory
• Compressed, fixed-width key
• Value : location (in system memory)
• Offload only indexing operations to GPUs
• CPU threads complete KV operations
How Mega-KV works?
Kai Zhang, Kaibo Wang, Yuan , Lei Guo, Rubao Lee, and Xiaodong Zhang. 2015. Mega-KV: a
case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB
Endow. 8, 11 (July 2015), 1226-1237.
9
GPU ACCELERATED IMKV SERVER (CONT’)
• Batching is required to achieve high-throughput
• Data copy and CUDA kernel launch dominate indexing processes
→ longer response time for the small requests
Can we do better? What is still missing in Mega-KV?
Small batch (1,000 keys): Data copy & kernel launch take longer than actual work
Large batch (50,000 keys): Multiple CUDA streams help overlap data copies and kernels, but underutilization still present…
cudaMemcpyHtoD + kernel + cudaMemcpyDtoH
10
OUTLINE
Introduction
Problem Statement & Challenges
Background
Proposed Solutions
Background
Evaluation
Conclusion
11
PROBLEM STATEMENT
• Can we further improve the throughput of GPU-based IMKV server?
• How to optimize data movement?
• Can we enable high-throughput indexing operations on GPU even
without batching large amount of input requests?
• How to reduce overhead when offloading operations to GPU?
• How to maintain high GPU utilization?
Challenges
12
OUTLINE
Introduction
Problem Statement & Challenges
Background
Proposed Solutions
Evaluation
Conclusion
13
GPUDIRECT RDMA
• Introduced in CUDA 5, available on all Tesla GPUs starting from Kepler
• Direct transfers between GPUs and NICs over PCIE
• Eliminate CPU bandwidth and latency bottlenecks
Remote Direct Memory Access
https://developer.nvidia.com/gpudirect
14
OUTLINE
Introduction
Problem Statement & Challenges
Background
Proposed Solutions
Evaluation
Conclusion
15
OVERVIEW
• Avoid copy engine overheads
• SM-based copy (fused with processing kernel)
• Avoid copies from System Memory to Device Memory
• Direct access to pinned system memory
• Direct write to device memory using GPUDirect RDMA (GDR) through NVSHMEM
• Avoid kernel launch overheads
• Persistent CUDA kernel(s)
Challenges and Proposed Solutions
16
AVOIDING COPIESData and control paths
IMKV Server
CPU
GPU
1
2.b
3
4
Device Memory
5
2.a
Client
1 Set/Get requests from clients
3
4
5
Schedules memory copies
Performs indexing operations
Post-processing results and reply to client if required
Copies result back to System Memory
1Put data directly to GPU memory using GDR technology through NVSHMEM
2.a
2.b Schedules GPU Kernel(s)
Direct access pinned SysMem or SM-based copy
17
PERSISTENT CUDA KERNELS
• CPU thread(s) launch a persistent CUDA kernel when initializing the server
• CPU thread(s) assign and signal threadblocks/CTAs for new requests
• CTAs perform indexing operations and signal back upon completion
• CPU thread(s) check the completion and perform post-processing
• Key design considerations
• Low overhead signal mechanism
• Efficient resource utilization
Synchronization and Resource Use
IMKV Server
CPU
GPU
2
3
Device Memory
18
SIGNALING METHOD
• What to signal?
• From CPU to a particular CTA, that new work is assigned
• Size of the assigned work
• Pointers of input and output buffers
• How to signal?
• Two counters and a work queue per CTA
• issued_seqnum (is): CPU updates, CTA polls
• completed_seqnum (cs): CTA updates, CPU polls
How persistent kernel knows when and where to start processing?
GP
U
CTA
1
…
CTA
3
CTA
4
CTA
5
…
CPU
is cs is cs
19
SIGNALING METHOD (CONT’)
• Where to store the counters?
• Pinned System Memory
• Fast for CPU, slow for GPU to update and poll
• Device Memory
• Fast for GPU to poll and update but extra copies are required for CPU
• Exploits low-latency GDRCopy [10] library which enables fast CPU update
• Issued_seqnum on system memory or GPU memory; completed_seqnum always on system memory
How persistent kernel knows when and where to start processing?
[10] https://github.com/NVIDIA/gdrcopy
GP
U
CTA
1
CTA
2
CTA
3
CTA
4
CTA
5
…
PinnedSysMem
GP
U
CTA
1
CTA
2
CTA
3
CTA
4
CTA
5
…
DevMem
PinnedSysMem
vs.
20
SIGNALING METHOD (CONT’)Signal overhead on Tesla P100 with 56 SMs
0
5
10
15
20
25
30
35
1 2 4 8 16 32 56 112
Late
ncy (
us)
Number of CTAs
Padding (128 B) No padding GDRCopy-Sync
21
PERSISTENT KERNEL + GDRData and control paths
IMKV Server
CPU
GPU
1.b
2
3
4
Device Memory
5
1.a
Client
1.a
1.b
2
3
4
5
Send SET/GET requests directly to GPU memory
Client notifies server
CPU signals available CTAs
Performs indexing operations
Copies result back to System Memory(*SM-based or Copy Engine)
Post-processing results and reply to client if required
22
BENEFITS OF PERSISTENT KERNELS + GDR
Non-persistent kernel, without GDR (i.e., Mega-KV)
Persistent kernel, with GDR
Better GPU Utilization & performance
cudaMemcpyH2D Kernel
Timeline
cudaMemcpyD2H
Issues copy Issues kernel Issues copyCPU
CTAs 0-7
CTAs 8-15
Issues copy Issues kernel Issues copy
cudaMemcpyH2D Kernel cudaMemcpyD2H
Kernel (indexing + D2H Copy)
Timeline
SignalingCPU
CTAs 0-7
CTAs 8-15
Signaling
Note: Assume processing two small batches of 1,000 keys & 8 CTAs required each batch
Kernel (indexing + D2H Copy)
23
OUTLINE
Introduction
Problem Statement & Challenges
Background
Proposed Solutions
Evaluation
Conclusion
24
EXPERIMENTAL ENVIRONMENT
• Intel Xeon dual-socket 16-core Haswell (E5-2698) CPU 2.3 GHz, 256 GB Host Memory
• InfiniBand Connect-IB (MT27600) FDR (56Gbps)
• NVIDIA Tesla P100 GPUs connected with PCIE Gen 3
• 16 GB DDR5 memory, 56 SMs
• CUDA 9.0.176
• Benchmark
• Adopted from Mega-KV (http://kay21s.github.io/megakv/)
• Metric: Million search Operations Per Second (MOPS)
• i.e., throughput, higher is better
Pascal P100 GPU
25
BASELINE
0
50
100
150
200
250
300
350
1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000
Thro
ugh
pu
t (M
OP
S)
Batch Size
Mega-KV-1 Mega-KV-2 Mega-KV-4 Mega-KV-7 Mega-KV-8
Mega-KV on P100: Tuning # of CUDA Streams
0
100
200
300
400
500
600
700
800
20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000
Thro
ugh
pu
t (M
OP
S)
Batch Size
Mega-KV-1 Mega-KV-2 Mega-KV-4 Mega-KV-7 Mega-KV-8
1 CUDA stream performs best for small batches 2/4 CUDA streams performs best for large batches
26
AVOID COPIESNVSHMEM with GDR improves throughput!
0
100
200
300
400
500
600
Thro
ugh
pu
t (M
OP
S)
Batch Size (# of Keys)
Mega-KV-1 G-IMKV-SysMem G-IMKV-GDR-CE
0100200300400500600700800900
1000
Thro
ugh
pu
t (M
OP
S)Batch Size (# of Keys)
Mega-KV-4 G-IMKV-SysMem G-IMKV-GDR-CE
1.2x1.6x
27
PERSISTENT KERNELChoosing the appropriate copy back mechanism!
0
100
200
300
400
500
600
Thro
ugh
pu
t (M
OP
S)
Batch Size (# of Keys)
G-IMKV-PK-SysMem G-IMKV-PK-SM
G-IMKV-PK-GDR-SM G-IMKV-PK-GDR-CE
0
100
200
300
400
500
600
700
800
900
Thro
ugh
pu
t (M
OP
S)Batch Size (# of Keys)
G-IMKV-PK-SysMem G-IMKV-PK-SM
G-IMKV-PK-GDR-SM G-IMKV-PK-GDR-CE
1.9x
1.4xUsing Copy Engine
help copying large batch
28
GDR, PERSISTENT KERNELvs. MegaKV
0
100
200
300
400
500
600
700
800
900
1000
Thro
ugh
pu
t (M
OP
S)
Batch Size (# of Keys)
Mega-KV-TUNED G-IMKV-GDR-CE G-IMKV-PK-GDR-OPT
4.8x
PEAK throughput: ~888 MOPS
29
OUTLINE
Introduction
Problem Statement & Challenges
Background
Proposed Solutions
Evaluation
Conclusion
30
CONCLUSION & FUTURE WORK
• Use of GPUDirect RDMA eliminates data transfer and copy command overhead
• 1.2x speedup for large batches and 4.8x speedup for small batches compared to Mega-KV
• Persistent CUDA kernel mitigate kernel launch overhead
• 3.7x higher throughput for small batch sizes compared to Mega-KV
• Future Work & Open issue
• Evaluate scalability: IMKV server(s) with multiple GPUs and NICs
• Extend to other streaming applications
• Investigate CUDA level guarantees required for concurrent progress of persistent kernels and
memory copies
Key takeaways