designing high-performance in-memory key-value … · 2018-09-06 · ching-hsiang chu (the ohio...

Ching-Hsiang Chu (The Ohio State University, NVIDIA), Sreeram Potluri (NVIDIA), Anshuman Goswami (NVIDIA), Manjunath Gorentla Venkata (Oak Ridge National Laboratory), Neena Imam (Oak

Ridge National Laboratory), and Chris J. Newburn (NVIDIA)

DESIGNING HIGH-PERFORMANCE IN-MEMORY KEY-VALUE OPERATIONS WITH PERSISTENT GPU KERNELS AND OPENSHMEM

2

OUTLINE

Introduction

Problem Statement & Challenges

Background

Proposed Solutions

Evaluation

Conclusion

3

HUGE DATA EXPLOSION

“Annual global IP traffic will reach 3.3 ZB (Zetta-bytes; 1000 Exa-bytes [EB]) by 2021”

source: Cisco Visual Networking Index: Forecast and Methodology, 2016–2021

All kinds of indexed traffic

Video, Search, Maps…

How to provide efficient access to such a large volume?

Cache!

Predicted to continue

“VNI Forecast Highlights Tool”,

https://www.cisco.com/c/m/en_us/solutions/service-provider/vni-forecast-

highlights.html

https://www.cisco.com/c/m/en_us/solutions/service-provider/vni-forecast-highlights.html

4

IN-MEMORY KEY-VALUE STORES

• In-Memory Key-Value (IMKV) Stores : Distributed in-memory data caching systems

• Caching data and objects in system memory, i.e., RAM (Random Access Memory)

• Querying the cache server(s) before retrieving data from database

• Speeding up web applications by shortening query/response path

What is it?

Network

Web Servers

Back-end Database

IMKV servers1. Cache access requests/responses

2. Cache miss: Fetching objects from DB

5

IN-MEMORY KEY-VALUE STORES (CONT’)

• Faster for I/O-intensive applications (especially, read-heavy workloads)

• System memory (micro/nano-seconds) >>> database with persistent storage (milliseconds)

• Resources are available: Large system memory now common on cloud and HPC systems

• Amazon EC2 X1e provides up to 4TB DDR4 Memory (https://aws.amazon.com/ec2/instance-types/)

• Amazon ElastiCache offers Memcached/Redis-ready nodes equipped with up to 400 GB memory (https://aws.amazon.com/elasticache/)

• Summit (#1 supercomputer): 512GB DDR4 per node

Why it becomes popular?

https://aws.amazon.com/ec2/instance-types/

https://aws.amazon.com/elasticache/

6

IN-MEMORY KEY-VALUE STORES (CONT’)

• Fit hash table on system memory

• Keys: 8Bytes – 250Bytes (for memcached)

• Values: 8Bytes – 1Mbytes data objects (for memcahced)

• Basic Key-Value (KV) operations: GET (the most commonly used) & SET

How does it work?

Network

Web Servers

Back-end Database

IMKV serversGET(Key) / SET(Key, Value)

Cache miss: Fetching objects from DB

7

LET’S MAKE IMKV SERVER FASTERState-of-the-arts (NOT a comparison)

[1] H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: a holistic approach to fast in-memory key-value storage. NSDI'14. USENIX Association, Berkeley, CA, USA, 429-444.

[2] T. H. Hetherington, Mike O'Connor, and Tor M. Aamodt. MemcachedGPU: scaling-up scale-out key-value stores. SoCC '15. ACM, New York, NY, USA, 43-57.

[3] K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, X. Zhang, B. He, J. Hu, and b. Hua. A distributed in-memory key-value store system on heterogeneous CPU---GPU cluster. The VLDB Journal 26, 5 (October 2017), 729-750.

[4] H. Fu, M. G. Venkata, A. R. Choudhury, N. Imam and W. Yu, "High-Performance Key-Value Store on OpenSHMEM," 2017 CCGRID, Madrid, 2017, pp. 559-568.

[5] B. Li, Z. Ruan, W. Xiao, Y. Lu, Y. Xiong, A. Putnam, E. Chen, and L. Zhang. 2017. KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. SOSP '17. ACM, New York, NY, USA, 137-152.

KVS HighlightsReported Peak throughput (MOPS)

Key HW GET PUT

Memcached Traditional CPU based solutions, UDP Intel 8-core Xeon CPU, One 40-GbE NIC 1.5 1.5

MICA[1] Kernel bypass, Intel DPDK 24-core CPU, 12-port NIC 137 135

MemcachedGPU[2] Offload packet parsing (UDP) and KV

operations to GPU, GPUDirectOne Nvidia Tesla K20c, 10-GbE NIC 12.92 N/A

Mega-KV[3] Offload only indexing operations to

GPU, Intel DPDK (UDP)

Singe Node : Dual socket Intel Xeon 8-core CPUs, one Nvidia

Tesla K40c GPU, one 40-GbE NIC per socket

4 nodes: Same config as above

166

623

171

N/A

SHMEMCache[4] OpenSHMEM, RDMA, One-sided KV

operations (CPU bypass)10-core Intel Xeon CPU, IB ConnectX3 NIC

1024 nodes on Titan cluster

~0.9

~50

~0.3

~12

KV-Direct[5] CPU Bypass, Offload KV operations

to Programable NIC (FPGA)

Both are on a single server:

One 40-GbE NIC, two PCIE Gen3 x8 link

Ten 40-GbE NICs, one PCIE Gen3 x8 link per NIC

180

1220

114

610

8

GPU ACCELERATED IMKV SERVER

• KV operations are memory-intensive operations

• GPU with HBM is a perfect fit!

• But, GPU memory is not as BIG as system memory

• Mega-KV: Cache KVs in SysMem; Index KV on GPU

• Introduce a new hash table in GPU memory

• Compressed, fixed-width key

• Value : location (in system memory)

• Offload only indexing operations to GPUs

• CPU threads complete KV operations

How Mega-KV works?

Kai Zhang, Kaibo Wang, Yuan , Lei Guo, Rubao Lee, and Xiaodong Zhang. 2015. Mega-KV: a

case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB

Endow. 8, 11 (July 2015), 1226-1237.

9

GPU ACCELERATED IMKV SERVER (CONT’)

• Batching is required to achieve high-throughput

• Data copy and CUDA kernel launch dominate indexing processes

→ longer response time for the small requests

Can we do better? What is still missing in Mega-KV?

Small batch (1,000 keys): Data copy & kernel launch take longer than actual work

Large batch (50,000 keys): Multiple CUDA streams help overlap data copies and kernels, but underutilization still present…

cudaMemcpyHtoD + kernel + cudaMemcpyDtoH

10

OUTLINE

Introduction


Background

Proposed Solutions

Background

Evaluation

Conclusion

11

PROBLEM STATEMENT

• Can we further improve the throughput of GPU-based IMKV server?

• How to optimize data movement?

• Can we enable high-throughput indexing operations on GPU even

without batching large amount of input requests?

• How to reduce overhead when offloading operations to GPU?

• How to maintain high GPU utilization?

Challenges

12

OUTLINE

Introduction


Background

Proposed Solutions

Evaluation

Conclusion

13

GPUDIRECT RDMA

• Introduced in CUDA 5, available on all Tesla GPUs starting from Kepler

• Direct transfers between GPUs and NICs over PCIE

• Eliminate CPU bandwidth and latency bottlenecks

Remote Direct Memory Access

https://developer.nvidia.com/gpudirect

14

OUTLINE

Introduction


Background

Proposed Solutions

Evaluation

Conclusion

15

OVERVIEW

• Avoid copy engine overheads

• SM-based copy (fused with processing kernel)

• Avoid copies from System Memory to Device Memory

• Direct access to pinned system memory

• Direct write to device memory using GPUDirect RDMA (GDR) through NVSHMEM

• Avoid kernel launch overheads

• Persistent CUDA kernel(s)

Challenges and Proposed Solutions

16

AVOIDING COPIESData and control paths

IMKV Server

CPU

GPU

1

2.b

3

4

Device Memory

5

2.a

Client

1 Set/Get requests from clients

3

4

5

Schedules memory copies

Performs indexing operations

Post-processing results and reply to client if required

Copies result back to System Memory

1Put data directly to GPU memory using GDR technology through NVSHMEM

2.a

2.b Schedules GPU Kernel(s)

Direct access pinned SysMem or SM-based copy

17

PERSISTENT CUDA KERNELS

• CPU thread(s) launch a persistent CUDA kernel when initializing the server

• CPU thread(s) assign and signal threadblocks/CTAs for new requests

• CTAs perform indexing operations and signal back upon completion

• CPU thread(s) check the completion and perform post-processing

• Key design considerations

• Low overhead signal mechanism

• Efficient resource utilization

Synchronization and Resource Use

IMKV Server

CPU

GPU

2

3

Device Memory

18

SIGNALING METHOD

• What to signal?

• From CPU to a particular CTA, that new work is assigned

• Size of the assigned work

• Pointers of input and output buffers

• How to signal?

• Two counters and a work queue per CTA

• issued_seqnum (is): CPU updates, CTA polls

• completed_seqnum (cs): CTA updates, CPU polls

How persistent kernel knows when and where to start processing?

GP

U

CTA

1

…

CTA

3

CTA

4

CTA

5

…

CPU

is cs is cs

19

SIGNALING METHOD (CONT’)

• Where to store the counters?

• Pinned System Memory

• Fast for CPU, slow for GPU to update and poll

• Device Memory

• Fast for GPU to poll and update but extra copies are required for CPU

• Exploits low-latency GDRCopy [10] library which enables fast CPU update

• Issued_seqnum on system memory or GPU memory; completed_seqnum always on system memory

How persistent kernel knows when and where to start processing?

[10] https://github.com/NVIDIA/gdrcopy

GP

U

CTA

1

CTA

2

CTA

3

CTA

4

CTA

5

…

PinnedSysMem

GP

U

CTA

1

CTA

2

CTA

3

CTA

4

CTA

5

…

DevMem

PinnedSysMem

vs.

20

SIGNALING METHOD (CONT’)Signal overhead on Tesla P100 with 56 SMs

0

5

10

15

20

25

30

35

1 2 4 8 16 32 56 112

Late

ncy (

us)

Number of CTAs

Padding (128 B) No padding GDRCopy-Sync

21

PERSISTENT KERNEL + GDRData and control paths

IMKV Server

CPU

GPU

1.b

2

3

4

Device Memory

5

1.a

Client

1.a

1.b

2

3

4

5

Send SET/GET requests directly to GPU memory

Client notifies server

CPU signals available CTAs

Performs indexing operations

Copies result back to System Memory(*SM-based or Copy Engine)

Post-processing results and reply to client if required

22

BENEFITS OF PERSISTENT KERNELS + GDR

Non-persistent kernel, without GDR (i.e., Mega-KV)

Persistent kernel, with GDR

Better GPU Utilization & performance

cudaMemcpyH2D Kernel

Timeline

cudaMemcpyD2H

Issues copy Issues kernel Issues copyCPU

CTAs 0-7

CTAs 8-15

Issues copy Issues kernel Issues copy

cudaMemcpyH2D Kernel cudaMemcpyD2H

Kernel (indexing + D2H Copy)

Timeline

SignalingCPU

CTAs 0-7

CTAs 8-15

Signaling

Note: Assume processing two small batches of 1,000 keys & 8 CTAs required each batch

Kernel (indexing + D2H Copy)

23

OUTLINE

Introduction


Background

Proposed Solutions

Evaluation

Conclusion

24

EXPERIMENTAL ENVIRONMENT

• Intel Xeon dual-socket 16-core Haswell (E5-2698) CPU 2.3 GHz, 256 GB Host Memory

• InfiniBand Connect-IB (MT27600) FDR (56Gbps)

• NVIDIA Tesla P100 GPUs connected with PCIE Gen 3

• 16 GB DDR5 memory, 56 SMs

• CUDA 9.0.176

• Benchmark

• Adopted from Mega-KV (http://kay21s.github.io/megakv/)

• Metric: Million search Operations Per Second (MOPS)

• i.e., throughput, higher is better

Pascal P100 GPU

http://kay21s.github.io/megakv/

25

BASELINE

0

50

100

150

200

250

300

350

1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000

Thro

ugh

pu

t (M

OP

S)

Batch Size

Mega-KV-1 Mega-KV-2 Mega-KV-4 Mega-KV-7 Mega-KV-8

Mega-KV on P100: Tuning # of CUDA Streams

0

100

200

300

400

500

600

700

800

20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000

Thro

ugh

pu

t (M

OP

S)

Batch Size

Mega-KV-1 Mega-KV-2 Mega-KV-4 Mega-KV-7 Mega-KV-8

1 CUDA stream performs best for small batches 2/4 CUDA streams performs best for large batches

26

AVOID COPIESNVSHMEM with GDR improves throughput!

0

100

200

300

400

500

600

Thro

ugh

pu

t (M

OP

S)

Batch Size (# of Keys)

Mega-KV-1 G-IMKV-SysMem G-IMKV-GDR-CE

0100200300400500600700800900

1000

Thro

ugh

pu

t (M

OP

S)Batch Size (# of Keys)

Mega-KV-4 G-IMKV-SysMem G-IMKV-GDR-CE

1.2x1.6x

27

PERSISTENT KERNELChoosing the appropriate copy back mechanism!

0

100

200

300

400

500

600

Thro

ugh

pu

t (M

OP

S)


G-IMKV-PK-SysMem G-IMKV-PK-SM

G-IMKV-PK-GDR-SM G-IMKV-PK-GDR-CE

0

100

200

300

400

500

600

700

800

900

Thro

ugh

pu

t (M

OP

S)Batch Size (# of Keys)

G-IMKV-PK-SysMem G-IMKV-PK-SM

G-IMKV-PK-GDR-SM G-IMKV-PK-GDR-CE

1.9x

1.4xUsing Copy Engine

help copying large batch

28

GDR, PERSISTENT KERNELvs. MegaKV

0

100

200

300

400

500

600

700

800

900

1000

Thro

ugh

pu

t (M

OP

S)


Mega-KV-TUNED G-IMKV-GDR-CE G-IMKV-PK-GDR-OPT

4.8x

PEAK throughput: ~888 MOPS

29

OUTLINE

Introduction


Background

Proposed Solutions

Evaluation

Conclusion

30

CONCLUSION & FUTURE WORK

• Use of GPUDirect RDMA eliminates data transfer and copy command overhead

• 1.2x speedup for large batches and 4.8x speedup for small batches compared to Mega-KV

• Persistent CUDA kernel mitigate kernel launch overhead

• 3.7x higher throughput for small batch sizes compared to Mega-KV

• Future Work & Open issue

• Evaluate scalability: IMKV server(s) with multiple GPUs and NICs

• Extend to other streaming applications

• Investigate CUDA level guarantees required for concurrent progress of persistent kernels and

memory copies

Key takeaways

designing high-performance in-memory key-value … · 2018-09-06 · ching-hsiang chu (the ohio...

Documents