1 a gpu accelerated storage system netsyslab the university of british columbia abdullah gharaibeh...
TRANSCRIPT
![Page 1: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/1.jpg)
1
A GPU Accelerated Storage System
NetSysLabThe University of British Columbia
Abdullah Gharaibeh
with: Samer Al-Kiswany
Sathish Gopalakrishnan
Matei Ripeanu
![Page 2: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/2.jpg)
2
GPUs radically change the cost landscape
$600
$1279
(Source: CUDA Guide)
![Page 3: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/3.jpg)
3
more complex programming model
limited memory space
accelerator / co-processor model
Harnessing GPU Power is Challenging
![Page 4: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/4.jpg)
4
Does the 10x reduction in computation costs GPUs offer change the way we design/implement distributed systems?
Motivating Question:
Distributed Storage Systems
Context:
![Page 5: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/5.jpg)
5
Distributed Systems Computationally Intensive Operations
Hashing
Erasure coding
Encryption/decryption
Membership testing (Bloom-filter)
Compression
Computationally intensive Limit performance
Similarity detection
Content addressability
Security
Integrity checks
Redundancy
Load balancing
Summary cache
Storage efficiency
Operations Techniques
![Page 6: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/6.jpg)
6
Distributed Storage System Architecture
Client
Metadata Manager
Storage Nodes
Access Module
Application
Techniques To improve Performance/Reliability
b1b2
b3b n
Files divided into stream of blocks
Similarity Detection
SecurityIntegrity Checks
Redundancy
CPUGPU
Offloading Layer
Enabling Operations
CompressionEncoding/Decoding
Encryption/Decryption
Hashing
Application Layer
FS API
![Page 7: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/7.jpg)
7
Contributions:
A GPU accelerated storage system:Design and prototype implementation that integrates similarity detection and GPU support
End-to-end system evaluation:2x throughput improvement for a realistic checkpointing workload
![Page 8: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/8.jpg)
8
Challenges
Integration Challenges
Minimizing the integration effort
Transparency
Separation of concerns
Extracting Major Performance Gains
Hiding memory allocation overheads
Hiding data transfer overheads
Efficient utilization of the GPU memory units
Use of multi-GPU systems
Similarity Detection
b1b2
b3b n
Files divided into stream of blocks
GPU
Hashing
Offloading Layer
![Page 9: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/9.jpg)
9
Past Work: Hashing on GPUs
HashGPU1: a library that exploits GPUs to support specialized use of hashing in distributed storage systems
1 “Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems” S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu,, HPDC ‘08
However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection
One performance data point:Accelerates hashing by up to 5x speedup compared to a single core CPU
HashGPU
GPU
b1b2
b3b n
Hashing stream of blocks
![Page 10: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/10.jpg)
10
Profiling HashGPU
Amortizing memory allocation and overlapping data transfers and computation may bring important benefits
At least 75% overhead
![Page 11: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/11.jpg)
11
CrystalGPU
CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations
Similarity Detection
b1b2
b3b n
Files divided into stream of blocks
GPU
HashGPU
Off
load
ing
Lay
er
CrystalGPU
One performance data point:CrystalGPU improves the speedup of HashGPU library by more than one order of magnitude
![Page 12: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/12.jpg)
12
CrystalGPU Opportunities and Enablers
Opportunity: Reusing GPU memory buffers
Enabler: a high-level memory manager
Opportunity: overlap the communication and computation
Enabler: double buffering and asynchronous kernel launch
Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters)
Enabler: a task queue manager
Similarity Detection
b1b2
b3b n
Files divided into stream of blocks
GPU
HashGPU
Off
load
ing
Lay
er
CrystalGPUMemory Manager Task Queue
Double Buffering
![Page 13: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/13.jpg)
13
Experimental Evaluation: CrystalGPU evaluation End-to-end system evaluation
![Page 14: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/14.jpg)
14
CrystalGPU Evaluation
Testbed: A machine with
CPU: Intel quad-core 2.66 GHz with PCI Express 2.0 x16 bus
GPU: NVIDIA GeForce dual-GPU 9800GX2
Experiment space:
HashGPU/CrystalGPU vs. original HashGPU Three optimizations
Buffer reuse Overlap communication and computation Exploiting the two GPUs
HashGPU
GPU
b1b2
b3b n
Files divided into stream of blocks
CrystaGPU
![Page 15: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/15.jpg)
15
HashGPU Performance on top CrystalGPU
The gains enabled by the three optimizations can be realized!
Base Line: CPU Single Core
![Page 16: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/16.jpg)
16
Testbed– Four storage nodes and one metadata server– One client with 9800GX2 GPU
Three implementations– No similarity detection (without-SD)– Similarity detection
• on CPU (4 cores @ 2.6GHz) (SD-CPU)• on GPU (9800 GX2) (SD-GPU)
Three workloads – Real checkpointing workload– Completely similar files: all possible gains in terms of data saving– Completely different files: only overheads, no gains
Success metrics:– System throughput – Impact on a competing application: compute or I/O intensive
End-to-End System Evaluation
![Page 17: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/17.jpg)
17
System Throughput (Checkpointing Workload)
The integrated system preserves the throughput gains on a realistic workload!
1.8x improvement
![Page 18: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/18.jpg)
18
System Throughput (Synthetic Workload of Similar Files)
Offloading to the GPU enables close to optimal performance!
Room for 2ximprovement
![Page 19: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/19.jpg)
19
Impact on Competing (Compute Intensive) Application
Writing Checkpoints back to back
2ximprovement
Frees resources (CPU) to competing applications while preserving throughput gains!
7% reduction
![Page 20: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/20.jpg)
20
Summary
We present the design and implementation of a distributed storage system that integrates GPU power
We present CrystalGPU: a management layer that transparently enable common GPU optimizations across GPGPU applications
We empirically demonstrate that employing the GPU enable close to optimal system performance
We shed light on the impact of GPU offloading on competing applications running on the same node
![Page 21: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/21.jpg)
21
netsyslab.ece.ubc.ca
![Page 22: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/22.jpg)
22
File AX
Y
Z
Hashing
Similarity Detection
W
Y
Z
File BHashing
Only the first block is differentPotentially improving write throughput
![Page 23: 1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei](https://reader030.vdocument.in/reader030/viewer/2022032523/56649d835503460f94a6929d/html5/thumbnails/23.jpg)
23
Execution Path on GPU – Data Processing Application
TTotal =
1
TPreprocesing
1
2
+ TDataHtoG
2
3
+ TProcessing
3
4
+ TDataGtoH
4
5
+ TPostProc
5
1. Preprocessing (memory allocation)
2. Data transfer in
3. GPU Processing
4. Data transfer out
5. Postprocessing