increasing the efficiency of your gpu-enabled cluster with rcuda · 2020. 1. 14. · hpc advisory...
TRANSCRIPT
![Page 1: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/1.jpg)
Increasing the efficiency of your
GPU-enabled cluster
with rCUDA
Federico Silla Technical University of Valencia
Spain
![Page 2: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/2.jpg)
HPC Advisory Council European Conference 2014, Leipzig 2/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
![Page 3: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/3.jpg)
HPC Advisory Council European Conference 2014, Leipzig 3/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
![Page 4: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/4.jpg)
HPC Advisory Council European Conference 2014, Leipzig 4/87
Current computing needs
• Many applications require a lot of computing
resources
• Execution time is usually increased
• Applications are accelerated to get their
execution time reduced
• GPU computing has experienced a
remarkable growth in the last years
![Page 5: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/5.jpg)
HPC Advisory Council European Conference 2014, Leipzig 5/87
GPUs reduce energy and time
On GPU On CPU On GPU On CPU On GPU On CPU On GPU On CPU
blastp –db sorted_env_nr –query SequenceLength_00001300.txt -num_threads X -gpu [t|f]
Dual socket E5-2620 v2 Intel Xeon node with NVIDIA K20 GPU
GPU-Blast: Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool
2.36x 3.56x
![Page 6: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/6.jpg)
HPC Advisory Council European Conference 2014, Leipzig 6/87
Current GPU computing facilities
The basic building block is a node with one or more GPUs
![Page 7: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/7.jpg)
HPC Advisory Council European Conference 2014, Leipzig 7/87
From the programming point of view:
A set of nodes, each one with:
one or more CPUs (with several cores per CPU)
one or more GPUs (typically between 1 and 4)
An interconnection network
Current GPU computing facilities
![Page 8: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/8.jpg)
HPC Advisory Council European Conference 2014, Leipzig 8/87
A GPU computing facility is usually a set of independent self-
contained nodes that leverage the shared-nothing approach
Nothing is directly shared among nodes (MPI required for aggregating
computing resources within the cluster)
GPUs can only be used within the node they are attached to
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Current GPU computing facilities
![Page 9: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/9.jpg)
HPC Advisory Council European Conference 2014, Leipzig 9/87
For many workloads, GPUs may be idle for long periods of time:
• Initial acquisition costs not amortized
• Space: GPUs reduce CPU density
• Energy: idle GPUs keep consuming power
Money leakage in current clusters?
Time (s)
Idle
Pow
er
(Watts)
1 GPU node
4 GPUs node
• 1 GPU node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 GPU
• 4 GPUs node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 GPUs
25%
![Page 10: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/10.jpg)
HPC Advisory Council European Conference 2014, Leipzig 10/87
• Applications can only use the GPUs located
within their node
• Non-accelerated applications keep GPUs idle
in the nodes when they use all the cores
• Multi-GPU applications running on a subset
of nodes cannot make use of the tremendous
GPU resources available at other cluster
nodes (even if they are idle)
Further concerns in accelerated clusters
![Page 11: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/11.jpg)
HPC Advisory Council European Conference 2014, Leipzig 11/87
What is missing is ...
... some flexibility for
using the GPUs in the
cluster
We need something else in the cluster
![Page 12: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/12.jpg)
HPC Advisory Council European Conference 2014, Leipzig 12/87
How to address current concerns:
Increasing flexibility
A way of addressing the “idle” concern is
by sharing the GPUs present in the cluster
among all the nodes and, once GPUs are
shared, their amount can be reduced
This would increase GPU utilization, also
lowering power consumption, at the same
time that initial acquisition costs are reduced
![Page 13: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/13.jpg)
HPC Advisory Council European Conference 2014, Leipzig 13/87
• This new cluster configuration requires:
• A way of seamlessly sharing GPUs
across nodes in the cluster (remote
GPU virtualization)
• Enhanced job schedulers that take into
account the new virtual GPUs
What is needed for increased flexibility?
![Page 14: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/14.jpg)
HPC Advisory Council European Conference 2014, Leipzig 14/87
Remote GPU virtualization allows a new vision of a GPU
deployment, moving from the usual cluster configuration:
Remote GPU virtualization envision
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
to the following one ….
![Page 15: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/15.jpg)
HPC Advisory Council European Conference 2014, Leipzig 15/87
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Physical configuration
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Logical connections
Logical configuration
Remote GPU virtualization envision
![Page 16: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/16.jpg)
HPC Advisory Council European Conference 2014, Leipzig 16/87
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Physical configuration
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Logical connections
Logical configuration
Busy cores are no longer a problem
![Page 17: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/17.jpg)
HPC Advisory Council European Conference 2014, Leipzig 17/87
Without GPU virtualization
GPU virtualization is also useful for multi-GPU applications
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Logical connections
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
With GPU virtualization
Many GPUs in the
cluster can be provided
to the application
Only the GPUs in the
node can be provided
to the application
Multi-GPU applications get benefit
![Page 18: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/18.jpg)
HPC Advisory Council European Conference 2014, Leipzig 18/87
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Remote GPU virtualization envision
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Without GPU
virtualization
With GPU
virtualization
Real local GPUs
Virtualized remote GPUs
GPU virtualization allows all nodes to
access all GPUs
![Page 19: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/19.jpg)
HPC Advisory Council European Conference 2014, Leipzig 19/87
+ GPU
+ GPU
+ GPU
+ GPU
+ GPU
One step further:
enhancing the scheduling process so
that GPU servers are put into low-power
sleeping modes as soon as their
acceleration features are not required
More about reducing energy consumption
![Page 20: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/20.jpg)
HPC Advisory Council European Conference 2014, Leipzig 20/87
Going even beyond:
consolidating GPUs into dedicated
GPU boxes (no CPU power required)
allowing GPU task migration
TRUE GREEN GPU
COMPUTING
Networked GPU boxes
![Page 21: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/21.jpg)
HPC Advisory Council European Conference 2014, Leipzig 21/87
Box A has 4 GPUs and only one is busy
Box B has 8 GPUs but only two are busy
Move jobs from Box B to Box A and switch
off Box B
Migration should be transparent to
applications (decided by the global
scheduler)
TRUE GREEN GPU
COMPUTING
GPU task migration
Box A
Box B
![Page 22: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/22.jpg)
HPC Advisory Council European Conference 2014, Leipzig 22/87
Box A has 4 GPUs and only one is busy
Box B has 8 GPUs but only two are busy
Move jobs from Box B to Box A and switch
off Box B
Migration should be transparent to
applications (decided by the global
scheduler)
TRUE GREEN GPU
COMPUTING
GPU task migration
Box A
Box B
![Page 23: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/23.jpg)
HPC Advisory Council European Conference 2014, Leipzig 23/87
0 2000 4000 6000 8000 10000 12000 14000 16000 180000
10
20
30
40
50
60
70
80
90
100Influence of data transfers for SGEMM
Pinned Memory
Non-Pinned Memory
Matrix Size
Tim
e d
evo
ted
to
da
ta tra
nsfe
rs (
%)
Main GPU virtualization drawback is the increased
latency and reduced bandwidth to the remote GPU
Problem with remote GPU virtualization
Data from a matrix-matrix
multiplication using a local GPU!!!
![Page 24: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/24.jpg)
HPC Advisory Council European Conference 2014, Leipzig 24/87
Several efforts have been made regarding GPU
virtualization during the last years:
rCUDA (CUDA 6.0)
GVirtuS (CUDA 3.2)
DS-CUDA (CUDA 4.1)
vCUDA (CUDA 1.1)
GViM (CUDA 1.1)
GridCUDA (CUDA 2.3)
V-GPU (CUDA 4.0)
Remote GPU virtualization frameworks
Publicly available
NOT publicly available
![Page 25: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/25.jpg)
HPC Advisory Council European Conference 2014, Leipzig 25/87
Performance comparison of GPU virtualization solutions:
Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture)
GPU NVIDIA Tesla K20
Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)
SX6025 Mellanox switch
CentOS 6.3 + Mellanox OFED 1.5.3
Pageable
H2D
Pinned
H2D
Pageable
D2H
Pinned
D2H
CUDA 34,3 4,3 16,2 5,2
rCUDA 94,5 23,1 292,2 6,0
GVirtuS 184,2 200,3 168,4 182,8
DS-CUDA 45,9 - 26,5 -
Latency (measured by transferring 64 bytes)
Remote GPU virtualization frameworks
![Page 26: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/26.jpg)
HPC Advisory Council European Conference 2014, Leipzig 26/87
MB
/sec
Transfer size (MB)
Host-to-Device
pageable memory
Device-to-Host
pageable memory
Host-to-Device
pinned memory
Device-to-Host
pinned memory
MB
/sec
Transfer size (MB)
MB
/sec
Transfer size (MB)
MB
/sec
Transfer size (MB)
Bandwidth of a copy between GPU and CPU memories
Remote GPU virtualization frameworks
![Page 27: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/27.jpg)
HPC Advisory Council European Conference 2014, Leipzig 27/87
Applications tested with rCUDA
rCUDA has been successfully tested with the following
applications:
NVIDIA CUDA SDK Samples
LAMMPS
WideLM
CUDASW++
OpenFOAM
HOOMDBlue
mCUDA-MEME
The list keeps growing
GPU-Blast
Gromacs
GAMESS
DL-POLY
HPL
![Page 28: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/28.jpg)
HPC Advisory Council European Conference 2014, Leipzig 28/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
![Page 29: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/29.jpg)
HPC Advisory Council European Conference 2014, Leipzig 29/87
It is useful for:
Applications that do not make use of GPUs all the time
(moderate level of data parallelism)
Applications for multi-GPU computing
A framework enabling a CUDA-based
application running in one (or some) node(s)
to access GPUs in other nodes
Basics of the rCUDA framework
![Page 30: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/30.jpg)
HPC Advisory Council European Conference 2014, Leipzig 30/87
Basics of the rCUDA framework
Basic CUDA behavior
![Page 31: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/31.jpg)
HPC Advisory Council European Conference 2014, Leipzig 31/87
Basics of the rCUDA framework
![Page 32: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/32.jpg)
HPC Advisory Council European Conference 2014, Leipzig 32/87
Basics of the rCUDA framework
![Page 33: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/33.jpg)
HPC Advisory Council European Conference 2014, Leipzig 33/87
Basics of the rCUDA framework
![Page 34: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/34.jpg)
HPC Advisory Council European Conference 2014, Leipzig 34/87
How to declare remote GPUs
Environment variables are properly initialized in the client side and
used by the rCUDA client (transparently to the application)
Server name/IP address : GPU
Amount of GPUs exposed to
applications
![Page 35: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/35.jpg)
HPC Advisory Council European Conference 2014, Leipzig 35/87
rCUDA uses a proprietary
communication protocol
Example:
1) initialization
2) memory allocation on the
remote GPU
3) CPU to GPU memory
transfer of the input data
4) kernel execution
5) GPU to CPU memory
transfer of the results
6) GPU memory release
7) communication channel
closing and server
process finalization
Basics of the rCUDA framework
![Page 36: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/36.jpg)
HPC Advisory Council European Conference 2014, Leipzig 36/87
rCUDA presents a modular architecture
![Page 37: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/37.jpg)
HPC Advisory Council European Conference 2014, Leipzig 37/87
Use of GPUDirect RDMA to move data between GPUs
Pipelined transfers to improve performance
Preallocated pinned memory buffers
Optimal pipeline block size
rCUDA uses optimized transfers
rCUDA features optimized communications:
![Page 38: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/38.jpg)
HPC Advisory Council European Conference 2014, Leipzig 38/87
Pipeline block size for InfiniBand FDR
• NVIDIA Tesla K20; Mellanox ConnectX-3 + SX6025 Mellanox switch
It was
2MB with
IB QDR
Basic performance analysis
![Page 39: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/39.jpg)
HPC Advisory Council European Conference 2014, Leipzig 39/87
Host-to-Device
pinned memory
Bandwidth of a copy between GPU and CPU memories
Host-to-Device
pageable memory
Basic performance analysis
![Page 40: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/40.jpg)
HPC Advisory Council European Conference 2014, Leipzig 40/87
Latency study: copy a small dataset (64 bytes)
Basic performance analysis
![Page 41: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/41.jpg)
HPC Advisory Council European Conference 2014, Leipzig 41/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
![Page 42: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/42.jpg)
HPC Advisory Council European Conference 2014, Leipzig 42/87
Test system:
Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture)
GPU NVIDIA Tesla K20
Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)
SX6025 Mellanox switch
Cisco switch SLM2014 (1Gbps Ethernet)
CentOS 6.3 + Mellanox OFED 1.5.3
Performance of the rCUDA framework
![Page 43: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/43.jpg)
HPC Advisory Council European Conference 2014, Leipzig 43/87
BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI0
1
2
3
4
5
6
7
8
9
CUDA
rCUDA FDR
rCUDA Gb
No
rma
lized
tim
e
Test CUDA SDK examples
Single-GPU applications
![Page 44: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/44.jpg)
HPC Advisory Council European Conference 2014, Leipzig 44/87
CUDASW++
Bioinformatics software for Smith-Waterman protein database searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCU
DA
Ove
rhe
ad
(%
)
Exe
cu
tio
n T
ime
(s)
Single-GPU applications
![Page 45: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/45.jpg)
HPC Advisory Council European Conference 2014, Leipzig 45/87
GPU-Blast
Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool
2 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
0
50
100
150
200
250
300
350
0
10
20
30
40
50
60
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCU
DA
Ove
rhe
ad
(%
)
Exe
cu
tio
n T
ime
(s)
Single-GPU applications
![Page 46: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/46.jpg)
HPC Advisory Council European Conference 2014, Leipzig 46/87
Test system for Multi-GPU applications:
For CUDA tests one node with:
2 Quad-Core Intel Xeon E5440 processors
Tesla S2050 (4 Tesla GPUs)
Each thread (1-4) uses one GPU
For rCUDA tests 8 nodes with:
2 Quad-Core Intel Xeon E5520 processors
1 NVIDIA Tesla C2050
InfiniBand QDR
Test running in one node and using up to all the GPUs from the others
Each thread (1-8) uses one GPU
Multi-GPU applications
![Page 47: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/47.jpg)
HPC Advisory Council European Conference 2014, Leipzig 47/87
MonteCarlo MultiGPU (from NVIDIA SDK)
Multi-GPU applications
![Page 48: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/48.jpg)
HPC Advisory Council European Conference 2014, Leipzig 48/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
![Page 49: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/49.jpg)
HPC Advisory Council European Conference 2014, Leipzig 49/87
• SLURM (Simple Linux Utility for Resource Management) job scheduler
• SLURM does not understand about virtualized GPUs
• Add a new GRES (general resource) in order to manage virtualized GPUs
• Where the GPUs are in the system is completely transparent to the user
• In the job script, or in the submission command, the user specifies the number of rGPUs (remote GPUs) required by the job. The amount of GPU memory required by the job may also be specified
Integrating rCUDA with SLURM
![Page 50: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/50.jpg)
HPC Advisory Council European Conference 2014, Leipzig 50/87
The basic idea about SLURM
![Page 51: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/51.jpg)
HPC Advisory Council European Conference 2014, Leipzig 51/87
GPUs are
decoupled
from nodes
The basic idea about SLURM + rCUDA
All jobs are
executed in
less time
![Page 52: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/52.jpg)
HPC Advisory Council European Conference 2014, Leipzig 52/87
GPUs are
decoupled
from nodes
All jobs are
executed
even in
less time
Sharing remote GPUs among jobs
GPU 0 is
scheduled to
be shared
among jobs
![Page 53: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/53.jpg)
HPC Advisory Council European Conference 2014, Leipzig 53/87
ClusterName=rcu
…
GresTypes=gpu,rgpu
…
NodeName=rcu16 NodeAddr=rcu16 CPUs=8 Sockets=1
CoresPerSocket=4 ThreadsPerCore=2
RealMemory=7990 Gres=rgpu:4,gpu:4
slurm.conf
SLURM configuration
![Page 54: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/54.jpg)
HPC Advisory Council European Conference 2014, Leipzig 54/87
NodeName=rcu16 Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null)
Gres=rgpu:4,gpu:4
NodeAddr=rcu16 NodeHostName=rcu16
OS=Linux RealMemory=7682 Sockets=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
BootTime=2013-04-24T18:45:35
SlurmdStartTime=2013-04-30T10:02:04
scontrol show node
SLURM configuration
![Page 55: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/55.jpg)
HPC Advisory Council European Conference 2014, Leipzig 55/87
Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m
Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m
Name=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535m
Name=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
gres.conf
SLURM configuration
![Page 56: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/56.jpg)
HPC Advisory Council European Conference 2014, Leipzig 56/87
$
$ srun -N1 --gres=rgpu:4:512M script.sh...
$
Submit a job
RCUDA_DEVICE_COUNT=4
RCUDA_DEVICE_0=rcu16:0
RCUDA_DEVICE_1=rcu16:1
RCUDA_DEVICE_2=rcu16:2
RCUDA_DEVICE_3=rcu16:3
Submitting jobs with SLURM
Environment variables are initialized by SLURM and used by
the rCUDA client (transparently to the user)
Server name/IP address : GPU
![Page 57: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/57.jpg)
HPC Advisory Council European Conference 2014, Leipzig 57/87
Test bench for checking SLURM correctness:
Old “heterogeneous” cluster with 28 nodes:
24 nodes without GPU with 2 AMD Opteron 1.4GHz, 2GB RAM
1 node without GPU, QuadCore i7 3.5GHz, 8GB de RAM
3 nodes with GeForce GTX 670 or 590 GPUs 2GB, QuadCore i7 3GHz
1Gbps Ethernet interconnect
Scientific Linux release 6.1 and also Open Suse 11.2 (dual boot)
Integrating rCUDA with SLURM
![Page 58: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/58.jpg)
HPC Advisory Council European Conference 2014, Leipzig 58/87
Test bench for analyzing SLURM+rCUDA performance:
InfiniBand ConnectX-3 based cluster
CentOS 6.4 Linux
Dual socket E5-2620 v2 Intel Xeon based nodes:
3 nodes without GPU
1 node with NVIDIA K40 GPU
10 nodes with NVIDIA K20 GPU
1 node with 4 NVIDIA K20 GPU
First tests carried out. More complex tests in the way
Integrating rCUDA with SLURM
1 node hosting the main SLURM controller
1 node without GPU
1 node with the K20 GPU
1 node with the K40 GPU
![Page 59: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/59.jpg)
HPC Advisory Council European Conference 2014, Leipzig 59/87
GPUBlast using K40 GPU remotely
Theoretical CUDA
Exclusive CUDA
Shared CUDA
Theoretical rCUDA (Eth)
Exclusive rCUDA (Eth)
Shared rCUDA (Eth)
Theoretical rCUDA (IB)
Exclusive rCUDA (IB)
Shared rCUDA (IB)
![Page 60: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/60.jpg)
HPC Advisory Council European Conference 2014, Leipzig 60/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
![Page 61: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/61.jpg)
HPC Advisory Council European Conference 2014, Leipzig 61/87
Why rCUDA with virtual machines?
• Current clusters frequently leverage virtual machines in order to attain energy and cost reductions
• Xen and KVM virtual machines are commonly leveraged
• Applications being executed inside virtual machines need access to GPU computing resources
• Different Xen and KVM virtual machines can share an InfiniBand network card thanks to the virtualization features of the InfiniBand driver
• NVIDIA drivers do not provide virtualization features, thus avoiding the concurrent usage of the GPU from several virtual machines
• rCUDA can be used to provide concurrent access to GPUs
![Page 62: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/62.jpg)
HPC Advisory Council European Conference 2014, Leipzig 62/87
How to connect an IB card to a VM
Real IB card Virtual IB cards (VFs)
• PCI pass-through is used to assign an IB card (either real or virtual) to a given VM
• The IB card manages the several virtual copies
![Page 63: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/63.jpg)
HPC Advisory Council European Conference 2014, Leipzig 63/87
Performance analysis on KVM VMs
RDMA write
Bandw
idth
(M
B/s
)
Transfer size (Bytes)
Computer with
KVM VMs
Remote
computer
RealX3: actual Connect-X3 (1 port) card
VF: virtual copy used from host OS
VM: virtual card copy used from VM
RDMA write
Norm
aliz
ed
Ba
ndw
idth
Transfer size (Bytes)
RDMA write Norm
aliz
ed late
ncy
Transfer size (Bytes)
![Page 64: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/64.jpg)
HPC Advisory Council European Conference 2014, Leipzig 64/87
CUDA BWtest from a KVM VM
Computer with
KVM VMs
Remote computer
with Tesla K20
![Page 65: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/65.jpg)
HPC Advisory Council European Conference 2014, Leipzig 65/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
![Page 66: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/66.jpg)
HPC Advisory Council European Conference 2014, Leipzig 66/87
rCUDA and low-power processors
• There is a clear trend to improve the performance-power ratio in HPC facilities:
• By leveraging accelerators (Tesla K20 by NVIDIA, FirePro by AMD, Xeon Phi by Intel, …)
• More recently, by using low-power processors (Intel Atom, ARM, …)
• rCUDA allows to attach a “pool of accelerators” to a computing facility:
• Less accelerators than nodes accelerators shared among nodes
• Performance-power ratio probably improved further
• How interesting is a heterogeneous platform leveraging powerful processors, low-power ones, and remote GPUs?
• Which is the best configuration?
• Is energy effectively saved?
![Page 67: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/67.jpg)
HPC Advisory Council European Conference 2014, Leipzig 67/87
rCUDA and low-power processors
• The computing platforms analyzed in this study are:
• KAYLA: NVIDIA Tegra 3 ARM Cortex A9 quad-core CPU (1.4 GHz), 2GB DDR3 RAM and Intel 82574L Gigabit Ethernet controller
• ATOM: Intel Atom quad-core CPU s1260 (2.0 GHz), 8GB DDR3 RAM and Intel I350 Gigabit Ethernet controller. No PCIe connector for a GPU
• XEON: Intel Xeon X3440 quad-core CPU (2.4GHz), 8GB DDR3 RAM and 82574L Gigabit Ethernet controller
• CARMA: NVIDIA Quadro 1000M GPU (96 cores) with 2GB DDR5 RAM. The rest of the system is the same as for KAYLA
• FERMI: NVIDIA GeForce GTX480 “Fermi” GPU (448 cores) with 1,280 MB DDR3/GDDR5 RAM
• The accelerators analyzed are: CARMA platform
![Page 68: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/68.jpg)
HPC Advisory Council European Conference 2014, Leipzig 68/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
![Page 69: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/69.jpg)
HPC Advisory Council European Conference 2014, Leipzig 69/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
![Page 70: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/70.jpg)
HPC Advisory Council European Conference 2014, Leipzig 70/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
• LAMMPS makes a more intensive use of the GPU than CUDASW++
![Page 71: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/71.jpg)
HPC Advisory Council European Conference 2014, Leipzig 71/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
• LAMMPS makes a more intensive use of the GPU than CUDASW++
• The lower performance of the PCIe bus in Kayla reduces performance
![Page 72: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/72.jpg)
HPC Advisory Council European Conference 2014, Leipzig 72/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
• LAMMPS makes a more intensive use of the GPU than CUDASW++
• The lower performance of the PCIe bus in Kayla reduces performance
• The 96 cores of CARMA provide a noticeably lower performance than a more powerful GPU
![Page 73: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/73.jpg)
HPC Advisory Council European Conference 2014, Leipzig 73/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
CARMA
requires
less power
![Page 74: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/74.jpg)
HPC Advisory Council European Conference 2014, Leipzig 74/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
When execution time
and required power are
combined into energy,
XEON+FERMI is more
efficient
![Page 75: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/75.jpg)
HPC Advisory Council European Conference 2014, Leipzig 75/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
Summing up:
- If execution time is a concern, then use XEON+FERMI
- If power is to be minimized, CARMA is a good choice
- The lower performance of KAYLA’s PCIe decreases the interest of
this option for single system local-GPU usage
- If energy must be minimized, XEON+FERMI is the right choice
![Page 76: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/76.jpg)
HPC Advisory Council European Conference 2014, Leipzig 76/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs
• The use of the CUADRO GPU within the CARMA system is discarded as rCUDA server due to its poor performance. Only the FERMI GPU is considered
• Six different combinations of rCUDA clients and servers are feasible:
• Only one client system and one server system are used at each experiment
Combination Client Server
1 KAYLA KAYLA+FERMI
2 ATOM KAYLA+FERMI
3 XEON KAYLA+FERMI
4 KAYLA XEON+FERMI
5 ATOM XEON+FERMI
6 XEON XEON+FERMI
![Page 77: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/77.jpg)
HPC Advisory Council European Conference 2014, Leipzig 77/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 78: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/78.jpg)
HPC Advisory Council European Conference 2014, Leipzig 78/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
The KAYLA-based
server presents
much higher
execution time
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 79: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/79.jpg)
HPC Advisory Council European Conference 2014, Leipzig 79/87
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs
The KAYLA-based
server presents
much higher
execution time
CUDASW++
The network
interface for
KAYLA delivers
much less than the
expected 1Gbps
bandwidth
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 80: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/80.jpg)
HPC Advisory Council European Conference 2014, Leipzig 80/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
The lower
bandwidth of
KAYLA (and ATOM)
can also be noticed
when using the
XEON server
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 81: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/81.jpg)
HPC Advisory Council European Conference 2014, Leipzig 81/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
Most of the power and
energy is required by
the server side, as it
hosts the GPU
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 82: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/82.jpg)
HPC Advisory Council European Conference 2014, Leipzig 82/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
The much shorter
execution time of
XEON-based servers
render more energy-
efficient systems
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 83: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/83.jpg)
HPC Advisory Council European Conference 2014, Leipzig 83/87
rCUDA and low-power processors
LAMMPS • 2nd analysis: use of remote GPUs
Combination Client Server
1 KAYLA KAYLA+FERMI
2 KAYLA XEON+FERMI
3 ATOM XEON+FERMI
4 XEON XEON+FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 84: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/84.jpg)
HPC Advisory Council European Conference 2014, Leipzig 84/87
rCUDA and low-power processors
LAMMPS • 2nd analysis: use of remote GPUs
Similar
conclusions to
CUDASW++ apply
to LAMMPS, due
to the network
bottleneck
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
![Page 85: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/85.jpg)
HPC Advisory Council European Conference 2014, Leipzig 85/87
rCUDA and low-power processors
• Atom-based systems with PCIe 2.0 x8 may hold an InfiniBand Card
• InfiniBand QDR adapters would noticeably increase network bandwidth
![Page 86: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/86.jpg)
HPC Advisory Council European Conference 2014, Leipzig 86/87
Summary about rCUDA
rCUDA is the enabling technology for:
• High Throughput Computing
• Using remote GPUs does not make applications to execute faster
• Sharing remote GPUs makes applications to execute slower
• BUT more throughput (jobs/time) is achieved
• Datacenter administrators can choose between HPC and HTC
• Green Computing
• GPU migration and application migration allow to
devote just the required computing resources to
the current load
• More flexible system upgrades
• GPU and CPU updates become independent
from each other. Adding GPU boxes to non
GPU-enabled clusters is possible
![Page 87: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU](https://reader034.vdocument.in/reader034/viewer/2022051321/5fffe03bfdceb74ad6497789/html5/thumbnails/87.jpg)
HPC Advisory Council European Conference 2014, Leipzig 87/87
You can get a free copy of rCUDA at
http://www.rcuda.net