network coding on the gpu péter vingelmann supervisor: frank h.p. fitzek
Post on 21-Dec-2015
214 views
TRANSCRIPT
Network coding on the GPU
Péter Vingelmann
Supervisor: Frank H.P. Fitzek
What is network coding?
Traditional routing in packet networks:
data is simply forwarded by the intermediate nodes
Network coding breaks with this principle:
nodes may recombine several input packets into one or several output packets
Linear network coding:
form linear combinations of incoming packets
What are the benefits of network coding?
ThroughtputThe butterfly network:
RobustnessEach encoded packet is “equally important”
ComplexityLess complex protocols (e.g. for content distribution)
SecurityIt is more difficult to “overhear” anything that makes sense
1
2
3
b2
4b1 xor b2
b1
b1
b1 xor b2
b1 xor b2
b2T2
T1
S
b2
b1
What is the problem?
The computational overhead introduced by network coding operations is not negligible
There is no dedicated network coding hardware yet
A possible solution:
use the Graphics Processing Unit (GPU) to perform the necessary calculations
Overview of network coding Definition: coding at a node in a packet network All operations are performed over a Galois Field GF(2s),
packets are divided into s bit-long symbols
The process of network coding can be divided into two separate parts:
K bits
s bits s bits
Symbol 1
s bits
Symbol 2 Symbol K/s
ENCODER DECODEREncoded packetsOriginal packets
Co
din
g
coe
fficien
ts
Original packets
Encoding Encoded packets are linear combinations of
the original packets, where addition and multiplication are performed over GF(2s)
We can use random coding coefficients
C(NxN)
Coding Coefficients
X(NxL)
Encoded Data
B(NxL)
Original Packets
L symbols
N p
acke
ts
N v
ect
ors
N coefficients
BCX
Decoding
Assume a node has received M encoded packets (together with its coefficients)
Linear system with M equations and N unknowns
We need M ≥ N to have a chance of solving this system of equations using standard Gaussian elimination
At least N linearly independent encoded packets must be received in order to recover all the original data packets
CPU implementation A simple C++ console application with some customizable
parameters:L: packet lengthN: generation size
Object-oriented implementation: Encoder and Decoder classes
Addition and subtraction over the Galois Field are simply XOR operations on the CPU
Galois multiplication and division tables are pre-calculated and stored in arrays: both operations can be performed by array lookups
Gauss-Jordan elimination is used for decoding:“on-the-fly” version of the standard Gaussian elimination
It is used as a reference implementation
Graphics card
Originally designed for real-time rendering of 3D graphics
The past: fixed-function pipeline They evolved into programmable parallel processors
with enormous computing power
The present: programmable pipeline Now they can even perform general-purpose
computations with some restrictions
The future: General Purpose Graphics Processing Unit (GPGPU)
OpenGL & CG implementation OpenGL is a standard cross-platform
API for computer graphics It cannot be used on its own, a shader
language is also necessary to implement custom algorithms
A shader is a short program which is used to program certain stages of the rendering pipeline
I chose NVIDIA’s CG toolkit as a shader language
The developer is forced to think with the traditional concepts of 3D graphics (e.g. vertices, pixels, triangles, lines and points)
Encoder shader in CG
A regular bitmap image serves as input data Coefficients and data packets are stored in textures
(2D arrays of bytes in graphics memory that can be accessed efficiently)
The XOR operation and Galois multiplication are also implemented by texture look-ups:a 256x256-sized black&white texture is necessary for each
The encoded packets are rendered (computed) line-by-line onto the screen and they are saved into a texture
Decoder shaders in CG
The decoding algorithm is more complex It must be decomposed into 3 different shaders These shaders correspond to the 3 consecutive
phases of the Gauss-Jordan elimination:
1. Forward substitution: reduce the new packet by the existing rows
2. Finding the pivot element in the reduced packet
3. Backward substitute the reduced and normalized packet into the existing rows
NVIDIA’s CUDA toolkit Compute Unified Device
Architecture (CUDA) Parallel computing applications in
the C language Modern GPUs have many
processor cores and they can launch thousands of threads with zero scheduling overhead
Terminology:host = CPUdevice = GPUkernel = a function executed on the GPU
A kernel is executed in the Single Program Multiple Data (SPMD) model, meaning that a user-specified number of threads execute the same program.
CUDA implementation
A CUDA-capable device is required!
NVIDIA GeForce 8 series at minimum This is a more native approach, we have fewer
restrictions A large number of threads must be launched to
achieve the GPU’s peak performance All data structures are stored in CUDA arrays, which
are bound to texture references if necessary Computations are visualized using an OpenGL GUI
Encoder kernel in CUDA
Encoding is a matrix multiplication in the GF domain, and can be considered as a highly parallel computation problem
We can achieve a very fine granularity by launching a thread for every single byte to be computed
Galois multiplication is implemented by array look-ups, but we have a native XOR operator
The encoder kernel is quite simple
Decoder kernels in CUDA
Gauss-Jordan elimination means that the decoding of each coded packet can only start after the decoding of the previous coded packets has finished => we have a sequential algorithm!!!
Parallelization is only possible within the decoding of the current coded packet
We need 2 separate kernels for forward and backward substitution
A search for the first non-zero element must be performed on the CPU side, because synchronization is not possible between all GPU threads => the CPU must assist the GPU!
Graphical User Interface
Performance evaluation
It is difficult to compare the actual performance of these implementations
A lot of factors has to be taken into consideration: Shader/kernel execution times Memory transfers between host and device memory Shader/kernel initialization & parameter setup CPU-GPU synchronization Measurement results are not uniform, because we
cannot have exclusive control over the GPU:
other applications may have a negative impact
CPU implementation
Desktop PC: Intel Core2 Quad CPU Q6600 @ 2.40 GHz
0
5000
10000
15000
20000
25000
30000
35000
16 32 64 128 256Generation size (N )
Th
rou
gh
pu
t [K
B/s
]
Encoding Decoding
OpenGL & CG implementation
OpenGL & Cg: NVidia GeForce 9600 GT
0
10000
20000
30000
40000
50000
60000
70000
80000
16 32 64 128 256
Generation size (N )
Th
rou
gh
pu
t [K
B/s
]
Encoding Decoding
CUDA implementation
CUDA: NVidia GeForce 9600 GT
0
50000
100000
150000
200000
250000
16 32 64 128 256
Generation size (N )
Th
rou
gh
pu
t [K
B/s
]
Encoding Decoding
Achievements
It has been shown that the GPU is capable of performing network coding calculations
What’s more, it can outperform the CPU by a significant margin in some cases
We have a submitted and accepted paper at European Wireless ’09 by the title:
Implementation of Random Linear Network Coding on OpenGL-enabled Graphics Cards
Demonstration
CPU implementation OpenGL & CG implementation CUDA implementation
Questions???
Thank you for your kind attention!