network coding on the gpu péter vingelmann supervisor: frank h.p. fitzek

Network coding on the GPU

Péter Vingelmann

Supervisor: Frank H.P. Fitzek

What is network coding?

Traditional routing in packet networks:

data is simply forwarded by the intermediate nodes

Network coding breaks with this principle:

nodes may recombine several input packets into one or several output packets

Linear network coding:

form linear combinations of incoming packets

What are the benefits of network coding?

ThroughtputThe butterfly network:

RobustnessEach encoded packet is “equally important”

ComplexityLess complex protocols (e.g. for content distribution)

SecurityIt is more difficult to “overhear” anything that makes sense

1

2

3

b2

4b1 xor b2

b1

b1

b1 xor b2

b1 xor b2

b2T2

T1

S

b2

b1

What is the problem?

The computational overhead introduced by network coding operations is not negligible

There is no dedicated network coding hardware yet

A possible solution:

use the Graphics Processing Unit (GPU) to perform the necessary calculations

Overview of network coding Definition: coding at a node in a packet network All operations are performed over a Galois Field GF(2s),

packets are divided into s bit-long symbols

The process of network coding can be divided into two separate parts:

K bits

s bits s bits

Symbol 1

s bits

Symbol 2 Symbol K/s

ENCODER DECODEREncoded packetsOriginal packets

Co

din

g

coe

fficien

ts

Original packets

Encoding Encoded packets are linear combinations of

the original packets, where addition and multiplication are performed over GF(2s)

We can use random coding coefficients

C(NxN)

Coding Coefficients

X(NxL)

Encoded Data

B(NxL)

Original Packets

L symbols

N p

acke

ts

N v

ect

ors

N coefficients

BCX

Decoding

Assume a node has received M encoded packets (together with its coefficients)

Linear system with M equations and N unknowns

We need M ≥ N to have a chance of solving this system of equations using standard Gaussian elimination

At least N linearly independent encoded packets must be received in order to recover all the original data packets

CPU implementation A simple C++ console application with some customizable

parameters:L: packet lengthN: generation size

Object-oriented implementation: Encoder and Decoder classes

Addition and subtraction over the Galois Field are simply XOR operations on the CPU

Galois multiplication and division tables are pre-calculated and stored in arrays: both operations can be performed by array lookups

Gauss-Jordan elimination is used for decoding:“on-the-fly” version of the standard Gaussian elimination

It is used as a reference implementation

Graphics card

Originally designed for real-time rendering of 3D graphics

The past: fixed-function pipeline They evolved into programmable parallel processors

with enormous computing power

The present: programmable pipeline Now they can even perform general-purpose

computations with some restrictions

The future: General Purpose Graphics Processing Unit (GPGPU)

OpenGL & CG implementation OpenGL is a standard cross-platform

API for computer graphics It cannot be used on its own, a shader

language is also necessary to implement custom algorithms

A shader is a short program which is used to program certain stages of the rendering pipeline

I chose NVIDIA’s CG toolkit as a shader language

The developer is forced to think with the traditional concepts of 3D graphics (e.g. vertices, pixels, triangles, lines and points)

Encoder shader in CG

A regular bitmap image serves as input data Coefficients and data packets are stored in textures

(2D arrays of bytes in graphics memory that can be accessed efficiently)

The XOR operation and Galois multiplication are also implemented by texture look-ups:a 256x256-sized black&white texture is necessary for each

The encoded packets are rendered (computed) line-by-line onto the screen and they are saved into a texture

Decoder shaders in CG

The decoding algorithm is more complex It must be decomposed into 3 different shaders These shaders correspond to the 3 consecutive

phases of the Gauss-Jordan elimination:

1. Forward substitution: reduce the new packet by the existing rows

2. Finding the pivot element in the reduced packet

3. Backward substitute the reduced and normalized packet into the existing rows

NVIDIA’s CUDA toolkit Compute Unified Device

Architecture (CUDA) Parallel computing applications in

the C language Modern GPUs have many

processor cores and they can launch thousands of threads with zero scheduling overhead

Terminology:host = CPUdevice = GPUkernel = a function executed on the GPU

A kernel is executed in the Single Program Multiple Data (SPMD) model, meaning that a user-specified number of threads execute the same program.

CUDA implementation

A CUDA-capable device is required!

NVIDIA GeForce 8 series at minimum This is a more native approach, we have fewer

restrictions A large number of threads must be launched to

achieve the GPU’s peak performance All data structures are stored in CUDA arrays, which

are bound to texture references if necessary Computations are visualized using an OpenGL GUI

Encoder kernel in CUDA

Encoding is a matrix multiplication in the GF domain, and can be considered as a highly parallel computation problem

We can achieve a very fine granularity by launching a thread for every single byte to be computed

Galois multiplication is implemented by array look-ups, but we have a native XOR operator

The encoder kernel is quite simple

Decoder kernels in CUDA

Gauss-Jordan elimination means that the decoding of each coded packet can only start after the decoding of the previous coded packets has finished => we have a sequential algorithm!!!

Parallelization is only possible within the decoding of the current coded packet

We need 2 separate kernels for forward and backward substitution

A search for the first non-zero element must be performed on the CPU side, because synchronization is not possible between all GPU threads => the CPU must assist the GPU!

Graphical User Interface

Performance evaluation

It is difficult to compare the actual performance of these implementations

A lot of factors has to be taken into consideration: Shader/kernel execution times Memory transfers between host and device memory Shader/kernel initialization & parameter setup CPU-GPU synchronization Measurement results are not uniform, because we

cannot have exclusive control over the GPU:

other applications may have a negative impact

CPU implementation

Desktop PC: Intel Core2 Quad CPU Q6600 @ 2.40 GHz

0

5000

10000

15000

20000

25000

30000

35000

16 32 64 128 256Generation size (N )

Th

rou

gh

pu

t [K

B/s

]

Encoding Decoding

OpenGL & CG implementation

OpenGL & Cg: NVidia GeForce 9600 GT

0

10000

20000

30000

40000

50000

60000

70000

80000

16 32 64 128 256

Generation size (N )

Th

rou

gh

pu

t [K

B/s

]

Encoding Decoding

CUDA implementation

CUDA: NVidia GeForce 9600 GT

0

50000

100000

150000

200000

250000

16 32 64 128 256

Generation size (N )

Th

rou

gh

pu

t [K

B/s

]

Encoding Decoding

Achievements

It has been shown that the GPU is capable of performing network coding calculations

What’s more, it can outperform the CPU by a significant margin in some cases

We have a submitted and accepted paper at European Wireless ’09 by the title:

Implementation of Random Linear Network Coding on OpenGL-enabled Graphics Cards

Demonstration

CPU implementation OpenGL & CG implementation CUDA implementation

Questions???

Thank you for your kind attention!

network coding on the gpu péter vingelmann supervisor: frank h.p. fitzek

Documents