quants coding cuda in .net: pitfalls and...

1 1

Quants Coding CUDA in .NET: Pitfalls and Solutions

Ben Eimer GPU Technology Conference – San Jose, CA April 6, 2016

2 2

Outline

• What is the problem we are trying to solve?

• The Hybrid Object Linear Algebra library (HOLA)

• Pitfalls/Solution

• Performance Data

3 3

Chatham Financial is a recognized leader in financial risk management specializing in foreign currency, interest rate and commodity risk. Our hedge accounting, valuations and derivatives and debt solutions are retained by clients worldwide.

1,800+ Clients Both public and private, rely on Chatham for end-to-end hedge advisory, execution, and technology solutions that bring transparency to the market

$400 Billion Annual transaction notional volume

450+ Accounting Clients rely on Chatham for hedge accounting and/or derivative valuations

Web-based Technology Platform Proprietary ChathamDirect technology platform serves clients worldwide

400+ Employees Employee owned, with offices in 6 countries provides an independent, global perspective into the most effective risk management solutions available

Advisory services, technology solutions, leaders in debt & derivatives

4 4

The Setup

5 5

Why this solution exists?

Client/Dealer Facing Applications

Web Pages Public API Internal Tools

Financial Models

IR FX/Commodities Accounting/ Regulation

Math Acceleration Library

6 6

Requirements for our solution

Quant Side

• Numbers are 100% reproducible

• Intuitive to use

• Required Math functions:

• Linear Algebra

• Linear Regression

• Random Numbers

• Summary Stats

• Factorization

• Interpolation

• Non-linear Optimization

Developer Side

• Version control

• MKL integration (for CPU)

• Provide GPU acceleration

• Able to perform in a heterogeneous compute cluster

• Written in C#

7 7

Why C#?

C# is a full featured language with a large, active community. It is well supported by Microsoft as well as many other 3rd party offerings. Using this allows our developers to be highly productive.

The Quants development also takes advantage of the internal tools, processes and expertise of our Technology Team.

8 8

HOLA

9 9

Hybrid Object Linear Algebra (HOLA) Library

• Hybrid Objects can be created on the GPU or the host depending on the context at runtime.

• MKL and CUDA accessed via DLLImport

• Delegate Pattern heavily relied on

• C# Using statements relied on to manage resources and set context

10 10

Delegate Pattern

Delegator

<<Interface>>

Delegation Interface

Concrete

Delegate A

Concrete

Delegate B

Matrix

<<Interface>>

IMatrix

HostMatrix DeviceMatrix

11 11

The C# using Statement

MSDN - Provides a convenient syntax that ensures the correct use

of IDisposable objects.

DotNetPerls - The using block helps manage resources. Conceptually it

protects the whole system's resources by specifying the scope of the usage of the resource. The using statement is combined with a type that implements IDisposable.

12 12

Computational Scope

using (new ComputationScope()) { //Do Something Here } using (new ComputationScope(TypeCoersionMode.ForceHost)) {} using (new ComputationScope(TypeCoersionMode.ForceDevice)) {} using (new ComputationScope(TypeCoersionMode.FavorLargerObject)) {} using (new ComputationScope(TypeCoersionMode.FavorDevice)) {} using (var scope = new ComputationScope()) { var X = Matrix.Uninitialized(rows, cols); X.AsDurable(); X.AsPinned(); X.AsUntracked(); scope.RegisterDisposable(X); }

13 13

HostMatrix

Example of object creation

Use Case

// Use Host

var X = Matrix.Uninitialized(rows, cols);

// Use Device


Under the Hood

Matrix

Uninitialized

ctor

HostArrayHandle

ctor

MKL

MKLalloc

DeviceMatrix

CudaMemoryWrapper

Cuda

MemAlloc

14 14

CBlaswrapper

Example of object operation

Use Case

// Use Host


var Y = Matrix.Uninitialized(cols, rows);

var Z = X*Y;

// Use Device



var Z = X*Y;

Under the Hood

Matrix

Multiply

dgemm

MKLBlas

dgemm

CuBlas70wrapper

CuBlas

15 15

Under the Hood

DeviceMatrix

Custom Kernels are Possible too.

Use Case

// Use Device


var Z = Y.Inv();

__global__ void invKernel(double * sourcePtr, double * targetPtr, int size)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < size)

{

targetPtr[i] = 1.0 / sourcePtr[i];

}

}

Matrix

Inv()

Inv()

KernelWrapper

Inv()

kernel.cu

invKernel

16 16

Pitfalls/Solution

17 17

Under the Hood

Challenge 1: Ephemeral Objects

Use Case

var C = X*Y*Z;

for (int i = 0; i < 500; i++)

{

var C = X*Y*Z;

}

var temp = X*Y

var C = temp*Z

Q: What happens to the temp Matrix?

A: It stays in memory.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6 7 8 9 10

Mem

ory

Allo

cate

d (

MB

)

Iteration number

18 18

Under the Hood

Ephemeral Objects revisited

Use Case

using (new ComputationScope(

TypeCoersionMode.ForceDevice))

{

var C = Matrix.Uninitialized(rows, cols);


var Y = Matrix.Uninitialized(rows, cols);

var Z = Matrix.Uninitialized(rows, cols);

for (int i = 0; i < 500; i++)

{

using (new ComputationScope())

{

C = X*Y*Z;

}

}

}

var temp = X*Y

var C = temp*Z

Q: What happens to the temp Matrix?

A: It is disposed as the inner Scope

closes.

19 19

Ephemeral Objects (cont.)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

512 1024 2048 4096

Mem

ory

Allo

cate

d (

MB

)

Matrix Size

First Iteration without Scopes Tenth Iteration without Scopes

First Iteration with Scopes Tenth Iteration with Scopes

20 20

Under the Hood

Challenge 2: Over aggressive compiler optimization

Use Case

(constructed example)

var M = Matrix.Zero(rows, cols);

var stats = new SummaryStats(M);

stats.AddMean().AddCorrelation().Compute();

var result = stats.Correlation;

.NET will ‘finalize’ M here

(sometimes) because it is no

longer used by the managed

code.

21 21

Under the Hood

The Computation Scope is now responsible for letting the compiler know when a Matrix can be released. M will no longer be finalized unexpectedly.

SummaryStats is also managed by the computation scope.

Over aggressive compiler optimization

Use Case

using (new ComputationScope())

{

var M = Matrix.Zero(rows, cols);

var stats = new SummaryStats(M);

stats.AddMean().AddCorrelation()

.Compute();

var result = stats.Correlation;

}

22 22

Performance

23 23

Performance

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

128 256 512 1024 2048 4096 8192

Ele

men

ts p

er S

eco

nd

Matrix Size (n x n)

Exp

2x E5620 Tesla C2050

24 24

Performance (cont.)

1.E+04

1.E+05

1.E+06

1.E+07

32 64 128 256 512 1024 2048 4096 8192

Nu

mb

er G

ener

ated

per

Sec

on

d

Matrix Size (n x n)

MRG32K3A


25 25

Performance (cont.)

0

50

100

150

200

250

300

350

64 128 256 512 1024 2048 4096 8192

GFL

OP

S

Matrix Size (n x n)

dgemm


26 26

Performance (cont.)

0

10

20

30

40

50

60

70

80

0 20000 40000 60000 80000 100000 120000 140000

Mill

ion

s o

f Si

mu

atio

n S

tep

s p

er S

eco

nd

Simluation Paths

3 Currency FX Simulation


27 27

Take-Aways

• HOLA is a complete math Library that is ready to be used in a variety of financial models under many different load and scaling profiles.

• HOLA can be used to easily accelerate math intensive code using GPUs while minimizing the strain/training placed on Quants and Developers.

• HOLA provides consistent numeric stability needed in many different disciplines including Finance.

28 28

[email protected]

Interested in HOLA?

29 29

Thank You

Ben Eimer

Quantitative Developer

[email protected]

Other people on the project:

• Ryan Deering

• John Bagatta

• Liang Cao

• Ethan Giordano

• Maciej Rys

• Jaroslaw Stachowiak

• Piotr Stecko

• Sam Campbell

quants coding cuda in .net: pitfalls and...

Documents