quants coding cuda in .net: pitfalls and...

29
1 1 Quants Coding CUDA in .NET: Pitfalls and Solutions Ben Eimer GPU Technology Conference – San Jose, CA April 6, 2016

Upload: others

Post on 24-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

1 1

Quants Coding CUDA in .NET: Pitfalls and Solutions

Ben Eimer GPU Technology Conference – San Jose, CA April 6, 2016

Page 2: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

2 2

Outline

• What is the problem we are trying to solve?

• The Hybrid Object Linear Algebra library (HOLA)

• Pitfalls/Solution

• Performance Data

Page 3: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

3 3

Chatham Financial is a recognized leader in financial risk management specializing in foreign currency, interest rate and commodity risk. Our hedge accounting, valuations and derivatives and debt solutions are retained by clients worldwide.

1,800+ Clients Both public and private, rely on Chatham for end-to-end hedge advisory, execution, and technology solutions that bring transparency to the market

$400 Billion Annual transaction notional volume

450+ Accounting Clients rely on Chatham for hedge accounting and/or derivative valuations

Web-based Technology Platform Proprietary ChathamDirect technology platform serves clients worldwide

400+ Employees Employee owned, with offices in 6 countries provides an independent, global perspective into the most effective risk management solutions available

Advisory services, technology solutions, leaders in debt & derivatives

Page 4: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

4 4

The Setup

Page 5: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

5 5

Why this solution exists?

Client/Dealer Facing Applications

Web Pages Public API Internal Tools

Financial Models

IR FX/Commodities Accounting/ Regulation

Math Acceleration Library

Page 6: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

6 6

Requirements for our solution

Quant Side

• Numbers are 100% reproducible

• Intuitive to use

• Required Math functions:

• Linear Algebra

• Linear Regression

• Random Numbers

• Summary Stats

• Factorization

• Interpolation

• Non-linear Optimization

Developer Side

• Version control

• MKL integration (for CPU)

• Provide GPU acceleration

• Able to perform in a heterogeneous compute cluster

• Written in C#

Page 7: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

7 7

Why C#?

C# is a full featured language with a large, active community. It is well supported by Microsoft as well as many other 3rd party offerings. Using this allows our developers to be highly productive.

The Quants development also takes advantage of the internal tools, processes and expertise of our Technology Team.

Page 8: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

8 8

HOLA

Page 9: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

9 9

Hybrid Object Linear Algebra (HOLA) Library

• Hybrid Objects can be created on the GPU or the host depending on the context at runtime.

• MKL and CUDA accessed via DLLImport

• Delegate Pattern heavily relied on

• C# Using statements relied on to manage resources and set context

Page 10: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

10 10

Delegate Pattern

Delegator

<<Interface>>

Delegation Interface

Concrete

Delegate A

Concrete

Delegate B

Matrix

<<Interface>>

IMatrix

HostMatrix DeviceMatrix

Page 11: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

11 11

The C# using Statement

MSDN - Provides a convenient syntax that ensures the correct use

of IDisposable objects.

DotNetPerls - The using block helps manage resources. Conceptually it

protects the whole system's resources by specifying the scope of the usage of the resource. The using statement is combined with a type that implements IDisposable.

Page 12: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

12 12

Computational Scope

using (new ComputationScope()) { //Do Something Here } using (new ComputationScope(TypeCoersionMode.ForceHost)) {} using (new ComputationScope(TypeCoersionMode.ForceDevice)) {} using (new ComputationScope(TypeCoersionMode.FavorLargerObject)) {} using (new ComputationScope(TypeCoersionMode.FavorDevice)) {} using (var scope = new ComputationScope()) { var X = Matrix.Uninitialized(rows, cols); X.AsDurable(); X.AsPinned(); X.AsUntracked(); scope.RegisterDisposable(X); }

Page 13: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

13 13

HostMatrix

Example of object creation

Use Case

// Use Host

var X = Matrix.Uninitialized(rows, cols);

// Use Device

var X = Matrix.Uninitialized(rows, cols);

Under the Hood

Matrix

Uninitialized

ctor

HostArrayHandle

ctor

MKL

MKLalloc

DeviceMatrix

CudaMemoryWrapper

Cuda

MemAlloc

Page 14: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

14 14

CBlaswrapper

Example of object operation

Use Case

// Use Host

var X = Matrix.Uninitialized(rows, cols);

var Y = Matrix.Uninitialized(cols, rows);

var Z = X*Y;

// Use Device

var X = Matrix.Uninitialized(rows, cols);

var Y = Matrix.Uninitialized(cols, rows);

var Z = X*Y;

Under the Hood

Matrix

Multiply

dgemm

MKLBlas

dgemm

CuBlas70wrapper

CuBlas

Page 15: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

15 15

Under the Hood

DeviceMatrix

Custom Kernels are Possible too.

Use Case

// Use Device

var Y = Matrix.Uninitialized(cols, rows);

var Z = Y.Inv();

__global__ void invKernel(double * sourcePtr, double * targetPtr, int size)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < size)

{

targetPtr[i] = 1.0 / sourcePtr[i];

}

}

Matrix

Inv()

Inv()

KernelWrapper

Inv()

kernel.cu

invKernel

Page 16: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

16 16

Pitfalls/Solution

Page 17: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

17 17

Under the Hood

Challenge 1: Ephemeral Objects

Use Case

var C = X*Y*Z;

for (int i = 0; i < 500; i++)

{

var C = X*Y*Z;

}

var temp = X*Y

var C = temp*Z

Q: What happens to the temp Matrix?

A: It stays in memory.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6 7 8 9 10

Mem

ory

Allo

cate

d (

MB

)

Iteration number

Page 18: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

18 18

Under the Hood

Ephemeral Objects revisited

Use Case

using (new ComputationScope(

TypeCoersionMode.ForceDevice))

{

var C = Matrix.Uninitialized(rows, cols);

var X = Matrix.Uninitialized(rows, cols);

var Y = Matrix.Uninitialized(rows, cols);

var Z = Matrix.Uninitialized(rows, cols);

for (int i = 0; i < 500; i++)

{

using (new ComputationScope())

{

C = X*Y*Z;

}

}

}

var temp = X*Y

var C = temp*Z

Q: What happens to the temp Matrix?

A: It is disposed as the inner Scope

closes.

Page 19: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

19 19

Ephemeral Objects (cont.)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

512 1024 2048 4096

Mem

ory

Allo

cate

d (

MB

)

Matrix Size

First Iteration without Scopes Tenth Iteration without Scopes

First Iteration with Scopes Tenth Iteration with Scopes

Page 20: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

20 20

Under the Hood

Challenge 2: Over aggressive compiler optimization

Use Case

(constructed example)

var M = Matrix.Zero(rows, cols);

var stats = new SummaryStats(M);

stats.AddMean().AddCorrelation().Compute();

var result = stats.Correlation;

.NET will ‘finalize’ M here

(sometimes) because it is no

longer used by the managed

code.

Page 21: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

21 21

Under the Hood

The Computation Scope is now responsible for letting the compiler know when a Matrix can be released. M will no longer be finalized unexpectedly.

SummaryStats is also managed by the computation scope.

Over aggressive compiler optimization

Use Case

using (new ComputationScope())

{

var M = Matrix.Zero(rows, cols);

var stats = new SummaryStats(M);

stats.AddMean().AddCorrelation()

.Compute();

var result = stats.Correlation;

}

Page 22: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

22 22

Performance

Page 23: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

23 23

Performance

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

128 256 512 1024 2048 4096 8192

Ele

men

ts p

er S

eco

nd

Matrix Size (n x n)

Exp

2x E5620 Tesla C2050

Page 24: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

24 24

Performance (cont.)

1.E+04

1.E+05

1.E+06

1.E+07

32 64 128 256 512 1024 2048 4096 8192

Nu

mb

er G

ener

ated

per

Sec

on

d

Matrix Size (n x n)

MRG32K3A

2x E5620 Tesla C2050

Page 25: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

25 25

Performance (cont.)

0

50

100

150

200

250

300

350

64 128 256 512 1024 2048 4096 8192

GFL

OP

S

Matrix Size (n x n)

dgemm

2x E5620 Tesla C2050

Page 26: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

26 26

Performance (cont.)

0

10

20

30

40

50

60

70

80

0 20000 40000 60000 80000 100000 120000 140000

Mill

ion

s o

f Si

mu

atio

n S

tep

s p

er S

eco

nd

Simluation Paths

3 Currency FX Simulation

2x E5620 Tesla C2050

Page 27: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

27 27

Take-Aways

• HOLA is a complete math Library that is ready to be used in a variety of financial models under many different load and scaling profiles.

• HOLA can be used to easily accelerate math intensive code using GPUs while minimizing the strain/training placed on Quants and Developers.

• HOLA provides consistent numeric stability needed in many different disciplines including Finance.

Page 28: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

28 28

[email protected]

Interested in HOLA?

Page 29: Quants Coding CUDA in .NET: Pitfalls and Solutionson-demand.gputechconf.com/gtc/2016/...quants-coding... · • Performance Data . 33 ... productive. The Quants development also takes

29 29

Thank You

Ben Eimer

Quantitative Developer

[email protected]

Other people on the project:

• Ryan Deering

• John Bagatta

• Liang Cao

• Ethan Giordano

• Maciej Rys

• Jaroslaw Stachowiak

• Piotr Stecko

• Sam Campbell