counterparty credit risk and im computation for ccp...

55
1 Counterparty Credit Risk and IM Computation for CCP on Multicore Systems Prasad Pawar Nishant Kumar Amit Kalele Tata Consultancy Services Limited

Upload: truonghuong

Post on 15-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

1

Counterparty Credit Risk and IM Computation for CCP

on Multicore Systems

Prasad Pawar

Nishant Kumar

Amit Kalele

Tata Consultancy Services Limited

2

Overview

Introduction

Counter Party Credit Risk

• Basic Terminology

Sequential Algorithm

Parallel algorithm using CUDA

Optimizations applied on GPGPU

Parallelized and Optimized algorithm on Intel Platform

Comparison Results

Conclusion and future work

3

Introduction

Counterparty credit risk

Counterparty credit risk is defined as the risk that the counterparty to a

transaction could default before the final settlement of the transaction’s cash flows.

Basic Terminology

• IRS trade - Interest rate swap trades are trades done primarily for hedging or

speculation on interest rate direction by the market participants

• Cash Flow - In the context of IRS trades cash flow refers to sum of money to be paid or

received on predefined cash flow dates that are mentioned in the trade

• Zero Coupon Yield Curve - It is a curve representing the yield of zero coupon bonds

which are plotted against the length of time they have to run to maturity and essentially it

provides forward rates and spot rates used for calculating cash flow and discounting

• MTM value - MTM value refers to the mark to market value of the IRS trade which

reflects monetary gain or loss on the trade to the two parties of the trade

4

Counter Party Credit Risk

Mark to Market Computations

• The central counterparty (CCP) values, using current yield curve, the complete portfolio

of all interest rate swap trades received from all the members on intraday basis.

• Calculates the mark to market (MTM) margin requirement for each member.

• Block the margin from a member’s collateral and if the margin is not sufficient make

margin call to the required members.

Initial Margin Computations

• The IM computation requires 250 times valuation of the member’s current portfolio using

250 different yield curves that are picked from historical data.

5

Challenges

On traditional systems (database systems) MTM computations takes ~25min for

20,000 trades, each with ~150 cash flows.

Initial Margin (IM) computations takes ~10min for 250 different yield curves each

with 20,000 trades along with ~150 cash flows on .NET based solution.

Such a high timings leads to

• It makes the process inefficient as the user is unproductive during the 25 minutes for

which the valuation happens

• The timings are high if information is required by senior executives or regulators on

an urgent basis

• If the trade volumes increase say to 100,000, a realistic possibility, then the time

taken will be more than 2 hours which will be virtually unacceptable

• Till the time IM result is computed, a member can continue to do trading but the

trades are guaranteed for settlement from point of trade in TS which increases risk

for CCP

An efficient solution is required to solve such problem

6

Computational Steps

Yield curve generations:

1. Using the linear interpolation, compute the intermediate swap rates for

tenors whose swap rates are not provided. Where (x,y) represents tenor and

corresponding interest rate.

2. Zero rates for tenor up to one year are computed using continuous

compounding method as:

3. Standard bootstrapping method is used to compute zero rates for tenor

beyond one year.

7

Computational Steps

MTM values

The input to the mark to market computation is business date, immediate previous and

future cash flow dates, principal amount, accrued interest, fixed rate, floating rate and

zero rate.

1. Compute fixed cash flows and compute floating cash flows using discrete equivalent

formula for future floating interest rates i.e. forward rates:

2. Calculate MTM value of each trade by doing discounting of fixed and floating cash

flows using zero rate and netting off fixed and floating cash flows.Discounted Value = Present value of cash flows

Trade MTM = Sum (Discounted Value)

3. The MTM value i.e. margin requirement for each member is obtained by aggregating

MTM value of all the trades of the member.

MTM for Member = Sum (Trade MTM of that Member)

8

Computational Steps

Initial Margin calculation

1. Value the complete IRS trade portfolio of a Member 250 times using 250

historical zero rates

2. Compute the daily percentage change in the MTM value of the portfolio and

record the 249 results

3. Assign weight to each of the result using EWMA scheme such that more

recent the result more is the weight

4. Sort the 249 results in ascending order

5. Add the weights from top and wherever the cumulative weight is equal to 0.05

i.e. worst 5 percent, the corresponding percentage change is then

multiplied by portfolio value and adjusted for the holding period to compute IM

9

Implementation Details

10

Input and Output of IRS

• Input - Swap Rate, Swap Tenor

• Output - Yield curve

• Input for MTM Computation -

Cash_flow_dates, Prev_cash_flow_dates, Notional_amt, accured_int,

fixed_rate

• Output – Present CashFlow, MTM value

11

Sequential Algorithm

Single MTM Computations

Compute zero rates

MTMfinal = 0

for trades = 0 to nTrade do

MTM[trade] = 0

for CF = 0 to nCf do

Compute Present_CashFlow[CF]

MTM[trade] = MTM[trade] + Present_CashFlow[CF]

end for

MTMfinal = MTMfinal + MTM[trade]

end for

12

Sequential Algorithm

Single MTM Computation

for CF = 0 to nCf do

if(CF ==0)

Read eff_date, curr_cash_flow_date

else

Read last_cash_flow_date, curr_cash_flow_date

1. Calculate no. of days between curr_date and last_date to calculate the tenor.

2. Calculates intermediate values of fw_rate, comp_fw_rate, dist_fw_rate etc

3. Calculated floating cashflow of particular trade based on above calculated values and

inputs such as notional_amt, accrued_int and fixed_rate.

4. Compute Present value of cashflow using fixed/floatinig cashflow and yield curve.

5. MTM = MTM + Present_CashFlow[CF]

end for

13

Sequential Algorithm

Initial Margin Computations using 250 MTM values:

Compute current dates zero rates and retrieve 249 different zero rates from

database.

MTMfinal [nRate] = 0

for zero rate = 0 to nRate do

for trades = 0 to nTrade do

MTM[trade] = 0

for CF = 0 to nCf do

Compute Present_CashFlow[CF]

MTM[trade] = MTM[trade] + Present_CashFlow[CF]

end for

MTMfinal[nRate] = MTMfinal[nRate] + MTM[trade]

end for

end for

Compute IM using MTMfinal[nRate]

Single MTM

Computations

14

NVIDIA GPU Systems

Kepler K20x -

Device: The nVidia’s Kepler K20x GPU with 796 MHz and 2496 cores,

5GB RAM.

Host: The Intel Xeon(R) CPU E5-2697 v3@ 2.1 GHz, dual socket, 6

cores/socket, 16GB RAM.

Kepler K40 -

Device: The nVidia’s Kepler K40 GPU with 745 MHz and 2880 cores, 12

GB RAM.

Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14

cores/socket, 64GB RAM.

Kepler K80 -

Device: The nVidia’s Kepler K80 GPU with 562 MHz and 2x2496 cores,

2x12 GB RAM.

Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14

cores/socket, 64GB RAM.

15

MTMfinal = 0

Launch CUDA Kernel with nTrade=20000 threads

Compute zero rates

MTM[Ti] = 0

for CF = 0 to nCf do

Compute discount rate[CF]

MTM[Ti] = MTM[Ti] + discount rate[CF]

end for

T1 T2 T3 T4 T20000

MTMfinal =∑ MTM

Kernel

Computation

GPU Algorithm for single MTM

16

Results

Sr. No. Experiment Time in Sec

Performance Gain

1 Sequential computation of 20000

trades with 150 cash flows each on 250

diff. yield curves

81.15 -

2 Parallel computation of 20000 trades

with 150 cash flows each on 250 diff.

yield curves on K20x

9.612 8.44x

Results taken on Kepler K20x system

17

Further Optimization

Nvidia GPU optimization:

Multi level parallelism using Hyper-Q

Using Shared Memory with coalesced memory access

Modified data structure

Resolved the issue of warp divergence

Using constant memory

Read-only cache memory using const __restrict__

18

Multi level parallelism using Hyper-Q

Allows connection from multiple CUDA streams, Message Passing Interface (MPI)

processes, or multiple threads of the same process.

32 concurrent work queues, can receive work from 32 process cores at the same time.

1.5x performance benefit achieved .

Figure source: nvidia.com

19

Set 32 CUDA stream and nRate=250

Distribute nRate/32 computations to each of 32 streams

MTMfinal[nRate] = 0

Compute zero rates and retrieve 249 previous zero rates

S0 S1 S2 S31 Streams

GPU Algorithm for 250 MTM & IM Computation

. . . . . . . . . . . . . . .

. . . . . .

Compute IM using MTM[nRate]

20

Hyper-Q using default streaming

nvcc --default-stream per-thread -c MTM_value.cu -arch

sm_35 -w -Xcompiler -fopenmp

21

Results

Experiment

Sr. No. Experiment Time in Sec

Performance Gain

1 Sequential computation of 20000

trades with 150 cash flows each on 250

diff. yield curves

81.15 -

2 Parallel computation of 20000 trades

with 150 cash flows each on 250 diff.

Zero rates.

9.612 1x

3 Using default streaming flag 6.24 1.54x

Results taken on Kepler K20x system

22

Using Shared Memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

• Read/write per-block

• Speed equivalent to

local cache

• 100x faster than Global

memory

• Limit up to 48KB

• Zero rate and Swap

tenor are used as

shared memory

Figure source: nvidia.com

23

Coalesced global memory access

Threads0 Threads1Threads30 Threads31

. . . . .. . . . .

. . . . . .

. . . . . .

128 192 256 2112 2176

Threads0 Threads1Threads30 Threads31

. . . . . .

. . . . . .

128 132 136 252 256

Shared Memory

T0 T1 T2 T30 T31

Global Memory

Global Memory

Threads

24

Using Shared Memory

Here ARRAY_COUNT=22 and No. of threads per Block are

32,64 …

As zero_rate and swap_tenor data is being used multiple

times per thread, so stored in a shared memory.

25

Results

Sr. No. Experiment Time in Sec

Performance Gain

1 Sequential computation of 20000

trades with 150 cash flows each on 250

diff. yield curves

81.15 -

2 Parallel computation of 20000 trades

with 150 cash flows each on 250 diff.

Zero rates.

9.612 1x

3 Using default streaming flag 6.24 1.54x

4 Shared memory used to store zero_rate

and swap_tenor data5.221 1.84x

Results taken on Kepler K20x system

26

NVVP Profiling

27

Data Structure of Dates

Data Structure -

Stored Cash_flow_dates and prev_cash_flow_date in the form of structure

typedef struct _d {

int day; → 4 bytes

int mon; → 4 bytes 12 bytes

int year; → 4 bytes

}dt;

28

Issue with Data Structure

. . . . . .

. . . . . .

0 4 8

T0 T1 T2 T30 T31

Global Memory

Threads

12 16 20 24

In one cycle cache fetches 128 bytes, will eventually

fetches 128/12 ~= 10 elements of date

Large size of data needs to be fetch from Global Memory

Access pattern to global memory is strided

29

Modified Data Structure

Original Data Structure -

Stored Cash_flow_dates and prev_cash_flow_date in the form of structure

typedef struct _d {

int day; → 4 bytes

int mon; → 4 bytes 12 bytes

int year; → 4 bytes

}dt;

Insted of structure stored the date information in the form of integer.

ex. dt date; int tmp_date;

date.day = 12;

date.mom = 3; tmp_date = date | date << 8 | date << 16;

date.year = 2010;

30

Modified Data Structure

To extract the data from integer to separated by day, month and year we

use below method:

date.day = tmp_date & 0xFF;

date.mon = (tmp_date >> 8) & 0xFF;

date.year = (tmp_date >> 16) & 0xFFFF;

31

Modified Data Structure

. . . . . .

. . . . . .

0 4 8

T0 T1 T2 T30 T31

Global Memory

Threads

12 16 20 24

In one cycle cache fetches 128 bytes, will eventually

fetches 128/4 = 32 elements of date

Less data needs to be fetch from Global Memory

Access pattern to global memory is coalesced

32

Results

Sr. No. Experiment Time in Sec

Performance Gain

1 Sequential computation of 20000

trades with 150 cash flows each on 250

diff. yield curves

81.15 -

2 Parallel computation of 20000 trades

with 150 cash flows each on 250 diff.

Zero rates.

9.612 1x

3 Using default streaming flag 6.24 1.54x

4 Shared memory used to store zero_rate

and swap_tenor data5.221 1.84x

5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure

4.229 2.27x

Results taken on Kepler K20x system

33

NVVP Profiling

34

Warp Divergence

Threads are executed in warps of 32, with all threads in the

warp executing the same instruction at the same time

35

Warp Divergence

36

Warp Divergence - solution

37

Results

Sr. No. Experiment Time in Sec Performance Gain

1 Sequential computation of 20000

trades with 150 cash flows each on 250

diff. yield curves

81.15 -

2 Parallel computation of 20000 trades

with 150 cash flows each on 250 diff. Zero

rates.

9.612 1x

3 Using default streaming flag 6.24 1.54x

4 Shared memory used to store zero_rate

and swap_tenor data5.221 1.84x

5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure

4.229 2.27x

6 Resolved the issue of warp divergence

( By separated if-else position of code)3.148 3.05x

Results taken on Kepler K20x system

38

NVVP Profiling

39

Constant memory

Constant Memory

Where is constant memory?

- Data is stored in the device global memory

- Read data through multiprocessor constant cache

- 64KB constant memory and 8KB cache for each SM

//declare constant memory__constant__ float cst_ptr[size];

//copy data from host to constant memorycudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

40

Constant memory

Here ARRAY_COUNT=22 and No. of threads per Block are

32,64 …

Issues with above implementation – Warp Divergence

41

Results

Sr. No. Experiment Time in Sec Performance Gain

1 Sequential computation of 20000 trades with 150

cash flows each on 250 diff. yield curves81.15 -

2 Parallel computation of 20000 trades

with 150 cash flows each on 250 diff. Zero rates.

9.612 1x

3 Using default streaming flag 6.24 1.54x

4 Shared memory used to store zero_rate

and swap_tenor data5.221 1.84x

5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure

4.229 2.27x

6 Resolved the issue of warp divergence

( By separated if-else position of code)3.148 3.05x

7 Changed shared Memory to constant memory 2.892 3.32x

Results taken on Kepler K20x system

42

NVVP Profiling

43

Read-Only Cache Memory

The read-only data cache was introduced with Compute Capability 3.5

architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780

GPUs).

Each SMX has a 48KB read-only cache.

The CUDA compiler automatically accesses data via the read-only

cache when it can determine that data is read-only for the lifetime of

kernel.

In practice, you need to qualify pointers as const and __restrict__ before

the compiler can satisfy this condition.

Also specify a read-only data cache access with the __ldg() intrinsic

function.

44

Read-Only Cache Memory

Without Read-only Cache

With Read-only Cache

45

NVVP Profiling

46

Results

Sr. No.

Experiment Time in Sec

Performance Gain

1 Sequential computation of 20000 trades with 150

cash flows each on 250 diff. yield curves81.15 -

2 Parallel computation of 20000 trades

with 150 cash flows each on 250 diff. Zero rates.

9.612 1x

3 Using default streaming flag 6.24 1.54x

4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x

5 Change in data structure – Cash Flow Dates stored in the

form of single integer value instead of structure

4.229 2.27x

6 Resolved the issue of warp divergence ( By separated if-else

position of code)3.148 3.05x

7 Changed shared Memory to constant memory 2.892 3.32x

8 Read-only Cache memory using const __restrict__ 2.765 3.47x

Results taken on Kepler K20x system

47

Intel Multi-core Systems

Experimental Setup

Ivy Bridge :

The Intel Xeon E5 2650 v2, Ivy Bridge 2.6 GHz, dual socket, 8

cores/socket with Hyper-threading, 24GB RAM.

Haswell :

The Intel Xeon E5-2697 v3 @ 2.60GHz, dual socket, 14

cores/socket with Hyper-threading, 64 GB RAM.

48

Intel OpenMP

.

.

.

Parallel section where each thread will

calculate MTM value of each trade

Sequential

regionSequential

region

Parallelization using

OpenMP

Syntax:

#pragma omp parallel for clauses(private, firstprivate, ...)

for ( trades =0 to nTrade)

{

for( CF = 0 to nCf )

{

computation steps

}

}

T1

T2

Tn

49

Intel Optimizations

Intel optimization

Multi thread settings

Optimization using compiler flags

– no-prec-div

– -unroll-aggressive

– -ipo

Schedule (Dynamic, chunk)

50

Intel Optimizations

No-prec-div

-prec-div improves precision of floating-point divides.

-no-prec-div disables this option and enables optimizations that giveslightly less precise results than full IEEE division.

-unroll-aggressive

This option determines whether the compiler uses more aggressive unrolling

for certain loops.

-ipo[n]

This option permits inlining and other interprocedural optimizations among multiple

source files.

51

Intel Optimizations Results on HSW-EP

Sr. No.

Experiment Time in Sec

Throughput

1 Parallel computation using OpenMP of 20000

trades with 150 cash flows each

10.6647 -

2 Using KMP_AFFINITY=granularity=fine,

compact,1,07.7874 1.37x

3 Using optimization flag -O2 3.4197 3.12x

4 Using optimization flags -no-prec-div

-unroll-aggressive3.1642 3.37x

5 Using optimization flag -ipo 2.7451 3.87x

Results taken on Intel Haswell System

52

Comparative Performance

53

Conclusion and Future Work

Traditional computing approaches are not suitable for the workload involved in the

computation of MTM and IM for CCP risk assessment.

Using HPC, 750 million cash-flows can be computed for identifying liquidity

requirement in few seconds.

Achieved more 100 times improvement compared to best known system for single

MTM computation.

HPC is well suited for complex calculations like credit value adjustment to price

counterparty risk in deal (which is out next PoC use case), expected shortfall

calculation for market risk measurement, potential future exposure and exposure at

default calculations, collateral valuation, basis risk computation etc.

Risk management is moving towards intraday risk computation for all major risk

categories of credit risk, market risk, liquidity risk and counterparty credit risk and

HPC is well suited for meeting the performance demands of these computations

The performance of Nvidia Kepler K80 is the best among all systems compared in our

benchmarking results.

54

Acknowledgement

We are thankful to Vinay Deshpande from Nvidia Pune, India for providing

access to latest K40 and K80 GPU systems to benchmark and fine tune the

application.

We are thankful to HPC Advisory Council for providing access to the Thor

system for evaluating our application.

5555

THANK YOU