counterparty credit risk and im computation for ccp...
TRANSCRIPT
1
Counterparty Credit Risk and IM Computation for CCP
on Multicore Systems
Prasad Pawar
Nishant Kumar
Amit Kalele
Tata Consultancy Services Limited
2
Overview
Introduction
Counter Party Credit Risk
• Basic Terminology
Sequential Algorithm
Parallel algorithm using CUDA
Optimizations applied on GPGPU
Parallelized and Optimized algorithm on Intel Platform
Comparison Results
Conclusion and future work
3
Introduction
Counterparty credit risk
Counterparty credit risk is defined as the risk that the counterparty to a
transaction could default before the final settlement of the transaction’s cash flows.
Basic Terminology
• IRS trade - Interest rate swap trades are trades done primarily for hedging or
speculation on interest rate direction by the market participants
• Cash Flow - In the context of IRS trades cash flow refers to sum of money to be paid or
received on predefined cash flow dates that are mentioned in the trade
• Zero Coupon Yield Curve - It is a curve representing the yield of zero coupon bonds
which are plotted against the length of time they have to run to maturity and essentially it
provides forward rates and spot rates used for calculating cash flow and discounting
• MTM value - MTM value refers to the mark to market value of the IRS trade which
reflects monetary gain or loss on the trade to the two parties of the trade
4
Counter Party Credit Risk
Mark to Market Computations
• The central counterparty (CCP) values, using current yield curve, the complete portfolio
of all interest rate swap trades received from all the members on intraday basis.
• Calculates the mark to market (MTM) margin requirement for each member.
• Block the margin from a member’s collateral and if the margin is not sufficient make
margin call to the required members.
Initial Margin Computations
• The IM computation requires 250 times valuation of the member’s current portfolio using
250 different yield curves that are picked from historical data.
5
Challenges
On traditional systems (database systems) MTM computations takes ~25min for
20,000 trades, each with ~150 cash flows.
Initial Margin (IM) computations takes ~10min for 250 different yield curves each
with 20,000 trades along with ~150 cash flows on .NET based solution.
Such a high timings leads to
• It makes the process inefficient as the user is unproductive during the 25 minutes for
which the valuation happens
• The timings are high if information is required by senior executives or regulators on
an urgent basis
• If the trade volumes increase say to 100,000, a realistic possibility, then the time
taken will be more than 2 hours which will be virtually unacceptable
• Till the time IM result is computed, a member can continue to do trading but the
trades are guaranteed for settlement from point of trade in TS which increases risk
for CCP
An efficient solution is required to solve such problem
6
Computational Steps
Yield curve generations:
1. Using the linear interpolation, compute the intermediate swap rates for
tenors whose swap rates are not provided. Where (x,y) represents tenor and
corresponding interest rate.
2. Zero rates for tenor up to one year are computed using continuous
compounding method as:
3. Standard bootstrapping method is used to compute zero rates for tenor
beyond one year.
7
Computational Steps
MTM values
The input to the mark to market computation is business date, immediate previous and
future cash flow dates, principal amount, accrued interest, fixed rate, floating rate and
zero rate.
1. Compute fixed cash flows and compute floating cash flows using discrete equivalent
formula for future floating interest rates i.e. forward rates:
2. Calculate MTM value of each trade by doing discounting of fixed and floating cash
flows using zero rate and netting off fixed and floating cash flows.Discounted Value = Present value of cash flows
Trade MTM = Sum (Discounted Value)
3. The MTM value i.e. margin requirement for each member is obtained by aggregating
MTM value of all the trades of the member.
MTM for Member = Sum (Trade MTM of that Member)
8
Computational Steps
Initial Margin calculation
1. Value the complete IRS trade portfolio of a Member 250 times using 250
historical zero rates
2. Compute the daily percentage change in the MTM value of the portfolio and
record the 249 results
3. Assign weight to each of the result using EWMA scheme such that more
recent the result more is the weight
4. Sort the 249 results in ascending order
5. Add the weights from top and wherever the cumulative weight is equal to 0.05
i.e. worst 5 percent, the corresponding percentage change is then
multiplied by portfolio value and adjusted for the holding period to compute IM
10
Input and Output of IRS
• Input - Swap Rate, Swap Tenor
• Output - Yield curve
• Input for MTM Computation -
Cash_flow_dates, Prev_cash_flow_dates, Notional_amt, accured_int,
fixed_rate
• Output – Present CashFlow, MTM value
11
Sequential Algorithm
Single MTM Computations
Compute zero rates
MTMfinal = 0
for trades = 0 to nTrade do
MTM[trade] = 0
for CF = 0 to nCf do
Compute Present_CashFlow[CF]
MTM[trade] = MTM[trade] + Present_CashFlow[CF]
end for
MTMfinal = MTMfinal + MTM[trade]
end for
12
Sequential Algorithm
Single MTM Computation
for CF = 0 to nCf do
if(CF ==0)
Read eff_date, curr_cash_flow_date
else
Read last_cash_flow_date, curr_cash_flow_date
1. Calculate no. of days between curr_date and last_date to calculate the tenor.
2. Calculates intermediate values of fw_rate, comp_fw_rate, dist_fw_rate etc
3. Calculated floating cashflow of particular trade based on above calculated values and
inputs such as notional_amt, accrued_int and fixed_rate.
4. Compute Present value of cashflow using fixed/floatinig cashflow and yield curve.
5. MTM = MTM + Present_CashFlow[CF]
end for
13
Sequential Algorithm
Initial Margin Computations using 250 MTM values:
Compute current dates zero rates and retrieve 249 different zero rates from
database.
MTMfinal [nRate] = 0
for zero rate = 0 to nRate do
for trades = 0 to nTrade do
MTM[trade] = 0
for CF = 0 to nCf do
Compute Present_CashFlow[CF]
MTM[trade] = MTM[trade] + Present_CashFlow[CF]
end for
MTMfinal[nRate] = MTMfinal[nRate] + MTM[trade]
end for
end for
Compute IM using MTMfinal[nRate]
Single MTM
Computations
14
NVIDIA GPU Systems
Kepler K20x -
Device: The nVidia’s Kepler K20x GPU with 796 MHz and 2496 cores,
5GB RAM.
Host: The Intel Xeon(R) CPU E5-2697 v3@ 2.1 GHz, dual socket, 6
cores/socket, 16GB RAM.
Kepler K40 -
Device: The nVidia’s Kepler K40 GPU with 745 MHz and 2880 cores, 12
GB RAM.
Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14
cores/socket, 64GB RAM.
Kepler K80 -
Device: The nVidia’s Kepler K80 GPU with 562 MHz and 2x2496 cores,
2x12 GB RAM.
Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14
cores/socket, 64GB RAM.
15
MTMfinal = 0
Launch CUDA Kernel with nTrade=20000 threads
Compute zero rates
MTM[Ti] = 0
for CF = 0 to nCf do
Compute discount rate[CF]
MTM[Ti] = MTM[Ti] + discount rate[CF]
end for
T1 T2 T3 T4 T20000
MTMfinal =∑ MTM
Kernel
Computation
GPU Algorithm for single MTM
16
Results
Sr. No. Experiment Time in Sec
Performance Gain
1 Sequential computation of 20000
trades with 150 cash flows each on 250
diff. yield curves
81.15 -
2 Parallel computation of 20000 trades
with 150 cash flows each on 250 diff.
yield curves on K20x
9.612 8.44x
Results taken on Kepler K20x system
17
Further Optimization
Nvidia GPU optimization:
Multi level parallelism using Hyper-Q
Using Shared Memory with coalesced memory access
Modified data structure
Resolved the issue of warp divergence
Using constant memory
Read-only cache memory using const __restrict__
18
Multi level parallelism using Hyper-Q
Allows connection from multiple CUDA streams, Message Passing Interface (MPI)
processes, or multiple threads of the same process.
32 concurrent work queues, can receive work from 32 process cores at the same time.
1.5x performance benefit achieved .
Figure source: nvidia.com
19
Set 32 CUDA stream and nRate=250
Distribute nRate/32 computations to each of 32 streams
MTMfinal[nRate] = 0
Compute zero rates and retrieve 249 previous zero rates
S0 S1 S2 S31 Streams
GPU Algorithm for 250 MTM & IM Computation
. . . . . . . . . . . . . . .
. . . . . .
Compute IM using MTM[nRate]
20
Hyper-Q using default streaming
nvcc --default-stream per-thread -c MTM_value.cu -arch
sm_35 -w -Xcompiler -fopenmp
21
Results
Experiment
Sr. No. Experiment Time in Sec
Performance Gain
1 Sequential computation of 20000
trades with 150 cash flows each on 250
diff. yield curves
81.15 -
2 Parallel computation of 20000 trades
with 150 cash flows each on 250 diff.
Zero rates.
9.612 1x
3 Using default streaming flag 6.24 1.54x
Results taken on Kepler K20x system
22
Using Shared Memory
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
• Read/write per-block
• Speed equivalent to
local cache
• 100x faster than Global
memory
• Limit up to 48KB
• Zero rate and Swap
tenor are used as
shared memory
Figure source: nvidia.com
23
Coalesced global memory access
Threads0 Threads1Threads30 Threads31
. . . . .. . . . .
. . . . . .
. . . . . .
128 192 256 2112 2176
Threads0 Threads1Threads30 Threads31
. . . . . .
. . . . . .
128 132 136 252 256
Shared Memory
T0 T1 T2 T30 T31
Global Memory
Global Memory
Threads
24
Using Shared Memory
Here ARRAY_COUNT=22 and No. of threads per Block are
32,64 …
As zero_rate and swap_tenor data is being used multiple
times per thread, so stored in a shared memory.
25
Results
Sr. No. Experiment Time in Sec
Performance Gain
1 Sequential computation of 20000
trades with 150 cash flows each on 250
diff. yield curves
81.15 -
2 Parallel computation of 20000 trades
with 150 cash flows each on 250 diff.
Zero rates.
9.612 1x
3 Using default streaming flag 6.24 1.54x
4 Shared memory used to store zero_rate
and swap_tenor data5.221 1.84x
Results taken on Kepler K20x system
27
Data Structure of Dates
Data Structure -
Stored Cash_flow_dates and prev_cash_flow_date in the form of structure
typedef struct _d {
int day; → 4 bytes
int mon; → 4 bytes 12 bytes
int year; → 4 bytes
}dt;
28
Issue with Data Structure
. . . . . .
. . . . . .
0 4 8
T0 T1 T2 T30 T31
Global Memory
Threads
12 16 20 24
In one cycle cache fetches 128 bytes, will eventually
fetches 128/12 ~= 10 elements of date
Large size of data needs to be fetch from Global Memory
Access pattern to global memory is strided
29
Modified Data Structure
Original Data Structure -
Stored Cash_flow_dates and prev_cash_flow_date in the form of structure
typedef struct _d {
int day; → 4 bytes
int mon; → 4 bytes 12 bytes
int year; → 4 bytes
}dt;
Insted of structure stored the date information in the form of integer.
ex. dt date; int tmp_date;
date.day = 12;
date.mom = 3; tmp_date = date | date << 8 | date << 16;
date.year = 2010;
30
Modified Data Structure
To extract the data from integer to separated by day, month and year we
use below method:
date.day = tmp_date & 0xFF;
date.mon = (tmp_date >> 8) & 0xFF;
date.year = (tmp_date >> 16) & 0xFFFF;
31
Modified Data Structure
. . . . . .
. . . . . .
0 4 8
T0 T1 T2 T30 T31
Global Memory
Threads
12 16 20 24
In one cycle cache fetches 128 bytes, will eventually
fetches 128/4 = 32 elements of date
Less data needs to be fetch from Global Memory
Access pattern to global memory is coalesced
32
Results
Sr. No. Experiment Time in Sec
Performance Gain
1 Sequential computation of 20000
trades with 150 cash flows each on 250
diff. yield curves
81.15 -
2 Parallel computation of 20000 trades
with 150 cash flows each on 250 diff.
Zero rates.
9.612 1x
3 Using default streaming flag 6.24 1.54x
4 Shared memory used to store zero_rate
and swap_tenor data5.221 1.84x
5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure
4.229 2.27x
Results taken on Kepler K20x system
34
Warp Divergence
Threads are executed in warps of 32, with all threads in the
warp executing the same instruction at the same time
37
Results
Sr. No. Experiment Time in Sec Performance Gain
1 Sequential computation of 20000
trades with 150 cash flows each on 250
diff. yield curves
81.15 -
2 Parallel computation of 20000 trades
with 150 cash flows each on 250 diff. Zero
rates.
9.612 1x
3 Using default streaming flag 6.24 1.54x
4 Shared memory used to store zero_rate
and swap_tenor data5.221 1.84x
5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure
4.229 2.27x
6 Resolved the issue of warp divergence
( By separated if-else position of code)3.148 3.05x
Results taken on Kepler K20x system
39
Constant memory
Constant Memory
Where is constant memory?
- Data is stored in the device global memory
- Read data through multiprocessor constant cache
- 64KB constant memory and 8KB cache for each SM
//declare constant memory__constant__ float cst_ptr[size];
//copy data from host to constant memorycudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);
40
Constant memory
Here ARRAY_COUNT=22 and No. of threads per Block are
32,64 …
Issues with above implementation – Warp Divergence
41
Results
Sr. No. Experiment Time in Sec Performance Gain
1 Sequential computation of 20000 trades with 150
cash flows each on 250 diff. yield curves81.15 -
2 Parallel computation of 20000 trades
with 150 cash flows each on 250 diff. Zero rates.
9.612 1x
3 Using default streaming flag 6.24 1.54x
4 Shared memory used to store zero_rate
and swap_tenor data5.221 1.84x
5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure
4.229 2.27x
6 Resolved the issue of warp divergence
( By separated if-else position of code)3.148 3.05x
7 Changed shared Memory to constant memory 2.892 3.32x
Results taken on Kepler K20x system
43
Read-Only Cache Memory
The read-only data cache was introduced with Compute Capability 3.5
architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780
GPUs).
Each SMX has a 48KB read-only cache.
The CUDA compiler automatically accesses data via the read-only
cache when it can determine that data is read-only for the lifetime of
kernel.
In practice, you need to qualify pointers as const and __restrict__ before
the compiler can satisfy this condition.
Also specify a read-only data cache access with the __ldg() intrinsic
function.
46
Results
Sr. No.
Experiment Time in Sec
Performance Gain
1 Sequential computation of 20000 trades with 150
cash flows each on 250 diff. yield curves81.15 -
2 Parallel computation of 20000 trades
with 150 cash flows each on 250 diff. Zero rates.
9.612 1x
3 Using default streaming flag 6.24 1.54x
4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x
5 Change in data structure – Cash Flow Dates stored in the
form of single integer value instead of structure
4.229 2.27x
6 Resolved the issue of warp divergence ( By separated if-else
position of code)3.148 3.05x
7 Changed shared Memory to constant memory 2.892 3.32x
8 Read-only Cache memory using const __restrict__ 2.765 3.47x
Results taken on Kepler K20x system
47
Intel Multi-core Systems
Experimental Setup
Ivy Bridge :
The Intel Xeon E5 2650 v2, Ivy Bridge 2.6 GHz, dual socket, 8
cores/socket with Hyper-threading, 24GB RAM.
Haswell :
The Intel Xeon E5-2697 v3 @ 2.60GHz, dual socket, 14
cores/socket with Hyper-threading, 64 GB RAM.
48
Intel OpenMP
.
.
.
Parallel section where each thread will
calculate MTM value of each trade
Sequential
regionSequential
region
Parallelization using
OpenMP
Syntax:
#pragma omp parallel for clauses(private, firstprivate, ...)
for ( trades =0 to nTrade)
{
for( CF = 0 to nCf )
{
computation steps
}
}
T1
T2
Tn
49
Intel Optimizations
Intel optimization
Multi thread settings
Optimization using compiler flags
– no-prec-div
– -unroll-aggressive
– -ipo
Schedule (Dynamic, chunk)
50
Intel Optimizations
No-prec-div
-prec-div improves precision of floating-point divides.
-no-prec-div disables this option and enables optimizations that giveslightly less precise results than full IEEE division.
-unroll-aggressive
This option determines whether the compiler uses more aggressive unrolling
for certain loops.
-ipo[n]
This option permits inlining and other interprocedural optimizations among multiple
source files.
51
Intel Optimizations Results on HSW-EP
Sr. No.
Experiment Time in Sec
Throughput
1 Parallel computation using OpenMP of 20000
trades with 150 cash flows each
10.6647 -
2 Using KMP_AFFINITY=granularity=fine,
compact,1,07.7874 1.37x
3 Using optimization flag -O2 3.4197 3.12x
4 Using optimization flags -no-prec-div
-unroll-aggressive3.1642 3.37x
5 Using optimization flag -ipo 2.7451 3.87x
Results taken on Intel Haswell System
53
Conclusion and Future Work
Traditional computing approaches are not suitable for the workload involved in the
computation of MTM and IM for CCP risk assessment.
Using HPC, 750 million cash-flows can be computed for identifying liquidity
requirement in few seconds.
Achieved more 100 times improvement compared to best known system for single
MTM computation.
HPC is well suited for complex calculations like credit value adjustment to price
counterparty risk in deal (which is out next PoC use case), expected shortfall
calculation for market risk measurement, potential future exposure and exposure at
default calculations, collateral valuation, basis risk computation etc.
Risk management is moving towards intraday risk computation for all major risk
categories of credit risk, market risk, liquidity risk and counterparty credit risk and
HPC is well suited for meeting the performance demands of these computations
The performance of Nvidia Kepler K80 is the best among all systems compared in our
benchmarking results.
54
Acknowledgement
We are thankful to Vinay Deshpande from Nvidia Pune, India for providing
access to latest K40 and K80 GPU systems to benchmark and fine tune the
application.
We are thankful to HPC Advisory Council for providing access to the Thor
system for evaluating our application.