hardware acceleration of monte- carlo structural financial instrument pricing using a gaussian...

Hardware Acceleration of Monte-Carlo Structural Financial Instrument

Pricing Using a Gaussian Copula Model

Presenter: Alexander Kaganov

Supervisor: Paul Chow

Increasing Computational Requirements (1/3)

In recent years the financial industry has seen:

1. Increasing contract/model complexity Every year new models are developed Unavailability of closed-form solution Necessitate Monte-Carlo pricing


2. Increasing portfolio sizes Increase in simple to price instruments Increase in complex derivate security

CDO issuance has increased from $157 billion in 2004 to $507 billion in 2007 (>3x)¹

3xN instruments

3xY time (at least)

N instruments

Y time

¹ SIFMA


3. Ever-present need to make real-time decisions Market trends can change quickly Instruments traded electronically

1 ms in Latency is Worth $100 M in Stock Trading Business Value (AMD

Analyst Day-26 july 2007)

Trends in Financial Monte-Carlo Algorithms

1. Computationally intensive Accuracy Time2

2. Highly repetitive A large portion of the calculation time

is spent in a small portion of the code (~90% of the time is spent in ~10% of

the code)

3. High degree of coarse and fine-grain parallelism

Coarse-GrainFine-Grain

Typical MC Financial simulation

Collateralized Debt Obligation (CDO)

CDOProblem:

Banks typically hold portfolios with highly volatile assets.

Solution: Sell assets to an outside entity (SPV), which combines the

different assets together into one collateral pool Repackage the pool as CDO tranches. Sell tranches as form of protection to investors in return for

premium payments

CDO Structure/Pricing (1/2)

Investors

Sponsor (Bank)

Bonds

Loans

CDS

CDOs

Collateral Pool

SPV

Tranches

Super Senior: 12%-100%

Senior: 6% -12%

Mezzanine: 3% -6%

Equity: 0% -3%

Borrowers

(Credit Default Swap)

Each tranche has attachment and detachment points Losses below attachment point → the tranche is unaffected Losses above the detachment point → the tranche becomes inactive

Investor premium is paid based on the tranche width minus tranche losses

CDO Structure/Pricing (2/2)

Attachment (3%)

Detachment (6%)

Tranche Losses

Investor Premium Payments4%

Mezzanine Tranche:

Paid premium based on the full investment

Losses 1/3 of the principal investment. Premium paid based on 2/3 of the original investment

Investor Premium Payments

))())(1

11

T

iiii

T

iiii dLLEdLSsE

CDO Tranche Value = Premium Leg – Default Leg

S =tranche thickness si= Premium

di= Discount factor Li= Tranche loses at time interval i

iLE

Li’s Gaussian Copula Model

Monte-Carlo (MC) pricing model For each path:

2. Compare: )]([1 tPY ii

iiii ZXY 21 1. Generate: iiij

m

jiji ZXY

)(1

or

3. Record losses:

One-Factor Multi-Factor

Implementation

Multi-Core Architecture Three portions: Distributor, Pricing

cores, and Collector. Coarse Grain Parallelism: MC paths

divided among OFGC cores All cores have the same input data

except for random generator seeds Data transfer occurs in parallel to

calculations Double Buffering

Minimal required data transfer rate with an external processor of: 11.3MBytes/sec 1-Lane PCI express- 250 MBytes/sec Data transfer latency can be hidden

Pricing Core Design

Phase 4: Convert collateral pool losses to tranche losses

Phase 5: Accumulate tranche losses

Phase 3: Combine the partial sums, L(ti)’s.

Phase 1: Generate Yi

Phase 2: Compare Yi<Φ-1[P(τi<t)]. Record partial losses

Phase 1

One-Factor Multi-Factor

Four factors accumulated per clock cycle

Accumulation is performed in parallel to Phase 2 execution

Pricing Core Design






Phase 2 Compare Yi<Φ-1[P(τi<t)].

Record Losses Fine-grain parallelism:

parallelize over time 8 replicas

More replicas → higher speedup (potentially) However, large portions of

the hardware become underutilized

Pipelined adder latency creates multiple partial sums

Pricing Core Design









Results

One-Factor Utilization*Single

PrecisionDouble

PrecisionHybrid “Integer”

Flip-Flops 6530 9910 (+51.2%)

6721 (+2.9%)

4906

(-24.9%)

LUTs 7052 13325 (+89.0%)

7599 (+7.8%)

5224

(-25.9%)

BRAMs 15 31 (+106.7%)

15

(0%)

15

(0%)

DSP48Es 29 40 (+37.9%) 30 (+3.4%) 7 (-75.9%)

Frequency 248.8 190.9

(-23.3%)

244.8

(-1.6%)

268.2

(+7.8%)

Average Error (%)

0.37

[1.07]

0 3.02E-5

[5.27E-5]

0

*Utilization for Virtex 5 SX50T chip

Performance: Benchmarks# Based on Data From # of

Assets# of

Time Steps

# of Default Curves

1 CDX.NA.HY 100 15 5

2 CDX.NA.IG 125 35 5

3 CDX.NA.IG.HVOL 30 19 4

4 CDX.NA.XO 35 22 4

5 CDX.EM 14 6 4

6 CDX.DIVERSIFIED 40 23 5

7 CDX.NA.HY.BB 37 13 4

8 CDX.NA.HY.B 46 26 4

9 Semi-homogenous 400 24 2

Credit rating and number of instruments are based on Dow Jones CDX

Notionals obtained from Moody’s, range from $600,000 to $6.6 billion

α: uniformly distributed in [0, 1]

Recovery rate: Normally distributed, N (0.4,0.15)

# of Time Steps: Normally distributed, N (20,10)

# of Systemic Factors: varied from 1 to 16

Processor vs. FPGA setup

3.4 GHz Intel Xeon Processor

3GB RAM C++ program 100,000 Monte-Carlo paths

Virtex 5 SX50T speed grade -3

Connected to host through PCI express

100,000 Monte-Carlo paths

0

5

10

15

20

25

CD

X.N

A.H

Y

CD

X.N

A.IG

CD

X.N

A.IG

.HV

OL

CD

X.N

A.X

O

CD

X.E

M

CD

X.D

IVE

RS

IFIE

D

CD

X.N

A.H

Y.B

B

CD

X.N

A.H

Y.B

Sem

i-homogenous

AV

ER

AG

E

Benchmarks

Sp

eed

up Double Precision

Single Precision

Single/Double Hybrid

"Integer"

Performance: One-Factor Single Core Results (1/3)


0

5

10

15

20

25

30

0 10 20 30 40 50 60 70

# of Time Steps

Sp

eed

up Single Precision

Double Precision

Integer


Single Core Average Acceleration:

Double Precision: 10.8 X

Single Precision: 14.0 X

Single/Double Hybrid: 13.8 X

Integer: 15.8 X

Performance: One-Factor Multi-Core Results Monte-Carlo paths independence allows for a linear speedup as

more pricing cores are incorporated.

Double Single Single/Double Hybrid

Integer

Single Core

Acceleration

10.8X 14.0X 13.8X 15.8X

Maximum # of

Instantiations

2 4 4 5

Multi-Core Acceleration

15.9X 47.2X 47.5X 64.5X

Performance: Multi-Factor Results (1/3)

8

Steps Time of #TRF

4

Factors Systemic of #SFRF

• SFRF is the number of cycles before Phase 1 generates a new Yi

•TRF is the number of cycles before Phase 2 requires a new Yi

Consumer: Phase 2

Ratec: 1/TRF

Producer: Phase 1

Ratep: 1/SFRF

Performance: Multi-Factor Results (2/3)

Benchmark 1: CDX.NA.HY

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16 18

# of Systemic Factors

Sp

eed

up

Double-Precision

Single-Precision

Integer

Ratec<= Ratep Ratec>Ratep

Performance: Multi-Factor Results (3/3)Ratep>Ratec Ratep= Ratec Ratep<Ratec Average

Double Precision

1 Core 13.3X 15.9X 12.0X 13.5X

2 Cores 18.7X 22.5X 16.9X 19.0X

Single Precision

1 Core 16.7X 20.1X 15.1X 17.0X

4 Cores 52.8X 63.4X 47.7X 53.6X

Hybrid

1 Core 16.5X 19.9X 14.9X 16.8X

4 Cores 50.3X 60.4X 45.4X 51.1X

Integer

1 Core 17.7X 22.4X 18.6X 18.9X

5 Cores 67.0X 84.5X 70.1X 71.5X

Summary Presented a hardware architecture for pricing

Collateralized Debt Obligations using Li’s models Demonstrating the flexibility of an FPGA in exploiting

both coarse- and fine-grain parallelism

Established that either a single/double hybrid or “integer” representations could be used to balance resource utilization and accuracy

Achieved a maximal acceleration of over 64-fold and 71-fold, for the one-factor and multi-factor models, respectively

Future Work

1. Expand to a multi-FPGA system

2. Attempt the algorithm on a different accelerator architectures

GPU

3. Incorporate into a financial software suite

Thank You(Questions?)

hardware acceleration of monte- carlo structural financial instrument pricing using a gaussian...

Documents