2011-eurex - gpu-accelerated stochastic volatility models

May 17-19, 20111 © Hanweck Associates, LLC

Hanweck Associates, LLC30 Broad St., 42nd Fl.New York, NY 10004

www.hanweckassoc.comTel: +1 646-414-7274

GPU-Accelerated StochasticVolatility ModelingGerald A. Hanweck, Jr., PhDCEO, Hanweck Associates, LLC

Eurex Quantitative SeminarsChicago Toronto New YorkMay 17, 2011 May 18, 2011 May 19, 2011


Agenda

Introduction

Stochastic Volatility Review• Why Use Stochastic Volatility?

• The Heston (1993) Stochastic Volatility Model

A Primer on Graphics-Processing Unit (GPU) Computing

Using GPUs to Solve the Heston Model

Application: Fitting Eurex Euro-Bund Options

Questions


Stochastic Volatility Review


Why Use Stochastic Volatility?

Realized volatility is not constant over time

Implied volatility is not constant over time

Constant-volatility models (e.g., Black-Scholes) exhibit:• volatility smiles

• volatility skews

Better modeling of volatility leads to:• better hedging

• better risk management


Volatility Is Not Constant

[Insert charts of bund implied and realized volatility since 2006]Bund Realized 3M vs Rolling Quarterly ATM Implied Volatility

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

5/28/05 10/10/06 2/22/08 7/6/09 11/18/10 4/1/12

Date

Vol Realized Vol

ATM Implied Vol

*Implied vol was calculated using the at the money front month quarterly options rolled 3 weeks to expiration.


9.80%

10.00%

10.20%

10.40%

10.60%

10.80%

11.00%

95 96 97 98 99 100 101 102 103 104 105Strike

Implied Volatility of Expected Premium

Volatility Smiles

Example:• Options expire in 3 months

• Futures = 100

• Volatility to expiry can take, with equal probability, one of three levels: 5%, 10% or 15%.

What’s going on? As one moves away from the money, option premium becomes more convex in volatility, and Jensen’s Inequality tells us that the expected premium will be greater than the premium at the expected volatility.

Probability 1/3 1/3 1/3Volatility Expected Implied

Strike 5% 10% 15% Premium Volatility95 5.02 5.39 6.07 5.49 10.88%

100 1.00 1.99 2.99 1.99 10.00%105 0.02 0.45 1.19 0.55 10.82%

call option premia


9.00%

9.20%

9.40%

9.60%

9.80%

10.00%

10.20%

10.40%

10.60%

10.80%

114 116 118 120 122 124 126 128 130 132 134 136

Implied Volatility

Strike

Futures = 124.65

Volatility Smiles (cont’d)Example: Euro-Bund March 2009 Implied Volatility on January 5, 2009


7.00%

7.50%

8.00%

8.50%

9.00%

9.50%

10.00%

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126

Implied Volatility

Strike Price

Futures = 115.37

Volatility Skews• Volatility skews typically arise from correlation between volatility and

underlying price (“directional volatility”).

• Correlation drivers include macroeconomic events, central-bank activity, leverage and risk premia.

Euro-Bund December 2008 Implied Volatility on October 1, 2008

During the throes of the financial crisis, just after Lehman declared bankruptcy, the Euro-Bund volatility skew steepened to the call side (i.e., higher volatility at lower yields)


The Heston Model

Convenient properties of the Heston model:• stochastic volatility ⇒ volatility smiles• correlation between underlying and volatility ⇒ volatility skews• mean-reverting volatility• extends Black-Scholes• extensible to stochastic jumps (e.g., Bates, 1996)• quasi closed-form solution for European call and put options!

* Heston, Steven L., “A Closed-Form Solution for Options with Stochastic Volatility with Applications to Bond and Currency Options,” The Review of Financial Studies, 6(2), 1993, pp. 327-343.

Let S represent the underlying asset price, and v represent its variance. Heston* represents their evolution through time as:

where z1 and z2 are Weiner processes correlated as ρ.


The Heston Model:Quasi Closed-Form Solution

Heston formula for pricing a European call option on S with strike K, timeto expiration T, constant interest rate r, and market price of volatility risk λ:


Quasi-Closed-Form Solution (cont’d)

Using Heston to price a European call or put option involves numerically integrating two complicated complex-valued functions.

Fitting the model to market data can require hundreds of thousands of option pricing evaluations.

Real-time applications require fast calculations.

Unfortunately, numerical integration is computationally intensive... but it is also embarrassingly parallel.

Enter the GPU!Enter the GPU!

f(x)

x

Illustration of numerical integration using the

trapezoid rule


Graphics-Processing Unit Computing


Quantitative Finance:The Computation Conundrum

As financial products and processes have grown in complexity...

...their computational needs have become more demanding...

• Credit Default Swaps (CDS)• Collateralized Debt Obligations (CDOs)• Asset/Mortgage-Backed Securities

(ABS/MBS)• Structured Finance

• Algorithmic Trading• High-Frequency Trading• Program Trading• Risk Management• Asset Valuation

• Monte Carlo simulations• Binomial / trinomial trees & lattices• Numerical integration• Matrix algebra

• Numerical optimization• Finite-difference / finite-element methods• Digital signal processing (FFT)• Real-time data processing

...and compute-related resources are at a premium:• Shrinking IT budgets• Limited server rack space and power• Higher energy costs for servers and

cooling

• Productivity growth• Increased regulatory demands• “Green” pressure


High-Performance Computing:Who Needs It?

Financial Institutions

• Investment Banks

• Hedge Funds

• Asset Managers

• Insurance Companies

• Pension Plans

• Mortgage Servicers

Every firm in quantitative finance today needs more computing power!

Institutional Applications

• Risk Management

• Financial Modeling & Engineering

• Quantitative Analysis

• Portfolio Management

• Sales & Trading

• Research

“In the ongoing arms race on Wall Street, high-performance computing (HPC), also referred to as supercomputing, provides a huge competitive advantage that no firm can afford to miss out on.” – Wall Street & Technology, March 19, 2007.


HPC in Finance Today

Several hardware / software acceleration platforms exist today:• Conventional multi-core CPU-based grids / clusters (MPI, OpenMP)• GPUs and GPU-based clusters (NVIDIA Tesla/CUDA, AMD/ATI, OpenCL)• FPGA accelerators (Celoxica, Exegy, Tervela, ACTIV)• Specialized MPUs (IBM Cell, Tilera)

They are not mutually exclusive!• CPUs excel at MIMD problems (multiple instruction, multiple data)• GPUs are designed for SIMD problems (single instruction, multiple data)• FPGAs handle SIMD and asynchronous real-time data processing well• They can all be combined into optimized systems / clusters


NVIDIA Tesla C2050 Specifications

Processor clock 1.15 GHz

# of CUDA cores 448

Peak floating-point perf 1.03 Tflops (SP)

Memory clock 1.5 GHz

Memory bus width 384 bit

Memory size 3 GB / 6 GB

Overview of Tesla C2050/C2070 GPU

Source: NVIDIA


GPUs = Higher Flops and Memory Bandwidth

Peak Memory Bandwidth GBytes/sec

0

50

100

150

200

250

2007 2008 2009 2010 2011 2012

M1060

Nehalem 3 GHz

Westmere3 GHz

8‐core Sandy Bridge3 GHz

FermiM2070

Kepler

Fermi+M20xx

Peak Double Precision FP GFlops/sec

0

200

400

600

800

1000

1200

2007 2008 2009 2010 2011 2012

Nehalem3 GHz

Westmere3 GHz

FermiM2070

Fermi+M20xx

M1060

Kepler

8‐coreSandy Bridge

3 GHz

NVIDIA GPU (ECC off) x86 CPUDouble Precision: NVIDIA GPU Double Precision: x86 CPU

Source: NVIDIA


GPU Architecture:Two Main Components

Global memoryAnalogous to RAM in a CPU serverAccessible by both GPU and CPUCurrently up to 6 GBBandwidth currently up to 150 GB/sfor Quadro and Tesla productsECC on/off option for Quadro and Tesla products

Streaming Multiprocessors (SMs)Perform the actual computationsEach SM has its own control units, registers, execution pipelines, caches

DRA

M I/F

DRA

M I/F

Giga Th

read

HOST I/F

HOST I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

DRA

M I/F

L2L2

Streaming Multiprocessors

(SMs)


(SMs)

Source: NVIDIA


NVIDIA GPU Architecture

32 CUDA cores per streaming multiprocessor (512 total)

8x peak double precision floating point performance (50% of peak single precision)

Dual Thread Scheduler

64 KB of RAM for shared memory and L1 cache (configurable)

DR

AM

I/F

HO

ST I/

FG

iga

Thre

adD

RA

M I/

F DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2L2

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache


(SMs)


(SMs)

Source: NVIDIA


Kernel Execution

………

CUDA‐enabled GPU

CUDA thread CUDA core

CUDA thread block

• Each thread is executed by a core

• Each block is executed by one SM and does not migrate

• Several concurrent blocks can reside on one SM depending on the blocks’memory requirements and the SM’s memory resources

• Each kernel is executed on one device

• Multiple kernels can execute on a device at one time

…

CUDA Streaming Multiprocessor

CUDA kernel grid

...

Source: NVIDIA


CUDA OverviewExecution

CPU code passes control to GPU functions called kernelsData is transferred between CPU to GPU memory via DMA copies

ThreadingA kernel operates on a grid of thread blocks (up to 65,535 x 65,535 x 65,535)Each thread block runs multiple threads (up to 1,024 per thread block)Threads are grouped into SIMD-like warps (32 threads)

MemoryGPU DRAM, or global memory, or device memory (multiple gigabytes) with L2 cache; generally slow (hundreds of clocks)Per SM shared memory / L1 cache (64KB) arranged in 32 banks; generally fast (2 clocks)Registers; very fast, but limited resourceConstant memory (64KB shared by all SMs, read only by kernel code)Texture memory (spatially cached global memory with rudimentary interpolation, up to 65,536 x 65,535 elements, read-only by kernel code) Source: NVIDIA


Serial code executes in a Host (CPU) thread

Parallel code executes in many Device (GPU) thread across multiple processing elements

Anatomy of a CUDA C/C++ Application

CUDA C/C++ CUDA C/C++ ApplicationApplication

Serial codeSerial code

Serial codeSerial code

Parallel codeParallel code

Parallel codeParallel code

Device = GPU

…

Host = CPU

Device = GPU

...

Host = CPU

Source: NVIDIA


CUDA C : C with a few keywords

Kernel: function called by the host that executes on the GPUCan only access GPU memoryNo variable number of argumentsNo static variables

Functions must be declared with a qualifier:__global__ : GPU kernel function launched by CPU, must return void__device__ : can be called from GPU functions__host__ : can be called from CPU functions (default)__host__ and __device__ qualifiers can be combined


CUDA C : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

// Invoke serial SAXPY kernel

saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

// Invoke parallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;

saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code


CUDA OverviewPerformance Tips

Coalesce Global Memory AccessesGlobal memory is read/written as 32-, 64- or 128-byte transactions (naturally aligned)A warp can access a naturally aligned contiguous group of 32, 64 or 128 bytes in a single coalesced memory transactionUncoalesced memory accesses require multiple transactions, reducing performanceWhile L1 and L2 caches can mitigate some of the performance hit, try to coalesce memory accesses as much as possible (e.g., organize data contiguously, use padding)

Avoid Bank ConflictsShared memory is organized into 32 banksA bank conflict occurs if two or more threads in a warp access different 32-bit words in the same bankA warp can access all 32 banks in one transaction as long as there are no bank conflictsA bank conflict requires the shared-memory access to be broken up into multiple transactions, reducing performance (sometimes severely)Try to avoid bank conflicts if at all possible by organizing data in shared memory appropriately


CUDA OverviewPerformance Tips (cont’d)

Avoid Divergent BranchesAll threads in a warp run in SIMD fashionIf threads take divergent branches (e.g., in an if-else clause or a loop), the different paths are serialized, which can severely reduce performanceTry to avoid divergent branches if possible (e.g., use arithmetic operations, group similar threads together)

Use Constant MemoryUseful for static model parameters, covariance matrices, cashflow schedules, call schedules, etc.Limited to 64KB, read-only by the kernelFast, particularly when all threads in a warp should read the same location

Use Texture MemoryUseful for static model parameters, covariance matrices, cashflow schedules, call schedules, etc.Can be very large, read-only by the kernel2-D arrays in texture memory benefit from spatial cachingBest performance is achieved when reads are clustered together in the 2-D array


CUDA Programming Resources

CUDA ToolkitCompiler, libraries, and documentationFree download for Windows, Linux, and MacOS

GPU Computing SDKCode samplesWhitepapers

GPU ToolsProfiler – GUI or Command-line tool used to inspect memory access and kernel execution patternsDebugger

• Linux: cuda-gdb• Windows: Parallel Nsight


Using GPUs to Solve the HestonModel


Numerical Integrationf ( x )

x

• Each thread evaluates one piece of the integral and saves the result to shared memory

• Each thread block then performs a sum reduction on all points evaluated within the thread block

• One value from each thread block is saved to global memory and copied back to host memory

• These results are then summed on the CPU to compute the value of the integral. If many thread blocks are needed in the first kernel call, then another sum reduction kernel can be executed to compute this sum


Sum Reduction4 7 5 9

11 1425

3 1 7 0 4 1 6 34 7 5 9

11 1425

3 1 7 0 4 1 6 34 7 5 9

11 1425

3 1 7 0 4 1 6 34 7 5 9

11 1425

3 1 7 0 4 1 6 34 7 5 9

11 1425

3 1 7 0 4 1 6 34 7 5 9

11 1425

3 1 7 0 4 1 6 34 7 5 9

11 1425

3 1 7 0 4 1 6 34 7 5 9

11 1425

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

Level 0:8 blocks

Level 1:1 block

• A sum reduction technique is used to sum the value of the Hestonintegral

• As shown above, multiple kernel call can be used to perform the sum if many points need to be evaluated


Application:

Fitting Eurex Euro-Bund Options


Data and Methodology

Data• Eurex Euro-Bund futures and options end-of-day settlement prices• Jan 2006 – April 2011

Fitting Methodology• Out-of-the-money options with premium greater than 0.01• Minimize least-squares difference between market and model option

premium (equivalent to vega weighting implied volatility differences, but more robust)

• Euro-Bund options are American style, but futures-style margining makes early-exercise suboptimal, so European Heston model applies (see Wu & Vischer, 2009)

• Can fit to individual expiry or surface; for one day or many; allow all parameters to vary or fix some (fitting is an art!)


Benchmarks

NVIDIA Tesla C2070 GPU vs. Intel Xeon E5640 @2.67 GHz (singe core)

Time to price 10,000 options• CPU: 348.2 seconds• GPU: 5.1 seconds

Option pricings per second• CPU ~ 30 /s• GPU ~ 2000/s

GPU gives about 70x performance gain!


Euro-Bund Options Expiring September 2011

Fitted Heston Parameters

0λ

0.0290827ρ

0.0769625σ

0κ

0Vinf

0.0033334V0

Fitted Premiums

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

110 115 120 125 130 135Strike (€)

Pre

miu

m (€

)

Fitted PutsFitted CallsMarket PutsMarket Calls

Fit performed on 04/12/2011 settlement prices for out of the money options expiring 09/2011

*κ and λ forced to 0

Heston Fitted Vols

5.40%

5.60%

5.80%

6.00%

6.20%

6.40%

6.60%

6.80%

105 110 115 120 125 130 135

Strike (€)

Vol

Heston Implied VolsMarket Implied Vols


September 2011 Euro-Bund Surface

Fit performed on 04/12/2011 settlement prices for out of the money options expiring 07/2011 and 09/2011

Fitted Heston Parameters

0λ

0.03037642ρ

0.07871576σ

0.05127899κ

0.02133107Vinf

0.00315456V0

*λ forced to 0

Heston Fitted Vols for 07/2011 Options

5.40%

5.90%

6.40%

6.90%

7.40%

7.90%

105 110 115 120 125 130 135

Strike (€)

Vol

Heston Implied VolMarket Implied Vol

Heston Fitted Vols for 09/2011 Options

5.40%

5.60%

5.80%

6.00%

6.20%

6.40%

6.60%

6.80%

105 110 115 120 125 130 135

Strike (€)

Vol

Heston Implied VolsMarket Implied Vols


Stability Of Parameters –September 2011 Euro-Bund Options

V0 and κ parameters show stability over time for a fitted option chain

√V0

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

2/26/11 3/8/11 3/18/11 3/28/11 4/7/11 4/17/11Date

√V0

Rho

-0.3-0.25

-0.2-0.15

-0.1-0.05

00.05

0.10.15

0.20.25

2/26/11 3/8/11 3/18/11 3/28/11 4/7/11 4/17/11Date

Rho

Sigma

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2/26/11 3/8/11 3/18/11 3/28/11 4/7/11 4/17/11Date

Sig

ma


Concluding Remarks


Concluding Remarks

The Heston model applied to Euro-Bund options provides a reasonable model for individual volatility skews; less reasonable for entire surfaces (particularly very short-dated options)

Some parameters (mean reversion, market price of risk) might need to be fixed / removed due to limited degrees of freedom

Some parameters (rho) might benefit from fitting over time

GPU-based acceleration reduces Heston model valuation time by orders of magnitude, making real-time pricing and fitting feasible

GPU-based acceleration is applicable to other models that require numerical integration, lattices or Monte-Carlo techniques


ReferencesBates, David S., “Jumps and Stochastic Volatility: Exchange Rate Processes Implicit in

Deutsche Mark Options,” The Review of Financial Studies, 9(1), 1996, pp. 69-107.

Heston, Steven L., “A Closed-Form Solution for Options with Stochastic Volatility with Applications to Bond and Currency Options,” The Review of Financial Studies, 6(2), 1993, pp. 327-343.

West, Graeme, “Calibration of the SABR Model in Illiquid Markets,” Applied Mathematical Finance, 12(4), 2005, pp. 371-385.

Wu, Shengxiong, and Axel Vischer, “Similarities and Differences of the Volatility Smiles of Euro-Bund and 10-Year T-Note Futures and Options,” Eurex working paper, November 2009.


Copyright © 2011 Hanweck Associates, LLC.All rights reserved. Additional information is available upon request.

This presentation has been prepared for the exclusive use of the direct recipient. No part of this presentation may be copied or redistributed without the express written consent of the author. Opinions and estimates constitute the author’s judgment as of the date of this material and are subject to change without notice. Information has been obtained from sources believed to be reliable, but the author does not warrant its completeness or accuracy. Past performance is not indicative of future results. Securities, financial instruments or strategies mentioned herein may not be suitable for all investors. The recipient of this report must make its own independent decisions regarding any strategies, securities or financial instruments discussed. This material is not intended as an offer or solicitation for the purchase or sale of any financial instrument.

2011-eurex - gpu-accelerated stochastic volatility models

Documents