2011-eurex - gpu-accelerated stochastic volatility models
TRANSCRIPT
May 17-19, 20111 © Hanweck Associates, LLC
Hanweck Associates, LLC30 Broad St., 42nd Fl.New York, NY 10004
www.hanweckassoc.comTel: +1 646-414-7274
GPU-Accelerated StochasticVolatility ModelingGerald A. Hanweck, Jr., PhDCEO, Hanweck Associates, LLC
Eurex Quantitative SeminarsChicago Toronto New YorkMay 17, 2011 May 18, 2011 May 19, 2011
May 17-19, 20112 © Hanweck Associates, LLC
Agenda
Introduction
Stochastic Volatility Review• Why Use Stochastic Volatility?
• The Heston (1993) Stochastic Volatility Model
A Primer on Graphics-Processing Unit (GPU) Computing
Using GPUs to Solve the Heston Model
Application: Fitting Eurex Euro-Bund Options
Questions
May 17-19, 20113 © Hanweck Associates, LLC
Stochastic Volatility Review
May 17-19, 20114 © Hanweck Associates, LLC
Why Use Stochastic Volatility?
Realized volatility is not constant over time
Implied volatility is not constant over time
Constant-volatility models (e.g., Black-Scholes) exhibit:• volatility smiles
• volatility skews
Better modeling of volatility leads to:• better hedging
• better risk management
May 17-19, 20115 © Hanweck Associates, LLC
Volatility Is Not Constant
[Insert charts of bund implied and realized volatility since 2006]Bund Realized 3M vs Rolling Quarterly ATM Implied Volatility
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
5/28/05 10/10/06 2/22/08 7/6/09 11/18/10 4/1/12
Date
Vol Realized Vol
ATM Implied Vol
*Implied vol was calculated using the at the money front month quarterly options rolled 3 weeks to expiration.
May 17-19, 20116 © Hanweck Associates, LLC
9.80%
10.00%
10.20%
10.40%
10.60%
10.80%
11.00%
95 96 97 98 99 100 101 102 103 104 105Strike
Implied Volatility of Expected Premium
Volatility Smiles
Example:• Options expire in 3 months
• Futures = 100
• Volatility to expiry can take, with equal probability, one of three levels: 5%, 10% or 15%.
What’s going on? As one moves away from the money, option premium becomes more convex in volatility, and Jensen’s Inequality tells us that the expected premium will be greater than the premium at the expected volatility.
Probability 1/3 1/3 1/3Volatility Expected Implied
Strike 5% 10% 15% Premium Volatility95 5.02 5.39 6.07 5.49 10.88%
100 1.00 1.99 2.99 1.99 10.00%105 0.02 0.45 1.19 0.55 10.82%
call option premia
May 17-19, 20117 © Hanweck Associates, LLC
9.00%
9.20%
9.40%
9.60%
9.80%
10.00%
10.20%
10.40%
10.60%
10.80%
114 116 118 120 122 124 126 128 130 132 134 136
Implied Volatility
Strike
Futures = 124.65
Volatility Smiles (cont’d)Example: Euro-Bund March 2009 Implied Volatility on January 5, 2009
May 17-19, 20118 © Hanweck Associates, LLC
7.00%
7.50%
8.00%
8.50%
9.00%
9.50%
10.00%
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
Implied Volatility
Strike Price
Futures = 115.37
Volatility Skews• Volatility skews typically arise from correlation between volatility and
underlying price (“directional volatility”).
• Correlation drivers include macroeconomic events, central-bank activity, leverage and risk premia.
Euro-Bund December 2008 Implied Volatility on October 1, 2008
During the throes of the financial crisis, just after Lehman declared bankruptcy, the Euro-Bund volatility skew steepened to the call side (i.e., higher volatility at lower yields)
May 17-19, 20119 © Hanweck Associates, LLC
The Heston Model
Convenient properties of the Heston model:• stochastic volatility ⇒ volatility smiles• correlation between underlying and volatility ⇒ volatility skews• mean-reverting volatility• extends Black-Scholes• extensible to stochastic jumps (e.g., Bates, 1996)• quasi closed-form solution for European call and put options!
* Heston, Steven L., “A Closed-Form Solution for Options with Stochastic Volatility with Applications to Bond and Currency Options,” The Review of Financial Studies, 6(2), 1993, pp. 327-343.
Let S represent the underlying asset price, and v represent its variance. Heston* represents their evolution through time as:
where z1 and z2 are Weiner processes correlated as ρ.
May 17-19, 201110 © Hanweck Associates, LLC
The Heston Model:Quasi Closed-Form Solution
Heston formula for pricing a European call option on S with strike K, timeto expiration T, constant interest rate r, and market price of volatility risk λ:
May 17-19, 201111 © Hanweck Associates, LLC
Quasi-Closed-Form Solution (cont’d)
Using Heston to price a European call or put option involves numerically integrating two complicated complex-valued functions.
Fitting the model to market data can require hundreds of thousands of option pricing evaluations.
Real-time applications require fast calculations.
Unfortunately, numerical integration is computationally intensive... but it is also embarrassingly parallel.
Enter the GPU!Enter the GPU!
f(x)
x
Illustration of numerical integration using the
trapezoid rule
May 17-19, 201112 © Hanweck Associates, LLC
Graphics-Processing Unit Computing
May 17-19, 201113 © Hanweck Associates, LLC
Quantitative Finance:The Computation Conundrum
As financial products and processes have grown in complexity...
...their computational needs have become more demanding...
• Credit Default Swaps (CDS)• Collateralized Debt Obligations (CDOs)• Asset/Mortgage-Backed Securities
(ABS/MBS)• Structured Finance
• Algorithmic Trading• High-Frequency Trading• Program Trading• Risk Management• Asset Valuation
• Monte Carlo simulations• Binomial / trinomial trees & lattices• Numerical integration• Matrix algebra
• Numerical optimization• Finite-difference / finite-element methods• Digital signal processing (FFT)• Real-time data processing
...and compute-related resources are at a premium:• Shrinking IT budgets• Limited server rack space and power• Higher energy costs for servers and
cooling
• Productivity growth• Increased regulatory demands• “Green” pressure
May 17-19, 201114 © Hanweck Associates, LLC
High-Performance Computing:Who Needs It?
Financial Institutions
• Investment Banks
• Hedge Funds
• Asset Managers
• Insurance Companies
• Pension Plans
• Mortgage Servicers
Every firm in quantitative finance today needs more computing power!
Institutional Applications
• Risk Management
• Financial Modeling & Engineering
• Quantitative Analysis
• Portfolio Management
• Sales & Trading
• Research
“In the ongoing arms race on Wall Street, high-performance computing (HPC), also referred to as supercomputing, provides a huge competitive advantage that no firm can afford to miss out on.” – Wall Street & Technology, March 19, 2007.
May 17-19, 201115 © Hanweck Associates, LLC
HPC in Finance Today
Several hardware / software acceleration platforms exist today:• Conventional multi-core CPU-based grids / clusters (MPI, OpenMP)• GPUs and GPU-based clusters (NVIDIA Tesla/CUDA, AMD/ATI, OpenCL)• FPGA accelerators (Celoxica, Exegy, Tervela, ACTIV)• Specialized MPUs (IBM Cell, Tilera)
They are not mutually exclusive!• CPUs excel at MIMD problems (multiple instruction, multiple data)• GPUs are designed for SIMD problems (single instruction, multiple data)• FPGAs handle SIMD and asynchronous real-time data processing well• They can all be combined into optimized systems / clusters
May 17-19, 201116 © Hanweck Associates, LLC
NVIDIA Tesla C2050 Specifications
Processor clock 1.15 GHz
# of CUDA cores 448
Peak floating-point perf 1.03 Tflops (SP)
Memory clock 1.5 GHz
Memory bus width 384 bit
Memory size 3 GB / 6 GB
Overview of Tesla C2050/C2070 GPU
Source: NVIDIA
May 17-19, 201117 © Hanweck Associates, LLC
GPUs = Higher Flops and Memory Bandwidth
Peak Memory Bandwidth GBytes/sec
0
50
100
150
200
250
2007 2008 2009 2010 2011 2012
M1060
Nehalem 3 GHz
Westmere3 GHz
8‐core Sandy Bridge3 GHz
FermiM2070
Kepler
Fermi+M20xx
Peak Double Precision FP GFlops/sec
0
200
400
600
800
1000
1200
2007 2008 2009 2010 2011 2012
Nehalem3 GHz
Westmere3 GHz
FermiM2070
Fermi+M20xx
M1060
Kepler
8‐coreSandy Bridge
3 GHz
NVIDIA GPU (ECC off) x86 CPUDouble Precision: NVIDIA GPU Double Precision: x86 CPU
Source: NVIDIA
May 17-19, 201118 © Hanweck Associates, LLC
GPU Architecture:Two Main Components
Global memoryAnalogous to RAM in a CPU serverAccessible by both GPU and CPUCurrently up to 6 GBBandwidth currently up to 150 GB/sfor Quadro and Tesla productsECC on/off option for Quadro and Tesla products
Streaming Multiprocessors (SMs)Perform the actual computationsEach SM has its own control units, registers, execution pipelines, caches
DRA
M I/F
DRA
M I/F
Giga Th
read
HOST I/F
HOST I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
DRA
M I/F
L2L2
Streaming Multiprocessors
(SMs)
Streaming Multiprocessors
(SMs)
Source: NVIDIA
May 17-19, 201119 © Hanweck Associates, LLC
NVIDIA GPU Architecture
32 CUDA cores per streaming multiprocessor (512 total)
8x peak double precision floating point performance (50% of peak single precision)
Dual Thread Scheduler
64 KB of RAM for shared memory and L1 cache (configurable)
DR
AM
I/F
HO
ST I/
FG
iga
Thre
adD
RA
M I/
F DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2L2
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
Streaming Multiprocessors
(SMs)
Streaming Multiprocessors
(SMs)
Source: NVIDIA
May 17-19, 201120 © Hanweck Associates, LLC
Kernel Execution
………
CUDA‐enabled GPU
CUDA thread CUDA core
CUDA thread block
• Each thread is executed by a core
• Each block is executed by one SM and does not migrate
• Several concurrent blocks can reside on one SM depending on the blocks’memory requirements and the SM’s memory resources
• Each kernel is executed on one device
• Multiple kernels can execute on a device at one time
…
CUDA Streaming Multiprocessor
CUDA kernel grid
...
Source: NVIDIA
May 17-19, 201121 © Hanweck Associates, LLC
CUDA OverviewExecution
CPU code passes control to GPU functions called kernelsData is transferred between CPU to GPU memory via DMA copies
ThreadingA kernel operates on a grid of thread blocks (up to 65,535 x 65,535 x 65,535)Each thread block runs multiple threads (up to 1,024 per thread block)Threads are grouped into SIMD-like warps (32 threads)
MemoryGPU DRAM, or global memory, or device memory (multiple gigabytes) with L2 cache; generally slow (hundreds of clocks)Per SM shared memory / L1 cache (64KB) arranged in 32 banks; generally fast (2 clocks)Registers; very fast, but limited resourceConstant memory (64KB shared by all SMs, read only by kernel code)Texture memory (spatially cached global memory with rudimentary interpolation, up to 65,536 x 65,535 elements, read-only by kernel code) Source: NVIDIA
May 17-19, 201122 © Hanweck Associates, LLC
Serial code executes in a Host (CPU) thread
Parallel code executes in many Device (GPU) thread across multiple processing elements
Anatomy of a CUDA C/C++ Application
CUDA C/C++ CUDA C/C++ ApplicationApplication
Serial codeSerial code
Serial codeSerial code
Parallel codeParallel code
Parallel codeParallel code
Device = GPU
…
Host = CPU
Device = GPU
...
Host = CPU
Source: NVIDIA
May 17-19, 201123 © Hanweck Associates, LLC
CUDA C : C with a few keywords
Kernel: function called by the host that executes on the GPUCan only access GPU memoryNo variable number of argumentsNo static variables
Functions must be declared with a qualifier:__global__ : GPU kernel function launched by CPU, must return void__device__ : can be called from GPU functions__host__ : can be called from CPU functions (default)__host__ and __device__ qualifiers can be combined
May 17-19, 201124 © Hanweck Associates, LLC
CUDA C : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
May 17-19, 201125 © Hanweck Associates, LLC
CUDA OverviewPerformance Tips
Coalesce Global Memory AccessesGlobal memory is read/written as 32-, 64- or 128-byte transactions (naturally aligned)A warp can access a naturally aligned contiguous group of 32, 64 or 128 bytes in a single coalesced memory transactionUncoalesced memory accesses require multiple transactions, reducing performanceWhile L1 and L2 caches can mitigate some of the performance hit, try to coalesce memory accesses as much as possible (e.g., organize data contiguously, use padding)
Avoid Bank ConflictsShared memory is organized into 32 banksA bank conflict occurs if two or more threads in a warp access different 32-bit words in the same bankA warp can access all 32 banks in one transaction as long as there are no bank conflictsA bank conflict requires the shared-memory access to be broken up into multiple transactions, reducing performance (sometimes severely)Try to avoid bank conflicts if at all possible by organizing data in shared memory appropriately
May 17-19, 201126 © Hanweck Associates, LLC
CUDA OverviewPerformance Tips (cont’d)
Avoid Divergent BranchesAll threads in a warp run in SIMD fashionIf threads take divergent branches (e.g., in an if-else clause or a loop), the different paths are serialized, which can severely reduce performanceTry to avoid divergent branches if possible (e.g., use arithmetic operations, group similar threads together)
Use Constant MemoryUseful for static model parameters, covariance matrices, cashflow schedules, call schedules, etc.Limited to 64KB, read-only by the kernelFast, particularly when all threads in a warp should read the same location
Use Texture MemoryUseful for static model parameters, covariance matrices, cashflow schedules, call schedules, etc.Can be very large, read-only by the kernel2-D arrays in texture memory benefit from spatial cachingBest performance is achieved when reads are clustered together in the 2-D array
May 17-19, 201127 © Hanweck Associates, LLC
CUDA Programming Resources
CUDA ToolkitCompiler, libraries, and documentationFree download for Windows, Linux, and MacOS
GPU Computing SDKCode samplesWhitepapers
GPU ToolsProfiler – GUI or Command-line tool used to inspect memory access and kernel execution patternsDebugger
• Linux: cuda-gdb• Windows: Parallel Nsight
May 17-19, 201128 © Hanweck Associates, LLC
Using GPUs to Solve the HestonModel
May 17-19, 201129 © Hanweck Associates, LLC
Numerical Integrationf ( x )
x
• Each thread evaluates one piece of the integral and saves the result to shared memory
• Each thread block then performs a sum reduction on all points evaluated within the thread block
• One value from each thread block is saved to global memory and copied back to host memory
• These results are then summed on the CPU to compute the value of the integral. If many thread blocks are needed in the first kernel call, then another sum reduction kernel can be executed to compute this sum
May 17-19, 201130 © Hanweck Associates, LLC
Sum Reduction4 7 5 9
11 1425
3 1 7 0 4 1 6 34 7 5 9
11 1425
3 1 7 0 4 1 6 34 7 5 9
11 1425
3 1 7 0 4 1 6 34 7 5 9
11 1425
3 1 7 0 4 1 6 34 7 5 9
11 1425
3 1 7 0 4 1 6 34 7 5 9
11 1425
3 1 7 0 4 1 6 34 7 5 9
11 1425
3 1 7 0 4 1 6 34 7 5 9
11 1425
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
Level 0:8 blocks
Level 1:1 block
• A sum reduction technique is used to sum the value of the Hestonintegral
• As shown above, multiple kernel call can be used to perform the sum if many points need to be evaluated
May 17-19, 201131 © Hanweck Associates, LLC
Application:
Fitting Eurex Euro-Bund Options
May 17-19, 201132 © Hanweck Associates, LLC
Data and Methodology
Data• Eurex Euro-Bund futures and options end-of-day settlement prices• Jan 2006 – April 2011
Fitting Methodology• Out-of-the-money options with premium greater than 0.01• Minimize least-squares difference between market and model option
premium (equivalent to vega weighting implied volatility differences, but more robust)
• Euro-Bund options are American style, but futures-style margining makes early-exercise suboptimal, so European Heston model applies (see Wu & Vischer, 2009)
• Can fit to individual expiry or surface; for one day or many; allow all parameters to vary or fix some (fitting is an art!)
May 17-19, 201133 © Hanweck Associates, LLC
Benchmarks
NVIDIA Tesla C2070 GPU vs. Intel Xeon E5640 @2.67 GHz (singe core)
Time to price 10,000 options• CPU: 348.2 seconds• GPU: 5.1 seconds
Option pricings per second• CPU ~ 30 /s• GPU ~ 2000/s
GPU gives about 70x performance gain!
May 17-19, 201134 © Hanweck Associates, LLC
Euro-Bund Options Expiring September 2011
Fitted Heston Parameters
0λ
0.0290827ρ
0.0769625σ
0κ
0Vinf
0.0033334V0
Fitted Premiums
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
110 115 120 125 130 135Strike (€)
Pre
miu
m (€
)
Fitted PutsFitted CallsMarket PutsMarket Calls
Fit performed on 04/12/2011 settlement prices for out of the money options expiring 09/2011
*κ and λ forced to 0
Heston Fitted Vols
5.40%
5.60%
5.80%
6.00%
6.20%
6.40%
6.60%
6.80%
105 110 115 120 125 130 135
Strike (€)
Vol
Heston Implied VolsMarket Implied Vols
May 17-19, 201135 © Hanweck Associates, LLC
September 2011 Euro-Bund Surface
Fit performed on 04/12/2011 settlement prices for out of the money options expiring 07/2011 and 09/2011
Fitted Heston Parameters
0λ
0.03037642ρ
0.07871576σ
0.05127899κ
0.02133107Vinf
0.00315456V0
*λ forced to 0
Heston Fitted Vols for 07/2011 Options
5.40%
5.90%
6.40%
6.90%
7.40%
7.90%
105 110 115 120 125 130 135
Strike (€)
Vol
Heston Implied VolMarket Implied Vol
Heston Fitted Vols for 09/2011 Options
5.40%
5.60%
5.80%
6.00%
6.20%
6.40%
6.60%
6.80%
105 110 115 120 125 130 135
Strike (€)
Vol
Heston Implied VolsMarket Implied Vols
May 17-19, 201136 © Hanweck Associates, LLC
Stability Of Parameters –September 2011 Euro-Bund Options
V0 and κ parameters show stability over time for a fitted option chain
√V0
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
2/26/11 3/8/11 3/18/11 3/28/11 4/7/11 4/17/11Date
√V0
Rho
-0.3-0.25
-0.2-0.15
-0.1-0.05
00.05
0.10.15
0.20.25
2/26/11 3/8/11 3/18/11 3/28/11 4/7/11 4/17/11Date
Rho
Sigma
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
2/26/11 3/8/11 3/18/11 3/28/11 4/7/11 4/17/11Date
Sig
ma
May 17-19, 201137 © Hanweck Associates, LLC
Concluding Remarks
May 17-19, 201138 © Hanweck Associates, LLC
Concluding Remarks
The Heston model applied to Euro-Bund options provides a reasonable model for individual volatility skews; less reasonable for entire surfaces (particularly very short-dated options)
Some parameters (mean reversion, market price of risk) might need to be fixed / removed due to limited degrees of freedom
Some parameters (rho) might benefit from fitting over time
GPU-based acceleration reduces Heston model valuation time by orders of magnitude, making real-time pricing and fitting feasible
GPU-based acceleration is applicable to other models that require numerical integration, lattices or Monte-Carlo techniques
May 17-19, 201139 © Hanweck Associates, LLC
ReferencesBates, David S., “Jumps and Stochastic Volatility: Exchange Rate Processes Implicit in
Deutsche Mark Options,” The Review of Financial Studies, 9(1), 1996, pp. 69-107.
Heston, Steven L., “A Closed-Form Solution for Options with Stochastic Volatility with Applications to Bond and Currency Options,” The Review of Financial Studies, 6(2), 1993, pp. 327-343.
West, Graeme, “Calibration of the SABR Model in Illiquid Markets,” Applied Mathematical Finance, 12(4), 2005, pp. 371-385.
Wu, Shengxiong, and Axel Vischer, “Similarities and Differences of the Volatility Smiles of Euro-Bund and 10-Year T-Note Futures and Options,” Eurex working paper, November 2009.
May 17-19, 201140 © Hanweck Associates, LLC
Copyright © 2011 Hanweck Associates, LLC.All rights reserved. Additional information is available upon request.
This presentation has been prepared for the exclusive use of the direct recipient. No part of this presentation may be copied or redistributed without the express written consent of the author. Opinions and estimates constitute the author’s judgment as of the date of this material and are subject to change without notice. Information has been obtained from sources believed to be reliable, but the author does not warrant its completeness or accuracy. Past performance is not indicative of future results. Securities, financial instruments or strategies mentioned herein may not be suitable for all investors. The recipient of this report must make its own independent decisions regarding any strategies, securities or financial instruments discussed. This material is not intended as an offer or solicitation for the purchase or sale of any financial instrument.