sparse optimization methods and statistical modeling with

Sparse Optimization Methods and Statistical Modeling with Applications to FinanceMichael Ho Advisor: Jack Xin
Department of Mathematics University of California, Irvine
April 28, 2016
Outline
3 Covariance Estimation from High Frequency Data
4 Conclusion
Introduction/Mean-Variance Portfolios
Section 1
Introduction/Mean-Variance Portfolios Mean-Variance Portfolio Selection
Modern Portfolio Theory
Modern portfolio theory (MPT) addresses the following problem Suppose an investor needs to invest in a portfolio of assets How should the investor choose the portfolio?
To answer this question MPT makes the following assumptions Investors make decisions based only on expected return and risk Given two portfolios with the same expected return, an investor will choose the lower risk portfolio
Michael Ho Sparse Finance April 27, 2016 4 / 38
Mean - Variance Criteria can be formulated as a quadratic program
Suppose there are N risky (random return) assets Denote the single period return of the nth asset as rn
Then a Mean-Variance optimal portfolio w can be written as the solution to following quadratic program (QP)
min w
wT Γw
s.t. wTEr ≥ η ≥ 0
wT~1 = const (MV)
where Γ is the covariance matrix of r . Here we assume Er 6= 0 and Γ is positive definite The above problem is convex and there are many techniques for solving (MV)
Sharpe ratio optimal portfolio
If rF is the return of a risk-free asset, the excess return of the risky assets is defined as r − rF
The Sharpe ratio (SR) optimal portfolio of risky assets can be computed via
max w
where µ is the mean of r − rF
Since SR is invariant to positive scaling this can be reformulated (up to a constant scaling) as
min w
SR optimal portfolio coincides with risky component of mean-variance optimal portfolio
Mean-variance criteria is subject to parameter uncertainty
Implementation of mean-variance criteria is impeded by lack of information
Mean and covariance are unknown
Intuitive work around is to estimate mean and covariance using sample averages from past return data and plug-in into the original MV problem
min w
Applied to the stock market out-of-sample portfolio performance using this technique is poor
High return standard deviation relative to mean Non-stationary statistics Ill-conditioned sample covariance matrix
Introduction/Mean-Variance Portfolios Research Contribution
Penalization approach that improves portfolio performance under parameter uncertainty Method improves on other techniques proposed in literature Adaptive support split-Bregman solver proposed SIAM J. Financial Math. (with J. Xin, Z. Sun), Vol. 6 2015
2. Robust covariance estimation from high frequency data Addresses market microstructure noise, asynchronous trading, jumps Sparse modeling approach (`1, Spike and Slab) adds robustness to jumps Method outperforms simpler techniques proposed in literature
Pairwise Weighted Elastic Net
Pairwise weighted elastic net (PWEN) penalized criterion
min w
wT Γw − wT µ+ |w |T |w |+ ||w ||~β,`1
is positive semidefinite matrix with non-negative entries, β non-negative ||w ||β,`1 =
∑ i |wi |βi
min w
max R∈A,v∈B
wT Rw − vT w .
A and B are parameter uncertainty sets for covariance and mean
A = {
B = {v : vi = µi + ci ; |ci | ≤ βi} .
is assumed to be diagonally dominate
PWEN criterion optimizes worse case performance
Calibration of PWEN
Calibration of PWEN can be done by selecting an appropriate uncertainty set for parameter estimate Classical approach is to form confidence intervals using some normal approximation Bootstrapping is a non-parametric way to quantify uncertainty
Robust optimization interpretation used in calibration
Elastic net promotes sparse portfolios
Sparse portfolios are advantageous in that they reduce management and transaction costs associated with investing As shown below elastic net penalty encourages sparsity
Markowitz portfolios require a large number of active positions Elastic net penalized portfolios have relatively few active positions
Performance Plot
Performance benefit of PWEN and WEN demonstrated on U.S. stock return data 630 stocks, from January 1,2001 to July 1, 2014, Mid to Large Cap
Covariance Estimation from High Frequency Data
Section 3
Leveraging High Frequency Data for Covariance estimation
Time varying nature of asset return statistics place limits on the time interval where training data for covariance is relevant High-frequency data allows for more data in shorter time interval Can obtain covariance estimates using more recent data However,estimation of covariance from high-frequency data is complicated by
Asynchronous returns Market Microstructure Noise Jumps
Benefits of High frequency data complicated by noise,asynchronous trading and jumps
Asynchronous trading
Standard sample average estimation of covariation of returns requires returns of all assets are sampled on a common grid In high frequency data assets trade asynchronously Resampling the data to a common grid can be performed but does not use all the data or may cause covariance to be non-positive definite
Market Microstructure Noise
Market friction such as bid ask spread is a source of noise True efficient price is not observed Over short time periods price variation due to bid/ask spread can mask “true” efficient return
Data Model for Hidden Price Process
Let Xn be a vector containing all true log-prices at time n. Model discrete time log-price as
Xn = Xn−1 + Vn N (D,Γ)
+ Jn Jump
Jn and Vn are i.i.d sequences and independent X is unobserved D and Γ unknown but assume known prior distribution.
Observations are noisy and missing
Observations are noisy (market micro-structure noise) and incomplete
Yn = InXn subset of prices observed
+ Wn microstructure noise,N (0,Q)
Here In is a sub-matrix of the identity matrix ( rows corresponding to missing observations are removed) V independent of J,W and X Q is unknown and diagonal but assume known prior distribution Assume observations are MAR(missing at random) and are independent of prices
Missing Data Example
SVD of complete data
opportunity to infer missing data
Missing data can be inferred from observation of other assets at same and different times
Data Completion through Kalman smoothing
Kalman smoothing can be used to infer missing data and remove noise Conditioned on parameters, θ, Kalman smoothing is a recursive method for computing the posterior distribution p(x |y , θ)
Applies only to normally distributed data and requires parameter estimates
Rudolf Kalman, 2008
Jumps are not addressed with Kalman smoothing
Kalman filter tends to over smooth jumps
Jumps can contaminate estimate of covariance (further degrading Kalman smoothing performance) Γ ∼ Γ + E(JJT )
jump bias
Sparsity inducing priors for jump modeling
Jumps need to be addressed to avoid corruption of covariance estimate
For discrete time modeling we consider two types of prior distributions for jump
Spike and Slab Laplace Distribution
Both priors induce sparsity in posterior mode of jumps Both models also popular for variable selection in regression and machine learning
Sparse modeling applied to infrequent jumps
Spike and Slab Jump Model
For this model prior of Ji(t) is a mixture of point mass at 0 and a normal distribution
p(ji(t)) = ζ 1ji (t)=0 spike at 0
+(1− ζ)N (ji(t),0, σ2 j,i(t))
slab
Laplace prior for jumps
Spike and slab distribution of J is non-continuous and multi-modal, which complicates estimation of J As an approximate we consider the Laplace distribution
p(j) ∝ exp (−λ|j |)
Induces weighted `1 norm in conditional log posterior λn(t) treated as unknown with known distribution (gamma) Iterative estimation of λn(t) induces a reweighting of `1
Laplace prior effectively adds a weighted `1 penalty to log-likelihood
Maximum a posteriori (MAP) estimation of covariance
MAP estimate of covariance, Γ is mode of posterior
[Γ, θ] = arg max Γ′,θ′
log p(θ′, Γ′|y)
where θ are the nuisance parameters (jumps,noise variance, etc) Posterior is difficult to directly optimize due to missing data Iterative approaches commonly employed
ECM approach to MAP estimation
Expectation conditional maximization (ECM) algorithm (Meng,Rubin 1993) alternates between two steps
E-step: Compute the following surrogate function
G(k)([Γ, θ]) = EX |Y ,Γ(k),θ(k) log p(Γ, θ|y , x)
M-step: Set [Γ(k+1), θ(k+1)] to conditional maximizers of G(k)([Γ, θ])
E-step performed using Kalman Smoother (model is normal when conditioned on jumps) Monotonic increase in log posterior
KECM-Laplace recovery, Low Rank Covariance
(Movie Loading.avi)
KECM approach can recover missing prices when covariance is low rank
LowRankVideo2.wmv
Bayesian Approach using MCMC
Problems with ECM approach Reports single mode, (yet multiple modes exist) Nuisance parameters are estimated Uncertainty in estimate not reported
Bayesian Approach Posterior distribution estimated, uncertainty quantified Nuisance parameters are averaged out, not optimized
Single mode- jump posterior Multimodal- jump posterior Michael Ho Sparse Finance April 27, 2016 30 / 38
Gibbs sampling approximation
Computing posterior of covariance directly involves integration over a high-dimensional parameter space Markov Chain Monte Carlo (MCMC) approaches such as Gibbs sampling can be used to approximate the posterior in an efficient manner
Sequentially draw each parameter from it’s conditional posterior distribution Sequence converges in distribution to posterior
Each conditional distribution known in closed form and easy to draw from
Gibbs sampling allows for approximate Bayesian inference
MCMC samples from jump posterior
Projection of samples onto two jump components
(Movie Loading.avi)
mcmc_smp.wmv
Utility of MCMC approach
Projection of samples onto a single assets log-price and jump
(Movie Loading.avi)
MCMCvideo.wmv
Results of Covariance Estimation, 20 simulated assets
Characterize performance using normalized Frobenius norm of covariance estimation error
√∑ i,j |Γi,j − Γi,j |2√∑
MCMC and KECM outperform other techniques when jumps are present
Performance under GARCH(1,1)-jump model
hi(t + 1) = 0.5hi(t) + 0.3(Xi(t)− Xi(t − 1)− D)2 + 0.2Γi,i
MCMC and KECM spike and slab are more robust to stochastic volatility
Conclusion
Conclusion
Conclusion
Sparse modeling and optimization applied to finance in 2 ways Portfolio robustness enhancements
Pairwise Weighted Elastic Net (PWEN) and Weighted Elastic Net (WEN) penalties proposed for mean-variance portfolios Bootstrap calibration for PWEN and WEN penalties proposed Experimental results using U.S. stock return data shows utility of WEN and PWEN approach
Covariance estimation from high frequency data Kalman EM approach extended to models that include price jumps Application of sparsity inducing priors to model jumps New approach shows enhanced performance under jump models for a variety of simulated data models (Jumps, GARCH, dependent observation noise)
Conclusion
Pairwise weighted elastic net Consider constraints such as no short selling Relaxing diagonal dominant restriction on weighting matrix, , may improve performance
Covariance estimation from high frequency data Investigate low rank + sparse matrix factorization techniques to enhance covariance estimation Reweighted nuclear norm and reweighted `1 penalties Investigate jumps in microstructure noise
Other applications Robust adaptive sensor processing (e.g. beamforming)
Backup Charts
Section 5
Backup Charts
Backup Charts
Solution via nuclear norm minimization
Missing data can also be recovered using matrix completion by noting returns are low rank
Definition Ri,t :unobserved low rank component return of asset i at time t Ji,t :unobserved sparse jump component return of asset i at time t Xi :unobserved efficient price of asset i at time 0 Yik ,tk : observed (noisy) price of asset ik at time tk . S: discrete time integration ( in time) operator (rectangular method)
Nuclear Norm Formulation
)2 + λ2||J||`1
Backup Charts
lo g-
P ric
Truth Nuclear Norm Minimization KECM-Laplace Observations
Time 100 110 120 130 140 150 160 170 180 190
lo g-
P ric
Time 0 50 100 150 200 250
Ju m
Truth Nuclear Norm Minimization KECM-Laplace
Singular Value # 0 2 4 6 8 10 12 14 16 18 20
S in
gu la
r V
al ue
80 percent observed - No Noise
Backup Charts
lo g-
P ric
Time 80 90 100 110 120 130 140 150
lo g-
P ric
Time 0 50 100 150 200 250
Ju m
Singular Value # 0 2 4 6 8 10 12 14 16 18 20
S in
gu la
r V
al ue
30 percent observed - No Noise
Backup Charts
Time 0 50 100 150 200 250
lo g-
P ric
Time 50 60 70 80 90 100 110 120 130 140
lo g-
P ric
Time 145 150 155 160 165 170 175
Ju m
Singular Value # 0 2 4 6 8 10 12 14 16 18 20
S in
gu la
r V
al ue
80 percent observed - Noise
Backup Charts
Time 0 50 100 150 200 250
lo g-
P ric
Time 70 80 90 100 110 120 130 140 150 160
lo g-
P ric
Time 206 208 210 212 214 216 218
Ju m
Singular Value # 0 2 4 6 8 10 12 14 16 18 20
S in
gu la
r V
al ue
30 percent observed - Noise
Backup Charts
Initialize estimate of Γ, σ2, and J while not converge
Compute posterior distribution of the X given Y , Γ, σ2, J,D with Kalman smoother(E-Step) Perform M-step for Γ,D and σ2, assume J is fixed Compute MAP estimate of J given Γ and σ2 using ADMM,FISTA, etc.. Update λi (t) ( effectively reweights `1 penalty)
Algorithm for spike and slab model is similar.
Backup Charts
Initialize parameters Θ(0) = [Ymiss,X , Γ,D, J, σ2, ζ, σ2 j ]
for m = 0 . . .M for k = 1 . . . 8
Sample Θ (m+k/8) k from p(Θk |Θ(m+(k−1)/8)
−k )
Discard first P samples “burn-in” Take covariance samples to estimate posterior mean of covariance
Backup Charts
Example: Bootstrapping the uncertainty set when statistics are unknown
Here we illustrate one way to calibrate the uncertainty set for µ Suppose we have training data returns r(1), . . . , r(T )
Randomly take T samples from {r(1), . . . , r(T )} (with replacement)
Call these ζ(1), . . . ζ(T )
Use empirical distribution of µ(ζ(1), . . . , ζ(T ))− µ(r(1), . . . , r(T )) as proxy for estimation error
This can be done via Monte Carlo by resampling many times
β can be selected as a percentile of the empirical distribution
Backup Charts
Consider the following experiment
Return data collected from 20 US stocks between 7-2001 and 7-2013 Sharpe Ratio Optimal Portfolio Computed based on 55 days of training data Portfolio performance evaluated using next 30 trading days Performance of plug-in mean-variance portfolio is disappointing
Backup Charts
Calibration using bootstrap Calibration using Normal-χ2 approximation
Backup Charts
time 510 520 530 540 550 560 570 580
pr ic
Backup Charts
Backup Charts
JumpVideo2.wmv
(Movie Loading.avi)
mcmc_smp.wmv
Jumps in price can corrupt estimate of covariance
Jumps in market returns not explained by a diffusion can occur These jumps can severely bias the covariance estimate of the diffusion component of the returns Disentangling price movement due to jumps and diffusion components necessary to estimate covariance
Backup Charts
Consider the following experiment
Suppose κ is Laplace distributed, q is N (0,1)
Let observe η = κ+ q. Suppose we observe η = 0.5 Maximum likelihood estimate of κ is 0.5. Posterior mode is 0 !
κ
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Likelihood Laplace Prior Posterior
Backup Charts
Quadratic programming approach to solving MV problem applicable to small scale problems
`1 component of Elastic net penalty is non-differentiable which complicates our optimization Problem can be reformulated as a convex Quadratic program and solved with general purpose solvers
Requires additional variables and constraints 2N primal and 2N dual variables This approach becomes difficult for large scale problems
min w ,d
i=1
Backup Charts
Splitting approaches scale better for large scale problems
Another approach to solving RMV is to first split the objective function into a smooth and non-smooth component
wT Γw − wT µ+ +||w ||2~α,`2 :=f (w)
+ ||w ||~β,`1 :=g(w)
Idea is that each component of the problem is easily solved when considered separately Algorithms that use this idea include
Proximal gradient descent (ISTA,FISTA) Split-Bregman (or Alternating Direction Method of Multipliers(ADMM))
Backup Charts Split-Bregman/ADMM Approach
min w
s.t. d = w
Split-Bregman is an Augmented Lagrangian approach where in each iteration the augmented Lagrangian
L(w , d , b) = f (w) + g(d) + λ
2 ||d − w − b||2`2
is minimized with respect to w and d separately
Algorithm 1 Split-Bregman Initialize: k = 1, bk = 0,wk = 0, dk = 0, λ > 0 while Not converged do
wk+1 = minw f (w) + λ 2 ||d
k − w − bk ||2`2
dk+1 = mind g(d) + λ 2 ||d − wk+1 − bk ||2`2
bk+1 i = bk + wk+1 − dk+1
k = k + 1 end while
Backup Charts Split-Bregman/ADMM Approach
Sub-problems in Split-Bregman approach are easy to solve
Advantage of Split-Bregman approach is that each sub-problem is easy to solve. First sub-problem in Split-Bregman is quadratic and is easily solved Solution to second sub-problem can be obtained via shrinkage
di =
wi − bi + βi λ if wi − bi < −βi
λ
(1)
Backup Charts Adaptive Support Split Bregman/ADMM Approach
Outline
Adaptive Support Split Bregman Approach developed to save computational time by exploiting sparse nature of solution
First sub-problem in Split-Bregman can be difficult to solve in high-dimensional setting
For stock returns, covariance matrix is often poorly conditioned and dense
As shown before WEN penalized portfolios tend to be sparse (only a small number of holdings). If support of portfolio is known a priori then significant computational savings can be achieved by restricting attention to the support set
Only need to solve the quadratic program on smaller subspace
However support is unknown and needs to be estimated A new adaptive support Split-Bregman approach has been developed which alternates between estimating the support and solving the problem on smaller support set
Top level description of Adaptive Support Split-Bregman Algorithm
The following is a top level description of Adaptive Support Split-Bregman Algorithm.
1) Start with a small basket of "high potential" assets likely to be in the optimal portfolio.
Sub partial derivative criteria used to determine which assets to choose
2) Solve for portfolio restricted to the small basket using Split-Bregman
3) Prolong primal and dual variables to full dimensional space
4) If optimality conditions are (approximately) satisfied we are done
5) Otherwise adjust the basket by adding additional high potential assets and removing low potential assets
Uses sub partial derivatives as a guide to determine which assets to choose
6) Return to 2
Adaptive Support Split-Bregman converges quickly to a solution for sparse portfolios
Dimension Sparsity Multi-level Split-Bregman FISTA Multi-level Level Split-Bregman FISTA*
2000 88 0.1 sec 20.6 sec 0.39 sec 0.2 sec 2000 142 0.2 sec 14.5 sec 0.8 sec 0.2 sec 2000 450 0.9 sec 14.6 sec 3.6 sec 1.5 sec 2000 1181 2.2 sec 27.0 sec 13.7 sec 14.5 sec 2000 1692 10.4 sec 38.0 sec 21.4 sec 22.7 sec 3000 237 0.3 sec 48.2 sec 12.9 sec 2.7 sec 3000 805 1.3 sec 49.9 sec 55.7 sec 24.6 sec 4000 234 0.5 sec 107.6 sec 24.6 sec 2.2 sec
New algorithm consistently outperforms other algorithms in sparse high-dimensional setting
∗ E. Treister and I. Yavneh. A Multilevel Iterated-Shrinkage Approach to `1 Penalized Least-Squares Minimization. IEEE Trans.
Signal Proc., 60(12):6319-6329, 2012.
Performance with stochastic noise variance
Here we extend GARCH(1,1) model to stochastic microstructure noise variance
σ2 o,i(t) =
Sensitivity to changes in idiosyncratic component of variance
Model with jumps. ζ = 0.999, σ2 j = 2.5e − 5
GARCH(1,1) model with jumps. ζ = 0.999, σ2 j = 2.5e − 5
Conclusion
Appendix

sparse optimization methods and statistical modeling with

Documents