Predicting the Volatility Index Returns UsingMachine Learning
by
Michael Yu
A thesis submitted in conformity with the requirementsfor the degree of Master of Science
Graduate Department of MathematicsUniversity of Toronto
© Copyright 2017 by Michael Yu
Abstract
Predicting the Volatility Index Returns Using Machine Learning
Michael Yu
Master of Science
Graduate Department of Mathematics
University of Toronto
2017
We probe how predictable the short term future behaviour of the Chicago Board
Options Exchange (CBOE) Volatility Index (ticker symbol VIX) is given past
market price data within the constraints of a simple classic machine learning
framework. We use past VIX and SPX price time windows as input to predict the
movement direction, i.e. sign of the return, of VIX over the next 1 to 6 weekdays.
For successful cases of predicting return direction from one particular weekday
to another particular future weekday, we have moderately reliable accuracies of
between about 55% and 65% depending on the particular time bridge. We find
that 1 day returns are difficult to predict except for a few particular cases, and
as the prediction window grows we have models that can predict more and more
accurately up to a consistent 62% for both 5 days and 6 days in the future.
ii
Contents
1 Background 1
2 Setup 1
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Method 2
3.1 Machine Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . 2
3.1.1 Financial Time Series . . . . . . . . . . . . . . . . . . . . . 4
3.2 Machine Learning Concepts . . . . . . . . . . . . . . . . . . . . . . 5
3.2.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . 5
3.2.2 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Machine Learning Primitives . . . . . . . . . . . . . . . . . . . . . 6
3.3.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Specified Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4.1 Input Space and Output Space . . . . . . . . . . . . . . . . 6
3.4.2 Combinations Features Bank . . . . . . . . . . . . . . . . . 7
3.4.3 Hyperparameter Search . . . . . . . . . . . . . . . . . . . . 9
3.4.4 Cross Validation Techniques . . . . . . . . . . . . . . . . . . 10
4 Results 11
4.1 Classification Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . 11
4.1.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Sample Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Best Committee of the Best XGBoost Models . . . . . . . . 13
4.2.2 XGBoost Specifications . . . . . . . . . . . . . . . . . . . . 13
4.3 Sample Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Discussion 18
5.1 Discovery Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Limitations and Extensions . . . . . . . . . . . . . . . . . . . . . . 19
iii
Bibliography 19
iv
1
1 Background
The SPX index (SPX) is a weighted sum of 500 influential American companies’
stock prices. Its performance over time represents well the economic growth in
the US. The Chicago Options Exchange (CBOE) offers SPX options to traders
so that positions in their portfolio can be hedged. The prices of SPX options fits
as variables in a relation involving the volatility of SPX price in the future. As
the prices of SPX options are determined purely by buy and sell activities in its
market, the volatility of SPX as expected by participants of the stock market can
be meaningfully assigned from market options prices. CBOE’s volatility index,
VIX[2], defines a changing portfolio of SPX options that constantly seeks to track
the mathematically implied volatility over the future 30 days according to market
behaviour.
2 Setup
We obtained SPX data from Yahoo Finance and VIX data from CBOE. The
dates (in yyyy/mm/dd format) forming our dataset ranges from 1990/01/02 to
2017/09/06.
2.1 Definitions
We use [a, b) to denote a range of indices a, a + 1, . . . , b − 1 where a ≤ b (being
an empty set of indices for a = b). Let V = {Vi : i ∈ [i0 − 1, i0 + N)}, denote the
time series of VIX prices. We have on the order of N ≈ 7000 data points. Define
the time series Li = log(Vi) and Ri = log(Vi/Vi−i) over the domain [i0, i0 + N).
Define the parameters p = 30 and q, which mean the past time window size
and the future prediction horizon size, respectively. Use the variable names `
and r to denote the collection of rolling window views onto the time series L
and R. That is, `i,j = Li+j for i = i0 + p, . . . , i0 + N − q and for each i the
index j ranges in [−p, q), and with an analogous definition for ri,j . In other words
`i = L[i−p,i+q), ri = R[i−p,i+q). For each i then, the rolling windows `i and ri have
p days of past, and therefore observable, data, and q days of future (or “current”
when j = 0, still interpreted as being in the future though) data.
We want to consider other data series apart from just the VIX price, however,
in the exact same way. Let us consider VIX the index-0 data series, and add in
2
addition the SPX price as the index-1 data series, so that we have the correspond-
ing series L(0) = L as above and L(1) defined analogously for SPX price, and
the other series R and rolling windows ` and r can all have the (0),(1) modifiers
corresponding to VIX, SPX price respectively. Let d = 2 denote the number of
data series considered.
2.2 Hypothesis
We hypothesize that the behaviour of VIX price over the near future few days
can be predicted using the data of VIX price and other related market time series
over the past. One particular stronger hypothesize we test is there being a pattern
relating each p days of past data on all the considered data series to the behaviour
of VIX price over the next q days. In particular, we investigate the predictive
potential for the cumulative return over the future q days of VIX at each time
window. In general we could have chosen any target future quantity thought to
depend largely on market factors that manifest in the chosen data series in the
past. In a similar vein, we can grow the prediction model via considering more data
series thought to contain extra market information pertaining to future VIX prices.
The choice of a fixed length past time window with which to make predictions by
is made with the goal of restricting model complexity and probing the sufficiency
of fairly short history memory in forecasting future return.
3 Method
3.1 Machine Learning Paradigm
First we describe the standard machine learning paradigm, which does not the-
oretically subsume all of the work we do for the application in this paper but
nonetheless provides instructive insight and acts as a good frame of reference for
thinking about other practical prediction methods.
There is a space S of possible input signals and a space T of possible output
targets. The product space S × T has a probability distribution D from which
we can repeatedly sample from. Our goal is to find a function m such that m(s)
and t are expected to be similar according to the similarity metric E when (s, t)
is sampled from D.
In devising to find m, we will use the idea behind the following approach. We
3
hypothesize a family of functions {ma : S → T}, each function ma in which is
characterized by the parameter a in the parameter space A, and is also endowed
with a prior probability c where c(a) denotes how likely we believe ma is the
function out of all the ones in the family to best satisfy our goal when knowing
nothing about D. We take i.i.d. samples from D many times to get two sets
of observations, {(s, t)}train and {(s, t)}cv (cv stands for cross validation). We
choose a so as to balance maximizing c(a) and also optimizing that ma(s) and
t are most similar according to E when (s, t) is drawn from the distribution de-
fined by {(s, t)}train. Then we evaluate if ma(s) and t are similar by E when
(s, t) ∼ {(s, t)}cv. A satisfactory result indicates we can choose m = ma. Call
this procedure a machine learning iteration.
The family of functions must be inclusive enough to contain a function ma
that can satisfactorily meet our goal. On the other hand, the family of functions
should be exclusive enough so as to preclude the possibility of ma being chosen
especially to minimize expected E(ma(s), t) when (s, t) ∼ {(s, t)}train and missing
out on minimizing E(ma(s), t) for when (s, t) ∼ D in general.
The evaluation of how well m′s predicts t using {(s, t)}cv loses its reliable if we
repeat the machine learning iteration with different family definitions, different
c(a), or different methods of search for a across A and pick the best result. This
issue arises because we do not discern which machine learning iteration should give
the best predictive model prior to peeking at the results that the cross validation
dataset gives. The inaccessibility of {(s, t)}cv to training part of the machine
learning iteration is what makes its measurement of ma(s) tracking t useful. Ex-
ecuting a machine learning iteration on a prediction problem, however, can give
crucial insight on designing a better machine learning iteration approach for that
problem. Therefore we find ways to repeat the machine learning iteration with-
out incurring the cost of overfitting to the cross validation dataset with our meta
selection of families of functions and other machine learning iteration parameters.
The same training and cross validation dataset may be used in multiple dif-
ferent machine learning iterations if the number of machine learning iterations is
small. A series of machine learning iterations can be considered as one machine
learning iteration if we acknowledge instead a bigger training set of {(s, t)}train ∪{(s, t)}cv and sample a different cross validation set anew.
The final cross validation set on which the performance of machine learning
approach is evaluated is better known as the test set. The results from this set can
4
be taken as a probabilistic sample on the effectiveness of the resulting prediction
solution.
3.1.1 Financial Time Series
For our application to predicting VIX returns, the input space is all possible values
of a number of real valued time series up to any final day. We can ignore part of
the information in the input and instead consider the input space as all possible
p past days time windows into the data series we have chosen if our prediction
functions only need to take in these p days windows. The output space is either the
value of q days (log) return immediately following each of those p days windows
or some less specific characteristic of that value, like its sign.
We split the physically available collection of time series data up to some
cutoff day into three different sample sets: training, cross validation, and testing.
A free variable i that denotes the day after the latest day present in the inputs
to our machine learning models, being that same variable that indexes time, can
designate identity of an individual sample from the input space, i.e. act as an
index for the windows. Denote the set of i indices represented by each sample
collection, training, cross validation, and testing, as tr, cv, and te, respectively.
We can assume that the distributions of these time windows are independent
across different i to align with the stated machine learning paradigm, but this
assumption is made more for the ease of understanding of the paradigm than for
application requirement. For the sake of practical use of creating a VIX return
prediction model, however, the results of using data from the future to help make
predictions regarding unseen events from the past is not generalizable. Therefore,
all of the i ∈ cv are greater than those in tr, and same for te to cv, i.e. they are
linearly in chronological order. The total number of windows for te consist of the
latest 2 years of data for the test set, amounting to about 7.3% of the data. For
typical machine learning applications the split will not be so skewed, but for our
application, market behaviour can change significantly very quickly, and so being
able to test our algorithm on 2 of the latest years, should mark it as sufficiently
effective.
We review the machine learning paradigm applied to our use case. We define
a family of functions characterized by having different parameterizations which
maps a past time window into a value that seeks to approximate the target future q
days VIX return. Using the input and target samples represented by the index set
5
tr, we select a choice of function from the family, or equivalently choose parameters,
seeking good predictive power on the tr future target values while maintaining
generalizability. We then fix this function and evaluate its predictive power on
the cv input-target samples. Good performance of the selected function from the
family in predicting our target value in the cv set gives an theoretically accountable
indication of the function capturing a predictive relation in the quantities studied,
as the function was chosen without using future information from the cv time
period.1 This process is repeated to select a best family of functions to choose
from and the best learning algorithm for attaining the optimal function within
the family according to prediction performance on the cv set, and then an optimal
function is chosen using all past and future data in tr combined with cv and tested
on the te set to determine a final measure of future predictive potential.
3.2 Machine Learning Concepts
3.2.1 Feature Engineering
A major part of most successful machine learning pipelines is feature engineering.
Features are transformations of the input values into a space that is more easily
related to the target space than the original inputs. We map S to the intermediary
space of features X which is then more easily related to T . Famous deep learning
algorithms of multi-layer neural network models can learn features without manual
human choice given a very large amount (on the order of at least millions) of
training data. In our case, I continue to make creative judgements on what features
to use based on seeing previous results on cross validation.
3.2.2 Ensemble Learning
Ensemble learning leverages the power of multiple learning algorithms used in
possibly distinct contexts in order to more effectively captures useful patterns in
the input. Creating a majority voting committee of base models is a simple way
of aggregating multiple decision algorithms. However, it is not a simple matter
to decide which base models to vote towards the committee decision for a better
combined model. I do not know any standard approaches to this so I try a few
1Note that if the cv time indices consecutively follow those in tr, then the last few tr timewindows future data is also future data for the first few cv time windows, so the parameterlearning process would leak some future information from the cv set. Therefore, some timewindows bordering the adjacent split sets are dropped.
6
methods explained in the results section. Finally, with zero prior intuition about
which of several candidate models might work best on a problem, one solution
is the so called bucket of models approach. Train all the candidate models on
a training set and then pick the one that attains the best score on the cross
validation set to solve the test set problem.
3.3 Machine Learning Primitives
3.3.1 Decision Trees
Decision trees are conceptually very simple. A single branch decision tree might
look like “Is the 5th component of s greater than 0.42? If so, output 0. Other-
wise, output 1.” More complicated decision trees will have more branches and
greater depth of branches. A gradient boosted tree model has multiple trees,
and the resulting prediction is a linear combination of the value outputs of each
individual tree. The term “gradient boosting” refers to the model learning by
iteratively adding on new trees chosen to descend on the gradient of the residual
error left by predictions made by the ensemble of preceding trees. An extremely
popular and industry tested version of a gradient boosted tree algorithm is called
XGBoost. We choose XGBoost as our main workhorse. It is available to be used
as a library[1].
3.4 Specified Model
3.4.1 Input Space and Output Space
Given we only have technical indicators over the past p days, the quintessential
features relevant for prediction must be found in how high or low our numbers
are hovering compared to their historical tendency, and in how these numbers are
moving up and down in the past window. All of this information can be expressed
by the daily log returns data r(k)i,[−p,0) limited to the past p days in addition to
the original past p days price data. Arbitrary linear combinations of these log
returns can represent a wide class of shape descriptors for the movement of the
time series in the past window and certain linear combinations of the log price
can express comparison information between different days not available in just
the log returns data. We hypothesize that machine learning algorithms can learn
a variation of the appropriate linear transformation of the combined information
vector si =⊕
k(r(k)i,[−p,0) ⊕ `
(k)i,[−p,0)) and map them to the target.
7
The time windows model we have does not intrinsically factor day of the week
into consideration. Through experimentation, I have found that separately train-
ing 5 different models for each Monday to Friday weekday gives richer results than
training the same model on all days. We consider a completely distinct model for
each weekday target prediction day and each previous 1–5 days before for the last
day of past data input.
For the target value, I have tried to predict the exact return q days into
the future, but it appears that my method does cannot make useful predictions.
Instead we use the sign of the return at q days in the future for a 2 class (− or
+) classification problem. We test our algorithm’s performance for each future
window from q = 1 to q = 6.
3.4.2 Combinations Features Bank
The primary bottleneck in feature creation is arriving at features that can with
little further effort effectively distinguish the target classes. The approach we take
here follows that principle and uses 3 different varieties of all inclusive combina-
torial patterns of linear, all integer, combinations on n data points. We describe
what the length n arrays of coefficients included in each of these types of combi-
nations consist of.
type C A specific collection of combinations of this variety is parameterized by vari-
ables h and m. These combinations include only arrays a[0,n) such that∑n−1i=0 ai = 0. h indicates the maximum absolute value ai can take and
m indicates the maximum number of entries in a[0,n) that allowed to be
nonzero. A choice of exactly one of a and the array of the negatives of the
values in a to keep in this collection is made.
type I A specific collection of combinations of this variety is parameterized by the
variable k. To begin, consider the case when k = 1. Included are all arrays
a[0,n) with entries 0, 1 only where all the 1s are consecutively placed. More-
over, and only in this specific case of k = 1, the trivial entry where all values
are 0 is not included. For k > 1, the entries consist of adding or subtracting,
decided independently, k arrays from the k = 1 type together. For each of
these arrays a there will be an array whose values are just the negative, i.e.
−a. We have a method to choose exactly one array from a and −a to keep.
Also, we throw away those arrays a whose values have a nontrivial common
factor.
8
type A A specific collection of combinations of this variety is parameterized by the
variables h and m. These combinations include all arrays a where the abso-
lute value of each entry does not exceed h and where at most m entries are
nonzero.
With the intuition of aiming to capture the data of price movement across
past days, `(k)i,[−p,0) is fitted with type C combinations and r
(k)i,[−p,0) is fitted with
type I and type A combinations, separately for each k. Due to computational
limitations, we want to limit n to be possibly smaller than p, so an example set
of features we might use is written explicitly as
xi =
d−1⊕k=0
⊕a∈C(h=1,m=4)
a · `(k)i,[−n,0)
or if we think of C(h = 1,m = 4) as a matrix of coefficient rows instead of a set
of coefficient arrays,
xi =
d−1⊕k=0
C(h = 1,m = 4)`(k)i,[−n,0)
So we have as options for the ith sample vector xi ∈ X of features any vector whose
components are the resulting values when linear combinations from a specific
collection of combinations from each type is applied to the data series that that
type is used with.
Combinations specification number
of combi-
nations
Short description/comments
A(h = 1,m = 1) n the n values in `(k)i,[−n,0) as is
I(k = 1) = C(h = 1,m = 2)(n+12
)all `
(k)i,j1− `
(k)i,j0
where j0 < j1
A(h = 1,m = n) 12 (3n + 1) all {−1, 0, 1} combinations of
r(k)i,[−n,0) (except the trivial 0
combination)
I(k = 1, n = 30) 465
C(h = 1,m = 4, n = 15) 4200 ⊂ I(k = 2, n = 15)
I(k = 2, n = 15) 7260
A(h = 1,m = n, n = 10) 29524
xi is mapped from the space of n real numbers. Some choices of combination
9
sets makes the length of xi significantly exceed n. While that fact alone does
not imply that most of the dimensions in xi are redundant, the relative simplicity
with which we constructed xi from si likely indicates so. Moreover, the number
of training samples for each weekday is only on the order of 103 = 1000, which
the number of constructed features can easily exceed. XGBoost excels at solving
prediction problems where there can be many redundant feature dimensions like
this, where it can quickly and effectively pinpoint the most relevant features to
use.
3.4.3 Hyperparameter Search
XGBoost has a number of manually adjustable settings that can greatly affect its
training and prediction effectiveness. We need to repeatedly run XGBoost using a
wide range of settings. These settings, or hyperparameters, can be thought of as
coming from a subspace of A, the parameter space, in that each instance of these
hyperparameters defines the function ma by specifying how ma is to be constructed
in the training procedure. Our approach to searching this hyperparameter space
is that of randomized grid search. Each hyperparameter has defined a list of
distributions from which to sample from. From these lists we make a grid, so each
grid point defines distributions for each hyperparameter. We sample from these
distributions to construct one search sample of the hyperparameter space. We
now describe the hyperparameters.
� booster This string parameter specifies the type of learning algorithm to
be used in training. “gbtree” is the default original XGBoost algorithm
and “dart” is a dropout analogue for XGBoost, where trees can be dropped
randomly. We use the parameters rate drop=0.1 and skip drop=0.5 for the
tree dropping behaviour.
� n estimators This is an integer that specifies how many trees will be fitted
in the model. The values we use are 10, 30, 100, 200.
� max depth This is an integer that specifies what the maximum depth of
the trees built will be. We use 2, 3, 4, 5.
� learning rate This floating point number is typically set at about 0.1. We
consider 0.1 and 0.5.
10
� subsample This floating point number in (0, 1] is the proportion of training
examples to keep at a learning step. Typically set between 0.5 and 0.9, it
promotes variance so prevents overfitting. The default is 1. We use values
between 0.3 to 0.9 on our different hyperparameter search iterations.
� col subsample Like subsample, but instead this is the proportion of feature
dimensions to keep at each learning step. We use values ranging from 0.003
to 0.7, depending on the number of dimensions in our input features space.
We also try the default value 1 for this hyperparameter, as keeping all of
the features might be more important due to differing semantics of each
generated feature dimension.
� gamma This floating point number is a minimum bound on the training
loss needed to partition deeper in a tree. Increasing this value makes the
model less likely to overfit. We try the default value of 0 and some positive
values.
� (early stopping) XGBoost can be given the option to train with early
stopping enabled, which makes it stopping add trees to the model when
the score on a separate cross validation data set stops improving after some
specified number of rounds. We choose 12 rounds for this threshold.
3.4.4 Cross Validation Techniques
Limitations on amount of data available, i.e. samples of (s, t) from D, make it
fruitful for us to make use of each data point as separate instances of samples.
An example of this is k-fold cross validation. The entire training data is split
into equal groups, numbering in total about 5 – 10 groups. For each group, we
train candidate models on the other groups combined as the training set and
cross validate their performances on that selected group. We average the cross
validation scores on each group to determine overall model performance.
For our application, differing periods of time may have dramatically different
behaviours in price movements. For that reason, I decided to make a 5-fold cross
validation where the cross validation sets consist of 4 four year spans prior to the
final two years of test data, with each newer one is offset forward from the previous
one by two years. And actually the last cross validation set is only a two year
time span, the two year immediately preceding the final two years oftest data.
11
4 Results
4.1 Classification Evaluation
Suppose the target space T is a finite set. For each sample (s, t) ∼ D, we can say s
belongs to the class t. For the cross validation evaluation of a prediction algorithm,
each sample (s, t) ∈ {(s, t)}cv gets a predicted class t̃ = m(s). The target space
in our results are mostly equally likely classes of either negative return or positive
return (labelled − or +).
4.1.1 Confusion Matrix
We can readily see how closely{s 7→ t̃
}cv
follows {s 7→ t}cv by making what is
known as the confusion matrix. The larger the numbers on the diagonal of the
main matrix compared the ones outside the diagonal the better the predictions.
Figure 1: Confusion matrix
Predicted Class0 1 Total
True Class0 (count) (count) (count)1 (count) (count) (count)Total (count) (count) (count)
e.g.
t\t̃ 0 1 All0 3 1 41 2 6 8
All 5 7 12
4.1.2 Precision and Recall
In binary classification, that is, classification with two classes, a standard method
of visualizing the effectiveness of different instances of prediction functions is a
precision-recall curve (also known as ROC curve). However, this visualization
technique requires the choice of one class being considered the “positive” case and
the other the “negative” case, asymmetrically presenting information regarding
either class thereafter. For a classification problem where both classes are equals,
both in proportion of occurrence and in theoretical meaning, we benefit instead
from considering exclusively either the precision or recall of the two classes.
12
Precision, denoted as prc0 and prc1 for classes 0 and 1, is defined as
prc0 =|{t̃ = 0 ∧ t = 0
}cv|
|{t̃ = 0
}cv|
= class 0 precision
prc1 =|{t̃ = 1 ∧ t = 1
}cv|
|{t̃ = 1
}cv|
= class 1 precision
Recall, denoted as rec0 and rec1 for classes 0 and 1, is defined as
rec0 =|{t̃ = 0 ∧ t = 0
}cv|
|{t = 0}cv|= class 0 recall
rec1 =|{t̃ = 1 ∧ t = 1
}cv|
|{t = 1}cv|= class 1 recall
4.1.3 Accuracy
Accuracy, denoted as acc, is defined as
acc =|{t̃ = 0 ∧ t = 0 ∨ t̃ = 1 ∧ t = 1
}cv|
|{all t}cv|= accuracy
It is simply the ratio of correctly classified cross validation samples. For prediction
problems with a class imbalance in number of cases, the accuracy by itself is not
useful, since we would could be 0.9 accurate for a problem where 0.9 of the cross
validation samples are class 0 if we predict everything is in class 0. However, for
our prediction problem of the sign of returns of a financial time series, accuracy is
the most important measurement of performance when considering the practical
value of the prediction solution’s application in trading.
4.2 Sample Procedure
In gaining an idea for which directions to head in order to better tackle the
project’s prediction problem, I have tried many different approaches and meth-
ods, not all of which have a formal structure to it. For the benefit of having a
quantifiable procedure and results, I have one model conceived and designed in its
completeness after exploring the available data in an unstructured way for much
time beforehand.
13
4.2.1 Best Committee of the Best XGBoost Models
The same procedure is done for each separate problem of q days future return
with a particular starting weekday. We sort a collection of XGBoost models
trained on our 5 sections of training data according to their average cross validation
accuracies. There are a total of 10 unique years of cross validation data, and 18
years of simulated freshly encountered cross validation data. This is enough cross
validation that the XGBoost hyperparameter search finds it hard to overfit to the
cross validation sets, notable from the sometimes less than 50% accuracy scores
on some of the validation samples. We try to combine these models to better fit
the span of cross validation sets. We make majority voting committees of size 3,
4, 5 (one model will have two votes in the committee of 4) over some choice of
all the XGBoost models and choose the one that performs best over all the cross
validation sets. The limited choice is mainly due to computational reasons, but if
we limit our search to only amongst the best n cross validated XGBoost models,
we decrease our risk of overfitting this committee combinatorial search to the cross
validation slices. We choose the 10 best cross validated XGBoost models in the
results section.
4.2.2 XGBoost Specifications
The same collection of 608 XGBoost models for each separate problem, specified
by different input features and hyperparameters, are selected with care to have
best hope for capturing the correct patterns for prediction from prior experience.
We use 4 particular feature sets: A(h = 1,m = 1, n = 30), I(k = 1, n = 30), I(k =
2, n = 15), A(h = 1,m = 10, n = 10). The XGBoost parameter grid for each fea-
ture sets is respectively [“gbtree”, “dart”]×[30, 100, 200, (early stop), (early stop)]×[3, 4, 5, 6]×[0.1, 0.5]×[0.75]×[0.6, 0.84]×[0], [“gbtree”, “dart”]×[30, 100, 200, (early stop), (early stop)]×[3, 4, 5, 6]×[0.1, 0.5]×[0.75]×[0.5, 0.7]×[0], [“gbtree”, “dart”]×[30, 100, (early stop)]×[3, 4, 5, 6]×[0.1, 0.5]×[0.75]×[0.1, 0.3, 0.6]×[0], [“gbtree”, “dart”]×[[10]×[4], [10]×[5], [10]×[6], [30]×[3], [30]×[3], [30]×[4], [30]×[5], [100]×[2], [100]×[3], [(early stop)]×[2], [(early stop)]× [3], [(early stop)]× [4], [(early stop)]× [5], ]× [0.1, 0.5]× [0.75]×[0.05, 0.15, 0.5]× [0] where the parameter list is in the order booster, n estimators,
max depth, learning rate, subsample, col subsample, gamma. Each model is given
a randomly and distinctly chosen random number generator starting seed.
14
Figure 2: Committee predictions for one day returns (0 with −)
last Fri to Mon Mon to Tue Tue to Wed
t\t̃ + 0− All+ 38 12 500− 30 18 48All 68 30 98
t\t̃ + 0− All+ 26 27 530− 13 32 45All 39 59 98
t\t̃ + 0− All+ 9 41 500− 12 36 48All 21 77 98
acc = 57 (all in %)prc+ = 56,prc− = 60rec+ = 76, rec− = 38
acc = 59 (all in %)prc+ = 67,prc− = 54rec+ = 49, rec− = 71
acc = 46 (all in %)prc+ = 43,prc− = 47rec+ = 18, rec− = 75
Wed to Thu Thu to Fri 1 day returns
t\t̃ + 0− All+ 17 27 440− 23 31 54All 40 58 98
t\t̃ + 0− All+ 3 29 320− 2 64 66All 5 93 98
t\t̃ + 0− All+ 93 136 2290− 80 181 261All 173 317 490
acc = 49 (all in %)prc+ = 42,prc− = 53rec+ = 39, rec− = 57
acc = 68 (all in %)prc+ = 60,prc− = 69rec+ = 9, rec− = 97
acc = 56 (all in %)prc+ = 54,prc− = 57rec+ = 41, rec− = 69
4.3 Sample Test Results
Note Figures 2 through 8.
4.4 Interpretation
When a particular subproblem (particular choice of starting and ending weekday
for the prediction period) is showing testing accuracy of less than 55%, it essen-
tially means the best cross validation performing committee model has failed to
generalize appropriately from the data before the test set. We have tried quite a
number of hyperparameters states and extracted a wide range of input features,
which in addition to the consistently 60%+ prediction accuracies for some of the
other subproblems, makes it seem more likely that in these cases the patterns in
the previous years just did not carry over to the most current 2 years.
The relatively consistently strong predictability of Friday’s price direction from
Thursday makes sense as at the end of the week traders want to close their posi-
tions and so may act more predictably. However, the prediction skews significantly
towards the more likely direction. The low precision scores indicate that this does
not necessarily make better discrimination of future behaviour. The growing con-
sistency of good predictions with the time span of the future window shows that
15
Figure 3: Committee predictions for one day returns (0 with +)
last Fri to Mon Mon to Tue Tue to Wed
t\t̃ 0+ − All0+ 56 4 60− 29 9 38All 85 13 98
t\t̃ 0+ − All0+ 25 30 55− 12 31 43All 37 61 98
t\t̃ 0+ − All0+ 11 39 50− 11 37 48All 22 76 98
acc = 66 (all in %)prc+ = 66,prc− = 69rec+ = 93, rec− = 24
acc = 57 (all in %)prc+ = 68,prc− = 51rec+ = 45, rec− = 72
acc = 49 (all in %)prc+ = 50,prc− = 49rec+ = 22, rec− = 77
Wed to Thu Thu to Fri 1 day returns
t\t̃ 0+ − All0+ 22 24 46− 29 23 52All 51 47 98
t\t̃ 0+ − All0+ 10 26 36− 8 54 62All 18 80 98
t\t̃ 0+ − All0+ 124 123 247− 89 154 243All 213 277 490
acc = 46 (all in %)prc+ = 43,prc− = 49rec+ = 48, rec− = 44
acc = 65 (all in %)prc+ = 56,prc− = 68rec+ = 28, rec− = 87
acc = 57 (all in %)prc+ = 58,prc− = 56rec+ = 50, rec− = 63
Figure 4: Committee predictions for two day returns (0 with +)
last Thu to Mon last Fri to Tue Mon to Wed
t\t̃ 0+ − All0+ 26 12 38− 42 18 60All 68 30 98
t\t̃ 0+ − All0+ 47 11 58− 23 17 40All 70 28 98
t\t̃ 0+ − All0+ 26 28 54− 16 28 44All 42 56 98
acc = 45 (all in %)prc+ = 38,prc− = 60rec+ = 68, rec− = 30
acc = 65 (all in %)prc+ = 67,prc− = 61rec+ = 81, rec− = 42
acc = 55 (all in %)prc+ = 62,prc− = 50rec+ = 48, rec− = 64
Tue to Thu Wed to Fri 2 days returns
t\t̃ 0+ − All0+ 15 36 51− 14 33 47All 29 69 98
t\t̃ 0+ − All0+ 2 33 35− 11 52 63All 13 85 98
t\t̃ 0+ − All0+ 116 120 236− 106 148 254All 222 268 490
acc = 49 (all in %)prc+ = 52,prc− = 48rec+ = 29, rec− = 70
acc = 55 (all in %)prc+ = 15,prc− = 61rec+ = 6, rec− = 83
acc = 54 (all in %)prc+ = 52,prc− = 55rec+ = 49, rec− = 58
16
Figure 5: Committee predictions for three day returns (0 with +)
last Wed to Mon last Thu to Tue last Fri to Wed
t\t̃ 0+ − All0+ 24 18 42− 32 24 56All 56 42 98
t\t̃ 0+ − All0+ 29 13 42− 28 28 56All 57 41 98
t\t̃ 0+ − All0+ 42 12 54− 20 24 44All 62 36 98
acc = 49 (all in %)prc+ = 43,prc− = 57rec+ = 57, rec− = 43
acc = 58 (all in %)prc+ = 51,prc− = 68rec+ = 69, rec− = 50
acc = 67 (all in %)prc+ = 68,prc− = 67rec+ = 78, rec− = 55
Mon to Thu Tue to Fri 3 days returns
t\t̃ 0+ − All0+ 26 25 51− 17 30 47All 43 55 98
t\t̃ 0+ − All0+ 6 30 36− 5 57 62All 11 87 98
t\t̃ 0+ − All0+ 127 98 225− 102 163 265All 229 261 490
acc = 57 (all in %)prc+ = 60,prc− = 55rec+ = 51, rec− = 64
acc = 64 (all in %)prc+ = 55,prc− = 66rec+ = 17, rec− = 92
acc = 59 (all in %)prc+ = 55,prc− = 62rec+ = 56, rec− = 62
Figure 6: Committee predictions for four day returns (0 with +)
last Tue to Mon last Wed to Tue last Thu to Wed
t\t̃ 0+ − All0+ 21 22 43− 19 36 55All 40 58 98
t\t̃ 0+ − All0+ 31 14 45− 32 21 53All 63 35 98
t\t̃ 0+ − All0+ 36 11 47− 22 29 51All 58 40 98
acc = 58 (all in %)prc+ = 52,prc− = 62rec+ = 49, rec− = 65
acc = 53 (all in %)prc+ = 49,prc− = 60rec+ = 69, rec− = 40
acc = 66 (all in %)prc+ = 62,prc− = 72rec+ = 77, rec− = 57
last Fri to Thu Mon to Friday 4 days returns
t\t̃ 0+ − All0+ 43 12 55− 20 23 43All 63 35 98
t\t̃ 0+ − All0+ 13 27 40− 11 47 58All 24 74 98
t\t̃ 0+ − All0+ 144 86 230− 104 156 260All 248 242 490
acc = 67 (all in %)prc+ = 68,prc− = 66rec+ = 78, rec− = 53
acc = 61 (all in %)prc+ = 54,prc− = 64rec+ = 32, rec− = 81
acc = 61 (all in %)prc+ = 58,prc− = 64rec+ = 63, rec− = 60
17
Figure 7: Committee predictions for five day returns (0 with +)
last Mon to Mon last Tue to Tue last Wed to Wed
t\t̃ 0+ − All0+ 33 14 47− 25 26 51All 58 40 98
t\t̃ 0+ − All0+ 27 20 47− 18 33 51All 45 53 98
t\t̃ 0+ − All0+ 29 16 45− 22 31 53All 51 47 98
acc = 60 (all in %)prc+ = 57,prc− = 65rec+ = 70, rec− = 51
acc = 61 (all in %)prc+ = 60,prc− = 62rec+ = 57, rec− = 65
acc = 61 (all in %)prc+ = 57,prc− = 66rec+ = 64, rec− = 58
last Thu to Thu last Fri to Fri 5 days returns
t\t̃ 0+ − All0+ 32 14 46− 19 33 52All 51 47 98
t\t̃ 0+ − All0+ 26 15 41− 24 33 57All 50 48 98
t\t̃ 0+ − All0+ 147 79 226− 108 156 264All 255 235 490
acc = 66 (all in %)prc+ = 63,prc− = 70rec+ = 70, rec− = 63
acc = 60 (all in %)prc+ = 52,prc− = 69rec+ = 63, rec− = 58
acc = 62 (all in %)prc+ = 58,prc− = 66rec+ = 65, rec− = 59
Figure 8: Committee predictions for six day returns (0 with +)
last last Friday to Mon last Mon to Tue last Tue to Wed
t\t̃ 0+ − All0+ 39 12 51− 24 23 47All 63 35 98
t\t̃ 0+ − All0+ 30 22 52− 17 29 46All 47 51 98
t\t̃ 0+ − All0+ 24 21 45− 14 39 53All 38 60 98
acc = 63 (all in %)prc+ = 62,prc− = 66rec+ = 76, rec− = 49
acc = 60 (all in %)prc+ = 64,prc− = 57rec+ = 58, rec− = 63
acc = 64 (all in %)prc+ = 63,prc− = 65rec+ = 53, rec− = 74
last Wed to Thu last Thu to Fri 6 days returns
t\t̃ 0+ − All0+ 28 18 46− 21 31 52All 49 49 98
t\t̃ 0+ − All0+ 19 20 39− 15 44 59All 34 64 98
t\t̃ 0+ − All0+ 140 93 233− 91 166 257All 231 259 490
acc = 60 (all in %)prc+ = 57,prc− = 63rec+ = 61, rec− = 60
acc = 64 (all in %)prc+ = 56,prc− = 69rec+ = 49, rec− = 75
acc = 62 (all in %)prc+ = 61,prc− = 64rec+ = 60, rec− = 65
18
high frequency price signals are more difficult to algorithmically find than lower
frequency signals.
5 Discussion
5.1 Discovery Process
This section is an account of the process of coming up with the final model.
To begin, the time series in our work was not filled in the missing weekdays
and the prediction problem was not separated for different starting weekdays. So
for each q = 1 . . . 5, I sought to find how well I can predict the q days return, or
just its sign, regardless of what the current day is. And holidays or other days
when prices were not recorded did not count as a day in the time series. Also, the
data for the time series only went back to 2004.
I tried simpler models of multilayer perceptrons or XGBoost trained on just
the log daily returns data on a 30 days past time window for example, and they
could not predict the future days return or the sign of the return better than
chance. Slightly more hopeful seeming results came from trying to predict the
sign of 3 days return, so I used this prediction task to search for models that have
potential.
I came up with the idea of using the A(h = 1,m = n) features to better
capture the information hidden in the slopes of price movements. n = 12 was as
high as computationally feasible, and n = 6 led to faster training time, so I had
run parameter searches for XGBoost using these features. Through the parameter
searching, I gained an idea of which XGBoost learning hyperparameters mattered
and optimal value ranges for the ones that did matter. I also found that using
the prices of both VIX and SPX gave the best results, over including additional
features from the volume data of the SPX or taking only one of the price time
series. I was able to attain about 60% accuracy with fairly balanced precision and
recall on q = 3, even attaining 59.5% accuracy on a test set of about the latest
2.5 years.
After not being able to improve the cross validation accuracy despite spending
much time hyperparameter searching, I thought of how differing weekdays might
systematically affect the behaviour of price movements. I filled in missing values to
be the last available price and separated the prediction windows by their starting
weekday. Immediately the cross validation scores jumped to about 70%. I did
REFERENCES 19
more parameter searching and tried majority vote committee ensembles of the
XGBoost models, which also improved the cross validation accuracies significantly,
passing 75%.
However, now came the problem of overfitting. It turns out that I searched too
many hyperparameters for a too small cross validation set, as the test scores were
erratic and not better than chance. So I added in more cross validation, and after
some tuning, arrived at the model included here. I also tried another method
selecting the committee candidates: take the top n XGBoost models, where n
is any odd number. This worked significantly poorly compared to the current
committee choices method on the final test set.
5.2 Limitations and Extensions
I note that I looked at results of some of my previous models on the current test set
multiple times before arriving at my current model. This should not significantly
degrade the predictive power implied by the modeling process presented here as I
did not systematically mine the test set for patterns.
The histograms of the returns on successfully predicted days do not differ in
any visible way from the histograms of the returns on unsuccessfully predicted
days. The application of the results here for trading will require an improve-
ment overhaul of the current cross validation model to be more certain that these
predictions are reliable before finding use in backtesting for possible profits from
trading using these predictions.
References
[1] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting sys-
tem”. In: Proceedings of the 22nd acm sigkdd international conference on
knowledge discovery and data mining. ACM. 2016, pp. 785–794.
[2] Chicago Board Options Exchange. “The CBOE volatility index-VIX”. In:
White Paper (2009), pp. 1–23.