forecasting high yield corporate bond industry excess return1185828/fulltext01.pdf · 96 di erent...

IN DEGREE PROJECT MATHEMATICS,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Forecasting High Yield Corporate Bond Industry Excess Return

CARLOS JUNIOR LOPEZ VYDRIN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Forecasting High Yield Corporate Bond Industry Excess Return An application of unsupervised hierarchical time series clustering and unbiased Random Forest regression using the GUIDE method

CARLOS JUNIOR LOPEZ VYDRIN Degree Projects in Financial Mathematics (30 ECTS credits) Degree Programme in Industrial Engineering and Management KTH Royal Institute of Technology year 2018 Supervisor at SEB Investment Management: Thomas Kristiansson Supervisor at KTH: Timo Koski Examiner at KTH: Timo Koski

TRITA-SCI-GRU 2018:018 MAT-E 2018:06 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

In this thesis, we apply unsupervised and supervised statistical learning methods on thehigh-yield corporate bond market with the goal of predicting its future excess return.We analyse the excess return of industry based indices of high-yield corporate bondsbelonging to the Chemical, Metals, Paper, Building Materials, Packaging, Telecom, andElectric Utility industry. To predict the excess return of these high-yield corporate bondindustry indices we utilised externally given market-observable financial time series from96 different asset and indices that we believe to be of predictive value for the excess return.These input time series covers assets and indices of major equity indices, corporate creditspreads, FX-currencies, stock-, bond-, and FX volatility, swap rates, swap spreads, certaincommodities, and macro economic surprise indices. After pre-processing the input data wearrive at 154 predictors that are used in a two-phase implementation procedure consistingof an unsupervised time series Agglomerative Hierarchical clustering and a supervisedRandom Forest regression model. We use the Hierarchical time series clustering and theRandom Forest unbiased variable importance estimates as means to reduce our inputpredictor space to the ten most influential predictor variables for each industry. These tenmost influential predictors are then used in a Random Forest regression model to predict[1, 3, 5, 10] day future cumulative excess return. To accommodate for the characteristics ofsequential time series data we also apply a sliding window method to the input predictorsand the response variable in our Random Forest model.

The results show that excess returns in the various industries under study are pre-dictable using Random Forest regression with our market-observable input data. Theout-of-sample coefficient of determination R2

out is in majority of the cases statistically sig-nificant at 0.01 level. The predictability varies across the industries and is in some casesdependent on whether we apply the sliding window method or not. Furthermore, applyingthe sliding window method on the predictors and the response variable showed in majorityof the cases statistically significant improvements in the mean-squared prediction error.The variable importance estimates from such models show that the excess return timeseries exhibit some degree of autocorrelation.

Sammanfattning

I denna uppsats applicerar vi oovervakade och overvakade statistiska inlarningsmetoderpa marknaden for hogavkastande foretagsobligationer med malet om att predicera dessframtida overavkastning. Vi analyserar overavkastningen for industribaserade index avforetagsobligationer tillhorande Kemi-, Metal-, Pappers-, Byggnadsmaterial-, Paketerings-, Telekom-, och Kraftforsorjningsindustrin. For att predicera overavkastningen i dessahogavkastande foretagsobligations-industriindex anvander vi externa marknadsobservera-nde finansiella tidserier fran 96 olika tillgangsklasser och index som vi tror ar av prediktivvarde for overavkastningen. Dessa input tidsserier tacker tillgangar och index for valkandaaktieindex, foretagskredit-spreadar, valutor, aktie-, obligations, och valutavolatilitet, swaprantor, swap spreadar, vissa ravaror, och makroekonomiska overraskningsindex. Efter attha processat inputsdata har vi 154 olika prediktorer som anvands i en tva-fas implementer-ingsprocedur bestaende av en oovervakad tidserie Agglomerativ Hierarkisk klustering ochen overvakad Random Forest regressionsmodel. Vi anvander den hierarkiska klusterin-gen och Random Forest fordomsfria variabel viktighetsuppskattningar som medel for attreducera inputsvariabelrummet till de tio mest influerande prediktorvariablerna for varjeindustri. Dessa tio mest influerande prediktorer anvandes sedan i en Random Forest re-gressionsmodel for att predicera [1, 3, 5, 10] dagars kumulativ overavkastning. For atttillgodose for de egenskaper som sekventiell tidserie data uppvisar applicerar vi en Slid-ing Window metod pa inputsprediktorerna och pa sjalva responsvariabeln i var RandomForest model.

Resultaten visar att overavkastningen i de olika industrierna som vi studerar ar forutspa-bar nar man anvander Random Forest regression med var marknadsobserverande inputs-data. Ur-samplings determinationskoefficienten R2

out visar i de flesta fall pa statistisksignifikans pa en 0.01 niva. Forutsagbarheten varierar over de olika industrierna ochar i vissa fall beroende pa om Sliding Window metoden ar applicerad eller inte. Vi-dare visar resultaten att applicerandet av Sliding Window metoden pa prediktorerna ochresponsvariabeln uppvisar i de flesta fallen pa statistisk signifikant forbattring av demmedel-kvadrerade prediktionsfelet. Variabel viktighetsuppskattningarna fran dessa mod-eller visar att overavkastning tidserierna uppvisar en viss grad autokorrelation.

Acknowledgements

I would like to thank Timo Koski, my supervisor at KTH Royal Institute of Technology,for supervising my work and providing valuable guidance. Furthermore, I would like tothank Fredrik Niveman and his colleagues at SEB Investment Management for their valu-able input and support throughout this thesis work.

Chimbote Peru, January 2018

Carlos Junior Lopez Vydrin

“We may regard the present state of the universe as the effect of its past and the cause ofits future. An intellect which at a certain moment would know all forces that set naturein motion, and all positions of all items of which nature is composed, if this intellect werealso vast enough to submit these data to analysis, it would embrace in a single formulathe movements of the greatest bodies of the universe and those of the tiniest atom; forsuch an intellect nothing would be uncertain and the future just like the past would bepresent before its eyes.”

– P.-S. Laplace, Essai philosophique sur les probabilites, 1814

i

Contents

1 Introduction 3

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 This thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Previous related work 7

2.1 Statistical learning and financial time series . . . . . . . . . . . . . . . . . . 7

2.2 Prediction of corporate bond excess returns . . . . . . . . . . . . . . . . . . 9

3 Theoretical framework 11

3.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Clustering overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.2 Similarity and dissimilarity measures . . . . . . . . . . . . . . . . . . 13

3.1.3 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.3.1 Linkage criterion . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.3.2 Cophenetic correlation coefficient . . . . . . . . . . . . . . . 17

3.1.3.3 Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.3.4 Inconsistency coefficient . . . . . . . . . . . . . . . . . . . . 20

3.1.4 Evaluating and assessing clustering quality . . . . . . . . . . . . . . 22

3.1.5 Timeseries clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.5.1 Whole time series clustering . . . . . . . . . . . . . . . . . 25

3.1.5.2 Representations . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.5.3 Similarity & dissimilarity measures in time series clustering 28

3.1.5.4 Time series cluster prototypes . . . . . . . . . . . . . . . . 32

3.1.5.5 Time series clustering algorithms . . . . . . . . . . . . . . . 33

3.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Regression with Random forest . . . . . . . . . . . . . . . . . . . . . 35

3.2.2.1 Base learner: Classification and Regression Tree (CART) . 35

3.2.2.2 Random Forest algorithm . . . . . . . . . . . . . . . . . . . 38

3.2.2.3 Out-of-Bag observations, error estimation, and variableimportance measure . . . . . . . . . . . . . . . . . . . . . . 41

3.2.2.4 Unbiased variable importance estimates using GUIDE pro-cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.2.5 Handling missing data with surrogate split predictors . . . 44

3.2.2.6 Prediction intervals using quantile regression forest . . . . 45

3.2.3 Supervised learning with sequential time-series data . . . . . . . . . 47

3.2.3.1 Sliding window method . . . . . . . . . . . . . . . . . . . . 47

iii

SF291X Degree Project in Financial Mathematics Carlos Junior Lopez Vydrin

4 Methodology 494.1 Data processing and transformation . . . . . . . . . . . . . . . . . . . . . . 49

4.1.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.1.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.3 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.4 Initial predictor variables . . . . . . . . . . . . . . . . . . . . . . . . 534.1.5 Dealing with missing data points . . . . . . . . . . . . . . . . . . . . 54

4.2 Two-phase implementation procedure . . . . . . . . . . . . . . . . . . . . . 554.2.1 Unsupervised time series clustering . . . . . . . . . . . . . . . . . . . 55

4.2.1.1 Implementation procedure . . . . . . . . . . . . . . . . . . 554.2.1.2 Selecting distance metric . . . . . . . . . . . . . . . . . . . 564.2.1.3 Selecting Linkage Criterion . . . . . . . . . . . . . . . . . . 564.2.1.4 Finding the natural clusters in the input data . . . . . . . 564.2.1.5 Evaluating and assessing our final clustering . . . . . . . . 574.2.1.6 Selecting predictor variables for our regression model . . . 57

4.2.2 Random Forest regression . . . . . . . . . . . . . . . . . . . . . . . . 584.2.2.1 Implementation procedure . . . . . . . . . . . . . . . . . . 584.2.2.2 Implementation settings for the Random Forest algorithm . 584.2.2.3 Forecasting days and the response variable . . . . . . . . . 594.2.2.4 Evaluating model performance . . . . . . . . . . . . . . . . 594.2.2.5 Unbiased selection of the 10 most influential predictors . . 604.2.2.6 Extending the regression analysis with sliding window . . . 61

4.2.3 Testing for model significance . . . . . . . . . . . . . . . . . . . . . . 61

5 Results 635.1 Unsupervised time series clustering . . . . . . . . . . . . . . . . . . . . . . . 63

5.1.1 Clustering results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.1.2 Clustering procedure results . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Random Forest implementation . . . . . . . . . . . . . . . . . . . . . . . . . 665.2.1 Random Forest’s ability to handle several missing values . . . . . . . 665.2.2 Random Forest regression results . . . . . . . . . . . . . . . . . . . . 685.2.3 Plots: OOB Predictions and Prediction Intervals . . . . . . . . . . . 715.2.4 Unbiased variable importance estimates . . . . . . . . . . . . . . . . 74

5.3 Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography 87

iv

List of Figures

3.1 Illustration of Single-, Complete, and Average linkage . . . . . . . . . . . . 173.2 Dendrogram illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Example: Hierarchical grouping of objects . . . . . . . . . . . . . . . . . . . 203.4 Dendrogram illustration of consistent and inconsistent links . . . . . . . . . 213.5 Time series clustering taxonomy . . . . . . . . . . . . . . . . . . . . . . . . 253.6 Time series clustering approaches . . . . . . . . . . . . . . . . . . . . . . . . 263.7 Four components of whole time series clustering . . . . . . . . . . . . . . . . 273.8 Distance measure approaches in the literature . . . . . . . . . . . . . . . . . 293.9 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.10 Clustering algorithm approaches . . . . . . . . . . . . . . . . . . . . . . . . 333.11 CART Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.12 Random Forest illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1 Dendrogram result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Calinski-Harabasz and Davies-Bouldin index plots . . . . . . . . . . . . . . 655.3 Chemical industry out-of-bag 5 day excess return prediction . . . . . . . . . 675.4 Packaging industry out-of-bag 5 day excess return prediction . . . . . . . . 675.5 Chemical OOB-Prediction and 90% Percentile Prediction Intervals . . . . . 715.6 Metals OOB-Prediction and 90% Percentile Prediction Intervals . . . . . . . 715.7 Paper OOB-Prediction and 90% Percentile Prediction Intervals . . . . . . . 725.8 Building Materials OOB-Prediction and 90% Percentile Prediction Intervals 725.9 Packaging OOB-Prediction and 90% Percentile Prediction Intervals . . . . . 735.10 Telecom OOB-Prediction and 90% Percentile Prediction Intervals . . . . . . 735.11 Electric Utility OOB-Prediction and 90% Percentile Prediction Intervals . . 745.12 Chemical Industry Variable Importance Estimates . . . . . . . . . . . . . . 755.13 Metals Industry Variable Importance Estimates . . . . . . . . . . . . . . . . 765.14 Paper Industry Variable Importance Estimates . . . . . . . . . . . . . . . . 775.15 Packaging Industry Variable Importance Estimates . . . . . . . . . . . . . . 785.16 Building Materials Industry Variable Importance Estimates . . . . . . . . . 795.17 Telecom Industry Variable Importance Estimates . . . . . . . . . . . . . . . 805.18 Electric Utility Industry Variable Importance Estimates . . . . . . . . . . . 81

v

List of Tables

4.1 High yield excess return time series data by industry . . . . . . . . . . . . . 504.2 Initial input time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Transformation of swap rates . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4 Processed input time series . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5 Common evaluation measures for binary classifiers . . . . . . . . . . . . . . 60

5.1 Clustering result using Agglomerative Hierarchical clustering . . . . . . . . 635.2 Cophenetic correlation coefficient values . . . . . . . . . . . . . . . . . . . . 645.3 Random Forest regression results . . . . . . . . . . . . . . . . . . . . . . . . 70

1

Chapter 1

Introduction

1.1 Background

The technological development over the past three decades has paved way for the applica-tion of analysis of massive data sets and computer intensive statistical learning methods inmany industries. With greater performance than conventional statistical methods, modernstatistical learning methods have had success in new as well as old areas of applicationsyielding greater insight into new as well as old problems. An early adopter of the saidmodern statistical methods has been the finance industry where statistical learning, alsoknown as Machine learning, has proven to be a major disruptive force that has and iscontinuing to reshape the finance industry landscape, most notably the financial markets.Quantitative methods for asset allocation and price prediction have been used for a longtime and is becoming ever more prevalent in the marketplace. The US investment bankMorgan Stanley estimated that 84% of all trading on US stock market in 2011 were doneby algorithmic (automated) programs ([54]). An in-depth article by the Wall Street Jour-nal titled ”The quants run Wall Street now”[58] revealed that quantitative hedge fundswere responsible for 27% of all US stock trades in 2017, up from 14% in 2013. This trend isnot limited to stocks, other major liquid asset classes like FX-currencies, bonds and ratesare experiencing similar developments where modern statistical methods are employed forprice prediction and algorithmic trading. The pace of this development, however, has notbeen the same in all asset classes. There are some asset classes where traditional quali-tative methods such as fundamental analysis are still preferred over quantitative methodsfor valuation and prediction of asset prices.

One such asset class is high-yield corporate bonds, i.e. riskier market-tradeable corpo-rate debt. The greater level of risk and uncertainty associated with high-yield corporatebonds warrants a careful qualitative approach, where focus is on fundamental cash flowanalysis and corporate evaluation, over more quantitative methods revolving around statis-tics. This preference for qualitative analysis in the high-yield space is one of the reasonsthat much of high-yield bond trading is still done in an traditional non-algorithmic wayand “over-the-counter”. Considering the success of modern statistical learning methodsin other areas of financial markets, applying modern statistical methods on the high-yieldbond market may perhaps yield greater insight into an otherwise still traditional part ofmarkets. This is what this thesis aims to do.

3


1.2 Problem Formulation

There is extensive research done on forecasting financial time series using sophisticatedstatistical learning methods. However, much of that research are focused on the topicof stock market prediction, with only a few studies covering other asset classes such ascurrencies, commodities, and government bond futures. While there is research done onpredicting returns in the corporate bond market, that research uses classical statisticalregression. There is to date, to the best of my knowledge, no research done on pre-dicting performance of corporate bonds, let alone on high-yield corporate bonds, usingmore modern statistical learning methods. This paper seeks to fill some of that researchgap by applying more modern non-conventional statistical learning methods on high-yieldcorporate bond market to predict future excess returns.

1.3 This thesis

This thesis seeks to apply non-conventional statistical learning methods on high yieldcorporate bond market with the goal to predict future performance measured in terms ofexcess return, and shed light on the drivers behind that excess return. The excess returnof a bond is defined as the excess return over the local risk-free rate, which in many casesis taken to be the local sovereign debt yield. In this thesis we are studying the excessreturn of seven high yield corporate bond industry indices, where each index represents abasket of corporate bonds belonging to a certain industry. The seven industries understudy are

• Building Materials

• Chemical

• Electric Utility

• Metals

• Packaging

• Paper

• Telecom

A two-phase implementation procedure consisting of unsupervised and supervised learningmethods will be used to:

1. Identify the 10 most influential predictors of excess return in each high yield corpo-rate bond industry

2. Forecast the [1,3,5,10] day future cumulative excess return of the high yield corporatebond industry indices using these 10 most influential predictors.

In the first phase, an unsupervised Hierarchical clustering method will be employed tonaturally separating the data and selecting relevant predictors from a larger set of inputvariables. In the second phase, we use the selected predictors from phase one as input toa supervised learning method called Random Forest for regression, prediction and variableimportance estimation. The input variables that will be considered are externally givenmarket-observable prices of assets and indices that are believed to influence the HY in-dustry excess return. This encompasses major FX-currency pairs, equity- and commodityindices, swap rates, swap spreads, FX-, bond- and equity volatilities, as well as macroeconomic surprise indices. In total 96 different time series will be considered, which aftervarious filters and transformations will amount to 154 initial input predictor for our sta-tistical learning tools.

4


A key difference of the method employed here compared to the methods used in relatedresearch on corporate bond excess return is that we let our statistical learning tools de-cide which predictors to use in the regression model, as opposite to letting the researcher(human) subjectively pick the predictors. While this method can possibly shed light onthe important drivers of excess return, it is still limited to the universe of the 154 initialinput time series we put into the model. Furthermore, in the regression phase we usenon-conventional regression methods as opposite to the classical multivariate regressionmethods used in related research. As such, the method in this thesis should be viewed asan exercise in dimensionality reduction techniques, variable importance estimation, andprediction using non-conventional regression methods.

This thesis is organised as follows. Chapter 2 provides a comprehensive review of re-lated work and literature. Chapter 3 presents the theoretical framework of the relevantmethods that will be considered and employed. Chapter 4 gives a detail walk-through ofthe methodology used, including a detailed description of the two-phase implementationprocedure. Chapter 5 presents the results followed by discussion and concluding remarksin Chapter 6.

1.4 Delimitations

The model employed in this thesis is limited to the study of the excess return of seven highyield corporate bond industry indices, where each industry index corresponds to a basketof certain underlying corporate bonds belonging to a specific industry. Furthermore, thetime series data used in this thesis is of daily resolution and is all retrieved from theBloomberg terminal. The time series data used is limited to market-observable pricesof FX-currencies, swap rates & spreads, equity indices, commodity indices, credit spreadindices, bond-, stock-, and FX-volatility indices. Fundamental factors of the underlyingcorporate bonds, such as ratings, cash flow characteristics, and specific credit spread ofunderlying bonds are not considered in this thesis. The time frame for forecasts are[1, 3, 5, 10] days.

5

Chapter 2

Previous related work

2.1 Statistical learning and financial time series

There are several research papers published on the topic of predicting future prices ofdifferent asset classes using modern statistical learning methods. Most of the research,however, has solely been focused on assets like stocks, government bonds or currencies,and not specifically on the HY corporate bond market. The common factor among theseresearch papers is that more sophisticated statistical learning methods such as SupportVector Machines, Artificial Neural Networks (ANN), Random Forest and AdaBoost hasshown promising results which seriously challenge the notion of semi-efficient markets.

There are many successful applications of ANNs, papers dating back to 1990s shows thatANNs and variation of those, such as Recurrent ANN and Hybrid Kohonen Self-OrganizingMaps (SOM), can be very useful for time series modelling and forecasting stock price per-formance [60] [51] [2] [1] [23]. A more recent study on financial time series forecasting withmachine learning techniques by Krollner, Vanstone, and Finnie [26] confirmed that ANNbased techniques are useful for stock market forecasting, either as a standalone techniqueor as embedded techniques within hybrid systems. Mingyue, Cheng, and Yu [38] [49] dida similar analysis and found that a GA-ANN model, a hybrid between ANN and GeneticAlgorithms achieved a hit ratio of 86.4% when predicting the direction of Nikkei 225 stockindex.

A different learning method called Support Vector Machines (SVM), which utilisesoptimisation theory, has its share of research showing convincing performance when pre-dicting financial time series [50] [19] [49] [25]. According to a recent research review byJaramillo, Velasquez, and Franco [22] on Research in financial time series forecasting withSVM: Contribution from literature, SVMs are commonly accepted to perform more accu-rately than classical time series forecasting methods and other types of neural networks.For instance, Kim [25] demonstrated in his research on forecasting Korean stock financialtime series a prediction performance of up to 58% in terms of hit ratio when using supportvector machines to predict the future direction. In line with Jaramillo, Velasquez, andFranco [22], Tay and Cao [49] concluded that it is more advantageous to apply SVMsto forecast financial time series over standard back propagation neural networks. Whenpredicting prices of five real futures contracts on stocks and bonds Tay and Cao [49]shows that SVM outperforms standard BP neural network based on four different criteria,normalised mean square error, mean absolute error, directional symmetry, and weighteddirectional symmetry. A similar finding were concluded by Huang, Nakamori, and Wang[19], when comparing SVMs to other individual classification methods they found thatSVMs is superior to other methods such as Linear / Quadratic discriminant analysis and

7


Elman Backpropagation Neural Networks when considered standalone, however the bestforecasting performance was achieved when several different prediction models were com-bined with SVMs into a hybrid model. This concept of combing several prediction modelsor learners to achieve better prediction accuracy is the fundamental idea behind anothergroup of statistical learning methods called Ensemble methods.

Random Forest and Adaboost are two different algorithms of Ensemble methods, theyshare the idea of combining several learners (predictors) for a single forecast by consider-ing the prediction of each individual learner. As a result, the combined predictions of alllearners have a better accuracy than any learner (predictor) alone. Research on these twomethods applied on financial time series is quite recent and has shown interesting results.Much of the research include performance comparison between Ensemble methods andother learning methods, such as ANN, SVM and ARIMA models, that consistently showthat ensemble methods are outperforming the other methods [36] [27]. When forecastingNew York electricity market [36] experiment results showed Random Forest (RF) out-performing ANN and ARMA models, with mean absolute prediction error of 12.03% forRF vs ANNs 12.83% and ARIMAs 13.65%. Different studies on stock market predictionswith RF [5] [35] show prediction accuracy between 54.12% up to 76.5% when forecastingthe directional move of the stocks, depending on how the input parameters are prepro-cessed and transformed. As for the AdaBoost, Yutong and Zhao [59] studied how wellAdaBoost performed in predicting stock directions on the Shanghai stock exchange, theydemonstrated prediction accuracy results of 54.5% for the Adaboost algorithm.

There are an interesting and noteworthy extension of the standard RF method pro-posed by Booth, Gerding, and McGroarty [5], in their experiment they suggest an ensembleof such Random Forests that is generated in an online fashion. It is essentially an ensembleof ensembles with set number of maximum quantity of Random Forests that partake inthe overall prediction. As new data are registered over time the model trains new set ofRandom forests on the new data, the least performing Random forests in terms of pre-diction accuracy are discarded from the ensemble group so that the maximum number ofRFs are maintained. This online learning RF ensemble of ensemble method outperformedother methods including a single RF, ensembles of ANNs, Linear region ensembles andSupport vector machines ensembles. While ensembles of RFs show interesting results thisthesis will, in order to maintain the complexity of this thesis at a reasonable level, onlyconsider standard Random Forest procedure.

The research on the ensemble methods is convincing, in the light of the model comparisonsin the various research papers ensemble methods are outperforming other methods whenforecasting financial time series data. Additionally, the algorithm and notion behind theRandom Forest and AdaBoost are intuitively sound and easy to grasp, which makes iteasy for new practitioners to understand and apply. This makes algorithms such as theRandom forest suitable statistical learning methods to use in this thesis.

An important point to highlight in this research review is that the results from each ofthe studies are not directly comparable to each other in terms of prediction performance.Comparing prediction accuracy from one study to another is not a valid comparison as dif-ferent authors have chosen different markets to analyse, e.g. Korean vs. US stock market,and the various authors have applied different transformation to their time series data,making the input variables (predictors) to their statistical learning methods different ineach study. Perfectly highlighted in Green and Pearson [17], a statistical model can only

8


be as good as the effort and input the practitioner puts into the model design, includingthe selection of predictors.

2.2 Prediction of corporate bond excess returns

The research on predicting corporate bond returns are not as plentiful as for stock return,however, there are studies done on the subject worth highlighting. Kim, Li, and Zhang[24] examined the predictive power of credit default swaps (CDS) bond basis for futurecorporate bond returns, they found that the CDS-bond basis strongly negatively predictsfuture returns of corporate bonds and CDS, and that the predictive power of the basis ismore significant for bonds than CDS. Cai and Jiang [8] studied the volatility of corporatebond excess return and its correlation to contemporaneous corporate bond excess returns.They decomposed the bond volatility into market-, time-to-maturity-, and ratings volatil-ity, and found that the bond volatility and idiosyncratic risk are significant predictors ofcorporate three-month and six-month ahead investment grade bond excess returns. Ina more relevant study Lin, Wang, and Wu [29] investigated the monthly, quarterly, andyearly predictability of corporate bond excess return using data sample for the period fromyear 1973 to 2010. They found that forward rate factors, liquidity factors, and a bond’scredit spread have predictive power on corporate bond excess returns. Forward rate fac-tors in particular captures substantial variations in expected bond excess returns. Theyconcluded that corporate bond returns are more predictable than stock returns, and thatthe predictability tends to be higher for low-grade (high-yield) bonds and short-maturitybonds.

In comparison to previous related research on corporate bond excess returns, this thesiscontrasts in three following ways. Firstly, this thesis considers non-conventional statisticaltools, such as the Random Forest, as opposite to the standard conventional ordinary least-square (OLS) regression. Secondly, we use externally given market-observable prices ofa diverse set of asset classes and indices, in contrast to ”internally” derived fundamentalpredictors such as bond rating and cash flow characteristics. The reason for this is thatwe are focusing on studying HY industries, i.e. basket of corporate bonds belonging to acertain industry, instead of studying each individual bond. Thirdly, we let our statisticaltools decide which of the 100+ predictors to use in the final regression model, as oppositeto subjectively picking the input predictors to our model.

9

Chapter 3

Theoretical framework

This chapter covers the theoretical framework behind the methodology used in this thesis.In line with the two-phase implementation procedure this chapter is divided into two parts,unsupervised learning and supervised learning. The first part focuses on the exploratoryanalysis using unsupervised learning techniques in order finding relevant predictors forour regression model through time series clustering. The second part focuses on super-vised learning using non-conventional regression techniques, where the goal is to predictfuture excess return and to identify from a larger set of input variables the most relevantpredictors for that prediction task. Much of the theory, methods and models coveredhere are based on well-established literature in machine learning and artificial intelligence[4][18][21][41].

3.1 Unsupervised learning

Unsupervised learning is the task of inferring a function to describe hidden structures fromunlabelled data, where there is no classification or categorisation of the data. Consideringa set of features in a feature vector X = (x1, x2, . . . , xk), Xi ∈ Rn, where n can bevery large number. In supervised learning there is a response variable Y ∈ Rn, alsoknown as “a label”, that is associated with the feature vector X. In an unsupervisedsetting there is no such label on X, we do not have any observation of Y and thus welack a response variable that can supervise our analysis. Nevertheless, structures andrelationships of the input data can still be learned through unsupervised learning. Some ofthe central tasks of unsupervised learning is to identify hidden structures between the inputfeatures, discovering groups of similar examples within the data, project data from high-dimensional space down to two or three dimensions for the purpose of visualisation, as wellas determine the distribution of data within the input space. Examples of unsupervisedlearning methods are K-means, K-Nearest Neighbours, Self-organising Maps, PrincipalComponent Analysis and Hierarchical Clustering.

3.1.1 Clustering overview

Cluster analysis, or clustering, is the task of segmenting or grouping data into subsetsin such a way that objects in the same group (a cluster) are more closely related to oneanother than objects assigned to different groups (clusters). Clustering can be said to beof an exploratory nature where the goal is to discover natural groups and structure inthe data. The various clustering methods attempt to group the data based on a specificdefinition of similarity or dissimilarity measure. Hence, central to the goal of clusteranalysis is the measure of similarity or dissimilarity between the objects being clustered.

11


There are several clustering algorithms, and they differ on the type of input and outputof the cluster analysis. In regard to the input, one distinguishes between similarity- andfeature based clustering. The former requires a N ×N dissimilarity matrix, or a distancematrix D, while the latter accepts a N×D feature matrix as input. In addition to the twotypes of inputs, there are two possible types of output that can be attained from clusteranalysis, namely partitional clustering and hierarchical-clustering. In partitional clusteringthe objects are partitioned into disjoint sets, while in hierarchical clustering a nested treeof partitions (clusters) is created. The most prominent examples of clustering methods areconnectivity-based clustering, centroid -based clustering and distribution-based clustering.A brief overview of these methods is given below.

Connectivity based clustering

The core idea behind connectivity-based clustering, also know as hierarchical clustering,is that objects close to each other by some specific distance measure are more related toeach other than to objects further away. Clusters in the connectivity based approach areformed based on the distance between the objects. One of the advantages of these methodsare that they can provide an extensive hierarchy of the clusters represented in a so calleddendrogram, a full visual representation of the distances and hierarchical connectivitybetween the different clusters. The methods in the connectivity based approach differ inthe way distance between objects are measured and the linkage criteria used for mergingthe clusters. It is important to keep in mind that these methods will not produce anyunique partitioning of the data set, but rather a hierarchy of clusters, hence the commonsynonym Hierarchical clustering. The drawbacks of this method are that it is not veryrobust towards outliers, as in such cases the outliers will either show up as additionalclusters or cause other clusters to merge unnaturally. This is the method that will be usedin this thesis and will be covered more thoroughly in the subsequent section.

Centroid based clustering

In contrast to connectivity-based clustering, centroid-based clustering aims to partitionthe data set into a fixed number of clusters, providing a single partition of the data set asoutput. The most popular algorithm in this family of methods is the K-means clusteringproduces a clustering such that the sum of squared error between samples and the meanof their cluster is small [18].

Given a set of observations the algorithm starts with k initial random cluster centers.For each object in the data set a distance measure is calculated to each of the clustercenters. The objects are then assigned to its closest cluster and the procedure is repeated.In each iteration the cluster centers are recalculated using the average values of the ob-jects assigned to each cluster. The iterative procedure is usually terminated when therecalculated cluster centers are unchanged.

Distribution based clustering

Distribution-based clustering, on the other hand, focuses on the statistics behind the dataset. As the name suggests, this group of methods focuses on the underlying distribution ofthe objects in the data, where clusters can be defined as objects belonging most likely tothe same distribution. Well-known models of this type of method are the Gaussian mixturemodels, where the data-set is usually modelled with a fixed number of Gaussian distribu-tions that are initialised randomly and whose parameters are iteratively optimized to betterfit the data set. To obtain clustering, the objects are typically assigned to the Gaussian

12


distribution they most likely belong to, this is called hard clustering. Distribution-basedclustering can also output a so called soft clustering, where each object can belong toseveral clusters with a certain probability. While this type of methods are complex, theadvantage of a successful implementation is that these methods can capture correlationand dependence between different attributes. However, assuming a specific distributionbehind the data, such as the Gaussian distribution, is a rather strong assumption on thedata.

The most popular algorithm used to optimise the parameters in Gaussian Mixturemodels are the Expectation-Maximisation (EM-) algorithm, an iterative method to find themaximum likelihood or maximum a posteriori estimates of parameters in various statisticalmodels where the models depends on unobserved latent variables.

3.1.2 Similarity and dissimilarity measures

The similarity or dissimilarity measure is an integral part of clustering analysis. The choiceof a similarity measure will certainly affect the resulting clustering structure, and differentmeasures can possibly yield different clustering structures on the same data. Therefore, themeasure of similarity (or dissimilarity) should be chosen with the characteristics of the dataset in mind. In similarity based clustering the input to the algorithm is aNxN dissimilaritymatrix, or distance matrix D. Such dissimilarity matrix D should have dissimilarities(distances) d satisfying the following mathematical conditions for all x,y,z.

1. Non-negativity

d(x, y) ≥ 0

2. Identity of indiscernible

d(x, y) = 0⇔ x = y

3. Symmetry

d(x, y) = d(y, x)

4. Triangle inequality

d(x, y) ≤ d(x, y) + d(y, z)

In subjectively judged dissimilarities, “distances” are seldom distances in the strict sense,meaning the triangle inequality does often not hold. However, many of the cluster algo-rithms do not require a true distance matrix, meaning that dissimilarity measures thatfulfil all the above properties except the triangle inequality are acceptable. Hierarchicalclustering algorithm being one of the algorithms that do not require triangle inequality tohold. Furthermore, if a similarity matrix S is available, one can easily convert it to a dis-similarity matrix by applying any monotonically decreasing function, e.g., D =max(S)−S,this is particularly useful when considering correlation measures between time series.

Common dissimilarity functions

There are several distance metrics that can be used to create the dissimilarity matrix, themost common way to define dissimilarity between objects is in terms of the dissimilarityof their attributes

d(xi, xj) =n∑k=1

dk(xi,k, xj,k) (3.1)

Examples of dissimilarity functions (or distance metrics) between the vector xs and xt are:

13


• Euclidean distanced(xs, xt) =

√(xs − xt)(xs − xt)′ (3.2)

• Squared Euclidean distance

d(xs, xt) = (xs − xt)(xs − xt)′ (3.3)

where x′ is vector transpose of a row vector.

• Standardised Euclidean distance

d(xs, xt) =√

(xs − xt)V−1(xs − xt)′ (3.4)

where V−1 is the n-by-n diagonal matrix whose jth diagonal element is (S(j))2,where S is a vector of scaling factor for each dimension. Each coordinate differencebetween rows in X is scaled by dividing by the corresponding element of the standarddeviation S = std(X).

• Mahalanobis distance

d(xs, xt) =√

(xs − xt)C−1(xs − xt)′ (3.5)

where C is the covariance matrix.

• Minkowski distance

d(xs, xt) = p

√√√√ n∑j=1

|xs,j − xt,j |p (3.6)

Where for the special case of p = 1, the Minkowski metric gives the city blockdistance. For the special case of p = 2, the Minkowski metric give the Euclidiandistance. For the special case of p = +infinity, the Minkowski metric gives theChebychev distance.

• Cosine distance

d(xs, xt) = 1− xsx′t√

(xsx′s)(xtx

′t)

(3.7)

When working with time series data, or other real-valued vectors, it is common to usecorrelation coefficients such as Pearson’s or Spearman’s sample correlation coefficients tocalculate correlation distances.

• Pearson’s correlation distance

d(xs, xt) = 1− ρp

where ρp =(xs − xs)(xt − xt)

′√(xs − xs)(xs − xs)′

√(xt − xt)(xt − xt)′

(3.8)

is the Pearson’s sample correlation coefficient.

• Spearman’s correlation distance

d(xs, xt) = 1− ρs

where ρs =(rs − rs)(rt − rt)

′√(rs − rs)(rs − rs)′

√(rt − rt)(rt − rt)′

(3.9)

14


is the Spearman’s rank-order correlation coefficient. Here rs,j is the rank of xs,jtaken over x1,j , x2,j , . . . , xm,j . rs and rt are the coordinate-wise rank vectors of xsand xt, i.e. rs = (rs,1, rs,2, . . . , rs,1n). Furthermore, we have that

rs =1

n

∑j

= rs,j =(n+ 1)

2,

and rt =1

n

∑j

= rt,j =(n+ 1)

2.

For ordinal variables, such as [small, medium, big] it is a common practice to encodethe values as real-valued numbers, for example 1/3, 2/3 and 3/3, in order to apply anydissimilarity function for quantitative variables. For categorical variables such as (red,green, blue), one can use the Hamming distance

• Hamming distance

d(xs, xt) =#(xs,j 6= xt,j)

n(3.10)

where a distance of one is assigned to features that are different, and a distance ofzero if the features are the same. Summing up over all the categorical features givesthe Hamming distance.

3.1.3 Hierarchical clustering

Hierarchical clustering is the unsupervised learning method that will be used in this thesisfor exploratory analysis of relevant predictors. Hierarchical clustering is a similarity basedclustering method that seek to build a hierarchy of clusters based on a selected distancemetric and a linkage criterion for the merging of clusters. As the name suggests, themethod produce hierarchical representations in which the clusters at each level of thehierarchy are created by merging clusters at the next lower level. At the lowest level,each cluster contains a single observation. At the highest level, there is only one clustercontaining all of the data. The pair of clusters chosen for merging consist of the twoclusters with the smallest inter-group dissimilarity.

There are two different procedures for hierarchical clustering, agglomerative and divi-sive. The former is bottom-up approach where each observation starts in its own clusterand new bigger clusters are created by merging the smaller ones in a bottom-up fashionas one moves up in the hierarchy. The divisive procedure is the opposite, a top-downapproach where all observations start in one cluster, new smaller clusters are created in atop-down fashion by splitting the bigger clusters as one moves down the hierarchy. Theoutput from a hierarchical clustering is a dendrogram, a tree diagram that illustrates thedistance and hierarchical connectivity (arrangement) between the different clusters. Thealgorithm for agglomerative hierarchical clustering is presented in algorithm (1).

The algorithm takes as input a dissimilarity matrix D = {di,j}, where di,j is the dis-similarity of objects i and j. Based on a specific distance metric (covered in the previoussection) the algorithm merges clusters through a linkage criterion which determines thedissimilarity between sets of observation. The agglomerative algorithm initialises by set-ting all the n observations as unique clusters. The clusters are then compared amongthemselves through the linkage-method and the two clusters with the smallest dissimilar-ity are merged. After each merger the dissimilarity matrix D = di,j is updated with newdissimilarity values. The procedure is repeated until there are no more clusters available

15


Algorithm 1 Agglomerative hierarchical clustering, Murphy [41]

1: Input: Dissimilarity matrixD2: Output: Hierarchical cluster tree-structure, and dendrogram3: initialise clusters as singletons: for i ← 1 to n do Ci ← {i};4: initialise set of clusters available for merging: S ← {1, . . . , n};5: repeat6: Pick 2 most similar clusters to merge: (j, k)← arg maxj,k∈S dj,k;7: Create new cluster Cl ← Cj ∪ Ck;8: Mark j and k as unavailable: S ← S\{j, k};9: if Cl 6= {1, . . . , n} then

10: Mark l as available, S ← S ∪ {l};11: for each i ∈ S do12: Update dissimilarity matrix d(i, l);13: until no more clusters are available for merging;

for merging. A convenient feature of hierarchical clustering is that the distance metricused do not need to fulfil the triangle equality.

3.1.3.1 Linkage criterion

It is important to distinguish between the distance measure (metric) used and the linkagecriterion. The linkage criterion is how you quantify the dissimilarity between clusters, whilethe distance measure is the metric used. The linkage criterion determines the distancebetween sets of observations as a function of the pairwise distances between observations.There are several different linkage criteria, i.e. ways to quantify dissimilarity betweenthe clusters. The different choices will most certainly have an impact on the resultingclustering and can give quite different results [41]. The most common linkage criterionsare single-, complete- and average linkage.

Consider two distinct groups, denoted A and B, that each contain a number of ob-jects nA and nB respectively. The Single linkage clustering, also called nearest neighbourmethod, merges the clusters based on the shortest distance between objects within A andB. Proximity between two clusters is the proximity between their two closest objects

dSL(A,B) = mina∈A,b∈B

da,b. (3.11)

The complete linkage, also called farthest neighbour method, quantifies dissimilarity be-tween A and B as the maximum distance between the observation within the groups.Proximity between two clusters is the proximity between their two most distant objects

dCL(A,B) = maxa∈A,b∈B

da,b. (3.12)

Average linkage is a mixture between the single and complete linkage, it considers theaverage distance between all the observations in both groups. Proximity between twoclusters is the arithmetic mean of all the proximities between the objects of one cluster,on one side, and the objects of another cluster, on the other side,

davg(A,B) =1

nAnB

∑a∈A

∑b∈B

da,b. (3.13)

The single-, complete, and average linkages can be used for both quantitative and quali-tative variables. These linkages are illustrated in figure 3.1 [41].

16


Figure 3.1: Illustration of (a) Single linkage. (b) Complete Linkage. (c) Average linkage

There are other linkages such as Centroid-, Median-, and Ward’s linkage that are onlyappropriate for quantitative variables. Centroid linkage merges two clusters based on theEuclidean distance between their geometric centroids. Proximity between clusters are theproximity between the cluster centroids or means. Denote cluster as r as xr, where xr,idenotes the ith object in cluster r, then the centriod linkage is given as

dcentroid(A,B) = ||xA − xB||2

where xr =1

nr

nr∑i=1

xr,i.(3.14)

With the same notation, the Median linkage is given by

dMedian(A,B) = ||xA − xB||2 (3.15)

where xA and xB are wighted centroids for the clusters A and B. If cluster A was createdby combining clusters p and q, xA is defined recursively as

xA =1

2(xp + xq).

The Ward’s linkage uses the incremental sum of squares, i.e. the increase in the totalwithin-cluster sum of squares as a result of joining two clusters. The within-cluster sumof squares is defined as the sum of the squares of the distances between all objects in thecluster and the centroid of the cluster. The sum of squares measure is equivalent to thefollowing distance measure d(A,B)

dward(A,B) =

√2nAnB

(nA + nB)||xA − xB||2 (3.16)

where xA and xB are centroids of cluster A and B, and nA and nB are the number ofelements in respective cluster.

3.1.3.2 Cophenetic correlation coefficient

Cophenetic correlation coefficient (CCC ) is a measure that can be used to compare differ-ent cluster solutions obtained using different hierarchical cluster algorithms. Introducedby Sokal and Rohlf [47], the cophenetic correlation coefficient is a technique for comparingdifferent dendrograms and is defined as the linear correlation coefficient between the dis-similarities di,j between each pair of observations (i, j) and their corresponding cophenetic

distances dcophi,j , which is the inter-group dissimilarity at which the observations (i, j) first

merged together in the same cluster. The correlation between di,j and dcophi,j is the cophe-netic correlation coefficient, and it is a measure of how faithfully a dendrogram represents

17


the dissimilarities among observations, i.e. the pairwise distances between the originalunmodeled data points [20] [46].

Consider the original set of observations {xi} being modelled using hierarchical cluster-ing, with the output being a dendrogram Zi. The dendrogram was created by consideringthe distance matrix D = {di,j}, now the cophenetic distance between two cluster pointsTi and Tj is represented by the output dendrogram, and is stored in a cophenetic distance

matrix Z = {dcophi,j } = {zi,j}. That is, a hierarchical clustering method imposes a den-drogram on the given dissimilarity matrix D and this establishes the cophenetic distancematrix Z [20]. The cophenetic distance zi,j can be referred as “dendrogrammic” distancebetween the model points (sub-clusters) Ti and Tj . This cophenetic distance is representedin the dendrogram by the height of the link in the dendrogram at which the two points Tiand Tj are first joined together. Thus, by letting d and z be the average of di,j and zi,jrespectively, the cophenetic correlation coefficient is given by

CCC = corr(d, z) =

∑i<j(di,j − d)(zi,j − z)√[∑

i<j(di,j − d)2][∑

i<j(zi,j − z)2] . (3.17)

This coefficient is used to verify dissimilarity and enables us to compare the results ofclustering the same data set using different distance calculations or linkage methods. Bycomparing the cophenetic distances with the original distance values generated in thedissimilarity matrix D, the CCC measures how well the cluster tree generated by a specifichierarchical clustering algorithm reflects the data. If the clustering is valid, the linking ofobjects (cophenetic distance) in the cluster tree should have a strong correlation with thedistances between objects in the distance in the dissimilarity matrix D. The closer thevalue of the cophenetic correlation coefficient is to 1, the more accurately the clusteringsolution reflects the data. Thus, the highest CCC value of two different hierarchicalclustering algorithms (i.e. different linkage methods) is the more superior method for theunderlying data, meaning that cluster tree and the dendrogram produced by the algorithmwith the highest CCC is the tree and dendrogram that best reflects the data.

18


3.1.3.3 Dendrogram

The output of hierarchical clustering is a dendrogram, it is a special type of tree structurethat provides a convenient picture of the hierarchical clustering. A dendrogram containslayers of nodes, each representing a cluster. The lines connect nodes representing clusterswhich are nested into one another. In a dendrogram, any two objects in the original dataset are eventually linked together at some level. The height of the link represents thedistance between the two clusters that contain those two objects. This height is knownas the cophenetic distance between the two objects. Cutting a dendrogram, for instancehorizontally, is what creates the final clustering and the final number of clusters (3.2b.

(a) A dendrogram of five objects (b) Cutting a dendrogram defines final clustering

Figure 3.2: Illustration of a dendrogram resulting from hierarchical clustering of five ob-jects [56]. In (b) a horizontal cut at height ∼ 1.6 defines a final clustering with 3 cluster.

In figure 3.2, the numbers along the horizontal axis represent the indices of the objectsin the original data set. The links between objects are represented as upside-down U-shaped lines. The height of the U indicates the distance between the objects, it representsthe linkage criterion (cophenetic distance) computed between the connected objects. Inthe example above, the link representing the cluster containing objects 1 and 3 has aheight of 1. The link representing the cluster that groups object 2 together with objects1, 3, 4, and 5, (which are already clustered as cluster object 8) has a height of 2.5. Figurebelow illustrates how a linkage criterion groups the five objects in figure (3.2) into 8 clusterobjects,

19


Figure 3.3: Graphical illustration of hierarchical grouping of five objects using a certainlinkage criterion [56].

One of the drawbacks of hierarchical clustering, compared to other non-hierarchicaltechniques, is that the output from hierarchical clustering is not a fixed partitioning ofthe data (i.e. fixed number of clusters), rather the output is a nested hierarchical tree ofpartitions (clusters). It is up to the practitioner to decide the final number clusters touse from the hierarchical analysis. Cutting a dendrogram at any level defines a clusteringand identifies clusters. Thus, an important decision that must be made in the hierarchicalclustering analysis is where to “cut the dendrogram”, or how many number of clusters toselect in the final output. This is a consideration that is especially important if one desiresthe final output to represent the natural structure of the data, i.e. the number of clustersthat represents the natural divisions in the data.

3.1.3.4 Inconsistency coefficient

The natural division of the data can be evident in the output dendrogram, where groups ofobjects are densely packed in certain areas and not in others. Thus, one way to determinethe natural cluster division in a data set when using hierarchical clustering methods isto consider the heights of the links in the dendrogram. The so-called inconsistent linkscan indicate the border of a natural division in a data set. By comparing the height ofeach link in a cluster tree with the heights of neighbouring links below in the tree one candetermine the natural cluster division. A link that is approximately the same height asthe links below it indicates that there are no distinct divisions between the objects joinedat that specific level of the hierarchy [56]. These links are said to exhibit a high levelof consistency, as the distance between objects joined is approximately the same as thedistance between the objects they contain. Similarly, a link whose height differs noticeablyfrom the height of the links below it indicates that the objects joined at that level in thecluster tree are much farther apart from each other than their components were when theywere joined. This link is said to be inconsistent with the links below it. The dendrogramin figure (3.5) illustrates the concept of inconsistent links.

20


Figure 3.4: Dendrogram illustration of consistent and inconsistent links, inconsistencyindicates the natural division of the objects.

One can distinguish how the objects in the dendrogram fall into two groups that areconnected by links at a much higher level in the tree. The top highlighted links show in-consistency when comparing to the links below them in the hierarchy. While the bottomhighlighted links show consistency.

The inconsistent coefficient is a quantified measure of the relative consistency of eachlink in a hierarchical cluster tree. It compares each link in the cluster hierarchy with adja-cent links and quantifies to what extent the cophenetic distance of a specific link is differentfrom the links below it. This coefficient is sensitive to sudden changes in the copheneticdistances along the direction of the building up, it identifies the divisions where the sim-ilarities between objects change abruptly and is suitable for finding boundaries betweendistinct groups [30]. The coefficient is calculated as the difference between the currentlink height and the mean of the height of all the links included in the calculation, whichdepends on depth of the comparison, normalised by the standard deviation of the heightof all the links included in the calculations [20]. Consider a cluster joint f and its twoconnected cluster objects g and h, then the inconsistency coefficient IC using comparisondepth level two is given by

IC(f) =Z(f)−mean(Z(g), Z(h), Z(f))

std(Z(g), Z(h), Z(f))(3.18)

where Z(∗) denotes the cophenetic distance at the specific cluster joint. The value of thiscoefficient is a comparison of the height of a link in a cluster hierarchy with the averageheight of links below it. Links that join distinct clusters have high inconsistency coefficient,while links that join indistinct clusters have a low inconsistency coefficient. Thus, whenusing the inconsistent coefficient to find the natural division of the data one decides on a

21


specific ”cutoff”-value of the coefficient that will serve as the dendrogram ”cut-off line”.All links with higher inconsistent coefficient than the cutoff value is considered to define acluster. Forming cluster this way may, but do not necessarily, correspond to a horizontalslice across the dendrogram at a certain height. If one desires clusters corresponding to ahorizontal slice of the dendrogram, one can either specify that the cutoff should be basedon distance rather than inconsistency, or directly specify the desired number of clusters.

In the calculation of the inconsistency coefficient one must decide how deep the com-parison should be in the tree, i.e. how many levels below each link one should considerfor the calculation of the inconsistency coefficient. This is also called the depth of thecomparison and the value of the depth is usually set to two levels, as in equation (3.18).Equation (3.18) compares the cluster joint link f with adjacent links that are less thantwo levels below in the cluster hierarchy. This depth level of the comparison can, however,be specified to any other desired depth level. But one should keep in mind that for leafnodes, nodes that have no further nodes under them, the inconsistency coefficient is setto zero, and clusters that join two leaves also have a zero inconsistency coefficient.

3.1.4 Evaluating and assessing clustering quality

An important thing to note about hierarchical clustering is that the method is just heuris-tics, which do not optimise any well-defined function[41]. The hierarchical clustering willalways produce a clustering of the input data, even if the data has no structure at alland is only random noise. This issue is not specific to hierarchical clustering, clusteringmethods in general have the nasty habit of creating clusters in data even when no naturalclusters exists, so hierarchies and clusterings must be viewed with extreme suspicion [20].As clustering in general is an unsupervised statistical method, there is no universal tech-nique or formula to assess the resulting clustering output and its correctness. It may befor that reasons evaluating an unsupervised clustering procedure can be the most difficultpart of clustering analysis.

There are, however, various measures that can be used to assess the resulting cluster-ing quality. These measures of cluster quality are broadly divided into two approaches,internal and external criteria of quality of clustering [15][45]. The key differences in thetwo approaches is that the internal criteria evaluate the cluster output based on the datathat was itself clustered, while the external criteria utilises known class labels or externalbenchmarks that was not used in the clustering to evaluate the resulting output. In otherwords, the external approach measures similarity of formed clusters with externally sup-plied information such as class labels or ground truth, while the internal approach measuresthe goodness of a clustering structure without respect to any external information.

Internal criterion of cluster quality

A high internal criterion of cluster quality is assigned to the algorithm that maximisesinter-cluster distance (similarity) and minimizes intra-cluster distance, i.e. an algorithmthat produces clusters with high similarity within a cluster and low similarity betweenclusters. Some of the common measures of internal criterion of quality are presentedbelow with the following notation. Denote our data set as D and let c denote the centerof our data D and n denote the number of observations in our data set. Furthermore, letNC be the number of clusters and ni be the number of object in ith cluster Ci. Also, letci denote the center of cluster Ci, and d(x, y) be the distance between x, y, then the wehave the following internal criterion of quality

22


• Davies-Bouldin index [11]

DB =1

NC

∑i

maxj,j 6=i

[ 1ni

∑x∈Ci d(x, ci) + 1

nj

∑x∈cj d(x, cj)

d(ci, cj)

](3.19)

where the optimal value is when DB is minimised.

• Calinski-Harabasz index [31]

CH =

∑i ni(d(ci, c)

2)/(NC − 1)∑i

∑x∈Ci(d(x, ci)2)/(n−Nc)

(3.20)

where the optimal value is when CH is maximised.

• Silhouette index [43]

S =1

NC

∑i

[ 1

ni

∑x∈Ci

b(x)− a(x)

max[b(x), a(x)]

]where a(x) =

1

ni − 1

∑x∈Ci,y 6=x

d(x, y) and b(x) = minj,j 6=i

[ 1

nj

∑y∈Cj

d(x, y)].

(3.21)

The drawbacks of internal criteria in cluster evaluation is that the evaluation is biasedtowards algorithms that use the same cluster model. For example, k-Means clusteringnaturally optimises object distances, and a distance-based internal criterion will likelyoverrate the resulting clustering [39]. The internal evaluation measures are best suited toget some insight into situations where one algorithm performs better than another, but thisdoes not imply that one algorithm produces more valid results than another. However,a useful application of the internal criterion measures is to use it to find the optimalnumber of clusters K in the data set. By monitoring the selected internal criterion fordifferent values of K, one can determine the optimal number of clusters in the data basedon that particular internal criterion. This is done by observing the K that maximises (orminimises) the internal criterion in question.

External criterion of cluster quality

The external criterion of cluster quality uses separate data of known class labels, or otherexternal benchmarks, to evaluate the resulting clustering. Clustering results are evaluatedbased on data that was not used in the clustering, e.g. with benchmark data such as a setof pre-classified labels created by human experts. This external approach measures thesimilarity of formed clusters to the externally supplied class labels or ground truth, andis the most popular clustering evaluation method. A common measure of such externalevaluation criterion is Purity.

When computing purity each cluster is assigned to the class which is most frequent inthe cluster. Then the accuracy of this assignment is measured by counting the numberof correctly assigned data points in the data set and dividing by total number of datapoints [41]. If we consider j = (1, . . . , C) to be true class of an object and i be the clusterthat the object has been assigned to by the clustering method. Then Ni,j is the numberof objects in cluster i that belong to class j. Let Ni =

∑j Ni,j be the total number of

objects in cluster i. Then define the empirical distribution over class labels for cluster i

23


as pi,j =Ni,jNi

. The purity of a cluster is then given by pi = maxjpi,j and the overall purity

of a clustering is calculated as [41]

Purity =∑i

Ni

Npi. (3.22)

The value of purity ranges from 0 (bad) to 1 (good), a good clustering is indicated by agreater value of purity. A purity value of zero corresponding to bad clustering where theresulting clustering is not in agreement with the true grouping of the data. While a purityvalue of 1 corresponds to the opposite situation. When using purity one should keep inmind that the measure do not penalise for the number of clusters, i.e. we can triviallyachieve a purity of 1 by putting each object into its own cluster.

Other examples of measures of the external approach are Rand index [45], Mutualinformation [41] and Fowlkes–Mallows index [16]. The paradoxical issue with the externalevaluation approach is that if we have true class labels or true class distribution, we wouldnot need to cluster the data in the first place. Another issue is that in practical applicationswe usually do not have such ”ground truth”-labels, and even if we did, such labels onlyreflect one possible partitioning of the data, which does not rule out that there exist adifferent, and maybe a better, clustering.

3.1.5 Timeseries clustering

This subsection is primarily based on the comprehensive research of Aghabozorgi, Shirkhor-shidi, and Wah [3] and their work on ”Time-series clustering - A decade review”. It is adetailed up-to-date review of time series clustering that is unrivalled in current academicliterature. Thus, much of what is to follow in this section is based on their work and is insome sections directly paraphrased.

A time series is classified as dynamic data with its feature values changing as a func-tion of time, which means that the values of each point of a time-series are one ormore observations that are made chronologically. The problem of time series cluster-ing is defined much like regular data clustering, i.e. given a data set of n time-series dataD = {F1, F2, . . . , Fn}, time series clustering is the process of unsupervised partitioning ofD into C = {C1, C2, . . . , Ck}, in such a way that homogeneous time series are groupedtogether based on a certain similarity measure.

Time-series data are of interest due to their omnipresence in various areas ranging fromscience, engineering, business, finance, economics, health-care, to government. The sub-ject of time series clustering has been covered in numerous research papers over the years,[3] provide a state-of-the-art review of time-series clustering over the past decade as well ascovering evaluation methods and measures available for validating time-series clustering.In their review, Aghabozorgi, Shirkhorshidi, and Wah [3] concluded that although differ-ent researches have been conducted on time-series clustering, the unique characteristics oftime-series data are barriers that fail most of conventional clustering algorithms to workwell for time series. In particular, the high dimensionality, very high feature correlation,and typically large amount of noise that characterise time series data have been viewedas an interesting research challenge in time-series clustering. Indeed, there arises a com-plication when one applies conventional clustering algorithms on time series data. Dueto the time dependency within the time series data one cannot simply cluster such datawith conventional clustering procedures, because parts of the time series will end up indifferent clusters and distort the time dependency of the data. Accordingly, most of the

24


studies in the literature have concentrated on two subroutines of clustering to overcomethe challenges, namely

• Focusing on the high dimensional characteristic of time-series data and finding waysto represent time-series in a lower dimension compatible with conventional clusteringalgorithms.

• Finding a suitable distance measurement based on raw time-series or the representeddata.

Most of the work on time series clustering fall into one of the three categories[3]:

1. Whole time-series clustering, considered as clustering of a set of individual time-serieswith respect to their similarity. Here, clustering implies conventional clustering ondiscrete objects, where objects are time-series.

2. Subsequence clustering, which is clustering on a set of subsequences of a time seriesthat are extracted via sliding window, that is, clustering of segments from a singlelong time-series.

3. Time point clustering, which is clustering of time point based on a combinationof their temporal proximity of time points and the similarity of the correspondingvalues. An approach that is similar to time-series segmentation, however, differentas all points do not need to be assigned to clusters, i.e. some of them are consideredas noise. The objective of time-point clustering is finding the clusters of time-pointinstead of clusters of time-series data.

Figure 3.5: Time series clustering taxonomy [3]

The focus of this thesis will be on the whole time-series clustering, which is covered inthe next subsection.

3.1.5.1 Whole time series clustering

In the literature various techniques have been recommended for the clustering of wholetime series data. Most of them take the approach of either customising the existingconventional clustering algorithms such that they become compatible with the nature oftime-series data, which usually entails modifying the distance measure to be compatiblewith the raw time series data. Or converting time series data into simple objects (staticdata) as input to conventional clustering algorithms, or using multi resolutions of timeseries as input of a multi-step approach. In addition to these common characteristics,there are generally three different ways to cluster time series, namely shape-, feature- andmodel-based [3].

25


• Shape-basedIn the shape-based approach, shapes of two time-series are matched as well as pos-sible, by a non-linear stretching and contracting of time axes. This approach isalso sometimes labelled as a “raw data-based” approach because it typically worksdirectly with the raw time series data. Shape-based algorithms usually employ con-ventional clustering methods, which are compatible with static data while theirdistance or similarity measure has been modified to accommodate for the nature oftime series.

• Feature-basedIn this approach, the raw time-series are converted into a feature vector of lowerdimension. A conventional clustering algorithm is then applied to the extractedfeature vectors. Usually in this approach, an equal length feature vector is calculatedfrom each time-series followed by the Euclidean distance measurement.

• Model-basedIn these type of methods, a raw time-series is transformed into model parameters,i.e. a parametric model for each time series. Then a suitable model distance anda clustering algorithm is chosen and applied to the extracted model parameters. Inregard to model-based approaches research show that they have scalability issuesand its performance reduces when the clusters are close to each other.

These approaches to whole time-series clustering are illustrated by Aghabozorgi, Shirkhor-shidi, and Wah [3] in figure (3.7).

Figure 3.6: Time series clustering approaches [3]

26


According to existing works in the literature whole time series clustering can broadlybe divided into four distinct components [3]. Namely, time series representation, similarityor distance measures, cluster prototypes, and time series clustering. These are covered inthe rest of this subsection.

Figure 3.7: An overview of the four components of whole time series clustering [3]

3.1.5.2 Representations

As the name suggests the time series representation component is concerned with howto represent the time-series data that will be used for clustering. This entails dimensionreduction or transformation of the original raw time series data into another (lower) di-mensional space or by feature extraction. Dimensionality reduction is of great importancein time series clustering because compared to when using raw data, it reduces memoryrequirements, reduces computational cost when calculating distances, and speeds up theclustering itself. Furthermore, because some distance measures are highly sensitive tosome “distortions” in the data, measuring the distance between two raw time series mayyield highly unintuitive results. Thus, using raw time series may consequently lead toa clustering of time-series which are similar in noise instead of similarity in shape. Thepotential to obtain a different type of cluster is the reason why choosing the appropriateapproach for dimension reduction (feature extraction) and its ratio is a challenging task [3][61], it is a trade-off between speed (execution time) and quality that must be made. Highdimensionality and noise are characteristics of most time series data, therefore choosingan appropriate data representation method can be considered as the key component whicheffects the efficiency and accuracy of the solution.

The definition of time series representation is as following, given a time-series data Fi ={f1, . . . , ft, . . . fT }, representation is transforming the time series to another dimension-ality reduced vector F

′i = {f ′1, . . . , f

′x}, where x < T and if two series are similar in the

original space, then their representations should be similar in the transformations pacetoo. In general, there are four representation types, data adaptive, non-data adaptive,model-based and data dictated representation approaches [3].

• Data adaptive representation methods are performed on all time series in datasetsand try to minimise the global reconstruction error using arbitrary length (non-equal) segments. Such data adaptive representation can better approximate eachseries, but the comparison of several time series is more difficult. Examples ofdata adaptive representations are Piecewise Polynomials Interpolation, Piecewise

27


Linear Approximation, Singular Value Decomposition, Adaptive Piecewise ConstantApproximation, and Symbolic Approximation.

• Non-data adaptive approaches are representations which are suitable for timeseries with fixed size (equal-length) segmentation, and the comparisons of represen-tations of several time series is straightforward. Examples of methods in this groupare Discrete Wavelet Transformation (DWT), Discrete Fourier Transform, DiscreteCosine Transform, Spectral Chebychev Polynomials, Random Mappings, PiecewiseAggregate Approximation, and Indexable Piecewise Linear Approximation.

• Model based approaches represent a time-series in a stochastic way such as Markovmodels and Hidden Markov Models, Statistical Models, time series Bitmaps, andAuto-Regressive Moving Average (ARMA). In the data adaptive, non-data adaptive,and model based approaches users can define the compression-ratio based on theapplication at hand.

• Data dictated approaches, in contrast, the compression ratio is defined automati-cally based on raw time series, example of such method is Clipped Data.

Much research have been done on representation and dimensionality reduction, Aghabo-zorgi, Shirkhorshidi, and Wah [3] chooses in their review to highlight one recent researchpaper in particular by Ding et al. [14]. In this research, Ding et al. [14] have performeda comprehensive comparison of eight representation methods on 38 datasets, the result ofthat study shows that there is very little difference between recent representation methods.Although, they had investigated the indexing effectiveness of representation methods, theresults are advantageous for clustering purpose as well.

3.1.5.3 Similarity & dissimilarity measures in time series clustering

Time series clustering relies on distance measures to a high extent. There are differentmeasures which can be applied to measure the distance among time series. Some similaritymeasures work regardless of representation methods or are compatible with raw time-series,other similarity measures are proposed based on the specific time series representation cho-sen. In traditional clustering, distance between static objects is exactly-match based, butin time series clustering, distance is calculated approximately. To compare time series withirregular sampling intervals and length, it is of great significance to adequately determinethe similarity of time series. There are various distance measures designed for specifyingsimilarity between time series, the most popular distance measures are modified Hausdorffdistance, Dynamic Time Warping (DTW), HMM-based distance, Euclidean distance, andLongest Common Sub-Sequence [3].

The choice of a proper distance approach depends on the characteristic of time series,the length of time series, representation method, and on the objective of clustering timeseries [3]. These considerations are depicted in figure (3.8).

28


Figure 3.8: Distance measure approaches in the literature [3]

ObjectiveAccording to Aghabozorgi, Shirkhorshidi, and Wah [3] there are typically three differentobjectives of distance measures, namely finding similar time-series in (1) time, (2) shapeor (3) change (structural similarity). For finding similar time series in time, measuressuch as correlation based or Euclidean distance measure are appropriate, as similarity iscompared on each time step. When one is trying to find similar time series in shape,the time of occurrence of patters is not important. In that case clusters of time serieswith similar patterns of change are constructed regardless of time point. Examples ofshape-based measures are Dynamic Time Warping (DTW) and Minimal Variance Match-ing. Similarity in time is a special case of similarity in shape, and research has shownthat similarity in shape is superior to metrics based on similarity in time [3]. The third ofthe objectives is finding similar time series in change. In this approach, usually modellingmethods such as Hidden Markov Models or an ARMA process are utilised, and similarityis measured on the parameters of the fitted model to time series. That is, clustering timeseries with similar autocorrelation structure, i.e. similarity of model’s parameters. Thisapproach is only proper for long time series, not for modest or short time series. Whenusing shape-based distance measuring of time series it is common to encounter challengeswith problems such as noise, amplitude scaling, offset translation, longitudinal scaling,linear drift, discontinuities, and temporal drift which are the common properties of time-series data.

Level (length of time-series)Clustering approaches could also be classified into two categories based on the length ofthe time series, namely shape- and structure level. The shape level is utilised to measuresimilarity in short-length time series clustering, e.g. clustering individual heartbeats bycomparing their local patterns. Whereas structure level measures similarity which is basedon global and high-level structure, and is used for long-length time series data such as anhour’s worth of ECG.

TypeBased on the objective and level (length of time-series) of the clustering proper type ofdistance measure is determined. Aghabozorgi, Shirkhorshidi, and Wah [3] specifies thefour types of distance measures found in the literature. They are shape-, compression-,feature- and model based similarities. Shape-based similarity measure, such as Euclidean,Dynamic Time Warping and Minimal Variance Matching, finds similar time series in timeand shape. It is a group of methods that are proper for short length time series. Com-

29


pression based similarity, on the other hand, is suitable for both short and long timeseries, examples of such measures are Pearson’s correlation coefficient, Autocorrelation,Piecewise normalization, and Cosine wavelets. Feature based similarity measures such as“Short time series distance” (STS), statistics, and coefficients are proper for long timeseries. Model based similarity measures such as HMM and ARMA is proper for long timeseries.

Among the conclusion that Aghabozorgi, Shirkhorshidi, and Wah [3] draw from theirliterature review is that the Euclidean distance and Dynamic Time Warping (DTW) arethe most common methods for similarity measure in time series clustering. One of theresearches in particular has shown that, in terms of time series classification accuracy,the Euclidean distance is surprisingly competitive, and DTW has its strength in similar-ity measurements which cannot be declined. However, a different study by (Prekopcsakand Lemire [42]) showed different results, they concluded that different distance measuresperform differently depending on the time series dataset. Prekopcsak and Lemire [42] com-pared the classification error of time series clustering using four different distance measures,Dynamic Time Warping, Euclidean, two configurations of the Mahalanobis distance (us-ing shrinkage- and diagonal approach), and one alternative classification technique, LargeMargin Nearest Neighbor (LMNN) classification. Their study uses the UCR time seriesclassification benchmark which includes 85 diverse time series data sets from many differ-ent domains. While it is unclear how many of the 85 time series that were actually usedin the study, Prekopcsak and Lemire [42] highlights the performance on 13 different timeseries data sets. The best performing distance measure were the Dynamic Time Warpingmeasure which had the greatest performance on 9 out of the 13 highlighted data sets, fol-lowed by LMNN with 6 best-performing results (some measures performed equally good onsome data sets), and Mahalanobis distance with 5 best-performing results (3 when usingdiagonal approach and 2 when using covariance shrinkage approach). The Euclidian dis-tance showed best performance on only one of the 13 highlighted time series data sets, andit was a tied performance with DTW on a specific data set, an expected occurrence giventhat the Mahalanobis distance converges to the Euclidian distance when the time seriesare uncorrelated. Prekopcsak and Lemire [42] concluded that the Mahalanobis distanceare far more superior to the Euclidian distance on some data sets, comparing them two,Euclidean distance outperformed the Mahalanobis distance measure in only two data setsby a small margin. Mahalanobis distance measure, on the other hand, outperformed theEuclidean measure 12 times, and sometimes by a large margin. It was further concludedthat the Dynamic Time Warping had the lowest error rates and provided best results forhalf of the data set.

For this thesis three time series similarity measures are worth highlighting and definemore precisely, namely Dynamic Time Warping, Pearson’s- and Spearman’s correlationdistance. Pearson’s and Spearman’s correlation distance are compression based similaritymeasures and are invariant to scale and location of data points, the definition of these aregiven in (3.8) and (3.9). The Dynamic Time Warping (DTW) distance is a shape-basedsimilarity that compares two time series locally using a local distance measure d(∗) on alocal subset of the two time series. The goal of DTW is to find an local (point-to-point)alignment between two time series such that minimal overall distance (sum of local dis-tances) is attained. Consider two time series X = (x1, x2, . . . , xN ) of length N ∈ N andY = (y1, y2, . . . , yM ) of length M ∈ M, and a n ×m matrix M where elements Mij in-dicates the local distance d(xi, yj) between xi and yi, here the local distance can be theEuclidean distance or other suitable distance measures. Then the point-to-point align-

30


ment and matching relationship between X and Y can be represented by a time warpingpath W = 〈w1, w2, . . . , wk〉, max(m,n) ≤ K < m + 1 − 1, where wk = (i, j) indicatesthe alignment and matching relationship between xi and yj . Then, if a path is the lowestcost path (equivalent to lowest sum of local distances), the corresponding dynamic timewarping is required to meet.

DTW(X,Y ) = minW{K∑k=1

dk,W = 〈w1, w2, . . . , wK〉}, (3.23)

where dk = d(xi, yj) indicates the distance represented as wk = (i, j) on the path W .measure that has in some applications shown to achieve better accuracy than Euclideandistance. Then, the formal definition of dynamic time warping distance between two seriesis given recursively as

DTW(〈〉, 〈〉) = 0,

DTW(X, 〈〉) = DTW(〈〉, Y ) =∞,

DTW(X,Y ) = d(xi, yj) + min

DTW(X,Y [2 : −])DTW(X[2 : −], Y )DTW(X[2 : −], Y [2 : −]),

(3.24)

where 〈〉 indicates empty series, [2 : −] indicates a sub-array whose elements includethe second element to the final element in an one-dimension array. Equation (3.24) caninitially be hard to grasp and for a more thorough description the reader is referred to [28]or [40]. However, the idea behind dynamic time warping distance is quite simple, it is theminimal value of the sum of local distance between two time series, this idea is illustratedin figure (3.9) below.

Figure 3.9: Illustration of the concept of Dynamic Time Warping matching the shape oftwo time series by minimising sum of the local distance

Equation (3.24) considers the whole length of the series X and Y in the calculations,i.e. the warping path is unrestricted. However, in some applications, such as financial timeseries, it is useful to restrict the warping path to be within a certain number of samples ofa straight-line fit between X and Y . For instance, for two financial time series which showa clear lead-lag relationship it may only make sense to compare local time series wherethis lead-lag relationship holds, i.e. utilise a restricted warping path equivalent number ofsamples where the relationship holds. After all, the DTW minimises the local distances,and for financial time series stretching several years it might only make sense to restrictthe ”local distance calculations” to be within a few days of each time-point.

31


3.1.5.4 Time series cluster prototypes

When working with time series clustering finding an appropriate cluster prototype (orcluster representative) is an essential subroutine [3]. Here, cluster prototype refers to thecluster representation of the set of sequences (time series) that is contained in the con-cerned cluster, i.e. a “cluster time series” that represents/reflects the various time seriescontained in the cluster. Choosing the appropriate cluster prototype is a challenging is-sue which can affect the accuracy of the clustering. One of the problems which lead tolow accuracy of clusters is poor definition or updating method of prototypes in time se-ries clustering process, especially in partitioning clustering algorithms such as K-Means,K-Medoids and fuzzy C-Means, or even in agglomerative Hierarchical clustering which re-quire a specification of the prototype in form of the linkage criterion (see 3.1.3.1). In thesealgorithms, the quality of clusters is highly dependent on quality of prototypes and manysuffer from low accuracy of representation methods. The inaccurate prototype can affectconvergence of clustering algorithms which also results in low quality of obtained clusters. ’

Generally there are three approaches to defining the cluster prototypes [3]:

1. The medoid sequence of the set

2. The average sequence of the set

3. The local search prototype

Cluster medoid is the most common cluster prototype in work related to time series clus-tering. Medoids are representative objects of a dataset, or a cluster with a dataset whoseaverage dissimilarity to all the objects in the cluster is minimal [48]. In this approach, thecenter of a cluster is defined as a sequence which minimises the sum of squared distancesto other objects within the cluster. Given time series in a cluster, the distance of all timeseries pairs within the cluster is calculated using a distance measure such as Euclidean orDTW. Then, one of the time series in the cluster, which has lower sum of square error isdefined as medoid of the cluster. Moreover, if the distance is a non-elastic approach suchas Euclidean, or if the centroid of the cluster can be calculated, it can be said that medoidis the nearest time series to centroid.

The averaging prototype is appropriate to use if the time series considered are of equallength and the distance metric is non-elastic (e.g. Euclidean distance) in the clusteringprocess. In such case, the averaging method is a simple averaging technique which is equalto the mean of the time series at each point. However, in the case that the time-series areof different length, or the similarity measure is an elastic one, i.e. based on “similarity inshape”, the one-to-one mapping of the averaging prototype makes it unable to capture theactual average shape. Thus, applying elastic approaches such as Dynamic Time Warpingand Longest Common Sub-Sequence is not a trivial task [3].

The third cluster prototype approach is the local search prototype. It is a slightly moresophisticated procedure that utilises a combination of different methods and techniquesto find a suitable cluster prototype. The local search prototype procedure is describedby Aghabozorgi, Shirkhorshidi, and Wah [3] as following: at first the medoid of clusteris computed, then averaged prototype is calculated on warping paths using averagingmethod. Thereafter, new warping paths are calculated to the averaged prototype. Anexample of such procedure is given by [3], in one of the reviewed studies the authorsapplied a combination of medoid, average and local search on K-Medoids, random Swapand Agglomerative hierarchical clustering.

32


3.1.5.5 Time series clustering algorithms

Time-series clustering algorithms can be broadly classified into six groups, Partitioning,hierarchical, Grid-based, Model-based, Density-based clustering, and Multi-step cluster-ing algorithms[3]. These six groups are highlighted in figure 3.10. The two groups ofalgorithms that are of interest for this thesis are the Partitioning- and Hierarchical clus-tering algorithms. The idea behind these two methods and how they work are describedin section (3.1.1), this section is devoted to the evaluation of how well these methods workwith time-series.

Figure 3.10: Clustering algorithm approaches [3]

The Hierarchical clustering algorithm applied on time series has similar benefits towhat is described in section (3.1.3). Hierarchical clustering of time series creates nestedhierarchy of similar groups based on a pair-wise distance matrix of time series. Thealgorithm has great visualisation power which makes it an appropriate method for timeseries clustering, the great visualisation is a great tool for the analysis of dimensionalityreduction and evaluation of distance measures (an important consideration for time seriesclustering). Another key strength of hierarchical clustering is that, in contrast to mostother algorithms, hierarchical clustering does not require an initial specification of thenumber of clusters. Furthermore, it is possible to cluster unequal time series using elasticdistance measures such as Dynamic Time Warping or Longest Common Sub-sequence tocompute the dissimilarity/similarity of the time series [3]. The fact that specifying clusterprototypes is not necessary in its process has made the hierarchical algorithm capable toaccept unequal time-series. However, due to its quadratic computational complexity, thehierarchical clustering algorithm is restricted to small time series as it is not capable todeal effectively with large time series.

The Partitioning clustering algorithms, on the other hand, has a very fast responsecompared to the hierarchical method and it has made them very suitable for time seriesclustering. The Partitioning clustering methods makes k number of groups from n un-labelled objects in such a way that each group contains at least one object. One of thecharacteristics of partitioning methods is that the number of clusters k needs to be pre-assigned. For many applications, this optimal value of k may not be available or feasibleto be determined, making it impractical to obtain natural clustering results. This problemis even worse in time series data because the datasets can be very large and diagnosti-cally checking it to determining the number of clusters can be unfeasible. While there aremethods to determine the “optimal” value of K, see section (3.1.4), the exercise of findingthe optimal k can be challenging depending on the data. The requirement of pre-assigningthe number of cluster is one of the drawbacks of partitioning clustering algorithms whichmakes them inapplicable in some real-world applications [3]. An example of a popular par-titioning clustering algorithms is the K-means, where each cluster has a prototype whichis the mean value of its objects. The main idea behind k-Means clustering algorithm is theminimisation of the total distance between all objects in a cluster to their cluster centers(prototype). The selection of a suitable cluster prototype is another challenging and non-trivial task of partitioning methods, their accuracy is directly dependent on the definition

33


of these prototypes and their updating method. For this reason, partitioning approachesare more compatible with finding clusters of similar time series in time and preferablyof equal length, because defining a cluster prototype for elastic distance measures whichhandle the similarity in shape is not very straight forward.

3.2 Supervised learning

Supervised learning is the statistical learning task of inferring a function from labelleddata, i.e. known output. The goal is to learn the mapping of a function from the in-put space to the output space with the help of input parameters X and correspondingoutput variable Y . Unlike in unsupervised learning the output variable Y is known andis supervising the learning process of the learning algorithm. In the supervised context,the learning algorithm makes prediction on the training data and is corrected by the trueoutput variable Y in an iteratively way until an acceptable level of model performance isachieved. There are in general two groups of supervised learning problems, regression andclassification problems. In classification, one is concerned with classifying a set of observa-tions into categorical values, e.g. “blue”, “flower”, “car”. In regression problems, one theother hand, one is concerned with estimating a continuous-value of the dependent variableY , e.g. “speed of a car”, “inflation rate”, “temperature”. The difference between thesetwo types of supervised learning problems lies in output variable Y , in classification oneis working with categorical values and in regression one is working real-continuous-valuedvalues.

There are several different supervised learning algorithms available and they performdifferently depending on the problem at hand and on the quality of the data. Exampleof supervised learning algorithms are Linear & Quadratic Discriminant Analysis, SupportVector Machines (SVM), K-Nearest Neighbours, Decision Regression Trees, and ArtificialNeural Networks. In this thesis we are working with a regression problem and will beemploying a supervised learning algorithm called Random Forest, an algorithm that belongto category of supervised learning algorithms called Ensemble methods.

3.2.1 Ensemble methods

Ensemble Methods is a supervised learning technique that uses multiple learning algo-rithms, called learners, to make a single prediction. With ensemble methods one trainsseveral so called base learners separately and combines their various predictions into asingle final prediction. The idea behind Ensemble methods is that the collective predic-tions of several base learners will achieve a better overall predictive performance than anysingle learner of the constituent learning algorithm would alone. In order for this to betrue, ensemble methods require the following:

1. There are enough number of diverse base learners in the ensemble, meaning thateach base learner is trained on a diverse subset of the training data, i.e. the learnersare trained only on a subset of the full data and their training set is different fromeach other. This creates learners that are “experts” on different parts (sub-spaces)of the training data.

2. When making predictions, learners that are not trained on the particular subspaceof the training data in question vote randomly such that their collective predictionscancel out. Such learners are referred to as “non-experts”.

3. Each learner (or “expert”) makes its own prediction autonomously and independentlyfrom the other learners in the ensemble.

34


The selection of the base learner, i.e. base supervised learning algorithm, can be anysupervised algorithm of choice, and one can use several different learning algorithms inthe ensemble. One of the most commonly used base learners are Decision Trees, mainlydue to their computational efficiency and interpretability. Ensemble methods, however,work with other supervised learning algorithms as well, and there are even successfulexperiments done using ensemble of ensembles.

3.2.2 Regression with Random forest

Introduced by Breiman [6] in 2001, Random Forest is an supervised ensemble learningmethod that can be used for both regression and classification problems. It is an ensembleof random decision tree learners (or classifiers) that makes predictions by combining thepredictions of the individual trees. There are different designs of random forests dependingon how the randomness is introduced in the tree building process [53]. The RandomForest algorithm has shown convincing performance in various researches [10] [36] [6], it isversatile enough to learn and predict with missing data points, it is able to handle mixeddiscrete and continuous inputs, and has the ability to estimate variable importance of theprediction model. In this thesis we will consider the random forest version as introducedby Breiman [6], which combines bagging and random feature selection, and utilises theClassification and Regression Tree (CART) model to grow the individual trees. In orderto compensate for the bias variable importance estimates of using the CART model, thisversion will be extended with another model for growing the trees called GUIDE-model.

3.2.2.1 Base learner: Classification and Regression Tree (CART)

The base learner in the Random Forest algorithm is Decision Trees, they can be used forboth regression and classification. The main idea behind decision trees is to recursivelypartition the predictor space into a number of simple distinct and non-overlapping regionsRm, m = 1, . . . ,M . The development of a tree structure comes from the recursive parti-tioning of the original data set in two subsets that are more homogeneous than the firstone, leading to a branching structure. The underlying idea is that, as the ramificationincreases, the homogeneity in each node increases too. A local constant model is definedin each resulting region of the predictor space, in form of a mean (regression) or mode(classification) of the training data Ytrain that the partitions contain. This mean/mode,denoted wm for partition m, serves as a model of the output variable Y . We can write themodel in the following form

f(x) = E[y|x] =

M∑m=1

wmI(x ∈ Rm), (3.25)

where wm = 1|Dm|

∑j∈Rm Yj in the regression case, and wm = Mode(Y ∈ Rm) in the

classification case. Mode is simply the most frequent value of Y in partition Rm. Dm

denotes the available data in partition Rm, and |Dm| denotes number of observations inpartition Rm. Ideally we would like to find the set of partitions R1, . . . , Rm that minimisesthe RSS, given by:

M∑m=1

∑i∈Rm

(yi − yRm)2, (3.26)

where yRm is the mean (or mode) response for training observations within the mth parti-tion, according to equation 3.25. Unfortunately, it is computationally infeasible to considerevery possible partition of the predictor space into M regions [21]. Therefore, a top-down

35


greedy approach known as recursive binary splitting is used, where at each step of thepartitioning of the predictor space (i.e. the tree-building process) the best split is madeat that particular step, rather than looking ahead and picking a split that will lead to abetter partitioning (tree) in some future step.

There are several methods for growing a tree, i.e. partitioning the predictor space. Themost common method is CART (Classification and Regression Tree), a binary recursivepartitioning method used by Breiman in his Random Forest implementation [6]. Othermethods include THAID [32], ID3/4/5/5R [52] and C4.5 [44], in this thesis we will con-sider the CART method according to Breiman’s Random Forest procedure.

The CART is a binary recursive method that splits each node in a tree into 2 sub-nodesby finding a best split variable along with a best split value tj . This procedure is donerecursively until reaching a minimum node size nmin, meaning that each end-node of thetree should contain at least nmin number of training observation, or equivalently that eachfinal partitioning Rj , j = 1, . . . ,m, should contain at least nmin training observations. Todemonstrate how the CART approach partitions the predictor space, we define two inputvariables X1 and X2 with a corresponding output variable Y . Assume that the inputvariables are defined on the unit interval, X1, X2 ∈ [0, 1]. The CART starts out by parti-tioning the input space into two distinct regions along one of the variables. The choice ofwhich variable Xk to split on, and on what particular split-point tj , is determined by thefollowing splitting criterion [41]:

(j∗, t∗) = argminj∈{1,...,D}

mint∈Tj

cost({xi, yi : xij ≤ t}) + cost({xi, yi : xij > t}) (3.27)

As can be seen in equation (3.27), the splitting criterion consist of a cost function thatshould be minimised. In the regression CART procedure that cost function is the sum ofsquared deviations about the mean and the values Yi contained in partition Rm [32][41].That is, in the regression case the cost of splitting on a certain predictor Xj is given by

cost(Dm) =∑i∈Dm

(Yi − YRm)2 (3.28)

Here, YRm is simply equation (3.25), or equivalently YRm = 1|Dm|

∑j∈Rm Yj . In the CART

procedure for classification problems, where the output variable Y is categorical, a dif-ferent cost function is used. It is called the Gini Index and is based on class-conditionalprobabilities [41]. The class-conditional probabilities are given by:

πc =1

|Dm|∑|Dm|

I(yi = c) (3.29)

The Gini Index is then formulated as follows:

C∑c=1

πc(1− πc) =∑c

πc −∑c

π2c = 1−

∑c

π2c (3.30)

Note that πc is the probability a random entry in a specific partition belongs to class c,and (1 − πc) is the probability that it would be miss classified, the Gini Index is thenequivalent to the expected error rate [41].

To illustrate the output of the CART procedure and the final partitioning, assume that

36


in our example the first split is done along variable X1 for some suitable value X1 = t1.Assume further that the region Xt ≤ t1 is split at X2 = t2, the region X1 > t1 is split atX1 = t3, and the final region X1 > t3 is split at X2 = t4. The result of this splitting processis a partitioning consisting of five regions R1, R2, R3, R4, R5 as shown in figure(3.11a). Thepartitioning is summarised in the output tree in figure (3.11b), along with a perspectiveplot of the prediction surface fig(3.11) corresponding to the final partitioning.

(a) CART partitioning (b) Tree-structure (c) Prediction surface

Figure 3.11: (a) Partitioning of the two-dimensional predictor space spanned by X1 andX2, using CART recursive binary splitting. (b) Tree-structure corresponding to the par-titioning in figure (a). (c) Perspective plot of the prediction surface. Source: Hastie,Tibshirani, and Friedman [18]

The binary splitting of the CART procedure is done recursively until a certain stop-ping criterion is reached. Following Breiman’s version of the random forest, that stoppingcriterion is a minimum node size nmin [7], i.e. a minimum count on the training datapoints in each node (or partition). However, other stopping criterion such as certain treesize, or amount of terminal nodes can be used.

Algorithm 2 Recursive procedure to grow a classification / regression tree, Murphy [41]

1: Input: Training dataD2: Output: Binary tree-structure3: function fitTree(node, D, depth)4: node.prediction = mean(yi : i ∈ D), i.e. YRD

, using equation 3.255: (j∗, t∗,DL,DR) = split(D), using 3.276:

7: if not worthSplitting then8: return node9: else

10: node.test = λx.xj∗ < t∗ // anonymous function11: node.left=fitTree(node,DL,depth+1)12: node.right=fitTree(node,DR,depth+1)13: return node14: end

There are many advantages of using CART decision trees, such as their ability to handlemixed discrete and continuous inputs, their insensitivity to monotone transformations ofinputs, their robustness against outliers, and the fact that they are easy to interpret [41].

37


However, on a stand-alone basis, CART decision trees as a single predictor suffer frompoor prediction accuracy. Due to the greedy nature of the tree construction algorithm thetrees are unstable and sensitive to small changes to the input data. Any small changesto the input data can have large impact on the tree structure due to the top-down hi-erarchical approach in the tree construction, causing errors at the top to affect the restof the tree[41]. In other words, CART trees have low bias but suffer from high variance[21].

Furthermore, there is also the challenge of selecting the parameters that control the modelcomplexity in form of tree size, e.g. selecting the splitting/stopping criterion. A poorchoice of parameters can result in either a very large tree that might overfit the data, ora too small tree might that have a poor representation of the underlying data. This isparticularly an issue when one is using only a single CART tree for prediction, and insuch case it is a common practice to use tree pruning to find the optimal tree size. Theidea behind tree pruning is to grow a very large tree T0, and then prune it back to asmaller subtree, where a smaller tree with fewer splits might lead to lower variance, betterprediction accuracy, and better interpretation at the cost of a little bias. To consider everypossible subtree Tj ⊂ T0 would be too cumbersome and is practically unfeasible. Ratherthan considering every possible subtree, one can consider a sequence of trees indexed by anon-negative tuning parameter α in a procedure called cost complexity pruning [21]. Foreach value of α there corresponds a subtree Tα ⊂ T0 such that

|Tα|∑m=1

∑i:xi∈Rm

(yi − yRm)2 + α|Tα|

is as small as possible. Here |Tα| indicates the number of terminal nodes of the tree Tα,Rm is the subset of the predictor space corresponding to the mth terminal node, and yRmis the predicted response associated with Rm. The tuning parameter α controls a trade-offbetween between the subtree’s complexity and its fit to the training data. As one increasesα from zero, branches get pruned from the tree in a nested and predictable fashion, andwe can easily obtain whole sequence of subtrees as a function of α. The optimal value of α,and in turn the optimal subtree, is obtained using a validation set or using cross-validationand identifying the value of α that corresponds to the lowest MSE on the validation set orin the CV procedure. While tree-pruning is an essential step when using a single CARTtree for prediction (in order to increase prediction accuracy at the expense of increasedbias), the Random Forest procedure do not apply any tree pruning to the individual treesin the ensemble. In Random Forest one is working with fully grown trees that individuallyhas low bias and high variance.

3.2.2.2 Random Forest algorithm

To remedy the issue of high variance and poor prediction accuracy of CART decisiontree, Breiman introduced the Random Forest procedure which extends the CART withthe methods bagging and random feature selection [6]. In brief, Random Forest (RF)procedure entails constructing multitude of CART trees using bootstrapped training sets,re-sampled from the original training data. The trees are constructed with random featureselection such that only a randomised subset of the predictor variables are considered ateach step of the tree construction process. The final RF prediction is made by combiningthe predictions of the individual trees, either through averaging (regression) or taking themode (classification) of their predictions. The use of several decision trees, in combina-tion with bagging and random feature selection, reduces the variance of the RF predictor

38


considerably compared to an individual CART tree. This in turn greatly improves theprediction accuracy of the RF predictor over that of an individual CART tree.

(a) a random forest (b) a Random Forest

Figure 3.12: Illustration of how RF combines several CART trees to make a final prediction

More specifically, a bootstrap procedure is used to create B randomised training setsΘ1 . . .ΘB, by re-sampling with replacement from the original training set (X,Y), suchthat the bootstrap replicates Θk contain an equal amount of instances as the originaltraining set. For each of the B bootstrapped training sets Θk a decision tree is constructed(trained) using the CART model. The individual trees are grown deep and are not pruned,thus they have high variance but low bias. To minimise correlation amongst the trees inthe ensemble, a random feature selection is used at each step of the construction of atree, where v variables are selected uniformly at random from the full set of p predictorcandidates X1, ..., Xp to serve as leaf splitting candidates, v < p. That is, a leaf nodesplit is selected along one of these v variables to minimise a certain cost function, sum ofsquared deviation (3.28) for regression or Gini-index (3.30) for classification. The valueof v is typically chosen such that v =

√p, i.e. squared root of the total number of

predictors. When fully trained, each of the trees in the ensemble is an individual treepredictor denoted as h(x,Θ), and given an input x each tree predictor has a correspondingprediction fk(x) = h(x,Θk). The final RF prediction in a regression setting is then givenby

Y BRF =

1

B

B∑k=1

fk(x) =1

B

B∑k=1

h(x,Θk). (3.31)

The full Breiman’s Random Forest algorithm for regression is presented in algorithm 3below.

In his paper, Breiman [6] shows following noteworthy result for RF in a regression setting.The mean-squared generalization error for any numerical predictor h(x) is

EX,Y (Y − h(X))2 (3.32)

and as the number of trees {h(x,Θk)} goes to infinity, the following holds almost surely

EX,Y (Y − avkh(X,Θk))2 a.s→ EX,Y (Y − EΘ(h(X,Θ)))2 (3.33)

The right-hand side of (3.33) denotes the generalisation error of the forest, PEforest, andEΘ(h(X,Θ) is the RF predictor. If we define the average generalization error of a singletree as

PEtree = EΘ[EX,Y (Y − h(X,Θ))2] (3.34)

39


Algorithm 3 Random Forest for regression [36]

1: Growing Stage2: Input:3: (a) Training data (X, Y ), N p-dimension samples along with their output values4: (b) Parameter B: Number of trees5: (c) Parameter v: Number of candidate variables to consider at each split6: (d) Parameter nmin: Minimum node size7: Output: Random Forest consisting of ensemble of trees {Tb}B18:

9: for b = 1 to B do10: (a) Draw a bootstrap sample Z∗ of size N from the training data11:

12: (b) Grow a CART tree Tb to the bootstrapped data, by recursively repeating thefollowing steps for each terminal node of the tree, until the minimum node sizenmin is reached.

13: i. Select v variables at random from the p predictors14: ii. Pick the best variable and split-point among the v variables15: iii. Split the node into two child nodes.

16: end17:

18: Prediction Stage19: Function: RF.prediction = Y B

RF (x), using equation (3.31)

20: Let Yb(x) be the prediction of the b-th random-forest tree, given an input x.21: Then Y B

RF (x) = 1B

∑Bb=1 Yb(x) is the RF prediction.

and further assume that for all Θ, E(Y ) = EX[h(X,Θ)], then we have the followingrelationship between the generalisation error of the forest and the generalisation of asingle tree, where the former is bounded by the latter in the following way

PEforest ≤ ρPEtree (3.35)

The ρ in equation (3.35) is the weighted correlation between the residuals Y −h(X,Θ) andY − h(X,Θ

′), where Θ and Θ

′are independent. The requirements for accurate regression

forests is pinpointed in equation (3.35) – low correlation between residuals and low errortrees [6]. The random forest decreases the average error of the trees employed by thefactor ρ. This implies that the randomisation employed needs to aim at low correlation,and it’s one of reasons for using random feature selection on top of bagging.

Bagging, short for Bootstrap aggregation, is a general-purpose procedure for reducingvariance of a statistical learning method [21]. It’s used in RF algorithm in order to reducethe variance of the predictor, and hence increase the prediction accuracy, by taking manybootstrapped training sets from the population, building separate CART tree predictionmodels using each training set, and averaging the resulting predictions. Recall that givena set of n independent observation Z1, . . . , Zn, each with variance σ2, the variance of themean Z of the observation is given by σ/n. Each individual tree in the ensemble has highvariance but low bias, bagging and averaging these trees reduces the variance. Bagging hasbeen demonstrated to give impressive improvements in accuracy by combining togetherhundreds or even thousands of trees into a single procedure [6]. However, averaging manyhighly correlated quantities does not lead to as large of a reduction in variance as averagingmany uncorrelated quantities. The issue of highly correlated boosted trees arises when weare working with a few very strong and dominant predictors. In such cases most of the

40


bagged trees will look quite similar to each other as most of them will use these strongpredictors in the top splits, consequently leading to similar and highly correlated trees.We know from equation (3.35) that in such scenario bagging alone will not lead to asubstantial reduction in variance over a single tree.

To de-correlate the trees in the ensemble a random feature selection is employed, itdecreases the variance of the predictor further and increases its prediction accuracy. Withrandom feature selection, each time a split is considered, only a random sample of v pre-dictors are chosen as split candidates from the full set of p predictors. The typical value ofv is chosen such that v =

√p for classification problems and p/3 for regression problems.

This procedure overcomes the issue of strong and dominant predictor as the algorithm isthen not allowed to consider a majority of the available predictors. On average (p−m)/pof the splits will not even consider the strong predictor, and there is more chance for otherpredictors to be considered. De-correlating the trees in this way makes the average of theresulting trees less variable and hence more reliable.

In addition to the low bias and low variance of the random forest predictor, the RFhas several other desirable features and statistical merits. Mei et al. [36] summarises theseas:

1. RF require only three input parameters that are very easy to tune

2. RF can generate so called out-of-bag error estimates, which is good estimate of thegeneralisation error, and these estimates can be generated while growing the randomforest. It’s a convenient feature as other models generally require multiple trainingprocedures such as Cross Validation to generate such estimates.

3. Through out-of-bag observation, the RF can generate variable importance rankingin its growing procedure, this serves as useful estimate of variable relevance for theprediction model.

4. RF is robust against irrelevant features and outliers in the training data

5. Structured as an ensemble of trees, the RF can easily be expanded to fit additionalnew data by growing more ’branches’ in existing trees [36]. Furthermore, one cantrain new trees and replace existing least-performing trees. This enables the RF tobe an adaptive machine learning method that can learn online.

6. With so-called surrogate splits the RF can be trained and predict with partiallymissing data points.

3.2.2.3 Out-of-Bag observations, error estimation, and variable importancemeasure

With random forest there is a very straightforward way to estimate the test error of themodel, without the need to perform cross-validation or using a validation set. Recall thata bagging procedure is employed in the RF algorithm, where B training sets are boot-strapped from the original data, and for each of the B bootstrapped sets a tree is fitted.It can be shown that on average, each of the trees makes use of around two-thirds of theoriginal observations[21]. The remaining one-third of the observations not used to fit agiven tree is referred to as the out-of-bag (OOB) observations for that particular tree.We can with the help of these out-of-bag observations estimate the generalization errorof the RF model. To obtain a single OOB prediction for the ith training observation, weobserve the predictions of the trees that were not trained with that particular observation,

41


in other words, with trees that have the ith observation as out-of-bag (OOB). This willroughly yield B/3 OOB predictions for the ith observation, which combined reflects theRF model’s OOB prediction for the ith observation. By obtaining RF OOB predictionsfor all n observations in the training data, we can compute the overall out-of-bag mean-squared error (OOB MSE) for the RF. The resulting OOB MSE is a valid estimate of thetest error for the random forest, since the response for each observation is predicted usingonly the trees that were not fit using that observation.

The out-of-bag observations can also be used to measure the predictor importance ofthe model, i.e. how influential the predictor variables in the model are at predicting theresponse [57]. To measure the variable importance in the model, we use out-of-bag pre-dictor importance estimates by permutation. The idea is that if a predictor is influentialin prediction, then permuting its values should affect the model error (OOB MSE), andby the same token, if a predictor is not influential, then permuting its values should havelittle to no effect on the model error. Thus, by studying how the model error is affected bypermuting a certain predictor we get insight into the predictor importance for the model.The procedure for out-of-bag predictor importance estimates by permutation is describedbelow

Algorithm 4 Out-of-bag predictor importance estimates by permutation [57]

1: Suppose that R is a random forest of B learners and p is the number of predictors inthe training data.

2:

3: for tree t, t = 1, . . . , B do4: (a) Identify the out-of-bag observations and the indices of the predictor variables

that were split to grow tree t, st ⊆ {1, ..., p}.5:

6: (b) Estimate the out-of-bag error εt.7:

8: (c) for each predictor variable xj , j ∈ st, do9: i. Randomly permute the observations of xj .

10: ii. Estimate the model error, εtj , using the out-of-bag observationscontaining the permuted values of xj .

11: iii. Take the difference dtj = εtj–εt. Predictor variables not split whengrowing tree t are attributed a difference of 0.

12: end for13: end for14:

15: For each predictor variable xj in the training data compute the mean, dj , and standarddeviation, σj , of the differences over the learners, j = 1, ..., p.

16: The out-of-bag predictor importance by permutation for predictor xj is dj/σj .

3.2.2.4 Unbiased variable importance estimates using GUIDE procedure

Applying the RF algorithm using the standard CART procedure and estimating the pre-dictor importance by permutation, as described in previous subsection (3.2.2.3), will resultin biased importance estimates. The reason for this is that the standard CART proceduretends to select split predictors containing many distinct values, e.g., continuous variables,over those containing few distinct values, e.g., categorical variables [34]. If the predictordata set is heterogeneous, or if there are predictors that have relatively fewer distinct val-

42


ues than other variables, then predictor importance estimates using standard CART willcertainly suffer from variable selection bias. As in such circumstances the CART splittingalgorithm will prefer splitting continuous predictors with many distinct values over thosewith fewer distinct values. Such a selection can sometimes be spurious and mask moreimportant predictors that have fewer levels, such as categorical predictors. The undesir-able property of CART stems from the exhaustive search approach taken to find the bestpredictor and splitting point. Consider the splitting criterion in (3.27), suppose that X1

and X2 are two ordered predictors with n1 and n2 distinct values, with n1 >> n2. Allother things being equal, X1 has a higher chance of being selected than X2. Also, notethat an ordered variable with m distinct values has (m − 1) splits of the form X ≤ c,and an unordered variable with m distinct unordered values has (2m−1 − 1) splits of theX ∈ S. Therefore, if everything else is equal, variables that have more distinct values havea greater chance to be selected, hence the variable selection bias [32].

Recognising this issue, Loh [33] introduced an algorithm called GUIDE that is specif-ically designed to eliminate the variable selection bias. The algorithm uses a two-stepapproach based on chi-squared significant tests to split each node in the tree constructionprocess. First, each predictor Xj is tested for association with response Y in what Loh[33] termed as curvature test. Then for each pair of variables (Xi, Xj) an interaction testis employed where the interaction test assesses the association between predictor variablesXi and Xj with respect to Y . The null hypothesis in both tests are no association. In thesecond step of the two-step approach, the the most significant predictor from either thecurvature test or the interaction test is selected as splitting variable. Then, an exhaustivesearch is performed for the set S. As every predictor Xj has the same chance of beingselected if each is independent of Y , this approach is effectively free of selection bias [32].In addition to achieving unbiased variable importance estimates, the GUIDE algorithmincreases the computational efficiency as the search for S is carried out only on the selectedvariable Xj . The full procedure as describe by Loh [33] in his paper is outlined below.

GUIDE algorithm: Chi-square tests for constant fit [33]:

1. Obtain the residuals from a constant model fitted to the Y data.

2. Curvature test: For each numerical-valued variable, divide the data into fourgroups at the sample quartiles; construct a 2 x 4 contingency table with the signsof the residuals (positive versus non-positive) as rows and the groups as columns;count the number of observations in each cell and compute the χ2-statistic and itstheoretical p-value from a χ2

3 distribution. This is referred to as a curvature test.

3. Do the same for each categorical variable, using the categories of the variable toform the columns of the contingency table and omitting columns with zero columntotals.

4. Interaction test: To detect interactions between a pair of numerical-valued vari-ables (Xi, Xj), divide the (Xi, Xj)-space into four quadrants by splitting the rangeof each variable into two halves at the sample median; construct a 2 x 4 contingencytable using the residual signs as rows and the quadrants as columns; compute theχ2-statistic and p-value. Again, columns with zero column totals are omitted. Thisis referred to as an interaction test.

5. Do the same for each pair of categorical variables, using their value pairs to dividethe sample space. For example, if Xi and Xj take ci and cj values, respectively, theχ2-statistic and p-value are computed from a table with two rows and number ofcolumns equal to cicj less the number of columns with zero totals

43


6. For each pair of variables (Xi, Xj) whereXi is numerical-valued andXj is categorical,divide the Xi-space into two at the sample median and the Xj - space into as manysets as the number of categories in its range (if Xj has c categories, this splits the(Xi, Xj)-space into 2c subsets); construct a 2 x 2c contingency table with the signs ofthe residuals as rows and the subsets as columns; compute a χ2-statistic and p-valuefor the table after omitting columns with zero totals.

7. Selecting splitting variable: If the smallest p-value is from a curvature test, it isnatural to select the associated Xj variable to split the node. If the smallest p-valueis from an interaction test, we need to select one of the two interacting variables. Wecould choose on the basis of the curvature p-values of the two variables, but becausethe goal is to fit a constant model in each node, the choice of variable is based onreduction in SSE. If both variables from the interaction test are numerical-valued,we use CART exhaustive search approach on the two interacting variables to findthe variable that yield the smaller total SSE (see (3.27) and (3.28)). Otherwise, ifat least one variable is categorical, the one with the smaller curvature p-value isselected.

Using the GUIDE algorithm as described by Loh [33] unfolds in the following node split-ting rules described below. These node splitting rules will in this paper partially replacethose of the standard CART.

Node splitting rules using GUIDE [55]:

1. For observations in node t, conduct curvature tests between each predictor and theresponse, and interaction tests between each pair of predictors and the response.

If all p-values are at least 0.05, then do not split node t.

If there is a minimal p-value and it is the result of a curvature test, then choosethe corresponding predictor to split node t.

If there is a minimal p-value and it is the result of an interaction test, then choosethe split predictor using standard CART on the corresponding pair of predictors.

If more than one p-value is zero due to underflow, then apply standard CART tothe corresponding predictors to choose the split predictor.

2. If a certain predictor is chosen, then apply standard CART to decide the cut point,see (3.27) and (3.28).

3.2.2.5 Handling missing data with surrogate split predictors

A nice feature of the Random Forest method is its ability handle partially missing datapoints, where we at a given time point do not have observation for all predictor variables.Constructing a RF with so called surrogate splits enable us to train a RF with partiallymissing data points, and also predict with partially missing data. When we have a missingdata point for a particular predictor we use the best surrogate split predictor to split a treein the construction process, or if are making a prediction we use the observation of thebest surrogate split predictor to orient us within the individual trees in the tree ensemble.If the value of the best surrogate split predictor for a certain observation is also missing,then we simply use the second-best surrogate split, we that value is also missing we use thethird best etc. These candidate surrogate splits are selected and sorted in a descendingorder by their predictive measure of association, a measure that indicates the similaritybetween decision rules that split observations. Among all possible decision splits that are

44


compared to the optimal split found by constructing a tree, the best surrogate decisionsplit yield the maximum predictive measure of association. The second-best surrogatesplit has the second-largest predictive measure of association etc.

Predictive measure of Association [55]Suppose that xj and xk are predictor variables j and k, respectively, and j 6= k. At nodet, the predictive measure of association between the optimal split xj < u and a surrogatesplit xk < v is,

λjk =min(PL, PR)− (1− PLj ,Lk − PRj ,Rk)

min(PL, PR), (3.36)

where λjk is a value in (−∞, 1]. If λjk > 0 then xk < v is a worthwhile surrogate forxj < u.

• PL is the proportion of observation in node t, such that xj < u. The subscript Lstands for the left child node t

• PR is the proportion of observations in node t, such that xj ≥ u. The subscript Rstands for the right child node t

• PLj ,Lk is the proportion of observations at node t, such that xj < u and xk < v.

• PRj ,Rk is the proportion of observations at node t, such that xj ≥ u and xk ≥ v.

• Observations with missing values for xj or xk do not contribute to the proportioncalculations

3.2.2.6 Prediction intervals using quantile regression forest

Along with our OOB-predictions we would like to have the confidence intervals for thosepredictions, for our RF model this can be attained using Quantile Regression Forest.Introduced by Meinshausen [37], the quantile regression forest utilises the fact that theRF provides information about the full conditional distribution of the response variable,and not only about the conditional mean. This information can be used to build predictionintervals and detect outliers in the data. The conditional distribution function F (y|X = x)is given by the probability that, for X = x, Y is smaller than y ∈ R,

F (y|X = x) = P (Y ≤ y|X = x).

For a continuous distribution function, the α-quantile Qα(x) is then defined such that theprobability of Y being smaller than Qα(x) is, for a given X = x, exactly equal to α,

Qα(x) = inf{y : F (y|X = x) ≥ α}. (3.37)

In quantile regression the prediction intervals are given as

I(x) = [Qα(x), Q1−α(x)]. (3.38)

The length of the prediction intervals reflect the variation of new observations aroundtheir predicted values. Meinshausen [37] show that quantile regression forest are, undersome reasonable assumptions, consistent for conditional quantile estimation. Followingthe notation of Meinshausen [37], the conditional distribution function of Y , given X = x,is given by

F (y|X = x) = P (Y ≤ y|X = x) = E(IY≤y|X = x), (3.39)

45


where the last expression is analogous to the random forest approximation of the condi-tional mean E(Y |X = x). Recall that the E(Y |X = x) is approximated by a weightedmean over the observations of Y (see 3.31 & 3.25), define an approximation to E(IY≤y|X =x) by the weighted mean over the observations of IY≤y,

F (y|X = x) =n∑i=1

wi(x)IYi≤y (3.40)

where wi(x) are the same weights as for the random forest. Using Meinshausen [37]notation, recall that the prediction of a single tree T (θ) for a new data point X = x isobtained by averaging the observed values in a leaf l(x, θ), i.e. the observed values inone of the final partitions that corresponds to the value X = x. Here, θ is the randomparameter vector that determines how a tree is grown. The weight vector of a single treewi(x, θ) is given by

wi(x, θ) =IXi∈Rl(x,θ)

#{j : Xj ∈ Rl(x,θ), (3.41)

where the prediction of a single tree is u =∑n

i=1wi(x, θ)Yi. Using random forests, theconditional mean E(Y |X = x) is approximated by the averaged prediction of k singletrees, each constructed with an i.i.d. vector θt, t = 1, . . . , k, as follows. Let wi(x) be theaverage of wi(θ) over the collection of the single trees,

wi(x) =1

k

k∑t=1

wi(x, θt), (3.42)

then the random forest approximation of the conditional mean of Y given X = x is givenas uRF (x) =

∑ni=1wi(x)Yi, as highlighted in equation 3.40 above.

Now to the main point of this subsection, the estimates Qα(x) of the conditional quantilesQα(x) are obtained by plugging F (y|X = x) instead of F (y|X = x) into equation 3.37 asfollows,

Qα(x) = inf{y : F (y|X = x) ≥ α}. (3.43)

The key difference between quantile regression forest and standard random forests is thatfor each node in each tree, the random forest keeps only the mean of the observationsthat fall into this node and neglects all other information. In contrast, quantile regressionforests keeps the value of all observations in this node, not just their mean, and assessesthe conditional distribution based on this information. Implementation-wise we estimatethe F (y|X = x) using following simple procedure,

Algorithm for computing the estimating F (y|X = x), [37]

• Grow k tree T (θt), t = 1, . . . , k as in random forest procedure. However, for everyleaf (final partition) of every tree, take note of all observations in this leaf, not justtheir average..

• For a given X = x, drop x down all trees. Compute the weight wi(x, θt) of obser-vation i ∈ {1, . . . , n} for every tree as in 3.41. Compute the weight wi(x) for everyobservation i ∈ {1, . . . , n} as an average over wi(x, θt), t = 1, . . . , k, as in equation3.42.

• Compute the estimate of the distribution function as in equation 3.40 for all y ∈ R,using the weights from the second step.

46


3.2.3 Supervised learning with sequential time-series data

Time series is a type of sequential data where its feature values are changing as a functionof time. In many cases time-series exhibits autocorrelation between points in time, wherethe value of future observations are dependent on observations from the past. Many ofthe supervised learning techniques, including the Random Forest method, are designed totreat each observation in a data set independently of each other. However, much like theissue with clustering of time-series data discussed in 3.1.5, due to the time dependencywithin time series one cannot simply treat the observations as independent data points.The unique characteristics of time series data are barriers that can, if not accounted for,result in substandard supervised learning performance as models fail to capture the timedependency of the data. There are methods that are specifically designed for sequentialdata, Bishop [4] provides examples of suitable methods including higher order MarkovChains, Hidden Markov models, Kalman-filters, Neural Networks, and ARMA-models.In a review study, Dietterich [13] provides additional examples of methods for sequentialsupervised learning, amongst the methods covered are maximum entropy Markov models,conditional random fields, graph transformer networks, and sliding window. Of these, themethod that is of interest for this thesis is the sliding window method. The reason for thisis that sliding window is compatible with most supervised learning algorithms and is ableto convert sequential supervised learning problems into the classical supervised learningproblems [13], enabling standard supervised learning algorithms to possibly capture thetime-dependency of the time-series data.

3.2.3.1 Sliding window method

A sequential supervised learning problem can be formulated as follows. Let {(xi, yi)}Ni=1

be a set of N training examples. Each example is a pair of sequences (xi, yi), where xi =〈xi,1, . . . , xi,Ti〉 and yi = 〈yi,1, . . . , yi,Ti〉. In a sequential learning problem we want givenan input sequence xk = 〈xk,1, . . . , xk,Tk〉 predict the whole sequence yk = 〈yk,1, . . . , yk,Tk〉,i.e. we want to predict all the values yk,1, . . . , yk,Tk simultaneously and for the full timeperiod T . This is a fundamentally different problem to that of time series prediction,where the task is to predict the t+1st element of a sequence 〈y1, . . . , yt〉 using 〈x1, . . . , xt〉,where t < T . In other words, in sequential supervised learning we have the entire sequence〈x1, . . . , xT 〉 available for prediction, whereas in time series we only have a prefix of thefull sequence up to current time 〈x1, . . . , xt〉, t < T . Furthermore, in time series analysis,we have the true observed y values up to current time t, whereas in sequential supervisedlearning, we are not given any y values and must predict them all.

In a sequential learning problem, the sliding window method converts the sequential train-ing example (xi, yi) into sub-sequences, called windows, of a selected width w. More pre-

cisely, let d = (w−1)2 be the half-width of the window, then the method creates a window

〈xi,t−d, xi,t−d+1, . . . , xi,t, . . . , xi,t+d−1, xi,t+d〉 which is used for the prediction of yi,t. Ineffect, the original input sequence xi is padded on each end by d null values and then con-verted into Ni separate examples, these are then compatible with any supervised learningalgorithm and are used to predict the value yi,t of the full sequence yi. The sliding windowmethod in a time series setting is applied differently due to the fundamental differences be-tween sequential supervised learning problems and time-series prediction problems. Sincethe predictor time-series (xt, xt−1, . . . , x1) is available to us only up to the current timet < T , the sliding window is applied backward-looking. For instance, consider the two pre-dictor time-series (x1,x2), applying a backward-looking sliding window of width w yieldsthe following sub-sequence for time t, (x1,t, x1,t−1, . . . , x1,t−w, x2,t, x2,t−1, . . . , x2,t−w). In

47


this case, the original predictor matrix Xn,p is padded on the end by the full windowwidth w and is expanded in column dimension by p · w, resulting in a new predictormatrix Wn−w,p(1+w). Below is an illustration of this sliding window transformationXn,p → Wn−w,p(1+w). Assume that n = 5, p = 2, and that we are applying a slidingwindow of width w = 2, then

Original predictor matrix Sliding windowx1 x2 transformed matrix

X5,2 =

x1,1 x2,1

x1,2 x2,2

x1,3 x2,3

x1,4 x2,4

x1,5 x2,5

→ W3,6 =

x1,1 x1,2 x1,3 x2,1 x2,2 x2,3

x1,2 x1,3 x1,4 x2,2 x2,3 x2,4

x1,3 x1,4 x1,5 x2,3 x2,4 x2,5

w1 w2 w3 w4 w5 w6

Here the jth index in xi,j indicate time from the most recent observation. Recall that intime series problems we also have the observed response values (yt−1, yt−2, . . . , y1) up totime t−1 available to us. Depending on the problem at hand these observed response valuescan be used for the prediction of yt. To incorporate these in a standard supervised learningalgorithm we perform the sliding window method on (yt−1, yt−2, . . . , y1) and append themto Wn−w,p(1+w).

48

Chapter 4

Methodology

This chapter describes the method and work process of the main analysis of this thesis.It contains three distinct parts, data processing & transformation, time series clustering,and Random Forest regression. Each part can be implemented and modified in variousways, and alterations to the implementation method described below will most certainlyyield different results. Thus, the reader should keep in mind that the method describedin this chapter is only one of several ways to implement our selected tools for our desiredanalysis. The specific work process described here has been put together according to ourtheoretical framework in chapter 3.

4.1 Data processing and transformation

4.1.1 The data

The time series data used in our analysis are market observable asset prices and indicesthat we believe can be of predictive value for the excess return of high yield corporatebonds. All the time series data is gathered from Bloomberg Terminal, which is a financeindustry standard software that has large and rich data bases of everything finance re-lated. The majority of the data gathered is daily time series from the date 2000-01-03 to2017-02-12.

Response variableThe response variables in our analysis are seven different time series of the high yield in-dustry excess return that corresponds to the seven high yield industries under study. Toclarify, we will in our analysis perform seven separate regressions, one on each industry,where we focus solely on one industry at a time. The excess return data is from 2002-02-19to 2017-02-14 and includes the following industries.

49


HY Industry Ticker Data type Start date Resolution

Chemical Excess return index 2002-02-19 DailyMetals Excess return index 2002-02-19 DailyPaper Excess return index 2002-02-19 Daily

Building Materials Excess return index 2002-02-19 DailyPackaging Excess return index 2002-02-19 Daily

Telecommunications Excess return index 2002-02-19 DailyElectric Utility Excess return index 2002-02-19 Daily

Table 4.1: High yield excess return time series data by industry

One should keep in mind that this excess return data is given as monthly cumulativeexcess return, meaning that each daily data point tracks the cumulative excess return forthat specific month, and the value of this cumulative excess return resets each new month.For the purpose of this thesis we transform this cumulative data into daily continuousexcess return through simple algebraic operations.

Initial input time seriesThe time series used in our analysis are prices and index levels from a wide range of assetclasses and indices that we believe can affect the excess return of high yield corporatebonds. Combined, these time series gives a holistic view of the situation in the financialmarkets. The asset classes and indices included are swap rates, swap spreads, commodi-ties, equity indices, various volatility indices, credit spreads, major FX currencies, andeconomic surprise indices. Some of these categories contain several time series reflectingdifferent regions, underlying assets, or different time-to-maturity aspects. For instance,the swap rates category contain the swap rates indices for EUR, USD, JPY and GBP,and each with 9 different time to maturities. The credit spreads category contain creditspreads for regular corporations as well as for financial corporations. A comprehensive listof the included time series are given below.

50


Category # of time series Data type Start date Resolution

Swap rates EUR 13 Rates 2000-01-03 DailySwap rates USD 9 Rates 2000-01-03 DailySwap rates GBP 9 Rates 2000-01-03 DailySwap rates JPY 9 Rates 2000-01-03 Daily

Swap spreads EUR 3 Rates spread 2007 / 2008 DailySwap spreads USD 3 Rates spread 2000-01-03 DailySwap spreads GBP 3 Rates spread 2000-01-04 DailySwap spreads JPY 3 Rates spread 2000-01-04 Daily

Oil 1 Commodity 2000-01-04 DailyCopper 1 Commodity 2000-01-04 DailyIron ore 1 Commodity 2013-10-18 Daily

Equity Indices 7 Equities 2000-01-04 Daily

US bond volatility 1 Volatility index 2000-01-03 DailyEuropean stock volatility 1 Volatility index 2000-01-03 Daily

US stock volatility 1 Volatility index 2000-01-03 DailyFX currency volatility 5 Volatility index 2000-01-03 Daily

EU corp. credit spreads 2 Credit spread 2004-06-16 DailyUS corp. credit spreads 2 Credit spread 2011-09-09 DailyEU subfin. credit spread 1 Credit spread 2004-06-21 Daily

Major FX currencies 14 Currency 2000-01-03 Daily

Economic surprise index 8 Macro data index 2000 - 2004 Daily

Table 4.2: A list of the initial input time series used in our analysis

4.1.2 Transformations

Swap rates to yield curveAs can be seen in the list of input time series data, the swap rates data category includeseveral number of time series. The reason for the many number of time series is that each”regional” swap rate category contain the swap rates for at least for 9 different maturities.In order to simplify the swap rates data and reduce the total number of input time series,we transform the ”regional” time series into a ”regional” yield-curve comprising only ofthree new time-series. The information contained in a yield-curve, in form of slope andcurvature, is of greater informational value than the ”raw” swap rates time series.

We use the Diebold and Li’s three-factor model for fitting a yield curve to transformour swap rates into a yield curve [12]. The model assumes that a yield curve can be fittedusing the three-factor model

yt(τ) = β1,t + β2,t(1− e−λtτ

λtτ) + β3,t(

1− e−λtτ

λtτ− e−λtτ ) (4.1)

where τ denotes the tenor. The above model allows the factors to be interpreted in thefollowing way, β1 corresponds to the level of the yield curve (long term), β2 correspondsto the slope (short-term), and β3 corresponds to the curvature (medium term). The λgoverns the exponential decay rate of the model, and determines the maturity at whichthe loading on the curvature is maximised. For a more thorough overview of the modelsee Diebold and Li [12].

Using the Dibold and Li’s three-factor model on the ”regional” swap rates time seriesdata, we essentially transform the several time series in each region into corresponding beta

51


factors {β1,t, β2,t, β3,t} for each region. These betas evolve in time and reflect the impliedswap rate curve for a certain region. These betas have sound economic interpretationsand will replace the ”raw” swap rates data in our analysis. Thus, we will in this thesisuse the absolute level of the swap rate curve (β1,t), the slope of the curve (β2,t), and thecurvature of the curve (β3,5). In effect, we have reduced the total number of time seriesrelated to swap rates from 40 to 12, the swap rates time series have been replaced withthe new swap curve factors time series as illustrated below.

Old time series # time series New time series Type Resolution

EUR b1 Swap curve level DailySwap rates EUR 13 EUR b2 Swap curve slope Daily

EUR b3 Swap curve curvature Daily

USD b1 Swap curve level DailySwap rates USD 9 USD b2 Swap curve slope Daily

USD b3 Swap curve curvature Daily

GBP b1 Swap curve level DailySwap rates GBP 9 GBP b2 Swap curve slope Daily

GBP b3 Swap curve curvature Daily

JPY b1 Swap curve level DailySwap rates JPY 9 JPY b2 Swap curve slope Daily

JPY b3 Swap curve curvature Daily

Table 4.3: Swap rates transformed into swap curve β1,t, β2,t and β3,t

52


4.1.3 Filters

The input time series presented in the previous subsection are all absolute values of eitherasset price or index levels. It is of interest for our analysis to filter some of these absolutevalue time series into daily percentage change, absolute value change, and relative levels.In order to avoid curse of dimensionality, we will confine our analysis to only these threesimple filters. Also, these filters will not be applied on all the time series, only on thosetime series where the filters make good sense to use. The next subsection (4.1.4) providesa comprehensive list on what filters have been used on which time series.

Denote Xj as the absolute value of an asset or an index at time j, then these simplefilters are given as

Daily percentage change (DoDpct)

DoDj =Xj −Xj−1

Xj−1(4.2)

Daily absolute value change (AbsChg)

Abs chgj = Xj −Xj−1 (4.3)

Relative level (RelLev)

We define the relative level filter as the ratio between an asset’s price today comparedto the its past d-day average price.

RelLevj(d) =Xj

Avg(Xj−1, Xj−2, . . . , Xj−d)(4.4)

In this thesis we will use 100 day relative level filter, i.e. d = 100. This filter has sensibleeconomic interpretation and can be useful for identifying market sentiment, for instancewhen applied on volatility data.

4.1.4 Initial predictor variables

After implementing the transformations and using the filters on the input time series (4.2),we arrive at the 154 initial predictor variables that will be used as input for our statisticallearning tools.

53


Category Filters # of time series Data type

Swap curve factors∗ AbsVal, AbsChg, DoDpct 28∗ Rates

Swap spreads AbsVal, DoDpct 24 Rates spread

Oil AbsVal, RelLev, DoDpct 3 CommodityCopper AbsVal, RelLev, DoDpct 3 Commodity

Iron ore∗∗ AbsVal, ReLev, DoDpct 3∗∗ Commodity

Equity Indices DoDpct 7 Equities

US bond volatility AbsVal, DoDpct, RelVal 3 Volatility indexEuropean stock volatility AbsVal, DoDpct, RelVal 3 Volatility index

US stock volatility AbsVal, DoDpct, RelVal 3 Volatility indexFX currency volatility AbsVal, DoDpct, RelVal 15 Volatility index

EU corp. credit spreads AbsVal, DoDpct, RelVal 6 Credit spreadUS corp. credit spreads AbsVal, DoDpct, RelVal 6 Credit spreadEU subfin. credit spread AbsVal, DoDpct, RelVal 3 Credit spread

Major FX currencies AbsVal,DoDpct, RelVal 42 Currency

Economic surprise index AbsVal 8 Macro data index

Total # times series 154∗

Table 4.4: A comprehensive list of the processed and transformed time series used as inputin our statistical analysis. *) No filters are used on the curvature factor β3. **) Iron oreis eventually discarded in the main analysis due to too few data points.

4.1.5 Dealing with missing data points

Some of the predictor time series in table (4.1.4), in particular credit spreads and EURswap spreads, are only available from year 2004 and 2007 respectively, while the rest of thetime series considered are from year 2000. This fact creates several ”missing data points”if we train our statistical models with data all the way from year 2000. In practice thisimplies that the predictor matrix M , which is created by merging all the time series andsort them according to date, will have several NaN-values. This can seriously distorts thetrue prediction accuracy of our model and thus we should pose some restriction on howmany missing data points (or NaNs) are allowed in our input predictor matrix M . Theselected restriction imposed is that each row in the predictor matrix M should not containmore than 50% of its values as NaNs. This restriction gives us a new training matrix MR

that will be used in our regression analysis. In total MR considers 3142 days (∼ 12.4 years)of data that ranges from 2004-06-21 to 2017-01-15. In our regression implementation wewill also consider an input matrix MR2 with considerably fewer missing data points. Theinput matrix MR2 have the same number of observation ( 12.4 yrs) covering same timeperiod from 2004-06-21 to 2017-01-15, the main difference compared to MR1 is that wehave removed the time series that were the main cause of the missing values in the trainingmatrix MR1. Accordingly, in MR2 we removed all time series that did not have registeredvalues prior year 2005. Looking at our table of initial input time series data in 4.2, in MR2

we essentially removed US corporate credit spreads (start year 2011), EUR swap spreads(start year from 2007 and 2008), the Chinese currency CHN (start year 2010), and theUSD-CHN FX volatility index USDCNHV1M (start year 2011).

This issue of missing data points in the input matrix M is a more severe issue whenit comes to clustering. One can essentially not perform hierarchical clustering if the timeseries are not of equal length. While there are ”tricks” that to some degree bypass this

54


issue, e.g. extrapolate values or setting the missing values as mean or medians of theseries, they don’t fully remedy the issue. Using such ”tricks” will most certainly resultin invalid clusterings. Thus, when it comes to clustering implementation we will consideran input matrix MC which has no missing values, such matrix has 983 days under studyfrom 2012-12-27 to 2017-01-12, and will only be used in the clustering procedure.

4.2 Two-phase implementation procedure

To briefly recap on the goal of this thesis and what the method employed aims to achieve.We want to predict the excess return of high yield corporate bond industry indices. Todo this, we gather several market observable time series that we believe are of predictivevalue for this excess return. Different time series will have different predictive value forthe different high yield industries, therefore we include all relevant time series data andtheir relevant filtered series, and let our modern statistical learning tools decide whichare the most influential for each industry in terms of predictive value. The aim is toarrive at the 10 most influential predictors of excess return for each different high yieldcorporate bond industry. Then using these 10 most influential predictors to predict futureexcess return. It is an exercise in dimensionality reduction techniques and prediction usingnon-conventional regression methods.

4.2.1 Unsupervised time series clustering

In the first part of our two-phase implementation we want reduce the dimension of ourinput training matrix M , which contains our initial predictor time series. In essence, wewant reduce the total number of predictor time series that will be used for our regressionmodel. The reason for this is two-fold, one related to the number of data points andthe other related to characteristics of the data variables. Firstly, when applying machinelearning techniques one would ideally want many more data points than variables used,in our m-by-n matrix M where m is number of data points and n number of predictors,this translates to m >> n. The reason is that having the number of data points mapproaching the number predictors n creates sparsity in the dataset, which implies thatthere is not enough data points to produce statistically significant results, this is in ourcase primarily due to too many predictors n. In literature this phenomenon is called curseof dimensionality. Clearly, with an input matrix MR2 consisting of n = 154 predictors overm = 1108 number of data points (days), we would achieve higher statistically significanceresults if we decreased the number of predictors n.

Secondly, considering the characteristics of our input predictor variables, many of themare filters of the same underlying time series, and some of the original raw time seriesexhibit significant correlation to each other (a common trait of some financial assets, suchas rates and currencies). Thus, using all the 154 predictors as input to our regressionmodel would result in a model with severe multicollinearity. Therefore, in order to avoidthis and the curse of dimensionality, we perform time series clustering with the goal ofselecting predictor variables that are diverse enough, and have the strongest correlationto our subject under study (HY industry excess return).

4.2.1.1 Implementation procedure

The idea behind using time series clustering for dimensionality reduction of our inputpredictor space is to let our unsupervised time series clustering technique decide whatnatural division there is in our time series data, i.e. what natural clusters there exist.

55


Then, considering this natural division of our data, we pick from each cluster a certainset of predictors that will be used as input in our regression model. This procedurewill ensure that we utilise a diverse set of the reduced input space. According to ourtheoretical framework (3.1.5), we will be using Agglomerative Hierarchical Clustering dueto its superior applicability on time series data. Furthermore, considering the discussionin (4.1.5) we cannot use a input predictor matrix M that has missing values (NaN), theHierarchical clustering procedure requires time series of equal length. Thus, as inputtraining matrix for our time series cluster will be the MC introduced in (4.1.5). The MC

has no missing values and covers 983 days from 2012-12-27 to 2017-01-12.

4.2.1.2 Selecting distance metric

Since we are concerned with avoiding multicollinearity we will use a distance measure thatcaptures similarity in change, and suitable distance measures for that are the correlationdistances. Ideally, we would also want to use Dynamic Time Warping (DTW) distancemeasure, especially considering its successful application on time series data. However,as we are working with filtered time series data, where we have absolute values and dailypercentage change of the same raw time series, this distance measure would have a hardtime matching related time series of different filters. After all, DTW is a shape-basedmeasure that measure the absolute distance between the series, a percentage value willhave a large distance to an absolute value and thus will be rendered as ”very dissimilar”.As such, we will in our analysis confine ourselves to the correlation distances Pearson’s- andSpearman’s correlation distance presented in (3.8) (3.9). A key difference between thesetwo correlation coefficients is that the Pearson’s coefficient captures the linear correlationbetween the objects, while the Spearman’s coefficient is a non-parametric measure of rankcorrelation, i.e. statistical dependence between the rank ordering of two variables. TheSpearman’s coefficient assesses how well the relationship between two variables can bedescribed by a monotonic function.

4.2.1.3 Selecting Linkage Criterion

Another important consideration of Hierarchical time series clustering is the choice ofthe Linkage method (see 3.1.3.1), in Hierarchical clustering this essentially equates tothe choice of a cluster prototype. According to the theoretical framework on time seriesclustering prototypes (3.1.5.4), the quality of the clustering of time series is highly sensitiveto the choice of the prototype, i.e. the linkage criterion used. Subsection (3.1.3.1) presentsa wide range of common linkage methods, however, given the high sensitivity of the qualityof the output clustering we should be cautious with the selection of the linkage criterion.To our help, we have the Cophenetic correlation coefficient presented in (3.1.3.2) to guideus to an appropriate linkage criterion. Recall that the CCC is a technique for comparingdifferent cluster solutions, it is a measure of how faithfully the resulting cluster representthe pairwise distances between the original unmodelled data. Thus, we can with thehelp of the CCC compare clustering results from different linkage methods and distancemeasures, and decide according to the value of the CCC the linkage and distance methodthat most faithfully represents the underlying data.

4.2.1.4 Finding the natural clusters in the input data

Having selected an appropriate linkage criterion and a distance measure for our Hierar-chical clustering implementation, our next concern is to find the natural cluster divisionin our nested hierarchical tree structure. Recall that the Hierarchical clustering method

56


has as output a nested tree-structure that represents hierarchical relationship between theclusters. That is, the output of hierarchical clustering is not a fixed ”final” ”true” cluster-ing of the data, it is just a structure depicting the hierarchical relationship between thedata and the clusters containing the data. Indeed, one of the drawbacks of hierarchicalclustering is that it will always create the hierarchical relationship structure even if thedata has no structure at all and is only random noise. To analyse if there are distinct nat-ural clusters in the data, we turn our focus to the Dendrogram (3.1.3.3). The hierarchicalrelationship between clusters is visualised in the dendrogram, and we can analyse it to findthe natural division of the data. Recall that ”cutting” a dendrogram defines the final clus-tering, thus, our main objective is to identify where to ”cut” the dendrogram such that weachieve the natural division of the data. For this task we use the Inconsistency coefficient(3.1.3.4) which compares the height of the links in the dendrogram. This coefficient issensitive to sudden changes in the cophenetic distance along the direction of the buildingup of the hierarchical structure, it identifies the divisions where the similarities betweenobjects change abruptly and is therefore suitable for finding boundaries between distinctgroups of data (see 3.1.3.4 for more details). The inconsistency coefficient guides us whereto ”cut” the dendrogram, and do note that this ”cut” does not necessarily correspond toa horizontal cut of the dendrogram.

4.2.1.5 Evaluating and assessing our final clustering

After finding the natural division of our data we have attained our final clustering. Toevaluate the quality of this clustering, we use measures of internal criterion of cluster qual-ity (3.1.4). A high internal criterion of cluster quality is attained when we have maximisedinter-cluster similarity (distances) and minimised the intra-cluster similarity (distances).These characteristics are reflected in the Davies-Bouldin index and the Calinski-Harabaszindex (3.1.4), and these indices will be used to evaluate our final clusters. More precisely,we will use the DB and CH indices for various ”cuts” of the dendrogram, the optimal(best) value of these indices should, if our final clustering is good, coincide with our final”natural cut” using the inconsistency coefficient.

The external approach to evaluating the final clustering, using external benchmarkor true labels, will not be considered. Because if we had true clustering labels, or goodexternal benchmark, then there would be no need to do unsupervised clustering in the firstplace. But of course, a subjective sanity check of the resulting clustering will be performedto see if the clustering makes sense, e.g. if credit spreads are in the same clusters etc.

4.2.1.6 Selecting predictor variables for our regression model

When we have our final clustering our last task of this phase is to select from each resultingcluster a certain number of variables to use in our regression model. The selection ofthese variables will be based on how correlated they are with the response variable (HYindustry excess return), i.e. we select the variables from each cluster that have the highestcorrelation to the response variable under study. We could choose a specific number ofvariables to extract from each cluster, but as we beforehand do not know the exact numberof final clusters it is more reasonable to select a certain percentage of the total variablesin each cluster. Furthermore, by specifying a certain percentage of variables to extract,we are indirectly specifying how large the dimensional reduction of the predictor spaceshould be. In this implementation we choose to select 25% of the variables in each clusteras our input variables for our regression model.

The measure of correlation between the predictor variables and the response variableunder study will be the Pearson’s- and Spearman’s correlation coefficient (see 3.8). Note

57


that we have seven response variables under study, thus, we will have seven differentregression models and seven different sets of predictor variables, one set for each regressionmodel.

4.2.2 Random Forest regression

4.2.2.1 Implementation procedure

Having reduced the predictor space through the unsupervised time series clustering pro-cedure described in the previous subsection, we now turn our attention to Random Forest(RF) regression and the prediction of the HY industry excess return. In this thesis we willtrain our RF models to predict the excess return for T = [1, 3, 5, 10] days in the future.Thus, we will consider several different RF regression corresponding to each HY industrygroup under study and each forecasting days. The input to our RF regression models arethe predictor variables extracted from the various clusters in the time series clusteringprocedure. Since we are extracting the variables from each cluster based on their correla-tion to the response variable under study, the set of extracted predictor variables will bedifferent for different HY industry group.

Our regression analysis for each individual HY industry and each forecasting day Tgoes as follows. We will in each such analysis perform two RF implementations for thespecific choice of HY industry and forecasting days. We first implement a RF model usingthe extracted set of predictor variables from the time series clustering procedure. Westudy the performance of that model and through out-of-bag predictor variable importanceestimates we identify the ten most influential predictors of excess return. These ten mostinfluential predictors are then used in a new final RF regression. This double RF regressionis performed for each HY industry, and for each forecasting days considered.

The input predictor matrix that will be used for the regression analysis is primarilythe regression training matrix MR2 (see 4.1.5), it covers 3142 days of data from 2004-06-22to 2017-01-15, it has very few missing values as time series with no values prior to year2005 has been removed. To evaluate how the RF method is capable of handling severalmissing values through surrogate splits we will also run the regression the larger trainingmatrix MR1, which covers the same time-period of 3142 days from 2004-06-21 to 2017-01-15. Compared to the training matrix MC used for clustering, which had no missingvalues, and the MR2, which have only a few missing values, the larger matrix MR containseveral missing values on certain time series.

4.2.2.2 Implementation settings for the Random Forest algorithm

In accordance with the theoretical framework on the Random Forest supervised learn-ing method (3.2.2), we will be implementing Breiman’s version of the Random Forestalgorithm with a slight adjustment to the way each individual tree base learner is con-structed. More precisely, we will use Breiman’s approach of random feature selection andbootstrap aggregation (bagging), however, we will disregard the CART-method for grow-ing decision trees in favour of the so called GUIDE-method. The reason for this is thatthe CART-method posses a selection bias when deciding the variables to split the treeconstruction process (see 3.2.2.4). This phenomenon will, if unaccounted for, yield biasedvariable (predictor) importance estimates. Thus, in order to avoid bias in our variableimportance estimates we employ the GUIDE-method (3.2.2.4) which uses a curvature testand a interaction test to make an unbiased selection of variables to split (construct) a tree.These aforementioned tests are essentially chi-squared tests that test for association of a

58


predictor with the response (curvature), and association amongst the predictor variables(interaction test).

When growing each individual decision tree, we will as our random feature selectionsetting only allow a third np/3 of the full number np of predictors to be considered assplit candidates at each step of the tree construction. The stopping criterion for the splitprocess is a minimum leaf size of 5 observation (MinLeafSize=5), i.e. each leaf should atleast contain 5 observations. To account for some missing data points in our input trainingmatrix M we will apply the RF algorithm with surrogate splits (see ref). We will in totaltrain 400 decision tree learners in each RF implementation, and use these 400 decisiontrees to make a single continuous-valued prediction of the excess return.

4.2.2.3 Forecasting days and the response variable

The response variable in our RF regression analysis is given in percentage excess return,meaning that a response variable value of 0.1 equates to 0.1%. Furthermore, as mentionedat the beginning of this subsection, we will train our RF models to predict the excessreturn for T = [1, 3, 5, 10] days in the future. Practically, this entails changing for eachforecasting day T the target response variable our RF model is trying to fit. That is, foreach different T forecasting days we are targeting to forecast the excess return for, we willchange the training response variable accordingly. We will be using the cumulative T -dayexcess return as our target response variable in the training phase of the RF,

Y cumi,T (T ) =

T+i∑j=i+1

Yj , (4.5)

where Yj is the daily excess return of a certain HY industry. Note that even though we usedifferent response training variable for each different forecasting day in our RF training,we still use the same daily input predictor variables (time series).

4.2.2.4 Evaluating model performance

Since we are using the Random Forest for regression a sensible evaluation metric is themean-squared error of our predictions. We will use the test mean-squared error MSEtest,which evaluates the generalisation error of our model and its performance on out-of-sampledata, i.e. data that were not used in the training (fitting) of the model. The out-of-samplemean-squared error is given by

MSEtest =1

n

n∑i=1

(Yi − Yi)2, (4.6)

where Yi is an out-of-sample data point. A nice feature of the Random Forest methodis that we can easily get an estimate of the MSEtest by simply considering the out-of-bagmean-squared error MSEOOB (see 3.2.2.3). The MSEOOB is a good approximation of thetrue MSEtest, and to ensure that is the case, we have verified this relationship in a 10 K-fold cross validation procedure. The closely related performance metric root mean-squarederror RMSEtest is given by simply taking the square root of the MSEtest

RMSEOOB ≈ RMSEtest =√

MSEtest. (4.7)

59


Furthermore, we will use an adjusted measure of the out-of-sample R2out coefficient of

determination to study the proportion of the variance in the dependent variable that ispredictable from the independent variables. The out-of-sample R2

out coefficient is given by

R2out = 1− SSres

SStot= 1− MSEtest

MSEN

where SStot =

n∑i=1

(yi − y)2, and SSres =

n∑i=1

(yi − yi)2.(4.8)

where y is the mean value of our response variable data and y is the our models out-of-bagprediction. The out-of-sample R2

out is positive when the RF regression has lower meansquare prediction errors than the forecast based on the historical average return.

It may be of interest to also understand how well our regression models are in theprediction of the direction of the excess return. That is, what percentage of the predicteddirection of excess return coincides with the actual true direction of excess return. Thismeasure will be summarised in a variable we call ACCdirr directional accuracy, it will usein this thesis to evaluate the improvement of various RF regressions, e.g. the improvementin directional accuracy when we apply the sliding window method.

Denote the number of positive and negative excess returns of our training responsedata as P and N respectively. Furthermore let

TP = True positive = # of Positive predictions that were correct

TN = True negative = # of Negative predictions that were correct

FP = False positive = # of Positive predictions that were incorrect

FN = False negative = # of Negative predictions that were incorrect.

(4.9)

We then have the following standard measures used in evaluation of binary classifiers.

Accuracy ACCdirr = (TP + TN)/(P +N)True positive rate TPR = TP/PTrue negative rate TNR = TN/N

Positive predictive value PPV = TP/(TP + FP )Negative predictive value NPV = TN/(TN + FN)

Table 4.5: Common evaluation measures for binary classifiers

4.2.2.5 Unbiased selection of the 10 most influential predictors

Recall that we perform two RF implementations for each HY industry and forecastingday. In each such analysis, after the first RF implementation our goal is to extract the 10most influential variables in terms of predictive value. To do this we use the out-of-bagpredictor importance estimates by permutation method described in (3.2.2.3), using thealgorithm (4). What that method essential does is that it studies how the model error(MSEOOB) is affected by permuting a certain predictor, the higher effect on the modelthrough permutation of certain predictor, the more influential it is in the prediction model.By the same token, if a certain predictor is not influential, then permuting its values shouldhave little to no effect on the model error. This predictor importance measure is calculatedusing the average of all differences in model errors when permuting a certain predictor,divided by the standard deviation of these differences, see (4) for more details. Since weused the GUIDE-method to construct our base learner trees, these estimates are unbiased,which implies that our selection of the 10 most influential predictors are also unbiased.

60


4.2.2.6 Extending the regression analysis with sliding window

It is a common trait of financial time series to exhibit some autocorrelation within theseries. Thus, to ”test” for autocorrelation and see if our RF model is able to capture it,we extend the RF analysis by incorporating the sliding window to our input data (see3.2.3.1). The sliding window extends our input predictor space by a factor correspondingto the number of sliding days Sdays consider. I.e. for each original predictor xi at time iwe will at time point i have additional Sdays number of predictors corresponding to theprevious values of xi, (xi−1, xi−2, . . . , xi−Sdays). In order to avoid a too large dimension ofthe predictor space we will confine the sliding window analysis on the 10 most influentialvariables from our earlier analysis. We will in total consider the sliding window up to 7sliding day.

Since we are dealing with a response variable that is also a financial time series, it is ofinterest to see if it exhibits some autocorrelation as well. Thus, we will also analyse howhistorical values of the response variable y affects our prediction performance. This impliesthat for sliding days Sdays we will at time i use (yi−1, yi−2, . . . , yi−Sdays) as additionalpredictor variables.

4.2.3 Testing for model significance

We will perform a significance test on our model performance using the sample mean-squared prediction error (MSPE). We will perform an approximately normal test for equalpredictive accuracy of two different and ”nested” RF regression models. That is, an one-sided test of the null hypothesis that the expected mean-squared prediction error from twodifferent nested RF models, say model A and model B, are equal. Against the alternativethat model B has a lower squared prediction error than model A. In this context, twonested models refers to the case where model A is a parsimonious model and model B isa larger model that nests model A, meaning that model B reduces to model A if some ofthe model B’s parameters are set to zero. The null is examined using the observed sampleof the two models’ MSPEs. Denote the sample mean-squared prediction error of a modelas σ, then we are examining the null using σ2

A − σ2B.

The purpose of the tests is to evaluate the statistical significance of our Random For-est performance measured in RMSEOOB. In our context, we will perform two such tests,one to assess the performance of the RF regression versus the historical average excess re-turn, and one test to assess the improvement of using the sliding window method with RF.These tests are inspired by Clark and West [9] Approximately normal test for equal pre-dictive accuracy in nested model, however, we do not use their proposed adjusted statisticsince are not estimating any β-parameters.

• MSPE test 1: RF regression versus historical average modelIn the first test we perform a one-sided test on the null hypothesis that expectedmean-square prediction errors from the historical average model and our RF regres-sion are equal, against the alternative that the RF regression model has a lowermean-squared prediction error than the historical average model. For each observa-tion, the historical average model predicts a constant corresponding to the historicalaverage excess return under study. Following the notation of Clark and West [9],we denote the sample mean-square prediction error of our RF as σ2

RF and of thehistorical average prediction model as σ2

avg. We are testing the null hypothesis byexamining σ2

avg − σ2RF , which in its full form is given as

ft+τ = (yt+τ − yH,t+τ )2 − (yt+τ − yRF,t+τ )2, (4.10)

61


where yH,t+τ denotes the prediction of the historical average model at time t for τdays in the future, and yRF,t+τ denotes the same prediction using our RF regression

model. The σ2H − σ2

RF is simply the average of ft+τ . In order to test for equal mean-

square prediction error we simply regress ft+τ on a constant and using the resultingt-statistic for a zero coefficient, i.e. the null that the historical average return modelis equally good as the RF regression model. In our one sided test, we reject the nullif this statistic is greater than a certain level corresponding to α = 0.01.

• MSPE test 2: Sliding window error improvement testRecall that the sliding window method expands the predictor space with new pre-dictors corresponding to the w-day historical values of existing predictors. That is,the sliding window method expands an existing training martix n×m by the size ofthe sliding window size w into a new training matrix of size (n−w)× (m ∗ (1 +w)),creating two nested models. In the second test, we test for the null that a RF modelwith sliding window applied has an equal mean-squared prediction error as a RFmodel without the sliding window method applied. The alternative is that a RFmodel with sliding window applied has a lower mean-squared prediction error thanthe standard RF model without sliding window applied. Implementation-wise thisentails changing the yH,t+τ -term in equation (4.10) with yRF,t+τ and introducing theprediction from the sliding window RF denoted ySW,t+τ as follows

ft+τ = (yt+τ − yRF,t+τ )2 − (yt+τ − ySW,t+τ )2. (4.11)

We test the null by examining the statistic ft+τ in the same way as in the testabove. Since the sliding window method is only applied on the ten most influentialvariables, it is important to note that this sliding window error improvement testonly considers as benchmark the RF model where the 10 most influential variablesare used.

In those cases where the first statistical test show no significance, while at the sametime show significance in the second sliding window (SW) error improvement test, wetest the new SW-improved RF-model for statistical significance over the historical averagemodel using MSPE test 1.

62

Chapter 5

Results

In accordance with our two-phase implementation method, this result section is dividedin two parts covering the result of the unsupervised time series clustering and RandomForest regression.

5.1 Unsupervised time series clustering

5.1.1 Clustering results

Following the procedure described in (4.2.1) we used the reduced predictor training matrixMc which has no missing values, the matrix covers 983 days from 2012-12-27 to 2017-01-12. In our data consisting of 154 initial input predictor time series, presented in 4.1.4,we found 7 natural clusters using Spearman’s correlation distance as distance metric, andAverage linkage criterion.

Cluster # C1 C2 C3 C4

# of time series 9 12 21 10% of time series 5.8% 7.8% 13.6% 6.5%

Stock Vol EUR Swap Spreads Swap Curve Slope JPY Swap SpreadsRelVal Credit Spreads EUR2/USD1/CNY1 Macro Index∗ Swap Curve Curvature Bond Vol

Type of series RelVal Stock Vol GBP Swap Spreads FX Currencies EURUSD FX VolCredit Spreads GBPUSD FX Vol EUR1/JPY1 Macro Index∗

USDJPY FX Vol USDCNH FX Vol

Cluster # C5 C6 C7

# of time series 14 34 54% of time series 9.1% 22.1% 35.1%

RelVal Bond Vol AbsChg Swap Curve Slope RelVal FX CurrenciesRelVal FX Curcy DoDpct Swap Spreads CNY2/JPY2/USD2 Macro Index∗

RelVal FX Vol DoDpct FX vol AbsChg Swap Curve LevelsType of series DoDpct Credit Spreads DoDpct Swap Curve Levels

DoDpct FX Currencies DoDpct Swap Curve SlopesDoDpct Volatilities DoDpct Equities

Commodities

Table 5.1: Clustering result using Agglomerative Hierarchical clustering with Spearman’scorrelation distance as distance metric and Average linkage criterion. A total of 7 naturalclusters were found in our data. *) EUR2 Macro Index refers to the monthly updatedmacro surprise index, while EUR1 Macro Index refers to the weekly updated macro surpriseindex.

5.1.2 Clustering procedure results

To get the natural clustering of our data we employed the methods described in (4.2.1.1).We first studied the Cophenetic correlation coefficient (CCC) to decide what distance

63


aaaaaaaaaaaLinkage criterion

Distancemetric Pearson’s

correlation distanceSpearman’s

correlation distance

Average 0.681 0.700

Complete 0.486 0.562

Single 0.269 0.313

Median 0.570 0.580

Table 5.2: Cophenetic correlation coefficient (CCC) values for different distance metricsand linkage criteria. Note that there are several other linkage criterion discussed in thisthesis, some of them are however not compatible with correlation distance metrics andhave thus been excluded in the analysis above.

metric to use (Pearson’s vs. Spearmans’s) and what linkage criterion is best matched tothe distance metric, in the sense of how faithfully the resulting clustering represent thepairwise distances between the original unmodelled data. As seen in table below the high-est CCC coefficient at 0.7 was achieved when using the Spearman’s correlation distance incombination with the Average linkage criterion.

Following our selection of the distance metric and linkage criterion, we used agglomerativehierarhical clustering with Spearman’s correlation distance and Average linkage criterion.The output of that clustering was a nested hierarchical tree-structure that is visualised inthe following dendrogram.

Figure 5.1: Dendrogram output from hierarhical clustering using Spearman’s correlationdistance and Average linkage criterion. The numbers along the horizontal axis representthe indices of the objects in the input data. The links between objects are representedas fork-shaped lines. Y-axis is the (cophenetic distance) computed between the connectedobjects, the height of the forks indicates the distance between the connected objects.

64


A ”cut” to this dendrogram defines the final clustering. To arrive at the clusteringthat corresponds to the natural division in our data, we used the Inconsistency coefficient(3.1.3.4). This coefficient is sensitive to the cophenetic distances along the buildup ofthe hierarchical structure, and can thus be used to identify divisions of the data wherethe similarity changes abruptly (i.e. where the natural division of the data occur). Wecalculated the Inconsistency coefficient according to (3.18) at a depth level of 3, and sortedthe values. These sorted values gives us an indication where there are inconsistent linksin our dendrogram, recall that inconsistent links indicate the border of a natural divisionin a data set. The ten highest sorted values of the Inconsistency coefficient where givenin descendent order as

ICsorted = (2.02, 1.80, 1.75, 1.66, 1.63, 1.53, 1.51, 1.45, 1.42, 1.40).

We see that some links in the dendrogram have large inconsistency coefficients, indicatingpossible natural borders of our data. In our analysis we pick 1.8 as ”cut-off” criteria for ourdendrogram, clusters are formed when a cluster object (”node”) and all of its sub-nodeshave inconsistent values less that 1.8, all data contained in the ”node” are grouped into acluster. Note that this ”cut-off” do not correspond to horizontal cut.

Using an inconsistent coefficient threshold of 1.8 resulted in 7 natural clusters. Toverify the cluster quality we performed an evaluation analysis using the internal clusterquality indices Davies-Bouldin (DB) index and the Calinski-Harabasz (CH) index. Theseindices measure how well a certain clustering have maximised inter-cluster similarity andminimised the intra-cluster similarity. Clearly, our dendrogram allow us to extract severaldifferent clusterings, both natural and unnatural clusterings. As it is unfeasible to calculatethese indices for all possible clusters, we confined our analysis to evaluate the clusteringscorresponding to ”horizontal cuts” to the dendrogram, where each horizontal cut ”forces”a certain number of clusters. Thus, we evaluated these two internal cluster quality indicesfor 50 different horizontal ”cuts” each for a different ”forced” K number of clusters, K =(1, 2, . . . , 50). The result is plotted in the following figures (5.2)

(a) Calinski-Harabasz Index (b) Davies-Bouldin Index

Figure 5.2: The Calinski-Harabasz and Davies-Bouldin index for different ”forced” clus-terings. The x-axis is the K forced number of clusters, whereas the y-axis correspondsto the value of respective index. The best value the Calinski-Harabasz index is attainedwhen the index is maximised. The best value of the Davies-Boulding index is attainedwhen the index is minimised.

What we can see from (5.2) is that in our analysis, using the dendrogram from ourhierarhical clustering, the optimal value of these indices are attained for K values between

65


5 and 9 ”forced cluster”. Note that the CB index optimal value is attained when theindex is maximised, while the optimal value of DB index is attained when the index isminimised. This implies that in our analysis, the K that maximises the inter-clustersimilarity and minimises intra-cluster similarity in our dendrogram is somewhere around5 and 9 clusters. We have to keep in mind that the two graphs (5.2) represent horizontalcuts to the dendrogram, it is of interest to see how well our ”final natural” non-horizontalcut compares to these values. The CB and DB indices values for our final clustering cutare CBfinal = 0.97 and DBfinal = 3.92. While these values are not the ”most optimal”in the sense of the graphs (5.2), they are within the range of reasonably good values(K = [5, . . . , 9]).

There is a contradicting behaviour of the two index graphs in (5.2), note how theDavies-Bouldin index moves to ”better” index values for higher K forced number of clus-ters, while at the same time the Calinski-Harabasz index moves to ”worse” values whenincreasing K. A closer look at how the DB index is calculated (see (3.1.4)) gives us apartial answer, in its calculation the DB index considers the ”worst-case” inter-clusterscatter for each cluster. Given that we use the Spearman’s correlation distance metricon financial time series with average linkage criterion, one can argue if the DB index is asuitable measure to use in our case. An adjusted measure that considers the ”average”inter-cluster scatter rather than the ”worst-case” inter-cluster scatter, would probably bemore appropriate in our case.

5.2 Random Forest implementation

This section presents the result from our regression and variable importance analysisdescribed in (4.2.2.1). The analysis covers several different RF implementations corre-sponding to the four different forecasting days of excess return, and for the seven differentHY industries under study. The structure of this subsection is as follows, we first presentthe results of using a training matrix with several missing values and evaluate how wellthe RF method with surrogate splits is able to handle several missing values. We thenproceed to the main analysis of this thesis where we show the performance results of allour different Random Forest regressions using a training matrix with very few missingvalues. For the main analysis we then proceed to present the 10 most influential variablesin each HY industry for the prediction of excess return. Lastly, we present our findings ofusing the sliding window method on the 10 most influential variable. The implementationwas done using time series data from 2004-06-22 to 2017-01-15.

5.2.1 Random Forest’s ability to handle several missing values

The initial implementation of the Random Forest analysis where done using the trainingmatrix MR1 (4.1.5), which contained several missing values attributed primarily to a fewtime series groups, namely US corporate credit spreads, EUR swap spreads and the Chinesecurrency CHN. Using the MR1 training matrix with several missing values resulted in apoor RF training and prediction performance, the figures below highlights the issue whenwe have several missing values.

66


Figure 5.3: Chemical industry out-of-bag 5 day excess return prediction usingMR1 trainingmatrix with several missing values

Figure 5.4: Packaging industry out-of-bag 5 day excess return prediction using MR1 train-ing matrix with several missing values

In the figures the blue line is the RF out-of-bag prediction, the black line is the actualexcess return for the forecasted period, and the red lines are the upper and lower quantilesof our OOB predictions, i.e. the red lines covers the 90% percentile OOB-predictioninterval. As can be seen in the figures, the Random Forest method with surrogate splitsperforms extremely poorly for the time-period where we have a lot of missing values.We see that during the period of year 2004 to 2012 the model predict almost a constantvalue of the excess return in both presented cases, and we observe that the 90% percentileintervals of our OOB-predictions are very wide compared to the intervals of prediction

67


done past year 2012. This combination of ”constant” OOB-predictions and very wideOOB-prediction percentile intervals coincides with the time-period where there are manymissing time series values, this suggests very high model uncertainty due to the missingvalues in the training matrix. As such, prediction and analysis using RF trained withMR1 with several missing values is pointless and nothing sensible can be inferred fromsuch model, both the prediction performance metrics and variable importance estimateswill be distorted by the RF’s inability to handle such large amount of missing values. Thisissue were note limited to these two presented industries, similar issue is found in the otherfive industries as well.

To fully understand why missing values in a limited number of time series can become aserious issue that can render our RF analysis useless, we have to recall how our two-phaseprocedure is implemented. Prior to the implementation we pre-process the data withtransformations and filters, essentially creating several new time series form the originalones. This means that the number of time series with several missing values have beendoubled or tripled because of the pre-processing. Furthermore, recall that from our unsu-pervised time series clustering we select from each resulting cluster 25% of the predictorsthat are the most correlated (Spearman’s coefficient) to the response variable under study.This means that each industry under study will use as training matrix a subspace of theoriginal input matrix MR1 that corresponds to these selected ”most correlated” predictorsfrom our clusters. This subspace training matrix is different for each different industry,and we can denote it according to industry as MIndustry, e.g. MChemical for chemicalindustry. It turns out that a significant part of the ”most correlated” predictors in thedifferent industries are time series with many missing values, or transformation of those.Time series such as EUR swap spreads, US corporate credit spread, or transformationof those, showed up as ”most correlated” predictors in many cases. This resulted in theindustry specific training matrices MIndustry having high percentage of missing values, insome cases up to 20% of a training matrix having missing values.

To bypass this issue of poor performance due to missing values, and for the sake ofhopefully inferring something sensible from our analysis, we removed all time series whichhave no registered values before year 2005. As a result we removed EUR swap spreads,US corporate credit spreads, and time series related to the Chinese currency CHN fromour analysis (see 4.1.5). All subsequent results are derived using the training matrix MR2,which contain almost no missing values.

5.2.2 Random Forest regression results

In this subsection we provide the performance metrics for all of our RF regressions, trainedand evaluated using the MR2 training matrix with very few missing values. The resultsare summarised in table 5.3. In total, we present four sets of RF regression results. Thefirst two set of RF regressions results are from our two-phase implementation procedure,where we perform two RF regressions for each HY industry and each specific forecastingperiod. The first RF regression is done using the predictor variables selected from ourunsupervised time series clustering. The second RF regression is done using the 10 mostinfluential variables from the first RF regression. Below, we present all the regressionsorted according to industry and forecast horizon. The performance metrics for the secondregression, where we use the 10 most influential variables as predictors, are denoted withTen before the metric name. Furthermore, to evaluate the performance of applying thesliding window method we have included the regressions when a 3-day sliding window wasapplied to the ten most influential variables, this set of regressions are denoted SW3 beforethe metric name. We also include a fourth set of regressions where we apply the sliding

68


window method to ten most influential variables as well as to the response variable itself,this fourth set regressions is denoted RSW3 before the metric name. To clarify, in theRSW3 we use a 3-day sliding window applied on the ten most influential variables and onthe daily historical values of the response variable, not to confuse with the actual responsevariable that might be measured in several days.

69


Ind

ustry

#d

ays

forecast

RM

SEtrain

RM

SEOOB

R2out

AC

Cdirr

Ten

RM

SEtrain

Ten

RM

SEOOB

Ten

R2out

Ten

AC

Cdirr

SW

3R

MS

Etrain

SW

3R

MS

EOOB

SW

3R

2out

SW

3A

CCdirr

RSW

3R

MS

Etrain

RSW

3R

MS

EOOB

RSW

3R

2out

RSW

3A

CCdirr

10.23

30.3

910.110

∗∗60.6%

0.2

450.386

0.131

∗∗60.7%

0.2

22

0.3

88

0.12561.3%

0.2190.386

0.13361.2%

Ch

emica

l3

0.46

70.8

080.167

∗∗61.2%

0.4

900.791

0.202

∗∗63.7%

0.4

36

0.7

81

0.223⊕⊕

62.7%0.430

0.7780.228

⊕⊕

63.3%5

0.65

11.1

350.207

∗∗63.1%

0.6

741.094

0.263

∗∗63.6%

0.5

78

1.0

47

0.326⊕⊕

64.0%0.575

1.0530.318

⊕⊕

64.5%10

0.9

253

1.6

710.3

39∗∗

66.2%

0.9

601.560

0.423

∗∗65.6%

0.7

80

1.4

46

0.505⊕⊕

69.2%0.792

1.4710.488

⊕⊕

68.0%

10.26

00.4

340.081

∗∗58.6%

0.2

760.435

0.078

∗∗57.6%

0.2

47

0.4

31

0.09258.5%

0.2380.424

0.123⊕⊕

59.1%M

etals

30.52

90.9

000.125

∗∗60.1%

0.5

650.896

0.134

∗∗59.9%

0.5

01

0.8

87

0.151⊕⊕

59.5%0.479

0.8690.186

⊕⊕

62.5%5

0.75

21.2

850.143

∗∗59.9%

0.7

981.273

0.159

∗∗58.9%

0.6

89

1.2

31

0.215⊕⊕

59.6%0.673

1.2240.223

⊕⊕

61.4%10

1.19

32.0

460.208

∗∗60.9%

1.2

502.000

0.243

∗∗59.3%

1.0

60

1.9

02

0.316⊕⊕

62.0%1.043

1.8960.321

⊕⊕

63.1%

10.23

60.3

450.097

∗∗58.2%

0.2

150.342

0.110

∗∗57.5%

0.1

93

0.3

41

0.11658.7%

0.1900.337

0.137⊕⊕

59.5%P

aper

30.47

40.6

810.213

∗∗62.7%

0.4

890.665

0.249

∗∗63.4%

0.4

90

0.6

58

0.266⊕⊕

63.3%0.485

0.6570.268

⊕⊕

63.4%5

0.66

60.9

640.268

∗∗64.1%

0.6

710.917

0.338

∗∗65.8%

0.6

61

0.8

96

0.367⊕⊕

67.0%0.660

0.9010.361

⊕⊕

66.5%10

0.97

21.4

740.390

∗∗68.2%

0.9

691.363

0.479

∗∗70.1%

0.9

06

1.3

04

0.523⊕⊕

72.3%0.924

1.3180.513

⊕⊕

71.4%

10.29

30.4

230.100

∗∗61.1%

0.3

070.417

0.124

∗∗61.2%

0.3

15

0.4

20

0.11461.6%

0.3090.420

0.11261.9%

Bu

ildin

gM

ateria

ls3

0.54

90.8

040.229

∗∗65.3%

0.5

590.775

0.284

∗∗66.0%

0.5

67

0.7

74

0.28766.1%

0.5570.769

0.29565.5%

50.75

21.1

050.313

∗∗64.6%

0.7

531.042

0.390

∗∗65.8%

0.7

40

1.0

29

0.405⊕⊕

66.6%0.743

1.0460.386

⊕67.3%

10

1.11

41.6

990.428

∗∗68.3%

1.0

891.562

0.516

∗∗70.5%

0.8

16

1.5

10

0.548⊕⊕

70.5%1.038

1.5270.538

⊕⊕

70.2%

10.22

40.3

590.03

2∗

56.1

%0.2

270.360

0.026

54.9%0.2

06

0.3

60

0.02756.3%

0.2020.358

0.038⊕|∗∗

57.0%P

acka

ging

30.42

80.6

960.043

∗∗59.1%

0.4

370.701

0.029

58.2%0.3

91

0.7

01

0.03057.8%

0.3810.694

0.05158.2%

50.59

30.9

720.045

∗∗58.1%

0.6

120.977

0.036

57.1%0.5

46

0.9

70

0.050⊕⊕|∗∗

57.7%0.529

0.9560.078

⊕⊕|∗∗

59.1%10

0.91

21.5

000.071

∗∗61.6%

0.9

431.509

0.060

∗∗60.1%

0.8

16

1.4

55

0.127⊕

61.7%0.824

1.4810.095

60.0%

10.32

10.5

230.055

∗∗59.2%

0.3

430.525

0.049

∗57.9%

0.3

08

0.5

20

0.06659.2%

0.3050.519

0.07159.2%

Teleco

m3

0.63

21.0

450.037

∗∗59.0%

0.6

791.054

0.020

58.2%0.6

02

1.0

38

0.050⊕|∗∗

59.6%0.589

1.0350.056

⊕⊕|∗∗

58.8%

50.87

61.4

560.02

7∗

57.0

%0.9

511.481

−0.001

56.6%0.8

42

1.4

54

0.030

⊕⊕|∗

57.6%0.830

1.4480.038

⊕⊕|∗∗

58.2%10

1.33

62.2

050.025

∗∗59.3%

1.4

252.213

0.018

56.5%1.2

89

2.2

03

0.02758.5%

1.2672.203

0.02758.3%

10.29

00.4

700.03

8∗

58.2

%0.3

120.473

0.023

57.9%0.2

83

0.4

72

0.03459.0%

0.2720.461

0.075

⊕⊕|∗∗

58.0%

Electric

Utility

30.57

80.9

620.03

0∗

58.6

%0.6

210.962

0.029

56.7%0.5

58

0.9

59

0.035⊕|∗

58.5%0.532

0.9370.079

⊕⊕|∗∗

60.8%

50.82

91.3

730.02

5∗

57.0

%0.8

941.382

0.012

56.4%0.8

05

1.3

72

0.02757.3%

0.7751.353

0.054

⊕⊕|∗∗

59.3%10

1.32

32.1

640.02

1∗

56.1

%1.4

122.172

0.014

53.9%1.2

74

2.1

54

0.03055.3%

1.2462.137

0.04657.1%

Tab

le5.3

:R

an

dom

Fore

stR

egre

ssion

Resu

lts:F

our

setsof

regressions

arep

resented

here,

the

first

setu

sesth

e∼

40variab

lesex

tractedfrom

the

un

sup

ervised

time

seriesclu

stering,

the

second

setd

enoted

Ten

utilises

the

10m

ostin

flu

ential

variables

fromth

efi

rstregression

asregressors,

the

third

setof

regressio

ns

den

oted

SW

3ap

plies

a3-d

ayslid

ing

win

dow

onth

e10

most

infl

uen

tialvariab

les.T

he

fourth

setof

regressions

den

oted

RSW

3ap

plies

a3-d

ayslid

ing

win

dow

the

10m

ostin

flu

ential

variables

and

the

respon

sevariab

leitself.

Th

eR

MS

Eis

givenin

percen

tages,a

RM

SE

value

of0.1

isin

terpreted

as0.1%

.T

he

direction

alaccu

racyd

enoted

asACCdirr

shou

ldon

lyb

ecom

pared

horizon

tally,as

for

each

increasin

gfo

recasting

day

the

nu

mb

erof

up

-day

sin

creases,from

∼55%

for1

forecasting

day

sto∼

60%up

-day

sfor

10forecastin

gd

ays.

Th

eR

2out

isgiv

enin

decim

al

formw

here

avalu

eof

1eq

uates

to100%

.*)

and

**)in

dicates

the

statisticalsign

ifican

ceof

ourR

2out

based

onth

eM

SP

Eerror

testw

here

the

histo

rical

averagem

od

elis

used

asa

ben

chm

arkatα

=0.05

andα

=0.01

levelresp

ectively(see

4.10).⊕

)an

d⊕⊕

)in

dica

testh

esta

tisticalsign

ifica

nce

ofou

rslid

ing

win

dow

error

impro

vemen

ttest

atα

=0.05

andα

=0.01

respectively

(see4.11).

Inth

oseca

sesw

here

the

first

statistical

testsh

own

osign

ifican

ce,w

hile

atth

esam

etim

esh

owsign

ifican

cein

the

second

slidin

gw

ind

ow(S

W)

error

imp

rovemen

ttest,

we

prov

ide

ad

dition

alin

formation

ofth

esign

ifican

ceof

the

SW

-app

liedR

Fm

od

elov

erth

eh

istoricalaverage

model

usin

gth

efi

rsttest.

E.g

.”⊕⊕|∗∗”

ind

icates

that

there

was

astatistical

signifi

cant

imp

rovemen

tin

RM

SE

usin

gth

eS

Wm

ethod

overth

ereg

ular

TenRF

-mod

el,an

dth

en

ewresu

lting

SW

RF

mod

elsh

owstatistical

signifi

cance

inR

2out

overth

eh

istoricalaverage

mod

el.C

learly,ap

ply

ing

aslid

ing

win

dow

onaTenRF

-mod

elth

atis

statisticalsign

ifican

tin

the

first

place,

will

yield

statisticalsign

ifican

tSWRF

-mod

elif

there

isa

statistically

signifi

cant

imp

rovemen

tin

RM

SE

wh

enu

sing

the

slidin

gw

ind

owm

ethod

,th

esecases

aresim

ply

indicated

by

”⊕⊕

”.

70


5.2.3 Plots: OOB Predictions and Prediction Intervals

In this section we provide the OOB-prediction plots corresponding the results in previoussubsection. The OOB-predictions are given with the 90% percentile prediction intervalsfrom using quantile regression (see 3.2.2.6). To keep the result section concise we onlyprovide the OOB-prediction and prediction interval plots for the regressions that utilisesthe 10 most influential variables as regressors, and predicts excess return 5 days in thefuture.

Figure 5.5: Chemical OOB-Prediction and 90% Percentile Prediction Intervals

Figure 5.6: Metals OOB-Prediction and 90% Percentile Prediction Intervals

71


Figure 5.7: Paper OOB-Prediction and 90% Percentile Prediction Intervals

Figure 5.8: Building Materials OOB-Prediction and 90% Percentile Prediction Intervals

72


Figure 5.9: Packaging OOB-Prediction and 90% Percentile Prediction Intervals

Figure 5.10: Telecom OOB-Prediction and 90% Percentile Prediction Intervals

73


Figure 5.11: Electric Utility OOB-Prediction and 90% Percentile Prediction Intervals

5.2.4 Unbiased variable importance estimates

The variable importance estimates presented in this subsection correspond to the RFregression results in table 5.3. For each HY industry analysis three variable importanceresults are presented. The initial variable importance estimates, denoted with ”1st RF”(left figures), are from the first RF regression where we used the extracted predictorvariables from our unsupervised clustering as regressors. Thus, the variable importanceestimate figure denoted with ”1st RF” lists all the variables selected from our unsupervisedtime series clustering procedure. From that first RF regression we selected the 10 mostinfluential predictor variables, according to the values of the 1st RF variable importanceestimates, to use in a second regression. The variable importance estimates from the secondregression are denoted with ”2nd RF” (top figures), and correspond to the regressionwhere we only used the 10 most influential variables. The third set of variable importanceestimates are from the regression where we used a 3-day sliding window applied on the tenmost influential variables and on the response variable itself, these estimates are denoted”sliding window” (right figures).

To keep the result section concise the variable importance estimates presented in thissubsection are confined to the results from forecasting the 5 days excess return, and for anapplied 3-day sliding window. The variable importance estimates for the other forecastingdays show similar results to what is presented here and has therefore been excluded tokeep this section as lucid as possible. Also, the variable importance estimates from ”2ndRF” correspond to the OOB-prediction plots given in previous subsection.

74


Fig

ure

5.12:C

hem

icalIn

du

stryV

ariable

Imp

ortance

Estim

ates

75


Fig

ure

5.13:M

etalsIn

du

stryV

ariable

Imp

ortance

Estim

ates

76


Figu

re5.14:

Pap

erIn

du

stryV

ariable

Imp

ortance

Estim

ates

77


Figu

re5.15:

Packagin

gIn

du

stryV

ariable

Imp

ortance

Estim

ates

78


Figu

re5.1

6:B

uild

ing

Materials

Ind

ustry

Variab

leIm

portan

ceE

stimates

79


Figu

re5.17:

Telecom

Ind

ustry

Variab

leIm

portan

ceE

stimates

80


Figu

re5.18:

Electric

Utility

Ind

ustry

Variab

leIm

portan

ceE

stimates

81


5.3 Result Discussion

Time series clustering

In our unsupervised time series clustering using agglomerative hierarchical clustering wefound seven natural clusters. The ”natural”-part refers to the methods from our theoreticalframework that are used to find natural division in data. It is important to understand thatthese methods presuppose that the hierarhical clustering was performed with appropriatedistance metric and linkage criterion, the effectiveness of these methods are limited bythe choice of the distance metric and linkage criterion. Our selection of the Spearman’scorrelation distance metric and the average linkage criterion is well-motivated by thecophenetic correlation coefficient. The resulting seven natural clusters were evaluatedwith internal criterion of cluster quality using the Calinski-Harabasz Index and the Davies-Boulding Index. Our seven natural clusters scored within the range of optimal values givenour nested hierarchical tree structure (dendrogram). Of the two indices it is in our casemore appropriate to put more focus on the CH index as the latter DB index uses the”worst-case” inter-cluster scatter in its calculation, which in presence of many objectsbecomes slightly biased towards a higher number of clusters.

We did not evaluate our seven clusters with any external criteria of cluster quality,because we clustered our data with a correlation distance metric and we do not haveany externally given ”true”-labels of what the true clusters should look like based on acorrelation coefficient, and if we did, then there would be no point in clustering the datain the first place. However, we can discuss the resulting natural clustering subjectively.Without digging ourselves into too detailed economics reasoning, clusters 1,2,3,5 and 6can be motivated logically using price dynamics from economics. On the other hand,cluster 7 with 54 time series and 35.1% of the objects under study are more difficult toprovide a logical economical reasoning for. One can argue that one could perhaps dividecluster 7 into further sub-clusters, however, we have to remember that we performed anunsupervised clustering based entirely on the Spearman’s correlation distance and theaverage linkage criterion, and according to our unsupervised method with these specificsettings cluster 7 is a single cluster.

Random Forest regression

The Random Forest regression results presented in 5.3 show that the high yield corporatebond industry index excess return is predictable with the Random Forest (RF) methodusing externally given market-observable prices of major assets and indices. The pre-dictability of the excess returns varies across the seven high yield industries under study.We note that the Chemical, Metals, Paper, and Building Materials industries are morepredictable than Packaging, Telecom, and Electric Utility, when using the RF method withexternally given market-observable predictors. The out-of-sample R2

out is statistically sig-nificant at 0.01 level for the more predictable industries when using the ∼ 40 predictorsfrom our unsupervised time series clustering analysis, as well as when only using the 10most influential variables from the first regression. The R2

out metric captures how well aRF regression performs in terms of RMSE compared to the historical average model, apositive value indicates a better performance than the historical average model. In the lesspredictable industries, Packaging, Telecom, and Electric Utility, the statistical significanceof our R2

out varied. For these industries, the RF model with ∼ 40 predictors showed statis-tically significance at 0.01 level for Packaging and Telecom, while only at a level of 0.05 forElectric Utility, the least predictable industry. Using the ten most influential predictorsRF model on these industries showed no statistical significance over the historical average

82


model, indicating perhaps that our selection of the ten most influential variables does notaccurately reflect the ten truly important predictors in these industries. The difference inpredictability across industries can be observed in the out-of-bag RMSE performance met-ric, which increasingly differs for longer forecasts. This difference can also be observed inthe OOB prediction interval plots in 5.2.3. In the OOB prediction plots we see that morepredictable industries have smaller width of the 90% percentile OOB prediction intervals,while less predictable industries, such as Telecom, have larger width of the 90% percentileprediction interval as well as showing increasing actual excess return volatility.

The difference in industry excess return predictability can perhaps be partially explainedby the results from our variable importance estimates. A closer look at the variable im-portance estimates from our RF regressions in 5.2.4 show that the ”more predictable”industries have a diverse variety of predictors, ranging between currencies, bond & stockvolatility, stock indices returns, credit spreads, swap spreads, and swap curve factors. The”less predictable” industries have a more concentrated selection of the ten most influentialvariables, belonging predominantly to stock indices returns and credit spreads. Applyingthe sliding window method on these ten most influential variables showed statistical sig-nificant improvement in the models’ RMSE in most of the cases, albeit by a small amount.The improvements were most evident in the Ten-RF models that were already statisticalsignificant over the historical average model. Applying the sliding window on the ”lesspredictable” industries, such as Electric Utility and Packaging, barely showed any im-provements. However, applying the sliding window method on the ten most influentialpredictors as well as on the response variable itself, i.e. applying a sliding window onthe daily excess return under study, yielded statistical significant improvements in RMSEin majority of the cases. In particular, for the ”less predictable” industries Packaging,Telecom, and Electric Utility, introducing previous values of the daily excess returns ofthe response variable through the sliding window method showed a statistical significantimprovement over the Ten RF-model, as well as statistical significant improvement overthe historical average model in cases that were previously insignificant (these cases arehighlighted by ”⊕ ⊕ | ∗ ∗” in table 5.3). The RSW RF-models, i.e. RF models withapplied sliding window method and introduced previous values of the excess return, sug-gest that the excess return across all industries exhibit some form of autocorrelation. Wenote that in the right figures of 5.2.4, which showcases the variable importance estimatesfrom a 5-day forecast RF model where we applied a 3-day sliding window on the ten mostimportant variables as well as on the daily excess return, that previous daily values of theresponse variable under study show up as the top 10 most influential variables. This resultindicates some form of autocorrelation, and we note that for Metals, Packaging, Telecom,and Electric Utility industry, current value of daily excess return is the most influentialpredictor in our models for forecasting future excess return.

Limitations to the results

When interpreting these results the reader should keep in mind that the results presentedhere are limited to the two-phase implementation procedure of this thesis. The ”ten mostinfluential” predictors are derived specifically from our two-phase procedure using theRandom Forest and unsupervised time series clustering with specific settings, alterationsto the settings in our two-phase procedure might give other results. For instance, we usedthe Spearman’s correlation coefficient as our distance metric in the clustering procedureas well as a measure to extract variables from our clustering. Using other measure ofcorrelation will most probably give a different result. Additional limitation to the results

83


are our selection of the specific input variables to our two-phase implementation procedure.We used a specific set of externally given market-observable prices of various assets andindices, using other variables such as fundamental bond data will probably give us differentresults.

Another limitation that needs to be mentioned is the limited amount of training dataand the quality of the training data. The derived results are limited to the amount oftraining data available, for instance, in unsupervised clustering we could only use 983days worth of data, from 2013 to 2017, as our clustering algorithm were unable to handlemissing values in the training matrix. A larger set of training data dating back to year2005 might have given us a different clustering that captures the correlation betweenobjects further back in time. In the RF regression analysis we used 3142 days worth ofdata, considering the many predictors we use one can question if this amount is enough toavoid curse of dimensionality and other related overfitting issues. Ideally, machine learningtechniques require a lot more training data in proportion to the number of predictors used.We try to avoid this problem in our two-phase implementation procedure by selectingthe ten most influential variables as regressors, however, by doing this we perhaps lostvaluable information that strongly influence the response variable. As an example, froman economic point of view the price of copper (HG1Comdty) should be an importantpredictor of the excess return in the Metals industry, however, in our first RF regressionthe price of copper did not even make the top 20 most influential predictors and wassubsequently not included in the second RF regression which uses the ten most influentialpredictors.

Appropriateness of the Random Forest for modelling financial time series

We also should address whether the Random Forest is an appropriate model for modellingfinancial time series. For the standard RF model the answer is probably not, for a RFmethod with sliding window method, perhaps. Financial time series is a complex setof data with continuously changing characteristics. Correlation, volatility, and economicprice dynamics of financial time series changes over time and will probably change topreviously unseen levels. The Random Forest is a piece-wise constant model, it predictsusing the average value of the observation in a leaf node. Such model could not possiblymodel ”unseen” market environments, and while sliding window method enable the RFto model sequences of price dynamics, it is questionable whether that is good enoughfor application and use in the real world. It is perhaps preferable to use a piece-wiselinear model to model unseen market environments, however, we should not disregard theRF method completely. As described in both the literature review and the theoreticalframework, the RF method can be a very versatile model. The RF model can learn ”on-the-go” by training additional new trees with recent observations, these new trees can thenbe appended to the ensemble of decision trees where the ”voting”-weights of each tree areadjusted according to information recency or prediction accuracy. The RF can also beeasily expanded to fit additional new predictors simply by growing more ”branches” inexisting trees. Evaluating how such modification of the RF performs on financial timeseries is a subject for another study. What we can conclude from our analysis, at least,is that the RF method with and without the sliding window method have a statisticallysignificant lower mean-square prediction error than the mean-square prediction error inthe historical average model.

84


5.4 Conclusion

In this thesis we demonstrated how hierarhical clustering can be applied on financialtime series data and be used as a dimensionality reduction technique. In the process,we showed how one goes about finding the natural division of the data and evaluate theresulting clustering using internal criteria of cluster quality. Furthermore, additional di-mensionality reduction approach was demonstrated using unbiased variable importanceestimates from a Random Forest constructed with trees using the GUIDE-method ratherthan the standard CART-method. We performed Random Forest regression with the tenmost influential variables in each industry under study, and showed the effectiveness ofapplying the sliding window method on the predictors as well as on the response variableitself. The regressions’ out-of-sample coefficient of determination R2

out showed statisticalsignificance in majority of the cases, but considering the accompanied OOB predictionplots with rather wide 90% percentile prediction intervals, it is questionable whether themodels are useful in the real world. Especially considering the limitation posed on themodel from the rather ”small” number of observations, the choice of input variables, andour two-phase implementation procedure. There are many alterations that can be madeto the two-phase implementation procedure that will most probably give different result.As such, the content of this thesis should at most be viewed as a possible dimensionalityreduction techniques and as an effort to make the Random Forest compatible with sequen-tial time series data.

The possible alterations to the employed two-phase implementation procedure lends itselfto possible future research work. For instance, in our regression efforts we use a piece-wiseconstant model to train and predict financial time series, it would be interesting to see howa piece-wise linear model performs in comparison to a constant model. Incorporating fun-damental bond data to the analysis of excess return would be another interesting area toinvestigate. In this thesis we only use externally-given market-observable prices of assetsand indices, the predictability would most certainly be improved by incorporating inter-nally-given bond-specific fundamental data such as bond ratings, duration, leverage, andcash flow characteristics. The measure of correlation between time series is an importantfactor in our two-phase implementation procedure, we considered Spearman’s and Pear-son’s correlation coefficient in clustering process as well as in the cluster variable extractionprocess. Perhaps there are other more suitable measure of correlation amongst time series,such as the p-value of a chi-squared association test used in the GUIDE-method to decideon the splitting variable.

85

Bibliography

[1] Ghada Abdelmouez et al. “Neural network vs. linear models for stock market sectorsforecasting”. In: Neural Networks, 2007. IJCNN 2007. International Joint Confer-ence on. IEEE. 2007, pp. 1365–1369.

[2] Mark O Afolabi and Olatoyosi Olude. “Predicting stock prices using a hybrid Ko-honen self organizing map (SOM)”. In: System Sciences, 2007. HICSS 2007. 40thAnnual Hawaii International Conference on. IEEE. 2007, pp. 48–48.

[3] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. “Time-series clustering–A decade review”. In: Information Systems 53 (2015), pp. 16–38.

[4] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

[5] Ash Booth, Enrico Gerding, and Frank McGroarty. “Predicting equity market priceimpact with performance weighted ensembles of random forests”. In: ComputationalIntelligence for Financial Engineering & Economics (CIFEr), 2104 IEEE Confer-ence on. IEEE. 2014, pp. 286–293.

[6] Leo Breiman. “Random forests”. In: Machine learning 45.1 (2001), pp. 5–32.

[7] Leo Breiman et al. Classification and regression trees. CRC press, 1984.

[8] Nianyun Cai and Xiaoquan Jiang. “Corporate bond returns and volatility”. In: Fi-nancial Review 43.1 (2008), pp. 1–26.

[9] Todd E Clark and Kenneth D West. “Approximately normal tests for equal predic-tive accuracy in nested models”. In: Journal of econometrics 138.1 (2007), pp. 291–311.

[10] C Crisci, B Ghattas, and G Perera. “A review of supervised machine learning al-gorithms and their applications to ecological data”. In: Ecological Modelling 240(2012), pp. 113–122.

[11] David L Davies and Donald W Bouldin. “A cluster separation measure”. In: IEEEtransactions on pattern analysis and machine intelligence 2 (1979), pp. 224–227.

[12] Francis X Diebold and Canlin Li. “Forecasting the term structure of governmentbond yields”. In: Journal of econometrics 130.2 (2006), pp. 337–364.

[13] Thomas Dietterich. “Machine learning for sequential data: A review”. In: Structural,syntactic, and statistical pattern recognition (2002), pp. 227–246.

[14] Hui Ding et al. “Querying and mining of time series data: experimental comparisonof representations and distance measures”. In: Proceedings of the VLDB Endowment1.2 (2008), pp. 1542–1552.

[15] Ronen Feldman and James Sanger. The text mining handbook: advanced approachesin analyzing unstructured data. Cambridge university press, 2007.

87


[16] Edward B Fowlkes and Colin L Mallows. “A method for comparing two hierarchi-cal clusterings”. In: Journal of the American statistical association 78.383 (1983),pp. 553–569.

[17] Henry G Green and Michael A Pearson. “Artificial intelligence in financial markets”.In: Neural Networks, 1995. Proceedings., IEEE International Conference on. Vol. 2.IEEE. 1995, pp. 839–844.

[18] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “The Elements of Statis-tical Learning: Data Mining, Inference, and Prediction”. In: Biometrics (2002).

[19] Wei Huang, Yoshiteru Nakamori, and Shou-Yang Wang. “Forecasting stock marketmovement direction with support vector machine”. In: Computers & OperationsResearch 32.10 (2005), pp. 2513–2522.

[20] Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall,Inc., 1988.

[21] Gareth James et al. An introduction to statistical learning. Vol. 112. Springer, 2013.

[22] Johana Jaramillo, Juan David Velasquez, and Carlos Jaime Franco. “Research inFinancial Time Series Forecasting with SVM: Contributions from Literature”. In:IEEE Latin America Transactions 15.1 (2017), pp. 145–153.

[23] Ken-ichi Kamijo and Tetsuji Tanigawa. “Stock price pattern recognition-a recurrentneural network approach”. In: Neural Networks, 1990., 1990 IJCNN InternationalJoint Conference on. IEEE. 1990, pp. 215–221.

[24] Gi H Kim, Haitao Li, and Weina Zhang. “CDS-bond basis and bond return pre-dictability”. In: Journal of Empirical Finance 38 (2016), pp. 307–337.

[25] Kyoung-jae Kim. “Financial time series forecasting using support vector machines”.In: Neurocomputing 55.1 (2003), pp. 307–319.

[26] Bjoern Krollner, Bruce Vanstone, and Gavin Finnie. “Financial time series forecast-ing with machine learning techniques: A survey”. In: (2010).

[27] R Senthil Kumar and C Ramesh. “A study on prediction of rainfall using dataminingtechnique”. In: Inventive Computation Technologies (ICICT), International Confer-ence on. Vol. 3. IEEE. 2016, pp. 1–9.

[28] Yingmin Li, Huiguo Chen, and Zheqian Wu. “Dynamic time warping distance methodfor similarity test of multipoint ground motion field”. In: Mathematical Problems inEngineering 2010 (2010).

[29] Hai Lin, Junbo Wang, and Chunchi Wu. “Predictions of corporate bond excessreturns”. In: Journal of Financial Markets 21 (2014), pp. 123–152.

[30] Xiao Liu et al. “A correlation-matrix-based hierarchical clustering method for func-tional connectivity analysis”. In: Journal of neuroscience methods 211.1 (2012),pp. 94–102.

[31] Yanchi Liu et al. “Understanding of internal clustering validation measures”. In:Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE. 2010,pp. 911–916.

[32] Wei-Yin Loh. “Classification and regression trees”. In: Wiley Interdisciplinary Re-views: Data Mining and Knowledge Discovery 1.1 (2011), pp. 14–23.

[33] Wei-Yin Loh. “Regression tress with unbiased variable selection and interaction de-tection”. In: Statistica Sinica (2002), pp. 361–386.

88


[34] Wei-Yin Loh and Yu-Shan Shih. “Split selection methods for classification trees”.In: Statistica sinica (1997), pp. 815–840.

[35] Teo Manojlovic and Ivan Stajduhar. “Predicting stock market trends using randomforests: A sample of the Zagreb stock exchange”. In: Information and Communica-tion Technology, Electronics and Microelectronics (MIPRO), 2015 38th InternationalConvention on. IEEE. 2015, pp. 1189–1193.

[36] Jie Mei et al. “A random forest method for real-time price forecasting in newyork electricity market”. In: PES General Meeting— Conference & Exposition, 2014IEEE. IEEE. 2014, pp. 1–5.

[37] Nicolai Meinshausen. “Quantile regression forests”. In: Journal of Machine LearningResearch 7.Jun (2006), pp. 983–999.

[38] Qiu Mingyue, Li Cheng, and Song Yu. “Application of the Artifical Neural Networkin predicting the direction of stock market index”. In: Complex, Intelligent, andSoftware Intensive Systems (CISIS), 2016 10th International Conference on. IEEE.2016, pp. 219–223.

[39] Markela Muca, Gleda Kutrolli, and Maksi Kutrolli. “A proposed algorithm for deter-mining the optimal number of clusters”. In: European Scientific Journal, ESJ 11.36(2015).

[40] Meinard Muller. Information retrieval for music and motion. Vol. 2. Springer, 2007.

[41] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[42] Zoltan Prekopcsak and Daniel Lemire. “Time series classification by class-specificMahalanobis distance measures”. In: Advances in Data Analysis and Classification(2012), pp. 1–16.

[43] Peter J Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validationof cluster analysis”. In: Journal of computational and applied mathematics 20 (1987),pp. 53–65.

[44] Steven L Salzberg. “C4. 5: Programs for machine learning by j. ross quinlan. morgankaufmann publishers, inc., 1993”. In: Machine Learning 16.3 (1994), pp. 235–240.

[45] Mark Sanderson, D Christopher, Hinrich Manning, et al. “Introduction to Informa-tion Retrieval”. In: Natural Language Engineering 16.1 (2010), p. 100.

[46] Sinan Saracli, Nurhan Dogan, and Ismet Dogan. “Comparison of hierarchical clusteranalysis methods by cophenetic correlation”. In: Journal of Inequalities and Appli-cations 2013.1 (2013), p. 203.

[47] Robert R Sokal and F James Rohlf. “The comparison of dendrograms by objectivemethods”. In: Taxon 11.2 (1962), pp. 33–40.

[48] Anja Struyf, Mia Hubert, Peter Rousseeuw, et al. “Clustering in an object-orientedenvironment”. In: Journal of Statistical Software 1.4 (1997), pp. 1–30.

[49] Francis EH Tay and Lijuan Cao. “Application of support vector machines in financialtime series forecasting”. In: omega 29.4 (2001), pp. 309–317.

[50] Francis EH Tay and LJ Cao. “Modified support vector machines in financial timeseries forecasting”. In: Neurocomputing 48.1 (2002), pp. 847–861.

[51] Ray Tsaih, Yenshan Hsu, and Charles C Lai. “Forecasting S&P 500 stock indexfutures with a hybrid AI system”. In: Decision Support Systems 23.2 (1998), pp. 161–174.

89


[52] Paul E Utgoff. “Incremental induction of decision trees”. In: Machine learning 4.2(1989), pp. 161–186.

[53] Celine Vens and Fabrizio Costa. “Random forest based feature induction”. In: DataMining (ICDM), 2011 IEEE 11th International Conference on. IEEE. 2011, pp. 744–753.

[54] Financial Times website. Real’ investors eclipsed by fast trading. url: https://www.ft.com/content/da5d033c-8e1c-11e1-bf8f-00144feab49a#axzz1t4qPww6r

(visited on 04/27/2017).

[55] Matlab website. Matlab Fittree Function. url: https://se.mathworks.com/help/stats/fitrtree.html (visited on 04/27/2017).

[56] Matlab website. Matlab Hierarchical clustering. url: https://se.mathworks.com/help/stats/hierarchical-clustering.html (visited on 04/27/2017).

[57] Matlab website. OOB Permuted Predictor Importance. url: https://se.mathworks.com/help/stats/regressionbaggedensemble.oobpermutedpredictorimportance.

html (visited on 04/27/2017).

[58] Wall Street Journal website. The quants run wall street now. url: https://www.wsj.com/articles/the-quants-run-wall-street-now-1495389108 (visited on04/27/2017).

[59] Sun Yutong and Hanqing Zhao. “Stock selection model based on advanced Ad-aBoost algorithm”. In: Modelling, Identification and Control (ICMIC), 2015 7thInternational Conference on. IEEE. 2015, pp. 1–7.

[60] Guoqiang Zhang, B Eddy Patuwo, and Michael Y Hu. “Forecasting with artificialneural networks:: The state of the art”. In: International journal of forecasting 14.1(1998), pp. 35–62.

[61] Hui Zhang et al. “Unsupervised feature extraction for time series clustering usingorthogonal wavelet transform”. In: Informatica 30.3 (2006).

90

https://www.ft.com/content/da5d033c-8e1c-11e1-bf8f-00144feab49a#axzz1t4qPww6r

https://www.ft.com/content/da5d033c-8e1c-11e1-bf8f-00144feab49a#axzz1t4qPww6r

https://se.mathworks.com/help/stats/fitrtree.html

https://se.mathworks.com/help/stats/fitrtree.html

https://se.mathworks.com/help/stats/hierarchical-clustering.html

https://se.mathworks.com/help/stats/hierarchical-clustering.html

https://se.mathworks.com/help/stats/regressionbaggedensemble.oobpermutedpredictorimportance.html



https://www.wsj.com/articles/the-quants-run-wall-street-now-1495389108

https://www.wsj.com/articles/the-quants-run-wall-street-now-1495389108

TRITA TRITA-SCI-GRU 2018:018

www.kth.se

forecasting high yield corporate bond industry excess return1185828/fulltext01.pdf · 96 di erent...

Documents