data mining for prediction. financial series case

Data Mining for Prediction.Financial Series Case

Stefan Zemke

Doctoral Thesis

The Royal Institute of Technology

Department of Computer and Systems Sciences

December 2003

i

Doctoral ThesisThe Royal Institute of Technology, SwedenISBN 91-7283-613-X

Copyright c© by Stefan ZemkeContact: [email protected] by Akademitryck AB, Edsbruk, 2003

ii

Abstract

Hard problems force innovative approaches and attention to detail, their explorationoften contributing beyond the area initially attempted. This thesis investigatesthe data mining process resulting in a predictor for numerical series. The seriesexperimented with come from financial data – usually hard to forecast.One approach to prediction is to spot patterns in the past, when we already knowwhat followed them, and to test on more recent data. If a pattern is followed bythe same outcome frequently enough, we can gain confidence that it is a genuinerelationship.Because this approach does not assume any special knowledge or form of the regular-ities, the method is quite general – applicable to other time series, not just financial.However, the generality puts strong demands on the pattern detection – as to noticeregularities in any of the many possible forms.The thesis’ quest for an automated pattern-spotting involves numerous data miningand optimization techniques: neural networks, decision trees, nearest neighbors,regression, genetic algorithms and other. Comparison of their performance on astock exchange index data is one of the contributions.As no single technique performed sufficiently well, a number of predictors have beenput together, forming a voting ensemble. The vote is diversified not only by differenttraining data – as usually done – but also by a learning method and its parameters.An approach is also proposed how to speed-up a predictor fine-tuning.The algorithm development goes still further: A prediction can only be as good asthe training data, therefore the need for good data preprocessing. In particular, newmultivariate discretization and attribute selection algorithms are presented.The thesis also includes overviews of prediction pitfalls and possible solutions, aswell as of ensemble-building for series data with financial characteristics, such asnoise and many attributes.The Ph.D. thesis consists of an extended background on financial prediction, 7papers, and 2 appendices.

iii

Acknowledgements

I would like to take the opportunity to express my gratitude to the manypeople who helped me with the developments leading to the thesis. Inparticular, I would like to thank Ryszard Kubiak for his tutoring andsupport reaching back to my high-school days and beginnings of universityeducation, also for his help to improve the thesis. I enjoyed and appreciatedthe fruitful exchange of ideas and cooperation with Michal Rams, to whomI am also grateful for comments on a part of the thesis. I am also grateful toMiroslawa Kajko-Mattsson for words of encouragement in the final monthsof the Ph.D. efforts and for her style-improving suggestions.

In the early days of my research Henrik Bostrom stimulated my interestin machine learning and Pierre Wijkman in evolutionary computation. Iam thankful for that and for the many discussions I had with both ofthem. And finally, I would like to thank Carl Gustaf Jansson for beingsuch a terrific supervisor.

I am indebted to Jozef Swiatycki for all forms of support during thestudy years. Also, I would like to express my gratitude to the computersupport people, in particular, Ulf Edvardsson, Niklas Brunback and JukkaLuukkonen at DMC, and to other staff at DSV, in particular to BirgittaOlsson for her patience with the final formatting efforts.

I dedicate the thesis to my parents who always believed in me.

Gdynia. October 27, 2003.Stefan Zemke

iv

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Questions in Financial Prediction . . . . . . . . . . . . . . 2

1.2.1 Questions Addressed by the Thesis . . . . . . . . . 4

1.3 Method of the Thesis Study . . . . . . . . . . . . . . . . . 4

1.3.1 Limitations of the Research . . . . . . . . . . . . . 4

1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . 6

2 Extended Background 9

2.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Time Series Glossary . . . . . . . . . . . . . . . . . 10

2.1.2 Financial Time Series Properties . . . . . . . . . . 13

2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Data Integration . . . . . . . . . . . . . . . . . . . 15

2.2.3 Data Transformation . . . . . . . . . . . . . . . . . 16

2.2.4 Data Reduction . . . . . . . . . . . . . . . . . . . . 16

2.2.5 Data Discretization . . . . . . . . . . . . . . . . . . 17

2.2.6 Data Quality Assessment . . . . . . . . . . . . . . . 18

2.3 Basic Time Series Models . . . . . . . . . . . . . . . . . . 18

2.3.1 Linear Models . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Limits of Linear Models . . . . . . . . . . . . . . . 19

2.3.3 Nonlinear Methods . . . . . . . . . . . . . . . . . . 20

2.3.4 General Learning Issues . . . . . . . . . . . . . . . 21

2.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . 23

2.5 System Evaluation . . . . . . . . . . . . . . . . . . . . . . 24

v

2.5.1 Evaluation Data . . . . . . . . . . . . . . . . . . . 242.5.2 Evaluation Measures . . . . . . . . . . . . . . . . . 252.5.3 Evaluation Procedure . . . . . . . . . . . . . . . . . 252.5.4 Non/Parametric Tests . . . . . . . . . . . . . . . . 26

3 Development of the Thesis 273.1 First half – Exploration . . . . . . . . . . . . . . . . . . . 273.2 Second half – Synthesis . . . . . . . . . . . . . . . . . . . . 29

4 Contributions of Thesis Papers 334.1 Nonlinear Index Prediction . . . . . . . . . . . . . . . . . . 334.2 ILP via GA for Time Series Prediction . . . . . . . . . . . 344.3 Bagging Imperfect Predictors . . . . . . . . . . . . . . . . 354.4 Rapid Fine Tuning of Computationally Intensive Classifiers 364.5 On Developing Financial Prediction System: Pitfalls and

Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Ensembles in Practice: Prediction, Estimation, Multi-Feature

and Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . 374.7 Multivariate Feature Coupling and Discretization . . . . . 38

5 Bibliographical Notes 39

A Feasibility Study on Short-Term Stock Prediction 141

B Amalgamation of Genetic Selection and BoostingPoster GECCO-99, US, 1999 147

vi

List of Thesis Papers

Stefan Zemke. 45Nonlinear Index Prediction.Physica A 269 (1999)

Stefan Zemke. 57ILP and GA for Time Series Prediction.Dept. of Computer and Systems Sciences Report 99-006

Stefan Zemke. 71Bagging Imperfect Predictors.ANNIE’99, St. Louis, MO, US, 1999

Stefan Zemke. 81Rapid Fine-Tuning of Computationally Intensive Classifiers.MICAI’2000, Mexico, 2000. LNAI 1793

Stefan Zemke. 95On Developing Financial Prediction System: Pitfalls and Possibilities.DMLL Workshop at ICML-2002, Australia, 2002

Stefan Zemke. 113Ensembles in Practice: Prediction, Estimation, Multi-Feature andNoisy Data.HIS-2002, Chile, 2002

Stefan Zemke and Michal Rams. 131Multivariate Feature Coupling and Discretization.FEA-2003, Cary, US, 2003

vii

Chapter 1

Introduction

Predictions are hard, especially about the future. Niels Bohr and Yogi Berra

1.1 Background

As computers, sensors and information distribution channels proliferate,there is an increasing flood of data. However, the data is of little use, unlessit is analyzed and exploited. There is indeed little use in just gathering thetell tale signals of a volcano eruption, heart attack, or a stock exchangecrash, unless they are recognized and acted upon in advance. This is whereprediction steps in.

To be effective, a prediction system requires good input data, goodpattern-spotting ability, good discovered pattern evaluation, among other.The input data needs to be preprocessed, perhaps enhanced by a domainexpert knowledge. The prediction algorithms can be provided by methodsfrom statistics, machine learning, analysis of dynamical systems, togetherknown as data mining – concerned with extracting useful information fromraw data. And predictions need to be carefully evaluated to see if they fulfillcriteria of significance, novelty, usefulness etc. In other words, prediction isnot an ad hoc procedure. It is a process involving a number of premeditatedsteps and domains, all of which influence the quality of the outcome.

The process is far from automatic. A particular prediction task requiresexperimentation to assess what works best. Part of the assessment comesfrom intelligent but to some extent artful exploratory data analysis. If thetask is poorly addressed by existing methods, the exploration might lead

1

to a new algorithm development.

The thesis research follows that progression, started by the question ofdays-ahead predictability of a stock exchange index data. The thesis workand contributions consist of three developments. First, exploration of sim-ple methods of prediction, exemplified by the initial thesis papers. Second,higher level analysis of the development process leading to a successful pre-dictor. The process also supplements the simple methods by specifics ofthe domain and advanced approaches such as elaborate preprocessing, en-sembles, chaos theory. Third, the thesis presents new algorithmic solutions,such as bagging a Genetic Algorithms population, parallel experiments forrapid fine-tuning and multivariate discretization.

Time series are common. Road traffic in cars per minute, heart beatsper minute, number of applications to a school every year and a wholerange of scientific and industrial measurements, all represent time serieswhich can be analyzed and perhaps predicted. Many of the predictiontasks face similar challenges, such as how to decide which input series willenhance prediction, how to preprocess them, or how efficiently tune variousparameters. Despite the thesis referring to the financial data, most of thework is applicable to other domains, even if not directly, then indirectly bypointing different possibilities and pitfalls in a predictor development.

1.2 Questions in Financial Prediction

Some questions of scientific and practical interest concerning financial pre-diction follow.

Prediction possibility. Is statistically significant prediction of financialmarkets data possible? Is profitable prediction of such data possible,what involves answer to the former question, adjusted by constraintsimposed by the real markets, such as commissions, liquidity limits,influence of the trades.

Methods. If prediction is possible, what methods are best at performingit? What methods are best-suited for what data characteristics –could it be said in advance?

2

Meta-methods. What are the ways to improve the methods? Can meta-heuristics successful in other domains, such as ensembles or pruning,improve financial prediction?

Data. Can the amount, type of data needed for prediction be character-ized?

Data preprocessing. Can data transformations that facilitate predictionbe identified? In particular, what transformation formulae enhanceinput data? Are the commonly used financial indicators formulae ofany good?

Evaluation. What are the features of sound evaluation procedure, re-specting the properties of financial data and the expectations of fi-nancial prediction? How to handle rare but important data events,such as crashes? What are the common evaluation pitfalls?

Predictor development. Are there any common features of successfulprediction systems? If so, what are they, and how could they beadvanced? Can common reasons of failure of financial prediction beidentified? Are they intrinsic, non-reparable, or there is a way toamend them?

Transfer to other domains. Can the methods developed for financialprediction benefit other domains?

Predictability estimation. Can financial data be reasonably quickly es-timated to be predictable or not, without the investment to build acustom system? What are the methods, what do they actually say,what are their limits?

Consequences of predictability. What are the theoretical and practicalconsequences of demonstrated predictability of financial data, or theimpossibility of it? How a successful prediction method translatesinto economical models? What could be the social consequences offinancial prediction?

3

1.2.1 Questions Addressed by the Thesis

The thesis addresses many of the questions, in particular the predictionpossibility, methods, meta-methods, data Preprocessing, and the predic-tion development process. More details on the contributions are providedby the chapter: Contributions of the Thesis Papers.

1.3 Method of the Thesis Study

The investigation behind the thesis has been mostly goal driven. As prob-lems appeared on the way to realizing financial prediction, they were con-fronted by various means including the following:

• Investigation of existing machine learning and data mining methodsand meta-heuristics.

• Reading of financial literature for properties and hints of regularitiesin financial data which could be exploited.

• Analysis of existing financial prediction systems, for commonly work-ing approaches.

• Implementation and experimentation with own machine learning meth-ods and hybrid approaches involving a number of existing methods.

• Some theoretical considerations on mechanisms behind the generationof financial data, e.g. deterministic chaotic systems, and on generalpredictability demands and limits.

• Practical insights into the realm of trading, some contacts with pro-fessional investors, courses on finance and economics.

1.3.1 Limitations of the Research

As any closed work, this thesis research has its limitations. One criticism ofthe thesis could be that the contributions do not directly tackle the promi-nent question: if financial prediction can be profitable. A Ph.D. studentconcentrating efforts on this would make a heavy bet: either s/he would

4

end up with a Ph.D. and as a millionaire, or without anything, should theprediction attempts fail. This is too high risk to take. This is why in myresearch, after the initial head-on attempts, I took a more balanced pathinvestigating prediction from the side: methods, data preprocessing etc.,instead of prediction results per se.

Another criticism could address the omission or shallowness of experi-ments involving some of the relevant methods. For instance, a researcherdevoted to Inductive Logic Programming could bring forward a new sys-tem good at dealing with numerical/noisy series, or the econometriciancould point out the omission of linear methods. The reply could be: thereare too many possibilities for one person to explore, so it was necessaryto skip some. Even then, the interdisciplinary research demanded muchwork, among other, for:

• Studying ’how to’ in 3 areas: machine learning/data mining, financeand mathematics; 2 years of graduate courses taken.

• Designing systems exploiting and efficiently implementing the result-ing ideas.

• Collecting data for prospective experiments – initially quite a timeconsuming task of low visibility.

• Programming, which for new ideas not guaranteed to work, takes timegoing into hundreds of hours.

• Evaluating the programs, adjusting parameters, evaluating again –the loop possibly taking hundreds of hours. The truth here is thatmost new approaches do not work, so the design, implementation andinitial evaluation efforts are not publishable.

• Writing papers, extended background study, for the successful at-tempts.

Another limitation of the research concerns evaluation methods. TheEvaluation section stresses how careful the process should be, preferablyinvolving a trading model, commissions, whereas the evaluations in the the-sis papers do not have that. The reasons are many-fold. First, as already

5

pointed out, the objective was not to prove there is profit possibility in thepredictions. This would involve not only commissions, but also a tradingmodel. A simple model would not fit the bill, so there would be a needto investigate how predictions, together with general knowledge, trader’sexperience etc. merge into successful trading – a subject for another Ph.D.Second, after commissions, the above-random gains, would be much thin-ner, demanding better predictions, more data, more careful statistics tospot the effect – perhaps too much for a pilot study.

The lack of experiments backing some of the thesis ideas is anothershortcoming. The research attempts to be practical, i.e. mostly experi-mental, but there are tradeoffs. As ideas become more advanced, the pathfrom an idea to a reported evaluation becomes more involved. For instance,to predict, one needs data preprocessing, often including discretization. So,even having implemented an experimental predictor, it could not have beenevaluated without the discretization completed, pressing to describe justthe prediction part – without real evaluation. Also computational demandsgrow – a notebook computer is no longer enough.

1.4 Outline of the Thesis

The rest of the initial chapters – preceding the thesis papers – is meant toprovide the reader with the papers’ background, often skimmed in themfor page limit reasons. Thus, the Extended Background chapter goesthrough the subsequent areas and issues involved in time series predictionin the financial domain, one of the objectives being to introduce the vo-cabulary. The intention is also to present the width of the prediction areaand of my study of it, which perhaps will allow one to appreciate the effortand knowledge behind the developments in this domain.

Then comes the Development of the Thesis chapter which, more orless chronologically, presents the research advancement. In this tale onecan also see the many attempts proving to be dead-ends. As such, thepositive Published results can be seen as an essence of much bigger work.

The next chapter Contributions of Thesis Papers summarizes all thethesis papers and their contributions. The summaries assume familiarity

6

with the vocabulary of the Extended Background chapter.The rest of the thesis consists of 8 thesis papers, formatted for a common

appearance, otherwise quoted the way they were published. The thesisends with common bibliography, resolving references for the introductionchapters and all the included papers.

7

Chapter 2

Extended Background

This chapter is organized as follows. Section 1 presents time series prelim-inaries and characteristics of financial series, Section 2 summarizes datapreprocessing, Section 3 lists basic learning schemes, Section 4 ensemblemethods, and Section 5 discuses predictor evaluation.

2.1 Time Series

This section introduces properties of time series appearing in the contextof developing a prediction system in general, and in the thesis papers inparticular. The presentation is divided into generic series properties andcharacteristics of financial time series. Most of the generic time seriesdefinitions follow (Tsay, 2002).

Time series, series for short, is a sequence of numerical values indexedby increasing time units, e.g. a price of a commodity, such as oranges ina particular shop, indexed by the time when the price is checked. In thesequel, series’ st return values refer to rt = log(st+T ) − log(st), the returnperiod T assumed 1, if not specified. Remarks about series distributionrefer to the distribution of the returns series rt. A predictor forecasts afuture value s

′

t+T , having access only to past values si, i ≤ t, of this andusually other series. For the prediction to be of any value it has to bebetter than random, which can be measured by various metrics, such asaccuracy, discussed in Section 6.

9

2.1.1 Time Series Glossary

Stationarity of a series indicates that its mean value and arbitrary au-tocorrelations are time invariant. Finance literature commonly assumesthat asset returns are weakly stationary. This can be checked, provided asufficient number of values, e.g., one can divide data into subsamples andcheck the consistency of mean and autocorrelations (Tsay, 2002). Determi-nation if a series moved into a nonstationary regime is not trivial, let alonedeciding which of the series properties are still holding. Therefore, mostprediction systems, which are based on past data, implicitly assume thatthe predicted series is to a great extent stationary, at least with respectto the invariants that the system may spot, which most likely go beyondmean and autocorrelations.

Seasonality means periodic fluctuations. For example, retail sales peakaround Christmas season and decline after the holidays. So the time seriesof retail sales will show increasing values from September through Decem-ber and declining in January and February. Seasonality is common ineconomic time series and less in engineering and scientific data. It can beidentified, e.g. by correlation or Fourier analysis, and removed, if desired.

Linearity and Nonlinearity are wide notions depending on the context inwhich they appear. Usually, linearity signifies that an entity can be decom-posed into sub-entities, properties of which, such as influence on the whole,carry on to the whole entity in an easy to analyze additive way. Nonlin-ear systems do not allow such a simple decomposition analysis since theinteractions do not need to be additive, often leading to complex emergentphenomena not seen in the individual sub-entities (Bak, 1997).

In the much narrower context of prediction methods, nonlinear oftenrefers to the form of dependencies between data and the predicted vari-able. In nonlinear systems the function might be nonlinear. Hence, linearapproaches, such as correlation analysis and linear regression are not suf-ficient. One must use less orthodox tools to find and exploit nonlineardependencies, e.g. neural networks.

10

Deterministic and Nondeterministic Chaos. For a reader new to chaos, anillustration of the theory applied to finances can be found in (Deboeck,1994). A system is chaotic if its trajectory through state space is sensi-tively dependent on the initial conditions, that is, if small differences aremagnified exponentially with time. This means that initially unobserv-able fluctuations will eventually dominate the outcome. So, though theprocess may be deterministic, it is unpredictable in the long run (Kantz& Schreiber, 1999a; Gershenfeld & Weigend, 1993). Deterministic meansthat given the same circumstances the transition from a state is always thesame.

The topic if financial markets express this kind of behavior is hotlydebated and there are numerous publications supporting each view. Thedeterministic chaos notion involves a number of issues. First, whethermarkets react deterministically to events influencing prices versus a moreprobabilistic reaction. Second, whether indeed magnified small changeseventually take over, which does not need to be the case, e.g. self-correctioncould step in if a value is too much off mark – overpriced or underpriced.Financial time series have been analyzed in those respects, however, themathematical theory behind chaos often poorly deals with noise prevalentin financial data making the results dubious.

Even a chaotic system can be predicted up to a point where magnifieddisturbances dominate. The time when this happens depends inverselyon the largest Lyapunov exponent, a measure of divergence. It is an av-erage statistics – at any time the process is likely to have different di-vergence/predictability, especially if nonstationary. Beyond, prediction ispossible only in statistical terms – which outcomes are more likely, no mat-ter what we start with. Weather – a chaotic system – is a good illustration:despite global efforts in data collection, forecasts are precise up to a fewdays and in the long run offer only statistical views such as average monthtemperature. However, chaos is not to be blamed for all poor forecasts – itrecently came to attention that the errors in weather forecasts initially donot grow exponentially but linearly, what points more to imprecise weathermodels than chaos at work.

Another exciting aspect of a chaotic system is its control. If at times the

11

system is so sensitive to disturbances, a small influence at that time canprofoundly alter the trajectory, provided that the system will be determin-istic for a while thereafter. So potentially a government, or a speculator,who knew the rules, could control the markets without a vast investment.Modern pace-makers for human heart – another chaotic system – workby this principle providing a little electrical impulse only when needed,without the need for constant overwhelming of the heart electrical activity.

Still, it is unclear if the markets are stochastic or deterministic, let alonechaotic. A mixed view is also possible: market are deterministic only inpart – so even short-term prediction cannot be fully accurate, or thatthere are pockets of predictability – markets, or market conditions, whenthe moves are deterministic, otherwise being stochastic.

Delay vectors embedding converts a scalar series st into a vector series:vt = (st, st−delay, .., st−(D−1)∗delay). This is a standard procedure in (non-linear) time series analysis, and a way to present a series to a predictordemanding an input of constant dimension D. More on how to fit thedelay embedding parameters can be found in (Kantz & Schreiber, 1999a).

Takens Theorem (Takens, 1981) states that we can reconstruct the dy-namics of a deterministic system – possibly multidimensional, which eachstate is a vector – by long-enough observation of just one noise-free vari-able of the system. Thus, given a series we can answer questions aboutthe dynamics of the system that generated it by examining the dynamicsin a space defined by delayed values of just that series. From this, we cancompute features such as the number of degrees of freedom and linking oftrajectories and make predictions by interpolating in the delay embeddingspace. However, Takens theorem holds for mathematical measurementfunctions, not the ones seen in the laboratory or market: asset price isnot a noise-free function. Nevertheless, the theorem supports experimentswith a delay embedding, which might yield useful models. In fact, theyoften do (Deboeck, 1994).

12

Prediction, modeling, characterization are three different goals of time se-ries analysis (Gershenfeld & Weigend, 1993): ”The aim of prediction isto accurately forecast the short-term evolution of the system; the goal ofmodeling is to find description that accurately captures features of thelong-term behavior. These are not necessarily identical: finding governingequations with proper long-term properties may not be the most reliableway to determine parameters for short-term forecasts, and a model thatis useful for short-term forecasts may have incorrect long-term properties.Characterization attempts with little or no a priori knowledge to deter-mine fundamental properties, such as the number of degrees of freedom ofa system or the amount of randomness.”

2.1.2 Financial Time Series Properties

One may wonder if there are universal characteristics of the many seriescoming from markets different in size, location, commodities, sophistica-tion etc. The surprising fact is that there are (Cont, 1999). Moreover,interacting systems in other fields, such as statistical mechanics, suggestthat the properties of financial time series loosely depend on the marketmicrostructure and are common to a range of interacting systems. Suchobservations have stimulated new models of markets based on analogieswith particle systems and brought in new analysis techniques opening theera of econophysics (Mantegna & Stanley, 2000).

Efficient Market Hypothesis (EMH) developed in 1965 (Fama, 1965) ini-tially got wide acceptance in the financial community. It asserts, in weakform, that the current price of an asset already reflects all information ob-tainable from past prices and assumes that news is promptly incorporatedinto prices. Since news is assumed unpredictable, so are prices.

However, real markets do not obey all the consequences of the hypoth-esis, e.g., price random walk implies normal distribution, not the observedcase; there is a delay while price stabilizes to a new level after news, whichamong other, lead to a more modern view (Haughen, 1997): ”Overall, thebest evidence points to the following conclusion. The market isn’t efficientwith respect to any of the so-called levels of efficiency. The value invest-

13

ing phenomenon is inconsistent with semi-strong form efficiency, and theJanuary effect is inconsistent even with weak form efficiency. Overall, theevidence indicates that a great deal of information available at all levels is,at any given time, reflected in stock prices. The market may not be easilybeaten, but it appears to be beatable, at least if you are willing to work atit.”

Distribution of financial series (Cont, 1999) tends to be non-normal, sharppeaked and heavy-tailed, these properties being more pronounced for in-traday values. Such observations were pioneered in the 1960s (Mandelbrot,1963), interestingly around the time the EMH was formulated.

Volatility – measured by the standard deviation – also has common char-acteristics (Tsay, 2002). First, there exist volatility clusters, i.e. volatilitymay be high for certain periods and low for other. Second, volatility evolvesover time in a continuous manner, volatility jumps are rare. Third, volatil-ity does not diverge to infinity but varies within fixed range, which meansthat it is often stationary. Fourth, volatility reaction to a big price increaseseems to differ from reaction to a big price drop.

Extreme values appear more frequently in a financial series as comparedto a normally-distributed series of the same variance. This is important tothe practitioner since often the values cannot be disregarded as erroneousoutliers but must be actively anticipated, because of their magnitude whichcan influence trading performance.

Scaling property of a time series indicates that the series is self-similar atdifferent time scales (Mantegna & Stanley, 2000). This is common in fi-nancial time series, i.e. given a plot of returns without the axis signed, it isnext to impossible to say if it represents hourly, daily or monthly changes,since all the plots look similar, with differences appearing at minute res-olution. Thus prediction methods developed for one resolution could, inprinciple, be applied to others.

Data frequency refers to how often series values are collected: hourly,daily, weekly etc. Usually, if a financial series provides values on daily,

14

or longer, basis, it is low frequency data, otherwise – when many intradayquotes are included – it is high frequency. Tick-by-tick data includes allindividual transactions, and as such, the event-driven time between datapoints varies creating challenge even for such a simple calculation as corre-lation. The minute market microstructure and massive data volume createnew problems and possibilities not dealt with by the thesis. The readerinterested in high frequency finance can start at (Dacorogna et al., 2001).

2.2 Data Preprocessing

Before data is scrutinized by a prediction algorithm, it must be collected,inspected, cleaned and selected. Since even the best predictor will fail onbad data, data quality and preparation is crucial. Also, since a predictorcan exploit only certain data features, it is important to detect which datapreprocessing/presentation works best.

2.2.1 Data Cleaning

Data cleaning fills in missing values, smoothes noisy data, handles or re-moves outliers, resolves inconsistencies. Missing values can be handled bya generic method (Han & Kamber, 2001). Methods include skipping thewhole instance with a missing value, or filling the miss with the mean/new’unknown’ constant, or using inference, e.g. based on most similar instancesor some Bayesian considerations.

Series data has another dimension – we do not want to spoil the temporalrelationship, thus data restoration is preferable to removal. The restora-tion should also accommodate the time aspect – not use too time-distantvalues. Noise is prevalent, especially low volume markets should be dealtwith suspicion. Noise reduction usually involves some form of averaging orputting a range of values into one bin, discretization.

If data changes are numerous, a test if the predictor picks the insertedbias is advisable. This can be done by ’missing’ some values from a randomseries – or better: permuted actual returns – and then restoring, cleaningetc. the series as if genuine. If the predictor can subsequently predict

15

anything from this, after all random, series there is too much structureintroduced (Gershenfeld & Weigend, 1993).

2.2.2 Data Integration

Data integration combines data from multiple sources into a coherent store.Time alignment can demand consideration in series from different sources,e.g. different time zones. Series to instances conversion is required by mostof the learning algorithms expecting as an input a fixed length vector. Itcan be done by the delay vector embedding technique. Such delay vectorswith the same time index t – coming from all input series – appendedgive an instance, data point or example, its coordinates referred to as datafeatures, attributes or variables.

2.2.3 Data Transformation

Data transformation changes the values of series to make them more suit-able for prediction. Detrending is such a common transformation removingthe growth of a series, e.g. by working with subsequent value differentials,or subtracting the trend (linear, quadratic etc.) interpolation. For stocks,indexes, and currencies converting into the series of returns does the trick.For volume, dividing it by last k quotes average, e.g. yearly, can scale itdown.

Indicators are series derived from others, enhancing some features ofinterest, such as trend reversal. Over the years traders and technical ana-lysts trying to predict stock movements developed the formulae (Murphy,1999), some later confirmed to pertain useful information (Sullivan et al.,1999). Indicators can also reduce noise due to averaging in many of theformulae. Common indicators include: Moving Average MA), Stochas-tic Oscillator, Moving Average Convergence Divergence (MACD), Rate ofChange (ROC), Relative Strength Index (RSI).

Normalization brings values to a certain range, minimally distortinginitial data relationships, e.g. the SoftMax norm increasingly squeezesextreme values, linearly mapping middle 95% values.

16

2.2.4 Data Reduction

Sampling – not using all the data available – might be worthwhile. In myexperiments with NYSE predictability, skipping half of training instanceswith the lowest weight (i.e. weekly return) enhanced predictions, similarlyreported (Deboeck, 1994). The improvement could be due to skippingnoise-dominated small changes, and/or the dominant changes ruled by amechanism whose learning is distracted by the numerous small changes.

Feature selection – choosing informative attributes – can make learn-ing feasible, because of the curse of dimensionality (Mitchell, 1997) multi-feature instances demand (exponentially w.r.t. feature number) more datato train. There are 2 approaches to the problem: filter – a purpose-madealgorithm evaluates and selects features, whereas in wrapper approach thefinal learning algorithm is presented with different feature subsets, selectedon the quality of the resulting predictions.

2.2.5 Data Discretization

Discretization maps similar values into one discrete bin, with the idea thatit preserves important information, e.g. if all that matters is a real value’ssign, it could be digitized to {0; 1}, 0 for negative, 1 otherwise. Someprediction algorithms require discrete data, sometimes referred to as nom-inal. Discretization can improve predictions by reducing the search space,reducing noise, and by pointing to important data characteristics. Un-supervised approaches work by dividing the original feature value rangeinto few equal-length or equal-data-frequency intervals; supervised – bymaximizing measure involving the predicted variable, e.g. entropy or thechi-square statistics (Liu et al., 2002).

Since discretization is an information loosing transformation, it shouldbe approached with caution, especially as most algorithms perform uni-variate discretization – they look at one feature at a time, disregardingthat it may have (additional) significance only in the context of other fea-tures, as it would be preserved in multivariate discretization. For example,if the predicted class = sign(xy), only discretizing x and y in tandem candiscover their significance, alone x and y can be inferred as not related to

17

class and even disregarded! The multivariate approach is especially im-portant in financial prediction, where no single variable can be expectedto bring significant predictability (Zemke & Rams, 2003).

2.2.6 Data Quality Assessment

Predictability assessment allows to concentrate on feasible cases (Hawawini& Keim, 1995). Some tests are simple non-parametric predictors – predic-tion quality reflecting predictability. The tests may involve: 1) Linearmethods, e.g. to measure correlation between the predicted and featureseries. 2) Nearest Neighbor prediction method, to assess local model-freepredictability. 3) Entropy, to measure information content (Molgedey &Ebeling, 2000). 4) Detrended Fluctuation Analysis (DFA), to reveal longterm self-similarity, even in nonstationary series (Vandewalle et al., 1997).5) Chaos and Lyapunov exponent, to test short-term determinism. 6) Ran-domness tests like chi-square, to assess the likelihood that the observedsequence is random. 7) Nonstationarity tests.

2.3 Basic Time Series Models

This section presents basic prediction methods, starting with the linearmodels well established in the financial literature and moving on to modernnonlinear learning algorithms.

2.3.1 Linear Models

Most linear time series models descend from the AutoRegressive Mov-ing Average (ARMA) and Generalized Autoregressive Conditional Het-eroskedastic (GARCH) (Bollerslev, 1986) models summary of which follows(Tsay, 2002).

ARMA models join simpler AuroRegressive (AR) and Moving-Average(MA) models. The concept is useful in volatility modelling, less in returnprediction. A general ARMA(p, q) is in the form:

18

rt = φ0 + Σpi=1φirt−i + at − Σq

i=1θiat−i

where p is the order of the AR part, φi its parameters, q the order ofthe MA part, θj its parameters, and at normally-distributed noise. Givendata series rt, there are heuristics to specify the order and parameters,e.g. either by the conditional or exact likelihood method. The Ljung-Boxstatistics of residuals can check the fit (Tsay, 2002).

GARCH models volatility which is influenced by time dependent informa-tion flows resulting in pronounced temporal volatility clustering. For a logreturn series rt, we assume its mean ARMA-modelled, then let at = rt−µt

be the mean-corrected log return. Then at follows a GARCH(m, s) modelif:

at = σtεt, σ2t = α0 + Σm

i=1αia2t−i + Σs

j=1βjσ2t−j

where εt is a sequence of identically independent distributed (iid) randomvariables with mean 0 and variance 1, α0 > 0, αi ≥ 0, βj >≥ 0, and

Σmax(m,s)i=1 (αi + βi) < 1.

Box-Jenkins AutoRegressive Integrated Moving Average (ARIMA) extendthe ARMA models, moreover coming with a detailed procedure how to fitand test such a model, not an easy task (Box et al., 1994). Because of wideapplicability, extendable to nonstationary series, and the fitting procedure,the models are commonly used. ARIMA assumes that a probability modelgenerates the series, with future values related to past values and errors.

Econometric models extend the notion of series depending only on itpast values – they additionally use related series. This involves a regressionmodel in which the time series is forecast as the dependent variable; therelated time series as well as the past values of the time series are theindependent or predictor variables. This, in principle, is the approach ofthe thesis papers.

19

2.3.2 Limits of Linear Models

Modern econometrics increasingly shifts towards nonlinear models of riskand return. Bera – actively involved in (G)ARCH research – remarked(Bera & Higgins, 1993): ”a major contribution of the ARCH literature isthe finding that apparent changes in the volatility of economic time seriesmay be predictable and result from a specific type of nonlinear depen-dence rather than exogenous structural changes in variables”. Campbellfurther argued (Campbell et al., 1997): ”it is both logically inconsistentand statistically inefficient to use volatility measures that are based onthe assumption of constant volatility over some period when the resultingseries moves through time.”

2.3.3 Nonlinear Methods

Nonlinear methods are increasingly preferred for financial prediction, dueto the perceived nonlinear dependencies in financial data which cannot behandled by purely linear models. A short overview of the methods follows(Mitchell, 1997).

Artificial Neural Network (ANN) advances linear models by applying anon-linear function to the linear combination of inputs to a network unit – aperceptron. In an ANN, perceptrons are usually prearranged in layers, withthose in the the first layer having access to the inputs, and the perceptrons’outputs forming the inputs to the next layer, the final one providing theANN output(s). Training a network involves adjusting the weights in eachunit’s linear combination as to minimize an objective, e.g. squared error.Backpropagation – the classical training method – however, may miss anoptimal network due to falling into a local minimum, so other methodsmight be preferred (Zemke, 2002b).

Inductive Logic Programming (ILP) and a decision tree (Mitchell, 1997)learner C4.5 (Quinlan, 1993) generate if-conditions-then-outcome symbolicrules, human understandable if small. Since the search for such rules is ex-pensive, the algorithms either employ greedy heuristics, e.g. C4.5 looking

20

at a single variable at a time, or perform exhaustive search, e.g. ILP Progol.These limit the applicability, especially in an area where data is volumi-nous and unlikely in the form of simple rules. Additionally, ensembles –putting a number of different predictors to vote – obstruct the acclaimedhuman comprehension of the rules. However, the approach could be of usein more regular domains, such as customer rating and perhaps fraud de-tection. Rules can be also extracted from an ANN, or used together withprobabilities making them more robust (Kovalerchuk & Vityaev, 2000).

Nearest Neighbor (kNN) does not create a general model, but to predict,it looks back for the most similar k cases. Distracted by noisy/irrelevantfeatures, but if this ruled out, failure of kNN suggests that the most thatcan be predicted are general regularities, e.g. based on the output (condi-tional) distribution.

Bayesian predictor first learns probabilities how evidence supports out-comes, used then to predict new evidence’s outcome. Although the simplelearning scheme is robust to violating the ’naive’ independent-evidence as-sumption, watching independence might pay off, especially as in decreasingmarkets variables become more correlated than usual.

Support Vector Machines (SVM) offer a relatively new and powerful learner,having attractive characteristics for time series prediction (Muller et al.,1997). First, the model deals with multidimensional instances, actually themore features the better – reducing the need for (wrong) feature selection.Second, it has few parameters, thus finding optimal settings can be easier;one parameter referring to noise level the system can handle.

Genetic Algorithms (GAs) (Deboeck, 1994) – mimic biological evolutionby mutation and cross-over of solutions, in order to maximize their fit-ness. This is a general optimization technique, thus can be applied to anyproblem – a solution can encode data selection, preprocessing, predictor.GAs explore novel possibilities, often not thought of by humans. There-fore, it may be worth keeping some predictor settings as parameters that

21

can be (later) GA-optimized. Evolutionary systems – another example ofevolutionary computation – work in a similar way to GAs, except that thesolution is coded as real-valued vector, and optimized not only with respectto the values but also to the optimization rate.

2.3.4 General Learning Issues

Computational Learning Theory (COLT) theoretically analyzes predictionalgorithms, with respect to the learning process assumptions, data andcomputation requirements.

Probably Approximately Correct (PAC) Learnability is a central notionin the theory, meaning that we learn probably – with probability 1 − δ –and approximately – within error ε – the correct predictor drawn from aspace H. The lower bound on the number of training examples m to findsuch a predictor is an important result:

m ≥ 1

ε(ln |H|+ ln(1/δ))

where |H| is the size of the space – the number of predictors in it. This isusually overly big bound – specifics about the learning process can lower it.However, it provides some insights: m grows linearly in the error factor 1/εand logarithmically in 1/δ – that we find the hypothesis at all (Mitchell,1997).

Curse of dimensionality (Bellman, 1961) involves two related problems.As the data dimension – the number of features in an instance – grows,the predictor needs increasing resources to cover the increasing instances.It also needs more instances to learn – exponentially with the dimension.Some prediction algorithms, e.g. kNN, will not be able to generalize at all,if the dimension is greater than ln(M), M the number of instances. Thisis why feature selection – reducing the data dimension – is so important.The amount of data to train a predictor can be experimentally estimated(Walczak, 2001).

Overfitting means that a predictor memorizes non-general aspects of thetraining data, such as noise. This leads to poor prediction on a new data.

22

This is a common problem due to a number of reasons. First, the trainingand testing data are often not well separated, so memorizing the commonpart will give the predictor a higher score. Second, multiple trials mightbe performed on the same data (split), so in effect the predictor comingout will be best suited for exactly that data. Third, the predictor com-plexity – number of internal parameters – might be too big for the numberof training instances, so the predictor learns even the unimportant datacharacteristics.

Precautions against overfitting involve: good separation of training andtesting data, careful evaluation, use of ensembles averaging-out the indi-vidual overfitting, and an application of the Occam’s razor. In general,overfitting is a difficult problem that must be approached individually. Adiscussion how to deal with it can be found in (Mitchell, 1997).

Occam’s razor – preferring a smaller solution, e.g. a predictor involvingfewer parameters, to a bigger one, other things equal – is not a specifictechnique but a general guidance. There are indeed arguments (Mitchell,1997) that a smaller hypothesis has a bigger chance to generalize well onnew data. Speed is another motivation – smaller predictor is likely to befaster, which can be especially important in an ensemble.

Entropy (Shannon & Weaver, 1949) is an information measure useful atmany stages in a prediction system development. Entropy expresses thenumber of bits of information brought in by an entity, let it be next train-ing instance, or checking another condition. Since the notion does notassume any data model, it is well suited to deal with nonlinear systems.As such it is used in feature selection, predictability estimation, predictorconstruction, e.g. in C4.5 as the information gain measure to decide whichfeature to split.

2.4 Ensemble Methods

An ensemble (Dietterich, 2000) is a number of predictors of which votesare put together into the final prediction. The predictors, on average,

23

are expected above-random and making independent errors. The idea isthat correct majority offsets individual errors, thus the ensemble will becorrect more often than an individual predictor. The diversity of errorsis usually achieved by training a scheme, e.g. C4.5, on different instancesamples or features. Alternatively, different predictor types – like C4.5,ANN, kNN – can be used. Common schemes include Bagging, Boosting,Bayesian ensembles and their combinations (Dietterich, 2000).

Bagging produces an ensemble by training predictors on different boot-strap samples – each the size of the original data, but sampled allowingrepetitions. The final prediction is the majority vote. This simple to imple-ment scheme is always worth trying, in order to reduce prediction variance.

Boosting initially assigns equal weights to all data instances and trains apredictor, then it increases weights of the misclassified instances, trainsnext predictor on the new distribution etc. The final prediction is aweighted vote of predictors obtained in this way. Boosting increasinglypays attention to misclassified instances, what may lead to overfitting ifthe instances are noisy.

Bayesian ensemble, similarly to the Bayesian predictor, uses conditionalprobabilities accumulated for the individual predictors, to arrive at themost evidenced outcome. Given good estimates for predictors’ accuracy,Bayesian ensemble results in a more optimal prediction compared to bag-ging.

2.5 System Evaluation

Proper evaluation is crucial to a prediction system development. First, ithas to measure exactly the interesting effect, e.g. trading return as opposedto related, but not identical, prediction accuracy. Second, it has to besensitive enough as to spot even minor gains. Third, it has to convincethat the gains are no merely a coincidence.

24

Usually prediction performance is compared against published results.Although, having its problems, such as data overfitting and accidental suc-cesses due to multiple (worldwide!) trials, this approach works well as longas everyone uses the same data and evaluation procedure, so meaningfulcomparisons are possible. However, when no agreed benchmark is avail-able, as in the financial domain, another approach must be adopted. Sincethe main question concerning financial data is whether prediction is at allpossible, it suffices to compare a predictor’s performance against the in-trinsic growth of a series – also referred to as the buy and hold strategy.Then a statistical test can judge if there is a significant improvement.

2.5.1 Evaluation Data

To reasonably test a prediction system, the data must include differenttrends, assets for which the system is to perform, and to be plentiful towarrant significant conclusions. Overfitting a system to data is a real dan-ger. Dividing data into three disjoint sets is the first precaution. Trainingportion of the data is used to build the predictor. If the predictor in-volves some parameters which need to be tuned, they can be adjusted asto maximize performance on the validation part. Now, the system pa-rameters frozen, its performance on an unseen test set provides the finalperformance estimation. In multiple tests, the significance level should beadjusted, e.g. if 10 tests are run and the best appears 99.9% significant, itreally is 99.9%10 = 99% (Zemke, 2000). If we want the system to predictthe future of a time series, it is important to maintain proper time relationbetween the training, validation and test sets – basically training shouldinvolve instances time-preceding any test data.

Bootstrap (Efron & Tibshirani, 1993) – with repetitions, sampling asmany elements as in the original – and deriving a predictor for each sucha sample, is useful for collecting various statistics (LeBaron & Weigend,1994), e.g. return and risk-variability. It can be also used for ensemblecreation or best predictor selection, however not without limits (Hastieet al., 2001).

25

2.5.2 Evaluation Measures

Financial forecasts are often developed to support semi-automated trading(profitability), whereas the algorithms used in those systems might haveoriginally different objectives. Accuracy – percentage of correct discrete(e.g. up/down) predictions – is a common measure for discrete systems,e.g. ILP/decision trees. Square error – sum of squared deviations fromactual outputs – is a common measure in numerical prediction, e.g. ANN.Performance measure – incorporating both the predictor and the tradingmodel it is going to benefit – is preferable and ideally should measureexactly what we are interested in, e.g. commission and risk adjusted return(Hellstrom & Holmstrom, 1998), not just return. Actually, many systems’’profitability’ disappears once the commissions are taken into account.

2.5.3 Evaluation Procedure

In data sets, where instance order does not matter, the N-cross validation –data divided into N disjoint parts, N−1 for training and 1 for testing, erroraveraged over all N (Mitchell, 1997) is a standard approach. However, inthe case of time series data, it underestimates error because in order to traina predictor we sometimes use the data that comes after the test instances –unlike in real life, where predictor knows only the past, not the future.For series, sliding window approach is more adept: a window/segment ofconsecutive instances used for training and a following segment for testing,the windows sliding over all data, as statistics collected.

2.5.4 Non/Parametric Tests

Parametric statistical tests have assumptions, e.g. concerning the sampleindependence and distribution, and as such allow stronger conclusion forsmaller data – the assumptions can be viewed as additional input informa-tion, so need to be demonstrated – what is often missed. Nonparametrictests put much weaker requirements, so for equally numerous data allowweaker conclusions. Since financial data have non-normal distribution, re-quired by many of the parametric tests, non-parametric comparisons mightbe safer (Heiler, 1999).

26

Surrogate data is a useful concept in a system evaluation (Kantz &Schreiber, 1999a). The idea is to generate data sets sharing characteristicsof the original data – e.g. permutations of series have the same mean,variance etc. – and for each compute a statistics of interest, e.g. return of astrategy. If α is the acceptable risk of wrongly rejecting the null hypothesisthat the original series statistics is lower (higher) than of any surrogate,then 1/α − 1 surrogates needed; if all give higher (lower) statistics thanthe original series, then the hypothesis can be rejected. Thus, if predictor’serror was lower on the original series, as compared to 19 runs on surrogates,we can be 95% sure it was not a fluke.

27

Chapter 3

Development of the Thesis

Be concerned with the ends not the means. Bruce Lee.

3.1 First half – Exploration

When introduced to the area of machine learning (ML) around 1996, Inoticed that many of the algorithms were developed on artificial ’toy prob-lems’ and once done, the search started for more realistic problems ’suit-able’ for the algorithm. As reasonable as such strategy might initiallyappear – knowledge of the optimal performance area of a learning algo-rithm is what is often desired – such studies seldom yielded general areainsights, merely performance comparisons for the carefully chosen test do-mains. This is in sharp contrast to the needs of a practitioner, who faces alearning problem first and searches for the solution method later, not viceversa. So, in my research, I adopted the practical approach: here is myprediction problem, what can I do about it.

My starting point was that financial prediction is difficult, but is it im-possible? Or perhaps, the notion of unpredictability emerged due to thenature of the method rather than the data – a case already known: with theadvent of chaotic analysis many processes previously considered randomturned out deterministic, at least in the short run. Though, I do not be-lieve that such a complex socio-economical process as the markets will anytime soon be found completely predictable, the question of a limited pre-dictability remains open and challenging. And since challenging problemsoften lead to profound discoveries I considered the subject worthwhile.

29

The experiments started with Inductive Logic Programming (ILP) –learning logic programs by combining provided background predicates sup-posedly useful in the domain in question. I used the then (in 1997) state-of-the-art system, Progol, reported successful in other domains, such astoxicology and chemistry. I provided the system with various financial in-dicators, however, despite many attempts, no compressed rules were evergenerated. This could be due to the noise present in financial data and therules, if any, far from the compact form sought for by an ILP system.

The initial failure reiterated the question: is financial prediction at allpossible, and if so, which algorithm works best? The failure of an otherwisesuccessful learning paradigm, directed the search towards more originalmethods. After many fruitless trials, some promising results started ap-pearing, with the unorthodox method shortly presented in the FeasibilityStudy on Short-Term Stock Prediction, Appendix A. This methodlooked for invariants in the time series predicted – not just patterns withhigh predictive accuracy, but patterns that have above-random accuracyin a number of temporarily distinct time epochs, thus excluding those thatwork perhaps well, but only for a time. The work went unpublished sincethe trials were limited and in the early stages of my research I was encour-aged to use more established methods. However, it is interesting to notethat the method is similar to entropy-based compression schemes, what Idiscovered later.

So I went on to evaluate standard machine learning – to see which of themethods warrants further investigation. I tried: Neural Network, NearestNeighbor, Naive Bayesian Classifier and Genetic Algorithms (GA) evolvedrules. That research, presented and published as Nonlinear Index Pre-diction – thesis paper 1, concludes that Nearest Neighbor (kNN) worksbest. Some of the details, not included in the paper, made into a reportILP and GA for Time Series Prediction, thesis paper 2.

The success of kNN suggested that the delay embedding and local pre-diction works for my data, so perhaps could be improved. However, whenI tried to GA-optimize the embedding parameters, the prediction resultswere not better. If fine-tuning was not the way, perhaps averaging a num-ber of rough predictors would be. The majority voting scheme has indeed

30

improved the prediction accuracy. The originating publication BaggingImperfect Predictors, thesis paper 3, presents bagging results from Non-linear Index Prediction and an approach believed to be novel at that time –bagging predictions from a number of classifiers evolved in one GA popu-lation.

Another spin off from the success of kNN in Nonlinear Index Prediction,so the implicit presence of determinism and perhaps limited dimension ofthe data, was a research proposal Evolving Differential Equations for Dy-namical System Modeling. The idea behind this more extensive project isto use Genetic Programming-like approach, but instead of evolving pro-grams, to evolve differential equations, known as the best descriptive andmodeling tool for dynamical systems. This is what the theory says, butfinding equations fitting given data is not yet a solved task. The projectwas stalled, awaiting financial support.

But coming back to the main thesis track. GA experiments in BaggingImperfect Predictors were computationally intensive, as it is often the casewhile developing a new learning approach. This problem gave rise to anidea how to try a number of development variants at once, instead of one-by-one, saving on computation time. Rapid Fine-Tuning of Compu-tationally Intensive Classifiers, thesis paper 4, explains the technique,together with some experimental guidelines.

The ensemble of GA individuals, as in Bagging Imperfect Predictors,could further benefit from a more powerful classifier committee technique,such as boosting. The published poster Amalgamation of Genetic Se-lection and Boosting, Appendix B, highlights the idea.

3.2 Second half – Synthesis

At that point, I presented the mid-Ph.D. results and thought what todo next. Since the ensembles, becoming a mainstream in the machinelearning community, seemed the most promising way to go, I investigatedhow different types of ensembles performed with my predictors, with theBayesian coming a bit ahead of Bagging and Boosting. However, the resultswere not that startling and I found more extensive comparisons in the

31

literature, making me abandon that line of research.

However, while searching for the comparisons above, I had done quitean extensive review. I selected the most practical and generally-applicablepapers in Ensembles in Practice: Prediction, Estimation, Multi-Feature and Noisy Data which publication addresses the four data issuesrelevant to financial prediction, thesis paper 5.

Except for the general algorithmic considerations, there are also thetens of little decisions that need to be taken while developing a predictionsystem, many leading to pitfalls. While reviewing descriptions of manysystems ’beating the odds’ I realized that, although widely different, theacclaimed successful systems share common characteristics, while the naivesystems – quite often manipulative in presenting the results – share com-mon mistakes. This led to the thesis paper 6: On Developing FinancialPrediction System: Pitfalls and Possibilities which is an attempt tohighlight some of the common solutions.

Financial data are generated in complex and interconnected ways. Whathappens in Tokyo influences what happens in New York and vice versa.For prediction this has several consequences. First, there are very manydata series to potentially take as inputs, creating data selection and curseof dimensionality problems. Second, many of the series are interconnected,in general, in nonlinear ways. Hence, an attempt to predict must identifythe important series and their interactions, having decided that the datawarrants predictability at all.

These considerations led me to a long investigation. Searching for apredictability measure, I had the idea to use the common Zip compressionto estimate entropy in a constructive way – if the algorithm could compress(many interleaved series), its internal working could provide the basis for aprediction system. But reviewing references, I found a similar work, moremathematically grounded, so had abandoned mine. Then, I shifted atten-tion to uncovering multivariate dependencies, along predictability measure,by means of weighted and GA-optimized Nearest Neighbor, which failed.1.

Then came a multivariate discretization idea, initially based on Shannon

1It worked, but only up to 15 input data series, whereas I wanted the method to work for more than50 series.

32

(conditional) entropy, later reformulated in terms of accuracy. After somany false-starts, the feat was quite spectacular as the method was ableto spot multivariate regularities, involving only fraction of the data, inup to 100 series. Up to my knowledge, this is also the first, (multivariate)discretization having maximizing an ensemble performance as an objective.Multivariate Feature Coupling and Discretization is the thesis papernumber 7.

Along the second part of the thesis, I have steadily developed a time se-ries prediction software incorporating my experiences and expertise. How-ever, at the thesis print time the system is not yet operational so its de-scription is not included.

33

Chapter 4

Contributions of Thesis Papers

This section summarizes some of the contributions of the 7 papers includedin the thesis.

4.1 Nonlinear Index Prediction

This publication (Zemke, 1998) examines index predictability by meansof Neural Networks (ANN), Nearest Neighbor (kNN), Naive Bayesian andGenetic Algorithms-optimized Inductive Logic Program (ILP) classifiers.The results are interesting in many respects. First, they show that a lim-ited prediction is indeed possible. This adds to the growing evidence thatan unqualified Efficient Market Hypothesis might one day be revised. Sec-ond, Nearest Neighbor achieves best accuracy among the commonly usedMachine Learning methods what might encourage further exploration inthis area dominated by Neural Network and rule-based, ILP-like, systems.Also, the success might hint specific features of the data analyzed. Namely,unlike the other approaches, Nearest Neighbor is a local, model-free tech-nique that does not assume any form of the learnt hypothesis, as it is doneby Neural Network architecture or LP background predicates. Third, thesuperior performance of Nearest Neighbor, as compared to the other meth-ods, points to the problems in constructing global models for the financialdata. If conformed in more extensive experiments, it would highlight theintrinsic difficulties of describing some economical dependencies in termsof simple rules, as taught to economics students. And fourth, the failure ofthe Naive Bayesian classifier can point out limitations of some statistical

35

techniques used to analyze complex preprocessed data, a common approachin the earlier studies of financial data so much contributing to the EfficientMarket Hypothesis view.

4.2 ILP via GA for Time Series Prediction

With only the main results, due to publisher space limits, of the GA-optimized ILP included in the earlier paper, this report presents somedetails of these computationally intensive experiments (Zemke, 1999c). Al-though the overall accuracy of LP on the index data was not impressive,the attempts still have practical value – in outlining limits of otherwise suc-cessful techniques. First, the initial experiments applying Progol – at thattime a ’state of the art’ Inductive Logic Programming system – show thata learning system successful on some domains can fail on others. Therecould be at least two reasons for this: domain unsuitable for the learningparadigm or unskillful use of the system. Here, I only note that most ofthe successful applications of Progol involve domains where few rules holdmost of the time: chemistry, astronomy, (simple) grammars, whereas fi-nancial prediction rules, if any, are more soft. As for the unskillful use ofan otherwise capable system, the comment could be that such a systemwould merely shift the burden to learning its ’correct usage’ from learningthe theory implied by the data provided – instead of lessening the bur-den altogether. As such, one should be aware that machine learning isstill more of an art – demanding experience and experimentation, ratherthan engineering – providing procedures for almost blindly solving a givenproblem.

The second contribution of this paper exposes background predicate sen-sitivity – exemplified by variants of equal. The predicate definitions canhave a substantial influence on the achieved results – again highlightingthe importance of an experimental approach and, possibly, a requirementfor nonlinear predicates. Third, since GA-evolved LP can be viewed asan instance of Genetic Programming (GP), the results confirm that GP isperhaps not the best vehicle for time series prediction. And fourth, a gen-eral observation about GA-optimization and learning: while evolving LP of

36

varying size, the best (accuracy) programs usually emerged in GA experi-ments with only secondary fitness bonus for smaller programs, as opposedto runs in which programs would be penalized by their size. Actually, itwas interesting to note that the path to smaller and accurate programsoften lead through much bigger programs which have been subsequentlyreduced – should the bigger programs be not allowed to appear in thefirst place, the smaller ones would not be found either. This observation,together with the not so good generalization of the smallest programs, is-sues a warning against blind application of Occam’s Razor in evolutionarycomputation.

4.3 Bagging Imperfect Predictors

This publication (Zemke, 1999b), again due to publisher restrictions, com-pactly presents a number of contributions both to the area of financialprediction and machine learning. The key tool here is bagging – a schemeinvolving majority voting of a number of different classifiers as to increasethe ensemble’s accuracy. The contributions could be summarized as fol-lows. First, instead of the usual bagging of the same classifier trained ondifferent (bootstrap) partitions of the data, classifiers based on differentdata partitions as well as methods are bagged together – an idea describedas ’neat’ by one of the referees. This leads to higher accuracy than thoseachieved by bagging each of the individual method classifiers or data se-lections separately. Second, as applied to index data, prediction accuracyseems highly correlated to returns, a relationship reported breaking up athigher accuracies. Third, since the above two points hold, bagging appliedto a variety of financial predictors has the potential to increase the ac-curacy of prediction and, consequently, of returns what is demonstrated.Fourth, in the case of GA-optimized classifiers, it is advantageous to bagall above-average classifiers present in the final GA population, instead ofthe usual taking the singe best classifier. And fifth, somehow contrary toconventional wisdom, it turned out that on the data analyzed, big indexmovements were more predictable than smaller ones – most likely due tothe smaller ones consisting of relatively more of noise.

37

4.4 Rapid Fine Tuning of Computationally Intensive

Classifiers

This publication (Zemke, 2000), a spin-off of the experiments carried outfor the previous paper, elaborates on a practical aspect applicable to almostany machine learning system development, namely, on a rapid fine-tuningof parameters for optimal performance. The results could be summarizedas follows. First, working on a specific difficult problem, as in the case ofindex prediction, can lead to a solution and insights to more general prob-lems, and as such is of value beyond merely the domain of the primaryinvestigation. Second, the paper describes a strategy for simultaneousexploration of many versions of a fine-tuned algorithm with different pa-rameter choices. And third, a statistical analysis method for detection ofsuperior parameter settings is presented, which together with the earlierpoint allows for rapid fine-tuning.

4.5 On Developing Financial Prediction System: Pit-

falls and Possibilities

The publication (Zemke, 2002b) is the result of my own experiments witha financial prediction system development and of a review of such in theliterature. The paper succinctly lists issues appearing in the developmentprocess pointing to some common pitfalls and solutions. The contributionscould be summarized as follows.

First, it makes the reader aware of the many steps involved in a suc-cessful system implementation. The presentation tried to follow the devel-opment progression – from data preparation, through predictor selectionand training, ’boosting’ the accuracy, to evaluation issues. Being aware ofthe progression can help in a more structured development and pinpointsome omissions.

Second, for each stage of the process, the paper lists some commonpitfalls. The importance of this cannot be overestimated. For instance,many ’profit-making’ systems presented in the literature are tested onlyin the decade-long bull market 1990-2000, and never tested in long-term

38

falling markets, which most likely would average the systems’ performance.Such are some of the many pitfalls pointed out.

Third, the paper suggests some solutions to the pitfalls and to generalissues appearing in a prediction system development.

4.6 Ensembles in Practice: Prediction, Estimation,

Multi-Feature and Noisy Data

This publication (Zemke, 2002a) is the result of an extensive literaturesearch on ensembles applied to realistic data sets, with the 4 objectives inmind: 1) time series prediction – how ensembles can specifically exploit theserial nature of the data; 2) accuracy estimation – how ensembles can mea-sure the maximal prediction accuracy for a given data set, in a better waythan any single method; 3) how ensembles can exploit multidimensionaldata and 4) how to use ensembles in the case of noisy data.

The four issues appear in the context of financial time series predic-tion, though the examples referred to are non-financial. Actually, thiscross-domain application of working solutions could bring new methods tofinancial prediction. The contributions of the publication can be summa-rized.

First, after a general introduction to how and why ensembles work, andto the different ways to build them, the paper diverges into the four titleareas. The message here can be that although ensembles are generally-applicable and robust techniques, a search for the ’ultimate ensemble’should not overlook the characteristics and requirements of the problemin question. Similar quest for the ’best’ machine learning technique fewyears ago failed with the realization that different techniques work bestin different circumstances. Similarly with ensembles: different problemsettings require individual approaches.

Second, the paper goes on to present some of the working approachesaddressing the four issues in question. This has a practical value. Usuallythe ensemble literature is organized by ensemble method, whereas, a prac-titioner has data and a goal, e.g. to predict from noisy series data. Thepaper points to possible solutions.

39

4.7 Multivariate Feature Coupling and Discretization

This paper (Zemke & Rams, 2003) presents a multivariate discretizationmethod based on Genetic Algorithms applied twice, first to identify im-portant feature groupings, second to perform the discretization maximiz-ing desired function, e.g. the predictive accuracy of an ensemble build onthose groupings. The contributions could be summarized as follows.

First, as the title suggests, a multivariate discretization is provided,presenting an alternative to the very few multivariate methods reported.Second, feature grouping and ranking – the intermediate outcome of theprocedure – has a value in itself: allows to see which features are interre-lated and how much predictability is brought in by them, promoting featureselection. Third, the second global GA-optimization allows an arbitraryobjective to be maximized, unlike in other discretization schemes wherethe objective is hard-coded into the algorithm. The objective exemplifiedin the paper maximizes the goal of prediction: accuracy, whereas otherschemes often only indirectly attempt to maximize it via measures such asentropy or the chi-square statistics. Fourth contribution, up to my knowl-edge, this is the first discretization to allow explicit optimization for anensemble. This forces the discretization to act on global basis, not merelysearching for maximal information gain per selected feature (grouping) butfor all features viewed together. Fifth, the global discretization can alsoyield a global estimate of predictability for the data.

40

Chapter 5

Bibliographical Notes

This chapter is intended to provide a general bibliography introducing newadepts to the interdisciplinary area of financial prediction. I list a fewbooks I have found to be both educational and interesting to read in mystudy of the domain.

Machine Learning

Machine Learning (Mitchell, 1997). As for now, I would regard this bookas the textbook for machine learning. It not only presents the main learn-ing paradigms – neural networks, decision trees, rule induction, nearestneighbor, analytical and reinforcement learning – but also introduces tohypothesis testing and computational learning theory. As such, it balancesthe presentation of machine learning algorithms with practical issues ofusing them, and some theoretical aspects of their function. Next editionsof this, otherwise an excellent book, could also consider the more novelapproaches: support vector machines and rough sets.

Data Mining: Practical Machine Learning Tools and Techniques withJava Implementations (Witten & Frank, 1999). Using this book, and thesoftware package Weka behind it, could save time, otherwise spent on im-plementing the many learning algorithms. This book essentially providesan extended user guide to the open-source code available online. TheWeka toolbox, in addition to more than 20 parameterized machine learningmethods, offers data preparation, hypothesis evaluation and some visual-ization tools. A word of warning, though: most of the implementations are

41

straightforward and non-optimized – suitable rather for learning the nutsand bolts of the algorithms, rather than a big scale data mining.

The Elements of Statistical Learning: Data Mining, Inference, and Pre-diction (Hastie et al., 2001). This book, in wide scope similar to MachineLearning (Mitchell, 1997), could be recommend for its more rigorous treat-ment and some additional topics, such as ensembles.

Data Mining and Knowledge Discovery with Evolutionary Algorithms(Alex, 2002). This could be a good introduction to practical applicationsof evolutionary computations to various aspects of data mining.

Financial Prediction

Here, I present a selection of books introducing to various aspects of non-linear financial time series analysis.

Data Mining in Finance: Advances in Relational and Hybrid Methods(Kovalerchuk & Vityaev, 2000). This is an overview of some of the methodsused for financial prediction and of features such a prediction system shouldhave. The authors also present their system, supposedly overcoming manyof the common pitfalls. However, the book is somehow short on detailsallowing to re-evaluate some of the claims, but good as an overview.

Trading on the Edge (Deboeck, 1994). This is an excellent book of self-contained chapters practically introducing to the essence of neural net-works, chaos analysis, genetic algorithms and fuzzy sets, as applied tofinancial prediction.

Neural Networks in the Capital Markets (Refenes, 1995). This collectionon neural networks for economic prediction, highlights some of the practicalconsiderations while developing a prediction system. Many of the hints areapplicable to prediction systems based on other paradigms, not just onneural networks.

Fractal Market Analysis (Peters, 1994). In this book, I found as themost interesting chapters on various applications of Hurst or R/S analysis.Though, this has not resulted in immediately using that approach, it isalways good to know what the self-similarity analysis can reveal about thedata in hand.

42

Nonlinear Analysis, Chaos

Nonlinear Time Series Analysis (Kantz & Schreiber, 1999a). As authorscan be divided into those who write what they know, and those who knowwhat they write about, this is definitely the latter case. I would recom-mend this book, among other introductions to nonlinear time series, forits readability, practical approach, examples (though mostly from physics),formulae with clearly explained meaning. I could easily convert into codemany of the algorithms described in the text.

Time Series Prediction: Forecasting the Future and Understanding thePast (Weigend & Gershenfeld, 1994). A primer on nonlinear predictionmethods. The book, finalizing the Santa Fe Institute prediction compe-tition, introduces time series forecasting issues and discusses them in thecontext of the competition entries.

Coping with Chaos (Ott, 1994). This book, by a contributor to thechaos theory, is a worthwhile read providing insights into aspects of chaoticdata analysis, prediction, filtering, control, with the theoretical motivationsrevealed.

Finance, General

Modern Investment Theory (Haughen, 1997). A relatively easy to readbook systematically introducing to current views on investments, mostlyfrom an academic point, though. This book also discusses the EfficientMarket Hypothesis.

Financial Engineering (Galitz, 1995). A basic text on what financialengineering is about and what it can do.

Stock Index Futures (Sutcliffe, 1997). Mostly overview work, providingnumerous references to research on index futures. I considered skimmingthe book essential for insights into documented futures behavior, as not toreinvent the wheel.

A Random Walk down Wall Street (Malkiel, 1996) and Reminiscencesof a Stock Operator (Lefvre, 1994). Enjoyable, leisure read about the me-chanics of Wall Street. In some sense the books – presenting investmentactivity in a wider historical and social context – have also great educa-

43

tional value. Namely, they show the influence of subjective, not alwaysrational, drives on the markets, which as such, perhaps cannot be fullyanalyzed by rational methods.

Finance, High Frequency

An Introduction to High-Frequency Finance (Dacorogna et al., 2001). Agood introduction to high frequency finance, presenting facts about thedata and ways to process it, with simple prediction schemes presented.

Financial Markets Tick by Tick (Lequeux, 1998). In high frequency fi-nance, where data is usually not equally time-spaced, certain mathematicalnotions – such as correlation, volatility – require new precise definitions.This book is attempting that.

44

Nonlinear Index PredictionInternational Workshop on Econophysics and Statistical Finance, 1998.

Physica A 269 (1999)

45

Nonlinear Index Prediction

Stefan ZemkeDepartment of Computer and System Sciences

Royal Institute of Technology (KTH) and Stockholm UniversityForum 100, 164 40 Kista, Sweden

Email: [email protected]

Presented: International Workshop on Econophysics and Statistical Finance, Palermo,1998.

Published: Physica A, volume 269, 1999

Abstract Neural Network, K-Nearest Neighbor, Naive Bayesian Classifier and Genetic

Algorithm evolving classification rules are compared for their prediction accuracies on

stock exchange index data. The method yielding the best result, Nearest Neighbor, is

then refined and incorporated into a simple trading system achieving returns above index

growth. The success of the method hints the plausibility of nonlinearities present in the

index series and, as such, the scope for nonlinear modeling/prediction.

Keywords: Stock Exchange Index Prediction, Machine Learning, Dynamics Reconstruc-

tion via delay vectors, Genetic Algorithms optimized Trading System

Introduction

Financial time series present a fruitful area for research. On one handthere are economists claiming that profitable prediction is not possible, asvoiced by the Efficient Market Hypothesis, on the other, there is a grow-ing evidence of exploitable features of these series. This work describes aprediction effort involving 4 Machine Learning (ML) techniques. These ex-periments use the same data and lack unduly specializing adjustments – thegoal being relative comparison of the basic methods. Only subsequently,the most promising technique is scrutinized.

Machine Learning (Mitchell, 1997) has been extensively applied to fi-nances (Deboeck, 1994; Refenes, 1995; Zirilli, 1997) and trading (Allen

47

& Karjalainen, 1993; Bauer, 1994; Dacorogna, 1993). Nonlinear time se-ries (Kantz & Schreiber, 1999a) approaches also become a commonplace(Trippi, 1995; Weigend & Gershenfeld, 1994). The controversial notion of(deterministic) chaos in financial data is important since the presence of achaotic attractor warrants partial predictability of financial time series –in contrast to the random walk and Efficient Market Hypothesis (Fama,1965; Malkiel, 1996). Some of the results supporting deviation from thelog-normal theory (Mandelbrot, 1997) and a limited financial predictioncan be found in (LeBaron, 1993; LeBaron, 1994).

The Task

Some evidence suggests that markets with lower trading volume are eas-ier to predict (Lerche, 1997). Since the task of the study is to compareML techniques, data from the relatively small and scientifically unexploredWarsaw Stock Exchange (WSE) (Aurell & Zyczkowski, 1996) is used, withthe quotes, from the opening of the exchange in 1991, freely available onthe Internet. At the exchange, prices are set once a day (with intradaytrading introduced more recently). The main index, WIG, is a capital-ization weighted average of all the stocks traded on the main floor, andprovides the time series used in this study.

The learning task involves predicting the relative index value 5 quotesahead, i.e., a binary decision whether the index value one trading weekahead will be up or down in relation to the current value. The interpretationof up and down is such that they are equally frequent in the data set,with down also including small index gains. This facilitates detection ofabove-random predictions – their accuracy, as measured by the proportionof correctly predicted changes, is 0.5 + s, where s is the threshold forthe required significance level. For the data including 1200 index quotes,the following table presents the s values for one-sided 95% significance,assuming that 1200−WindowSize data points are used for the accuracyestimate.

Window size: 60 125 250 500 1000

Significant error: 0.025 0.025 0.027 0.031 0.06

48

Learning involves WindowSize consecutive index values. Index daily(relative) changes are digitized via monotonically mapping them into 8integer values, 1..8, such that each is equally frequent in the resulting series.This preprocessing is necessary since some of the ML methods requirebounded and/or discrete values. The digitized series is then used to createdelay vectors of 10 values, with lag one. Such a vector (ct, ct−1, ct−2, ..., ct−9),is the sole basis for prediction of the index up/down value at time t + 5w.r.t. the value at time t. Only vectors, and their matching predictions,derived form index values falling within the current window are used forlearning.

The best generated predictor – achieving highest accuracy at the windowcases – is then applied to the vector next to the last one in window –yielding prediction for the index value falling next to the window. With theaccuracy estimate accumulating and the window shifting over all availabledata points, the resulting prediction accuracies are presented in the tablesas percentages.

Neural Network Prediction

Five layered network topologies have been tested. The topologies, as de-scribed by the numbers of non-bias units in subsequent layers, are: G0:10-1, G1: 10-5-1, G2: 10-5-3-1, G3: 10-8-5-1, G4: 10-20-5-1. Units inthe first layer represent the input values. Standard backpropagation (BP)algorithm is used for learning weights, with the change values 1..8 linearlyscaled down to the [0.2, 0.8] range required by the sigmoid BP, and updenoted by 0.8, and down – by 0.2.

The window examples are randomly assigned into either training orvalidation set, compromising 80% and 20% of the examples respectively.The training set is used by BP to update weights, while the validation set –to evaluate the network’s squared output error. The minimal error networkfor the whole run is then applied to the example next to the window forprediction. Prediction accuracies and some observations follow.

49

Window/Graph G0 G1 G2 G3 G4

60 56 - - - -125 58 56 63 58 -250 57 57 60 60 -500 58 54 57 57 581000 - - - 61 61

• Prediction accuracy, without outliers, is in the significant 56 – 61%range

• Accuracies seem to increase with window size, reaching above 60% forbigger networks (G2 – G4), as such the results could further improvewith more training data

Naive Bayesian Classifier

Here the basis for prediction consists of the probabilities P (classj) andP (evidencei| classj) for all recognized evidence/class pairs. The classp

preferred by observed evidenceo1... evidenceon

is given by maximizing theexpression P (classp)∗P (evidenceo1

| classp)∗ ... ∗P (evidenceon| classp).

In the task in hand, evidence can take the form: attributen = valuen,where attributen, n = 1..10, denotes the n-th position in the delay vec-tor, and valuen is a fixed value. If the position has this value, the evi-dence is present. Class and conditional probabilities are computed throughcounting respective occurrences in the window, with conditionals missingassigned the default 1/equivalentSampleSize probability 1/80 (Mitchell,1997). Some results and comments follow.

Window size: 60 125 250 500 1000

Accuracy: 54 52 51 47 50

• The classifier performs poorly – perhaps due to preprocessing of thedataset removing any major probability shifts – in the bigger windowcase no better than a guessing strategy

• The results show, however, some autocorrelation in the data: positivefor shorter periods (up to 250 data-points) and mildly negative for

50

longer (up to 1000 data-points), which is consistent with other studieson stock returns (Haughen, 1997).

K-Nearest Neighbor

In this approach, K most similar window vectors – to the one being clas-sified – are found. The most frequent class among the K vectors is thenreturned as the classification. The standard similarity metrics is Euclideandistance between the vectors. Some results and comments follow.

Window/K 1 11 125

125 56 - -250 55 53 56500 54 52 541000 64 61 56

• Peak of 64%

• Accuracy always at least 50% and significant in most cases

The above table has been generated for the Euclidean metrics. However,the peak of 64% accuracy (though for other Window/K combinations) hasalso been achieved for the Angle and Manhattan metrics1, indicating thatthe result is not merely an outlier due to some idiosyncrasies of the dataand parameters.

GA-evolved Logic Programs

The logic program is a list of clauses for the target up predicate. Eachclause is 10 literals long, with each literal drawn form the set of available2 argument predicates: lessOrEqual, greaterOrEqual – with the impliedinterpretation, as well as Equal(X, Y) if abs(X − Y ) < 2 and nonEqual(X,Y) if abs(X − Y ) > 1. The first argument of each literal is a constant

1The results were obtained from a GA run in the space: MerticsType ∗K ∗WindowSize. For a pairof vectors, the Angle metrics returns the angle between them, Maximal – the maximal absolute differencecoordinate-wise, whereas Manhattan - sum of such differences.

51

among 1..8 – together with the predicate symbol – evolved through theGA. The other genetic operator is a 2-point list crossover, applied to the2 programs – lists of clauses.

The second argument of N-th literal is the clause’s N-th head argumentwhich is unified with the N-th value in a delay vector. Applying the uppredicate to a delay vector performs prediction. If the predicate succeedsthe classification is up, and down otherwise. Fitness of a program is mea-sured as the proportion of window examples it correctly classifies. Uponthe GA termination, the fittest program from the run is used to classifythe example next to the current window. Programs in a population havedifferent lengths – number of up clauses – limited by a parameter, as shownin the following table.

Window/Clauses 5 10 50 100 200

250 60 - - - –500 44 47 53 50 –1000 48 50 50 38 44

• Accuracy, in general, non-significant

• Bigger programs (number of clauses > 10) are very slow to convergeand result in erratic predictions

In subsequent trials individual program clauses are GA-optimized formaximal up coverage, and one by one added to the initially empty programuntil no uncovered up examples remain in the window. A clause covers anexample, if it succeeds on that example’s delay vector. The meaning ofvalues in the clause fitness formulas is the following: Neg is the count ofwindow down examples (wrongly!) covered by the (up!) clause, Pos isthe count of up examples yet-uncovered by clauses already added to theprogram, but covered by the current clause, and AllPos is the total countof all window up examples covered by that clause. The weights givento individual counts mark their importance in the GA search trying tomaximize the fitness value. The results and some commentaries follow.

52

Clause fitness function/Window 60 125 250 500 1000

AllPos + Pos− 103 ∗Neg 54.8 50.3 51.7 51.9 53.2AllPos + 103 ∗ Pos− 106 ∗Neg 57.1 51.7 52.8 53.0 48.9as above & ordinary equality 53.6 51.9 53.0 52.5 58.8

• Accuracies, in general, not significant

• Accuracy increase after ordinary equation introduction (Equal(X, Y)if X = Y ) indicates the importance of relations used

• The highest accuracy achieved, reaching 59% for window of 1000,indicates possibility of further improvement should bigger window beavailable

K-nearest Neighbor Prediction Scrutinized

In the prediction accuracy measurements so far, no provision has beenmade for the magnitude of the actual index changes. As such, it couldturn out that highly accurate system is not profitable in real terms, e.g.by making infrequent but big loses (Deboeck, 1994). To check this, a morerealistic prediction scheme is tested, in which prediction performance ismeasured as the extra growth in returns in relation to intrinsic growth ofthe series. The series worked with is the sequence of logs of daily indexchanges: logn = ln indexn− ln indexn−1. The log change delay vectors stillhave length 10, but because of high autocorrelation present (0.34) the delaylag has been set to 2, instead of 1 as before (Kantz & Schreiber, 1999a).Additional parameters follow.

Neighborhood Radius – maximal distance w.r.t. chosen metrics, upto which vectors are considered neighbors and used for prediction, in[0.0, 0.05)

Distance Metrics – between vectors, one of Euclidean, Maximal, Man-hattan metrics

Window size – limit how many past data-points are looked at whilesearching for neighbors, in [60, 1000)

53

Kmin – minimal number of vectors required within a neighborhood towarrant prediction, in [1, 20)

Predictions’ Variability – how much neighborhood vector’s predictionscan vary to justify a consistent common prediction, in [0.0, 1.0)

Prediction Variability Measure – how to compute the above mea-sure from the series of the individual predictions, as: standard devia-tion, difference max min between maximal and minimal value or thesame sign proportion of predictions

Distance scaling – how contributory predictions are weighted in thecommon prediction sum, as a function of neighbor distance, no-scaling:1, linear: 1/distance, exponential: exp(−distance)

The parameters are optimized via GA. The function maximized is therelative gain of an investment strategy involving long position in the index,when the summaric prediction says it will go up, short position – whendown, and staying in cash if no prediction warranted. The predictionperiod is 5 days and the investment continues for that period, after whicha new prediction is made. A summaric prediction is computed by addingall the weighted contributory predictions associated with valid neighbors.If some of the requirements, e.g. minimal number of neighbors, fail – noprediction is issued.

The following tests have been run. Test1 computed average annual gainover index growth during 4 years of trading: 33%. Test2 computed minimal(out of 5 runs shifted by 1 day each) gain during the last year (ending onSept. 1, 1998): 28%. Test3 involved generating 19 sets of surrogate data –permuted logarithmic change series – and checking if the gain on the realseries exceeds those for the surrogate series; the test failed – in 6 cases thegain on the permuted data was bigger. However, assuming normality ofdistribution in the Test2 and Test3 samples, the two-sample t procedureyielded 95% significant result (t = 1.91, df = 14, P < 0.05) that the Test2gains are indeed higher than those for Test32.

2The (logarithmic) average for Test3 was around 0 as opposed to the strictly positive results andaverage for Test1 and Test2 – this could be the basis for another surrogate test.

54

Conclusion

The results show that some exploitable regularities do exist in the indexdata and Nearest Neighbor is able to profit from them. All the other, def-initely more elaborate techniques, fall short of the 64% accuracy achievedvia Nearest Neighbor. One of the reasons could involve non-linearity ofthe problem in question: with only linear relations available, logic programclassifier rules require linear nature of the problem for good performance,nonlinear Neural Network performing somehow better. On the other handthe Nearest Neighbor approach can be viewed as generalizing only locally –with no linear structure imposed/assumed – moreover with the granularityset by the problem examples.

As further research, other data could be tested, independent tests fornonlinearity performed (e.g. dimension and Lyapunov exponent estima-tion) and the other Machine Learning methods refined as well.

55

ILP and GA for Time SeriesPredictionDept. of Computer and Systems Sciences Report 99-006

57

ILP via GA for Time Series Prediction

Stefan Zemke

Department of Computer and System SciencesRoyal Institute of Technology (KTH) and Stockholm University

Forum 100, 164 40 Kista, Sweden


June 1998. Published: DSV report 99-006

Abstract This report presents experiments using GA for optimizing Logic Programsfor time series prediction. Both strategies: optimizing the whole program at once, andbuilding it clause-by-clause are investigated. The set of background predicates stays thesame during all the experiments, though the influence of some variations is also observed.

Despite extensive trials none of the approaches exceeded 60% accuracy, with 50% for

a random strategy, and 64% achieved by a Nearest Neighbor classifier on the same data.

Some reasons for the weak performance are speculated, including non-linearity of the

problem and too greedy approach.

Keywords: Inductive Logic Programming, Genetic Programming, Financial Applications,

Time Series Forecasting, Machine Learning, Genetic Algorithms

Introduction

Inductive Logic Programming

Inductive Logic Programming (ILP) (Muggleton & Feng, 1990) – auto-matic induction of logic programs, given set of examples and backgroundpredicates – has shown successful performance in several domains (Lavarac& Dzeroski, 1994).

The usual setting for ILP involves providing positive examples of the re-lationship to be learned, as well as negative examples for which the relation-ship does not hold. The hypotheses are selected to maximize compressionor information gain, e.g., measured by the number of used literals/tests

59

or program clauses. The induced hypothesis, in the form of a logic pro-gram (or easily converted to it), can be usually executed without furthermodifications as a Prolog program.

The hypotheses are often found via covering, in which a clause succeed-ing on, or covering, some positive examples is discovered (e.g. by greedylocal search) and added to the program, covered positives removed fromthe example set and clauses added until the set is empty. Each clauseshould cover positive examples only with all the negative examples ex-cluded, though this can be relaxed, e.g., because of different noise handlingschemes.

As such, ILP is well suited for domains with compact representations ofthe concept learned possible and likely. Without further elaboration it canbe seen that most of the ILP success areas belong to such domains withconcise mathematical-like description feasible and subsequently discovered.

The Task

The task attempted in this work consists in short term prediction of a timeseries – a normalized version of stock exchange daily quotes. The normal-ization involved monotonically mapping the daily changes to 8 values, 1..8,ensuring that the frequency of those values is equal in the 1200 points con-sidered. The value to be predicted is a binary up or down, referring to theindex value five steps ahead in the series. These classifications were againmade equally frequent (with down including also small index gains).

The normalization of the class data allows easy detection of above-random predictions – their accuracy is above 50% + s, where s is thethreshold for required significance level. If the level is one-sided 95% andthe predictions tested on all 1200 – WindowSize examples, then the sig-nificant deviations from 0.5 are as presented in the table.

Thus, predictions with accuracy above 0.56 are of interest, no matterwhat the window size. For an impression of the predictability of the timeseries: a Nearest Neighbor method yielded 64% (Zemke, 1998) accuracy,Neural Network reaching similar results.

60

Window Significant error

60 0.025125 0.025250 0.027500 0.0311000 0.06

Figure 5.1: One-sided 95% significance level errors for the tests

class_example(223, up, [3,6,5,8,8,8,4,8,8,7]).

class_example(224, up, [6,5,8,8,8,4,8,8,7,8]).

class_example(225, up, [5,8,8,8,4,8,8,7,8,8]).

class_example(226, down, [8,8,8,4,8,8,7,8,8,8]).


class_example(228, up, [8,4,8,8,7,8,8,8,1,1]).



Figure 5.2: Data format sample: number, class and 10-changes vector

Data Format

The actual prediction involves looking at the pattern of subsequent 10changes and from them forecasting the class. To make it more convenient,the sequence of the changes and class tuples has been pre-computed. Thechange tuples are generated at one step resolution from the original series,so the next tuple’s initial 9 values overlap with the previous tuple’s last9 values, with the most recent value concatenated as the 10th argument.Such tuples constitute the learning task’s examples. A sample is presented.

The accuracy estimation for a prediction strategy consists in learning ina window of consecutive examples, and then trying to predict the class ofthe example next to the window via applying the best predictor for thatwindow to the next example’s change vector. The counts for correct and allpredictions are accumulated as the window shifts one-by-one over all the

61

available examples. The final accuracy is the ratio of correct predictionsto all predictions made.

GA Program Learning

Common GA Settings

All the tests use the same GA module, with GA parameters constant forall trials, unless indicated otherwise. Random individual generation, mu-tation, crossover, fitness evaluation are provided as plug-ins to the moduleand are described for each experiment setting.

The Genetic Algorithm uses 2-member tournament selection strategywith the fitter individual having lower numerical fitness value (which can bea negative or positive). Mutation rate is 0.1, with each individual mutatedat most once before applying other genetic operators, crossover rate is 0.3(so offspring constitute 0.6 of next population) and the population size is100. Two-point (uniform) crossover is applied only to the top-level list inthe individuals’ representation. The number of generations is at least 5, nomore than 30, and additionally terminated if the GA run’s best individualhas not improved in the last 5 generations.

A provision is made for the shifted window learning to benefit from thealready learned hypothesis, in an incremental learning fashion. This canbe conveniently done using few (mutated) copies of the previous windowbest hypothesis – while initializing a new population – instead of a totallyrandom initialization. This is done, both, to speed up convergence as wellas to increase GA exploitation.

Evolving Whole Logic Program

Representation The program is represented as a list of clauses for the tar-get up predicate. Each clause is 10 literals long, with each literal drawnform the set of available 2 argument predicates: lessOrEqual, greaterOrE-qual – with the implied interpretation, as well as equal(X, Y) if abs(X −Y ) < 2 and nonEqual(X, Y) if abs(X − Y ) > 1.

The first argument of each literal is an integer among 1..8 – together

62

with the predicate symbol – evolved through the GA. The second argumentof clause’s N-th literal is the value of the N-th head argument, which isunified with the N-th change value in an example’s tuple.

Evaluation Classification is performed by applying the up predicate to anexample’s change vector. If it succeeds the classification is up, and downotherwise. Fitness of a program is measured as the (negative) number ofwindow examples it correctly classifies. Upon the GA termination, thefittest program from the run is used to classify the example next to thecurrent window.

GA Operators Mutation changes single predicate symbol or a constant inone of program clauses. The 2-point crossover is applied to list holdingprogram clauses and cuts in between them (not inside clauses) with theresulting offspring programs of uncontrolled length.

Other parameters Initial program population consists of randomly gen-erated programs of up to L clauses. The L parameter has been variedform 1 to 300, with more thoroughly tested cases reported. The L param-eter is also used during fitness evaluation. If, as the result of crossover, alonger program is generated, its fitness is set to 0 (practically setting itstournament survival chance to nil).

Another approach to limit program length has also been tried and givenup because of no performance improvement and more computational effort.Namely, programs of up to 2*L clauses were evaluated and the actual fitnessvalue returned for those longer than L was multiplied by a factor linearlydeclining from 1 to 0, as the length increased form L to 2*L.

When the learning window is shifted and program population initialized,the new population has 0.02 chance of being seeded with a mutated versionof previous window’s best classifier – the one used for prediction. Thisinitiative is intended to promote incremental learning on the new windowdiffering only by 2 examples (one added and one removed).

63

Window/Clauses 5 10 50 100 200

250 60 - - - –500 44 47 53 50 –1000 48 50 50 38 44

Figure 5.3: GA-evolved logic program prediction accuracy

Results The results for bigger program sizes and smaller windows aremissing since the amount of information required to code the programswould be comparable to that to memorize the examples which could easilylead to overfitting, instead of generalization.

Observations form over 50 GA runs follow.

• Accuracy, in general, non-significant

• Up to certain number of clauses, increased clause number improvesaccuracy, with more erratic results thereafter

• The influence of crossover seems to be limited to that of random mu-tations, with offspring less fit than parents

• Programs evolved with different settings (e.g. maximal program length)often fail to predict the same ’difficult’ examples

• For bigger programs allowed (clause count more than 50; with popula-tion tried up to 2000), convergence is very slow and the best programoften (randomly) created in an initial population

• Window increase, as well as bigger population, generally improve pre-diction accuracy

• Window-cases accuracy (i.e. fitness) is not a good measure of pre-diction accuracy, though both remain related (especially for biggerwindow sizes)

64

Learning Individual Clauses via GA

To limit the search space explosion, perhaps responsible for the previoustrial poor performance, the next tests optimize individual clauses one-by-one added to the program. In this more traditional ILP setting, the windowup cases constitute the positive, and down – the negative examples.

Representation Clauses are represented as lists of literals – in the sameway as an individual clause in the whole program learning. A GA popula-tion maintains set of clauses, at a particular run of GA, all optimized bythe same fitness function.

The classifier is built by running the GA search for an optimal clause,adding the clause to the (initially empty) program, updating the set ofyet uncovered positives and initiating the GA procedure again. The pro-cess terminates when there are no more remaining positive examples to becovered.

Evaluation Details of the fitness function vary and will be described forthe individual tests. In general, the function promotes a single clausecovering maximal number of positive and no negative examples in thecurrent window. The variants include different sets of positives (all or yetuncovered), different weights assigned to their counts and some changes inthe relations used.

GA Operators Crossover takes the 10-element lists encoding body literalsof selected 2 clauses and applies the 2-point crossover to them with therestriction that each of the offspring must also have exactly 10 literals.Mutation changes individual literal: its relation symbol or constant.

Unrestricted GA Clause Search

The first trial initializes the set of clauses randomly,

with no connection to the window example set.

65

Evaluation Fitness of a clause is defined as the difference Negatives – Pos-itives, where Negatives is the count of all negatives covered by the clause,and Positives is the count of yet-uncovered – by previously added clauses –positives covered by the current clause.

Termination The problem with this approach is that, however initially itseems to work, as soon as the set of remaining positives becomes sparse,the GA search has difficulty finding any clause covering a positive exampleat all, not even mentioning a number of positives, and no negatives. Thesearch did not terminate in many cases.

However, in those cases in which a set of clauses covering all positiveshas been found, the accuracy on classifying new examples looked promisingwhich lead to subsequent trials.

More Specific Genetic Operators

GA Operators In this setting all genetic operators, including clause initial-ization, have an invariant: a selected positive example must be covered.This leads to changes in the implementation of clause initialization andmutation. Crossover does not need to be specialized: a crossover of twoclauses, each covering the same positive example, still covers that example.

Evaluation Fitness function is defined by the formula: 1000*Negatives –Positives – AllPositives, where the additional AllPositives indicates thecount of all window positives covered by that clause; the other summandsas already explained. Such formula has showed better prediction accuracythan just promoting a maximal count among remaining positives. Herethere is a double premium for capturing the remaining positives: they areincluded in the two positive counts.

Termination and full positive coverage are ensured by iterating over allpositive examples, with each clause added covering at least one of them.

Some observations about the results follow.

• The only significant prediction is that for window size 60, but onlyjust

66

Window Accuracy

60 54.8125 50.3250 51.7500 51.91000 53.2

Figure 5.4: More Specific Genetic Operators. Fitness: 1000*Negatives – Positives –AllPositives

Window Accuracy

60 57.1125 51.7250 52.8500 53.01000 48.9

Figure 5.5: Refined Fitness Function. Fitness: 1000000*Negatives – 1000*Positives –AllPositives

• The rest of the results hint no prediction, giving overall poor perfor-mance

Refined Fitness Function

In this trial all the settings are as above with different fitness function.

Evaluation The new fitness formula, 1000000*Negatives – 1000*Positives –AllPositives, gives priority to covering no negatives, then to maximizing thecoverage on yet-uncovered positives, and only then on all positives (withall Negatives, Positives, AllPositives counts less than 1000 because of thewindow size).

As compared to the previous fitness setting, this resulted in:

• Improved accuracy for all window sizes

67

Window Accuracy

60 53.6125 51.9250 53.0500 52.51000 58.8

Figure 5.6: Ordinary Equality in the Relation Set. Fitness: 1000000*Negatives –1000*Positives – AllPositives

• Predictions for window size 250 significant

Ordinary Equality in the Relation Set

Another possibility for changes involves redefining the relations employed.Since with the relations used so far there was no possibility to select aspecific value – all the relations, including equality, involving intervals –the definition of Equal has been confined.

Representation The Equal predicate has been set to ordinary equality,holding if its two arguments are the same numbers. Other settings stay aspreviously.

The results are interesting, among others because:

• The property: the more data fed (i.e. bigger window) the higheraccuracy allows to expect further accuracy gains

• The accuracy achieved for window size 1000 is the highest for allmethods with individually evolved clauses

Other variations attempted led to no profound changes. In principle, allthe changes to the clause invention scheme and to parameter values couldbe carried by a meta-GA search in the appropriate space. However, due tocomputational effort this is not yet feasible, e.g. achieving the above resultfor window size 1000 involved more than 50h of computation (UltraSparc,248MHz)

68

Decision Trees

The prediction task was also attempted via the Spectre (Bostrom & L.,1999) system, a prepositional learner with results equivalent to a decisiontree classifier, equipped with hypothesis pruning and noise handling. I amgrateful for the courtesy of Henrik Bostrom, actually running the test onprovided data. The results follow.

I tried SPECTRE on learning from a random subset consisting

of 95\% of the entire set, and testing on the remaining 5\%.

The results were very poor (see below).

[...]

******************* Experimental Results ********************

Example file: zemke_ex

No. of runs: 10

****************** Summary of results ***********************

=============================================================

Method: S

Theory file: zemke

-------------------------------------------------------------

Training size: 95

Mean no. of clauses: 3.4 Std deviation: 1.65 Std error: 0.52

Mean accuracy: 50.33 Std deviation: 7.11 Std error: 2.25

Pos. accuracy: 28.64 Std deviation: 14.69 Std error: 4.64

Neg. accuracy: 71.58 Std deviation: 17.59 Std error: 5.56

Since the set of predicates employed by Spectre included only =, <, >,excluding any notion of inequality, the last setting for GA clause induc-tion was run for comparison, with nonEqual disabled. The result for thewindow size 1000 is 51.5%, slightly better than that of Spectre but stillnon-significant. The drop from 58.8% indicates the importance of the re-lation set.

The results are similar to unreported in the current study – focusingon evolutionary approaches to ILP – experiments involving Progol (Mug-gleton, 1995). This system searches the space of logic programs covering

69

given positive and excluding negative examples by exhaustively (accordingto some restrictions) considering combinations of background predicates.In the trials, the system came either with up( any) or the example set asthe most compressed hypothesis, thus effectively offering no learning.

Conclusion

The overall results are not impressive – none of the approaches has exceededthe 60% accuracy level. The failure of the standard ILP systems (Progoland decision tree learner) can be indicative of the inappropriateness of thelocally greedy compression/information gain driven approach to this typeof problems. The failure of evolving whole programs once more shows thedifficulty of finding optima in very big search spaces.

Another factor is the set of predicates used. As compared with theGA runs, Progol and Spectre tests missed the inequality relation. As theintroduction of ordinary equality or removal of inequality showed, even theflexible GA search is very sensitive to the available predicates. This couldbe an area for further exploration.

All the above, definitely more elaborate techniques, fall short of theresults achieved via Nearest Neighbor method. One of the reasons couldinvolve non-linearity of the problem in question: with only linear relationsavailable all generalizations assume linear nature of the problem for goodperformance. On the other hand the Nearest Neighbor approach can beviewed as generalizing only locally, moreover with the granularity set bythe problem examples themselves.

70

Bagging Imperfect PredictorsANNIE’99, St. Louis, MO, US, 1999

71

Bagging Imperfect Predictors

Stefan Zemke

Department of Computer and System SciencesRoyal Institute of Technology (KTH) and Stockholm University

Forum 100, 164 40 Kista, Sweden


Presented: ANNIE’99.

Published: Smart Engineering System Design, ASME Press, 1999

Abstract Bagging – a majority voting scheme – has been applied to a population of

stock exchange index predictors yielding returns higher than these by the best predictor.

The observation has been more thoroughly checked in a setting in which all above-average

predictors evolved in a Genetic Algorithm population have been bagged, and their trading

performance compared with that of the population’s best, resulting in significant improve-

ment.

Keywords: Bagging, Financial Applications, Performance Analysis, Time Series Forecast-

ing, Machine Learning, Mixture of Experts, Neural Network Classifier, Genetic Algo-

rithms, Nearest Neighbor

Introduction

Financial time series prediction presents a difficult task with no singlemethod best in all respects, the foremost of which are accuracy (returns)and variance (risk). In the Machine Learning area, ensembles of classifiershave long been used as a way to boost accuracy and reduce variance. Fi-nancial prediction could also benefit from this approach, however due tothe peculiarities of financial data the usability needs to be experimentallyconfirmed.

This paper reports experiments applying bagging – a majority votingscheme – to predictors for a stock exchange index. The predictors come

73

from efforts to obtain a single best predictor. In addition to observing bag-ging induced changes in accuracies, the study also analyzes their influenceon potential monetary returns.

The following chapter provides an overview of bagging. Next, settingsfor the base study generating index predictions are described, and how thepredictions are bagged in the current experiments. And at last, a morerealistic trading environment is presented together with the results.

Bagging

Bagging (Breiman, 1996) is a procedure involving a committee of differentclassifiers. This is usually achieved by applying a single learning algorithmto different bootstrap samples drawn form the training data – which shoulddestabilize the learning process resulting in non-identical classifiers. An-other possibility is to use different learning algorithms trained on commondata, or a mix of both. When a new case is classified, each individual clas-sifier issues its unweighted vote, and the class which obtains the biggestnumber of votes is the bag outcome.

For bagging to increase accuracy, the main requirement is that the indi-vidual classifiers make independent errors and are (mostly) above random.By majority voting, bagging promotes the average bias of the classifiersreducing the influence of individual variability. Experiments show (Webb,1998), that indeed, bagging reduces variance while slightly increasing bias,with bias measuring the contribution to classification error by classifiers’central tendency, whereas variance – error by deviation from the centraltendency.

Bagging Predictors

Results in this study involve bagging outcomes of 55 experiments run forearlier research comparing predictions via Neural Network (ANN, 10 pre-dictors), Nearest Neighbor (kNN, 29), Evolved Logic Programs (ILP, 16)and Bayesian Classifier (not used in this study). More detailed descriptionof the methods can be found in (Zemke, 1998).

74

Experimental Settings

Some evidence suggests that markets with lower trading volume are easierto predict (Lerche, 1997). Since the task of the earlier research was tocompare Machine Learning techniques, data from the relatively small andunexplored Warsaw Stock Exchange (WSE) was used, with the quotesfreely available on the Internet (WSE, 1995 onwards). At the exchange,prices are set once a day (with intraday trading introduced more recently).The main index, WIG, a capitalization weighted average of stocks tradedon the main floor, provided the time series used in this study, with 1250quotes since the formation of the exchange in 1991 to the comparativeresearch.

Index daily (log) changes were digitized via monotonically mappingthem into 8 integer values, 1..8, such that each was equally frequent inthe resulting series. The digitized series, {c}, was then used to create de-lay vectors of 10 values, with lag one. Such a vector (ct, ct−1, ct−2, ..., ct−9),was the sole basis for prediction of the index up/down value at time t + 5w.r.t. the value at time t. Changes up and down have been made equallyfrequent (with down including small index gains) for easier detection ofabove-random predictors. Only delay vectors and their matching 5-dayreturns derived from consecutive index values within a learning windowwere used for learning. Windows of half-year, 1-year (250 index quotes),2-years and 4 years were tested.

For each method, the predictor obtained for the window was then ap-plied to the vector next to the last one in the window yielding up/downprediction for the index value falling next to the window. With the coun-ters for in/correct predictions accumulating as the window shifted over allavailable data points, the resulting average accuracies for each method areincluded in table 1, with accuracy shown as the percentage (%) of correctlypredicted up and down cases.

Estimating Returns

For estimating index returns induced by predictions, the 5-day index changeshave been divided into 8 equally frequent ranges, 1..8, with ranges 1..4 cor-

75

responding to down and 5..8 to up. Changes within each range obtainedvalues reflecting non-uniform distribution of index returns (Cizeau et al.,1997). The near-zero changes 4 and 5 obtained value 1, changes 3 and 6 —2, 2 and 7 — 4 and the extreme changes 1 and 8 — value 8.

Return is calculated as sum of values corresponding to correct (up/down)predictions subtracted by the values for incorrect predictions. To normal-ize, it is divided by the total sum of all values involved, thus ranging be-tween −1 – for null and 1 – for full predictability. It should be noted thatsuch a return is not equivalent to accuracy, which gives the same weightto all correct predictions.

The different learning methods, ILP, kNN and ANN, involved in thisstudy offer the classification error independence required by bagging towork. Within each method predictors, there is still a variety due to differenttraining windows and parameters, such as background predicates for ILP,k values for kNN, and architectures for ANN.

In this context bagging is applied as follows: all selected predictors, e.g.these trained on a window of half a year – as for the first row of baggedresults in table 1, issue their predictions for an instance, with the majorityclass being the instance’s bagged prediction. The predicate selections intable 1 are according to the learning method (columns): ILP, kNN, ANN,all of them, and according to training window size (rows), e.g. ’4 & 2 & 1year’ – bagging predictions for all these window sizes.

Method’s ILP, #16 kNN, #29 ANN, #10 all, #55return % Return % Return % Return % Return

Individual methods – no bagging involvedAverage 56 0.18 57 0.19 62 0.32 57 0.21Deviation .029 .068 .038 .094 .018 .043 .039 .095Window-wise Bagged resultsHalf year 55 0.20 63 0.32 - - 61 0.301 year 56 0.19 57 0.20 61 0.30 59 0.272 years 55 0.14 60 0.28 65 0.38 60 0.264 years 60 0.22 66 0.41 62 0.32 64 0.344 & 2 years 62 0.28 63 0.34 63 0.35 64 0.35& 1 year 60 0.26 61 0.30 64 0.36 63 0.34& half year 61 0.28 61 0.30 63 0.34 64 0.37

Figure 1: Accuracies and returns for individual and bagged methods.

76

With up to 1000 (4 years) – of the 1250 index points used for training –the presented accuracies for the last 250 require 6% increase for a signifi-cant improvement (one-sided, 0.05 error). Looking at the results, a numberof observations can be attempted. First, increased accuracy – bagged accu-racies exceeding the average for each method. Second, poorly performingmethods gaining most, e.g. ILP (significantly) going up from 56% averageto 62% bagged accuracy. Third, overall, bagged predictors incorporatingwindows of 4 & 2 years achieve highest accuracy. And fourth, return per-formance is positively correlated to bagged accuracy, with highest returnsfor highest accuracies.

Bagging GA Population

This section describes trading application of bagged GA-optimized NearestNeighbor classifiers. As compared to previously used Nearest Neighborclassifier, these in this section have additional parameters warranting whatconstitutes a neighbor and are optimized for maximizing return implied bytheir predictions; they also work on more extensive data – choice of whichis also parameterized. Some of the parameters follow (Zemke, 1998).

Active features – binary vector indicating features/coordinates in delayvector included in neighbor distance calculation, max. 7 active

Neighborhood Radius – maximal distance up to which vectors are con-sidered neighbors and used for prediction, in [0.0, 0.05)

Window size – limit how many past data-points are looked at whilesearching for neighbors, in [60, 1000)

Kmin – minimal number of vectors required within a neighborhood towarrant prediction, in [1, 20)

Predictions’ Variability – how much neighborhood vector’s predictionscan vary to justify a consistent common prediction, in [0.0, 1.0)

Prediction Variability Measure – how to compute the above measurefrom the series of the individual predictions, as: standard deviation,difference max min between maximal and minimal value

77

Distance scaling – how contributory predictions are weighted in thecommon prediction sum, as a function of neighbor distance, no-scaling:1, linear: 1/distance, exponential: exp(−distance)

The kNN parameters are optimized for above-index gain of an invest-ment strategy involving long index position for up prediction, short posi-tion – for down, and staying out of index if no prediction warranted. Theprediction and investment period is 5 days, after which a new predictionis executed. A kNN prediction is arrived at by adding all weighted past5-day returns associated with valid neighbors. If some of the requirements,e.g. minimal number of neighbors, fail – no overall prediction is issued.

The trading is tested for a period of one year, split into 1.5 monthperiods, for which new kNN parameters are GA-optimized. The delayvectors are composed of daily logarithmic changes derived from series, withnumber of delayed values (lag 1) indicated: WIG index (30), Dow JonesIndustrial Average (10), and Polish-American Pioneer Stock InvestmentFund (10). The results of the trading simulations are presented in table 2.

Method No. of Trials Mean Deviation

Random strategy 10000 0.171 0.192Best strategy 200 0.23 0.17Bagged strategy 200 0.32 0.16

Figure 2: Returns for a random, GA-best and bagged strategy

Random strategy represents trading according to up/down sign of ran-domly chosen 5-day index return form the past. Best strategy indicatestrading according to the GA-optimized strategy (fitness = return on pre-ceding year). Bagged strategy, indicates trading according to a majorityvote of all above-random (i.e. positive fitness) predictors present in thefinal generation.

Trading by the best predictor outperforms random strategy with 99.9%confidence (t-test), the same, as trading by bagged predictor outperformsthe best strategy.

78

Conclusion

This study presents evidence that bagging multiple predictors can improveprediction accuracy for a stock exchange index data. With observationthat returns are proportional to prediction accuracy, bagging makes an in-teresting approach for increasing returns. This is confirmed by trading ina more realistic setting with the returns of bagging significantly outper-forming that of trading by a single best strategy.

79

Rapid Fine-Tuning ofComputationally Intensive ClassifiersMICAI’2000, Mexico, 2000. LNAI 1793

81

Rapid Fine-Tuning of ComputationallyIntensive Classifiers




Presented: MICAI’00. Published: LNAI 1793, Springer, 2000

Abstract This paper proposes a method for testing multiple parameter settings in one

experiment, thus saving on computation-time. This is possible by simultaneously tracing

processing for a number of parameters and, instead of one, generating many results –

for all the variants. The multiple data can then be analyzed in a number of ways, such

as by the binomial test used here for superior parameters detection. This experimental

approach might be of interest to practitioners developing classifiers and fine-tuning them

for particular applications, or in cases when testing is computationally intensive.

Keywords: Analysis and design, Classifier development and testing, Significance tests,

Parallel tests

Introduction

Evaluating a classifier and fine-tuning its parameters, especially when per-formed with non-optimal prototype code, all often require lengthy com-putation. This paper addresses the issue of such experiments, propos-ing a scheme speeding up the process in two ways: by allowing multipleclassifier-variants comparison in shorter time, and by speeding up detectionof superior parameter values.

The rest of the paper is organized as follows. First, a methodology ofcomparing classifiers is described pointing out some pitfalls. Next, theproposed method is outlined. And finally, an application of the scheme toa real case is presented.

83

Basic Experimental Statistics

Comparing Outcomes

While testing 2 classifiers, one comes with 2 sets of resulting accuracies.The question is then: are the observed differences indicating actual supe-riority of one approach or could they arise randomly.

The standard statistical treatment for comparing 2 populations, the t-test, came under criticism when applied in the machine learning settings(Dietterich, 1996), or with multiple algorithms (Raftery, 1995). The testassumes that the 2 samples are independent, whereas usually when twoalgorithms are compared, this is done on the same data set so the inde-pendence of the resulting accuracies is not strict. Another doubt can arisewhen the quantities compared do not necessarily have normal distribution.

If one wants to compare two algorithms, A and B, then the binomial testis more appropriate. The experiment is to run both algorithms N timesand to count the S times A was better than B. If the algorithms wereequal, i.e., P(A better than B in a single trial) = 0.5, then the probabilityof obtaining the difference of S or more amounts to the sum of binomialtrials, P = 0.5, yielding between S and N successes. As S gets larger thanN/2, the error of wrongly declaring A as better than B decreases, allowingone to achieve desired confidence level. The table 1 provides the minimalS differentials as a function of number of trials N and the (I or II sided)confidence level.

The weaknesses of binomial tests for accuracies include: non-qualitativecomparison – not visualizing how much one case is better than the other(e.g., as presented by their means), somehow ambivalent results in thecase of many draws – what if the number of draws D >> S, should therelatively small number of successes decide which sample is superior, non-obvious ways of comparing more than 2 samples or samples of differentcardinality (Salzberg, 1997).

Significance Level

Performing many experiments increases the odds that one will find ’sig-nificant’ results where there is none. For example, an experiment at 95%

84

#Trials 95% I 95% II 99% I 99% II 99.9% I 99.9% II 99.99% I 99.99% II5 5 - - - - - - -6 6 6 - - - - - -7 7 7 7 - - - - -8 7 8 8 8 - - - -

16 12 13 13 14 15 15 16 1632 22 22 23 24 25 26 27 2864 40 41 42 43 45 46 47 48

128 74 76 78 79 82 83 86 87256 142 145 147 149 153 155 158 160512 275 279 283 286 292 294 299 301

1024 539 544 550 554 562 565 572 575

Figure 5.7: Minimal success differentials for desired confidence

Confidence desired 95% 99% 99.9%

tested99% 5 1 -99.9% 51 10 199.99% 512 100 10

Figure 5.8: Required single-trial confidence for series of trials

confidence level draws a conclusion that is with 0.05 probability wrong,so in fact, for every 20 experiments, one is expected to pass arbitrary testwith 95% confidence. The probability of not making such an error in allamong K (independent) experiments goes down to 0.95K , which for K > 1is clearly less than the 95% confidence level.

Thus in order to keep the overall confidence for a series of experiments,the individual confidences must be more stringent. If c is the desiredconfidence, then the productive confidence of the individual experimentsmust be at least that. Table 2 presents, for few desired levels, maxi-mally how many experiments at a higher level can be run for the seriesto still be within the intended level. The approximate (conservative) for-mula is quite simple MaxNumberOfTrials = (1 - Confidence desired) / (1- Confidence tested).

85

To avoid spurious inferences, one is strongly advised to always aim atsignificance higher than the bottom line 95% easily obtained in tens oftesting runs. However, more stringent tests also increase the possibilitythat one will omit some genuine regularities. One solution to this trade-offcould be, first searching for any results accepting relatively low significance,and once something interesting is spotted, to rerun the test, on a moreextensive data, aiming at a higher pass.

Tuning Parameters

A common practice involves multiple experiments in order to fine-tuneoptimal parameters for the final trial. Such a practice increases the chancesof finding an illusory significance – in two ways. First, it involves thediscussed above effect of numerous tests on the same data. Second, itspecializes the algorithm to perform on the (type of) data on which it islater tested.

To avoid this pitfall, first each fine-tuning experiment involving thewhole data should appropriately adjust the significance level of the wholeseries – in a way discussed. The second possibility requires keeping part ofthe data for testing and never using it at the fine-tuning stage, in whichcase the significance level must only be adjusted according to the numberof trials on the test portion.

Proposed Method

Usually it is unclear without a trial how to set parameter values for optimalperformance. Finding the settings is often done in a change-and-test man-ner, which is computationally intensive, both to check the many possiblesettings, and to get results enough as to be confident that any observed reg-ularity is not merely accidental. The proposed approach to implementingthe change-and-test routine can speed up both.

The key idea is to run many experiments simultaneously. For example,if the tuned algorithm has 3 binary parameters A, B and C taking values-/+, in order to decide which setting among A- B- C-, A- B- C+, ..., A+B+ C+ to choose, all could be tried at once. This can be done by keeping

86

2 copies of all the variables influenced by parameter A: one variable setrepresenting the setting A- and the other – A+. Those 2 variable setscould be also used in 2 ways – each with respect to processing required byB- and B+ resulting in 4 variable sets representing the choices A- B-, A-B+, A+ B- and A+ B+. And in the same manner, the C choice wouldgenerate 8 sets of affected variables. Finally, as the original algorithmproduces one result, the modified multiple-variable version would produce8 values per iteration.

The details of the procedure, namely which variables need to be traced inmultiple copies, depend on the algorithm in question. Though the processmight seem changing the structure of the algorithm – using data structurein the place of a single variable – once this step is properly implemented, itdoes not increase the conceptual complexity if 2 or 10 variables are traced.Actually, with the use of any programming language allowing abstractions,such as an object-oriented language, it is easy to reveal the internal natureof variables only where necessary - without the need for any major codechanges where the modified variables are merely passed.

Handling the variable choices obviously increases the computationalcomplexity of the algorithm, however, as it will be shown on an exam-ple, the overhead can be negligible when the variable parameters concernchoices outside the computationally intensive core of the algorithm, as itusually is in the case for fine-tuning3.

Superior Parameter Detection

Continuing the above case with 3 binary choices, for each classifier applica-tion 8 outcomes would be generated instead of one, if all the 3 parameterswere fixed. Concentrating on just one parameter, say A, divides the 8outcomes into 2 sets: this with A- and A+ – each including 4 elementsindexed by variants of the other parameters: B- C-, B- C+, B+ C-, B+C+. The identical settings for the other parameters allow us to observethe influence of value of A by comparing the corresponding outcomes.

The comparisons can be made according to the binomial test, as dis-

3It is matter of terminology what constitutes parameter-tuning and what development of a newalgorithm.

87

cussed (Salzberg, 1997). In order to collect the statistics, several itera-tions – applications of the algorithm – will usually be required, dependingon the number of variable choices – so outcomes – at each iteration, andthe required confidence. With 3 variable choices, each application allows 4comparisons – in general, tracing K choices allows 2K−1.

This analysis can reveal if a certain parameter setting results in signifi-cantly better performance. The same procedure, and algorithm outcomes,can be used for all the parameters, here including also B and C, whichequally divide the outcomes into B- and B+, etc. Any decisive resultsobtained in such a way indicate a strong superiority of a given parametervalue – regardless of the combinations of the other parameters. However,in many cases the results cannot be expected to be so crisp – with theinfluence of parameter values inter-dependent, i.e. which given parametervalue is optimal may depend on the configuration of the other parameters.

In that case the procedure can be extended, namely the algorithm out-comes can be divided according to value of a variable parameter, let it beA, into 2 sets: A- and A+. Each of the sets would then be subject to theprocedure described above, with the already fixed parameter excluded. Sothe analysis of the set A- might, for example, reveal that parameter B+gives superior results no matter what the value of the other parameters(here: only C left), whereas analysis of A+ might possibly reveal superior-ity of B-. The point to observe is that fixing one binary variable reduces thecardinality of the sample by half, thus twice as many algorithm iterationswill be required for the same cardinality of the analyzed sets. This kindof analysis might reveal the more subtle interactions between parameters,helpful in understanding why the algorithms works the way it does.

Parallel Experiments

In the limit, the extended procedure will lead to 2K sets obtaining oneelement per iteration, K – the number of binary parameters traced. Suchobtained sets can be subject to another statistical analysis, this time thegains in computation coming from the fact that once generated, the 2K

sets can be compared to a designated set, or even pair-wise, correspondingto many experiments.

88

The statistics used in this case can again involve the binomial compari-son or – unlike in the previous case – a test based on random sampling. Inthe superior parameter detection mode, the divisions obtained for a singleparameter most likely do not have normal distribution, thus tests assumingit, such as the t-test, are not applicable. Since the binomial test does notmake any such assumption it was used.

However, if the compared sets are built in one-element-per-iterationfashion, where each iteration is assumed to be independent (or randomgenerator dependent) from the previous one, the sets can be consideredrandom samples. The fact that they are originating from the same randomgenerator sequence forming the outcomes at each iteration, can be actuallyconsidered helpful in getting more reliable comparison of the sets – dueonly to the performance of the variants, but not to the variation in thesampling procedure. This aspect could be considered another advantageof the parallel experiments. However, discussing the more advanced testsutilizing this property is beyond the scope of the current paper.

Example of Actual Application

This section provides description of a classifier development (Zemke, 1999a),which inspired the parameter tuning and testing procedure. Since the de-veloped algorithm was (believed to be) novel, there were no clear guidelineswhich, among the many, small but important choices within the algorithmshould be preferred. By providing with more results for analysis, the test-ing proposed approach helped both to find promising parameters and toclarify some misconceptions about the algorithm’s performance. Generat-ing the data took approximately one week of computation, thus repeatingthe run for the 13 variants considered would be impractical.

Algorithm

The designed classifier was an extension of the nearest neighbor algorithm,with parameters indicating what constitutes a neighbor, which features tolook at, how to combine neighbor classifications etc. The parameters wereoptimized by a genetic algorithm (GA) whose population explored their

89

combinations. The idea believed to be novel, involved taking – insteadof the best GA-evolved classifier – part of the final GA-population andbagging (Breiman, 1996) the individual classifiers together into an ensembleclassifier. Trying the idea seemed worthwhile since bagging is known toincrease accuracy benefiting from the variation in the ensemble – exactlywhat a (not over-converged) GA-population should offer.

The computationally intensive part was the GA search – evolving apopulation of parameterized classifiers and evaluating them. This had tobe done no matter if one was interested just in the best classifier or ina bigger portion of the population. As proposed, the tested algorithmneeds to be multi-variant traced for a number of iterations. Here, iterationinvolved a fresh GA run, and yielded accuracies (on the test set) – one foreach variant traced.

The questions concerning bagging the GA population involved: whichindividual classifiers to bag – all above-random or only some of them,how to weight their vote – by single vote or according to accuracy of theclassifiers, how to solicit the bagged vote – by simple majority or if themajority was above a threshold. The questions gave rise to 3 parameters,described below, and their 3 ∗ 2 ∗ 2 = 12 combinations, listed in Table 3indicating which parameter (No) takes what value (+).

1. This parameter takes 3 values depending which of the above-random(fitness) accuracy classifiers from the final GA population are includedin the bagged classifier: all, only the upper half, or a random halfamong the above-random.

2. This binary parameter distinguishes between unweighted vote (+):where each classifier adds 1 to its class, and a weighted vote (-): wherethe class vote is incremented according to the classifier’s accuracy.

3. This binary parameter decides how the bagged ensemble decision isreached – by taking the class with the biggest cumulative vote (+), or(-) when the majority is by more than 1

3 total votes greater than thatof the next class and returning the bias of the training data otherwise.

90

1 2 3 4 5 6 7 8 9 10 11 12No Parameter setting1 Upper half bag + + + + - - - - - - - -1 All above-random bag - - - - + + + + - - - -1 Half above-random bag - - - - - - - - + + + +2 Unweighted vote + + - - + + - - + + - -3 Majority decision + - + - + - + - + - + -

Figure 5.9: Settings for 12 parameter combinations.

Parameter Analysis

The parameter analysis can identify algorithm settings that give superiorperformance, so they can be set to these values. The first parameter has 3values which can be dealt with by checking if results for one of the valuesare superior to both of the others. Table 4 presents the comparisons asprobabilities for erroneously deciding superiority of the left parameter setversus one on the right. Thus, for example, in the first row comparison of{1..4 } vs. {5..8}, which represent the different settings for parameter 1,the error 0.965 by 128 iterations indicates that setting {1..4 } is unlikelyto be better than {5..8}. Looking at it the other way: {5..8} is more likelyto be better than {1..4 } with error4 around 0.035 = 1−0.965. The setting{0} stands for results by the reference non-bagged classifier – respectiveGA run fittest. The results in Table 4 allow us to make some observationsconcerning the parameters. The following conclusions are for results up to128 iterations, the results for the full trials up to 361 iterations includedfor comparison only.

1. There is no superior value for parameter 1 – such that it would out-perform all the other values.

2. Both settings for parameter 2 are comparable.

4The error probabilities of A- vs. A+ and A+ vs. A- do not add exactly to 1 for two reasons. First,draws are possible thus the former situation S successes out of N trials can lead to less than F = N − Ssuccesses for the later, so adding only the non-draw binomial probabilities would amount for less than1. And second, even if there are no draws, both error binomial sums would involve a common factorbinomial(N,S) = binomial(N,N − S) making the complementary probabilities to add to more than 1.Thus for the analysis to be strict, the opposite situation error should be computed from the scratch.

91

No Parameter settings/Iterations 32 64 128 3611 {1..4 } vs. {5..8} 0.46 0.95 0.965 0.99851 {1..4 } vs. {9..12 } 0.53 0.75 0.90 0.461 {5..8 } vs. {9..12 } 0.33 0.29 0.77 0.0992 {1,2,5,6,9,10} vs. {3,4,7,8,11,12} 0.53 0.6 0.24 0.723 {1,3,5,7,9,11} vs. {2,4,6,8,10,12} 0.018 9E-5 3E-5 0- {1} vs. {0} 0.0035 0.0041 0.013 1E-6- {2} vs. {0} 0.19 0.54 0.46 0.91- {3} vs. {0} 0.055 0.45 0.39 0.086- {4} vs. {0} 0.11 0.19 0.33 0.12- {5} vs. {0} 0.0035 3.8E-5 6.2E-5 0- {6} vs. {0} 0.30 0.64 0.87 0.89- {7} vs. {0} 0.055 0.030 0.013 1.7E-4- {8} vs. {0} 0.025 0.030 0.02 0.0011- {9} vs. {0} 0.055 7.8E-4 2.5E-4 0- {10} vs. {0} 0.11 0.35 0.39 0.73- {11} vs. {0} 0.19 0.64 0.39 0.085- {12} vs. {0} 0.055 0.030 0.0030 0.0016

Figure 5.10: Experimental parameter setting comparisons.

3. Majority decision ({1,3,5,7,9,11}), for parameter 3, is clearly outper-forming with confidence 99.99% by 64 iterations.

4. In the comparisons against the non-bagged {0}, settings 5 and 9 aremore accurate, at less than 0.1% error (by iteration 128) pointing outsuperior parameter values.

Speed up

In this case the speed up of the aggregate experiments – as opposed toindividual pair-wise comparisons – comes from the fact that the most com-putationally intensive part of the classification algorithm – the GA run –does not involve the multiply-threaded variables. They come into playonly when the GA evolution is finished and different modes of bagging andnon-bagging are evaluated.

Exploring variants outside the inner loop can still benefit algorithms inwhich multiple threading will have to be added to the loop thus increasing

92

the computational burden. In this case, the cost of exploring the corevariants should be fully utilized by carefully analyzing the influence of the(many) post-core settings as not to waste the core computation due tosome unfortunate parameter choice afterwards.

Conclusion

This paper proposes a method for testing multiple parameter settings inone experiment, thus saving on computation-time. This is possible bysimultaneously tracing processing for a number of parameters and, insteadof one, generating many results – for all the variants. The multiple data canthen be analyzed in a number of ways, such as by the binomial test usedhere for superior parameters detection. This experimental approach mightbe of interest to practitioners developing classifiers and fine-tuning themfor particular applications, or in cases when testing is computationallyintensive.

The current approach could be refined in a number of ways. First, finerstatistical framework could be provided taking advantage of the specificfeatures of the data generating process, thus providing crisper tests, possi-bly at smaller sample size. Second, some standard procedures for dealingwith common classifiers could be elaborated, making the proposed devel-opment process more straightforward.

93

On Developing Financial PredictionSystem: Pitfalls and PossibilitiesDMLL Workshop at ICML-2002, Australia, 2002

95

On Developing Financial Prediction System:Pitfalls and Possibilities




Published: Proceedings of DMLL Workshop at ICML-2002, 2002

Abstract A successful financial prediction system presents many challenges. Someare encountered over again, and though an individual solution might be system-specific,general principles still apply. Using them as a guideline might save time, effort, boostresults, as such promoting project’s success.

This paper remarks on a prediction system development stemming from author’s ex-

periences and published results. The presentation follows stages in a prediction system

development: data preprocessing, prediction algorithm selection and boosting, system

evaluation – with some commonly successful solutions highlighted.

Introduction

Financial prediction presents challenges encountered over again. The paperhighlights some of the problems and solutions. A predictor developmentdemands excessive experimentation: with data preprocessing and selection,the prediction algorithm(s), a matching trading model, evaluation and tun-ing – to benefit from the minute gains, but not fall into over-fitting. Theexperimentation is necessary since there are no proven solutions, but ex-periences of others, even failed, can speed the development.

The idea of financial prediction (and resulting riches) is appealing,initiating countless attempts. In this competitive environment, if onewants above-average results, one needs above-average insight and sophisti-cation. Reported successful systems are hybrid and custom made, whereas

97

straightforward approaches, e.g. a neural network plugged to relativelyunprocessed data, usually fail (Swingler, 1994).

The individuality of a hybrid system offers chances and dangers. Onecan bring together the best of many approaches, however the interactioncomplexity hinders judging where the performance dis/advantage is comingfrom. This paper provides hints in major steps in a prediction systemdevelopment based on author’s experiments and published results.

The paper assumes some familiarity with machine learning and financialprediction. As a reference one could use (Hastie et al., 2001; Mitchell,1997), including java code (Witten & Frank, 1999), applied to finance(Deboeck, 1994; Kovalerchuk & Vityaev, 2000). Non-linear analysis (Kantz& Schreiber, 1999a), in finance (Deboeck, 1994; Peters, 1991). Ensembletechniques (Dietterich, 2000), in finance (Kovalerchuk & Vityaev, 2000).

Data Preprocessing

Before data is fed into an algorithm, it must be collected, inspected, cleanedand selected. Since even the best predictor will fail on bad data, dataquality and preparation is crucial. Also, since a predictor can exploit onlycertain data features, it is important to detect which data preprocess-ing/presentation works best.

Visual inspection is invaluable. At first, one can look for: trend – ifneed to remove, histogram – redistribute, missing values and outliers, anyregularities. There are financial data characteristics (Mantegna & Stanley,2000) that differ from the normally-distributed, aligned data assumptionin general data mining literature.

Outliers may require different considerations: 1) genuine big changes – of big interestto prediction, such data could even be multiplied to promote recognition; 2) jumpsdue to change a quality is calculated, e.g. stock splits; all previous data could bere-adjusted or a single outlier treated as a missing value 3) outlier regularities couldsignal a systematic error.

Fat tails – extreme values more likely as compared to the normal distribution – is an es-tablished property of financial returns (Mantegna & Stanley, 2000). It can matter in1) situations, which assume normal distribution, e.g. generating missing/surrogate

98

data w.r.t. normal distribution will underestimate extreme values 2) in outlier de-tection. If capturing the actual distribution is important, the data histogram canbe preferred to parametric models.

Time alignment – same date-stamp data may differ in the actual time as long as therelationship is kept constant. The series originating the predicted quantity setsthe time – extra time entries in other series may be skipped, whereas missing inother series may need to be restored. Alternatively, all series could be converted toevent-driven time scale, especially for intra-day data (Dacorogna et al., 2001).

Missing values dealt with by data mining methods (Han & Kamber, 2001;Dacorogna et al., 2001). If a miss spoils temporal relationship, restorationis preferable to removal. Conveniently all misses in the raw series arerestored for feature derivation, alignment etc., skipping any later instancesof undefined values. If data restorations are numerous, test if the predictorpicks the inserted bias is advisable.

Detrending removes the growth of a series. For stocks, indexes, and cur-rencies converting into logarithms of subsequent (e.g. daily) returns doesthe trick. For volume, dividing it by last k quotes average, e.g. yearly, canscale it down.

Noise minimally at price discretisation level is prevalent; especially lowvolume markets should be dealt with suspicion. Discretisation of series intofew (< 10) categories (Gershenfeld & Weigend, 1993) along noise cleaningcould be evaluated against predictions quality. Simple cleaning: for eachseries value, find its nearest neighbor based on surrounding values, andthen substitutes the value by an average of the original and those from theneighbors (Kantz & Schreiber, 1999a). Other operations limiting noise:averaging, instance multiplication, sampling – mentioned below.

Normalization. Discretization – mapping the original values to fewer (new)ones – e.g. positive to 1 and other to -1 – is useful for noise reduction andfor nominal input predictors. Subsequent predictor training with input dis-cretized into decreasing number of values can estimate noise – predictionaccuracy could increase (Kohavi & Sahami, 1996) once difference between

99

discretized values exceeds noise, to decline later after rough discretizationignores important data distinctions.

Redistribution – changing the frequency of some values in relation toothers – can better utilize available range, e.g. if daily returns were linearlyscaled to (-1, 1), majority would be around 0.

Normalization brings values to a certain range, minimally distortinginitial data relationships. SoftMax norm increasingly squeezes extremevalues, linearly mapping middle, e.g. middle 95% input values could bemapped to [-0.95, 0.95], with bottom and top 2.5% nonlinearly to (-1,-0.95)and (0.95, 1) respectively. Normalization should precede feature selection,as non-normalized series may confuse the process.

Series to instances conversion is required by most learning algorithms ex-pecting as an input a fixed length vector. It can be a delay vector derivedfrom series, a basic technique in nonlinear analysis (Kantz & Schreiber,1999a), vt = (seriest, seriest−delay, .., seriest−(D−1)∗delay). The delay can bethe least giving zero autocorrelation, when applied to the series. Such vec-tors with the same time index t – coming from all input series – appendedgive an instance, its coordinates referred to as features or attributes.

Data multiplication can be done on many levels. The frequency of a seriescan be increased by adding (Fourier) interpolated points (Gershenfeld &Weigend, 1993).

Instances can be cloned with some features supplemented with Gaussiannoise, 0-mean, deviation between the noise level already present in thefeature/series, and the deviation of that series. This can be useful whenonly few instances are available for an interesting type, e.g. instanceswith big return. Such data forces the predictor to look for importantcharacteristics ignoring noise – added and intrinsic. Also, by relativelyincreasing the number of interesting cases, training will pay more attentionto their recognition.

Including more series can increase the number of features. A simple testwhat to include, is to look for series significantly correlated to the predictedone. More difficult is to add non-numerical series, however, adding a text

100

filter for keywords in news can bring substantial advantage.

Indicators are series derived from others, enhancing some features of in-terest, such as trend reversal. Over the years traders and technical analyststrying to predict stock movements developed the formulae (Murphy, 1999),some later confirmed to pertain useful information (Sullivan et al., 1999).Indicator feeding into a prediction systems is important due to 1) averag-ing, thus noise reduction, present in many indicator formulae, 2) providingviews of the data suitable for prediction. Common indicators follow.

MA, Moving Average, is the average of past k values up to date. Exponential MovingAverage, EMAn = weight ∗ seriesn + (1− weight) ∗ EMAn−1.

Stochastic (Oscillator) places the current value relative to the high/low range in a pe-

riod: seriesn−low(k)high(k)−low(k)

, low(k) – the lowest among the k values preceding n, k often 14days.

MACD, Moving Average Convergence Divergence, difference of short and long-termexponential moving averages, 8 and 17, or 12 and 26 days used.

ROC, Rate of Change, ratio of the current price to price k quotes earlier, k usually 5 or10 days.

RSI, Relative Strength Index, relates growths to falls in a period. RSI can be computedas positive changes (i.e. seriesi−seriesi−1 > 0) sum divided by all absolute changessum, taking last k quotes; k usually 9 or 14 days.

Sampling. In my experiments with NYSE predictability, skipping 0.5 train-ing instances with the lowest weight (i.e. weekly return) enhanced predic-tions, similarly reported (Deboeck, 1994). The distribution (for returnsapproximated by lognormal) was such that the lowest-return half consti-tuted only 0.2 of the cumulative return, and lowest 0.75 – 0.5 (Mantegna &Stanley, 2000). The improvement could be due to skipping noise-dominatedsmall changes, and/or bigger changes ruled by a mechanism whose learn-ing is distracted by the numerous small changes. Thus, while sampling, itmight be worth under-representing small weight instances, missing value-filled, evident-outlier instances and older ones. The amount of data totrain a model can be estimated (Walczak, 2001).

101

Bootstrap – with repetitions, sampling as many elements as in the origi-nal – and deriving a predictor for each such a sample, is useful for collectingvarious statistics (LeBaron & Weigend, 1994), e.g. performance, also en-semble creation or best predictor selection (e.g. via bumping), however notwithout limits (Hastie et al., 2001).

Feature selection can make learning feasible, as because of the curse of di-mensionality (Mitchell, 1997) long instances demand (exponentially) moredata. As always, feature choice should be evaluated together with the pre-dictor, as assuming feature importance because it worked well with otherpredictors, may mislead.

Principal Component Analysis (PCA) and claimed better for stock data Indepen-dent Component Analysis (Back & Weigend, 1998), reduce dimension by proposinga new set of salient features.

Sensitivity Analysis trains predictor on all features and then drops those least influ-encing predictions. Many learning schemes internally signal important features, e.g.(C4.5) decision tree use them first, neural networks assign highest weights etc.

Heuristic such as hill-climbing or genetic algorithms operating on binary feature selec-tion can be used not only to find salient feature subsets, but also – invoked severaltimes – to provide different sets for ensemble creation.

Predictability assessment allows to concentrate on feasible cases (Hawaw-ini & Keim, 1995). Some tests below are simple non-parametric predictors –prediction quality reflecting predictability, measured, e.g., by standard er-ror to series standard deviation ratio.

Linear methods measure correlation between predicted and feature series – significantnon-zero implying predictability (Tsay, 2002). Multiple features can be taken intoaccount by multivariate regression.

Nearest Neighbor (Mitchell, 1997) offers a powerful local predictor. Distracted bynoisy/irrelevant features, but if this ruled out, failure suggests that the most thatcan be predicted are general regularities, e.g. an outcome overall probability.

Entropy measures information content, i.e. deviation from randomness (Molgedey &Ebeling, 2000). This general measure, not demanding big amounts of data, anduseful in discretisation or feature selection is worth familiarizing with.

102

Compressibility – the ratio of compressed to the original sequence length – shows howregularities can be exploited by a compression algorithm (which could be the basisof a predictor). An implementation: series digitized 4-bit values packed in pairs intobyte array subjected to Zip compression (Feder et al., 1992).

Detrended Fluctuation Analysis (DFA) reveals long term correlations (self-similarity)even in non-stationary time series (Vandewalle et al., 1997). DFA is more robust,so recommended to Hurst analysis – a sensitive statistics of cycles, proper interpre-tation requiring experience (Peters, 1991).

Chaos and Lyapunov exponent test short-term determinism, thus predictability (Kantz& Schreiber, 1999a). However, the algorithms are noise-sensitive and require longseries, thus conclusions should be cautious.

Randomness tests like chi-square, can assess the likelihood that the observed (digi-tized) sequence is random. Such a test on patterns of consecutive digits could hintpattern no/randomness.

Non-stationarity test can be implemented by dividing data into parts and computingpart i predictability based only on part j data. The variability of the measures(visual inspection encouraged), such as standard deviation, assesses stationarity.

A battery of tests could include linear regression, DFA for long termcorrelations, compressibility for entropy-based approach, Nearest Neighborfor local prediction, and a non-stationarity test.

Prediction Algorithms

Below, common learning algorithms (Mitchell, 1997) are discussed, point-ing their features important to financial prediction.

Linear methods not main focus here, are widely used in financial pre-diction (Tsay, 2002). In my Weka (Witten & Frank, 1999) experiments,Locally Weighted Regression (LWR) – scheme weighting Nearest Neighborpredictions – discovered regularities in NYSE data 5. Also, Logistic – non-linear regression for discrete classes – performed above-average and withspeed. As such, regression is worth trying, especially its schemes more spe-cialized to the data (e.g. Logistic to discrete) and as a final optimization –weighting other predictions (LWR).

5Unpublished, ongoing work.

103

Neural Network (ANN) – seems the method of choice for financial pre-diction (Kutsurelis, 1998; Cheng et al., 1996). Backpropagation ANNspresent the problems of long training and guessing the net architecture.Schemes training architecture along weights could be preferred (Hochre-iter & Schmidhuber, 1997) (Kingdon, 1997), limiting under-performancedue to wrong (architecture) parameter choice. Note, a failure of an ANNattempt, especially using a general-purpose package, does not necessitateprediction impossible. In my experiments, Voted Perceptron performanceoften compared with that of ANN, this could be a start, especially whenspeed is important, such as in ensembles.

C4.5, ILP – generate decision trees/if-then rules – human understand-able, if small. In my experiments with Progol (Mitchell, 1997) – otherwisesuccessful rule-learner – applied to NYSE data, rules (resembling techni-cal) seldom emerged; Weka J48 (C4.5) tree-learner predictions have notperformed; GA-evolved rules’ performance was very sensitive to ’right’background predicates (Zemke, 1998). The conclusion being that, smallrule-based models cannot express certain relationships and perform wellwith noisy/at times inconsistent financial data (Kovalerchuk & Vityaev,2000). Ensembles of decision trees, can make up for the problems, butreadability is usually lost. Rules can also be extracted from ANN, offeringaccuracy and readability (Kovalerchuk & Vityaev, 2000).

Nearest Neighbor (NN) does not create a general model, but to predict,it looks back for the most similar case(s) (Mitchell, 1997). Irrelevant/noisyfeatures disrupt the similarity measure, so pre-processing is worthwhile.NN is a key technique is nonlinear analysis which offers insights, e.g.weighting more neighbors, efficient NN search (Kantz & Schreiber, 1999a).Cross-validation (Mitchell, 1997) can also decide an optimal number ofkNN neighbors. Ensemble/bagging NNs trained on different instance sam-ples usually does not boost accuracy, though on different feature subsetsmight.

104

Bayesian classifier/predictor first learns probabilities how evidence sup-ports outcomes, used then to predict new evidence’s outcome. Thoughthe simple scheme is robust to violating the ’naive’ independent-evidenceassumption, watching independence might pay off, especially as in decreas-ing markets variables become more correlated than usual. The Bayesianscheme might also combine ensemble predictions – more optimally thanmajority voting.

Support Vector Machines (SVM) are a relatively new and powerful learner,having attractive characteristics for time series prediction (Muller et al.,1997). First, it deals with multidimensional instances, actually the morefeatures the better – reducing the need for (wrong) feature selection. Sec-ond, it has few parameters, thus finding optimal settings can be easier, oneof the parameters referring to noise level the system can handle.

Performance improvement

Most successful prediction are hybrid: several learning schemes coupledtogether (Kingdon, 1997; Cheng et al., 1996; Kutsurelis, 1998; Kovalerchuk& Vityaev, 2000). Predictions, indication of their quality, biases, etc., fedinto a (meta-learning) final decision layer. The hybrid architecture mayalso stem from performance improving techniques:

Ensemble (Dietterich, 2000) is a number of predictors of which votes are put togetherinto the final prediction. The predictors, on average, are expected above-randomand making independent errors. The idea is that correct majority offsets individualerrors, thus the ensemble will be correct more often than an individual predictor.The diversity of errors is usually achieved by training a scheme, e.g. C4.5, on differ-ent instance samples or features. Alternatively, different predictor types – like C4.5,ANN, kNN – can be used or the predictor’s training can be changed, e.g. by choos-ing the second best decision, instead of first, building C4.5 decision tree. Commonschemes include bagging, boosting and their combinations and Bayesian ensembles(Dietterich, 2000). Boosting is particularly effective in improving accuracy.

Note: an ensemble is not a panacea for non-predictable data – it only boosts accu-racy of already performing predictor. Also, readability, efficiency are decreased.

Genetic Algorithms (GAs) (Deboeck, 1994) explore novel possibilities, often not thoughtof by humans. Therefore, it is always worth keeping some decisions as parameters

105

that can be (later) GA-optimized, e.g., feature preprocessing and selection, samplingstrategy, predictor type and settings, trading strategy. GAs (typically) require a fit-ness function – reflecting how well a solution is doing. A common mistake is todefine the fitness one way and to expect the solution to perform another way, e.g. ifnot only return but also variance are important, both factors should be incorporatedinto fitness. Also, with more parameters and GAs ingenuity it is easier to overfitthe data, thus testing should be more careful.

Local, greedy optimization can improve an interesting solution. This is worth com-bining with a global optimization, like GAs, which may get near a good solutionwithout reaching it. If the parameter space is likely nonlinear, it is better to use astochastic search, like simulated annealing, as compared to simple up-hill.

Pruning properly applied can boost both 1) speed – by skipping unnecessary computa-tion, and 2) performance – by limiting overfitting. Occam’s razor – among equallyperforming models, simpler preferred – is a robust criterion to select predictors, e.g.Network Regression Pruning (Kingdon, 1997), MMDR (Kovalerchuk & Vityaev,2000) successfully use it. In C4.5 tree pruning is an intrinsic part. In ANN, weightdecay schemes (Mitchell, 1997) reduce towards 0 connections not sufficiently pro-moted by training. In kNN, often a few prototypes perform better than referringto all instances – as mentioned, high return instances could be candidates. In en-sembles, if the final vote is weighted, as in AdaBoost (Dietterich, 2000), only thehighest-weighted predictors matter.

Tabu, cache, incremental learning, gene GA can accelerate search, allowing moreexploration, bigger ensemble etc. Tabu search prohibits re-visiting recent pointagain – except for not duplicating computation, it forces the search to explorenew areas. Caching stores computationally expensive results for a quick recall, e.g.(partial) kNN can be precomputed. Incremental learning only updates a model asnew instances arrive, e.g. training ANN could start with ANN previously trainedon similar data, speeding up convergence. Gene expression GAs optimize solution’scompact encoding (gene), instead of the whole solution which is derived from theencoding for evaluation.

I use a mixture: optimizing genes stored in a tabu cache (logged and later scrutinizedif necessary).

What if everything fails but the data seems predictable? There are stillpossibilities: more relevant data, playing with noise reduction/discretisation,making the prediction easier, e.g. instead of return, predicting volatility(and separately direction), or instead of stock (which may require companydata) predicting index, or stock in relation to index; changing the horizon –prediction in 1 step vs. many; another market, trading model.

106

Trading model given predictions, makes trading decisions, e.g. predictedup – long position, down – short, with more possibilities (Hellstrom &Holmstrom, 1998). Return is just one objective, other include: minimizingvariance, maximal loss (bankruptcy), risk (exposure), trade (commissions),taxes; Sharpe ratio etc. A practical system employs precautions againstpredictors non-performance: monitoring recent performance and signalingif it is below accepted/historic level. It is crucial in non-stationary marketsto allow for market shifts beyond control – politics, disasters, entry of a bigplayer. If the shifts cannot be dealt with, at least should be signaled beforeinflicting unreparable loss. This touches the subject of a bigger (money)management system, taking the predictions into account while hedging,but it is beyond the scope of this paper.

System Evaluation

Proper evaluation is critical to a prediction system development. First,it has to measure exactly the interesting effect, e.g. trading return, asopposed to prediction accuracy. Second, it has to be sensitive enough asto distinguish often minor gains. Third, it has to convince that the gainsare no merely a coincidence.

Evaluate the right thing. Financial forecasts are often developed to supportsemi-automated trading (profitability), whereas the algorithms underlyingthose systems might have different objective. Thus, it is important totest the system performing in the setting it is going to be used, a trivial,but often missed notion. Also, the evaluation data should be of exactlythe same nature as planned for real-life application, e.g. an index-futurestrading performed for index data used as a proxy for futures price, butreal futures data degraded it. Some problems with common evaluationstrategies (Hellstrom & Holmstrom, 1998) follow.

Accuracy – percentage of correct discrete (e.g. up/down) predictions; common measurefor discrete systems, e.g. ILP/decision trees. It values instances equally, disregard-ing both instance’s weight and accuracies for different cases, e.g. a system mightget high score predicting the numerous small changes whereas missing the big few.

107

Actually, some of the best-performing systems have lower accuracy than could befound for that data (Deboeck, 1994).

Square error – sum of squared deviations from actual outputs – is a common measurein numerical prediction, e.g. ANN. It penalizes bigger deviations, however if signis what matters this might not be optimal, e.g. predicting -1 for -0.1 gets biggerpenalty than predicting +0.1, though the latter might trigger going long insteadof short. Square error minimization is often an intrinsic part of an algorithm suchas ANN backpropagation, and changing it might be difficult. Still, many suchpredictors, e.g. trained on bootstrap samples, can be validated according to thedesired measure and the best picked.

Reliability – predictor’s confidence in its forecast – is equally important and difficult todevelop as the predictor itself (Gershenfeld & Weigend, 1993). A predictor will notalways be confident – it should be able to express this to the trading counterpart,human or not. e.g. by an output ’undecided’. No trade on dubious predictions isbeneficial in many ways: lower errors, commissions, exposure. In my experimentsoptimizing the reliability requirement, stringent values emerged – why to trade if thepredicted move and confidence are low? Reliability can be assessed by comparingmany predictions: coming from an ensemble, as well as done in one step and multiplesteps fashion.

Performance measure (Hellstrom & Holmstrom, 1998) should incorporate the predic-tor and the (trading) model it is going to benefit. Some points: Commissions needto be incorporated – many trading ’opportunities’ exactly disappear with commis-sions. Risk/variability – what is the value of even high return strategy if in theprocess one gets bankrupt? Data difficult to obtain in real time, e.g. volume, mightmislead historic data simulations.

Evaluation bias resulting from the evaluation scheme and time series data,needs to be recognized. Evaluation similar to the intended operation canminimize performance estimate bias, though different tests can be usefulto estimate different aspects, such as return, variance.

N-cross validation – data divided into N disjoint parts, N − 1 for training and 1 fortesting, error averaged over all N (Mitchell, 1997) – in the case of time series data,underestimates error. Reason: in at least N − 2 out of the N train-and-test runs,training instances precede and follow the test cases unlike in actual prediction whenonly past in known. For series, window approach is more adept.

Window approach – segment (’window’) of consecutive instances used for trainingand a following segment for testing, the windows sliding over all data, as statisticscollected. Often, to save training time, the test segment consists of many instances.

108

However, more than 1 instance overestimates error, since the training window doesnot include the data directly preceding some tested cases. Since markets undergoregime change in matter of weeks, the test window should be no longer than that, orthe train window’s fraction (< 20%). To speed up training for the next test window,the previous window predictor could be used as the starting point while training onthe next window, e.g. instead of starting with ANN random weights.

Evaluation data should include different regimes, markets, even data er-rors, and be plentiful. Dividing test data into segments helps to spotperformance ir-regularities (for different regimes).

Overfitting a system to data is a real danger. Dividing data into disjointsets is the first precaution: training, validation for tuning, and test set forperformance estimation. A pitfall may be that the sets are not as separatedas seem, e.g. predicting returns 5 days ahead, a set may end at day D,but that instance may contains return for day D+5 falling into a next set.Thus data preparation and splitting should be careful.

Another pitfall is using the test set more than once. Just by luck, 1 outof 20 trials is 95% above average, 1 out of 100, 99% above etc. In multipletest, significance calculation must factor that in, e.g. if 10 tests are runand the best appears 99.9% significant, it really is 99.9%10 = 99% (Zemke,2000).

Multiple use can be avoided, for the ultimate test, by taking data thatwas not available earlier. Another possibility is to test on similar, not tunedfor, data – without any tweaking until better results, only with predefinedadjustments for the new data, e.g. switching the detrending preprocessingon.

Surrogate data is a useful concept in nonlinear system evaluation (Kantz& Schreiber, 1999a). The idea is to generate data sets sharing charac-teristics of the original data – e.g. permutations of series have the samemean, variance etc. – and for each compute an interesting statistics, e.g.return of a strategy. To compare the original series statistics to those of thesurrogates, there are 2 ways to proceed: 1) If the statistics is normally dis-tributed, the usual one/two-sided test comparing to the surrogates’ meanused. 2) If no such assumption, the nonparametric rank test can be used:

109

If α is the acceptable risk of wrongly rejecting the null hypothesis thatthe original series statistics is lower (higher) than of any surrogate, then1/α − 1 surrogates needed; if all give higher (lower) statistics than theoriginal series, then the hypothesis can be rejected. Thus, if predictor’serror was lower on original series, than in 19 runs on surrogates, we can be95% sure it is up to something.

Non/Parametric tests. Most statistical tests (Hastie et al., 2001) (Efron& Tibshirani, 1993) have preconditions. They often involve assumptionsabout sample independence and distributions – unfulfilled leading to un-founded conclusions. Independence is tricky to achieve, e.g. predictorstrained on overlapping data are not independent. If the sampling distribu-tion is unknown, as it usually is, it takes least 30, better 100, observationsfor normal distribution statistics.

If the sample is smaller than 100, nonparametric test are preferable,with less scope for assumption errors. The backside is they have less dis-criminatory power – for the same sample size (Heiler, 1999).

A predictor should significantly win (nonparametric) comparisons withnaive predictors: 1) Majority predictor outputs the commonest value allthe time, for stocks it could be the dominant up move, translating intothe buy and hold strategy. 2) Repeat previous predictor for the next valueissues the (sign of the) previous one.

Sanity checks involve common sense (Gershenfeld & Weigend, 1993). Pre-diction errors along the series should not reveal any structure, unless thepredictor missed something. Do predictions on surrogate (permuted) seriesdiscover something? If valid, this is the bottom line for comparison withprediction on the original series – is it significantly better?

Putting it all together

To make the paper’s less abstract, some author’s choices in a NYSE indexprediction system follow. The ongoing research extends an earlier system(Zemke, 1998). The idea is to develop a 5-days return predictor, later on,to support a trading strategy.

110

Data used consists of 30 years of daily NYSE 5 indexes and 4 volumeseries. Data is plotted and some series visibly mimicking other omitted.Missing values are filled by a nearest neighbor algorithm, and the 5-daysreturn series to be predicted computed. The index series are converted tologarithms of daily returns; the volumes divided by lagged yearly averages.Additional series are derived, depending on experiment, 10 and 15 days MAand ROC for indexes. Then all series are Softmax normalized to -1..1 anddiscretized to 0.1 precision. In between major preprocessing steps seriesstatistics are computed: number of NaN, min and max values, mean, st.deviation, 1,2-autocorrelation, zip-compressibility, linear regression slope,DFA – tracing if preprocessing does what expected – removing NaN, trend,outliers, but not zip/DFA predictability. In the simplest approach, allseries are then put together into instances with D = 3 and delay = 2.An instance’s weight is corresponding time absolute 5-days return andinstance’s class – the return’s sign.

The predictor is one of Weka (Witten & Frank, 1999) classifiers han-dling numerical data, 4-bit coded into a binary string together with: whichinstance’s features to use, how much past data to train on (3, 6, 10, 15, 20years) and what part of lowest weight instances to skip (0.5, 0.75, 0.85).Such strings are GA-optimized, with already evaluated strings cached andprohibited from costly re-evaluation. Evaluation: a predictor is trainedon past data and used to predict values in a disjoint window, 20% size ofthe data, ahead of it; repeated 10 times with the windows shifted by thesmaller window size. The average of the 10 period returns less the ’alwaysup’ return and divided by the 10 values st. deviation give a predictor’sfitness.

Final Remarks

Financial markets, as described by multidimensional data presented to aprediction/trading system, are complex nonlinear systems – with subtletiesand interactions difficult for humans to comprehend. This is why, once asystem has been developed, tuned and proven performing on (volumes of)data, there is no space for human ’adjustments’, except for going through

111

the whole development cycle. Without stringent re-evaluation performanceis likely hurt.

A system development usually involves a number of recognizable steps:data preparation – cleaning, selecting, making data suitable for the pre-dictor; prediction algorithm development and tuning – for performance onthe quality of interest; evaluation – to see if indeed the system performs onunseen data. But since financial prediction is very difficult, extra insightsare needed. The paper has tried to provide some: data enhancing tech-niques, predictability tests, performance improvements, evaluation hintsand pitfalls to avoid. Awareness of them hopefully will make predictionseasier, or at least the realization that they cannot be done quicker.

112

Ensembles in Practice: Prediction,Estimation, Multi-Feature and NoisyDataHIS-2002, Chile, 2002

113

.

114

Ensembles in Practice: Prediction,Estimation, Multi-Feature and Noisy Data




Published: Proceedings of HIS-2002, 2002

Abstract This paper addresses 4 practical ensemble applications: time series prediction,

estimating accuracy, dealing with multiple feature and noisy data. The intent is to refer

a practitioner to ensemble solutions exploiting the specificity of the application area.

Introduction

Recent years have seen a big interest in ensembles – putting several classi-fiers together to vote – and for a good reason. Even weak, by itself not soaccurate classifiers can create an ensemble beating the best learning algo-rithms. Understanding, why and when this is possible, and what are theproblems, can lead to even better ensemble use. Many learning algorithmsincorporate voting. Neural networks apply weights to inputs and a nonlin-ear threshold function to summarize the ’vote’. Nearest neighbor (kNN)searches for k prototypes for a classified case, and outputs the prototypes’majority vote. If the definition of an ensemble allows that all members clas-sify, but only one outputs, then also Inductive Logic Programming (ILP)is an example.

A classifier can be also put into an external ensemble. Methods, howto generate and put classifiers together have been prescribed reportingaccuracy over that of the base classifiers. But this success and generality ofensemble use does not mean that there are no special cases benefiting from aproblem-related approach. This might be especially important in extreme

115

cases, e.g. when it is difficult to obtain above-random classifiers due tonoise – ensemble will not help. In such cases, it takes more knowledgeand experiments (ideally by others) to come up with a working solution,which probably involves more steps and ingenuity than a standard ensemblesolution. This paper presents specialized ensembles in 4 areas: prediction,estimating accuracy, dealing with multiple feature and noisy data. Theexamples have been selected from many reviewed papers, with clarity andgenerality (within their area) of the solution in mind. The idea of this paperis to provide a problem-indexed reference to the existing work, rather thanto detail it.

Why Ensembles Work

An ensemble outperforms the base classifier due to several reasons (Diet-terich, 2000). First, given limited training data, a learning algorithm canfind several classifiers performing equally well. An ensemble minimizes therisk of selecting the wrong one, as more can be incorporated and averaged.Second, the learning algorithm’s outcome – a classifier – might be merelya local optimum in the algorithm’s search. Ensemble construction restartsthe algorithm from different points avoiding the pitfall. Third, a numberof even simple classifiers together can express more complex functions –better matching the data.

It is not difficult to make ensembles work – most of the methods to’disturb’ either the training data, or the classifier construction result inan ensemble performing better than a single classifier. Ensembles haveimproved classification results for data repositories, having as the baselearner different algorithms, thus suggesting that the ensemble is behindthe progress. Ensembles can be stuck on top of each other adding benefits,e.g. if boosting is good at improving accuracy, bagging – at reducingvariance, bagging boosted classifiers, or the other way, may be good atboth, as experienced (Webb, 1998) Ensemble size is not crucial, even smallbenefits. Much of the reduction in error appears at 10 − 15 classifiers(Breiman, 1996). But AdaBoost and Arcing measurably improve theirtest-set error until around 25 classifiers (Opitz & Maclin, 1999) or more.

Increased computational cost is the first bad news. An ensemble of tens,

116

perhaps hundreds of classifiers takes that much more to train, classify andstore. This can be alleviated by simpler base classifiers – e.g. decisionstubs instead of trees – and pruning, e.g. skipping low-weight members ina weighted ensemble (Margineantu & Dietterich, 1997a). Overfitting canresult when an ensemble does not merely model training data from many(random) angles, but tries to fit its whims. Such a way to boost accuracymay work on noise-free data, but in the general case this is a recipe foroverfitting (Sollich & Krogh, 1996). Readability loss is another consequenceof voting classifiers. Rule-based decisions trees and predicate-based ILP,having similar accuracy as e.g. neural networks (ANN) and kNN werefavored in some areas because of human-understandable models. However,an ensemble of 100 such models – different to make the ensemble work andpossibly weighted – blows the readability.

A note on vocabulary. Bias refers to the classification error part ofthe central tendency, or most frequent classification, of a learner whentrained on different sets; variance – the error part of deviations from thecentral tendency (Webb, 1998). Stable learning algorithms are not thatsensitive to changes in the training set, include kNN, regression, SupportVector Machines (SVM), whereas decision trees, ILP, ANN are unstable.Global learning creates a model for the whole data, later used (the model,not data) to classify new instances, e.g. ANN, decision tree, whereas localalgorithms refrain from creating such models, e.g. kNN, SVM. An overviewof learning algorithms can be found (Mitchell, 1997).

Common Ensemble Solutions

For an ensemble to increase accuracy, the member classifiers need to have:1) Independent, or better negatively correlated, errors (Ali & Pazzani,1995), 2) Expected above-random accuracy. Ensemble methods usually donot check the assumptions, instead they prescribe how to generate compli-ant classifiers. In this sense, the methods are heuristics, found effective andscrutinized for various aspects, e.g. suitable learning algorithms, ensemblesize, data requirements etc.

An ensemble explicitly fulfilling the assumptions might be even more

117

effective: smaller – including only truly contributing classifiers, more accu-rate – taking validated above-random classifiers etc. Experiments confirmthis (Liu & Yao, 1998). However, even more advanced methods often referto features of common ensembles. There are many ways to ensure ensem-ble’s classifier diversity, e.g. by changing training data or the classifierconstruction process – the most common methods described below.

Different Training Instances

Different subsets of the training set lead to different classifiers, especiallyif the learning algorithm is unstable. The subsets could be without orwith replacement, i.e. allowing multiple copies of the same example. Acommon setting is a bootstrap – drawing as many elements as in the wholeset, however with replacement, thus having on average 63.2% elementsof the original set, some repeated. Another possibility is to divide thetraining set into N disjoint subsets and systematically train on N − 1 ofthem, leaving each one out. More general way is to assign weights to theexamples and change them before training a new classifier respecting theweights.

Bagging (Breiman, 1996) – classifiers trained on different bootstrap sam-ples are put to a majority vote – class issued by most wins. Baggingimproves accuracy by promoting the average classification of the ensemble,thus reducing influence of individual variances (Domingos, 1997), with biasmostly intact (Webb, 1998), handles noise (to 20%) well (Dietterich, 1998),but does not at all work with stable learning methods.

AdaBoost (Freund & Schapire, 1995) forms a committee by applying alearning algorithm to the training set whose distribution is changed, aftergenerating each classifier, as to stress frequently misclassified cases. Whileclassifying, a member of the ensemble is weighted by a factor proportionalto its accuracy. With little training data, AdaBoost performs better thanBagging (Dietterich, 1998), however it may deteriorate if there is insuffi-cient training data relative to the complexity of the base classifiers or theirtraining errors grow (Schapire et al., 1997).

118

Feature selection

Classifiers can be trained with different feature subsets. The selection canbe random or premeditated, e.g. providing a classifier with a selectionof informative, uncorrelated features. If all features are independent andimportant, the accuracy of the restricted (feature-subset) classifiers willdecline, however putting them all together could still give a boost. Featurescan also be preprocessed presenting different views of the data to differentclassifiers.

Changing Output Classes

The output values can be assigned to 2 super-classes, e.g. A1, B1 – eachcovering several of the original class values – and then training a classifierwith the super-class, then making another selection, A2, B2 and trainingnext classifier etc. When classifying, all ensemble members issue theirsuper-classifications and a count is made which of the original classes ap-pears most frequently – the final output.

Error Correcting Output Coding (ECOC) (Dietterich & Bakiri, 1991) is amulti-class classification, where each class is encoded as a string of binarycode-letters, a codeword. Given a test instance, each of its code-lettersis predicted, and the class whose codeword has the smallest Hammingdistance to the predicted codeword is assigned. By reducing bias and vari-ance, ECOC boosts global learning algorithms, but not local (Ricci & Aha,1998) – in that case ECOC code-letter’s classifiers can be differentiated byproviding them with a subset of features. In data with few classes (K < 6),extending the codeword length yields increases error reduction.

Randomization

Randomization could be inserted at many points resulting in ensemblevariety. Some training examples could be distorted, e.g. by adding 0-meannoise. Some class values could be randomized. The internal working of thelearning algorithm could be altered, e.g. by choosing a random decisionamong the 3 best in decision tree build up.

119

Wagging (Bauer & Kohavi, 1998), a variant of bagging, requires a baselearner accepting training set weights. Instead of bootstrap samples, wag-ging assigns random weights to instances in each training set, the originalformulation used Gaussian noise to vary the weights.

Different Classifier Types

Even when trained on the same data, classifiers such as kNN, neural net-work, decision tree create models classifying new instances differently dueto different internal language, biases, sensitivity to noise etc. Learnerscould also induce varied models due to different settings, e.g. networkarchitecture.

Bayesian ensemble uses k classifiers obtained by any means in the Bayesformula. The basis for the ensemble outcome are probabilities: classes’ andconditional for predicted/actual class pairs, for each classifier. The Bayesoutput, given k classifications, is then given by classp maximizing P (classp)* P1(predict1|classp) * . . . * Pk(predictk|classp). It can be viewed as theNaive Bayes Classifier (Mitchell, 1997) meta-applied to the ensemble.

Specialized Ensemble Use

The quest for the ultimate ensemble technique resembles the previous ef-forts to find the ’best’ learning algorithm which discovered a number of sim-ilarly accurate methods, some somehow better in specific circumstances,and usually further improved by problem-specific knowledge. Ensemblemethods also show their strengths in different circumstances, e.g. no/datanoise, un/stable learner etc. Problem specifics could be directly incorpo-rated into an specialized ensemble. This section addresses four practicalproblem areas and presents ensemble adaptations. Though other problemsin the areas might require individual approach, the intention is to bringsome issues and worked-out solutions.

120

Time Series Prediction

Time series arise in any context in which data is linearly ordered, e.g. bytime or distance. The index increment may be constant, e.g. 1 day, or not,as in the case of event-driven measurements, e.g. indicating a transactiontime and its value. Series values are usually numeric, in a more generalcase – vectors of fixed length. Time series prediction is to estimate a futurevalue, given values up to date. There are different measures of success, themost common accuracy – in the case of nominal series values, and squaredmean error – in the case of numeric.

Series to instances conversion is required by most learning algorithms ex-pecting as an input a fixed length vector. It can be a lag vector derived fromseries, a basic technique in nonlinear analysis vt = (seriest, seriest−lag, ..,seriest−(D−1)∗lag). Such vectors with the same time index t – coming fromall input series – appended give an instance, its coordinates referred to asfeatures. The lag vectors have motivation in Takens embedding theorem(Kantz & Schreiber, 1999b) stating that a deterministic – i.e. to someextent predictable – series’ dynamics is mimicked by the dynamics of thelag vectors, so e.g. if a series has a cycle – coming to the same values, thelag vectors will have a cycle too.

Embedding dimension D – the number of lagged series values used tomodel a series trajectory – according to the embedding theorem does notneed to exceed 2d − 1, where d is the dimension of the series generatingsystem. In practice d is unknown, possibly infinite if the data is stochastic.D is usually arrived at by increasing the value until some measure – e.g.prediction accuracy – gets saturated. In theory – infinite data and nonoise – it should stay the same even when D is increased, in practice itis not, due to curse of dimensionality etc. A smaller dimension allowsmore and closer neighborhood matches. An ensemble involving differentdimensions could resolve the dilemma.

Embedding lag according to Takens theorem should only be different fromthe system’s cycle, in practice it is more restricted. Too small, makesdifferences between the lagged values not informative enough to model the

121

system’s trajectory – imagine a yearly cycle by giving just several valuesseparated seconds apart. Too big, misses the details and risks puttingtogether weakly related values – as in the case of a yearly cycle sampledat 123-months interval. Without advanced knowledge of the data, a lagis preferred: either as the first zero autocorrelation or minimum of themutual information (Kantz & Schreiber, 1999b). However, those are onlyheuristics and an ensemble could explore a range of values, especially astheory does not favor any.

Prediction horizon – how much ahead to predict at a time – is anotherdecision. Target 10 steps ahead prediction can be in 1 shot, in 2 iterated 5-ahead predictions, 5*2-ahead, or 10*1. Longer horizon makes the predictedquantity less corrupted by noise; shorter – can be all that can be predicted,and iterated predictions can be corrected for their systematic errors asdescribed below. An ensemble of different horizons could not only limitoutliers, but also estimate the overall prediction reliability via agreementamong the individual predictions.

Converting a short-term predictor into longer-term can be also doneutilizing some form of metalearning/ensemble (Judd & Small, 2000). Themethod uses a second learner to discover the systematic errors in the (notnecessarily very good, but above-average) short-term predictor, as it isiterated. These corrections are then used when a longer-term predictionis issued resulting in much better results. The technique also provides anindication of a feasible prediction horizon and is robust w.r.t. noisy series.

Series preprocessing – meaning global data preparation before it is usedfor classification or prediction – can introduce domain-specific data view,reduce noise, normalize, presenting the learning algorithm with more ac-cessible data. E.g. in the analysis of financial series, so called indicatorsseries are frequently preprocessed/derived and consist of different movingaverages and relative value measures within an interval (Zemke, 2002b).Preprocessing can precede, or be done at the learning time, e.g. as calls tobackground predicates.

The following system (Gonzalez & Diez, 2000) introduces general time

122

series preprocessing predicates: relative increases, decreases, stays

(within a range) and region: always, sometime, true percentage – test-ing if interval values belong to a range. The predicates, filled with valuesspecifying the intervals and ranges, are the basis of simple classifiers – con-sisting of only one predicate. The classifiers are then subject to boostingup to 100 iterations. The results are good, though noisy data causes someproblems.

Initial conditions of the learning algorithm can differ for each ensemblemember. Usually, the learning algorithm has some settings, other thaninput/output data features etc. In the case of ANN, it is the initial weights,architecture, learning speed, weight decay rate etc. For an ILP system –the background predicates, allowed complexity of clauses. For kNN – the k

parameter and weighting of the k neighbors w.r.t. distance: equal, linear,exponential. All can be varied.

An ANN example of different weight initialization for time series pre-diction follows (Naftaly et al., 1997). Nets of the same architecture arerandomly initialized and assigned to ensembles built at 2 levels. First, thenets are grouped into ensembles of fixed size Q, and the results for thegroups averaged at the second level. Initially, Q = 1, which as Q increasesexpectably reduces the variance. At Q = 20 the variance is similar to whatcould be extrapolated for Q = ∞. Except for suggesting a way the im-prove predictions, the study offers some interesting observations. First, theminimum of the ensemble predictor error is obtained at ANN epoch thatfor a single net would already mean overfitting. Second, as Q increases,the test set error curves w.r.t. epochs/training time, go flatter making itless crucial to stop training at the ’right’ moment.

Different series involved in a prediction of a given one, are another ensem-ble possibility. They might be other series than the one predicted, but sup-porting its prediction, what could be revealed by, e.g., significant non-zerocorrelation. Or the additional series could be derived from the given one(s),e.g. according to the indicator formulae, in financial prediction. Then allthe series, can be put together into the lag vectors – already described

123

for one series – and presented to the learning algorithm. Different ensem-ble members can be provided with their different selection/preprocessingcombination.

Selection of delay vector lag, dimension, even for more input series, canbe done with the following (Zemke, 1999b). For each series, lag is set to asmall value, and dimension to a reasonable value, e.g. 2 and 10. Next, abinary vector, as long as the sum of embedding dimensions for all series,is optimized by a Genetic Algorithm (GA). The vector, by its ’1’ positionsindicates which lagged values should be used, their number restricted toavoid the curse of dimensionality. The selected features are used to traina predictor which performance/accuracy measures the vector’s fitness. Inthe GA population no 2 identical vectors are allowed and, after a certainnumber of generations, the top performing half of the last population issubject to majority vote/averaging of their predictions.

Multiple Features

Multiple features, running into hundreds or even thousands, naturally ap-pear in some domains. In text classification, a word’s presence may beconsidered a feature, in image recognition – a pixel’s value, in chemicaldesign – a component’s presence and activity, or in a joint data base thefeatures may mount. Feature selection and extraction are main dimension-ality reduction schemes. In selection, a criterion, e.g. correlation, decidesfeature choice for classification. Feature extraction, e.g. Principal Com-ponent Analysis (PCA), reduces dimensionality by creating new features.Sometimes, it is impossible to find an optimal feature set, when several setsperform similarly. Because different feature sets represent different dataviews, simultaneous use of them can lead to a better classification.

Simultaneous use of different feature sets usually lumps feature vec-tors together into a single composite vector. Although there are severalmethods to form the vector, the use of such joint feature set may resultin the following problems: 1) Curse of dimensionality, the dimension of acomposite feature vector becomes much higher than any of component fea-ture vectors, 2) Difficulty in formation, it is often difficult to lump severaldifferent feature vectors together due to their diversified forms, 3) Redun-

124

dancy, the component feature vectors are usually not independent of eachother (Chen & Chi, 1998). The problems of relevant feature and exampleselection are interconnected (Blum & Langley, 1997).

Random feature selection for each ensemble classifier is perhaps the sim-plest method. It works if 1) data is highly redundant – it does not mattermuch which features are included, as many carry similar information and 2)the selected subsets are big enough to create above-random classifier – find-ing that size may require some experimentation. Provided that, one mayobtain better classifiers in random subspaces than in the original featurespace, even before the ensemble application. In a successful experiment(Skurichina & Duin, 2001), the original dimensionality was 80 (actually24-60), subspaces – 10, randomly selected for 100-classifier majority vote.

Feature synthesis creates new features, exposing important data charac-teristics to classifiers. Different feature preprocessing for different classifiersensures their variety for an effective ensemble. PCA – creating orthogonalcombinations of features, maximizing variance – is a common way to dealwith multi-dimensional data. PCA new features, principal components,generated in a sequence with decreasing variability/importance, in differ-ent subsets or derived from different data, can be the basis of an ensemble.

In an experiment to automatically recognize volcanos in Mars satelliteimages (Asker & Maclin, 1997), PCA has been applied to 15 ∗ 15 pixels =225 feature images. Varying number, 6−16, of principal components, plusdomain features – line filter values – have been fed into 48 ANNs, makingan ensemble reaching experts’ accuracy. The authors conclude that thedomain-specific features and PCA preprocessing were far more importantthan the learning algorithm choice. Such scheme seems suitable for caseswhen domain-specific features can be identified and the detection whichother features contributed most is not important, since PCA mixes themall.

Ensemble-based feature selection reduces data dimensionality by observingwhich classifiers – based on which features – perform well. The features can

125

then contribute to an even more robust ensemble. Sensitivity of a featureis defined as the change in the output variable when an input feature ischanged within its allowable range (while holding all other inputs frozenat their median/average value) (Embrechts et al., 2001).

In in-silico drug design with QSAR, 100-1000 dependent features andonly 50-100 instances present related challenges: how to avoid curse of di-mensionality, and how to maximize classification accuracy given the fewinstances yet many features. A solution reported is to bootstrap an (ANN)ensemble on all features adding one random – with values uniformly dis-tributed – to estimate sensitivities of features, and skip features less sen-sitive than the random. Repeat the process until not further feature canbe dropped and train the final ensemble. This scheme allows to identifyimportant features.

Class-aware feature selection – input decimation – is based on the follow-ing. 1) Class is important for feature selection (but ignored, e.g. in PCA).2) Different classes have different sets of informative features. 3) Retainingoriginal features is more human-readable. Input decimation works as fol-lows (Oza & Tumer, 2001). For each among L classes, decimation selects asubset of features most correlated to the class and trains a separate classi-fier on that features. The L classifiers constitute an ensemble. Given a newinstance, each of the classifiers is applied (to its respective features) andthe class voted for by most is the output. Decimation reduces classificationerror up to 90% over single classifiers and ensembles trained on all features,as well as ensembles trained on principal components. Ensemble methodssuch as bagging, boosting and stacking can be used in conjunction withdecimation.

Accuracy Estimation

For many reallife problems, perfect classification is not possible. In addi-tion to fundamental limits to classification accuracy arising from overlap-ping class densities, errors arise because of deficiencies in the classifier andthe training data. Classifier related problems such as incorrect structuralmodel, parameters, or learning regime may be overcome by changing or

126

improving the classifier. However, errors caused by the data (finite train-ing sets, mislabelled patterns) cannot be corrected during the classificationstage. It is therefore important not only to design a good classifier, butalso to estimate limits to achievable classification rates. Such estimatesdetermine whether it is worthwhile to pursue (alternative) classificationschemes.

The Bayes error provides the lowest achievable error for a given clas-sification problem. A simple Bayes error upper bound is provided by theMahalanobis distance, however, it is not tight – might be twice the actualerror. The Bhattacharyya distance provides a better range estimate, butit requires knowledge of the class densities. The Chernoff bound tight-ens Bhattacharyya upper estimate but it is seldom used since difficult tocompute (Tumer & Ghosh, 1996). The Bayes error can be also estimatednon-parametrically from errors of a nearest neighbor classifier, providedthe training data is large, otherwise the asymptotic analysis might fail.Little work has been reported on a direct estimation of the performanceof classifiers (Bensusan & Kalousis, 2001) and on data complexity analysisfor optimal classifier combination (Ho, 2001).

Bayes error estimation via an ensemble (Tumer & Ghosh, 1996) exploitsthat the error is only data dependent, thus the same for all classifiers thatadd to it extra error due to a specific classifier limitations. By determiningthe amount of improvement obtained from an ensemble, the Bayes errorcan be isolated. Given the error of a single classifier E, of an averagingensemble Eensemble, of N ρ-correlated classifiers, the Bayes error stands:EBayes = NEensemble−((N−1)ρ+1)E

(N−1)(1−ρ) . The classifier correlation ρ is estimated by

deriving the (binary) misclassification vector for each classifier, and thenaveraging the vectors’ correlations. This can cause problems, as it treatsclassifiers equally, and is expensive if their number, N is high. The corre-lation can be, however, also derived via mutual information by averagingit between classifiers and an ensemble as a fraction of the total entropyin the individual classifiers (Tumer et al., 1998). This yields even betterestimate of the Bayes error.

127

Noisy Data

There is little research specifically on ensembles for noisy data. This is animportant combination, since most real-life data is noisy (in broad senseof missing and corrupted data) and ensembles’ success may partially comefrom reducing influence of the noise by feature selection/preprocessing,bootstrap sampling etc.

Noise deteriorates weighted ensemble as the optimization of the combiningweights overfits difficult, including noisy, examples (Sollich & Krogh, 1996).This is perhaps the basic result to bear in mind while developing/applyingelaborate (weighted) ensemble schemes. To asses the influence of noise,controlled amount of 5-30% input and output features have been corruptedin an experiment involving Bagging, AdaBoost, Simple and Arcing ensem-bles of decision trees or ANNs (Opitz & Maclin, 1999). As the noise grew,the efficacy of the Simple and Bagging ensembles generally increased whilethe Arcing and AdaBoost gained much less. As for ensemble size, with itsincrease Bagging error rate did not increase, whereas AdaBoost’s did.

Boosting in the presence of outliers can work, e.g., by allowing a fractionof examples to be misclassified, if this improves overall (ensemble) accu-racy. An overview of boosting performance on noisy data can be found(Jiang, 2001). ν-Arc is AdaBoost algorithm enhanced by a free param-eter determining the fraction of allowable errors (Rtsch et al., 2000). In(toy) experiments, on noisy data ν-Arc performs significantly better thanAdaBoost and comparably to SVM.

Coordinated ensemble specializes classifiers on different data aspects, e.g.so classifiers appropriately miss-classify outliers coming into their area,without the need to recognize outliers globally. Negatively correlated clas-sifiers – making different (if need be) instead of independent errors – buildhighly performing ensembles (Ali & Pazzani, 1995). This principle havebeen joint together with coordinated training specializing classifiers on dif-ferent data parts (Liu & Yao, 1998). The proposed ANN training rule –

128

extension of backpropagation – clearly outperforms standard ANNs ensem-bles on noisy data both in terms of accuracy and ensemble size.

Missing data is another aspect of ’noise’ where specialized ensemble solu-tion can increase performance. Missing features can be viewed as data tobe predicted – based on non-missing attributes. An approach sorts all datainstances according to how many features they miss: complete instances,missing 1, 2, etc. features. A missing feature is a target for an ensembletrained on all instances where the feature is present. The feature is thenpredicted and the repaired instance added to the the data and the wholeprocess repeated, if needed, for other features (Conversano & Cappelli,2000).

Removing mislabelled instances, with such cleaned data used for training,can improve accuracy. The problem is how to recognize a corrupted la-bel, distinguishing it from exceptional, but correct, case. Interestingly, asopposed to labels, cleaning corrupted attributes may decrease accuracy ifa classifier trained on the cleaned data later classify noisy instances. Inan approach (Brodley & Friedl, 1996), all data has been divided into N

parts and an ensemble trained (by whatever ensemble-generating method)on N − 1 parts, and used to classify the remaining part, in turn done forall parts. The voting method was consensus – only if the whole ensembleagreed on a class different from the actual, the instance was removed. Sucha conservative approach is unlikely to remove correct labels, though maystill leave some misclassifications. Experiments have shown that using thecleaned data for training the final classifier (of whatever type) increasedaccuracy for 20 - 40% noise (i.e. corrupted labels), and left it the same fornoise less than 20%.

Conclusion

Ensemble techniques, bringing together multiple classifiers for increasedaccuracy, have been intensively researched in the last decade. Most ofthe papers either propose a ’novel’ ensemble technique, often a hybrid one

129

bringing features of several existing, or compare existing ensemble and clas-sifier methods. This kind of presentation has 2 drawbacks. It is inaccessibleto a practitioner, with a specific problem in mind, since the literature is en-semble method oriented, as opposed to problem oriented. It also gives theimpression that there is the ultimate ensemble technique. Similar searchfor the ultimate machine learning proved fruitless. This paper concen-trates on ensemble solutions in 4 problem areas: time series prediction,accuracy estimation, multiple feature and noisy data. Published systems,often blending internal ensemble working with some of the areas specificproblems are presented easing the burden to reinvent them.

130

Multivariate Feature Coupling andDiscretizationFEA-2003, Cary, US, 2003

131

.

132

Multivariate Feature Coupling andDiscretization




Michal Rams6

Institut de Mathematiques de BourgogneUniversite de Bourgogne

Dijon, France

[email protected]

Published: Proceedings of FEA-2003, 2003

Abstract This paper presents a two step approach to multivariate discretization, basedon Genetic Algorithms (GA). First, subsets of informative and interacting features areidentified – this is one outcome of the algorithm. Second, the feature sets are globallydiscretized, with respect to an arbitrary objective. We illustrate this by discretizion forthe highest classification accuracy of an ensemble diversified by the feature sets.

Introduction

Primitive data can be discrete, continuous or nominal. Nominal typemerely lists the elements without any structure, whereas discrete and con-tinuous data have an order – can be compared. Discrete data differs fromcontinuous that it has a finite number of values. Discretization, digitiza-tion or quantization maps a continuous interval into one discrete value, theidea being that the projection preserves important distinctions. If all thatmatters, e.g., is a real value’s sign, it could be digitized to {0; 1}, 0 fornegative, 1 otherwise.

6On leave from the Institute of Mathematics, Polish Academy of Sciences, Poland.

133

A data set has data dimension of attributes or features – each holdinga single type of values across all data instances. If attributes are the datacolumns, instances are the rows and their number is the data size. If one ofthe attributes is the class to be predicted, we are dealing with superviseddata, versus unsupervised. The data description vocabulary carries overto the discretization algorithms. If an algorithm discretizing an attributetakes into account the class, it is supervised.

Most common univariate methods discretize one attribute at a time,whereas multivariate methods consider interactions between attributes inthe process. Discretization is global if performed on the whole data set,versus local if only part of the data is used, e.g. a subset of instances.

There are many advantages of discretized data. Discrete features arecloser to a knowledge level representation than continuous ones. Data canbe reduced and simplified, so it is easier to understand, use, and explain.Discretization can make learning more accurate and faster and the re-sulting hypotheses (decision trees, induction rules) more compact, shorter,hence can be more efficiently examined, compared and used. Some learningalgorithms can only deal with discrete data (Liu et al., 2002).

Background

Machine learning and data mining aim at high accuracy, whereas mostdiscretization algorithms promote accuracy only indirectly, by optimizingrelated metrics such as entropy or the chi-square statistics.

Univariate discretization algorithms are systemized and compared in(Liu et al., 2002). The best discretizations were supervised: entropy mo-tivated Minimum Description Length Principle (MDLP) (Fayyad & Irani,1993), and based on the chi-square statistics (Liu & Setiono, 1997), laterextended into parameter-free (Tay & Shen, 2002).

There is much less literature on multivariate discretization. A chi-squarestatistics approach (Bay, 2001) aims at discretizing data so its distributionis most similar to the original. Classification rules based on feature inter-vals can also be viewed as discretization, as done by (Kwedlo & Kretowski,1999) who GA-evolve the rules. However, different rules may impose dif-

134

ferent intervals for the same feature.

Multivariate Considerations

Discretizing one feature at a time is computationally less demanding, thoughlimiting. First, some variables are only important together, e.g., if the pre-dicted class = sign(xy), only discretizing x and y in tandem can discovertheir significance, each alone can be inferred as not related to the class

and even discarded.

Second, especially in noisy/stochasic data, a non-random feature may beonly slightly above randomness, so can still test as non-significant. Only agrouping a number of such features can reveal their above-random nature.

Those considerations are crucial since discretization is an informationloosing transformation – if the discretization algorithm cannot spot a reg-ularity, it will discretize suboptimally, possibly corrupting the features oromitting them as irrelevant.

Besides, data mining applications proliferate beyond the each featurealone informative data. To exploit such data and the capabilities of ad-vanced mining algorithms, the data preprocessing including discretization,needs to be equally adequate.

Voting ensembles also pose demands. When data is used to train anumber of imperfect classifiers, which together yield the final hypothesis,the aim of discretization should not be so much to perfect individual fea-ture cut-points, but to ensure that the features as a whole carry as muchinformation as possible – to be recaptured by the ensemble.

Discretization Measures

When discretization goes beyond the fixed interval length/frequency ap-proach, it needs a measure guiding it through the search for salient featuresand their cut-points. The search strategy can itself have different imple-mentations (Liu et al., 2002). However, let’s concentrate on the scorefunctions first.

135

Shannon conditional entropy (Shannon & Weaver, 1949) is commonly usedto estimate the information gain of a cut-point, with the point with max-imal score used, as in C4.5 (Quinlan, 1993). We encountered the problemthat entropy has low discriminative power in some non-optimal equilibria,and as such, does not provide a clear direction how to get out of them.

Chi-square statistics assess how the discretized data is similar to the orig-inal (Bay, 2001). We experimented with chi-square as a secondary test, tofurther distinguish between nearly equal primary objective. Eventually, wepreferred the Renyi entropy which has similar quadratic formula, thoughinterpretable as accuracy. It can be linearly combined with the followingaccuracy measure.

Accuracy is rarely a discretization score function, though a complemen-tary data inconsistency is an objective (Liu et al., 2002): For each instance,its features’ discretized values create a pattern. If there are np instances ofthat pattern p in the data, then inconsistencyp = np − majorityClassp,where majorityClassp is the most numerous among the pattern classes.The data totalInconsistency is the sum of inconsistences, for all patterns.We define discretization accuracy = 1− totalInconsistency/dataSize.

Number of discrete values is an objective to be minimized. We use relatedsplits: The number of patterns a feature set can maximally produce. Thenumber for a feature set is computed by multiplying the number of discretevalues introduced by each feature, e.g. if feature 1 is discretized into 3values, features 2 and 3 into 4, splits = 3 ∗ 4 ∗ 4 = 80.

Our Measure

Accuracy alone is not a good discretization measure since a set of randomfeatures may have high accuracy, as the number of splits and overfittinggrow. Also, some of the individual features in a set may induce accuratepredictions. We need a measure of the extra gain over that of contributingfeatures and overfitting. Such considerations led to the following.

136

Signal-to-Noise Ratio (SNR) expresses the accuracy gain of a feature set:SNR = accuracy / (1 − accuracy) = (dataSize- totalInconsistency) /totalInconsistency, i.e. the ratio of consistent to inconsistent pattern to-tals.

To correct for the accuracy induced by individual features, we normalizethe SNR by dividing it by the SNR for all the features involved in thefeature set, getting SNRn. SNRn > 1 indicates that a feature set predictsmore than its individual features.

Fitness for a GA-individual, consisting of feature set and its discretiza-tion, is provided by the SNRn. The GA population is sorted w.r.t. fitness,with a newly evaluated individual included only if it surpasses the cur-rent population worst. If two individuals have the same fitness, furtherdiscretization preferences are used. Thus, a secondary objective is to min-imize splits; individuals with splits > dataSize/40 discarded. Next, onewith greater SNR is promoted. Eventually, the feature sets are comparedin lexicographic order.

Two-Stage Algorithm

The approach uses Genetic Algorithms to select feature sets contributingto predictability and to discretize the features. First, it identifies differentfeature subsets, on the basis of their predictive accuracy. Second, thesubsets fixed, all the features involved in them are globally fine-discretized.

Feature Coupling

This stage uses rough discretization to identify feature sets of above randomaccuracy, via GA fitness maximization. A feature in different subsets mayhave different discretization cut-points.

After each population size evaluations, the fittest individual’s featureset is a candidate to the list of coupled feature sets. If the SNRn < 1,the set is rejected. Otherwise, it is locally optimized: in turn each featureis removed and the remaining subset evaluated for SNRn. If a subsetmeasures no worse than the original, it is recursively optimized. If the SNR

137

(not-normalized) of the smallest subset exceeds an acceptance threshold,the set joins the coupled feature list.

An acceptance threshold presents a balance. Threshold too low will letthrough random, overfitting feature subsets, too high will reject genuinesubsets, or postpone their discovery due to increased demands on fitness.Through long experimentation, we arrived at the formula: A number ofrandom features are evaluated for SNR on the class attribute. To simulatea feature set splitting the class into many patterns, each random featureis split into dataSize/40 discrete values. The greatest such obtained SNRdefines the threshold. This has its basis in the surrogate data approach:generate K sets resembling the original data and compute some statisticsof interest. Then, if the statistics on the actual data is greater (lower)than all on the random sets, the chance of getting this by coincidence isless than 1/K.

Once a feature subset joins the above-random list, all its subsets andsupersets in the GA population are mutated. The mutation is such as notto generate any subset or superset of a set already in the list. This done,the GA continues.

At the end of the GA run, single features are considered to the list offeature sets warranting predictability. The accuracy threshold for acceptinga feature is arrived at by collecting statistics on accuracy of permutedoriginal features predicting the actual class. The features are randomizedin this way as to preserve their distribution, e.g. the features may happento be binary which should be respected collecting the statistics. Then thethreshold accuracy is provided by the mean accuracy plus required numberof standard deviations.

The discretization accuracy for a feature is computed by unsuperviseddiscretizing it into 20 equally frequent intervals. The discretization cut-points are used in the feature set search. Such restricted cut-points are tominimize the search space of individuals optimized for active features andtheir cut-point selection.

138

Global Discretization

Once we have identified the coupled features sets, the second optimizationcan proceed. The user could provide the objective. We have attempted afine discretization of the selected features, in which each feature is assignedonly one set of cut-points. The fitness of such can be measured in manyways, e.g., in the spirit of the Naive Bayesian Classifier as the product ofthe discretization accuracies for all the feature sets. The GA optimizationproceeds by exploring the cut-points, the feature sets fixed.

The overall procedure provides:

Coupled features – sets inducing superior accuracy to that obtained bythe features individually.

Above-random features – all features that induce predictability indi-vidually.

Measure of predictability – expressed by the global discretization Ba-yesian ensemble accuracy.

Implementation Details

GA-individual = active feature set + cut-point selection. The features area subset of all the features available, no more than 4 selected at once. Thecut-points are indices to sorted threshold values, precomputed for eachdata feature as the value at which class changes (Fayyad & Irani, 1993).Thus, a discretization of a value is the smallest index whose correspondingthreshold exceeds that value.

Although non-active features are not processed in an individual, theirthresholds are inherited from a predecessor. Once the corresponding fea-ture is mutated active, the retained threshold indices will be used. Themotivation is that even for non-active features, the thresholds had been op-timized, and as such have greater potential as opposed to random thresh-olds selection. This is not a big overhead, as all feature threshold indexlists are merely pointers to an event when they were created, and theconstant-size representation promotes simple genetic operators.

139

Genetic operators currently include mutation at 2 levels. First mutation,stage 1 only, may alter the active feature selection by adding a feature,deleting or changing it. Second mutation does the same to threshold se-lection of an active feature: add, delete or change.

Experiments

Since the advantage of a multivariate discretization over univariate liesin the ability to identify group-informative features, it is misguided tocompare the two on the same data. The data could look random to theunivariate approach, or it could not need the multivariate search if singlefeatures warranted satisfactory predictability. In the latter case, a univari-ate approach skipping all the multivariate considerations would be moreappropriate and efficient. Comparison to another multivariate discretiza-tion would require the exact algorithm and data, which we do not have.Instead, we test our method on synthetic data designed to identify thelimitations of the approach. On that data a univariate approach wouldcompletely fail.

The data is defined as follows. The data sets have dataSize = 8192instances, values uniformly in (0,1). Class of an instance is the xor func-tion on subsequent groupings of classDim = 3 features: for half of theinstances, class is xor of features {0, 1, 2}, for another quarter – xor offeatures {3,4,5}, for another one-eighth – xor of {6,7,8} etc. The xor iscomputed by multiplying the values involved, each minus 0.5, and if theproduct > 0 returning 1, otherwise 0. This data is undoubtedly artificialbut most difficult. In applications where the feature sets could be incre-mentally discovered, e.g. {0,1} above random but {0,1,2} even better, weexpect the effectiveness of the algorithm to be higher than reported.

The tables below have been generated for the default setting: GA pop-ulation size 5000, allowed number of fitness evaluations 100 000, only forexploring dataDim the number of evaluations was increased to 250 000.Unless otherwise indicated, data dimension is 30, and noise is 0. Note thatsince the feature groupings are defined on diminishing parts of the data,the rest effectively acts as noise. Data and class noise indicate percentage

140

of randomly assigned data, respectively class, values after the class hadbeen computed on the non-corrupted data. The table results represent thepercentages of cases, out of 10 runs, when the sets {0,1,2} etc. were found.

Data noise 0.1 0.15 0.2 0.25 0.35

{0,1,2} found 100 100 100 100 20{3,4,5} found 100 80 0 0 0

Class noise 0.05 0.1 0.15 0.2 0.25

{0,1,2} found 100 100 100 100 100{3,4,5} found 100 100 60 0 0

Data dim 30 60 90 120 150

{0,1,2} found 100 100 100 100 60{3,4,5} found 100 100 100 100 60{6,7,8} found 80 40 40 0 0

Conclusion

The approach presented invokes a number of interesting possibilities fordata mining applications. First, the algorithm detects informative featuregroupings even if they contribute only partially to the class definition andthe noise is strong. In more descriptive data mining, where it is not onlyimportant to obtain good predictive models but also to present them ina readable form, the discovery that a feature group contributes to pre-dictability with certain accuracy is of value.

Second, the global discretization stage can be easily adjusted to a partic-ular objective. If it is prediction accuracy by another type of ensemble, orif only 10 features are to be involved, it can be expressed via the GA-fitnessfunction for the global discretization.

141

Appendix A

Feasibility Study on Short-TermStock Prediction

143

.

144

Feasibility Study on Short-Term StockPrediction




1997

Abstract This paper presents an experimental system predicting a stock exchange indexdirection of change with up to 76 per cent accuracy. The period concerned varies form 1to 30 days.

The method combines probabilistic and pattern-based approaches into one, highly

robust system. It first classifies the past of the time series involved into binary patterns

and then, analyzes the recent data pattern and probabilistically assigns a prediction based

on the similarity to past patterns.

Introduction

The objective of the work was to test if short-term prediction of a stock in-dex is at all possible using simple methods and a limited dataset(Deboeck,1994; Weigend & Gershenfeld, 1994). Several approaches were tried, bothwith respect to the data format and algorithms. Details of the successfulsetting follow.

Experimental Settings

The tests have been performed on 750 daily index quotes of the Polishstock exchange, with the training data reaching another 400 sessions back.Index changes obtained binary characterization: 1 – for strictly positivechanges, 0 – otherwise. Index prediction – the binary function of changebetween current and future value – was attempted for periods of 1, 3, 5, 10,

145

20 and 30 days ahead. Pattern learning took place up to the most recentindex value available (before the prediction period). Benchmark strategiesare presented to account for biases present in the data. Description of usedstrategies follows.

Always up assumes that the index always goes up.

Trend following assumes the same sign of index change to continue.

Patterns+trend following. Patterns of subsequent 7 binary index chan-ges are assigned a probability of correctly predicting positive indexchange. The process is carried independently for a number of non-overlapping, adjacent epochs (currently 2 epochs, 200 quotes each).The patterns which consistently – with probability 50% or higher –predict the same sign of change in all epochs are retained and subse-quently used for predicting the index change should the pattern occur.Since only some patterns pass the filtering process, in cases when nopattern is available, the outcome is the same as in the Trend followingmethod.

Results

The table presents portions of agreements on actual index changes andthose predicted by the strategies.

Prediction/quotes ahead 1 3 5 10 20 30

Always up 0.48 0.51 0.51 0.52 0.56 0.60Trend following 0.60 0.56 0.53 0.52 0.51 0.50Patterns + trend 0.76 0.76 0.75 0.74 0.73 0.671

In all periods considered, patterns allowed to maintain predictability atlevels considerably higher than benchmark methods. Though this relativelysimple characterization of index behavior, patterns correctly predict indexmove in 3 out of 4 cases up to 20 sessions ahead. The Trend followingstrategy diminishes from 60% accuracy to a random strategy at around 10sessions and the Always up gains strength at 20 quotes ahead in accordancewith general index appreciation.

146

Conclusions

The experiments show that a short-term index prediction is indeed possible(Haughen, 1997). However, as a complex, non-linear system, the stockexchange requires a careful approach (Peters, 1991; Trippi, 1995). In earlierexperiments, when pattern learning took place only in epochs proceedingthe test period or when epochs extended too far back, the resulting patternswere of little use. This could be caused by shifting regimes (Asbrink, 1997)in the dynamic process underlying the index values.

However, with only short history relevant, the scope for inferring anyuseful patterns, so prediction, is limited. A solution to this could be pro-vided by a hill climbing method, such as genetic algorithms (Michalewicz,1992; Bauer, 1994), in the space of (epoch-size * number-of-epochs *pattern-complexity) as to maximize the predictive power. Other ways ofincreasing predictability include incorporating other data series and in-creasing the flexibility of the pattern building process, which now onlyincorporates simple probability measure and logical conjunction.

Other interesting possibilities follow even a short analysis of the suc-cessful binary patterns: many of them pointing to an existence of shortperiod ‘waves’ in the index. This could be further explored e.g. by Fourieror wavelet analysis.

Finally, I only mention trials with the symbolic, ILP, system Progol em-ployed for finding a logical expression generalizing positive index changepatterns (up to 10 binary digits long). The system failed to find any hy-pothesis in a number of different settings and a rather exhaustive search(more than 20h computation on SPARC 5 for longer cases). I view theoutcome as a result of a strong insistence of the system for generating(only) compressed hypothesis and problems in dealing with partially in-consistent/noisy data.

147

Appendix B

Amalgamation of Genetic Selectionand BoostingPoster GECCO-99, US, 1999

149

.

150

Amalgamation of Genetic Selection andBoosting




Published: poster at GECCO-99, 1999

Synopsis comes from research on financial time series prediction (Zemke,1998). Initially 4 methods, ANN, kNN, Bayesian Classifier and GP, havebeen compared for accuracy, and the best, kNN, scrutinized by GA-op-timizing various parameters. However, the resulting predictors were oftenunstable. This led to use of bagging (Breiman, 1996) – a majority votingscheme provably reducing variance. The improvement came at no compu-tational cost – instead of taking the best evolved kNN classifier (as definedby its parameters), all above a threshold voted on the class.

Next, a method similar to bagging, but acclaimed better, was tried.AdaBoost (Freund & Schapire, 1996) which works by creating (weighted)ensemble of classifiers – each trained on updated distribution of examples,with those misclassified by previous ensemble getting more weight. A pop-ulation of classifiers was GA-optimized for minimal error on the trainingdistribution. Once the best individual exceeded threshold it joined theensemble. After distribution, thus fitness, update, the GA proceeded withthe same classifier population, effectively implementing data-classifier co-evolution. However, as the distribution drifted from (initial) uniform, GAconvergence became problematic. The following averts this by re-buildingthe GA population from the training set after each distribution update. Aclassifier consists of a list of prototypes, one per class, and binary vectorselecting active features for 1-NN determination.

The algorithm extends, initially empty, ensemble of 1-NN classifiers – inabove form.

151

1. Split training examples into the evaluation set (15%) and training set(85%).

2. Build GA classifier population by selecting prototypes for each class bycopying examples from the training set according to their probabilitydistribution. Each classifier also includes random binary active featurevector.

3. Evolve GA population until criterion for the best classifier met

4. Add the classifier to ensemble list, perform ensemble Reduce-ErrorPruning with Backfitting (Margineantu & Dietterich, 1997b) to max-imize its accuracy on the evaluation set. Check ensemble enlargementend criterion

5. If not an end, update training set distribution according to AdaBoost,go to 2.

The operators used in the GA search include: Mutation – changing sin-gle bit in the feature select vector, or randomly changing an active featurevalue in one of classifier’s prototypes. Crossover, given 2 classifiers, in-volves swapping of either feature select vectors or prototypes for one class.Classifier fitness (to be minimized) is measured as its error on the trainingset, i.e., as sum of probabilities of examples it misclassifies. The end crite-rion for classifier evolution is that at least half of the GA population hasbelow-random error. The end criterion for ensemble enlargement is thatits accuracy on the evaluation set is not growing. The algorithm drawsfrom several methods to boost performance:

• AdaBoost

• Pruning of ensembles

• Feature selection/small prototype set to destabilize individual classi-fiers (Zheng et al., 1998)

• GA-like selection and evolving of prototypes

• Redundancy in prototype vectors (Ohno, 1970) – only selected fea-tures influence the 1-NN distance, but all are subject to evolution

152

Experiments indicate robustness of the approach – acceptable classifier isusually found in early generation, thus ensemble grows rapidly. Accuracieson the (difficult) financial data are fairly stable and, on average, abovethose obtained by the methods from the initial study, but below theirpeaks. Bagging such obtained ensembles has also been attempted furtherreducing variance but only minimally increasing accuracy.

Foreseen work includes pushing the accuracy. Trials involving the UCIrepository are planned for wider comparisons. Refinement of the algo-rithms will include: genetic operators (perhaps leading to many prototypesper class) and end criteria. The intention is to promote rapid finding of(not prefect but) above-random and diverse classifiers contributing to anaccurate ensemble.

In summary the expected outcome of this research is a robust generalpurpose system distinguished by generating small set of prototypes, nev-ertheless in ensemble exhibiting high accuracy and stable results.

153

Bibliography

Alex, F. (2002). Data mining and knowledge discovery with evolutionaryalgorithms. Natural Computing Series. Springer.

Ali, K. M., & Pazzani, M. J. (1995). On the link between error correlationand error reduction in decision tree ensembles (Technical Report ICS-TR-95-38). Dept. of Information and Computer Science, UCI, USA.

Allen, F., & Karjalainen, R. (1993). Using genetic algorithms to find tech-nical trading rules (Technical Report). The Rodney L. White Center forFinancial Research, The Wharton School, University of Pensylvania.

Asbrink, S. (1997). Nonlinearities and regime shifts in financial time series.Stockholm School of Economics.

Asker, L., & Maclin, R. (1997). Feature engineering and classifier selection:A case study in Venusian volcano detection. Proc. 14th InternationalConference on Machine Learning (pp. 3–11). Morgan Kaufmann.

Aurell, E., & Zyczkowski, K. (1996). Option pricing and partial hedging:Theory of polish options. Applied Math. Finance.

Back, A., & Weigend, A. (1998). A first application of independent com-ponent analysis to extracting structure from stock returns. Int. J. onNeural Systems, 8(4), 473–484.

Bak, P. (1997). How nature works: the science of self organized criticality.Oxford University Press.

Bauer, E., & Kohavi, R. (1998). An empirical comparison of voting classi-fication algorithms: Bagging, boosting and variants. To be published.

155

Bauer, R. (1994). Genetic algorithms and investment strategies. an alter-native approach to neural networks and chaos theory. New York: Wiley.

Bay, S. D. (2001). Multivariate discretization for set mining. Knowledgeand Information Systems, 3, 491–512.

Bellman, R. (1961). Adaptive control processes: A guided tour. PrincetonUniv. Press.

Bensusan, H., & Kalousis, A. (2001). Estimating the predictive accuracyof a classifier (Technical Report). Department of Computer Science,University of Bristol, UK.

Bera, A. K., & Higgins, M. (1993). Arch models: Properties, estimationand testing. Journal of Economic Surveys, 7, 307366.

Blum, A., & Langley, P. (1997). Selection of relevant features and examplesin machine learning. Artificial Intelligence, 97, 245–271.

Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedas-ticity. Journal of Econometrics, 31, 307–327.

Bostrom, H., & L., A. (1999). Combining divide-and-conquer and separate-and-conquer for efficient and effective rule induction. Proceedings ofthe Ninth International Workshop on Inductive Logic Programming.Springer.

Box, G., Jenkins, G., & Reinsel, G. (1994). Time series analysis, forecast-ing and control. Prentice Hall.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.

Brodley, C. E., & Friedl, M. A. (1996). Identifying and eliminating misla-beled training instances. AAAI/IAAI, Vol. 1 (pp. 799–805).

Campbell, J. Y., Lo, A., & MacKinlay, A. (1997). The econometrics offinancial markets. Princeton University Press.

Chen, K., & Chi, H. (1998). A method of combining multiple probabilisticclassifiers through soft competition on different feature sets. Neurocom-puting, 20, 227–252.

156

Cheng, W., Wagner, L., & Lin, C.-H. (1996). Forecasting the 30-year u.s.treasury bond with a system of neural networks.

Cizeau, P., Liu, Y., Meyer, M., Peng, C.-K., & H., S. (1997). Volatilitydistribution in the s&p500 stock index. Physica A, 245.

Cont, R. (1999). Statistical properties of financial time series (TechnicalReport). Ecole Polytechnique, F-91128, Palaiseau, France.

Conversano, C., & Cappelli, C. (2000). Incremental multiple imputationof missingdata through ensemble of classifiers (Technical Report). De-partment of Matematics and Statistics, University of Naples Federico II,Italy.

Dacorogna, M. (1993). The main ingredients of simple trading modelsfor use in genetic algorithm optimization (Technical Report). Olsen &Associates.

Dacorogna, M., Gencay, R., Muller, U., Olsen, R., & Pictet, O. (2001). Anintroduction to high-frequency finance. Academic Press.

Deboeck, G. (1994). Trading on the edge. Wiley.

Dietterich, T. (1996). Statistical tests for comparing supervised learningalgorithms (Technical Report). Oregon State University, Corvallis, OR.

Dietterich, T. (1998). An experimental comparison of three methods forconstructing ensembles of decision trees: Bagging, boosting, and random-ization. Machine Learning, ?, 1–22.

Dietterich, T., & Bakiri, G. (1991). Error-correcting output codes: A gen-eral method of improving multiclass inductive learning programs. Pro-ceedings of the Ninth National Conference on AI (pp. 572–577).

Dietterich, T. G. (2000). Ensemble methods in machine learning. MultipleClassifier Systems (pp. 1–15).

Domingos, P. (1997). Why bagging work? a bayesian account and its impli-cations. Proceedings of the Third International Conference on KnowledgeDiscovery and Data Mining (pp. 155–158).

157

Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Chap-man & Hall.

Embrechts, M., et al. (2001). Bagging neural network sensitivity analysisfor feature reduction in qsar problems. Proceedings INNS-IEEE Interna-tional Joint Conference on Neural Networks (pp. 2478–2482).

Fama, E. (1965). The behavior of stock market prices. Journal of Business,January, 34–105.

Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continous-valued attributes for classification learning. Proc. of the InternationalJoint Conference on Artificial Intelligence (pp. 1022–1027). MorganKaufmann.

Feder, M., Merhav, N., & Gutman, M. (1992). Universal prediction ofindividual sequences. IEEE Trans. Information Theory, IT-38, 1258–1270.

Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization ofonline learning and an application to boosting. Proceedings of the SecondEuropean Conference on Machine Learning (pp. 23–37). Springer-Varlag.

Freund, Y., & Schapire, R. (1996). Experiments with a new boosting al-gorithm. Machine Learning: Proceedings of the Thirteenth InternationalConference.

Galitz, L. (1995). Financial engineering: Tools and techniques to managefinancial risk. Pitman.

Gershenfeld, N., & Weigend, S. (1993). The future of time series: Learningand understanding. Addison-Wesley.

Gonzalez, C. A., & Diez, J. J. R. (2000). Time series classification by boost-ing interval based literals. Inteligencia Artificial, Revista Iberoamericanade Inteligencia Artificial, 11, 2–11.

Han, J., & Kamber, M. (2001). Data mining. concepts and techniques.Morgan Kaufmann.

158

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statis-tical learning. data mining, inference and prediction. Springer.

Haughen, R. (1997). Modern investment theory. Prentice Hall.

Hawawini, G., & Keim, D. (1995). On the predictability of common stockreturns: World-wide evidence, chapter 17. North Holland.

Heiler, S. (1999). A survey on nonparametric time series analysis.

Hellstrom, T., & Holmstrom, K. (1998). Predicting the stock market (Tech-nical Report). Univ. of Umeøa, Sweden.

Ho, T. K. (2001). Data complexity analysis for classifier combination.Lecture Notes in Computer Science, 2096, 53–.

Hochreiter, S., & Schmidhuber, J. (1997). Flat minima. Neural Computa-tion, 9, 1–42.

Jiang, W. (2001). Some theoretical aspects of boosting in the presence ofnoisy data. Proc. of ICML-2001.

Judd, K., & Small, M. (2000). Towards long-term prediction. Physica D,136, 31–44.

Kantz, H., & Schreiber, T. (1999a). Nonlinear time series analysis. Cam-bridge Univ. Press.

Kantz, H., & Schreiber, T. (1999b). Nonlinear time series analysis. Cam-bridge Univ. Press.

Kingdon, J. (1997). Intellignet systems and financial forecasting. Springer.

Kohavi, R., & Sahami, M. (1996). Error-based and entropy-based dis-cretization of continuous features. Proc. of Second Itn. Conf. on Knowl-edge Discovery and Data Mining (pp. 114–119).

Kovalerchuk, B., & Vityaev, E. (2000). Data mining in finance: Advancesin relational and hybrid methods. Kluwer Academic.

159

Kutsurelis, J. (1998). Forecasting financial markets using neural networks:An analysis of methods and accuracy.

Kwedlo, W., & Kretowski, M. (1999). An evolutionary algorithm usingmultivariate discretization for decision rule induction. Principles of DataMining and Knowledge Discovery (pp. 392–397).

Lavarac, N., & Dzeroski (1994). Inductive logic programming: Techniquesand applications. Ellis Horwood.

LeBaron, B. (1993). Nonlinear diagnostics and simple trading rules forhigh-frequency foreign exchange rates. In A. Weigend and N. Gershenfeld(Eds.), Time series prediction: Forecasting the future and understandingthe past, 457–474. Reading, MA: Addison Wesley.

LeBaron, B. (1994). Chaos and forecastability in economics and finance.Phil. Trans. Roy. Soc., 348, 397–404.

LeBaron, B., & Weigend, A. (1994). Evaluating neural network predictorsby bootstrapping. Proc. of Itn. Conf. on Neural Information Processing.

Lefvre, E. (1994). Reminiscences of a stock operator. John Wiley & Sons.

Lequeux, P. (Ed.). (1998). The financial markets tick by tick. Wiley.

Lerche, H. (1997). Prediction and complexity of financial data (TechnicalReport). Dept. of Mathematical Stochastic, Freiburg Univ.

Liu, H., Hussain, F., Tan, C., & Dash, M. (2002). Discretization: Anenabling technique. Data Mining and Knowledge Discovery, 393–423.

Liu, H., & Setiono, R. (1997). Feature selection via discretization (TechnicalReport). Dept. of Information Systems and Computer Science, Singapore.

Liu, Y., & Yao, X. (1998). Negatively correlated neural networks for clas-sification.

Malkiel, B. (1996). Random walk down wall street. Norton.

Mandelbrot, B. (1963). The variation of certain speculative prices. Jour-nala of Business, 36, 392–417.

160

Mandelbrot, B. (1997). Fractals and scaling in finance: Discontinuity andconcentration. Springer.

Mantegna, R., & Stanley, E. (2000). An introduction to econophysics:Correlations and complexity in finance. Cambridge Univ. Press.

Margineantu, D., & Dietterich, T. (1997a). Pruning adaptive boosting(Technical Report). Technical report: Oregon State University.

Margineantu, D., & Dietterich, T. (1997b). Pruning adaptive boosting(Technical Report). Technical report: Oregon State University.

Michalewicz, Z. (1992). Genetic algorithms + data structures = programs.Springer.

Mitchell, T. (1997). Machine learning. McGraw Hill.

Molgedey, L., & Ebeling, W. (2000). Local order, entropy and predictabil-ity of financial time series (Technical Report). Institute of Physics,Humboldt-University Berlin, Germany.

Muggleton, S. (1995). Inverse entailment and Progol. New GenerationComputing, Special issue on Inductive Logic Programming, 13, 245–286.

Muggleton, S., & Feng, C. (1990). Efficient induction of logic programs.Proceedings of the 1st Conference on Algorithmic Learning Theory (pp.368–381). Ohmsma, Tokyo, Japan.

Muller, K.-R., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., &Vapnik, V. (1997). Using support vector machines for time series predic-tion.

Murphy, J. (1999). Technical analysis of the financial markets: A compre-hensive guide to trading methods and applications. Prentice Hall.

Naftaly, U., Intrator, N., & Horn, D. (1997). Optimal ensemble averagingof neural networks. Network, 8, 283–296.

Ohno, S. (1970). Evolution by gene duplication. Springer-Verlag.

161

Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empiricalstudy. Journal of Artificial Intelligence Research, 169–198.

Ott, E. (1994). Coping with chaos. Wiley.

Oza, N. C., & Tumer, K. (2001). Dimensionality reduction through clas-sifier ensembles. Instance Selection: A Special Issue of the Data Miningand Knowledge Discovery Journal.

Peters, E. (1991). Chaos and order in the capital markets. Wiley.

Peters, E. (1994). Fractal market analysis. John Wiley & Sons.

Quinlan, R. (1993). C4.5: Programs for machine learning. Morgan Kauf-mann.

Raftery, A. (1995). Bayesian model selection in social research, 111–196.Blackwells, Oxford, UK.

Refenes, A. (Ed.). (1995). Neural networks in the capital markets. Wiley.

Ricci, F., & Aha, D. (1998). Error-correcting output codes for local learn-ers. Proceedings of the 10th European Conference on Machine Learning.

Rtsch, G., Schlkopf, B., Smola, A., Mller, K.-R., Onoda, T., & Mika, S.(2000). nu-arc: Ensemble learning in the presence of outliers.

Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a rec-ommended approach. Data Mining and Knowledge Discovery, 1, 317–327.

Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997). Boostingthe margin: a new explanation for the effectiveness of voting methods.Proc. 14th International Conference on Machine Learning (pp. 322–330).Morgan Kaufmann.

Shannon, C., & Weaver, W. (1949). The mathematical theory of commu-nication. Urbana, Illinois: University of Illinois Press.

Skurichina, M., & Duin, R. P. (2001). Bagging and the random subspacemethod for redundant feature spaces. Second International Workshop,MCS 2001.

162

Sollich, P., & Krogh, A. (1996). Learning with ensembles: How overfittingcan be useful. Advances in Neural Information Processing Systems (pp.190–196). The MIT Press.

Sullivan, R., A., A. T., & White, H. (1999). Data-snooping, technicaltrading rule performance and the bootstrap. J. of Finance.

Sutcliffe, C. (1997). Stock index futures: Theories and international evi-dences. International Thompson Business Press.

Swingler, K. (1994). Financial prediction, some pointers, pitfalls and com-mon errors (Technical Report). Centre for Cognitive and ComputationalNeuroscience, Stirling Univ., UK.

Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dy-namical Systems and Turbulence, 898.

Tay, F., & Shen, L. (2002). A modified chi2 algorithm for discretization.Knowledge and Data Engineering, 14, 666–670.

Trippi, R. (1995). Chaos and nonlinear dynamics in the financial markets.Irwin.

Tsay, R. (2002). Analysis of financial time series. Wiley.

Tumer, K., Bollacker, K., & Ghosh, J. (1998). A mutual information basedensemble method to estimate bayes error.

Tumer, K., & Ghosh, J. (1996). Estimating the bayes error rate throughclassifier combining. International Conference on Pattern Recognition(pp. 695–699).

Vandewalle, N., Ausloos, M., & Boveroux, P. (1997). Detrended fluctuationanalysis of the foreign exchange markets. Proc. Econophysics Workshop,Budapest.

Walczak, S. (2001). An empirical analysis of data requirements for financialforecasting with neural networks.

163

Webb, G. (1998). Multiboosting: A technique for combining boosting andwagging (Technical Report). School of Computing and Mathematics,Deakin University, Australia.

Weigend, A., & Gershenfeld, N. (1994). Time series prediction: Forecastingthe future and understanding the past. Addison-Wesley.

Witten, I., & Frank, E. (1999). Data mining: Practical machine learningtools and techniques with java implementations. Morgan Kaufmann.

WSE (1995 onwards). Daily quotes.http://yogi.ippt.gov.pl/pub/WGPW/wyniki/.

Zemke, S. (1998). Nonlinear index prediction. Physica A, 269, 177–183.

Zemke, S. (1999a). Amalgamation of genetic selection and bag-ging. GECCO-99 Poster, www.genetic-algorithm.org/GECCO1999/phd-www.html (p. 2).

Zemke, S. (1999b). Bagging imperfect predictors. ANNIE’99. ASME Press.

Zemke, S. (1999c). Ilp via ga for time series prediction (Technical Report).Dept. of Computer and System Sciences, KTH, Sweden.

Zemke, S. (2000). Rapid fine tuning of computationally intensive classifiers.Proceedings of AISTA, Australia.

Zemke, S. (2002a). Ensembles in practice: Prediction, estimation, multi-feature and noisy data. Proceedings of HIS-2002, Chile, Dec. 2002 (p. 10).

Zemke, S. (2002b). On developing a financial prediction system: Pitfallsand possibilities. Proceedings of DMLL-2002 Workshop at ICML-2002,Sydney, Australia.

Zemke, S., & Rams, M. (2003). Multivariate feature coupling and dis-cretization. Proceedings of FEA-2003.

Zheng, Z., Webb, G., & Ting, K. (1998). Integrating boositng and stochasticattribute selection committees for further improving the performance ofdecission tree learning (Technical Report). School of Computing andMathematics, Deakin University, Geelong, Australia.

164

Zirilli, J. (1997). Financial prediction using neural networks. InternationalThompson Computer Press.

165

data mining for prediction. financial series case

Documents

series data

nancial data

evaluation data

data discretization17

data reduction16

data cleaning

recent data

data integration