classiﬁcation and clustering of stocks, using genetic ... · classiﬁcation and clustering of...

Classification and Clustering of Stocks, using GeneticAlgorithms and Fundamental Analysis

David Bugalho de Moura

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Prof. Rui Fuentecilla Maia Ferreira NevesProf. Nuno Cavaco Gomes Horta

Examination Committee

Chairperson: Prof. Horacio Claudio de Campos NetoSupervisor: Prof. Rui Fuentecilla Maia Ferreira Neves

Members of the Committee: Prof. Joao Paulo Baptista de Carvalho

November 2016

Acknowledgments

I would like to thank my supervisor, Rui Neves, for all the advice, ideas and knowledge transmitted,

especially in the financial field, through this last year. I would also like to thank my family for the support

given at every moment of my life, especially when solutions to the problems encountered seemed so

distant. Lastly I would like to thank Clara Paiva, for the motivation given to do this work, even when

power of will failed me.

Abstract

Since the last two decades the ease of access to information has grown exponentially, making it easier to

analyze and use this data in every field of science, including computational finance. This work presents

an architecture made from scratch of a trading system that classifies stocks using two techniques, in

order to conclude which one is superior: one with user input parameters, and an unsupervised one

using a genetic algorithm to optimize clustering position, with a constant number of clusters. A genetic

algorithm is also applied to optimize fundamental indicators to give buy and sell signals in each of the

groups obtained, in order to conclude if stocks in the same group behave in similar fashion. Results have

shown that the group with best results obtained with user input parameters is superior to the group with

best returns obtained with the clustering algorithm. However the clustering algorithm classified stocks

better, having increased performance over the user input method when used with the genetic algorithm

using fundamental indicators. The proposed system was implemented from scratch, and was contains

optimization modules, processing modules and a trading simulator.

Keywords

Genetic Algorithms; Fundamental Analysis; Fundamental Indicators; Classification; Clustering; Stock

Market; S&P500

iii

Resumo

Desde as ultimas duas decadas o acesso a informacao cresceu exponencialmente, sendo mais facil

usar estes dados em todos os campos cientıficos, incluindo o campo de inteligencia artificial aplicado

a financas. Este trabalho apresenta um sistema que classifica accoes de empresas usando duas

tecnicas, de forma a concluir qual a melhor: uma supervisionada, com parametros dados pelo uti-

lizador, e uma nao supervisionada que usa um algoritmo genetico para optimizacao de clustering. Um

algoritmo genetico e tambem aplicado para optimizacao de indicadores fundamentais com o objectivo

de dar sinais de compra e venda em cada um dos grupos previamente obtidos, de forma a concluir se

as accoes dum grupo se comportam de forma semelhante. Os resultados mostram que os grupos com

mais semelhancas entre si foram obtidos com o algoritmo de clustering. O sistema de compra e venda

de accoes dado com indicadores fundamentais mostrou um melhoramento significativo no retorno do

grupo com melhores resultados do algoritmo de clustering.

Palavras Chave

Algoritmos Geneticos; Analise Fundamental; Indicadores Fundamentais; Classificacao; Clustering; Mer-

cado Bolseiro; S&P500

Contents

1 Introduction 1

1.1 Motivation and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 State of Art 5

2.1 Financial Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Fundamental Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Technical Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Computational Intelligence Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3 Fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Portfolio Composition Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Investment Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.1 Value Investing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.2 Growth Investing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.3 Growth At Reasonable Price (GARP) Investing . . . . . . . . . . . . . . . . . . . . 26

2.5.4 Income Investing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6 Classification of Stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Data Set / Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.8 Related Work Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.8.1 Evolutionary Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.8.2 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Architecture 33

3.1 Module View of the Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

vii

3.2 General System Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.2 Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3 Fundamental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.6 GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.7 Investment Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.8 Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.3 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 System Validation 61

4.1 Validation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1.1 ROI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.1.2 Drawdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1.3 Sharpe Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1.4 Success rate of trades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1.5 Average time in the market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1.6 Rate of positive quarters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.1 Case Study I - User Input Classification . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.1.A Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.2 Case Study II - Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.2.A Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2.3 Case Study III - Using GAs to optimize FI . . . . . . . . . . . . . . . . . . . . . . . 79

4.2.3.A Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Conclusion 83

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 System limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A List of Stocks used 93

viii

List of Figures

2.1 Macroeconomic Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Agilent Technologies Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 S&P500 index Technical Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Dynamic Optimization Problems (DOP) Diversity Approaches . . . . . . . . . . . . . . . . 20

2.6 DOP Memory Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.8 DOP Multi-Population Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.9 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Modules view of the architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Modules Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Stock’s Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Stock module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 FA module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6 Types’ Quadrants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 Classifier Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.8 Fundamental Indicators’ Chromosome Representation . . . . . . . . . . . . . . . . . . . . 50

3.9 Clustering Chromosome Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.10 Investment Simulator, Portfolio and Stock Interaction . . . . . . . . . . . . . . . . . . . . . 56

4.1 User input classification portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2 User input portfolios metrics table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3 Types B&H - Yearly Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 Investment by Type table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.5 Clustering classification portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6 Clustering portfolios metrics table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

ix

4.7 Cluster’s portfolios yearly returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.8 Cluster’s representation in the whole plane - 2012Q4 . . . . . . . . . . . . . . . . . . . . . 75

4.9 Cluster’s representation in the whole plane-2013Q4 . . . . . . . . . . . . . . . . . . . . . 76

4.10 Cluster’s representation in the plane - 2012Q4 Zoomed . . . . . . . . . . . . . . . . . . . 76

4.11 Cluster’s representation in the plane - 2013Q4 Zoomed . . . . . . . . . . . . . . . . . . . 77

4.12 Investment by Type table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.13 GA optimization portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.14 FA GA portfolios metrics table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.15 FA GA portfolios’ yearly returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

x

List of Tables

4.1 Average fitness per quarter of the clustering algorithm . . . . . . . . . . . . . . . . . . . . 72

4.2 Best, worst and median run of each portfolio . . . . . . . . . . . . . . . . . . . . . . . . . 79

xi

List of Algorithms

2.1 Simple Genetic Algorithm (GA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Roulette Wheel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Automatic Clustering Genetic Algorithm (ACGA) . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Used GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xiii

Acronyms

GA Genetic Algorithm

ML Machine Learning

AI Artificial Intelligence

CI Computational Intelligence

DM Data Mining

FA Fundamental Analysis

TA Technical Analysis

FI Fundamental Indicators

TI Technical Indicators

GDP Gross Domestic Product

CPI Consumer Price Index

PPI Producer Price Index

DR Debt Ratio

ROE Return On Equity

NI Net Income

PM Profit Margin

PER Price Earnings Ratio

RG Revenue Growth

CSO Common Stock Outstanding

xv

EPS Earnings Per Share

NIG Net Income Growth

PR Payout Ratio

CE Capital Expenditures

CFOAG Cash From Operating Activities Growth

MA Moving Average

MSCI Morgan Stanley Capital International

OECD Organisation for Economic Co-operation and Development

DJI Dow Jones Industrial

NASDAQ National Association of Securities Dealers Automated Quotations

OBV On Balance Volume

VAMA Volume Adjusted Moving Average

EC Evolutionary Computation

EA Evolutionary Algorithms

GA Genetic Algorithm

ANN Artificial Neural Networks

AGA Adaptive Genetic Algorithm

DOP Dynamic Optimization Problems

SOS Self-Organizing Scouts

KDD Knowledge Discovery from Data

FCM Fuzzy C-means

ACGA Automatic Clustering Genetic Algorithm

MO Multi-Objective

SO Single-Objective

GARP Growth At Reasonable Price

xvi

MOEA Multi-Objective Evolutionary Algorithm

SPEA2 Strength Pareto Evolutionary Algorithm 2

R-SPEA2 Robust Strength Pareto Evolutionary Algorithm 2

MOGP Multi-Objective Genetic Programing

MCDM Multiple-Criteria Decision-Making

GP Genetic Programming

RST Rough Set Theory

MEPA Minimize the Entropy Principle Approach

CPDA Cumulative Probability Distribution Approach

FCMAC Fuzzy Cerebral Model Articulation Controller

RL Reinforcement Learning

SMA Simple Moving Average

RSI Relative Strength Index

MACD Moving Average Convergence Divergence

xvii

1Introduction

Contents

1.1 Motivation and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

This section is an introduction to the subject of computational techniques applied to companies’

finances past data as a way to find trading rules and patterns in the stock market to achieve bigger

returns than the S&P500 index.

1.1 Motivation and Context

Since the recent advance of computational technologies (specially with the ease of information ac-

cess since the early 2000’s), several techniques have been used to try to extract trading rules from past

data of the stock market [1]. The stock market is constituted by humans, which gives it the same un-

predictability that humans have, making it hard to find patterns in it. Nevertheless, with the increase of

available information observed in the last couple of decades, this task seems more and more feasible.

Though there is a property of randomness intrinsic to the stock market (associated with all the variables

that are unknown to us), a large number of scientific researchers (for examples of it, see the next chap-

ter) showed results that makes us conclude that it is possible to obtain returns on the stock market using

computational resources.

To predict future stock prices with Artificial Intelligence (AI) and Computational Intelligence (CI) sys-

tems one needs to analyze data, applying data mining and machine learning algorithms to do so. Al-

though Machine Learning (ML) started attracting attention since the 80’s, it only flourished in the 90’s

when AI shifted from rule based methods1 to data driven methods2 (using approaches it had inherited

from AI but shifting towards methods using statistics and probability theory). Data analysis evolved al-

most in the same way, specially Data Mining (DM), that overlaps in terms of methods employed with

ML. They can be distinguished in a key aspect that is, ML focuses on prediction using known propri-

eties (learned in the training phase), and DM focuses on the discovery of unknown properties of data.

Although these algorithms have some decades now, only later these techniques have been applied

to Finance, not because the technology was not worth it, but because data availability and price was

not. In the late 90’s/early 2000’s, the globalization of Internet has made the access to information eas-

ier, cheaper and faster than in any other point in history, taking a big step in the way problems were

approached and in the technologies used.

Several techniques are used to predict stock market quotes, being the most popular one’s AI methods

to optimize financial indicators’ parameters. There are two types of financial indicators: fundamental

and technical indicators. Other methods include the evaluation of each stock of a certain type, being this

type defined by its sector or any other rule defined. Since the amount of data available is exponentially

increasing, probabilistic methods are gaining a lot of attention.

1defines a sequence of steps to be taken, based on a Knownledge Base (which consists in facts, and an inference engine)2describes data to be matched (pattern matching) and the processing in a more abstract way

3

1.2 Problem Statement

The problem here is to construct a system that evaluates the stock market and accurately groups

stocks that show similar behaviors. It should also choose the companies that show bigger returns in the

market, and study which technique is better to do so. The system should be able to perform in a similar

way when trading in real time. The implemented system is constructed from scratch with the objective

of being used by human traders. The architecture of the system should also be open for improvements.

The main objective when solving this problem is to obtain investment strategies that obtain better results

than the S&P500 index.

1.3 Proposed Solution

The implemented solution is a system made from scratch that classifies companies into groups based

on their financial statements, in two ways: a supervised way with parameters defined by the user and

an unsupervised way, applying a genetic algorithm to optimize clustering position, with a fixed number

of clusters. After both classifications, the application uses a genetic algorithm once more to optimize

indicators to buy stocks, comparing the results of the optimization done with the whole pool of stocks

with each classification given, to conclude on which classification classifies stocks better. The proposed

system was implemented in C++ with about 9000 lines of code and it was made to be used by third

parties that wish to improve its features.

1.4 Document Structure

The presented thesis is structured as follows:

• Chapter 2 shows theoretical concepts required to develop this project, including the theory behind

financial indicators, machine learning algorithms with special focus on evolutionary computing,

clustering algorithms and portfolio management. It is also given an overview of the results obtained

by some related works.

• Chapter 3 documents the proposed solution, with a detailed description of the architecture, meth-

ods and details of the application.

• Chapter 4 presents a validation of the system, showing parameters used and a detailed study of

the solution performance and robustness through case studies.

• Chapter 5 summarizes this work, concluding the achievements, limitations and proposing future

improvements.

4

2State of Art

Contents

2.1 Financial Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Computational Intelligence Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Portfolio Composition Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Investment Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.6 Classification of Stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Data Set / Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.8 Related Work Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5

2.1 Financial Indicators

Financial indicators analyze statistics and present that information in the form of ratios, which sup-

ports managers on future stock prices’ decisions. These indicators will give better results if used along-

side the right strategy [1].

There are several types of financial indicators, divided into Fundamental Indicators (FI) or Technical

Indicators (TI), which come from Fundamental Analysis (FA) and Technical Analysis (TA).

2.1.1 Fundamental Indicators

FA is used to check for the intrinsic value of a certain market, industry or company, being FA applied to

the latter the most used one. FA uses financial statements to know if a company is under or overvalued.

FI are constructed using FA, and these type of indicators do not take into account market trends, only

its intrinsic value, or in other words, the real raw value of the company. It does not take into account

the people’s feelings about the company, neither the stocks quotes or trends, it only cares about how

the finances of the company are [2]. Through the analysis of the factors that reflect (or influence) a

company’s productivity, profitability or competitive advantage, one can identify if a stock is overvalued or

undervalued.

FA has been made famous by value investors (see section 2.5.1) like Benjamin Graham and Warren

Buffet (for their value investment strategies [3], [4]).

Fundamental analysis can generate trading rules that determine which stocks show signals of being

a good investment (by being financially stable), and which stocks show signs of not being financially

stable [5].

FA collects data on financial statements of several years and analyzes the financial evolution of a

company. This helps managers to make a prediction on the growth of a company. Other applications

of FA also include the evaluation of data that is external to the company, for example Gross Domestic

Product (GDP) or currency value, to evaluate the potential that a market may or not have. There are three

main types of FA [2]: Macroeconomic (analysis of macroeconomic factor like GDP growth to study the

effect of the macroeconomic environment on the future profit of a company), Industry Analysis (analysis

of the industry status and prospect, to estimate the value of the company inserted in a certain industry)

and Company Analysis (which analyses the operational status of a company to evaluate its internal

value, usually by analyzing company financial reports).

• Macroeconomic

This type of analysis use macroeconomic indicators to make assumptions about the type of market

we are investing on [6]. One example is economic growth and price stability in the economy, and

7

price stability1 can be measured as the rate of change in inflation [7]. There are several macroe-

conomical indicators (some can be found in: http://www.rbcpa.com/economic fundamentals.pdf).

The Consumer Price Index (CPI)2 is one of such indicators. CPI measures changes in consumer

prices and theoretically determines to what extent life is getting more expensive for the aver-

age consumer. Another important indicator that also measures inflation is the Producer Price

Index (PPI), that measures the rate of change in prices of goods received by domestic producers,

used in their output. When these prices increase substantially, it is likely that companies eventually

pass the price increase’s burden to consumers.

GDP is also an important, and one of the most used indicators, because it represents the total

output of a given economy. The trend at which the GDP is evolving (up/down) may represent a

expandability/contraction of the economy. When the GDP is stable or declining most companies

will not be able to increase their profits, however if GDP growth is too high, it may mean trouble,

because it will usually come with a growth in inflation, and may come with other negative side

effects. See figure 2.1 to see how these indicators evolved in the United States in the last years

(from 2006 until 2016).

(a) United States GDP growth rate (b) United States Inflation rate

(c) United States Producer Price Index

Figure 2.1: Macroeconomic indicators in the United States of America: GDP growth rate, Inflation rate and PPI,from 2006 until 2016 (adapted http://www.tradingeconomics.com/)

In figure 2.1 one can see that the inflation graph looks like the PPI one, and GDP growth follows a

1http://www.eestipank.ee/en/monetary-policy/importance-price-stability2http://www.investopedia.com/terms/c/consumerpriceindex.asp

8

similar fashion.

• Industry

The analysis of the fundamental value of an industry or sector (amount of possible clients, volume

of transactions on that industry, etc.) is used as indicator to check if the target market is good or

not for investment [2].

The possible number of costumers in an industry can be an indicator about what kind of market we

are targeting. Usually markets that rely on a small number of clients for a big part of their revenues

are not good markets to invest on, since a loss of one of those clients may cause a major loss on

revenues (for example, if a military supplier has 100% of its sales to the government, a change in

a defense policy may cause the company to go bankrupt).

Industry Growth, just as macroeconomic growth, is another indicator to check if a market is good

or bad for investment. Before looking for companies with certain requirements, one can check the

growing potential of an industry to check if a target market is promising or not. If a market has a

stable or a declining number of clients, it will be harder for a company to grow in that market, since

it will need to steal market share from other companies.

• Company Using information given in financial statements, one can calculate some fundamental

ratios used to compare companies, to decide which ones are the best to invest in. See figure 2.2

to see the evolution of Agilent Technologies fundamentals. Some companies’ FI are described as

follows (these appear in [8], [9], [10] [11], for example):

– Debt Ratio (DR)

DR is a ratio used to measure the level of debt of a company. Companies with a higher debt

ratio will have a larger amount of debt compared to their assets, leaving them more vulnerable

to an adverse economy, a reduction in their profits or an increase in their debt interests. In

most cases, a high DR can mean that a company is in a highly competitive market, with a

constant need for research and development, usually carried by external financing.

DR =Total Debt

Total Assets(2.1)

– Return On Equity (ROE)

ROE measures the performance of the company Net Income (NI) using the company eq-

uity (measures performance of profits to equity level). This is obtained through operational

efficiency, efficient use of assets and financial leverage. This ratio allows one to select com-

panies that maximize the return on the investment made in them, since the higher this ratio

9

is, the higher the return made of the money invested in the stock.

ROE =NI

Total Equity(2.2)

– Profit Margin (PM)

PM is a ratio that measures the cost of the business to generate profit, or, as the name says,

the margin (profit) the company has after paying all the operating, administrative and financial

costs, along with taxes. Although it is strange (and may be a bad signal) if this ratio varies a

lot (it may mean a decapitalization if it increases, or that the revenue is not making profits, if it

decreases) it is usually a good sign when this indicator is high.

PM =NI

Revenue(2.3)

– Price Earnings Ratio (PER)

PER is a ratio that indicates the value of a company share price when compared with its

per-share earnings. It is the inverse of the percentage of the per-share earnings. It is usually

used to look for undervalued companies. When the PER ratio is going up, it is usually because

investors are expecting a higher growth in the future. However this indicator has to be taken

into account in comparison with PER of stocks of the same sector, and because of these

nuances, it may be a misleading indicator, when used without comparison.

PER =SharePrice

EPS(2.4)

– Revenue Growth (RG)

RG is an indicator that shows the evolution of the business. It increases with two main factors:

either the company is gaining market share from other competitors, or the company is inserted

in a growing market and its growing with it. It only reflects the growth of a company’s revenue

and not its profits.

RG =RevenueCurrent −RevenueLastY ear

RevenueLastY ear(2.5)

– Common Stock Outstanding (CSO)

CSO is an indicator of the ownership hold of the company by shareholders. When a company

issues shares there is a share dilution, and when a company reduces the outstanding shares,

there is an increase in the Earnings Per Share (EPS) (since the same earnings go to a fewer

number of shares), and a decrease in the PER ratio. This is a good indicator to find companies

10

that have repurchased their shares (reduced the outstanding shares).

∆CSO =CSOCurrent − CSOLastY ear

CSOLastY ear(2.6)

– Net Income Growth (NIG)

NIG is an indicator about the trend of the profits of a certain company, and it is used to check

if a good result obtain in a certain year is not just a result of the economic conjecture or of

financial engineering. This indicator can be used to search for undervalued stocks, if the stock

prices do not follow the same behavior as the net income trend.

∆NI =NICurrent −NILastY ear

NILastY ear(2.7)

– Payout Ratio (PR)

PR indicates the percentage of net income distributed by the investor as dividends. High PR

indicates a stable company that does not need to do a lot of investment to keep their business

running but at the same time that is inserted in a stable market where stock performance will

be smaller than those in a fast growing pace, since the part of earnings not paid to investors is

used to invest and create future earning growths. Investors seeking high incomes with limited

earnings growth choose high PR, and investors seeking for capital growth choose lower PR.

PR =DPS

EPS(2.8)

– Capital Expenditures (CE)

CE, when increasing with a greater momentum than NI, is an indicator that the company is

probably inserted in a competitive market. This indicator is compared to NI to avoid compa-

nies that show this type of behavior (to avoid companies in competitive markets).

∆CE =CECurrent − CELastY ear

CELastY ear(2.9)

– Cash From Operating Activities Growth (CFOAG)

CFOAG is a measure of performance of generating money through operations, or operational

money (ability to transform paper operating income into the income statement in receivable

cash). This accounts the cash flow that comes in the company, because the company may

have a high operating net income but be inefficient in the collection of its cash profits.

CFOAG = ∆CFOA =CFOACurrent − CFOALastY ear

CFOALastY ear(2.10)

11

(a) Balance Sheet (b) Cash Flow Statement

(c) Income Statement

Figure 2.2: Financials of Agilent Technologies Inc from 2011 until 2015 (Adapted fromhttps://www.google.com/finance). The balance sheet holds information about total debt, total as-sets and the DR. The cash flow statement holds information about cash from operating activities, cashfrom investing activities and cash from financing activities. The income statement shows the revenue,the net income, the profit margin, the operating income and the operating margin

2.1.2 Technical Indicators

TA [12] and FA use different approaches towards investment, since TA uses movement of stock

prices [13] and volume of transactions [14] as the main information to predict stock markets. TIs look

for patterns in past data and use those patterns to forecast market tendencies(see figure 2.3 to see

how some TI follow the trends of the S&P500 index). TA generates trading rules by analyzing previous

patterns of technical indicators [2], and can be grouped into eight main groups [15], five of them are

described as follows (apart from these there are also other kind of TI: flow of funds, sentiment and raw

data):

• Trend

Trend analysis is a price-based indicator used to track stocks (or other assets) price’s trends.

12

Figure 2.3: S&P500 index and 3 TI - SMA, RSI and MACD (adapted from https://www.google.com/finance)

Strategies that use this indicator assume that political and economical events usually change mar-

ket prices through a change in market trends instead of returning to the most rational point. The

most common trend indicator are Moving Average (MA)s (see for example [16]).

• Momentum

Momentum analysis is also a price-based indicator but used to evaluate the velocity of price

change, and evaluate if a trend reversal is about to happen.

• Volatility

Volatility analysis investigate fluctuations of price ranges in stocks. It can be used to evaluate risk

and identify the level of support and resistance. Stock prices usually are recognized to fluctuate

between the level of support (lower level) and resistance (higher level), but continue to fall/rise if

they break through that level. Volatility indicators include Average True Range, Bollinger Band,

among others. Volatility can also be used to predict Macroeconomic Indicators.

According to [17], volatility is a good GDP growth measure, since GDP growth shrinks after spikes

in volatility. Markets also react to volatility, either in or out a crisis context, and regardless the

market context (either bull or bear). An increase in volatility is usually associated with an increase

in inflation and unemployment rate, and during recessions, on average, volatility rises and interest

rates drop. When a random shock in volatility occurs, GDP reacts to it but reverts to its mean

13

quickly after (1 or 2 quarters). However, if volatility is created by economic politics uncertainty, the

reversion to the mean can take a lot longer (specially if the shock in the politics is unexpected).

One way of measuring volatility, according to [17], is by the quarterly and monthly variance of the

average daily Morgan Stanley Capital International (MSCI) country stock market index.

In an attempt to proxy monetary policies, one can control short term interest rates, with a given

lag (to proxy implementation and effectiveness). Also, one can check for the overall tax level of

a country checking for the ratio between Tax Revenue and Real GDP. Industry production will

decrease with an increase in tax rates.

Volatility affects growth much more than the other way around. Three possible measures of

Macroeconomic uncertainty are the Leading indicator index from Organisation for Economic Co-

operation and Development (OECD) (contains various macroeconomic indicators, one of which is

industrial production index), the Oil Price Volatility and economic policy volatility [17].

[18] shows that permanent shocks (being shock defined as a volatility measure) explain the bulk

of the variation of stock prices over short periods. The author also says that three big American

indexes (Dow Jones Industrial (DJI), National Association of Securities Dealers Automated Quota-

tions (NASDAQ) and S&P500) share a common trend and a common cycle relationship, therefore

shocks will affect all markets similarly.

• Volume

Volume based indicators reflect the amount of investment from buyers/sellers, which can also

predict stock price movements. Volume indicators include Volume change rate, On Balance Vol-

ume (OBV), among others.

In [14] it is used a Volume Adjusted Moving Average (VAMA). It is based on equivolume charting,

a technique that analyses stock prices in relation with the amount of volume traded. In this type of

charting the stock price goes to the vertical axis, and the volume traded goes to the horizontal axis.

Short and wide boxes tend to occur at turning points (stock price is having difficulties moving), and

tall, narrow boxes usually occur at stable markets (stock price is moving easily).

• Cycle

Cycle analysis is a type of indicator that assumes periodic variation in stock prices. Long cycles

can take years and include several smaller cycles. Strategies that use this indicator analyze the

position of the stock price in the cycle.

[19] tries to find a correlation in the amount of business between countries, and the impact of

shocks in business cycles and GDP.

14

Dow’s theory [20] (one of the origins of the trend analysis) assumes there are three types of trends

in the stock market:

– Primary trend: Long term movement of prices (from a year to three years)

– Secondary trend: Short term deviations of prices from the underlying trend. It can be seen

as a correction from the primary trend (from three weeks to three months).

– Tertiary trend: A corrective movement from the secondary trend (less than three weeks).

A cycle is defined as an up trend, down trend and up trend again [21], taking only one of the Dow’s

theory trends into consideration. Longer cycles are constituted by several smaller ones.

2.2 Computational Intelligence Algorithms

CI combines methods and tools to solve problems that normally would require human intelligence.

There are several known CI algorithms: artificial neural networks, fuzzy logic systems, evolutionary

algorithms, among many others. In all of them the success on solving a problem depends mostly on

how that problem is represented by the algorithm.

When it comes to algorithmic implementation in computational finance, a popular approach is Evolutionary

Computation (EC) to optimize rule discovery, because the population based system used by EC greatly

increases the number of searches in the solution search space (by doing parallel search), thus reducing

computational time.

EC is a subfield of AI that will receive the focus of this work. Evolutionary Algorithms (EA) are

algorithms that optimize or learn tasks with the ability to evolve. EAs have three main characteristics,

they are: population-based (the algorithm maintains a set of solutions to search the solution space in a

parallel way), fitness-oriented (the algorithm has a fitness function which measures the success of the

solution, and this is the main aspect that guarantees convergence) and variation-driven (solutions will

suffer several variation operations, to cover more of the search space and to avoid local maximums) [22].

Most CI problems can be seen as a mapping of a domain space into a solution space, and usually

the possible number of solutions becomes so huge it becomes impossible to search all of it. EAs are

stochastic methods that use heuristics to find solutions, which means they will not guarantee the best

solution, but will take a significant reduction in cost and time [23].

The EAs used in this work are GAs. GAs are the most used kind of EAs. They can be either used

as an optimization algorithm or to study adaptive systems. GAs simulate natural selection, where better

solutions are more prone to reproduce than worst solution, each solution (individual) has a limited life

span, there is variation in the population and the ability to survive is positively correlated with the ability

to reproduce [24].

15

Apart from EC techniques, there are other popular approaches, such as Artificial Neural Networks

(ANN) and Fuzzy Systems.

2.2.1 Genetic Algorithms

GAs are a type of EAs that will receive focus on this work. GAs are based on the theory of evolution

developed by Darwin, simulating the evolution of a specie in a certain environment. It starts with a

population of individuals (chromosomes), where each one codify a solution. As it happens with species

evolutionary process, these individuals reproduce in order to create offspring solutions better than the

parent solutions.

GAs were discovered as a useful optimization an search algorithm. A lot of problems in AI can be

defined as a search in a solution space (called search space) which contains every possible solution.

GAs search this space by comparing solutions and looking for the best one.

This heuristic allows the search of several solutions in parallel, converging to better ones. This

convergence is measured by a fitness function. Fitter solutions are privileged when selecting solutions

to ”reproduce”, attracting the whole population of solutions to somewhere near them in the search space

[22].

Usual implementations of GA individuals are arrays or trees of values (as in [25]), where each value

codifies a parameter to be optimized.

A set of genetic operators has to be defined for the GA. The way these genetic operators are imple-

mented determine the success of the algorithm. In a simple GA, the algorithm has to take four steps on

each iteration (generation) [26].

• Selection After evaluating the fitness of each individual of the population, the first step is to select

individuals to reproduce. This selection is done randomly, taking into account the relative fitness

of individuals, such that the best solutions are chosen.

• Reproduction In this step, offspring are created from the selected individuals. For this, it can be

used both recombination and mutation of values.

• Evaluation The fitness of the new population is reevaluated.

• Replacement In the last step, recently created individuals replace individuals from the old popu-

lation.

The algorithm will repeat until a stopping condition is reached, and this is either a maximum number

of generations, no change in the best fitted individuals of the population for a predetermined number of

generations or when a specified time elapsed.

16

A simple GA pseudocode is given in algorithm 1.

Algorithm 2.1: Simple GAt← 0;P (t)← random;Evaluation P (t);while notEndcondition do

Pp(t)← Selection of parents from P (t);Pc(t)← Crossover from Pp(t);Pm(t)← Mutation of Pc(t);Evaluation Pm(t);P (t+ 1)← New Generation Creation from (P (t), Pm(t));t← t+ 1 ;

The basic operators of a GA are defined as follows:

• Selection

Selection is made by evaluating and ranking individuals, using the fitness function of the GA [27].

Selection has several ways of being implemented. This work will focus on two implementations:

Roulette Wheel Selection and Ranking Selection. These are described as follows:

– Roulette Wheel Selection The principle here is that of a linear search in a roulette wheel,

where the slots in the wheel are weighted in proportion to the individual’s fitness value. To

implement the roulette wheel one has to go through the following steps: first the total expected

value of individuals in the population is obtained (see equation 2.11), and afterwards the

algorithm can run (see algorithm 2)

T =

N∑i=1

Fitnessi (2.11)

then:

Algorithm 2.2: Roulette Wheel Selectioni = 0 ;while i != N do

chose random number r ∈]0, T ];j = 0 ;Fit Sum = 0 ;while Fit Sum < r do

Fit Sum = Fit Sum + Fitnessj ;++j ;

++i ;

– Ranking Selection In Ranking Selection individuals received their fitness by their ranking.

This results in slow convergence, however avoids quick convergence and possibly getting

trapped in a local maximum. A suggestion to do this is to select two individuals at random,

the one with the best ranking becomes the parent. Then, repeat this process to find the other

17

(a) Single Point Crossover (b) Multi-point Crossover

Figure 2.4: Single and double point crossover

(a) Boolean Valued Mutation (b) Integer Valued Mutation

Figure 2.5: Boolean valued and integer valued mutation. The integer valued mutation is not associated with anyprobabilistic distribution in this image, is purely figurative

parent.

• Crossover

Crossover simulates the biological crossover, and mixes values of individuals (of the old popula-

tion) to generate an offspring (that will be an individual in the new population). The Crossover can

be made in a single point (2 segments are exchanged between individuals), or in multiple points

(more than 2 segments exchanged) as shown in figure 2.4.

• Mutation

Mutation is an operator used to increase diversity in solutions (being able to cover more of the

search space, and avoiding being stuck at a local maximum). Mutation perturbs a value in the chro-

mosome, adding noise with a certain probability distribution (a popular choice is Gaussian noise) in

real valued chromosomes, randomizing that value, interchanging values or flipping boolean values,

as shown in figure 2.5.

• Other paradigms

In [25] a tree structure representation of a portfolio as a GA is used, where the GA has to fill

out some more rules, for example, each branch of the tree must be a portfolio by itself, each node

represents the weight of that branch, and the leafs are the stocks. In this representation operations

in the GA are handled differently.

There are also other operators or techniques that can be applied to the GA in order to improve its

18

results, such as elitism, that propagates a percentage of the best individuals in a population into

the next generation.

Adaptive Genetic Algorithm (AGA)s are used for Dynamic Optimization Problems (DOP) (problems

where variables change over time). These kind of problems need a solution that tracks the moving

optima over time. To achieve this, one has to make some enhancements to GAs such that it adapts

to the new optima over time. An AGA can be a GA whose parameters (such as population size,

mutation or crossover probability) changes while the GA is running. According to [28], a DOP is

characterized as:

F = f(~x, ~ψ, t) (2.12)

Where ~x are the decision variables, ~ψ the parameters and t is time. The challenge here is to track

the moving solution without having to restart the algorithm. There are 5 main approaches to this:

– Memory: store useful information

– Diversity: handle convergence

– Multi-Population: co-operate between sub-populations

– Adaptive: adapt generators and parameters

– Prediction: forecast changes and take action

Details about these techniques are given as follows:

– Memory Approaches

Memory approaches are particularly useful for cyclic DOPs. This approach can be divided into

implicit memory approaches and explicit memory approaches. Implicit Memory approaches

uses redundant information. A way of implementing it in GAs is by using a pair of chromo-

somes on each individual (Diploid GA) that encode the genotype of the individual, and a

dominance scheme that maps the genotype to phenotype. Explicit Memory approaches use

extra memory to store useful information of the population. The best solutions are saved in

memory, such that when a change occurs the memory solution will be used to track the new

optima. If Direct Memory is used only good solutions are stored into memory, and if Asso-

ciative Memory is used, good solutions and environmental information (context) is stored. In

this case when a memory update occurs, a new pair (AD) (with ~D being the environmental

information) replaces another, and solutions are generated by sampling ~DM . An example of

it is in figure 2.6.

19

(a) Random Immigrants (b) Memory-based Immigrants

Figure 2.7: Diversity Approaches

Figure 2.6: DOP Memory Approaches

– Diversity Approaches

Diversity Approaches will use diversity of individuals to cover more of the search space in

order to have a faster convergence when a change occurs. A way of achieving this is the

Random Immigrants approach (see figure 2.7). This approach inserts random individuals

each generation to maintain diversity, such that when a change occurs the random individu-

als will attract the population to the new optimum. A second approach is using Memory-based

Immigrants, where some points in the search space are stored into memory and re-evaluated

each generation. On each generation the best memory point is chosen and the immigrants

are generated by mutating this point with a certain probability, and then the population re-

places the worst individuals with these solutions.

– Multi-Population Approaches

Multi-Population approaches use several co-operating sub-populations to explore the search

20

(a) Shifting Balance (b) Self-organizing Scouts

Figure 2.8: DOP Multi-Population Approaches

space at the same time. One approach to this is the Shifting Balance, where a core population

explores the area of the present optimum while several colonies (sub-populations) explore the

rest of the search space. Whenever a change in the optimum occurs, the most fit individuals

of the colony searching the space of the current optimum will migrate to the core population,

attracting the core population to this search space. Another approach is the Self-Organizing

Scouts (SOS), where a core population explores the promising search space and is split into

child populations under certain conditions. Each child population explores limited promising

areas and are also split under certain conditions (see figure 2.8).

– Adaptive Approaches

Adaptive approaches change the operators/parameters of the GA, usually after a change, to

pressure the population to dramatic changes for a certain period. Hyper-mutation, Hyper-

selection and Hyper-learning are 3 operators used to achieve this (augmenting mutation rate,

selection pressure and learning rate temporarily).

– Predictive Approaches

Prediction approaches analyze patterns in the DOP to forecast the next optimum, when the

next change will occur and which environment may appear. Kalman Filters and forecasting

are two examples of techniques used.

2.2.2 Neural Networks

ANN have attracted the attention researchers due to its predicting power and flexibility. ANN is a

biological inspired computational model which consists in processing elements (neurons) and connec-

tions between them with coefficients (weights). These connection weights are the ”memory” of the

21

system [23]. This kind of systems can be used for either supervised or unsupervised learning.

Usually, neurons are visualized as being arranged in layers, and typically neurons in the same layer

behave in the same manner. The arrangement of neurons into layers and the the connection patterns

within and between layers is called ”net architecture”. In figure 2.9 is the example of a feedforward

network: a network in which the signals flow from the input units, to the output units, in a forward

direction [29].

ANN applied to computational finance are implemented in several ways (see [30], [31], [21], [32], [33]

or [34] for some examples).

Figure 2.9: Artificial Neural Network representation - wij is the weight given to the connection between nodes iand nodes j, w′

jk is the weight of the connections between nodes j and k. These weights are chosenaccording to a mathematical function, that will decide which neurons (nodes) will be used as path forthe inputs.

2.2.3 Fuzzy Systems

In 1965, Lotfi Zadeh published a paper ( [35]) formally developing the multi value set theory, that later

has come to be known as fuzzy logic. In that paper, the author showed how the function IA of non-fuzzy

subset A of X, described as equation 2.13 could be extended to the multivalued indicator function, µA

of fuzzy subset of X, given by the membership function in equation 2.14.

IA(x) =

{1→ x ∈ A0→ otherwise

(2.13)

In equation 2.13 0 represents non-membership and 1 represents membership.

µA(x) : X → [0, 1] (2.14)

22

In equation 2.14 µA(x) is interpreted as the degree of membership of element x in fuzzy set A for each

x ∈ X [36].

If the universe is discrete, a membership function can be defined by a finite set in the following way:

A =∑

µi/ui (2.15)

In equation 2.15 the symbol / separates the membership degrees µ(ui) from the elements of the universe

ui ∈ U [23].

Fuzzy Rules applied to computational finance usually create linguistic rules of the type IF this THEN that

using technical indicators, which can be understood by a human trader. Fuzzy systems are usually used

with ANNs, examples of these are given in [37], [32].

2.3 Clustering Algorithms

Data mining is the process of exploring data from different perspectives to discover previously un-

known patterns, and develop a model used to understand phenomena from the data and summarizing

it into useful information [38].

This analysis allows to obtain correlations and to learn new features about the data set. Although the

term is relatively new, the technology is not, and it is used by large distribution companies (Walmart for

example) to relate costumer’s buying patterns, being able to increase revenue using this information.

DM is considered a process in Knowledge Discovery from Data (KDD), which processes consists in

the iteration of the following steps [39]:

• Data cleaning Removes noise and inconsistent data

• Data integration Multiple data sources may be combined

• Data selection Relevant data for the analysis task is retrieved from the databases

• Data transformation Data is transformed or consolidated into appropriate forms for mining

• Data mining Intelligent methods are applied to extract patterns from data

• Pattern evaluation Identifies patterns that represent knowledge based on some interesting mea-

sures

• Knowledge presentation Visualization and knowledge representation techniques are used to

present the mined knowledge to the user.

23

There are several DM algorithms, for an explanation of some of them see [40]. This work will use a

clustering algorithm, that is a type of algorithm that can be used in data mining (although in this work is

used as a classification algorithm) and so it will focus more on detailing this technique.

Clustering is a tool of data analysis, which solves classification problems, applied when there is

no class to be predicted, but instead, when instances can be divided into natural groups. Clustering

itself is not an algorithm, but a task, with several algorithms that can be used to find a solution. The

best algorithm to apply depends on the data and desired results [41]. It is a unsupervised technique,

since it does not use preclassified data. Instead the algorithm discovers similarities (in the requested

attributes) between objects of the set, grouping them in the same cluster. The identified groups may be

exclusive (an instance belongs to one only group), overlapping (an instance belongs to several groups),

or probabilistic (an instance belongs to each group with certain probability) [42].

Exclusive clustering objective is to group a set into smaller subsets, such that the degree of associ-

ation is strong between members of the same cluster and weak between members of different clusters.

There are many clustering algorithms (see [43]). Some Clustering algorithms use metrics to measure

intra-cluster or inter-cluster distance. In [44] a GA is used to optimize clustering, using the Calinski-

Harabasz index as fitness function, obtaining better results than classical clustering methods as K-

means and Fuzzy C-means (FCM) (FCM is a Fuzzy Clustering method, a method where an object can

belong to more than one cluster, for more information see [45] and [46]).

The algorithm from [44] uses cluster points as chromosomes for the GA, and activation values for

those points in an algorithm called ACGA. It is described in algorithm 3.

Algorithm 2.3: ACGAt← 0;P (t)← random;A(t)← random;while notEndcondition do

Pp(t)andAp(t)← Selection of parents from P (t)andA(t);Pc(t)andAc(t)← Crossover from Pp(t)andAp(t);Pm(t)andAm(t)← Mutation of Pc(t)andAc(t);Check for bigger A(t) values;Choose clusters corresponding to bigger A(t) values ;Compute calinski-harabasz index to attribute fitness values ;P (t+ 1)← New Generation Creation from (P (t), Pm(t))and(A(t), Am(t));t← t+ 1 ;

In algorithm 3, P (t) is the population (each individual is an array of points in the space that can be

chosen as a cluster center) and A(t) are the activation values (each individual is an array of values

∈ [0, 1]), and each individual in A(t) corresponds to an individual in P (t) (chromosomes are of the same

size).

24

2.4 Portfolio Composition Problem

The original portfolio composition optimization problem described by Markowitz [47] is described as

in equations 2.16.

Max(expectedreturn) =

M∑i=1

uixi (2.16a)

Min expected risk =

M∑i=1

M∑j=1

oijxixj (2.16b)

s.t.

M∑i=1

xi = 1 (2.16c)

In equations 2.16 ui is the expected return of asset i, xi is the investment portion on the asset, and oij

is the covariance between asset i and j.

There are two main approaches to this problem. One is Single-Objective (SO) optimization, and the

other Multi-Objective (MO) optimization.

In the case of SOs, a single criteria is optimized, and an optimum is either its maximum or its minimum

and a solution dominates another if it is above (for maximums) or below (for minimums) another solution.

In a MO there are several criteria to be optimized, and the exact mutual influences between objectives

can become complicated, and are not always obvious. This approach uses Pareto optimality, which

defines the frontier of solutions that can be reached by trading off conflicting objectives [48]. According

to [49] and [50], a MO algorithm is said robust if solutions maintain as close as possible to the Pareto

Front, the rankings are the same in training and validation and solutions are diverse (uniform distribution

by the Pareto Front) and non dominated solutions maintain that way in training and test. In [50], there

are also techniques used to improve robustness, for example Mating Restriction (restricting mating to

occur only between dominated and non-dominated individuals).

Fitness can be described as proximity to the Pareto front, and solution’s diversity can be described as

the distribution of solutions in the Pareto Front. The MO approach has shown great results in optimizing

the portfolio composition problem and in ranking stocks [51]. Examples of works using MO are [52] [11]

[53].

2.5 Investment Strategies

Strategies vary a lot, ranging from Value Investing strategies to Growth Investing strategies, or even

a mixture of both. One can choose between several strategies when investing, some of which are

described below.

25

2.5.1 Value Investing

As referred above, FI were largely used by investors such as Warren Buffet, that made value investing

famous. He would look for companies with a high intrinsic value, and/or companies that had some sort

competitive advantage, and buy them, as said in [4]. This generated huge profits for these investors,

because the market (that is self regulated) eventually realized the value of the stocks, and since the

companies had a competitive advantage in the market they were inserted in, the stocks never had a big

breakdown in recession times, and continued to grow further. This type of strategy uses FI above all

other indicators, since it gives the best mechanism to evaluate the financial strength of a company, even

if its quote is currently falling. FIs do not take into account any kind of trends, and so, all you can assume

(using only FI) is that a company is good or bad, according to its financial statements (evaluating the

intrinsic value with the market value) and wait for the market to eventually realize the company’s value.

2.5.2 Growth Investing

TI are mostly used for strategies that take into account tendencies such as growth investing (this

investment strategy is the most commonly used in computational finance). This means that instead of

measuring the intrinsic value of an asset at given times, it measures how the market reacts to it, either

by simply analyzing the stock price trend, or going into a more complex analysis of relating the trend of a

stock price to its volatility. Studying past tendencies of certain TIs can help to predict the tendency, using

that information to buy, sell or hold a certain stock. Many times this is not done by analyzing only a single

TI, but several of them, and by drawing conclusions on how a market will evolve given that information.

2.5.3 GARP Investing

GARP takes the best of value and growth investing, by looking for companies with a good intrinsic

value, with good growth prospects. One of the biggest supporters of this kind of investing is Peter Lynch

(see [54]). He segmented the type of existing markets, and search for cycles, and other type of indicators

(mainly, but not only, macroeconomic indicators) that indicate the type of market in which the investor

is inserted. By doing so, he was able to adapt a better strategy to that kind of market, and use more

relevant indicators for the type of stocks he is looking for. GARP investing avoids companies with huge

growths, since those companies have an higher risk associated with it. It also avoids companies that

have a good intrinsic value but do not grow.

26

2.5.4 Income Investing

Income Investing prefers to rather have a fixed income than to risk investing in stocks that show

volatility. So, income investors prefer bonds to stocks, and in stockpicking they pick stocks that have

high DPS values, so they can have a fixed income on their dividends.

2.6 Classification of Stocks

Several investors classify stocks according to company’s FA or TA ( [55], [56], for example). This

classification allows them to group stocks with similar features, making the study of the behavior of those

stocks easier. If stocks are well classified and inserted in the right cluster, it will be easier to evaluate if

a company is going to grow or not evaluating its behavior inserted in the group. Having this information,

the investor is able to create customized investment strategies to each group of stocks (a better adapted

investment strategy, since behaviors will be similar). Lynch and Rothschild did this on their book [54].

Using FA they evaluated stocks according to their growth rate, capitalization and economic behavior,

and classified companies into six major types, creating specific investment strategies to each of these

types. The main characteristic of each type is explained as following, according to the book:

• Slow Grower

These companies are usually large and aging companies, that are expected to grow only slightly

faster than the Gross National Product. Normally slow growers start out as fast growers and

eventually stop growing as fast, either because they have grown as much as they can, or because

the industry they are inserted in slows down its growth. Every Fast growing industry eventually

slows down and becomes a slow growing industry. Usually slow growers pay a generous and

regular dividend (because these companies usually can not use that money to expand business).

• Stalwart

These are usually multi-billion dollar companies, that are not exactly agile climbers but are faster

growers than slow growers. These companies have around 10 to 12 percent annual growth in

earnings. They can give a sizable profit if bought and sold at the right time, and are also a good

protection against recession times, since they are so big that will not go bankrupt, and soon enough

after the recession their value will be restored.

• Fast Grower

These are small, aggressive enterprises that grow at a rate of 20 to 25 percent a year. A fast

growing company does not necessarily has to belong to a fast growing industry. All it needs is

the room to expand in a slow growing industry. Usually these upstart enterprises learn to succeed

in one place, and then replicate their winning formula over and over. These kind of stocks are

27

usually risky, especially in younger companies that tend to be overzealous and underfinanced, and

underfinanced companies do not end up well during recession times. Also, Wall Street does not

look kindly on fast growers that run out of stamina and turn into slow growers. Once a fast grower

grows too big it faces the problem of having trouble growing further.

• Cyclicals

These are stocks whose sales and profits rise and fall in a regular if not completely predictable

fashion. In a cyclical industry business expands and contracts, then expands and contracts again.

When coming out of a recession and into a vigorous economy, the cyclicals flourish, and their stock

prices tend to rise much faster than stalwart prices. However, during a recession the cyclicals

suffer, and so do shareholders. Buying a cyclical in the wrong part of the cycle can make one lose

a lot of money, and so, timing is everything when buying a cyclical.

• Turnaround

These are companies that have no growth at all. Sometimes turnarounds are poorly managed

cyclicals that go so far down in a cycle that people think they will never come back up. Neverthe-

less, turnarounds are companies that can make up lost ground quickly, and the best thing about

investing in successful turnarounds is that of all the categories of stocks they are the least related

to the general market. Failed turnarounds are dragged into bankruptcy, making it a very risky type

of stock, that varies between a major success and a major failure.

• Asset Plays

These are companies that own one or more valuable assets that Wall Street has overlooked (and

so, has not valued the stocks accordingly). These assets can be as simple as a pile of cash or the

subscribers of a TV cable provider, for example, but usually they are real state assets. These are

companies whose assets may value more now (or may get more valuable) than the value given to

the company itself. When the market realizes this value (or when the assets grow in value), the

stock prices grow accordingly.

2.7 Data Set / Markets

Most works done in computational finance have been done with well known and stable markets (from

strong European or American economies). This work will also be focused on these kind of markets

(specifically focused on the S&P500 index), not only because most work has been done with similar

markets, but also because other kind of markets have different relationships between indicators, and

although being easier to predict (because they are less efficient), are usually more unstable. According

to [57] funds located on the US that invest in emerging markets underperformed funds physically located

28

in the emerging markets (one of the reasons is because some of this markets have not so stable financial

policies, and the market is very affected by policy changes). The author also shows that geographically

focused funds outperform the ones that invest globally. These factors made me chose a well known and

studied market, with easy access to information.

2.8 Related Work Results

In this section, some previous work results and data used will be analyzed and compared between

them in order to know which are the best solutions.

2.8.1 Evolutionary Computing

In [8] it is used an hybrid approach to portfolio composition, using both fundamental and technical

indicators. In this paper uses a Multi-Objective Evolutionary Algorithm (MOEA) with two objectives

(return and variance of returns), computes the Pareto front and tries to find solutions near it with the

technical indicators.

In [13] is proposed an approach of technical rules optimization using a GA. In this approach each

individual is an asset classifier equation that takes into account the value of the technical indicators

applied to the available data prices.

In [49] is proposed a change to the Strength Pareto Evolutionary Algorithm 2 (SPEA2) algorithm

(becoming the Robust Strength Pareto Evolutionary Algorithm 2 (R-SPEA2)) in order to make it more

robust. In [50] Multi-Objective Genetic Programing (MOGP) robustness is also studied and it is con-

cluded that mating restriction is a promising technique to use accordingly: mating of similar parents will

converge solutions, and mating of dissimilar parents will promote diversity.

In [25] is proposed a tree structure representation of the GA for portfolio optimization, instead of the

more common array approach. In a tree structure GA each terminal node holds an asset and each

non-terminal node holds the weight of the subtree, and each subtree is also considered a portfolio.

In [16] is used an approach using GAs to search for the optimal period lengths, adjustment frequen-

cies and adjustment volumes of moving averages to predict changes in price of crude oil for investment

in the crude oil future market.

In [58] a Multiple-Criteria Decision-Making (MCDM) method is applied to portfolio optimization, divid-

ing the criteria of return and risk into several measurements.

[59] uses standard Genetic Programming (GP) optimization, with a function set comprising a com-

mon set of arithmetic and a terminal set comprising a collection of technical indicators and constants.

The objective is consistently outperform the B&H strategy, based on the work of [60].

29

[61] uses GAs to decide trading strategies, consisting of two stages, elimination of unacceptable

stocks and stock trading construction.

2.8.2 Other methods

In [62] a correlation matrix, between the most significant indicators and future prices, is applied,

alongside a data discretization using a Cumulative Probability Distribution Approach (CPDA) and a

Minimize the Entropy Principle Approach (MEPA). Afterwards Rough Set Theory (RST) is used to obtain

linguistic rules and a GA is used to refine it.

In [14] is used a mix of strategies, applying ANN, fuzzy logic and GA to an approach that uses VAMA

as the indicator used. The system consists of three phases: the ANN system having VAMA as baseline,

the definition of fuzzy rules with the ANN outputs and the refinement of those rules made by a GA.

In [32] a Fuzzy Cerebral Model Articulation Controller (FCMAC) approach to Forex Exchange is

proposed. This approach divides data into sets and uses local learning (focus on useful local information

from observed data).

In [30] it is used a Modular Neural Network with sliding window, error back propagation method and

supplementary learning as a way of retraining avoiding over fitting.

In [21] it is used an Adaptive Network Fuzzy Inference System, supplemented with Reinforcement

Learning (RL). This RL process uses feedback reward/punishment according to environment state.

The author uses Momentum and MA indicators to discover cycles in the data, and invest using that

information.

30

Wor

kD

ate

App

roac

hFi

nanc

ialD

ata

App

licat

ion

Ben

chm

ark

Valid

atio

nPe

riod

Ret

urns

[8]

2015

MO

GA

FI&

TIPo

rtfo

lioC

ompo

sitio

nS

&P

500

2013

-201

428

.3%

[13]

2011

GA

TIPo

rtfo

lioC

ompo

sitio

nD

JI20

03-2

009

Bet

tert

han

B%

Han

dR

ando

mW

alk

[49]

&[5

0]20

08R

-SP

EA

2R

awD

ata

Port

folio

Com

posi

tion

FTS

E10

0M

ay20

04-

Dec

embe

r200

5

Bet

terR

esul

tsth

anth

eS

PE

A2

appr

oach

[58]

2004

EC

Mar

kow

itzM

odel

Fund

Man

agem

ent

Com

posi

tion

S&

PD

atab

ase

N/A

GA

gotb

ette

rre

sults

than

2PLS

,SA

and

TS

[16]

2015

GA

TIC

rude

Oil

Futu

reM

arke

tC

rude

Oil

Mar

ket

1987

-201

3B

ette

rtha

nB

&H

[62]

2010

ME

PA&

CP

DA

&R

ST

&G

ATI

Port

folio

Com

posi

tion

TAIE

X20

00-2

005

Bet

terr

esul

tsth

anB

&H

,GA

orR

ST

alon

e

[14]

2009

AN

N&

FL&

GA

TIPo

rtfo

lioC

ompo

sitio

nS

%P

500

1997

-200

2B

ette

rres

ults

than

B&

H,N

N,

NF

orG

Aal

one

[32]

2007

RS

T&

Fuzz

yR

ules

TIFo

rex

Exc

hang

eU

SD

vsot

her

mai

ncu

rren

cies

2004

-200

633

.95%

[59]

2009

GP

TIPo

rtfo

lioC

ompo

sitio

nS

&P

500

1990

-200

2E

very

mod

elou

tper

form

edB

&H

[61]

2014

GA

TIPo

rtfo

lioC

ompo

sitio

nTA

IEX

118

days

(end

date

14Ju

ne20

14)

Out

perfo

rms

B&

H

[30]

1990

AN

NR

awD

ata

Port

folio

Com

posi

tion

TOP

IXJa

nuar

y19

87-

Sep

tem

ber1

989

Out

perfo

rms

B&

H

[21]

2011

AN

FIS

TIPo

rtfo

lioC

ompo

sitio

n

IBM

&W

alm

art&

Citi

grou

p&

Wye

th&

Gen

eral

Mot

ors

24th

Aug

ust

1994

-30t

hA

ugus

t200

6

240.

32%

,bet

ter

than

S&

P50

0an

dD

JI

31

3Architecture

Contents

3.1 Module View of the Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 General System Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

33

This Chapter will give a description of the system architecture. First it will be given an overview of

the proposed solution, afterwards a module style description of the most important modules and lastly,

an explanation of some other implementation details.

The architecture of the proposed solution was made from scratch in C++. The implementation has

22 C++ classes, and a size of about 9000 lines of code.

3.1 Module View of the Global Architecture

The overview of the module architecture of the proposed solution is presented in this section (see

figure 3.1).

There are nine main modules in the system, each one with a specific functionality. Their specific

function is described as follows:

• Download This module is used to fetch information about companies (the stocks’ raw data) and

their financial statements (the financial statements are fetched with [63] algorithm) and store that

information to later be used by the Stock and Fundamental Analysis modules.

• Fundamental Analysis (FA) This module is used as a data analysis module. It processes in-

formation of the financial statements of the company, computes all the FI needed and does the

growth analysis of the variables used.

• Stock The Stock module holds information about the stock’s raw data (the quotes) and information

about the classification of the stock, given by the Classifier and the Clustering modules in each

quarter.

• Classifier The Classifier module is used by the Stock module. Uses a configuration file to define

thresholds, and uses those thresholds to classify the stock in a certain quarter, according to the

information given by the Fundamental Analysis module.

• Clustering This module is also used by the Stock module, and its objective is to attribute a cluster

to a stock in a certain quarter, using the Clustering Genetic Algorithm module, according to the

information given by the Fundamental Analysis module. It uses an unsupervised data mining tech-

nique, inspired in the ACGA algorithm from [44] that uses a GA to optimize clustering positioning.

The number of clusters used is fixed (given by the user as input) and the algorithm simply opti-

mizes the location where each should be. The number of clusters used in this work was 5, in order

to have the same number of clusters as the number of types created by the Classifier module.

• Genetic Algorithm (GA) This module contains two submodules: the Clustering Genetic Algorithm

submodule, used by the Clustering module to optimize clustering position (with a fixed number

35

of clusters), and the Fundamental Analysis Genetic Algorithm module, used by the Investment

Simulator module, to optimize Fundamental Indicators’ weights to give buy and sell signals.

• Investment Simulator This is the module responsible by the kind of portfolio that is created. Is

used by the Investor module, and uses both the Stock and Fundamental Analysis Genetic Algo-

rithm modules. It creates the Portfolio module, and is responsible for giving the buy and sell signals

to the Portfolio module, making the bridge between the Investor module and the Portfolio module.

• Portfolio The Portfolio module is used by the Investment Simulator one, and uses the Stocks

module to retrieve quote’s information. It saves the state of the portfolio information: the stocks

that are currently in the portfolio, when they were bought, at which price they were bought, and

the current return of the portfolio. It is also responsible for getting the necessary stock information

from the Stock module to simulate buy and sell.

• Investor This is the main module of the system, it receives user input, coordinates data flow and

calls every other method as necessary. Is responsible for: using the Download module to fetch

information, creating the Fundamental Analysis and Stock module (one for each company) using

the information from the Download module, receiving and distributing user input information by the

modules that use it, and giving information to the Investment Simulator module for the portfolio

construction.

36

Figure 3.1: Modules view of the architecture - The UML schematic shows which modules are used by each mod-ule. The full arrow in the GA modules represent inheritance, broken line arrows mean usage withoutinstantiation, an full line arrows mean association

3.2 General System Dataflow

The overview of the general dataflow of the proposed solution is presented in this section (see figure

3.2).

37

Figure 3.2: Modules Data Flow - User input and the system output is represented with a box, every module has itsname and functionality, and the arrows represent data flow

A step by step description of the data flow of the algorithm is given as follows:

• The system starts by receiving system parameters as inputs. In this work this is done with a

configuration file for general configuration (GA parameters), a second file with parameters for the

Stock Classifier (with the thresholds used) and a third file with the list of stock tickers used in this

work.

38

• The second thing the system does, is to create the Download module to fetch stock information.

This includes downloading quotes from yahoo site (constructing the URL and using GNU’s wget

to retrieve the CSV files with stocks’ information), and also downloading, sorting and rewriting the

financial statements (Balance Sheets, Income Statements and Cash Flow Statements) from the

Security Exchange Commission website1 (using [63] algorithm).

This fetching from Edgar has several steps. To better understand these algorithms see [63]. Since

this part was not developed during this work, it will not be detailed.

• Afterwards, the Investor module creates both the FA and the Stock module (for each company),

with the information fetched, and attributes the FA module to the Stock one, so the FA data from a

company can be accessed by the Stock module of that company.

• The Investor creates and gives the user input with the thresholds to the Classifier module, that

each stock uses to get a classification per quarter. The Classifier receives the Stock’s FA data,

iterates over it, and gives back the classifications to the stock (containing size, growth, health and

type classifications) for each quarter.

• The Investor creates then the Clustering module, giving it the configuration used by the Clustering

GA (number of clusters and GA parameters). The Clustering module receives the Stock’s FA and

creates and uses the Clustering GA submodule to attribute a cluster to each quarter.

• With all the information processed, the Investment Simulator module is created and given access

to the Stocks and the GA parameters for the submodule FA GA. It creates the Portfolios modules,

and gives them information about the stocks it has to buy and sell, each quarter. There are five

portfolios that use exclusively the classification given to the stock and other five that use exclusively

the cluster of the stock. The number of clusters created was fixed, and it was chosen to be 5

clusters since there were 5 types of stocks created with the classifier using user input.

There are other ten portfolios that are created in the same manner as the ones explained above,

with the difference that after checking each stock to see if it is of the type the Investment Simulator

wants and creating a pool of stocks with a certain type/cluster, it creates and uses the submodule

FA GA for that pool of stocks. It is also constructed a portfolio that uses the FA GA to optimize

indicators weights using the whole dataset.

• The Portfolio module receives the buy and sell signals. It first buys the stocks, so it accesses the

stocks and retrieves and holds information about the stocks’ quotes (since it applies transaction

costs, adds them in this step) and the date that stock is being bought (that in the point of view of

the Portfolio is the current date, transmitted by the Investment Simulator).

1https://www.sec.gov/edgar/searchedgar/companysearch.html

39

• Afterwards the Portfolio module accesses the stocks it holds to retrieve information about the cur-

rent quote of each one, and updates the return of the portfolio. Lastly it sells the stocks indicated by

the Investment Simulator with the sell signals, or in the case of classification/cluster portfolios, the

stocks that are not indicated in the buy signal. In the last quarter of every portfolio, the Investment

Simulator gives information to the Portfolio module to sell all the stocks.

3.3 Modules

There are nine main modules in the application. The modules are connected between them as

showed in figure 3.2. This section explains the functionality of them (leaving out the Investor module,

whose functionality was already explained in sections 3.1 and 3.2)

3.3.1 Download

The download module is divided into two main functionalities, one is to download stock’s raw data

and the other is to download companies financial information.

The companies’ financial information is fetched with [63] algorithm, as explained before.

The Stock raw data is fetched using GNU’s wget (for more information check https://www.gnu.org/software/wget/)

to download CSV files from the yahoo finance URL API for historic information. It first constructs the

URL to be used, that is of the form:

MAIN?&TICK&SM&SD&SY&EM&ED&EY&P&IG

Where

• MAIN: http://ichart.finance.yahoo.com/table.csv - the main URL of the API

• TICK: s =TICKER - (TICKER is the stock ticker)

• SM: a =STARTMONTH - (STARTMONTH is the start month to be downloaded)

• SD: b =STARTDAY - (STARTDAY is the start day to be downloaded)

• SY: c =STARTYEAR - (STARTYEAR is the start year to be downloaded)

• EM: d =ENDMONTH - (ENDMONTH is the end month to be downloaded)

• ED: e =ENDDAY - (ENDDAY is the end day to be downloaded)

• EY: f =ENDYEAR - (ENDYEAR is the end year to be downloaded)

40

• P: g =PERIOCITY - (PERIOCITY is the periocity to be downloaded, it can be daily with ”d”, weekly

with ”w” and monthly with ”m” )

• IG: ignore = .csv - (the URL used to end)

To fetch the information daily (this can be used to update the data used in the application, making it

available to use in real time) the URL used is:

http : //finance.yahoo.com/d/quotes.csv?s = TICKS&f = FLAGS

Where

• TICKS: Are the tickers of the Stocks wanted. Several tickers can be put together with the sign +.

• FLAGS: Are the flags that specify the information required by the API (for more information check

http://www.jarloo.com/yahoo finance/).

This will give the necessary information to create the Stock raw data. The information will be down-

loaded in the format described in figure 3.3, and saved in the Stock module.

Figure 3.3: Structure of the stock’s raw data downloaded from the Yahoo API

3.3.2 Stock

The stock module is where company’s data is saved. This includes stock’s raw data (quotes, adjusted

quotes, the stock quote higher and lower point of the day, the volume of transactions on that day and the

date of the quotes) and the Fundamental Analysis of the company.

The module also holds information about the Classification attributed by the Classifier, and the Clus-

ter given by the Clustering module. We can think of this module as the database from which information

to simulate investment is retrieved (see figure 3.4).

3.3.3 Fundamental Analysis

The Fundamental Analysis module is made from the public information about the companies. This

data is obtain through three spreadsheets (Balance Sheet, Income Statement and Cash Flow State-

ment), each one with information organized by quarter. For a better understanding of the structure of

the module see figure 3.5.

For further detail on how data is processed see section 3.4.1.

41

Figure 3.4: Representation of the Stock module

After all the information is retrieved, this module uses this information to create several indicators

and performs evaluations on the growth of the variables obtained by the sheets. This growth analysis is

made annually (compares the same quarter in different years).

Figure 3.5: Representation of the FA module

This is also the module where Fundamental Indicators are computed to later be optimized by the

GA module and used by the Investment Simulator module to give buy and sell signals. Each indicator

is modified if needed for the objective to be a maximization of the indicator. The indicators used are

42

described as follows:

• Debt

Assuming that no company in the S&P500 index has more total debt than total assets, the DR will

always be ∈ [0, 1], however, the objective is to minimize this indicator. To change this minimization

objective into a maximization one (as required so that all indicators’ objective is to maximize that

indicator), instead of calculating the percentage of debt, the indicator calculates the percentage of

the company that is not in debt, doing TotalAssetsTotalAssets −DR. The Debt indicator used in this work will

then be:

Debtindicator =TotalAssets

TotalAssets−DR = 1− Total Debt

Total Assets(3.1)

• PR

The Payout Ratio, as described in Chapter 2, is the percentage of net income distributed to the

investors as dividends. Since financially healthier companies have a bigger PR, the objective will

be to maximize it, so the PR indicator used in this work will be:

PR =DPSEPS

(3.2)

• ROE

The Return on Equity is described in Chapter 2, and in this work, this indicator will be used as it is,

and the objective will simply be to maximize it. The indicator used is:

ROE =Net IncomeTotal Equity

(3.3)

• PM

The objective in this work will be to maximize the indicator as it is explained in Chapter 2:

PM =Net Income

Revenue(3.4)

• RG

Although Revenue Growth (explained in Chapter 2) is used already in Classification and Clustering

of stocks, it is used as an indicator to choose the stocks that grow more inside each group. The

indicator used is:

RG = ∆Revenue =RevenueActual −RevenueLastY ear

RevenueLastY ear(3.5)

43

• NIG

NI growth will be used as an indicator, as explained in Chapter 2. The indicator used is:

NIG = ∆NI =NetIncomeActual −NetIncomeLastY ear

NetIncomeLastY ear(3.6)

• ∆ RG

Sometimes the growth of a company does not affect the stock quote growth as much as the per-

spective of growth. Since the market is sensible to changes in predicted returns, if a company

grows more (or less) than it is supposed to, it may change market confidence on that stock. This

indicator reflects the growth momentum of revenue, admitting that a bigger momentum will create

more confidence in a certain company, and so it will be represented by:

∆RG = ∆∆Revenue =RGActual −RGLastY ear

RGLastY ear(3.7)

Since it compares the momentum of annual growth, this indicator will only be available after the

second year of analysis.

• ∆ NIG

This indicator will have the same impact as ∆RG, and will also be used to measure changes in

different NI grows. This will indicate if the NI grow of a company is slowing down or not. The

indicator used is:

∆NIG = ∆∆NI =NIGActual −NIGLastY ear

NIGLastY ear(3.8)

• CFOA

This indicator (as explained in Chapter 2) will be used to choose companies that create cash flow

income from operating activities. The indicator used is:

∆CFOA =CFOAActual − CFOALastY ear

CFOALastY ear(3.9)

3.3.4 Classifier

This module does the classification of stocks for each quarter, based on the approach used by Peter

Lynch in [54], explained in section 2.6.

Although the book is old, and Economy changes at a fast pace, there are underlying theories that

make sense (and can be applied) in the present, with adaptations. The book was written in the end of the

1980’s, when economic context was completely different, and because of this the reference parameters

used by the author are not suitable. This update in parameters, either for economic context, as for

44

implementation ease is explained in the following paragraphs.

Although the author identifies six types of stocks, only five types are used in this work (not all types

are directly deduced from the book), mainly because they are the ones one can identify by directly

analyzing companies’ accountings. Cyclicals and Asset Plays are two types of stocks from the book

not used in this work. Cyclicals need a more careful analysis, to look for patterns in the quotes and

in accountings, in a way one determine cycle parameters and in which part of the cycle a company is.

Asset Plays are mainly based on the evaluation of the companies assets, which requires careful asset

examination, which can not be determined in spreadsheets.

There is a new type of stock used in this work, which was introduced to cover all the Size × Growth

space. This type represents the small stocks with normal and good growths (see the following para-

graphs to understand these classifications, and see figure 3.6 to better understand the structure) and

are given the name Potential Stocks.

Very Good

Good

Normal

Bad

Very Bad

MediumSmall Big

Fast Grower

Stalwart

Slow Grower

Potential

Turn Around

Growth

Size

Figure 3.6: Types’ Quadrants

The Classifier is the module that defines in which of the five types of stocks a stock belongs to, and

also gives a classification to the company financial health, all evaluated through the company’s FA. The

classification is given quarterly, so, a company may be of a type in a quarter, and change its classification

in the next one. The necessary information for the classification made by this module is given by user

input, however one could implement a fuzzy system instead of a human input classifier, in order to make

the system more sophisticated, and possibly to get better returns.

The classification given has only into account the size and the growth of the company (see figure

45

3.6), and not the financial health, that was implemented for possible human analysis.

• Size

The size of a company is classified given classifications to assets (using thresholds given as user

input), and then averaging the classification given to each one. Each asset is classified by doing

an average of the value over the last year (for further detail on how data is processed see section

3.4.1) and comparing that value with two thresholds. These thresholds will indicate if the asset is

either small, medium or big. After all the assets are classified, the company’s size is classified by

averaging all those classifications and rounding them (i.e. if taken into account 3 assets, and 2 of

them are classified as big, and the third one as medium, the company is classified as big).

The way classification is given using thresholds is the following:

– Classification = 1→ V alue < THLow

– Classification = 2→ THLow ≤ V alue < THHigh

– Classification = 3→ THHigh ≤ V alue

Afterwards the Classification is averaged:

TotalClassification =

∑Ni=1 Classificationi

N(3.10)

In equation 3.10 N is the total number of assets to be classified, and Classificationi is the classi-

fication given to asset i. For last, the classification given to the size is accordingly:

– Small→ Total Classification ∈]0, 1.5[

– Medium→ Total Classification ∈ [1.5, 2.5[

– Big→ Total Classification ∈ [2.5, 3]

In this work the only size indicator used is the last year’s Total Assets average, and the thresholds

are:

– THLow = 5B2

– THHigh = 10B

• Growth

The Growth of the company is classified in a similar way (by classifying the growth of user input

variables and then averaging the classification), however there are some differences. In the growth

classification, each classification has five possibilities (very bad, bad, normal, good, very good),

2the 5B and 10B represents respectively 5 and 10 Billion dollars

46

unlike the size classification that had only three. One out of 10 possible classifications is given to

each variable, and as in the size classification, the growth classification will also be the average of

the classifications of all the variables used as user input.

The growth is measured yearly (between the same quarter of different years). For further detail on

how this is done see section 3.4.1.

The procedure is similar with the one used to classify the size:

– Classification = 1→ Indicator < TH1

– Classification = 2→ TH1 ≤ Indicator < TH2








– Classification = 10→ TH9 ≤ Indicator

Afterwards the Classification is averaged:

TotalClassification =

∑Ni=1 Classificationi

N(3.11)

In equation 3.11 N is the total number of assets to be classified, and Classificationi is the classi-

fication given to asset i. For last, the classification given to the growth is accordingly:

– Very Bad→ Total Classification ∈]0, 2]

– Bad→ Total Classification ∈]2, 4]

– Normal→ Total Classification ∈]4, 6]

– Good→ Total Classification ∈]6, 8]

– Very Good→ Total Classification ∈]8, 10]

In this work the only growth indicator used is the revenue yearly growth, and the thresholds are:

– TH1 = −0.2

– TH2 = −0.1

47

– TH3 = −0.05

– TH4 = −0.02

– TH5 = 0

– TH6 = 0.02

– TH7 = 0.05

– TH8 = 0.1

– TH9 = 0.2

• Health

The Health evaluation of a company has into account the amount of debt the company has. It is

classified in a similar way to the size (by doing an annual average of parameters, and comparing

them to two thresholds). The financial health of a company does not interfere with its type (there

may be several companies of the same type with different financial healths), it is simply indicative

for human analysis.

In this work the only financial Health indicator used is the last year DR indicator average, and the

thresholds are:

– THLow = 0.3

– THHigh = 0.7

(a) Health (b) Size

(c) Growth

Figure 3.7: Classifier Structure

48

• Type

The type of the stock is, as said before, obtained from the evaluation of the size and growth of a

company, based on the approach presented by Peter Lynch in [54]. After a stock has its growth

and size evaluated, the type is determined by combinations between them.

The 5 classifications given in this work are (see figure 3.6):

– Slow Grower - These are the companies that are considered Medium in terms of size and

had a Normal or Good growth classification

– Stalwart - These are the companies that are considered Big in terms of size and had a Normal,

Good or Very Good growth classification

– Fast Grower - These are the companies that are considered Small or Medium in terms of size

and had a Very Good growth classification

– Potential - These are the companies that are considered Small in terms of size and had a

Normal or Good growth classification

– Turn Around - These are all the companies that had a Bad or Very Bad growth classification,

independently of their size.

3.3.5 Clustering

The clustering in this work is not threated as a typical clustering problem, but instead it is used for

classification, given a fixed number of clusters. This module creates five clusters each quarter, and

associates each stock to the nearest cluster in the Growth × Size space. It uses the GA module to

optimize clusters’ positions, given at least one year training. The number of clusters chosen (5) was

chosen taking into account the number of types created by the user input classifier (also five), to check if

there was any kind of resemblance between the two classification methods. After the GA module outputs

the clusters’ locations in the search space, the clustering module associates each company to a cluster

by minimizing the euclidean distance between company and clusters (it choses the cluster with minimum

distance). Since each axis comes with a big unit difference (size comes in billions, and growth comes

in a fraction representing the percentage), values are scaled to help the algorithm to converge. This is

done such that 1B$3 in assets represents a distance equivalent to 1% in growth, both representing a unit

distance from the origin. These scaled units are given in equation 3.12.

Size =Total Assets[Million$]

1000(3.12a)

Growth = ∆Revenue× 100 (3.12b)3B stands for the American billion, 1000 million in European units

49

The growth measure is the annual revenue growth (for more details on how data is processed check

section 3.4.1), and the size measure is the average of the Total Assets over the last year.

After trying with different scales this one was the most successful in clustering the same type of

stocks, since companies’ data is so sparse (the difference in size of the companies is very big comparing

to the difference in growth). The most intuitive normalization would be to normalize both values over the

maximum value of that quarter, however, this would make the density of points near the origin would be

too high for the algorithm to converge properly.

There will be 5 clusters (the same amount as the types in the Classifier module), enumerated from A

to E, and they will move each quarter (the GA will recompute the best locations for clusters each quarter

passed). So, to maintain consistency, the first cluster (cluster A) will always be the one nearest to the

origin of the plane (Origin = Coordinates (0, 0)), B the second nearest, and so on. This way we can

check in a coherent way if a stock changed its cluster.

3.3.6 GA

This module has two functionalities (divided into two submodules). One is to define weights given

to fundamental indicators, used with the Investment Simulator module in order to give buy/sell signals.

Other is used with the Clustering module to optimize the location of the clusters in the plane. The way

each works is described as follows:

• Fundamental Analysis Genetic Algorithm (FA GA)

This is where all the training phases used for different portfolios occur. Since the Portfolios use

Fundamental Indicators, and FI requires a more long term analysis than Technical Indicators, each

time unit is considered a quarter. A generation is an iteration over the last 4 quarters of the GA.

See figure 3.8 to see a structure of a chromosome.

Figure 3.8: Fundamental Indicators’ Chromosome Representation

A pseudocode of the GA used is described:

50

Algorithm 3.1: Used GAg ← 0 ;t← current Quarter −4 ;P (g)← random ;while g != number of generations do

while t != current Quarter doInvestment Simulation from P (t) ;

fitness← Returns from P (t) Simulation ;Pp(g)← Selection of parents from P (t) ;Pc(g)← Crossover from Pp(t) ;Pm(g)← Mutation of Pc(t) ;P (g + 1)← New Generation Creation from (P (t), Pm(t)) ;g ← g + 1 ;

In algorithm 4 t are trimesters and g are generations.

At the beginning of the algorithm the population is generated randomly. Each FI weight is initialized

as in equation 3.13a, the buy signal value is initiated as in equation 3.13b, and the sell signal value

initiated as in equation 3.13c.

r ∈ [0, 1] (3.13a)

b =

N∑i=1

riai (3.13b)

s =b

2(3.13c)

In equations 3.13 N is the number of indicators used, r and a are different random numbers, b

is the buy signal value and s is the sell signal value. Even though b can take values in [0, N ],

finding a random number in this interval will not simulate randomness of indicators and weights.

To construct a truly random b one has to create N random numbers r to simulate weights, N

different random numbers a to simulate indicators’ values, and apply the equation 3.13b. The s

value is calculated as in equation 3.13c to guarantee that it is smaller than the b value, and that it

has a substantial percentage difference from the b value.

A buy signal is given if the sum of the weights times the value of the fundamental indicators is

above b, and a sell signal is given if this sum is below s, as described in equations 3.14. Signal is

the type of signal given to the Portfolio module to simulate buy or sell, vi is the value of indicator

index i and wi is the weight of indicator index i. The average of the top 5 individuals of the algorithm

is used at the end to define the values used by the Investment Simulator.

51

Signal =

{BUY →

∑Ni=1 vi × wi > b

SELL→∑Ni=1 vi × wi < s

(3.14a)

The fitness function will be the ROI of each individual, when simulating investment.

Fitness = ROI =Return - Initial Investment

InitialInvestment(3.15)

The iterations start by simulating investment with the Fundamental Indicators’ weights and the

buy/sell values of the chromosomes. The returns from the simulations with each individual will be

the fitness of that individual.

• Clustering GA

This is where the training and validation of the optimization of clustering positions algorithm occur,

inspired in the Automatic Clustering Genetic Algorithm from [44]. The GA module will receive as

input the size of the chromosome and the number of desired clusters (a fixed value) and run the

GA to find the best locations for clusters centroids, being the output the centroids locations in the

Size×Growth plane. It is relevant to note that

It will use the stocks’ FA to calculate their positioning in the plane, and use the Calinski-Harabasz

index as fitness function. See figure 3.9 to see the structure of a chromosome.

The GA, apart from the usual GA parameters (such has population size, number of generations,

etc..) uses two user inputs:

– Number of possible cluster positions

– Number of solutions (or clusters) created

The chromosomes are constructed with cluster points and have an auxiliary structure called acti-

vation values, in equal numbers, since the activation values will determine if a certain cluster point

is going to be used or not. The number of cluster points and activation values is the possible

number of cluster positions given by the user. Cluster points and activation values are such that:

– Cluster Point is a tuple (size, growth), where size ∈ R+ and growth ∈ R

– Activation value is a number n ∈ [0, 1]

Figure 3.9: Clustering Chromosome Representation

52

A stock position in the plane Size × Growth is normalized before computing centroids and dis-

tances, and described in equation 3.16 (see section 3.3.5 for an explanation on why this normal-

ization was used).

Size = Total AssetsLast Year Average/1000 (3.16a)

Growth = ∆RevenueLast Year Average ∗ 100 (3.16b)

At first, the maximum size of all stocks in the first year (the minimum training period is 1 year) is

obtained, and used as a reference (maxsize). The maximum growth is also computed and used

as reference (maxgrowth). Afterwards chromosomes are randomly initiated, by assigning random

values to Size and Growth, in order to create the N different possible points. This is done as in

equation 3.17.

Sizerandom =maxsize

2× r1 (3.17a)

Growthrandom =maxgrowth

2× r2 (3.17b)

In equation 3.17 r1 ∈ [0, 1] and r2 ∈ [0, 1] are distinct random numbers and maxsize, maxgrowth

are the maximum size and growth measured in the first year.

Activation values, that are an auxiliary structure whose only purpose is to evaluate which clusters

have more members (the GA does not apply to the activation values) are computed after this

initialization, measuring the percentage of stocks that belong to each cluster.

Fist, stocks are assigned to clusters. To check the distance between stocks positions and cluster

positions, the distance function used is the euclidean distance, or L2 norm as in equation 3.18.

‖A−B‖ =√

(xA − xB)2 + (yA − yB)2 (3.18)

That applied to this specific problem comes in the form of equation 3.19.

Dist = ‖Stock−Cluster‖ =√

(Stocksize − Clustersize)2 + (Stockgrowth − Clustergrowth)2 (3.19)

The activation values are obtained as in equation 3.20.

Activationi =Number of Stocks ∈ CiTotal Number of Stocks

(3.20)

In equation 3.20 i is the index of the solution, and Ci is the cluster of index i.

Within the number of solutions decided by the user, the clusters with bigger activation values are

53

chosen as solutions of that chromosome. To measure the fitness of the chromosome after this

process, one has first to find the centroid of the whole data set. Afterwards stocks are assigned to

the solution clusters (assign to the nearest cluster) and then the Calinski-Harabasz metric (some-

times called variance ratio criterion) is applied (as in equation 3.21). The bigger the result, the

fitter the individual is. This metric has into account intra cluster similarity and inter cluster dissim-

ilarity, getting higher values when clusters have high intra cluster similarity and high inter cluster

dissimilarity.

CH =SSBSSW

× N −KK − 1

(3.21a)

SSB =

K∑j=1

nj‖Cj − C‖2 (3.21b)

SSW =

K∑j=1

∑i∈Ij

‖Xij − Cj‖2 (3.21c)

C = (

∑Stockssize

N,

∑Stocksgrowth

N) (3.21d)

In equations 3.21 SSB is the between-cluster variance, SSW is the within-cluster variance, N is

the total number of stocks, K is the number of clusters (the number of solution clusters), nj is the

number of data points belonging to cluster index j, C is the centroid of the dataset, Cj is cluster of

index j, Ij is the set of data points belonging to cluster j and Xij is data point index i belonging to

cluster index j.

Although the Calinski-Harabasz index is used as a metric to optimize the number of clusters cre-

ated by a clustering algorithm, in this work this index is simply used as a fitness function to optimize

the position of the clusters.

• Operators

Two types of selection are implemented. A roulette wheel and a ranking selection (as explained

in 2.2.1). For this it will be used the fitness values shifted by the fitness of the worst individual, to

avoid negative fitnesses.

The type of crossover done is a single point crossover, that takes a random integer value between

1 and Numberweights − 1 (in a vector starting at 0) and exchanges the values of the weights in the

indexes from that random point on (for example, if there are 5 weights, and the random value is 3,

the weights 3, 4 and 5 are exchanged). The same crossover is applied in both the clustering and

the FA GA.

The mutation implemented (also inspired in [44]) uses a ψ value to determine the maximum value

54

of the perturbation. This ψ is a percentage δ of the value that will be mutated. After finding this

value, the mutation value α will be computed as being a random number between [0, ψ]

The mutation can be mathematically written:

ψ = δ × v (3.22)

α = random ∈ [0, ψ] (3.23)

value = value± α (3.24)

Where value is the value receiving the mutation, α is the perturbation applied to the value, ψ is the

maximum value of the perturbation and δ is the percentage of perturbation chosen. To chose if the

mutation will be a sum or a subtraction it’s used a ”coin flip”, meaning a random number r ∈ [0, 1]

is generated, if the number is below 0.5 the mutation will be a subtraction, otherwise it will be a

sum.

In this work values of δ = 0.5 in indicators weights, and δ = 0.2 in cluster positions are used, this

way mutation can perturb at most 50% of weights’ value, and at most 20% of clusters’ positions

(since 5 clusters will be used, 100% of the Size X Growth plane5 = 20%)

3.3.7 Investment Simulator

The Investment Simulator is the coordinator between the Investor, the Genetic Algorithm and the

Portfolio modules. This is the module that will determine which model will run. There are 3 type of

models implemented:

• Whole data set models

• Classification based models

• Cluster based models

The Investment Simulator is the module responsible for coordinating the resources needed for the mod-

ules to run (GA usage, access to pools of stocks, and parameter definition). It is the Investment Simulator

module that is responsible to give the buy signals to the portfolios, and to define the weights given to

each indicator (given as an average of the top 5 individuals of the population of the GA). See figure 3.10

to see how the Investment Simulator and the Portfolio module interact.

3.3.8 Portfolio

The portfolio is the output of the system. It receives as input access to the Stock module, the buy

signals of the stocks, and the current trimester. It uses this information to save information about the state

55

Figure 3.10: Investment Simulator, Portfolio and Stock Interaction

of the investment for a certain strategy. It contains the current date (the date that is being evaluated),

the companies with open positions in the portfolio, the date and quote at which one of them was bought

and the value of the quotes at present time. It is the module responsible for simulating the buying and

selling of stocks (looking for the quotes at a given date, saving that information).

The portfolios using classifications or clusters use only the stocks of a certain type/cluster. Since

the type/cluster of the stocks may change each quarter, the pool of stocks available for transactions is

dynamic (changes each quarter). There are 2 types of Portfolios that use the whole dataset. One is the

normal Buy&Hold, used for comparison with other portfolios, and the other is the FA GA portfolio using

all of the Data Set.

• Buy & Hold

There is only one B&H portfolio in this work, using the whole data set, used to compare other

portfolios (alongside the S&P500 index) Using only the S&P500 index as control study may not

be enough, since every portfolio uses a dataset that is not the one of the index (but a subset of

the index), and so, a B&H of the dataset used is useful for comparison. In this B&H the whole

dataset is bought at the beginning of the time duration of the portfolios and kept until the end of

56

their duration.

• Classification and Clustering portfolios There are 10 portfolios using a classification technique

in this work. Five are portfolios using user input classification, and five using the clustering algo-

rithm.

Since the pool of stocks is dynamic (each stock may change type/cluster each quarter), the port-

folios are done in such a way that if a stock is of the required type/cluster in a quarter, is added to

the portfolio in that quarter (is bought), and when it stops being of that type/cluster is taken out of

the portfolio (is sold). The evolution is tracked after buying, but before selling, this way if a stock

enters the portfolio in a quarter it will not change the growth of the portfolio (since the evolution is

checked after buy, and for the same day, the stock will have the same quote, and the difference will

be only the transaction cost). However if that same stock changes its type in the next quarter and

has to leave the portfolio, the evolution of that quarter will be accounted (since evolution is tracked

before selling the stock).

This will allow to monitor how the stocks of each type or cluster evolve, making it able to conclude

which type or cluster has superior results.

• FA GA portfolios

These are the Portfolios that use the GA module to train and use Fundamental Indicators to give

the buy and sell signals over a pool of stocks. This pool of stocks may be the whole data set, or

stocks from one of the 5 classification/clusters portfolios. There are 11 portfolios like this, one with

the whole data set, 5 using the stocks classification and 5 using stocks clusters.

The Portfolio starts after the training period of the GA, and the buying signals are given by FI

weights made from the average of the weights of the top 5 (5% in this work) chromosomes of

the training. At each iteration, after the portfolio computed its solution for the quarter, the GA

module retrains, with a sliding window, so the information used in the next quarter is updated with

information about the current.

When the portfolio uses classifications or clusters, the pool of stocks is dynamic (changes each

quarter). However these portfolios buy only stocks of a certain type/cluster, it maintains them in

the portfolio until a sell signal is given. This way the GA will be able to optimize parameters to a

single type/cluster of stocks but will not sell them prematurely.

57

3.4 Implementation Details

3.4.1 Data

There are two types of Data obtained from available information on the Internet, one are the quotes

of a certain company for a given period, and the other type of data are the companies financials (Balance

Sheets, Income Statements and Cash Flow Statements). The stock quotes come in tuples as in figure

3.3. The type of quotes used in this work are closing quotes.

Although the sheets used come with a lot of data (Balance Sheet with 19 variables of data, Income

Statement with 22 variables and Cash Flow Statement with 6 variables), the only information used from

each sheet is the following:

• Balance Sheet:

– Total Equity

– Total Assets

– Total Liabilities

• Income Statement:

– Revenue

– Net Income

– Dividends per Share

– Diluted Normalized EPS

• Cash Flow Statement:

– Cash from Operating Activities

When this data is being processed, there are several things to take into account. The first one is

data integrity. Some companies had missing rows or columns in their information. Companies missing

crucial information for the system were taken out of the dataset.

3.4.2 Data Processing

The evaluation of assets and growth in the clustering and classification methods averaged the vari-

ables being evaluated, to smooth abrupt changes in these variables, and to take into account more than

one quarter of information. For example, revenue growth is measured between trimesters of different

years, lets say we have the growth measured between every quarter of 2012 and 2011, the value used

58

as a growth measure is the average of the growth of all the quarters of 2012. This average does not

take into account zero values, since these correspond to missing information (for example in 2010 there

is no way of measuring growth, since there is no information about 2009). This is also valid for size

evaluations.

3.4.3 Configuration

The parameters given in the configuration file include the following:

• GA Parameters

The parameters used by the GA are:

– Population size - number of individuals (chromosomes) used in the GA.

– Number of generations - number of iterations over the same period

– Mutation rate - percentage of total weights or cluster points that receive a mutation

– Elitism rate - percentage of the population with best fitnesses that is copied to the next gen-

eration

– Immigration rate - percentage of the population with worst fitnesses that is replaced by random

immigrants

– Training period - number of quarters used only for training

– Validation period - number of quarters used for validation of the models

– Chromosome size - number of parameters (weights or cluster points) to be optimized by each

individual

The GA parameters are the one’s that describe the GA functioning. These may be tunned up as we

please, and bring slight changes to the output. The balance we want to find is between a system

that has a meaningful training and one that adapts well to changes (that avoids over fitting).

The used parameters are:

– Population size - 100. This value was chosen since related works chose similar sizes. There

was given no consideration to the problem size or features.

– Number of generations - 50 for FA GA and 200 for clustering location optimization. 50 gener-

ations for the FA GA were chosen so the models could be computed in a viable time period

(values of 35 and 100 generations were also tested, and 50 was the biggest value that could

compute in viable time for the different tests made). 200 generations were chosen for the

59

clustering location optimization, since this algorithm had to run fewer times the execution time

could be bigger (values of 100 and 150 generations were also tested). Also, the clustering

position optimization algorithm did not change much in the last generations.

– Mutation rate - 5%. This value was chosen having as reference the related works. A value of

3% was also tested, however 5% seemed more appropriate after selecting an elitism rate of

40%.

– Elitism rate - 40%. This value was chosen to guarantee that the global fitness of the population

would not go down.

– Random immigration rate - 20%. This value was chosen to give flexibility to the system and

help the system adapt to changes in the optimization problem.

– Training period - the minimum training period for the clustering GA is of 1 year, and the

minimum training period for the FA GA using the classification or clustering portfolios is 2

years (1 to obtain the classification and 1 to train the FA GA)

– Validation period - the maximum validation period is from the end of the training period until

the last quarter with information available (5 years for the clustering GA and 4 years for the

FA GA)

– Chromosome size - 11 for the FA GA (equals the number of indicators) and 75 for the clus-

tering position optimization (number of possible cluster points). The size of the clustering

chromosome was chosen to be 75 because of the execution time associated with it. Sizes

of 50 and 100 were also tested, but 75 was the biggest size whose execution time would be

viable.

• Transaction Costs

Although none of the portfolios solves the portfolio composition problem, and there is no allocation

of budget, transaction costs are taken into account in every trade, when the stock is bought and

when it is sold. These transactions have a cost of 0,3% (so when it is bought every stock costs

0,3% more, and when it is sold it is worth 0,3% less).

60

4System Validation

Contents

4.1 Validation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

61

In this section the performance of the implemented system is tested and metrics used described.

First the metrics used to evaluate the models are described, and later the case studies on the proposed

solution. The quarterly returns presented in the tables are measured as the returns of positions since

the first day of the quarter measured until the first day of the next quarter. This means that first quarter’s

returns are measured from the first day of quarter 1 until the first day of quarter 2, and quarter 4 returns

are measured from the first day of quarter 4 until the first day of quarter 1 of the next year.

Every solution is compared with the S&P500 index and the B&H of the dataset in the specific time

period. The B&H of the dataset is constructed by buying all the stocks of the dataset (the 272 stocks listed

in appendix A) in the first day of the investment period, and tracking its returns during the investment

period.

There is no comparison with related studies because when this work was proposed I was unable to

find neither works that used the same validation period, neither works that traded only quarterly, and

so this work uses the B&H and the S&P500 index (specially the latter) as comparison with the obtained

results.

The creation of each portfolio is independent of each other, and so, although the algorithm was run

sequentially without any parallelization, parallelizing the algorithm to create each portfolio at the same

time would not be hard if the necessary alterations were made.

The execution time of the clusters’ positions optimization algorithm is of about 5 hours and it takes 7

hours to create all of the portfolios once, with an i7 microprocessor.

4.1 Validation Metrics

In order to test the performance of the proposed solution, the metrics used in this work are the

following:

• ROI

• Drawdown

• Sharpe Ratio

• Success rate of trades

• Average time in the market

• Average return per trade

• Rate of positive quarters

63

• Average return per quarter

These metrics are used to evaluate every type of portfolios, and the performance of the portfolios are

compared to the S&P500 index, and the B&H of the dataset during the same time.

4.1.1 ROI

The Return on Investment is used to measure the amount of return, given in percentage, that a

certain investment had. In this work, this is given exclusively by the percentage difference in the stocks’

quotes, and is mathematically represented as in equation 4.1.

ROIn[%] =Returnn − InitialInvestmentn

InitialInvestmentn(4.1a)

InitialInvestment = Pn,t (4.1b)

Return = Pn,τ (4.1c)

TotalROI[%] =

∑NUMn=1 ROInNUM

(4.1d)

In equation 4.1d, NUM is the total number of companies in the data set being used, P is the quote of

stock index n, t is the period at which stock index n was bought and τ is the period at which stock index

n was sold or had the return evaluated (one can see the ROI of an investment without selling the stock).

The number of stocks is not accounted because this work has only in considerations the evolution of

stock’s quotes, and does not solve the portfolio problem.

To include transaction costs in this work, the used ROI formula will suffer a slight change, as de-

scribed in equations 4.2a and 4.2b.

InitialInvestmenttx = (Pn,t + (Pn,t × ψ)) (4.2a)

Returntx = (Pn,τ − (Pn,τ × ψ)) (4.2b)

In equations 4.2a and 4.2b, ψ is the transaction cost. This work uses ψ = 0, 003 (transaction costs of

0,3%).

Since all the portfolios studied in this work are long term investments, and the buy and selling dates

will be at the beginning of each quarter, the total ROI of a portfolio at a given time (compounded over

the time studied) will be given by:

TotalROI(t)[%]tx = (

T∏t=1

(1 +ROI(t)tx))− 1 (4.3)

64

In equation 4.3, T is the total number of quarters that a portfolio has and t is the current quarter being

evaluated. ROI(1)tx, the ROI of buying and selling stocks without any change in the quotes (equivalent

of buying and selling in the first day) would be 0 without transaction costs, however with transaction

costs, this will have a negative effect.

ROItx[%] =Returntx − InitialInvestmenttx

InitialInvestmenttx=

Return(1− ψ)− InitialInvestment(1 + ψ)

InitialInvestment(1 + ψ)=

−2× InitialInvestment× ψInitialInvestment(1 + ψ)

=−2× ψ(1 + ψ)

(4.4)

Applying the 0,3% of transaction costs used in this work, the result is −0, 00596421. This is the value

of ROI(1)tx, and the percentage of invested money that goes to transaction costs in that quarter.

4.1.2 Drawdown

This metric evaluates the biggest peak-to-trough decline during a specific period of investment. It is

quoted as the percentage between the peak and the subsequent trough. Investors can use this metric

as a way to measure a portfolio volatility.

Drawdown = min(0, ROIi)→ i ∈ [0, Q] (4.5)

In equation 4.5 Q is the number of quarters in a portfolio, ROIi is the ROI in quarter i, and i is the

number of quarters passed since the beginning of the portfolio.

4.1.3 Sharpe Ratio

The Sharpe Ratio is one popular ratio, used to measure the risk associated with the return of a

portfolio. The excess of return of the portfolio over the risk free rate of return is standardized over the

standard deviation of the portfolio. The higher this ratio is, the better.

The risk free rate of return is a theoretical concept, that represents the amount of return an investment

without risk would have. These kind of investments do not exist, since there is always some risks

associated with investing, however, the United States Treasury Bills are usually used as references for

the risk free rate, since it is considered the less risky investment worldwide.

SharpeRatio =Portfolio Return− Risk free rate

δ(4.6)

In equation 4.6 δ is the standard deviation of the portfolio.

65

In this work is used a yearly risk free rate of 2% for 2012 and 2015, and 2,5% for 2013 and 2014.

4.1.4 Success rate of trades

A trade is considered as a buy and subsequent sell of a stock (including the selling of stocks at the

end of the period of the portfolios, in the fourth quarter of 2015). The Success rate of trades will be used

to measure the percentage of trades that obtained a positive return, and is described in equation 4.7.

Success Rate of Trades[%] =number of trades with positive return

number of total trades(4.7)

4.1.5 Average time in the market

This metric evaluates the average time each investment was on the market, making it possible to

conclude how long does the portfolio maintain its investments. It will be given in number of quarters, and

is given by:

Average time in the market =

∑Tt=1Mt

T(4.8)

In 4.8 T is the total number of trades done in a portfolio and M is the time spent by trade t in the market.

4.1.6 Rate of positive quarters

This metric evaluates if the portfolio is able to maintain a positive return in each quarter over the

evaluation period and can be used as a way to measure risk.

Rate of positive quarters =#Positivequarters

#Quarters(4.9)

4.2 Case Studies

In this section, the case studies of the models used are presented. The Dataset used for the models

is constituted by 272 companies of the S&P500 index. The application was tested with data obtained

from Yahoo finance API (as explained in Chapter 3) using the close quotes of stocks. Three constrains

are present in all of the case studies. They are the following:

• Only Long positions: The portfolios created allowed only the option of long positions. Since

this work measures the amount of time one has a position open in quarters, and it uses only

Fundamental Analysis, short position will not be contemplated. Short positions would require a

deeper technical analysis, since they are more risky than long positions.

66

• No dividends: In this work is used the stock’s closing quotes, with no adjustment to account

dividends (commissions paid by shareholders).

• Transaction Costs: This work assumes transaction costs with a value of 0, 3% of the stock quote

value, in every buy or sell.

Results will be compared with the B&H of the dataset in order to check if the proposed solution is

better than simply do a B&H on a pool of stocks. Results will also be compared with the S&P500 index1

to draw conclusions about how good a result really is.

Every portfolio will have its start in the first quarter of 2012, and end at the fourth quarter of 2015.

Even though the first two case studies could start in 2011, this decision was made to ease the compari-

son between every portfolio, including the ones in case study 3, that could only start at the beginning of

2012.

The 3 case studies concern only about the evolution of returns in the portfolios during the stipulated

time period, and do not concern about the portfolio composition problem.

Although a ranking selection was implemented in this work, the performance of this selection scheme

was worst than the performance of the roulette wheel selection, and so the presented results are con-

structed using only the roulette wheel selection.

4.2.1 Case Study I - User Input Classification

This case study presents the results of classifying stocks using user input data.

The strategy is to do a portfolio containing only a type of stock. This portfolio will buy stocks of a

certain type the quarter that stock is classified as belonging to that type, and sold in the quarter it stops

belonging to that type.

Since this classification is fixed, the values presented are the result of a single run (every other run

would have exactly the same results, since data did not change, and this is a deterministic classification).

We can see in figure 4.1 that the Fast Growers type of stocks have better returns overall, with a 4

year period ROI of 79, 32%, above the S&P500 returns of 62, 4% and the dataset B&H of 54, 25%.

In this time period we can also check that the Turnaround type of stocks have a worst result overall,

with a ROI of 30, 13%. Turnarounds and Slow Growers types were the only ones that underperformed

both the B&H and the S&P500. Potentials and Stalwarts got a result similar to that of the B&H.

This is an expected result since Turnarounds are all the companies that got a bad or very bad

classification at growth. It was also expectable that the type of stocks with better results was the Fast

Growers, since it contains the stocks with better revenue growth. The stalwarts return was a surprise,

as one would expect that they would perform better.

1data obtained from http://performance.morningstar.com/Performance/index-c/performance-return.action?t=SPX

67

2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4SG 7,91% 3,43% 10,52% 14,04% 24,27% 26,46% 32,62% 37,55% 43,75% 45,16% 41,50% 51,82% 54,16% 51,47% 45,49% 44,09%SW 8,85% 5,72% 10,58% 14,68% 24,83% 28,15% 35,00% 44,87% 51,04% 55,74% 52,32% 63,62% 66,37% 62,57% 52,60% 54,33%FG 14,07% 6,67% 15,88% 25,37% 28,56% 29,92% 35,68% 46,95% 56,97% 61,02% 53,34% 65,29% 78,10% 72,37% 68,14% 79,32%POT 10,73% 4,34% 8,27% 13,94% 21,30% 20,37% 29,24% 38,72% 42,89% 44,70% 41,67% 52,06% 60,15% 57,31% 53,34% 55,73%TA 6,11% 1,52% 6,40% 4,65% 14,36% 20,35% 29,90% 40,10% 45,02% 47,30% 43,27% 45,95% 45,32% 40,83% 27,82% 30,13%B&H 10,25% 5,35% 11,73% 15,50% 26,05% 29,50% 37,89% 48,41% 55,24% 59,29% 54,97% 65,68% 68,94% 64,51% 52,44% 54,25%S&P500 11,99% 8,31% 14,54% 13,39% 24,76% 27,70% 33,69% 46,96% 48,87% 55,85% 56,80% 63,68% 64,40% 64,03% 52,64% 62,49%

-20%

0%

20%

40%

60%

80%

100%

Retu

rns

Quarters

Type's Portfolios

Figure 4.1: User input classification portfolios’ accumulated returns. The key is the following: SG - Slow Growers;SW - Stalwarts; FG - Fast Growers; POT - Potentials; TA - Turnarounds

It is shown in figure 4.2 that Stalwarts was the type that had the biggest rate of trade success, and

Turnarounds the type with the worst rate of trade success by a large margin. Stalwarts was also the

type with the biggest gain in a trade, and surprisingly, Fast Growers type was the one with the smallest

biggest gain by a large margin, nevertheless it was also the one with the smallest biggest loss. The type

with the biggest loss in a trade was the Turnaround type. Both Fast Growers and Stalwarts obtained

more positive quarters than any other, and Turnarounds got the worst result once more, with only 62,5%

of positive quarters. Fast Growers got the biggest average return per quarter, more than twice the

Turnarounds average.

One of the surprising things here is that the Fast Growers got the best sharpe ratio of all portfolios,

even though it was the one with the biggest return, meaning not only was the portfolio that got the

best returns, it was also the portfolio carrying less risk. Fast Growers type did not had a big drawdown

(comparing with the other types), although the smallest drawdown was the one of the Slow Growers

type. The biggest drawdown was the one from the Turnaround type.

In figure 4.3 one can see that Fast Growers and Potentials were the only types (including the B&H

and the S&P500 index) that got positive results every year. It is also visible that Turnarounds got not

68

Number of Trades

Rate of Trade

Success

Biggest Trade Return

Biggest Trade Loss

Average Time on

Market (in

Average Return

Per trade

Rate of Positive Quarters

Average Return Per

Quarter

4 Year Return

Sharpe Ratio

Drawdown

SG 132 65,91% 213,04% -65,39% 4,95 24,46% 68,75% 2,39% 44,09% 0,83 10,07%

SW 243 78,19% 253,67% -74,54% 8,51 36,03% 75,00% 2,84% 54,33% 0,87 13,77%

FG 72 73,61% 151,01% -34,50% 3,93 18,67% 75,00% 3,87% 79,32% 2,15 9,96%

POT 85 72,94% 226,94% -41,83% 5,94 26,76% 68,75% 2,90% 55,73% 1,41 6,81%

TA 203 59,11% 221,01% -80,57% 4,09 10,83% 62,50% 1,78% 30,13% 0,36 18,13%

Figure 4.2: User input portfolios metrics table. The key is the following: SG - Slow Growers; SW - Stalwarts; FG -Fast Growers; POT - Potentials; TA - Turnarounds

only the biggest return in the time period (in 2013) but also got the biggest loss (in 2015), showing the

high volatility from that type.

In figure 4.4 is presented the quarterly, accumulated and yearly results of the portfolios, and the worst

quarter of each portfolio is marked in red (this does not mean the drawdown is in the same quarter, since

negative quarters have a bigger impact on portfolios that have bigger accumulated returns).

4.2.1.A Conclusion

The results of Fast Growers (the best type) and Turnarounds (the worst type) were unsurprising,

however, this can not be said about the other types, because without these results there was no way of

knowing what kind of result one could expect.

It needs to be mentioned that these results were obtained with thresholds that were hardwired from

the start. This means that the user that defines these thresholds have access to information about all

the data from the past, and can make a decision about the thresholds using that information. Unless the

user is a financial expert (or at least knows about the subject), the robustness of this classification (to

do it in real-time with the market) is uncertain, since it depends on how well the user understands the

market.

69

2012 2013 2014 2015SG 14,04% 20,61% 10,38% -5,10%SW 14,68% 26,33% 12,94% -5,68%FG 25,37% 17,22% 12,48% 8,49%POT 13,94% 21,75% 9,62% 2,41%TA 4,65% 33,88% 4,18% -10,84%B&H 15,51% 28,40% 11,65% -6,68%S&P500 13,41% 29,60% 11,39% -0,73%

-20%-10%

0%10%20%30%40%

Returns

Years

Yearly Returns

Figure 4.3: Types B&H - Yearly Returns. The key is the following: SG - Slow Growers; SW - Stalwarts; FG - FastGrowers; POT - Potentials; TA - Turnarounds

70

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

ly

Retu

rns

12,0

0%-3

,29%

5,76

%-1

,01%

10,0

3%2,

36%

4,69

%9,

92%

1,30

%4,

69%

0,61

%4,

39%

0,44

%-0

,23%

-6,9

4%6,

45%

Accu

mul

ated

Re

turn

s11

,99%

8,31

%14

,54%

13,3

9%24

,76%

27,7

0%33

,69%

46,9

6%48

,87%

55,8

5%56

,80%

63,6

8%64

,40%

64,0

3%52

,64%

62,4

9%

Year

ly R

etur

ns

Qua

rter

ly

Retu

rns

10,2

5%-4

,44%

6,06

%3,

37%

9,13

%2,

74%

6,47

%7,

63%

4,60

%2,

61%

-2,7

1%6,

91%

1,97

%-2

,62%

-7,3

4%1,

19%

Accu

mul

ated

Re

turn

s10

,25%

5,35

%11

,73%

15,5

0%26

,05%

29,5

0%37

,89%

48,4

1%55

,24%

59,2

9%54

,97%

65,6

8%68

,94%

64,5

1%52

,44%

54,2

5%

Year

ly R

etur

ns

Qua

rter

ly

Retu

rns

7,91

%-4

,15%

6,86

%3,

19%

8,97

%1,

76%

4,87

%3,

71%

4,51

%0,

98%

-2,5

2%7,

29%

1,54

%-1

,75%

-3,9

5%-0

,97%

Accu

mul

ated

Re

turn

s7,

91%

3,43

%10

,52%

14,0

4%24

,27%

26,4

6%32

,62%

37,5

5%43

,75%

45,1

6%41

,50%

51,8

2%54

,16%

51,4

7%45

,49%

44,0

9%

Year

ly R

etur

ns

Qua

rter

ly

Retu

rns

8,85

%-2

,87%

4,59

%3,

71%

8,86

%2,

66%

5,34

%7,

32%

4,26

%3,

11%

-2,2

0%7,

41%

1,69

%-2

,29%

-6,1

3%1,

13%

Accu

mul

ated

Re

turn

s8,

85%

5,72

%10

,58%

14,6

8%24

,83%

28,1

5%35

,00%

44,8

7%51

,04%

55,7

4%52

,32%

63,6

2%66

,37%

62,5

7%52

,60%

54,3

3%

Year

ly R

etur

ns

Qua

rter

ly

Retu

rns

14,0

7%-6

,48%

8,63

%8,

18%

2,55

%1,

06%

4,44

%8,

31%

6,81

%2,

58%

-4,7

7%7,

80%

7,75

%-3

,22%

-2,4

5%6,

65%

Accu

mul

ated

Re

turn

s14

,07%

6,67

%15

,88%

25,3

7%28

,56%

29,9

2%35

,68%

46,9

5%56

,97%

61,0

2%53

,34%

65,2

9%78

,10%

72,3

7%68

,14%

79,3

2%

Year

ly R

etur

ns

Qua

rter

ly

Retu

rns

10,7

3%-5

,77%

3,76

%5,

24%

6,45

%-0

,77%

7,37

%7,

34%

3,01

%1,

27%

-2,1

0%7,

34%

5,32

%-1

,77%

-2,5

3%1,

56%

Accu

mul

ated

Re

turn

s10

,73%

4,34

%8,

27%

13,9

4%21

,30%

20,3

7%29

,24%

38,7

2%42

,89%

44,7

0%41

,67%

52,0

6%60

,15%

57,3

1%53

,34%

55,7

3%

Year

ly R

etur

ns

Qua

rter

ly

Retu

rns

6,11

%-4

,32%

4,80

%-1

,65%

9,28

%5,

24%

7,93

%7,

85%

3,51

%1,

58%

-2,7

4%1,

87%

-0,4

4%-3

,09%

-9,2

4%1,

81%

Accu

mul

ated

Re

turn

s6,

11%

1,52

%6,

40%

4,65

%14

,36%

20,3

5%29

,90%

40,1

0%45

,02%

47,3

0%43

,27%

45,9

5%45

,32%

40,8

3%27

,82%

30,1

3%

Year

ly R

etur

ns

TAB&H

SG SW FG POT

S&P5

00

13,9

4%21

,75%

25,3

7%17

,22%

13,4

1%29

,60%

15,3

9%28

,50%

14,6

8%26

,33%

12,4

8%8,

49%

11,3

9%-0

,73%

11,6

3%-6

,90%

12,9

4%-5

,68%

4,65

%33

,88%

4,18

%-1

0,84

%

9,62

%2,

41%

2012

2013

2014

2015

14,0

4%20

,61%

10,3

8%-5

,10%

Figu

re4.

4:In

vest

men

tby

Type

tabl

e.Th

eke

yis

the

follo

win

g:S

G-S

low

Gro

wer

s;S

W-S

talw

arts

;FG

-Fas

tGro

wer

s;P

OT

-Pot

entia

ls;T

A-T

urna

roun

ds

4.2.2 Case Study II - Clustering

This case study presents the results of clustering stocks by their revenue growth and size (total

assets), in the plane SizexGrowth, with the Clustering GA algorithm described in section 3.3.6.

The algorithm has run 10 times, and the solution that achieved the median value of the averages of

all quarters’ fitnesses was used, as shown in table 4.1.

Table 4.1: Average fitness per quarter of the clustering algorithm

Best Worst MediumCalinski-Harabasz 806,4 665,37 767,8

The strategy used in this case study is similar to the one from Case Study I: a portfolio is made with

each cluster of stocks. Just as in Case Study I, the portfolios will buy stocks of a certain cluster the

quarter that stock is classified as belonging to that cluster, and sold in the quarter it stops belonging to

that cluster.

Since clusters are ordered, cluster A is always the cluster nearest to the origin (point (0,0) ) and clus-

ter E is always the cluster further away from the origin. Also, remember that the normalization applied to

the plane Size×Growth is such that in terms of distances, 1.000.000.000$ in size = 1% in revenue growth

We can see in figure 4.5 that cluster E has better returns overall, with a 4 year period ROI of 77, 34%,

above the S&P500 returns of 62, 49% and the dataset B&H of 54, 54%. One should also notice that if

the period was 3 years, at the beginning of 2015 cluster E had returns of 100, 83%. In this time period

we can also check that clusters C and D have a worst result overall, with a ROI of 45, 02% and 34, 37%

respectively. All of the clusters with exception for cluster B and E underperformed the B&H. The fact that

cluster E got the best results was an expected, since the further away from the origin, the bigger the

company, and possibly with bigger revenue growth (although this was not the case in cluster E, that got

its classification exclusively by the value in the size axis).

In figure 4.6 is shown that cluster E has very few trades in this period, and has a significantly high

time on the market (comparing with other clusters), meaning that companies in cluster E do not transition

much to other clusters. In cluster B however, there is a high number of trades (relatively to cluster C, D

and E) and the average time in the market is low, meaning there are more companies that change from

cluster B into other clusters (and vice versa). Cluster A has almost the same amount of trades of cluster

B, but almost the double the amount of time in the market, meaning that the cluster is simply bigger than

B (there is not as much exchange of clusters in companies from cluster A as there is in companies from

cluster B).

The rate of trade success is high in every cluster, with 3 clusters above the 80% rate of trade success,

with cluster E having almost 90%. Cluster A has the biggest gain in a trade, and cluster D has the worst

biggest gain in a trade. Cluster D also has the biggest loss in a trade, while cluster E has the lowest

72

2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4A 9,13% 3,96% 10,33% 14,36% 24,87% 27,79% 36,86% 46,58% 53,08% 56,58% 52,68% 63,90% 66,93% 60,47% 51,00% 51,87%B 11,64% 7,31% 12,49% 17,81% 24,72% 23,45% 28,92% 38,72% 47,30% 54,08% 47,79% 53,25% 61,41% 60,19% 55,58% 56,06%C 10,06% 4,45% 7,59% 8,75% 19,24% 26,03% 30,88% 42,32% 48,97% 54,63% 51,91% 59,03% 57,79% 59,05% 43,94% 45,02%D 9,46% 15,49% 22,17% 19,05% 25,65% 30,77% 32,35% 43,80% 44,59% 40,37% 35,78% 43,74% 46,18% 47,82% 32,55% 34,37%E 28,17% 21,00% 36,22% 49,04% 59,20% 70,12% 78,24% 94,57% 91,90% 100,64% 97,29% 104,26% 96,95% 101,68% 77,85% 77,34%B&H 10,14% 5,25% 11,77% 15,51% 26,10% 29,49% 37,84% 48,32% 55,03% 59,04% 54,74% 65,60% 68,90% 64,51% 52,77% 54,54%S&P500 11,99% 8,31% 14,54% 13,39% 24,76% 27,70% 33,69% 46,96% 48,87% 55,85% 56,80% 63,68% 64,40% 64,03% 52,64% 62,49%

-10,00%

10,00%

30,00%

50,00%

70,00%

90,00%

110,00%

Retu

rns

Quarters

Cluster's Portfolios

Figure 4.5: Clustering classification portfolios’ accumulated returns. The key is the following: A - Cluster A; B -Cluster B; C - Cluster C; D - Cluster D; E - Cluster E.

biggest loss.

Cluster E has also the greatest drawdown by far, which made the returns of this cluster fall from

101,68% in the second quarter of 2015, to 77,34% in the fourth quarter of 2015, as seen in figure 4.5.

This explains the low value of the Sharpe ratio. Cluster B is the less risky portfolio, having the lowest

drawdown and the biggest Sharpe ratio.

In figure 4.7 is visible a yearly decrease in the returns of cluster E. One can also see that cluster B

was the only portfolio to have positive returns every year.

In figures 4.8 and 4.9 is shown how companies are divided into clusters. It is remarkable how data

is so sparse in the Size axis. Cluster A, as said before, is the biggest cluster, containing most of the

companies, and cluster E has only 4 companies in the fourth quarter of 2012 and 3 companies in the

fourth quarter of 2013. This is due to a shift to the right of the center of the cluster, which made the

leftmost company in cluster E move into cluster D.

In figures 4.10 and 4.11 is shown a zoom of the first 3 clusters. The main difference between cluster

A and cluster B is in the Growth axis, which explains why cluster B got its results. The fact that cluster

B got the best Sharpe ratio, and the smallest drawdown can be now related to the revenue growth. One

73

Number of Trades

Rate of Trade Success

Biggest Gain Biggest LossAverage Time on Market (in

Quarters)

Average Return Per

trade


Average Return Per

Quarter4 Year Return Sharpe Ratio Drawdown

Cluster A

104 84,62% 202,85% -48,88% 6,64 46,58% 75,00% 2,76% 51,87% 0,76 15,93%

Cluster B

107 74,77% 128,38% -50,11% 3,55 14,75% 68,75% 2,91% 56,06% 1,50 6,29%

Cluster C

65 73,85% 167,25% -57,45% 8,80 29,69% 75,00% 2,48% 45,02% 0,60 15,11%

Cluster D

31 80,65% 107,34% -84,16% 6,23 23,36% 75,00% 1,99% 34,37% 0,51 15,27%

Cluster E

9 88,89% 150,20% -36,59% 7,33 66,46% 62,50% 4,00% 77,34% 0,66 23,83%

Figure 4.6: Clustering portfolios metrics table. The key is the following: A - Cluster A; B - Cluster B; C - Cluster C;D - Cluster D; E - Cluster E.

can also see that not only the cluster moved, as the points belonging to cluster B seem to be in different

locations. This happens because revenue growth varies more that the total assets size, and this explains

why cluster B had the least average time on the market.

4.2.2.A Conclusion

One can conclude that the biggest companies of the dataset got huge returns on bull markets, and

huge losses on bear markets, having small Sharpe ratios, and huge drawdowns. Big companies, outside

this small set of huge companies, have the worst returns of all (the ones from cluster D). It can also be

concluded that revenue growth can be chosen as an indicator to pick stocks with relatively good Sharpe

ratios.

With the scale and number of clusters used there was a better grouping in size than in growth. Due

to the amount and sparsity of data, to get a better clustering in the growth axis (specially in high values

of size) it would be necessary more cluster points.

This classification is more reliable than the one made with user input, since it was made and opti-

mized automatically, being less prone to human error and because of that, probably more robust if used

in real-time with the market (its behavior is going to change less than the user input classification).

74

2012 2013 2014 2015A 14,36% 28,17% 11,81% -7,34%B 17,81% 17,75% 10,47% 1,83%C 8,75% 30,87% 11,75% -8,81%D 19,05% 20,79% -0,04% -6,52%E 49,04% 30,54% 4,98% -13,18%B&H 15,51% 28,40% 11,65% -6,68%S&P500 13,41% 29,60% 11,39% -0,73%

-20,00%-10,00%

0,00%10,00%20,00%30,00%40,00%50,00%60,00%

Retu

rns

Years

Yearly Returns

Figure 4.7: Cluster’s portfolios yearly returns. The key is the following: A - Cluster A; B - Cluster B; C - Cluster C;D - Cluster D; E - Cluster E.

-80

-60

-40

-20

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800 900 1000

Grow

th [%

]

Size [1.000.000.000$]

Clusters 2012 Q4

A

B

C

D

E

Cluster A

Cluster B

Cluster C

Cluster D

Cluster E

Figure 4.8: Cluster’s representation in the whole plane in the fourth quarter of 2012. Cluster centers have the key”Cluster X” where X is the name of the cluster (A, B, C, D or E), and the points belonging to a certaincluster have only the letter of that cluster.

75

-60

-40

-20

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800 900 1000

Grow

th [%

]

Size [1.000.000.000$]

Clusters 2013 Q4

A

B

C

D

E

Cluster A

Cluster B

Cluster C

Cluster D

Cluster E

Figure 4.9: Cluster’s representation in the plane in the fourth quarter of 2013. Cluster centers have the key ”ClusterX” where X is the name of the cluster (A, B, C, D or E), and the points belonging to a certain clusterhave only the letter of that cluster.

-80

-60

-40

-20

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160 180

Grow

th [%

]

Size [1.000.000.000$]

Clusters 2012 Q4

A

B

C

Cluster A

Cluster B

Cluster C

Figure 4.10: Cluster’s representation in the plane in the fourth quarter of 2012, zoomed in the first three clusterrepresentation. Cluster centers have the key ”Cluster X” where X is the name of the cluster (A, B, C),and the points belonging to a certain cluster have only the letter of that cluster.

76

-60

-40

-20

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160 180 200

Grow

th [%

]

Size [1.000.000.000$]

Clusters 2013 Q4

A

B

C

Cluster A

Cluster B

Cluster C

Figure 4.11: Cluster’s representation in the plane in the fourth quarter of 2013, zoomed in the first three clusterrepresentation. Cluster centers have the key ”Cluster X” where X is the name of the cluster (A, B, C),and the points belonging to a certain cluster have only the letter of that cluster.

77

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

1Q

uart

er 2

Qua

rter

3Q

uart

er 4

Qua

rter

ly

Retu

rns

12,0

0%-3

,29%

5,76

%-1

,01%

10,0

3%2,

36%

4,69

%9,

92%

1,30

%4,

69%

0,61

%4,

39%

0,44

%-0

,23%

-6,9

4%6,

45%

Accu

mul

ated

Re

turn

s11

,99%

8,31

%14

,54%

13,3

9%24

,76%

27,7

0%33

,69%

46,9

6%48

,87%

55,8

5%56

,80%

63,6

8%64

,40%

64,0

3%52

,64%

62,4

9%

Year

ly R

etur

nsQ

uart

erly

Re

turn

s10

,14%

-4,4

5%6,

20%

3,35

%9,

17%

2,69

%6,

45%

7,60

%4,

52%

2,59

%-2

,70%

7,02

%1,

99%

-2,6

0%-7

,14%

1,16

%

Accu

mul

ated

Re

turn

s10

,14%

5,25

%11

,77%

15,5

1%26

,10%

29,4

9%37

,84%

48,3

2%55

,03%

59,0

4%54

,74%

65,6

0%68

,90%

64,5

1%52

,77%

54,5

4%

Year

ly R

etur

nsQ

uart

erly

Re

turn

s9,

13%

-4,7

4%6,

13%

3,65

%9,

19%

2,33

%7,

10%

7,10

%4,

43%

2,29

%-2

,50%

7,35

%1,

85%

-3,8

7%-5

,90%

0,58

%

Accu

mul

ated

Re

turn

s9,

13%

3,96

%10

,33%

14,3

6%24

,87%

27,7

9%36

,86%

46,5

8%53

,08%

56,5

8%52

,68%

63,9

0%66

,93%

60,4

7%51

,00%

51,8

7%

Year

ly R

etur

nsQ

uart

erly

Re

turn

s11

,64%

-3,8

8%4,

82%

4,73

%5,

87%

-1,0

2%4,

43%

7,60

%6,

18%

4,60

%-4

,08%

3,70

%5,

32%

-0,7

5%-2

,88%

0,30

%

Accu

mul

ated

Re

turn

s11

,64%

7,31

%12

,49%

17,8

1%24

,72%

23,4

5%28

,92%

38,7

2%47

,30%

54,0

8%47

,79%

53,2

5%61

,41%

60,1

9%55

,58%

56,0

6%

Year

ly R

etur

nsQ

uart

erly

Re

turn

s10

,06%

-5,1

0%3,

01%

1,08

%9,

64%

5,69

%3,

85%

8,74

%4,

67%

3,80

%-1

,76%

4,69

%-0

,78%

0,80

%-9

,50%

0,75

%

Accu

mul

ated

Re

turn

s10

,06%

4,45

%7,

59%

8,75

%19

,24%

26,0

3%30

,88%

42,3

2%48

,97%

54,6

3%51

,91%

59,0

3%57

,79%

59,0

5%43

,94%

45,0

2%

Year

ly R

etur

nsQ

uart

erly

Re

turn

s9,

46%

5,51

%5,

78%

-2,5

6%5,

55%

4,08

%1,

20%

8,65

%0,

55%

-2,9

2%-3

,27%

5,86

%1,

70%

1,12

%-1

0,33

%1,

37%

Accu

mul

ated

Re

turn

s9,

46%

15,4

9%22

,17%

19,0

5%25

,65%

30,7

7%32

,35%

43,8

0%44

,59%

40,3

7%35

,78%

43,7

4%46

,18%

47,8

2%32

,55%

34,3

7%

Year

ly R

etur

nsQ

uart

erly

Re

turn

s28

,17%

-5,5

9%12

,58%

9,41

%6,

81%

6,86

%4,

77%

9,16

%-1

,37%

4,55

%-1

,67%

3,53

%-3

,58%

2,40

%-1

1,82

%-0

,29%

Accu

mul

ated

Re

turn

s28

,17%

21,0

0%36

,22%

49,0

4%59

,20%

70,1

2%78

,24%

94,5

7%91

,90%

100,

64%

97,2

9%10

4,26

%96

,95%

101,

68%

77,8

5%77

,34%

Year

ly R

etur

ns

Clus

ter A

-13,

18%

Clus

ter

D

19,0

5%20

,79%

-0,0

4%-6

,52%

Clus

ter E

49,0

4%30

,54%

4,98

%

-8,8

1%

Clus

ter B

17,8

1%17

,75%

10,4

7%1,

83%

Clus

ter C

8,75

%30

,87%

11,7

5%

B&H

15,5

1%28

,40%

11,6

5%

S&P5

00

13,4

1%29

,60%

11,3

9%

2014

2015

14,3

6%

2012

2013

-0,7

3%

-6,6

8%

28,1

7%11

,81%

-7,3

4%

Figu

re4.

12:

Inve

stm

entb

yTy

peta

ble.

The

key

isth

efo

llow

ing:

A-C

lust

erA

;B-C

lust

erB

;C-C

lust

erC

;D-C

lust

erD

;E-C

lust

erE

.

4.2.3 Case Study III - Using GAs to optimize FI

This case study presents the results of using the FA GA described in section 3.3.6 to optimize weights

given to Fundamental Indicators. It is studied the results of applying GAs to give buy and sell signals on

each of the groups of case studies 1 and 2, and the whole dataset.

Since the GA uses the last year of data on training, these portfolios will have a training period of 2

years (2010 and 2011). The first year of training is used to define the type/cluster of stocks, and the

second year is used to train the GA.

The algorithm ran 10 times for each model, and in the models that used the clustering algorithm, the

clustering algorithm has run twice, so for each portfolio that uses clustering, 5 runs used the first run of

the clustering algorithm, and the other 5 used the other run of the clustering algorithm. Results were

compared between them. In table 4.2 is shown the best, worst and median run of each portfolio.

Table 4.2: Best, worst and median run of each portfolio. For the type portfolios: SG - Slow Grower, SW - Stalwart,FG - Fast Growers, POT - Potential, TA - Turnarounds. The keys: A, B, C, D and E refer to the clusters.

Best Run Worst Run Median RunGA 43,41% 36,05% 41,30%

SG-GA 62,78% 22,72% 46,03%SW-GA 55,33% 39,21% 48,67%FG-GA 72,11% 27,18% 41,27%POT-GA 83,76% 30,54% 43,71%TA-GA 37,39% 10,65% 21,43%A-GA 49,26% 27,96% 42,09%B-GA 63,64% 37,72% 51,51%C-GA 51,52% 32,33% 42,13%D-GA 43,50% 18,64% 36,93%E-GA 112,33% 81,55% 104,68%

Only one portfolio using user input classifications (type Slow Growers) and two portfolios using clus-

tering showed improvements (clusters D and E). On the rest of the portfolios, the ones optimizing weights

based on the clusters of the stocks got closer results to that of the previous case studies. The results

present how the GA improved a cluster type portfolio (cluster E) and a classification type portfolio (Slow

Growers), from the original portfolios of case study 1 and 2.

In figure 4.13 the results of the median runs from cluster E and the Slow Growers type is shown.

These were the only groups (apart from cluster D) that had an increase in performance when compared

to the results obtained in case study 1 and 2. One can notice that specially in the portfolio E-GA there

was a 27,34% increase in performance, most of which in the year 2015 (see figure 4.15). The GA limited

the losses of cluster E in 2015, and consequentially, increased the Sharpe ratio.

One can also see in figure 4.14 that since companies were only sold with the signal given by the

GA, the E-GA portfolio achieved 100% on rate of trade success and increased the number of positive

quarters to 87,50%.

79

2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4SG-GA 6,65% 4,50% 12,73% 16,63% 23,62% 24,44% 31,57% 39,78% 45,98% 46,81% 44,12% 54,93% 62,32% 56,63% 45,87% 46,03%SG 7,91% 3,43% 10,52% 14,04% 24,27% 26,46% 32,62% 37,55% 43,75% 45,16% 41,50% 51,82% 54,16% 51,47% 45,49% 44,09%E-GA 29,50% 30,50% 41,65% 49,71% 61,30% 72,99% 74,92% 89,28% 94,96% 105,01% 93,62% 105,34% 118,52% 125,49% 103,61% 104,68%E 28,17% 21,00% 36,22% 49,04% 59,20% 70,12% 78,24% 94,57% 91,90% 100,64% 97,29% 104,26% 96,95% 101,68% 77,85% 77,34%GA 7,67% 3,89% 7,64% 10,12% 19,05% 20,35% 27,93% 36,93% 42,23% 46,18% 42,40% 51,94% 53,70% 50,07% 36,15% 41,30%

-20%

0%

20%

40%

60%

80%

100%

120%

140%

Retu

rns

Quarters

Type's Portfolios

Figure 4.13: GA optimization portfolios. The key ”SG” represents the portfolio of type Slow Growers and the key”E” represents the portfolio of cluster E, as constructed in case studies 1 and 2. The suffix ”-GA”represents the portfolios using GA to optimize FI weights. The key ”GA” represents the GA algorithmrunning over the whole dataset.

Also in figure 4.14 it is shown that the SG-GA portfolio increased in only 1,94% the returns, comparing

with the SG portfolio from case study 1. It reduced substantially the number of trades done, when

compared with portfolio SG (78,03% - from 132 trades in portfolio SG, to 29 trades in portfolio SG-GA).

The rate of positive quarters increased (from 68,75% to 75%), however the drawdown also increased

(from 10,07% to 16,45%). This is visible in figure 4.13 that the SG-GA portfolio rose more than the SG

portfolio in the fourth quarter of 2015, but fell right afterwards until the point it almost crossed below the

SG portfolio, and in figure 4.15 we can see how small the difference between the two portfolios is in

terms of yearly returns.

4.2.3.A Conclusions

One can conclude from this case study that Fundamental Indicators optimization has better results

when applied to small sets of companies (as there were cluster E and D).

Also, the GA used in this work greatly reduced the number of trades done in all portfolios, reducing

80

Number of

Trades

Rate of Trade

Success

Biggest Trade Return

Biggest Trade Loss

Average Time on Market

(in Quarters)

Average Return

Per trade


Average Return

Per Quarter

4 Year Return

Sharpe

Ratio

Drawdown

GA 368 72,83% 276,67% -69,44% 5,49 24,02% 75,00% 2,29% 41,30% 0,67 17,55%

SG-GA 29 68,97% 264,07% -35,46% 3,79 29,16% 75,00% 2,49% 46,03% 0,84 16,45%

E-GA 8 100,00% 117,26% - 3,50 45,94% 87,50% 4,87% 104,68% 0,98 21,89%

Figure 4.14: GA portfolios metrics table. ”GA” is the whole dataset using the FA GA algorithm, and SG-GA andE-GA is respectively the Slow Growers type and the cluster E using the FA GA algorithm.

potential losses, but also potential gains. In the case of the E-GA portfolio, the major difference from the

cluster E portfolio presented in case study 2 was the reduction of losses in 2015.

One can also conclude that clustering classification is better than the user input one, since it groups

companies with more similarities than those of the user input classification, which allows the FA GA

algorithm to optimize better trading rules.

81

2012 2013 2014 2015SG-GA 16,63% 19,85% 10,84% -5,74%SG 14,04% 20,61% 10,38% -5,10%E-GA 49,71% 26,43% 8,49% -0,32%E 49,04% 30,54% 4,98% -13,18%GA 10,12% 24,35% 10,96% -7,00%

-20%

-10%

0%

10%

20%

30%

40%

Returns

Years

Yearly Returns

Figure 4.15: FA GA portfolios’ yearly returns. ”GA” is the whole dataset using the FA GA algorithm, and SG-GAand E-GA is respectively the Slow Growers type and the cluster E using the FA GA algorithm.

82

5Conclusion

Contents

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 System limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

83

This work describes the implementation of a system that classifies stocks in the plane Size×Growth

with two methods, using in both the same metrics of Fundamental Analysis, and that uses Fundamental

Indicators for trading simulation. Various Artificial Intelligence methods and financial strategies were

analyzed and several related works were exposed. It was described the implementation of the system,

showing the methods used, decisions made and the results of the evaluation of the system. This chapter

does a conclusive statement about the results obtained, the system’s achievements and limitations, and

the possible future work to be developed.

5.1 Conclusions

The results obtained in this work allow the conclusion that genetic algorithms are suitable for large

amounts of computations and for the analysis of large amounts of data, having successfully classified

stocks according to their Fundamental Analysis. The classification methods used were successful, hav-

ing both achieved returns above the S&P500 index in the expected groups. The genetic algorithm for

weight optimization of fundamental indicators improved two cluster based portfolios, and one classifi-

cation based portfolio. It also showed great success when applied to the cluster with bigger returns,

improving the returns by 27,34%. To have better results in this optimization it would be necessary a

bigger number of clusters so the algorithm could optimize Fundamental Indicators weights in smaller

sets of stocks. These results allowed to conclude that automatic classification using genetic algorithms

is a better way of classifying stocks than using human input, since it is not prone to human error and

does a more careful analysis of the data, grouping companies with more similar behaviors.

5.2 Achievements

The main achievements in this work were the following:

• The implementation from scratch of a architecture of a system containing data processing, genetic

algorithms and a trading simulator.

• A classification method using Fundamental Analysis and Genetic Algorithm to optimize clusters’

positions, with a fixed number of clusters.

• The comparison between an user input classification method and an automatic classification

method.

• A long term investment strategy based on the implemented classification methods.

• The exclusive use of Fundamental Analysis and Fundamental Indicators with Genetic Algorithms.

85

5.3 System limitations

• The main limitation of this work is the fact that only trades quarterly, ignoring the period in between,

which does not take into account the best time to buy or sell.

• Other limitation is the use of only 272 out of 500 stocks of the S&P500 index, due to data availability.

The 500 companies of the index should be included, and a dynamic tracking of the companies that

leave or enter the index should be added to this work.

5.4 Future Work

• Try different number of clusters, and optimize the number of clusters created to maximize the

Calinski-Harabasz index.

• Use Technical Analysis to give the buy and sell signals of a portfolio. If TA buy/sell signals is trained

within a specific group it should be more effective than when trained with the whole dataset.

• Use macroeconomic data and cycle study in order to know which group has more potential within

a certain economical context.

• Tackle the Markwoitz Portfolio composition problem, in a way that can create a portfolio that opti-

mizes the number of positions open in each group.

• Implement more sophisticated Adaptive Genetic Algorithms techniques to allow the system to

change more quickly to obtain better results, either in trading and classification.

86

Bibliography

[1] G. S. Atsalakis and K. P. Valavanis, “Surveying stock market forecasting techniques – part ii: Soft

computing methods,” Expert Systems with Applications, vol. 36, no. 3, Part 2, pp. 5932 – 5941,

2009. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417408004417

[2] S. B. Achelis, Technical Analysis from A to Z. McGraw Hill New York, 2001.

[3] B. Greenwald, J. Kahn, P. Sonkin, and M. van Biema, Value Investing: From Graham

to Buffett and Beyond, ser. Wiley finance. John Wiley & Sons, 2004. [Online]. Available:

https://books.google.pt/books?id=gvCzlskpZxoC

[4] M. Buffett and D. Clark, Warren Buffett and the Interpretation of Financial Statements: The Search

for the Company with a Durable Competitive Advantage. Scribner, 2008. [Online]. Available:

https://books.google.pt/books?id=7iqO6rGdrAYC

[5] A. Islam, H. Zaman, and R. Ahmed, “Automated fundamental analysis for stock ranking and growth

prediction,” in Computers and Information Technology, 2009. ICCIT’09. 12th International Confer-

ence on. IEEE, 2009, pp. 145–150.

[6] S.-S. Chen, “Predicting the bear stock market: Macroeconomic variables as leading indicators,”

Journal of Banking & Finance, vol. 33, no. 2, pp. 211–223, 2009.

[7] M. Blejer, “Central banks and price stability: Is a single objective enough?” Journal of Applied

Economics, vol. 1, no. 1, pp. 105–122, 1998.

[8] A. Silva, R. Neves, and N. Horta, “A hybrid approach to portfolio composition based on fundamental

and technical indicators,” Expert Systems with Applications, vol. 42, no. 4, pp. 2036 – 2048, 2015.

[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417414006113

[9] D. Chandwani and M. S. Saluja, “Stock direction forecasting techniques: an empirical study com-

bining machine learning system with market indicators in the indian context,” International Journal

of Computer Applications, vol. 92, no. 11, 2014.

87

http://www.sciencedirect.com/science/article/pii/S0957417408004417

https://books.google.pt/books?id=gvCzlskpZxoC

https://books.google.pt/books?id=7iqO6rGdrAYC


[10] W. Wu and J. Xu, “Fundamental analysis of stock price by artificial neural networks model based

on rough set theory,” World Journal of Modelling and Simulation, vol. 1, no. 2, pp. 36–44, 2006.

[11] A. Silva, R. Neves, and N. Horta, “Portfolio optimization using fundamental indicators based on

multi-objective ea,” in 2014 IEEE Conference on Computational Intelligence for Financial Engineer-

ing & Economics (CIFEr). IEEE, 2014, pp. 158–165.

[12] R. D. Edwards, J. Magee, and W. C. Bassetti, Technical analysis of stock trends. CRC Press,

2007.

[13] A. Gorgulho, R. Neves, and N. Horta, “Applying a {GA} kernel on optimizing technical analysis

rules for stock picking and portfolio composition,” Expert Systems with Applications, vol. 38, no. 11,

pp. 14 072 – 14 085, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/

S0957417411007433

[14] T. Chavarnakul and D. Enke, “A hybrid stock trading system for intelligent technical analysis-

based equivolume charting,” Neurocomputing, vol. 72, no. 16–18, pp. 3517 – 3528, 2009,

financial EngineeringComputational and Ambient Intelligence (IWANN 2007). [Online]. Available:


[15] Y. Hu, K. Liu, X. Zhang, L. Su, E. Ngai, and M. Liu, “Application of evolutionary computation

for rule discovery in stock algorithmic trading: A literature review,” Applied Soft Computing,

vol. 36, pp. 534 – 551, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/

S156849461500438X

[16] L. Wang, H. An, X. Liu, and X. Huang, “Selecting dynamic moving average trading rules in the

crude oil futures market using a genetic approach,” Applied Energy, vol. 162, pp. 1608 – 1618,


[17] N. T. Vu, “Stock market volatility and international business cycle dynamics: Evidence from

{OECD} economies,” Journal of International Money and Finance, vol. 50, pp. 1 – 15, 2015.

[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0261560614001338

[18] P. K. Narayan and K. S. Thuraisamy, “Common trends and common cycles in stock

markets,” Economic Modelling, vol. 35, pp. 472 – 476, 2013. [Online]. Available: http:

//www.sciencedirect.com/science/article/pii/S0264999313003179

[19] B. A. Blonigen, J. Piger, and N. Sly, “Comovement in {GDP} trends and cycles among trading

partners,” Journal of International Economics, vol. 94, no. 2, pp. 239 – 247, 2014. [Online].

Available: http://www.sciencedirect.com/science/article/pii/S0022199614000919

88




http://www.sciencedirect.com/science/article/pii/S156849461500438X







[20] S. J. Brown, W. N. Goetzmann, and A. Kumar, “The dow theory: William peter hamilton’s track

record reconsidered,” The Journal of finance, vol. 53, no. 4, pp. 1311–1333, 1998.

[21] Z. Tan, C. Quek, and P. Y. Cheng, “Stock trading with cycles: A financial application of {ANFIS}

and reinforcement learning,” Expert Systems with Applications, vol. 38, no. 5, pp. 4741 – 4755,

2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S095741741000905X

[22] X. Yu and M. Gen, Introduction to evolutionary algorithms. Springer Science & Business Media,

2010.

[23] N. K. Kasabov, Foundations of neural networks, fuzzy systems, and knowledge engineering. Mar-

cel Alencar, 1996.

[24] D. Simon, Evolutionary optimization algorithms. John Wiley & Sons, 2013.

[25] C. C. Aranha and H. Iba, “A tree-based ga representation for the portfolio optimization problem,” in

Proceedings of the 10th annual conference on Genetic and evolutionary computation. ACM, 2008,

pp. 873–880.

[26] S. Sivanandam and S. Deepa, Introduction to genetic algorithms. Springer Science & Business

Media, 2007.

[27] B. L. Miller and D. E. Goldberg, “Genetic algorithms, tournament selection, and the effects of noise,”

Complex systems, vol. 9, no. 3, pp. 193–212, 1995.

[28] S. Yang, “Evolutionary computation for dynamic optimization problems,” in Proceedings of the Com-

panion Publication of the 2015 on Genetic and Evolutionary Computation Conference. ACM, 2015,

pp. 629–649.

[29] L. Fausett, Fundamentals of neural networks: architectures, algorithms, and applications.

Prentice-Hall, Inc., 1994.

[30] T. Kimoto, K. Asakawa, M. Yoda, and M. Takeoka, “Stock market prediction system with modular

neural networks,” in Neural Networks, 1990., 1990 IJCNN International Joint Conference on, June

1990, pp. 1–6 vol.1.

[31] A. Senanayake, “Automated neural-ware system for stock market prediction,” in Cybernetics and

Intelligent Systems, 2004 IEEE Conference on, vol. 2. IEEE, 2004, pp. 1166–1171.

[32] S. Yao, M. Pasquier, and C. Quek, “A foreign exchange portfolio management mechanism based

on fuzzy neural networks.” in IEEE Congress on Evolutionary Computation. IEEE, 2007, pp.

2576–2583. [Online]. Available: http://dblp.uni-trier.de/db/conf/cec/cec2007.html#YaoPQ07

89


http://dblp.uni-trier.de/db/conf/cec/cec2007.html#YaoPQ07

[33] M. Lam, “Neural network techniques for financial performance prediction: integrating fundamental

and technical analysis,” Decision support systems, vol. 37, no. 4, pp. 567–581, 2004.

[34] W. Leigh, R. Purvis, and J. M. Ragusa, “Forecasting the nyse composite index with technical anal-

ysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic decision

support,” Decision support systems, vol. 32, no. 4, pp. 361–377, 2002.

[35] L. A. Zadeh, “Fuzzy sets,” Information and control, vol. 8, no. 3, pp. 338–353, 1965.

[36] R. Fuller, “Neural fuzzy systems,” 1995.

[37] R. Kuo, L. Lee, and C. Lee, “Integration of artificial neural networks and fuzzy delphi for stock

market forecasting,” in Systems, Man, and Cybernetics, 1996., IEEE International Conference on,

vol. 2. IEEE, 1996, pp. 1073–1078.

[38] O. Maimon and L. Rokach, Data mining and knowledge discovery handbook. Springer, 2005,

vol. 2.

[39] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.

[40] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, S. Y.

Philip et al., “Top 10 algorithms in data mining,” Knowledge and information systems, vol. 14, no. 1,

pp. 1–37, 2008.

[41] E. Hajizadeh, H. D. Ardakani, and J. Shahrabi, “Application of data mining techniques in stock

markets: A survey,” Journal of Economics and International Finance, vol. 2, no. 7, p. 109, 2010.

[42] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. Morgan

Kaufmann, 2005.

[43] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping multidimensional data.

Springer, 2006, pp. 25–71.

[44] C. Raposo, C. H. Antunes, and J. P. Barreto, Automatic Clustering Using a Genetic Algorithm

with New Solution Encoding and Operators. Cham: Springer International Publishing, 2014, pp.

92–103. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-09129-7 7

[45] J. C. Dunn, “A fuzzy relative of the isodata process and its use in detecting compact well-separated

clusters,” 1973.

[46] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. Springer Science &

Business Media, 2013.

[47] H. Markowitz, “Portfolio selection,” The journal of finance, vol. 7, no. 1, pp. 77–91, 1952.

90

http://dx.doi.org/10.1007/978-3-319-09129-7_7

[48] T. Weise, “Global optimization algorithms-theory and application,” Self-Published,, pp. 25–26, 2009.

[49] G. Hassan, “Multiobjective robustness for portfolio optimization in volatile environments,” in In Proc.

GECCO ’08. ACM, 2008, pp. 1507–1514.

[50] G. Hassan and C. D. Clack, “Robustness of multiple objective gp stock-picking in unstable

financial markets: Real-world applications track,” in Proceedings of the 11th Annual Conference on

Genetic and Evolutionary Computation, ser. GECCO ’09. New York, NY, USA: ACM, 2009, pp.

1513–1520. [Online]. Available: http://doi.acm.org/10.1145/1569901.1570104

[51] Y. L. Becker, H. Fox, and P. Fei, “An empirical study of multi-objective algorithms for stock ranking,”

in Genetic Programming Theory and Practice V. Springer, 2008, pp. 239–259.

[52] P. Skolpadungket, K. Dahal, and N. Harnpornchai, “Portfolio optimization using multi-objective ge-

netic algorithms,” in Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, Sept 2007,

pp. 516–523.

[53] D. Lohpetch and D. Corne, “Multiobjective algorithms for financial trading: Multiobjective out-trades

single-objective,” in 2011 IEEE Congress of Evolutionary Computation (CEC). IEEE, 2011, pp.

192–199.

[54] P. Lynch and J. Rothchild, One Up On Wall Street: How To Use What You Already Know To

Make Money In. Simon & Schuster, 2012. [Online]. Available: https://books.google.pt/books?id=

TYOdIrFJ2SkC

[55] R. Peachavanish, “Stock selection and trading based on cluster analysis of trend and momentum

indicators,” in Proceedings of the International MultiConference of Engineers and Computer Scien-

tists, vol. 1, 2016.

[56] L. A. Teixeira and A. L. I. De Oliveira, “A method for automatic stock trading combining technical

analysis and nearest neighbor classification,” Expert systems with applications, vol. 37, no. 10, pp.

6885–6890, 2010.

[57] H. Park, “Emerging market hedge funds in the united states,” Emerging Markets Review,

vol. 22, pp. 25 – 42, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/

S1566014114000788

[58] M. Ehrgott, K. Klamroth, and C. Schwehm, “An {MCDM} approach to portfolio optimization,”

European Journal of Operational Research, vol. 155, no. 3, pp. 752 – 770, 2004, traffic and

Transportation Systems Analysis. [Online]. Available: http://www.sciencedirect.com/science/article/

pii/S0377221702008810

91

http://doi.acm.org/10.1145/1569901.1570104

https://books.google.pt/books?id=TYOdIrFJ2SkC

https://books.google.pt/books?id=TYOdIrFJ2SkC





[59] D. Lohpetch and D. Corne, “Discovering effective technical trading rules with genetic programming:

Towards robustly outperforming buy-and-hold,” in Nature & Biologically Inspired Computing, 2009.

NaBIC 2009. World Congress on. IEEE, 2009, pp. 439–444.

[60] F. Allen and R. Karjalainen, “Using genetic algorithms to find technical trading rules,” Journal of

financial Economics, vol. 51, no. 2, pp. 245–271, 1999.

[61] M. Radeerom, “Automatic trading system based on genetic algorithm and technical analysis for

stock index,” International Journal of Information Processing and Management, vol. 5, no. 4, p. 124,

2014.

[62] C.-H. Cheng, T.-L. Chen, and L.-Y. Wei, “A hybrid model based on rough sets theory and genetic

algorithms for stock price forecasting,” Information Sciences, vol. 180, no. 9, pp. 1610 – 1629,


[63] A. Fernandes, An Evolutionary Computing Approach to Financial Portfolio Management Based on

Growth Stocks & Sector/Industry Distribution. Instituto Superior Tecnico, May 2016.

92


AList of Stocks used

This work used a subset of 272 out of 500 companies of the S&P500 index. The stock tickers of these

companies are the following:

• MMM

• ABT

• ADBE

• AES

• AET

• AFL

• A

• APD

• AKAM

93

• AA

• ALL

• MO

• AMZN

• AEE

• AEP

• AIG

• AMP

• ABC

• AMGN

• APH

• ADI

• AON

• AAPL

• AMAT

• ADM

• AIZ

• T

• ADSK

• ADP

• AZO

• AVB

• AVY

• BHI

• BLL

94

• BCR

• BAX

• BDX

• BBBY

• BBY

• BIIB

• HRB

• BA

• BSX

• CHRW

• CPB

• COF

• CAH

• CBG

• CBS

• CELG

• CNP

• CTL

• CF

• SCHW

• CVX

• CTAS

• CSCO

• CLX

• COH

95

• CTSH

• CL

• CMA

• CAG

• COP

• ED

• STZ

• GLW

• COST

• CSX

• CMI

• CVS

• DHI

• DRI

• DE

• XRAY

• DOV

• DOW

• DTE

• DUK

• DNB

• ETFC

• EMN

• ETN

• EBAY

96

• ECL

• EA

• EMR

• EQT

• EQR

• EL

• ES

• EXC

• EXPE

• EXPD

• XOM

• FDX

• FITB

• FSLR

• FLIR

• FLS

• FLR

• FMC

• FTI

• F

• BEN

• FCX

• GPS

• GIS

• GPC

97

• GILD

• GS

• GT

• GWW

• HAL

• HAR

• HAS

• HCP

• HP

• HON

• HRL

• HPQ

• HUM

• HBAN

• ITW

• IR

• INTC

• IBM

• IPG

• IFF

• INTU

• ISRG

• IVZ

• JCI

• JNPR

98

• K

• KEY

• KMB

• KLAC

• LH

• LM

• LEG

• LEN

• LLY

• LLTC

• L

• MTB

• MRO

• MAR

• MAS

• MAT

• MKC

• MCD

• MCK

• MJN

• MDT

• MRK

• MET

• MCHP

• MSFT

99

• TAP

• MDLZ

• MON

• MYL

• NDAQ

• NOV

• NTAP

• NWL

• NEM

• NKE

• NI

• JWN

• NSC

• NOC

• NRG

• NVDA

• ORLY

• OXY

• OMC

• OKE

• ORCL

• OI

• PCAR

• PH

• PDCO

100

• PAYX

• PEP

• PKI

• PCG

• PM

• PBI

• PNC

• PPG

• PPL

• PX

• PCLN

• PFG

• PG

• PGR

• PEG

• PHM

• PWR

• QCOM

• DGX

• RRC

• RTN

• RHT

• RSG

• RAI

• RHI

101

• ROK

• COL

• ROP

• R

• CRM

• SLB

• SNI

• SRE

• SHW

• SPG

• SJM

• SNA

• SO

• SWN

• STJ

• SWK

• SPLS

• SBUX

• STT

• SRCL

• SYK

• STI

• SYY

• TROW

• TDC

102

• TSO

• TXN

• TXT

• HSY

• TRV

• TMO

• TIF

• TWX

• TJX

• TMK

• TSS

• TSN

• UNH

• UPS

• UTX

• UNM

• URBN

• VFC

• VAR

• VTR

• VRSN

• VZ

• VNO

• VMC

• WMT

103

• DIS

• WM

• ANTM

• WU

• WHR

• WFM

• WEC

• WYN

• WYNN

• XEL

• XRX

• XL

• YHOO

104

classiﬁcation and clustering of stocks, using genetic ... · classiﬁcation and clustering of...

Documents