viscovery som (clustering)

8/17/2019 ViscoVery SOM (clustering)

1/173

CREDI T RATI NG PREDI CTI ON

US I NG SELF- ORGA NI Z I NG M A PS

Visually exploring and constructing a quantit ative model

Roger P.G.H. Tan


2/173


3/173

CREDI T RATI NG PREDI CTI ON

US I NG SELF- ORGA NI Z I NG M A PS

Visually exploring and constructing a quantit ative model

Roger P.G.H. TanStudentnr. 140033

Erasmus University Rotterdam

Faculty of Economics

July 2000


4/173

co n t en t s

iv

con t en t s

cont ent s iv

pr ef ace vii i

1 int r oduct ion 1

1.1 Overview 2

1.2 Resea rch doma in 41.2.1 Bond ratings 41.2.2 Financial data and ratings 41.2.3 Self-Organizing Maps 5

1.3 Research topics 6

2 cr edit r at ings 9

2.1 Credits a nd credit ra tings 102.1.1 Bonds 102.1.2 Credits 112.1.3 Credit ra tings 122.1.4 Ra tings and default risk 13

2.2 The S &P credit ra ting process 152.2.1 Process steps 15

2.3 Financia l sta tement ana lysis 182.3.1 Financia l sta tement 182.3.2 Financia l ra tios 182.3.3 Balance sheet and income sta tement 192.3.4 Used ratios 20

2.4 Summa ry 23


5/173

v

3 sel f -or ganizing maps 25

3.1 Knowledge discovery 263.1.1 Introduction 26

3.1.2 Knowledge discovery process 263.1.3 Description and prediction 27

3.2 Projection and clustering techniques 283.2.1 Linear projection 283.2.2 Non-linear projection 293.2.3 Hierarchica l clustering 303.2.4 Non-hierarchical clustering 30

3.3 Classifica tion techniques 313.3.1 Linear regression 313.3.2 Ordered logit 313.3.3 Artificia l neura l networks 32

3.4 Self Orga nizing Ma ps 333.4.1 Introduction 333.4.2 Overview 34

3.5 SOM projection 36

3.5.1 The self-organization process 363.5.2 A two dimensiona l example 373.5.3 Mathematical description 393.5.4 A three dimensiona l example 40

3.6 SOM visua liza tion and clustering 413.6.1 Maps 413.6.2 Map quality 433.6.3 Clusters 453.6.4 Cluster quality 47

3.6.5 Map settings 47

3.7 SOM interpreta tion and eva lua tion 503.7.1 Description 503.7.2 Prediction 52

3.8 SOM questions and a nswers 55

3.9 Summa ry 57


6/173

co n t en t s

vi

4 descr ipt ive anal ysis 59

4.1 Ba sic da ta ana lysis 604.1.1 Data selection 60

4.1.2 Pre-processing &transformation 64

4.2 Clustering companies 674.2.1 Crea ting suitable maps 674.2.2 Intermediate results 684.2.3 Results 70

4.3 Compa ring S&P ra tings 724.3.1 Associa ting ra tings 724.3.2 Measuring the goodness of fit 734.3.3 Results 75

4.4 Sensitivity ana lysis 784.4.1 Cluster coincidence plots 784.4.2 Results 79

4.5 Benchma rk 824.5.1 Principa l Component Ana lysis 824.5.2 Results 82

4.5.3 Comparison with SOM 84

4.6 Summary 85

5 cl assif icat ion model 87

5.1 Model set-up 885.1.1 Training and prediction 885.1.2 Data 885.1.3 The prediction process 89

5.1.4 Ratings distribution 895.1.5 Measuring performance 91

5.2 Model construction 955.2.1 Initia l model 955.2.2 Variable reduction 965.2.3 Sensitivity analysis 995.2.4 Results 102


7/173

vii

5.3 Model va lida tion 1045.3.1 Comparison with constant prediction 1045.3.2 Comparison with random prediction 1045.3.3 Classifications per ra ting class 106

5.3.4 Equa lized ra tings distribution 106

5.4 Benchmark 1085.4.1 Linear regression 1085.4.2 Ordered logit 1085.4.3 Results &comparison with SOM 109

5.5 Out-of-sa mple test 1125.5.1 Results for test set 1125.5.2 Results for older historical periods 113

5.5.3 Linking spreads 114

5.6 Summa ry 117

6 concl usions 119

6.1 Conclusions 120

6.2 Further research 122

7 bibl iogr aphy 123

appendix 125

I Artificia l neura l networks 126

II Itera tions of the SOM algorithm 128

III SOM example: Recta l muscle sizes 131

IV SOM example: Customer segmenta tion 134

V Sta tistica l measures and tests 138

VI Descriptive a na lysis 140

VII Cla ssifica tion model 163


8/173

pr ef ace

viii

pr ef ace

This mas ters thesis fo rms the conclusion to my study o f Econometrics, with specialization Bus iness Oriented

Computer Science, at the Eras mus University of Rott erda m. It was written during my internship at the

Quant itat ive Research (QR) depa rtment of the Rotterda m ba sed a ss et-manag er Robeco Group. My time a t the

Robeco Group has been very enjoyable, a nd the combination of practical resea rch a nd writing a t the sa me time

has proven to be a very relaxed a nd sure way of writing a t hesis. I can recommend this to everyone in the final

sta ge of his or her study.

This thesis is ta rgeted a t reade rs from two different scientific area s (computer science and financial

econometrics), so some concepts are treat ed more extensively tha n first may s eem necessary. Considerable

time wa s a lso spent making this thes is into a n att ractive package, but a t a ll times have I striven to keep looks

and content in good ba lance.

Naturally I could not have written this without the comments and encouragement I received from many people,

some of which I would like to especially mention: First a nd foremost I would like to tha nk dr.ir. Peter Ferket, mymentor at Robeco a nd hea d of QR, and dr.ir. Jan va n den Berg and drs. Willem-Max van d en Bergh, b oth

as socia te professors at the faculty of Economics at the Eras mus University. They all provided invalua ble

comments on this thesis in its s everal sta ges of development. Furthermore my gratitude g oes out t o the

members of the Credits resea rch tea m, to my roommat es a nd to the other collea gues a t QR, for ans wering t he

many ques tions a Computer Science gradua te inevita bly has when acting like an econometrician. Finally I wa nt

to s ay tha nks to dr. Guido Deboeck (Virtua l Imagineer, U.S.A.) and dr. Gerhard Kranner (Euda ptics, Austria) for

taking the time to answer my many emails, providing new insights and a better understanding of Self-

Organizing Maps. Euda ptics also g enerously supplied me with the la tes t version of their Viscovery SOMine

softwa re, so tha t I could focus on the real resea rch subject instea d of ha ving to d evote time to programming.

As much as I ha ve loved the past few yea rs I spent partly studying, partly working and pa rtly partying, I’m glad

this sta ge of my life has come to a conclusion. I’m looking forwa rd to put even more energy into my new job a s I

have put into this thesis.

Roger Tan, July 2000


9/173

1 i n t r oduct i on

In chapter 1 we introduce the main probl em and the research top ics for this thesis. Paragraph 1 gives a brief

overview of the prob lem setting and paragraph 2 describes the domain of research. Paragraph 3 reviews the

central question and several sub-questions to be answered in the remainder of thi s thesis.


10/173


11/173

Overview

3

Severa l techniques ha ve been developed for thes e kind of ana lyses . We will focus on a less common technique

called Se lf-Orga nizing Maps, which is a combinat ion of a projection and a clustering algorithm. Its main

ad vantag es a re the insightful visualizations of large da tas ets a nd its flexibility.


12/173

1 i n t r o du ct i o n

4

1.2 Research domain

1.2.1 Bond ratingsBond ratings a re letter values on an ordinal scale, g iving a n opinion of creditworthiness of the iss uer of a bond.

The two most important rating a gencies (issuers of ratings) are S tanda rd &Poor’s a nd

Moody’s. The ratings issued by these two a gencies are comparable, but in this thesis

we w ill focus on St anda rd &Poor’s.

Examples of ra tings are AA or B, the full ra ting s cale is shown in ta ble 1-1. A low ra ting

(e.g. CC) corresponds to a high defa ult risk, a high rat ing (e.g. AA) corresponds to a low

defa ult risk. A ‘D’ indicate s an actua l default on the bond. The sca le is even more

refined by appending ‘+ ’ or ‘-‘ to the letter rat ing, indicating a slightly better or slightly

worse rating.

Nowa da ys, more and more companies ha ve been rated, but st ill most rated companies

are ba sed in the United Sta tes of America. Also more historical da ta is ava ilable for

thes e companies. Therefore, our resea rch will be conducted using only U.S. ba sed

companies.

1.2.2 Financial da ta and ratingsRating a gencies claim that the issued ratings a re based on (1) a qua ntitative ana lysis of

the financial statement of a company and (2) a qualitative analysis of the company a nd

the environment of the company: What is the long term strateg y, are there any

impending threa ts on future profitability not expressible in the financial sta tement (like

lawsuits), and w hat is the economic outlook for the sector as a whole? We will treat

the credit rating process extensively in chapter 2, but suffice it t o s ay that the contribution of q ualitative factorsto the rating is unclear. We can clarify the relat ionship between financial data and credit ratings using

quantitative techniques like the Self-Organizing Map and indirectly give an assessment of the contribution of

qua litat ive factors.

Financial sta tement data on most US companies is available in huge da ta bas es from da tas ources like Compustat

and WorldScope. The information in these da ta bas es could help us ga in a bette r understanding of the

relationship betwe en financia l informat ion and bond rat ings. It might even provide us with a means to correctly

Tabl e 1-1 Cr edit r ati ngscal e

St and ar d & Po o r ’s

AAA

AA+

AA

AA-A+

A

A-

BBB+

BB B

BBB-

BB +

BB

BB -

B+

B

B-

CCC+

CCC

CCC-

CC

C

D


13/173

R e s e a r c h d o m a i n

5

predict bond ratings, ba sed on the stored financial da ta a lone. However, trans forming the stored data into

knowledge is no trivial ta sk.

1.2.3 Self-Orga nizing MapsA common problem is the complex nature of large a mounts of dat a . Our universe conta ins a large number of

companies, a nd for ea ch company many financial characteristics are available. This hinders the inference of

sensible relationships; to cope with the problem specific techniques have be en developed2. In this thesis we

will focus on the Self-Orga nizing Map technique.

Self-Organizing Maps (SOMs) use a n a dvanced a lgorithm to form an a s g ood a s poss ible representa tion of the

da ta. Clusters of similar companies a re identified a nd displayed on a map, using colours to enha nce the

represe nta tion. The voluminous original data set is compressed into a 2-dimensional, ea sily rea da ble map. Thecontributions of individual characteristics are also part of the display, making it possible to visually infer

relationships from the underlying da ta.

The Self-Organizing Map can be used as a visua l exploration tool and as a class ificat ion model. Both functions

will be illustrated us ing our bond rating problem.

2 Fayya d, U.M., 1996, Chapter 1.


14/173

1 i n t r o du ct i o n

6

1.3 Resea rch topics

The ea rlier sketched domain forms the ba ckground for the following centra l ques tion in this thesis :

In what way can we use Self-Organizing Maps to explore the relationship between financial statement data

and credit ratings?

This question ca n be broken down into the following five sub-quest ions:

1. What are credit rat ings and how is the credit rating process structured?

An analysis of the Sta ndard &Poor credit rating process gives us a better understanding of the relation between

credit ratings and financial sta tement data .

2. What are Self-Organizing Maps and how can they aid in explori ng relati onships in large data sets?

Before we can trust the results inferred from the SOM maps we first have to understand how the SOM gives a

view on the underlying dat a . We provide an in-depth review of the algorithm itself and a guide on how to

interpret the g enerated results.

3. Is it possible to find a logical clustering of companies, based on the financial statements of these

companies?

First w e would like to know if companies are discernible bas ed on financial sta tement da ta alone.

4. If such a clustering is found, does this clustering coincide with levels of creditworth iness of the companies

in a cluster?

We then compare the found clustering with the dist ribution of the rat ings over the companies to de termine to

wha t extent they coincide.

5. Is it possible to classify companies in rating classes using onl y financial statement data?

Using previously found knowledge we set up a model specifically suited to the task of classifying new

companies using financial sta tement dat a.


15/173

R e s e a r c h t o p i cs

7

This thesis is divided into several chapters. Chapter 1 conta ins the introduction, a description of the resea rch

domain and this overview of the resea rch topics. In chapter 2 we give a theoretical treatment of t he credit rat ing

process a nd in chapter 3 we provide a n in-dept h review of Self-Orga nizing Maps . Chapter 4 discuss es the

descriptive analysis after which chapter 5 focuses on the classification model. In chapter 6 we draw ourconclusions and present some s uggest ions for further research.


16/173


17/173

2 cr edi t r at i ngs

This chapter provides a background on credits and credit rat ings. Question 1 from the introduction is answered:

1. What are credit rat ings and how i s the credit rating process structured?

Paragraph 1 addresses the theoretical foundati ons of credits and credit rati ngs. Paragraph 2 reviews the rati ng

process of Standard & Poor’ s, a well-known rati ng agency. Paragraph 3 evaluates the key financial ratios

applicable to the economic sector under scrutiny in th is thesis, Consumer Cyclicals.


18/173

2 cr ed i t r at i n g s

10

2.1 Credits and credit ra tings

2.1.1 BondsIn its most s imple form a bo nd is a loan from one entity to the othe r. The entity tha t receives the loan (this is

often a g overnment or a large company) is called the obligor or issuer, the loa n itself is called a bond obligation

or issue. The bond is freely tradable on the exchanges and split up into smaller parts, t o make the bond more

marketable.

Bonds belong to the group of fixed-income instruments, because they periodically pay a fixed amount (the

coupon) to the buyer of the bond. Bonds d iffer from eq uity (or stockholders sha res) in that buyers of bonds do

not become owners of the company. When a company g oes into ba nkruptcy, the owner of the bond is in a be tter

position tha n the s hareholder because first all the loa ns a re redeemed, and from which is left (if a ny) the owners

are repaid.

Characteristics

Each bond has certain characteristics, which fully describe the bond. The bond has to be redeemed on a fixed

dat e, called the maturity da te. Bonds with original maturities longer than a year a re considered long-term, all

bonds w ith maturities up to one year a re considered short-term. Each period a certain interest percentage has

to be paid in the form of the coupon. Often this percentag e is fixed, but sometimes this percentag e is

dependent on the ma rket interest rate (the coupon is floating). Other variations on t he sta ndard bond include

sinking redemptions (periodically a part of the bond is redeemed), callable bonds (at certain dates the issuer

has the right to prematurely redeem the bond), and of course special combinations leading to more exotic

variants.

Value

The value of a bond depends la rgely on the coupon percentag e and the current market interest ra te. If themarket interest ra te rises, then the value of the bond lowers. The coupon percenta ge is fixed, a nd investors

would rathe r buy a new bond w ith a coupon that is more in-line with the current market interest rate. If the

market interest rate de clines, then the value of the bond rises . Investors would rather buy our bond tha n new

bonds w ith lower interest rates .

The value of the bond is det ermined in the market, by the forces of supply and de mand. Using the market price

the current yield of the bond ca n be ca lculated. This is the internal discount factor needed when discounting a ll

future cash flows o f the bond (coupon payments a nd redemption payment) to represent t he current price. This


19/173


20/173


12

2.1.3 Credit rat ingsAccording to Sta nda rd &Poor’s (S&P), “t he bond or credit ra ting is an opinion of the genera l creditw orthiness o f

an obligor with respect to a particular debt security or other financial obligation, based on relevant risk

factors.”4

All rating a gencies seem to support this definition.

Rat ing ag encies

A rating ag ency, of which S&P is one of the best known examples, as sess es the relevant factors relating to the

creditworthiness o f the iss uer. These include the qua ntita tive fact ors like the profitability of the company and

the amount of outstanding debt, but also the qualitative factors like skill of manag ement and economic

expectat ions for the company. The whole ana lysis is then condense d into a lette r rating5. Standa rd &Poor’s

and Moody’s both have been rating bonds for almost a century and are the leading rating agencies right now.

Other reputab le rating institutions a re Fitch and Duff &Phelps.

Rat ings interpreta tion

The types of a ssigned rat ings a re comparable for most ag encies, and for S&P a nd Moody’s there is a direct

4 Sta ndard &Poor’s, 2000, page 7.

Tabl e 2-1 Cr edit r atings and inter pretat ion

S & P M o o d y ’ s I n t e r p r e t a t i o n

AAA Aaa Hig hes t q u al i t y

AA+ Aa1

AA Aa2AA- Aa3

High q ual i t y

A+ A1

A A2

A- A3

St r ong payment capaci t y

BBB+ Baa1

BBB Baa2

BBB- Baa3

Adeq uat e payment capaci t y

BB+ Ba1

BB Ba2

BB- Ba3

Likel y t o f ul f i l o bl igat ion s; o ngo in g un cer t ain t y

B+ B1

B B2

B- B3

High r isk o b l igat ions

CCC+

CCC

CCC-

Caa

CC

Cur r ent vu l ner abi l i t y t o def aul t , o r in d ef aul t (Mo o dy’ s)

C

Ca

Bank r upt cy f i l ed

D D Def au l t ed


21/173


22/173


14

Figure 2-1 shows the default rates corresponding to Moody’s rating classes for 19997. As is to be expected, the

lower rating clas ses have corresponding higher default rates.

Investment grade versus speculative grade

Credits with a n as signed ra ting from AAA to BBB- are known a s investment gra de credits. Lower rat ed iss ues a re

known as specula tive grade credits, high yield issues or junk bonds . The spread s on these high yield issues a re

relatively wide, thus providing an interest ing investment opportunity. This is even more so a fter finding anaverage recovery rate of 42%

8 (for every U$ 100 worth of defa ults on a verage U$ 42 recovers).

Sometimes fundmanagers are restricted to purchasing investment grade issues, to avoid speculative

investments. However, the abs olute defa ult rat es do not remain sta ble over the yea rs. For example, rest ricting

the fund manag ers to purchase a t least BBB- grad e issues does not g uarantee lower than 1%default rates.

7

Moody’s, 2000, pa ge 26.8 Moody’s , 2000, page 17.

Default rates for 1999

0

2

4

6

8

10

12

A a a

A a 1

A a 2

A a 3 A 1 A 2 A 3

B a a 1

B a a 2

B a a 3 B a

1 B a

2 B a

3 B 1 B 2 B 3

%

Figur e 2-1 Def ault r ates for 1999


23/173

Th e S & P c re d i t r a t i n g p roce s s

15

2.2 The S & P credit ra ting process

“The rating experi en ce is as much an art as it is a science.” – Solomon B. Samson, Chief Rating Officer at

Standard & Poor’ s 9 .

This paragraph des cribes the credit rat ing process of S ta ndard & Poor’s. Most informat ion contained in this

para graph was t aken from the “ Corporate Ratings Criteria” document, on-line published a t the S&P webs ite. In

this document, the distinction between the qua litat ive and t he qua ntitative analysis is less clear. The

qua lita tive ana lysis is most e xtensively treated and thus most emphasized. The descriptive a nalysis in chapter

4 will try to uncover whether this depiction reflects the a ctua l ra ting practice of S&P.

2.2.1 Process stepsThe Sta nda rd &Poor’s credit rating process can be broken down into severa l steps . The process is summarized

in figure 2-2.

Request rating

Companies themselves often approach Sta ndard & Poor’s to request a ra ting. In addition to this, it is S&P’s

policy to rate a ny public corporate de bt issue larger tha n U$ 50 million, with or without request from the iss uer.

Bas ic resea rch

When the rating is requested a team of a nalysts is ga thered. The ana lysts w orking at S &P ea ch have their own

sector specialty, covering all risk categories in the sector.

9 St anda rd &Poor’s, 1999.

request

rating

ass ign analytical team

conduct ba sic research

meet

issuer

ratingcommittee

meeting

issue(r)

rating

surveil-

lance

appeals

process

Figur e 2-2 The St andar d & Poor 's cr edit r ati ng process


24/173


16

The appropriate a nalysts a re chosen and a lead a nalyst is a ssigned, who is responsible for the conduct of the

rating process.

Some ba sic research is conducted, ba sed on publicly available information and based on information received

from the company prior to the meeting with the management 10. The information requested prior to the meeting

should conta in:

- five years of audited a nnual financial state ments (bala nce sheet a nd profits and loss es a ccount),

- the last several interim financial statements (this is mostly applicable to US companies, as they are

required by law to provide q uarterly financial s tat ements),

- narrative d escriptions of operations a nd products,

- relevant industry information.

As some of this ma y be se nsitive information, S &P ha s a st rict policy of confident iality on a ll the information

obta ined in a non-public fas hion. Any published ra tiona le on the rea lization of the assigned ra ting only contains

publicly ava ilable information.

Meeting the issuer

In the next step a part of the team meets with management of the company to review key factors that have an

impact on the ra ting. This meeting covers the operat ing and financia l plans of the company and the

manag ement policies but it is also a qua litat ive a sse ssment of manag ement itself. The meeting is scheduled

well in advance so ample time for preparation is given.

The specific topics discussed a t the meet ing are:

- the industry environment and prospects,

- an overview of the major business segments, including operating statistics and comparisons with

competitors and industry norms,

- manag ement’s financial policies a nd financial performance goa ls,

- distinctive accounting practices,

- management’s projections, including income and cash flow statements and balance sheets, together with

the underlying market a nd operating a ssumptions,

10

So called ‘public information ratings’ a re the exception to this rule; they a re solely based on the annual publicly available financialstatement.


25/173

Th e S & P c re d i t r a t i n g p roce s s

17

- capital spending plans,

- financing a lternatives a nd contingency plans.

Standard &Poor’s does not base its rating on the issuers financial projections, but uses them to indicate how

the mana gement a sses ses potential problems a nd future economic developments.

Rating committee and appeals process

Shortly afte r the meeting with the mana gement of the issuer the rat ing committee convenes. The rating

committee consists of five to seven voting members, who will decide on the rating using information presented

by the lead a nalyst. His presenta tion covers:

- an a nalysis of the na ture of the company’s business and its operating environment,

- an evaluation of the company’s strateg ic and financial mana gement,

- a financial analysis,

- and finally a rating recommendation.

After a d iscussion about the rating recommendation and the fa cts s upporting it the committee votes on the

recommenda tion. The issuer is notified of the rating a nd the major considera tions supporting it. An appea l is

possible (the issuer could possibly provide new information), but there is no guarantee that the committee will

alter its decision.

Publishing the rat ing

For public issues the new rat ing is published using several media, e. g. the Internet site o r the “ CreditWeek”

publicat ion by Standa rd & Poor’s. For ratings as signed on reques t by the issuer, the company itself may

determine if they wa nt the rating to be publicly available or not. This will often be the ca se, becaus e rating

requests a re expensive and a public rating facilita tes the neg otiations for loans and lea ses.

Surveillance

The ra ted issues a nd issuers are being monitored on an ongoing bas is. New financial or economic

developments are reviewed a nd often a meeting with the mana gement is scheduled annually. If these

developments might lead to a rating change, this w ill be made known using the CreditWatch listings. A more

thorough analysis is performed, after which the rating committee again convenes and decides on the rating

change.


26/173


18

2.3 Financia l st a tement a nalys is

2.3.1 Financia l s ta tementThe financial sta tement of a company comprises the bala nce sheet a nd the profits a nd losses account. There

are st rict accounting regulat ions the financial sta tement must a dhere to, wh ich vary for different countries. The

financial stat ements for companies in different sectors a lso diverge: We would expect a factory to have a raw

materia ls inventory on its ba lance shee t, but not a bank. The most important differences occur betw een

Financial companies and Industrial companies, the next section describes the financial ratios that are most

applicable to Industrial companies.

2.3.2 Financia l rat iosThe financial performance of a company can be analyzed b y carefully examining the ba lance sheet a nd income

sta tement for that company. To make these large q uantities of data more comprehensible a nd to make

comparisons betw een firms poss ible one often uses financial ratios.

There are several financial ratio clas ses :

! leverage ra tios measure the de bt level of a company,

! liquidity ratios measure the ea se w ith which a company ca n acq uire cas h,

! profitability ratios mea sure the profits of a company in proportion to its a sse ts.

In ad dition to these financial ratios a few other clas ses of variables ca n be observed to characterize a company:

! size variables measure the size of a company,

! sta bility variables measure the sta bility of the company o ver time in terms of size a nd income,

! market value ratios meas ure the value investors as sign to a company.

Although financial ratios provide a mea ns to quickly compare companies , some caut ion should be taken when

using them. Companies often use different a ccounting s tanda rds, so t wo comparable companies ca n have very

different values for certain ratios just because of different ways of valuing the items on the balance sheet.

Furthermore, companies often wa nt to present an a s fa vourable a s poss ible image, known as ‘window dressing’.

This also lea ds to ra tios not fully representing the rea l financial sta te of the company.


27/173

Fi na n ci a l s t a t e m e nt a n a l y s i s

19

2.3.3 Ba lance sheet a nd income s ta tementThe financial rat ios a re calculate d using elements from the ba lance sheet and from the income s tat ement of a

company. They a re shown in table 2-2 and t ab le 2-3.

Tabl e 2-2 Balance sheet

As s et s Li ab i l i t i es

+ cas h & eq u iv al en t s + t o t a l s ho r t t er m d eb t

+ t o t a l n et r ecei v ab l es + acco u nt s payabl e

+ t o t a l i n ven t o r y + o t h er cu r r en t l iab i l i t i es

+ o t h er cu r r en t as set s + i n co me t axes payab l e

t o t al cur r ent asset s t ot al cur r ent l iabi l i t i es

+ n e t p r o p er t y , p l a n t & e q u i p me n t + t o t a l l o n g t e r m d e b t

+ i nv es t men t & ad v an ces + o t h er n o n -c ur r en t l i ab il i t i es+ in t an gib l es + d ef er r ed in co me t ax es & i n ves t men t t ax cr ed it

+ o t h er as s et s + min o r i t y in t er es t

t o t al l i ab il i t i es

+ pr ef er r ed st o ck

+ t o t al co mmo n eq ui t y

t o t al assets

t o t a l l iab i l i t i es & capit a l

Table 2-3 Income st atement

Inc o me st at ement

+ net sal es

- co st o f go ods so l d

- o t her expens es

earnings befor e inter est , taxes, depr eciat ion and amor t izat ion

- depr ecia t io n an d amo r t iza t ion expense

earnings befor e int er est and t ax

- gr o ss in t er est ex pens e

+ specia l i t ems ( no n-r ecur r ing)pr e-t ax income

- t o t al in co me t axes

- min o r i t y in t er est

net income

- pr ef er r ed di vid end s

ear nings appl icabl e t o common st ock


28/173


20

2.3.4 Used ra tiosOur preliminary se lection yielded the following financia l ratios.

Interest coverage ratiosThese mea sure the extent to which interest or debt is covered by the ea rnings of a company.

EBIT int er est cover age:

(earnings before interest and taxes) / (interest expenses)

EBITDA int er est cover age

(earnings before interest, taxes, depreciation and amortization) / (interest expenses)

EBIT / t ot al debt

(earnings be fore interest a nd ta xes) / (total debt)

Leverag e ratios

Financial leverag e is crea ted when firms borrow money. To measure this leverage, a number of ratios a re

available.

Debt r at io

(long term debt ) / (long term debt + eq uity + minority interest )

Debt-equit y r at io

This can be mea sured in several wa ys, tw o of which are:

(long term debt ) / (equity)

and

(long term debt) / (tota l capita l)

Net gear ing

(tota l liabilities – cash) / (equity)

Profita bility ratios

Profitability rat ios measure the profits of a company in proportion to its a sse ts.


29/173

Fi na n ci a l s t a t e m e nt a n a l y s i s

21

Ret ur n on equit y

This measures the income the firm was able to generate fo r its s hareholders11.

(net income) / (average eq uity)

Ret ur n on t ot al asset s

(earnings before interest and taxes) / (total as sets )

Oper at ing income / sal es

(operating income before depreciation) / (sales )

Net prof it mar gin

(net income) / (tota l sa les)

Size variables

These meas ure the size of a company.

Tot al asset s

The tota l as sets of the company.

Mar ket val ue

Price per share * number of shares outstanding.

Sta bility variables

Stability variables measure the stability of the company over time in terms of size and income.

Coef f icient of var iati on of net income

(sta nda rd deviation of net income over 5 years) / (mea n of net income over 5 years)

Coef f icient of var iati on of t ot al asset s

(standa rd deviation of tota l ass ets over 5 years) / (mean of tota l ass ets over 5 years)

Market variables

Market variables a re used to a sses s the value investors a ssign to a company.

11

Note the use of the average of the e quity (at the beg inning and the end of the qua rter). Averages a re often used when comparingflow data (net income) with snapshot da ta.


30/173


22

Coef f icient of var iati on of ear nings for ecast s (fi scal year 1)

This measures the risk encapsula ted in the earnings forecas ts (for fiscal yea r 1) of the several ana lysts . If the

ana lysts do not a gree with ea ch other, that should be a n indication for higher risk involved w ith this company.

(standard deviation of forecasts fiscal year 1 over analysts) / (mean of forecasts fiscal year 1 overanalysts)

Mar ket bet a r el at ive t o NYSE

The bet a is the sensitivity of the stock to market movements, in this case movements of the New York

Stock Exchange12. A snapshot is ta ken on the last trading da y of the quarter.

Ear nings per shar e

This is calculated for the last month of the quarter.

(earnings a pplicable to common stock) / (tota l number of sha res)

12 Brea ley, R.A. a nd Myers, S .C., 1991, cha pter 7.


31/173

S u m m a r y

23

2.4 Summary

In this chapter we have reviewed some theoretical aspects of bonds and credits before exploring the ratings

doma in. The credits we a re most interes ted in are bonds issued by companies (corpora te bonds ). We have see n

the direct relation between creditworthiness, defa ult probability and spread of a credit. If the perceived

creditworthiness is bett er, then the assigned rat ing will be higher and the defa ult proba bility will be lower. The

difference in yield with a similar government bond (also known a s the s pread ) will be subseq uently lower.

The different process steps of the St anda rd &Poo r’s credit rating process emphasize the q ualitative a nalysis

performed by the a gency. The qua ntitative ana lysis, based on financial state ment data , is just a single step in

the process. In the remainder of this thes is we will try to uncover whether actua l ra ting practice reflects this

depiction of matte rs, using the described financial ratios. These ratios form a means to summarize the ba lance

sheet a nd income sta tement of a company a nd to compare the financial st a tements of different companies.


32/173


33/173

3 sel f -o r gan i zi ng maps

Chapter 3 reviews the Self-Organizing Map and its place in the knowledge discovery process. To provide a

background fo r the SOM we wi ll bri efly discuss some related techniques before examining the Self-Organizing

Map algorithm. Altogether this answers question 2 from the int roduction:

2. What are Self-Organizing Maps and how can they aid in explor ing relat ionships in large data sets?

Paragraph 1 describes the knowl edge discovery process. Paragraph 2 describes some projection and clustering

methods related to SOM. Paragraph 3 describes the classification techniques that we also use in the

classification model of chapter 5. The remainder of the chapter is dedicated to an explanati on of SOM and

guideli nes for the use of SOM.


34/173

3 s el f - o r g an i z i n g map s

26

3.1 Knowledg e discovery

3.1.1 IntroductionThese d ays it is quite common for corporations of a ll kinds and sizes to ga ther

large amounts of data . This may vary from customer da ta (e.g. sca nned

purchase data for supermarkets) to data regarding some of the processes

within a company (e.g. process s ta tes of a machine). On a meso-economic and

macro-economic level a lot of data is available too, concerning the financial

sta tements of individual companies or the financial sta tements of countries.

The volumes of thes e da taba ses are often g igantic, making it impossible toretrieve sensible information just by looking at the raw d at a. To ga in access to

the knowledge contained in the stored data one has to rely on specific

techniques, which extract information from the database in a sys tematic way.

In the ICTsector these techniques are referred to as data-mining13

techniques,

and all the steps necessary to extract knowledge from databases is known as

the knowledge discovery process.

3.1.2 Knowledge discovery processThe knowledge d iscovery process encompas ses all the ste ps necessary to

extract potentially useful information (knowledge) from the da taba se14.

The basic s teps (displayed in figure 3-1) involve:

- Creating a target data set based on the available data, the knowledge of

the underlying doma in and the g oals o f the resea rch.

- Pre-processing t his dat a to a ccount for extreme values a nd missing values.

- Applying a ny necessa ry trans forma tions.

- ‘Mining’ the data so distinct pat terns become available for interpreta tion and evaluation. In this thes is we

will focus on visualization techniques, whereby specific patterns can be found in the resulting maps.

13 Computer scientists use t he term da ta-mining in a positive context (extracting previously unknown knowledge from large da tab ase s),

econometricians use the t erm data -mining in a nega tive context (manipulating dat a a nd the used technique to support specific

conclusions). This sometimes lead s to confusion about the intended meaning.14 Fayya d, U.M., et a l, 1996, chapter 2.

Data

Targe t da ta

selection

Preprocessed

data

preprocessing

Trans formed

data

transformation

visualization

Patterns

Maps

interpretation

evaluation

Knowledge

Figur e 3-1 The knowl edgediscover y pr ocess


35/173

Kn o w l e d g e d i s c o v e ry

27

- Interpreting a nd evaluating these ma ps, often repeating one or more st eps of the process.

3.1.3 Description and predictionThe knowledg e discovery process serves two main purposes: des cription and prediction. Descriptive

knowl edge discovery tries to correctly represent the d at a in a compact form. The new represent at ion implicitly

or explicitly shows relat ionships in the data . Not s o obvious relationships emerge, thus a ttributing to a greater

knowledge of the underlying domain. Obvious relat ionships are of course visible too, streng thening the image

one has of the da ta bas ed on preliminary research. Common used te chniques are projection and clustering

algorithms.

Predictive knowledge discovery is used to complement values for one or more characteristics (or variables) of

observations in the dat a s et. This is often in the form of a classification problem: A da ta set w ith known class

memberships is used to build a model, and this model is used to predict the class membership for new

observations. Common used techniques are linear regression ba sed classifiers like ordered logit and artificial

neural networks.

Of course this division is not strict. Some of the algorithms are combinations of techniques , and ofte n the

des criptive techniques a re used a s an intermediate st ep in large investiga tions. The output of the descriptive

ana lysis then may s erve a s input for some of the prediction algorithms.

In the following sections we will highlight some of the available projection, clustering, and classificationtechniques . The Self-Organizing Map, treated extens ively in the remainder of the chapte r, is a ctua lly a neural

network combining regression, projection and clustering!


36/173


28

3.2 Projection and clustering techniques

We use projection techniques to reduce the dimensionality of the da ta, making it ea sier to g rasp the ess ence of

the da ta . Projection techniques ca n be split into two groups, linear and non-linear projection methods . On the

other hand, clustering techniques a re designed to reduce the amount of da ta by grouping a like items tog ether.

The dimensionality of the dat a does not cha nge. The several clustering methods ca n be split into tw o common

types , hierarchical and non-hierarchical clustering.

3.2.1 Linea r projectionLinear projection methods use a linear combination of t he components of the original da ta to project the da ta

onto a new co-ordinate system o f lower dimensionality using a fixed set of sca lar coefficients.

Principal component analysis (PCA) is a commonly used linear projection method . The PCA technique tries to

capture the intrinsic dimensionality of the da ta by finding the d irections in which the da ta displays the greates t

variance. Often the dat a is s tretched in one or more

directions and has an intrinsic lower dimensionality

tha n it first may seem (see figure 3-2). These

directions in the data are called ‘principal

components ’. The first principal componentdescribes the direction of the largest variation in the

da ta . The second principal component, orthogona l

to the first, describes the direction of the second-

largest variation in the dat a, et cetera. The variat ion

in the da ta that has not been described by the first N

principal components is called the residual variance.

The da ta is projected onto a new co-ordinat e sys temspanned by the first tw o principal components, to g ive a more accurat e view of the da ta. A drawba ck of linear

projection methods is t hat they ca n not t ake non-linear or a rbitrarily sha ped st ructures in the da ta into account,

possibly leading to incorrect projections.

In chapter 4, we compare the P CA technique with SOM. A full explanat ion of principal components can b e found

in Johnson and Wichern15

.

15 Johnso n, R.A., a nd Wichern, D.W., 1992, chapter 8.

Figur e 3-2 Two dimensional data st r et ched in one dir ect ion


37/173

P r o j e c t io n a n d c l u s t e r i n g t e c h n i q u e s

29

3.2.2 Non-linear projectionSeveral techniques exist to project the non-linear structures in the

da ta . They often focus on correctly displaying the differences

between obs ervat ions in the original data space.

Multi Dimensional Scaling (MDS)16

, developed b y J.B. Kruskal d uring

the sixties and seventies, actually denotes a whole range of

techniques. It aims a t placing the original, high dimensional data

points on a lower dimensional display in such a wa y tha t the relative

rank ordering of similarity between observations in the input space

is preserved a s much as possible. The new distance bet ween t he

two least similar observations is largest, and vice versa the new distance betw een the two most similar observations is smallest.

The specification of the s imilarity meas ure defines the specific used

version of MDS; metric MDS uses Euclidea n dista nces17 in the input

spa ce, non-metric MDS uses doma in specific relative rank orderings.

One interesting application of non-metric MDS can be found in

archaeology for the reconstruction of the geography of the

Mycenaean kingdom of Pylos in Greece (circa 1200 BC)18. The found

Palace archives (clay tablets) contain no direct geographical

information, but relative distances between cities can be inferred

from them. The MDS bas ed map of the kingdo m (figure 3-4)

matches the map drawn by experts (figure 3-3) quite closely.

16 Johnso n, R.A. a nd Wichern, D.W., 1992, pag es 602-608

17 The Euclidea n dist ance ( ) y xd , betwe en vectors x and y is defined as ( ) ( ) ( ) ( )

2...

2

22

2

11,

n y

n x y x y x y xd ++++++=

18 See “ http://www.a rchaeology.usyd.edu.a u/~ myers/multidim.htm” for more information.

Figur e 3-4 MDS map of Pyl os k ingdom

Figur e 3-3 Exper t map of Pyl os kingdom


38/173


30

3.2.3 Hierarchical clusteringHierarchical clustering techniques group data

items according to some measure of similarity in a

hierarchical fashion. They can be divided intosplitting a nd merging methods.

Splitting methods work top-down, starting with

one big cluste r. At each step the cluster is divided

into two separate clusters thereby maximizing

some inter-cluster distance measure d . The

divisional process is s topped when d becomes too

small. The found division of the da ta set isequivalent with a binary tree structure.

Merging method s work bottom-up, sta rting with each cas e in a sepa rate cluster. Clusters ha ving the lea st inter-

cluster dista nce d a re merged, often the Euclidea n dista nce is used for d . An example clustering of car brands is

shown in figure 3-5.

3.2.4 Non-hierarchical clusteringNon-hierarchical or partitional clustering methods try to directly divide the data into a set of disjoint clusters.

This is done in such a wa y tha t the intra-cluster dist ance is minimized a nd the inter-cluster dista nce is

maximized.

K-means clustering is a non-hierarchical cluste ring method tha t is very much related t o Se lf-Organizing Maps. A

set of K reference vectors is chosen with the sa me dimensiona lity as the input da ta . Then for ea ch reference

vector a list is mad e of the obs ervations lying most closely to the reference vectors. The reference vectors a re

then recomputed by taking the mean over the respect ive list . Each reference vector (also called ‘cent roid’) thus

represents the centre of the cluster. This is repea ted until the reference vectors do not change much anymore.

Figur e 3-5 Cl ust er ing car br ands using mer ging


39/173

Cl a s s i f i c a t i o n t e c h n i q u e s

31

3.3 Class ificat ion techniques

The techniques treated in this pa ragraph can a ll be used as classificat ion methods . Linear regression a nd neural

networks are more general method s tha t can also be used to solve other kinds of problems. The ordered logit

model is specifica lly used fo r clas sification problems. All three techniques a re used in chapter 5.

3.3.1 Linear regress ionThe multiple linear regression model is used to s tudy the relat ionship between a dependent variable a nd

severa l independent variables . The regression equa tion has the following form:

iikkiii xxxy ε β β β ++++= !2211 , i = 1,…n ,

where y is the dependent or explained variable, x 1 ,…,x

k are the independent or explanatory variables (also

known as regressors), and i indexes the n sa mple observations. The disturba nce ε is used to model external

random influences tha t we can not capture with the model (e.g . errors of measurement). The coefficients o f the

independent variables (β 1…β

k ) and the disturbance are most often estimated using the Ordinary Least Squares

technique. Before we do this a number of a ssumptions ha ve to be sa tisfied concerning amongst others the

dependencies betw een variables a nd the dis tribution of the dis turbances. A full overview of the multiple linear

regress ion model is given in Greene19

.

3.3.2 Ordered logitThe ordered logit model is a so ca lled ordered response model. It is a n extension of the binary logit model,

which is a regression-bas ed technique: A latent variable is ass umed to be the d etermining factor for clas s

membership. This la tent variable is linea rly depend ent on several regres sors and a disturba nce.

iikkiii xxxy ε β β β ++++= !2211 , i = 1,…n

We as sume a log istic distribution for the disturbance ε , hence the name ordered logit . Although the classeshave to be ordered they need not be of eq ual width. The classification is seen as a t ransformat ion of the latent

variable a nd derived from y using

1cx i ∈ if 1α ≤iy

ji cx ∈ if ji j y α α ≤


40/173


32

mi cx ∈ if im y


41/173

S e l f O rg a n i z i ng Ma p s

33

3.4 Self Orga nizing Maps

3.4.1 IntroductionThe se lf-orga nizing map (SOM) is a combination of a clustering and projection a lgorithm at t he sa me time,driven by a neural netwo rk. The multi-dimensional input (e.g. compa nies with multiple financia l ratios per

company) is projected onto a 2-dimensional map, thereby preserving the local distances between the

observations. The projected observa tions are subseq uently merged into clusters , ta king the placement on the

map into a ccount.

The model of the se lf orga nizing map was inspired by the huma n brain: The complex motoric and se nsoric

control of specific parts of the human body can be pinpointed to specific areas on a flat surface of the brain.More complex functions a re appointed la rger area s (or cluste rs) of brain tiss ue. The resulting man-like sha pe

projected on the brain is known a s t he homunculus (figure 3-7).

Figur e 3-7 Pictur e of t he homuncul us in t he br ain, dr awn by Wil der Penf iel d


42/173


34

3.4.2 Overview The self-orga nizing map algorithm involves two steps . The first st ep projects the obs ervations , the second s tep

clusters the projected observations.

Projection

The first st ep of the a lgorithm involves projecting the obs ervations onto a 2 dimensional, flexible grid composed

of neurons or nodes. The grid is stretched and b ended through the input space to form an a s go od a s poss ible

representa tion of the dat a . The projection on this grid is a generalizat ion of simple projection (on the flat

surface) and projection using Principal Component Analysis (PCA).

Simple projection simply projects the dat apoints on the flat s urface defined by the x and y axes. Projection

using principal components is more advanced than simple projection (reflecting the intrinsic dimensionality of

the da ta), but is st ill limited be cause t he obse rvations a re projected o n a flat plane. The flat plane is a ligned

according to the a xes defined by the two directions inhibiting the la rgest va riance of the da ta . The projection

part of the SOM algorithm (also known as the self-organization process) can be thought of as a non-linear

generalization of PCA21

. The plane onto which the obse rvat ions are projected ca n stretch and bend through the

input space thus more thoroughly capturing the distribution of the observations in the input space.

The first two types of projections a re often too rest ricted to fully capture the irregularities of the da ta . The three

dimensional example in Figure 3-7shows t his more clearly. The da ta is clustered in three distinct segments of

the cube, simple projection projects the ob servat ions on the bot tom of this cube (left picture). The flat pla ne

show n in the middle picture is aligned a long the first tw o principal components o f the da ta . A projection on this

surface g ives a better representa tion of relative dista nces in the data set. The rightmost picture show s the

flexible, bended a nd stret ched grid used for SOM projection. By following the form of the da ta an even more

21 Kaski, S., 1997.

Fi ure3 -7Pl aneo f ro ect i on usi n theX-Y lane usi n PCA and usin SOM


43/173

S e l f O rg a n i z i ng Ma p s

35

accurate representa tion of relat ive dista nces in the da ta set is given. How the SOM achieves this projection is

extensively treated in paragraph 3.5.

Clustering

The flexible g rid, ont o wh ich the obs ervations have b een projected, is (for convenient output viewing) returned

to a normal, unstretched flat plane and displayed a s the map. The form of the g rid in the input spa ce remains

fixed. The local ordering of the sample is preserved; ne ighbouring observat ions in the input spa ce will be

neighbouring observations on t he map.

A bott om-up clustering method is used to cluste r the projected obs ervations: sta rting with each observat ion in

a separate cluster, 2 clusters are merged if their relative distance (e.g. Euclidean distance) in the input space is

smallest and if they are adjacent in the map. The number of show n clusters varies with the specific step of the

algorithm we wa nt to see. One step later in the algorithm means one less cluster shown (another cluster has

merged), one step earlier means one more cluster shown.

Cluster are clear separations of the input space, so observations can only be member of one cluster (the clusters

do not overlap). The clustering algorithm is discuss ed in parag raph 3.6.


44/173


45/173


46/173

3 5 3 Mathematical description


47/173

S O M p ro j e c t i on

39

3.5.3 Mathematical descriptionThe self-orga nization process ca n be described in mathema tical form. The input consist s of a s ample of n-

dimensional observations

( ) ( ) ( ) ( )[ ]txtxtxtx n,...,, 21= ,

where t is regarded a s the index of the observat ions in the sa mple (t = 1, 2,..., T ).

The goal of the a lgorithm is to d etermine the va lues for a s et of n-dimensiona l neurons,

( ) ( ) ( ) ( )[ ]TmTmTmTm iniii ,...,, 21= ,

where the i denotes the index of the current neuron in the output map ( i = 1, 2, ..., I ). The neurons are first

initialized to a rbitrary values . The placement of the neurons in the output map is fixed, so the index i does not

change.

For every t , the algorithm performs the following steps:

1. The winning neuron mc (t) most closely resembling the current observation x(t) is selected (c denotes the

winning a nd i denotes the current neuron):

( ) ( ) ( ) ( ){ }tmtxtmtx ii

c −=− min .

2. The mi a re updated:

( ) ( ) ( ) ( ) ( ) ( )[ ]tmtxthttmtm iciii −+=+ α 1 .

The ad justment is monotonically decreas ing as t he number of itera tions increases . This is controlled by the

learning rate factor α (t) ( 0


48/173


40

p ( p g p ) p g

the train process while keeping the sa me results . For more information on the bat ch train process plea se refer

to Deboeck22

.

3.5.4 A three dimensional exa mpleAn example using a three dimensional input space is more representa tive of a rea l world application of the SOM:

A high-dimensional input spa ce mapped to a two dimensiona l output grid. In Figure 3-11 the neurons a re placed

in a t hree dimensional input space with three groups of da ta . Please note tha t the network is not random but

linearly initialized a ccording to the first two principal components of t he da ta set.

The dis tribution of the neurons a fter the self-orga nization process is shown in Figure 3-12. The network, still a 2

dimensional lat tice, has curved and stretched to form an a s good as possible fit to the original da ta. The

neurons a re concentrate d in those area s of the input space conta ining the most observations. The largest

separation occurs between the cluster of observations in the bottom half of the cube and the two clusters of

observations in the upper half of the cube.

22

Deboe ck, G., 1998, pag e 167.

Figur e 3-11 Linear l y init ial ized net wor k in a 3D input space Figur e 3-12 Dist r ibuti on of t he neur ons aft er sel f-or ganizat ion

3 6 SOM i li ti d l t i g


49/173

S O M v i s u a l i z a t i o n a n d c l u s t e r i n g

41

3.6 SOM visualiza tion and clustering

The previous t reatment of the inner workings of S OM are g eneric for mos t implementa tions, b ut the ava ilable

visualizations of the final map vary for each softwa re packag e. We ha ve mad e use of the Viscovery SOMine 3.0

Enterprise edition program, generously supplied to us by Eudaptic s in Austria23

. Some of the shown

visualization and cluster capabilities can not be found in other programs24

.

3.6.1 MapsThe visible output of the a lgorithm consist s of the map, which is an unstretched , flatt ened representa tion of the

grid in the input space. Observations mapped to a specific neuron in the input space a ppear on the sa me

specific neuron (grid point) in the map. Neighbouring observat ions in the input spa ce are neighbouring

observations on the map.

The map has s everal manifesta tions:

- Clusters: to view the clustering of neurons25

.

- U-matrix: to view relat ive dista nces between neurons (in the input space).

- Component planes: to view d istributions of separat e variables over the map.

It is important to remember that for ea ch map manifestat ion the distribution of obs ervations over the map doe s

not change. We are looking at the same map, but each time dif ferent info rmation is shown.

Unified dista nce matrix

The Unified dist ance matrix (U-mat rix) can be used

to assess relative distances between neurons in

the input spa ce. When transla ting the grid in the

input space to the output map, distanceinformation is lost (the grid is returned to an

unstretched , flatt ened sta te). This informat ion is

re-introduced by colour coding the map. Greater

23 Euda ptics, 1999.

24 In addition to this, the intuitive interface and the a bility to work with Excel files make it an a ttractive package.

25

The clusters a nd specific clustering algorithms will be treate d in para graph 3.6.3.

Figur e 3-13 U-matr ix

differences between the neurons in the input spa ce translat e to darker colours in the ma p.


50/173


42

The U-mat rix for the ea rlier used three dimens iona l example is show n in figure 3-13. The implicit clustering is

visible as groups of neurons ha ving a lmost equa l colour separated by nodes with distinctly different colours. In

this U-mat rix one very clear cluster at the right of the map can be found. The two clusters a t the left ha lf,separat ed in the middle, a re less clear. This ag rees with the placement of the three clusters of obs ervations, as

can be checked in figure 3-12.

Component pla nes

A component plane is a ma nifesta tion of the map whe reby the values for only one of the variables (a

component) are shown. In this wa y the distribution of this separat e variable over the map ca n easily be

inspected. When comparing two different component planes of the sa me map highly correlat ed variables w ould

sta nd out beca use of the likeliness of their component planes. Components not contributing much to the

distribution of the observations show a more random pattern in their component planes, they are only

contributing noise to the clustering.

Often a display o f the U-mat rix surrounded b y the component planes of a ll the variables is crea ted . Figure 3-14

show s such a d isplay for our three-dimensiona l example. The three component planes represe nt the X, Y and Z

variables.


51/173


43

Figur e 3-14 U-matr ix and component pl anes for all t hr ee var iables

The display shows t ha t no two variables a re highly correlat ed. The right cluste r is characterized by small va lues

for all variables. The top-left cluster is characterized by high va lues for X and Z, the bo ttom-left cluster displa ys

high values for Y and Z. This a lso agrees with the pla cement of the cluste rs of obs ervations in Figure X.

3.6.2 Map qualityWe can discern two types o f map qua lity:

- The da ta representa tion accuracy.

- The da ta set topology representa tion accuracy.

Both make use of the ‘Best Matching Unit’ concept.

Figur e 3-15 Best matching unit f or vector [2, 0, 1]


52/173

Dat a s et topology representation


53/173


45

The da ta set topology representa tion accuracy

can be meas ured in several ways. One error

function often used is the topographic error

meas ure: The percentage of first and second

best ma tching units of a sa mple vector that a re

not ad jacent to ea ch other. This also mea sures

the smoothness of t he mapping.

A more visua l tool for evalua ting the da ta se t

topology representation accuracy is the

frequency map. This manifes ta tion of the map

displays the number of matched observations

per neuron (a darker colour means more

matched neurons). A goo d map should show equa lly distributed frequencies on the freq uency map (Figure 3-

17).

3.6.3 ClustersIt is left to the user to find any clustering of observations ba sed on the U-mat rix and the component planes. This

so-called implicit clustering can be complemented w ith other cluste ring techniques to find a n explicit clustering.

Most software implementations of the Self-Organizing Map do not incorporate any explicit clustering

alg orithms. The Viscovery SOMine packag e includes up to three d ifferent clustering methods.

The clustering a lgorithm frees t he user from the difficult ta sk of identifying clusters in the U-matrix. However,

by alte ring para meters of the clustering algorithm the number of show n cluste rs may vary. The user still has to

select the most a deq uate clustering ba sed on a ll available information.

The three clustering metho ds implemented in Viscovery SOMine a re Ward's clustering, S OM single linkag e a nd

a combination of these tw o, called SOM-Ward. Instea d of directly cluste ring the original observat ions thes e

algorithms perform a clustering on the neurons (grid points) in the map, on which the observations are

projected . As thes e neurons form 'best representat ions' for the observations in the input space there is no

qua litative difference. The clustering of the observations ca n be found by retrieving the projected ob servat ions

for each neuron in each cluster.

Figur e 3-17 Fr equency map

Distance measure

Two of the implemented clustering a lgorithms make use of a specific distance meas ure called the Ward


54/173


46

Two of the implemented clustering a lgorithms make use of a specific distance meas ure, called the Ward

distance. It is defined as:

Ward d ista nce

2

, yxyx

yxyx meanmeannn

nn

d −⋅+

⋅=

where x and y are clusters, xn is the number of neurons in cluster x and

xmean is the vector with averages over all components of the neurons in

cluster x , also known as the cluster centroid. Dista nces betwe en clusters with

an evenly distributed number of neurons are enlarged in comparison with

distances between clusters with an uneven distribution of the numbers of

neurons (see ta ble 3-1). This accelera tes t he merging of stray s mall cluste rs.

Ward's clustering

This is one of the clas sic bottom-up methods . It sta rts with a ll the neurons in a s eparat e cluster, in each s tep

merging the clusters ha ving the least Ward dista nce. This dista nce is calculated w ithout ta king the ordering of

the map into account, only dista nces betw een neurons in the input space a re used . When the found clustering

is shown on the map, the clusters may a ppear disconnected: In the input space the neurons are close-by

wa rranting the inclusion in one cluster, but the grid may be bended through the input space in such a wa y tha t

the neurons a re fa r apart on the map.

SOM single l inkage

This clustering method concentrates on the o rdering of the neurons on the ma p. For each neuron the dist ance

with it's neighbour is calculated, when this dista nce exceeds a certa in threshold a s eparat or is s et bet ween the

neurons in the grid. If the separators form a closed loop the neurons w ithin the loop are marked as a cluster.

Because the forming of the clusters only depends on the smallest possible distances between clusters this

clustering method is known a s a single linkag e method.

SOM-Ward

This clustering method is ess entia lly the sa me as Ward 's cluste ring, but this time the ordering of the neurons on

the map is ta ken into account. Only clusters that are direct neighbours in the map can be merged togethe r to

form a la rger cluster. The SOM-Ward clustering technique is primarily used in our res ea rch. An example of

SOM-Ward clustering (using the sa me 3 dimensiona l da ta set ) is show n in figure 3-18.

Table 3-1 War d dist ances fordiff er ent cl uster sizes

xn yn

yx

yx

nn

nn

+

⋅

1 10 0 .91

2 9 1 .64

3 8 2 .18

4 7 2 .55

5 6 2 .73


55/173

Number of neurons

One of the main settings to choose when training a map is


56/173


48

g g p

the number of output neurons.

A small number of neurons (smaller tha n the to ta l number of

observations in the train set) means a more general fit is

made. The map is better a t generalizing a nd is less sensitive

to noise in the da ta . Figure 3-19 shows the underlying

function ( y = sin(x) ), the t rain da ta with some uniform

distributed random noise added, and a 1-dimensional 5

neuron grid.

A large number of neurons (larger tha n the to ta l number of

observations in the train set) means a more precise fit is

made, but the map is more sensitive to noise in the data.

The neurons d o not precisely ma tch the original

observations, but almost all observations are mapped to

separate neurons. Figure 3-20 shows the same data, now

with a 20-neuron grid.

Clearly, the fit of the netw ork to the original da ta is better in

this second case, but the error in respect to the underlying

function is a lso greate r. Notice that the network is not

completely ‘attracted’ to outliers, due to the learning rate

facto r and the neighbourhood function. Although the

network has more neurons it still is a fairly good generalizer

for the underlying function. Compare this to polynomial

fitting; higher order polynomials often lead to large errors!

The number of neurons should be chos en in proportion to the trust one pla ces in his or her da ta : If a lot of noise

is to be expected , then a relatively sma ll number of neurons should be chosen. If the dist ribution of the sample

data very closely resembles the underlying distribution of the population, then a relatively large number of

neurons can be initialized. The extra neurons then warrant a more refined representa tion of the data by the

network.

Figur e 3-19 Fit t ing a 5 neur on net wor k t odat apoint s wit h under l ying funct ion y = sin(x)

Figur e 3-20 Fit t ing a 20 neur on net wor k t odat apoint s wit h under l ying funct ion y = sin(x)

Initialization

Instea d of random initialization one often uses linear initializat ion. Both ca n be used , but linear initialization


57/173


49

provides a bett er sta rting point for the orga nization of the map. The map is often linear initia lized along the

axes provided by the first two principal components of the d at a set.

Choice of learning rate factor and neig hbourhood function

The lea rning ra te factor α (t) is normally a linearly decreasing function over the iterations, but can also be

specified a s a n inverse-time function:

( )( )tB A

t+

=α ,

where A and B are const ants . Earlier and later sa mples will now be ta ken into account with approximately

similar average weights26.

The neighbourhood function often has the Ga uss ian form

( )( )

−−=

2

2

2exp

t

r r th

ji

ijσ

,

where r i

denotes the place of this neuron in the map a nd σ (t) is some monotonically decreasing function over

the iterat ions. Somet imes a simpler form of the neighbourhood function is used , e.g. the bubb le function whichjust denote s a fixed set of neurons a round the winning neuron (in the map). The Gaus sian form ensures a global

best ordering of the map (the quantization error arrives at a global minimum instead of a local minimum) 27

.

26 Kohone n, T., 1997, pa ge 117.

27

Kohone n, T., 1997, pa ge 118.

3.7 SOM interpreta tion and evalua tion


58/173


50

In the knowledge discovery process t he SOM maps a re mainly used for two reas ons: describing the da ta set and

predicting values for certain as pects of the data . Each of these applications demands a specific wa y ofevaluating a nd interpreting t he map.

3.7.1 DescriptionWhen a map ha s been created the us er has to e valuate the map, de termine a good clustering a nd possibly

improve on the clustering so tha t a clear understa nding of the underlying da ta set e merges.

Determining a good clustering is a non-trivial ta sk. Of course the va riab les used for map creation have to be

suitable for the resea rch se tting. Then each specific setting for the used clustering algorithm renders a differentnumber of clusters visible. The map quality meas ures and the q uant itat ive cluster quality measure form a

sta rting point for determining a good clustering. It is up to the expert user to choose a clustering s uitable for

the ta sk at hand, s pecifically by ta king a ny domain knowledge into a ccount.

Improving the clustering

Often one tries to improve on the results (clustering or readability of the display) by reducing the number of

variables used in the crea tion of the map. Removing a variable is warranted only under certa in conditions, if

these conditions hold then the variable does not contribute much to the generated map and can safely be

removed:

- With or without t he variable the d istribution of the companies over the map remains equal.

- With or without the variable the clustering remains the sa me (sa me size and same characteristics in terms

of individua l variables).

Two s trong visual clues lead us to these kinds of variables:

- The component plane of the variable shows a random distri bution (Figure 3-21). The component only ad ds

noise to the formation of the map, it does not contribute to the dis tribution of companies over the map. For

instance, this could happen when the variance of the normalized variable is significantly lower than the

variance of the other normalized variables.

- The component plane of the variable bears a close resemblance with the component plane of anot her

variable (Figure 3-21). The variables a re then highly correlat ed (not necess arily in a linear fas hion). The

dependent variable doe s not contribute to the distribution of companies over the map, beca use the s ame

information is already contained in the o ther variable.


59/173

S O M i n t e r pr e t a t i o n a n d e v a l u a t i o n

51

A less s trong visual clue a lso leads us to spurious variables:

- The distribution o f the high and low values of the component plane does not coincide with one or more

specific clusters (Figure 3-22). A strong cha racterization of the clusters (regarding this variable) can not be

given. It is most likely that the variable does not contribute to the clustering, so we choose to remove the

variable.

Figur e 3-22 Dist r ibuti on of var iabl e does not coincide with cl uster ing

Examples

In appendix III and IV two examples of descriptive SOM use can be found, one on a medical domain and the

second on a da ta bas ed marketing domain. Chapter 4 also uses S OM in a descriptive wa y to evaluate the link

between credit ratings a nd financial ratios.

Figur e 3-21 Random and highl y cor r el ated component pl anes


60/173

a relat ively short time spa n. When using all the variab les for map creation, and then subsequently removing

variables not contributing much to the prediction power, we can be certain that all contributing combinations


61/173

S O M i n t e r pr e t a t i o n a n d e v a l u a t i o n

53

are found. Unfortuna tely this strategy is more time consuming.

Using ta rget variable a s a train variableFor classification purposes most often multi-layered ba ckpropaga tion networks are used . For these netw orks it

is possible to train the network based on t he train variables and the target variable. For each observation the

sta te of the train va riab les is shown to the network. The network gives a prediction for class membership, and

this prediction is compared w ith the real clas s membership (the target variable), leading to ad justments in the

network to account for any deviat ions (the ba ckpropaga tion step). This is also known as supervised t raining,

the network ada pts to bet ter distinguish the differences betw een the classes the observations can belong to.

For the SOM as a feed forward netw ork, it is not pos sible to directly match the real value of the t arget variablewith the predicted value of the target value. But we can simulate it by using the target variable as a train

variable during map creation, this is known as semi-supervised training28

. How this can be beneficial to a

distinction betwe en observat ions in different cluste rs is illustrate d in the following figures. Without using the

target variable as a train variable, the map in figure 3-23 (consisting of just two neurons) is created using only 1

variable or 1 dimension. A distinction between the observations is difficult to make, it is hard to see t o which

Figur e 3-23 SOM net wor k when onl y 1-dimensional (x-axis)infor mation about t he dat apoint s (r ed plusses) is avail abl e.The best matching neur on f or t he new obser vati on (gr eenstar isd i f f i cul t to measure.

neuron the new observation (green) is matched in the one-dimensional final map (the distance to either neuron

is equa l).


62/173


54

When using the ta rget variable a s a t rain variable, the ma p is crea ted us ing two d imensions (figure 3-24). The

placement of the neurons shifts, it is much clearer that the new observation matches the rightmost neuron.

Remember that we do not have the value of the ta rget variable for the new observation, so w e ca n still only use

the x-dimension to determine the best matching unit for this new observation.

Of course t his particular example only illustrat es o ne possible outcome of using t he ta rget variable a s a train

variable. A deeper investigation into the effects of this technique lies outside the scope of this thes is.

ExamplesAn example of the use of SOM as a prediction model can be found in chapter 5: Financial ratios a re used t o

clas sify companies according to creditworthiness.

28

Kohonen, T., 1997.

Figur e 3-24 SOM net wor k when tar get var iable (y-axis) is alsoused when t r aining the net wor k. The best matching neur on fo rt he new obser vati on (gr een st ar ) is easy t o measure.

3.8 SOM quest ions a nd answers


63/173

S O M q u e s t i o n s a n d a n s w e r s

55

Q: Is it a neural network?

A: Yes, b ut a very special one; a feed forwa rd neural netw ork with no hidden layers. The inner workings of the

SOM are relatively simple (see paragraph 3.5) and therefore much clearer than for networks using multiple

layers and ba ckpropaga tion.

Q: Is it a blackbox?

A: No, the SOM is nothing more than the projection on a non-linear plane drawn through the obs ervations. The

form of the plane is set using a very strict and clear algorithm, and the form of the plane is fixed after the

a lgorithm has completed. The component planes g ive us insight into the contribution of individual variables t othe clustering. Other neural nets use multiple layers and ba ckpropaga tion, making the inner workings of the

network more difficult to comprehend.

Q: How can the neural network be flattened and unstretched for output viewing (the map) but still keep the

fixed form in t he input space (fixed after completing the algorithm)?

A: It is not really the grid in the input spa ce that is flattened a nd unstretched, rather a direct representa tion of

this grid in 2 dimensions. Each neuron in the input space directly correspond s with a grid point in the 2

dimensional map.

Q: Is there a chance of overfitti ng the neural network when using a large number of neurons (larger than the

number of observations)?

A: This depends on your definition of overfitting. The SOM alg orithm includes a utomat ic 'da mpening' functions

in the form of the lea rning rate fa ctor and the neighbourhood function. When using a la rge number of neurons

the network more precisely represents the underlying data set , some would consider this overfitting. Howe ver,

thanks to the da mpening functions t he neurons are not completely attracted by the specific observations.

Q: Does the order in wh ich the observations are being processed by the self-organization process make any

difference for the final results?

A: No, becaus e instea d of processing the obs ervations just once, often multiple itera tions are used . Toge ther

with the used dampening functions the map converges to a sta ble form.

Q: What is the stati stical significance of results found with SOM?

A: The SOM can be used in two w ays, (1) to give an accurate d escription of the dat a s et, a nd (2) to predict


64/173


56

values for one or more variab les. For descriptive use severa l SOM and cluster qua lity mea sures exist (see

paragraph 3.6), but (like other visualization techniques) no g eneral sta tistical ‘goodness ’ indicat or exists .

For predictive use we should see the SOM as a form of non-linear regression, w ithout a presupposed form of the

fitted function. Beca use of the non-linearity of the model the direct contributions of the individual variables a re

difficult to as sess . The tota l performance of the model can be measured a nd validated using common sta tistical

techniques.

3.9 Summary


65/173

S u m m a r y

57

Chapter 3 covered the theo retical founda tions of SOM. We viewed the place of Self-Organizing Maps in the

knowledge discovery process, and we described some projection, clustering and classification techniquesrelated to S OM. The SOM is a combination of non-linear projection a nd hierarchical clustering, d riven by a

simple feed forwa rd neural network. The observat ions are projected on a flexible grid of neurons tha t stret ches

and bends to accommodate to the d istribution of the da ta in the input space. After the network has found its

final form, it is displayed in a flattened st at e as a ma p. The observations projected on this map a re then

clustered, according to similarity of the used variables.

A Self-Organizing Map can be used in two wa ys: As a de scriptive ana lysis tool, and as a prediction model. For

use in a des criptive set ting the map display and the clustering is most important. Visually comparing theclusters a nd other parts o f the SOM display provides a good and insightful overview of the underlying da ta set.

When deploying the SOM as a prediction model, we are more interest ed in the d istribution of the companies

over the map (or equivalently, the form of the map) than the clustering. The SOM then functions as a semi-

para metric (poss ibly non-linear) regres sion model.


66/173

4 descr i pt i ve anal ysi s


67/173

The paragraphs in chapter 4 form an account of our descript ive analysis, using the SOM as a visual explorat ion

tool. We answer question 3 and 4 from the introduction:

3. Is it possible to f ind a log ical clustering of the companies, based on the financial statements of these

companies?

4. If such a clustering is found , does this clustering coincide with levels of creditworth iness of the companies

in a cluster?

Paragraph 1 covers the basic data analysis. Paragraph 2 explo res the possibili ty of clustering companies based

on financial data. In paragraph 3 we then compare the found clustering wit h the credit ratings of the clustered

companies. Paragraph 4 reviews the performed sensitivi ty analysis and in paragraph 5 we benchmark the SOM

results to a principal components analysis.

4.1 Bas ic data a nalysis


68/173

4 des cr i p t i v e an al y s i s

60

Our basic data analysis comprises the first three steps of the knowledge discovery process, namely data

selection, da ta pre-processing a nd da ta transformat ion.

4.1.1 Data selectionThe

viscovery som (clustering)

Documents