faculty ofcognitive sciences and human development mining and data visualization using self... ·...

24
Faculty of Cognitive Sciences and Human Development DATA MINING AND DATA VISUALIZATION USING SELF-ORGANIZING MAP (SOM) Ahmad Naufal Khan Bin Jamil Khan (45433) Bachelor of Science with Honours (Cognitive Science) 2017

Upload: vokhue

Post on 08-Jun-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Faculty of Cognitive Sciences and Human Development

DATA MINING AND DATA VISUALIZATION USING SELF-ORGANIZING MAP (SOM)

Ahmad Naufal Khan Bin Jamil Khan

(45433)

Bachelor of Science with Honours (Cognitive Science)

2017

DAT A MINING AND DATA VISUALIZATION USING SELF-OR GAN IZING MAP (SOM)

AHMAD NAUFAL KHAN BIN IAMIL KHAN

This project is submitted in partia l ful fi lment of the requirements for a

Bachelor of Sc ience with Honours (Cogniti ve Sc ience)

Facul ty of Cogn itive Sciences and Human Deve lopment

UN IVERS ITI MA LAYS IA SARAWAK

(20 17)

UNIVERSITI MALAYSIA SARAWAK

A shyGra de

Please tick one Final Year Project Report IS]

Ma sters 0

PhD ~

DECLARATION OF ORIGINAL WORK

Th is dedaraLlon 1S made on the 8 day of JUNE yea r 20 17

Students Declaration

I AHMAD NAUFAL KHA N BIN JAMIL KHAN 45433 FACULTY OF COG NITIVE SCIE NCES

AND HUMAN DEVELOPMENT hereby declare that the work entitled DATA MINING AND

D ATA VISUALIZATION USING SELFmiddot ORGANIZING l1AP(SOM) is ro y origlll a l work I have not copied from any other students work or from any other lources wIth the exception where due refere nce 01 acknowledge me nll s made explICitly in the tex t nor ha ~ a ny pa rt of the wor k been vritten for me by a nother person

9 JUNE 2017 AHlVIAD NAU FAL KHAN (45433)

1 AP DR tEH CH EE S IO NG hereby certll) that the work entitled was p(~ p ~lre d by the aforementlOoed or

above me ntlOned student a nd wa s s ubmitted to the FACULTY as a partlaVfull fulfIllm ent for the conferment of BACHELOR OF SCIE CE WITH HO OURS (COG ITlVE SCIE NCE) and the aforeme ntIOned work to the bes t of roy know ledge I S the said s tud ent$ work

9 J UNE 20 17 Da te _________

(A P DR E H CREE SIONG) Receive d for exa minatIOn by

I declare this ProJectThesis IS classified as (Please tick (Jraquo)

n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)

CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)

tEl OPEN ACCESS

I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))

tEl YES J NO

Validation of ProjectITh es is

I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows

bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)

bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes

bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database

bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes

bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS

bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS

Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017

Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn

The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for exam ination by

--~-------(AP DR TEH CHEE SIONG)

Date 9th June 20 17

Grade

Ashy

ACKNOWLEDGEMENT

I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner

I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope

when giving up seems like th e onl y option Without Him my wo rk will neve r be completed

To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest

inspiration of all time J would like to give the highest appreciation to my dear superv iso r for

showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s

wonderful topic You ve showed me what it means to be a responsible person and to wo rk

through eve ry hardship

I also wo uld like to thank my parents They have always enco uraged me to continue

wo rk and convince me that J am capable of doing it They al ways believe ill me and that has

been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to

con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your

undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck

and thin you will always be in my praye rs You are trul y a precious gem

bull

TABLE OF CONTENTS

ACKNOWLEDGEMENT III

vi

ABSTRACT vii

LIST OF FIGURES

ABSTRAK _ viii

CHAPTER 1 1

10 INTRODUCTION 1

101 Prelim inaries 1

102 Artificial Intelligence (AI) 2

103 Biological Neural Network (BNN) 3

LOA Artificial Neural Network (ANN) 4

105 Data vis ualization 5

1 1 Problem statement 5

12 Research Objectives 7

13 Specific Ob jectives 7

14 Resea rch Scopes 7

15 Research Methodologies 7

9CHAPTER 2

20 LITERATURE REVIEW 9

21 Intro duct ion 9

22 Dimension Redu ction 9

23 Da ta visualization 9

24 Visualization of high-dimensional data item s 10

2S Data visu al ization in the past and current 11

26 PrinCipal Component Analys is (PCA) bull 12

27 Sammon s Mapping (SM) 13

28 Self-Orga nizing Maps (SOM) 14

CHAPTER 3 16

30THE PROPOSED DATA VISUALIZATION METHOD 16

31 Introduction 16

31 Self-Organ izing Map (SOM ) 16

32 Advantages of Self-Organizing Map (SOM) 20

33 Issues Related to SO M 21

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

DAT A MINING AND DATA VISUALIZATION USING SELF-OR GAN IZING MAP (SOM)

AHMAD NAUFAL KHAN BIN IAMIL KHAN

This project is submitted in partia l ful fi lment of the requirements for a

Bachelor of Sc ience with Honours (Cogniti ve Sc ience)

Facul ty of Cogn itive Sciences and Human Deve lopment

UN IVERS ITI MA LAYS IA SARAWAK

(20 17)

UNIVERSITI MALAYSIA SARAWAK

A shyGra de

Please tick one Final Year Project Report IS]

Ma sters 0

PhD ~

DECLARATION OF ORIGINAL WORK

Th is dedaraLlon 1S made on the 8 day of JUNE yea r 20 17

Students Declaration

I AHMAD NAUFAL KHA N BIN JAMIL KHAN 45433 FACULTY OF COG NITIVE SCIE NCES

AND HUMAN DEVELOPMENT hereby declare that the work entitled DATA MINING AND

D ATA VISUALIZATION USING SELFmiddot ORGANIZING l1AP(SOM) is ro y origlll a l work I have not copied from any other students work or from any other lources wIth the exception where due refere nce 01 acknowledge me nll s made explICitly in the tex t nor ha ~ a ny pa rt of the wor k been vritten for me by a nother person

9 JUNE 2017 AHlVIAD NAU FAL KHAN (45433)

1 AP DR tEH CH EE S IO NG hereby certll) that the work entitled was p(~ p ~lre d by the aforementlOoed or

above me ntlOned student a nd wa s s ubmitted to the FACULTY as a partlaVfull fulfIllm ent for the conferment of BACHELOR OF SCIE CE WITH HO OURS (COG ITlVE SCIE NCE) and the aforeme ntIOned work to the bes t of roy know ledge I S the said s tud ent$ work

9 J UNE 20 17 Da te _________

(A P DR E H CREE SIONG) Receive d for exa minatIOn by

I declare this ProJectThesis IS classified as (Please tick (Jraquo)

n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)

CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)

tEl OPEN ACCESS

I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))

tEl YES J NO

Validation of ProjectITh es is

I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows

bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)

bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes

bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database

bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes

bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS

bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS

Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017

Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn

The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for exam ination by

--~-------(AP DR TEH CHEE SIONG)

Date 9th June 20 17

Grade

Ashy

ACKNOWLEDGEMENT

I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner

I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope

when giving up seems like th e onl y option Without Him my wo rk will neve r be completed

To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest

inspiration of all time J would like to give the highest appreciation to my dear superv iso r for

showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s

wonderful topic You ve showed me what it means to be a responsible person and to wo rk

through eve ry hardship

I also wo uld like to thank my parents They have always enco uraged me to continue

wo rk and convince me that J am capable of doing it They al ways believe ill me and that has

been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to

con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your

undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck

and thin you will always be in my praye rs You are trul y a precious gem

bull

TABLE OF CONTENTS

ACKNOWLEDGEMENT III

vi

ABSTRACT vii

LIST OF FIGURES

ABSTRAK _ viii

CHAPTER 1 1

10 INTRODUCTION 1

101 Prelim inaries 1

102 Artificial Intelligence (AI) 2

103 Biological Neural Network (BNN) 3

LOA Artificial Neural Network (ANN) 4

105 Data vis ualization 5

1 1 Problem statement 5

12 Research Objectives 7

13 Specific Ob jectives 7

14 Resea rch Scopes 7

15 Research Methodologies 7

9CHAPTER 2

20 LITERATURE REVIEW 9

21 Intro duct ion 9

22 Dimension Redu ction 9

23 Da ta visualization 9

24 Visualization of high-dimensional data item s 10

2S Data visu al ization in the past and current 11

26 PrinCipal Component Analys is (PCA) bull 12

27 Sammon s Mapping (SM) 13

28 Self-Orga nizing Maps (SOM) 14

CHAPTER 3 16

30THE PROPOSED DATA VISUALIZATION METHOD 16

31 Introduction 16

31 Self-Organ izing Map (SOM ) 16

32 Advantages of Self-Organizing Map (SOM) 20

33 Issues Related to SO M 21

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

UNIVERSITI MALAYSIA SARAWAK

A shyGra de

Please tick one Final Year Project Report IS]

Ma sters 0

PhD ~

DECLARATION OF ORIGINAL WORK

Th is dedaraLlon 1S made on the 8 day of JUNE yea r 20 17

Students Declaration

I AHMAD NAUFAL KHA N BIN JAMIL KHAN 45433 FACULTY OF COG NITIVE SCIE NCES

AND HUMAN DEVELOPMENT hereby declare that the work entitled DATA MINING AND

D ATA VISUALIZATION USING SELFmiddot ORGANIZING l1AP(SOM) is ro y origlll a l work I have not copied from any other students work or from any other lources wIth the exception where due refere nce 01 acknowledge me nll s made explICitly in the tex t nor ha ~ a ny pa rt of the wor k been vritten for me by a nother person

9 JUNE 2017 AHlVIAD NAU FAL KHAN (45433)

1 AP DR tEH CH EE S IO NG hereby certll) that the work entitled was p(~ p ~lre d by the aforementlOoed or

above me ntlOned student a nd wa s s ubmitted to the FACULTY as a partlaVfull fulfIllm ent for the conferment of BACHELOR OF SCIE CE WITH HO OURS (COG ITlVE SCIE NCE) and the aforeme ntIOned work to the bes t of roy know ledge I S the said s tud ent$ work

9 J UNE 20 17 Da te _________

(A P DR E H CREE SIONG) Receive d for exa minatIOn by

I declare this ProJectThesis IS classified as (Please tick (Jraquo)

n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)

CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)

tEl OPEN ACCESS

I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))

tEl YES J NO

Validation of ProjectITh es is

I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows

bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)

bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes

bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database

bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes

bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS

bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS

Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017

Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn

The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for exam ination by

--~-------(AP DR TEH CHEE SIONG)

Date 9th June 20 17

Grade

Ashy

ACKNOWLEDGEMENT

I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner

I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope

when giving up seems like th e onl y option Without Him my wo rk will neve r be completed

To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest

inspiration of all time J would like to give the highest appreciation to my dear superv iso r for

showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s

wonderful topic You ve showed me what it means to be a responsible person and to wo rk

through eve ry hardship

I also wo uld like to thank my parents They have always enco uraged me to continue

wo rk and convince me that J am capable of doing it They al ways believe ill me and that has

been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to

con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your

undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck

and thin you will always be in my praye rs You are trul y a precious gem

bull

TABLE OF CONTENTS

ACKNOWLEDGEMENT III

vi

ABSTRACT vii

LIST OF FIGURES

ABSTRAK _ viii

CHAPTER 1 1

10 INTRODUCTION 1

101 Prelim inaries 1

102 Artificial Intelligence (AI) 2

103 Biological Neural Network (BNN) 3

LOA Artificial Neural Network (ANN) 4

105 Data vis ualization 5

1 1 Problem statement 5

12 Research Objectives 7

13 Specific Ob jectives 7

14 Resea rch Scopes 7

15 Research Methodologies 7

9CHAPTER 2

20 LITERATURE REVIEW 9

21 Intro duct ion 9

22 Dimension Redu ction 9

23 Da ta visualization 9

24 Visualization of high-dimensional data item s 10

2S Data visu al ization in the past and current 11

26 PrinCipal Component Analys is (PCA) bull 12

27 Sammon s Mapping (SM) 13

28 Self-Orga nizing Maps (SOM) 14

CHAPTER 3 16

30THE PROPOSED DATA VISUALIZATION METHOD 16

31 Introduction 16

31 Self-Organ izing Map (SOM ) 16

32 Advantages of Self-Organizing Map (SOM) 20

33 Issues Related to SO M 21

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

I declare this ProJectThesis IS classified as (Please tick (Jraquo)

n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)

CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)

tEl OPEN ACCESS

I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))

tEl YES J NO

Validation of ProjectITh es is

I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows

bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)

bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes

bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database

bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes

bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS

bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS

Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017

Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn

The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for exam ination by

--~-------(AP DR TEH CHEE SIONG)

Date 9th June 20 17

Grade

Ashy

ACKNOWLEDGEMENT

I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner

I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope

when giving up seems like th e onl y option Without Him my wo rk will neve r be completed

To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest

inspiration of all time J would like to give the highest appreciation to my dear superv iso r for

showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s

wonderful topic You ve showed me what it means to be a responsible person and to wo rk

through eve ry hardship

I also wo uld like to thank my parents They have always enco uraged me to continue

wo rk and convince me that J am capable of doing it They al ways believe ill me and that has

been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to

con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your

undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck

and thin you will always be in my praye rs You are trul y a precious gem

bull

TABLE OF CONTENTS

ACKNOWLEDGEMENT III

vi

ABSTRACT vii

LIST OF FIGURES

ABSTRAK _ viii

CHAPTER 1 1

10 INTRODUCTION 1

101 Prelim inaries 1

102 Artificial Intelligence (AI) 2

103 Biological Neural Network (BNN) 3

LOA Artificial Neural Network (ANN) 4

105 Data vis ualization 5

1 1 Problem statement 5

12 Research Objectives 7

13 Specific Ob jectives 7

14 Resea rch Scopes 7

15 Research Methodologies 7

9CHAPTER 2

20 LITERATURE REVIEW 9

21 Intro duct ion 9

22 Dimension Redu ction 9

23 Da ta visualization 9

24 Visualization of high-dimensional data item s 10

2S Data visu al ization in the past and current 11

26 PrinCipal Component Analys is (PCA) bull 12

27 Sammon s Mapping (SM) 13

28 Self-Orga nizing Maps (SOM) 14

CHAPTER 3 16

30THE PROPOSED DATA VISUALIZATION METHOD 16

31 Introduction 16

31 Self-Organ izing Map (SOM ) 16

32 Advantages of Self-Organizing Map (SOM) 20

33 Issues Related to SO M 21

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)

Received for exam ination by

--~-------(AP DR TEH CHEE SIONG)

Date 9th June 20 17

Grade

Ashy

ACKNOWLEDGEMENT

I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner

I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope

when giving up seems like th e onl y option Without Him my wo rk will neve r be completed

To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest

inspiration of all time J would like to give the highest appreciation to my dear superv iso r for

showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s

wonderful topic You ve showed me what it means to be a responsible person and to wo rk

through eve ry hardship

I also wo uld like to thank my parents They have always enco uraged me to continue

wo rk and convince me that J am capable of doing it They al ways believe ill me and that has

been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to

con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your

undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck

and thin you will always be in my praye rs You are trul y a precious gem

bull

TABLE OF CONTENTS

ACKNOWLEDGEMENT III

vi

ABSTRACT vii

LIST OF FIGURES

ABSTRAK _ viii

CHAPTER 1 1

10 INTRODUCTION 1

101 Prelim inaries 1

102 Artificial Intelligence (AI) 2

103 Biological Neural Network (BNN) 3

LOA Artificial Neural Network (ANN) 4

105 Data vis ualization 5

1 1 Problem statement 5

12 Research Objectives 7

13 Specific Ob jectives 7

14 Resea rch Scopes 7

15 Research Methodologies 7

9CHAPTER 2

20 LITERATURE REVIEW 9

21 Intro duct ion 9

22 Dimension Redu ction 9

23 Da ta visualization 9

24 Visualization of high-dimensional data item s 10

2S Data visu al ization in the past and current 11

26 PrinCipal Component Analys is (PCA) bull 12

27 Sammon s Mapping (SM) 13

28 Self-Orga nizing Maps (SOM) 14

CHAPTER 3 16

30THE PROPOSED DATA VISUALIZATION METHOD 16

31 Introduction 16

31 Self-Organ izing Map (SOM ) 16

32 Advantages of Self-Organizing Map (SOM) 20

33 Issues Related to SO M 21

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

ACKNOWLEDGEMENT

I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner

I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope

when giving up seems like th e onl y option Without Him my wo rk will neve r be completed

To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest

inspiration of all time J would like to give the highest appreciation to my dear superv iso r for

showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s

wonderful topic You ve showed me what it means to be a responsible person and to wo rk

through eve ry hardship

I also wo uld like to thank my parents They have always enco uraged me to continue

wo rk and convince me that J am capable of doing it They al ways believe ill me and that has

been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to

con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your

undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck

and thin you will always be in my praye rs You are trul y a precious gem

bull

TABLE OF CONTENTS

ACKNOWLEDGEMENT III

vi

ABSTRACT vii

LIST OF FIGURES

ABSTRAK _ viii

CHAPTER 1 1

10 INTRODUCTION 1

101 Prelim inaries 1

102 Artificial Intelligence (AI) 2

103 Biological Neural Network (BNN) 3

LOA Artificial Neural Network (ANN) 4

105 Data vis ualization 5

1 1 Problem statement 5

12 Research Objectives 7

13 Specific Ob jectives 7

14 Resea rch Scopes 7

15 Research Methodologies 7

9CHAPTER 2

20 LITERATURE REVIEW 9

21 Intro duct ion 9

22 Dimension Redu ction 9

23 Da ta visualization 9

24 Visualization of high-dimensional data item s 10

2S Data visu al ization in the past and current 11

26 PrinCipal Component Analys is (PCA) bull 12

27 Sammon s Mapping (SM) 13

28 Self-Orga nizing Maps (SOM) 14

CHAPTER 3 16

30THE PROPOSED DATA VISUALIZATION METHOD 16

31 Introduction 16

31 Self-Organ izing Map (SOM ) 16

32 Advantages of Self-Organizing Map (SOM) 20

33 Issues Related to SO M 21

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

TABLE OF CONTENTS

ACKNOWLEDGEMENT III

vi

ABSTRACT vii

LIST OF FIGURES

ABSTRAK _ viii

CHAPTER 1 1

10 INTRODUCTION 1

101 Prelim inaries 1

102 Artificial Intelligence (AI) 2

103 Biological Neural Network (BNN) 3

LOA Artificial Neural Network (ANN) 4

105 Data vis ualization 5

1 1 Problem statement 5

12 Research Objectives 7

13 Specific Ob jectives 7

14 Resea rch Scopes 7

15 Research Methodologies 7

9CHAPTER 2

20 LITERATURE REVIEW 9

21 Intro duct ion 9

22 Dimension Redu ction 9

23 Da ta visualization 9

24 Visualization of high-dimensional data item s 10

2S Data visu al ization in the past and current 11

26 PrinCipal Component Analys is (PCA) bull 12

27 Sammon s Mapping (SM) 13

28 Self-Orga nizing Maps (SOM) 14

CHAPTER 3 16

30THE PROPOSED DATA VISUALIZATION METHOD 16

31 Introduction 16

31 Self-Organ izing Map (SOM ) 16

32 Advantages of Self-Organizing Map (SOM) 20

33 Issues Related to SO M 21

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

23

CHAPTER 4 23

40 EXPERIMENTS ON BENCHMARK DATAS ETS 23

4 1 Introduction 23

42 Data Visualization

43 Visualizations comparison of various datasets 23

44lris Datasets 24

45 W ine Datasets 35

46 Glass Identification Data set 40

CHAPTER S 43

50 DISCUSSION AND CONCLUSION

51 Introduction

52 Data visua liza tion Analysis

53 Data construction

54 Data normalization

5 5 Map Train ing

5 6 Data Visua lization

5 7 SOM_SHOW command in SO M Toolbox

58 SOM _5H OW_ADD command in SOM Toolbox

59 50M_GRID com mand in SOM Toolbox

5 10 Future applicat ions of 50M

REFEREN CES

43

43

43

43

43

44

44

45

45

46

48

v

46

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

LIST OF FIGURES

Figure I Representation of Kohoen SOM a lghorithm 17

Figure 2 Flow chart of training process in SOM 18

Figure 3 Map grid showin g how clustering works 19

Figure 4 Histogram and sca lier plots of Iri s Datasets 24

Figure 5 U-matrix Distance matrix of Iri s Dataset 27

Figure 6 U-matrix and empty label grid 28

Figure 7 Labelled U-m atrix for each class 39

Figure 8 Hit hi stogram showing the BM U 30

Figure 9 Coloured hit histograms accord ing to thei r classes 31

Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33

Figure 11 Many forms of U-matrix representations 37

Figure 12 U-matr ix are labelled by Wine c lasses 39

Figure 13 Map grid of Wine datasets (3 -dimens ional) 40

Figure 14 Distance matrix of glass identification data and its variables 41

Figure 15 Labe ll ed distance matri x of glass identification data 42

Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42

Figure 17 Di stance matrix of glass-identifi cat ion 46

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

ABSTRACT

Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in

recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible

to be ex tracted and have to be done manually With the aid of data mining techn ology now

information can be gathered from data sets at much shorter time The di scovery of data

visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by

everyone B ig dimensions can now be reduced to help data be more understandable In this

thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data

mining and data visualizations SOM is a neural network technique that can performs data

mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All

steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped

Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm

Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y

ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such

examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in

discuss ion section

Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

ABSTRAK

Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang

tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini

data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data

besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini

kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan

penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan

bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses

pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen

te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng

dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik

rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran

data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h

dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan

diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja

SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h

penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga

diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan

teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data

Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i

Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

CHAPTER

10 INTRODUCTION

101 Preliminaries

The pattern recognition re lated studies are concerned with teaching a machine to identify

pattern of interest from its background Little did we know humans are born with such gifted

abilities that we tend to take for granted Humans are ab le to identify patterns even as babies

Thi s given abi lity may be better than what machines are able to do even at current level of

technology

However this is limited to a certain extend at which humans are better than machines

Given a big dataset humans minds may be able to perform data recognition but at slow

speed compared to machines These massive datasets are also demandin g a thrust in data

visualization technology producing massive big leaps in related researches Thi s field of

study aims to increase our understanding of datasets what useful infonmation is latent in it

and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse

2002) A good portion of data visualization related researches focuses on making it easier to

v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et

aI 2002)

Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables

us to see analytic data in visual fonm instead of plain numbers This will allow humans to

identify patterns or understand hard concepts With the help of such interactive visualization

humans can use this technology to present infonmation in charts or graphs for more

1

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier

(Tufte 1983)

This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g

computer systems that can engage on behaviors that humans consider intelligent In current

time com putational computations use a lgor itl1mic approach where the computers must know

the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by

depict ing paradigm for computing rather than the conventional computing (Te h 2006)

Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial

Inte II igence is c reated through in spiration of how bio logical nervous system processes

information especia ll y in the brain Most of the work in ANN is related to neurons and how

they train data In data visual ization it is a lso important to classify data first so that we can

visua lize data better and c learer ANN is useful for this purpose because it can be

programmed to perform specific tasks (Teh 2006)

102 A rtificial Intelligence (AI)

A I has been defined in many ways to represent different point of views in literature

(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to

causes machi nes to perform tasks that requires cognition or perception while performed by

humans (Sage 1990) Simon (199 1) defined AI as

We call a program for a computer artificially intelligent if il does something which

when done by a human being will be (hought to require human il1lelligence

Wilson (1992) also prov ided a definiti on for AI as followi ng

2

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive

reason and act (W il son 1992as cited in Kumar 2004p12)

According to Kumar (2004) an y artificial should consists three elements in essence First

of all Al system has to be able to handle knowledge that is both general and domain specific

implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable

mechanism to constra in the search throu gh the knowledge base and able to arrive at

conclusion from premises and available ev idence Lastly it should possess mechanism for

learning new information with minimal noi se to the existing knowledge structure as provided

by the env ironment in which the system operates at

The Al study area is often related to biological neural system as it is hugely motivated by

the biological intelligence system provided by nature

103 Biological Neural Network (BNN)

The human brain is one of the most complex system ever discovered It is mOre complex

than any machine ever created This is because our brains contains billion of cell s that

regulates the body stores inform ation and retrieving them as well as making decisions in a

way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside

humans are made of hundred billion information-processing units called neurons These

neurons are probably the most vital units that enables humans brain to perform all the acti ons

that they are capable of Adding to the massive number of neurons each of them consists of

an average of 10000 synapses that pass along information between them in incredibly fast

speed This results in total num ber of connections in the network to reach up to a huge figure

of approximately 10 5

3

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

In depth each neurons contain dendrites spreading from each neurons These tree-like

structures transports s ignals from one neuron to other neuron At the end of the neuron there

exists soma a porti on of the neuron that co ntains the nucleus and other vital components that

decide what kind of output s ignals the neuron will give off in accordance to input s ignals

received by other ne urons Afterwards th eres another important structure ca lled axon whi ch

carries the output signal produced in soma towards other neurons At the e nd of the axon

there is located a synaptic term inal which is the transfer point of information T he synaptic

termina ls are not conn ected with each other There are gaps between them ca lled the sy napses

Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single

process that occurs while in fact in our brain there are billions o f same processes occurring

w ithin milliseconds

104 Artificial Neural Network (ANN)

ANNs are human s attempt on makin g biological neural network into computational

network Just like neurons in our brains it is able to so lve complex computations through

se lf-organiz ing abilities by learning information (Gra upe1977)

However there are perspectives that ANN model s are not as accurate and effective until

they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued

that even though nature has its own way that no machines can replicate the compliment is

a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have

limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in

details unless it can guarantee full benefits

4

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

105 Data visualization

As mentioned before data visualizatio n is the field of study that enab les user to explore

the properties of a dataset based on the person s own intuition and understand of domain

knowledge since it is meant to co nve y hidden information about the data to the user

(Teh2006) Data mining data warehousing retrieval systems and knowledge applications

have widely increase the demand of the expansion of data visualization field A huge

advantage of data vis uali zation over non-data visualization technique is it provides direct

feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the

necessity and capability of data v isualizatio n Konig (1998) explains more

To cope with today s flood ofdata from rapidly growing dalabases and relaled

compulalional resources especially 10 discover salienl Slruclures and imerescing

correlacions in dara requires advanced mechods o(machine learning palcern

recognition dara analysis and visualizations The remarkable abilities ofthe human

observers co perceive eluslen and correlalions and chus scruClllre in dara is of

greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and

interactive visualizations (Koni g 1998 p55)

11 Problem statement

Data vis uali zation is a field that has been supported by many respected approaches such

as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional

Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal

Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA

MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization

since they demonstrate major di sadvantages For example PCA loses certain useful

information upon dimension reduction while MDS and Sammon s Mapping require heavy

5

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

computati on which essentially leads to impractica l methods In addition MDS and

Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting

data (Wu amp Chow 2005) This also loses its practicality

Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused

unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from

N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not

able to reveal the underly ing structure of data as we ll as the inter-neuron distances from

-dimensional input space to low-dimensional output space (Yin 2002)

In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM

(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data

structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are

better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron

di stances data topology and data structure preservation by Yin (2002) and Wu and Chow

(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned

lim itations of PCA MDS and Sammon s Mapp ing

However SOM faces problems in optimizing the accuracy of class ifi cation as compared

agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even

its variants ViSOM and PRSOM are not ab le to perform data classification although they

contributed a significant boost to data visualization

A lthough the primary focus of this research is data visuali zation but it can never stands

on its own Data vi sualization usuall y comes hand in hand with data class ification In order to

correctl y visualize data more accurate ly the data has to be classified first so it can present the

data in more presentable manner If big datasets are presented without being class ified first it

will show confusing data 6

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

12 Research Objectives

The ai m of this research is to propose SOM as a mode l of data visuali zation The

visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances

and data topology from N-dimensional input space to low-d imensional o utput space

13 Specific Objectives

To study how SOM toolbox performs vari o us methods of data v isualization

To investigate the feas ibility of the pro posed method fo r data visualization

To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets

To evaluate the efficiency of data vis ualization methods

14 Research Scopes

The scope of thi s research is confined in developing an ANN method to support data

visuali zation with computati onal efficiency In depth data structure data topology and

inter-neuron distances wi ll be preserved as much as possible

Empirical studies on the benchmark datasets wi II be conducted to demonstrate data

visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons

will be conducted

J5 Research Methodologies

T he research kicks off with thorough literature reviews of the recent data visual iza tion

methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and

compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted

in order to define a way of enhancing a selected supervised ANN with adequate data

visua li zati on ab ilities

7

-- _ W=fshy

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

The proposed methods are implemented using MATLAB 70 so ftware packages

(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re

conducted to evaluate the performance o f selected method Iris flo wer dataset will first be

tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics

8

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

CHAPTER 2

20 LITERATURE REVIEW

21 IntJoduction

Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety

Various approaches have been introd uced a lo ng the line In this chapter methods fo r data

vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and

weaknesses

This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction

Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r

data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be

high lighted fo r each of th e techniques

22 Dimension Reduction

The o bjective of dimension reduction is to represent data of highers dimens ions usuall y

more than three in a much lower dimensiona l space with minimum loss of info rmation

Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n

ai ms to a id humans in better un derstanding the data and to make use of them fo r good use

The methods th at will be used and di scussed in later sections are Sammon s Mapping

(Sammon 1969) and SOM (Kohonen1984)

23 Data visualization

Visualizati on is the transformation of the symbol ic into the geometric There are man y

reasons why visua li zations a re created such as to answer questio ns make deci sions seeing

data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve

9

- ilif shy

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

the purpose visual ization has to be constructed obeying the des ign principles as proposed by

Mack inlay (1986) Those principles are

I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso

express ible in the sentence of the language including all the facts in the data

2) Effecti veness - a v isualizati on is considered effective if the information portrayed by

the v isua lization is mo re read ily perce ived than the info rmation provided by any other

fo rm of visua lization

Data visualizati on methods represent the underl y ing structures of data and the

mu ltivari ate re latio nship among the data samples which are in researchers interest for

decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck

visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the

so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue

med ian and the two qual1iles of data which can be drawn to visua lize the summary of

low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of

visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization

methods wi ll be d iscussed bri e fi y for hi gh-dimensional data

24 Visualization of high-dimensional data items

To help visualize data better numerica l approaches related to graphi ca l rep resentation

have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the

similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one

them related to data visual ization It is a very simple form of data visualization which

prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the

correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is

10

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

for each data sample If the connections are correctly arranged the patterns will form and

produce good pattern However relationships between data samples are not visible Vice

versa if the data are not properl y arranged in order it may cause confusing data visuali zati on

(Tan et al2006)

Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is

obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids

(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the

data attributes

25 Data visualization in the past and current

Data visualization date s back to the 2nd century AD where rapid development has

occurred during that period The earl iest tab le ever recorded was created in Egypt where it

was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a

textual representation of data but contains visual attributes that help humans recognize better

According to Few (2007) graphs didn t exist until 17th century where it was invented by the

infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts

In contemporary use data visualizatio n has shown its importance in business intelligence

but sti ll being igno red by a fraction of business people It is still a misunderstood concept

where people take it for granted and to be taken li ghtly However there are sti ll good news

related to the adv ancement of data visuali zatio n especiall y in academic field s In university as

an example all courses require data visualization as part of explanation in any matter at al l

Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on

visuali zatio n and a few have excellent programs that serve the needs of many graduate

students whom may req uire it for their researches and prototype application There al so exist

bad trends in data visualization in current days espec ially when people rush to embrace the

11

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on

is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in

field of adverti sements when they try to implement vi sua lizati ons technique wi thout

considering the design principle Many people are ye t to grasp the idea of data visualizati ons

full y While many may be und er the impressions that data visualizat ion is about making

things pretty an d cute truth is it is about dressing up presentations to tackle the audience It

doesnt necessarily have to be colorful just as long as it catches the attention of the users it is

alread y considered a success (Few 2007)

26 Principal Component Analysis (PCA)

PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l

directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It

di splays data in a linear projection o n such a subspace of the original data space that best

preserves the data variances (Kaski 1997) By se lecting the major components the PCA is

ab le to effectively reduce the data dimensions and di splay the data in terms of selected and

reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an

optimal linea r projecti on of the mean square e rror between origina l data po ints and th e

projected one as shown in Eq(l)

Eq(l)

Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re

the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx

onto the)h principal dimension

12

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13

However linear PCA loses celiain useful information about the data while dealing w ith

data having much higher dimension than two (Yin2002) Linear PCA is not able to display

the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional

dataset linear vi sualization methods may have difficulty to represent the data we ll in

low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to

produce non-linear low-dimensional di splay of high-dimensional data

27 Sammons Mapping (SM)

Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y

associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between

two data po ints in N-dimensional space to low-d imensional di splay space The advantage of

Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space

which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance

between two points i and j in N-dimensional data space and dl representin g their

corresponding dista nces in low-dimensional display space the cost functio n of Sammon s

Mapping can be read according to Kohonen (20 J I) as

Eq(2)

As in MDS Sammons Mapping performs dimensional reduction by di splaying

multidimensional data in a low-dimensional no nlinear di splay space However both vIDS

and Sammons Mapping display weakness in provid ing exp licit mapping function to

accommodate new data samples in the visua lizati on without re-computing the existing data

(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich

causes im practicality when using Sammons Mapping in mUltiple occas ions Although

13