faculty ofcognitive sciences and human development mining and data visualization using self... ·...
TRANSCRIPT
Faculty of Cognitive Sciences and Human Development
DATA MINING AND DATA VISUALIZATION USING SELF-ORGANIZING MAP (SOM)
Ahmad Naufal Khan Bin Jamil Khan
(45433)
Bachelor of Science with Honours (Cognitive Science)
2017
DAT A MINING AND DATA VISUALIZATION USING SELF-OR GAN IZING MAP (SOM)
AHMAD NAUFAL KHAN BIN IAMIL KHAN
This project is submitted in partia l ful fi lment of the requirements for a
Bachelor of Sc ience with Honours (Cogniti ve Sc ience)
Facul ty of Cogn itive Sciences and Human Deve lopment
UN IVERS ITI MA LAYS IA SARAWAK
(20 17)
UNIVERSITI MALAYSIA SARAWAK
A shyGra de
Please tick one Final Year Project Report IS]
Ma sters 0
PhD ~
DECLARATION OF ORIGINAL WORK
Th is dedaraLlon 1S made on the 8 day of JUNE yea r 20 17
Students Declaration
I AHMAD NAUFAL KHA N BIN JAMIL KHAN 45433 FACULTY OF COG NITIVE SCIE NCES
AND HUMAN DEVELOPMENT hereby declare that the work entitled DATA MINING AND
D ATA VISUALIZATION USING SELFmiddot ORGANIZING l1AP(SOM) is ro y origlll a l work I have not copied from any other students work or from any other lources wIth the exception where due refere nce 01 acknowledge me nll s made explICitly in the tex t nor ha ~ a ny pa rt of the wor k been vritten for me by a nother person
9 JUNE 2017 AHlVIAD NAU FAL KHAN (45433)
1 AP DR tEH CH EE S IO NG hereby certll) that the work entitled was p(~ p ~lre d by the aforementlOoed or
above me ntlOned student a nd wa s s ubmitted to the FACULTY as a partlaVfull fulfIllm ent for the conferment of BACHELOR OF SCIE CE WITH HO OURS (COG ITlVE SCIE NCE) and the aforeme ntIOned work to the bes t of roy know ledge I S the said s tud ent$ work
9 J UNE 20 17 Da te _________
(A P DR E H CREE SIONG) Receive d for exa minatIOn by
I declare this ProJectThesis IS classified as (Please tick (Jraquo)
n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)
CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)
tEl OPEN ACCESS
I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))
tEl YES J NO
Validation of ProjectITh es is
I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows
bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)
bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes
bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database
bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes
bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS
bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS
Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017
Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn
The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for exam ination by
--~-------(AP DR TEH CHEE SIONG)
Date 9th June 20 17
Grade
Ashy
ACKNOWLEDGEMENT
I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner
I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope
when giving up seems like th e onl y option Without Him my wo rk will neve r be completed
To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest
inspiration of all time J would like to give the highest appreciation to my dear superv iso r for
showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s
wonderful topic You ve showed me what it means to be a responsible person and to wo rk
through eve ry hardship
I also wo uld like to thank my parents They have always enco uraged me to continue
wo rk and convince me that J am capable of doing it They al ways believe ill me and that has
been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to
con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your
undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck
and thin you will always be in my praye rs You are trul y a precious gem
bull
TABLE OF CONTENTS
ACKNOWLEDGEMENT III
vi
ABSTRACT vii
LIST OF FIGURES
ABSTRAK _ viii
CHAPTER 1 1
10 INTRODUCTION 1
101 Prelim inaries 1
102 Artificial Intelligence (AI) 2
103 Biological Neural Network (BNN) 3
LOA Artificial Neural Network (ANN) 4
105 Data vis ualization 5
1 1 Problem statement 5
12 Research Objectives 7
13 Specific Ob jectives 7
14 Resea rch Scopes 7
15 Research Methodologies 7
9CHAPTER 2
20 LITERATURE REVIEW 9
21 Intro duct ion 9
22 Dimension Redu ction 9
23 Da ta visualization 9
24 Visualization of high-dimensional data item s 10
2S Data visu al ization in the past and current 11
26 PrinCipal Component Analys is (PCA) bull 12
27 Sammon s Mapping (SM) 13
28 Self-Orga nizing Maps (SOM) 14
CHAPTER 3 16
30THE PROPOSED DATA VISUALIZATION METHOD 16
31 Introduction 16
31 Self-Organ izing Map (SOM ) 16
32 Advantages of Self-Organizing Map (SOM) 20
33 Issues Related to SO M 21
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
DAT A MINING AND DATA VISUALIZATION USING SELF-OR GAN IZING MAP (SOM)
AHMAD NAUFAL KHAN BIN IAMIL KHAN
This project is submitted in partia l ful fi lment of the requirements for a
Bachelor of Sc ience with Honours (Cogniti ve Sc ience)
Facul ty of Cogn itive Sciences and Human Deve lopment
UN IVERS ITI MA LAYS IA SARAWAK
(20 17)
UNIVERSITI MALAYSIA SARAWAK
A shyGra de
Please tick one Final Year Project Report IS]
Ma sters 0
PhD ~
DECLARATION OF ORIGINAL WORK
Th is dedaraLlon 1S made on the 8 day of JUNE yea r 20 17
Students Declaration
I AHMAD NAUFAL KHA N BIN JAMIL KHAN 45433 FACULTY OF COG NITIVE SCIE NCES
AND HUMAN DEVELOPMENT hereby declare that the work entitled DATA MINING AND
D ATA VISUALIZATION USING SELFmiddot ORGANIZING l1AP(SOM) is ro y origlll a l work I have not copied from any other students work or from any other lources wIth the exception where due refere nce 01 acknowledge me nll s made explICitly in the tex t nor ha ~ a ny pa rt of the wor k been vritten for me by a nother person
9 JUNE 2017 AHlVIAD NAU FAL KHAN (45433)
1 AP DR tEH CH EE S IO NG hereby certll) that the work entitled was p(~ p ~lre d by the aforementlOoed or
above me ntlOned student a nd wa s s ubmitted to the FACULTY as a partlaVfull fulfIllm ent for the conferment of BACHELOR OF SCIE CE WITH HO OURS (COG ITlVE SCIE NCE) and the aforeme ntIOned work to the bes t of roy know ledge I S the said s tud ent$ work
9 J UNE 20 17 Da te _________
(A P DR E H CREE SIONG) Receive d for exa minatIOn by
I declare this ProJectThesis IS classified as (Please tick (Jraquo)
n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)
CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)
tEl OPEN ACCESS
I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))
tEl YES J NO
Validation of ProjectITh es is
I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows
bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)
bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes
bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database
bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes
bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS
bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS
Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017
Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn
The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for exam ination by
--~-------(AP DR TEH CHEE SIONG)
Date 9th June 20 17
Grade
Ashy
ACKNOWLEDGEMENT
I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner
I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope
when giving up seems like th e onl y option Without Him my wo rk will neve r be completed
To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest
inspiration of all time J would like to give the highest appreciation to my dear superv iso r for
showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s
wonderful topic You ve showed me what it means to be a responsible person and to wo rk
through eve ry hardship
I also wo uld like to thank my parents They have always enco uraged me to continue
wo rk and convince me that J am capable of doing it They al ways believe ill me and that has
been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to
con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your
undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck
and thin you will always be in my praye rs You are trul y a precious gem
bull
TABLE OF CONTENTS
ACKNOWLEDGEMENT III
vi
ABSTRACT vii
LIST OF FIGURES
ABSTRAK _ viii
CHAPTER 1 1
10 INTRODUCTION 1
101 Prelim inaries 1
102 Artificial Intelligence (AI) 2
103 Biological Neural Network (BNN) 3
LOA Artificial Neural Network (ANN) 4
105 Data vis ualization 5
1 1 Problem statement 5
12 Research Objectives 7
13 Specific Ob jectives 7
14 Resea rch Scopes 7
15 Research Methodologies 7
9CHAPTER 2
20 LITERATURE REVIEW 9
21 Intro duct ion 9
22 Dimension Redu ction 9
23 Da ta visualization 9
24 Visualization of high-dimensional data item s 10
2S Data visu al ization in the past and current 11
26 PrinCipal Component Analys is (PCA) bull 12
27 Sammon s Mapping (SM) 13
28 Self-Orga nizing Maps (SOM) 14
CHAPTER 3 16
30THE PROPOSED DATA VISUALIZATION METHOD 16
31 Introduction 16
31 Self-Organ izing Map (SOM ) 16
32 Advantages of Self-Organizing Map (SOM) 20
33 Issues Related to SO M 21
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
UNIVERSITI MALAYSIA SARAWAK
A shyGra de
Please tick one Final Year Project Report IS]
Ma sters 0
PhD ~
DECLARATION OF ORIGINAL WORK
Th is dedaraLlon 1S made on the 8 day of JUNE yea r 20 17
Students Declaration
I AHMAD NAUFAL KHA N BIN JAMIL KHAN 45433 FACULTY OF COG NITIVE SCIE NCES
AND HUMAN DEVELOPMENT hereby declare that the work entitled DATA MINING AND
D ATA VISUALIZATION USING SELFmiddot ORGANIZING l1AP(SOM) is ro y origlll a l work I have not copied from any other students work or from any other lources wIth the exception where due refere nce 01 acknowledge me nll s made explICitly in the tex t nor ha ~ a ny pa rt of the wor k been vritten for me by a nother person
9 JUNE 2017 AHlVIAD NAU FAL KHAN (45433)
1 AP DR tEH CH EE S IO NG hereby certll) that the work entitled was p(~ p ~lre d by the aforementlOoed or
above me ntlOned student a nd wa s s ubmitted to the FACULTY as a partlaVfull fulfIllm ent for the conferment of BACHELOR OF SCIE CE WITH HO OURS (COG ITlVE SCIE NCE) and the aforeme ntIOned work to the bes t of roy know ledge I S the said s tud ent$ work
9 J UNE 20 17 Da te _________
(A P DR E H CREE SIONG) Receive d for exa minatIOn by
I declare this ProJectThesis IS classified as (Please tick (Jraquo)
n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)
CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)
tEl OPEN ACCESS
I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))
tEl YES J NO
Validation of ProjectITh es is
I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows
bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)
bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes
bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database
bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes
bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS
bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS
Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017
Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn
The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for exam ination by
--~-------(AP DR TEH CHEE SIONG)
Date 9th June 20 17
Grade
Ashy
ACKNOWLEDGEMENT
I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner
I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope
when giving up seems like th e onl y option Without Him my wo rk will neve r be completed
To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest
inspiration of all time J would like to give the highest appreciation to my dear superv iso r for
showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s
wonderful topic You ve showed me what it means to be a responsible person and to wo rk
through eve ry hardship
I also wo uld like to thank my parents They have always enco uraged me to continue
wo rk and convince me that J am capable of doing it They al ways believe ill me and that has
been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to
con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your
undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck
and thin you will always be in my praye rs You are trul y a precious gem
bull
TABLE OF CONTENTS
ACKNOWLEDGEMENT III
vi
ABSTRACT vii
LIST OF FIGURES
ABSTRAK _ viii
CHAPTER 1 1
10 INTRODUCTION 1
101 Prelim inaries 1
102 Artificial Intelligence (AI) 2
103 Biological Neural Network (BNN) 3
LOA Artificial Neural Network (ANN) 4
105 Data vis ualization 5
1 1 Problem statement 5
12 Research Objectives 7
13 Specific Ob jectives 7
14 Resea rch Scopes 7
15 Research Methodologies 7
9CHAPTER 2
20 LITERATURE REVIEW 9
21 Intro duct ion 9
22 Dimension Redu ction 9
23 Da ta visualization 9
24 Visualization of high-dimensional data item s 10
2S Data visu al ization in the past and current 11
26 PrinCipal Component Analys is (PCA) bull 12
27 Sammon s Mapping (SM) 13
28 Self-Orga nizing Maps (SOM) 14
CHAPTER 3 16
30THE PROPOSED DATA VISUALIZATION METHOD 16
31 Introduction 16
31 Self-Organ izing Map (SOM ) 16
32 Advantages of Self-Organizing Map (SOM) 20
33 Issues Related to SO M 21
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
I declare this ProJectThesis IS classified as (Please tick (Jraquo)
n CONFIDENTIAL (Contai ns confid ential Information unde r the Officia l Secret Act 1972)
CJ RESTRI CTED (Contains restricled information as p(cliied by the organisa tlon where resea rch wa s done)
tEl OPEN ACCESS
I decla re this ProjecliThesls is to be s ubmitted to the Centre for AcademiC InformatIOn Se rvices (CAIS) a nd uploaded into UNIMAS Institutional Repository (UNlMAS IR) (Please tick (i))
tEl YES J NO
Validation of ProjectITh es is
I hereby duly a fljrmed with (ree consent a nd Willingness declared that this sa id ProjecliThesls shall be placed officially In th e Ce ntre for Acade mic InformatlOn Services with the abide Interest a nd rights as fo llows
bull This ProJectT heS IS is the sale legal prope rty of UnivenHti Malaysia Sa rawak (UNIMAS)
bull Th e Centre for Academic Information Services has the lawfu l rig ht to ma ke cop ies of the Projecttrhes is for acade mic a nd research purposes only and not [or other purposes
bull The Centre for Acade miC Information Services has the lawful n ght U I dlglllze the content to be up loaded In to Local Content Database
bull Th e Centre for AcademiC Information Services has t he lawful fi ght to ma ke copIes of the Projectrrh e~il s If required for use by other parties for academic pu rpOF=(S or by othe r Higher Learning Ins titutes
bull TO dispute or any cla im s hall arise from the s tudent himself I herse lf ne ithe r a thud party on thiS Proj ectnhes is once It becomes the sole property of UNIMAS
bull This ProjecLffhesls or any mate rial data and Information rela ted to It shall not be dlstnb uted published or di sclosed to any party by the student him~Wherse lf Wi thout fllst obtaimng a pprova l from UNIMAS
Studen ts signatu re _ _ v=-7 __ Supervisors slgnat-ureJ -S- middot ~_LJ_ _ Date _middot lrJunc 2017 Va tp 9 J une 2017
Notes bull If the ProjectfThesis is CONFIDENTIAL or RESTRI CTED ploase attach to~ether as an nexure a lette r from the orgamsation with the date of restriction indica ted and th E reasons for the confidentiahty and res trictIOn
The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for exam ination by
--~-------(AP DR TEH CHEE SIONG)
Date 9th June 20 17
Grade
Ashy
ACKNOWLEDGEMENT
I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner
I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope
when giving up seems like th e onl y option Without Him my wo rk will neve r be completed
To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest
inspiration of all time J would like to give the highest appreciation to my dear superv iso r for
showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s
wonderful topic You ve showed me what it means to be a responsible person and to wo rk
through eve ry hardship
I also wo uld like to thank my parents They have always enco uraged me to continue
wo rk and convince me that J am capable of doing it They al ways believe ill me and that has
been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to
con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your
undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck
and thin you will always be in my praye rs You are trul y a precious gem
bull
TABLE OF CONTENTS
ACKNOWLEDGEMENT III
vi
ABSTRACT vii
LIST OF FIGURES
ABSTRAK _ viii
CHAPTER 1 1
10 INTRODUCTION 1
101 Prelim inaries 1
102 Artificial Intelligence (AI) 2
103 Biological Neural Network (BNN) 3
LOA Artificial Neural Network (ANN) 4
105 Data vis ualization 5
1 1 Problem statement 5
12 Research Objectives 7
13 Specific Ob jectives 7
14 Resea rch Scopes 7
15 Research Methodologies 7
9CHAPTER 2
20 LITERATURE REVIEW 9
21 Intro duct ion 9
22 Dimension Redu ction 9
23 Da ta visualization 9
24 Visualization of high-dimensional data item s 10
2S Data visu al ization in the past and current 11
26 PrinCipal Component Analys is (PCA) bull 12
27 Sammon s Mapping (SM) 13
28 Self-Orga nizing Maps (SOM) 14
CHAPTER 3 16
30THE PROPOSED DATA VISUALIZATION METHOD 16
31 Introduction 16
31 Self-Organ izing Map (SOM ) 16
32 Advantages of Self-Organizing Map (SOM) 20
33 Issues Related to SO M 21
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
The project entitled Data Mining and Data Visualizations Using Self-Organizing Map (SOM) was prepared by Ahmad Naufal Khan bin Jamil Khan and submitted to the Faculty of Cogniti ve Sciences and Human Development in pal1ial fulfillment of the requirements for a Bachelor of Science with Honours (Cognitive Science)
Received for exam ination by
--~-------(AP DR TEH CHEE SIONG)
Date 9th June 20 17
Grade
Ashy
ACKNOWLEDGEMENT
I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner
I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope
when giving up seems like th e onl y option Without Him my wo rk will neve r be completed
To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest
inspiration of all time J would like to give the highest appreciation to my dear superv iso r for
showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s
wonderful topic You ve showed me what it means to be a responsible person and to wo rk
through eve ry hardship
I also wo uld like to thank my parents They have always enco uraged me to continue
wo rk and convince me that J am capable of doing it They al ways believe ill me and that has
been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to
con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your
undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck
and thin you will always be in my praye rs You are trul y a precious gem
bull
TABLE OF CONTENTS
ACKNOWLEDGEMENT III
vi
ABSTRACT vii
LIST OF FIGURES
ABSTRAK _ viii
CHAPTER 1 1
10 INTRODUCTION 1
101 Prelim inaries 1
102 Artificial Intelligence (AI) 2
103 Biological Neural Network (BNN) 3
LOA Artificial Neural Network (ANN) 4
105 Data vis ualization 5
1 1 Problem statement 5
12 Research Objectives 7
13 Specific Ob jectives 7
14 Resea rch Scopes 7
15 Research Methodologies 7
9CHAPTER 2
20 LITERATURE REVIEW 9
21 Intro duct ion 9
22 Dimension Redu ction 9
23 Da ta visualization 9
24 Visualization of high-dimensional data item s 10
2S Data visu al ization in the past and current 11
26 PrinCipal Component Analys is (PCA) bull 12
27 Sammon s Mapping (SM) 13
28 Self-Orga nizing Maps (SOM) 14
CHAPTER 3 16
30THE PROPOSED DATA VISUALIZATION METHOD 16
31 Introduction 16
31 Self-Organ izing Map (SOM ) 16
32 Advantages of Self-Organizing Map (SOM) 20
33 Issues Related to SO M 21
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
ACKNOWLEDGEMENT
I wou ld like to than k God for giving me thi s opportunity to complete my thes is in a manner
I trul y am sa ti sfi ed I thank god for showin g me the way wh en I felt lost for giv ing me hope
when giving up seems like th e onl y option Without Him my wo rk will neve r be completed
To my beloved and respected superviso r AP Dr Teh Chee S iong you are my bi ggest
inspiration of all time J would like to give the highest appreciation to my dear superv iso r for
showing the path to complete thi s thes is Thank you for handing me the fa ith to wo rk on thi s
wonderful topic You ve showed me what it means to be a responsible person and to wo rk
through eve ry hardship
I also wo uld like to thank my parents They have always enco uraged me to continue
wo rk and convince me that J am capable of doing it They al ways believe ill me and that has
been a major moti vation fo r me to go on Whenever I fee l like giving up they convince me to
con tinue till the end I thank them for that To the rest of my fa mily I th ank you a ll for your
undying prayers and help for all my li fe To those people who ha ve sti ck with me through thi ck
and thin you will always be in my praye rs You are trul y a precious gem
bull
TABLE OF CONTENTS
ACKNOWLEDGEMENT III
vi
ABSTRACT vii
LIST OF FIGURES
ABSTRAK _ viii
CHAPTER 1 1
10 INTRODUCTION 1
101 Prelim inaries 1
102 Artificial Intelligence (AI) 2
103 Biological Neural Network (BNN) 3
LOA Artificial Neural Network (ANN) 4
105 Data vis ualization 5
1 1 Problem statement 5
12 Research Objectives 7
13 Specific Ob jectives 7
14 Resea rch Scopes 7
15 Research Methodologies 7
9CHAPTER 2
20 LITERATURE REVIEW 9
21 Intro duct ion 9
22 Dimension Redu ction 9
23 Da ta visualization 9
24 Visualization of high-dimensional data item s 10
2S Data visu al ization in the past and current 11
26 PrinCipal Component Analys is (PCA) bull 12
27 Sammon s Mapping (SM) 13
28 Self-Orga nizing Maps (SOM) 14
CHAPTER 3 16
30THE PROPOSED DATA VISUALIZATION METHOD 16
31 Introduction 16
31 Self-Organ izing Map (SOM ) 16
32 Advantages of Self-Organizing Map (SOM) 20
33 Issues Related to SO M 21
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
TABLE OF CONTENTS
ACKNOWLEDGEMENT III
vi
ABSTRACT vii
LIST OF FIGURES
ABSTRAK _ viii
CHAPTER 1 1
10 INTRODUCTION 1
101 Prelim inaries 1
102 Artificial Intelligence (AI) 2
103 Biological Neural Network (BNN) 3
LOA Artificial Neural Network (ANN) 4
105 Data vis ualization 5
1 1 Problem statement 5
12 Research Objectives 7
13 Specific Ob jectives 7
14 Resea rch Scopes 7
15 Research Methodologies 7
9CHAPTER 2
20 LITERATURE REVIEW 9
21 Intro duct ion 9
22 Dimension Redu ction 9
23 Da ta visualization 9
24 Visualization of high-dimensional data item s 10
2S Data visu al ization in the past and current 11
26 PrinCipal Component Analys is (PCA) bull 12
27 Sammon s Mapping (SM) 13
28 Self-Orga nizing Maps (SOM) 14
CHAPTER 3 16
30THE PROPOSED DATA VISUALIZATION METHOD 16
31 Introduction 16
31 Self-Organ izing Map (SOM ) 16
32 Advantages of Self-Organizing Map (SOM) 20
33 Issues Related to SO M 21
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
23
CHAPTER 4 23
40 EXPERIMENTS ON BENCHMARK DATAS ETS 23
4 1 Introduction 23
42 Data Visualization
43 Visualizations comparison of various datasets 23
44lris Datasets 24
45 W ine Datasets 35
46 Glass Identification Data set 40
CHAPTER S 43
50 DISCUSSION AND CONCLUSION
51 Introduction
52 Data visua liza tion Analysis
53 Data construction
54 Data normalization
5 5 Map Train ing
5 6 Data Visua lization
5 7 SOM_SHOW command in SO M Toolbox
58 SOM _5H OW_ADD command in SOM Toolbox
59 50M_GRID com mand in SOM Toolbox
5 10 Future applicat ions of 50M
REFEREN CES
43
43
43
43
43
44
44
45
45
46
48
v
46
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
LIST OF FIGURES
Figure I Representation of Kohoen SOM a lghorithm 17
Figure 2 Flow chart of training process in SOM 18
Figure 3 Map grid showin g how clustering works 19
Figure 4 Histogram and sca lier plots of Iri s Datasets 24
Figure 5 U-matrix Distance matrix of Iri s Dataset 27
Figure 6 U-matrix and empty label grid 28
Figure 7 Labelled U-m atrix for each class 39
Figure 8 Hit hi stogram showing the BM U 30
Figure 9 Coloured hit histograms accord ing to thei r classes 31
Figure 10 Visualizations of Iris datasets in map grid (3-d imensiona l) 33
Figure 11 Many forms of U-matrix representations 37
Figure 12 U-matr ix are labelled by Wine c lasses 39
Figure 13 Map grid of Wine datasets (3 -dimens ional) 40
Figure 14 Distance matrix of glass identification data and its variables 41
Figure 15 Labe ll ed distance matri x of glass identification data 42
Figure 16 Colored distance matrix 6 classes mean 6 colors are used to label 42
Figure 17 Di stance matrix of glass-identifi cat ion 46
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
ABSTRACT
Data min ing and data visual izati ons are becoming essential parts in inform ation techno logy in
recent e ra Wi thout the existence of data minin g techno logy big data can be near imposs ible
to be ex tracted and have to be done manually With the aid of data mining techn ology now
information can be gathered from data sets at much shorter time The di scovery of data
visua lizati ons a lso aid s in managing data into presen table fom) that can be understood by
everyone B ig dimensions can now be reduced to help data be more understandable In this
thesis Kohonen se lf-organiz ing map(SOM) technique is di scussed and examined for data
mining and data visualizations SOM is a neural network technique that can performs data
mining data classificat ion and data visua lizati ons SOM Too lbox was used on MATLAB All
steps in SOM are exp la ined in detail s from weight initi a lization until trai nin g is stopped
Graphical explanations of how SOM works are a lso used to help vi sua li ze SOM a lgorithm
Many methods of visualizations a re di splayed in the thesis A ll methods are criti call y
ana lysed Few benchmark datasets are used as examples of the visua li zation techniques Such
examples are Iri s datasets and Wine data sets The strength and weaknesses are I isted out in
discuss ion section
Keywords Data mining data c lassificatio n data visualizati on and sel f-orga ni zi ng map
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
ABSTRAK
Pema haman makna data dan data visua lisasi adalah sesuatu yang sanga t penting da lam bidang
tek nologi maklumat pada era barLl ini Tanpa kewujudan teknologi pemahaman makna data ini
data yan g besar boleh di kataka n hampir musta hil untuk dio lah clan diekstrak Oleh itu data
besa r ini perlu dilak uk an seca ra manual Dengan ada nya teknologi pemahaman makna data ini
kini maklumal dapat dikumpul kan da ri se t data pada masa yang lebih s ingkal Penemuan
penggambaran datajuga membantu cla lal11 l11 enguruska n data ke dalam bentu k yang rapi dan
bo leh difa hami semua lapi san masya raka l Dimensi besarjuga boleh mela lui proses
pe ngurangan dimensi untuk l11enjadi kan data lebi h l1ludah difahal11i Dalam tesis ini Kohonen
te lah l11 ereka sualU teknik yang dipangg il SelrOrgonizing Mop (SOM) Teknik ini dibinca ng
dan dite liti bagi proses pemahal1lan mak na data dan data visualisasi SOM ialah sa tu teknik
rangkaianneural ya ng bo leh melak ukan pengolehan data klas iti kasi data clan peggal1lbaran
data dengan tepat SOM Toolhox tel ah digu naka n dalam ap li kas i MA TLAB Sem ua langka h
dalal1l SOM dij elaskan secara terperinci dari perl11ulaan pemberat sehingga lah latihan
diberhent ikan yang l1lenandakan proses sudah tamal Penjelasan grafik tentang gerak kerja
SOM juga digunakan untuk mel11bantu menggal11barkan a lrgo ritl11a SOM Ba nyak kaeda h
penggaillba ran yang dipaparkan dalam tes is ini Kese illua kaedah penggam baran ini juga
diana li sa seeara kritikal Set data penanda aras digunakan seba ga i eO l1toh untuk llIelllapa rkan
teknik-teknik visua li sas i ini Anta ra eo ntoh set yang digunakan ia lan set Iri s Floer dan data
Wi ne Kekualan dall kelcll1ahan bagi tek lli k SOM dirungkaikan da lam seksyel1 5 tes is il1i
Kata kU llci Peri oll1bonga n data klasifikasi data visua lisasi data dan SefOrganizing Map
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
CHAPTER
10 INTRODUCTION
101 Preliminaries
The pattern recognition re lated studies are concerned with teaching a machine to identify
pattern of interest from its background Little did we know humans are born with such gifted
abilities that we tend to take for granted Humans are ab le to identify patterns even as babies
Thi s given abi lity may be better than what machines are able to do even at current level of
technology
However this is limited to a certain extend at which humans are better than machines
Given a big dataset humans minds may be able to perform data recognition but at slow
speed compared to machines These massive datasets are also demandin g a thrust in data
visualization technology producing massive big leaps in related researches Thi s field of
study aims to increase our understanding of datasets what useful infonmation is latent in it
and how to detect th e portion of it that are of strong interest (Fayyad Grinstein amp Wierse
2002) A good portion of data visualization related researches focuses on making it easier to
v isualize more dimensions effectively using 20-30 display technologies available (Fayyad et
aI 2002)
Data v isua li zation is the prese ntation of data in a pictorial or graphical format It enables
us to see analytic data in visual fonm instead of plain numbers This will allow humans to
identify patterns or understand hard concepts With the help of such interactive visualization
humans can use this technology to present infonmation in charts or graphs for more
1
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
deta iled-oriented presentati on enab ling humans mind to encapsulate new in formation eas ier
(Tufte 1983)
This study is associated with Artificial Inte lli gence (AI) whi ch focuses on creatin g
computer systems that can engage on behaviors that humans consider intelligent In current
time com putational computations use a lgor itl1mic approach where the computers must know
the speci fi c problem so lving steps However Artificia l Intelligence proves to be different by
depict ing paradigm for computing rather than the conventional computing (Te h 2006)
Artificial Neural etwo rks (ANN) one of th e hi ghlighted study area under Artifi cial
Inte II igence is c reated through in spiration of how bio logical nervous system processes
information especia ll y in the brain Most of the work in ANN is related to neurons and how
they train data In data visual ization it is a lso important to classify data first so that we can
visua lize data better and c learer ANN is useful for this purpose because it can be
programmed to perform specific tasks (Teh 2006)
102 A rtificial Intelligence (AI)
A I has been defined in many ways to represent different point of views in literature
(Kumar 2004) In genera l A I a ims to develop paradigms or algorithm s w hich in attempt to
causes machi nes to perform tasks that requires cognition or perception while performed by
humans (Sage 1990) Simon (199 1) defined AI as
We call a program for a computer artificially intelligent if il does something which
when done by a human being will be (hought to require human il1lelligence
Wilson (1992) also prov ided a definiti on for AI as followi ng
2
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
Artificial Intelligence is the study ofcompuratiol1s that make it possible to perceive
reason and act (W il son 1992as cited in Kumar 2004p12)
According to Kumar (2004) an y artificial should consists three elements in essence First
of all Al system has to be able to handle knowledge that is both general and domain specific
implicit and exp licit and at different levels of abstractions Seco ndl y it should have suitable
mechanism to constra in the search throu gh the knowledge base and able to arrive at
conclusion from premises and available ev idence Lastly it should possess mechanism for
learning new information with minimal noi se to the existing knowledge structure as provided
by the env ironment in which the system operates at
The Al study area is often related to biological neural system as it is hugely motivated by
the biological intelligence system provided by nature
103 Biological Neural Network (BNN)
The human brain is one of the most complex system ever discovered It is mOre complex
than any machine ever created This is because our brains contains billion of cell s that
regulates the body stores inform ation and retrieving them as well as making decisions in a
way that humans are still not ab le to explain (Swerd low 1995) This complex organ inside
humans are made of hundred billion information-processing units called neurons These
neurons are probably the most vital units that enables humans brain to perform all the acti ons
that they are capable of Adding to the massive number of neurons each of them consists of
an average of 10000 synapses that pass along information between them in incredibly fast
speed This results in total num ber of connections in the network to reach up to a huge figure
of approximately 10 5
3
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
In depth each neurons contain dendrites spreading from each neurons These tree-like
structures transports s ignals from one neuron to other neuron At the end of the neuron there
exists soma a porti on of the neuron that co ntains the nucleus and other vital components that
decide what kind of output s ignals the neuron will give off in accordance to input s ignals
received by other ne urons Afterwards th eres another important structure ca lled axon whi ch
carries the output signal produced in soma towards other neurons At the e nd of the axon
there is located a synaptic term inal which is the transfer point of information T he synaptic
termina ls are not conn ected with each other There are gaps between them ca lled the sy napses
Thi s completes the so-called so phisticated neural networks in our brains Thi s is a single
process that occurs while in fact in our brain there are billions o f same processes occurring
w ithin milliseconds
104 Artificial Neural Network (ANN)
ANNs are human s attempt on makin g biological neural network into computational
network Just like neurons in our brains it is able to so lve complex computations through
se lf-organiz ing abilities by learning information (Gra upe1977)
However there are perspectives that ANN model s are not as accurate and effective until
they are able to replicate the work of neurons as preci se as possib le Kohonen (200 1) argued
that even though nature has its own way that no machines can replicate the compliment is
a lso true Machines are ab le to do what humans brain cannot do as wel l Human brains have
limitation that are va ri ed and prove to be a weakness at times So it is not logical to copy in
details unless it can guarantee full benefits
4
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
105 Data visualization
As mentioned before data visualizatio n is the field of study that enab les user to explore
the properties of a dataset based on the person s own intuition and understand of domain
knowledge since it is meant to co nve y hidden information about the data to the user
(Teh2006) Data mining data warehousing retrieval systems and knowledge applications
have widely increase the demand of the expansion of data visualization field A huge
advantage of data vis uali zation over non-data visualization technique is it provides direct
feedback to enhance the quality of data mining (A nkerst 2000) To further exp la in the
necessity and capability of data v isualizatio n Konig (1998) explains more
To cope with today s flood ofdata from rapidly growing dalabases and relaled
compulalional resources especially 10 discover salienl Slruclures and imerescing
correlacions in dara requires advanced mechods o(machine learning palcern
recognition dara analysis and visualizations The remarkable abilities ofthe human
observers co perceive eluslen and correlalions and chus scruClllre in dara is of
greal ineresl and can be well exploiced by ejJeclive syslems for dara projection and
interactive visualizations (Koni g 1998 p55)
11 Problem statement
Data vis uali zation is a field that has been supported by many respected approaches such
as Principal Component Analysis (PCA) (Johnson amp Wichern 1992) Multidimensional
Sca ling (MDS) (S hepard amp Carroll 1965) Sammons Mapping (Sa mmon 1969) Principal
Curves (Hastie amp Stuetzle 1989) and Principal Surfaces (LeBlanc amp Tibshirani 1994) PCA
MDS and Sammon s Mapp ing are cons idered to be not an effective way fo r data visualization
since they demonstrate major di sadvantages For example PCA loses certain useful
information upon dimension reduction while MDS and Sammon s Mapping require heavy
5
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
computati on which essentially leads to impractica l methods In addition MDS and
Sammons Mapping cannot adapt to new data sample without first re-co mputing the ex isting
data (Wu amp Chow 2005) This also loses its practicality
Self-Organizing Maps (SOM) (Ko honen 1984) is a data visualization focused
unsupervi sed ANN method SOMs visualization is able to preserve data topo logy from
N-dimensiona l space to low-dimensional di splay space (Kohonen 2001) However it is not
able to reveal the underly ing structure of data as we ll as the inter-neuron distances from
-dimensional input space to low-dimensional output space (Yin 2002)
In order to overcome these weaknesses SOM proposed to new variants ca ll ed ViSOM
(Yin 2002) and PRSOM (Wu amp Chow 2005) to preserve the inter-neuron di stances and data
structu re n addition to data topo logy preservati on It is proven that ViSOM and PRSOM are
better than PCA MDS and Sammons Mapping o r even SOM itself in terms of inte r-neuron
di stances data topology and data structure preservation by Yin (2002) and Wu and Chow
(200S) respectively A lso they are di scovered to ab le to overcome the aforementi oned
lim itations of PCA MDS and Sammon s Mapp ing
However SOM faces problems in optimizing the accuracy of class ifi cation as compared
agai nst supervised classification focu sed ANN methods such as LVQ (Kohonen 1988) Even
its variants ViSOM and PRSOM are not ab le to perform data classification although they
contributed a significant boost to data visualization
A lthough the primary focus of this research is data visuali zation but it can never stands
on its own Data vi sualization usuall y comes hand in hand with data class ification In order to
correctl y visualize data more accurate ly the data has to be classified first so it can present the
data in more presentable manner If big datasets are presented without being class ified first it
will show confusing data 6
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
12 Research Objectives
The ai m of this research is to propose SOM as a mode l of data visuali zation The
visuali zation ability is ex pected to be ab le to preserve data structure inter-neu ron dis tances
and data topology from N-dimensional input space to low-d imensional o utput space
13 Specific Objectives
To study how SOM toolbox performs vari o us methods of data v isualization
To investigate the feas ibility of the pro posed method fo r data visualization
To test a deve loped SOM too lbox to visua li ze numerous benchmark datasets
To evaluate the efficiency of data vis ualization methods
14 Research Scopes
The scope of thi s research is confined in developing an ANN method to support data
visuali zation with computati onal efficiency In depth data structure data topology and
inter-neuron distances wi ll be preserved as much as possible
Empirical studies on the benchmark datasets wi II be conducted to demonstrate data
visualizat ion abilities using proposed method Both quali tative and qu antitat ive comparisons
will be conducted
J5 Research Methodologies
T he research kicks off with thorough literature reviews of the recent data visual iza tion
methods such as PCA Sammon s Mapping(SM) and SOM in an effort to clearly state and
compare the strengths and limitations of each method A literature rev iew is a lso co ndu cted
in order to define a way of enhancing a selected supervised ANN with adequate data
visua li zati on ab ilities
7
-- _ W=fshy
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
The proposed methods are implemented using MATLAB 70 so ftware packages
(Ed ucatio n versio n) and SOM too lbox The em pirical studies on the benchmark datasets a re
conducted to evaluate the performance o f selected method Iris flo wer dataset will first be
tested and fo ll owed by other multi -d imensional datasets such as w ine and car characteristics
8
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
CHAPTER 2
20 LITERATURE REVIEW
21 IntJoduction
Data visua lization is a fi e ld being ex plored rapid ly by researchers in th is modern soc iety
Various approaches have been introd uced a lo ng the line In this chapter methods fo r data
vi sua lization will be di scussed and the a lgo ri thms invo lved hi gh light ing the ir strengths and
weaknesses
This chapter di scu sses the techniques invo lved specifica ll y dim ension red uction
Sammons Mappi ng and Self-Orga niz ing Maps (SOM) These are all re lated techniq ues fo r
data visualizati on which will also be d iscussed in thi s chapter Their perform ances w ill be
high lighted fo r each of th e techniques
22 Dimension Reduction
The o bjective of dimension reduction is to represent data of highers dimens ions usuall y
more than three in a much lower dimensiona l space with minimum loss of info rmation
Humans are unab le to view data if the d imension is hig her than three So data vis ua li zat io n
ai ms to a id humans in better un derstanding the data and to make use of them fo r good use
The methods th at will be used and di scussed in later sections are Sammon s Mapping
(Sammon 1969) and SOM (Kohonen1984)
23 Data visualization
Visualizati on is the transformation of the symbol ic into the geometric There are man y
reasons why visua li zations a re created such as to answer questio ns make deci sions seeing
data in a cleare r co ntext expand ing memory find ing patterns and to name a few To se rve
9
- ilif shy
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
the purpose visual ization has to be constructed obeying the des ign principles as proposed by
Mack inlay (1986) Those principles are
I) Exp ress iveness - a set of data is considered expressib le in v isual if it is a lso
express ible in the sentence of the language including all the facts in the data
2) Effecti veness - a v isualizati on is considered effective if the information portrayed by
the v isua lization is mo re read ily perce ived than the info rmation provided by any other
fo rm of visua lization
Data visualizati on methods represent the underl y ing structures of data and the
mu ltivari ate re latio nship among the data samples which are in researchers interest for
decades In o rde r to achieve th is severa l methods ha ve been developed to perform qui ck
visua li zation of s im ple data summaries fo r low-dimensional datasets As an example the
so-ca lled fi ve num ber summaries (Tukey 1977) whi ch consists of hi ghest and lowest va lue
med ian and the two qual1iles of data which can be drawn to visua lize the summary of
low-d imensional datasets However as the nu mber of d imensions increase their abi li ty of
visual iz ing data summary decreases (Kaski 1 997) From here on afterwards visualization
methods wi ll be d iscussed bri e fi y for hi gh-dimensional data
24 Visualization of high-dimensional data items
To help visualize data better numerica l approaches related to graphi ca l rep resentation
have been used fo r this purpose Tan Stein bach and Kumar (2006) have di scussed the
similarities between them fo r hi gh-dimensional data Paralle l Coordinate approach is one
them related to data visual ization It is a very simple form of data visualization which
prov ides one coordinate axis for each attributes spaced parall el in the x-axis and the
correspond ing value is plotted by draw ing lines connecting the paralle l att ri bu tes in the y-ax is
10
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
for each data sample If the connections are correctly arranged the patterns will form and
produce good pattern However relationships between data samples are not visible Vice
versa if the data are not properl y arranged in order it may cause confusing data visuali zati on
(Tan et al2006)
Andrews ( 1972) proposed a visuali zatio n technique where one curve for each data item is
obtained by us ing the components of the data vectors as coeffici ents of orthogonal sinuso ids
(Kaski 1997) Similar to Parallel Coo rdinate it is also greatly affected by the ordering the
data attributes
25 Data visualization in the past and current
Data visualization date s back to the 2nd century AD where rapid development has
occurred during that period The earl iest tab le ever recorded was created in Egypt where it
was used to organize astronom ica l information as a too l for navi gatio n In genera l a table is a
textual representation of data but contains visual attributes that help humans recognize better
According to Few (2007) graphs didn t exist until 17th century where it was invented by the
infamous mathematic ian Rene Descartes who is also known for hi s philosophical tho ughts
In contemporary use data visualizatio n has shown its importance in business intelligence
but sti ll being igno red by a fraction of business people It is still a misunderstood concept
where people take it for granted and to be taken li ghtly However there are sti ll good news
related to the adv ancement of data visuali zatio n especiall y in academic field s In university as
an example all courses require data visualization as part of explanation in any matter at al l
Due to thi s rapid advancements man y univ ersities nowadays have given their attentions on
visuali zatio n and a few have excellent programs that serve the needs of many graduate
students whom may req uire it for their researches and prototype application There al so exist
bad trends in data visualization in current days espec ially when people rush to embrace the
11
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
technology and fa il to full y understand thi s matter w ithout first studying it Data vi sua lizati on
is capti vating to the eyes and intrigue people to try and use them This is especiall y bad in
field of adverti sements when they try to implement vi sua lizati ons technique wi thout
considering the design principle Many people are ye t to grasp the idea of data visualizati ons
full y While many may be und er the impressions that data visualizat ion is about making
things pretty an d cute truth is it is about dressing up presentations to tackle the audience It
doesnt necessarily have to be colorful just as long as it catches the attention of the users it is
alread y considered a success (Few 2007)
26 Principal Component Analysis (PCA)
PCA ( Hotelling 1933) is a linear data ana lysis method to find the orthogonal principa l
directi ons in a dataset along which th e dataset show s the largest variances (Yin 2002) It
di splays data in a linear projection o n such a subspace of the original data space that best
preserves the data variances (Kaski 1997) By se lecting the major components the PCA is
ab le to effectively reduce the data dimensions and di splay the data in terms of selected and
reduced dimensions in low-dimensional space As described by Yin (2002) PCA is an
optimal linea r projecti on of the mean square e rror between origina l data po ints and th e
projected one as shown in Eq(l)
Eq(l)
Here X= [X I X2X3 XS] is the N-dimensional input vector and qj=123 mmSN a re
the first In princ ipal eigenvectors of the data and the term qJ represents th e projection ofx
onto the)h principal dimension
12
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13
However linear PCA loses celiain useful information about the data while dealing w ith
data having much higher dimension than two (Yin2002) Linear PCA is not able to display
the non-lineari ty of data s ince it di splays data in a linear subspace For high-dimensional
dataset linear vi sualization methods may have difficulty to represent the data we ll in
low-dimensional spaces (Kaski 1997) Thus several approaches ha ve been proposed to
produce non-linear low-dimensional di splay of high-dimensional data
27 Sammons Mapping (SM)
Sammons Mapping (Sammo n 1969) is a non-linear data projection method usuall y
associated with multidimensiona l sca ling (MDS) It preserves the pair w ise di stances between
two data po ints in N-dimensional space to low-d imensional di splay space The advantage of
Sammo n s Mapping is that the errors are divided by the di stances in the N-dimensional space
which emphas izes the small di stances (Kohonen 200 J) With (5 representing the di stance
between two points i and j in N-dimensional data space and dl representin g their
corresponding dista nces in low-dimensional display space the cost functio n of Sammon s
Mapping can be read according to Kohonen (20 J I) as
Eq(2)
As in MDS Sammons Mapping performs dimensional reduction by di splaying
multidimensional data in a low-dimensional no nlinear di splay space However both vIDS
and Sammons Mapping display weakness in provid ing exp licit mapping function to
accommodate new data samples in the visua lizati on without re-computing the existing data
(Mao amp Jain 1995) Sammon s Mapping requires high computational complexity wh ich
causes im practicality when using Sammons Mapping in mUltiple occas ions Although
13