borang pengesahan...
TRANSCRIPT
PSZ. 19:16 (Pind. 1/97)
UNIVERSITI TEKNOLOGI MALAYSIA
BORANG PENGESAHAN STATUS TESI~
JUDUL: BIOACTIVITY CLASSIFICATION OF ANTI AIDS COMPOUNDS USING NEURAL NETWORK AND SUPPORT VECTOR MACHINE: A COMPARISON
SESI PENGAJIAN: SEMESTER I 2004/2005
Saya ________________ -=RA~Ho~A~YU~~B=IN~T~IA~.H=A~M~ID=_ __________ ~ (HURUF BESAR)
mengah.ll membenarkan tesis (P-8M/SmjanalGekter Falsafah)* ini disimpan di Perpustakaan Universiti Teknologi Malaysia dengan syarat-syarat kegunaan seperti berikut:
1. Tesis ini adalah hakmilik Universiti Teknologi Malaysia. 2. Perpustakaan Universiti Teknologi Malaysia dibenarkan membuat salinan untuk
tujuan pengajian sahaja. 3. Perpustakaan dibenarkan membuat salinan tesis ini sebagai bahan pertukaran
antara institusi pengajian tinggi. 4. ** Sila tandakan (v")
D SULIT
D TERHAD
[] TIDAK TERHAD
(T AND AT ANGAN PENULIS)
Alamat Tetap: 19, Jalan Pertiwi, Stulang Laut, 80300 Johor Bahru, Johor Darul Takzim.
Tarikh: OCTOBER 2004
CATATAN: * Potong yang tidak berkenaan.
(Mengandungi maklumat yang berdarjah keselamatan at au kepentingan Malaysia seperti yang termah.wb di dalam AKT A RAHSIA RASMI 1972)
(Mengandungi maklumat TERHAD yang telah ditentukan oleh organisasilbadan di mana penyelidikan dijalankan)
Disahkan oleh
~ (T AND~ PENYELIA)
ASSOC. PROF. DR. NAOMJE BINTI SALIM
Nama Penyelia
Tarikh: OCTOBER 2004
** Jika tesis ini SULIT atau TERHAD, sila lampirkan sural daripada pihak ber1:uasalorganisasi berkcnaan dcngan mcnyatakan sekali scbab dan tcmpoh tcsis ini perlu dikclaskan sebagai SULIT atau TERHAD. Tesis dimaksudkan sebagai tesis bagi Ijazah Doktor Falsafah dan Sarjana sccara pcnyelidikan, atau disertasi bagi pcngajian secara kerja h'Ursus dan pcnyelidikan, atau Laporan Projek Sarjana Muda (PSM).
"I declare that I have read this project and in my opinion this
project report has satisfied the scope and quality for the award
of the degree of Master of Science (Computer Science)."
Signature
Name of Supervisor
Date
~//l1/~ ~ /C/ .........................................
: ASSOC. PROF. DR. NAOMIE BINTI SALIM
: ........................ ./.r.!(~./~:y
BIOACTIVITY CLASSIFICATION OF ANTI AIDS COMPOUNDS USING
NEURAL NETWORK AND SUPPORT VECTOR MACHINE: A COMPARISON
RAHA YU BINTI A. HAMID
A report submitted in partial fulfillment of the
requirements for the award of the degree of
Master of Science (Computer Science)
Faculty Of Computer Science And Information System
University Of Technology Malaysia
OCTOBER 2004
"I declare that this project report is the result of my own research except as
cited in references. This report has not been accepted for any de!,rree and is
not currently submitted in candidature of any degree."
Signature
Name of Author
Date
RAHA YU BlNT1 A.HAMID
11
To my husband, thank Y01lfor ),our love and support. To my son, ),011 mean evelything to me.
To my mother, thank yo 11 for always being therefor me, supporting me and encouraging me to be the best that J can be.
111
IV
ACKNO'VLEDGEMENT
Praises to Allah for giving me the patience, strength and will to go through
and complete my study. I would like to express my appreciation to my supervisor,
Associate Professor Dr. Naomie bte Salim, for her support and guidance during the
course of this study and the writing of the thesis. A special thanks is due to Associate
Professor Dr. Siti Mariyam bte Shamsuddin for her helpful advice and insight in this
study. I would also like to extend my thanks to fellow classmates who have given me
the encouragement and support when I needed it. Finally, I would like to dedicate
this thesis to my family. Without their love and support I would have never come this
far.
v
ABSTRACT
High Throughput Screening has been used in drug discovery to screen large numbers
of potential compounds against a biological target by making it possible to screen
tens of thousands to hundreds of thousands of compounds at the early stage of drug
design. However, it is impractical to test every available compound against every
biological target. Classification is an approach in classifYing the compounds into
active and inactive based on already known actives. In this study, Neural Network
and Support Vector Machines (SVM) are used to classify AIDS data represented as
2D descriptors. Selection of compounds used is based on the most diverse
compounds. The classification models will be tested using different ratios of the data
set to identify whether the size of data would affect the rate of classification. Besides
th~t, the study also analyses the effects of dimensional reduction towards the results
of the two teclmiques. Final results indicate that SVM produces better classification
results for both the original data and the reduced dimension data.
VI
ABSTRAK
Penggunaan High Throughput Screenjng untuk menyaring sejumlah besar
molekul kimia yang berpotensi terhadap sasUl·an biologi telah memungkinkan ratusan
ribuan moleh .. ul kimia dikenalpasti pada peringkat awal dalam proses penghasilan
ubat. Namun, pengujian setiap molebll kimia ke atas setiap sasaran biologi adalah
tidak praktikal. Kajian ini mengaplikasikan teknik Rangkaian Neural Network dan
Support Vector Machines (SVM) bagi mengkelaskan data AIDS yang berbentuk 2D.
Pemilihan molekul kimia yang digunakan adalah berdasarkan kepada sifat
ketaksamaan yang paling tinggi. Model pengkelasan diuji menggunakan data set
dengan nisbah berbeza untuk mengenalpasti kesan saiz data terhadap keputusan
pengkelasan. Selain dUlipada itu, kajian juga menganalisa kesan pengurangan
dimensi data terhadap keputusan kedua-dua teknik. Hasil keputusan kajian
menunjukkan bahawa tekllik SVM menghasilkan keputusan yang lebih baik bagi
data asal dan juga data yang telah dikurangkan dimensinya.
CHAPTER
TABLE OF CONTENT
CONTENT
TITLE
DECLARA nON
DEDICATION
ACKNOWLEDG EM ENT
ABSTRACT
ABSTRAK
TABLE OF CONTENT
LIST OF TABLES
LIST OF FIGlTRES
LIST OF SYMBOLS
LIST OF ABBREVIATIONS
LIST OF APPENDICES
INTRODllCTION
1.0 Introduction
1.1 Problem Background
1.2 Problem Statement
1.3 Aim
1.4 Objectives
VII
PAGE
II
111
IV
v
VI
VII
XII
XV
X V 111
XIX
xx
2
4
5
5
2
1.5 Scope
1.6 Outline Of Thesis
1.7 Summary
LITERA TURE REVIEW
2.0 Introduction
2.1 Chemoinfonnatics
2.2 The Drug Discovery And Development Process
2.2.1 Assay Development
2.2.2 Lead Identification
2.2.3 Lead Optimisation
2.2.4 Clinical Trial
2.2.5 Bringing The Drug Into The Market
2.3 Lead Identification - The Past, Present And Future
2.4 AIDS Data Set
2.5 StnIcture Of Chemical Molecules
2.6 Dimensional Reduction Techniques
2.6.1 Principal Components Analysis (PCA)
2.6.1.1 PCA Method
2.6.2 Factor Analysis (F A)
2.7 Data Mining
2.7. I Data Mining Techniques
2.7.1.1 Predictive Modelling
2.7.1.2 Database Seh'1llentation
2.7.1.3 Link Analysis
2.7.1.4 Deviation Detection
2.8 Classification
2.8.1 Linear Discriminants
2.8.2 k-Nearest Neighbour
2.8.3 Decision Tree
2.8.4 Logistic Regression
VIII
6
6
7
8
8
8
10
10
II
II
II
12
12
15
16
18
18
19
22
22
23
23
24
24
25
25
26
27
27
29
IX
2.8.5 Generalize Additive Model (GAM) 29
2.9 Classification Methods in Chemoinfonnatics 30
2.10 Neural Network 35
2.10.1 Back Propagation Neural Network 36
2.10.2 Stmchlling The Network 38
2.10.2.1 The Input Layer 38
2.10.2.2 The Hidden Layer 38
2.10.2.3 The Number Of Node In The 39
Hidden Layer
2.10.2.4 The Output Layer 40
2.10.2.5 Selecting An Activation 40
Function
2.10.3 Advantages And Drawbacks Of Neural 41
Network
2.11 Neural Network In Chemoinformatics 41
2.12 Support Vector ivlachine (SVM) 43
2.12.1 Linear SVM 45
2.12.2 Non Linear SVM 48
2.12.3 Advantages And Drawbacks Of Support 50
Vector Machine
2.13 SVM In Chemoinfonnatics 51
2.14 Summal)' 53
3 PROJECT METHODOLOGY 55
3.0 Introduction 55
3.1 Problem Identification 55
3.2 Literature Review 56
3.3 Data Acquisition And Pre-Processing 56
3.3.1 Data Acquisition 57
3.3.2 Data Transfonnation 58
3.3.3 Data Labelling/Classification 59
x
3.3.4 Dimensional Reduction 59
3.3.5 Nonnalization Of Principal Components 60
3.3.6 Data Splitting 60
3.4 Implementation 62
3.4.1 Back Propagation Neural Network 62
3.4.1.1 Building The Network Struchlre 62
3.4.1.2 Leaming Process 64
3.4.1.3 Testing Process/Output Generation 64
3.4.2 Support Vector Machine (SVM) 65
3.5 Analysis Of Classification Perfonnance 67
3.6 Summary 67
4 RESULTS AND DISCUSSION 68
4.0 Introduction 68
4.1 Classification Results On TIle Original Aids Data 69
Set
4.1.1 Result Of Back Propagation Neural 69
Network
4.1.2 Result OfSVM 75
4.2 Classification Results On The PCA Data Set 87
4.2.1 Result Of Back Propagation Neural 87
Network
4.2.2 Result OfSVM 93
4.3 Compmison Of Classification Results Between The 105
Original Aids Data And PCA Data
4.4 Summary 107
5 CONCLUSION 108
5.0 Introduction 108
XI
5.1 Findings 108
5.2 Advantages of Study 109
5.3 Contribution of Study 1 10
5.4 Conclusion 110
5.5 For Future Work 111
BIBLIOGRAPHY 112
APPENDLX 120
XlI
LIST OF TABLES
TABLE NO. TITLE PAGE
3.1 Principal Components selection criteria 60
3.2 Number of sample in training and testing set 62
3.3 Number of node in the network layer 63
4.1 The best network structure for the 20:80 Aids data set 70
4.2 The best network structure for the 50:50 Aids data set 72
4.3 The best network structure for the 80:20 Aids data set 73
4.4 Comparison of results for all three Aids data set using Back 75
propagation neural network
4.5 Confusion matrix for classification of20:80 Aids data set 77
using Linear Kernel
4.6 Confusion matrix for classification of20:80 Aids data set 78
using Polynomial Kernel
4.7 Confusion matrix for classification of 20:80 Aids data set 79
using RBF Kernel
4.8 Confusion matrix for classification of 50:50 Aids data set 80
using Linear Kernel
4.9 Confusion matrix for classification of 50:50 Aids data set 81
using Polynomial Kernel
4.10 Confusion matrix for classification of 50:50 Aids data set 82
using RBF Kernel
4.11 Confusion matrix for classification of 80:20 Aids data set 83
using Linear Kernel
XliI
4.12 Confusion matrix for classification of 80:20 Aids data set 84
using Polynomial Kernel
4.13 Confusion matrix for classification of 80:20 Aids data set 85
using RBF Kernel
4.14 Comparison ofresults for all three Aids data set using three 86
different kernels in SVM
4.15 The best nehvork structure for the 20:80 PCA data set 88
4.16 The best nehvork stTucture for the 50:50 PCA data set 90
4.17 The best nehvork structure for the 80:20 PCA data set 91
4.18 Comparison of results for all three PCA data set using Back 93
propagation neural nehvork
4.19 Confusion matrix for classification of20:80 PCA data set 94
using Linear Kernel
4.20 Confusion matrix for classification of20:80 PCA data set 95
using Polynomial Kernel
4.21 Confusion matrix for classification of20:80 PCA data set 96
using RBF Kernel
4.22 Confusion matrix for classification of 50:50 PCA data set 97
using Linear Kernel
4.23 Confusion matrix for classification of 50:50 PCA data set 98
using Polynomial Kernel
4.24 Confusion matrix for classification of 50:50 PCA data set 99
using RBF Kernel
4.25 Confusion matrix for classification of 80:20 PCA data set 101
using Linear Kernel
4.26 Confusion matrix for classification of 80:20 PCA data set 102
using Polynomial Kernel
4.27 Confusion matrix for classification of 80:20 PCA data set 103
using RBF Kernel
4.28 Comparison of results for all three PCA data set using three 104
different kernels in SVM
4.29 Comparison of Classification Performance for Aids data set 105
between Back propagation neural network and SVM
4.30 Comparison of Classi fication Pcrtonl1allCe f(lr Pc.\ J:1W ~Cl
betwccn Back propagation neural network and SV0.1
\ 1\
LIST OF FIGURES
FIGURE NO. TITLE
2.1 The Drug Discovery and Development Process
2.2 An example of structural keys
2.3 An example of a decision tree
2.4 A neural network with one hidden layer
2.5 A linear support vector machine
2.6 SVM margin
2.7 SVM input and feature space
3.1 Data Acquisition and Pre-processing phase
4.1 Comparison of Desired Target Output and Actual Output for
20:80 Aids data set
4.2 Comparison of Desired Target Output and Actual Output for
50:50 Aids data set
4.3 Comparison of Desired Target Output and Actual Output for
80:20 Aids data set
4.4 Comparison of Desired Target Output and Actual Output for
20:80 Aids data set using Linear Kernel
4.5 Comparison of Desired Target Output and Actual Output for
20:80 Aids data set using Polynomial Kernel
4.6 Comparison of Desired Target Output and Actual Output for
20:80 Aids data set using REF kernel
4.7 Comparison of Desired Target Output and Actual Output for
50:50 Aids data set using Linear Kernel
xv
PAGE
10
17
28
36
44
46
48
57
71
72
74
77
78
79
80
XVI
4.8 Comparison of Desired Target Output and Actual Output for 81
50:50 Aids data set using Polynomial Kernel
4.9 Comparison of Desired Target Output and Actual Output for 82
50:50 Aids data set using RBF kernel
4. ]0 Comparison of Desired Target Output and Actual Output for 83
80:20 Aids data set using Linear Kernel
4.1 ] Comparison of Desired Target Output and Actual Output for 84
80:20 Aids data set using Polynomial Kernel
4.12 Comparison of Desired Target Output and Actual Output for 85
80:20 Aids data set using RBF kernel
4.13 Comparison of Desired Target Output and Actual Output for 89
20:80 PCA data set
4.14 Comparison of Desired Target Output and Actual Output for 90
50:50 PCA data set
4.15 Comparison of Desired Target Output and Actual Output for 92
80:20 PCA data set
4.16 Comparison of Desired Target Output and Actual Output for 94
20:80 PCA data set using Linear Kernel
4.17 Comparison of Desired Target Output and Actual Output for 95
20:80 PCA data set using Polynomial Kernel
4.18 Comparison of Desired Target Output and Actual Output for 96
20:80 PCA data set using REF kernel
4.19 Comparison of Desired Target Output and Actual Output for 98
50:50 PCA data set using Linear Kernel
4.20 Comparison of Desired Target Output and Actual Output for 99
50:50 PCA data set using Polynomial Kernel
4.21 Comparison of Desired Target Output and Actual Output for 100
50:50 PCA data set using REF kernel
4.22 Comparison of Desired Target Output and Actual Output for 101
80:20 PCA data set using Linear Kernel
4.23 Comparison of Desired Target Output and Actual Output for 102
80:20 PCA data set using Polynomial Kernel
4.24 Comparison of Desired Target Output and Actual Output for 103
XVII
80:20 PCA data set using RBF kernel
N
n
91
:Jf
Y E Y
X E X
( x- z )
K(x, z)
w
b
a
L
w II-lip In
e
log
x',X'
a
LIST OF SYMBOLS
dimension of data
dimension of input data
feature space
Euclidean space
output and output space
input and input space
inner product between x and z
kernel ( <flex) -<fl(z) )
weight vector
bias
dual variables or lagrange multipliers
primal lagrangian
dual lagrangian
p-norm
natural logarithm
base of the natural logarithm
logarithm to the base 2
transpose of vector, matrix
natural, real numbers
learning rate
momentum rate
confidence
XVIIl
CA
CART
CM
CI
ERM
FA
GAM
HTS
LRM
MARS
MLP
MSE
Ncr
NMR
PCA
RBF
SAR
SRM
SVM
LIST OF ABBREVIATIONS
Confirmed Active
Classification And Regression Trees
Confirmed Moderately Active
Confinned Inactive
Empirical Risk Minimization
Factor Analysis
Generalize Additive Model
High Throughput Screenings
Logistic Regression Method
Multivariate Adaptive Regression Splines
Multi Layer Perceptrons
Min Squared Error
National Cancer Institute
Nuclear Magnetic Resonance
Principal Components Analysis
Radial Basis Function
Structure-Activity Relationship
Structural Risk Minimization
Support Vector Machine
XIX
xx
LIST OF APPENDIX
APPENDIX TITLE PAGE
Al Gantt Chart of Project I 120
A2 Gantt Chart of Project II 121
B Sample Data 122
C Calculation in Forward Propagation 125
D Calculation in Backward Propagation 126
EI Output of Stage I and II for 20:80 Aids data set using 127
Back propagation neural network
E2 Output of Stage I and II for 50 :50 Aids data set using 131
Back propagation neural network
E3 Output of Stage I and II for 80:20 Aids data set using 135
Back propagation neural network
F1 Output of Stage I and II for 20:80 PCA data set using 139
Back propagation neural network
F2 Output of Stage I and II for 50:50 PCA data set using 143
Back propagation neural network
F3 Output of Stage I and II for 80:20 PCA data set using 147
Back propagation neural network
GI Classification results of20:80 Aids data set using SVM 151
G2 Classification results of 50:50 Aids data set using SVM 155
G3 Classification results of 80:20 Aids data set using SVM 159
HI Classification results of 20:80 PCA data set using SVi\'1 163
H2 Classification results of 50:50 PCA data set using SVM 167