borang pengesahan...

PSZ. 19:16 (Pind. 1/97)

UNIVERSITI TEKNOLOGI MALAYSIA

BORANG PENGESAHAN STATUS TESI~

JUDUL: BIOACTIVITY CLASSIFICATION OF ANTI AIDS COMPOUNDS USING NEURAL NETWORK AND SUPPORT VECTOR MACHINE: A COMPARISON

SESI PENGAJIAN: SEMESTER I 2004/2005

Saya ________________ -=RA~Ho~A~YU~~B=IN~T~IA~.H=A~M~ID=_ __________ ~ (HURUF BESAR)

mengah.ll membenarkan tesis (P-8M/SmjanalGekter Falsafah)* ini disimpan di Perpustakaan Universiti Teknologi Malaysia dengan syarat-syarat kegunaan seperti berikut:

1. Tesis ini adalah hakmilik Universiti Teknologi Malaysia. 2. Perpustakaan Universiti Teknologi Malaysia dibenarkan membuat salinan untuk

tujuan pengajian sahaja. 3. Perpustakaan dibenarkan membuat salinan tesis ini sebagai bahan pertukaran

antara institusi pengajian tinggi. 4. ** Sila tandakan (v")

D SULIT

D TERHAD

[] TIDAK TERHAD

(T AND AT ANGAN PENULIS)

Alamat Tetap: 19, Jalan Pertiwi, Stulang Laut, 80300 Johor Bahru, Johor Darul Takzim.

Tarikh: OCTOBER 2004

CATATAN: * Potong yang tidak berkenaan.

(Mengandungi maklumat yang berdarjah keselamatan at au kepentingan Malaysia seperti yang termah.wb di dalam AKT A RAHSIA RASMI 1972)

(Mengandungi maklumat TERHAD yang telah ditentukan oleh organisasilbadan di mana penyelidikan dijalankan)

Disahkan oleh

~ (T AND~ PENYELIA)

ASSOC. PROF. DR. NAOMJE BINTI SALIM

Nama Penyelia

Tarikh: OCTOBER 2004

** Jika tesis ini SULIT atau TERHAD, sila lampirkan sural daripada pihak ber1:uasalorganisasi berkcnaan dcngan mcnyatakan sekali scbab dan tcmpoh tcsis ini perlu dikclaskan sebagai SULIT atau TERHAD. Tesis dimaksudkan sebagai tesis bagi Ijazah Doktor Falsafah dan Sarjana sccara pcnyelidikan, atau disertasi bagi pcngajian secara kerja h'Ursus dan pcnyelidikan, atau Laporan Projek Sarjana Muda (PSM).

"I declare that I have read this project and in my opinion this

project report has satisfied the scope and quality for the award

of the degree of Master of Science (Computer Science)."

Signature

Name of Supervisor

Date

~//l1/~ ~ /C/ .........................................

: ASSOC. PROF. DR. NAOMIE BINTI SALIM

: ........................ ./.r.!(~./~:y

BIOACTIVITY CLASSIFICATION OF ANTI AIDS COMPOUNDS USING

NEURAL NETWORK AND SUPPORT VECTOR MACHINE: A COMPARISON

RAHA YU BINTI A. HAMID

A report submitted in partial fulfillment of the

requirements for the award of the degree of

Master of Science (Computer Science)

Faculty Of Computer Science And Information System

University Of Technology Malaysia

OCTOBER 2004

"I declare that this project report is the result of my own research except as

cited in references. This report has not been accepted for any de!,rree and is

not currently submitted in candidature of any degree."

Signature

Name of Author

Date

RAHA YU BlNT1 A.HAMID

11

To my husband, thank Y01lfor ),our love and support. To my son, ),011 mean evelything to me.

To my mother, thank yo 11 for always being therefor me, supporting me and encouraging me to be the best that J can be.

111

IV

ACKNO'VLEDGEMENT

Praises to Allah for giving me the patience, strength and will to go through

and complete my study. I would like to express my appreciation to my supervisor,

Associate Professor Dr. Naomie bte Salim, for her support and guidance during the

course of this study and the writing of the thesis. A special thanks is due to Associate

Professor Dr. Siti Mariyam bte Shamsuddin for her helpful advice and insight in this

study. I would also like to extend my thanks to fellow classmates who have given me

the encouragement and support when I needed it. Finally, I would like to dedicate

this thesis to my family. Without their love and support I would have never come this

far.

v

ABSTRACT

High Throughput Screening has been used in drug discovery to screen large numbers

of potential compounds against a biological target by making it possible to screen

tens of thousands to hundreds of thousands of compounds at the early stage of drug

design. However, it is impractical to test every available compound against every

biological target. Classification is an approach in classifYing the compounds into

active and inactive based on already known actives. In this study, Neural Network

and Support Vector Machines (SVM) are used to classify AIDS data represented as

2D descriptors. Selection of compounds used is based on the most diverse

compounds. The classification models will be tested using different ratios of the data

set to identify whether the size of data would affect the rate of classification. Besides

th~t, the study also analyses the effects of dimensional reduction towards the results

of the two teclmiques. Final results indicate that SVM produces better classification

results for both the original data and the reduced dimension data.

VI

ABSTRAK

Penggunaan High Throughput Screenjng untuk menyaring sejumlah besar

molekul kimia yang berpotensi terhadap sasUl·an biologi telah memungkinkan ratusan

ribuan moleh .. ul kimia dikenalpasti pada peringkat awal dalam proses penghasilan

ubat. Namun, pengujian setiap molebll kimia ke atas setiap sasaran biologi adalah

tidak praktikal. Kajian ini mengaplikasikan teknik Rangkaian Neural Network dan

Support Vector Machines (SVM) bagi mengkelaskan data AIDS yang berbentuk 2D.

Pemilihan molekul kimia yang digunakan adalah berdasarkan kepada sifat

ketaksamaan yang paling tinggi. Model pengkelasan diuji menggunakan data set

dengan nisbah berbeza untuk mengenalpasti kesan saiz data terhadap keputusan

pengkelasan. Selain dUlipada itu, kajian juga menganalisa kesan pengurangan

dimensi data terhadap keputusan kedua-dua teknik. Hasil keputusan kajian

menunjukkan bahawa tekllik SVM menghasilkan keputusan yang lebih baik bagi

data asal dan juga data yang telah dikurangkan dimensinya.

CHAPTER

TABLE OF CONTENT

CONTENT

TITLE

DECLARA nON

DEDICATION

ACKNOWLEDG EM ENT

ABSTRACT

ABSTRAK

TABLE OF CONTENT

LIST OF TABLES

LIST OF FIGlTRES

LIST OF SYMBOLS

LIST OF ABBREVIATIONS

LIST OF APPENDICES

INTRODllCTION

1.0 Introduction

1.1 Problem Background

1.2 Problem Statement

1.3 Aim

1.4 Objectives

VII

PAGE

II

111

IV

v

VI

VII

XII

XV

X V 111

XIX

xx

2

4

5

5

2

1.5 Scope

1.6 Outline Of Thesis

1.7 Summary

LITERA TURE REVIEW

2.0 Introduction

2.1 Chemoinfonnatics

2.2 The Drug Discovery And Development Process

2.2.1 Assay Development

2.2.2 Lead Identification

2.2.3 Lead Optimisation

2.2.4 Clinical Trial

2.2.5 Bringing The Drug Into The Market

2.3 Lead Identification - The Past, Present And Future

2.4 AIDS Data Set

2.5 StnIcture Of Chemical Molecules

2.6 Dimensional Reduction Techniques

2.6.1 Principal Components Analysis (PCA)

2.6.1.1 PCA Method

2.6.2 Factor Analysis (F A)

2.7 Data Mining

2.7. I Data Mining Techniques

2.7.1.1 Predictive Modelling

2.7.1.2 Database Seh'1llentation

2.7.1.3 Link Analysis

2.7.1.4 Deviation Detection

2.8 Classification

2.8.1 Linear Discriminants

2.8.2 k-Nearest Neighbour

2.8.3 Decision Tree

2.8.4 Logistic Regression

VIII

6

6

7

8

8

8

10

10

II

II

II

12

12

15

16

18

18

19

22

22

23

23

24

24

25

25

26

27

27

29

IX

2.8.5 Generalize Additive Model (GAM) 29

2.9 Classification Methods in Chemoinfonnatics 30

2.10 Neural Network 35

2.10.1 Back Propagation Neural Network 36

2.10.2 Stmchlling The Network 38

2.10.2.1 The Input Layer 38

2.10.2.2 The Hidden Layer 38

2.10.2.3 The Number Of Node In The 39

Hidden Layer

2.10.2.4 The Output Layer 40

2.10.2.5 Selecting An Activation 40

Function

2.10.3 Advantages And Drawbacks Of Neural 41

Network

2.11 Neural Network In Chemoinformatics 41

2.12 Support Vector ivlachine (SVM) 43

2.12.1 Linear SVM 45

2.12.2 Non Linear SVM 48

2.12.3 Advantages And Drawbacks Of Support 50

Vector Machine

2.13 SVM In Chemoinfonnatics 51

2.14 Summal)' 53

3 PROJECT METHODOLOGY 55

3.0 Introduction 55

3.1 Problem Identification 55

3.2 Literature Review 56

3.3 Data Acquisition And Pre-Processing 56

3.3.1 Data Acquisition 57

3.3.2 Data Transfonnation 58

3.3.3 Data Labelling/Classification 59

x

3.3.4 Dimensional Reduction 59

3.3.5 Nonnalization Of Principal Components 60

3.3.6 Data Splitting 60

3.4 Implementation 62

3.4.1 Back Propagation Neural Network 62

3.4.1.1 Building The Network Struchlre 62

3.4.1.2 Leaming Process 64

3.4.1.3 Testing Process/Output Generation 64

3.4.2 Support Vector Machine (SVM) 65

3.5 Analysis Of Classification Perfonnance 67

3.6 Summary 67

4 RESULTS AND DISCUSSION 68

4.0 Introduction 68

4.1 Classification Results On TIle Original Aids Data 69

Set

4.1.1 Result Of Back Propagation Neural 69

Network

4.1.2 Result OfSVM 75

4.2 Classification Results On The PCA Data Set 87

4.2.1 Result Of Back Propagation Neural 87

Network

4.2.2 Result OfSVM 93

4.3 Compmison Of Classification Results Between The 105

Original Aids Data And PCA Data

4.4 Summary 107

5 CONCLUSION 108

5.0 Introduction 108

XI

5.1 Findings 108

5.2 Advantages of Study 109

5.3 Contribution of Study 1 10

5.4 Conclusion 110

5.5 For Future Work 111

BIBLIOGRAPHY 112

APPENDLX 120

XlI

LIST OF TABLES

TABLE NO. TITLE PAGE

3.1 Principal Components selection criteria 60

3.2 Number of sample in training and testing set 62

3.3 Number of node in the network layer 63

4.1 The best network structure for the 20:80 Aids data set 70



4.4 Comparison of results for all three Aids data set using Back 75

propagation neural network

4.5 Confusion matrix for classification of20:80 Aids data set 77

using Linear Kernel

4.6 Confusion matrix for classification of20:80 Aids data set 78

using Polynomial Kernel

4.7 Confusion matrix for classification of 20:80 Aids data set 79

using RBF Kernel


using Linear Kernel




using RBF Kernel


using Linear Kernel

XliI




using RBF Kernel

4.14 Comparison ofresults for all three Aids data set using three 86

different kernels in SVM

4.15 The best nehvork structure for the 20:80 PCA data set 88

4.16 The best nehvork stTucture for the 50:50 PCA data set 90

4.17 The best nehvork structure for the 80:20 PCA data set 91

4.18 Comparison of results for all three PCA data set using Back 93

propagation neural nehvork

4.19 Confusion matrix for classification of20:80 PCA data set 94

using Linear Kernel




using RBF Kernel

4.22 Confusion matrix for classification of 50:50 PCA data set 97

using Linear Kernel




using RBF Kernel


using Linear Kernel




using RBF Kernel

4.28 Comparison of results for all three PCA data set using three 104

different kernels in SVM

4.29 Comparison of Classification Performance for Aids data set 105

between Back propagation neural network and SVM

4.30 Comparison of Classi fication Pcrtonl1allCe f(lr Pc.\ J:1W ~Cl

betwccn Back propagation neural network and SV0.1

\ 1\

LIST OF FIGURES

FIGURE NO. TITLE

2.1 The Drug Discovery and Development Process

2.2 An example of structural keys

2.3 An example of a decision tree

2.4 A neural network with one hidden layer

2.5 A linear support vector machine

2.6 SVM margin

2.7 SVM input and feature space

3.1 Data Acquisition and Pre-processing phase

4.1 Comparison of Desired Target Output and Actual Output for

20:80 Aids data set


50:50 Aids data set


80:20 Aids data set


20:80 Aids data set using Linear Kernel


20:80 Aids data set using Polynomial Kernel


20:80 Aids data set using REF kernel



xv

PAGE

10

17

28

36

44

46

48

57

71

72

74

77

78

79

80

XVI

4.8 Comparison of Desired Target Output and Actual Output for 81



50:50 Aids data set using RBF kernel

4. ]0 Comparison of Desired Target Output and Actual Output for 83


4.1 ] Comparison of Desired Target Output and Actual Output for 84



80:20 Aids data set using RBF kernel


20:80 PCA data set


50:50 PCA data set


80:20 PCA data set


20:80 PCA data set using Linear Kernel


20:80 PCA data set using Polynomial Kernel


20:80 PCA data set using REF kernel






50:50 PCA data set using REF kernel






XVII

80:20 PCA data set using RBF kernel

N

n

91

:Jf

Y E Y

X E X

( x- z )

K(x, z)

w

b

a

L

w II-lip In

e

log

x',X'

a

LIST OF SYMBOLS

dimension of data

dimension of input data

feature space

Euclidean space

output and output space

input and input space

inner product between x and z

kernel ( <flex) -<fl(z) )

weight vector

bias

dual variables or lagrange multipliers

primal lagrangian

dual lagrangian

p-norm

natural logarithm

base of the natural logarithm

logarithm to the base 2

transpose of vector, matrix

natural, real numbers

learning rate

momentum rate

confidence

XVIIl

CA

CART

CM

CI

ERM

FA

GAM

HTS

LRM

MARS

MLP

MSE

Ncr

NMR

PCA

RBF

SAR

SRM

SVM

LIST OF ABBREVIATIONS

Confirmed Active

Classification And Regression Trees

Confirmed Moderately Active

Confinned Inactive

Empirical Risk Minimization

Factor Analysis

Generalize Additive Model

High Throughput Screenings

Logistic Regression Method

Multivariate Adaptive Regression Splines

Multi Layer Perceptrons

Min Squared Error

National Cancer Institute

Nuclear Magnetic Resonance

Principal Components Analysis

Radial Basis Function

Structure-Activity Relationship

Structural Risk Minimization

Support Vector Machine

XIX

xx

LIST OF APPENDIX

APPENDIX TITLE PAGE

Al Gantt Chart of Project I 120

A2 Gantt Chart of Project II 121

B Sample Data 122

C Calculation in Forward Propagation 125

D Calculation in Backward Propagation 126

EI Output of Stage I and II for 20:80 Aids data set using 127

Back propagation neural network

E2 Output of Stage I and II for 50 :50 Aids data set using 131


E3 Output of Stage I and II for 80:20 Aids data set using 135


F1 Output of Stage I and II for 20:80 PCA data set using 139






GI Classification results of20:80 Aids data set using SVM 151

G2 Classification results of 50:50 Aids data set using SVM 155

G3 Classification results of 80:20 Aids data set using SVM 159

HI Classification results of 20:80 PCA data set using SVi\'1 163

H2 Classification results of 50:50 PCA data set using SVM 167

borang pengesahan...

Documents