protein subcellular classi cation using machine learning...

140
Protein Subcellular Classification using Machine Learning Approaches Muhammad Tahir PhD Thesis Department of Computer and Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Islamabad, Pakistan

Upload: others

Post on 12-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Protein Subcellular Classification using

Machine Learning Approaches

Muhammad Tahir

PhD Thesis

Department of Computer and Information Sciences,

Pakistan Institute of Engineering & Applied Sciences,

Islamabad, Pakistan

Page 2: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Protein Subcellular Classification using

Machine Learning Approaches

By

Muhammad Tahir

A dissertation submitted in partial fulfillment of the

requirements for the degree of Doctor of Philosophy in

Computer and Information Sciences

The Department of Computer and Information Sciences,

Pakistan Institute of Engineering and Applied Sciences,

Islamabad, Pakistan

2014

ii

Page 3: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

This thesis is carried out under the supervision of

Dr. Asifullah Khan

Associate Professor

Department of Computer and Information Sciences,

Pakistan Institute of Engineering & Applied Sciences,

Islamabad, Pakistan

This work is financially supported by Higher Education Commission of Pakistan

Under the indigenous 5000 Ph.D. fellowship program

17-5-4 (Ps4-124)/HEC/Sch/2008/

iii

Page 4: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Declaration

I confirm that the work presented in this thesis is the contribution of my original research

work in candidature for a research degree at this university; consultations to others’ pub-

lished work are acknowledged and explicitly cited in the text. I assure that the material

presented in this thesis has neither been submitted nor approved previously for the award

of a degree at any university.

Signature:

Muhammad Tahir

It is certified that the work in this thesis is carried out and completed under my supervision.

Supervisor:

Dr. Asifullah Khan

Associate Professor

DCIS, PIEAS, Islamabad.

iv

Page 5: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Acknowledgments

First and foremost, I am very thankful to Allah almighty for his blessings during my entire

life particularly, for the duration of this research work. He blessed me with knowledge and

purpose as well as guided me whenever I faced any problems. Having sincere teachers and

cooperative friends, during my PhD adventure, are all the blessings of Allah almighty.

Next, I would like to pay my deepest gratitude to my supervisor Dr. Asifullah Khan

whose dedication, enthusiasm, and devotion to work always inspired me. He has been a

constant source of motivation and inspiration for me throughout my PhD research. I always

found him generous in sharing his knowledge and wisdom. Working with him was a great

learning experience. His kind attitude really made the difference.

I would also like to pay my gratitude to Dr. Abdul Majid for his advice, guidance,

and invaluable comments during my PhD. I extend my gratitudes to Dr. Abdul Jalil

for his appreciation and encouragement to complete my PhD. I would also appreciate my

friends, particularly, Dr. Maqsood Hayat, Mr. Khurram Jawad, Mr. Adnan Idris, Mr.

Aksam Iftikhar, Mr. Mehdi Hassan, Mr. Manzoor, Mr. Fazal Badshah, Mr. Safdar Ali,

Dr. Atta ur Rahman, Dr. Anwar Hussain and Mr. Zaheer Uddin for their cooperative and

encouraging behavior during my stay at PIEAS.

I would certainly like to thank my loving parents, brothers, and sisters whose support,

throughout my studies, has made all this possible. Without their encouraging behavior and

moral support, the completion of this research work would not have been possible.

I pay my gratitude to Pattern Recognition Lab at the Department of Computer and

Information Sciences, Pakistan Institute of Engineering and Applied Sciences, which pro-

vided me a good environment and technical support round the clock for conducting my

PhD research.

Finally, I would like to thank Higher Education Commission of Pakistan, for their fi-

nancial support under the Indigenous 5000 PhD scholarship program with reference to the

award letter number 17-5-4(Ps4-124)/HEC/Sch/2008/.

Muhammad Tahir

v

Page 6: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

List of Publications

International refereed journals

1. Muhammad Tahir, Asifullah Khan, and Abdul Majid, “Protein Subcellular

Localization of Fluorescence Imagery using Spatial and Transform domain Fea-

tures”, Journal of Bioinformatics, 28(1) 91-97 (2012). (impact factor: 5.323)

2. Muhammad Tahir, Asifullah Khan, Abdul Majid,and Alessandra Lumini,

“Subcellular Localization using Fluorescence Imagery: Utilizing Ensemble Classi-

fication with Diverse Feature Extraction Strategies and Data Balancing”, Journal

of Applied Soft Computing, 13(11) 4231-4243 (2013). (impact factor: 2.140)

3. Muhammad Tahir, Asifullah Khan and Hseyin Kaya, “Protein Subcellular Lo-

calization in Human and Hamster cell lines: Employing Local Ternary Patterns

of Fluorescence Microscopy Images”, Journal of Theoretical Biology, 340 85-95

(2014). (impact factor: 2.351)

4. Muhammad Tayyeb Mirza, Asifullah Khan, Muhammad Tahir, and Yeon Soo

Lee, “MitProt-Pred: Predicting mitochondrial proteins of Plasmodium falci-

parum parasite using diverse physiochemical properties and ensemble classifica-

tion”, Journal of Computers in Biology and Medicine, 43(10) 15021511 (2013).

(Impact factor: 1.162)

5. Muhammad Tahir and Asifullah Khan, “Protein Subcellular Localization us-

ing SVM based Ensemble and Individual Orientations of both Gray Level Co-

occurrence Matrices and Texton Images ”, Under Review in IEEE Transac-

tions on Cybernetics

vi

Page 7: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Contents

Declaration iv

Acknowledgments v

List of Publications vi

Abstract xvi

1 Introduction 1

1.1 Motivation and Research Objectives . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Research Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Relevant Literature and Techniques 7

2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Feature Extraction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Haralick Texture Features . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1.1 Gray Level Co-occurrence Matrix . . . . . . . . . . . . . . 13

2.2.2 Texton Image based Statistical Features . . . . . . . . . . . . . . . . 18

2.2.3 Zernike Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.4 Wavelet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.5 Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.6 Local Ternary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.7 Threshold Adjacency Statistics . . . . . . . . . . . . . . . . . . . . . 23

2.2.8 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.9 Edge Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vii

Page 8: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

2.2.10 Hull Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.11 Morphological Features . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.12 Histogram of Oriented Gradients . . . . . . . . . . . . . . . . . . . . 25

2.3 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Oversampling with SMOTE . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.2 Feature Selection with mRMR . . . . . . . . . . . . . . . . . . . . . 27

2.4 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.2 Random Forest Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.3 Rotation Forest Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5 Performance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.2 Sensitivity/Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5.3 Mathews Correlation Coefficient . . . . . . . . . . . . . . . . . . . . 33

2.5.4 F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.5 Q-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.6 Multiclass ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.6 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Protein Subcellular Localization using Spatial and Transform Domain

Features 39

3.1 The SVM-SubLoc Prediction System . . . . . . . . . . . . . . . . . . . . . . 39

3.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1.1.1 GLCM Construction and Haralick Co-efficients . . . . . . . 41

3.1.1.2 Discrete Wavelet Transformation . . . . . . . . . . . . . . . 42

3.1.1.3 The Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . 44

3.1.2 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.1 Performance Analysis of SVM-SubLoc using Individual Features for

HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.2 Performance Analysis of SVM-SubLoc using Hybrid Features for 2D

HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

viii

Page 9: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

3.2.3 Performance Analysis of SVM-SubLoc for LOCATE datasets . . . . 51

3.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Protein Subcellular Localization in the Presence of Imbalanced data 57

4.1 The Proposed RF-SubLoc Prediction System . . . . . . . . . . . . . . . . . 57

4.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 58

4.1.1.1 The Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . 58

4.1.2 Data Balancing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.3 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2.1 Performance of RF-SubLoc using Individual Features . . . . . . . . 61

4.2.2 Performance Analysis of RF-SubLoc using Hybrid Features . . . . . 63

4.2.3 Performance Analysis of RotF Ensemble using Individual Features . 65

4.2.4 Performance Analysis of RotF Ensemble using Hybrid Features . . . 65

4.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Protein Subcellular Localization: Employing LTP with SMOTE 70

5.1 The Protein-SubLoc Prediction System . . . . . . . . . . . . . . . . . . . . 70

5.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.2 Oversampling Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.3 Feature Selection(Optional) . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.4 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.5 Ensemble Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Performance Analysis for 2D HeLa dataset . . . . . . . . . . . . . . 74

5.2.2 Performance Analysis for CHOA dataset . . . . . . . . . . . . . . . . 78

5.2.3 Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Protein Subcellular Localization using GLCM and Texton Image based

Features 84

6.1 The IEH-GT Prediction System . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 85

ix

Page 10: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

6.1.1.1 GLCM Construction . . . . . . . . . . . . . . . . . . . . . . 85

6.1.1.2 Texton Image Construction . . . . . . . . . . . . . . . . . . 86

6.1.1.3 The Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . 86

6.1.2 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2.1 Analysis of GLCM based Features for HeLa Dataset . . . . . . . . . 87

6.2.2 Analysis of Texton Image based Features for HeLa Dataset . . . . . 94

6.2.3 Analysis of the Hybrid Model for HeLa dataset . . . . . . . . . . . . 101

6.2.4 Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Conclusions and Future Directions 105

7.1 Conclusive Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

References 109

x

Page 11: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

List of Figures

Figure 1.1 Fluorescence microscopy image of microtubule from HeLa dataset [5] 2

Figure 1.2 A general pattern recognition system . . . . . . . . . . . . . . . . . . 3

Figure 2.1 GLCM construction for Ng = 8 at θ = {0◦, 45◦, 90◦, 135◦} and ∆ = 1 14

Figure 2.2 Different Texton masks . . . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 2.3 Procedure of Texton image generation . . . . . . . . . . . . . . . . . 19

Figure 2.4 LBP code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 2.5 LTP is split into two LBP codes . . . . . . . . . . . . . . . . . . . . 23

Figure 2.6 Threshold Adjacency Statistics . . . . . . . . . . . . . . . . . . . . . 24

Figure 3.1 The SVM-SubLoc prediction system . . . . . . . . . . . . . . . . . . 40

Figure 3.2 Feature extraction from GLCMH, GLCMV, GLCMD and GLCMoD . 41

Figure 3.3 DWT Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 3.4 Fluorescence microscopy protein image of size M -by-N is split into

four sub-images at each decomposition level. Decomposition level 0 indicates

the original fluorescence microscopy protein image. . . . . . . . . . . . . . . 43

Figure 3.5 The Classification phase of SVM-SubLoc . . . . . . . . . . . . . . . . 44

Figure 4.1 The RF-SubLoc prediction system . . . . . . . . . . . . . . . . . . . 58

Figure 4.2 The Classification phase of RF-SubLoc . . . . . . . . . . . . . . . . . 61

Figure 5.1 Framework of the proposed system . . . . . . . . . . . . . . . . . . . 71

Figure 5.2 Comparison of original and synthetic samples in HeLa dataset . . . 72

Figure 5.3 Comparison of original and synthetic samples in CHOA dataset . . . 72

Figure 5.4 Classification phase of Protein-SubLoc prediction system . . . . . . . 73

Figure 5.5 Ratio of explained variance to the total variance for HeLa dataset . 75

Figure 5.6 ROC curves using URI-LTP(3, 24, 80) for 2D HeLa dataset . . . . . 76

xi

Page 12: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Figure 5.7 Effect of mRMR on URI-LTP extracted on radius 3 for HeLa dataset 78

Figure 5.8 Effect of mRMR on U-LTP extracted on radius 2 for HeLa dataset . 78

Figure 5.9 ROC curves using URI-LTP(3, 24, 30) for CHOA dataset . . . . . . 80

Figure 6.1 Framework of IEH-GT prediction system . . . . . . . . . . . . . . . 85

Figure 6.2 Classification phase of IEH-GT system . . . . . . . . . . . . . . . . . 87

Figure 6.3 Golgia protein is wrongly predicted as Golgpp protein . . . . . . . . 89

Figure 6.4 Golgia protein is represented by red circle and Golgpp class is shown

by green plus sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Figure 6.5 Golgpp protein is wrongly predicted as Golgia protein . . . . . . . . 90

Figure 6.6 Golgpp instance is represented by red circle and Golgia class is shown

by green plus sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Figure 6.7 Endosome instance is shown by red circle and Lysosome class is rep-

resented by green plus sign . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Figure 6.8 Similar patterns can be observed in the two images . . . . . . . . . . 96

xii

Page 13: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

List of Tables

Table 2.1 Offset pair with corresponding direction . . . . . . . . . . . . . . . . . 13

Table 2.2 A confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Table 2.3 Correlation between classifier BCi and classifier BCj . . . . . . . . . 35

Table 2.4 HeLa dataset classes and image breakup in each class . . . . . . . . . 36

Table 2.5 CHOM dataset classes and image breakup in each class . . . . . . . . 37

Table 2.6 CHOA dataset classes and image breakup in each class . . . . . . . . 37

Table 2.7 Vero dataset classes and image breakup in each class . . . . . . . . . 37

Table 2.8 LOCATE Endogenous and Transfected datasets: images breakup . . 38

Table 3.1 Performance of SVM-SubLoc using Haralick textures with/without DWT 46

Table 3.2 Performance of SVM-SubLoc using Zernike moments with/without DWT 47

Table 3.3 Performance of SVM-SubLoc using TAS features . . . . . . . . . . . . 47

Table 3.4 Performance of SVM-SubLoc using LBP with different mappings . . 48

Table 3.5 Performance of SVM-SubLoc using LTP with different mappings . . . 49

Table 3.6 Performance of SVM-SubLoc using ZHar with/without DWT . . . . 50

Table 3.7 Performance of SVM-SubLoc using HarTAS . . . . . . . . . . . . . . 50

Table 3.8 Performance of SVM-SubLoc using HarLBP . . . . . . . . . . . . . . 51

Table 3.9 Performance of SVM-SubLoc using HarLTP . . . . . . . . . . . . . . 51

Table 3.10 Performance of SVM-SubLoc using LTP for Endogenous dataset . . . 52

Table 3.11 Performance of SVM-SubLoc using LTP for Transfected dataset . . . 52

Table 3.12 Performance of SVM-SubLoc using HarLBP for Endogenous dataset . 53

Table 3.13 Performance of SVM-SubLoc using HarLBP for Transfected dataset . 53

Table 3.14 Performance of SVM-SubLoc using HarLTP for Endogenous dataset . 54

Table 3.15 Performance of SVM-SubLoc using HarLTP for Transfected dataset . 54

Table 3.16 Performance comparison with other published work . . . . . . . . . . 55

Table 4.1 Oversampled instances for CHOM dataset . . . . . . . . . . . . . . . 59

xiii

Page 14: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 4.2 Oversampled instances for CHOA dataset . . . . . . . . . . . . . . . . 60

Table 4.3 Oversampled instances for Vero dataset . . . . . . . . . . . . . . . . . 60

Table 4.4 Performance of RF-SubLoc using individual features . . . . . . . . . . 62

Table 4.5 Performance of RF-SubLoc using hybrid features . . . . . . . . . . . 64

Table 4.6 Performance of RotF ensemble using individual features . . . . . . . . 66

Table 4.7 Performance of RotF ensemble using hybrid features . . . . . . . . . 67

Table 4.8 Performance comparison with other published work . . . . . . . . . . 68

Table 5.1 The serial numbers, attached to the mapping used in LTP computation,

representing a particular LTP variant in Tables 5.8 and 5.9 . . . . . . . . . 73

Table 5.2 Performance of Protein-SubLoc using LTP for balanced HeLa dataset 74

Table 5.3 Performance of Protein-SubLoc using LTP for balanced mRMR based

HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Table 5.4 Performance of Protein-SubLoc using LTP for imbalanced HeLa dataset 79

Table 5.5 Performance of Protein-SubLoc using LTP for balanced CHOA dataset 79

Table 5.6 Performance of Protein-SubLoc using LTP for balanced mRMR based

CHOA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Table 5.7 Performance of Protein-SubLoc using LTP for imbalanced CHOA dataset 80

Table 5.8 Performance of different combinations of SVM classifications using LTP

for balanced HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Table 5.9 Performance of different combinations of SVM classifications using LTP

for balanced CHOA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Table 5.10 Performance comparison with other approaches . . . . . . . . . . . . 82

Table 6.1 Predictions of individual SVMs using Haralick features for HeLa dataset 87

Table 6.2 Confusion matrix using Haralick features from GLCMH . . . . . . . . 88

Table 6.3 Confusion matrix using Haralick features computed from GLCMD . . 91

Table 6.4 Confusion matrix using Haralick features computed from GLCMV . . 91

Table 6.5 Confusion matrix using Haralick features computed from GLCMoD . 92

Table 6.6 Confusion matrix using Haralick features computed from GLCMF . . 92

Table 6.7 Golgia proteins classified as Golgpp proteins . . . . . . . . . . . . . . 93

Table 6.8 Predictions of individual SVMs using statistical features computed

from Texton images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

xiv

Page 15: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 6.9 Confusion matrix using statistical features computed from Texton im-

age constructed using T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Table 6.10 Confusion matrix using statistical features computed from Texton im-

age constructed using T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Table 6.11 Confusion matrix using statistical features computed from Texton im-

age constructed using T3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Table 6.12 Confusion matrix using statistical features computed from Texton im-

age constructed using T4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Table 6.13 Confusion matrix using statistical features computed from Texton im-

age constructed using T5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Table 6.14 Confusion matrix using statistical features computed from Texton im-

age constructed using T6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Table 6.15 Confusion matrix using statistical features computed from Texton im-

age constructed from the fusion of six Texton images . . . . . . . . . . . . . 99

Table 6.16 Different Texton image based statistical features lead to different clas-

sification results. Endosome proteins as Lysosome proteins . . . . . . . . . . 100

Table 6.17 Confusion matrix obtained using the hybrid model . . . . . . . . . . . 101

Table 6.18 Confusion matrix using majority voting scheme . . . . . . . . . . . . 102

Table 6.19 Performance comparison with the existing approaches based on HeLa

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

xv

Page 16: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Abstract

Subcellular localization of proteins is one of the most significant characteristics of living

cells that may reveal plentiful information regarding the working of a cell. Subcellular

localization property of proteins plays a key role in understanding numerous functions of

proteins. The proteins, located in their respective compartments or localizations, are in-

volved in their relevant cellular processes, which may include cell apoptosis, asymmetric cell

division, cell cycle regulation, and spermatic morphogenesis. In fact, cells may not perform

their regular operations well in case proteins are not found in their proper subcellular lo-

cations. Improper localization of proteins may lead to primary human liver tumors, breast

cancer, and Bartter syndrome. Protein sequencing has observed rapid expansion due to

the advancement in genomic sequencing technologies. This led the research community to

recognize the functionalities of different proteins. In this connection, microscopy imaging

is providing protein images well in time with low cost compared to protein sequencing.

However, automated systems are required for fast and reliable classification of these protein

images. Comprehensive analysis of fluorescence microscopy images is required in order to

develop efficient automated systems for accurate localization of various proteins. For this

purpose, representation of microscopy images with discriminative numerical descriptors has

always been a challenge.

This thesis focuses on the identification of discriminative feature extraction strategies

effective for protein subcellular localization, the recognition capability of the prediction sys-

tems, and the reduction of classifier bias towards the majority class due to the imbalance

present in data. The contributions of this thesis include (1) Analysis of different spatial

and transform domain features, (2) Development of a novel idea for GLCM construction

in DWT domain, (3) Analysis of SMOTE oversampling in the feature space, (4) Analysis

of GLCM in the spatial domain for capturing discriminative information from fluorescence

xvi

Page 17: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

microscopy protein images along different orientations, (5) Exploitation of Texton images

for their capability of extracting discriminative information along different orientations from

fluorescence microscopy protein images, (6) Development of the web based prediction sys-

tems that can be accessed freely by the academicians and researchers.

Extensive simulations are performed in order to assess the efficiency of the proposed pre-

dictions systems in discriminating different subcellular structures from various datasets.

xvii

Page 18: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

List of Abbreviations

AIIA Artificial Intelligence for Investigating Anti-cancer solutions

ASM Angular Second Moment

AUC Area Under Curve

CHO Chinese Hamster Ovary

CHOA CHO dataset obtained from AIIA lab

CHOM CHO dataset obtained from Murphy lab

COF Center Of Fluorescence

DWT Discrete Wavelet Transform

EQP Elongated Quinary Patterns

FN False negatives

FP False positives

GFP Green Fluorescent Protein

GLCM Grey Level Co-occurrence Matrix

HH High Frequency Components

HL High Frequency and Low Frequency Components

HOG Histogram of Oriented Gradients

ICA Independent Component Analysis

xviii

Page 19: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

IEH-GT Individual Exploitation of orientation and Hybridization of GLCM and

Texton Image

KPCA Kernel PCA

LBP Local Binary Patterns

LH Low Frequency and High Frequency Components

lin-SVM SVM with linear kernel

LL Low Frequency Components

LTP Local Ternary Patterns

MCC Matthews Correlation Coefficient

MDA Multiple Discriminant Analysis

mRMR minimum Redundancy Maximum Relevance

NPE Neighborhood Preserving Embedding

PCA Principal Component Analysis

poly-SVM SVM with polynomial kernel

Protein-SubLoc SVM with SMOTE based Subcellular Localization System

Q-Statistic Yule’s measure for diversity among multiple classifiers

RBF Radial Basis Function

RBF-SVM SVM with RBF kernel

RF Random Forest

RF-SubLoc RF based ensemble prediction system

RI Rotation Invariant

ROC Receiver Operating Characteristic

RotF Rotation Forest

xix

Page 20: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

RSE Random Subspace Ensemble

SDA Stepwise Discriminant Analysis

sig-SVM SVM with sigmoid kernel

SLF Subcellular Location Features

SMOTE Synthetic Minority Oversampling TEchnique

SRM Structural Risk Minimization

SVM Support vector machine

SVM-SubLoc SVM based Subcellular Localization System

TAS Threshold Adjacency Statistics

TN True negatives

TP True positives

U Uniform

URI Uniform Rotation Invariant

xx

Page 21: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

List of Symbols

α Langrange multiplier

w Weight vector

x Feature vector

y Corresponding label vector of x

∆ Distance in terms of pixel

γ Width of the Guassian function in RBF kernel of SVM

µ Mean value

σ Variance value

θ Orientation of GLCM or Texton image

τ Threshold value

ξ Positive slack variable associated with training data

B Base Classifier

BN Number of Base Classifiers

C Cost parameter of misclassification for SVM

d degree of polynomial kernel in SVM

gc Gray value of central pixel c ∈ PN

gu Gray value of pixel u ∈ PN

xxi

Page 22: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Gi,j Total number of times a particular combination of gray level i and gray

level j occurs in a GLCM

Mf Modified dimensions of a feature space

Nf Dimensions of a feature space

Ng Number of gray levels in an image or the square dimensions of a GLCM

PN Pixels in a neighborhood

(i, j ) Index of a 2D image or matrix

CL Class Label

D Dimensionality of the feature space

i Some gray level in an image

j Some gray level in an image

K SVM kernel

L Decomposition Level

MI Mutual Information

MR Maximum Relevance

mR minimum Redundancy

S Feature Set

u A particular pixel ∈ PN

Y Predicted Label

z Target class

xxii

Page 23: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Chapter 1

Introduction

Subcellular localization of proteins provides information about the functional behavior of

proteins as well as about their tendency to interact with other proteins under different

circumstances [1, 2]. Therefore, knowledge of the subcellular distribution of these proteins

in a cell is of prime significance in the field of proteomics, cell biology, and computational

functional genomics [3, 4]. Determining protein subcellular localization is critical to the

understanding of various protein functions. For example, during the drug discovery pro-

cess, knowledge of the subcellular localization of a protein can considerably improve the

identification of drugs [5,6]. Comprehension of the protein functions is of prime importance

in biological sciences that may help understand the cell behavior in different situations [7].

Moreover, diagnosis of different diseases in early stages might be performed by adequately

finding the protein locations in cells [8, 9]. For instance, aberrant subcellular localization

of proteins has been observed in the cells of several fatal diseases, namely, breast cancer

and Alzheimer’s disease [10]. In addition, knowing the exact location of protein before and

after using the drugs may help in assessing the drugs ability to cure certain disease [11,12].

Different subcellular compartments of a cell include cytoplasm, mitochondria, Golgi appa-

ratus, lysosome, endoplasmic reticulum, and many others that help it to carry out different

activities such as digestion, movement, and reproduction [13–15].

Various microscopy techniques are frequently used to determine protein localizations

from images. Among these, fluorescence microscopy based imaging has got remarkable at-

tention of researchers in different fields of research particularly in biological sciences [16–18].

The ability of quantifying the object of interest efficiently lying on the black background is

1

Page 24: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

the specialty of fluorescence microscopy. For imaging a protein, GFP need to be attached

for microscopy to be able to visualize proteins. Because, it is the reflection of GFP, which is

captured by the fluorescence microscopy imaging system. Due to the advancement of such

tools, the art of microscopy imaging has flourished in the fields of health and medicine. An

example fluorescence microscopy image of microtubule protein from HeLa cell lines [5] is

shown in Figure 1.1. Onwards, the term ”fluorescence microscopy image” will be referred

to as simply ”protein image” in the text.

Figure 1.1: Fluorescence microscopy image of microtubule from HeLa dataset [5]

Fluorescence microscopy generated data is typically used by researchers to train pattern

recognition systems based on their novel algorithms that might aid medical doctors in diag-

nosing various diseases [10,19]. The training of such algorithms is based on the availability

of large microscopy data, which can be efficiently provided by fluorescence microscopy. A

typical pattern recognition system consists of an optional pre-processing phase, a feature

extraction phase, an optional post-processing phase and a classification phase as depicted

in Figure 1.2. In this figure, dotted lines around some blocks indicate that these phases

are optional, which completely depend on the input data. Representation of fluorescence

microscopy protein images through discriminative numerical descriptors for the classifica-

tion stage is the main focus of research in this area. In this connection, pioneer work is

conducted [2, 5, 11, 20–23] at Murphy’s Lab. They developed various feature extraction

mechanisms, which can efficiently distinguish among different protein structures. Their re-

search was further extended by other researchers in the field [7,24,25]. They have proposed

2

Page 25: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Classification

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Input ImagePre

Processing

Feature

Extraction

Post

Processing

Figure 1.2: A general pattern recognition system

modifications to the existing methods as well as developed novel strategies to classify protein

subcellular localizations. The objective of these researchers is to develop such prediction

systems, which are superior to others in terms of accuracy.

1.1 Motivation and Research Objectives

Precise information about the localization of proteins provides valuable clue toward deter-

mining the function of novel proteins [26–29]. Experimental methods are inherently expen-

sive and time consuming as well as laborious due to the requirement of human specialist

in the field. Further, in conventional experimental techniques, proteomics and microscopic

recognition is not viable to be performed for some of the species [30]. Their applications

are limited to a few proteins. Therefore, for basic research and drug discovery, automated,

reliably efficient, and fast computational methods are always required by researchers and

pharmaceutical industry so that unknown proteins can easily be identified and accurately

predicted. Extraction of informative features from fluorescence microscopy images of var-

ious proteins and consequently, efficient exploitation of the discriminative power of these

features is a challenging task in order to accurately predict the class of a particular protein.

Considerable progress is observed in recent years for the development of computational

methods that can automatically determine the subcellular protein locations from fluores-

cence microscopy images. The main objective of this research is to express fluorescence

microscopy protein images using their numerical descriptors such that different patterns

belonging to various classes can be discriminable from each other. Furthermore, novel pro-

tein localization systems would be developed using various machine learning approaches for

the accurate localization of proteins under the influence of imbalanced feature spaces.

3

Page 26: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

1.2 Research Perspective

Advances in microscopy techniques have enabled researchers to create large databanks of

fluorescence microscopy protein images. Automated systems are capable of recognizing and

classifying these images efficiently and accurately. The research work presented in this thesis

focuses on the application of various feature extraction strategies on fluorescence microscopy

images of proteins in the field of Bioinformatics. Further, the effect of imbalanced data on

the performance of a pattern recognition system is also unveiled.

1.3 Contributions

This research takes into consideration the discrimination power of feature extraction strate-

gies, the recognition capability of pattern recognition system, and the imbalance nature of

the data. The key findings are listed below.

• A study of spatial and transform domain features has been performed, which reveals

that the discrimination power of certain feature spaces is improved when extracted

in transform domain rather than spatial domain. Further, the ensemble constructed

from the decisions of individual classifiers enhances the overall prediction performance.

• The discrimination power of Haralick features [31] extracted from GLCM constructed

in the DWT domain is enhanced. Further, level 2 is observed to be the best decom-

position level, which reveals that this level possesses the most discriminative features.

• Introducing synthetic samples using SMOTE [32] in the feature space prior to the

classification phase reduces the classifier bias towards majority class and consequently,

enhances the prediction performance of protein subcellular localization system.

• Accuracy of a prediction system using feature spaces oversampled through SMOTE is

directly proportional to the size of imbalance present in original dataset of fluorescence

microscopy protein images.

• It is revealed from numerous simulation results that discriminative strength of LTP

has not been improved with mRMR or alternatively, LTP does not require mRMR

based feature selection for the improvement of its discriminative strength.

4

Page 27: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

• GLCM and Texton image [33] along different orientations capture different informa-

tion from the same image that is useful in identifying fluorescence microscopy protein

images from different classes.

• It is also observed that in some cases, GLCM constructed along a single direction

performs better than the combined GLCM along all the four orientations.

• Web based prediction systems are also developed that can be accessed freely by the

academicians and researchers.

1.4 Structure of the Thesis

Chapter 2 highlights various approaches in the literature developed for protein subcellu-

lar localization. Researchers have utilized numerous individual and ensemble classification

based approaches by exploiting the discrimination power of various texture and statistical

based feature extraction strategies. Further, various existing machine learning algorithms

utilized in this research would also be reviewed. In addition, different feature extraction

mechanisms, adopted for feature generation from fluorescence microscopy protein images,

are also discussed in detail. The description of a number of performance measures used to

assess the performance of the proposed algorithms is provided. This chapter ends with the

discussion of some benchmark fluorescence microscopy protein image datasets, which are

utilized in this thesis.

Chapter 3 begins with the introduction of SVM-SubLoc prediction system. We propose

the extraction of Haralick and Zernike moments in spatial and DWT based transform do-

mains. Other features include LBP, LTP, TAS and various hybrid models of these feature

spaces that are utilized in spatial domain only. SVM with linear, polynomial, RBF, and

sigmoid kernels is employed as classification algorithm. The performance of these individual

SVMs is evaluated against their ensemble constructed through the majority voting scheme.

Accuracy, MCC , F-Score, and Q-Statistic are employed as performance measures to assess

the quality of SVM-SubLoc prediction system.

Chapter 4 contributes to the classification of fluorescence microscopy protein images

under the effect of balanced and imbalanced feature spaces where balanced feature space

is constructed through SMOTE. The performance of RF and RotF ensemble classifiers is

5

Page 28: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

explored using both balanced and imbalanced feature spaces. The extracted features in-

clude Haralick, HOG, Edge, URI-LBP, Image, and Hull based features in addition to some

hybrid models constructed by forming different combinations of these individual features.

The performance measures: accuracy, F-Score, and MCC revealed that the proposed tech-

nique is promising.

Chapter 5 is devoted to elaborate the effectiveness of LTP in conjunction with SMOTE

using polynomial SVM in classifying fluorescence microscopy protein images from fluores-

cence microscopy. It is observed that the application of mRMR feature selection technique

does not improve the discrimination power of LTP patterns. The performance measures

indicated the effectiveness of the proposed technique.

Chapter 6 deals with the implementation of IEH-GT prediction system. Recognition ca-

pability of the IEH-GT prediction system is assessed using the features, extracted separately

from GLCMs and Texton images. The simulation results reveal that GLCMs extracted in-

dividually along different angles are capable of describing fluorescence microscopy protein

images through distinctive features. Similarly, Texton image constructed along a single

direction reduces the overlap in the extracted information from fluorescence microscopy

protein images. In addition to the individual orientation analysis, the combined structures

of GLCMs as well as Texton images have also been analyzed. SVMs are trained on all the

feature spaces and the final prediction is obtained through the majority voting scheme. The

proposed IEH-GT prediction system is tested on HeLa dataset. The performance predic-

tions reveal the effectiveness of the proposed prediction system.

Chapter 7 draws the conclusion and sets the future directions towards the end of this

thesis.

6

Page 29: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Chapter 2

Relevant Literature and

Techniques

This chapter discusses the literature related to Bioinformatics based algorithms developed

for protein subcellular localization. It also presents machine learning theory and various

related techniques, which have been employed in this thesis. First, theoretical details of

the employed feature extraction strategies are discussed. Then, post-processing techniques

utilized in this work are described. Then, classification algorithms used in this work are

presented. Next, different performance measures are discussed. Finally, different fluores-

cence microscopy protein image datasets are presented, which we have utilized to assess the

prediction systems developed in this thesis.

2.1 Literature Review

Boland et al. are among the pioneer researchers [20], who have started developing auto-

mated systems for the prediction of protein subcellular localizations. They adopted Haralick

and Zernike moment based numerical descriptors extracted from fluorescence microscopy

protein images of CHOM dataset, which are then classified using both BPNN and the clas-

sification tree. They have also evaluated the performance of BPNN using two subsets of 10

features; each selected using SDA and MDA from the combined feature space of Haralick

and Zernike moments. Their findings reveal that SDA and classification trees are unable

to perform well on these fluorescence microscopy protein images. On contrary, BPNN has

performed well even on small number of features, which are selected using MDA from the

7

Page 30: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

combined feature space of Haralick and Zernike moments. Continuing this research work,

Murphy et al. have proposed 22 new features in addition to the Haralick and Zernike mo-

ments [2], which were reported in their previous work. These new features are the result

of morphological and geometrical analysis of the fluorescence microscopy protein images.

They have evaluated the performance of BPNN using Haralick textures, Zernike moments

and the new set of 22 features for 2D HeLa dataset. Further, from the 84D combined

feature space of Haralick, Zernike and 22 new features, 37 features subset is selected using

SDA and evaluated on BPNN to assess the discriminative power of this subset. Boland et

al. [5] have further extended the studies reported in [2, 20]. They proposed some new sets

of features called SLF. The first set was termed as SLF1 composed of morphological based

features. Another set SLF2 constructed from SLF1 and 6 additional features extracted

from the processed images of proteins and their corresponding DNA images resulting in 22

features in total. SLF3 was composed of SLF1, 49D Zernike moments and 13D Haralick

coefficients producing a total of 78D feature space. Similarly, SLF4 was generated from

SLF2, concatenated with the same set of Haralick and Zernike features, resulting in 84D

feature space. Further, SDA was applied on SLF4 producing a subset of 37 features, which

was termed as SLF5. Then BPNN was trained using SLF5 for HeLa dataset. They have

also produced subset feature space using PCA. However, performance of SDA was reported

better than that of PCA.

Exploring new venues in this field, Murphy et al. [11] have proposed a new feature set

SLF7, which is composed of SLF3 [5] and six new features based on skeleton information

of the protein location. SDA was then applied on SLF7 generating 32 selected features,

which are termed as SLF8. BPNN was trained using this selected feature set. SDA is ap-

plied on the combined feature space of SLF7 and six DNA features that resulted in SLF13

composed of 31 selected features. The new method has achieved good performance with

relatively smaller feature space though reference DNA image was not utilized during the

feature extraction.

Previously, Murphy and his fellow researchers have mostly relied on a single feature se-

lection mechanism that is SDA and then a BPNN is trained using the selected feature space.

However, Huang et al. [34] have tested the effect of eight feature reduction techniques on

these SLF sets for 2D HeLa cell lines. The utilized classifier is the well known SVM. Feature

reduction techniques include PCA, nonlinear PCA, KPCA, ICA, classification trees, Fractal

8

Page 31: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Dimensionality Reduction, SDA, and GA. SVM achieved 86% accuracy using the reduced

feature set of SLF7 with KPCA among other feature reduction techniques. On the other

hand, SVM achieved 87.4% accuracy using the selected feature space from SLF7 with SDA

among feature selection techniques. SLF12 is introduced by selecting 8 features from SLF7

using SDA. The performance accuracy of SLF12 is 80.1% on average.

Huang and Murphy [35] reported the performance of various classifiers using SLF8 and

SLF13 feature sets, which were previously proposed in [11, 21]. In their previous papers,

they mostly used BPNN as a classifier in their experiments. In contrast to their previously

proposed algorithms, they have introduced SLF15 and SLF16 as well as they have trained

a number of classifiers to identify the best classifiers. SLF15 is generated by applying SDA

on a 174D feature space, which is composed of Daubechies4, Gabor, and SLF7 features.

Similarly, SLF16 is generated using the same procedure on 180D feature space, which in-

cludes 6 DNA features in addition to the 174 features just mentioned. This system achieved

92%− 93% accuracy for 2D HeLa dataset. Chen et al. have presented a comprehensive re-

view about automated systems developed till 2006 for protein localization [36]. This review

focuses on the importance of discriminative image descriptors of fluorescence microscopy

images. The systems discussed in this review were developed for various 2D and 3D fluo-

rescence microscopy protein images. This review considers SDA the best feature selection

technique among others.

Srinivasa et al. have tested the efficiency of Haralick and morphological features in

multi-resolution subspaces upto 2 levels where PCA is applied to represent the same feature

vectors in their eigenspaces [1]. They have shown that extracting features in multiresolu-

tion subspaces obtain maximum information from fluorescence microscopy protein images

that help the classifier efficiently classify the images. Prediction results are generated with

the help of K-means algorithm and combined afterwards through weighting. They showed

that their approach is able to improve performance by 10% for HeLa dataset compared

to features extracted from GLCM in spatial domain. Hamilton et al. [24] have reported

an SVM based classifier, ASPiC: Automated Subcellular Phenotype Classification system

that achieved 94.3% and 89.8% accuracy values for LOCATE endogenous and transfected

datasets, respectively. ASPiC used area and intensity measures as well as Haralick and

Zernike moments as numerical descriptors of fluorescence microscopy protein images. Ex-

panding their contribution to the bioinformatics community, Hamilton et al. have further

9

Page 32: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

proposed TAS based feature extraction strategy to efficiently quantify protein subcellular

localization images [37]. SVM classifier, using TAS, achieved 94.4% and 90.3% accuracy

values for LOCATE endogenous and transfected datasets, respectively. The performance is

further enhanced when TAS and Haralick textures were combined and further utilized by

SVM that yielded accuracy values of 98.2% and 93.2%, respectively, for LOCATE endoge-

nous and transfected datasets.

Chebira et al. have contributed to the field by developing a multiresolution based clas-

sification system for protein subcellular localization [7]. This approach first decomposes

an input fluorescence microscopy protein image into multiresolution subspaces where Har-

alick, Zernike and morphological features are computed in different combinations at each

subspace. Decisions regarding the classification of different fluorescence microscopy protein

images are generated at each subspace using ANN and consequently, these decisions are

combined by weight assignment yielding accuracy of 95.3% for 2D HeLa dataset. Lin et

al. have developed a novel algorithm AdaBoost.ERC: AdaBoost with Error-correcting Re-

peating Codes [6], which is trained using strong and weak detectors to classify fluorescence

microscopy protein images into multiple classes. Their proposed algorithm achieved 94.7%

accuracy for CHOA dataset, 93.6% for HeLa dataset, and 89.1% for Vero dataset. Chen et

al. have proposed the utilization of different field-level and cell-level features to describe

fluorescence microscopy protein images from yeast GFP fusion localization database [38].

Before the classification step, they employed SDA for the selection of most discriminative

set of features and then hired SVM classifiers to be trained on these selected features. In

order to make the final decision, scheme of plurality voting is utilized to combine decisions

for various SVM classifiers.

This discussion will not be complete without presenting the contribution of Dr. Loris

Nanni to the field of Bioinformatics. Nanni and Lumini have proposed a novel application

of invariant LBP for extracting numerical features from protein subcellular localization im-

ages [39]. They were further combined with Haralick and TAS feature extraction strategies,

which enhanced the performance of RSE of neural networks for HeLa, LOCATE endogenous

and LOCATE transfected datasets. They showed that RSE of neural networks has outper-

formed SVM. The success rates of RSE of neural networks were 94.2%, 98.4%, and 96.5%

for HeLa, LOCATE endogenous, and LOCATE transfected datasets, respectively. Further,

Nanni et al. have reported an optimal set of features comprising of wavelet, Haralick, TAS

10

Page 33: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

and some variants of LBP feature extraction strategies, which were used to train RSE of 100

Levenberg-Marquardt neural networks [8]. The final prediction is made through the sum

rule achieving accuracy values of 95.8%, 99.5% and 97.0% for HeLa, LOCATE endogenous,

and LOCATE transfected datasets, respectively. Nanni et al. have extended their efforts

and proposed a novel approach for selecting the most discriminative invariant LBP and LTP

patterns from a set of extracted features [12]. In this connection, they utilized PCA and

NPE to produce a reduced dimensionality feature space with high variance. Next, 50 SVMs

were trained using the resultant feature spaces and the final prediction is obtained using

the sum rule yielding 93.2% and 92.9% accuracy values for HeLa and LOCATE endogenous

datasets, respectively. In an another effort, Nanni et al. have constructed different local

and global descriptors to numerically describe protein subcellular localization images [40].

The local descriptors are extracted according to the method proposed in [39]. Weak descrip-

tors as proposed in [6] are also utilized in this work. RSE of Levenberg-Marquardt neural

networks and Adaboost of weak learners are individually trained using these features. A

fusion of the two ensembles is also performed using the sum rule. This system yielded 97.5%

prediction accuracy for HeLa dataset. Focusing on LBP, Nanni et al. have proposed a novel

variant of LBP for image classification called EQP [41], which outperformed the existing

LBP and its various variants. EQP achieved 92.4% accuracy value for HeLa dataset, which

is the highest value yielded by the LBP variants. Furthermore, Nanni et al. have reported

the effectiveness of non-uniform LBP and LTP patterns [42]. They showed that RSE of

SVM classifiers may outperform a standalone SVM using non-uniform LBP/LTP patterns

in classifying protein as well as other non-protein images.

Literature survey reveals that most of the researchers have either developed novel fea-

ture extraction strategies or suggested modifications to the existing techniques for numerical

description of fluorescence microscopy protein images. In addition, different ensemble and

individual classification systems have been trained on these features. Next section discusses

the feature extraction strategies, which have been utilized in this thesis.

2.2 Feature Extraction Schemes

Feature extraction strategies are used to compute numerical features from images or from

a sub-part of the whole image. These numerical features are then used in the classification

11

Page 34: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

process. The computed features should be discriminative enough so that the classifier

can easily distinguish the subcellular location images from each other. Different feature

extraction techniques are utilized in the development of this thesis, which are discussed as

follows.

2.2.1 Haralick Texture Features

Haralick texture features are first proposed in [31], which are based on second order statistics

computed from a GLCM. Haralick proposed thirteen statistical features to be extracted from

a GLCM. These features include:

• Energy

• Correlation

• Inertia

• Entropy

• Inverse difference moment

• Sum average

• Sum variance

• Sum entropy

• Difference average

• Difference variance

• Difference entropy

• Information measure of correlation 1

• Information measure of correlation 2

Before going to discuss Haralick coefficients in detail, it is worth discussing GLCM matrices

first.

12

Page 35: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

2.2.1.1 Gray Level Co-occurrence Matrix

Each element of a GLCM matrix represents the co-occurring frequency of two pixels, sit-

uated ∆ pixels apart from each other, one with gray level i and the other with gray level

j along certain angle θ. Here ∆ is measured in terms of pixel distance and θ is quantized

along four directions including 0◦, 45◦, 90◦ and 135◦ [43,44]. Mathematically, each entry at

index (i, j) of a GLCM is computed as given in Equation 2.1.

P (i, j) =

Ng∑x=1

Ng∑y=1

1 if I (x, y) = i and I (x+ ∆x, y + ∆y) = j

0, otherwise

(2.1)

P (i, j) is the probability of occurrence of intensity value i with intensity value j and Ng is

the number of gray levels in the input image. (∆x,∆y) describes the distance and orientation

between the pixels. This is used to generate a separate GLCM for each orientation from the

set of pre-defined orientations {0◦, 45◦, 90◦, 135◦}. Table 2.1 shows different sets of offsets

each representing a different direction.

Table 2.1: Offset pair with corresponding direction

Offset Direction

(0, ∆) Horizontal

(-∆, ∆) Diagonal

(-∆, 0) Vertical

(-∆, -∆) Off-diagonal

The dependency upon the directions is usually avoided by obtaining the average of

GLCMs computed along the four directions. The size of a GLCM depends on the gray

levels in an input fluorescence microscopy protein image. An image having Ng gray levels

results in a GLCM matrix of size Ng-by-Ng. A separate GLCM matrix has to be maintained

for each (∆, θ) pair resulting in large memory requirements. Therefore, the number of gray

tones in an image, from which a GLCM has to be computed, is usually reduced so that

the resulting GLCM is of smaller dimensions [45, 46]. We have to keep in view that the

performance of the features computed from GLCM is greatly dependent upon the number

of utilized gray levels.

Let an image I of size M -by-N having 8 gray levels ranging from 1 to 8 is illustrated in

13

Page 36: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Figure 2.1(a). Figure 2.1(b) is the resultant GLCM matrix along θ = 0◦ and at distance ∆ =

1. Figure 2.1(c)-(e) represent the obtained GLCMs along 45◦, 90◦ and 135◦, respectively.

After constructing and fusing these GLCMs, Haralick coefficients are computed from the

2

4

23 155

21 546

23 534

58 84

4 111

6

5

6

3

7

56 727 8

(a) Example image

GLCM at ∆ = 1, θ = 00

1

1

0

0

1 02 11000

0 00 11120

2 00 00100

0 11 0001

01 11200

0 00 10110

1 10 0000

0 00 0011

(b) GLCM at 0◦

0

0

0

0

1 10 00011

0 01 02101

1 00 00100

1 01 0021

01 01110

0 00 00111

0 01 1000

1 00 0100

GLCM at ∆ = 1, θ = 450

(c) GLCM at 45◦

GLCM at ∆ = 1, θ = 900

1

0

0

1

1 00 20101

2 01 00100

0 11 10010

1 01 0110

10 02120

0 01 10111

0 10 0000

0 01 0100

(d) GLCM at 90◦

GLCM at ∆ = 1, θ = 1350

0

0

0

1

2 10 10100

1 01 01010

1 10 00100

0 01 1121

01 01111

0 10 00000

0 00 0000

0 00 1000

(e) GLCM at 135◦

Figure 2.1: GLCM construction for Ng = 8 at θ = {0◦, 45◦, 90◦, 135◦} and ∆ = 1

resultant GLCM, which acts as descriptor for a particular image [43,47]. Before extracting

these statistical measures, it is necessary to normalize the GLCM so that the cells contain

probabilities of occurrence of different outcomes. Note that these are merely approximations

because the gray levels are integer values rather than continuous values. The probability of

occurrence of a particular combination can be calculated using Equation 2.2.

P (i, j) =Gi,j

Ng∑i=1

Ng∑j=1

Gi,j

(2.2)

Here, Ng is the total number of gray levels, Gi,j represents the total number of times

a particular combination occurs, and the denominator is used to show the total number of

possible outcomes.

Haralick texture features have been observed to be very effective in texture classification

[45,47,48]. Therefore, we provide the formulae and respective description for some of these

14

Page 37: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

features as follows.

Energy

Energy or uniformity of energy is a measure that can be extracted from a GLCM. It is

used to quantify the homogeneity of an image and results in the sum of squared elements of

GLCM. Homogeneous regions exhibit little variations in the gray level intensities and hence

the gray level range is smaller. This gives fewer but high frequencies for P (i, j) values in a

GLCM. Energy can be measured by calculating the square root of ASM.

Energy =√ASM (2.3)

Higher values (maximum 1) of Energy or ASM reveal the constant behavior of an image.

ASM is used to express the amount of smoothness or homogeneity of an image.

ASM =

Ng∑i=1

Ng∑j=1

(p (i, j))2 (2.4)

where Ng is the number of gray levels in the image or alternatively, the square dimensions

of GLCM whereas p(i, j) is a particular element of GLCM.

Contrast

Contrast, also known as inertia or variance, is used to measure the amount of local intensity

variations in the image. It returns high values where image regions exhibit large variations.

Alternatively, low values are returned for the image regions where gray level differences are

smaller. For constant images, the contrast is always zero. Equation 2.5 is used to quantify

the intensity contrast of neighboring pixels.

Contrast =

Ng∑i=1

Ng∑j=1

p (i, j) (i− j)2 (2.5)

where i and j are pixel indices, Ng is dimension of a square matrix and p(i, j) is the

probability of occurrence of pixel pairs. In Equation 2.5 (i− j)2 is the weighting function.

The pixels at the diagonal are assigned zero weight due to the complete similarity. In the

equation, this situation is indicated by (i− j)2 = 0 where i and j are equal. The distance

between i and j = 1 means that there is a little contrast between the pixels and hence

the weight assigned to these pixels is 1. Similarly, distance 2 shows increased contrast and

15

Page 38: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

the assigned weight is 4. The weighting function grows exponentially with the increase in

(i− j).

Correlation

GLCM correlation quantifies the amount of correlation among the neighboring pixels. This

feature identifies the gray level spatial dependence within the image. Pixels belonging to

same object have usually high correlation compared to the pixels belonging to different

objects. Similarly, pixels in close proximity with each other have high correlation. On the

other hand, pixels situated farther away from each other have low correlation. Correlation

can be computed using Equation 2.6.

Correlation =

Ng∑i=1

Ng∑j=1

p (i, j)

(i− µi) (j − µj)√σ2i σ

2j

(2.6)

where i and j represent pixels, Ng is the dimension of the square matrix, p(i, j) is the

probability of occurrence of i and j in combination, µ is the GLCM mean, and σ is the

GLCM variance. Variance values are set to zero when all the pixels in the image have

similar intensities and consequently, correlation will be undefined for such image. However,

for calculation purpose the value of correlation is set to 1 in such situations indicating that

the pixels are with similar intensities. GLCM mean and variance are presented next to

complete the discussion about GLCM Correlation.

GLCM Mean In GLCM mean, pixel value is weighted by its occurrence frequency

in combination with some other pixel value in the neighborhood.

µi =

Ng∑i=1

Ng∑j=1

i (p (i, j)) (2.7)

µj =

Ng∑i=1

Ng∑j=1

j (p (i, j)) (2.8)

GLCM Variance Calculation of GLCM variance involves the mean and distribution

of cell values around the mean in the GLCM matrix. Zero variance is the indication of a

completely uniform image.

σi =

Ng∑i=1

Ng∑j=1

p (i, j) (i− µi)2 (2.9)

16

Page 39: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

σj =

Ng∑i=1

Ng∑j=1

p (i, j) (j − µj)2 (2.10)

GLCM standard deviation is given by:

σi =√σ2i (2.11)

σj =√σ2j (2.12)

Entropy

Entropy is used to capture the randomness of a GLCM. Randomness is an important prop-

erty of image texture and can enhance the classification capability of a prediction system.

Entropy returns low values when an irregular GLCM is constructed for a regular image.

High values are produced when the constructed GLCM has all equal elements for an irreg-

ular image. Entropy can be measured using Equation 2.13

Entropy = −Ng∑i=1

Ng∑j=1

p (i, j) ln p (i, j) (2.13)

Since p(i, j) is a probability measure, its value ranges from 0 to 1 and therefore, the

value of ln(p(i, j)) would be either zero or negative. Smaller values of p(i, j) reveal that

the occurrence of a particular pixel combination is infrequent that results in large values of

ln(p(i, j)). The negative term in the equation makes the output entropy positive. Entropy

is always zero at the minimum because 0× ln(0) and 1× ln(1) result in 0.

The value of p(i, j)× ln(p(i, j) is maximum when its derivative with respect to p(i, j) is

zero.

Local Homogeneity

Homogeneity measure is used to indicate the homogeneous behavior of an image. The

weighting term 11+(i−j)2 in homogeneity computation is the inverse of that used in the cal-

culation of Contrast. Therefore, this measure returns high values for low contrast regions in

an image. The weighting term keeps decreasing exponentially with the increase in distance

from the diagonal. Homogeneity can be computed using Equation 2.14.

Homogeneity =

Ng∑i=1

Ng∑j=1

p (i, j)

1 + (i− j)2(2.14)

Decreasing trends in the weighting term away from the diagonal result in larger outputs

for homogeneity measure, which is the indication of more homogeneous scenes in the image.

17

Page 40: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

2.2.2 Texton Image based Statistical Features

Texton images are obtained using the micro-structures known as Texton masks, which are

utilized to explore information related to the horizontal, vertical, diagonal or off-diagonal

patterns present in an image. Difference present in the structure of Texton masks helps in

reducing the overlap of information present along different orientations in an image [33].

Texton elements adopted in this thesis are illustrated in Figure 2.2. All the Texton elements

are described on a 2-by-2 grid termed as Texton Mask. For Texton image construction, the

p4

p2

p3

p1

T1

p4

p2

p3

p1

T2

p4

p2

p3

p1

T3

p4

p2

p3

p1

T4

p4

p2

p3

p1

T5

p4

p2

p3

p1

T6

p4

p2

p3

p1

Texton Mask

Figure 2.2: Different Texton masks

original image is first quantized to 16 gray levels. The 2-by-2 grid slides over the entire

image from left to right and top to bottom with step size of 2, which detects patterns in

the underlying image defined by a Texton Mask. If the pixel intensities inside a particular

Texton Mask are found similar, all pixels in the image under the Texton are retained

without intact because they form a valid texton of the image. On the other hand, if the

pixel intensities, inside a particular Texton Mask, are different, they are all set to zero.

Each Texton Mask constructs a unique Texton image. The final Texton image is obtained

by combining the individual Texton images. The complete process of Texton detection is

demonstrated in Figure 2.3. In this thesis, ten statistical features are extracted from Texton

images. These include:

• Energy

• Contrast

• Homogeneity

18

Page 41: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Texton Mask

7

4

4

1

5

3

2

6

1

3

2

5

2

3

6

1

4

7

1

4

7

5

0

3

3

7

6

5

2

1

2

6

3

7

3

2

5

1

6

2

5

0

4 4 0 1 1 4 2 5

0

2

0

3

2

5

513455 4 0

7

4

4

1

5

3

2

6

1

3

2

5

2

3

6

1

4

7

1

4

7

5

0

3

3

7

6

5

2

1

2

6

3

7

3

1

5

1

6

2

5

0

4 4 0 1 1 3 2 5

0

2

0

3

2

5

513455 4 0

T1 T4 T2

T5

T4 T2

T3 T1 T6

T3T4

7

4

4

0

5

0

2

6

1

0

0

0

2

3

6

0

0

0

0

4

7

5

0

3

0

7

6

5

2

1

2

6

3

0

0

5

1

6

0

0

0 1 0 0 2 5

0

0

0

3

2

5

513405 4 0

30

4 4

p4

p2

p3

p1

Original Image Detected Textons Texton Image

Location of

detected

Texton types

Figure 2.3: Procedure of Texton image generation

• Entropy

• Difference average

• Difference variance

• Difference entropy

• Inertia

• Inverse difference

• Information measure of correlation 1

2.2.3 Zernike Moments

Zernike moments have drawn the attention of researchers for many years in the field of pat-

tern recognition and bioinformatics [5, 20, 38]. Zernike moments are capable of extracting

informative features from fluorescence microscopy protein images about the protein distri-

bution maintaining the property of rotational invariance.

Zernike polynomials are utilized as basis functions for the computation of Zernike mo-

ments. These polynomials are defined over a unit circle where all the moments are orthogo-

nal to each other thus guaranteeing no redundancy in the extracted information from images

and similarly, these moments are mathematically independent [20, 49]. Zernike polynomial

19

Page 42: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

consists of a series of terms, which represent a particular property of an image [50]. Each

term in the Zernike polynomial has a coefficient with magnitude and sign that recognizes

the prominent property and its direction of the image components [51]. The magnitudes

of Zernike coefficients/moments are not dependent on the rotation angle of the object and

therefore, useful in extracting information from images, which efficiently illustrate the shape

properties of objects. Zernike moments are adopted to address many research problems re-

lated to the classification of different patterns. Each Zernike moment measures the similarity

of an image to a set of Zernike polynomials [23,36]. The invariance is achieved by computing

the resemblance of the transformed image with the Zernike polynomials using its conjugate.

The absolute value resulting in this way exhibits the property of invariance [17].

Zernike moments are computationally inexpensive as compared to other texture based

features [23]. Theoretically, perfect image reconstruction is possible if a complete set of

Zernike moments of the subject image are available.

2.2.4 Wavelet Features

Wavelet features of an image are obtained by applying DWT procedure that gets infor-

mation from both spatial and frequency domains [7,8,52,53]. DWT decomposes the input

image to a detail and approximation description by utilizing a scaling function and a wavelet

function of the applied wavelet that correspond to a low-pass and a high-pass filters, re-

spectively [17, 18]. Since an image is a 2D signal, comprising of rows and columns, when

applying DWT, first, the columns of the input image are convolved with high-pass and

low-pass filters and then the rows of the resulting image are convolved once more with the

high-pass and low-pass filters. Consequently, four convolved images are produced by ap-

plying DWT at each level. These four resultant images represent four different frequency

groups, among these groups three are high-frequency components and one is low-frequency

component. The low-frequency component is along x -direction while the three high fre-

quency components are along x, y and diagonal directions each. The three high-frequency

components have valuable information while the only low-frequency component needs to

be decomposed further by re-applying the same procedure so that valuable information can

be extracted if it has any. At each decomposition level, four new images are obtained for

each input image. The three high-frequency components are stored and the low-frequency

component is further decomposed for extracting more information from it.

20

Page 43: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

2.2.5 Local Binary Patterns

LBP operator is used to extract gray level patterns from an image [54]. The rationale

behind its extensive use, by the medical image processing research community, is its reduced

computational cost and resistance to the illumination changes as well as invariant behavior

to the rotation [12].

LBP can either be uniform, rotation invariant or uniform rotation invariant. Uniform

LBP patterns [55, 56] are of practical importance. Considering the binary string circular,

LBP is said to be uniform if there are at most two bitwise transitions from 1 to 0 or 0 to 1 in

that string. For instance, 11100011 is a uniform pattern because only two such transitions

exist whereas 11100101 is a non-uniform pattern because of the existence of 4 transitions.

Uniformity represents important structural features such as spots, edges, and corners.

Another variant of LBP operator is rotation invariant LBP. In invariant LBP, the

transformation of gray level values is monotonic. Alternatively, LBP code remains constant

if and only if the order of the gray scale values in the image preserves the order. Rotating

a pattern consisting of merely 0s or 1s does not produce any variance that is it remains

invariant. On contrary, rotating a binary pattern that is comprised of gray values other

than merely 0s or 1s produces different LBP code [55].

LBP codes are computed by evaluating the binary differences of the intensity of a

central pixel c with each of the intensities of PN pixels that surround c and lying on the

circle defined by radius R in the neighborhood. The final LBP code is the sum of these

differences. The mathematical expression for LBP is given in Equation 2.15.

LPBR,PN=

PN−1∑p=0

s(gp − gc)2p (2.15)

Here, PN represents the neighboring pixels count and R indicates the radius from the

central pixel. The gray level intensity of the central pixel is represented by gc whereas gu

shows the neighboring pixel intensities. The function s(u) returns the output value of each

u ∈ PN after the calculation of difference with the value of central pixel as expressed in

Equation 2.16. The value of s(u) depends on the difference of gc and gu.

s (x) =

1 if x ≥ 0

0, otherwise

(2.16)

21

Page 44: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

LBP can be computed with different PN and R values. The combinations that we have

adopted in our works are (R = 1,PN = 8) , (R = 2,PN = 16) and (R = 3,PN = 24). Figure

100

93

82

47 51 65

97

90

107

10

3

-8

-43 -39 -25

7 17

1

1

0

0 0 0

1 1

0 * 20 + 0 * 21 + 0 * 22 + 1 * 23 + 1 * 24 + 1 * 25 + 1 * 26 + 0 * 27 = 120

1,( )

0

if x 0 s x

, otherwise

8

64

128

1 2 4

32 16

LBP OperatorCodewordOriginal Difference

Figure 2.4: LBP code generation

2.4 is provided to illustrate the procedure of LBP code generation. In this example, the

procedure is depicted for R=1 and PN=8. First, 3-by-3 neighborhood of a pixel is selected.

Next, the central pixel value is subtracted from the value of each neighboring pixel. Then,

the threshold is applied as given in Equation 2.16 to obtain the LBP textured image.

Finally, this 3-by-3 neighborhood is converted to a single decimal value, which is the LBP

code for this neighborhood. LBP codes for the whole image are obtained in this way.

2.2.6 Local Ternary Patterns

LTP, proposed by Tan and Triggs [57], is a generalized form of LBP coding scheme. In LTP,

the binary differences of the central pixel c with each of the PN pixels in the predefined

neighborhood are based on ternary value rather than binary that is computed according to

a threshold value τ as expressed in Equation 2.17.

s (u) =

1 if u ≥ c+ τ

−1 if u ≤ c− τ

0 otherwise

(2.17)

Hence, obtaining a textured image that is less sensitive to noise and efficiently discrim-

inative. The computational complexity of LTP coding is higher compared to LBP scheme.

Therefore, LTP feature extraction strategy, based on its negative and positive components,

22

Page 45: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

is usually transformed into two equivalent LBP computations as illustrated in Figure 2.5.

The obtained component histograms are concatenated in order to construct the LTP feature

0

0

1

0 1 1

1 0

1

1

-1

1 -1 -1

-1 0

1

1

0

1 0 0

0 0

Figure 2.5: LTP is split into two LBP codes

vector. Similar to LBP, LTP codes are also computed using uniform, rotation invariant,

and uniform rotation invariant mappings. The mapping concepts are same as discussed

earlier in the discussion of LBP.

2.2.7 Threshold Adjacency Statistics

TAS [37] is a simple and efficient morphological feature extraction technique in which an

image is first transformed into three binary images by applying three different thresholds.

The pixel values of the three binary images are in the range of µ to 255, µ− τ to 255 and

µ + τ to 255 where µ shows the average intensity of the original input image and τ is the

threshold provided by the user. Then from each output binary image, 9-bin histograms are

computed, which are concatenated at the end of TAS construction resulting in 27D feature

vector [58]. The statistics computation procedure is applied to 3-by-3 segments of the entire

image after the thresholding step, which is depicted in Figure 2.6. The first statistic is the

total number of white pixels that have no white neighbor. Similarly, the second statistic is

the total number of white pixels that have one white neighbors. Third and Fourth statistics

are recorded as the aggregate number of two and three white pixels, respectively, which

have been in the neighborhood of a white pixel. Similar statistics are computed for five,

six, seven, and eight white pixels found adjacent to a white pixel. This procedure results

23

Page 46: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

0 1 2 3 4 5 6 7 8

Figure 2.6: Threshold Adjacency Statistics

in 9 threshold adjacency statistics for one threshold image. For other two images similar

operations are performed to calculate their respective statistics. Each of these nine statistics

is divided by the total number of white pixels in the threshold image for normalization.

2.2.8 Image Features

Prior to the acquisition of Image features from an image, Otsu global thresholding [59]

is applied to convert the image into its binary form. Next, 8-connected elements in the

resultant image are obtained. Image features are produced as:

• Number of objects in the image

• Average and variance of the non-zero pixels per object

• Average and variance of the object distances from the COF

• Ratio of the largest object to the smallest object

• Ratio of the distance of the furthest object from the COF to the distance of the closest

object to the COF

• Euler number of the image (the number of objects in the region -minus- the number

of holes in those objects)

2.2.9 Edge Features

Edge Features [17] are captured using the Prewitt gradient [60], which is utilized to recog-

nize vertical as well as horizontal edges in an image. Features related to edges in an image

are mean, variance, and median. Additional features are obtained by considering magni-

tude histogram and direction components distributed in 8 bins. Furthermore, the count

of edge pixels contribute to the area feature whereas direction homogeneity is obtained by

24

Page 47: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

considering the total edge pixels present in the first two bins of the direction histogram.

Another edge feature is generated by computing the difference of direction histogram bins

at orientation θ and θ+π. In this connection, difference of the sum of bins 1-4 from the sum

of bins 5-8 is obtained. The differences along both orientations are summed up together

for normalization. Likewise, ratio of the maximum to minimum intensities and ratio of the

maximum to the next maximum intensities are also regarded as edge features.

2.2.10 Hull Features

Binary convex hull is the basis for generating Hull features of an image. The fraction of

the convex hull area occupied by protein fluorescence, the shape of the convex hull, and the

convex hull eccentricity are all Hull related features [17].

2.2.11 Morphological Features

Morphological features are efficient descriptors to differentiate various objects found in

fluorescence microscopy based protein images [17]. Morphological features are constructed

by combining Image, Edge and Hull features discussed in sections 2.2.8, 2.2.9, and 2.2.10,

respectively.

2.2.12 Histogram of Oriented Gradients

HOG is a frequently used feature generation mechanism in the fields of image processing

and computer vision for object detection [61]. The rationale behind HOG development is

that the intensity gradient distribution or edge orientation can exploit the local form and

shape of an object in an image very well. In order to compute HOG, the image is first

divided in sub-images called cells, which can either be rectangular or circular in shape. For

each cell a histogram of gradient orientations is then compiled for pixels within the cell.

Eventually, these histograms are combined to construct the feature space.

For gradient detection, 1D, point discrete derivative mask is applied along X and Y-

axes. The utilized kernel for edge orientation detection is [−1, 0, 1]. Afterwards, these

detected gradients are binned in a histogram, which is either distributed over 0◦ − 180◦

or 0◦ − 360◦ according to the requirements whether to support negative directions or not.

Cells are grouped together in order to construct blocks where detected gradients can be

normalized locally so that illumination changes and contrast variations can be countered.

25

Page 48: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Blocks can also be rectangular or circular. Rectangular block is normally described using

three parameters including the number of cells in a block, the number of pixels in a cell

and the number of bins in a cell histogram. The feature space is the concatenation of

normalized cell histograms from all the blocks. Circular block can be represented by four

parameters, which include the number of angular and radial bins, the radius of the center

bin, and the expansion factor for the radius of additional radial bins. In this study, we have

utilized the strategy of computing HOG reported in [62]. They have divided the image into

9 rectangular cells with 9 bin histogram for each cell resulting in 81D feature vector.

Sometimes, features obtained through different feature extraction strategies need to be

processed further so that the performance of classifier is enhanced by utilizing only the

selected feature subset of the full feature space. For this purpose, we have utilized SMOTE

oversampling and mRMR feature selection as discussed in the next section.

2.3 Post-Processing

Post processing is usually performed on the feature space for further improvement in the

quality of the extracted feature space so that pattern recognition system can efficiently

discriminate different instances belonging to different classes. In addition, instance space

may also be modified synthetically in order to enhance the performance of a recognition

system. In this thesis, we have adopted SMOTE and mRMR as oversampling and feature

selection techniques, respectively.

2.3.1 Oversampling with SMOTE

Due to the availability of imbalanced data, a pattern recognition system usually gets biased

towards the majority class whereas the minority class faces negligence from the classifier

and hence the system’s overall performance is degraded [63,64]. In such situations, the mi-

nority class samples are usually increased so that the balance in terms of number of samples

with the majority class is established.

For this purpose, SMOTE [32] has been utilized in this work to increase the number

of samples of a minority class in a dataset. SMOTE performs its operations in the feature

space rather than the data space. The minority class original sample is input to SMOTE

algorithm and as a result it produces new synthetic instances along the same line segment

26

Page 49: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

where some or all k nearest neighbors of the original input class are located. The synthetic

samples are not merely the replica of the original samples rather they are created following

a distinctive mechanism. The new synthetic sample is created by first finding the difference

of the original sample and its nearest neighbor. Next, a random number in the range [0, 1]

is multiplied with the result of the previous step. New sample is created when this product

is added to the original sample.

The classifier’s learning capability is more generalized due to the introduction of syn-

thetic samples using SMOTE. The generalization capability is enhanced due to the fact

that the new synthetic samples are generated along the separating line of two particular

features. In this way, bias towards the minority class is increased and the true performance

of the prediction system is revealed.

2.3.2 Feature Selection with mRMR

Feature selection is an imperative issue in designing and developing pattern recognition and

classification systems [65]. Feature selection is a process in which an optimal subset of the

entire feature space is utilized to assess the discrimination power of a system. This phe-

nomenon provides us a mechanism to analyze the feature space constructed for classification

purpose. The ultimate goal of feature selection is to eradicate unnecessary features, remove

redundancy, and reduce noise while keeping the discriminative information intact. Feature

selection may additionally improve the accuracy as well as generalization capability of the

classification system.

In this connection, mRMR is employed as feature selection mechanism, which has been

frequently reported by many researchers in the fields of Bioinformatics and machine learn-

ing [66–68]. The feature selection with mRMR attempts to reduce redundancy during fea-

ture subset selection as well as retains most relevant features, required to classify instances

belonging to different classes. The features generated using mRMR feature selection shares

minimum redundancy with other features in the feature space and maximum relevance to

the target class variables. The minimum redundancy and maximum relevance can be cal-

culated using the mutual information found among the features and the class variables.

Mutual information among different features is calculated using Equation 2.18.

27

Page 50: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

MI (x, y) =∑i,j∈N

p (xi, yj)log p (xi, yj)

p (xi) p (yj)(2.18)

Here, x and y are any two features, p (xi, yj) is the estimate of joint probability density

function and p(xi), p(yj) are marginal probability density functions. Similarly, mutual

information of features with the target class variables is obtained using Equation 2.19.

MI (x, z) =∑i,k∈N

p (xi, zk)log p (xi, zk)

p (xi) p (zk)(2.19)

Here, x is a feature and z is a target class. Minimum redundancy is achieved using

Equation 2.20.

min (mR) =1

|S|2∑x,y∈S

MI (x, y) (2.20)

Here, |S| shows number of features. The relevance of the features towards the target

class is maximized using Equation 2.21.

max (MR) =1

|S|∑x∈S

MI (x, z) (2.21)

The final feature space is constructed by optimizing Equations 2.20 and 2.21, simulta-

neously as given in Equation 2.22.

mRMR = maxS

[MR−mR] (2.22)

2.4 Classification Algorithms

Classification systems are developed for assigning appropriate label to an unknown object

[69]. In order to achieve this goal accurately, any classification system is first passed through

the training phase. A classification system predicts the label of an unknown test sample

using the knowledge acquired from the training set, which is composed of objects whose

labels are already known. The classifier learns about the objects in the training phase

from their attribute values obtained during the feature extraction. The classifier then

predicts the label of a test sample using the attribute values of that test sample. This

whole procedure in which a training set is utilized by a classification system to predict the

unknown sample is called supervised learning. A number of machine learning algorithms

for classification has been applied in the fields of pattern recognition, bioinformatics, and

computer vision [1,70,71]. The classification algorithms, which have been employed in this

thesis, are discussed as follows.

28

Page 51: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

2.4.1 Support Vector Machine

SVM [72], is a popular machine learning algorithm, which found many applications in the

fields of bioinformatics, pattern recognition and classification [12,30,37,73]. SVM is based

upon the SRM principle, which enables SVM to generalize efficiently. It is due to the fact

that SRM introduces a minimized upper bound on the generalization error. SVM is inher-

ently developed for two-class classification problems; however, it is applicable to multi-class

classification problems through one-versus-one, one-versus-all and directed acyclic graph

SVM strategies.

In order to convert a non-linearly separable problem into a linearly separable one, SVM

transforms the input data space into higher dimensional space where it may find a linear

separation between the classes. In a two-class classification problem, the objective of SVM

is to formulate such a separation between the two classes that could lead to fine gener-

alization. In this connection, SVM maximizes the distance of the separating hyperplane

from the closest data points called support vectors of both the classes. Consequently, the

generalization capability of this hyperplane should be good on unseen samples.

Consider the training pair (xi, yi) where xi ∈ RNf and yi ∈ {−1,+1} with i =

1, 2, ...., Nf . The instance xi is assigned either of the labels in yi. The separating hy-

perplane is constructed as given in Equation 2.23.

f (x) =

Nf∑i=1

αiyixTi .x + bias, where αi > 0 (2.23)

where α is the Langrange multiplier. For linearly separable samples, SVM utilizes the

dot product of two points as kernel function in the data space. However, for non-linearly

separable samples, the separating hyperplane is calculated differently as given in Equation

2.24.

ϕ (w, ξ) =1

2‖w‖2 + C

Nf∑i=1

ξi (2.24)

Provided that the condition yi(wTϕ(xi) + b

)≥ 1−ξi, where ξi > 0 is satisfied. Further,

C > 0 is the cost of misclassification forNf∑i=1

ξi. ϕ(x) is the nonlinear mapping function,

which is used by SVM to transform the input space Nf into a higher dimensional space Mf

provided that ϕ : R → FMf ,Mf > Nf . The nonlinear separating hyperplane is now given

29

Page 52: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

in Equation 2.25.

f (x) =

N∑i=1

αiyiK(xi.x) + bias (2.25)

where N is the number of support vectors and K(xi.x) represents the kernel.

Researchers have proposed different types of kernels to compute the inner product effi-

ciently. The kernel functions, which can be utilized inside SVM, include linear, polynomial,

RBF, and sigmoid kernels. SVM with RBF or sigmoid kernel takes C and γ as input

parameters, SVM with polynomial kernel takes an additional parameter d, which indicates

the degree of the polynomial; however, SVM with linear kernel works with C parameter

only. If x represents the feature vector and y is its corresponding label vector, then linear

kernel can be formulated as given in Equation 2.26.

K(x,y) = x.y (2.26)

SVM does not map the input data into higher dimensional feature space while using lin-

ear kernel, therefore, the computations are performed faster. In order to classify non-linearly

separable feature space, we employ polynomial kernel of SVM, which can be expressed as

shown in Equation 2.27.

K(x,y) = (x.y + 1)d (2.27)

In Equation 2.27 d represents the degree of polynomial kernel. The shape of separating

hyperplane depends on the degree d, which controls its complexity in the input data space.

Polynomial kernel is equivalent to the linear kernel for d = 1. Another very important

kernel, the RBF kernel is mathematically expressed as given in Equation 2.28.

K(x,y) = exp(−γ‖x.y‖2

)(2.28)

In the RBF kernel, γ is used to describe the width of the Gaussian function. In this

thesis, LIBSVM1 library is utilized for conducting the experiments. Different parameters

of SVM are set through the grid search approach using internal cross validation on the

training data.

2.4.2 Random Forest Ensemble

RF ensemble is a classification algorithm [74] in which decision trees are utilized as base

classifiers. These trees are constructed by randomly drawing instances from the original

1Available at http://www.csie.ntu.edu.tw/ cjlin/libsvm/30

Page 53: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

instance space with replacement employing bootstrap sampling. Large number of trees is

generated where each tree is based on randomized subset of predictors and therefore, the

phenomenon of Random Forest is introduced. A subset from all the available predictors is

randomly selected in order to locate the optimal split at a particular node. The prediction

of each tree, regarding the label of certain input, is considered as a single vote. The notion

of majority voting is employed in order to combine the predictions of all the individual trees

in a forest.

RF ensemble is robust against noisy data due to the intrinsic tree structure. In addition,

it is able to manage large number of attributes efficiently [75]. The functioning of RF

ensemble involves two input parameters: the number of variables randomly selected at each

node in order to determine the prediction at a particular node and number of trees in the

forest. The number of randomly selected variables at each node should be fewer than the

attributes in the dataset. RF ensemble may perform poorly in the presence of imbalanced

data in which case the classifier may get biased towards the majority class [76].

2.4.3 Rotation Forest Ensemble

RotF [77, 78] classifier builds ensemble of classifiers from various decision trees, which per-

form their operations as base learners for the ensemble. Despite that decision trees are

independent of each other; the whole dataset is utilized by each tree in a rotated feature

space. The rotation of data around the feature axis is achieved through PCA. In the course

of building RotF ensemble, H subsets are randomly generated from the attribute set where

each subset is subsequently manipulated by PCA. The obtained principal components in this

manner are preserved according to the initial sequence and thus a linearly transformed fea-

ture space is produced. Decision trees are sensitive to feature axes rotations and therefore,

introduce diversity in the ensemble. Another factor effecting diversity is the introduction

of H splits of the attribute set, which leads to different feature spaces for the classification

stage. Following parameters may be adopted for building the RotF ensemble.

• Number of features in each subset of H subsets

• Number of classifiers to build the ensemble

• Feature extraction technique

• Base learner

31

Page 54: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

2.5 Performance Parameters

Performance evaluation is an important phase of prediction system development. A newly

developed model is required to be assessed in this phase so that it becomes evident that

the problem is addressed properly. In machine learning, quite a few performance indicators

have been identified and recognized very proficient in assessing the performance of an al-

gorithm. However, the utilization of a particular parameter in different problem domains

is dependent upon the data distribution and classification task in that domain. The per-

formance parameters are extracted from a confusion matrix, which tabulates the actual

labels against the predicted labels for each class. Table 2.2 illustrates a confusion ma-

trix for a binary class problem. However, confusion matrices for multi-class classification

problems are easily constructed. TP indicates number of positive instances predicted as

Table 2.2: A confusion matrix

Predicted Label

Positives Negatives

Actual LabelPositives True Positives False Negatives

Negatives False Positives True Negatives

positives whereas FN shows number of positive instances predicted as negatives. Similarly,

FP specifies number of negative instances predicted as positives and TN signifies number

of negative instances predicted as negatives. FP is also called Type-I error or error of the

first kind whereas FN are known as Type-II error or error of the second kind. Some of the

performance parameters extensively utilized by machine learning and pattern recognition

community are discussed as follows.

2.5.1 Accuracy

The error rate or accuracy is utilized by the researchers to measure the efficiency of a

prediction system. Accuracy measures both the true positives and true negatives returned

32

Page 55: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

by a learning system. It is computed using Equation 2.29.

Accuracy =TP + TN

TP + FP + TN + FN× 100 (2.29)

Though, accuracy is considered a proper parameter to assess the performance of a clas-

sifier, it sometimes fails to measure the true performance of the system. For example, in

case of imbalanced data, classifier usually gets biased towards the majority class and con-

sequently the higher accuracy does not reveal the actual performance of the entire system.

2.5.2 Sensitivity/Specificity

Sensitivity and specificity indicate the true positive and true negative rates, respectively,

of the prediction system. Equations 2.30 and 2.31 are used to measure sensitivity and

specificity of a system.

Sensitivity =TP

TP + FN× 100 (2.30)

Specificity =TN

FP + TN× 100 (2.31)

Sensitivity and specificity both play a key role in the computation of accuracy. High

values of sensitivity and specificity lead to high accuracy. In case of high sensitivity and low

specificity, accuracy gets biased towards the sensitivity. Conversely, accuracy is influenced

by specificity. Similarly, low values of sensitivity and specificity result in low accuracy.

2.5.3 Mathews Correlation Coefficient

MCC [79,80] is a performance parameter used to measure the quality of a prediction system.

It can intrinsically transform a confusion matrix into a scalar value ranging from −1 to

+1, where −1 means that the classifier consistently produces incorrect predictions and +1

assures that the classifier always generates correct predictions. The value 0 indicates that

the classifier produces average random predictions. Equation 2.32 is used to calculate MCC.

MCC(i) =TP × TN − FP × FN√

[TP + FP ][TP + FN ][TN + FP ][TN + FN ]× 100 (2.32)

MCC is a very useful measure that inherently has the ability to resolve challenges faced

by accuracy particularly in situations where balanced data is not available to the classifier.

For example, a classifier correctly predicts all the instances of a positive majority class

and incorrectly predicts all the instances of a negative minority class. In such cases, the

33

Page 56: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

performance of the classifier is not promising but accuracy will show reasonably good results.

However, the true performance of the classifier is exposed using the MCC, which will show

the performance as 0.

2.5.4 F-measure

F-measure or F-Score approximates the performance accuracy of the executed test [81,

82]. It is employed in tasks where the classification system is required to correctly predict

instances of a particular class without predicting too many instances of other classes. F-

measure is computed using the harmonic mean of both the precision p and recall r of the

test. Therefore, F-measure greatly depends on these two measures. Precision is the ratio

of true positives to the number of predicted positives whereas recall is the ratio of true

positives to the number of actual positives. Precision and recall are also known as positive

predictive value and sensitivity, respectively. F-measure returns its output in the range

[0, 1] where output closer to 0 indicates poor performance and the output closer to 1 reveals

good performance.

Precision =TP

TP + FP(2.33)

Recall =TP

TP + FN(2.34)

F − Score = 2× Recall × PrecisionRecall + Precision

(2.35)

In order to obtain a reasonable score for F-measure, a trade-off between precision and

recall is usually sought. It is evident from Equation 2.35 where it can be identified that

improved precision leads to inferior recall and vice versa.

2.5.5 Q-Statistic

Q-Statistic is a performance parameter used to measure the diversity among the member

classifiers in an ensemble [25,83]. The Q-Statistic gives similarity value as its output between

the base classifiers and therefore its returned value is subtracted from 1 to obtain the

diversity value. The 2-by-2 contingency table illustrates the concept of correlation between

two base classifiers BCi and BCj in an ensemble. In Table 2.3, N11 and N00 are respectively,

the correct and incorrect predictions of both the classifiers. The correct predictions of 1st

classifier and the incorrect of 2nd are indicated by N10. Conversely, the correct predictions

34

Page 57: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 2.3: Correlation between classifier BCi and classifier BCj

BCj hit (1) BCj miss (0)

BC i

hit(1

)

N11 N10

BC i

miss

(0)

N01 N00

of 2nd classifier and incorrect of 1st are shown by N01. The Q-Statistic between any two

base classifiers is calculated using Equation 2.36.

Qi,j =N11 ×N00 −N10 ×N01

N11 ×N00 +N10 ×N01(2.36)

or for the sake of discussion, we may write as under:

Qi,j =(hits× hits)− (misses×misses)(hits× hits) + (misses×misses)

(2.37)

When the base classifiers have only hits and there are not any misses, the values in

numerator as well as the denominator are same leading to Q-Statistic = 1. Conversely,

when the base classifiers commit misses and there are no hits, Q-Statistic = −1. When

hits and misses occur with the same ratio, Q-Statistic = 0. Thus the output of Q-Statistic

ranges from −1 to +1. Q-Statistic = 1 means that there exists perfect positive correlation

between the two classifiers. Similarly, Q-Statistic = −1 indicates that perfect negative

correlation exists between the two classifiers. However, Q-Statistic returns 0 for statistically

independent classifiers. In order to compute the Q-Statistic for ensemble consisting of

multiple classifiers, the average among all pairs of B base classifiers is computed using

Equation 2.38.

Qavg =2

BN (BN − 1)

BN−1∑i=1

BN∑k=i+1

Qi,k (2.38)

2.5.6 Multiclass ROC

AUC is a valuable performance metric, in terms of ROC, for measuring the similarity

between two different categories. We have adopted the method presented in [84] to compute

35

Page 58: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

the AUC for multi-class classification problems. In this case, AUC is computed for all pair

wise combinations of these 10 and 8 classes. Total AUC is the mean of all AUCs.

2.6 Datasets

We have used various protein subcellular localization image datasets in order to assess the

performance of our proposed models. These include HeLa dataset from Murphy Lab [5],

two LOCATE datasets from LOCATE subcellular localization database [8], Vero dataset

from AIIA lab in Taiwan [6], and Two CHO datasets: one from Murphy Lab and the other

from AIIA lab [6]. These datasets are multi-class datasets and retain major subcellular

structures of eukaryotic cells.

Table 2.4 shows the breakup of samples in different classes for HeLa2 dataset. The 2D

HeLa dataset includes 862 protein images from fluorescence microscopy distributed in 10

distinct categories.

Table 2.4: HeLa dataset classes and image breakup in each class

S. No Class Name Images/class

1 ActinFilaments 98

2 Nucleus 87

3 Endosomes 91

4 ER 86

5 Golgi Giantin 87

6 Golgi GPP130 85

7 Lysosome 84

8 Microtubules 91

9 Mitochondria 73

10 Nucleolus 80

Total 862

The CHOM2 dataset retains 327 protein images from fluorescence microscopy dis-

tributed over five categories. Similarly, CHOA3 dataset has 668 fluorescence microscopy

protein images categorized in eight classes. The Vero3 image dataset possesses 1472 images

representing eight subcellular structures of monkey cells. Tables 2.5, 2.6 and 2.7 show the

details of CHOM, CHOA and Vero datasets, respectively.

3Available at http://aiia.iis.sinica.edu.tw/ 36

Page 59: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 2.5: CHOM dataset classes and image breakup in each class

S. No Class Name Images/class

1 Golgi 77

2 DNA 69

3 Lysosome 97

4 Nucleolus 33

5 Cytoskeleton 51

Total 327

Table 2.6: CHOA dataset classes and image breakup in each class

S. No Class Name Images/class

1 Actin 161

2 ER 143

3 Golgi 156

4 Microtubule 67

5 Mitochondria 46

6 Nucleolus 15

7 Nucleus 28

8 Peroxisome 52

Total 668

Table 2.7: Vero dataset classes and image breakup in each class

S. No Class Name Images/class

1 Actin 65

2 ER 372

3 Golgi 145

4 Microtubule 101

5 Mitochondria 179

6 Nucleolus 110

7 Nucleus 444

8 Peroxisome 56

Total 1472

Similarly, the two LOCATE4 datasets are Endogenous and Transfected datasets each

containing 502 and 553 fluorescence microscopy protein images, respectively, as shown in

Table 2.8. Correspondingly, LOCATE Endogenous protein images from fluorescence mi-

croscopy are grouped in 10 classes whereas LOCATE Transfected protein images from the

4Available at http://locate.imb.uq.edu.au/ 37

Page 60: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

same imaging technique are categorized in 11 classes.

Table 2.8: LOCATE Endogenous and Transfected datasets: images breakup

S. No Class Name Endogenous Transfected

1 Actin-cytoskeleton 50 50

2 Cytoplasm 0 50

3 Endosomes 49 50

4 Endoplasmic Reticulum 50 48

5 Golgi 46 50

6 Lysosome 50 50

7 Microtubule 50 50

8 Mitochondria 50 55

9 Nucleus 50 50

10 Peroxisomes 50 50

11 Plasma Membrane 57 50

Total 502 553

Summary

In this chapter, the existing literature devoted to the development of automated systems for

protein subcellular localization are reviewed. Further, different feature extraction strategies,

which are utilized in the development of this thesis are discussed. From machine learning

theory, different classification algorithms, post-processing techniques and various perfor-

mance indicators in the context of this thesis are also discussed. The chapter is concluded

with the presentation of some benchmark datasets utilized to assess the performance of

prediction systems developed during the course of this thesis. In the next chapter, SVM-

SubLoc prediction system is presented that is developed for protein subcellular localization

images from HeLa and LOCATE datasets.

38

Page 61: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Chapter 3

Protein Subcellular Localization

using Spatial and Transform

Domain Features

Protein subcellular localization images of living cells are usually acquired through fluores-

cence microscopy. Automated systems are required for the analysis and classification of

these images. Automated prediction systems do exist in the literature, however, low pre-

diction accuracy and availability of large feature spaces make them less favorite for the

research community.

In this chapter, different spatial and transform domain features have been discussed

for their efficacy in differentiating different protein images from fluorescence microscopy.

The aim of this work is to build a more efficient and reliable prediction system for protein

subcellular localization utilizing low dimensional feature spaces with enhanced prediction

accuracy. Both individual and hybrid feature spaces are constructed in spatial as well as

transform domains. The overview of the prediction system is given in the following section.

3.1 The SVM-SubLoc Prediction System

The proposed SVM-SubLoc prediction system is shown in Figure 3.1. The main phases of

SVM-SubLoc include the feature extraction and classification phases. In the feature ex-

traction phase, different individual features are extracted in spatial and transform domains

39

Page 62: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

whereas in the classification phase different SVMs are trained using these features. The

final prediction is obtained through the majority voting scheme.

The performance of SVM-SubLoc prediction system is evaluated using three bench-

mark protein image datasets from fluorescence microscopy including 2D HeLa, LOCATE

Endogenous and LOCATE Transfected datasets.

Input

Image

MR

filter

Ensemble

Decision

Predicted

Result

Hybrid features

ZHar HarLBP

HarLTP HarTAS

Transform domain features

Haralick textures

Zernike features

Spatial domain featuresHaralick

textures

Zernike

featuresLBP LTP TAS

lin-SVM poly-SVM RBF SVM sig-SVM

Figure 3.1: The SVM-SubLoc prediction system

3.1.1 Feature Extraction Phase

In the feature extraction phase of SVM-SubLoc, different individual features are extracted

from fluorescence microscopy protein images in spatial as well as transform domains.

The spatial domain features from an image are extracted by forwarding the image di-

rectly to the feature extraction stage. Individual features include Haralick coefficients,

Zernike moments, LBP, LTP, and TAS. However, for transform domain feature extraction,

an image is first passed through the multi-resolution filter and then this transformed image

is forwarded to the feature extraction stage. Zernike moments and Haralick textures are the

only features, which have been extracted in spatial as well as transform domains. Domain

40

Page 63: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

transformation is achieved through DWT.

3.1.1.1 GLCM Construction and Haralick Co-efficients

Prior to GLCM construction, each input image is first quantized to eight gray levels, which

resulted into GLCM of 8-by-8 dimensions. For each input image, four GLCMs are con-

structed along horizontal, vertical, diagonal, and off-diagonal directions for pixel distance

∆ = 1. After this, Haralick features are extracted from each GLCM separately and then

the mean values for each of the features are computed. This is in contrast to the method of

first computing the combined GLCM and then computing the features. The construction

of GLCMs and extraction of the feature spaces from these GLCMs are shown in Figure 3.2.

GLCMV

GLCMD

GLCMoD

Energy Contrast Correlation InertiaInverse

DifferenceEntropy

Sum

Average

Sum

Variance

Sum

Entropy

Difference

Variance

Difference

Entropy

Information Measure

of Correlation 1

Information Measure

of Correlation 2

Energy Contrast Correlation InertiaInverse

DifferenceEntropy

Sum

Average

Sum

Variance

Sum

Entropy

Difference

Variance

Difference

Entropy

Information Measure

of Correlation 1

Information Measure

of Correlation 2

Energy Contrast Correlation InertiaInverse

DifferenceEntropy

Sum

Average

Sum

Variance

Sum

Entropy

Difference

Variance

Difference

Entropy

Information Measure

of Correlation 1

Information Measure

of Correlation 2

Energy Contrast Correlation InertiaInverse

DifferenceEntropy

Sum

Average

Sum

Variance

Sum

Entropy

Difference

Variance

Difference

Entropy

Information Measure

of Correlation 1

Information Measure

of Correlation 2

Me

an

va

lue

s

Me

an

va

lue

s

GLCMH

Fe

atu

re

Extra

ctio

n

Figure 3.2: Feature extraction from GLCMH, GLCMV, GLCMD and GLCMoD

In case of each GLCM, 13 features are extracted resulting in 52 features in total. The

feature space dimension is reduced by averaging the features from GLCMH with the features

from GLCMV. Similarly, the features generated from GLCMD are averaged with GLCMoD.

The final feature vector is composed of these mean values for each feature that result in 26

41

Page 64: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

features for each fluorescence microscopy protein image.

3.1.1.2 Discrete Wavelet Transformation

The input image is decomposed up to four levels using DWT with Haar filter. For each

input image of size M-by-N, DWT produces four sub-images of size M/2-by-N/2 at each

decomposition level as shown in Figure 3.3. The sub-images are approximation detail (LL),

the vertical detail (HL), the horizontal detail (LH), and the diagonal detail (HH). The sub-

band images are then forwarded to the feature extraction phase where features are extracted

for each level and from each sub-band individually.

DWT M

N

LL LH

HL HH

Protein Image DWT domain

Figure 3.3: DWT Transformation

As discussed earlier, each input image is decomposed up to four decomposition levels,

however, only two decomposition levels are shown in Figure 3.4. During the computa-

tion of Haralick coefficients from GLCMs, 26D feature space is generated for fluorescence

microscopy protein image at decomposition level 0, which is in fact the image in spatial

domain. It can be observed that at each decomposition level, four decomposed images are

generated for each input image. The feature vector for each decomposition level is obtained

by concatenating the feature vectors generated from each fluorescence microscopy image.

For instance, at decomposition level 1, 26 × 4 = 104D feature space is generated where

26 is the dimension of the feature vector for a single fluorescence microscopy image and 4

indicates the number of component images at 1st decomposition level. Similarly, for the

second decomposition level, 26 × 16 = 416D feature space is computed where 26 is the

feature vector for single image and 16 is the number of component images at 2nd decompo-

sition level. At decomposition level 3, 26 ×64 = 1664D feature vector is constructed. At

decomposition level 4, the size of the feature space is 26× 256 = 6656D.

42

Page 65: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Original Protein Image

Diagonal detail

HH1

Horizontal detail

LH1

Vertical detail

HL1

Approximation detail

LL1

LL2

LH2

HL2

HH2

De

co

mp

os

itio

n

Le

ve

l 1

De

co

mp

os

itio

n

Le

ve

l 2

De

co

mp

os

itio

n

Le

ve

l 0

Figure 3.4: Fluorescence microscopy protein image of size M -by-N is split into four sub-

images at each decomposition level. Decomposition level 0 indicates the original fluorescence

microscopy protein image.

Likewise, Zernike moments are computed up to order 12 that result in 49D feature

space for fluorescence microscopy protein image at decomposition level 0. The feature spaces

obtained for multiple decomposed images are concatenated to produce a single feature vector

that represents all the sub-images at a particular decomposition level. For decomposition

level 1, the size of feature vector is 49× 4 = 196D, where 49 is the dimension of the feature

vector for one image and 4 indicates the number of sub-images generated at decomposition

level 1. Similarly, feature vector at decomposition level 2 is of 49× 16 = 784D where 49 is

43

Page 66: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

the feature space dimension for a single fluorescence microscopy protein image and 16 is the

number of component images at 2nd decomposition level. The Zernike moments generated

at decomposition level 3 are of 49× 64 = 3136D. The feature space dimension constructed

for decomposition level 4 is 49× 256 = 12544D.

3.1.1.3 The Hybrid Model

Hybrid feature spaces are developed on the hope that they may enhance the discrimination

capability of individual feature spaces. Hybrid features are constructed from the concate-

nation of individual features. These include ZHar (Zernike + Haralick), HarLBP (Haralick

+ LBP), HarLTP (Haralick + LTP) and HarTAS (Haralick + TAS). In the hybrid models

formation, only Haralick textures obtained in the spatial domain are combined with other

spatial domain features. However, in the formation of ZHar hybrid model both spatial and

transform domain Haralick and Zernike features are utilized.

3.1.2 Classification Phase

The classification phase of SVM-SubLoc is highlighted in Figure 3.5. In this phase, first

four SVMs are trained using individual and hybrid features both in spatial and transform

domains. The four SVMs include lin-SVM, poly-SVM of degree 2, RBF-SVM, and sig-SVM.

The parameters for SVM kernels are obtained using the grid search approach.

Ensemble

Final

Prediction

Features

lin-SVM

poly-SVM

RBF-SVM

sig-SVM

Figure 3.5: The Classification phase of SVM-SubLoc

In order to further enhance the prediction accuracies, the output predictions of these

44

Page 67: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

individual SVMs have been integrated as given in Equation 3.1.

MajEns = lin− SVM ? poly − SVM ? RBF − SVM ? sig − SVM (3.1)

where ? is the fusion operator and MajEns is the output ensemble classifier. The complete

procedure of MajEns is given as follows. Let us assume the classification results of individual

classifiers are {B1, B2, · · ·, BN} ∈ {CL1, CL2, · · ·, CLM}, where B1, B2, · · ·, BN indicate

individual base classifiers and CL1, CL2, · · ·, CLM represent labels of protein classes. Then

the output label of SVM-SubLoc is predicted as given in Equation 3.2.

Yi =N∑i=1

δ(Bi, CLj) for j = 1, 2, · · ·, N,N = 4 (3.2)

where δ(Bi, CLj) =

1 if Bi ∈ CLj

0 otherwise

The final prediction is yielded by the combination of individual predictions through the

majority voting scheme as given in Equation 3.3.

YMajEns = max{Y1, Y2, ..., YN} (3.3)

where YMajEns is the output prediction of the ensemble.

In situations where the votes are equal for a certain instance to be assigned to a particular

class, preference is given to the prediction of highest performing classifier.

3.2 Results and Discussion

The classifier performance is assessed through 5-fold cross validation protocol for the protein

image datasets from fluorescence microscopy. Data is stratified before forwarding it to the

classifier for classification. Accuracy, MCC, F-Score, and Q-Statistic are adopted as perfor-

mance indicators for SVM-SubLoc because any single measure is sometimes inadequate to

measure the performance of a prediction system.

3.2.1 Performance Analysis of SVM-SubLoc using Individual Features

for HeLa dataset

We analyze and discuss the performance of SVM-SubLoc for 2D HeLa dataset using individ-

ual features. Table 3.1 reports the performance predictions of SVM-SubLoc using Haralick

45

Page 68: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

features with/without employing DWT. Among individual classifiers, RBF-SVM has out-

performed other SVMs at all decomposition levels. Similarly, among the decomposition

levels, 2nd level is observed to be the best decomposition level for protein discrimination

capability where RBF-SVM yielded 84.1% accuracy. At 0th and 1st decomposition levels,

some hidden information is lost whereas at 3rd and 4th level, redundancy in the extracted

information occurs; however, the 2nd level has preserved most of the valuable information

that is why SVM-SubLoc has shown superior performance at this level. The ensemble accu-

Table 3.1: Performance of SVM-SubLoc using Haralick textures with/without DWT

lin poly RBF sig Ensemble

L D Acc Acc MCC F−Score Q-Statistic

0 26 75.6 75.9 76.4 69.6 87.7 0.59 0.60 0.29

1 104 80.5 80.9 81.7 80.3 92.5 0.71 0.72 0.18

2 416 81.9 83.2 84.1 79.8 93.2 0.73 0.74 0.20

3 1664 77.3 78.3 80.6 77.9 90.2 0.65 0.66 0.30

4 6656 79.5 80.6 80.7 79.5 92.6 0.72 0.73 0.23

racy of SVM-SubLoc is 93.2%, 9.1% higher than that of individual RBF-SVM, which shows

significant improvement in the performance of SVM-SubLoc through the introduction of

majority voting based ensemble. MCC and F-Score values of 0.73 and 0.74, respectively,

are also highest at 2nd decomposition level, which indicate the effectiveness of SVM-SubLoc

at this level. The highest diversity is, however, achieved at 1st decomposition level where

the Q-statistic value is 0.18.

The performance accuracies of SVM-SubLoc using Zernike moments are highlighted

in Table 3.2 with/without employing DWT. RBF-SVM has obtained the highest success

rates of 67.8% among individual classifiers at 4th decomposition level. At each next de-

composition level, the accuracy of RBF-SVM has been observed increasing, which shows

the effectiveness of RBF-SVM coupled with the DWT based Zernike moments. The perfor-

mance accuracy at 0th level is less than the random predictions but in the transform domain,

SVM-SubLoc has gradually attained higher accuracies employing RBF-SVM. The ensemble

accuracy 58.5% indicates the highest diversity among the ensemble members at 2nd decom-

position level; however, this accuracy is 9.3% less than the highest accuracy achieved by

RBF-SVM among individual classifiers at 4th decomposition level. Similarly, at 4th level,

46

Page 69: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 3.2: Performance of SVM-SubLoc using Zernike moments with/without DWT

lin poly RBF sig Ensemble

L D Acc Acc MCC F−Score Q-Statistic

0 49 28.3 43.8 46.5 15.5 46.7 -0.02 0.16 0.02

1 196 35.9 50.3 52.2 24.4 57.8 0.12 0.23 0.09

2 784 39.0 53.4 56.2 15.1 58.5 0.15 0.25 0.01

3 3136 50.2 52.0 60.5 11.3 54.5 0.10 0.22 0.24

4 12544 44.0 48.0 67.8 11.0 56.7 0.14 0.24 0.11

the ensemble accuracy is 56.7%, which is 11.1% less than the individual RBF-SVM. Here,

the diversity among the classifiers is less compared to other levels except the 3rd level as is

evident by the Q-Statistic values. Despite good outcomes of accuracy and Q-Statistic val-

ues, prediction quality represented by MCC and accuracy of the test indicated by F-Score,

are less promising.

Performance of SVM-SubLoc using TAS features extracted in spatial domain is demon-

strated in Table 3.3. In the first column, τ indicates the user specified threshold. We have

obtained TAS features with different τ values. The highest accuracy of 81.0% has been

achieved by RBF-SVM among individual classifiers for τ = 140. The ensemble accuracy

Table 3.3: Performance of SVM-SubLoc using TAS features

lin poly RBF sig Ensemble

τ D Acc Acc MCC F−Score Q-Statistic

100 27 75.9 78.1 77.0 69.0 89.0 0.62 0.64 0.31

130 27 76.9 79.6 78.1 69.9 88.5 0.61 0.63 0.39

140 27 77.6 80.3 81.0 70.7 91.6 0.69 0.70 0.31

180 27 78.3 80.7 78.6 69.9 90.8 0.67 0.68 0.32

220 27 77.3 79.2 79.0 69.1 90.4 0.66 0.67 0.28

240 27 77.9 80.1 79.8 70.0 90.4 0.66 0.67 0.34

91.6% is achieved, which is superior to that of RBF-SVM by 10.6%. MCC and F-Score

values of 0.69 and 0.70, respectively, are also good. The significance of the ensemble is

evident from the enhanced accuracy.

Table 3.4 presents the outcome accuracies of SVM-SubLoc using LBP patterns. R, PN ,

and m, respectively, represent radius, neighborhood, and mapping, which are utilized to

47

Page 70: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

extract these patterns. In the following discussion, LBP with certain configuration will

be referred to as m-LBP(R, PN). Among individual classifiers, RBF-SVM has yielded the

Table 3.4: Performance of SVM-SubLoc using LBP with different mappings

lin poly RBF sig Ensemble

R PN m D Acc Acc MCC F−Score Q-Statistic

1 8 U 59 83.7 84.3 85.3 71.5 92.8 0.72 0.73 0.32

1 8 RI 36 83.5 82.4 82.7 72.9 92.8 0.72 0.73 0.34

1 8 URI 10 81.5 82.5 82.5 74.9 92.5 0.71 0.72 0.33

2 16 U 243 85.7 86.8 87.5 78.6 95.0 0.79 0.80 0.32

2 16 RI 4116 73.2 72.6 73.0 70.5 85.4 0.54 0.56 0.37

2 16 URI 18 87.1 86.8 87.8 78.3 95.1 0.79 0.80 0.37

3 24 U 555 85.0 86.3 87.4 78.8 95.5 0.81 0.82 0.25

3 24 URI 26 85.7 86.4 88.0 80.5 94.8 0.78 0.79 0.34

highest accuracy value of 88.0% employing URI-LBP(3, 24) patterns. However, the en-

semble accuracy of 95.5% is achieved by SVM-SubLoc using U-LBP(3, 24) patterns. The

ensemble accuracy is 7.5% higher than the highest accuracy of individual RBF-SVM. The

highest diversity among the individual classifiers is observed for U-LBP(3, 24) patterns as

is evident from Q-Statistic value of 0.25. SVM-SubLoc has yielded the highest values of

MCC (= 0.81) and F-Score (= 0.82) using the same features.

Prediction accuracies of SVM-SubLoc using LTP patterns are presented in Table 3.5.

Here, τ represents the threshold for constructing LTP feature space. In further discussions,

LTP with certain configuration would be referred to as m-LTP(R, PN, τ). Among individual

classifiers, poly-SVM has achieved the highest accuracy value of 94.4% using URI-LTP(3,

24, 80) patterns. The ensemble SVM-SubLoc, on the other hand, has achieved 99.0% accu-

racy, which is 4.6% higher than that of the individual poly-SVM using the same features.

The obtained accuracy revealed that LTP patterns have strong discriminative power com-

pared to other feature extraction strategies. The highest Q-Statistic value of 0.15 indicates

that individual classifiers have diversified predictions using URI-LTP(3, 24, 80) patterns.

From Table 3.5, it is evident that LTP patterns computed on smaller radii require smaller

value of τ for efficient classification of fluorescence microscopy protein images. However, for

large radius τ needs to be greater. The MCC and F-Score values of 0.95 each show that

the performance of SVM-SubLoc has been enhanced due to the utilization of LTP patterns.

48

Page 71: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 3.5: Performance of SVM-SubLoc using LTP with different mappings

lin poly RBF sig Ensemble

R PN τ m D Acc Acc MCC F−Score Q-Statistic

1 8 40 U 118 90.4 90.4 90.8 78.1 97.7 0.89 0.90 0.35

1 8 40 RI 72 89.3 87.5 89.3 80.5 97.3 0.87 0.88 0.28

1 8 40 URI 20 89.2 89.7 90.1 77.2 97.3 0.87 0.88 0.28

2 16 80 U 486 92.8 93.0 92.9 84.1 98.2 0.91 0.92 0.21

2 16 80 URI 36 91.8 92.9 93.5 85.1 98.6 0.93 0.93 0.26

3 24 80 URI 52 93.8 94.4 93.8 86.3 99.0 0.95 0.95 0.15

3.2.2 Performance Analysis of SVM-SubLoc using Hybrid Features for

2D HeLa dataset

Table 3.6 demonstrates the performance predictions of SVM-SubLoc employing ZHar hy-

brid features in spatial and transform domains. RBF-SVM has shown superior performance

achieving accuracy value of 80.8% among individual classifiers at 2nd decomposition level.

SVM-SubLoc has yielded 93.2% accuracy at the same level showing the ensemble sig-

nificance. The highest diversity is observed at the 1st decomposition level as indicated by

Q-Statistic value of 0.12, but it could achieve only 90.4% accuracy. It is evident that the

best decomposition level is the 2nd level among other levels for discriminating the fluores-

cence microscopy protein images. MCC and F-Score values of 0.73 and 0.74, respectively,

are also highest at 2nd level. The hybrid model of Haralick textures and Zernike moments

does not add any significant discrimination capability to these individual feature spaces. At

the level of individual classifiers, SVM-SubLoc could not achieve good performance accu-

racies using the hybrid model compared to the Haralick textures alone as shown in Table

3.1. However, the ensemble accuracy of 93.2% is achieved using both the hybrid model and

Haralick textures as shown in Tables 3.6 and 3.1, respectively. The prediction performance

of SVM-SubLoc using HarTAS has been highlighted in Table 3.7. Among individual base

learners, poly-SVM has outperformed other learners and achieved 87.9% accuracy. The

ensemble SVM-SubLoc has yielded 96.2% accuracy, which is 8.3% higher than that of poly-

SVM. The ensemble accuracy of this hybrid space is also 3% and 4.6% higher than that of

the Haralick and TAS features, respectively, as shown in Tables 3.1 and 3.3. The diversity

among the individual classifiers is sufficient to produce the improved ensemble accuracy as

is evident by the Q-Statistic value of 0.32. Similarly, MCC value of 0.83 and F-Score value

49

Page 72: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 3.6: Performance of SVM-SubLoc using ZHar with/without DWT

lin poly RBF sig Ensemble

L D Acc Acc MCC F−Score Q-Statistic

0 75 69.7 69.7 71.8 57.5 84.4 0.52 0.54 0.19

1 300 74.8 76.2 76.9 74.4 90.4 0.66 0.67 0.12

2 1200 80.0 80.5 80.8 77.2 93.2 0.73 0.74 0.24

3 4800 77.3 77.9 78.5 77.4 89.2 0.62 0.64 0.35

4 19200 77.8 80.0 79.3 76.2 90.8 0.66 0.68 0.33

Table 3.7: Performance of SVM-SubLoc using HarTAS

lin poly RBF sig Ensemble

τ D Acc Acc MCC F−Score Q-Statistic

100 53 86.8 87.8 86.3 81.4 96.1 0.83 0.84 0.28

130 53 86.3 87.7 86.0 80.7 94.8 0.78 0.79 0.38

140 53 87.0 87.9 86.0 83.0 96.2 0.83 0.84 0.32

180 53 87.3 87.9 86.0 79.6 95.7 0.81 0.82 0.33

220 53 86.3 87.7 86.4 80.6 96.0 0.82 0.83 0.24

240 53 86.5 87.9 87.1 80.9 95.9 0.82 0.83 0.23

of 0.84 show that the prediction quality and test accuracy are, respectively, both good.

Among individual classifiers, RBF-SVM has achieved the highest accuracy of 94.4% us-

ing two separate hybrid models of Haralick textures with U-LBP(2, 16) and with U-LBP(3,

24) patterns as shown in Table 3.8. The yielded accuracy is 18% higher than the accuracy

achieved by SVM-SubLoc using individual Haralick textures in spatial domain as given in

Table 3.1 and similarly, 6.4% higher than the accuracy achieved using LBP patterns as

shown in Table 3.4 using RBF-SVM. Likewise, the ensemble accuracy of 99.7% is achieved

using Haralick with U-LBP(3, 24) patterns. This is 5.3% higher than that of RBF-SVM.

Haralick textures perform better when combined with U-LBP patterns extracted on large

circle and discriminate well among the subcellular structures from fluorescence microscopy

images. MCC (= 0.98) and F-Score (= 0.98) reveal that the quality of prediction and test

accuracy are excellent for these features. However, diversified ensemble can be generated

using Haralick textures with U-LBP patterns on radii 1 and 2 as shown by Q-Statistic values

of 0.14.

The success rates of SVM-SubLoc using HarLTP are demonstrated in Table 3.9. The

50

Page 73: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 3.8: Performance of SVM-SubLoc using HarLBP

lin poly RBF sig Ensemble

R PN m D Acc Acc MCC F−Score Q-Statistic

1 8 U 85 92.1 92.4 92.5 88.5 99.3 0.96 0.96 0.14

1 8 RI 62 89.2 89.9 89.0 84.9 97.0 0.86 0.87 0.31

1 8 URI 36 88.3 90.2 89.6 87.9 97.5 0.88 0.89 0.28

2 16 U 269 93.2 93.9 94.4 91.7 99.0 0.95 0.95 0.14

2 16 RI 4142 81.6 81.6 78.5 81.0 92.5 0.71 0.72 0.28

2 16 URI 44 92.5 93.2 93.7 90.8 98.8 0.94 0.94 0.18

3 24 U 581 93.2 93.6 94.4 90.9 99.7 0.98 0.98 0.20

3 24 URI 52 91.8 92.6 92.8 89.7 99.3 0.96 0.96 0.34

highest accuracy of 94.7% is achieved by poly-SVM among individual classifiers using the

hybrid of Haralick textures with U-LTP(2, 16, 80) patterns. The ensemble accuracy of

99.4% is yielded using the hybrid of Haralick textures with URI-LTP(3, 24, 80) patterns.

The ensemble accuracy is 4.7% higher than that of the poly-SVM. The highest values of

Table 3.9: Performance of SVM-SubLoc using HarLTP

lin poly RBF sig Ensemble

R PN τ m D Acc Acc MCC F−Score Q-Statistic

1 8 40 U 144 92.4 92.5 92.2 85.2 98.1 0.91 0.91 0.24

1 8 40 RI 98 90.9 90.6 90.3 87.8 98.3 0.92 0.92 0.30

1 8 40 URI 46 88.8 91.5 91.8 88.2 98.2 0.91 0.92 0.19

2 16 80 U 512 94.3 94.7 94.4 92.3 99.0 0.95 0.95 0.07

2 16 80 URI 62 93.2 93.1 93.6 90.7 99.0 0.95 0.95 0.19

3 24 80 URI 78 93.9 93.9 93.1 90.8 99.4 0.97 0.97 0.05

MCC and F-Score are also obtained for the same hybrid model. The Q-Statistic value of

0.05 indicates that the diversity among the classifiers is highest.

3.2.3 Performance Analysis of SVM-SubLoc for LOCATE datasets

In this section, we demonstrate the prediction performance of SVM-SubLoc for LOCATE

datasets using the high performing features for 2D HeLa dataset. In this connection, we

obtained experimental results using LTP, HarLBP, and HarLTP for LOCATE datasets.

Table 3.10 presents the prediction performance of SVM-SubLoc using LTP patterns for

51

Page 74: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

LOCATE Endogenous dataset. Among the individual learners, lin-SVM has achieved the

highest success rates using URI-LTP(2, 16, 0) patterns. The yielded accuracy is 90.2%.

The ensemble SVM-SubLoc has obtained 95.6% accuracy for the Endogenous dataset. The

Table 3.10: Performance of SVM-SubLoc using LTP for Endogenous dataset

lin poly RBF sig Ensemble

R PN τ m D Acc Acc MCC F−Score Q-Statistic

1 8 0 RI 72 85.2 86.2 80.6 59.7 93.2 0.72 0.73 0.17

1 8 0 URI 20 75.6 80.6 78.0 63.1 91.0 0.65 0.67 0.15

2 16 0 URI 36 90.2 86.6 85.2 71.9 95.6 0.80 0.81 0.21

3 24 0 URI 52 87.8 87.0 85.2 71.7 92.6 0.69 0.71 0.38

success rates of SVM-SubLoc using LTP patterns for LOCATE Transfected dataset are

reported in Table 3.11. SVM-SubLoc has yielded 93.6% ensemble accuracy using URI-

LTP(2, 16, 0) patterns, however, the diversity is maximum using RI-LTP(1, 8, 0) patterns

as is evident from Q-Statistic value of 0.06. Among the individual classifiers, poly-SVM has

outperformed other classifiers and achieved 85.5% prediction accuracy. The performance of

Table 3.11: Performance of SVM-SubLoc using LTP for Transfected dataset

lin poly RBF sig Ensemble

R PN τ m D Acc Acc MCC F−Score Q-Statistic

1 8 0 RI 72 78.8 84.2 75.2 54.4 89.6 0.60 0.61 0.06

1 8 0 URI 20 72.6 79.0 72.8 51.5 85.1 0.48 0.50 0.19

2 16 0 URI 36 81.1 85.5 77.0 57.1 93.6 0.71 0.73 0.07

3 24 0 URI 52 84.4 84.4 81.5 60.9 93.3 0.70 0.71 0.11

SVM-SubLoc for both the LOCATE datasets using HarLBP features is reported in Tables

3.12 and 3.13. The proposed SVM-SubLoc has obtained 99.8% accuracy for Endogenous

dataset using HarLBP where LBP(1, 8) are generated with URI mapping. The MCC and

F-Score values are also promising for the same features. For LOCATE Transfected dataset,

SVM-SubLoc has achieved 98.5% accuracy using HarLBP where LBP(2, 16) are computed

on URI mapping. Among the individual learners, poly-SVM obtained the highest accuracy

52

Page 75: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 3.12: Performance of SVM-SubLoc using HarLBP for Endogenous dataset

lin poly RBF sig Ensemble

R PN m D Acc Acc MCC F−Score Q-Statistic

1 8 U 85 94.6 93.8 93.8 88.6 99.6 0.97 0.98 0.24

1 8 URI 36 93.0 92.4 90.8 86.8 99.8 0.98 0.99 0.10

2 16 U 269 96.2 95.8 92.8 89.4 99.4 0.96 0.97 0.14

2 16 URI 44 95.0 94.6 94.2 90.6 99.6 0.97 0.98 0.06

3 24 U 581 94.4 94.4 92.8 88.4 99.6 0.97 0.98 0.25

3 24 URI 52 94.2 94.6 94.8 91.4 98.8 0.93 0.94 0.46

value of 93.6% using HarLBP where LBP(2, 16) are extracted through URI mapping. The

accuracy yielded by SVM-SubLoc is 4.9% higher than that of poly-SVM that shows the

significance of the developed ensemble. The highest diversity is also reported for these

specific features where Q-Statistic value of 0.06 is calculated. Table 3.14 highlights the

Table 3.13: Performance of SVM-SubLoc using HarLBP for Transfected dataset

lin poly RBF sig Ensemble

R PN m D Acc Acc MCC F−Score Q-Statistic

1 8 U 85 93.3 92.5 88.7 80.4 97.4 0.87 0.87 0.15

1 8 URI 36 89.8 90.0 87.1 81.0 96.5 0.83 0.84 0.21

2 16 U 269 93.3 93.1 90.5 80.2 98.0 0.89 0.90 0.11

2 16 URI 44 93.3 93.6 90.0 83.3 98.5 0.91 0.92 0.06

3 24 U 581 92.0 91.8 90.4 76.8 97.6 0.87 0.88 0.18

3 24 URI 52 92.5 92.5 91.5 84.2 97.8 0.88 0.89 0.25

prediction accuracies of SVM-SubLoc employing HarLTP features for LOCATE Endogenous

dataset. Ensemble accuracy of 99.6% is achieved by SVM-SubLoc that is 5.4% higher

than the accuracy achieved by individual RBF-SVM using the same features. Similarly,

performance predictions of SVM-SubLoc are presented in Table 3.15 for Transfected dataset.

The highest ensemble accuracy yielded by SVM-SubLoc is 98.7% using HarLTP. In contrast,

the individual poly-SVM has achieved 93.4% accuracy, which is 5.3% less than that of the

ensemble. The improved performance for LOCATE datasets is due to the inclusion of

hybrid models as well as the ensemble decisions. Hybrid models exhibit the properties of

their individual constituent feature spaces, which are added up to enhance the prediction

53

Page 76: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 3.14: Performance of SVM-SubLoc using HarLTP for Endogenous dataset

lin poly RBF sig Ensemble

R PN τ m D Acc Acc MCC F−Score Q-Statistic

1 8 0 RI 98 91.0 89.6 89.2 84.4 98.0 0.90 0.90 0.28

1 8 0 URI 46 91.2 91.2 90.2 85.6 99.0 0.94 0.95 0.17

2 16 0 URI 62 93.6 93.8 94.2 91.4 99.6 0.97 0.98 0.15

3 24 0 URI 78 93.2 94.2 94.0 91.2 98.6 0.92 0.93 0.12

Table 3.15: Performance of SVM-SubLoc using HarLTP for Transfected dataset

lin poly RBF sig Ensemble

R PN τ m D Acc Acc MCC F−Score Q-Statistic

1 8 0 RI 98 90.4 91.8 86.0 77.2 98.1 0.90 0.90 0.12

1 8 0 URI 46 90.2 90.5 86.7 78.8 96.5 0.83 0.84 0.18

2 16 0 URI 62 91.8 92.4 90.5 82.2 98.5 0.91 0.92 0.09

3 24 0 URI 78 91.8 93.4 90.0 84.0 98.7 0.92 0.93 0.20

capability of SVM-SubLoc.

3.3 Comparative Analysis

Performance comparison of the proposed SVM-SubLoc is also performed with state-of-the-

art existing approaches as demonstrated in Table 3.16. Chebira et al. have developed a

prediction system that achieved accuracy of 95.4% for the HeLa cell lines [7]. The prediction

system proposed by Nanni and Lumini has achieved 94.2%, 98.4%, and 96.5% accuracies,

respectively, for HeLa, LOCATE Endogenous and LOCATE Transfected datasets [39]. An-

other prediction system, proposed by Nanni et al. [40], has achieved 97.5% accuracy for

HeLa dataset. They have developed another system, which has achieved accuracy values

of 95.8%, 99.5%, and 97.0% for HeLa, LOCATE Endogenous and LOCATE Transfected

datasets, respectively [8]. On the contrary, SVM-SubLoc has obtained 99.7% accuracy for

HeLa dataset, 99.8% accuracy for LOCATE Endogenous dataset, and 98.7% accuracy for

LOCATE Transfected dataset, outperforming all the existing approaches.

Comparing with the existing approaches, SVM-SubLoc is an accurate and efficient pre-

diction system for recognizing protein subcellular structures from HeLa and two LOCATE

54

Page 77: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

datasets. Spatial as well as transform domain descriptors have been efficiently exploited to

quantitatively describe these subcellular localizations. The improved prediction rates using

multi-resolution sub-spaces have proved the importance of feature extraction in transform

domain. It is worth noting that describing an image at different resolutions may reveal dif-

ferent information in apparently similar structures. Similarly, the hybrid models constructed

Table 3.16: Performance comparison with other published work

Method HeLa LOCATE

Endogenous Transfected

Hamilton et al. [37] - 98.2(47) 93.2(47)

Chebira et al. [7] 95.4(78) - -

Nanni and Lumini [39] 94.2(107) 98.4(107) 96.5(81)

Nanni et al. [40] 97.5(322) - -

Nanni et al. [8] 95.8(305) 99.5(305) 97.0(305)

SVM-SubLoc using HarLBP 99.7(581) 99.8(36) 98.5(44)

SVM-SubLoc using HarLTP 99.4(78) 99.6(62) 98.7(78)

SVM-SubLoc using LTPs 99.0(52) 95.6(36) 93.6(36)

from different individual features spaces have performed well for fluorescence microscopy

protein images. For example, SVM-SubLoc has yielded the accuracies of 99.7% with 581D

feature space and 99.4% with 78D feature space using HarLBP and HarLTP, respectively,

in the spatial domain. Similarly, SVM-SubLoc has achieved accuracy 99.0% using LTP

patterns having 52D feature space. The discrimination capability of these features for sub-

cellular structures is quite promising with the advantage of reduced feature spaces. Besides,

hybrid feature spaces of HarLBP and HarLTP have performed well due to the combined

discriminative power of both Haralick and LBP as well as Haralick and LTP.

In all the above cases, the enhancement, achieved by SVM-SubLoc, is due to the two

level ensemble. On one hand, we constructed the feature level ensemble and on the other

hand, we combined the classifiers’ decisions. The feature level ensemble has boosted the dis-

criminative capability of the individual features. Likewise, the decisions combined through

the majority voting technique has improved the prediction accuracies to higher levels.

55

Page 78: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Summary

In this chapter, we have developed and analyzed the performance of SVM-SubLoc prediction

system. This approach outperformed other existing and well-known approaches in terms of

increased accuracy and reduced dimensionality of the feature space. In the next chapter, we

will discuss the performance of RF-SubLoc prediction system in the presence of imbalanced

data.

56

Page 79: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Chapter 4

Protein Subcellular Localization in

the Presence of Imbalanced data

The advancement in the computational resources encouraged numerous researchers to de-

velop efficient prediction models for recognizing various subcellular structures present in

eukaryotic cells from fluorescence microscopy images. These prediction systems employ dif-

ferent classification algorithms exploiting the discriminative power of a number of feature

extraction strategies. However, imbalanced data always remains one of the major problems

in developing efficient and reliable prediction systems even in the presence of highly dis-

criminative features.

The main objective of this work is to develop a prediction system capable of efficiently

handling imbalanced data. For this purpose, we employed SMOTE oversampling technique

that produces synthetic samples in the feature space for the minority classes. These new

features are delivered to RF-SubLoc prediction system, which classifies them in various

categories. The availability of balanced data caused the classifiers to decrease their bias

towards the majority class and hence the overall performance is improved.

4.1 The Proposed RF-SubLoc Prediction System

The proposed RF-SubLoc prediction system is illustrated in Figure 4.1. Different phases

of the prediction system include the feature extraction phase, data balancing phase, and

classification phase. In the feature extraction phase, different features are extracted from

each input fluorescence microscopy protein image. In the data balancing phase, the instances

57

Page 80: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Feature Extraction

Classification

Random Forest

Rotation ForestHybrid features

Output Prediction

Individual features

Haralick LBP

Image EdgeHull

Data Balancing

SMOTE

Input Data

Figure 4.1: The RF-SubLoc prediction system

are oversampled in the feature space, which are then forwarded to the classification phase

for classification into different classes. The performance of the proposed prediction system

is assessed using CHOM, CHOA, and Vero datasets.

4.1.1 Feature Extraction Phase

Features from fluorescence microscopy protein images are extracted in spatial domain only.

The feature extraction techniques include Haralick coefficients, Edge, Hull, Image, HOG,

and LBP. From LBP patterns as discussed in Chapter 2, only URI mapping has been

considered with parameter configuration of (R = 1, PN = 8), (R = 2, PN = 16) and

(R = 3, PN = 24). In this chapter, they will be referred to as merely LBP8, LBP16, and

LBP24 where the number following the word LBP indicates the number of neighboring

pixels. Other feature extraction techniques are adopted in accordance with the discussion

presented in section 2.2.

4.1.1.1 The Hybrid Model

In addition to the individual features, different hybrid models have also been constructed

by combining these individual feature spaces, which include HarEdge (Haralick + Edge),

HarHull (Haralick + Hull), HarImg (Haralick + Image), HullEdge (Hull + Edge), HullImg

58

Page 81: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

(Hull + Image), ImgEdge (Image + Edge), Morph (Morphological), HarHOG (Haralick +

HOG), MorphHOG (Morphological + HOG), HarMorphHOG (Haralick + Morphological +

HOG), HarLBP8Morph (Haralick + LBP8 + Morphological), HarLBP16Morph (Haralick

+ LBP16 + Morphological), and HarLBP24Morph (Haralick + LBP24 + Morphological).

4.1.2 Data Balancing Phase

After the feature extraction phase, SMOTE data balancing technique is employed in the

feature space to increase the minority class samples synthetically. In Figure 4.1, synthetic

samples are shown in black among other samples.

Tables 4.1, 4.2, and 4.3, respectively, show oversampled CHOM, CHOA and Vero datasets.

The balancing criteria is set against the majority class in each dataset. For example, in

CHOM dataset, the majority class is the Lysosome class where the number of instances is

the highest. To balance the dataset, sufficient number of synthetic samples are added to

the minority classes so that balance is established among minority and majority classes.

However, those classes are neglected in the oversampling process in which the imbalance

is lower. In CHOM dataset as shown in Table 2.5, Lysosome class is the majority class

Table 4.1: Oversampled instances for CHOM dataset

S. No Class Name Images/class

1 Golgi 77

2 DNA 69

3 Lysosome 97

4 Nucleolus 97

5 Cytoskeleton 97

Total 437

containing 97 samples. In this connection, only the classes of nucleolus and cytoskeleton

proteins are oversampled because they have high imbalance and may cause the classifier to

have decreased bias towards them. Therefore, increasing the samples using SMOTE may

increase classifier bias towards these minority classes. The oversampled CHOM is shown

in Table 4.1. Similarly, for CHOA dataset given in Table 2.6, Actin is the majority class,

which possesses 161 samples out of 668 samples in the entire dataset. Therefore, SMOTE

is applied to Microtubule, Mitochondria, Nucleolus, Nucleus, and Peroxisome classes to

59

Page 82: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 4.2: Oversampled instances for CHOA dataset

S. No Class Name Oversampled instances

1 Actin 161

2 ER 143

3 Golgi 156

4 Microtubule 161

5 Mitochondria 161

6 Nucleolus 161

7 Nucleus 161

8 Peroxisome 161

Total 1265

increase their samples, as shown in Table 4.2, because they are the minority classes of the

dataset. Original Vera dataset is shown in Table 2.7 where Nucleus is observed to be the

Table 4.3: Oversampled instances for Vero dataset

S. No Class Name Images/class

1 Actin 444

2 ER 444

3 Golgi 444

4 Microtubule 444

5 Mitochondria 444

6 Nucleolus 444

7 Nucleus 444

8 Peroxisome 444

Total 3552

majority class of the dataset. The remaining classes are all the minority classes. Therefore,

they were oversampled to generate a balanced feature space for the classification phase.

The oversampled dataset is given in Table 4.3.

4.1.3 Classification Phase

The classification phase of RF-SubLoc is depicted in Figure 4.2. In this phase, features

with/without SMOTE are classified using RF-SubLoc prediction system. Additionally,

RotF ensemble classifier is also utilized for comparison purpose with RF-SubLoc. The two

important parameters of RF-SubLoc: number of trees in the forest and number of randomly

selected variables have chosen to be 500 and 10, respectively. In case of RotF ensemble,

60

Page 83: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Decision Trees

Random Subset

Random Subset

Random Subset

Random Subset

Prediction

Prediction

Prediction

Prediction

Majority

Voting

Final

Prediction

Oversampled

Features

.

.

.

.

.

.

Figure 4.2: The Classification phase of RF-SubLoc

iterations and feature subset selection are the two parameters where iterations range from

2-4 for numerous features and dimensions of feature subsets for different features also vary.

4.2 Results and Discussion

This section analyzes performance of the proposed RF-SubLoc prediction system using

various feature spaces with/without SMOTE technique. The RotF ensemble is also assessed

for its performance using the same set of features in order to provide a useful comparison

with RF-SubLoc prediction system. The cross-validation protocol for conducting these

simulations is 10-fold for the three datasets. The performance of the prediction system is

measured in terms of accuracy, F-Score, and MCC.

4.2.1 Performance of RF-SubLoc using Individual Features

The prediction accuracies of RF-SubLoc for the three protein datasets utilizing individ-

ual features with/without SMOTE oversampling are presented in Table 4.4. The highest

accuracy of 95.1% is achieved using the Image features by RF-SubLoc for CHOM dataset

without SMOTE technique. Accuracy value of 96.6% is yielded using the same features by

RF-SubLoc with SMOTE oversampling. The improvement in other measures demonstrates

the effectiveness of SMOTE oversampling technique over the imbalanced feature space. Less

promising performance has been shown by RF-SubLoc using the Hull features, which indi-

cate that CHOM fluorescence microscopy protein images cannot be distinguished well using

61

Page 84: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

the hull properties captured from fluorescence microscopy protein images compared to other

features. Similar behavior is observed using Edge features.

Table 4.4: Performance of RF-SubLoc using individual features

Without SMOTE With SMOTE

Feature D Acc F−Score MCC Acc F−Score MCC

CHOM

Haralick 26 88.7 0.77 0.71 90.8 0.77 0.71

Edge 5 77.1 0.61 0.49 81.9 0.67 0.57

Hull 3 62.4 0.46 0.27 62.9 0.46 0.25

Image 8 95.1 0.89 0.86 96.6 0.91 0.89

HOG 81 82.9 0.68 0.58 88.6 0.74 0.66

LBP8 10 86.5 0.73 0.66 90.8 0.80 0.74

LBP16 18 90.5 0.80 0.74 92.7 0.84 0.79

LBP24 26 89.9 0.79 0.73 92.7 0.84 0.79

Average 84.1 0.71 0.63 87.1 0.75 0.67

CHOA

Haralick 26 72.9 0.49 0.37 85.8 0.57 0.52

Edge 5 62.4 0.37 0.20 76.0 0.39 0.30

Hull 3 44.2 0.26 -0.02 59.7 0.25 0.10

Image 8 84.6 0.66 0.59 92.2 0.73 0.70

HOG 81 66.2 0.42 0.26 87.8 0.60 0.55

LBP8 10 82.9 0.62 0.55 90.0 0.67 0.63

LBP16 18 81.3 0.60 0.51 90.9 0.69 0.65

LBP24 26 80.7 0.59 0.50 90.8 0.68 0.64

Average 71.9 0.50 0.37 84.2 0.57 0.51

Vero

Haralick 26 55.2 0.20 0.04 76.9 0.45 0.39

Edge 5 44.8 0.14 -0.09 65.9 0.32 0.21

Hull 3 40.8 0.12 -0.14 55.4 0.22 0.05

Image 8 60.3 0.24 0.11 75.8 0.44 0.37

HOG 81 57.2 0.22 0.06 83.4 0.56 0.52

LBP8 10 62.2 0.26 0.13 75.4 0.43 0.36

LBP16 18 69.8 0.33 0.24 83.3 0.56 0.51

LBP24 26 73.3 0.37 0.30 87.5 0.64 0.60

Average 57.9 0.23 0.08 75.5 0.45 0.37

However, the combined feature space of Hull and Edge features enhances the prediction

performance of RF-SubLoc as shown in Table 4.5 for CHOM dataset. RF-SubLoc has

achieved highest performance accuracy for CHOA dataset using Image features without

SMOTE data balancing technique achieving 84.6% accuracy. The improvement of 7.6%

62

Page 85: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

in this accuracy is observed with SMOTE technique resulting in 92.2% overall accuracy.

This signifies the importance of data balancing in classification problems through SMOTE.

In case of Vero dataset, RF-SubLoc achieved the highest accuracy using LBP24 features

without SMOTE oversampling. However, with SMOTE technique 14.2% enhancement in

the accuracy is achieved using the same features resulting in 87.5% accuracy.

The performance improvement is achievable in the presence of balanced data as is evident

from the obtained simulation results. We have also reported the average performance over all

the feature spaces in case of all the three datasets. The average value of results have shown

great performance improvement with SMOTE data balancing technique, which proves the

effectiveness of SMOTE technique for fluorescence microscopy protein image classification.

4.2.2 Performance Analysis of RF-SubLoc using Hybrid Features

Table 4.5 highlights the success rates of RF-SubLoc using hybrid features with/without

SMOTE technique for the three protein image datasets from fluorescence microscopy. RF-

SubLoc shows superior performance using hybrid features compared to using the individ-

ual features. RF-SubLoc yielded highest accuracies using the HarLBP24Morph hybrid

model for all the three datasets with/without SMOTE oversampling. The prediction sys-

tem achieved 99.1% accuracy without employing SMOTE oversampling for CHOM dataset

and further 99.3% accuracy with SMOTE oversampling. For CHOA dataset, RF-SubLoc

has obtained 96.5% and 93.0% accuracies with and without SMOTE data balancing, re-

spectively, using the same features. Similarly, in case of Vero dataset, RF-SubLoc has

achieved accuracies 91.2% and 74.6%, respectively, with and without SMOTE oversam-

pling utilizing the same set of features. The performance of RF-SubLoc has been degraded

using the HullImg hybrid model for CHOM dataset. On the other hand, the performance

of RF-SubLoc using Image features alone is superior as is evident from Table 4.4 where the

ensemble achieved 96.6% accuracy using Image features alone compared to 95.0% accuracy

using HullImg features with SMOTE oversampling as shown in Table 4.5. This is due to

the fact that the hybrid model introduced redundancy to the discriminative information

of individual features. This reveals that hybrid models cannot be trusted blindly on the

hope for improvement in the model accuracy. RF-SubLoc performed well with oversampled

feature spaces using both individual and hybrid models. The performance of RF-SubLoc is

enhanced using the hybrid models compared to using individual feature spaces.

63

Page 86: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 4.5: Performance of RF-SubLoc using hybrid features

Without SMOTE With SMOTE

Feature D Acc F−Score MCC Acc F−Score MCC

CHOM

HarEdge 31 92.0 0.83 0.79 94.3 0.85 0.80

HarHull 29 94.8 0.89 0.86 96.1 0.89 0.86

HarImg 34 95.1 0.89 0.86 96.8 0.92 0.89

HullEdge 8 84.1 0.71 0.63 87.6 0.74 0.67

HullImg 11 94.8 0.89 0.86 95.0 0.89 0.86

ImgEdge 13 95.4 0.90 0.87 95.9 0.91 0.88

Morph 16 95.4 0.90 0.87 96.1 0.91 0.88

HarHoG 107 91.4 0.82 0.77 94.3 0.85 0.81

MorphHoG 97 97.6 0.94 0.93 99.1 0.98 0.98

HarMorphHoG 123 97.9 0.95 0.94 98.4 0.96 0.94

HarLBP8Morph 52 98.2 0.95 0.94 98.6 0.97 0.96

HarLBP16Morph 60 98.5 0.96 0.95 99.1 0.98 0.97

HarLBP24Morph 68 99.1 0.97 0.97 99.3 0.98 0.98Average 87.4 0.89 0.86 96.2 0.91 0.88

CHOA

HarEdge 31 82.0 0.62 0.54 91.8 0.72 0.68

HarHull 29 77.4 0.55 0.46 89.2 0.65 0.61

HarImg 34 89.4 0.74 0.69 94.5 0.80 0.77

HullEdge 8 72.5 0.49 0.38 84.5 0.55 0.50

HullImg 11 86.5 0.69 0.63 92.2 0.74 0.71

ImgEdge 13 87.0 0.70 0.64 93.4 0.76 0.73

Morph 16 89.7 0.75 0.70 94.9 0.81 0.79

HarHoG 107 78.7 0.56 0.47 90.2 0.67 0.63

MorphHoG 97 88.6 0.73 0.68 96.0 0.85 0.83

HarMorphHoG 123 90.1 0.76 0.71 96.1 0.85 0.83

HarLBP8Morph 52 92.7 0.81 0.78 96.1 0.85 0.83

HarLBP16Morph 60 92.8 0.81 0.78 96.2 0.85 0.84

HarLBP24Morph 68 93.0 0.82 0.78 96.5 0.86 0.85Average 86.2 0.69 0.63 93.2 0.77 0.74

Vero

HarEdge 31 58.9 0.23 0.08 80.6 0.51 0.45

HarHull 29 59.3 0.24 0.09 79.3 0.49 0.44

HarImg 34 64.4 0.28 0.16 84.0 0.57 0.53

HullEdge 8 54.1 0.20 0.02 74.7 0.42 0.35

HullImg 11 62.4 0.27 0.14 77.0 0.45 0.39

ImgEdge 13 63.7 0.27 0.15 79.8 0.49 0.44

Morph 16 65.4 0.28 0.17 82.0 0.53 0.48

HarHoG 107 61.9 0.25 0.12 85.8 0.60 0.56

MorphHoG 97 65.9 0.29 0.18 86.9 0.62 0.59

HarMorphHoG 123 66.8 0.30 0.19 88.9 0.67 0.64

HarLBP8Morph 52 68.5 0.31 0.22 87.6 0.64 0.60

HarLBP16Morph 60 71.2 0.34 0.26 89.8 0.69 0.66

HarLBP24Morph 68 74.6 0.39 0.32 91.2 0.72 0.69Average 64.4 0.28 0.16 83.7 0.57 0.52

64

Page 87: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

This is due to the fact that the discrimination power of individual features has been

exploited by the classifier concurrently. At the same time, every class has equal repre-

sentation during the training phase in terms of number of input samples due to SMOTE

oversampling. This certainly enhanced the overall performance of RF-SubLoc prediction

system.

4.2.3 Performance Analysis of RotF Ensemble using Individual Features

Table 4.6 demonstrates the prediction performance of RotF ensemble using individual fea-

tures for CHOM, CHOA and Vero datasets with/without SMOTE oversampling technique.

Improvement in prediction performance of RotF ensemble has been observed using balanced

data. LBP patterns have enabled RotF ensemble to achieve higher accuracies for all the

three datasets. RotF has yielded 92.0% accuracy using LBP8 and LBP16 features without

employing the SMOTE technique for CHOM dataset. With SMOTE technique, improve-

ment in the performance of RotF ensemble has been observed for achieving accuracy value

of 93.8% using LBP16 features. RotF has achieved 78.0% accuracy using LBP8 features for

CHOA dataset without oversampling, which is enhanced to 84.7% utilizing SMOTE over-

sampling. However, the highest accuracy obtained by RotF ensemble is 86.6% for CHOA

dataset using Image features with SMOTE oversampling. For Vero dataset, RotF ensem-

ble yielded the highest accuracies of 66.4% and 73.3% using LBP24 features, respectively,

without and with SMOTE technique.

The obtained results reveal that providing a balanced feature space in terms of synthetic

samples enables the classifier to decide about the label of a sample without biasing towards

the majority class.

4.2.4 Performance Analysis of RotF Ensemble using Hybrid Features

Table 4.7 highlights the success rates of RotF ensemble using hybrid features with/without

SMOTE for CHOM, CHOA, and Vero datasets. The effectiveness of hybrid features and

SMOTE data balancing is obvious from the obtained results for all the three datasets. For

CHOM, highest accuracy 97.3% is reported using HarLBP16Morph features with SMOTE.

However, for CHOA and Vero datasets highest accuracies 91.2% and 74.0%, respectively,

are obtained using the oversampled HarLBP24Morph features.

65

Page 88: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 4.6: Performance of RotF ensemble using individual features

Without SMOTE With SMOTE

Feature D Acc F−Score MCC Acc F−Score MCC

CHOM

Haralick 26 88.1 0.77 0.70 91.1 0.81 0.76

Edge 5 67.6 0.49 0.32 75.3 0.56 0.41

Hull 3 56.3 0.40 0.16 61.8 0.47 0.27

Image 8 90.5 0.80 0.75 93.4 0.85 0.80

HoG 81 68.2 0.48 0.30 74.8 0.54 0.39

LBP8 10 92.0 0.83 0.78 92.2 0.83 0.78

LBP16 18 92.0 0.83 0.78 93.8 0.84 0.80

LBP24 26 89.0 0.77 0.70 91.5 0.80 0.75

Average 80.5 0.67 0.56 84.2 0.71 0.62

CHOA

Haralick 26 69.3 0.45 0.31 81.1 0.49 0.42

Edge 5 54.2 0.30 0.07 66.9 0.29 0.17

Hull 3 39.4 0.22 -0.10 45.1 0.14 -0.10

Image 8 77.2 0.55 0.45 86.6 0.60 0.55

HoG 81 48.2 0.25 -0.01 64.9 0.28 0.15

LBP8 10 78.0 0.56 0.46 84.7 0.56 0.50

LBP16 18 77.1 0.54 0.44 86.1 0.58 0.53

LBP24 26 76.3 0.53 0.42 85.8 0.57 0.52

Average 65.0 0.42 0.25 75.2 0.43 0.34

Vero

Haralick 26 50.2 0.18 -0.01 60.2 0.27 0.13

Edge 5 35.7 0.11 -0.19 47.3 0.17 -0.04

Hull 3 35.6 0.10 -0.21 42.4 0.15 -0.10

Image 8 50.4 0.17 -0.01 61.0 0.27 0.14

HoG 81 44.2 0.14 -0.09 57.2 0.25 0.09

LBP8 10 49.7 0.17 -0.02 59.1 0.26 0.11

LBP16 18 60.7 0.25 0.12 66.9 0.33 0.23

LBP24 26 66.4 0.31 0.21 73.3 0.41 0.33

Average 49.1 0.17 -0.02 58.4 0.26 0.11

The effectiveness of SMOTE data balancing is evident from the obtained results where

improvement has always been achieved. Likewise, the average success rates have proven

the usefulness of SMOTE data balancing technique. Similarly, hybrid features have shown

performance improvement over individual features in case of all the three datasets because

the discrimination power of individual features is combined and consequently enhanced.

66

Page 89: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 4.7: Performance of RotF ensemble using hybrid features

Without SMOTE With SMOTE

Feature D Acc F−Score MCC Acc F−Score MCC

CHOM

HarEdge 31 89.0 0.78 0.71 90.4 0.78 0.72

HarHull 29 89.9 0.80 0.74 90.4 0.77 0.70

HarImg 34 92.0 0.84 0.79 94.5 0.87 0.83

HullEdge 8 73.7 0.57 0.43 78.9 0.63 0.51

HullImg 11 91.4 0.82 0.77 94.3 0.86 0.82

ImgEdge 13 91.7 0.83 0.78 94.7 0.87 0.83

Morph 16 90.8 0.81 0.76 95.4 0.89 0.86

HarHoG 107 84.1 0.70 0.62 88.1 0.76 0.69

MorphHoG 97 89.6 0.79 0.73 92.9 0.82 0.76

HarMorphHoG 123 91.4 0.82 0.78 93.6 0.85 0.81

HarLBP8Morph 52 94.5 0.88 0.85 96.1 0.91 0.89

HarLBP16Morph 60 95.4 0.90 0.87 97.3 0.92 0.90

HarLBP24Morph 68 96.0 0.91 0.89 96.6 0.92 0.90Average 90.0 0.80 0.75 92.6 0.83 0.79

CHOA

HarEdge 31 74.4 0.51 0.39 84.7 0.56 0.51

HarHull 29 71.0 0.47 0.34 80.9 0.49 0.43

HarImg 34 80.2 0.59 0.50 88.9 0.65 0.61

HullEdge 8 58.8 0.34 0.14 70.2 0.33 0.22

HullImg 11 76.8 0.55 0.45 85.8 0.58 0.54

ImgEdge 13 77.7 0.55 0.46 87.5 0.62 0.57

Morph 16 78.0 0.56 0.46 86.6 0.60 0.55

HarHoG 107 65.4 0.41 0.25 79.4 0.46 0.39

MorphHoG 97 72.8 0.49 0.36 83.6 0.53 0.48

HarMorphHoG 123 76.0 0.53 0.42 86.6 0.60 0.55

HarLBP8Morph 52 82.8 0.63 0.55 90.2 0.68 0.64

HarLBP16Morph 60 83.7 0.64 0.57 90.8 0.69 0.66

HarLBP24Morph 68 84.1 0.65 0.57 91.2 0.71 0.67Average 75.5 0.53 0.42 85.1 0.58 0.52

Vero

HarEdge 31 50.1 0.17 -0.02 61.6 0.28 0.14

HarHull 29 49.2 0.17 -0.03 61.6 0.28 0.15

HarImg 34 53.9 0.20 0.02 67.1 0.34 0.23

HullEdge 8 43.6 0.13 -0.11 52.6 0.20 0.01

HullImg 11 52.6 0.19 0.01 62.0 0.29 0.16

ImgEdge 13 52.2 0.19 0.01 63.7 0.30 0.18

Morph 16 53.4 0.20 0.02 64.9 0.31 0.19

HarHoG 107 49.5 0.17 -0.03 61.8 0.28 0.15

MorphHoG 97 52.4 0.19 0.01 65.2 0.31 0.19

HarMorphHoG 123 53.8 0.19 0.02 67.3 0.33 0.23

HarLBP8Morph 52 55.2 0.21 0.05 67.8 0.34 0.24

HarLBP16Morph 60 59.3 0.24 0.10 70.4 0.37 0.29

HarLBP24Morph 68 65.4 0.29 0.19 74.0 0.42 0.35Average 53.1 0.20 0.02 64.6 0.31 0.19

67

Page 90: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

4.3 Comparative Analysis

We have provided comparative analysis of the proposed system with state of the art ex-

isting approaches from various researchers as given in Table 4.8. Zhang and Pham [85]

have developed RSE of Multi-layer Perceptrons, which achieved 98.9% accuracy for CHOM

dataset. In a similar approach, Zhang et al. [86] have reported 94.2% accuracy for the

same dataset. Lin et al. have proposed AdaBoost.ERC [6] based prediction system, which

obtained accuracy values of 94.7% and 89.1% for CHOA and Vero datasets, respectively.

The feature extraction strategies utilized by these prediction systems are complex in nature

and occupy large spaces in memory. For example, in [85] authors have utilized curvelet

transform based features, which have to be extracted in multi-resolution sub-spaces. This

increases the computation time of the prediction system. Similarly, in [6] the strong and

weak detectors require large memory to be loaded. In contrast, our employed feature ex-

traction strategies have used feature spaces with smaller dimensions. We have extracted

various individual features as well as constructed their hybrid models. The key novelty

of our proposed prediction system lies in the introduction of SMOTE data balancing as

post-processing for fluorescence microscopy protein images, which greatly enhanced the

performance of the RF-SubLoc prediction system. In our proposed approach, RF-SubLoc

has achieved accuracy 99.3% for CHOM dataset, 96.5% for CHOA dataset, and 91.2% for

Vero dataset.

Table 4.8: Performance comparison with other published work

Method Accuracy

CHOM CHOA Vero

MLP-RSE [85] 98.9 - -

MLP-RSE [86] 94.2 - -

AdaBoost.ERC [6] - 94.7 89.1

RF-SubLoc 99.3 96.5 91.2

We have explored the performance of RF-SubLoc prediction system and RotF ensemble

classifier in the presence of balanced and imbalanced data. In this connection, datasets with

varying degree of imbalance are employed. Simulation results revealed that enhancement

in the performance of the classifier under SMOTE is directly proportional to the imbalance

68

Page 91: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

present in the original dataset. In other words, more improvement in the performance of

the classifier is observed after data balancing where imbalance was higher in the original

dataset.

Simulation results proved the effectiveness of RF-SubLoc ensemble, where its efficiency is

boosted with balanced feature spaces. Synthetic samples are created for this purpose using

SMOTE, which forced RF-SubLoc to increase its affinity towards minority class samples.

This leads to more generalized learning and consequently, the over all performance of the

system is improved.

Summary

We observed that SMOTE increased the efficiency of RF-SubLoc in classifying protein

subcellular structures, particularly, in case of HarLBP24Morph hybrid features. The sig-

nificance of SMOTE based data balancing in the classification of fluorescence microscopy

protein images is thus evident from the simulation results. In the next chapter, we combine

the best techniques of the two systems presented in this chapter and Chapter 3. For this

purpose, LTP from Chapter 3 and SMOTE from the current chapter are selected.

69

Page 92: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Chapter 5

Protein Subcellular Localization:

Employing LTP with SMOTE

Prediction systems for protein localization greatly depend upon the discrimination power

of feature extraction strategies for their reliability and accuracy. High discrimination power

would certainly aid in drug development process for various diseases. The existing ap-

proaches have employed individual and ensemble classifier systems, which operate on dif-

ferent feature generation techniques in order to classify protein localization with high ac-

curacy [23, 36]. These methods have achieved high prediction accuracies however; room

for more accurate systems still exists. Besides, another problem faced by such prediction

systems is the availability of imbalanced data that severely degrades the performance of

automated systems. The classifier’s predictions are influenced by the majority class and

consequently, the classification system faces the impact of this bias in the form of degraded

performance. The overall performance of the classifier apparently seems good in the form

of higher accuracy. However, the instances of minority classes may all be neglected in the

classification procedure. The main focus of this chapter is to tackle this issue.

5.1 The Protein-SubLoc Prediction System

The proposed Protein-SubLoc prediction system is highlighted in Figure 5.1. The con-

stituent phases of the prediction system include feature extraction, oversampling, optional

feature selection, classification, and ensemble generation, which are discussed as follows.

70

Page 93: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

LTPFeature Extraction

SMOTEOversampling

Output

SVMClassification

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Best Ensemble Selection

Ensembles Generation

Figure 5.1: Framework of the proposed system

5.1.1 Feature Extraction Phase

LTP patterns are extracted in the spatial domain by forwarding each input image to the

feature extraction phase. In this work, RI-LTP (R=1, PN=8), U-LTP (R=1, PN=8), URI-

LTP (R=1, PN=8), U-LTP (R=2, PN=16), URI-LTP (R=2, PN=16) and URI-LTP (R=3,

PN=24) have been exploited for their discrimination power.

5.1.2 Oversampling Phase

Feature extraction phase is followed by the oversampling phase where SMOTE is introduced

to synthetically oversample the feature space. In this way, a balanced feature space is

produced before the classification phase. In Figure 5.1 different colors indicate different

instances belonging to different classes. Synthetically oversampled examples are specified

using black color.

In this work, LTP patterns are utilized for protein subcellular localization images from

HeLa and CHOA datasets. Among the two datasets, HeLa exhibits slight imbalance whereas

CHOA possesses comparatively high imbalance. In order to remedy the imbalance, SMOTE

is employed in the extracted feature spaces before forwarding them to the classification

phase. The original and synthetic samples are depicted in Figures 5.2 and 5.3 for HeLa and

CHOA datasets, respectively.

5.1.3 Feature Selection(Optional)

Feature selection may improve the performance of a model by selecting the most discrimi-

native and informative features from the full feature space. In addition, the feature space

71

Page 94: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Figure 5.2: Comparison of original and synthetic samples in HeLa dataset

Figure 5.3: Comparison of original and synthetic samples in CHOA dataset

dimension is reduced, which leads to less computational cost. In order to further enhance

the discrimination power of the oversampled feature space, mRMR feature selection tech-

nique is adopted. Our findings reveal that mRMR does not aid anything significant to the

performance of LTP patterns.

5.1.4 Classification Phase

In the classification phase as shown in Figure 5.4, the performance of Protein-SubLoc pre-

diction system has been evaluated using SVM with polynomial kernel of degree 2. Six SVM

classifiers have been trained using six different feature spaces of LTP patterns generated

using six different configurations from LTP feature extraction mechanism. The detailed

72

Page 95: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Ensemble

Generation

Best

Ensemble

Selection

oversampled

LTP

Features

SVM1

SVM2

SVMn

.

.

.

Final

Prediction

Figure 5.4: Classification phase of Protein-SubLoc prediction system

discussion about LTP configurations could be found in Chapter 2.

5.1.5 Ensemble Generation

Majority voting based ensemble technique is employed to generate the ensemble for pre-

dicting labels of different protein subcellular localizations. In the formation of ensemble,

all the LTP variations based on different mappings are considered. As shown in Table 5.1,

serial numbers are assigned to various mappings for the sake of discussion. From six indi-

vidual classifications, six different ensembles can be generated if ensemble of five members

is considered. In this connection, different combinations are tested and the best ensemble

in terms of enhanced accuracy is selected for Protein-SubLoc prediction system. This prac-

tice is performed to explore the diversity among different classification results as well as to

restrict the number of ensemble members to odd number of classifiers.

Table 5.1: The serial numbers, attached to the mapping used in LTP computation, repre-

senting a particular LTP variant in Tables 5.8 and 5.9

S. No m R PN D

1 U 1 8 118

2 RI 1 8 72

3 URI 1 8 20

4 URI 2 16 36

5 U 2 16 486

6 URI 3 24 52

73

Page 96: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

5.2 Results and Discussion

The detailed analysis of the proposed Protein-SubLoc prediction system is provided in this

section. The importance and significance of LTP patterns in conjunction with SMOTE

is discussed in detail. First, the results obtained using 2D HeLa are presented. Then

results regarding CHOA dataset are analyzed. Simulation results are obtained using 10-fold

cross validation protocol. The performance of the proposed system is measured in terms of

accuracy, sensitivity, specificity, MCC, F-Score, and Q-statistic.

5.2.1 Performance Analysis for 2D HeLa dataset

The output predictions of SVM for HeLa dataset are presented in Table 5.2. In the first

column, m indicates mapping methods adopted during the feature extraction using LTP

mechanism. Radius and number of neighboring pixels under consideration during the LTP

feature extraction are mentioned in columns 2 and 3, respectively. Values of threshold τ

required by LTP feature extraction are presented in column 4. D represents the feature

space dimension, which is shown in column 5. Columns 6-10 indicate accuracy, sensitivity,

specificity, MCC, and F-Score, respectively. This point forward, LTP patterns are referred

to as m-LTP(R, PN, τ) in the text. Results in Table 5.2 reveal that LTP patterns with

Table 5.2: Performance of Protein-SubLoc using LTP for balanced HeLa dataset

Polynomial kernel

m R PN τ D Acc Sen Spe MCC F−Score

RI 1 8 40 72 90.7 92.6 90.5 0.65 0.66

U 1 8 40 118 93.1 93.7 93.0 0.71 0.73

URI 1 8 40 20 91.6 92.5 91.5 0.67 0.68

U 2 16 80 486 95.4 96.0 95.3 0.79 0.80

URI 2 16 80 36 94.6 95.9 94.5 0.77 0.78

URI 3 24 80 52 95.4 96.1 95.3 0.79 0.80

SMOTE have the capability of discriminating the HeLa patterns very well yielding accu-

racy above 90% using any mapping. Further, URI-LTP(3, 24, 80) patterns have achieved

the highest accuracy of 95.4%. The Protein-SubLoc prediction system has retained good

balance between sensitivity and specificity, which has been possible due to the availability

of SMOTE generated balanced dataset. The MCC value of 0.79 for URI-LTP(3, 24, 80)

74

Page 97: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

patterns reveals fine quality of prediction. Likewise, F-Score value of 0.80 assures good

accuracy of the conducted test. Comparing with other LTPs in Table 5.2, it is observed

that performance of LTP patterns is improved on large image patches as described by the

radius R.

The discriminative feature space usually exhibits high variance. In the current dis-

cussion, we illustrate the high variance possessed by LTP patterns in Figure 5.5. In this

connection, the phenomenon of explained variance is utilized. Among various PCA com-

ponents of the feature space, the first two components possess more information only if,

higher values are observed for the explained variance. From Figure 5.5, the ratio of ex-

plained variance to the total variance is 77%. Further, this ratio reaches to 92% if first

five components are considered. Thus guaranteeing enhanced performance of the Protein-

SubLoc prediction system in differentiating different protein structures. Similarly, if we add

more components in addition to the first five, there is no guarantee for the performance

to be improved. This is because the components other than the first five do not possess

significant variability. Multiclass ROC curves for URI-LTP(3, 24, 80) are also presented

Figure 5.5: Ratio of explained variance to the total variance for HeLa dataset

in Figure 5.6. The features maintained discrimination between the classes as is evident

from the ROC curves illustrated in the figure, which is based on one feature rather than on

the whole feature space. Since we have 10 classes, there are 45 possible pairwise combina-

tions and consequently, there are 45 ROC curves. The notion of mRMR is also introduced

in order to select the most discriminative features from the full feature space, which will

75

Page 98: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Figure 5.6: ROC curves using URI-LTP(3, 24, 80) for 2D HeLa dataset

reduce the dimension as well as the noise from the feature space. Performance of Protein-

SubLoc is recorded using mRMR based reduced feature spaces of LTP patterns, which is

demonstrated in Table 5.3. The performance of reduced feature spaces was not promising

in contrast to the full feature spaces. This indicates that LTP feature extraction strategy

extracts most of the information from the fluorescence microscopy protein images necessary

to differentiate different subcellular structures and hence mRMR based feature selection is

not required as a post-processing technique to improve the performance of LTP patterns.

However, mRMR based feature space of U-LTP(1, 8, 40) patterns have achieved slightly

better performance compared to the full feature space of the same LTP patterns as is evident

from Tables 5.2 and 5.3. Similar to the analysis of mRMR based LTP patterns presented

in Table 5.3, we have also investigated the effect of mRMR on the highest performing LTP

patterns including U-LTP(2, 16, 80) and URI-LTP(3, 24, 80), as can be seen in Table 5.2.

Figure 5.7 demonstrates the performance of various mRMR based feature subspaces selected

from the full feature space of URI-LTP (3, 24, 80) patterns. The highest accuracy value

of 95.5% is achieved on 50D feature subspace compared to the 95.4% accuracy on 52D full

feature space, which could not be considered an extra ordinary improvement. Likewise,

95.6% accuracy is yielded using the mRMR based 450D feature subspace of U-LTP(2, 16,

80) patterns compared to the 95.4% accuracy using 486D full feature space as depicted in

Figure 5.8. Although the reduction in the features through the mRMR has early advantage

76

Page 99: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 5.3: Performance of Protein-SubLoc using LTP for balanced mRMR based HeLa

dataset

Polynomial kernel

m (R, PN, τ) D Acc Sen Spe MCC F−Score

RI (1, 8, 40)

36 89.0 90.9 88.8 0.60 0.62

54 89.7 91.3 89.6 0.62 0.64

60 89.8 91.4 89.7 0.62 0.64

U (1, 8, 40)

59 91.0 91.8 90.9 0.65 0.67

89 93.3 94.0 93.2 0.72 0.73

100 93.0 93.8 92.9 0.71 0.73

URI (1, 8, 40)

10 85.9 88.8 85.5 0.54 0.55

15 90.4 92.2 90.2 0.64 0.65

18 91.6 92.5 91.5 0.67 0.68

U (2, 16, 80)

243 94.3 94.9 94.3 0.75 0.77

365 95.2 95.9 95.1 0.78 0.80

400 95.1 95.5 95.0 0.78 0.79

URI (2, 16, 80)

18 92.9 94.3 92.7 0.71 0.72

27 94.1 95.6 94.0 0.75 0.76

30 93.5 94.9 93.4 0.73 0.74

URI (3, 24, 80)

26 93.5 94.4 93.4 0.73 0.74

39 94.8 95.8 94.7 0.77 0.78

45 95.1 95.7 95.0 0.78 0.79

of 0.1% and 0.2% over the full feature spaces using URI-LTP(3, 24, 80) and U-LTP(2, 16,

80) patterns, respectively. However, further reduction has shown degraded performance as

is evident from Figures 5.7 and 5.8, correspondingly.

LTP patterns alone have shown great performance in conjunction with SMOTE without

the aid of mRMR based feature selection, which is a positive indication about the strength

of balanced LTP patterns. The difference of the performance of balanced and imbalanced

feature spaces could easily be understood through the analysis of LTP performance without

SMOTE, which is tabulated in Table 5.4. The highest accuracy achieved using URI-LTP(3,

24, 80) patterns without SMOTE is 94.0% that is 1.4% less than that of using LTP patterns

with SMOTE as is evident from Table 5.2. The data balancing with SMOTE has improved

the discriminative capability of the prediction system because all the classes have equal

representation in training and testing phases. Overall, LTP patterns could perform well

without the aid of mRMR based feature selection provided that Protein-SubLoc prediction

system is trained on balanced dataset.

77

Page 100: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Figure 5.7: Effect of mRMR on URI-LTP extracted on radius 3 for HeLa dataset

Figure 5.8: Effect of mRMR on U-LTP extracted on radius 2 for HeLa dataset

5.2.2 Performance Analysis for CHOA dataset

In Table 5.5, performance accuracies of Protein-SubLoc prediction system using LTP pat-

terns with SMOTE are presented for CHOA cell lines. URI-LTP(3, 24, 30) patterns have

outperformed other LTP patterns in predicting various subcellular structures from CHOA

dataset. Protein-SubLoc prediction system has yielded 90.7% accuracy, 0.65 MCC and 0.69

F-Score values. The performance of LTP patterns for balanced CHOA dataset with mRMR

feature selection is highlighted in Table 5.6. Any improvement has not been observed with

feature selection since LTP patterns are capable of extracting maximum information from

fluorescence microscopy protein images and hence no feature selection is required.

78

Page 101: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 5.4: Performance of Protein-SubLoc using LTP for imbalanced HeLa dataset

Polynomial kernel

m R PN τ D Acc Sen Spe MCC F−Score

RI 1 8 40 72 87.5 89.5 87.3 0.58 0.60

U 1 8 40 118 90.8 92.1 90.6 0.65 0.67

URI 1 8 40 20 89.7 91.9 89.5 0.63 0.65

U 2 16 80 486 92.5 94.0 92.3 0.71 0.72

URI 2 16 80 36 93.2 94.8 93.0 0.73 0.74

URI 3 24 80 52 94.0 95.7 93.8 0.75 0.77

Table 5.5: Performance of Protein-SubLoc using LTP for balanced CHOA dataset

Polynomial kernel

m R PN τ D Acc Sen Spe MCC F−Score

RI 1 8 30 72 70.4 59.6 71.9 0.22 0.33

U 1 8 30 118 72.8 61.0 74.4 0.25 0.35

URI 1 8 30 20 68.4 55.7 70.2 0.18 0.30

U 2 16 30 486 82.2 59.8 50.9 0.08 0.31

URI 2 16 30 36 83.9 77.1 84.9 0.48 0.54

URI 3 24 30 52 90.7 85.3 91.5 0.65 0.69

Table 5.6: Performance of Protein-SubLoc using LTP for balanced mRMR based CHOA

dataset

Polynomial kernel

m R PN τ D Acc Sen Spe MCC F−Score

RI 1 8 30 36 68.9 57.2 70.5 0.19 0.31

U 1 8 30 59 70.3 57.6 72.1 0.21 0.32

URI 1 8 30 10 64.2 52.0 65.9 0.12 0.26

U 2 16 30 243 80.4 67.3 82.3 0.38 0.45

URI 2 16 30 18 80.7 71.8 81.9 0.41 0.47

URI 3 24 30 26 89.3 83.6 90.1 0.61 0.65

In order to analyze the performance of LTP patterns without balancing the dataset, the

results without SMOTE are also presented as shown in Table 5.7. The highest accuracy

achieved without SMOTE is 8.6% lower than that of with SMOTE as shown in Table 5.5

using URI-LTP(3, 24, 30) patterns. This assures the effectiveness of SMOTE in predicting

protein subcellular localization from CHOA dataset. The ROC curves for URI-LTP(3, 24,

79

Page 102: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 5.7: Performance of Protein-SubLoc using LTP for imbalanced CHOA dataset

Polynomial kernel

m R PN τ D Acc Sen Spe MCC F−Score

RI 1 8 30 72 58.5 57.4 58.8 0.12 0.33

U 1 8 30 118 55.6 54.8 55.9 0.08 0.30

URI 1 8 30 20 55.2 53.0 55.9 0.06 0.29

U 2 16 30 486 60.0 58.3 60.4 0.14 0.34

URI 2 16 30 36 72.3 72.3 72.3 0.35 0.48

URI 3 24 30 52 82.1 82.2 82.1 0.54 0.62

30) features are shown in Figure ?? . There are 28 ROC curves, which are produced from

pair wise combinations of 8 classes.

Figure 5.9: ROC curves using URI-LTP(3, 24, 30) for CHOA dataset

5.2.3 Ensemble Analysis

Tables 5.8 and 5.9 highlight the prediction accuracies of the ensemble for HeLa and CHOA

cell lines, respectively. The ensemble for HeLa dataset has achieved accuracy value of 100%

by the 5th group as shown in Table 5.8. The diversity among the ensemble members of this

group is highest as is evident from the Q-value of 0.01. Likewise, for CHOA dataset, the

same group has achieved 95% prediction accuracy as shown in Table 5.9. The diversity of

80

Page 103: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 5.8: Performance of different combinations of SVM classifications using LTP for

balanced HeLa dataset

S. No 1 2 3 4 5 6 Acc Sen Spe MCC F-Score Q-value

1√ √ √ √ √

- 99.8 99.9 99.8 0.99 0.99 0.20

2√ √ √ √

-√

99.8 99.9 99.8 0.99 0.99 0.07

3√ √ √

-√ √

99.8 99.9 99.8 0.99 0.99 0.08

4√ √

-√ √ √

99.8 99.9 99.8 0.99 0.99 0.11

5√

-√ √ √ √

100 100 100 1 1 0.01

6 -√ √ √ √ √

99.7 99.9 99.7 0.98 0.98 0.12

this group has achieved promising results for both HeLa and CHOA datasets. The Q-value

of 0.33 revealed good standard of diversity maintained by these ensemble members.

Table 5.9: Performance of different combinations of SVM classifications using LTP for

balanced CHOA dataset

S. No 1 2 3 4 5 6 Acc Sen Spe MCC F-Score Q-value

1√ √ √ √ √

- 91.7 87.3 92.4 0.69 0.72 0.38

2√ √ √ √

-√

92.8 89.1 93.4 0.72 0.75 0.36

3√ √ √

-√ √

92.9 88.8 93.5 0.72 0.75 0.36

4√ √

-√ √ √

94.7 91.4 95.1 0.78 0.81 0.34

5√

-√ √ √ √

95.0 91.6 95.4 0.79 0.81 0.33

6 -√ √ √ √ √

93.6 90.2 94.1 0.75 0.77 0.38

5.3 Comparative Analysis

Comparative analysis of the proposed Protein-SubLoc prediction system is highlighted in

Table 5.10. Many state-of-the-art existing techniques are listed in the first column and

their respective summary is recorded in the next column. Chebira et al. have developed

multi-resolution based technique, which yielded 95.4% performance accuracy [7]. RSE based

prediction system proposed in [39] has predicted HeLa images with 94.2% accuracy. In a

similar approach, [8] have reported 95.8% prediction accuracy. Another approach, in which

fusion of two different ensembles was generated, achieved 97.5% prediction performance [40].

Nanni et al. have developed an ensemble of SVMs, which achieved 93.2% accuracy [12]. Lin

et al. have proposed AdaBoost.ERC algorithm, which achieved 94.7% and 93.6% accuracies

81

Page 104: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

for CHOA and HeLa datasets, respectively [6]. From the above discussion, it is evident

Table 5.10: Performance comparison with other approaches

Method Summary of the Technique Accuracy

HeLa CHOA

Chebira et al. [7] Multi-resolution subspaces, weighted majority voting 95.4 -

Nanni and Lumini [39] RSE of NNs 94.2 -

Nanni et al. [40] RSE of NNs, AdaBoost ensemble of weak learners, sum rule 97.5 -

Nanni et al. [8] RSE of NNs 95.8 -

Nanni et al. [12] SVM, random subset of features, 50 classifiers, sum rule 93.2 -

Lin et al. [6] Variant of AdaBoost named as AdaBoost.ERC 93.6 94.7

Protein-SubLoc Majority Voting based Ensemble 100 95.0

that the performance shown by the proposed Protein-SubLoc approach outperformed the

existing approaches. In individual classifiers, SVM has yielded 95.4% accuracy for HeLa

dataset using 52D full feature space constructed from URI-LTP(3, 24, 80) patterns as shown

in Table 5.2, which shows comparable performance with the existing ensemble approaches

in the literature. The proposed Protein-SubLoc achieved 100% accuracy, which completely

outperformed the existing ensemble approaches. The reported accuracies of the proposed

ensemble for the HeLa and CHOA datasets are, respectively 2.5% and 0.3% higher than the

highest accuracies of the existing approaches for the same datasets.

The performance accuracies yielded by the proposed Protein-SubLoc prediction system

are promising due to the sensitive behavior of LTP patterns towards noise. This enables

LTP patterns to extract valuable and discriminative information from images, which lead

the classifier to efficiently distinguish different subcellular structures from each other. LTP

patterns perform its operations in varying area of observations, which enable this technique

to have different views of the same image regions. Furthermore, providing balanced data

to the classifier reduced the bias of the classifier towards the majority class and hence the

performance of the classifier is enhanced with improved reliability.

The prediction accuracy with SMOTE is directly proportional to the size of imbalance

present in a dataset. For instance, the improvement for HeLa cell lines with SMOTE over

without SMOTE is not very high whereas for CHOA dataset this improvement is high

because the imbalance in HeLa dataset is low compared to that of CHOA dataset. Feature

selection using mRMR is not required for LTP since it produces such a discriminative

82

Page 105: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

feature space that further enhancement in its discriminative power is not possible. The

informative and discriminative feature space, generated with LTP patterns and oversampled

with SMOTE, enable Protein-SubLoc prediction system to efficiently classify different sub-

cellular protein images from fluorescence microscopy into various classes. Consequently, the

prediction system boosted the prediction performance to higher levels.

Summary

In this chapter, the effectiveness of LTP in conjunction with SMOTE for the classification

of fluorescence microscopy protein images is reported. The results show that efficient pre-

dictions might be achieved by efficiently utilizing the learning power of the classifier by

providing balanced feature space. In the next chapter, a deeper insight is developed in

the extraction of minute details form fluorescence microscopy images through GLCM and

Texton image based feature extraction strategies. The decisions of individual classifiers are

then combined through the majority voting scheme for enhanced classification results.

83

Page 106: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Chapter 6

Protein Subcellular Localization

using GLCM and Texton Image

based Features

Subcellular localization property of proteins plays a key role in understanding numerous

functions of proteins. Comprehensive analysis of fluorescence microscopy images is required

in order to develop efficient automated systems for accurate localization of different proteins.

In this chapter, we explore the effect of various feature spaces constructed from GLCM and

Texton images along different orientations. It is observed that the recognition capability and

learning of prediction system have been enhanced when the feature spaces are constructed

from a GLCM or Texton image along separate directions and then utilized individually

by classifiers. The enhancement in recognition is obtained by combining the individual

decisions through the majority voting scheme.

6.1 The IEH-GT Prediction System

The IEH-GT prediction system is shown in Figure 6.1. There are two phases namely, feature

extraction, and classification. Individual SVMs are trained on the extracted features from

GLCM and Texton images, and then the final prediction is produced through the majority

voting based ensemble technique. Each of the phases of IEH-GT prediction system is

discussed in detail as follows.

84

Page 107: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Predicted

Label

Ensemble

Classification

SVM

Input

Image

Feature Extraction

Haralick Texture

Features

Feature Extraction

Statistical

Features

Quantization

GLCM Construction

GLCMH

GLCMV

GLCMD

GLCMoD

GLCMF

Texton Image Construction

T1

T2

T3

T4

T5

T6

TF

Hybrid Model

Figure 6.1: Framework of IEH-GT prediction system

6.1.1 Feature Extraction Phase

In the feature extraction phase, first GLCM matrices and Texton images are constructed

from each input fluorescence microscopy protein image and then statistical measures are

computed from the output GLCM matrices and Texton images. The input image is quan-

tized to 16 gray levels prior to the GLCM and Texton image construction and feature

extraction.

6.1.1.1 GLCM Construction

GLCM matrices are usually constructed along horizontal, vertical, diagonal, and off-diagonal

orientations. Features are extracted from each GLCM separately and then these computed

features are combined to form a single representative feature space. It is observed that

this single GLCM is not the real representation of fluorescence microscopy protein images

that belong to different categories. Therefore, in this work, GLCMs along four directions

are constructed separately and are not combined with the intention that each GLCM along

a particular direction may efficiently represent a separate class of fluorescence microscopy

protein image. The classifier would thus be able to better discriminate the classes that may

fall close in the feature space.

In addition to the aforementioned four GLCMs, a combined GLCM is also obtained by

85

Page 108: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

their fusion. Haralick coefficients are then extracted from each of the five GLCMs, which

are utilized later by the classifiers.

6.1.1.2 Texton Image Construction

Texton masks are utilized to construct six Texton images from each input fluorescence

microscopy protein image. The procedure of Texton image construction with the help of

Texton masks is discussed in section 2.2.2.

Usually, a GLCM is constructed from a Texton image and consequently, feature spaces

are generated from this GLCM. However, what we do new is to generate statistical features

from the individual Texton images directly without transforming them into a GLCM matrix.

The rationale, behind the construction of feature spaces directly from Texton images and

not from the GLCM matrices, is to obtain distinctive feature spaces capable of representing

fluorescence microscopy protein images on individual basis, which consequently help in

discriminative learning. In addition to the six Texton images, a combined Texton image is

also obtained through the fusion of these six Texton images. Then, ten statistical features

are extracted from each Texton image, which are later utilized in classification phase.

6.1.1.3 The Hybrid Model

A hybrid feature space is also constructed by combining all the individual feature spaces

developed from seven Texton images and five GLCM matrices on the hope that they may

enhance the overall performance of the classification system.

6.1.2 Classification Phase

IEH-GT prediction system utilizes RBF-SVM as base classifier in the classification phase

as shown in Figure 6.2. Separate SVMs are trained using each feature space extracted from

GLCM matrices, Texton images, and the hybrid model. Hence, 13 SVMs are utilized in

this phase. The final prediction is obtained through the majority voting ensemble using all

the trained SVMs. The performance of the proposed IEH-GT prediction system is assessed

using 2D HeLa dataset.

86

Page 109: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

ClassifierCombining

DecisionsFeatures

GLCM &

Texton Image

FeaturesMajority

Voting

Final

Prediction.

.

.

.

SVM1

SVMn

SVM2

Predictions1

Predictionsn

Predictions2

.

.

.

.

Figure 6.2: Classification phase of IEH-GT system

6.2 Results and Discussion

The objective of our simulation based study is to validate the effectiveness of the proposed

IEH-GT prediction system, which is based on the statistics computed from GLCM matrices

and Texton images. We have employed 10-fold cross validation protocol to assess the per-

formance of the proposed model. Measures used to assess the effectiveness of the prediction

system include accuracy, sensitivity, specificity, MCC, and F-Score.

6.2.1 Analysis of GLCM based Features for HeLa Dataset

Table 6.1 presents performance accuracies of IEH-GT prediction system using Haralick

features computed from GLCMs constructed along different directions. Column 1 shows

GLCM type from which the feature space is constructed. Columns 2-6 indicate the perfor-

mance measures.

Table 6.1: Predictions of individual SVMs using Haralick features for HeLa dataset

GLCM type Acc Sen Spe MCC F−Score

GLCMH 72.0 68.6 68.1 0.23 0.31

GLCMD 72.9 69.0 65.6 0.21 0.29

GLCMV 71.8 68.3 67.1 0.22 0.30

GLCMoD 73.0 70.2 67.5 0.24 0.31

GLCMF 74.8 69.2 67.0 0.23 0.30

In this work, we employed Haralick coefficients to discriminate among different patterns

of HeLa cell lines. The computed Haralick features from GLCMs constructed along different

directions individually have produced diverse results, which ultimately led to the improved

87

Page 110: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

performance of the ensemble classification model. The analysis revealed that each version

of GLCM holds information of some particular pattern present in different images.

Table 6.2: Confusion matrix using Haralick features from GLCMH

Actin 77 0 7 0 0 0 1 11 2 0

DNA 0 82 0 4 0 0 0 0 1 0

Endosome 18 0 44 0 0 0 8 10 10 1

ER 2 2 3 65 0 0 2 7 5 0

Golgia 0 0 2 0 66 18 0 0 0 1

Golgpp 0 0 1 2 12 64 3 0 0 3

Lysosome 1 0 5 2 2 1 69 3 0 1

Microtubule 18 0 5 11 0 0 0 51 6 0

Mitochondria 3 0 14 14 0 0 0 8 34 0

Nucleolus 0 0 1 1 1 3 5 0 0 69

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

For example, the confusion matrix presented in Table 6.2 indicates that Golgia proteins

are misclassified 18 times as Golgpp proteins due to the similarity of protein structure of

these classes in the fluorescence microscopy images of HeLa dataset. Golgia and Golgpp

proteins are classified with the accuracies of 75.8% and 75.2%, respectively, using the fea-

tures extracted from GLCMH. The tight structure of proteins around the nucleus in the

images of these two classes leads SVM to the wrong predictions. For instance, Figure 6.3-(a)

is a Golgia protein but predicted as an instance of Golgpp protein. The closest instance of

Golgpp class of HeLa dataset in the Euclidean space is shown in Figure 6.3-(b). This fact

is revealed by the scatter matrix in Figure 6.4 where fluorescence microscopy protein image

of Figure 6.3-(a) is plotted in the feature space with all the instances of Golgpp protein

class. The instance of Golgia class is represented by red circle in Figure 6.4 whereas Gol-

gpp instances are represented by green plus sign. The Golgia protein presented in Figure

6.3-(a) is correctly predicted by SVM using the Haralick features extracted from GLCMD

and GLCMoD, which reveals that these two GLCMs have the ability to extract features

from this particular protein. Hence strengthening our argument of using the concept of

split GLCM in which we constructed GLCMs along different directions separately so that

88

Page 111: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

(a) Golgia protein (b) Golgpp protein

Figure 6.3: Golgia protein is wrongly predicted as Golgpp protein

meaningful features from different images could be extracted.

Figure 6.4: Golgia protein is represented by red circle and Golgpp class is shown by green

plus sign

Consequently, separation makes different GLCMs suitable for images having different

features from each other. Similarly, Golgpp protein is wrongly predicted as Golgia protein

12 times. Again the similar structures in these two classes lead SVM to the wrong conclusion

of the labels of these misclassified fluorescence microscopy protein images. For example,

Figure 6.5-(a) shows a Golgpp protein, which is wrongly predicted as Golgia protein. The

minimum Euclidean distance of Golgpp protein of Figure 6.5-(a), is found with the Golgia

89

Page 112: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

protein of Figure 6.5-(b). Figure 6.6 illustrates the feature space of Golgia protein in relation

with one instance of Golgpp protein. This instance shown by red circle is misclassified as

Golgia protein.

(a) Golgpp protein (b) Golgia protein

Figure 6.5: Golgpp protein is wrongly predicted as Golgia protein

On the other hand, Lysosomal proteins are predicted with quite good accuracy of 82.1%

as shown in Table 6.2, even though there is much similarity present among these proteins

and the Endosome proteins.

Figure 6.6: Golgpp instance is represented by red circle and Golgia class is shown by green

plus sign

90

Page 113: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

In Table 6.2 there are only 5 misclassifications of Lysosomal proteins as Endosomes. But

at the same time, these features are not capturing the information from the fluorescence

microscopy protein images of Mitochondria class where the accuracy is merely 46.5%. Most

of these patterns are classified wrongly as Endosome, ER and Microtubules. Similarly,

Tables 6.3, 6.4, 6.5 present the confusion matrices generated by SVM using features from

GLCMD, GLCMV, and GLCMoD, respectively.

Table 6.3: Confusion matrix using Haralick features computed from GLCMD

Actin 77 0 7 0 0 0 0 12 2 0

DNA 0 84 0 3 0 0 0 0 0 0

Endosome 14 0 48 1 0 0 8 11 9 0

ER 1 1 0 62 0 0 2 10 10 0

Golgia 0 0 1 0 68 16 0 0 0 2

Golgpp 0 0 2 1 20 55 2 0 0 5

Lysosome 2 0 7 3 1 3 64 2 1 1

Microtubule 13 0 6 12 0 0 0 55 5 0

Mitochondria 6 0 5 15 0 0 0 4 43 0

Nucleolus 0 2 0 0 1 2 0 0 2 73

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

Table 6.4: Confusion matrix using Haralick features computed from GLCMV

Actin 76 0 6 0 0 0 1 13 2 0

DNA 0 82 0 4 0 0 0 0 1 0

Endosome 16 0 44 0 0 0 9 10 11 1

ER 1 1 1 64 0 0 3 8 8 0

Golgia 0 0 2 0 62 21 0 0 0 2

Golgpp 0 0 2 1 14 62 2 0 0 4

Lysosome 1 0 8 1 1 2 67 2 1 1

Microtubule 16 0 5 11 0 0 1 51 7 0

Mitochondria 2 1 10 10 0 0 1 10 38 1

Nucleolus 0 0 0 1 1 2 2 0 1 73

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

91

Page 114: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 6.5: Confusion matrix using Haralick features computed from GLCMoD

Actin 77 0 12 0 0 0 0 9 0 0

DNA 0 84 0 2 0 0 0 0 0 1

Endosome 16 1 46 1 0 0 10 9 8 0

ER 1 2 2 66 0 0 1 8 6 0

Golgia 0 0 1 0 66 19 0 0 1 0

Golgpp 0 0 4 0 13 62 1 0 1 4

Lysosome 1 0 7 0 2 4 64 2 2 2

Microtubule 15 0 12 9 0 0 0 53 2 0

Mitochondria 5 1 12 8 0 0 1 6 40 0

Nucleolus 0 0 1 1 1 3 1 0 1 72

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

Mitochondrial proteins are symmetrically distributed around the nucleus similar to those

of Endosome, ER and Microtubule proteins. SVM is able to classify Endosomes with the

accuracy of 60.4% using Haralick features computed from the GLCMF as shown in Table

6.6.

Table 6.6: Confusion matrix using Haralick features computed from GLCMF

Actin 80 0 8 0 0 0 0 10 0 0

DNA 0 83 0 4 0 0 0 0 0 0

Endosome 9 0 55 2 0 0 8 7 10 0

ER 1 1 1 67 0 0 2 7 7 0

Golgia 0 0 0 0 66 18 0 0 2 1

Golgpp 0 0 1 2 15 63 1 0 0 3

Lysosome 0 0 10 2 1 4 62 2 1 2

Microtubule 13 0 10 8 0 1 0 53 6 0

Mitochondria 4 0 7 9 0 0 0 7 46 0

Nucleolus 0 1 0 1 2 4 1 0 1 70

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

On the other hand, the GLCM constructed along any single orientation except GLCMD

92

Page 115: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

was not able to achieve accuracy higher than 52.7% for these images. The diversity among

individual SVMs, utilizing features from GLCM matrices, can be understood from the

analysis of Table 6.7. This table shows the original labels of Golgia proteins misclassified

as Golgpp proteins by individual SVM using the Haralick features computed from GLCM

constructed along different directions. The disagreements among different classifiers lead to

the diversified individuals of the IEH-GT prediction system.

Table 6.7: Golgia proteins classified as Golgpp proteins

GLCMH GLCMD GLCMV GLCMoD GLCMF

- - golgia 002 - golgia 002

- - - - golgia 003

- golgia 004 - golgia 004 golgia 004

- - golgia 005 - -

golgia 006 - - - -

golgia 007 - - - -

golgia 008 - - - -

golgia 009 - - - -

- - - golgia 011 -

- - - golgia 012 -

- - - - golgia 014

- - golgia 015 - -

- golgia 016 - - -

golgia 017 - golgia 017 - -

- - - golgia 018 -

- golgia 022 - - -

golgia 024 golgia 024 - - golgia 024

golgia 025 - - - -

- - - golgia 028 -

- - golgia 031 - -

golgia 033 - - - golgia 033

- golgia 034 - - -

golgia 035 golgia 035 - - golgia 035

- - - - golgia 036

- golgia 037 golgia 037 - -

golgia 038 - - - -

- - golgia 039 - golgia 039

golgia 040 golgia 040 - - -

golgia 042 - - golgia 042 -

- golgia 043 golgia 043 - -

- - golgia 044 - -

- golgia 046 golgia 046 - -

- - golgia 049 - -

- - - golgia 050 golgia 050

- - - golgia 051 golgia 051

- golgia 052 - - -

- - golgia 055 golgia 055 -

- golgia 056 - - -

golgia 057 - golgia 057 - -

- golgia 058 - - -

- - golgia 059 - golgia 059

Continued on next page

93

Page 116: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 6.7 – continued from previous page

GLCMH GLCMD GLCMV GLCMoD GLCMF

- - - - golgia 060

- - - golgia 061 golgia 061

- - - golgia 062 -

- - golgia 063 golgia 063 -

- - golgia 064 - -

- golgia 066 golgia 066 golgia 066 -

- - golgia 067 - golgia 067

- - - - golgia 068

golgia 069 - - - -

- - - - golgia 072

golgia 073 - - - -

- - - golgia 074 -

- golgia 076 golgia 076 golgia 076 -

- - - golgia 077 -

golgia 079 - - - -

- - - - golgia 080

- - - golgia 081 -

- - - golgia 083 -

golgia 084 - - - -

- golgia 085 golgia 085 golgia 085 -

golgia 087 - golgia 087 - -

6.2.2 Analysis of Texton Image based Features for HeLa Dataset

Prediction performance of the proposed IEH-GT prediction system using statistical features

computed from Texton images constructed along different directions is presented in Table

6.8. Column 1 shows the Texton mask direction. Columns 2-6 indicate the performance

measures utilized for the performance assessment of the prediction system.

Table 6.8: Predictions of individual SVMs using statistical features computed from Texton

images

Texton Mask Acc Sen Spe MCC F−Score

T1 63.8 62.6 57.7 0.12 0.23

T2 60.0 62.7 58.3 0.13 0.24

T3 63.4 65.0 60.1 0.15 0.25

T4 60.4 61.6 56.9 0.11 0.23

T5 63.8 65.8 60.5 0.16 0.26

T6 63.6 67.5 60.1 0.17 0.26

Fused Textons 56.3 53.2 53.7 0.04 0.19

Individual SVMs have produced reasonable results using features from Texton images.

94

Page 117: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Unlike thee features from GLCM, the features from fused Texton image possess redun-

dancy that is why SVM produced lower prediction rates i.e. 56.3% for these features. The

confusion matrix, generated by SVM using the features extracted from the Texton image

employing T1 Texton mask, is presented in Table 6.9.

Table 6.9: Confusion matrix using statistical features computed from Texton image con-

structed using T1

Actin 72 1 8 3 0 0 1 10 2 1

DNA 0 79 0 7 0 0 0 0 1 0

Endosome 9 1 42 4 3 0 15 6 11 0

ER 5 4 4 54 0 0 0 13 6 0

Golgia 0 0 1 0 53 30 1 0 2 0

Golgpp 0 0 1 1 12 65 4 0 0 2

Lysosome 3 0 11 1 2 11 51 0 1 4

Microtubule 14 2 5 29 0 1 1 33 5 1

Mitochondria 7 1 13 11 1 0 2 3 35 0

Nucleolus 0 0 1 1 0 7 4 0 1 66

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

Figure 6.7: Endosome instance is shown by red circle and Lysosome class is represented by

green plus sign

95

Page 118: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

It is evident that SVM classifier is confused to classify some of the instances from Endo-

some and Lysosome proteins because proteins are concentrated along one side of the nucleus

in both types of images. In addition, stains can be observed throughout the cytoplasm in

both the fluorescence microscopy protein images. Similarly, Endosome proteins are also

confused with the mitochondrial proteins. Since, in Mitochondrial images, staining similar

to Endosome protein images can be observed.

(a) Endosome protein (b) Lysosome protein

Figure 6.8: Similar patterns can be observed in the two images

Table 6.10: Confusion matrix using statistical features computed from Texton image con-

structed using T2

Actin 76 1 6 1 0 0 2 11 1 0

DNA 0 79 1 4 0 0 0 2 1 0

Endosome 12 1 43 2 3 1 13 4 11 1

ER 10 5 9 37 0 0 0 19 6 0

Golgia 0 0 2 0 38 40 3 0 4 0

Golgpp 0 0 1 1 6 65 10 0 0 2

Lysosome 4 0 14 0 2 11 49 0 1 3

Microtubule 23 2 7 20 0 1 0 34 3 1

Mitochondria 8 2 17 7 3 1 1 5 29 0

Nucleolus 0 0 0 0 1 5 1 0 2 71

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

96

Page 119: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Similar patterns can be observed in the images presented in Figure 6.8 where Endosome

and Lysosome are possessing similar patterns. This indicates that T1 Texton mask can

extract pattern from images, which possess horizontal texture patterns. That is why the

two images belonging to different classes are categorized as the instances of same class.

Likewise, confusion matrices generated by SVM using statistical features computed from

Texton images constructed with T2, T3, T4, T5, and T6 are presented in Tables 6.10, 6.11,

6.12, 6.13, and 6.14, respectively. Table 6.15 presents the confusion matrix generated by

Table 6.11: Confusion matrix using statistical features computed from Texton image con-

structed using T3

Actin 75 0 8 3 0 0 2 9 1 0

DNA 0 76 0 9 0 0 0 1 0 1

Endosome 8 1 46 2 3 1 17 5 8 0

ER 7 3 5 44 1 0 1 18 7 0

Golgia 0 0 2 0 54 28 1 0 2 0

Golgpp 0 0 2 1 13 64 4 0 0 1

Lysosome 3 0 8 2 2 12 53 0 0 4

Microtubule 15 1 7 22 0 2 1 38 5 0

Mitochondria 5 0 14 12 3 1 0 7 31 0

Nucleolus 0 0 1 1 0 6 6 0 0 66

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

SVM using the statistical features extracted from a Texton image that is obtained from

the fusion of Texton images constructed using the aforementioned Texton masks. Due to

the overlapping information, SVM could not produce comparable results using the features

extracted from the combined Texton image.

97

Page 120: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 6.12: Confusion matrix using statistical features computed from Texton image con-

structed using T4

Actin 70 1 6 8 0 0 2 8 3 0

DNA 1 79 1 3 0 0 1 1 0 1

Endosome 12 0 37 2 4 1 13 5 17 0

ER 15 1 4 41 0 0 0 16 9 0

Golgia 0 0 2 0 48 32 2 0 3 0

Golgpp 0 0 3 0 8 68 5 0 0 1

Lysosome 1 0 12 1 3 13 49 3 0 2

Microtubule 19 3 11 30 0 1 1 25 1 0

Mitochondria 9 0 15 8 4 1 1 5 30 0

Nucleolus 0 1 2 0 0 5 1 0 0 71

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

Table 6.13: Confusion matrix using statistical features computed from Texton image con-

structed using T5

Actin 73 0 11 5 0 0 1 8 0 0

DNA 0 78 0 7 0 0 0 1 1 0

Endosome 11 0 43 3 3 0 13 5 12 1

ER 5 3 3 54 0 0 0 11 10 0

Golgia 0 0 1 0 51 31 1 0 3 0

Golgpp 0 0 3 0 11 66 3 0 0 2

Lysosome 4 0 11 1 2 11 51 0 1 3

Microtubule 15 1 5 29 0 2 2 32 5 0

Mitochondria 5 0 14 10 1 1 1 5 36 0

Nucleolus 0 1 0 1 0 5 5 0 2 66

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

98

Page 121: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 6.14: Confusion matrix using statistical features computed from Texton image con-

structed using T6

Actin 76 0 7 5 0 0 0 9 1 0

DNA 0 78 1 4 0 0 0 2 2 0

Endosome 8 1 44 3 2 1 12 4 16 0

ER 8 3 6 45 0 0 1 14 9 0

Golgia 0 0 1 0 50 28 3 0 5 0

Golgpp 0 0 2 1 8 67 4 0 0 3

Lysosome 3 0 9 0 2 10 54 2 2 2

Microtubule 16 0 8 23 0 1 1 37 4 1

Mitochondria 4 0 22 8 4 1 0 5 29 0

Nucleolus 0 0 0 1 0 5 4 0 1 69

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

Table 6.15: Confusion matrix using statistical features computed from Texton image con-

structed from the fusion of six Texton images

Actin 47 2 7 26 1 0 0 5 10 0

DNA 4 71 1 5 0 0 0 6 0 0

Endosome 10 0 43 6 2 1 13 5 11 0

ER 7 1 8 39 0 0 4 23 3 1

Golgia 0 0 3 0 58 20 4 0 2 0

Golgpp 0 0 6 0 15 57 5 0 0 2

Lysosome 3 0 14 2 3 5 55 1 0 1

Microtubule 10 6 8 25 1 1 0 34 6 0

Mitochondria 11 0 20 9 6 1 1 4 21 0

Nucleolus 0 0 1 1 5 3 8 0 1 61

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

Table 6.16 shows the original labels of Endosome proteins misclassified as Lysosome

proteins by SVM using the statistical features computed from Texton images constructed

along different directions. The diversity among the classifications of different classifiers can

99

Page 122: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

be observed.

Table 6.16: Different Texton image based statistical features lead to different classification

results. Endosome proteins as Lysosome proteins

T1 T2 T3 T4 T5 T6 Fused Texton Image

- endosome 003 endosome 003 - - - -

- - - - endosome 005 - -

- endosome 007 - - - - -

- - endosome 008 - - - -

- - - - endosome 009 - endosome 009

- - - - endosome 010 - -

endosome 011 endosome 011 - - - - -

- endosome 013 - - - - -

- - endosome 014 - - - -

endosome 015 - - - - - -

- - - - endosome 016 - -

- - endosome 017 - - - -

- - - endosome 018 - - -

- - endosome 021 endosome 021 - - -

- - - endosome 023 endosome 023 - endosome 023

- endosome 026 - - - - -

- - endosome 027 - - - -

- - - endosome 028 - endosome 028 -

- endosome 029 - - - - endosome 029

endosome 031 - - - - - endosome 031

endosome 032 endosome 032 endosome 032 - - - -

endosome 034 - - - endosome 034 - -

endosome 035 - - - - - endosome 035

endosome 038 - - - - - endosome 038

- endosome 041 - endosome 041 endosome 041 endosome 041 -

endosome 043 - endosome 043 - - - -

- - - - endosome 044 - -

- - - - endosome 046 - -

- - endosome 047 - endosome 047 - -

endosome 050 - - endosome 050 - - -

- endosome 051 - endosome 051 - - -

- - - - - endosome 052 endosome 052

- - - endosome 053 - - -

- - - - - endosome 054 -

- - - - - - endosome 055

endosome 059 - - - - - -

- endosome 060 - - endosome 060 - -

- endosome 061 - - endosome 061 - endosome 061

- - - - - - endosome 062

endosome 063 - endosome 063 - - - -

- - - - - endosome 065 -

- - - endosome 067 - - -

- - - - - endosome 068 -

- endosome 070 endosome 070 endosome 070 - endosome 070 -

- - endosome 071 - - - -

- - - endosome 072 - - endosome 072

- - - - - endosome 073 -

- - - endosome 074 - endosome 074 -

- - - endosome 076 - endosome 076 -

Continued on next page

100

Page 123: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Table 6.16 – continued from previous page

T1 T2 T3 T4 T5 T6 Fused Texton Image

- - endosome 077 - - - -

- endosome 079 endosome 079 - - - -

endosome 080 - - - - - -

endosome 081 - - - - endosome 081 -

- - endosome 082 - - - endosome 082

endosome 084 - endosome 084 - - - -

- - - - - - endosome 085

- - - - endosome 088 endosome 088 -

- - endosome 089 - - - -

endosome 090 - - - - - -

6.2.3 Analysis of the Hybrid Model for HeLa dataset

SVM is trained using the Hybrid model that achieved the highest accuracy value of 79.5%

compared to other individual feature spaces. Among other performance measures, SVM

has yielded 83.9% sensitivity, 79.0% specificity, 0.43 MCC, and 0.46 F-score values. The

confusion matrix using the Hybrid model is presented in Table 6.17.

Table 6.17: Confusion matrix obtained using the hybrid model

Actin 90 0 1 0 0 0 2 2 3 0

DNA 0 84 0 1 0 0 0 2 0 0

Endosome 4 0 59 0 1 1 7 8 11 0

ER 1 0 1 69 0 0 1 1 13 0

Golgia 0 0 1 0 75 9 0 0 2 0

Golgpp 0 0 1 1 14 63 3 0 1 2

Lysosome 1 0 9 0 2 2 64 2 2 2

Microtubule 3 0 13 6 0 0 0 62 7 0

Mitochondria 1 0 6 8 0 0 2 9 47 0

Nucleolus 0 0 0 0 1 3 1 0 2 73

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

It can be observed that the combined feature space possesses more discriminative in-

formation compared to the individual feature spaces. That is why SVM using the hybrid

model outperformed other SVMs using individual features.

101

Page 124: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

6.2.4 Ensemble Analysis

Ensemble is generated by combining the decisions of all the SVMs trained using all the

GLCM based Haralick features, Texton image based statistical features and the Hybrid

model. The confusion matrix is given in Table 6.18. We have 13 votes in total, which

achieved 95.3% collective overall prediction accuracy. The sensitivity, specificity, MCC, and

F-Score values are 97.4%, 95.1%, 0.80, and 0.81, respectively. It can be observed from the

results presented in Tables 6.1, 6.8, and 6.18 that the accuracies of individual classifiers are

not higher. The highest accuracy of individual SVM is reported to be 79.5% using the hybrid

model. However, the prediction accuracy of IEH-GT has reached 95.3%, which indicates

that individual SVM classifiers using features from Texton images and GLCM matrices

along different orientations have produced diversified results that lead to the improved

prediction performance of the majority voting based ensemble.

Table 6.18: Confusion matrix using majority voting scheme

Actin 98 0 0 0 0 0 0 0 0 0

DNA 0 87 0 0 0 0 0 0 0 0

Endosome 3 0 85 0 0 0 1 0 2 0

ER 0 0 0 85 0 0 0 1 0 0

Golgia 0 0 0 0 81 6 0 0 0 0

Golgpp 0 0 0 0 0 85 0 0 0 0

Lysosome 0 0 0 0 0 0 84 0 0 0

Microtubule 5 0 1 9 0 0 0 76 0 0

Mitochondria 1 0 7 3 0 0 0 1 61 0

Nucleolus 0 0 0 0 0 0 0 0 0 80

Acti

n

DN

A

En

doso

me

ER

Golg

ia

Golg

pp

Lyso

som

e

Mic

rotu

bu

le

Mit

och

on

dria

Nu

cle

olu

s

6.3 Comparative Analysis

The proposed IEH-GT approach showed comparable performance with the existing state-of-

the-art approaches. In [7], the authors have shown that their proposed system has achieved

95.4% performance accuracy. They have utilized the features extracted in multi-resolution

subspaces from 2D HeLa images. In another approach, the model developed in [39] yielded

102

Page 125: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

94.2% accuracy for the 2D HeLa dataset employing RSE of neural networks. For the same

dataset, Nanni et al. have built a prediction system, which obtained the accuracy value of

97.5% [40].

Table 6.19: Performance comparison with the existing approaches based on HeLa dataset

Method Summary of the Technique Accuracy

Chebira et al. [7] Multi-resolution subspaces, weighted majority voting 95.4

Nanni and Lumini [39] RSE of NNs 94.2

Nanni et al. [40] RSE of NNs, AdaBoost ensemble of weak learners, sum rule 97.5

Nanni et al. [8] RSE of NNs 95.8

Nanni et al. [12] SVM, random subset of features, 50 classifiers, sum rule 93.2

Lin et al. [6] Variant of AdaBoost named as AdaBoost.ERC 93.6

IEH-GT Majority Voting based Ensemble 95.3

However, they achieved this significant improvement by the combination of two ensem-

bles through sum rule. Nanni et al. have developed another prediction model for the 2D

HeLa dataset that yielded 95.8% accuracy [8], which again employed RSE of neural net-

works. Nanni et al. have also developed a system that produced 93.2% accuracy for this

dataset by employing 50 SVMs where each SVM is trained on a separate feature space

using the phenomenon of random subset of features [12]. The final decision is made using

the sum rule. The accuracy value of 93.6% has been reported by [6] for 2D HeLa dataset.

They utilized AdaBoost.ERC as ensemble classifier. In contrast to all the aforementioned

techniques, IEH-GT is a simple ensemble of 13 SVMs that achieved 95.3% accuracy. The

features are obtained in spatial domain only. The only technique, which produced 2.2%

higher accuracy than the IEH-GT approach, is proposed in [40]. However, this accuracy is

achieved using complex ensemble based on neural networks and AdaBoost.

We report a feature extraction mechanism for effective classification of subcellular local-

ization images. It is more or less like a divide and conquer approach, targeting individual

fluorescence microscopy protein images on the basis of their histogram co-occurring infor-

mation as well as texture patterns distribution. A single combined GLCM may not be a

real representative of the whole dataset that is why individual GLCMs along the four ori-

entations are utilized to recognize individual protein images from fluorescence microscopy.

Similar approach is employed in case of Texton image based features. Statistical features

103

Page 126: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

of different patterns are obtained so that individual fluorescence microscopy protein images

are well addressed by the classification system. Individual GLCMs constructed along cer-

tain orientation captures information more precisely from an image class and this GLCM

may not be the appropriate representation for another class. However, when these multiple

GLCMs are combined, the obtained information held in different GLCMs are suppressed

and thus the resulting combined GLCM may not be appropriate for the representation of

a specific image class. Although, this combined GLCM might be a generalized form for all

the images but this may not contribute well to the overall prediction performance of the

underlying system. Therefore, the significance of the proposed GLCM and Texton image

based prediction system is obvious from the observed prediction performance.

We have shown that treating GLCM and Texton images along different orientations

separately exploit the hidden information well in the protein fluorescence microscopy im-

ages. It is evident that GLCM and Texton images along certain direction are more suitable

for certain pattern extraction and subsequently their ensemble form the basis for improved

prediction performance. The proposed IEH-GT technique might be very prospective in

studying the structural variations of the biological components or organisms.

Summary

In this chapter, IEH-GT prediction system, based on the features computed from GLCMs

and Texton images along different orientations, is proposed. The analysis reveals that

GLCM and Texton image constructed along single orientation may be efficient in extracting

useful information from a particular class of images. This means that each orientation

describes some part of the dataset. This will help the classifier build diversified ensemble.

In the next chapter, conclusive remarks are presented along with future directions.

104

Page 127: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Chapter 7

Conclusions and Future Directions

Analysis of fluorescence microscopy based protein images through efficient and reliable au-

tomated systems is one of the basic needs in the fields of Bioinformatics, computational

biology, and biological sciences. Development of automated systems requires protein sub-

cellular localization images to be represented numerically through the use of informative

and discriminative feature extraction strategies. In this connection, different approaches

have been developed in this thesis. Conclusive remarks for the research work conducted are

presented next, followed by identifying possible future directions.

7.1 Conclusive Remarks

In this thesis, reliable and effective automated systems for protein subcellular localiza-

tion have been developed, which are based on discriminative feature extraction strategies.

Further, the employed ensemble classification systems have exploited the generated feature

spaces very well that improved the overall prediction performance. The proposed approaches

have achieved significant improvement in terms of accuracy compared to other approaches

in the literature.

In Chapter 3, we studied different spatial and transform domain features as well as their

effect on the classification capability of the developed SVM-SubLoc prediction system. Dis-

criminative power of Haralick and Zernike features is enhanced in transform domain with

DWT. Particularly, decomposition level 2 exhibits most of the information and improved

the classification capability of the classifier. Further, different hybrid features formed by

the concatenation of individual features have also improved the prediction performance of

105

Page 128: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

the classification system. In addition, the overall performance of the prediction system is

further enhanced through the majority voting ensemble, where the mis-classifications of one

classifier are compensated by the other. Since, any single member classifier of an ensemble

may not know everything, however, it is for sure that every member of the ensemble does

know something. Thus, their ensemble showed better performance compared to individual

classifiers. In general, the use of hybrid models in transform and spatial domains coupled

with the ensemble classification yielded a prediction system that outperformed state-of-the-

art existing approaches.

Chapter 4 introduced the notion of oversampling in the feature space in order to re-

duce the imbalance of data. In this connection, SMOTE is utilized to introduce synthetic

samples in the feature space due to which the classifier bias towards the majority class is

reduced. It is observed that the prediction performance of the proposed RF-SubLoc predic-

tion system under SMOTE is directly proportional to the imbalance present in the original

dataset. That is, more the imbalance in the original dataset, more the improvement in the

prediction performance of the classifier under SMOTE. Consequently, the overall prediction

performance of the prediction system is enhanced with oversampled features. From exper-

imental analysis, it is observed that the RF-SubLoc is superior to other state-of-the-art

approaches reported in the literature.

Chapter 5 focused on the exploitation of discriminative power of LTP with SMOTE

oversampling. It is observed that URI-LTP patterns are highly discriminative for fluores-

cence microscopy protein images. That is why feature selection with mRMR has failed to

further improve the discriminative power of LTP patterns. Hence, it is concluded that LTP

patterns may possess enough discriminative power and there might be no room for further

improvement in their discriminative capability through feature selection.

Chapter 6 provided in depth analysis of GLCM and Texton images for extracting dis-

criminative information from protein subcellular localization images. GLCM and Texton

images are able to extract diverse information along different orientations from the same

image. From the simulation results, it is observed that performance improvement of features

extracted from GLCMoD is marginal with the GLCMF. Hence, in some cases, construction

of GLCMs and Texton images along all the directions might be avoided.

106

Page 129: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

7.2 Future Directions

The proposed prediction systems have obtained significant improvement over other state-

of-the-arts in classification accuracy. In Chapter 3, we explained the method of obtaining

Haralick and Zernike moments in the sub-bands through DWT, where the highest perfor-

mance of the classifier is reported using features extracted at decomposition level 2. In

future, more decomposition levels will be explored for probing more information that might

be useful for discriminating subcellular structures.

We have constructed different hybrid feature spaces to enhance the classifier’s perfor-

mance. In future, one may analyze the weighted combination of these features that could

lead to enhanced performance of the classification system. Further, various pre-processing

techniques prior to the feature extraction strategies might be explored in order to study

their usefulness in discriminating fluorescence microscopy protein images.

The performance of prediction systems greatly depends on the discrimination power of

features extracted from fluorescence microscopy protein images. In most cases, the classifi-

cation system depends on the successful execution of ensemble. Therefore, we suggest the

development of a novel feature extraction strategy or modification to an existing technique

capable of enabling the classifier yield high prediction performance without the aid of en-

semble.

In the current thesis, the proposed prediction systems were developed for 2D fluorescence

microscopy protein images only. However, 3D protein images from fluorescence microscopy

capture more information, which could be used in the development of more sophisticated

and efficient classification systems. The proposed prediction systems could easily be modi-

fied for analysis and classification of 3D images.

The feature extraction strategies utilized in this thesis are capable of characterizing

different properties of protein subcellular localization images. An additional benefit may

be to develop automated systems that are able to extract images from online journals and

estimate the research outcomes for a certain field for a specified duration. This will help in

finding new research trends in the field of image processing.

We have utilized many of the existing feature extraction strategies. Among them LTP

and TAS require user defined thresholds for their computation. Researchers might focus on

developing a system that automatically determines this threshold in accordance with the

107

Page 130: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

underlying dataset.

We have also developed web based predictors, which are freely available to the research

community. In future, the computational complexity and real time cost of these predictors

could also be reduced.

108

Page 131: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

References

[1] G. Srinivasa, T. Merryman, A. Chebira, J. Kovacevic, and A. Mintos, “Adaptive mul-

tiresolution techniques for subcellular protein location classification,” in Proceedings

of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5.

IEEE, 2006, pp. 14–19.

[2] R. F. Murphy, M. V. Boland, and M. Velliste, “Towards a systematics for protein

subcellular location: Quantitative description of protein localization patterns and au-

tomated analysis of fluorescence microscope images.” in ISMB, 2000, pp. 251–259.

[3] S.-B. Wan, L.-L. Hu, S. Niu, K. Wang, Y.-D. Cai, W.-C. Lu, and K.-C. Chou, “Identi-

fication of multiple subcellular locations for proteins in budding yeast,” Current Bioin-

formatics, vol. 6, no. 1, pp. 71–80, 2011.

[4] Y.-Y. Xu, F. Yang, Y. Zhang, and H.-B. Shen, “An image-based multi-label human

protein subcellular localization predictor (ilocator) reveals protein mislocalizations in

cancer tissues,” Bioinformatics, vol. 29, no. 16, pp. 2032–2040, 2013.

[5] M. V. Boland and R. F. Murphy, “A neural network classifier capable of recognizing

the patterns of all major subcellular structures in fluorescence microscope images of

hela cells,” Bioinformatics, vol. 17, no. 12, pp. 1213–1223, 2001.

[6] C.-C. Lin, Y.-S. Tsai, Y.-S. Lin, T.-Y. Chiu, C.-C. Hsiung, M.-I. Lee, J. C. Simpson,

and C.-N. Hsu, “Boosting multiclass learning with repeating codes and weak detectors

for protein subcellular localization,” Bioinformatics, vol. 23, no. 24, pp. 3374–3381,

2007.

109

Page 132: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[7] A. Chebira, Y. Barbotin, C. Jackson, T. Merryman, G. Srinivasa, R. F. Murphy,

and J. Kovacevic, “A multiresolution approach to automated classification of protein

subcellular location images,” BMC bioinformatics, vol. 8, no. 1, p. 210, 2007.

[8] L. Nanni, S. Brahnam, and A. Lumini, “Novel features for automated cell phenotype

image classification,” in Advances in Computational Biology. Springer, 2010, pp.

207–213.

[9] R. F. Murphy, “Automated proteome-wide determination of subcellular location using

high throughput microscopy,” in Proceedings of 5th IEEE International Symposium on

Biomedical Imaging: From Nano to Macro. IEEE, 2008, pp. 308–311.

[10] M. Riffle and T. N. Davis, “The yeast resource center public image repository: A large

database of fluorescence microscopy images,” BMC bioinformatics, vol. 11, no. 1, p.

263, 2010.

[11] R. F. Murphy, M. Velliste, and G. Porreca, “Robust numerical features for descrip-

tion and classification of subcellular location patterns in fluorescence microscope im-

ages,” Journal of VLSI signal processing systems for signal, image and video technology,

vol. 35, no. 3, pp. 311–321, 2003.

[12] L. Nanni, S. Brahnam, and A. Lumini, “Selecting the best performing rotation invariant

patterns in local binary/ternary patterns.” in IPCV, 2010, pp. 369–375.

[13] X. Xiao, S. Shao, Y. Ding, Z. Huang, and K.-C. Chou, “Using cellular automata

images and pseudo amino acid composition to predict protein subcellular location,”

Amino acids, vol. 30, no. 1, pp. 49–54, 2006.

[14] K.-C. Chou, Z.-C. Wu, and X. Xiao, “iloc-hum: using the accumulation-label scale to

predict subcellular locations of human proteins with both single and multiple sites,”

Molecular Biosystems, vol. 8, no. 2, pp. 629–641, 2012.

[15] K.-C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid

composition,” Journal of theoretical biology, vol. 273, no. 1, pp. 236–247, 2011.

[16] J. W. Lichtman and J.-A. Conchello, “Fluorescence microscopy,” Nature Methods,

vol. 2, no. 12, pp. 910–919, 2005.

110

Page 133: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[17] J. Newberg, J. Hua, and R. F. Murphy, “Location proteomics: systematic determina-

tion of protein subcellular location,” in Systems Biology. Springer, 2009, pp. 313–332.

[18] K. Huang and R. F. Murphy, “Data mining methods for a systematics of protein

subcellular location,” in Data Mining in Bioinformatics. Springer, 2005, pp. 143–187.

[19] V. Ljosa and A. E. Carpenter, “Introduction to the quantitative analysis of two-

dimensional fluorescence microscopy images for cell-based screening,” PLoS compu-

tational biology, vol. 5, no. 12, p. e1000603, 2009.

[20] M. V. Boland, M. K. Markey, R. F. Murphy et al., “Automated recognition of patterns

characteristic of subcellular structures in fluorescence microscopy images,” Cytometry,

vol. 33, no. 3, pp. 366–375, 1998.

[21] R. F. Murphy, M. Velliste, and G. Porreca, “Robust classification of subcellular location

patterns in fluorescence microscope images,” in Proceedings of the 12th IEEE Workshop

on Neural Networks for Signal Processing. IEEE, 2002, pp. 67–76.

[22] R. F. Murphy, “Automated interpretation of subcellular location patterns,” in IEEE

International Symposium on Biomedical Imaging: Nano to Macro. IEEE, 2004, pp.

53–56.

[23] Y. Hu and R. F. Murphy, “Automated interpretation of subcellular patterns from

immunofluorescence microscopy,” Journal of immunological methods, vol. 290, no. 1,

pp. 93–105, 2004.

[24] N. Hamilton, R. Pantelic, K. Hanson, J. Fink, S. Karunaratne, and R. Teasdale, “Au-

tomated sub-cellular phenotype classification: an introduction and recent results,” in

Proceedings of the workshop on Intelligent systems for bioinformatics, vol. 73. Aus-

tralian Computer Society, Inc., 2006, pp. 67–72.

[25] L. Nanni and A. Lumini, “Ensemblator: An ensemble of classifiers for reliable classi-

fication of biological data,” Pattern Recognition Letters, vol. 28, no. 5, pp. 622–630,

2007.

[26] M. Tscherepanow, N. Jensen, and F. Kummert, “An incremental approach to auto-

mated protein localisation,” BMC bioinformatics, vol. 9, no. 1, p. 445, 2008.

111

Page 134: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[27] Z.-C. Wu, X. Xiao, and K.-C. Chou, “iloc-gpos: A multi-layer classifier for predicting

the subcellular localization of singleplex and multiplex gram-positive bacterial pro-

teins,” Protein and Peptide Letters, vol. 19, no. 1, pp. 4–14, 2012.

[28] X. Xiao, Z.-C. Wu, and K.-C. Chou, “iloc-virus: A multi-label learning classifier for

identifying the subcellular localization of virus proteins with both single and multiple

sites,” Journal of Theoretical Biology, vol. 284, no. 1, pp. 42–51, 2011.

[29] B. Zhang and T. D. Pham, “Multiple features based two-stage hybrid classifier ensem-

bles for subcellular phenotype images classification,” International Journal of Biomet-

rics and Bioinformatics, vol. 4, no. 5, pp. 176–193, 2010.

[30] M. Hayat and A. Khan, “Predicting membrane protein types by fusing composite

protein sequence features into pseudo amino acid composition,” Journal of Theoretical

Biology, vol. 271, no. 1, pp. 10–17, 2011.

[31] R. M. Haralick, “Statistical and structural approaches to texture,” Proceedings of the

IEEE, vol. 67, no. 5, pp. 786–804, 1979.

[32] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic

minority over-sampling technique,” arXiv preprint arXiv:1106.1813, 2011.

[33] B. Julesz, “A theory of preattentive texture discrimination based on first-order statis-

tics of textons,” Biological cybernetics, vol. 41, no. 2, pp. 131–138, 1981.

[34] K. Huang, M. Velliste, and R. F. Murphy, “Feature reduction for improved recogni-

tion of subcellular location patterns in fluorescence microscope images,” in Proceedings

SPIE, vol. 4962, 2003, pp. 307–318.

[35] K. Huang and R. Murphy, “Boosting accuracy of automated classification of fluores-

cence microscope images for location proteomics,” Bmc Bioinformatics, vol. 5, no. 1,

p. 78, 2004.

[36] X. Chen, M. Velliste, and R. F. Murphy, “Automated interpretation of subcellular

patterns in fluorescence microscope images for location proteomics,” Cytometry part

A, vol. 69, no. 7, pp. 631–640, 2006.

112

Page 135: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[37] N. Hamilton, R. Pantelic, K. Hanson, and R. Teasdale, “Fast automated cell phenotype

image classification,” BMC bioinformatics, vol. 8, no. 1, p. 110, 2007.

[38] S.-C. Chen, T. Zhao, G. J. Gordon, and R. F. Murphy, “Automated image analysis

of protein localization in budding yeast,” Bioinformatics, vol. 23, no. 13, pp. i66–i71,

2007.

[39] L. Nanni and A. Lumini, “A reliable method for cell phenotype image classification,”

Artificial intelligence in medicine, vol. 43, no. 2, pp. 87–97, 2008.

[40] L. Nanni, A. Lumini, Y.-S. Lin, C.-N. Hsu, and C.-C. Lin, “Fusion of systems for auto-

mated cell phenotype image classification,” Expert Systems with Applications, vol. 37,

no. 2, pp. 1556–1562, 2010.

[41] L. Nanni, A. Lumini, and S. Brahnam, “Local binary patterns variants as texture

descriptors for medical image analysis,” Artificial intelligence in medicine, vol. 49,

no. 2, pp. 117–125, 2010.

[42] L. Nanni, S. Brahnam, and A. Lumini, “A simple method for improving local binary

patterns by considering non-uniform patterns,” Pattern Recognition, vol. 45, no. 10,

pp. 3844–3852, 2012.

[43] A. Eleyan and H. Demirel, “Co-occurrence matrix and its statistical features as a new

approach for face recognition,” Turk J Elec Eng & Comp Sci, vol. 19, no. 1, pp. 97–107,

2011.

[44] L. Nanni, S. Brahnam, S. Ghidoni, E. Menegatti, and T. Barrier, “Different approaches

for extracting information from the co-occurrence matrix,” PloS one, vol. 8, no. 12, p.

e83554, 2013.

[45] F. Albregtsen et al., “Statistical texture measures computed from gray level coocur-

rence matrices,” Image Processing Laboratory, Department of Informatics, University

of Oslo, 1995.

[46] G. Srinivasan and G. Shobha, “Statistical texture analysis,” Proceedings of world

academy of science, engg & tech, vol. 36, 2008.

113

Page 136: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[47] A. Chaddad, C. Tanougast, A. Dandache, A. Bouridane, J. Charara, and A. Al Hou-

seini, “Classification of cancer cells based on haralicks coefficients using multi-spectral

images,” in 7th ESBME conference, International Federation for Medical and Biological

Engineering, 2010.

[48] A. Gelzinis, A. Verikas, and M. Bacauskiene, “Increasing the discrimination power

of the co-occurrence matrix-based features,” Pattern Recognition, vol. 40, no. 9, pp.

2367–2372, 2007.

[49] V. Lakshminarayanan and A. Fleck, “Zernike polynomials: a guide,” Journal of Modern

Optics, vol. 58, no. 7, pp. 545–561, 2011.

[50] C.-W. Chong, P. Raveendran, and R. Mukundan, “A comparative analysis of algo-

rithms for fast computation of zernike moments,” Pattern Recognition, vol. 36, no. 3,

pp. 731–742, 2003.

[51] T. Arif, Z. Shaaban, L. Krekor, and S. Baba, “Object classification via geometrical,

zernike and legendre moments,” Journal of Theoretical and Applied Information Tech-

nology, vol. 7, no. 1, pp. 031–037, 2009.

[52] M. Hayat and A. Khan, “Membrane protein prediction using wavelet decomposition

and pseudo amino acid based feature extraction,” in Proceedings of 6th International

Conference on Emerging Technologies. IEEE, 2010, pp. 1–6.

[53] J.-D. Qiu, X.-Y. Sun, J.-H. Huang, and R.-P. Liang, “Prediction of the types of mem-

brane proteins based on discrete wavelet transform and support vector machines,” The

Protein Journal, vol. 29, no. 2, pp. 114–119, 2010.

[54] T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of texture measures

with classification based on featured distributions,” Pattern recognition, vol. 29, no. 1,

pp. 51–59, 1996.

[55] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation

invariant texture classification with local binary patterns,” Pattern Analysis and Ma-

chine Intelligence, IEEE Transactions on, vol. 24, no. 7, pp. 971–987, 2002.

114

Page 137: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[56] X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition un-

der difficult lighting conditions,” in Analysis and Modeling of Faces and Gestures.

Springer, 2007, pp. 168–182.

[57] ——, “Enhanced local texture feature sets for face recognition under difficult lighting

conditions,” Image Processing, IEEE Transactions on, vol. 19, no. 6, pp. 1635–1650,

2010.

[58] N. A. Hamilton, J. T. Wang, M. C. Kerr, and R. D. Teasdale, “Statistical and visual

differentiation of subcellular imaging,” BMC bioinformatics, vol. 10, no. 1, p. 94, 2009.

[59] N. Otsu, “A threshold selection method from gray-level histograms,” Automatica,

vol. 11, no. 285-296, pp. 23–27, 1975.

[60] J. Prewitt and M. L. Mendelsohn, “The analysis of cell images*,” Annals of the New

York Academy of Sciences, vol. 128, no. 3, pp. 1035–1053, 1966.

[61] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in

IEEE Computer Society Conference on Computer Vision and Pattern Recognition,

vol. 1. IEEE, 2005, pp. 886–893.

[62] O. Ludwig, D. Delgado, V. Goncalves, and U. Nunes, “Trainable classifier-fusion

schemes: an application to pedestrian detection,” in Proceedings of 12th International

IEEE Conference on Intelligent Transportation Systems. IEEE, 2009, pp. 1–6.

[63] C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,”

University of California, Berkeley, 2004.

[64] P. Yang, L. Xu, B. Zhou, Z. Zhang, and A. Zomaya, “A particle swarm based hybrid

system for imbalanced medical data sampling,” BMC genomics, vol. 10, no. Suppl 3,

p. S34, 2009.

[65] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information cri-

teria of max-dependency, max-relevance, and min-redundancy,” Pattern Analysis and

Machine Intelligence, IEEE Transactions on, vol. 27, no. 8, pp. 1226–1238, 2005.

115

Page 138: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[66] Z.-S. He, X.-H. Shi, X.-Y. Kong, Y.-B. Zhu, and K.-C. Chou, “A novel sequence-based

method for phosphorylation site prediction with feature selection and analysis,” Protein

and Peptide Letters, vol. 19, no. 1, pp. 70–78, 2012.

[67] T. Huang, L. Chen, Y.-D. Cai, and K.-C. Chou, “Classification and analysis of regu-

latory pathways using graph property, biochemical and physicochemical property, and

functional property,” PLoS One, vol. 6, no. 9, p. e25297, 2011.

[68] B.-Q. Li, L.-L. Hu, L. Chen, K.-Y. Feng, Y.-D. Cai, and K.-C. Chou, “Prediction of

protein domain with mrmr feature selection and analysis,” PloS one, vol. 7, no. 6, p.

e39308, 2012.

[69] L. Rokach, “Taxonomy for characterizing ensemble methods in classification tasks: A

review and annotated bibliography,” Computational Statistics & Data Analysis, vol. 53,

no. 12, pp. 4046–4072, 2009.

[70] A.-L. Boulesteix, S. Janitza, J. Kruppa, and I. R. Konig, “Overview of random for-

est methodology and practical guidance with emphasis on computational biology and

bioinformatics,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discov-

ery, vol. 2, no. 6, pp. 493–507, 2012.

[71] L. Nanni, S. Brahnam, and A. Lumini, “Local ternary patterns from three orthogonal

planes for human action classification,” Expert Systems with Applications, vol. 38, no. 5,

pp. 5125–5128, 2011.

[72] V. N. Vapnik, “An overview of statistical learning theory,” Neural Networks, IEEE

Transactions on, vol. 10, no. 5, pp. 988–999, 1999.

[73] J. Li, L. Xiong, J. Schneider, and R. F. Murphy, “Protein subcellular location pattern

classification in cellular images using latent discriminative models,” Bioinformatics,

vol. 28, no. 12, pp. i32–i39, 2012.

[74] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

[75] L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, no. 1-2,

pp. 1–39, 2010.

116

Page 139: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

[76] Y. Xie, X. Li, E. Ngai, and W. Ying, “Customer churn prediction using improved

balanced random forests,” Expert Systems with Applications, vol. 36, no. 3, pp. 5445–

5449, 2009.

[77] J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier

ensemble method,” Pattern Analysis and Machine Intelligence, IEEE Transactions on,

vol. 28, no. 10, pp. 1619–1630, 2006.

[78] L. I. Kuncheva and J. J. Rodrıguez, “An experimental study on rotation forest ensem-

bles,” in Multiple Classifier Systems. Springer, 2007, pp. 459–468.

[79] W. Chen, P.-M. Feng, H. Lin, and K.-C. Chou, “irspot-psednc: identify recombination

spots with pseudo dinucleotide composition,” Nucleic acids research, vol. 41, no. 6, pp.

e68–e68, 2013.

[80] Y. Xu, J. Ding, L.-Y. Wu, and K.-C. Chou, “isno-pseaac: predict cysteine s-

nitrosylation sites in proteins by incorporating position specific amino acid propensity

into pseudo amino acid composition,” PloS one, vol. 8, no. 2, p. e55844, 2013.

[81] N. Ye, K. M. A. Chai, W. S. Lee, and H. L. Chieu, “Optimizing f-measures: A tale of

two approaches,” in Proceedings of the International Conference on Machine Learning,

2012.

[82] Y. Sasaki, “The truth of the f-measure,” Teach Tutor mater, pp. 1–5, 2007.

[83] J. Meynet and J.-P. Thiran, “Information theoretic combination of pattern classifiers,”

Pattern Recognition, vol. 43, no. 10, pp. 3412–3421, 2010.

[84] D. J. Hand and R. J. Till, “A simple generalisation of the area under the roc curve for

multiple class classification problems,” Machine Learning, vol. 45, no. 2, pp. 171–186,

2001.

[85] B. Zhang and T. D. Pham, “Phenotype recognition with combined features and random

subspace classifier ensemble,” BMC bioinformatics, vol. 12, no. 1, p. 128, 2011.

[86] B. Zhang, Y. Zhang, W. Lu, and G. Han, “Phenotype recognition by curvelet trans-

form and random subspace ensemble,” Journal on Applied Mathematics Bioinformat-

ics, vol. 1, no. 1, pp. 79–103, 2011.

117

Page 140: Protein Subcellular Classi cation using Machine Learning …prr.hec.gov.pk/jspui/bitstream/123456789/2055/1/2352S.pdf · 2018-07-23 · 3. Muhammad Tahir, Asifullah Khan and Hseyin

Vitae

Muhammad Tahir is a Ph.D. candidate at the Department of Computer and Informa-

tion Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad,

Pakistan. He received his M.Sc degree in Computer Science from Central Science Post-

Graduate College, Peshawar, University of Peshawar, in 2005. He received his MS degree

in Computer Science from National University of Computer and Emerging Sciences, Islam-

abad, Pakistan, on August 8, 2010. His current research interests include machine learning,

pattern recognition, computational intelligence, bioinformatics, and ensemble classification.

Asifullah Khan received his M.S. and Ph.D. degrees in Computer Systems Engineering

from Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Pakistan,

in 2003 and 2006, respectively. He has carried out two-years Post-Doc Research at Signal

and Image Processing Lab, Department of Mechatronics, Gwangju Institute of Science and

Technology, South Korea. He has more than 14 years of research experience and is working

as Associate Professor in Department of Computer and Information Sciences at PIEAS.

His research areas include Digital Watermarking, Pattern Recognition, Bioinformatics, and

Machine Learning.

118