protein subcellular classi cation using machine learning...
TRANSCRIPT
Protein Subcellular Classification using
Machine Learning Approaches
Muhammad Tahir
PhD Thesis
Department of Computer and Information Sciences,
Pakistan Institute of Engineering & Applied Sciences,
Islamabad, Pakistan
Protein Subcellular Classification using
Machine Learning Approaches
By
Muhammad Tahir
A dissertation submitted in partial fulfillment of the
requirements for the degree of Doctor of Philosophy in
Computer and Information Sciences
The Department of Computer and Information Sciences,
Pakistan Institute of Engineering and Applied Sciences,
Islamabad, Pakistan
2014
ii
This thesis is carried out under the supervision of
Dr. Asifullah Khan
Associate Professor
Department of Computer and Information Sciences,
Pakistan Institute of Engineering & Applied Sciences,
Islamabad, Pakistan
This work is financially supported by Higher Education Commission of Pakistan
Under the indigenous 5000 Ph.D. fellowship program
17-5-4 (Ps4-124)/HEC/Sch/2008/
iii
Declaration
I confirm that the work presented in this thesis is the contribution of my original research
work in candidature for a research degree at this university; consultations to others’ pub-
lished work are acknowledged and explicitly cited in the text. I assure that the material
presented in this thesis has neither been submitted nor approved previously for the award
of a degree at any university.
Signature:
Muhammad Tahir
It is certified that the work in this thesis is carried out and completed under my supervision.
Supervisor:
Dr. Asifullah Khan
Associate Professor
DCIS, PIEAS, Islamabad.
iv
Acknowledgments
First and foremost, I am very thankful to Allah almighty for his blessings during my entire
life particularly, for the duration of this research work. He blessed me with knowledge and
purpose as well as guided me whenever I faced any problems. Having sincere teachers and
cooperative friends, during my PhD adventure, are all the blessings of Allah almighty.
Next, I would like to pay my deepest gratitude to my supervisor Dr. Asifullah Khan
whose dedication, enthusiasm, and devotion to work always inspired me. He has been a
constant source of motivation and inspiration for me throughout my PhD research. I always
found him generous in sharing his knowledge and wisdom. Working with him was a great
learning experience. His kind attitude really made the difference.
I would also like to pay my gratitude to Dr. Abdul Majid for his advice, guidance,
and invaluable comments during my PhD. I extend my gratitudes to Dr. Abdul Jalil
for his appreciation and encouragement to complete my PhD. I would also appreciate my
friends, particularly, Dr. Maqsood Hayat, Mr. Khurram Jawad, Mr. Adnan Idris, Mr.
Aksam Iftikhar, Mr. Mehdi Hassan, Mr. Manzoor, Mr. Fazal Badshah, Mr. Safdar Ali,
Dr. Atta ur Rahman, Dr. Anwar Hussain and Mr. Zaheer Uddin for their cooperative and
encouraging behavior during my stay at PIEAS.
I would certainly like to thank my loving parents, brothers, and sisters whose support,
throughout my studies, has made all this possible. Without their encouraging behavior and
moral support, the completion of this research work would not have been possible.
I pay my gratitude to Pattern Recognition Lab at the Department of Computer and
Information Sciences, Pakistan Institute of Engineering and Applied Sciences, which pro-
vided me a good environment and technical support round the clock for conducting my
PhD research.
Finally, I would like to thank Higher Education Commission of Pakistan, for their fi-
nancial support under the Indigenous 5000 PhD scholarship program with reference to the
award letter number 17-5-4(Ps4-124)/HEC/Sch/2008/.
Muhammad Tahir
v
List of Publications
International refereed journals
1. Muhammad Tahir, Asifullah Khan, and Abdul Majid, “Protein Subcellular
Localization of Fluorescence Imagery using Spatial and Transform domain Fea-
tures”, Journal of Bioinformatics, 28(1) 91-97 (2012). (impact factor: 5.323)
2. Muhammad Tahir, Asifullah Khan, Abdul Majid,and Alessandra Lumini,
“Subcellular Localization using Fluorescence Imagery: Utilizing Ensemble Classi-
fication with Diverse Feature Extraction Strategies and Data Balancing”, Journal
of Applied Soft Computing, 13(11) 4231-4243 (2013). (impact factor: 2.140)
3. Muhammad Tahir, Asifullah Khan and Hseyin Kaya, “Protein Subcellular Lo-
calization in Human and Hamster cell lines: Employing Local Ternary Patterns
of Fluorescence Microscopy Images”, Journal of Theoretical Biology, 340 85-95
(2014). (impact factor: 2.351)
4. Muhammad Tayyeb Mirza, Asifullah Khan, Muhammad Tahir, and Yeon Soo
Lee, “MitProt-Pred: Predicting mitochondrial proteins of Plasmodium falci-
parum parasite using diverse physiochemical properties and ensemble classifica-
tion”, Journal of Computers in Biology and Medicine, 43(10) 15021511 (2013).
(Impact factor: 1.162)
5. Muhammad Tahir and Asifullah Khan, “Protein Subcellular Localization us-
ing SVM based Ensemble and Individual Orientations of both Gray Level Co-
occurrence Matrices and Texton Images ”, Under Review in IEEE Transac-
tions on Cybernetics
vi
Contents
Declaration iv
Acknowledgments v
List of Publications vi
Abstract xvi
1 Introduction 1
1.1 Motivation and Research Objectives . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Relevant Literature and Techniques 7
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Feature Extraction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Haralick Texture Features . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1.1 Gray Level Co-occurrence Matrix . . . . . . . . . . . . . . 13
2.2.2 Texton Image based Statistical Features . . . . . . . . . . . . . . . . 18
2.2.3 Zernike Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Wavelet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.5 Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.6 Local Ternary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.7 Threshold Adjacency Statistics . . . . . . . . . . . . . . . . . . . . . 23
2.2.8 Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.9 Edge Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
2.2.10 Hull Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.11 Morphological Features . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.12 Histogram of Oriented Gradients . . . . . . . . . . . . . . . . . . . . 25
2.3 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Oversampling with SMOTE . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Feature Selection with mRMR . . . . . . . . . . . . . . . . . . . . . 27
2.4 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Random Forest Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Rotation Forest Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Performance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Sensitivity/Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.3 Mathews Correlation Coefficient . . . . . . . . . . . . . . . . . . . . 33
2.5.4 F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.5 Q-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.6 Multiclass ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Protein Subcellular Localization using Spatial and Transform Domain
Features 39
3.1 The SVM-SubLoc Prediction System . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.1.1 GLCM Construction and Haralick Co-efficients . . . . . . . 41
3.1.1.2 Discrete Wavelet Transformation . . . . . . . . . . . . . . . 42
3.1.1.3 The Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Performance Analysis of SVM-SubLoc using Individual Features for
HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Performance Analysis of SVM-SubLoc using Hybrid Features for 2D
HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
viii
3.2.3 Performance Analysis of SVM-SubLoc for LOCATE datasets . . . . 51
3.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Protein Subcellular Localization in the Presence of Imbalanced data 57
4.1 The Proposed RF-SubLoc Prediction System . . . . . . . . . . . . . . . . . 57
4.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.1.1 The Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Data Balancing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.1 Performance of RF-SubLoc using Individual Features . . . . . . . . 61
4.2.2 Performance Analysis of RF-SubLoc using Hybrid Features . . . . . 63
4.2.3 Performance Analysis of RotF Ensemble using Individual Features . 65
4.2.4 Performance Analysis of RotF Ensemble using Hybrid Features . . . 65
4.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Protein Subcellular Localization: Employing LTP with SMOTE 70
5.1 The Protein-SubLoc Prediction System . . . . . . . . . . . . . . . . . . . . 70
5.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.2 Oversampling Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.3 Feature Selection(Optional) . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.4 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.5 Ensemble Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Performance Analysis for 2D HeLa dataset . . . . . . . . . . . . . . 74
5.2.2 Performance Analysis for CHOA dataset . . . . . . . . . . . . . . . . 78
5.2.3 Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Protein Subcellular Localization using GLCM and Texton Image based
Features 84
6.1 The IEH-GT Prediction System . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.1 Feature Extraction Phase . . . . . . . . . . . . . . . . . . . . . . . . 85
ix
6.1.1.1 GLCM Construction . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1.2 Texton Image Construction . . . . . . . . . . . . . . . . . . 86
6.1.1.3 The Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.2 Classification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 Analysis of GLCM based Features for HeLa Dataset . . . . . . . . . 87
6.2.2 Analysis of Texton Image based Features for HeLa Dataset . . . . . 94
6.2.3 Analysis of the Hybrid Model for HeLa dataset . . . . . . . . . . . . 101
6.2.4 Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 Conclusions and Future Directions 105
7.1 Conclusive Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
References 109
x
List of Figures
Figure 1.1 Fluorescence microscopy image of microtubule from HeLa dataset [5] 2
Figure 1.2 A general pattern recognition system . . . . . . . . . . . . . . . . . . 3
Figure 2.1 GLCM construction for Ng = 8 at θ = {0◦, 45◦, 90◦, 135◦} and ∆ = 1 14
Figure 2.2 Different Texton masks . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 2.3 Procedure of Texton image generation . . . . . . . . . . . . . . . . . 19
Figure 2.4 LBP code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 2.5 LTP is split into two LBP codes . . . . . . . . . . . . . . . . . . . . 23
Figure 2.6 Threshold Adjacency Statistics . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.1 The SVM-SubLoc prediction system . . . . . . . . . . . . . . . . . . 40
Figure 3.2 Feature extraction from GLCMH, GLCMV, GLCMD and GLCMoD . 41
Figure 3.3 DWT Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 3.4 Fluorescence microscopy protein image of size M -by-N is split into
four sub-images at each decomposition level. Decomposition level 0 indicates
the original fluorescence microscopy protein image. . . . . . . . . . . . . . . 43
Figure 3.5 The Classification phase of SVM-SubLoc . . . . . . . . . . . . . . . . 44
Figure 4.1 The RF-SubLoc prediction system . . . . . . . . . . . . . . . . . . . 58
Figure 4.2 The Classification phase of RF-SubLoc . . . . . . . . . . . . . . . . . 61
Figure 5.1 Framework of the proposed system . . . . . . . . . . . . . . . . . . . 71
Figure 5.2 Comparison of original and synthetic samples in HeLa dataset . . . 72
Figure 5.3 Comparison of original and synthetic samples in CHOA dataset . . . 72
Figure 5.4 Classification phase of Protein-SubLoc prediction system . . . . . . . 73
Figure 5.5 Ratio of explained variance to the total variance for HeLa dataset . 75
Figure 5.6 ROC curves using URI-LTP(3, 24, 80) for 2D HeLa dataset . . . . . 76
xi
Figure 5.7 Effect of mRMR on URI-LTP extracted on radius 3 for HeLa dataset 78
Figure 5.8 Effect of mRMR on U-LTP extracted on radius 2 for HeLa dataset . 78
Figure 5.9 ROC curves using URI-LTP(3, 24, 30) for CHOA dataset . . . . . . 80
Figure 6.1 Framework of IEH-GT prediction system . . . . . . . . . . . . . . . 85
Figure 6.2 Classification phase of IEH-GT system . . . . . . . . . . . . . . . . . 87
Figure 6.3 Golgia protein is wrongly predicted as Golgpp protein . . . . . . . . 89
Figure 6.4 Golgia protein is represented by red circle and Golgpp class is shown
by green plus sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 6.5 Golgpp protein is wrongly predicted as Golgia protein . . . . . . . . 90
Figure 6.6 Golgpp instance is represented by red circle and Golgia class is shown
by green plus sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 6.7 Endosome instance is shown by red circle and Lysosome class is rep-
resented by green plus sign . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 6.8 Similar patterns can be observed in the two images . . . . . . . . . . 96
xii
List of Tables
Table 2.1 Offset pair with corresponding direction . . . . . . . . . . . . . . . . . 13
Table 2.2 A confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Table 2.3 Correlation between classifier BCi and classifier BCj . . . . . . . . . 35
Table 2.4 HeLa dataset classes and image breakup in each class . . . . . . . . . 36
Table 2.5 CHOM dataset classes and image breakup in each class . . . . . . . . 37
Table 2.6 CHOA dataset classes and image breakup in each class . . . . . . . . 37
Table 2.7 Vero dataset classes and image breakup in each class . . . . . . . . . 37
Table 2.8 LOCATE Endogenous and Transfected datasets: images breakup . . 38
Table 3.1 Performance of SVM-SubLoc using Haralick textures with/without DWT 46
Table 3.2 Performance of SVM-SubLoc using Zernike moments with/without DWT 47
Table 3.3 Performance of SVM-SubLoc using TAS features . . . . . . . . . . . . 47
Table 3.4 Performance of SVM-SubLoc using LBP with different mappings . . 48
Table 3.5 Performance of SVM-SubLoc using LTP with different mappings . . . 49
Table 3.6 Performance of SVM-SubLoc using ZHar with/without DWT . . . . 50
Table 3.7 Performance of SVM-SubLoc using HarTAS . . . . . . . . . . . . . . 50
Table 3.8 Performance of SVM-SubLoc using HarLBP . . . . . . . . . . . . . . 51
Table 3.9 Performance of SVM-SubLoc using HarLTP . . . . . . . . . . . . . . 51
Table 3.10 Performance of SVM-SubLoc using LTP for Endogenous dataset . . . 52
Table 3.11 Performance of SVM-SubLoc using LTP for Transfected dataset . . . 52
Table 3.12 Performance of SVM-SubLoc using HarLBP for Endogenous dataset . 53
Table 3.13 Performance of SVM-SubLoc using HarLBP for Transfected dataset . 53
Table 3.14 Performance of SVM-SubLoc using HarLTP for Endogenous dataset . 54
Table 3.15 Performance of SVM-SubLoc using HarLTP for Transfected dataset . 54
Table 3.16 Performance comparison with other published work . . . . . . . . . . 55
Table 4.1 Oversampled instances for CHOM dataset . . . . . . . . . . . . . . . 59
xiii
Table 4.2 Oversampled instances for CHOA dataset . . . . . . . . . . . . . . . . 60
Table 4.3 Oversampled instances for Vero dataset . . . . . . . . . . . . . . . . . 60
Table 4.4 Performance of RF-SubLoc using individual features . . . . . . . . . . 62
Table 4.5 Performance of RF-SubLoc using hybrid features . . . . . . . . . . . 64
Table 4.6 Performance of RotF ensemble using individual features . . . . . . . . 66
Table 4.7 Performance of RotF ensemble using hybrid features . . . . . . . . . 67
Table 4.8 Performance comparison with other published work . . . . . . . . . . 68
Table 5.1 The serial numbers, attached to the mapping used in LTP computation,
representing a particular LTP variant in Tables 5.8 and 5.9 . . . . . . . . . 73
Table 5.2 Performance of Protein-SubLoc using LTP for balanced HeLa dataset 74
Table 5.3 Performance of Protein-SubLoc using LTP for balanced mRMR based
HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Table 5.4 Performance of Protein-SubLoc using LTP for imbalanced HeLa dataset 79
Table 5.5 Performance of Protein-SubLoc using LTP for balanced CHOA dataset 79
Table 5.6 Performance of Protein-SubLoc using LTP for balanced mRMR based
CHOA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 5.7 Performance of Protein-SubLoc using LTP for imbalanced CHOA dataset 80
Table 5.8 Performance of different combinations of SVM classifications using LTP
for balanced HeLa dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 5.9 Performance of different combinations of SVM classifications using LTP
for balanced CHOA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 5.10 Performance comparison with other approaches . . . . . . . . . . . . 82
Table 6.1 Predictions of individual SVMs using Haralick features for HeLa dataset 87
Table 6.2 Confusion matrix using Haralick features from GLCMH . . . . . . . . 88
Table 6.3 Confusion matrix using Haralick features computed from GLCMD . . 91
Table 6.4 Confusion matrix using Haralick features computed from GLCMV . . 91
Table 6.5 Confusion matrix using Haralick features computed from GLCMoD . 92
Table 6.6 Confusion matrix using Haralick features computed from GLCMF . . 92
Table 6.7 Golgia proteins classified as Golgpp proteins . . . . . . . . . . . . . . 93
Table 6.8 Predictions of individual SVMs using statistical features computed
from Texton images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xiv
Table 6.9 Confusion matrix using statistical features computed from Texton im-
age constructed using T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Table 6.10 Confusion matrix using statistical features computed from Texton im-
age constructed using T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table 6.11 Confusion matrix using statistical features computed from Texton im-
age constructed using T3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Table 6.12 Confusion matrix using statistical features computed from Texton im-
age constructed using T4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Table 6.13 Confusion matrix using statistical features computed from Texton im-
age constructed using T5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Table 6.14 Confusion matrix using statistical features computed from Texton im-
age constructed using T6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table 6.15 Confusion matrix using statistical features computed from Texton im-
age constructed from the fusion of six Texton images . . . . . . . . . . . . . 99
Table 6.16 Different Texton image based statistical features lead to different clas-
sification results. Endosome proteins as Lysosome proteins . . . . . . . . . . 100
Table 6.17 Confusion matrix obtained using the hybrid model . . . . . . . . . . . 101
Table 6.18 Confusion matrix using majority voting scheme . . . . . . . . . . . . 102
Table 6.19 Performance comparison with the existing approaches based on HeLa
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
xv
Abstract
Subcellular localization of proteins is one of the most significant characteristics of living
cells that may reveal plentiful information regarding the working of a cell. Subcellular
localization property of proteins plays a key role in understanding numerous functions of
proteins. The proteins, located in their respective compartments or localizations, are in-
volved in their relevant cellular processes, which may include cell apoptosis, asymmetric cell
division, cell cycle regulation, and spermatic morphogenesis. In fact, cells may not perform
their regular operations well in case proteins are not found in their proper subcellular lo-
cations. Improper localization of proteins may lead to primary human liver tumors, breast
cancer, and Bartter syndrome. Protein sequencing has observed rapid expansion due to
the advancement in genomic sequencing technologies. This led the research community to
recognize the functionalities of different proteins. In this connection, microscopy imaging
is providing protein images well in time with low cost compared to protein sequencing.
However, automated systems are required for fast and reliable classification of these protein
images. Comprehensive analysis of fluorescence microscopy images is required in order to
develop efficient automated systems for accurate localization of various proteins. For this
purpose, representation of microscopy images with discriminative numerical descriptors has
always been a challenge.
This thesis focuses on the identification of discriminative feature extraction strategies
effective for protein subcellular localization, the recognition capability of the prediction sys-
tems, and the reduction of classifier bias towards the majority class due to the imbalance
present in data. The contributions of this thesis include (1) Analysis of different spatial
and transform domain features, (2) Development of a novel idea for GLCM construction
in DWT domain, (3) Analysis of SMOTE oversampling in the feature space, (4) Analysis
of GLCM in the spatial domain for capturing discriminative information from fluorescence
xvi
microscopy protein images along different orientations, (5) Exploitation of Texton images
for their capability of extracting discriminative information along different orientations from
fluorescence microscopy protein images, (6) Development of the web based prediction sys-
tems that can be accessed freely by the academicians and researchers.
Extensive simulations are performed in order to assess the efficiency of the proposed pre-
dictions systems in discriminating different subcellular structures from various datasets.
xvii
List of Abbreviations
AIIA Artificial Intelligence for Investigating Anti-cancer solutions
ASM Angular Second Moment
AUC Area Under Curve
CHO Chinese Hamster Ovary
CHOA CHO dataset obtained from AIIA lab
CHOM CHO dataset obtained from Murphy lab
COF Center Of Fluorescence
DWT Discrete Wavelet Transform
EQP Elongated Quinary Patterns
FN False negatives
FP False positives
GFP Green Fluorescent Protein
GLCM Grey Level Co-occurrence Matrix
HH High Frequency Components
HL High Frequency and Low Frequency Components
HOG Histogram of Oriented Gradients
ICA Independent Component Analysis
xviii
IEH-GT Individual Exploitation of orientation and Hybridization of GLCM and
Texton Image
KPCA Kernel PCA
LBP Local Binary Patterns
LH Low Frequency and High Frequency Components
lin-SVM SVM with linear kernel
LL Low Frequency Components
LTP Local Ternary Patterns
MCC Matthews Correlation Coefficient
MDA Multiple Discriminant Analysis
mRMR minimum Redundancy Maximum Relevance
NPE Neighborhood Preserving Embedding
PCA Principal Component Analysis
poly-SVM SVM with polynomial kernel
Protein-SubLoc SVM with SMOTE based Subcellular Localization System
Q-Statistic Yule’s measure for diversity among multiple classifiers
RBF Radial Basis Function
RBF-SVM SVM with RBF kernel
RF Random Forest
RF-SubLoc RF based ensemble prediction system
RI Rotation Invariant
ROC Receiver Operating Characteristic
RotF Rotation Forest
xix
RSE Random Subspace Ensemble
SDA Stepwise Discriminant Analysis
sig-SVM SVM with sigmoid kernel
SLF Subcellular Location Features
SMOTE Synthetic Minority Oversampling TEchnique
SRM Structural Risk Minimization
SVM Support vector machine
SVM-SubLoc SVM based Subcellular Localization System
TAS Threshold Adjacency Statistics
TN True negatives
TP True positives
U Uniform
URI Uniform Rotation Invariant
xx
List of Symbols
α Langrange multiplier
w Weight vector
x Feature vector
y Corresponding label vector of x
∆ Distance in terms of pixel
γ Width of the Guassian function in RBF kernel of SVM
µ Mean value
σ Variance value
θ Orientation of GLCM or Texton image
τ Threshold value
ξ Positive slack variable associated with training data
B Base Classifier
BN Number of Base Classifiers
C Cost parameter of misclassification for SVM
d degree of polynomial kernel in SVM
gc Gray value of central pixel c ∈ PN
gu Gray value of pixel u ∈ PN
xxi
Gi,j Total number of times a particular combination of gray level i and gray
level j occurs in a GLCM
Mf Modified dimensions of a feature space
Nf Dimensions of a feature space
Ng Number of gray levels in an image or the square dimensions of a GLCM
PN Pixels in a neighborhood
(i, j ) Index of a 2D image or matrix
CL Class Label
D Dimensionality of the feature space
i Some gray level in an image
j Some gray level in an image
K SVM kernel
L Decomposition Level
MI Mutual Information
MR Maximum Relevance
mR minimum Redundancy
S Feature Set
u A particular pixel ∈ PN
Y Predicted Label
z Target class
xxii
Chapter 1
Introduction
Subcellular localization of proteins provides information about the functional behavior of
proteins as well as about their tendency to interact with other proteins under different
circumstances [1, 2]. Therefore, knowledge of the subcellular distribution of these proteins
in a cell is of prime significance in the field of proteomics, cell biology, and computational
functional genomics [3, 4]. Determining protein subcellular localization is critical to the
understanding of various protein functions. For example, during the drug discovery pro-
cess, knowledge of the subcellular localization of a protein can considerably improve the
identification of drugs [5,6]. Comprehension of the protein functions is of prime importance
in biological sciences that may help understand the cell behavior in different situations [7].
Moreover, diagnosis of different diseases in early stages might be performed by adequately
finding the protein locations in cells [8, 9]. For instance, aberrant subcellular localization
of proteins has been observed in the cells of several fatal diseases, namely, breast cancer
and Alzheimer’s disease [10]. In addition, knowing the exact location of protein before and
after using the drugs may help in assessing the drugs ability to cure certain disease [11,12].
Different subcellular compartments of a cell include cytoplasm, mitochondria, Golgi appa-
ratus, lysosome, endoplasmic reticulum, and many others that help it to carry out different
activities such as digestion, movement, and reproduction [13–15].
Various microscopy techniques are frequently used to determine protein localizations
from images. Among these, fluorescence microscopy based imaging has got remarkable at-
tention of researchers in different fields of research particularly in biological sciences [16–18].
The ability of quantifying the object of interest efficiently lying on the black background is
1
the specialty of fluorescence microscopy. For imaging a protein, GFP need to be attached
for microscopy to be able to visualize proteins. Because, it is the reflection of GFP, which is
captured by the fluorescence microscopy imaging system. Due to the advancement of such
tools, the art of microscopy imaging has flourished in the fields of health and medicine. An
example fluorescence microscopy image of microtubule protein from HeLa cell lines [5] is
shown in Figure 1.1. Onwards, the term ”fluorescence microscopy image” will be referred
to as simply ”protein image” in the text.
Figure 1.1: Fluorescence microscopy image of microtubule from HeLa dataset [5]
Fluorescence microscopy generated data is typically used by researchers to train pattern
recognition systems based on their novel algorithms that might aid medical doctors in diag-
nosing various diseases [10,19]. The training of such algorithms is based on the availability
of large microscopy data, which can be efficiently provided by fluorescence microscopy. A
typical pattern recognition system consists of an optional pre-processing phase, a feature
extraction phase, an optional post-processing phase and a classification phase as depicted
in Figure 1.2. In this figure, dotted lines around some blocks indicate that these phases
are optional, which completely depend on the input data. Representation of fluorescence
microscopy protein images through discriminative numerical descriptors for the classifica-
tion stage is the main focus of research in this area. In this connection, pioneer work is
conducted [2, 5, 11, 20–23] at Murphy’s Lab. They developed various feature extraction
mechanisms, which can efficiently distinguish among different protein structures. Their re-
search was further extended by other researchers in the field [7,24,25]. They have proposed
2
Classification
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Input ImagePre
Processing
Feature
Extraction
Post
Processing
Figure 1.2: A general pattern recognition system
modifications to the existing methods as well as developed novel strategies to classify protein
subcellular localizations. The objective of these researchers is to develop such prediction
systems, which are superior to others in terms of accuracy.
1.1 Motivation and Research Objectives
Precise information about the localization of proteins provides valuable clue toward deter-
mining the function of novel proteins [26–29]. Experimental methods are inherently expen-
sive and time consuming as well as laborious due to the requirement of human specialist
in the field. Further, in conventional experimental techniques, proteomics and microscopic
recognition is not viable to be performed for some of the species [30]. Their applications
are limited to a few proteins. Therefore, for basic research and drug discovery, automated,
reliably efficient, and fast computational methods are always required by researchers and
pharmaceutical industry so that unknown proteins can easily be identified and accurately
predicted. Extraction of informative features from fluorescence microscopy images of var-
ious proteins and consequently, efficient exploitation of the discriminative power of these
features is a challenging task in order to accurately predict the class of a particular protein.
Considerable progress is observed in recent years for the development of computational
methods that can automatically determine the subcellular protein locations from fluores-
cence microscopy images. The main objective of this research is to express fluorescence
microscopy protein images using their numerical descriptors such that different patterns
belonging to various classes can be discriminable from each other. Furthermore, novel pro-
tein localization systems would be developed using various machine learning approaches for
the accurate localization of proteins under the influence of imbalanced feature spaces.
3
1.2 Research Perspective
Advances in microscopy techniques have enabled researchers to create large databanks of
fluorescence microscopy protein images. Automated systems are capable of recognizing and
classifying these images efficiently and accurately. The research work presented in this thesis
focuses on the application of various feature extraction strategies on fluorescence microscopy
images of proteins in the field of Bioinformatics. Further, the effect of imbalanced data on
the performance of a pattern recognition system is also unveiled.
1.3 Contributions
This research takes into consideration the discrimination power of feature extraction strate-
gies, the recognition capability of pattern recognition system, and the imbalance nature of
the data. The key findings are listed below.
• A study of spatial and transform domain features has been performed, which reveals
that the discrimination power of certain feature spaces is improved when extracted
in transform domain rather than spatial domain. Further, the ensemble constructed
from the decisions of individual classifiers enhances the overall prediction performance.
• The discrimination power of Haralick features [31] extracted from GLCM constructed
in the DWT domain is enhanced. Further, level 2 is observed to be the best decom-
position level, which reveals that this level possesses the most discriminative features.
• Introducing synthetic samples using SMOTE [32] in the feature space prior to the
classification phase reduces the classifier bias towards majority class and consequently,
enhances the prediction performance of protein subcellular localization system.
• Accuracy of a prediction system using feature spaces oversampled through SMOTE is
directly proportional to the size of imbalance present in original dataset of fluorescence
microscopy protein images.
• It is revealed from numerous simulation results that discriminative strength of LTP
has not been improved with mRMR or alternatively, LTP does not require mRMR
based feature selection for the improvement of its discriminative strength.
4
• GLCM and Texton image [33] along different orientations capture different informa-
tion from the same image that is useful in identifying fluorescence microscopy protein
images from different classes.
• It is also observed that in some cases, GLCM constructed along a single direction
performs better than the combined GLCM along all the four orientations.
• Web based prediction systems are also developed that can be accessed freely by the
academicians and researchers.
1.4 Structure of the Thesis
Chapter 2 highlights various approaches in the literature developed for protein subcellu-
lar localization. Researchers have utilized numerous individual and ensemble classification
based approaches by exploiting the discrimination power of various texture and statistical
based feature extraction strategies. Further, various existing machine learning algorithms
utilized in this research would also be reviewed. In addition, different feature extraction
mechanisms, adopted for feature generation from fluorescence microscopy protein images,
are also discussed in detail. The description of a number of performance measures used to
assess the performance of the proposed algorithms is provided. This chapter ends with the
discussion of some benchmark fluorescence microscopy protein image datasets, which are
utilized in this thesis.
Chapter 3 begins with the introduction of SVM-SubLoc prediction system. We propose
the extraction of Haralick and Zernike moments in spatial and DWT based transform do-
mains. Other features include LBP, LTP, TAS and various hybrid models of these feature
spaces that are utilized in spatial domain only. SVM with linear, polynomial, RBF, and
sigmoid kernels is employed as classification algorithm. The performance of these individual
SVMs is evaluated against their ensemble constructed through the majority voting scheme.
Accuracy, MCC , F-Score, and Q-Statistic are employed as performance measures to assess
the quality of SVM-SubLoc prediction system.
Chapter 4 contributes to the classification of fluorescence microscopy protein images
under the effect of balanced and imbalanced feature spaces where balanced feature space
is constructed through SMOTE. The performance of RF and RotF ensemble classifiers is
5
explored using both balanced and imbalanced feature spaces. The extracted features in-
clude Haralick, HOG, Edge, URI-LBP, Image, and Hull based features in addition to some
hybrid models constructed by forming different combinations of these individual features.
The performance measures: accuracy, F-Score, and MCC revealed that the proposed tech-
nique is promising.
Chapter 5 is devoted to elaborate the effectiveness of LTP in conjunction with SMOTE
using polynomial SVM in classifying fluorescence microscopy protein images from fluores-
cence microscopy. It is observed that the application of mRMR feature selection technique
does not improve the discrimination power of LTP patterns. The performance measures
indicated the effectiveness of the proposed technique.
Chapter 6 deals with the implementation of IEH-GT prediction system. Recognition ca-
pability of the IEH-GT prediction system is assessed using the features, extracted separately
from GLCMs and Texton images. The simulation results reveal that GLCMs extracted in-
dividually along different angles are capable of describing fluorescence microscopy protein
images through distinctive features. Similarly, Texton image constructed along a single
direction reduces the overlap in the extracted information from fluorescence microscopy
protein images. In addition to the individual orientation analysis, the combined structures
of GLCMs as well as Texton images have also been analyzed. SVMs are trained on all the
feature spaces and the final prediction is obtained through the majority voting scheme. The
proposed IEH-GT prediction system is tested on HeLa dataset. The performance predic-
tions reveal the effectiveness of the proposed prediction system.
Chapter 7 draws the conclusion and sets the future directions towards the end of this
thesis.
6
Chapter 2
Relevant Literature and
Techniques
This chapter discusses the literature related to Bioinformatics based algorithms developed
for protein subcellular localization. It also presents machine learning theory and various
related techniques, which have been employed in this thesis. First, theoretical details of
the employed feature extraction strategies are discussed. Then, post-processing techniques
utilized in this work are described. Then, classification algorithms used in this work are
presented. Next, different performance measures are discussed. Finally, different fluores-
cence microscopy protein image datasets are presented, which we have utilized to assess the
prediction systems developed in this thesis.
2.1 Literature Review
Boland et al. are among the pioneer researchers [20], who have started developing auto-
mated systems for the prediction of protein subcellular localizations. They adopted Haralick
and Zernike moment based numerical descriptors extracted from fluorescence microscopy
protein images of CHOM dataset, which are then classified using both BPNN and the clas-
sification tree. They have also evaluated the performance of BPNN using two subsets of 10
features; each selected using SDA and MDA from the combined feature space of Haralick
and Zernike moments. Their findings reveal that SDA and classification trees are unable
to perform well on these fluorescence microscopy protein images. On contrary, BPNN has
performed well even on small number of features, which are selected using MDA from the
7
combined feature space of Haralick and Zernike moments. Continuing this research work,
Murphy et al. have proposed 22 new features in addition to the Haralick and Zernike mo-
ments [2], which were reported in their previous work. These new features are the result
of morphological and geometrical analysis of the fluorescence microscopy protein images.
They have evaluated the performance of BPNN using Haralick textures, Zernike moments
and the new set of 22 features for 2D HeLa dataset. Further, from the 84D combined
feature space of Haralick, Zernike and 22 new features, 37 features subset is selected using
SDA and evaluated on BPNN to assess the discriminative power of this subset. Boland et
al. [5] have further extended the studies reported in [2, 20]. They proposed some new sets
of features called SLF. The first set was termed as SLF1 composed of morphological based
features. Another set SLF2 constructed from SLF1 and 6 additional features extracted
from the processed images of proteins and their corresponding DNA images resulting in 22
features in total. SLF3 was composed of SLF1, 49D Zernike moments and 13D Haralick
coefficients producing a total of 78D feature space. Similarly, SLF4 was generated from
SLF2, concatenated with the same set of Haralick and Zernike features, resulting in 84D
feature space. Further, SDA was applied on SLF4 producing a subset of 37 features, which
was termed as SLF5. Then BPNN was trained using SLF5 for HeLa dataset. They have
also produced subset feature space using PCA. However, performance of SDA was reported
better than that of PCA.
Exploring new venues in this field, Murphy et al. [11] have proposed a new feature set
SLF7, which is composed of SLF3 [5] and six new features based on skeleton information
of the protein location. SDA was then applied on SLF7 generating 32 selected features,
which are termed as SLF8. BPNN was trained using this selected feature set. SDA is ap-
plied on the combined feature space of SLF7 and six DNA features that resulted in SLF13
composed of 31 selected features. The new method has achieved good performance with
relatively smaller feature space though reference DNA image was not utilized during the
feature extraction.
Previously, Murphy and his fellow researchers have mostly relied on a single feature se-
lection mechanism that is SDA and then a BPNN is trained using the selected feature space.
However, Huang et al. [34] have tested the effect of eight feature reduction techniques on
these SLF sets for 2D HeLa cell lines. The utilized classifier is the well known SVM. Feature
reduction techniques include PCA, nonlinear PCA, KPCA, ICA, classification trees, Fractal
8
Dimensionality Reduction, SDA, and GA. SVM achieved 86% accuracy using the reduced
feature set of SLF7 with KPCA among other feature reduction techniques. On the other
hand, SVM achieved 87.4% accuracy using the selected feature space from SLF7 with SDA
among feature selection techniques. SLF12 is introduced by selecting 8 features from SLF7
using SDA. The performance accuracy of SLF12 is 80.1% on average.
Huang and Murphy [35] reported the performance of various classifiers using SLF8 and
SLF13 feature sets, which were previously proposed in [11, 21]. In their previous papers,
they mostly used BPNN as a classifier in their experiments. In contrast to their previously
proposed algorithms, they have introduced SLF15 and SLF16 as well as they have trained
a number of classifiers to identify the best classifiers. SLF15 is generated by applying SDA
on a 174D feature space, which is composed of Daubechies4, Gabor, and SLF7 features.
Similarly, SLF16 is generated using the same procedure on 180D feature space, which in-
cludes 6 DNA features in addition to the 174 features just mentioned. This system achieved
92%− 93% accuracy for 2D HeLa dataset. Chen et al. have presented a comprehensive re-
view about automated systems developed till 2006 for protein localization [36]. This review
focuses on the importance of discriminative image descriptors of fluorescence microscopy
images. The systems discussed in this review were developed for various 2D and 3D fluo-
rescence microscopy protein images. This review considers SDA the best feature selection
technique among others.
Srinivasa et al. have tested the efficiency of Haralick and morphological features in
multi-resolution subspaces upto 2 levels where PCA is applied to represent the same feature
vectors in their eigenspaces [1]. They have shown that extracting features in multiresolu-
tion subspaces obtain maximum information from fluorescence microscopy protein images
that help the classifier efficiently classify the images. Prediction results are generated with
the help of K-means algorithm and combined afterwards through weighting. They showed
that their approach is able to improve performance by 10% for HeLa dataset compared
to features extracted from GLCM in spatial domain. Hamilton et al. [24] have reported
an SVM based classifier, ASPiC: Automated Subcellular Phenotype Classification system
that achieved 94.3% and 89.8% accuracy values for LOCATE endogenous and transfected
datasets, respectively. ASPiC used area and intensity measures as well as Haralick and
Zernike moments as numerical descriptors of fluorescence microscopy protein images. Ex-
panding their contribution to the bioinformatics community, Hamilton et al. have further
9
proposed TAS based feature extraction strategy to efficiently quantify protein subcellular
localization images [37]. SVM classifier, using TAS, achieved 94.4% and 90.3% accuracy
values for LOCATE endogenous and transfected datasets, respectively. The performance is
further enhanced when TAS and Haralick textures were combined and further utilized by
SVM that yielded accuracy values of 98.2% and 93.2%, respectively, for LOCATE endoge-
nous and transfected datasets.
Chebira et al. have contributed to the field by developing a multiresolution based clas-
sification system for protein subcellular localization [7]. This approach first decomposes
an input fluorescence microscopy protein image into multiresolution subspaces where Har-
alick, Zernike and morphological features are computed in different combinations at each
subspace. Decisions regarding the classification of different fluorescence microscopy protein
images are generated at each subspace using ANN and consequently, these decisions are
combined by weight assignment yielding accuracy of 95.3% for 2D HeLa dataset. Lin et
al. have developed a novel algorithm AdaBoost.ERC: AdaBoost with Error-correcting Re-
peating Codes [6], which is trained using strong and weak detectors to classify fluorescence
microscopy protein images into multiple classes. Their proposed algorithm achieved 94.7%
accuracy for CHOA dataset, 93.6% for HeLa dataset, and 89.1% for Vero dataset. Chen et
al. have proposed the utilization of different field-level and cell-level features to describe
fluorescence microscopy protein images from yeast GFP fusion localization database [38].
Before the classification step, they employed SDA for the selection of most discriminative
set of features and then hired SVM classifiers to be trained on these selected features. In
order to make the final decision, scheme of plurality voting is utilized to combine decisions
for various SVM classifiers.
This discussion will not be complete without presenting the contribution of Dr. Loris
Nanni to the field of Bioinformatics. Nanni and Lumini have proposed a novel application
of invariant LBP for extracting numerical features from protein subcellular localization im-
ages [39]. They were further combined with Haralick and TAS feature extraction strategies,
which enhanced the performance of RSE of neural networks for HeLa, LOCATE endogenous
and LOCATE transfected datasets. They showed that RSE of neural networks has outper-
formed SVM. The success rates of RSE of neural networks were 94.2%, 98.4%, and 96.5%
for HeLa, LOCATE endogenous, and LOCATE transfected datasets, respectively. Further,
Nanni et al. have reported an optimal set of features comprising of wavelet, Haralick, TAS
10
and some variants of LBP feature extraction strategies, which were used to train RSE of 100
Levenberg-Marquardt neural networks [8]. The final prediction is made through the sum
rule achieving accuracy values of 95.8%, 99.5% and 97.0% for HeLa, LOCATE endogenous,
and LOCATE transfected datasets, respectively. Nanni et al. have extended their efforts
and proposed a novel approach for selecting the most discriminative invariant LBP and LTP
patterns from a set of extracted features [12]. In this connection, they utilized PCA and
NPE to produce a reduced dimensionality feature space with high variance. Next, 50 SVMs
were trained using the resultant feature spaces and the final prediction is obtained using
the sum rule yielding 93.2% and 92.9% accuracy values for HeLa and LOCATE endogenous
datasets, respectively. In an another effort, Nanni et al. have constructed different local
and global descriptors to numerically describe protein subcellular localization images [40].
The local descriptors are extracted according to the method proposed in [39]. Weak descrip-
tors as proposed in [6] are also utilized in this work. RSE of Levenberg-Marquardt neural
networks and Adaboost of weak learners are individually trained using these features. A
fusion of the two ensembles is also performed using the sum rule. This system yielded 97.5%
prediction accuracy for HeLa dataset. Focusing on LBP, Nanni et al. have proposed a novel
variant of LBP for image classification called EQP [41], which outperformed the existing
LBP and its various variants. EQP achieved 92.4% accuracy value for HeLa dataset, which
is the highest value yielded by the LBP variants. Furthermore, Nanni et al. have reported
the effectiveness of non-uniform LBP and LTP patterns [42]. They showed that RSE of
SVM classifiers may outperform a standalone SVM using non-uniform LBP/LTP patterns
in classifying protein as well as other non-protein images.
Literature survey reveals that most of the researchers have either developed novel fea-
ture extraction strategies or suggested modifications to the existing techniques for numerical
description of fluorescence microscopy protein images. In addition, different ensemble and
individual classification systems have been trained on these features. Next section discusses
the feature extraction strategies, which have been utilized in this thesis.
2.2 Feature Extraction Schemes
Feature extraction strategies are used to compute numerical features from images or from
a sub-part of the whole image. These numerical features are then used in the classification
11
process. The computed features should be discriminative enough so that the classifier
can easily distinguish the subcellular location images from each other. Different feature
extraction techniques are utilized in the development of this thesis, which are discussed as
follows.
2.2.1 Haralick Texture Features
Haralick texture features are first proposed in [31], which are based on second order statistics
computed from a GLCM. Haralick proposed thirteen statistical features to be extracted from
a GLCM. These features include:
• Energy
• Correlation
• Inertia
• Entropy
• Inverse difference moment
• Sum average
• Sum variance
• Sum entropy
• Difference average
• Difference variance
• Difference entropy
• Information measure of correlation 1
• Information measure of correlation 2
Before going to discuss Haralick coefficients in detail, it is worth discussing GLCM matrices
first.
12
2.2.1.1 Gray Level Co-occurrence Matrix
Each element of a GLCM matrix represents the co-occurring frequency of two pixels, sit-
uated ∆ pixels apart from each other, one with gray level i and the other with gray level
j along certain angle θ. Here ∆ is measured in terms of pixel distance and θ is quantized
along four directions including 0◦, 45◦, 90◦ and 135◦ [43,44]. Mathematically, each entry at
index (i, j) of a GLCM is computed as given in Equation 2.1.
P (i, j) =
Ng∑x=1
Ng∑y=1
1 if I (x, y) = i and I (x+ ∆x, y + ∆y) = j
0, otherwise
(2.1)
P (i, j) is the probability of occurrence of intensity value i with intensity value j and Ng is
the number of gray levels in the input image. (∆x,∆y) describes the distance and orientation
between the pixels. This is used to generate a separate GLCM for each orientation from the
set of pre-defined orientations {0◦, 45◦, 90◦, 135◦}. Table 2.1 shows different sets of offsets
each representing a different direction.
Table 2.1: Offset pair with corresponding direction
Offset Direction
(0, ∆) Horizontal
(-∆, ∆) Diagonal
(-∆, 0) Vertical
(-∆, -∆) Off-diagonal
The dependency upon the directions is usually avoided by obtaining the average of
GLCMs computed along the four directions. The size of a GLCM depends on the gray
levels in an input fluorescence microscopy protein image. An image having Ng gray levels
results in a GLCM matrix of size Ng-by-Ng. A separate GLCM matrix has to be maintained
for each (∆, θ) pair resulting in large memory requirements. Therefore, the number of gray
tones in an image, from which a GLCM has to be computed, is usually reduced so that
the resulting GLCM is of smaller dimensions [45, 46]. We have to keep in view that the
performance of the features computed from GLCM is greatly dependent upon the number
of utilized gray levels.
Let an image I of size M -by-N having 8 gray levels ranging from 1 to 8 is illustrated in
13
Figure 2.1(a). Figure 2.1(b) is the resultant GLCM matrix along θ = 0◦ and at distance ∆ =
1. Figure 2.1(c)-(e) represent the obtained GLCMs along 45◦, 90◦ and 135◦, respectively.
After constructing and fusing these GLCMs, Haralick coefficients are computed from the
2
4
23 155
21 546
23 534
58 84
4 111
6
5
6
3
7
56 727 8
(a) Example image
GLCM at ∆ = 1, θ = 00
1
1
0
0
1 02 11000
0 00 11120
2 00 00100
0 11 0001
01 11200
0 00 10110
1 10 0000
0 00 0011
(b) GLCM at 0◦
0
0
0
0
1 10 00011
0 01 02101
1 00 00100
1 01 0021
01 01110
0 00 00111
0 01 1000
1 00 0100
GLCM at ∆ = 1, θ = 450
(c) GLCM at 45◦
GLCM at ∆ = 1, θ = 900
1
0
0
1
1 00 20101
2 01 00100
0 11 10010
1 01 0110
10 02120
0 01 10111
0 10 0000
0 01 0100
(d) GLCM at 90◦
GLCM at ∆ = 1, θ = 1350
0
0
0
1
2 10 10100
1 01 01010
1 10 00100
0 01 1121
01 01111
0 10 00000
0 00 0000
0 00 1000
(e) GLCM at 135◦
Figure 2.1: GLCM construction for Ng = 8 at θ = {0◦, 45◦, 90◦, 135◦} and ∆ = 1
resultant GLCM, which acts as descriptor for a particular image [43,47]. Before extracting
these statistical measures, it is necessary to normalize the GLCM so that the cells contain
probabilities of occurrence of different outcomes. Note that these are merely approximations
because the gray levels are integer values rather than continuous values. The probability of
occurrence of a particular combination can be calculated using Equation 2.2.
P (i, j) =Gi,j
Ng∑i=1
Ng∑j=1
Gi,j
(2.2)
Here, Ng is the total number of gray levels, Gi,j represents the total number of times
a particular combination occurs, and the denominator is used to show the total number of
possible outcomes.
Haralick texture features have been observed to be very effective in texture classification
[45,47,48]. Therefore, we provide the formulae and respective description for some of these
14
features as follows.
Energy
Energy or uniformity of energy is a measure that can be extracted from a GLCM. It is
used to quantify the homogeneity of an image and results in the sum of squared elements of
GLCM. Homogeneous regions exhibit little variations in the gray level intensities and hence
the gray level range is smaller. This gives fewer but high frequencies for P (i, j) values in a
GLCM. Energy can be measured by calculating the square root of ASM.
Energy =√ASM (2.3)
Higher values (maximum 1) of Energy or ASM reveal the constant behavior of an image.
ASM is used to express the amount of smoothness or homogeneity of an image.
ASM =
Ng∑i=1
Ng∑j=1
(p (i, j))2 (2.4)
where Ng is the number of gray levels in the image or alternatively, the square dimensions
of GLCM whereas p(i, j) is a particular element of GLCM.
Contrast
Contrast, also known as inertia or variance, is used to measure the amount of local intensity
variations in the image. It returns high values where image regions exhibit large variations.
Alternatively, low values are returned for the image regions where gray level differences are
smaller. For constant images, the contrast is always zero. Equation 2.5 is used to quantify
the intensity contrast of neighboring pixels.
Contrast =
Ng∑i=1
Ng∑j=1
p (i, j) (i− j)2 (2.5)
where i and j are pixel indices, Ng is dimension of a square matrix and p(i, j) is the
probability of occurrence of pixel pairs. In Equation 2.5 (i− j)2 is the weighting function.
The pixels at the diagonal are assigned zero weight due to the complete similarity. In the
equation, this situation is indicated by (i− j)2 = 0 where i and j are equal. The distance
between i and j = 1 means that there is a little contrast between the pixels and hence
the weight assigned to these pixels is 1. Similarly, distance 2 shows increased contrast and
15
the assigned weight is 4. The weighting function grows exponentially with the increase in
(i− j).
Correlation
GLCM correlation quantifies the amount of correlation among the neighboring pixels. This
feature identifies the gray level spatial dependence within the image. Pixels belonging to
same object have usually high correlation compared to the pixels belonging to different
objects. Similarly, pixels in close proximity with each other have high correlation. On the
other hand, pixels situated farther away from each other have low correlation. Correlation
can be computed using Equation 2.6.
Correlation =
Ng∑i=1
Ng∑j=1
p (i, j)
(i− µi) (j − µj)√σ2i σ
2j
(2.6)
where i and j represent pixels, Ng is the dimension of the square matrix, p(i, j) is the
probability of occurrence of i and j in combination, µ is the GLCM mean, and σ is the
GLCM variance. Variance values are set to zero when all the pixels in the image have
similar intensities and consequently, correlation will be undefined for such image. However,
for calculation purpose the value of correlation is set to 1 in such situations indicating that
the pixels are with similar intensities. GLCM mean and variance are presented next to
complete the discussion about GLCM Correlation.
GLCM Mean In GLCM mean, pixel value is weighted by its occurrence frequency
in combination with some other pixel value in the neighborhood.
µi =
Ng∑i=1
Ng∑j=1
i (p (i, j)) (2.7)
µj =
Ng∑i=1
Ng∑j=1
j (p (i, j)) (2.8)
GLCM Variance Calculation of GLCM variance involves the mean and distribution
of cell values around the mean in the GLCM matrix. Zero variance is the indication of a
completely uniform image.
σi =
Ng∑i=1
Ng∑j=1
p (i, j) (i− µi)2 (2.9)
16
σj =
Ng∑i=1
Ng∑j=1
p (i, j) (j − µj)2 (2.10)
GLCM standard deviation is given by:
σi =√σ2i (2.11)
σj =√σ2j (2.12)
Entropy
Entropy is used to capture the randomness of a GLCM. Randomness is an important prop-
erty of image texture and can enhance the classification capability of a prediction system.
Entropy returns low values when an irregular GLCM is constructed for a regular image.
High values are produced when the constructed GLCM has all equal elements for an irreg-
ular image. Entropy can be measured using Equation 2.13
Entropy = −Ng∑i=1
Ng∑j=1
p (i, j) ln p (i, j) (2.13)
Since p(i, j) is a probability measure, its value ranges from 0 to 1 and therefore, the
value of ln(p(i, j)) would be either zero or negative. Smaller values of p(i, j) reveal that
the occurrence of a particular pixel combination is infrequent that results in large values of
ln(p(i, j)). The negative term in the equation makes the output entropy positive. Entropy
is always zero at the minimum because 0× ln(0) and 1× ln(1) result in 0.
The value of p(i, j)× ln(p(i, j) is maximum when its derivative with respect to p(i, j) is
zero.
Local Homogeneity
Homogeneity measure is used to indicate the homogeneous behavior of an image. The
weighting term 11+(i−j)2 in homogeneity computation is the inverse of that used in the cal-
culation of Contrast. Therefore, this measure returns high values for low contrast regions in
an image. The weighting term keeps decreasing exponentially with the increase in distance
from the diagonal. Homogeneity can be computed using Equation 2.14.
Homogeneity =
Ng∑i=1
Ng∑j=1
p (i, j)
1 + (i− j)2(2.14)
Decreasing trends in the weighting term away from the diagonal result in larger outputs
for homogeneity measure, which is the indication of more homogeneous scenes in the image.
17
2.2.2 Texton Image based Statistical Features
Texton images are obtained using the micro-structures known as Texton masks, which are
utilized to explore information related to the horizontal, vertical, diagonal or off-diagonal
patterns present in an image. Difference present in the structure of Texton masks helps in
reducing the overlap of information present along different orientations in an image [33].
Texton elements adopted in this thesis are illustrated in Figure 2.2. All the Texton elements
are described on a 2-by-2 grid termed as Texton Mask. For Texton image construction, the
p4
p2
p3
p1
T1
p4
p2
p3
p1
T2
p4
p2
p3
p1
T3
p4
p2
p3
p1
T4
p4
p2
p3
p1
T5
p4
p2
p3
p1
T6
p4
p2
p3
p1
Texton Mask
Figure 2.2: Different Texton masks
original image is first quantized to 16 gray levels. The 2-by-2 grid slides over the entire
image from left to right and top to bottom with step size of 2, which detects patterns in
the underlying image defined by a Texton Mask. If the pixel intensities inside a particular
Texton Mask are found similar, all pixels in the image under the Texton are retained
without intact because they form a valid texton of the image. On the other hand, if the
pixel intensities, inside a particular Texton Mask, are different, they are all set to zero.
Each Texton Mask constructs a unique Texton image. The final Texton image is obtained
by combining the individual Texton images. The complete process of Texton detection is
demonstrated in Figure 2.3. In this thesis, ten statistical features are extracted from Texton
images. These include:
• Energy
• Contrast
• Homogeneity
18
Texton Mask
7
4
4
1
5
3
2
6
1
3
2
5
2
3
6
1
4
7
1
4
7
5
0
3
3
7
6
5
2
1
2
6
3
7
3
2
5
1
6
2
5
0
4 4 0 1 1 4 2 5
0
2
0
3
2
5
513455 4 0
7
4
4
1
5
3
2
6
1
3
2
5
2
3
6
1
4
7
1
4
7
5
0
3
3
7
6
5
2
1
2
6
3
7
3
1
5
1
6
2
5
0
4 4 0 1 1 3 2 5
0
2
0
3
2
5
513455 4 0
T1 T4 T2
T5
T4 T2
T3 T1 T6
T3T4
7
4
4
0
5
0
2
6
1
0
0
0
2
3
6
0
0
0
0
4
7
5
0
3
0
7
6
5
2
1
2
6
3
0
0
5
1
6
0
0
0 1 0 0 2 5
0
0
0
3
2
5
513405 4 0
30
4 4
p4
p2
p3
p1
Original Image Detected Textons Texton Image
Location of
detected
Texton types
Figure 2.3: Procedure of Texton image generation
• Entropy
• Difference average
• Difference variance
• Difference entropy
• Inertia
• Inverse difference
• Information measure of correlation 1
2.2.3 Zernike Moments
Zernike moments have drawn the attention of researchers for many years in the field of pat-
tern recognition and bioinformatics [5, 20, 38]. Zernike moments are capable of extracting
informative features from fluorescence microscopy protein images about the protein distri-
bution maintaining the property of rotational invariance.
Zernike polynomials are utilized as basis functions for the computation of Zernike mo-
ments. These polynomials are defined over a unit circle where all the moments are orthogo-
nal to each other thus guaranteeing no redundancy in the extracted information from images
and similarly, these moments are mathematically independent [20, 49]. Zernike polynomial
19
consists of a series of terms, which represent a particular property of an image [50]. Each
term in the Zernike polynomial has a coefficient with magnitude and sign that recognizes
the prominent property and its direction of the image components [51]. The magnitudes
of Zernike coefficients/moments are not dependent on the rotation angle of the object and
therefore, useful in extracting information from images, which efficiently illustrate the shape
properties of objects. Zernike moments are adopted to address many research problems re-
lated to the classification of different patterns. Each Zernike moment measures the similarity
of an image to a set of Zernike polynomials [23,36]. The invariance is achieved by computing
the resemblance of the transformed image with the Zernike polynomials using its conjugate.
The absolute value resulting in this way exhibits the property of invariance [17].
Zernike moments are computationally inexpensive as compared to other texture based
features [23]. Theoretically, perfect image reconstruction is possible if a complete set of
Zernike moments of the subject image are available.
2.2.4 Wavelet Features
Wavelet features of an image are obtained by applying DWT procedure that gets infor-
mation from both spatial and frequency domains [7,8,52,53]. DWT decomposes the input
image to a detail and approximation description by utilizing a scaling function and a wavelet
function of the applied wavelet that correspond to a low-pass and a high-pass filters, re-
spectively [17, 18]. Since an image is a 2D signal, comprising of rows and columns, when
applying DWT, first, the columns of the input image are convolved with high-pass and
low-pass filters and then the rows of the resulting image are convolved once more with the
high-pass and low-pass filters. Consequently, four convolved images are produced by ap-
plying DWT at each level. These four resultant images represent four different frequency
groups, among these groups three are high-frequency components and one is low-frequency
component. The low-frequency component is along x -direction while the three high fre-
quency components are along x, y and diagonal directions each. The three high-frequency
components have valuable information while the only low-frequency component needs to
be decomposed further by re-applying the same procedure so that valuable information can
be extracted if it has any. At each decomposition level, four new images are obtained for
each input image. The three high-frequency components are stored and the low-frequency
component is further decomposed for extracting more information from it.
20
2.2.5 Local Binary Patterns
LBP operator is used to extract gray level patterns from an image [54]. The rationale
behind its extensive use, by the medical image processing research community, is its reduced
computational cost and resistance to the illumination changes as well as invariant behavior
to the rotation [12].
LBP can either be uniform, rotation invariant or uniform rotation invariant. Uniform
LBP patterns [55, 56] are of practical importance. Considering the binary string circular,
LBP is said to be uniform if there are at most two bitwise transitions from 1 to 0 or 0 to 1 in
that string. For instance, 11100011 is a uniform pattern because only two such transitions
exist whereas 11100101 is a non-uniform pattern because of the existence of 4 transitions.
Uniformity represents important structural features such as spots, edges, and corners.
Another variant of LBP operator is rotation invariant LBP. In invariant LBP, the
transformation of gray level values is monotonic. Alternatively, LBP code remains constant
if and only if the order of the gray scale values in the image preserves the order. Rotating
a pattern consisting of merely 0s or 1s does not produce any variance that is it remains
invariant. On contrary, rotating a binary pattern that is comprised of gray values other
than merely 0s or 1s produces different LBP code [55].
LBP codes are computed by evaluating the binary differences of the intensity of a
central pixel c with each of the intensities of PN pixels that surround c and lying on the
circle defined by radius R in the neighborhood. The final LBP code is the sum of these
differences. The mathematical expression for LBP is given in Equation 2.15.
LPBR,PN=
PN−1∑p=0
s(gp − gc)2p (2.15)
Here, PN represents the neighboring pixels count and R indicates the radius from the
central pixel. The gray level intensity of the central pixel is represented by gc whereas gu
shows the neighboring pixel intensities. The function s(u) returns the output value of each
u ∈ PN after the calculation of difference with the value of central pixel as expressed in
Equation 2.16. The value of s(u) depends on the difference of gc and gu.
s (x) =
1 if x ≥ 0
0, otherwise
(2.16)
21
LBP can be computed with different PN and R values. The combinations that we have
adopted in our works are (R = 1,PN = 8) , (R = 2,PN = 16) and (R = 3,PN = 24). Figure
100
93
82
47 51 65
97
90
107
10
3
-8
-43 -39 -25
7 17
1
1
0
0 0 0
1 1
0 * 20 + 0 * 21 + 0 * 22 + 1 * 23 + 1 * 24 + 1 * 25 + 1 * 26 + 0 * 27 = 120
1,( )
0
if x 0 s x
, otherwise
8
64
128
1 2 4
32 16
LBP OperatorCodewordOriginal Difference
Figure 2.4: LBP code generation
2.4 is provided to illustrate the procedure of LBP code generation. In this example, the
procedure is depicted for R=1 and PN=8. First, 3-by-3 neighborhood of a pixel is selected.
Next, the central pixel value is subtracted from the value of each neighboring pixel. Then,
the threshold is applied as given in Equation 2.16 to obtain the LBP textured image.
Finally, this 3-by-3 neighborhood is converted to a single decimal value, which is the LBP
code for this neighborhood. LBP codes for the whole image are obtained in this way.
2.2.6 Local Ternary Patterns
LTP, proposed by Tan and Triggs [57], is a generalized form of LBP coding scheme. In LTP,
the binary differences of the central pixel c with each of the PN pixels in the predefined
neighborhood are based on ternary value rather than binary that is computed according to
a threshold value τ as expressed in Equation 2.17.
s (u) =
1 if u ≥ c+ τ
−1 if u ≤ c− τ
0 otherwise
(2.17)
Hence, obtaining a textured image that is less sensitive to noise and efficiently discrim-
inative. The computational complexity of LTP coding is higher compared to LBP scheme.
Therefore, LTP feature extraction strategy, based on its negative and positive components,
22
is usually transformed into two equivalent LBP computations as illustrated in Figure 2.5.
The obtained component histograms are concatenated in order to construct the LTP feature
0
0
1
0 1 1
1 0
1
1
-1
1 -1 -1
-1 0
1
1
0
1 0 0
0 0
Figure 2.5: LTP is split into two LBP codes
vector. Similar to LBP, LTP codes are also computed using uniform, rotation invariant,
and uniform rotation invariant mappings. The mapping concepts are same as discussed
earlier in the discussion of LBP.
2.2.7 Threshold Adjacency Statistics
TAS [37] is a simple and efficient morphological feature extraction technique in which an
image is first transformed into three binary images by applying three different thresholds.
The pixel values of the three binary images are in the range of µ to 255, µ− τ to 255 and
µ + τ to 255 where µ shows the average intensity of the original input image and τ is the
threshold provided by the user. Then from each output binary image, 9-bin histograms are
computed, which are concatenated at the end of TAS construction resulting in 27D feature
vector [58]. The statistics computation procedure is applied to 3-by-3 segments of the entire
image after the thresholding step, which is depicted in Figure 2.6. The first statistic is the
total number of white pixels that have no white neighbor. Similarly, the second statistic is
the total number of white pixels that have one white neighbors. Third and Fourth statistics
are recorded as the aggregate number of two and three white pixels, respectively, which
have been in the neighborhood of a white pixel. Similar statistics are computed for five,
six, seven, and eight white pixels found adjacent to a white pixel. This procedure results
23
0 1 2 3 4 5 6 7 8
Figure 2.6: Threshold Adjacency Statistics
in 9 threshold adjacency statistics for one threshold image. For other two images similar
operations are performed to calculate their respective statistics. Each of these nine statistics
is divided by the total number of white pixels in the threshold image for normalization.
2.2.8 Image Features
Prior to the acquisition of Image features from an image, Otsu global thresholding [59]
is applied to convert the image into its binary form. Next, 8-connected elements in the
resultant image are obtained. Image features are produced as:
• Number of objects in the image
• Average and variance of the non-zero pixels per object
• Average and variance of the object distances from the COF
• Ratio of the largest object to the smallest object
• Ratio of the distance of the furthest object from the COF to the distance of the closest
object to the COF
• Euler number of the image (the number of objects in the region -minus- the number
of holes in those objects)
2.2.9 Edge Features
Edge Features [17] are captured using the Prewitt gradient [60], which is utilized to recog-
nize vertical as well as horizontal edges in an image. Features related to edges in an image
are mean, variance, and median. Additional features are obtained by considering magni-
tude histogram and direction components distributed in 8 bins. Furthermore, the count
of edge pixels contribute to the area feature whereas direction homogeneity is obtained by
24
considering the total edge pixels present in the first two bins of the direction histogram.
Another edge feature is generated by computing the difference of direction histogram bins
at orientation θ and θ+π. In this connection, difference of the sum of bins 1-4 from the sum
of bins 5-8 is obtained. The differences along both orientations are summed up together
for normalization. Likewise, ratio of the maximum to minimum intensities and ratio of the
maximum to the next maximum intensities are also regarded as edge features.
2.2.10 Hull Features
Binary convex hull is the basis for generating Hull features of an image. The fraction of
the convex hull area occupied by protein fluorescence, the shape of the convex hull, and the
convex hull eccentricity are all Hull related features [17].
2.2.11 Morphological Features
Morphological features are efficient descriptors to differentiate various objects found in
fluorescence microscopy based protein images [17]. Morphological features are constructed
by combining Image, Edge and Hull features discussed in sections 2.2.8, 2.2.9, and 2.2.10,
respectively.
2.2.12 Histogram of Oriented Gradients
HOG is a frequently used feature generation mechanism in the fields of image processing
and computer vision for object detection [61]. The rationale behind HOG development is
that the intensity gradient distribution or edge orientation can exploit the local form and
shape of an object in an image very well. In order to compute HOG, the image is first
divided in sub-images called cells, which can either be rectangular or circular in shape. For
each cell a histogram of gradient orientations is then compiled for pixels within the cell.
Eventually, these histograms are combined to construct the feature space.
For gradient detection, 1D, point discrete derivative mask is applied along X and Y-
axes. The utilized kernel for edge orientation detection is [−1, 0, 1]. Afterwards, these
detected gradients are binned in a histogram, which is either distributed over 0◦ − 180◦
or 0◦ − 360◦ according to the requirements whether to support negative directions or not.
Cells are grouped together in order to construct blocks where detected gradients can be
normalized locally so that illumination changes and contrast variations can be countered.
25
Blocks can also be rectangular or circular. Rectangular block is normally described using
three parameters including the number of cells in a block, the number of pixels in a cell
and the number of bins in a cell histogram. The feature space is the concatenation of
normalized cell histograms from all the blocks. Circular block can be represented by four
parameters, which include the number of angular and radial bins, the radius of the center
bin, and the expansion factor for the radius of additional radial bins. In this study, we have
utilized the strategy of computing HOG reported in [62]. They have divided the image into
9 rectangular cells with 9 bin histogram for each cell resulting in 81D feature vector.
Sometimes, features obtained through different feature extraction strategies need to be
processed further so that the performance of classifier is enhanced by utilizing only the
selected feature subset of the full feature space. For this purpose, we have utilized SMOTE
oversampling and mRMR feature selection as discussed in the next section.
2.3 Post-Processing
Post processing is usually performed on the feature space for further improvement in the
quality of the extracted feature space so that pattern recognition system can efficiently
discriminate different instances belonging to different classes. In addition, instance space
may also be modified synthetically in order to enhance the performance of a recognition
system. In this thesis, we have adopted SMOTE and mRMR as oversampling and feature
selection techniques, respectively.
2.3.1 Oversampling with SMOTE
Due to the availability of imbalanced data, a pattern recognition system usually gets biased
towards the majority class whereas the minority class faces negligence from the classifier
and hence the system’s overall performance is degraded [63,64]. In such situations, the mi-
nority class samples are usually increased so that the balance in terms of number of samples
with the majority class is established.
For this purpose, SMOTE [32] has been utilized in this work to increase the number
of samples of a minority class in a dataset. SMOTE performs its operations in the feature
space rather than the data space. The minority class original sample is input to SMOTE
algorithm and as a result it produces new synthetic instances along the same line segment
26
where some or all k nearest neighbors of the original input class are located. The synthetic
samples are not merely the replica of the original samples rather they are created following
a distinctive mechanism. The new synthetic sample is created by first finding the difference
of the original sample and its nearest neighbor. Next, a random number in the range [0, 1]
is multiplied with the result of the previous step. New sample is created when this product
is added to the original sample.
The classifier’s learning capability is more generalized due to the introduction of syn-
thetic samples using SMOTE. The generalization capability is enhanced due to the fact
that the new synthetic samples are generated along the separating line of two particular
features. In this way, bias towards the minority class is increased and the true performance
of the prediction system is revealed.
2.3.2 Feature Selection with mRMR
Feature selection is an imperative issue in designing and developing pattern recognition and
classification systems [65]. Feature selection is a process in which an optimal subset of the
entire feature space is utilized to assess the discrimination power of a system. This phe-
nomenon provides us a mechanism to analyze the feature space constructed for classification
purpose. The ultimate goal of feature selection is to eradicate unnecessary features, remove
redundancy, and reduce noise while keeping the discriminative information intact. Feature
selection may additionally improve the accuracy as well as generalization capability of the
classification system.
In this connection, mRMR is employed as feature selection mechanism, which has been
frequently reported by many researchers in the fields of Bioinformatics and machine learn-
ing [66–68]. The feature selection with mRMR attempts to reduce redundancy during fea-
ture subset selection as well as retains most relevant features, required to classify instances
belonging to different classes. The features generated using mRMR feature selection shares
minimum redundancy with other features in the feature space and maximum relevance to
the target class variables. The minimum redundancy and maximum relevance can be cal-
culated using the mutual information found among the features and the class variables.
Mutual information among different features is calculated using Equation 2.18.
27
MI (x, y) =∑i,j∈N
p (xi, yj)log p (xi, yj)
p (xi) p (yj)(2.18)
Here, x and y are any two features, p (xi, yj) is the estimate of joint probability density
function and p(xi), p(yj) are marginal probability density functions. Similarly, mutual
information of features with the target class variables is obtained using Equation 2.19.
MI (x, z) =∑i,k∈N
p (xi, zk)log p (xi, zk)
p (xi) p (zk)(2.19)
Here, x is a feature and z is a target class. Minimum redundancy is achieved using
Equation 2.20.
min (mR) =1
|S|2∑x,y∈S
MI (x, y) (2.20)
Here, |S| shows number of features. The relevance of the features towards the target
class is maximized using Equation 2.21.
max (MR) =1
|S|∑x∈S
MI (x, z) (2.21)
The final feature space is constructed by optimizing Equations 2.20 and 2.21, simulta-
neously as given in Equation 2.22.
mRMR = maxS
[MR−mR] (2.22)
2.4 Classification Algorithms
Classification systems are developed for assigning appropriate label to an unknown object
[69]. In order to achieve this goal accurately, any classification system is first passed through
the training phase. A classification system predicts the label of an unknown test sample
using the knowledge acquired from the training set, which is composed of objects whose
labels are already known. The classifier learns about the objects in the training phase
from their attribute values obtained during the feature extraction. The classifier then
predicts the label of a test sample using the attribute values of that test sample. This
whole procedure in which a training set is utilized by a classification system to predict the
unknown sample is called supervised learning. A number of machine learning algorithms
for classification has been applied in the fields of pattern recognition, bioinformatics, and
computer vision [1,70,71]. The classification algorithms, which have been employed in this
thesis, are discussed as follows.
28
2.4.1 Support Vector Machine
SVM [72], is a popular machine learning algorithm, which found many applications in the
fields of bioinformatics, pattern recognition and classification [12,30,37,73]. SVM is based
upon the SRM principle, which enables SVM to generalize efficiently. It is due to the fact
that SRM introduces a minimized upper bound on the generalization error. SVM is inher-
ently developed for two-class classification problems; however, it is applicable to multi-class
classification problems through one-versus-one, one-versus-all and directed acyclic graph
SVM strategies.
In order to convert a non-linearly separable problem into a linearly separable one, SVM
transforms the input data space into higher dimensional space where it may find a linear
separation between the classes. In a two-class classification problem, the objective of SVM
is to formulate such a separation between the two classes that could lead to fine gener-
alization. In this connection, SVM maximizes the distance of the separating hyperplane
from the closest data points called support vectors of both the classes. Consequently, the
generalization capability of this hyperplane should be good on unseen samples.
Consider the training pair (xi, yi) where xi ∈ RNf and yi ∈ {−1,+1} with i =
1, 2, ...., Nf . The instance xi is assigned either of the labels in yi. The separating hy-
perplane is constructed as given in Equation 2.23.
f (x) =
Nf∑i=1
αiyixTi .x + bias, where αi > 0 (2.23)
where α is the Langrange multiplier. For linearly separable samples, SVM utilizes the
dot product of two points as kernel function in the data space. However, for non-linearly
separable samples, the separating hyperplane is calculated differently as given in Equation
2.24.
ϕ (w, ξ) =1
2‖w‖2 + C
Nf∑i=1
ξi (2.24)
Provided that the condition yi(wTϕ(xi) + b
)≥ 1−ξi, where ξi > 0 is satisfied. Further,
C > 0 is the cost of misclassification forNf∑i=1
ξi. ϕ(x) is the nonlinear mapping function,
which is used by SVM to transform the input space Nf into a higher dimensional space Mf
provided that ϕ : R → FMf ,Mf > Nf . The nonlinear separating hyperplane is now given
29
in Equation 2.25.
f (x) =
N∑i=1
αiyiK(xi.x) + bias (2.25)
where N is the number of support vectors and K(xi.x) represents the kernel.
Researchers have proposed different types of kernels to compute the inner product effi-
ciently. The kernel functions, which can be utilized inside SVM, include linear, polynomial,
RBF, and sigmoid kernels. SVM with RBF or sigmoid kernel takes C and γ as input
parameters, SVM with polynomial kernel takes an additional parameter d, which indicates
the degree of the polynomial; however, SVM with linear kernel works with C parameter
only. If x represents the feature vector and y is its corresponding label vector, then linear
kernel can be formulated as given in Equation 2.26.
K(x,y) = x.y (2.26)
SVM does not map the input data into higher dimensional feature space while using lin-
ear kernel, therefore, the computations are performed faster. In order to classify non-linearly
separable feature space, we employ polynomial kernel of SVM, which can be expressed as
shown in Equation 2.27.
K(x,y) = (x.y + 1)d (2.27)
In Equation 2.27 d represents the degree of polynomial kernel. The shape of separating
hyperplane depends on the degree d, which controls its complexity in the input data space.
Polynomial kernel is equivalent to the linear kernel for d = 1. Another very important
kernel, the RBF kernel is mathematically expressed as given in Equation 2.28.
K(x,y) = exp(−γ‖x.y‖2
)(2.28)
In the RBF kernel, γ is used to describe the width of the Gaussian function. In this
thesis, LIBSVM1 library is utilized for conducting the experiments. Different parameters
of SVM are set through the grid search approach using internal cross validation on the
training data.
2.4.2 Random Forest Ensemble
RF ensemble is a classification algorithm [74] in which decision trees are utilized as base
classifiers. These trees are constructed by randomly drawing instances from the original
1Available at http://www.csie.ntu.edu.tw/ cjlin/libsvm/30
instance space with replacement employing bootstrap sampling. Large number of trees is
generated where each tree is based on randomized subset of predictors and therefore, the
phenomenon of Random Forest is introduced. A subset from all the available predictors is
randomly selected in order to locate the optimal split at a particular node. The prediction
of each tree, regarding the label of certain input, is considered as a single vote. The notion
of majority voting is employed in order to combine the predictions of all the individual trees
in a forest.
RF ensemble is robust against noisy data due to the intrinsic tree structure. In addition,
it is able to manage large number of attributes efficiently [75]. The functioning of RF
ensemble involves two input parameters: the number of variables randomly selected at each
node in order to determine the prediction at a particular node and number of trees in the
forest. The number of randomly selected variables at each node should be fewer than the
attributes in the dataset. RF ensemble may perform poorly in the presence of imbalanced
data in which case the classifier may get biased towards the majority class [76].
2.4.3 Rotation Forest Ensemble
RotF [77, 78] classifier builds ensemble of classifiers from various decision trees, which per-
form their operations as base learners for the ensemble. Despite that decision trees are
independent of each other; the whole dataset is utilized by each tree in a rotated feature
space. The rotation of data around the feature axis is achieved through PCA. In the course
of building RotF ensemble, H subsets are randomly generated from the attribute set where
each subset is subsequently manipulated by PCA. The obtained principal components in this
manner are preserved according to the initial sequence and thus a linearly transformed fea-
ture space is produced. Decision trees are sensitive to feature axes rotations and therefore,
introduce diversity in the ensemble. Another factor effecting diversity is the introduction
of H splits of the attribute set, which leads to different feature spaces for the classification
stage. Following parameters may be adopted for building the RotF ensemble.
• Number of features in each subset of H subsets
• Number of classifiers to build the ensemble
• Feature extraction technique
• Base learner
31
2.5 Performance Parameters
Performance evaluation is an important phase of prediction system development. A newly
developed model is required to be assessed in this phase so that it becomes evident that
the problem is addressed properly. In machine learning, quite a few performance indicators
have been identified and recognized very proficient in assessing the performance of an al-
gorithm. However, the utilization of a particular parameter in different problem domains
is dependent upon the data distribution and classification task in that domain. The per-
formance parameters are extracted from a confusion matrix, which tabulates the actual
labels against the predicted labels for each class. Table 2.2 illustrates a confusion ma-
trix for a binary class problem. However, confusion matrices for multi-class classification
problems are easily constructed. TP indicates number of positive instances predicted as
Table 2.2: A confusion matrix
Predicted Label
Positives Negatives
Actual LabelPositives True Positives False Negatives
Negatives False Positives True Negatives
positives whereas FN shows number of positive instances predicted as negatives. Similarly,
FP specifies number of negative instances predicted as positives and TN signifies number
of negative instances predicted as negatives. FP is also called Type-I error or error of the
first kind whereas FN are known as Type-II error or error of the second kind. Some of the
performance parameters extensively utilized by machine learning and pattern recognition
community are discussed as follows.
2.5.1 Accuracy
The error rate or accuracy is utilized by the researchers to measure the efficiency of a
prediction system. Accuracy measures both the true positives and true negatives returned
32
by a learning system. It is computed using Equation 2.29.
Accuracy =TP + TN
TP + FP + TN + FN× 100 (2.29)
Though, accuracy is considered a proper parameter to assess the performance of a clas-
sifier, it sometimes fails to measure the true performance of the system. For example, in
case of imbalanced data, classifier usually gets biased towards the majority class and con-
sequently the higher accuracy does not reveal the actual performance of the entire system.
2.5.2 Sensitivity/Specificity
Sensitivity and specificity indicate the true positive and true negative rates, respectively,
of the prediction system. Equations 2.30 and 2.31 are used to measure sensitivity and
specificity of a system.
Sensitivity =TP
TP + FN× 100 (2.30)
Specificity =TN
FP + TN× 100 (2.31)
Sensitivity and specificity both play a key role in the computation of accuracy. High
values of sensitivity and specificity lead to high accuracy. In case of high sensitivity and low
specificity, accuracy gets biased towards the sensitivity. Conversely, accuracy is influenced
by specificity. Similarly, low values of sensitivity and specificity result in low accuracy.
2.5.3 Mathews Correlation Coefficient
MCC [79,80] is a performance parameter used to measure the quality of a prediction system.
It can intrinsically transform a confusion matrix into a scalar value ranging from −1 to
+1, where −1 means that the classifier consistently produces incorrect predictions and +1
assures that the classifier always generates correct predictions. The value 0 indicates that
the classifier produces average random predictions. Equation 2.32 is used to calculate MCC.
MCC(i) =TP × TN − FP × FN√
[TP + FP ][TP + FN ][TN + FP ][TN + FN ]× 100 (2.32)
MCC is a very useful measure that inherently has the ability to resolve challenges faced
by accuracy particularly in situations where balanced data is not available to the classifier.
For example, a classifier correctly predicts all the instances of a positive majority class
and incorrectly predicts all the instances of a negative minority class. In such cases, the
33
performance of the classifier is not promising but accuracy will show reasonably good results.
However, the true performance of the classifier is exposed using the MCC, which will show
the performance as 0.
2.5.4 F-measure
F-measure or F-Score approximates the performance accuracy of the executed test [81,
82]. It is employed in tasks where the classification system is required to correctly predict
instances of a particular class without predicting too many instances of other classes. F-
measure is computed using the harmonic mean of both the precision p and recall r of the
test. Therefore, F-measure greatly depends on these two measures. Precision is the ratio
of true positives to the number of predicted positives whereas recall is the ratio of true
positives to the number of actual positives. Precision and recall are also known as positive
predictive value and sensitivity, respectively. F-measure returns its output in the range
[0, 1] where output closer to 0 indicates poor performance and the output closer to 1 reveals
good performance.
Precision =TP
TP + FP(2.33)
Recall =TP
TP + FN(2.34)
F − Score = 2× Recall × PrecisionRecall + Precision
(2.35)
In order to obtain a reasonable score for F-measure, a trade-off between precision and
recall is usually sought. It is evident from Equation 2.35 where it can be identified that
improved precision leads to inferior recall and vice versa.
2.5.5 Q-Statistic
Q-Statistic is a performance parameter used to measure the diversity among the member
classifiers in an ensemble [25,83]. The Q-Statistic gives similarity value as its output between
the base classifiers and therefore its returned value is subtracted from 1 to obtain the
diversity value. The 2-by-2 contingency table illustrates the concept of correlation between
two base classifiers BCi and BCj in an ensemble. In Table 2.3, N11 and N00 are respectively,
the correct and incorrect predictions of both the classifiers. The correct predictions of 1st
classifier and the incorrect of 2nd are indicated by N10. Conversely, the correct predictions
34
Table 2.3: Correlation between classifier BCi and classifier BCj
BCj hit (1) BCj miss (0)
BC i
hit(1
)
N11 N10
BC i
miss
(0)
N01 N00
of 2nd classifier and incorrect of 1st are shown by N01. The Q-Statistic between any two
base classifiers is calculated using Equation 2.36.
Qi,j =N11 ×N00 −N10 ×N01
N11 ×N00 +N10 ×N01(2.36)
or for the sake of discussion, we may write as under:
Qi,j =(hits× hits)− (misses×misses)(hits× hits) + (misses×misses)
(2.37)
When the base classifiers have only hits and there are not any misses, the values in
numerator as well as the denominator are same leading to Q-Statistic = 1. Conversely,
when the base classifiers commit misses and there are no hits, Q-Statistic = −1. When
hits and misses occur with the same ratio, Q-Statistic = 0. Thus the output of Q-Statistic
ranges from −1 to +1. Q-Statistic = 1 means that there exists perfect positive correlation
between the two classifiers. Similarly, Q-Statistic = −1 indicates that perfect negative
correlation exists between the two classifiers. However, Q-Statistic returns 0 for statistically
independent classifiers. In order to compute the Q-Statistic for ensemble consisting of
multiple classifiers, the average among all pairs of B base classifiers is computed using
Equation 2.38.
Qavg =2
BN (BN − 1)
BN−1∑i=1
BN∑k=i+1
Qi,k (2.38)
2.5.6 Multiclass ROC
AUC is a valuable performance metric, in terms of ROC, for measuring the similarity
between two different categories. We have adopted the method presented in [84] to compute
35
the AUC for multi-class classification problems. In this case, AUC is computed for all pair
wise combinations of these 10 and 8 classes. Total AUC is the mean of all AUCs.
2.6 Datasets
We have used various protein subcellular localization image datasets in order to assess the
performance of our proposed models. These include HeLa dataset from Murphy Lab [5],
two LOCATE datasets from LOCATE subcellular localization database [8], Vero dataset
from AIIA lab in Taiwan [6], and Two CHO datasets: one from Murphy Lab and the other
from AIIA lab [6]. These datasets are multi-class datasets and retain major subcellular
structures of eukaryotic cells.
Table 2.4 shows the breakup of samples in different classes for HeLa2 dataset. The 2D
HeLa dataset includes 862 protein images from fluorescence microscopy distributed in 10
distinct categories.
Table 2.4: HeLa dataset classes and image breakup in each class
S. No Class Name Images/class
1 ActinFilaments 98
2 Nucleus 87
3 Endosomes 91
4 ER 86
5 Golgi Giantin 87
6 Golgi GPP130 85
7 Lysosome 84
8 Microtubules 91
9 Mitochondria 73
10 Nucleolus 80
Total 862
The CHOM2 dataset retains 327 protein images from fluorescence microscopy dis-
tributed over five categories. Similarly, CHOA3 dataset has 668 fluorescence microscopy
protein images categorized in eight classes. The Vero3 image dataset possesses 1472 images
representing eight subcellular structures of monkey cells. Tables 2.5, 2.6 and 2.7 show the
details of CHOM, CHOA and Vero datasets, respectively.
3Available at http://aiia.iis.sinica.edu.tw/ 36
Table 2.5: CHOM dataset classes and image breakup in each class
S. No Class Name Images/class
1 Golgi 77
2 DNA 69
3 Lysosome 97
4 Nucleolus 33
5 Cytoskeleton 51
Total 327
Table 2.6: CHOA dataset classes and image breakup in each class
S. No Class Name Images/class
1 Actin 161
2 ER 143
3 Golgi 156
4 Microtubule 67
5 Mitochondria 46
6 Nucleolus 15
7 Nucleus 28
8 Peroxisome 52
Total 668
Table 2.7: Vero dataset classes and image breakup in each class
S. No Class Name Images/class
1 Actin 65
2 ER 372
3 Golgi 145
4 Microtubule 101
5 Mitochondria 179
6 Nucleolus 110
7 Nucleus 444
8 Peroxisome 56
Total 1472
Similarly, the two LOCATE4 datasets are Endogenous and Transfected datasets each
containing 502 and 553 fluorescence microscopy protein images, respectively, as shown in
Table 2.8. Correspondingly, LOCATE Endogenous protein images from fluorescence mi-
croscopy are grouped in 10 classes whereas LOCATE Transfected protein images from the
4Available at http://locate.imb.uq.edu.au/ 37
same imaging technique are categorized in 11 classes.
Table 2.8: LOCATE Endogenous and Transfected datasets: images breakup
S. No Class Name Endogenous Transfected
1 Actin-cytoskeleton 50 50
2 Cytoplasm 0 50
3 Endosomes 49 50
4 Endoplasmic Reticulum 50 48
5 Golgi 46 50
6 Lysosome 50 50
7 Microtubule 50 50
8 Mitochondria 50 55
9 Nucleus 50 50
10 Peroxisomes 50 50
11 Plasma Membrane 57 50
Total 502 553
Summary
In this chapter, the existing literature devoted to the development of automated systems for
protein subcellular localization are reviewed. Further, different feature extraction strategies,
which are utilized in the development of this thesis are discussed. From machine learning
theory, different classification algorithms, post-processing techniques and various perfor-
mance indicators in the context of this thesis are also discussed. The chapter is concluded
with the presentation of some benchmark datasets utilized to assess the performance of
prediction systems developed during the course of this thesis. In the next chapter, SVM-
SubLoc prediction system is presented that is developed for protein subcellular localization
images from HeLa and LOCATE datasets.
38
Chapter 3
Protein Subcellular Localization
using Spatial and Transform
Domain Features
Protein subcellular localization images of living cells are usually acquired through fluores-
cence microscopy. Automated systems are required for the analysis and classification of
these images. Automated prediction systems do exist in the literature, however, low pre-
diction accuracy and availability of large feature spaces make them less favorite for the
research community.
In this chapter, different spatial and transform domain features have been discussed
for their efficacy in differentiating different protein images from fluorescence microscopy.
The aim of this work is to build a more efficient and reliable prediction system for protein
subcellular localization utilizing low dimensional feature spaces with enhanced prediction
accuracy. Both individual and hybrid feature spaces are constructed in spatial as well as
transform domains. The overview of the prediction system is given in the following section.
3.1 The SVM-SubLoc Prediction System
The proposed SVM-SubLoc prediction system is shown in Figure 3.1. The main phases of
SVM-SubLoc include the feature extraction and classification phases. In the feature ex-
traction phase, different individual features are extracted in spatial and transform domains
39
whereas in the classification phase different SVMs are trained using these features. The
final prediction is obtained through the majority voting scheme.
The performance of SVM-SubLoc prediction system is evaluated using three bench-
mark protein image datasets from fluorescence microscopy including 2D HeLa, LOCATE
Endogenous and LOCATE Transfected datasets.
Input
Image
MR
filter
Ensemble
Decision
Predicted
Result
Hybrid features
ZHar HarLBP
HarLTP HarTAS
Transform domain features
Haralick textures
Zernike features
Spatial domain featuresHaralick
textures
Zernike
featuresLBP LTP TAS
lin-SVM poly-SVM RBF SVM sig-SVM
Figure 3.1: The SVM-SubLoc prediction system
3.1.1 Feature Extraction Phase
In the feature extraction phase of SVM-SubLoc, different individual features are extracted
from fluorescence microscopy protein images in spatial as well as transform domains.
The spatial domain features from an image are extracted by forwarding the image di-
rectly to the feature extraction stage. Individual features include Haralick coefficients,
Zernike moments, LBP, LTP, and TAS. However, for transform domain feature extraction,
an image is first passed through the multi-resolution filter and then this transformed image
is forwarded to the feature extraction stage. Zernike moments and Haralick textures are the
only features, which have been extracted in spatial as well as transform domains. Domain
40
transformation is achieved through DWT.
3.1.1.1 GLCM Construction and Haralick Co-efficients
Prior to GLCM construction, each input image is first quantized to eight gray levels, which
resulted into GLCM of 8-by-8 dimensions. For each input image, four GLCMs are con-
structed along horizontal, vertical, diagonal, and off-diagonal directions for pixel distance
∆ = 1. After this, Haralick features are extracted from each GLCM separately and then
the mean values for each of the features are computed. This is in contrast to the method of
first computing the combined GLCM and then computing the features. The construction
of GLCMs and extraction of the feature spaces from these GLCMs are shown in Figure 3.2.
GLCMV
GLCMD
GLCMoD
Energy Contrast Correlation InertiaInverse
DifferenceEntropy
Sum
Average
Sum
Variance
Sum
Entropy
Difference
Variance
Difference
Entropy
Information Measure
of Correlation 1
Information Measure
of Correlation 2
Energy Contrast Correlation InertiaInverse
DifferenceEntropy
Sum
Average
Sum
Variance
Sum
Entropy
Difference
Variance
Difference
Entropy
Information Measure
of Correlation 1
Information Measure
of Correlation 2
Energy Contrast Correlation InertiaInverse
DifferenceEntropy
Sum
Average
Sum
Variance
Sum
Entropy
Difference
Variance
Difference
Entropy
Information Measure
of Correlation 1
Information Measure
of Correlation 2
Energy Contrast Correlation InertiaInverse
DifferenceEntropy
Sum
Average
Sum
Variance
Sum
Entropy
Difference
Variance
Difference
Entropy
Information Measure
of Correlation 1
Information Measure
of Correlation 2
Me
an
va
lue
s
Me
an
va
lue
s
GLCMH
Fe
atu
re
Extra
ctio
n
Figure 3.2: Feature extraction from GLCMH, GLCMV, GLCMD and GLCMoD
In case of each GLCM, 13 features are extracted resulting in 52 features in total. The
feature space dimension is reduced by averaging the features from GLCMH with the features
from GLCMV. Similarly, the features generated from GLCMD are averaged with GLCMoD.
The final feature vector is composed of these mean values for each feature that result in 26
41
features for each fluorescence microscopy protein image.
3.1.1.2 Discrete Wavelet Transformation
The input image is decomposed up to four levels using DWT with Haar filter. For each
input image of size M-by-N, DWT produces four sub-images of size M/2-by-N/2 at each
decomposition level as shown in Figure 3.3. The sub-images are approximation detail (LL),
the vertical detail (HL), the horizontal detail (LH), and the diagonal detail (HH). The sub-
band images are then forwarded to the feature extraction phase where features are extracted
for each level and from each sub-band individually.
DWT M
N
LL LH
HL HH
Protein Image DWT domain
Figure 3.3: DWT Transformation
As discussed earlier, each input image is decomposed up to four decomposition levels,
however, only two decomposition levels are shown in Figure 3.4. During the computa-
tion of Haralick coefficients from GLCMs, 26D feature space is generated for fluorescence
microscopy protein image at decomposition level 0, which is in fact the image in spatial
domain. It can be observed that at each decomposition level, four decomposed images are
generated for each input image. The feature vector for each decomposition level is obtained
by concatenating the feature vectors generated from each fluorescence microscopy image.
For instance, at decomposition level 1, 26 × 4 = 104D feature space is generated where
26 is the dimension of the feature vector for a single fluorescence microscopy image and 4
indicates the number of component images at 1st decomposition level. Similarly, for the
second decomposition level, 26 × 16 = 416D feature space is computed where 26 is the
feature vector for single image and 16 is the number of component images at 2nd decompo-
sition level. At decomposition level 3, 26 ×64 = 1664D feature vector is constructed. At
decomposition level 4, the size of the feature space is 26× 256 = 6656D.
42
Original Protein Image
Diagonal detail
HH1
Horizontal detail
LH1
Vertical detail
HL1
Approximation detail
LL1
LL2
LH2
HL2
HH2
De
co
mp
os
itio
n
Le
ve
l 1
De
co
mp
os
itio
n
Le
ve
l 2
De
co
mp
os
itio
n
Le
ve
l 0
Figure 3.4: Fluorescence microscopy protein image of size M -by-N is split into four sub-
images at each decomposition level. Decomposition level 0 indicates the original fluorescence
microscopy protein image.
Likewise, Zernike moments are computed up to order 12 that result in 49D feature
space for fluorescence microscopy protein image at decomposition level 0. The feature spaces
obtained for multiple decomposed images are concatenated to produce a single feature vector
that represents all the sub-images at a particular decomposition level. For decomposition
level 1, the size of feature vector is 49× 4 = 196D, where 49 is the dimension of the feature
vector for one image and 4 indicates the number of sub-images generated at decomposition
level 1. Similarly, feature vector at decomposition level 2 is of 49× 16 = 784D where 49 is
43
the feature space dimension for a single fluorescence microscopy protein image and 16 is the
number of component images at 2nd decomposition level. The Zernike moments generated
at decomposition level 3 are of 49× 64 = 3136D. The feature space dimension constructed
for decomposition level 4 is 49× 256 = 12544D.
3.1.1.3 The Hybrid Model
Hybrid feature spaces are developed on the hope that they may enhance the discrimination
capability of individual feature spaces. Hybrid features are constructed from the concate-
nation of individual features. These include ZHar (Zernike + Haralick), HarLBP (Haralick
+ LBP), HarLTP (Haralick + LTP) and HarTAS (Haralick + TAS). In the hybrid models
formation, only Haralick textures obtained in the spatial domain are combined with other
spatial domain features. However, in the formation of ZHar hybrid model both spatial and
transform domain Haralick and Zernike features are utilized.
3.1.2 Classification Phase
The classification phase of SVM-SubLoc is highlighted in Figure 3.5. In this phase, first
four SVMs are trained using individual and hybrid features both in spatial and transform
domains. The four SVMs include lin-SVM, poly-SVM of degree 2, RBF-SVM, and sig-SVM.
The parameters for SVM kernels are obtained using the grid search approach.
Ensemble
Final
Prediction
Features
lin-SVM
poly-SVM
RBF-SVM
sig-SVM
Figure 3.5: The Classification phase of SVM-SubLoc
In order to further enhance the prediction accuracies, the output predictions of these
44
individual SVMs have been integrated as given in Equation 3.1.
MajEns = lin− SVM ? poly − SVM ? RBF − SVM ? sig − SVM (3.1)
where ? is the fusion operator and MajEns is the output ensemble classifier. The complete
procedure of MajEns is given as follows. Let us assume the classification results of individual
classifiers are {B1, B2, · · ·, BN} ∈ {CL1, CL2, · · ·, CLM}, where B1, B2, · · ·, BN indicate
individual base classifiers and CL1, CL2, · · ·, CLM represent labels of protein classes. Then
the output label of SVM-SubLoc is predicted as given in Equation 3.2.
Yi =N∑i=1
δ(Bi, CLj) for j = 1, 2, · · ·, N,N = 4 (3.2)
where δ(Bi, CLj) =
1 if Bi ∈ CLj
0 otherwise
The final prediction is yielded by the combination of individual predictions through the
majority voting scheme as given in Equation 3.3.
YMajEns = max{Y1, Y2, ..., YN} (3.3)
where YMajEns is the output prediction of the ensemble.
In situations where the votes are equal for a certain instance to be assigned to a particular
class, preference is given to the prediction of highest performing classifier.
3.2 Results and Discussion
The classifier performance is assessed through 5-fold cross validation protocol for the protein
image datasets from fluorescence microscopy. Data is stratified before forwarding it to the
classifier for classification. Accuracy, MCC, F-Score, and Q-Statistic are adopted as perfor-
mance indicators for SVM-SubLoc because any single measure is sometimes inadequate to
measure the performance of a prediction system.
3.2.1 Performance Analysis of SVM-SubLoc using Individual Features
for HeLa dataset
We analyze and discuss the performance of SVM-SubLoc for 2D HeLa dataset using individ-
ual features. Table 3.1 reports the performance predictions of SVM-SubLoc using Haralick
45
features with/without employing DWT. Among individual classifiers, RBF-SVM has out-
performed other SVMs at all decomposition levels. Similarly, among the decomposition
levels, 2nd level is observed to be the best decomposition level for protein discrimination
capability where RBF-SVM yielded 84.1% accuracy. At 0th and 1st decomposition levels,
some hidden information is lost whereas at 3rd and 4th level, redundancy in the extracted
information occurs; however, the 2nd level has preserved most of the valuable information
that is why SVM-SubLoc has shown superior performance at this level. The ensemble accu-
Table 3.1: Performance of SVM-SubLoc using Haralick textures with/without DWT
lin poly RBF sig Ensemble
L D Acc Acc MCC F−Score Q-Statistic
0 26 75.6 75.9 76.4 69.6 87.7 0.59 0.60 0.29
1 104 80.5 80.9 81.7 80.3 92.5 0.71 0.72 0.18
2 416 81.9 83.2 84.1 79.8 93.2 0.73 0.74 0.20
3 1664 77.3 78.3 80.6 77.9 90.2 0.65 0.66 0.30
4 6656 79.5 80.6 80.7 79.5 92.6 0.72 0.73 0.23
racy of SVM-SubLoc is 93.2%, 9.1% higher than that of individual RBF-SVM, which shows
significant improvement in the performance of SVM-SubLoc through the introduction of
majority voting based ensemble. MCC and F-Score values of 0.73 and 0.74, respectively,
are also highest at 2nd decomposition level, which indicate the effectiveness of SVM-SubLoc
at this level. The highest diversity is, however, achieved at 1st decomposition level where
the Q-statistic value is 0.18.
The performance accuracies of SVM-SubLoc using Zernike moments are highlighted
in Table 3.2 with/without employing DWT. RBF-SVM has obtained the highest success
rates of 67.8% among individual classifiers at 4th decomposition level. At each next de-
composition level, the accuracy of RBF-SVM has been observed increasing, which shows
the effectiveness of RBF-SVM coupled with the DWT based Zernike moments. The perfor-
mance accuracy at 0th level is less than the random predictions but in the transform domain,
SVM-SubLoc has gradually attained higher accuracies employing RBF-SVM. The ensemble
accuracy 58.5% indicates the highest diversity among the ensemble members at 2nd decom-
position level; however, this accuracy is 9.3% less than the highest accuracy achieved by
RBF-SVM among individual classifiers at 4th decomposition level. Similarly, at 4th level,
46
Table 3.2: Performance of SVM-SubLoc using Zernike moments with/without DWT
lin poly RBF sig Ensemble
L D Acc Acc MCC F−Score Q-Statistic
0 49 28.3 43.8 46.5 15.5 46.7 -0.02 0.16 0.02
1 196 35.9 50.3 52.2 24.4 57.8 0.12 0.23 0.09
2 784 39.0 53.4 56.2 15.1 58.5 0.15 0.25 0.01
3 3136 50.2 52.0 60.5 11.3 54.5 0.10 0.22 0.24
4 12544 44.0 48.0 67.8 11.0 56.7 0.14 0.24 0.11
the ensemble accuracy is 56.7%, which is 11.1% less than the individual RBF-SVM. Here,
the diversity among the classifiers is less compared to other levels except the 3rd level as is
evident by the Q-Statistic values. Despite good outcomes of accuracy and Q-Statistic val-
ues, prediction quality represented by MCC and accuracy of the test indicated by F-Score,
are less promising.
Performance of SVM-SubLoc using TAS features extracted in spatial domain is demon-
strated in Table 3.3. In the first column, τ indicates the user specified threshold. We have
obtained TAS features with different τ values. The highest accuracy of 81.0% has been
achieved by RBF-SVM among individual classifiers for τ = 140. The ensemble accuracy
Table 3.3: Performance of SVM-SubLoc using TAS features
lin poly RBF sig Ensemble
τ D Acc Acc MCC F−Score Q-Statistic
100 27 75.9 78.1 77.0 69.0 89.0 0.62 0.64 0.31
130 27 76.9 79.6 78.1 69.9 88.5 0.61 0.63 0.39
140 27 77.6 80.3 81.0 70.7 91.6 0.69 0.70 0.31
180 27 78.3 80.7 78.6 69.9 90.8 0.67 0.68 0.32
220 27 77.3 79.2 79.0 69.1 90.4 0.66 0.67 0.28
240 27 77.9 80.1 79.8 70.0 90.4 0.66 0.67 0.34
91.6% is achieved, which is superior to that of RBF-SVM by 10.6%. MCC and F-Score
values of 0.69 and 0.70, respectively, are also good. The significance of the ensemble is
evident from the enhanced accuracy.
Table 3.4 presents the outcome accuracies of SVM-SubLoc using LBP patterns. R, PN ,
and m, respectively, represent radius, neighborhood, and mapping, which are utilized to
47
extract these patterns. In the following discussion, LBP with certain configuration will
be referred to as m-LBP(R, PN). Among individual classifiers, RBF-SVM has yielded the
Table 3.4: Performance of SVM-SubLoc using LBP with different mappings
lin poly RBF sig Ensemble
R PN m D Acc Acc MCC F−Score Q-Statistic
1 8 U 59 83.7 84.3 85.3 71.5 92.8 0.72 0.73 0.32
1 8 RI 36 83.5 82.4 82.7 72.9 92.8 0.72 0.73 0.34
1 8 URI 10 81.5 82.5 82.5 74.9 92.5 0.71 0.72 0.33
2 16 U 243 85.7 86.8 87.5 78.6 95.0 0.79 0.80 0.32
2 16 RI 4116 73.2 72.6 73.0 70.5 85.4 0.54 0.56 0.37
2 16 URI 18 87.1 86.8 87.8 78.3 95.1 0.79 0.80 0.37
3 24 U 555 85.0 86.3 87.4 78.8 95.5 0.81 0.82 0.25
3 24 URI 26 85.7 86.4 88.0 80.5 94.8 0.78 0.79 0.34
highest accuracy value of 88.0% employing URI-LBP(3, 24) patterns. However, the en-
semble accuracy of 95.5% is achieved by SVM-SubLoc using U-LBP(3, 24) patterns. The
ensemble accuracy is 7.5% higher than the highest accuracy of individual RBF-SVM. The
highest diversity among the individual classifiers is observed for U-LBP(3, 24) patterns as
is evident from Q-Statistic value of 0.25. SVM-SubLoc has yielded the highest values of
MCC (= 0.81) and F-Score (= 0.82) using the same features.
Prediction accuracies of SVM-SubLoc using LTP patterns are presented in Table 3.5.
Here, τ represents the threshold for constructing LTP feature space. In further discussions,
LTP with certain configuration would be referred to as m-LTP(R, PN, τ). Among individual
classifiers, poly-SVM has achieved the highest accuracy value of 94.4% using URI-LTP(3,
24, 80) patterns. The ensemble SVM-SubLoc, on the other hand, has achieved 99.0% accu-
racy, which is 4.6% higher than that of the individual poly-SVM using the same features.
The obtained accuracy revealed that LTP patterns have strong discriminative power com-
pared to other feature extraction strategies. The highest Q-Statistic value of 0.15 indicates
that individual classifiers have diversified predictions using URI-LTP(3, 24, 80) patterns.
From Table 3.5, it is evident that LTP patterns computed on smaller radii require smaller
value of τ for efficient classification of fluorescence microscopy protein images. However, for
large radius τ needs to be greater. The MCC and F-Score values of 0.95 each show that
the performance of SVM-SubLoc has been enhanced due to the utilization of LTP patterns.
48
Table 3.5: Performance of SVM-SubLoc using LTP with different mappings
lin poly RBF sig Ensemble
R PN τ m D Acc Acc MCC F−Score Q-Statistic
1 8 40 U 118 90.4 90.4 90.8 78.1 97.7 0.89 0.90 0.35
1 8 40 RI 72 89.3 87.5 89.3 80.5 97.3 0.87 0.88 0.28
1 8 40 URI 20 89.2 89.7 90.1 77.2 97.3 0.87 0.88 0.28
2 16 80 U 486 92.8 93.0 92.9 84.1 98.2 0.91 0.92 0.21
2 16 80 URI 36 91.8 92.9 93.5 85.1 98.6 0.93 0.93 0.26
3 24 80 URI 52 93.8 94.4 93.8 86.3 99.0 0.95 0.95 0.15
3.2.2 Performance Analysis of SVM-SubLoc using Hybrid Features for
2D HeLa dataset
Table 3.6 demonstrates the performance predictions of SVM-SubLoc employing ZHar hy-
brid features in spatial and transform domains. RBF-SVM has shown superior performance
achieving accuracy value of 80.8% among individual classifiers at 2nd decomposition level.
SVM-SubLoc has yielded 93.2% accuracy at the same level showing the ensemble sig-
nificance. The highest diversity is observed at the 1st decomposition level as indicated by
Q-Statistic value of 0.12, but it could achieve only 90.4% accuracy. It is evident that the
best decomposition level is the 2nd level among other levels for discriminating the fluores-
cence microscopy protein images. MCC and F-Score values of 0.73 and 0.74, respectively,
are also highest at 2nd level. The hybrid model of Haralick textures and Zernike moments
does not add any significant discrimination capability to these individual feature spaces. At
the level of individual classifiers, SVM-SubLoc could not achieve good performance accu-
racies using the hybrid model compared to the Haralick textures alone as shown in Table
3.1. However, the ensemble accuracy of 93.2% is achieved using both the hybrid model and
Haralick textures as shown in Tables 3.6 and 3.1, respectively. The prediction performance
of SVM-SubLoc using HarTAS has been highlighted in Table 3.7. Among individual base
learners, poly-SVM has outperformed other learners and achieved 87.9% accuracy. The
ensemble SVM-SubLoc has yielded 96.2% accuracy, which is 8.3% higher than that of poly-
SVM. The ensemble accuracy of this hybrid space is also 3% and 4.6% higher than that of
the Haralick and TAS features, respectively, as shown in Tables 3.1 and 3.3. The diversity
among the individual classifiers is sufficient to produce the improved ensemble accuracy as
is evident by the Q-Statistic value of 0.32. Similarly, MCC value of 0.83 and F-Score value
49
Table 3.6: Performance of SVM-SubLoc using ZHar with/without DWT
lin poly RBF sig Ensemble
L D Acc Acc MCC F−Score Q-Statistic
0 75 69.7 69.7 71.8 57.5 84.4 0.52 0.54 0.19
1 300 74.8 76.2 76.9 74.4 90.4 0.66 0.67 0.12
2 1200 80.0 80.5 80.8 77.2 93.2 0.73 0.74 0.24
3 4800 77.3 77.9 78.5 77.4 89.2 0.62 0.64 0.35
4 19200 77.8 80.0 79.3 76.2 90.8 0.66 0.68 0.33
Table 3.7: Performance of SVM-SubLoc using HarTAS
lin poly RBF sig Ensemble
τ D Acc Acc MCC F−Score Q-Statistic
100 53 86.8 87.8 86.3 81.4 96.1 0.83 0.84 0.28
130 53 86.3 87.7 86.0 80.7 94.8 0.78 0.79 0.38
140 53 87.0 87.9 86.0 83.0 96.2 0.83 0.84 0.32
180 53 87.3 87.9 86.0 79.6 95.7 0.81 0.82 0.33
220 53 86.3 87.7 86.4 80.6 96.0 0.82 0.83 0.24
240 53 86.5 87.9 87.1 80.9 95.9 0.82 0.83 0.23
of 0.84 show that the prediction quality and test accuracy are, respectively, both good.
Among individual classifiers, RBF-SVM has achieved the highest accuracy of 94.4% us-
ing two separate hybrid models of Haralick textures with U-LBP(2, 16) and with U-LBP(3,
24) patterns as shown in Table 3.8. The yielded accuracy is 18% higher than the accuracy
achieved by SVM-SubLoc using individual Haralick textures in spatial domain as given in
Table 3.1 and similarly, 6.4% higher than the accuracy achieved using LBP patterns as
shown in Table 3.4 using RBF-SVM. Likewise, the ensemble accuracy of 99.7% is achieved
using Haralick with U-LBP(3, 24) patterns. This is 5.3% higher than that of RBF-SVM.
Haralick textures perform better when combined with U-LBP patterns extracted on large
circle and discriminate well among the subcellular structures from fluorescence microscopy
images. MCC (= 0.98) and F-Score (= 0.98) reveal that the quality of prediction and test
accuracy are excellent for these features. However, diversified ensemble can be generated
using Haralick textures with U-LBP patterns on radii 1 and 2 as shown by Q-Statistic values
of 0.14.
The success rates of SVM-SubLoc using HarLTP are demonstrated in Table 3.9. The
50
Table 3.8: Performance of SVM-SubLoc using HarLBP
lin poly RBF sig Ensemble
R PN m D Acc Acc MCC F−Score Q-Statistic
1 8 U 85 92.1 92.4 92.5 88.5 99.3 0.96 0.96 0.14
1 8 RI 62 89.2 89.9 89.0 84.9 97.0 0.86 0.87 0.31
1 8 URI 36 88.3 90.2 89.6 87.9 97.5 0.88 0.89 0.28
2 16 U 269 93.2 93.9 94.4 91.7 99.0 0.95 0.95 0.14
2 16 RI 4142 81.6 81.6 78.5 81.0 92.5 0.71 0.72 0.28
2 16 URI 44 92.5 93.2 93.7 90.8 98.8 0.94 0.94 0.18
3 24 U 581 93.2 93.6 94.4 90.9 99.7 0.98 0.98 0.20
3 24 URI 52 91.8 92.6 92.8 89.7 99.3 0.96 0.96 0.34
highest accuracy of 94.7% is achieved by poly-SVM among individual classifiers using the
hybrid of Haralick textures with U-LTP(2, 16, 80) patterns. The ensemble accuracy of
99.4% is yielded using the hybrid of Haralick textures with URI-LTP(3, 24, 80) patterns.
The ensemble accuracy is 4.7% higher than that of the poly-SVM. The highest values of
Table 3.9: Performance of SVM-SubLoc using HarLTP
lin poly RBF sig Ensemble
R PN τ m D Acc Acc MCC F−Score Q-Statistic
1 8 40 U 144 92.4 92.5 92.2 85.2 98.1 0.91 0.91 0.24
1 8 40 RI 98 90.9 90.6 90.3 87.8 98.3 0.92 0.92 0.30
1 8 40 URI 46 88.8 91.5 91.8 88.2 98.2 0.91 0.92 0.19
2 16 80 U 512 94.3 94.7 94.4 92.3 99.0 0.95 0.95 0.07
2 16 80 URI 62 93.2 93.1 93.6 90.7 99.0 0.95 0.95 0.19
3 24 80 URI 78 93.9 93.9 93.1 90.8 99.4 0.97 0.97 0.05
MCC and F-Score are also obtained for the same hybrid model. The Q-Statistic value of
0.05 indicates that the diversity among the classifiers is highest.
3.2.3 Performance Analysis of SVM-SubLoc for LOCATE datasets
In this section, we demonstrate the prediction performance of SVM-SubLoc for LOCATE
datasets using the high performing features for 2D HeLa dataset. In this connection, we
obtained experimental results using LTP, HarLBP, and HarLTP for LOCATE datasets.
Table 3.10 presents the prediction performance of SVM-SubLoc using LTP patterns for
51
LOCATE Endogenous dataset. Among the individual learners, lin-SVM has achieved the
highest success rates using URI-LTP(2, 16, 0) patterns. The yielded accuracy is 90.2%.
The ensemble SVM-SubLoc has obtained 95.6% accuracy for the Endogenous dataset. The
Table 3.10: Performance of SVM-SubLoc using LTP for Endogenous dataset
lin poly RBF sig Ensemble
R PN τ m D Acc Acc MCC F−Score Q-Statistic
1 8 0 RI 72 85.2 86.2 80.6 59.7 93.2 0.72 0.73 0.17
1 8 0 URI 20 75.6 80.6 78.0 63.1 91.0 0.65 0.67 0.15
2 16 0 URI 36 90.2 86.6 85.2 71.9 95.6 0.80 0.81 0.21
3 24 0 URI 52 87.8 87.0 85.2 71.7 92.6 0.69 0.71 0.38
success rates of SVM-SubLoc using LTP patterns for LOCATE Transfected dataset are
reported in Table 3.11. SVM-SubLoc has yielded 93.6% ensemble accuracy using URI-
LTP(2, 16, 0) patterns, however, the diversity is maximum using RI-LTP(1, 8, 0) patterns
as is evident from Q-Statistic value of 0.06. Among the individual classifiers, poly-SVM has
outperformed other classifiers and achieved 85.5% prediction accuracy. The performance of
Table 3.11: Performance of SVM-SubLoc using LTP for Transfected dataset
lin poly RBF sig Ensemble
R PN τ m D Acc Acc MCC F−Score Q-Statistic
1 8 0 RI 72 78.8 84.2 75.2 54.4 89.6 0.60 0.61 0.06
1 8 0 URI 20 72.6 79.0 72.8 51.5 85.1 0.48 0.50 0.19
2 16 0 URI 36 81.1 85.5 77.0 57.1 93.6 0.71 0.73 0.07
3 24 0 URI 52 84.4 84.4 81.5 60.9 93.3 0.70 0.71 0.11
SVM-SubLoc for both the LOCATE datasets using HarLBP features is reported in Tables
3.12 and 3.13. The proposed SVM-SubLoc has obtained 99.8% accuracy for Endogenous
dataset using HarLBP where LBP(1, 8) are generated with URI mapping. The MCC and
F-Score values are also promising for the same features. For LOCATE Transfected dataset,
SVM-SubLoc has achieved 98.5% accuracy using HarLBP where LBP(2, 16) are computed
on URI mapping. Among the individual learners, poly-SVM obtained the highest accuracy
52
Table 3.12: Performance of SVM-SubLoc using HarLBP for Endogenous dataset
lin poly RBF sig Ensemble
R PN m D Acc Acc MCC F−Score Q-Statistic
1 8 U 85 94.6 93.8 93.8 88.6 99.6 0.97 0.98 0.24
1 8 URI 36 93.0 92.4 90.8 86.8 99.8 0.98 0.99 0.10
2 16 U 269 96.2 95.8 92.8 89.4 99.4 0.96 0.97 0.14
2 16 URI 44 95.0 94.6 94.2 90.6 99.6 0.97 0.98 0.06
3 24 U 581 94.4 94.4 92.8 88.4 99.6 0.97 0.98 0.25
3 24 URI 52 94.2 94.6 94.8 91.4 98.8 0.93 0.94 0.46
value of 93.6% using HarLBP where LBP(2, 16) are extracted through URI mapping. The
accuracy yielded by SVM-SubLoc is 4.9% higher than that of poly-SVM that shows the
significance of the developed ensemble. The highest diversity is also reported for these
specific features where Q-Statistic value of 0.06 is calculated. Table 3.14 highlights the
Table 3.13: Performance of SVM-SubLoc using HarLBP for Transfected dataset
lin poly RBF sig Ensemble
R PN m D Acc Acc MCC F−Score Q-Statistic
1 8 U 85 93.3 92.5 88.7 80.4 97.4 0.87 0.87 0.15
1 8 URI 36 89.8 90.0 87.1 81.0 96.5 0.83 0.84 0.21
2 16 U 269 93.3 93.1 90.5 80.2 98.0 0.89 0.90 0.11
2 16 URI 44 93.3 93.6 90.0 83.3 98.5 0.91 0.92 0.06
3 24 U 581 92.0 91.8 90.4 76.8 97.6 0.87 0.88 0.18
3 24 URI 52 92.5 92.5 91.5 84.2 97.8 0.88 0.89 0.25
prediction accuracies of SVM-SubLoc employing HarLTP features for LOCATE Endogenous
dataset. Ensemble accuracy of 99.6% is achieved by SVM-SubLoc that is 5.4% higher
than the accuracy achieved by individual RBF-SVM using the same features. Similarly,
performance predictions of SVM-SubLoc are presented in Table 3.15 for Transfected dataset.
The highest ensemble accuracy yielded by SVM-SubLoc is 98.7% using HarLTP. In contrast,
the individual poly-SVM has achieved 93.4% accuracy, which is 5.3% less than that of the
ensemble. The improved performance for LOCATE datasets is due to the inclusion of
hybrid models as well as the ensemble decisions. Hybrid models exhibit the properties of
their individual constituent feature spaces, which are added up to enhance the prediction
53
Table 3.14: Performance of SVM-SubLoc using HarLTP for Endogenous dataset
lin poly RBF sig Ensemble
R PN τ m D Acc Acc MCC F−Score Q-Statistic
1 8 0 RI 98 91.0 89.6 89.2 84.4 98.0 0.90 0.90 0.28
1 8 0 URI 46 91.2 91.2 90.2 85.6 99.0 0.94 0.95 0.17
2 16 0 URI 62 93.6 93.8 94.2 91.4 99.6 0.97 0.98 0.15
3 24 0 URI 78 93.2 94.2 94.0 91.2 98.6 0.92 0.93 0.12
Table 3.15: Performance of SVM-SubLoc using HarLTP for Transfected dataset
lin poly RBF sig Ensemble
R PN τ m D Acc Acc MCC F−Score Q-Statistic
1 8 0 RI 98 90.4 91.8 86.0 77.2 98.1 0.90 0.90 0.12
1 8 0 URI 46 90.2 90.5 86.7 78.8 96.5 0.83 0.84 0.18
2 16 0 URI 62 91.8 92.4 90.5 82.2 98.5 0.91 0.92 0.09
3 24 0 URI 78 91.8 93.4 90.0 84.0 98.7 0.92 0.93 0.20
capability of SVM-SubLoc.
3.3 Comparative Analysis
Performance comparison of the proposed SVM-SubLoc is also performed with state-of-the-
art existing approaches as demonstrated in Table 3.16. Chebira et al. have developed a
prediction system that achieved accuracy of 95.4% for the HeLa cell lines [7]. The prediction
system proposed by Nanni and Lumini has achieved 94.2%, 98.4%, and 96.5% accuracies,
respectively, for HeLa, LOCATE Endogenous and LOCATE Transfected datasets [39]. An-
other prediction system, proposed by Nanni et al. [40], has achieved 97.5% accuracy for
HeLa dataset. They have developed another system, which has achieved accuracy values
of 95.8%, 99.5%, and 97.0% for HeLa, LOCATE Endogenous and LOCATE Transfected
datasets, respectively [8]. On the contrary, SVM-SubLoc has obtained 99.7% accuracy for
HeLa dataset, 99.8% accuracy for LOCATE Endogenous dataset, and 98.7% accuracy for
LOCATE Transfected dataset, outperforming all the existing approaches.
Comparing with the existing approaches, SVM-SubLoc is an accurate and efficient pre-
diction system for recognizing protein subcellular structures from HeLa and two LOCATE
54
datasets. Spatial as well as transform domain descriptors have been efficiently exploited to
quantitatively describe these subcellular localizations. The improved prediction rates using
multi-resolution sub-spaces have proved the importance of feature extraction in transform
domain. It is worth noting that describing an image at different resolutions may reveal dif-
ferent information in apparently similar structures. Similarly, the hybrid models constructed
Table 3.16: Performance comparison with other published work
Method HeLa LOCATE
Endogenous Transfected
Hamilton et al. [37] - 98.2(47) 93.2(47)
Chebira et al. [7] 95.4(78) - -
Nanni and Lumini [39] 94.2(107) 98.4(107) 96.5(81)
Nanni et al. [40] 97.5(322) - -
Nanni et al. [8] 95.8(305) 99.5(305) 97.0(305)
SVM-SubLoc using HarLBP 99.7(581) 99.8(36) 98.5(44)
SVM-SubLoc using HarLTP 99.4(78) 99.6(62) 98.7(78)
SVM-SubLoc using LTPs 99.0(52) 95.6(36) 93.6(36)
from different individual features spaces have performed well for fluorescence microscopy
protein images. For example, SVM-SubLoc has yielded the accuracies of 99.7% with 581D
feature space and 99.4% with 78D feature space using HarLBP and HarLTP, respectively,
in the spatial domain. Similarly, SVM-SubLoc has achieved accuracy 99.0% using LTP
patterns having 52D feature space. The discrimination capability of these features for sub-
cellular structures is quite promising with the advantage of reduced feature spaces. Besides,
hybrid feature spaces of HarLBP and HarLTP have performed well due to the combined
discriminative power of both Haralick and LBP as well as Haralick and LTP.
In all the above cases, the enhancement, achieved by SVM-SubLoc, is due to the two
level ensemble. On one hand, we constructed the feature level ensemble and on the other
hand, we combined the classifiers’ decisions. The feature level ensemble has boosted the dis-
criminative capability of the individual features. Likewise, the decisions combined through
the majority voting technique has improved the prediction accuracies to higher levels.
55
Summary
In this chapter, we have developed and analyzed the performance of SVM-SubLoc prediction
system. This approach outperformed other existing and well-known approaches in terms of
increased accuracy and reduced dimensionality of the feature space. In the next chapter, we
will discuss the performance of RF-SubLoc prediction system in the presence of imbalanced
data.
56
Chapter 4
Protein Subcellular Localization in
the Presence of Imbalanced data
The advancement in the computational resources encouraged numerous researchers to de-
velop efficient prediction models for recognizing various subcellular structures present in
eukaryotic cells from fluorescence microscopy images. These prediction systems employ dif-
ferent classification algorithms exploiting the discriminative power of a number of feature
extraction strategies. However, imbalanced data always remains one of the major problems
in developing efficient and reliable prediction systems even in the presence of highly dis-
criminative features.
The main objective of this work is to develop a prediction system capable of efficiently
handling imbalanced data. For this purpose, we employed SMOTE oversampling technique
that produces synthetic samples in the feature space for the minority classes. These new
features are delivered to RF-SubLoc prediction system, which classifies them in various
categories. The availability of balanced data caused the classifiers to decrease their bias
towards the majority class and hence the overall performance is improved.
4.1 The Proposed RF-SubLoc Prediction System
The proposed RF-SubLoc prediction system is illustrated in Figure 4.1. Different phases
of the prediction system include the feature extraction phase, data balancing phase, and
classification phase. In the feature extraction phase, different features are extracted from
each input fluorescence microscopy protein image. In the data balancing phase, the instances
57
Feature Extraction
Classification
Random Forest
Rotation ForestHybrid features
Output Prediction
Individual features
Haralick LBP
Image EdgeHull
Data Balancing
SMOTE
Input Data
Figure 4.1: The RF-SubLoc prediction system
are oversampled in the feature space, which are then forwarded to the classification phase
for classification into different classes. The performance of the proposed prediction system
is assessed using CHOM, CHOA, and Vero datasets.
4.1.1 Feature Extraction Phase
Features from fluorescence microscopy protein images are extracted in spatial domain only.
The feature extraction techniques include Haralick coefficients, Edge, Hull, Image, HOG,
and LBP. From LBP patterns as discussed in Chapter 2, only URI mapping has been
considered with parameter configuration of (R = 1, PN = 8), (R = 2, PN = 16) and
(R = 3, PN = 24). In this chapter, they will be referred to as merely LBP8, LBP16, and
LBP24 where the number following the word LBP indicates the number of neighboring
pixels. Other feature extraction techniques are adopted in accordance with the discussion
presented in section 2.2.
4.1.1.1 The Hybrid Model
In addition to the individual features, different hybrid models have also been constructed
by combining these individual feature spaces, which include HarEdge (Haralick + Edge),
HarHull (Haralick + Hull), HarImg (Haralick + Image), HullEdge (Hull + Edge), HullImg
58
(Hull + Image), ImgEdge (Image + Edge), Morph (Morphological), HarHOG (Haralick +
HOG), MorphHOG (Morphological + HOG), HarMorphHOG (Haralick + Morphological +
HOG), HarLBP8Morph (Haralick + LBP8 + Morphological), HarLBP16Morph (Haralick
+ LBP16 + Morphological), and HarLBP24Morph (Haralick + LBP24 + Morphological).
4.1.2 Data Balancing Phase
After the feature extraction phase, SMOTE data balancing technique is employed in the
feature space to increase the minority class samples synthetically. In Figure 4.1, synthetic
samples are shown in black among other samples.
Tables 4.1, 4.2, and 4.3, respectively, show oversampled CHOM, CHOA and Vero datasets.
The balancing criteria is set against the majority class in each dataset. For example, in
CHOM dataset, the majority class is the Lysosome class where the number of instances is
the highest. To balance the dataset, sufficient number of synthetic samples are added to
the minority classes so that balance is established among minority and majority classes.
However, those classes are neglected in the oversampling process in which the imbalance
is lower. In CHOM dataset as shown in Table 2.5, Lysosome class is the majority class
Table 4.1: Oversampled instances for CHOM dataset
S. No Class Name Images/class
1 Golgi 77
2 DNA 69
3 Lysosome 97
4 Nucleolus 97
5 Cytoskeleton 97
Total 437
containing 97 samples. In this connection, only the classes of nucleolus and cytoskeleton
proteins are oversampled because they have high imbalance and may cause the classifier to
have decreased bias towards them. Therefore, increasing the samples using SMOTE may
increase classifier bias towards these minority classes. The oversampled CHOM is shown
in Table 4.1. Similarly, for CHOA dataset given in Table 2.6, Actin is the majority class,
which possesses 161 samples out of 668 samples in the entire dataset. Therefore, SMOTE
is applied to Microtubule, Mitochondria, Nucleolus, Nucleus, and Peroxisome classes to
59
Table 4.2: Oversampled instances for CHOA dataset
S. No Class Name Oversampled instances
1 Actin 161
2 ER 143
3 Golgi 156
4 Microtubule 161
5 Mitochondria 161
6 Nucleolus 161
7 Nucleus 161
8 Peroxisome 161
Total 1265
increase their samples, as shown in Table 4.2, because they are the minority classes of the
dataset. Original Vera dataset is shown in Table 2.7 where Nucleus is observed to be the
Table 4.3: Oversampled instances for Vero dataset
S. No Class Name Images/class
1 Actin 444
2 ER 444
3 Golgi 444
4 Microtubule 444
5 Mitochondria 444
6 Nucleolus 444
7 Nucleus 444
8 Peroxisome 444
Total 3552
majority class of the dataset. The remaining classes are all the minority classes. Therefore,
they were oversampled to generate a balanced feature space for the classification phase.
The oversampled dataset is given in Table 4.3.
4.1.3 Classification Phase
The classification phase of RF-SubLoc is depicted in Figure 4.2. In this phase, features
with/without SMOTE are classified using RF-SubLoc prediction system. Additionally,
RotF ensemble classifier is also utilized for comparison purpose with RF-SubLoc. The two
important parameters of RF-SubLoc: number of trees in the forest and number of randomly
selected variables have chosen to be 500 and 10, respectively. In case of RotF ensemble,
60
Decision Trees
Random Subset
Random Subset
Random Subset
Random Subset
Prediction
Prediction
Prediction
Prediction
Majority
Voting
Final
Prediction
Oversampled
Features
.
.
.
.
.
.
Figure 4.2: The Classification phase of RF-SubLoc
iterations and feature subset selection are the two parameters where iterations range from
2-4 for numerous features and dimensions of feature subsets for different features also vary.
4.2 Results and Discussion
This section analyzes performance of the proposed RF-SubLoc prediction system using
various feature spaces with/without SMOTE technique. The RotF ensemble is also assessed
for its performance using the same set of features in order to provide a useful comparison
with RF-SubLoc prediction system. The cross-validation protocol for conducting these
simulations is 10-fold for the three datasets. The performance of the prediction system is
measured in terms of accuracy, F-Score, and MCC.
4.2.1 Performance of RF-SubLoc using Individual Features
The prediction accuracies of RF-SubLoc for the three protein datasets utilizing individ-
ual features with/without SMOTE oversampling are presented in Table 4.4. The highest
accuracy of 95.1% is achieved using the Image features by RF-SubLoc for CHOM dataset
without SMOTE technique. Accuracy value of 96.6% is yielded using the same features by
RF-SubLoc with SMOTE oversampling. The improvement in other measures demonstrates
the effectiveness of SMOTE oversampling technique over the imbalanced feature space. Less
promising performance has been shown by RF-SubLoc using the Hull features, which indi-
cate that CHOM fluorescence microscopy protein images cannot be distinguished well using
61
the hull properties captured from fluorescence microscopy protein images compared to other
features. Similar behavior is observed using Edge features.
Table 4.4: Performance of RF-SubLoc using individual features
Without SMOTE With SMOTE
Feature D Acc F−Score MCC Acc F−Score MCC
CHOM
Haralick 26 88.7 0.77 0.71 90.8 0.77 0.71
Edge 5 77.1 0.61 0.49 81.9 0.67 0.57
Hull 3 62.4 0.46 0.27 62.9 0.46 0.25
Image 8 95.1 0.89 0.86 96.6 0.91 0.89
HOG 81 82.9 0.68 0.58 88.6 0.74 0.66
LBP8 10 86.5 0.73 0.66 90.8 0.80 0.74
LBP16 18 90.5 0.80 0.74 92.7 0.84 0.79
LBP24 26 89.9 0.79 0.73 92.7 0.84 0.79
Average 84.1 0.71 0.63 87.1 0.75 0.67
CHOA
Haralick 26 72.9 0.49 0.37 85.8 0.57 0.52
Edge 5 62.4 0.37 0.20 76.0 0.39 0.30
Hull 3 44.2 0.26 -0.02 59.7 0.25 0.10
Image 8 84.6 0.66 0.59 92.2 0.73 0.70
HOG 81 66.2 0.42 0.26 87.8 0.60 0.55
LBP8 10 82.9 0.62 0.55 90.0 0.67 0.63
LBP16 18 81.3 0.60 0.51 90.9 0.69 0.65
LBP24 26 80.7 0.59 0.50 90.8 0.68 0.64
Average 71.9 0.50 0.37 84.2 0.57 0.51
Vero
Haralick 26 55.2 0.20 0.04 76.9 0.45 0.39
Edge 5 44.8 0.14 -0.09 65.9 0.32 0.21
Hull 3 40.8 0.12 -0.14 55.4 0.22 0.05
Image 8 60.3 0.24 0.11 75.8 0.44 0.37
HOG 81 57.2 0.22 0.06 83.4 0.56 0.52
LBP8 10 62.2 0.26 0.13 75.4 0.43 0.36
LBP16 18 69.8 0.33 0.24 83.3 0.56 0.51
LBP24 26 73.3 0.37 0.30 87.5 0.64 0.60
Average 57.9 0.23 0.08 75.5 0.45 0.37
However, the combined feature space of Hull and Edge features enhances the prediction
performance of RF-SubLoc as shown in Table 4.5 for CHOM dataset. RF-SubLoc has
achieved highest performance accuracy for CHOA dataset using Image features without
SMOTE data balancing technique achieving 84.6% accuracy. The improvement of 7.6%
62
in this accuracy is observed with SMOTE technique resulting in 92.2% overall accuracy.
This signifies the importance of data balancing in classification problems through SMOTE.
In case of Vero dataset, RF-SubLoc achieved the highest accuracy using LBP24 features
without SMOTE oversampling. However, with SMOTE technique 14.2% enhancement in
the accuracy is achieved using the same features resulting in 87.5% accuracy.
The performance improvement is achievable in the presence of balanced data as is evident
from the obtained simulation results. We have also reported the average performance over all
the feature spaces in case of all the three datasets. The average value of results have shown
great performance improvement with SMOTE data balancing technique, which proves the
effectiveness of SMOTE technique for fluorescence microscopy protein image classification.
4.2.2 Performance Analysis of RF-SubLoc using Hybrid Features
Table 4.5 highlights the success rates of RF-SubLoc using hybrid features with/without
SMOTE technique for the three protein image datasets from fluorescence microscopy. RF-
SubLoc shows superior performance using hybrid features compared to using the individ-
ual features. RF-SubLoc yielded highest accuracies using the HarLBP24Morph hybrid
model for all the three datasets with/without SMOTE oversampling. The prediction sys-
tem achieved 99.1% accuracy without employing SMOTE oversampling for CHOM dataset
and further 99.3% accuracy with SMOTE oversampling. For CHOA dataset, RF-SubLoc
has obtained 96.5% and 93.0% accuracies with and without SMOTE data balancing, re-
spectively, using the same features. Similarly, in case of Vero dataset, RF-SubLoc has
achieved accuracies 91.2% and 74.6%, respectively, with and without SMOTE oversam-
pling utilizing the same set of features. The performance of RF-SubLoc has been degraded
using the HullImg hybrid model for CHOM dataset. On the other hand, the performance
of RF-SubLoc using Image features alone is superior as is evident from Table 4.4 where the
ensemble achieved 96.6% accuracy using Image features alone compared to 95.0% accuracy
using HullImg features with SMOTE oversampling as shown in Table 4.5. This is due to
the fact that the hybrid model introduced redundancy to the discriminative information
of individual features. This reveals that hybrid models cannot be trusted blindly on the
hope for improvement in the model accuracy. RF-SubLoc performed well with oversampled
feature spaces using both individual and hybrid models. The performance of RF-SubLoc is
enhanced using the hybrid models compared to using individual feature spaces.
63
Table 4.5: Performance of RF-SubLoc using hybrid features
Without SMOTE With SMOTE
Feature D Acc F−Score MCC Acc F−Score MCC
CHOM
HarEdge 31 92.0 0.83 0.79 94.3 0.85 0.80
HarHull 29 94.8 0.89 0.86 96.1 0.89 0.86
HarImg 34 95.1 0.89 0.86 96.8 0.92 0.89
HullEdge 8 84.1 0.71 0.63 87.6 0.74 0.67
HullImg 11 94.8 0.89 0.86 95.0 0.89 0.86
ImgEdge 13 95.4 0.90 0.87 95.9 0.91 0.88
Morph 16 95.4 0.90 0.87 96.1 0.91 0.88
HarHoG 107 91.4 0.82 0.77 94.3 0.85 0.81
MorphHoG 97 97.6 0.94 0.93 99.1 0.98 0.98
HarMorphHoG 123 97.9 0.95 0.94 98.4 0.96 0.94
HarLBP8Morph 52 98.2 0.95 0.94 98.6 0.97 0.96
HarLBP16Morph 60 98.5 0.96 0.95 99.1 0.98 0.97
HarLBP24Morph 68 99.1 0.97 0.97 99.3 0.98 0.98Average 87.4 0.89 0.86 96.2 0.91 0.88
CHOA
HarEdge 31 82.0 0.62 0.54 91.8 0.72 0.68
HarHull 29 77.4 0.55 0.46 89.2 0.65 0.61
HarImg 34 89.4 0.74 0.69 94.5 0.80 0.77
HullEdge 8 72.5 0.49 0.38 84.5 0.55 0.50
HullImg 11 86.5 0.69 0.63 92.2 0.74 0.71
ImgEdge 13 87.0 0.70 0.64 93.4 0.76 0.73
Morph 16 89.7 0.75 0.70 94.9 0.81 0.79
HarHoG 107 78.7 0.56 0.47 90.2 0.67 0.63
MorphHoG 97 88.6 0.73 0.68 96.0 0.85 0.83
HarMorphHoG 123 90.1 0.76 0.71 96.1 0.85 0.83
HarLBP8Morph 52 92.7 0.81 0.78 96.1 0.85 0.83
HarLBP16Morph 60 92.8 0.81 0.78 96.2 0.85 0.84
HarLBP24Morph 68 93.0 0.82 0.78 96.5 0.86 0.85Average 86.2 0.69 0.63 93.2 0.77 0.74
Vero
HarEdge 31 58.9 0.23 0.08 80.6 0.51 0.45
HarHull 29 59.3 0.24 0.09 79.3 0.49 0.44
HarImg 34 64.4 0.28 0.16 84.0 0.57 0.53
HullEdge 8 54.1 0.20 0.02 74.7 0.42 0.35
HullImg 11 62.4 0.27 0.14 77.0 0.45 0.39
ImgEdge 13 63.7 0.27 0.15 79.8 0.49 0.44
Morph 16 65.4 0.28 0.17 82.0 0.53 0.48
HarHoG 107 61.9 0.25 0.12 85.8 0.60 0.56
MorphHoG 97 65.9 0.29 0.18 86.9 0.62 0.59
HarMorphHoG 123 66.8 0.30 0.19 88.9 0.67 0.64
HarLBP8Morph 52 68.5 0.31 0.22 87.6 0.64 0.60
HarLBP16Morph 60 71.2 0.34 0.26 89.8 0.69 0.66
HarLBP24Morph 68 74.6 0.39 0.32 91.2 0.72 0.69Average 64.4 0.28 0.16 83.7 0.57 0.52
64
This is due to the fact that the discrimination power of individual features has been
exploited by the classifier concurrently. At the same time, every class has equal repre-
sentation during the training phase in terms of number of input samples due to SMOTE
oversampling. This certainly enhanced the overall performance of RF-SubLoc prediction
system.
4.2.3 Performance Analysis of RotF Ensemble using Individual Features
Table 4.6 demonstrates the prediction performance of RotF ensemble using individual fea-
tures for CHOM, CHOA and Vero datasets with/without SMOTE oversampling technique.
Improvement in prediction performance of RotF ensemble has been observed using balanced
data. LBP patterns have enabled RotF ensemble to achieve higher accuracies for all the
three datasets. RotF has yielded 92.0% accuracy using LBP8 and LBP16 features without
employing the SMOTE technique for CHOM dataset. With SMOTE technique, improve-
ment in the performance of RotF ensemble has been observed for achieving accuracy value
of 93.8% using LBP16 features. RotF has achieved 78.0% accuracy using LBP8 features for
CHOA dataset without oversampling, which is enhanced to 84.7% utilizing SMOTE over-
sampling. However, the highest accuracy obtained by RotF ensemble is 86.6% for CHOA
dataset using Image features with SMOTE oversampling. For Vero dataset, RotF ensem-
ble yielded the highest accuracies of 66.4% and 73.3% using LBP24 features, respectively,
without and with SMOTE technique.
The obtained results reveal that providing a balanced feature space in terms of synthetic
samples enables the classifier to decide about the label of a sample without biasing towards
the majority class.
4.2.4 Performance Analysis of RotF Ensemble using Hybrid Features
Table 4.7 highlights the success rates of RotF ensemble using hybrid features with/without
SMOTE for CHOM, CHOA, and Vero datasets. The effectiveness of hybrid features and
SMOTE data balancing is obvious from the obtained results for all the three datasets. For
CHOM, highest accuracy 97.3% is reported using HarLBP16Morph features with SMOTE.
However, for CHOA and Vero datasets highest accuracies 91.2% and 74.0%, respectively,
are obtained using the oversampled HarLBP24Morph features.
65
Table 4.6: Performance of RotF ensemble using individual features
Without SMOTE With SMOTE
Feature D Acc F−Score MCC Acc F−Score MCC
CHOM
Haralick 26 88.1 0.77 0.70 91.1 0.81 0.76
Edge 5 67.6 0.49 0.32 75.3 0.56 0.41
Hull 3 56.3 0.40 0.16 61.8 0.47 0.27
Image 8 90.5 0.80 0.75 93.4 0.85 0.80
HoG 81 68.2 0.48 0.30 74.8 0.54 0.39
LBP8 10 92.0 0.83 0.78 92.2 0.83 0.78
LBP16 18 92.0 0.83 0.78 93.8 0.84 0.80
LBP24 26 89.0 0.77 0.70 91.5 0.80 0.75
Average 80.5 0.67 0.56 84.2 0.71 0.62
CHOA
Haralick 26 69.3 0.45 0.31 81.1 0.49 0.42
Edge 5 54.2 0.30 0.07 66.9 0.29 0.17
Hull 3 39.4 0.22 -0.10 45.1 0.14 -0.10
Image 8 77.2 0.55 0.45 86.6 0.60 0.55
HoG 81 48.2 0.25 -0.01 64.9 0.28 0.15
LBP8 10 78.0 0.56 0.46 84.7 0.56 0.50
LBP16 18 77.1 0.54 0.44 86.1 0.58 0.53
LBP24 26 76.3 0.53 0.42 85.8 0.57 0.52
Average 65.0 0.42 0.25 75.2 0.43 0.34
Vero
Haralick 26 50.2 0.18 -0.01 60.2 0.27 0.13
Edge 5 35.7 0.11 -0.19 47.3 0.17 -0.04
Hull 3 35.6 0.10 -0.21 42.4 0.15 -0.10
Image 8 50.4 0.17 -0.01 61.0 0.27 0.14
HoG 81 44.2 0.14 -0.09 57.2 0.25 0.09
LBP8 10 49.7 0.17 -0.02 59.1 0.26 0.11
LBP16 18 60.7 0.25 0.12 66.9 0.33 0.23
LBP24 26 66.4 0.31 0.21 73.3 0.41 0.33
Average 49.1 0.17 -0.02 58.4 0.26 0.11
The effectiveness of SMOTE data balancing is evident from the obtained results where
improvement has always been achieved. Likewise, the average success rates have proven
the usefulness of SMOTE data balancing technique. Similarly, hybrid features have shown
performance improvement over individual features in case of all the three datasets because
the discrimination power of individual features is combined and consequently enhanced.
66
Table 4.7: Performance of RotF ensemble using hybrid features
Without SMOTE With SMOTE
Feature D Acc F−Score MCC Acc F−Score MCC
CHOM
HarEdge 31 89.0 0.78 0.71 90.4 0.78 0.72
HarHull 29 89.9 0.80 0.74 90.4 0.77 0.70
HarImg 34 92.0 0.84 0.79 94.5 0.87 0.83
HullEdge 8 73.7 0.57 0.43 78.9 0.63 0.51
HullImg 11 91.4 0.82 0.77 94.3 0.86 0.82
ImgEdge 13 91.7 0.83 0.78 94.7 0.87 0.83
Morph 16 90.8 0.81 0.76 95.4 0.89 0.86
HarHoG 107 84.1 0.70 0.62 88.1 0.76 0.69
MorphHoG 97 89.6 0.79 0.73 92.9 0.82 0.76
HarMorphHoG 123 91.4 0.82 0.78 93.6 0.85 0.81
HarLBP8Morph 52 94.5 0.88 0.85 96.1 0.91 0.89
HarLBP16Morph 60 95.4 0.90 0.87 97.3 0.92 0.90
HarLBP24Morph 68 96.0 0.91 0.89 96.6 0.92 0.90Average 90.0 0.80 0.75 92.6 0.83 0.79
CHOA
HarEdge 31 74.4 0.51 0.39 84.7 0.56 0.51
HarHull 29 71.0 0.47 0.34 80.9 0.49 0.43
HarImg 34 80.2 0.59 0.50 88.9 0.65 0.61
HullEdge 8 58.8 0.34 0.14 70.2 0.33 0.22
HullImg 11 76.8 0.55 0.45 85.8 0.58 0.54
ImgEdge 13 77.7 0.55 0.46 87.5 0.62 0.57
Morph 16 78.0 0.56 0.46 86.6 0.60 0.55
HarHoG 107 65.4 0.41 0.25 79.4 0.46 0.39
MorphHoG 97 72.8 0.49 0.36 83.6 0.53 0.48
HarMorphHoG 123 76.0 0.53 0.42 86.6 0.60 0.55
HarLBP8Morph 52 82.8 0.63 0.55 90.2 0.68 0.64
HarLBP16Morph 60 83.7 0.64 0.57 90.8 0.69 0.66
HarLBP24Morph 68 84.1 0.65 0.57 91.2 0.71 0.67Average 75.5 0.53 0.42 85.1 0.58 0.52
Vero
HarEdge 31 50.1 0.17 -0.02 61.6 0.28 0.14
HarHull 29 49.2 0.17 -0.03 61.6 0.28 0.15
HarImg 34 53.9 0.20 0.02 67.1 0.34 0.23
HullEdge 8 43.6 0.13 -0.11 52.6 0.20 0.01
HullImg 11 52.6 0.19 0.01 62.0 0.29 0.16
ImgEdge 13 52.2 0.19 0.01 63.7 0.30 0.18
Morph 16 53.4 0.20 0.02 64.9 0.31 0.19
HarHoG 107 49.5 0.17 -0.03 61.8 0.28 0.15
MorphHoG 97 52.4 0.19 0.01 65.2 0.31 0.19
HarMorphHoG 123 53.8 0.19 0.02 67.3 0.33 0.23
HarLBP8Morph 52 55.2 0.21 0.05 67.8 0.34 0.24
HarLBP16Morph 60 59.3 0.24 0.10 70.4 0.37 0.29
HarLBP24Morph 68 65.4 0.29 0.19 74.0 0.42 0.35Average 53.1 0.20 0.02 64.6 0.31 0.19
67
4.3 Comparative Analysis
We have provided comparative analysis of the proposed system with state of the art ex-
isting approaches from various researchers as given in Table 4.8. Zhang and Pham [85]
have developed RSE of Multi-layer Perceptrons, which achieved 98.9% accuracy for CHOM
dataset. In a similar approach, Zhang et al. [86] have reported 94.2% accuracy for the
same dataset. Lin et al. have proposed AdaBoost.ERC [6] based prediction system, which
obtained accuracy values of 94.7% and 89.1% for CHOA and Vero datasets, respectively.
The feature extraction strategies utilized by these prediction systems are complex in nature
and occupy large spaces in memory. For example, in [85] authors have utilized curvelet
transform based features, which have to be extracted in multi-resolution sub-spaces. This
increases the computation time of the prediction system. Similarly, in [6] the strong and
weak detectors require large memory to be loaded. In contrast, our employed feature ex-
traction strategies have used feature spaces with smaller dimensions. We have extracted
various individual features as well as constructed their hybrid models. The key novelty
of our proposed prediction system lies in the introduction of SMOTE data balancing as
post-processing for fluorescence microscopy protein images, which greatly enhanced the
performance of the RF-SubLoc prediction system. In our proposed approach, RF-SubLoc
has achieved accuracy 99.3% for CHOM dataset, 96.5% for CHOA dataset, and 91.2% for
Vero dataset.
Table 4.8: Performance comparison with other published work
Method Accuracy
CHOM CHOA Vero
MLP-RSE [85] 98.9 - -
MLP-RSE [86] 94.2 - -
AdaBoost.ERC [6] - 94.7 89.1
RF-SubLoc 99.3 96.5 91.2
We have explored the performance of RF-SubLoc prediction system and RotF ensemble
classifier in the presence of balanced and imbalanced data. In this connection, datasets with
varying degree of imbalance are employed. Simulation results revealed that enhancement
in the performance of the classifier under SMOTE is directly proportional to the imbalance
68
present in the original dataset. In other words, more improvement in the performance of
the classifier is observed after data balancing where imbalance was higher in the original
dataset.
Simulation results proved the effectiveness of RF-SubLoc ensemble, where its efficiency is
boosted with balanced feature spaces. Synthetic samples are created for this purpose using
SMOTE, which forced RF-SubLoc to increase its affinity towards minority class samples.
This leads to more generalized learning and consequently, the over all performance of the
system is improved.
Summary
We observed that SMOTE increased the efficiency of RF-SubLoc in classifying protein
subcellular structures, particularly, in case of HarLBP24Morph hybrid features. The sig-
nificance of SMOTE based data balancing in the classification of fluorescence microscopy
protein images is thus evident from the simulation results. In the next chapter, we combine
the best techniques of the two systems presented in this chapter and Chapter 3. For this
purpose, LTP from Chapter 3 and SMOTE from the current chapter are selected.
69
Chapter 5
Protein Subcellular Localization:
Employing LTP with SMOTE
Prediction systems for protein localization greatly depend upon the discrimination power
of feature extraction strategies for their reliability and accuracy. High discrimination power
would certainly aid in drug development process for various diseases. The existing ap-
proaches have employed individual and ensemble classifier systems, which operate on dif-
ferent feature generation techniques in order to classify protein localization with high ac-
curacy [23, 36]. These methods have achieved high prediction accuracies however; room
for more accurate systems still exists. Besides, another problem faced by such prediction
systems is the availability of imbalanced data that severely degrades the performance of
automated systems. The classifier’s predictions are influenced by the majority class and
consequently, the classification system faces the impact of this bias in the form of degraded
performance. The overall performance of the classifier apparently seems good in the form
of higher accuracy. However, the instances of minority classes may all be neglected in the
classification procedure. The main focus of this chapter is to tackle this issue.
5.1 The Protein-SubLoc Prediction System
The proposed Protein-SubLoc prediction system is highlighted in Figure 5.1. The con-
stituent phases of the prediction system include feature extraction, oversampling, optional
feature selection, classification, and ensemble generation, which are discussed as follows.
70
LTPFeature Extraction
SMOTEOversampling
Output
SVMClassification
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Best Ensemble Selection
Ensembles Generation
Figure 5.1: Framework of the proposed system
5.1.1 Feature Extraction Phase
LTP patterns are extracted in the spatial domain by forwarding each input image to the
feature extraction phase. In this work, RI-LTP (R=1, PN=8), U-LTP (R=1, PN=8), URI-
LTP (R=1, PN=8), U-LTP (R=2, PN=16), URI-LTP (R=2, PN=16) and URI-LTP (R=3,
PN=24) have been exploited for their discrimination power.
5.1.2 Oversampling Phase
Feature extraction phase is followed by the oversampling phase where SMOTE is introduced
to synthetically oversample the feature space. In this way, a balanced feature space is
produced before the classification phase. In Figure 5.1 different colors indicate different
instances belonging to different classes. Synthetically oversampled examples are specified
using black color.
In this work, LTP patterns are utilized for protein subcellular localization images from
HeLa and CHOA datasets. Among the two datasets, HeLa exhibits slight imbalance whereas
CHOA possesses comparatively high imbalance. In order to remedy the imbalance, SMOTE
is employed in the extracted feature spaces before forwarding them to the classification
phase. The original and synthetic samples are depicted in Figures 5.2 and 5.3 for HeLa and
CHOA datasets, respectively.
5.1.3 Feature Selection(Optional)
Feature selection may improve the performance of a model by selecting the most discrimi-
native and informative features from the full feature space. In addition, the feature space
71
Figure 5.2: Comparison of original and synthetic samples in HeLa dataset
Figure 5.3: Comparison of original and synthetic samples in CHOA dataset
dimension is reduced, which leads to less computational cost. In order to further enhance
the discrimination power of the oversampled feature space, mRMR feature selection tech-
nique is adopted. Our findings reveal that mRMR does not aid anything significant to the
performance of LTP patterns.
5.1.4 Classification Phase
In the classification phase as shown in Figure 5.4, the performance of Protein-SubLoc pre-
diction system has been evaluated using SVM with polynomial kernel of degree 2. Six SVM
classifiers have been trained using six different feature spaces of LTP patterns generated
using six different configurations from LTP feature extraction mechanism. The detailed
72
Ensemble
Generation
Best
Ensemble
Selection
oversampled
LTP
Features
SVM1
SVM2
SVMn
.
.
.
Final
Prediction
Figure 5.4: Classification phase of Protein-SubLoc prediction system
discussion about LTP configurations could be found in Chapter 2.
5.1.5 Ensemble Generation
Majority voting based ensemble technique is employed to generate the ensemble for pre-
dicting labels of different protein subcellular localizations. In the formation of ensemble,
all the LTP variations based on different mappings are considered. As shown in Table 5.1,
serial numbers are assigned to various mappings for the sake of discussion. From six indi-
vidual classifications, six different ensembles can be generated if ensemble of five members
is considered. In this connection, different combinations are tested and the best ensemble
in terms of enhanced accuracy is selected for Protein-SubLoc prediction system. This prac-
tice is performed to explore the diversity among different classification results as well as to
restrict the number of ensemble members to odd number of classifiers.
Table 5.1: The serial numbers, attached to the mapping used in LTP computation, repre-
senting a particular LTP variant in Tables 5.8 and 5.9
S. No m R PN D
1 U 1 8 118
2 RI 1 8 72
3 URI 1 8 20
4 URI 2 16 36
5 U 2 16 486
6 URI 3 24 52
73
5.2 Results and Discussion
The detailed analysis of the proposed Protein-SubLoc prediction system is provided in this
section. The importance and significance of LTP patterns in conjunction with SMOTE
is discussed in detail. First, the results obtained using 2D HeLa are presented. Then
results regarding CHOA dataset are analyzed. Simulation results are obtained using 10-fold
cross validation protocol. The performance of the proposed system is measured in terms of
accuracy, sensitivity, specificity, MCC, F-Score, and Q-statistic.
5.2.1 Performance Analysis for 2D HeLa dataset
The output predictions of SVM for HeLa dataset are presented in Table 5.2. In the first
column, m indicates mapping methods adopted during the feature extraction using LTP
mechanism. Radius and number of neighboring pixels under consideration during the LTP
feature extraction are mentioned in columns 2 and 3, respectively. Values of threshold τ
required by LTP feature extraction are presented in column 4. D represents the feature
space dimension, which is shown in column 5. Columns 6-10 indicate accuracy, sensitivity,
specificity, MCC, and F-Score, respectively. This point forward, LTP patterns are referred
to as m-LTP(R, PN, τ) in the text. Results in Table 5.2 reveal that LTP patterns with
Table 5.2: Performance of Protein-SubLoc using LTP for balanced HeLa dataset
Polynomial kernel
m R PN τ D Acc Sen Spe MCC F−Score
RI 1 8 40 72 90.7 92.6 90.5 0.65 0.66
U 1 8 40 118 93.1 93.7 93.0 0.71 0.73
URI 1 8 40 20 91.6 92.5 91.5 0.67 0.68
U 2 16 80 486 95.4 96.0 95.3 0.79 0.80
URI 2 16 80 36 94.6 95.9 94.5 0.77 0.78
URI 3 24 80 52 95.4 96.1 95.3 0.79 0.80
SMOTE have the capability of discriminating the HeLa patterns very well yielding accu-
racy above 90% using any mapping. Further, URI-LTP(3, 24, 80) patterns have achieved
the highest accuracy of 95.4%. The Protein-SubLoc prediction system has retained good
balance between sensitivity and specificity, which has been possible due to the availability
of SMOTE generated balanced dataset. The MCC value of 0.79 for URI-LTP(3, 24, 80)
74
patterns reveals fine quality of prediction. Likewise, F-Score value of 0.80 assures good
accuracy of the conducted test. Comparing with other LTPs in Table 5.2, it is observed
that performance of LTP patterns is improved on large image patches as described by the
radius R.
The discriminative feature space usually exhibits high variance. In the current dis-
cussion, we illustrate the high variance possessed by LTP patterns in Figure 5.5. In this
connection, the phenomenon of explained variance is utilized. Among various PCA com-
ponents of the feature space, the first two components possess more information only if,
higher values are observed for the explained variance. From Figure 5.5, the ratio of ex-
plained variance to the total variance is 77%. Further, this ratio reaches to 92% if first
five components are considered. Thus guaranteeing enhanced performance of the Protein-
SubLoc prediction system in differentiating different protein structures. Similarly, if we add
more components in addition to the first five, there is no guarantee for the performance
to be improved. This is because the components other than the first five do not possess
significant variability. Multiclass ROC curves for URI-LTP(3, 24, 80) are also presented
Figure 5.5: Ratio of explained variance to the total variance for HeLa dataset
in Figure 5.6. The features maintained discrimination between the classes as is evident
from the ROC curves illustrated in the figure, which is based on one feature rather than on
the whole feature space. Since we have 10 classes, there are 45 possible pairwise combina-
tions and consequently, there are 45 ROC curves. The notion of mRMR is also introduced
in order to select the most discriminative features from the full feature space, which will
75
Figure 5.6: ROC curves using URI-LTP(3, 24, 80) for 2D HeLa dataset
reduce the dimension as well as the noise from the feature space. Performance of Protein-
SubLoc is recorded using mRMR based reduced feature spaces of LTP patterns, which is
demonstrated in Table 5.3. The performance of reduced feature spaces was not promising
in contrast to the full feature spaces. This indicates that LTP feature extraction strategy
extracts most of the information from the fluorescence microscopy protein images necessary
to differentiate different subcellular structures and hence mRMR based feature selection is
not required as a post-processing technique to improve the performance of LTP patterns.
However, mRMR based feature space of U-LTP(1, 8, 40) patterns have achieved slightly
better performance compared to the full feature space of the same LTP patterns as is evident
from Tables 5.2 and 5.3. Similar to the analysis of mRMR based LTP patterns presented
in Table 5.3, we have also investigated the effect of mRMR on the highest performing LTP
patterns including U-LTP(2, 16, 80) and URI-LTP(3, 24, 80), as can be seen in Table 5.2.
Figure 5.7 demonstrates the performance of various mRMR based feature subspaces selected
from the full feature space of URI-LTP (3, 24, 80) patterns. The highest accuracy value
of 95.5% is achieved on 50D feature subspace compared to the 95.4% accuracy on 52D full
feature space, which could not be considered an extra ordinary improvement. Likewise,
95.6% accuracy is yielded using the mRMR based 450D feature subspace of U-LTP(2, 16,
80) patterns compared to the 95.4% accuracy using 486D full feature space as depicted in
Figure 5.8. Although the reduction in the features through the mRMR has early advantage
76
Table 5.3: Performance of Protein-SubLoc using LTP for balanced mRMR based HeLa
dataset
Polynomial kernel
m (R, PN, τ) D Acc Sen Spe MCC F−Score
RI (1, 8, 40)
36 89.0 90.9 88.8 0.60 0.62
54 89.7 91.3 89.6 0.62 0.64
60 89.8 91.4 89.7 0.62 0.64
U (1, 8, 40)
59 91.0 91.8 90.9 0.65 0.67
89 93.3 94.0 93.2 0.72 0.73
100 93.0 93.8 92.9 0.71 0.73
URI (1, 8, 40)
10 85.9 88.8 85.5 0.54 0.55
15 90.4 92.2 90.2 0.64 0.65
18 91.6 92.5 91.5 0.67 0.68
U (2, 16, 80)
243 94.3 94.9 94.3 0.75 0.77
365 95.2 95.9 95.1 0.78 0.80
400 95.1 95.5 95.0 0.78 0.79
URI (2, 16, 80)
18 92.9 94.3 92.7 0.71 0.72
27 94.1 95.6 94.0 0.75 0.76
30 93.5 94.9 93.4 0.73 0.74
URI (3, 24, 80)
26 93.5 94.4 93.4 0.73 0.74
39 94.8 95.8 94.7 0.77 0.78
45 95.1 95.7 95.0 0.78 0.79
of 0.1% and 0.2% over the full feature spaces using URI-LTP(3, 24, 80) and U-LTP(2, 16,
80) patterns, respectively. However, further reduction has shown degraded performance as
is evident from Figures 5.7 and 5.8, correspondingly.
LTP patterns alone have shown great performance in conjunction with SMOTE without
the aid of mRMR based feature selection, which is a positive indication about the strength
of balanced LTP patterns. The difference of the performance of balanced and imbalanced
feature spaces could easily be understood through the analysis of LTP performance without
SMOTE, which is tabulated in Table 5.4. The highest accuracy achieved using URI-LTP(3,
24, 80) patterns without SMOTE is 94.0% that is 1.4% less than that of using LTP patterns
with SMOTE as is evident from Table 5.2. The data balancing with SMOTE has improved
the discriminative capability of the prediction system because all the classes have equal
representation in training and testing phases. Overall, LTP patterns could perform well
without the aid of mRMR based feature selection provided that Protein-SubLoc prediction
system is trained on balanced dataset.
77
Figure 5.7: Effect of mRMR on URI-LTP extracted on radius 3 for HeLa dataset
Figure 5.8: Effect of mRMR on U-LTP extracted on radius 2 for HeLa dataset
5.2.2 Performance Analysis for CHOA dataset
In Table 5.5, performance accuracies of Protein-SubLoc prediction system using LTP pat-
terns with SMOTE are presented for CHOA cell lines. URI-LTP(3, 24, 30) patterns have
outperformed other LTP patterns in predicting various subcellular structures from CHOA
dataset. Protein-SubLoc prediction system has yielded 90.7% accuracy, 0.65 MCC and 0.69
F-Score values. The performance of LTP patterns for balanced CHOA dataset with mRMR
feature selection is highlighted in Table 5.6. Any improvement has not been observed with
feature selection since LTP patterns are capable of extracting maximum information from
fluorescence microscopy protein images and hence no feature selection is required.
78
Table 5.4: Performance of Protein-SubLoc using LTP for imbalanced HeLa dataset
Polynomial kernel
m R PN τ D Acc Sen Spe MCC F−Score
RI 1 8 40 72 87.5 89.5 87.3 0.58 0.60
U 1 8 40 118 90.8 92.1 90.6 0.65 0.67
URI 1 8 40 20 89.7 91.9 89.5 0.63 0.65
U 2 16 80 486 92.5 94.0 92.3 0.71 0.72
URI 2 16 80 36 93.2 94.8 93.0 0.73 0.74
URI 3 24 80 52 94.0 95.7 93.8 0.75 0.77
Table 5.5: Performance of Protein-SubLoc using LTP for balanced CHOA dataset
Polynomial kernel
m R PN τ D Acc Sen Spe MCC F−Score
RI 1 8 30 72 70.4 59.6 71.9 0.22 0.33
U 1 8 30 118 72.8 61.0 74.4 0.25 0.35
URI 1 8 30 20 68.4 55.7 70.2 0.18 0.30
U 2 16 30 486 82.2 59.8 50.9 0.08 0.31
URI 2 16 30 36 83.9 77.1 84.9 0.48 0.54
URI 3 24 30 52 90.7 85.3 91.5 0.65 0.69
Table 5.6: Performance of Protein-SubLoc using LTP for balanced mRMR based CHOA
dataset
Polynomial kernel
m R PN τ D Acc Sen Spe MCC F−Score
RI 1 8 30 36 68.9 57.2 70.5 0.19 0.31
U 1 8 30 59 70.3 57.6 72.1 0.21 0.32
URI 1 8 30 10 64.2 52.0 65.9 0.12 0.26
U 2 16 30 243 80.4 67.3 82.3 0.38 0.45
URI 2 16 30 18 80.7 71.8 81.9 0.41 0.47
URI 3 24 30 26 89.3 83.6 90.1 0.61 0.65
In order to analyze the performance of LTP patterns without balancing the dataset, the
results without SMOTE are also presented as shown in Table 5.7. The highest accuracy
achieved without SMOTE is 8.6% lower than that of with SMOTE as shown in Table 5.5
using URI-LTP(3, 24, 30) patterns. This assures the effectiveness of SMOTE in predicting
protein subcellular localization from CHOA dataset. The ROC curves for URI-LTP(3, 24,
79
Table 5.7: Performance of Protein-SubLoc using LTP for imbalanced CHOA dataset
Polynomial kernel
m R PN τ D Acc Sen Spe MCC F−Score
RI 1 8 30 72 58.5 57.4 58.8 0.12 0.33
U 1 8 30 118 55.6 54.8 55.9 0.08 0.30
URI 1 8 30 20 55.2 53.0 55.9 0.06 0.29
U 2 16 30 486 60.0 58.3 60.4 0.14 0.34
URI 2 16 30 36 72.3 72.3 72.3 0.35 0.48
URI 3 24 30 52 82.1 82.2 82.1 0.54 0.62
30) features are shown in Figure ?? . There are 28 ROC curves, which are produced from
pair wise combinations of 8 classes.
Figure 5.9: ROC curves using URI-LTP(3, 24, 30) for CHOA dataset
5.2.3 Ensemble Analysis
Tables 5.8 and 5.9 highlight the prediction accuracies of the ensemble for HeLa and CHOA
cell lines, respectively. The ensemble for HeLa dataset has achieved accuracy value of 100%
by the 5th group as shown in Table 5.8. The diversity among the ensemble members of this
group is highest as is evident from the Q-value of 0.01. Likewise, for CHOA dataset, the
same group has achieved 95% prediction accuracy as shown in Table 5.9. The diversity of
80
Table 5.8: Performance of different combinations of SVM classifications using LTP for
balanced HeLa dataset
S. No 1 2 3 4 5 6 Acc Sen Spe MCC F-Score Q-value
1√ √ √ √ √
- 99.8 99.9 99.8 0.99 0.99 0.20
2√ √ √ √
-√
99.8 99.9 99.8 0.99 0.99 0.07
3√ √ √
-√ √
99.8 99.9 99.8 0.99 0.99 0.08
4√ √
-√ √ √
99.8 99.9 99.8 0.99 0.99 0.11
5√
-√ √ √ √
100 100 100 1 1 0.01
6 -√ √ √ √ √
99.7 99.9 99.7 0.98 0.98 0.12
this group has achieved promising results for both HeLa and CHOA datasets. The Q-value
of 0.33 revealed good standard of diversity maintained by these ensemble members.
Table 5.9: Performance of different combinations of SVM classifications using LTP for
balanced CHOA dataset
S. No 1 2 3 4 5 6 Acc Sen Spe MCC F-Score Q-value
1√ √ √ √ √
- 91.7 87.3 92.4 0.69 0.72 0.38
2√ √ √ √
-√
92.8 89.1 93.4 0.72 0.75 0.36
3√ √ √
-√ √
92.9 88.8 93.5 0.72 0.75 0.36
4√ √
-√ √ √
94.7 91.4 95.1 0.78 0.81 0.34
5√
-√ √ √ √
95.0 91.6 95.4 0.79 0.81 0.33
6 -√ √ √ √ √
93.6 90.2 94.1 0.75 0.77 0.38
5.3 Comparative Analysis
Comparative analysis of the proposed Protein-SubLoc prediction system is highlighted in
Table 5.10. Many state-of-the-art existing techniques are listed in the first column and
their respective summary is recorded in the next column. Chebira et al. have developed
multi-resolution based technique, which yielded 95.4% performance accuracy [7]. RSE based
prediction system proposed in [39] has predicted HeLa images with 94.2% accuracy. In a
similar approach, [8] have reported 95.8% prediction accuracy. Another approach, in which
fusion of two different ensembles was generated, achieved 97.5% prediction performance [40].
Nanni et al. have developed an ensemble of SVMs, which achieved 93.2% accuracy [12]. Lin
et al. have proposed AdaBoost.ERC algorithm, which achieved 94.7% and 93.6% accuracies
81
for CHOA and HeLa datasets, respectively [6]. From the above discussion, it is evident
Table 5.10: Performance comparison with other approaches
Method Summary of the Technique Accuracy
HeLa CHOA
Chebira et al. [7] Multi-resolution subspaces, weighted majority voting 95.4 -
Nanni and Lumini [39] RSE of NNs 94.2 -
Nanni et al. [40] RSE of NNs, AdaBoost ensemble of weak learners, sum rule 97.5 -
Nanni et al. [8] RSE of NNs 95.8 -
Nanni et al. [12] SVM, random subset of features, 50 classifiers, sum rule 93.2 -
Lin et al. [6] Variant of AdaBoost named as AdaBoost.ERC 93.6 94.7
Protein-SubLoc Majority Voting based Ensemble 100 95.0
that the performance shown by the proposed Protein-SubLoc approach outperformed the
existing approaches. In individual classifiers, SVM has yielded 95.4% accuracy for HeLa
dataset using 52D full feature space constructed from URI-LTP(3, 24, 80) patterns as shown
in Table 5.2, which shows comparable performance with the existing ensemble approaches
in the literature. The proposed Protein-SubLoc achieved 100% accuracy, which completely
outperformed the existing ensemble approaches. The reported accuracies of the proposed
ensemble for the HeLa and CHOA datasets are, respectively 2.5% and 0.3% higher than the
highest accuracies of the existing approaches for the same datasets.
The performance accuracies yielded by the proposed Protein-SubLoc prediction system
are promising due to the sensitive behavior of LTP patterns towards noise. This enables
LTP patterns to extract valuable and discriminative information from images, which lead
the classifier to efficiently distinguish different subcellular structures from each other. LTP
patterns perform its operations in varying area of observations, which enable this technique
to have different views of the same image regions. Furthermore, providing balanced data
to the classifier reduced the bias of the classifier towards the majority class and hence the
performance of the classifier is enhanced with improved reliability.
The prediction accuracy with SMOTE is directly proportional to the size of imbalance
present in a dataset. For instance, the improvement for HeLa cell lines with SMOTE over
without SMOTE is not very high whereas for CHOA dataset this improvement is high
because the imbalance in HeLa dataset is low compared to that of CHOA dataset. Feature
selection using mRMR is not required for LTP since it produces such a discriminative
82
feature space that further enhancement in its discriminative power is not possible. The
informative and discriminative feature space, generated with LTP patterns and oversampled
with SMOTE, enable Protein-SubLoc prediction system to efficiently classify different sub-
cellular protein images from fluorescence microscopy into various classes. Consequently, the
prediction system boosted the prediction performance to higher levels.
Summary
In this chapter, the effectiveness of LTP in conjunction with SMOTE for the classification
of fluorescence microscopy protein images is reported. The results show that efficient pre-
dictions might be achieved by efficiently utilizing the learning power of the classifier by
providing balanced feature space. In the next chapter, a deeper insight is developed in
the extraction of minute details form fluorescence microscopy images through GLCM and
Texton image based feature extraction strategies. The decisions of individual classifiers are
then combined through the majority voting scheme for enhanced classification results.
83
Chapter 6
Protein Subcellular Localization
using GLCM and Texton Image
based Features
Subcellular localization property of proteins plays a key role in understanding numerous
functions of proteins. Comprehensive analysis of fluorescence microscopy images is required
in order to develop efficient automated systems for accurate localization of different proteins.
In this chapter, we explore the effect of various feature spaces constructed from GLCM and
Texton images along different orientations. It is observed that the recognition capability and
learning of prediction system have been enhanced when the feature spaces are constructed
from a GLCM or Texton image along separate directions and then utilized individually
by classifiers. The enhancement in recognition is obtained by combining the individual
decisions through the majority voting scheme.
6.1 The IEH-GT Prediction System
The IEH-GT prediction system is shown in Figure 6.1. There are two phases namely, feature
extraction, and classification. Individual SVMs are trained on the extracted features from
GLCM and Texton images, and then the final prediction is produced through the majority
voting based ensemble technique. Each of the phases of IEH-GT prediction system is
discussed in detail as follows.
84
Predicted
Label
Ensemble
Classification
SVM
Input
Image
Feature Extraction
Haralick Texture
Features
Feature Extraction
Statistical
Features
Quantization
GLCM Construction
GLCMH
GLCMV
GLCMD
GLCMoD
GLCMF
Texton Image Construction
T1
T2
T3
T4
T5
T6
TF
Hybrid Model
Figure 6.1: Framework of IEH-GT prediction system
6.1.1 Feature Extraction Phase
In the feature extraction phase, first GLCM matrices and Texton images are constructed
from each input fluorescence microscopy protein image and then statistical measures are
computed from the output GLCM matrices and Texton images. The input image is quan-
tized to 16 gray levels prior to the GLCM and Texton image construction and feature
extraction.
6.1.1.1 GLCM Construction
GLCM matrices are usually constructed along horizontal, vertical, diagonal, and off-diagonal
orientations. Features are extracted from each GLCM separately and then these computed
features are combined to form a single representative feature space. It is observed that
this single GLCM is not the real representation of fluorescence microscopy protein images
that belong to different categories. Therefore, in this work, GLCMs along four directions
are constructed separately and are not combined with the intention that each GLCM along
a particular direction may efficiently represent a separate class of fluorescence microscopy
protein image. The classifier would thus be able to better discriminate the classes that may
fall close in the feature space.
In addition to the aforementioned four GLCMs, a combined GLCM is also obtained by
85
their fusion. Haralick coefficients are then extracted from each of the five GLCMs, which
are utilized later by the classifiers.
6.1.1.2 Texton Image Construction
Texton masks are utilized to construct six Texton images from each input fluorescence
microscopy protein image. The procedure of Texton image construction with the help of
Texton masks is discussed in section 2.2.2.
Usually, a GLCM is constructed from a Texton image and consequently, feature spaces
are generated from this GLCM. However, what we do new is to generate statistical features
from the individual Texton images directly without transforming them into a GLCM matrix.
The rationale, behind the construction of feature spaces directly from Texton images and
not from the GLCM matrices, is to obtain distinctive feature spaces capable of representing
fluorescence microscopy protein images on individual basis, which consequently help in
discriminative learning. In addition to the six Texton images, a combined Texton image is
also obtained through the fusion of these six Texton images. Then, ten statistical features
are extracted from each Texton image, which are later utilized in classification phase.
6.1.1.3 The Hybrid Model
A hybrid feature space is also constructed by combining all the individual feature spaces
developed from seven Texton images and five GLCM matrices on the hope that they may
enhance the overall performance of the classification system.
6.1.2 Classification Phase
IEH-GT prediction system utilizes RBF-SVM as base classifier in the classification phase
as shown in Figure 6.2. Separate SVMs are trained using each feature space extracted from
GLCM matrices, Texton images, and the hybrid model. Hence, 13 SVMs are utilized in
this phase. The final prediction is obtained through the majority voting ensemble using all
the trained SVMs. The performance of the proposed IEH-GT prediction system is assessed
using 2D HeLa dataset.
86
ClassifierCombining
DecisionsFeatures
GLCM &
Texton Image
FeaturesMajority
Voting
Final
Prediction.
.
.
.
SVM1
SVMn
SVM2
Predictions1
Predictionsn
Predictions2
.
.
.
.
Figure 6.2: Classification phase of IEH-GT system
6.2 Results and Discussion
The objective of our simulation based study is to validate the effectiveness of the proposed
IEH-GT prediction system, which is based on the statistics computed from GLCM matrices
and Texton images. We have employed 10-fold cross validation protocol to assess the per-
formance of the proposed model. Measures used to assess the effectiveness of the prediction
system include accuracy, sensitivity, specificity, MCC, and F-Score.
6.2.1 Analysis of GLCM based Features for HeLa Dataset
Table 6.1 presents performance accuracies of IEH-GT prediction system using Haralick
features computed from GLCMs constructed along different directions. Column 1 shows
GLCM type from which the feature space is constructed. Columns 2-6 indicate the perfor-
mance measures.
Table 6.1: Predictions of individual SVMs using Haralick features for HeLa dataset
GLCM type Acc Sen Spe MCC F−Score
GLCMH 72.0 68.6 68.1 0.23 0.31
GLCMD 72.9 69.0 65.6 0.21 0.29
GLCMV 71.8 68.3 67.1 0.22 0.30
GLCMoD 73.0 70.2 67.5 0.24 0.31
GLCMF 74.8 69.2 67.0 0.23 0.30
In this work, we employed Haralick coefficients to discriminate among different patterns
of HeLa cell lines. The computed Haralick features from GLCMs constructed along different
directions individually have produced diverse results, which ultimately led to the improved
87
performance of the ensemble classification model. The analysis revealed that each version
of GLCM holds information of some particular pattern present in different images.
Table 6.2: Confusion matrix using Haralick features from GLCMH
Actin 77 0 7 0 0 0 1 11 2 0
DNA 0 82 0 4 0 0 0 0 1 0
Endosome 18 0 44 0 0 0 8 10 10 1
ER 2 2 3 65 0 0 2 7 5 0
Golgia 0 0 2 0 66 18 0 0 0 1
Golgpp 0 0 1 2 12 64 3 0 0 3
Lysosome 1 0 5 2 2 1 69 3 0 1
Microtubule 18 0 5 11 0 0 0 51 6 0
Mitochondria 3 0 14 14 0 0 0 8 34 0
Nucleolus 0 0 1 1 1 3 5 0 0 69
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
For example, the confusion matrix presented in Table 6.2 indicates that Golgia proteins
are misclassified 18 times as Golgpp proteins due to the similarity of protein structure of
these classes in the fluorescence microscopy images of HeLa dataset. Golgia and Golgpp
proteins are classified with the accuracies of 75.8% and 75.2%, respectively, using the fea-
tures extracted from GLCMH. The tight structure of proteins around the nucleus in the
images of these two classes leads SVM to the wrong predictions. For instance, Figure 6.3-(a)
is a Golgia protein but predicted as an instance of Golgpp protein. The closest instance of
Golgpp class of HeLa dataset in the Euclidean space is shown in Figure 6.3-(b). This fact
is revealed by the scatter matrix in Figure 6.4 where fluorescence microscopy protein image
of Figure 6.3-(a) is plotted in the feature space with all the instances of Golgpp protein
class. The instance of Golgia class is represented by red circle in Figure 6.4 whereas Gol-
gpp instances are represented by green plus sign. The Golgia protein presented in Figure
6.3-(a) is correctly predicted by SVM using the Haralick features extracted from GLCMD
and GLCMoD, which reveals that these two GLCMs have the ability to extract features
from this particular protein. Hence strengthening our argument of using the concept of
split GLCM in which we constructed GLCMs along different directions separately so that
88
(a) Golgia protein (b) Golgpp protein
Figure 6.3: Golgia protein is wrongly predicted as Golgpp protein
meaningful features from different images could be extracted.
Figure 6.4: Golgia protein is represented by red circle and Golgpp class is shown by green
plus sign
Consequently, separation makes different GLCMs suitable for images having different
features from each other. Similarly, Golgpp protein is wrongly predicted as Golgia protein
12 times. Again the similar structures in these two classes lead SVM to the wrong conclusion
of the labels of these misclassified fluorescence microscopy protein images. For example,
Figure 6.5-(a) shows a Golgpp protein, which is wrongly predicted as Golgia protein. The
minimum Euclidean distance of Golgpp protein of Figure 6.5-(a), is found with the Golgia
89
protein of Figure 6.5-(b). Figure 6.6 illustrates the feature space of Golgia protein in relation
with one instance of Golgpp protein. This instance shown by red circle is misclassified as
Golgia protein.
(a) Golgpp protein (b) Golgia protein
Figure 6.5: Golgpp protein is wrongly predicted as Golgia protein
On the other hand, Lysosomal proteins are predicted with quite good accuracy of 82.1%
as shown in Table 6.2, even though there is much similarity present among these proteins
and the Endosome proteins.
Figure 6.6: Golgpp instance is represented by red circle and Golgia class is shown by green
plus sign
90
In Table 6.2 there are only 5 misclassifications of Lysosomal proteins as Endosomes. But
at the same time, these features are not capturing the information from the fluorescence
microscopy protein images of Mitochondria class where the accuracy is merely 46.5%. Most
of these patterns are classified wrongly as Endosome, ER and Microtubules. Similarly,
Tables 6.3, 6.4, 6.5 present the confusion matrices generated by SVM using features from
GLCMD, GLCMV, and GLCMoD, respectively.
Table 6.3: Confusion matrix using Haralick features computed from GLCMD
Actin 77 0 7 0 0 0 0 12 2 0
DNA 0 84 0 3 0 0 0 0 0 0
Endosome 14 0 48 1 0 0 8 11 9 0
ER 1 1 0 62 0 0 2 10 10 0
Golgia 0 0 1 0 68 16 0 0 0 2
Golgpp 0 0 2 1 20 55 2 0 0 5
Lysosome 2 0 7 3 1 3 64 2 1 1
Microtubule 13 0 6 12 0 0 0 55 5 0
Mitochondria 6 0 5 15 0 0 0 4 43 0
Nucleolus 0 2 0 0 1 2 0 0 2 73
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
Table 6.4: Confusion matrix using Haralick features computed from GLCMV
Actin 76 0 6 0 0 0 1 13 2 0
DNA 0 82 0 4 0 0 0 0 1 0
Endosome 16 0 44 0 0 0 9 10 11 1
ER 1 1 1 64 0 0 3 8 8 0
Golgia 0 0 2 0 62 21 0 0 0 2
Golgpp 0 0 2 1 14 62 2 0 0 4
Lysosome 1 0 8 1 1 2 67 2 1 1
Microtubule 16 0 5 11 0 0 1 51 7 0
Mitochondria 2 1 10 10 0 0 1 10 38 1
Nucleolus 0 0 0 1 1 2 2 0 1 73
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
91
Table 6.5: Confusion matrix using Haralick features computed from GLCMoD
Actin 77 0 12 0 0 0 0 9 0 0
DNA 0 84 0 2 0 0 0 0 0 1
Endosome 16 1 46 1 0 0 10 9 8 0
ER 1 2 2 66 0 0 1 8 6 0
Golgia 0 0 1 0 66 19 0 0 1 0
Golgpp 0 0 4 0 13 62 1 0 1 4
Lysosome 1 0 7 0 2 4 64 2 2 2
Microtubule 15 0 12 9 0 0 0 53 2 0
Mitochondria 5 1 12 8 0 0 1 6 40 0
Nucleolus 0 0 1 1 1 3 1 0 1 72
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
Mitochondrial proteins are symmetrically distributed around the nucleus similar to those
of Endosome, ER and Microtubule proteins. SVM is able to classify Endosomes with the
accuracy of 60.4% using Haralick features computed from the GLCMF as shown in Table
6.6.
Table 6.6: Confusion matrix using Haralick features computed from GLCMF
Actin 80 0 8 0 0 0 0 10 0 0
DNA 0 83 0 4 0 0 0 0 0 0
Endosome 9 0 55 2 0 0 8 7 10 0
ER 1 1 1 67 0 0 2 7 7 0
Golgia 0 0 0 0 66 18 0 0 2 1
Golgpp 0 0 1 2 15 63 1 0 0 3
Lysosome 0 0 10 2 1 4 62 2 1 2
Microtubule 13 0 10 8 0 1 0 53 6 0
Mitochondria 4 0 7 9 0 0 0 7 46 0
Nucleolus 0 1 0 1 2 4 1 0 1 70
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
On the other hand, the GLCM constructed along any single orientation except GLCMD
92
was not able to achieve accuracy higher than 52.7% for these images. The diversity among
individual SVMs, utilizing features from GLCM matrices, can be understood from the
analysis of Table 6.7. This table shows the original labels of Golgia proteins misclassified
as Golgpp proteins by individual SVM using the Haralick features computed from GLCM
constructed along different directions. The disagreements among different classifiers lead to
the diversified individuals of the IEH-GT prediction system.
Table 6.7: Golgia proteins classified as Golgpp proteins
GLCMH GLCMD GLCMV GLCMoD GLCMF
- - golgia 002 - golgia 002
- - - - golgia 003
- golgia 004 - golgia 004 golgia 004
- - golgia 005 - -
golgia 006 - - - -
golgia 007 - - - -
golgia 008 - - - -
golgia 009 - - - -
- - - golgia 011 -
- - - golgia 012 -
- - - - golgia 014
- - golgia 015 - -
- golgia 016 - - -
golgia 017 - golgia 017 - -
- - - golgia 018 -
- golgia 022 - - -
golgia 024 golgia 024 - - golgia 024
golgia 025 - - - -
- - - golgia 028 -
- - golgia 031 - -
golgia 033 - - - golgia 033
- golgia 034 - - -
golgia 035 golgia 035 - - golgia 035
- - - - golgia 036
- golgia 037 golgia 037 - -
golgia 038 - - - -
- - golgia 039 - golgia 039
golgia 040 golgia 040 - - -
golgia 042 - - golgia 042 -
- golgia 043 golgia 043 - -
- - golgia 044 - -
- golgia 046 golgia 046 - -
- - golgia 049 - -
- - - golgia 050 golgia 050
- - - golgia 051 golgia 051
- golgia 052 - - -
- - golgia 055 golgia 055 -
- golgia 056 - - -
golgia 057 - golgia 057 - -
- golgia 058 - - -
- - golgia 059 - golgia 059
Continued on next page
93
Table 6.7 – continued from previous page
GLCMH GLCMD GLCMV GLCMoD GLCMF
- - - - golgia 060
- - - golgia 061 golgia 061
- - - golgia 062 -
- - golgia 063 golgia 063 -
- - golgia 064 - -
- golgia 066 golgia 066 golgia 066 -
- - golgia 067 - golgia 067
- - - - golgia 068
golgia 069 - - - -
- - - - golgia 072
golgia 073 - - - -
- - - golgia 074 -
- golgia 076 golgia 076 golgia 076 -
- - - golgia 077 -
golgia 079 - - - -
- - - - golgia 080
- - - golgia 081 -
- - - golgia 083 -
golgia 084 - - - -
- golgia 085 golgia 085 golgia 085 -
golgia 087 - golgia 087 - -
6.2.2 Analysis of Texton Image based Features for HeLa Dataset
Prediction performance of the proposed IEH-GT prediction system using statistical features
computed from Texton images constructed along different directions is presented in Table
6.8. Column 1 shows the Texton mask direction. Columns 2-6 indicate the performance
measures utilized for the performance assessment of the prediction system.
Table 6.8: Predictions of individual SVMs using statistical features computed from Texton
images
Texton Mask Acc Sen Spe MCC F−Score
T1 63.8 62.6 57.7 0.12 0.23
T2 60.0 62.7 58.3 0.13 0.24
T3 63.4 65.0 60.1 0.15 0.25
T4 60.4 61.6 56.9 0.11 0.23
T5 63.8 65.8 60.5 0.16 0.26
T6 63.6 67.5 60.1 0.17 0.26
Fused Textons 56.3 53.2 53.7 0.04 0.19
Individual SVMs have produced reasonable results using features from Texton images.
94
Unlike thee features from GLCM, the features from fused Texton image possess redun-
dancy that is why SVM produced lower prediction rates i.e. 56.3% for these features. The
confusion matrix, generated by SVM using the features extracted from the Texton image
employing T1 Texton mask, is presented in Table 6.9.
Table 6.9: Confusion matrix using statistical features computed from Texton image con-
structed using T1
Actin 72 1 8 3 0 0 1 10 2 1
DNA 0 79 0 7 0 0 0 0 1 0
Endosome 9 1 42 4 3 0 15 6 11 0
ER 5 4 4 54 0 0 0 13 6 0
Golgia 0 0 1 0 53 30 1 0 2 0
Golgpp 0 0 1 1 12 65 4 0 0 2
Lysosome 3 0 11 1 2 11 51 0 1 4
Microtubule 14 2 5 29 0 1 1 33 5 1
Mitochondria 7 1 13 11 1 0 2 3 35 0
Nucleolus 0 0 1 1 0 7 4 0 1 66
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
Figure 6.7: Endosome instance is shown by red circle and Lysosome class is represented by
green plus sign
95
It is evident that SVM classifier is confused to classify some of the instances from Endo-
some and Lysosome proteins because proteins are concentrated along one side of the nucleus
in both types of images. In addition, stains can be observed throughout the cytoplasm in
both the fluorescence microscopy protein images. Similarly, Endosome proteins are also
confused with the mitochondrial proteins. Since, in Mitochondrial images, staining similar
to Endosome protein images can be observed.
(a) Endosome protein (b) Lysosome protein
Figure 6.8: Similar patterns can be observed in the two images
Table 6.10: Confusion matrix using statistical features computed from Texton image con-
structed using T2
Actin 76 1 6 1 0 0 2 11 1 0
DNA 0 79 1 4 0 0 0 2 1 0
Endosome 12 1 43 2 3 1 13 4 11 1
ER 10 5 9 37 0 0 0 19 6 0
Golgia 0 0 2 0 38 40 3 0 4 0
Golgpp 0 0 1 1 6 65 10 0 0 2
Lysosome 4 0 14 0 2 11 49 0 1 3
Microtubule 23 2 7 20 0 1 0 34 3 1
Mitochondria 8 2 17 7 3 1 1 5 29 0
Nucleolus 0 0 0 0 1 5 1 0 2 71
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
96
Similar patterns can be observed in the images presented in Figure 6.8 where Endosome
and Lysosome are possessing similar patterns. This indicates that T1 Texton mask can
extract pattern from images, which possess horizontal texture patterns. That is why the
two images belonging to different classes are categorized as the instances of same class.
Likewise, confusion matrices generated by SVM using statistical features computed from
Texton images constructed with T2, T3, T4, T5, and T6 are presented in Tables 6.10, 6.11,
6.12, 6.13, and 6.14, respectively. Table 6.15 presents the confusion matrix generated by
Table 6.11: Confusion matrix using statistical features computed from Texton image con-
structed using T3
Actin 75 0 8 3 0 0 2 9 1 0
DNA 0 76 0 9 0 0 0 1 0 1
Endosome 8 1 46 2 3 1 17 5 8 0
ER 7 3 5 44 1 0 1 18 7 0
Golgia 0 0 2 0 54 28 1 0 2 0
Golgpp 0 0 2 1 13 64 4 0 0 1
Lysosome 3 0 8 2 2 12 53 0 0 4
Microtubule 15 1 7 22 0 2 1 38 5 0
Mitochondria 5 0 14 12 3 1 0 7 31 0
Nucleolus 0 0 1 1 0 6 6 0 0 66
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
SVM using the statistical features extracted from a Texton image that is obtained from
the fusion of Texton images constructed using the aforementioned Texton masks. Due to
the overlapping information, SVM could not produce comparable results using the features
extracted from the combined Texton image.
97
Table 6.12: Confusion matrix using statistical features computed from Texton image con-
structed using T4
Actin 70 1 6 8 0 0 2 8 3 0
DNA 1 79 1 3 0 0 1 1 0 1
Endosome 12 0 37 2 4 1 13 5 17 0
ER 15 1 4 41 0 0 0 16 9 0
Golgia 0 0 2 0 48 32 2 0 3 0
Golgpp 0 0 3 0 8 68 5 0 0 1
Lysosome 1 0 12 1 3 13 49 3 0 2
Microtubule 19 3 11 30 0 1 1 25 1 0
Mitochondria 9 0 15 8 4 1 1 5 30 0
Nucleolus 0 1 2 0 0 5 1 0 0 71
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
Table 6.13: Confusion matrix using statistical features computed from Texton image con-
structed using T5
Actin 73 0 11 5 0 0 1 8 0 0
DNA 0 78 0 7 0 0 0 1 1 0
Endosome 11 0 43 3 3 0 13 5 12 1
ER 5 3 3 54 0 0 0 11 10 0
Golgia 0 0 1 0 51 31 1 0 3 0
Golgpp 0 0 3 0 11 66 3 0 0 2
Lysosome 4 0 11 1 2 11 51 0 1 3
Microtubule 15 1 5 29 0 2 2 32 5 0
Mitochondria 5 0 14 10 1 1 1 5 36 0
Nucleolus 0 1 0 1 0 5 5 0 2 66
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
98
Table 6.14: Confusion matrix using statistical features computed from Texton image con-
structed using T6
Actin 76 0 7 5 0 0 0 9 1 0
DNA 0 78 1 4 0 0 0 2 2 0
Endosome 8 1 44 3 2 1 12 4 16 0
ER 8 3 6 45 0 0 1 14 9 0
Golgia 0 0 1 0 50 28 3 0 5 0
Golgpp 0 0 2 1 8 67 4 0 0 3
Lysosome 3 0 9 0 2 10 54 2 2 2
Microtubule 16 0 8 23 0 1 1 37 4 1
Mitochondria 4 0 22 8 4 1 0 5 29 0
Nucleolus 0 0 0 1 0 5 4 0 1 69
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
Table 6.15: Confusion matrix using statistical features computed from Texton image con-
structed from the fusion of six Texton images
Actin 47 2 7 26 1 0 0 5 10 0
DNA 4 71 1 5 0 0 0 6 0 0
Endosome 10 0 43 6 2 1 13 5 11 0
ER 7 1 8 39 0 0 4 23 3 1
Golgia 0 0 3 0 58 20 4 0 2 0
Golgpp 0 0 6 0 15 57 5 0 0 2
Lysosome 3 0 14 2 3 5 55 1 0 1
Microtubule 10 6 8 25 1 1 0 34 6 0
Mitochondria 11 0 20 9 6 1 1 4 21 0
Nucleolus 0 0 1 1 5 3 8 0 1 61
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
Table 6.16 shows the original labels of Endosome proteins misclassified as Lysosome
proteins by SVM using the statistical features computed from Texton images constructed
along different directions. The diversity among the classifications of different classifiers can
99
be observed.
Table 6.16: Different Texton image based statistical features lead to different classification
results. Endosome proteins as Lysosome proteins
T1 T2 T3 T4 T5 T6 Fused Texton Image
- endosome 003 endosome 003 - - - -
- - - - endosome 005 - -
- endosome 007 - - - - -
- - endosome 008 - - - -
- - - - endosome 009 - endosome 009
- - - - endosome 010 - -
endosome 011 endosome 011 - - - - -
- endosome 013 - - - - -
- - endosome 014 - - - -
endosome 015 - - - - - -
- - - - endosome 016 - -
- - endosome 017 - - - -
- - - endosome 018 - - -
- - endosome 021 endosome 021 - - -
- - - endosome 023 endosome 023 - endosome 023
- endosome 026 - - - - -
- - endosome 027 - - - -
- - - endosome 028 - endosome 028 -
- endosome 029 - - - - endosome 029
endosome 031 - - - - - endosome 031
endosome 032 endosome 032 endosome 032 - - - -
endosome 034 - - - endosome 034 - -
endosome 035 - - - - - endosome 035
endosome 038 - - - - - endosome 038
- endosome 041 - endosome 041 endosome 041 endosome 041 -
endosome 043 - endosome 043 - - - -
- - - - endosome 044 - -
- - - - endosome 046 - -
- - endosome 047 - endosome 047 - -
endosome 050 - - endosome 050 - - -
- endosome 051 - endosome 051 - - -
- - - - - endosome 052 endosome 052
- - - endosome 053 - - -
- - - - - endosome 054 -
- - - - - - endosome 055
endosome 059 - - - - - -
- endosome 060 - - endosome 060 - -
- endosome 061 - - endosome 061 - endosome 061
- - - - - - endosome 062
endosome 063 - endosome 063 - - - -
- - - - - endosome 065 -
- - - endosome 067 - - -
- - - - - endosome 068 -
- endosome 070 endosome 070 endosome 070 - endosome 070 -
- - endosome 071 - - - -
- - - endosome 072 - - endosome 072
- - - - - endosome 073 -
- - - endosome 074 - endosome 074 -
- - - endosome 076 - endosome 076 -
Continued on next page
100
Table 6.16 – continued from previous page
T1 T2 T3 T4 T5 T6 Fused Texton Image
- - endosome 077 - - - -
- endosome 079 endosome 079 - - - -
endosome 080 - - - - - -
endosome 081 - - - - endosome 081 -
- - endosome 082 - - - endosome 082
endosome 084 - endosome 084 - - - -
- - - - - - endosome 085
- - - - endosome 088 endosome 088 -
- - endosome 089 - - - -
endosome 090 - - - - - -
6.2.3 Analysis of the Hybrid Model for HeLa dataset
SVM is trained using the Hybrid model that achieved the highest accuracy value of 79.5%
compared to other individual feature spaces. Among other performance measures, SVM
has yielded 83.9% sensitivity, 79.0% specificity, 0.43 MCC, and 0.46 F-score values. The
confusion matrix using the Hybrid model is presented in Table 6.17.
Table 6.17: Confusion matrix obtained using the hybrid model
Actin 90 0 1 0 0 0 2 2 3 0
DNA 0 84 0 1 0 0 0 2 0 0
Endosome 4 0 59 0 1 1 7 8 11 0
ER 1 0 1 69 0 0 1 1 13 0
Golgia 0 0 1 0 75 9 0 0 2 0
Golgpp 0 0 1 1 14 63 3 0 1 2
Lysosome 1 0 9 0 2 2 64 2 2 2
Microtubule 3 0 13 6 0 0 0 62 7 0
Mitochondria 1 0 6 8 0 0 2 9 47 0
Nucleolus 0 0 0 0 1 3 1 0 2 73
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
It can be observed that the combined feature space possesses more discriminative in-
formation compared to the individual feature spaces. That is why SVM using the hybrid
model outperformed other SVMs using individual features.
101
6.2.4 Ensemble Analysis
Ensemble is generated by combining the decisions of all the SVMs trained using all the
GLCM based Haralick features, Texton image based statistical features and the Hybrid
model. The confusion matrix is given in Table 6.18. We have 13 votes in total, which
achieved 95.3% collective overall prediction accuracy. The sensitivity, specificity, MCC, and
F-Score values are 97.4%, 95.1%, 0.80, and 0.81, respectively. It can be observed from the
results presented in Tables 6.1, 6.8, and 6.18 that the accuracies of individual classifiers are
not higher. The highest accuracy of individual SVM is reported to be 79.5% using the hybrid
model. However, the prediction accuracy of IEH-GT has reached 95.3%, which indicates
that individual SVM classifiers using features from Texton images and GLCM matrices
along different orientations have produced diversified results that lead to the improved
prediction performance of the majority voting based ensemble.
Table 6.18: Confusion matrix using majority voting scheme
Actin 98 0 0 0 0 0 0 0 0 0
DNA 0 87 0 0 0 0 0 0 0 0
Endosome 3 0 85 0 0 0 1 0 2 0
ER 0 0 0 85 0 0 0 1 0 0
Golgia 0 0 0 0 81 6 0 0 0 0
Golgpp 0 0 0 0 0 85 0 0 0 0
Lysosome 0 0 0 0 0 0 84 0 0 0
Microtubule 5 0 1 9 0 0 0 76 0 0
Mitochondria 1 0 7 3 0 0 0 1 61 0
Nucleolus 0 0 0 0 0 0 0 0 0 80
Acti
n
DN
A
En
doso
me
ER
Golg
ia
Golg
pp
Lyso
som
e
Mic
rotu
bu
le
Mit
och
on
dria
Nu
cle
olu
s
6.3 Comparative Analysis
The proposed IEH-GT approach showed comparable performance with the existing state-of-
the-art approaches. In [7], the authors have shown that their proposed system has achieved
95.4% performance accuracy. They have utilized the features extracted in multi-resolution
subspaces from 2D HeLa images. In another approach, the model developed in [39] yielded
102
94.2% accuracy for the 2D HeLa dataset employing RSE of neural networks. For the same
dataset, Nanni et al. have built a prediction system, which obtained the accuracy value of
97.5% [40].
Table 6.19: Performance comparison with the existing approaches based on HeLa dataset
Method Summary of the Technique Accuracy
Chebira et al. [7] Multi-resolution subspaces, weighted majority voting 95.4
Nanni and Lumini [39] RSE of NNs 94.2
Nanni et al. [40] RSE of NNs, AdaBoost ensemble of weak learners, sum rule 97.5
Nanni et al. [8] RSE of NNs 95.8
Nanni et al. [12] SVM, random subset of features, 50 classifiers, sum rule 93.2
Lin et al. [6] Variant of AdaBoost named as AdaBoost.ERC 93.6
IEH-GT Majority Voting based Ensemble 95.3
However, they achieved this significant improvement by the combination of two ensem-
bles through sum rule. Nanni et al. have developed another prediction model for the 2D
HeLa dataset that yielded 95.8% accuracy [8], which again employed RSE of neural net-
works. Nanni et al. have also developed a system that produced 93.2% accuracy for this
dataset by employing 50 SVMs where each SVM is trained on a separate feature space
using the phenomenon of random subset of features [12]. The final decision is made using
the sum rule. The accuracy value of 93.6% has been reported by [6] for 2D HeLa dataset.
They utilized AdaBoost.ERC as ensemble classifier. In contrast to all the aforementioned
techniques, IEH-GT is a simple ensemble of 13 SVMs that achieved 95.3% accuracy. The
features are obtained in spatial domain only. The only technique, which produced 2.2%
higher accuracy than the IEH-GT approach, is proposed in [40]. However, this accuracy is
achieved using complex ensemble based on neural networks and AdaBoost.
We report a feature extraction mechanism for effective classification of subcellular local-
ization images. It is more or less like a divide and conquer approach, targeting individual
fluorescence microscopy protein images on the basis of their histogram co-occurring infor-
mation as well as texture patterns distribution. A single combined GLCM may not be a
real representative of the whole dataset that is why individual GLCMs along the four ori-
entations are utilized to recognize individual protein images from fluorescence microscopy.
Similar approach is employed in case of Texton image based features. Statistical features
103
of different patterns are obtained so that individual fluorescence microscopy protein images
are well addressed by the classification system. Individual GLCMs constructed along cer-
tain orientation captures information more precisely from an image class and this GLCM
may not be the appropriate representation for another class. However, when these multiple
GLCMs are combined, the obtained information held in different GLCMs are suppressed
and thus the resulting combined GLCM may not be appropriate for the representation of
a specific image class. Although, this combined GLCM might be a generalized form for all
the images but this may not contribute well to the overall prediction performance of the
underlying system. Therefore, the significance of the proposed GLCM and Texton image
based prediction system is obvious from the observed prediction performance.
We have shown that treating GLCM and Texton images along different orientations
separately exploit the hidden information well in the protein fluorescence microscopy im-
ages. It is evident that GLCM and Texton images along certain direction are more suitable
for certain pattern extraction and subsequently their ensemble form the basis for improved
prediction performance. The proposed IEH-GT technique might be very prospective in
studying the structural variations of the biological components or organisms.
Summary
In this chapter, IEH-GT prediction system, based on the features computed from GLCMs
and Texton images along different orientations, is proposed. The analysis reveals that
GLCM and Texton image constructed along single orientation may be efficient in extracting
useful information from a particular class of images. This means that each orientation
describes some part of the dataset. This will help the classifier build diversified ensemble.
In the next chapter, conclusive remarks are presented along with future directions.
104
Chapter 7
Conclusions and Future Directions
Analysis of fluorescence microscopy based protein images through efficient and reliable au-
tomated systems is one of the basic needs in the fields of Bioinformatics, computational
biology, and biological sciences. Development of automated systems requires protein sub-
cellular localization images to be represented numerically through the use of informative
and discriminative feature extraction strategies. In this connection, different approaches
have been developed in this thesis. Conclusive remarks for the research work conducted are
presented next, followed by identifying possible future directions.
7.1 Conclusive Remarks
In this thesis, reliable and effective automated systems for protein subcellular localiza-
tion have been developed, which are based on discriminative feature extraction strategies.
Further, the employed ensemble classification systems have exploited the generated feature
spaces very well that improved the overall prediction performance. The proposed approaches
have achieved significant improvement in terms of accuracy compared to other approaches
in the literature.
In Chapter 3, we studied different spatial and transform domain features as well as their
effect on the classification capability of the developed SVM-SubLoc prediction system. Dis-
criminative power of Haralick and Zernike features is enhanced in transform domain with
DWT. Particularly, decomposition level 2 exhibits most of the information and improved
the classification capability of the classifier. Further, different hybrid features formed by
the concatenation of individual features have also improved the prediction performance of
105
the classification system. In addition, the overall performance of the prediction system is
further enhanced through the majority voting ensemble, where the mis-classifications of one
classifier are compensated by the other. Since, any single member classifier of an ensemble
may not know everything, however, it is for sure that every member of the ensemble does
know something. Thus, their ensemble showed better performance compared to individual
classifiers. In general, the use of hybrid models in transform and spatial domains coupled
with the ensemble classification yielded a prediction system that outperformed state-of-the-
art existing approaches.
Chapter 4 introduced the notion of oversampling in the feature space in order to re-
duce the imbalance of data. In this connection, SMOTE is utilized to introduce synthetic
samples in the feature space due to which the classifier bias towards the majority class is
reduced. It is observed that the prediction performance of the proposed RF-SubLoc predic-
tion system under SMOTE is directly proportional to the imbalance present in the original
dataset. That is, more the imbalance in the original dataset, more the improvement in the
prediction performance of the classifier under SMOTE. Consequently, the overall prediction
performance of the prediction system is enhanced with oversampled features. From exper-
imental analysis, it is observed that the RF-SubLoc is superior to other state-of-the-art
approaches reported in the literature.
Chapter 5 focused on the exploitation of discriminative power of LTP with SMOTE
oversampling. It is observed that URI-LTP patterns are highly discriminative for fluores-
cence microscopy protein images. That is why feature selection with mRMR has failed to
further improve the discriminative power of LTP patterns. Hence, it is concluded that LTP
patterns may possess enough discriminative power and there might be no room for further
improvement in their discriminative capability through feature selection.
Chapter 6 provided in depth analysis of GLCM and Texton images for extracting dis-
criminative information from protein subcellular localization images. GLCM and Texton
images are able to extract diverse information along different orientations from the same
image. From the simulation results, it is observed that performance improvement of features
extracted from GLCMoD is marginal with the GLCMF. Hence, in some cases, construction
of GLCMs and Texton images along all the directions might be avoided.
106
7.2 Future Directions
The proposed prediction systems have obtained significant improvement over other state-
of-the-arts in classification accuracy. In Chapter 3, we explained the method of obtaining
Haralick and Zernike moments in the sub-bands through DWT, where the highest perfor-
mance of the classifier is reported using features extracted at decomposition level 2. In
future, more decomposition levels will be explored for probing more information that might
be useful for discriminating subcellular structures.
We have constructed different hybrid feature spaces to enhance the classifier’s perfor-
mance. In future, one may analyze the weighted combination of these features that could
lead to enhanced performance of the classification system. Further, various pre-processing
techniques prior to the feature extraction strategies might be explored in order to study
their usefulness in discriminating fluorescence microscopy protein images.
The performance of prediction systems greatly depends on the discrimination power of
features extracted from fluorescence microscopy protein images. In most cases, the classifi-
cation system depends on the successful execution of ensemble. Therefore, we suggest the
development of a novel feature extraction strategy or modification to an existing technique
capable of enabling the classifier yield high prediction performance without the aid of en-
semble.
In the current thesis, the proposed prediction systems were developed for 2D fluorescence
microscopy protein images only. However, 3D protein images from fluorescence microscopy
capture more information, which could be used in the development of more sophisticated
and efficient classification systems. The proposed prediction systems could easily be modi-
fied for analysis and classification of 3D images.
The feature extraction strategies utilized in this thesis are capable of characterizing
different properties of protein subcellular localization images. An additional benefit may
be to develop automated systems that are able to extract images from online journals and
estimate the research outcomes for a certain field for a specified duration. This will help in
finding new research trends in the field of image processing.
We have utilized many of the existing feature extraction strategies. Among them LTP
and TAS require user defined thresholds for their computation. Researchers might focus on
developing a system that automatically determines this threshold in accordance with the
107
underlying dataset.
We have also developed web based predictors, which are freely available to the research
community. In future, the computational complexity and real time cost of these predictors
could also be reduced.
108
References
[1] G. Srinivasa, T. Merryman, A. Chebira, J. Kovacevic, and A. Mintos, “Adaptive mul-
tiresolution techniques for subcellular protein location classification,” in Proceedings
of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5.
IEEE, 2006, pp. 14–19.
[2] R. F. Murphy, M. V. Boland, and M. Velliste, “Towards a systematics for protein
subcellular location: Quantitative description of protein localization patterns and au-
tomated analysis of fluorescence microscope images.” in ISMB, 2000, pp. 251–259.
[3] S.-B. Wan, L.-L. Hu, S. Niu, K. Wang, Y.-D. Cai, W.-C. Lu, and K.-C. Chou, “Identi-
fication of multiple subcellular locations for proteins in budding yeast,” Current Bioin-
formatics, vol. 6, no. 1, pp. 71–80, 2011.
[4] Y.-Y. Xu, F. Yang, Y. Zhang, and H.-B. Shen, “An image-based multi-label human
protein subcellular localization predictor (ilocator) reveals protein mislocalizations in
cancer tissues,” Bioinformatics, vol. 29, no. 16, pp. 2032–2040, 2013.
[5] M. V. Boland and R. F. Murphy, “A neural network classifier capable of recognizing
the patterns of all major subcellular structures in fluorescence microscope images of
hela cells,” Bioinformatics, vol. 17, no. 12, pp. 1213–1223, 2001.
[6] C.-C. Lin, Y.-S. Tsai, Y.-S. Lin, T.-Y. Chiu, C.-C. Hsiung, M.-I. Lee, J. C. Simpson,
and C.-N. Hsu, “Boosting multiclass learning with repeating codes and weak detectors
for protein subcellular localization,” Bioinformatics, vol. 23, no. 24, pp. 3374–3381,
2007.
109
[7] A. Chebira, Y. Barbotin, C. Jackson, T. Merryman, G. Srinivasa, R. F. Murphy,
and J. Kovacevic, “A multiresolution approach to automated classification of protein
subcellular location images,” BMC bioinformatics, vol. 8, no. 1, p. 210, 2007.
[8] L. Nanni, S. Brahnam, and A. Lumini, “Novel features for automated cell phenotype
image classification,” in Advances in Computational Biology. Springer, 2010, pp.
207–213.
[9] R. F. Murphy, “Automated proteome-wide determination of subcellular location using
high throughput microscopy,” in Proceedings of 5th IEEE International Symposium on
Biomedical Imaging: From Nano to Macro. IEEE, 2008, pp. 308–311.
[10] M. Riffle and T. N. Davis, “The yeast resource center public image repository: A large
database of fluorescence microscopy images,” BMC bioinformatics, vol. 11, no. 1, p.
263, 2010.
[11] R. F. Murphy, M. Velliste, and G. Porreca, “Robust numerical features for descrip-
tion and classification of subcellular location patterns in fluorescence microscope im-
ages,” Journal of VLSI signal processing systems for signal, image and video technology,
vol. 35, no. 3, pp. 311–321, 2003.
[12] L. Nanni, S. Brahnam, and A. Lumini, “Selecting the best performing rotation invariant
patterns in local binary/ternary patterns.” in IPCV, 2010, pp. 369–375.
[13] X. Xiao, S. Shao, Y. Ding, Z. Huang, and K.-C. Chou, “Using cellular automata
images and pseudo amino acid composition to predict protein subcellular location,”
Amino acids, vol. 30, no. 1, pp. 49–54, 2006.
[14] K.-C. Chou, Z.-C. Wu, and X. Xiao, “iloc-hum: using the accumulation-label scale to
predict subcellular locations of human proteins with both single and multiple sites,”
Molecular Biosystems, vol. 8, no. 2, pp. 629–641, 2012.
[15] K.-C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid
composition,” Journal of theoretical biology, vol. 273, no. 1, pp. 236–247, 2011.
[16] J. W. Lichtman and J.-A. Conchello, “Fluorescence microscopy,” Nature Methods,
vol. 2, no. 12, pp. 910–919, 2005.
110
[17] J. Newberg, J. Hua, and R. F. Murphy, “Location proteomics: systematic determina-
tion of protein subcellular location,” in Systems Biology. Springer, 2009, pp. 313–332.
[18] K. Huang and R. F. Murphy, “Data mining methods for a systematics of protein
subcellular location,” in Data Mining in Bioinformatics. Springer, 2005, pp. 143–187.
[19] V. Ljosa and A. E. Carpenter, “Introduction to the quantitative analysis of two-
dimensional fluorescence microscopy images for cell-based screening,” PLoS compu-
tational biology, vol. 5, no. 12, p. e1000603, 2009.
[20] M. V. Boland, M. K. Markey, R. F. Murphy et al., “Automated recognition of patterns
characteristic of subcellular structures in fluorescence microscopy images,” Cytometry,
vol. 33, no. 3, pp. 366–375, 1998.
[21] R. F. Murphy, M. Velliste, and G. Porreca, “Robust classification of subcellular location
patterns in fluorescence microscope images,” in Proceedings of the 12th IEEE Workshop
on Neural Networks for Signal Processing. IEEE, 2002, pp. 67–76.
[22] R. F. Murphy, “Automated interpretation of subcellular location patterns,” in IEEE
International Symposium on Biomedical Imaging: Nano to Macro. IEEE, 2004, pp.
53–56.
[23] Y. Hu and R. F. Murphy, “Automated interpretation of subcellular patterns from
immunofluorescence microscopy,” Journal of immunological methods, vol. 290, no. 1,
pp. 93–105, 2004.
[24] N. Hamilton, R. Pantelic, K. Hanson, J. Fink, S. Karunaratne, and R. Teasdale, “Au-
tomated sub-cellular phenotype classification: an introduction and recent results,” in
Proceedings of the workshop on Intelligent systems for bioinformatics, vol. 73. Aus-
tralian Computer Society, Inc., 2006, pp. 67–72.
[25] L. Nanni and A. Lumini, “Ensemblator: An ensemble of classifiers for reliable classi-
fication of biological data,” Pattern Recognition Letters, vol. 28, no. 5, pp. 622–630,
2007.
[26] M. Tscherepanow, N. Jensen, and F. Kummert, “An incremental approach to auto-
mated protein localisation,” BMC bioinformatics, vol. 9, no. 1, p. 445, 2008.
111
[27] Z.-C. Wu, X. Xiao, and K.-C. Chou, “iloc-gpos: A multi-layer classifier for predicting
the subcellular localization of singleplex and multiplex gram-positive bacterial pro-
teins,” Protein and Peptide Letters, vol. 19, no. 1, pp. 4–14, 2012.
[28] X. Xiao, Z.-C. Wu, and K.-C. Chou, “iloc-virus: A multi-label learning classifier for
identifying the subcellular localization of virus proteins with both single and multiple
sites,” Journal of Theoretical Biology, vol. 284, no. 1, pp. 42–51, 2011.
[29] B. Zhang and T. D. Pham, “Multiple features based two-stage hybrid classifier ensem-
bles for subcellular phenotype images classification,” International Journal of Biomet-
rics and Bioinformatics, vol. 4, no. 5, pp. 176–193, 2010.
[30] M. Hayat and A. Khan, “Predicting membrane protein types by fusing composite
protein sequence features into pseudo amino acid composition,” Journal of Theoretical
Biology, vol. 271, no. 1, pp. 10–17, 2011.
[31] R. M. Haralick, “Statistical and structural approaches to texture,” Proceedings of the
IEEE, vol. 67, no. 5, pp. 786–804, 1979.
[32] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic
minority over-sampling technique,” arXiv preprint arXiv:1106.1813, 2011.
[33] B. Julesz, “A theory of preattentive texture discrimination based on first-order statis-
tics of textons,” Biological cybernetics, vol. 41, no. 2, pp. 131–138, 1981.
[34] K. Huang, M. Velliste, and R. F. Murphy, “Feature reduction for improved recogni-
tion of subcellular location patterns in fluorescence microscope images,” in Proceedings
SPIE, vol. 4962, 2003, pp. 307–318.
[35] K. Huang and R. Murphy, “Boosting accuracy of automated classification of fluores-
cence microscope images for location proteomics,” Bmc Bioinformatics, vol. 5, no. 1,
p. 78, 2004.
[36] X. Chen, M. Velliste, and R. F. Murphy, “Automated interpretation of subcellular
patterns in fluorescence microscope images for location proteomics,” Cytometry part
A, vol. 69, no. 7, pp. 631–640, 2006.
112
[37] N. Hamilton, R. Pantelic, K. Hanson, and R. Teasdale, “Fast automated cell phenotype
image classification,” BMC bioinformatics, vol. 8, no. 1, p. 110, 2007.
[38] S.-C. Chen, T. Zhao, G. J. Gordon, and R. F. Murphy, “Automated image analysis
of protein localization in budding yeast,” Bioinformatics, vol. 23, no. 13, pp. i66–i71,
2007.
[39] L. Nanni and A. Lumini, “A reliable method for cell phenotype image classification,”
Artificial intelligence in medicine, vol. 43, no. 2, pp. 87–97, 2008.
[40] L. Nanni, A. Lumini, Y.-S. Lin, C.-N. Hsu, and C.-C. Lin, “Fusion of systems for auto-
mated cell phenotype image classification,” Expert Systems with Applications, vol. 37,
no. 2, pp. 1556–1562, 2010.
[41] L. Nanni, A. Lumini, and S. Brahnam, “Local binary patterns variants as texture
descriptors for medical image analysis,” Artificial intelligence in medicine, vol. 49,
no. 2, pp. 117–125, 2010.
[42] L. Nanni, S. Brahnam, and A. Lumini, “A simple method for improving local binary
patterns by considering non-uniform patterns,” Pattern Recognition, vol. 45, no. 10,
pp. 3844–3852, 2012.
[43] A. Eleyan and H. Demirel, “Co-occurrence matrix and its statistical features as a new
approach for face recognition,” Turk J Elec Eng & Comp Sci, vol. 19, no. 1, pp. 97–107,
2011.
[44] L. Nanni, S. Brahnam, S. Ghidoni, E. Menegatti, and T. Barrier, “Different approaches
for extracting information from the co-occurrence matrix,” PloS one, vol. 8, no. 12, p.
e83554, 2013.
[45] F. Albregtsen et al., “Statistical texture measures computed from gray level coocur-
rence matrices,” Image Processing Laboratory, Department of Informatics, University
of Oslo, 1995.
[46] G. Srinivasan and G. Shobha, “Statistical texture analysis,” Proceedings of world
academy of science, engg & tech, vol. 36, 2008.
113
[47] A. Chaddad, C. Tanougast, A. Dandache, A. Bouridane, J. Charara, and A. Al Hou-
seini, “Classification of cancer cells based on haralicks coefficients using multi-spectral
images,” in 7th ESBME conference, International Federation for Medical and Biological
Engineering, 2010.
[48] A. Gelzinis, A. Verikas, and M. Bacauskiene, “Increasing the discrimination power
of the co-occurrence matrix-based features,” Pattern Recognition, vol. 40, no. 9, pp.
2367–2372, 2007.
[49] V. Lakshminarayanan and A. Fleck, “Zernike polynomials: a guide,” Journal of Modern
Optics, vol. 58, no. 7, pp. 545–561, 2011.
[50] C.-W. Chong, P. Raveendran, and R. Mukundan, “A comparative analysis of algo-
rithms for fast computation of zernike moments,” Pattern Recognition, vol. 36, no. 3,
pp. 731–742, 2003.
[51] T. Arif, Z. Shaaban, L. Krekor, and S. Baba, “Object classification via geometrical,
zernike and legendre moments,” Journal of Theoretical and Applied Information Tech-
nology, vol. 7, no. 1, pp. 031–037, 2009.
[52] M. Hayat and A. Khan, “Membrane protein prediction using wavelet decomposition
and pseudo amino acid based feature extraction,” in Proceedings of 6th International
Conference on Emerging Technologies. IEEE, 2010, pp. 1–6.
[53] J.-D. Qiu, X.-Y. Sun, J.-H. Huang, and R.-P. Liang, “Prediction of the types of mem-
brane proteins based on discrete wavelet transform and support vector machines,” The
Protein Journal, vol. 29, no. 2, pp. 114–119, 2010.
[54] T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of texture measures
with classification based on featured distributions,” Pattern recognition, vol. 29, no. 1,
pp. 51–59, 1996.
[55] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns,” Pattern Analysis and Ma-
chine Intelligence, IEEE Transactions on, vol. 24, no. 7, pp. 971–987, 2002.
114
[56] X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition un-
der difficult lighting conditions,” in Analysis and Modeling of Faces and Gestures.
Springer, 2007, pp. 168–182.
[57] ——, “Enhanced local texture feature sets for face recognition under difficult lighting
conditions,” Image Processing, IEEE Transactions on, vol. 19, no. 6, pp. 1635–1650,
2010.
[58] N. A. Hamilton, J. T. Wang, M. C. Kerr, and R. D. Teasdale, “Statistical and visual
differentiation of subcellular imaging,” BMC bioinformatics, vol. 10, no. 1, p. 94, 2009.
[59] N. Otsu, “A threshold selection method from gray-level histograms,” Automatica,
vol. 11, no. 285-296, pp. 23–27, 1975.
[60] J. Prewitt and M. L. Mendelsohn, “The analysis of cell images*,” Annals of the New
York Academy of Sciences, vol. 128, no. 3, pp. 1035–1053, 1966.
[61] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
vol. 1. IEEE, 2005, pp. 886–893.
[62] O. Ludwig, D. Delgado, V. Goncalves, and U. Nunes, “Trainable classifier-fusion
schemes: an application to pedestrian detection,” in Proceedings of 12th International
IEEE Conference on Intelligent Transportation Systems. IEEE, 2009, pp. 1–6.
[63] C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,”
University of California, Berkeley, 2004.
[64] P. Yang, L. Xu, B. Zhou, Z. Zhang, and A. Zomaya, “A particle swarm based hybrid
system for imbalanced medical data sampling,” BMC genomics, vol. 10, no. Suppl 3,
p. S34, 2009.
[65] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information cri-
teria of max-dependency, max-relevance, and min-redundancy,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 27, no. 8, pp. 1226–1238, 2005.
115
[66] Z.-S. He, X.-H. Shi, X.-Y. Kong, Y.-B. Zhu, and K.-C. Chou, “A novel sequence-based
method for phosphorylation site prediction with feature selection and analysis,” Protein
and Peptide Letters, vol. 19, no. 1, pp. 70–78, 2012.
[67] T. Huang, L. Chen, Y.-D. Cai, and K.-C. Chou, “Classification and analysis of regu-
latory pathways using graph property, biochemical and physicochemical property, and
functional property,” PLoS One, vol. 6, no. 9, p. e25297, 2011.
[68] B.-Q. Li, L.-L. Hu, L. Chen, K.-Y. Feng, Y.-D. Cai, and K.-C. Chou, “Prediction of
protein domain with mrmr feature selection and analysis,” PloS one, vol. 7, no. 6, p.
e39308, 2012.
[69] L. Rokach, “Taxonomy for characterizing ensemble methods in classification tasks: A
review and annotated bibliography,” Computational Statistics & Data Analysis, vol. 53,
no. 12, pp. 4046–4072, 2009.
[70] A.-L. Boulesteix, S. Janitza, J. Kruppa, and I. R. Konig, “Overview of random for-
est methodology and practical guidance with emphasis on computational biology and
bioinformatics,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discov-
ery, vol. 2, no. 6, pp. 493–507, 2012.
[71] L. Nanni, S. Brahnam, and A. Lumini, “Local ternary patterns from three orthogonal
planes for human action classification,” Expert Systems with Applications, vol. 38, no. 5,
pp. 5125–5128, 2011.
[72] V. N. Vapnik, “An overview of statistical learning theory,” Neural Networks, IEEE
Transactions on, vol. 10, no. 5, pp. 988–999, 1999.
[73] J. Li, L. Xiong, J. Schneider, and R. F. Murphy, “Protein subcellular location pattern
classification in cellular images using latent discriminative models,” Bioinformatics,
vol. 28, no. 12, pp. i32–i39, 2012.
[74] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[75] L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, no. 1-2,
pp. 1–39, 2010.
116
[76] Y. Xie, X. Li, E. Ngai, and W. Ying, “Customer churn prediction using improved
balanced random forests,” Expert Systems with Applications, vol. 36, no. 3, pp. 5445–
5449, 2009.
[77] J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier
ensemble method,” Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 28, no. 10, pp. 1619–1630, 2006.
[78] L. I. Kuncheva and J. J. Rodrıguez, “An experimental study on rotation forest ensem-
bles,” in Multiple Classifier Systems. Springer, 2007, pp. 459–468.
[79] W. Chen, P.-M. Feng, H. Lin, and K.-C. Chou, “irspot-psednc: identify recombination
spots with pseudo dinucleotide composition,” Nucleic acids research, vol. 41, no. 6, pp.
e68–e68, 2013.
[80] Y. Xu, J. Ding, L.-Y. Wu, and K.-C. Chou, “isno-pseaac: predict cysteine s-
nitrosylation sites in proteins by incorporating position specific amino acid propensity
into pseudo amino acid composition,” PloS one, vol. 8, no. 2, p. e55844, 2013.
[81] N. Ye, K. M. A. Chai, W. S. Lee, and H. L. Chieu, “Optimizing f-measures: A tale of
two approaches,” in Proceedings of the International Conference on Machine Learning,
2012.
[82] Y. Sasaki, “The truth of the f-measure,” Teach Tutor mater, pp. 1–5, 2007.
[83] J. Meynet and J.-P. Thiran, “Information theoretic combination of pattern classifiers,”
Pattern Recognition, vol. 43, no. 10, pp. 3412–3421, 2010.
[84] D. J. Hand and R. J. Till, “A simple generalisation of the area under the roc curve for
multiple class classification problems,” Machine Learning, vol. 45, no. 2, pp. 171–186,
2001.
[85] B. Zhang and T. D. Pham, “Phenotype recognition with combined features and random
subspace classifier ensemble,” BMC bioinformatics, vol. 12, no. 1, p. 128, 2011.
[86] B. Zhang, Y. Zhang, W. Lu, and G. Han, “Phenotype recognition by curvelet trans-
form and random subspace ensemble,” Journal on Applied Mathematics Bioinformat-
ics, vol. 1, no. 1, pp. 79–103, 2011.
117
Vitae
Muhammad Tahir is a Ph.D. candidate at the Department of Computer and Informa-
tion Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad,
Pakistan. He received his M.Sc degree in Computer Science from Central Science Post-
Graduate College, Peshawar, University of Peshawar, in 2005. He received his MS degree
in Computer Science from National University of Computer and Emerging Sciences, Islam-
abad, Pakistan, on August 8, 2010. His current research interests include machine learning,
pattern recognition, computational intelligence, bioinformatics, and ensemble classification.
Asifullah Khan received his M.S. and Ph.D. degrees in Computer Systems Engineering
from Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Pakistan,
in 2003 and 2006, respectively. He has carried out two-years Post-Doc Research at Signal
and Image Processing Lab, Department of Mechatronics, Gwangju Institute of Science and
Technology, South Korea. He has more than 14 years of research experience and is working
as Associate Professor in Department of Computer and Information Sciences at PIEAS.
His research areas include Digital Watermarking, Pattern Recognition, Bioinformatics, and
Machine Learning.
118