comparison of mlr, isotonic regression and knn … this study, we have developed multiple linear...
TRANSCRIPT
Comparison of MLR, Isotonic Regression and
KNN Based QSAR Models for the Prediction of
Inhibitory Activity of HDAC6 Inhibitors
Sandhya Vijayasarathy and Jhinuk Chatterjee Department of Biotechnology, PES Institute of Technology, Bangalore, Karnataka, India
Email: [email protected]
Abstract—Histone deacetylase 6 (HDAC6), a member of
class II HDACs is considered as a drug target for cancer due
to its chief contribution in oncogenic cell transformation.
The aim of this study was to develop Quantitative Structure
Activity Relationship (QSAR) models and evaluate its
performance in predicting the inhibitory activity of various
HDAC6 inhibitors. To achieve this, a dataset comprising of
forty HDAC6 inhibitors were collected from PubChem
database and were subjected to processes like Calculation of
descriptors, Data pre-processing, Selection of relevant
descriptors, QSAR modeling followed by evaluation of the
developed models using various statistical parameters. Best
results were obtained for k nearest neighbor based QSAR
model having Correlation Coefficient (r) = 0.98, Squared
Correlation Coefficient (r2) = 0.96, Mean Absolute Error
(MAE) = 0.06 and Root Mean Squared Error (RMSE) =
0.15 thus indicating that, it is the best method for prediction
of inhibitory activity of HDAC6 inhibitors. Further, cross-
validation procedures need to be performed and the model
must be tested against a large dataset to authenticate its
predictive accuracy.
Index Terms—QSAR, histone deacetylase, multiple linear
regression, isotonic regression, k nearest neighbor
I. INTRODUCTION
Histone Deacetylases (HDACs) are a promising and
novel class of anti-cancer drug targets. HDACs are
grouped into three classes based on their homology to
yeast HDACs. The class I HDACs consists of HDAC1,
HDAC2, HDAC3 and HDAC8. Class II HDACs include
HDAC4, HDAC5, HDAC6, HDAC7, HDAC9, HDAC10
and class III HDACs are similar to the NAD+ dependent
ySIR2. Histone deacetylases are mainly involved in the
deacetylation of histones. However, some HDACs like
HDAC6 can also affect the function of cytoplasmic non-
histone proteins.
Over-expression of HDAC6 is associated with
tumorigenesis and improved survival. Hence, HDAC6
can be used as a marker for prognosis. Studies have
demonstrated that in multiple myeloma cells, HDAC6
inhibition results in apoptosis. Moreover, HDAC6 is
essential for the activation of heat-shock factor 1 (HSF1),
an activator of heat-shock protein (HSP) and a
cylindromatosis tumor suppressor gene (CYLD). HDAC6
Manuscript received January 12, 2015; revised April 7, 2015
also affects transcription and translation by regulating
HSP90 and stress granules. Furthermore, HDAC6
contributes to metastasis since its up regulation escalates
cell motility in breast cancer MCF-7 cells [1].
HDAC inhibitors have been evaluated in clinical trials
and have showed activity against several cancers.
However, these inhibitors act unselectively against
several or all HDAC family members. As a result, various
side effects were shown in clinical phase I trials. Thus,
identifying cancer relevant HDAC family member in a
certain tumour type and design of selective inhibitor that
targets only cancerous cells remain a challenge [2]. As
HDACs are over-expressed in tumour cells, its inhibition
can be useful in inducing growth arrest or specifically
promote apoptosis. For example, HDAC6 deacetylates
alpha-tubulin and increases cell motility. Also, it is found
to be upregulated in oral squamous cell carcinoma and is
known to increase in the advanced stages of the cancer
[3]. Thus, selective inhibition of HDAC6 can be a
promising approach for the treatment of oral cancer.
Finding new drugs involving experimental screening
of a large number of compounds is a complex, expensive
and time-consuming process. Thus, use of computational
models like Quantitative Structure Activity Relationship
(QSAR) can be an alternate and reliable approach. The
significant insights into rational design of novel and
potent compounds can be attained by studying the
relationships between structure and biological activity of
a series of compounds through the use of reliable and
robust QSAR models.
QSAR model is a mathematical model that establishes
relationship between the structure-derived features of a
compound and its biological activity in the form of a
mathematical model. It works on the assumption that
structurally similar compounds have similar activities [4].
QSAR models can be used to filter the drug candidates
before subjecting them to in vitro and in vivo studies.
These models can be developed through the application
of various supervised and/or unsupervised statistical and
machine learning techniques [5]. Apart from predicting
the unknown inhibitory activity of various compounds,
QSAR models can be used for classifying different
classes of compounds, lead compound optimization, as
well as prediction of biological activity, toxicity and
various physio-chemical properties of molecules.
International Journal of Life Sciences Biotechnology and Pharma Research Vol. 4, No. 2, April 2015
©2015 Int. J. Life Sci. Biotech. Pharm. Res. 127
In this study, we have developed Multiple Linear
Regression, Isotonic regression and k-nearest neighbor
(IBk) based QSAR models for predicting the inhibitory
activity of various HDAC6 inhibitors and their
performance was assessed using various statistical
parameters.
II. MATERIALS AND METHODS
The various steps involved in QSAR analysis is shown
below.
Figure 1. Flowchart representing the various steps involved in QSAR
analysis.
A. Preparation of Dataset
A dataset comprising of forty HDAC6 inhibitors were
collected from PubChem Bioassay Database [6] with
known IC50 values. The structures were downloaded in
3D Structure Data Format (sdf). The biological activity in
terms of IC50 (defined as the concentration of an inhibitor
required for 50% inhibition of its target) was used in the
study. These inhibitors exhibit a wide range of activity
ranging from 0.661 µM to 2 µM. In addition, structural
diversity was checked by performing clustering using
PubChem structure clustering tool.
All IC50 values were converted to negative logarithmic
value to obtain actual pIC50 values (i.e., pIC50= - log10
IC50), in order to make it a dependent variable in the
model.
B. Calculation of Descriptors
The structural information of compounds can be
analyzed using molecular descriptors. Molecular
descriptors describe the properties of molecules and are
represented by numerical values [7]. In this study, 811
types of descriptors were calculated using E-Dragon 1.0
software [8]. These include: Topological descriptors,
Constitutional descriptors, walk and path counts, 2D
autocorrelation, connectivity indices, information indices,
edge adjacency indices, topological charge indices,
burden eigenvalues, functional group counts, molecular
properties and eigenvalues based indices [9].
Further, the dataset was randomized and divided into
training (30 compounds) and test set (10 compounds).
C. Selection of Descriptors
In QSAR modeling, descriptors play an important role
and therefore selection of highly significant descriptors
becomes necessary for building the most efficient QSAR
model. To achieve this, descriptors with invariable values
were removed and CfsSubsetEval module implemented
in Waikato Environment for Knowledge Analysis
(WEKA 3.6.8), a data mining tool [10] was adopted on
the training set to remove highly correlated attributes
(Data pre-processing).
The CfsSubsetEval module along with best first search
method finds the best descriptors by considering the
predictive ability of each descriptor. It selects descriptors
having high correlation with the class and less inter-
correlation with other descriptors. The search may start
with full set of attributes (i.e., descriptors) and search
backward or start with an empty set of attributes and
search forward, or start at any point and search in both
directions by considering all possible single attribute
additions and deletions at a given point.
D. QSAR Model Construction
The QSAR models were then constructed using three
machine learning algorithms namely Multiple Linear
Regression (MLR), Isotonic Regression and k Nearest
Neighbor (IBk) implemented in WEKA 3.6.8.
In WEKA, MLR method deals with weighted instances
and employs Akaike criterion for model selection. This
technique establishes relationship between a dependent
variable and multiple independent variables.
MLR equation represents the variation of
biological/chemical properties as a function of the
variations of the molecular substituents present in the
molecular data [11].
In case of isotonic regression, the model is learned by
picking the attribute that result in the lowest squared error.
On the other hand, k-nearest Neighbor (IBk) selects
suitable value of k based on cross-validation. It does
distance weighting using a simple distance measure to
find the training instance closest to the given test instance
in addition to predicting the same class as the training
instance. If multiple instances have the same (smallest)
distance to the test instance, the first one found will be
used [12], [13].
In all the cases, the model was built on the training set
of 30 compounds and was evaluated against the test set of
10 compounds. The predicted activity was then compared
with the actual values to analyze the predictive behavior
and accuracy of the generated model.
In addition, a scatter plot was plotted for MLR based
model between actual and predicted pIC50 values to study
their linear association.
E. Evaluation of Models
The fitness of the models developed in this study was
assessed using following statistical parameters:
Correlation coefficient (r) = ∑xiyi-(∑xi ∑yi/N) / √ (∑xi2-
(∑xi) 2/N) (∑yi
2-(∑yi) 2/N) (1)
Mean Absolute Error (MAE) = ∑ (yi-xi) / N (2)
International Journal of Life Sciences Biotechnology and Pharma Research Vol. 4, No. 2, April 2015
©2015 Int. J. Life Sci. Biotech. Pharm. Res. 128
Root Mean Square Error (RMSE) = √ ∑ (yi-xi) 2 / N (3)
where xi and yi denote actual and predicted pIC50 value
for the ith
compound, and ‘N’ is the number of
compounds [14].
III. RESULTS
TABLE I. LIST OF DESCRIPTORS OBTAINED AFTER ATTRIBUTE
SELECTION PROCESS
Descriptor Description
X0Av Connectivity index
BIC0 Information index
JGI3 2D autocorrelations
nArCNO Number of oximes (aromatic)
nArNR2 Number of tertiary amines (aromatic)
Hypertens-80 Drug-like index
The equation obtained for MLR is given below:
pIC50 = -5.3313 × X0Av-13.3235 × BIC0+7.3719 ×
JGI3-1.0443 × nArCNO+1.4028 × nArNR2+0.7771 ×
Hypertens-80+13.5977 (4)
Figure 2. Multiple linear regression result in WEKA 3.6.8.
Figure 3. Isotonic regression result in WEKA 3.6.8.
Figure 4. KNN (IBk) result in WEKA 3.6.8.
Figure 5. Scatter plot between observed pIC50 and predicted pIC50 of MLR based model.
TABLE II. PERFORMANCE OF DIFFERENT REGRESSION ALGORITHMS
Algorithm R r2 MAE RMSE
MLR 0.86 0.74 0.32 0.39
Isotonic regression 0.74 0.55 0.4 0.5
k nearest neighbor (IBk) 0.98 0.96 0.06 0.15
IV. DISCUSSION
Quantitative structure activity relationship plays an
imperative role in unearthing new potent chemical
entities. In this study, QSAR models have been
constructed for prediction of pIC50 value for a series of
HDAC6 inhibitors. The flowchart representing the
various steps involved in QSAR analysis is given in Fig.
1.
After collecting HDAC6 inhibitors from PubChem
database, structural diversity was checked using
PubChem Structure Clustering tool to ensure that the
compounds collected are structurally similar.
Descriptor computing softwares calculate large number
of descriptors, thus arising a need to reduce them by
eliminating duplicate and highly correlated descriptors so
that we can narrow down to best performing and best
representative set of descriptors. After calculating
descriptors with E-Dragon 1.0 software, highly correlated
attributes were removed using WEKA’s attribute
selection tool, where in, CfsSubsetEval attribute
International Journal of Life Sciences Biotechnology and Pharma Research Vol. 4, No. 2, April 2015
©2015 Int. J. Life Sci. Biotech. Pharm. Res. 129
evaluator and best first search method were employed.
This procedure resulted in 6 descriptors (Table I).
Further, MLR model, isotonic regression model and
kNN model were developed using Linear Regression
function, Isotonic Regression function and IBk
respectively in WEKA 3.6.8. In all the cases, the QSAR
model was built on the training set and evaluated against
the test set. The predicted activity was then compared
with the actual value. The results of MLR, isotonic
regression and kNN based models for training set is
shown in Fig. 2, Fig. 3 and Fig. 4 respectively.
The goodness of fit of the developed models was
assessed by various statistical parameters as given in
“(1),” “(2),” “(3).” The performance of different
regression algorithms is given in Table II.
The equation attained after multiple linear regression
analysis is given in “(4).” Squared correlation coefficient
of 0.74 for MLR based model indicates that 74% of the
variations in the dependent variable are explained by the
independent variables. In addition, a scatter plot was
plotted for MLR model between the predicted and
observed pIC50 values. The predicted pIC50 shows linear
association with observed pIC50 and fit of the data to the
regression line is good. From the graph, it can be inferred
that, the selected descriptors contribute positively towards
the prediction of inhibitory activity (Fig. 5).
From Table II, it is evident that k nearest neighbor
(IBk) model performs better than other two learning
techniques in terms of r= 0.98, r2 = 0.96, MAE = 0.06 and
RMSE = 0.15. The r2
value of 0.96 specifies that 96%
variations in pIC50 are well explained by the descriptors.
RMSE is the measure of accuracy and is said to be ideal
if it is small. MAE is a statistical measure of how far an
estimate is from the actual value and is said to be ideal if
it is smaller than RMSE.
V. CONCLUSION
QSAR plays a key role in drug discovery and
development. Selection of an appropriate QSAR
modeling method plays a major role in prediction.
Integrated use of various machine learning techniques
and use of associated validation metrics allows
development of highly predictive QSAR models.
Building an efficient QSAR model would be of great
benefit to drug discovery community engaged in lead
candidate generation, leading to cost and time-saving
effect on the research studies.
In this study, the inhibitory activity of HDAC6
inhibitors was modelled by three different machine
learning techniques namely Multiple Linear Regression,
Isotonic Regression and k nearest neighbor (IBk). From
the results, it was clear that k nearest neighbor (IBk)
model performed better than other two models. Therefore,
k nearest neighbor (IBk) based QSAR model can be
considered as a promising approach for prediction of
inhibitory activity of HDAC6 inhibitors which in turn can
be used for designing new HDAC6 inhibitors or aid
towards improving the bioactivity of existing ones.
Further, internal and external cross-validation procedures
need to be performed and the model must be tested
against large dataset to authenticate its predictive
accuracy.
ACKNOWLEDGMENT
The authors would like to thank Dr. Roshan Makam,
head of Department of Biotechnology, PES Institute of
Technology, Bangalore for his support.
REFERENCES
[1] G. I. Aldana-Masangkay and K. M. Sakamoto, “The Role
of HDAC6 in Cancer (Review),” J. Biomed. Biotechnol.,
vol. 2011, pp. 10, 2011.
[2] O. Witt, H. E. Deubzer, T. Milde, and I. Oehme, “HDAC
family: What are the cancer relevant targets?” Cancer Lett.,
vol. 277, pp. 8-21, 2009.
[3] T. Sakuma, K. Uzawa, and T. Onda, “Aberrant expression
of histone deacetylase 6 in oral squamous cell carcinoma,”
Int. J. Oncol., vol. 29, pp. 117-124, 2006.
[4] T. Puzyn, J. Leszczynski, and M. T. Cronin, Recent
Advances in QSAR Studies Methods and Applications
(Challenges and Advances in Computational Chemistry
and Physics), Dordrecht Heidelberg London, New York:
Springer, 2010.
[5] C. Ventura, D. A. Latino, and F. Martins, “Comparison of
multiple linear regressions and neural networks based
QSAR models for the design of new antitubercular
compounds,” Eur. J. Med. Chem., vol. 70, pp. 831-845,
2013.
[6] Y. Wang, J. Xiao, and T. O. Suzek, “Pubchem’s bioassay
database,” Nucleic Acids Res, vol. 40, pp. 400-412, 2012.
[7] A. Leach and V. Gillet, An Introduction to
Chemoinformatics, India: Springer, 2009.
[8] I. V. Tetko, J. Gasteiger, and R. Todeschini, “Virtual
computational chemistry laboratory - design and
description,” J. Comput. Aid. Mol. Des., vol. 19, pp. 453-
463, 2005.
[9] A. Garg, R. Tewari, and G. P. Raghava, “KiDoQ: Using
docking based energy scores to develop ligand based
model for predicting antibacterials,” BMC Bioinformatics,
vol. 11, pp. 125, 2010.
[10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.
Reutemann, and I. H. Witten, “The WEKA data mining
software: An update,” SIGKDD Explor. Newslett., vol. 11,
pp. 10-18, 2009.
[11] C. Nantasenamat, C. Isarankura-Na-Ayudhya, T. Naenna,
and V. Prachayasittikul, “A practical overview of
quantitative structure-activity relationship,” EXCLI, vol. 8,
pp. 74-88, 2009.
[12] S. M. Mwagha, M. Muthoni, and P. Ochieg, “Comparison
of nearest neighbor (ibk), regression by discretization and
isotonic regression classification algorithms for
precipitation classes prediction,” International Journal of
Computer Applications, vol. 96, pp. 44-48, 2014.
[13] D. Aha and D. Kibler, “Instance-based learning
algorithms,” Machine Learning, vol. 6, pp. 37-66, 1991.
[14] D. Singla, M. Anurag, D. Dash, and G. P. Raghava, “A
web server for predicting inhibitors against bacterial target
GlmU protein,” BMC Pharmacol., vol. 11, pp. 5-14, 2011.
International Journal of Life Sciences Biotechnology and Pharma Research Vol. 4, No. 2, April 2015
©2015 Int. J. Life Sci. Biotech. Pharm. Res. 130
Sandhya Vijayasarathy was born in Bangalore (India) on 6th
July, 1988. She is a full-time research scholar, pursuing Ph.D. in
biotechnology at PES Institute of Technology, Bangalore
(Visvesvaraya Technological University). She holds a
bachelor’s degree (B.E) in biotechnology and a master’s degree
(M.Tech) in bioinformatics. She is an enthusiastic researcher.
Her major areas of research include computational biology,
cheminformatics, data mining, structural bioinformatics and
machine learning.
She has published research papers in various international
journals such as International Journal of Pharmaceutical
Sciences and Research, Asian Pacific Journal of Cancer
Prevention and International Journal of Biomedical
Engineering and Health Informatics in addition to oral and
poster presentations at various national and international
conferences.
She is also an associate member of “The Institution of
Engineers”, India.
Jhinuk Chatterjee was born in Kalyani (West Bengal, India)
on 11th January, 1973. At present, she is working as an assistant
professor in PES Institute of Technology, Bangalore (India).
She holds Ph.D. in biotechnology (Department of Life Sciences
& Biotechnology, Jadavpur University, Kolkata, India). Her
major area of research is computational biology, systems
biology, cellular imaging and computational immunology.
University merit scholarship and academic excellence award
were awarded to her during graduation and post-graduation
studies. She also worked as full time post-doctoral researcher in
Indian Institute of Science (Bangalore, India). She has published
research papers in various journals of good repute.
She is a life member of Indian Science Congress Association
and ISTE. She has also reviewed papers from various journals
like Current Bioinformatics, Computers in Biology and
Medicine.
International Journal of Life Sciences Biotechnology and Pharma Research Vol. 4, No. 2, April 2015
©2015 Int. J. Life Sci. Biotech. Pharm. Res. 131