comparison of mlr, isotonic regression and knn … this study, we have developed multiple linear...

Comparison of MLR, Isotonic Regression and

KNN Based QSAR Models for the Prediction of

Inhibitory Activity of HDAC6 Inhibitors

Sandhya Vijayasarathy and Jhinuk Chatterjee Department of Biotechnology, PES Institute of Technology, Bangalore, Karnataka, India

Email: [email protected]

Abstract—Histone deacetylase 6 (HDAC6), a member of

class II HDACs is considered as a drug target for cancer due

to its chief contribution in oncogenic cell transformation.

The aim of this study was to develop Quantitative Structure

Activity Relationship (QSAR) models and evaluate its

performance in predicting the inhibitory activity of various

HDAC6 inhibitors. To achieve this, a dataset comprising of

forty HDAC6 inhibitors were collected from PubChem

database and were subjected to processes like Calculation of

descriptors, Data pre-processing, Selection of relevant

descriptors, QSAR modeling followed by evaluation of the

developed models using various statistical parameters. Best

results were obtained for k nearest neighbor based QSAR

model having Correlation Coefficient (r) = 0.98, Squared

Correlation Coefficient (r2) = 0.96, Mean Absolute Error

(MAE) = 0.06 and Root Mean Squared Error (RMSE) =

0.15 thus indicating that, it is the best method for prediction

of inhibitory activity of HDAC6 inhibitors. Further, cross-

validation procedures need to be performed and the model

must be tested against a large dataset to authenticate its

predictive accuracy.

Index Terms—QSAR, histone deacetylase, multiple linear

regression, isotonic regression, k nearest neighbor

I. INTRODUCTION

Histone Deacetylases (HDACs) are a promising and

novel class of anti-cancer drug targets. HDACs are

grouped into three classes based on their homology to

yeast HDACs. The class I HDACs consists of HDAC1,

HDAC2, HDAC3 and HDAC8. Class II HDACs include

HDAC4, HDAC5, HDAC6, HDAC7, HDAC9, HDAC10

and class III HDACs are similar to the NAD+ dependent

ySIR2. Histone deacetylases are mainly involved in the

deacetylation of histones. However, some HDACs like

HDAC6 can also affect the function of cytoplasmic non-

histone proteins.

Over-expression of HDAC6 is associated with

tumorigenesis and improved survival. Hence, HDAC6

can be used as a marker for prognosis. Studies have

demonstrated that in multiple myeloma cells, HDAC6

inhibition results in apoptosis. Moreover, HDAC6 is

essential for the activation of heat-shock factor 1 (HSF1),

an activator of heat-shock protein (HSP) and a

cylindromatosis tumor suppressor gene (CYLD). HDAC6

Manuscript received January 12, 2015; revised April 7, 2015

also affects transcription and translation by regulating

HSP90 and stress granules. Furthermore, HDAC6

contributes to metastasis since its up regulation escalates

cell motility in breast cancer MCF-7 cells [1].

HDAC inhibitors have been evaluated in clinical trials

and have showed activity against several cancers.

However, these inhibitors act unselectively against

several or all HDAC family members. As a result, various

side effects were shown in clinical phase I trials. Thus,

identifying cancer relevant HDAC family member in a

certain tumour type and design of selective inhibitor that

targets only cancerous cells remain a challenge [2]. As

HDACs are over-expressed in tumour cells, its inhibition

can be useful in inducing growth arrest or specifically

promote apoptosis. For example, HDAC6 deacetylates

alpha-tubulin and increases cell motility. Also, it is found

to be upregulated in oral squamous cell carcinoma and is

known to increase in the advanced stages of the cancer

[3]. Thus, selective inhibition of HDAC6 can be a

promising approach for the treatment of oral cancer.

Finding new drugs involving experimental screening

of a large number of compounds is a complex, expensive

and time-consuming process. Thus, use of computational

models like Quantitative Structure Activity Relationship

(QSAR) can be an alternate and reliable approach. The

significant insights into rational design of novel and

potent compounds can be attained by studying the

relationships between structure and biological activity of

a series of compounds through the use of reliable and

robust QSAR models.

QSAR model is a mathematical model that establishes

relationship between the structure-derived features of a

compound and its biological activity in the form of a

mathematical model. It works on the assumption that

structurally similar compounds have similar activities [4].

QSAR models can be used to filter the drug candidates

before subjecting them to in vitro and in vivo studies.

These models can be developed through the application

of various supervised and/or unsupervised statistical and

machine learning techniques [5]. Apart from predicting

the unknown inhibitory activity of various compounds,

QSAR models can be used for classifying different

classes of compounds, lead compound optimization, as

well as prediction of biological activity, toxicity and

various physio-chemical properties of molecules.

International Journal of Life Sciences Biotechnology and Pharma Research Vol. 4, No. 2, April 2015

©2015 Int. J. Life Sci. Biotech. Pharm. Res. 127

In this study, we have developed Multiple Linear

Regression, Isotonic regression and k-nearest neighbor

(IBk) based QSAR models for predicting the inhibitory

activity of various HDAC6 inhibitors and their

performance was assessed using various statistical

parameters.

II. MATERIALS AND METHODS

The various steps involved in QSAR analysis is shown

below.

Figure 1. Flowchart representing the various steps involved in QSAR

analysis.

A. Preparation of Dataset

A dataset comprising of forty HDAC6 inhibitors were

collected from PubChem Bioassay Database [6] with

known IC50 values. The structures were downloaded in

3D Structure Data Format (sdf). The biological activity in

terms of IC50 (defined as the concentration of an inhibitor

required for 50% inhibition of its target) was used in the

study. These inhibitors exhibit a wide range of activity

ranging from 0.661 µM to 2 µM. In addition, structural

diversity was checked by performing clustering using

PubChem structure clustering tool.

All IC50 values were converted to negative logarithmic

value to obtain actual pIC50 values (i.e., pIC50= - log10

IC50), in order to make it a dependent variable in the

model.

B. Calculation of Descriptors

The structural information of compounds can be

analyzed using molecular descriptors. Molecular

descriptors describe the properties of molecules and are

represented by numerical values [7]. In this study, 811

types of descriptors were calculated using E-Dragon 1.0

software [8]. These include: Topological descriptors,

Constitutional descriptors, walk and path counts, 2D

autocorrelation, connectivity indices, information indices,

edge adjacency indices, topological charge indices,

burden eigenvalues, functional group counts, molecular

properties and eigenvalues based indices [9].

Further, the dataset was randomized and divided into

training (30 compounds) and test set (10 compounds).

C. Selection of Descriptors

In QSAR modeling, descriptors play an important role

and therefore selection of highly significant descriptors

becomes necessary for building the most efficient QSAR

model. To achieve this, descriptors with invariable values

were removed and CfsSubsetEval module implemented

in Waikato Environment for Knowledge Analysis

(WEKA 3.6.8), a data mining tool [10] was adopted on

the training set to remove highly correlated attributes

(Data pre-processing).

The CfsSubsetEval module along with best first search

method finds the best descriptors by considering the

predictive ability of each descriptor. It selects descriptors

having high correlation with the class and less inter-

correlation with other descriptors. The search may start

with full set of attributes (i.e., descriptors) and search

backward or start with an empty set of attributes and

search forward, or start at any point and search in both

directions by considering all possible single attribute

additions and deletions at a given point.

D. QSAR Model Construction

The QSAR models were then constructed using three

machine learning algorithms namely Multiple Linear

Regression (MLR), Isotonic Regression and k Nearest

Neighbor (IBk) implemented in WEKA 3.6.8.

In WEKA, MLR method deals with weighted instances

and employs Akaike criterion for model selection. This

technique establishes relationship between a dependent

variable and multiple independent variables.

MLR equation represents the variation of

biological/chemical properties as a function of the

variations of the molecular substituents present in the

molecular data [11].

In case of isotonic regression, the model is learned by

picking the attribute that result in the lowest squared error.

On the other hand, k-nearest Neighbor (IBk) selects

suitable value of k based on cross-validation. It does

distance weighting using a simple distance measure to

find the training instance closest to the given test instance

in addition to predicting the same class as the training

instance. If multiple instances have the same (smallest)

distance to the test instance, the first one found will be

used [12], [13].

In all the cases, the model was built on the training set

of 30 compounds and was evaluated against the test set of

10 compounds. The predicted activity was then compared

with the actual values to analyze the predictive behavior

and accuracy of the generated model.

In addition, a scatter plot was plotted for MLR based

model between actual and predicted pIC50 values to study

their linear association.

E. Evaluation of Models

The fitness of the models developed in this study was

assessed using following statistical parameters:

Correlation coefficient (r) = ∑xiyi-(∑xi ∑yi/N) / √ (∑xi2-

(∑xi) 2/N) (∑yi

2-(∑yi) 2/N) (1)

Mean Absolute Error (MAE) = ∑ (yi-xi) / N (2)



Root Mean Square Error (RMSE) = √ ∑ (yi-xi) 2 / N (3)

where xi and yi denote actual and predicted pIC50 value

for the ith

compound, and ‘N’ is the number of

compounds [14].

III. RESULTS

TABLE I. LIST OF DESCRIPTORS OBTAINED AFTER ATTRIBUTE

SELECTION PROCESS

Descriptor Description

X0Av Connectivity index

BIC0 Information index

JGI3 2D autocorrelations

nArCNO Number of oximes (aromatic)

nArNR2 Number of tertiary amines (aromatic)

Hypertens-80 Drug-like index

The equation obtained for MLR is given below:

pIC50 = -5.3313 × X0Av-13.3235 × BIC0+7.3719 ×

JGI3-1.0443 × nArCNO+1.4028 × nArNR2+0.7771 ×

Hypertens-80+13.5977 (4)

Figure 2. Multiple linear regression result in WEKA 3.6.8.

Figure 3. Isotonic regression result in WEKA 3.6.8.

Figure 4. KNN (IBk) result in WEKA 3.6.8.

Figure 5. Scatter plot between observed pIC50 and predicted pIC50 of MLR based model.

TABLE II. PERFORMANCE OF DIFFERENT REGRESSION ALGORITHMS

Algorithm R r2 MAE RMSE

MLR 0.86 0.74 0.32 0.39

Isotonic regression 0.74 0.55 0.4 0.5

k nearest neighbor (IBk) 0.98 0.96 0.06 0.15

IV. DISCUSSION

Quantitative structure activity relationship plays an

imperative role in unearthing new potent chemical

entities. In this study, QSAR models have been

constructed for prediction of pIC50 value for a series of

HDAC6 inhibitors. The flowchart representing the

various steps involved in QSAR analysis is given in Fig.

1.

After collecting HDAC6 inhibitors from PubChem

database, structural diversity was checked using

PubChem Structure Clustering tool to ensure that the

compounds collected are structurally similar.

Descriptor computing softwares calculate large number

of descriptors, thus arising a need to reduce them by

eliminating duplicate and highly correlated descriptors so

that we can narrow down to best performing and best

representative set of descriptors. After calculating

descriptors with E-Dragon 1.0 software, highly correlated

attributes were removed using WEKA’s attribute

selection tool, where in, CfsSubsetEval attribute



evaluator and best first search method were employed.

This procedure resulted in 6 descriptors (Table I).

Further, MLR model, isotonic regression model and

kNN model were developed using Linear Regression

function, Isotonic Regression function and IBk

respectively in WEKA 3.6.8. In all the cases, the QSAR

model was built on the training set and evaluated against

the test set. The predicted activity was then compared

with the actual value. The results of MLR, isotonic

regression and kNN based models for training set is

shown in Fig. 2, Fig. 3 and Fig. 4 respectively.

The goodness of fit of the developed models was

assessed by various statistical parameters as given in

“(1),” “(2),” “(3).” The performance of different

regression algorithms is given in Table II.

The equation attained after multiple linear regression

analysis is given in “(4).” Squared correlation coefficient

of 0.74 for MLR based model indicates that 74% of the

variations in the dependent variable are explained by the

independent variables. In addition, a scatter plot was

plotted for MLR model between the predicted and

observed pIC50 values. The predicted pIC50 shows linear

association with observed pIC50 and fit of the data to the

regression line is good. From the graph, it can be inferred

that, the selected descriptors contribute positively towards

the prediction of inhibitory activity (Fig. 5).

From Table II, it is evident that k nearest neighbor

(IBk) model performs better than other two learning

techniques in terms of r= 0.98, r2 = 0.96, MAE = 0.06 and

RMSE = 0.15. The r2

value of 0.96 specifies that 96%

variations in pIC50 are well explained by the descriptors.

RMSE is the measure of accuracy and is said to be ideal

if it is small. MAE is a statistical measure of how far an

estimate is from the actual value and is said to be ideal if

it is smaller than RMSE.

V. CONCLUSION

QSAR plays a key role in drug discovery and

development. Selection of an appropriate QSAR

modeling method plays a major role in prediction.

Integrated use of various machine learning techniques

and use of associated validation metrics allows

development of highly predictive QSAR models.

Building an efficient QSAR model would be of great

benefit to drug discovery community engaged in lead

candidate generation, leading to cost and time-saving

effect on the research studies.

In this study, the inhibitory activity of HDAC6

inhibitors was modelled by three different machine

learning techniques namely Multiple Linear Regression,

Isotonic Regression and k nearest neighbor (IBk). From

the results, it was clear that k nearest neighbor (IBk)

model performed better than other two models. Therefore,

k nearest neighbor (IBk) based QSAR model can be

considered as a promising approach for prediction of

inhibitory activity of HDAC6 inhibitors which in turn can

be used for designing new HDAC6 inhibitors or aid

towards improving the bioactivity of existing ones.

Further, internal and external cross-validation procedures

need to be performed and the model must be tested

against large dataset to authenticate its predictive

accuracy.

ACKNOWLEDGMENT

The authors would like to thank Dr. Roshan Makam,

head of Department of Biotechnology, PES Institute of

Technology, Bangalore for his support.

REFERENCES

[1] G. I. Aldana-Masangkay and K. M. Sakamoto, “The Role

of HDAC6 in Cancer (Review),” J. Biomed. Biotechnol.,

vol. 2011, pp. 10, 2011.

[2] O. Witt, H. E. Deubzer, T. Milde, and I. Oehme, “HDAC

family: What are the cancer relevant targets?” Cancer Lett.,

vol. 277, pp. 8-21, 2009.

[3] T. Sakuma, K. Uzawa, and T. Onda, “Aberrant expression

of histone deacetylase 6 in oral squamous cell carcinoma,”

Int. J. Oncol., vol. 29, pp. 117-124, 2006.

[4] T. Puzyn, J. Leszczynski, and M. T. Cronin, Recent

Advances in QSAR Studies Methods and Applications

(Challenges and Advances in Computational Chemistry

and Physics), Dordrecht Heidelberg London, New York:

Springer, 2010.

[5] C. Ventura, D. A. Latino, and F. Martins, “Comparison of

multiple linear regressions and neural networks based

QSAR models for the design of new antitubercular

compounds,” Eur. J. Med. Chem., vol. 70, pp. 831-845,

2013.

[6] Y. Wang, J. Xiao, and T. O. Suzek, “Pubchem’s bioassay

database,” Nucleic Acids Res, vol. 40, pp. 400-412, 2012.

[7] A. Leach and V. Gillet, An Introduction to

Chemoinformatics, India: Springer, 2009.

[8] I. V. Tetko, J. Gasteiger, and R. Todeschini, “Virtual

computational chemistry laboratory - design and

description,” J. Comput. Aid. Mol. Des., vol. 19, pp. 453-

463, 2005.

[9] A. Garg, R. Tewari, and G. P. Raghava, “KiDoQ: Using

docking based energy scores to develop ligand based

model for predicting antibacterials,” BMC Bioinformatics,

vol. 11, pp. 125, 2010.

[10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.

Reutemann, and I. H. Witten, “The WEKA data mining

software: An update,” SIGKDD Explor. Newslett., vol. 11,

pp. 10-18, 2009.

[11] C. Nantasenamat, C. Isarankura-Na-Ayudhya, T. Naenna,

and V. Prachayasittikul, “A practical overview of

quantitative structure-activity relationship,” EXCLI, vol. 8,

pp. 74-88, 2009.

[12] S. M. Mwagha, M. Muthoni, and P. Ochieg, “Comparison

of nearest neighbor (ibk), regression by discretization and

isotonic regression classification algorithms for

precipitation classes prediction,” International Journal of

Computer Applications, vol. 96, pp. 44-48, 2014.

[13] D. Aha and D. Kibler, “Instance-based learning

algorithms,” Machine Learning, vol. 6, pp. 37-66, 1991.

[14] D. Singla, M. Anurag, D. Dash, and G. P. Raghava, “A

web server for predicting inhibitors against bacterial target

GlmU protein,” BMC Pharmacol., vol. 11, pp. 5-14, 2011.



Sandhya Vijayasarathy was born in Bangalore (India) on 6th

July, 1988. She is a full-time research scholar, pursuing Ph.D. in

biotechnology at PES Institute of Technology, Bangalore

(Visvesvaraya Technological University). She holds a

bachelor’s degree (B.E) in biotechnology and a master’s degree

(M.Tech) in bioinformatics. She is an enthusiastic researcher.

Her major areas of research include computational biology,

cheminformatics, data mining, structural bioinformatics and

machine learning.

She has published research papers in various international

journals such as International Journal of Pharmaceutical

Sciences and Research, Asian Pacific Journal of Cancer

Prevention and International Journal of Biomedical

Engineering and Health Informatics in addition to oral and

poster presentations at various national and international

conferences.

She is also an associate member of “The Institution of

Engineers”, India.

Jhinuk Chatterjee was born in Kalyani (West Bengal, India)

on 11th January, 1973. At present, she is working as an assistant

professor in PES Institute of Technology, Bangalore (India).

She holds Ph.D. in biotechnology (Department of Life Sciences

& Biotechnology, Jadavpur University, Kolkata, India). Her

major area of research is computational biology, systems

biology, cellular imaging and computational immunology.

University merit scholarship and academic excellence award

were awarded to her during graduation and post-graduation

studies. She also worked as full time post-doctoral researcher in

Indian Institute of Science (Bangalore, India). She has published

research papers in various journals of good repute.

She is a life member of Indian Science Congress Association

and ISTE. She has also reviewed papers from various journals

like Current Bioinformatics, Computers in Biology and

Medicine.



comparison of mlr, isotonic regression and knn … this study, we have developed multiple linear...

Documents