quantitative structure property relationship studies for

13
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY 1 Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com © 2012 Simplex Academic Publishers. All rights reserved. Quantitative structureproperty relationship studies for predicting gas to carbon tetrachloride solvation enthalpy based on partial least squares, artificial neural network and support vector machine Zahra Dashtbozorgi a,* , Hassan Golmohammadi b , William E. Acree. Jr. c a Young Researchers Club, Science and Research Branch, Islamic Azad University, Tehran, Iran b Department of Chemistry, Mazandaran University, P. O. Box 453, Babolsar, Iran c Department of Chemistry, P. O. Box 305070, University of North Texas, Denton, TX 76203-5070, USA * Author for correspondence: Zahra Dashtbozorgi, email: [email protected] Received 17 Jan 2012; Accepted 20 Feb 2012; Available Online 20 Feb 2012 Abstract In the present work, partial least squares (PLS), artificial neural network (ANN) and support vector machine (SVM) techniques were used for quantitative structureproperty relationship (QSPR) studies of gas to carbon tetrachloride solvation enthalpy (ΔH solv ) of various organic compounds based on molecular descriptors calculated from the optimized structures. Different kinds of molecular descriptors were calculated to characterize the molecular structures of compounds, such as constitutional, topological, charge, and geometric descriptors. The variable selection method of genetic algorithm-partial least squares (GA-PLS) was employed to select most favorable subset of descriptors. The five descriptors selected using GA-PLS were used as inputs of ANN and SVM to predict the gas to carbon tetrachloride solvation enthalpy. The correlation coefficients, R, between experimental and predicted solvation enthalpy for the test set by PLS, ANN and SVM are 0.922, 0.985 and 0.990 respectively. Satisfactory results indicated that the GA-PLS approach is a very effective method for variable selection and the predictive ability of the SVM model is superior to those obtained by PLS and ANN. The obtained results demonstrate that SVM can be used as a substitute powerful modeling tool for QSPR studies. Keywords: Gas to carbon tetrachloride solvation enthalpy; Quantitative structureproperty relationship; Partial least squares; Artificial neural network; Support vector machine 1. Introduction The enthalpy of solvation of any species is defined as the heat gained or lost when the species is transferred from the gas phase into solution. The enthalpy of solvation is an important accompaniment to the free energy of solvation because it provides additional information to help understand and construe the physics of the solvation procedure. It is significant to note that the thermodynamic properties related to a solvation process do not depend entirely, even for very dilute solutions, on solute/solvent interaction because the addition of a solute entails the formation of a cavity of satisfactory size to hold the solute molecule. The solvation process can opportunely be decomposed into three steps (1) formation of a cavity in the solvent; (2) van der Waals interactions and (3) electrostatic contributions [1-3]. The first step is obviously the creation of a cavity in the solvent that is adequate in size to accommodate the solute. Because this will involve breakage of the forces maintaining cohesion with the solvent, the free energy contribution to cavitation will be adverse. On the contrary, the van der Waals contribution is favorable, since the solute cavity is created in regions of the solvent where the dispersion term is larger than the repulsion term. The third step involves two mechanisms, that is the work necessary to create the gas-phase charge distribution of the solute in solution, and the work required to polarize this charge distribution by the solvent. From a thermodynamic standpoint, the gas-to- condensed phase partition coefficient, K, can be estimated by [4]: (1) at other temperatures from measured partition coefficient data at 298.15 K and the solute’s enthalpy of solvation, ΔH Sol , or enthalpy of transfer, ΔH trans , between the two condensed phases. The enthalpy of transfer needed in Eq. 1 (for K = P, where P is the water to organic solvent partition coefficient) is defined as ΔH trans = ΔH Sol, Org - ΔH Sol,W (2) The difference is the enthalpy of solvation of the solute in the specified organic solvent minus its enthalpy of solvation in water. The above equations assume zero heat capacity changes. The gas-to-organic solvent enthalpy is defined by Liquid solutes: ΔH Solv = ΔH Soln ΔH Vap, 298 K (3) Crystalline solutes: ΔH Solv = ΔH Soln ΔH Sub, 298 K (4) subtracting the solute’s standard molar enthalpy of vaporization [5], ΔH Vap,298K , or standard molar enthalpy of sublimation [6], ΔH Sub,298K , at 298.15 K. Physical and thermodynamic property data of organic compounds such as solvation enthalpy are significant in the

Upload: others

Post on 11-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

1

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

Quantitative structure–property relationship studies for predicting gas to carbon

tetrachloride solvation enthalpy based on partial least squares,

artificial neural network and support vector machine

Zahra Dashtbozorgi

a,*, Hassan Golmohammadi

b, William E. Acree. Jr.

c

a Young Researchers Club, Science and Research Branch, Islamic Azad University, Tehran, Iran

b Department of Chemistry, Mazandaran University, P. O. Box 453, Babolsar, Iran c Department of Chemistry, P. O. Box 305070, University of North Texas, Denton, TX 76203-5070, USA

*Author for correspondence: Zahra Dashtbozorgi, email: [email protected]

Received 17 Jan 2012; Accepted 20 Feb 2012; Available Online 20 Feb 2012

Abstract

In the present work, partial least squares (PLS), artificial neural network (ANN) and support vector machine (SVM) techniques were

used for quantitative structure–property relationship (QSPR) studies of gas to carbon tetrachloride solvation enthalpy (ΔHsolv) of various organic

compounds based on molecular descriptors calculated from the optimized structures. Different kinds of molecular descriptors were calculated to

characterize the molecular structures of compounds, such as constitutional, topological, charge, and geometric descriptors. The variable selection

method of genetic algorithm-partial least squares (GA-PLS) was employed to select most favorable subset of descriptors. The five descriptors

selected using GA-PLS were used as inputs of ANN and SVM to predict the gas to carbon tetrachloride solvation enthalpy. The correlation

coefficients, R, between experimental and predicted solvation enthalpy for the test set by PLS, ANN and SVM are 0.922, 0.985 and 0.990

respectively. Satisfactory results indicated that the GA-PLS approach is a very effective method for variable selection and the predictive ability

of the SVM model is superior to those obtained by PLS and ANN. The obtained results demonstrate that SVM can be used as a substitute

powerful modeling tool for QSPR studies.

Keywords: Gas to carbon tetrachloride solvation enthalpy; Quantitative structure–property relationship; Partial least squares; Artificial neural

network; Support vector machine

1. Introduction

The enthalpy of solvation of any species is defined as

the heat gained or lost when the species is transferred from the

gas phase into solution. The enthalpy of solvation is an

important accompaniment to the free energy of solvation

because it provides additional information to help understand

and construe the physics of the solvation procedure. It is

significant to note that the thermodynamic properties related

to a solvation process do not depend entirely, even for very

dilute solutions, on solute/solvent interaction because the

addition of a solute entails the formation of a cavity of

satisfactory size to hold the solute molecule.

The solvation process can opportunely be

decomposed into three steps (1) formation of a cavity in the

solvent; (2) van der Waals interactions and (3) electrostatic

contributions [1-3]. The first step is obviously the creation of a

cavity in the solvent that is adequate in size to accommodate

the solute. Because this will involve breakage of the forces

maintaining cohesion with the solvent, the free energy

contribution to cavitation will be adverse. On the contrary, the

van der Waals contribution is favorable, since the solute cavity

is created in regions of the solvent where the dispersion term

is larger than the repulsion term. The third step involves two

mechanisms, that is the work necessary to create the gas-phase

charge distribution of the solute in solution, and the work

required to polarize this charge distribution by the solvent.

From a thermodynamic standpoint, the gas-to-

condensed phase partition coefficient, K, can be estimated by

[4]:

(1)

at other temperatures from measured partition coefficient data

at 298.15 K and the solute’s enthalpy of solvation, ΔHSol , or

enthalpy of transfer, ΔHtrans, between the two condensed

phases. The enthalpy of transfer needed in Eq. 1 (for K = P,

where P is the water to organic solvent partition coefficient) is

defined as

ΔHtrans= ΔHSol, Org - ΔHSol,W (2)

The difference is the enthalpy of solvation of the

solute in the specified organic solvent minus its enthalpy of

solvation in water. The above equations assume zero heat

capacity changes. The gas-to-organic solvent enthalpy is

defined by

Liquid solutes: ΔHSolv = ΔHSoln – ΔHVap, 298 K (3)

Crystalline solutes: ΔHSolv = ΔHSoln – ΔHSub, 298 K (4)

subtracting the solute’s standard molar enthalpy of

vaporization [5], ΔHVap,298K, or standard molar enthalpy of

sublimation [6], ΔHSub,298K, at 298.15 K.

Physical and thermodynamic property data of organic

compounds such as solvation enthalpy are significant in the

Page 2: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

2

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

engineering design and operation of industrial chemical

processes. Since the experimental determination of solvation

enthalpy is time–consuming and expensive, and there is

increased requirement of reliable physical and thermodynamic

data for the optimization of chemical processes, it would be

very useful to develop predictive models that can be used to

predict these properties of organic compounds that are not

synthesized or their properties are unknown. Alternatively, the

quantitative structure-property relationship (QSPR) provides a

capable method for the evaluation of the solvation enthalpy of

organic compounds rooted in descriptors derived exclusively

from the molecular structure to fit experimental data [7]. The

QSPR approach has become very practical in the prediction of

physical and chemical properties. [8]. The support vector

machine (SVM) was recently developed from the machine

learning community by Vapnik and co-workers in 1995 [9,10].

It is a new algorithm developed for regression and

classification, and has indicated a good performance in

classification problems by several successful applications [11–

17]. In recent years, SVM has also exposed great performance

in QSPR studies due to its aptitude to interpret the nonlinear

relationships between molecular structure and properties [18–

27].

In the present work, for the first time, SVM was used

for predicting the gas to carbon tetrachloride solvation

enthalpy of various organic compounds. The aim was to

establish a QSPR model that could be used for the prediction

of solvation enthalpy from their molecular structures alone

and to show the flexible modeling ability of SVM and at the

same time, to seek the important structural features related to

the solvation enthalpy of organic compounds. PLS and ANN

methods were also utilized to establish quantitative linear and

nonlinear relationships to compare with the results obtained by

SVM.

2. Methodology

2.1. Data set

The data set of gas to carbon tetrachloride solvation

enthalpy was extracted from the values reported by Mintz et

al. [28]. The molecules in data set including alkanes, alkenes,

alkyl halides, alcohols, phenols, ethers, esters, ketones,

aldehydes, amines, anilines, nitriles, nitro compounds,

polycyclic hydrocarbons, heterocyclic compounds and

benzene derivatives are summarized in Table 1 (see

Appendix). The solvation enthalpies of all molecules included

in data set were obtained under the same conditions and refer

to a temperature of 298 K. The solvation enthalpies fall in the

range of -3.01 to -100.80 kJ/mole for methane and 18-crown-6

ether, respectively. The entire dataset is randomly divided into

two subsets. A training set of 119 compounds and a test set of

50 compounds. The training set was used to build the actual

models and the test set was used for evaluation of the

prediction power of obtained model. Leave-one-out (LOO)

cross-validation was performed to evaluate the modeling

ability of the model. In leave-one-out, each of the samples in

the dataset is in turn singled out as a test sample and the

remaining samples are used to train the classifier.

2.2. Molecular descriptors generation

Due to multiplicity of the molecules studied, different

descriptors were calculated. The calculation process of the

molecular descriptors was described as follows: molecules

were drawn with Hyperchem package (Version 7) [29] and

then pre-optimized using MM+ molecular mechanics force

field. A more precise optimization is then done with the

semiempirical PM6 method in Mopac [30]. To ensure getting

structure with optimum geometry, optimization was repeated

many times with different starting geometries. The

optimization was preceded by the Polak–Rebiere algorithm to

reach 0.01 root mean square gradient. As a next step, the

Hyperchem and Mopac output files were used by the

CODESSA program (Version 2.7.2) [31]. This software can

calculate more than 500 different descriptors on the basis of

molecular structural information [32,33]. Since CODESSA is

not able to calculate some new generated 3D descriptors such

as 3D-MORSE, GETAWAY and WHIM, hence they were

computed by using the DRAGON software [34].

2.3. GA-PLS based variable selection

The strategy implemented for genetic algorithm-

based variable selection in the frame of PLS regression can be

described through the different steps detailed in [35]. GA-PLS

is a refined hybrid approach that combines GA [36] as a

heuristic global optimization method with PLS [37] as a robust

statistical method for variable selection. In GA-PLS, the

chromosome and its fitness in the species correspond to a set

of variables and internal prediction of the derived PLS model,

respectively [38].

In QSPR studies, it is essential to attain a model

containing as few variables as possible because this will lead

to a simple and interpretable model. Therefore, the quality of a

chromosome is determined by both the internal predictivity it

gives and the number of variables it uses. In order to enhance

the quality of chromosomes in the population, an extra rule is

added to GA-PLS following the idea of Leardi et al. [39]: the

best chromosome using the same number of variables is

sheltered unless a chromosome with a lower number of

variables gives better internal predictivity.

In this paper, GA-PLS followed Leardi's method.

The values of empirical parameters affecting the performance

of GA-PLS were defined as in Table 2. Because each GA

gives a slightly different model, at least each run is repeated

five times to verify the robustness of the predictive ability and

importance of the selected model.

2.4. Partial Least Squares (PLS)

PLS regression is a modern technique that generalizes

and combines features from principal component analysis and

multiple regression. The PLS method takes into account

information of dependent variables during the decomposition

of the independent variables data matrix. Suppose that X

represents independent variables (X is a matrix) and Y

represents dependent variables (Y is a vector). Then a brief

description of computations is given as follows:

Y=XB+E (5)

where B is the matrix of PLS regression and E is the matrix of

the residuals. In this categorization application, the Y

Page 3: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

3

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

variables contain the information about class memberships of

the training objects. The estimation of B can be obtained

through the generalized inverse of X (X+) provided by the PLS

algorithms:

B=X+Y= W (P′W)

-1Q (6)

where W is the matrix of weights of the X-space, Q is the

loading matrix for the Y-space, and P is the X-space loading

matrix.

The PLS algorithm used in this investigation was the

singular value decomposition (SVD)-based PLS. This

algorithm was proposed by Lobert et al. [40]. A concise

discussion of the SVD-based PLS algorithm can be found in

the literature [41–43]. The program of PLS modeling based on

SVD was written with MATLAB 7 in our laboratory [44].

2.5. Artificial Neural Network (ANN)

An ANN is a biologically inspired computer program

designed to learn from data in a manner of emulating the

learning pattern in the brain. Most ANN systems are very

multifaceted and high-dimension processing systems. A

detailed description of the theory behind a neural network has

been adequately described in our previous works [45–49].

In the present work, an ANN program was written

with MATLAB 7. This network was feed-forward fully

connected that has three layers with sigmoidal transfer

function. Descriptors selected by GA and PLS methods were

used as inputs of network and its output signal represent the

solvation enthalpy of interested molecules. Thus this network

has five nodes in input layer and one node in output layer. The

value of each input was divided into its mean value to bring

them into dynamic range of the sigmoidal transfer function of

the network. The initial values of weights were randomly

selected from a uniform distribution that ranged between -0.3

to +0.3 and the initial values of biases were set to be one.

Before training, the network parameters would be optimized.

These parameters are: number of nodes in the hidden layer,

weights and biases learning rates and the momentum.

Procedures for the optimization of these parameters were

reported elsewhere [50, 51].

2.6. Support Vector Machine (SVM)

SVM is gaining popularity due to many striking

features and promising empirical performance. It was invented

from early concepts developed by Vapnik and Chervonenkis

[52–54]. This technique has proven to be very effective for

addressing general intention classification and regression

problems [55–59]. For nonlinear regression, the basic idea in

support vector regression (SVR) is to plan the input data X

into a higher dimensional feature space F via a nonlinear

mapping ϕ and then a linear regression problem is acquired

and solved in the feature space. Therefore, the regression

approximation addresses the problem of estimating a function

based on a given data set [60]

(7)

where Xi is input vector, di is the desired value and l

corresponds to the size of the training set.

The generic SVR estimating function takes the form as Eq. 8:

(8)

where indicates the features of inputs, and

b are coefficients. The coefficients are predicted by

minimizing the regularized risk function

) + ∥w∥2 (9)

where

(10)

In Eq. 9 the first term, ) is the empirical

error (risk). The ϵ-insensitive loss function given by Eq. 10 is

used to measure them. This loss function presents the

advantage of enabling one to use sparse data points to signify

the decision function as Eq. 8. Also, the second term ∥w∥2 is

the regularization term that controls the model complexity and

is used as a measurement of function flatness, where C is the

regularized constant. C determines the tradeoff between the

empirical risk and the regularization term. Increasing the value

of C will result in the relative significance of the empirical risk

to the regularization term to grow. ϵ is called the tube size and

it corresponds to the approximation accuracy placed on the

training data points. Both C and ϵ are user-prescribed

parameters.

Then, by introduction of Lagrange multipliers (α, αi*)

and satisfying the α .αi*= 0, αi ≥ 0, αi* ≥ 0, i=1,…,l the

decision function (8) becomes the following form:

(11)

In Eq. 11, the kernel function K is equivalent to

K(X,Xi) = ϕ(X). All kernel functions must satisfy

Mercer’s condition (kernel function must be symmetric, and it

must be positive definite) that corresponds to the inner product

of some feature space. One has several possibilities for the

choice of this kernel function, including linear, polynomial,

spline, and radial basis function. The elegance of using the

kernel function lies in the fact that one can deal with feature

spaces of random dimensionality without having to compute

Table 2. Parameters of the genetic algorithm.

Population size 30 Chromosomes

Regression method PLS

Maximum number of variables selected in the same chromosome 30

Maximum number of components The optimal number

Response Cross-validated % explained variance

Probability of mutation 0.01

Probability of cross over 0.5

Number of evaluations 200

Number of runs 100

Page 4: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

4

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

the plan ϕ(x) explicitly. In SVR, a commonly used kernel

function is the Gaussian radial basis function.

2.7. Estimation of the predictive ability of a QSPR model

For the optimized QSPR model numerous parameters

were chosen to test prediction potential of the model. A real

QSPR model may have a high predictive talent, if it is close to

ideal one. This may entail that the correlation coefficient R

between the experimental (actual) y and predicted y~

properties must be close to 1 and regression of y against y~ or

y~ against y through the origin, i.e. y~kyr0 and yk'y~ r0,

respectively, should be demonstrated by at least either k or k'

close to 1 [61]. Slopes k and k' are calculated as follows:

2i

ii

y~

y~yk (12)

2

i

ii

y

y~yk' (13)

The criteria formulated above may not be satisfactory

for a QSPR model to be really predictive. Regression lines

through the origin defined by y~ky r0 and yk'y~ r0

(with

the intercept set to one) should be close to optimum regression

lines by~ay rand 'bya'y~ r

(b and b' are

intercepts). Correlation coefficients for these lines 2

0R and

2

0R' are calculated as follows:

2i

2r0ii2

0)y~y~(

)yy~(1R (14)

2

i

2r0

ii2

0)y(y

)y~(y1R' (15)

where y and y~ are the average values of the observed and

predicted properties, respectively and the summations are over

all n compounds in the validation set.

A difference between 2R and 2

0R values (2

mR )

desires to be studied to check the prediction potential of a

model [62]. This term was defined in the following manner:

) RR(1RR 2

0

222

m (16)

Finally, the subsequent criteria for assessment of the

predictive ability of QSPR models should be considered:

1. High value of cross-validated R2 (q

2>0.5).

2. Correlation coefficient R between the predicted and actual

properties from an external test set close to 1. R 2

0 or R'20should be close to R

2.

3. At least one slope of regression lines (k or k') through the

origin should be close to 1.

4. 2

mR should be greater than 0.5.

3. Results and Discussion

3.1. Diversity validation

The basic investigation topics in chemical database

analysis are diversity of sampling [63]. In this study, diversity

analysis was done on the data set to make certain that the

structures of the test sets can illustrate those of the whole ones.

We consider a database of n compounds generated from m

highly correlated chemical descriptors . Each

compound, Xi, is shown as following vector (Eq. 17):

n1,2,...,ifor ),...xx,x,(x Χ imi3i2i1i (17)

where xij signifies the value of descriptor j of compound Xi.

The combined database is represented a n×m

matrix of X as follows (Eq. 18):

nmn2n1

2m2221

1m1211

N21

x... xx

x... xx

x... xx

)X,...X,(XX

T (18)

where the superscript T represents the vector/matrix transpose.

A distance score, dij, for two different compounds Xi and Xj

can be measured by the Euclidean distance norm (Eq. 19):

2m

1k

jkikjiij )x(xΧΧd (19)

The mean distances of one sample to the remaining

ones were calculated as follows (Eq. 20):

n1,2,...,i 1-n

d

d

n

1j

ij

i (20)

Then the mean distances were normalized within the

interval of zero to one. With the aim of calculating the values

of mean distances compliant with the Eqs. (19) and (20) a

MATLAB program was written in our laboratory. The closer

to one the distance is the more diverse to each other the

compound is. The mean distances of sample were plotted

against experimental solvation enthalpy (EXP) (Figure 1)

which shows the diversity of the molecules in the training and

test sets. As can be seen from this figure, the molecules are

varied in all sets and the training set with a wide

representation of the chemistry space was sufficient to ensure

the model's stability. The diversity of test set can prove the

predictive ability of the model.

3.2. PLS modeling

Table 1 (see Appendix) shows the data set and

corresponding observed PLS, ANN and SVM predicted values

of solvation enthalpy of all molecules studied in this work.

Parameters of genetic algorithm for generation of GA-PLS are

shown in Table 2. Table 3 shows the specifications of best

PLS model. The optimum number of latent variables to be

included in the model was three. It can be seen from this table

Page 5: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

5

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

that five descriptors appeared in this model. These descriptors

are: mean information index on atomic composition (AAC),

maximal electrotopological negative variation (MAXDN),

solvation connectivity index chi-0 (X0sol), total information

context index (neighborhood symmetry of 0 order) (TIC0) and

3D- Balaban index (J3D). After the formation of the PLS

model, a Variance Inflation Factor (VIF) (VIF=1/(1-R2)) was

calculated to see if multicollinearities existed among the

descriptors in models. If VIF ranges from 1.0 to 5.0, the

related equation is acceptable; when VIF is larger than 10.0,

the regression equation is unstable and re-check of variables

correlation coefficient is necessary. As can be seen in the last

column of Table 3, the VIF of all descriptors are smaller than

5, indicating that generated model possesses statistic

significance and good stability. Table 4 represents the

correlation matrix for these descriptors. By interpreting the

descriptors in the models, it is possible to gain some insight

into factors that are likely related to solvation enthalpy of the

organic compounds.

For assessment of the relative importance and

donation of each descriptor in the model, the value of mean

effect (ME) was calculated for each descriptor by the

following equation:

(21)

where, MEj is the mean effect for considered descriptor j, βj is

the coefficient of descriptor j, dij is the value of descriptors for

each molecule, and m is the number of descriptors in the

model. The calculated values of MEs are represented in Table

3 and are also plotted in Figure 2. The value and sign of mean

effect demonstrates the relative contribution and direction of

influence of each variable on the solvation enthalpy. The first

descriptor according to its mean effect is the solvation

connectivity index chi 0 (X0sol), which represents the linear

fragment of one carbon atom that is defined in order to model

solvation entropy and to describe dispersion interaction in

solution. If the characteristic dimensions of the molecules by

atomic parameters are taken into account, it defined as:

= (22)

where La is the principal quantum number (2 for C, N, O

atoms, 3 for Si, S,Cl,…) of ath atom in the kth subgraphs; δa is

the corresponding vertex degree; k is the total number of mth

order subgraphs and n is the number of vertices in the

subgraph [64]. This molecular descriptor has negative sign for

its mean effect, which reveals that by increasing the value of

Table 3. The partial least squares regression coefficients.

Descriptor Notation Coefficient Mean effect VIF

Mean information index on atomic composition AAC -5.506 -6.733 1.478

Maximal electrotopological negative variation MAXDN 1.425 1.354 1.206

Solvation connectivity index chi-0 X0sol -4.264 -24.581 2.537

Total information context index (neighborhood symmetry of 0 order) TIC0 -0.805 -16.132 3.114

3D- Balaban index J3D 2.661 10.650 1.798

Constant -7.087

.

Table 4. Correlation matrix for descriptors applied in this work.

AAC MAXDN X0sol TIC0 J3D

AAC 1 0.382 -0.014 -0.012 -0.411

MAXDN 1 -0.164 -0.128 -0.168

X0sol 1 0.723 0.066

TIC0 1 0.409

J3D 1

.

Figure 1. Scatter plot of samples for training and test sets.

Figure 2. Plot of descriptor’s mean effects.

Page 6: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

6

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

this descriptor the values of ΔHSolv decrease.

The second descriptor is total information context

index (neighborhood symmetry of 0 order) (TIC0). This

topological descriptor represents a measure of the graph

complexity and is calculated as follows [65]:

TICr =A. ICr (23)

where A is the atom number and ICr is neighborhood

information content and is defined as follows:

(24)

where g runs over the G equivalence classes, Ag is the

cardinality of the gth equivalence class, A is the total number

of atoms and Pg is the probability of randomly selecting a

vertex of the gth class. It represents a measure of structural

complexity per vertex. The negative value of mean effect for

this descriptor in the PLS model indicates that this descriptor

contributes negatively to value of ΔHSolv.

The next descriptor accordance to mean effect is 3D-

Balaban index (J3D). The 3D-Balaban indexes are calculated

based on the geometry distance matrix [66]. This descriptor

describes the mobility of the backbone chain and is defined as

follows:

(25)

where ζi and ζj are vertex distance degrees of two adjacent

atoms i and j that are connected by the bond b, and the sum

runs over all the bonds b in the molecule, B is the total number

of bonds in the molecule, and C is the cyclomatic number (the

minimum number of edges that must be removed from the

molecular graph to make it acyclic). This descriptor has

positive sign, which reveals that by increasing the values of

this descriptor, the values of ΔHSolv increase.

The fourth descriptor is the mean information index

on atomic composition (AAC). This descriptor is the mean

value of the total information content and was calculated as

(26)

where Ah is the total number of atoms (hydrogen included), A

is the number of equal type atoms in the gth

equivalence class,

and P is the probability of randomly selecting a gth

type atom

[67]. The negative value of mean effect for AAC (-6.733) in

the PLS model indicates that this descriptor contributes

negatively to value of ΔHSolv. The last descriptor described here is Maximal

electrotopological negative variation (MAXDN). This

descriptor represents the maximum negative intrinsic state

difference in the molecule and can be related to the

nucleophilicity of the molecule and was calculated as follows

[68]:

If (27)

where is the field effect on the ith atom due to the

perturbation of all other atoms as defined by Kier and Hall:

(28)

where the sum runs over all the other atoms in the molecular

graph, I is the atomic intrinsic state and d the topological

distance between the two considered atoms. The positive value

of mean effect for MAXDN (1.354) in the PLS model reveals

that this descriptor contributes positively to value of ΔHSolv.

From the above discussion, it can be seen that all

descriptors involved in the QSPR model have physical

meaning, and these descriptors can account for structural

features that affect the solvation enthalpy of the interested

molecules.

3.3. ANN modeling

The next step was the construction of an ANN.

Before training the ANNs, the parameters of network

including the number of nodes in the hidden layer, weights

and biases learning rates and momentum values were

optimized. Table 5 shows the architecture and specification of

the optimized network. The predictive power of the ANN

models developed on the selected training sets are estimated

on the predictions of test set chemicals, by calculating the q2

that is defined as follows:

2i

2ii2

)yy~(

)y~-(y1q (29)

where yi and iy~ , respectively, are the measured and predicted

values of the dependent variable (solvation enthalpy), y is the

averaged value of dependent variable of the training set and

the summations cover all the compounds. The calculated value

of q2 was 0.970.

The statistical values of test set for the ANN model

was characterized by q2 = 0.970, R

2 = 0.971 (R = 0.985), R0

2=

0.971, Rm2

= 0.939 and k = 0.996. These values and other

statistical parameters which are shown in Table 6 reveal the

high predictive ability of the model. Figure 3a shows the plot

of the ANN predicted versus experimental values for solvation

enthalpy of all of the molecules in data set. The residuals of

the ANN calculated values of the solvation enthalpy are

plotted against the experimental values in Figure 4a. The

propagation of the residuals in both sides of zero line indicates

that no systematic error exists in the constructed QSPR model.

3.4. SVM modeling

The influential modeling method of SVM is then

used to investigate the possible nonlinear relation between the

selected descriptors and the ΔHSolv values. The performances

of SVM for regression depend on the combination of several

parameters: capacity parameter C, ε of ε-insensitive loss

function, and γ. C is a regularization parameter that controls

the tradeoff between maximizing the margin and minimizing

the training error. If C is too small, then inadequate stress will

be placed on fitting the training data. If C is too large, then the

Table 5. Architecture and specifications of optimized ANN

model.

Number of nodes in the input layer 5

Number of nodes in the hidden layer 6

Number of nodes in the output layer 1

Weights learning rate 0.2

Biases learning rate 0.5

Momentum 0.3

Transfer function Sigmoid

.

Page 7: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

7

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

algorithm will overfit the training data. To make the learning

process steady, a large value should be set up for C. The

kernel type is another important parameter. For regression

tasks, the Gaussian RBF kernel is commonly used. The form

of the Gaussian RBF function is represented as follows:

) (30)

where γ is a constant, parameter of the kernel, u and V are two

independent variables. γ controls the amplitude of the

Gaussian RBF function and therefore, controls the

generalization ability of SVM. The optimal value for ε

depends on the type of noise present in the data, which is

usually unknown. Even if adequate knowledge of the noise is

accessible to select an optimal value for ε, there is the useful

contemplation of the number of resulting support vectors. ε

insensitivity prevents the whole training set meeting border

conditions and so authorizes for the possibility of scattering in

the dual formulation’s solution. So, selecting the suitable

value of ε is mandatory. These parameters should be

optimized to obtain better results. To select the accurate values

for these parameters, different values of them were tried; the

set of values with the best leave-five-out cross-validation

performance will be selected as the optimal ones. The overall

performances of SVM were evaluated in terms of root-mean-

square (RMS), which was defined as below:

Table 6. Statistical parameters obtained using the PLS, ANN and SVM models a.

Model SEtr SEt Rtr Rt Ftr Ft

PLS 5.444 5.425 0.927 0.922

712 273

ANN 2.126 2.410 0.989 0.985 5318 1580

SVM 1.708 2.016 0.993 0.990 8308 2279 a tr, training set; t, test set; SE, standard error; R, the correlation coefficient; and F, the statistical F-value.

.

Figure 3a. Plot of ANN calculated versus experimental gas

to carbon tetrachloride solvation enthalpy.

Figure 3b. Plot of SVM calculated versus experimental

gas to carbon tetrachloride solvation enthalpy.

Figure 4a. Plot of ANN residual versus experimental

values of gas to carbon tetrachloride solvation enthalpy.

Figure 4b. Plot of SVM residual versus experimental

values of gas to carbon tetrachloride solvation enthalpy.

Page 8: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

8

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

(31)

where di is the desired outputs in the test set, oi the SVM

outputs, and n is the number of samples in test set. The

influences of the parameters on the performance of SVM are

shown in Figures 5–7. Through the above process, the γ, ε and

C were fixed to 25, 0.03 and 200 respectively, when the

support vector number was 119. The predicted results of the

optimal SVM were shown in Table 6. The model gave a RMS

error of 1.708 for the training set and 2.016 for the test set, and

the corresponding correlation coefficients (R) were 0.993 and

0.990, respectively. Figure 3b shows the plot of the SVM

predicted versus experimental values for solvation enthalpy of

all of the molecules in data set. The residuals of the SVM

calculated values of the solvation enthalpy are plotted against

the experimental values in Figure 4 b. The propagation of the

residuals in both sides of zero line indicates that no systematic

error exists in the constructed QSPR model.

3.5. Comparison of the results obtained by different QSPR

approaches

The results of different QSPR models are collected in

Table 6. The correlation coefficient (R) between experimental

and predicted solvation enthalpy by PLS, ANN and SVM are

0.927, 0.989 and 0.993, respectively for training set and 0.922,

0.985 and 0.990, respectively for the test set. As can be seen

from Table 6, the result of SVM model is better than those

obtained by PLS method.

Furthermore, the result of SVM is comparable to

those of ANN. SVM exhibits the better overall performance

owing to exemplifying the structural risk minimization

principle and some advantages over the other techniques of

converging to the global optimum and not to a local optimum.

It is important to note that as a general machine learning

method, SVM is based on the structural risk minimization

principle, which minimizes an upper bound of the

generalization error rather than minimizes the training error.

So SVM is of better generalization performance than PLS and

ANN, and thus is especially suitable for QSPR modeling on

the small datasets. Moreover, when compared to ANN, once

corresponding parameters are specified, the solution of SVM

is definite and reproducible, which is clearly superior to ANN.

It is also significant to note that, the standard error

values for SVM model were not only low but also as similar

as possible for the training and external test set, which

suggests that the proposed model has both predictive ability

(low values) as well as sufficient generalization performance

(similar values).

4. Conclusions

In this paper, QSPR models based on PLS, ANN and

SVM have been developed for the first time for predicting the

ΔHSolv of a diverse set of organic compounds from the

molecular structure. Results obtained, show that nonlinear

models using SVM based on the same set of descriptors

produced even better models with a good predictive ability

than the two other PLS and ANN models. By performing

model validation, it can be concluded that the presented model

is a suitable model and can be successfully used to predict the

ΔHSolv of organic compounds with accuracy similar to the

accuracy of experimental ΔHSolv determination. It can be

logically concluded that the proposed model would be

expected to predict ΔHSolv for new organic compounds or for

other organic compounds for which experimental values are

unknown.

Figure 5. The gamma versus rms error on LOO cross-validation (C

=100, ε = 0.1) [69].

Figure 6. The epsilon versus rms error on LOO cross-validation

(C = 100, γ= 25) [69].

Figure 7. The cost versus rms error on LOO cross-validation (γ=

25, ε= 0.03) [69].

Page 9: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

9

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

References

1. H.M.J. Neumann, Solution Chem. 6 (1977) 33.

2. N. Morel-Desrosiers, J. P. Morel, Can. J. Chem. 59 (1981) 1.

3. M. Irisa, K. Nagayama, F. Hirata, Chem. Phys. Lett. 207 (1993)

430.

4. C, Mintz, K. Burton, W. E. Acree Jr., Fluid Phase Equilibr. 258

(2007) 191.

5. J.S. Chickos, W.E. Acree Jr., J. Phys. Chem. Ref. Data 32

(2003) 519.

6. J.S. Chickos, W.E. Acree Jr., J. Phys. Chem. Ref. Data 31

(2002) 537.

7. X.J. Yao, Y.W. Wang, X.Y. Zhang, R.S. Zhang, M.C. Liu, Z.D.

Hu, B.T. Fan, Chemom. Intell. Lab. Syst. 62 (2002) 217.

8. X.J. Yao, M.C. Liu, X.Y. Zhang, Z.D. Hu, B.T. Fan, Anal.

Chim. Acta 462 (2002) 101.

9. V.N. Vapnik, The Nature of Statistical Learning Theory,

Springer, New York (1995).

10. V.N. Vapnik, Statistical Learning Theory, Wiley, New York

(1998).

11. J. Wang, H.Y. Du, X.J. Yao, Z.D. Hu, Anal. Chim. Acta 601

(2007) 156.

12. X.J. Yao, A. Panaye, J.P. Doucet, H.F. Chen, R.S. Zhang, B.T.

Fan, M.C. Liu, Z.D. Hu, Anal. Chim. Acta 535 (2005) 259.

13. U. Thissen, B. Ustun, W.J. Melssen, L.M.C. Buydens, Anal.

Chem. 76 (2004) 3099.

14. E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, J. Chem.

Inf. Comput. Sci. 43 (2003) 1882.

15. S.R. Amendolia, G. Cossu, M.L. Ganadu, B. Golosio, G.L.

Masala, G.M. Mura, Chemometr. Intell. Lab. Syst. 69 (2003) 13.

16. Y. Lee, C.K. Lee, Bioinformatics 19 (2003) 1132.

17. A.I. Belousov, S.A. Verzakov, J.V. Frese, Chemometr. Intell.

Lab. Syst. 64 (2002) 15.

18. H.Z. Si, S.P. Yuan, K.J. Zhang, A.P. Fu, Y.B. Duan, Z.D. Hu,

Chemometr. Intell. Lab. Syst. 90 (2008) 15.

19. J. Wang, H.Y. Du, H.X. Liu, X.J. Yao, Z.D. Hu, B.T. Fan,

Talanta 73 (2007) 147.

20. C.X. Xue, R.S. Zhang, H.X. Liu, X.J. Yao, M.C. Liu, Z.D. Hu,

B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 1693.

21. W.P. Ma, X.Y. Zhang, F. Luan, H.X. Zhang, R.S. Zhang, M.C.

Liu, Z.D. Hu, B.T. Fan, J. Phys. Chem. A 109 (2005) 3485.

22. H.X. Liu, X.J. Yao, R.S. Zhang, M.C. Liu, Z.D. Hu, B.T. Fan, J.

Phys. Chem. B 109 (2005) 20565.

23. C.Y. Zhao, H.X. Zhang, X.Y. Zhang, M.C. Liu, Z.D. Hu, B.T.

Fan, Toxicology 217 (2006) 105.

24. J.Z. Li, H.X. Liu, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan,

Chemometr. Intell. Lab. Syst. 87 (2007) 139.

25. H.X. Liu, C.X. Xue, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu,

B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 1979.

26. M.K. Leong, Chem. Res. Toxicol. 20 (2007) 217.

27. L. Peter, M. Tatiana, J. Chem. Inf. Comput. Sci. 43 (2003) 1855.

28. C. Mintz, M. Clark, K. Burton, W. E. Acree, Jr., M. H.

Abraham, J. Solution Chem. 36 (2007) 947.

29. Hyperchem, re. 4. for Windows, Autodesk, Sansalito, CA

(1995).

30. Mopac for Windows, Stewart Computational Chemistry (2009).

31. A. R. Katritzky, M. Karelson, R. Petrukhin, Comprehensive

Descriptors for Structural and Statistical Analysis (CODESSA)

Version 2.7.2, University of Florida (1994).

32. A.R. Katritzky, V.S. Labadov, M. Carelson, CODESSA

Training Manual, University of Florida, Gainesville (1995).

33. A.R. Katritzky, V.S. Labadov, M. Carelson, CODESSA Version

1 Reference Manual, University of Florida, Gainesville, Florida

(1994).

34. I.V. Tetko, J. Gasteiger, R. Todeschini, A. Mauri, D.

Livingstone, P. Ertl, V.A. Palyulin, E.V. Radchenko, N.S.

Zefirov, A.S. Makarenko, V. Tanchuk, V.V. J. Prokopenko, J.

Comput. Aid. Mol. Des. 19 (2005) 453.

35. R. Leardi, Chemom. Intell. Lab. Syst. 41 (1998) 195.

36. D.E. Goldberg, Genetic Algorithms in Search, Optimization and

Machine Learning, Addison–Wesley, New York (1989).

37. A. Hoskuldsson, Prediction Methods in Science and

Technology, Basic Theory, Thur Publishing, Denmark Vol 1

(1996).

38. K. Hasegawa, T. Kimura, K. Funatsu, Quant. Struct.-Act. Relat.

18 (1999) 262.

39. R. Leardi, R. Boggia, M. Terrile, J. Chemom. 6 (1992) 267.

40. A. Lorber, L. Wangen, B.R.J. Kowalsky, Chemometrics 1

(1987) 19.

41. T. Khayamian, A.A. Ensafi, B. Hemmateenejad, Talanta 49

(1999) 587.

42. M. Shamsipur, B. Hemmateenejad, M. Akhond, H. Sharghi,

Talanta 54 (2001) 1113.

43. A. Hoskuldsson, Chemom. Intell. Lab. Syst. 55 (2001) 23.

44. MATLAB 7.0. The Mathworks Inc., Natick. http://www.math

works.com.

45. H. Golmohammadi, Comput. Chem. 30 (2009) 2455.

46. Z. Dashtbozorgi, H. Golmohammadi, Eur. J. Med. Chem. 45

(2010) 2182.

47. Z. Dashtbozorgi, H. Golmohammadi, J. Sep. Sci. 33 (2010)

3800.

48. H. Golmohammadi, Z. Dashtbozorgi, Struct. Chem. 21 (2010)

1241.

49. H. Golmohammadi, M. Safdari, Microchem. J. 95 (2010) 140.

50. T.B. Blank, S.T. Brown, Anal. Chem. 65 (1993) 3081.

51. M. Jalali-Heravi, M. H. Fatemi, J. Chromatogr. A 915 (2001)

177.

52. V. Vapnik, Estimation of Dependencies Based on Empirical

Data, Nauka, Moscow (1979).

53. C. Cortes, V. Vapnik, Mach. Learn. 20 (1995) 273.

54. V. Vapnik, S. Golowich, A. Smola, Adv. Neural Inform.

Process. Syst. 9 (1997) 281.

55. E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, J. Chem.

Inf. Comput. Sci. 43 (2003) 1882.

56. C.Y. Zhao, R.S. Zhang, H.X. Liu, C.X. Xue, S.G. Zhao, X.F.

Zhou, M.C. Liu, B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004)

2040.

57. F. Luan, C.X. Xue, R.S. Zhang, C.Y. Zhao, M.C. Liu, Z.D. Hu,

B.T. Fan, Anal. Chim. Acta 537 (2005) 101.

58. F. Luan, W.P. Ma, X.Y. Zhang, H.X. Zhang, M.C. Liu, Z.D. Hu,

B.T. Fan, Chemosphere 63 (2006) 1142.

59. V.Z. Vladimir, V.B. Konstantin, A.I. Andrey, P.S. Nikolay, V.P.

Igor, J. Chem. Inf. Comput. Sci. 43 (2003) 2048.

60. H.X. Liu, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J.

Chem. Inf. Comput. Sci. 43 (2004) 161.

61. A. Golbraikh, A. Tropsha, J. Mol. Graphics Model. 20 (2002)

269.

62. P.P. Roy, K. Roy, QSAR Comb. Sci. 27 (2008) 302.

63. A.G. Maldonado, J.P. Doucet, M. Petitjean, Mol. Divers. 10

(2006) 39.

64. V.K. Gombar, A. Kumar, M.S. Murthy, Indian J. Chem. 268

(1987) 1168.

65. R. Sarkar, A.B. Roy, P.K. Sarkar, Math. Biosci. 39 (1978) 299.

66. A.T. Balaban, S.C. Basak, T. Colburn, G.D. Grunwald, J. Chem.

Inf. Comput. Sci. 34 (1994) 1118.

67. S.M. Dancoff, H. Quastler, Essays on the Use of Information

Theory in Biology. University of Illinois, Urbana, IL (1953).

68. L.B. Kier, L.H. Hall, Molecular Structure Description. The

Electrotopological State. Academic Press, London, UK (1999).

69. V. Cherkassky, F. Mulier, Learning from data: Concepts, theory,

and methods; Wiley, New York (1998).

Page 10: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

10

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

Appendix

Table 1. Comparison of experimental and predicted values of gas to carbon tetrachloride solvation enthalpy (ΔHSolv in units of kJ/mole)

for training and test sets.

ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number

Training set

-2.91 -3.39 -3.27 -3.01 Methane 1

-19.49 -20.66 -22.43 -18.49 Butane 2

-25.07 -23.87 -28.36 -25.19 Pentane 3

-49.37 -49.55 -52.12 -48.37 Decane 4

-58.70 -59.51 -61.09 -57.70 Dodecane 5

-75.30 -72.08 -82.40 -76.30 Hexadecane 6

-26.90 -27.53 -29.94 -25.90 2,2-Dimethylbutane 7

-31.76 -31.47 -32.29 -33.89 Ethyl pentane 8

-37.23 -39.74 -43.71 -36.23 2,2,4,4-Tetramethylpentane 9

-29.47 -27.59 -29.58 -28.47 Cyclopentane 10

-34.26 -32.78 -35.85 -32.34 Cyclohexane 11

-38.33 -36.88 -38.32 -37.91 Cycloheptane 12

-41.88 -40.94 -41.91 -42.95 Cyclooctane 13

-51.00 -50.43 -50.66 -52.00 Cyclododecane 14

-36.90 -35.47 -39.52 -34.65 Methylcyclohexane 15

-50.71 -51.65 -55.83 -49.22 trans Decalin 16

-50.12 -49.89 -53.41 -48.40 Adamantane 17

-9.80 -9.75 -17.59 -9.58 Ethene 18

-34.27 -32.91 -32.50 -35.21 1-Heptene 19

-38.58 -37.78 -34.04 -39.58 1-Octene 20

-36.58 -38.13 -40.04 -35.52 Norbornadiene 21

-27.85 -29.12 -24.08 -28.14 Acetone 22

-32.13 -32.87 -28.80 -33.13 2-Butanone 23

-34.09 -35.74 -31.76 -37.07 2-Pentanone 24

-56.19 -57.78 -48.41 -55.19 2-Nonanone 25

-39.79 -39.79 -32.18 -41.38 Cyclopentanone 26

-42.27 -43.78 -31.47 -44.68 Cyclohexanone 27

-42.23 -43.24 -37.71 -44.89 2,2,4,4-Tetramethyl-3-pentanone 28

-32.92 -30.37 -36.27 -28.96 Diethyl ether 29

-39.74 -39.31 -40.30 -36.69 Dipropyl ether 30

-49.79 -49.48 -53.79 -46.46 Dibutyl ether 31

-30.07 -27.14 -35.67 -28.79 Methyl tert-butyl ether 32

-38.24 -38.03 -31.83 -37.50 1,2-Dimethoxyethane 33

-38.50 -36.93 -37.55 -37.20 Tetrahydropyran 34

-35.50 -33.97 -39.35 -34.50 Butyl methyl ether 35

-32.78 -32.33 -22.58 -30.40 Dimethoxymethane 36

-75.90 -79.92 -85.13 -76.90 15-Crown-5 37

-99.80 -97.43 -121.09 -100.80 18-Crown-6 38

-29.29 -30.34 -33.34 -30.13 Chloroform 39

-33.43 -30.46 -22.62 -32.43 Carbon tetrachloride 40

-32.89 -32.88 -32.51 -33.18 1-Chlorobutane 41

-29.67 -28.11 -38.73 -27.61 cis-1,2-Dichloroethylene 42

-30.67 -29.88 -34.73 -28.03 trans-1,2-Dichloroethylene 43

-26.12 -25.92 -28.51 -27.12 Iodomethane 44

Page 11: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

11

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

Table 1 Continued

ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number

-34.57 -33.30 -37.67 -35.57 2-Iodo-2-methylpropane 45

-40.88 -38.44 -32.93 -41.88 Diiodomethane 46

-11.13 -10.41 -17.54 -10.63 Chlorotrifluoromethane 47

-27.79 -29.28 -22.54 -28.79 Propanal 48

-31.86 -32.57 -30.10 -32.90 Butanal 49

-36.84 -37.52 -39.15 -35.84 2-Nitropropane 50

-24.36 -23.96 -21.16 -25.36 Acetonitrile 51

-19.40 -18.97 -13.90 -19.80 Methanol 52

-24.02 -24.07 -21.73 -24.28 Ethanol 53

-28.45 -28.93 -21.37 -27.90 1-Propanol 54

-39.46 -39.10 -38.33 -42.20 1-Hexanol 55

-50.91 -49.23 -47.87 -52.20 1-Octanol 56

-29.71 -30.04 -32.61 -30.71 2-Methyl-2-butanol 57

-36.17 -37.40 -42.12 -35.14 Propyl formate 58

-38.97 -40.73 -40.32 -39.97 Butyl formate 59

-33.39 -30.87 -33.04 -30.94 Methyl acetate 60

-36.40 -34.63 -37.14 -34.97 Ethyl acetate 61

-43.23 -42.27 -34.26 -44.23 Butyl acetate 62

-35.22 -34.55 -36.24 -35.77 Methyl propionate 63

-39.04 -39.03 -46.10 -39.78 Ethyl propionate 64

-45.15 -42.24 -45.70 -44.45 Propyl propionate 65

-32.31 -30.61 -33.57 -33.31 Benzene 66

-38.61 -39.74 -38.55 -38.12 Toluene 67

-52.76 -51.09 -49.65 -54.14 Naphthalene 68

-76.18 -74.41 -65.98 -77.18 Anthracene 69

-50.82 -47.24 -50.11 -50.08 Acetophenone 70

-48.28 -47.84 -46.90 -45.27 Anisole 71

-45.09 -43.61 -43.00 -46.86 Benzaldehyde 72

-69.20 -65.37 -74.31 -68.20 1,3,4,5-Tetrabromobenzene 73

-42.11 -43.12 -43.29 -40.33 Chlorobenzene 74

-49.36 -48.20 -51.03 -46.15 1,2-Dichlorobenzene 75

-50.00 -48.43 -51.19 -46.70 1,4-Dichlorobenzene 76

-57.91 -55.28 -66.74 -58.09 1,2,4,5-Tetrachlorobenzene 77

-70.48 -71.36 -62.28 -71.48 Hexachlorobenzene 78

-35.43 -33.00 -39.42 -34.43 Fluorobenzene 79

-48.31 -49.78 -42.33 -48.01 Iodobenzene 80

-35.35 -34.72 -38.28 -34.35 Trifluoromethylbenzene 81

-36.87 -41.15 -34.88 -44.52 N,N-Dimethylformamide 82

-42.11 -42.97 -32.19 -44.98 Dimethyl sulfoxide 83

-43.43 -43.97 -42.32 -46.47 Aniline 84

-48.49 -47.09 -44.58 -49.59 N-Methylaniline 85

-52.99 -49.27 -58.88 -51.99 N,N-Dimethylaniline 86

-52.22 -51.61 -47.57 -54.37 N-Ethylaniline 87

-37.31 -37.63 -37.74 -38.31 Pyridine 88

-44.01 -44.92 -42.81 -43.01 2-Methylpyridine 89

Page 12: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

12

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

Table 1 Continued

ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number

-49.39 -50.40 -48.21 -48.88 2,4-Dimethylpyridine 90

-49.51 -49.43 -48.32 -46.68 2,6-Dimethylpyridine 91

-50.10 -50.82 -49.92 -50.38 2-Bromopyridine 92

-50.58 -53.40 -47.12 -51.14 3-Bromopyridine 93

-47.17 -48.37 -47.25 -45.20 2-Chloropyridine 94

-47.60 -51.74 -47.45 -48.60 3-Chloropyridine 95

-47.65 -45.06 -46.17 -49.40 3-Cyanopyridine 96

-47.82 -45.88 -46.21 -49.10 4-Cyanopyridine 97

-33.59 -32.48 -29.93 -35.53 Butylamine 98

-32.22 -32.72 -30.96 -33.22 Diethylamine 99

-38.53 -38.82 -38.41 -38.04 Triethylamine 100

-67.72 -65.40 -61.93 -68.72 Tributylamine 101

-53.27 -53.26 -52.65 -54.72 Dibutyl sulfide 102

-45.24 -45.63 -38.95 -47.56 N,N-Dimethylacetamide 103

-41.80 -42.20 -41.03 -43.40 Phenol 104

-48.11 -47.05 -54.57 -46.10 2-Chlorophenol 105

-57.78 -56.53 -53.95 -59.70 4-Bromophenol 106

-41.00 -40.64 -35.68 -42.00 Pentafluorophenol 107

-47.37 -48.65 -41.59 -46.06 3-Methylphenol 108

-54.04 -54.51 -52.78 -54.40 2-Methoxyphenol 109

-57.39 -55.24 -53.06 -59.50 3-Methoxyphenol 110

-55.45 -55.93 -50.11 -57.60 4-Methoxyphenol 111

-61.70 -63.90 -57.84 -60.70 1-Naphthol 112

-43.59 -46.40 -46.97 -42.81 Diethyl carbonate 113

-51.16 -51.02 -49.29 -52.79 Phenyl methyl sulfide 114

-27.20 -32.20 -29.10 -27.20 Acrylonitrile 115

-35.46 -35.67 -36.63 -33.89 1,4-Difluorobenzene 116

-73.48 -74.51 -67.31 -74.48 Benzophenone 117

-56.59 -57.72 -53.73 -55.40 Quinoline 118

-65.39 -64.51 -71.36 -64.39 1-Nitronaphthalene

119

Test set

-9.34 -10.69 -15.11 -9.20 Ethane 120

-15.10 -15.69 -22.21 -14.40 Propane 121

-30.19 -29.27 -34.89 -29.75 Hexane 122

-34.54 -34.82 -39.57 -34.48 Heptane 123

-39.21 -40.16 -30.27 -39.13 Octane 124

-44.11 -45.34 -35.13 -43.18 Nonane 125

-16.26 -17.07 -21.62 -16.15 2-Methylpropane 126

-24.11 -24.18 -25.43 -23.93 2-Methylbutane 127

-31.21 -32.19 -24.35 -30.95 Methylcyclopentane 128

-50.66 -52.70 -58.59 -50.78 cis Decalin 129

-59.09 -60.33 -61.66 -57.16 Bicyclohexyl 130

-53.54 -52.69 -52.08 -55.50 Tetralin 131

-35.37 -36.37 -30.15 -37.73 3-Pentanone 132

-39.99 -41.57 -32.06 -41.84 2-Hexanone 133

-44.77 -45.64 -38.97 -46.40 2-Heptanone 134

Page 13: Quantitative structure property relationship studies for

GLOBAL JOURNAL OF PHYSICAL CHEMISTRY

13

Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com

© 2012 Simplex Academic Publishers. All rights reserved.

Table 1 Continued

ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number

-43.14 -46.62 -40.01 -45.64 4-Heptanone 135

-48.96 -50.10 -42.62 -50.65 2-Octanone 136

-56.66 -54.58 -51.91 -54.77 5-Nonanone 137

-37.37 -36.37 -32.15 -39.88 2,4-Pentanedione

138

-46.80 -50.17 -51.32 -45.50 1,2-Diethoxyethane 139

-38.53 -39.13 -43.19 -37.90 1,4-Dioxane 140

-69.77 -69.35 -68.57 -71.30 2,5,8,11-Tetraoxadodecane 141

-34.80 -34.83 -30.31 -34.60 Tetrahydrofuran 142

-61.78 -65.70 -65.27 -63.90 12-Crown-4 143

-51.62 -52.51 -47.52 -53.59 1-Chlorooctane 144

-41.94 -36.78 -38.90 -39.33 Tetrachloroethylene

145

-38.35 -39.54 -38.61 -40.39 1-Iodobutane

146

-36.17 -36.42 -34.11 -38.15 Pentanal 147

-28.22 -29.25 -34.68 -28.37 Nitromethane 148

-32.67 -34.44 -34.82 -32.90 Nitroethane 149

-37.65 -39.65 -38.98 -38.23 1-Nitropropane 150

-27.31 -27.23 -25.40 -26.40 2-Propanol 151

-33.52 -31.28 -29.38 -34.53 1-Butanol 152

-35.59 -36.01 -33.76 -37.82 1-Pentanol 153

-42.81 -43.66 -43.19 -41.83 Ethylbenzene 154

-47.95 -48.24 -49.47 -46.74 Mesitylene 155

-65.44 -61.68 -58.80 -63.06 Biphenyl 156

-47.10 -46.20 -44.57 -48.10 Benzonitrile 157

-43.46 -47.41 -45.96 -42.97 Bromobenzene 158

-61.81 -60.75 -64.82 -60.67 1,3,5-Tribromobenzene 159

-51.43 -52.67 -54.18 -50.21 Nitrobenzene 160

-58.84 -58.52 -64.44 -56.08 4-Chloro-1-nitrobenzene 161

-33.47 -36.04 -34.05 -33.21 Pyrrole 162

-40.03 -41.67 -39.59 -38.99 N-Methylpyrrole 163

-39.08 -36.98 -35.61 -41.31 Tetrahydrothiophene 164

-24.46 -27.30 -26.54 -23.38 Dimethyl sulfide 165

-35.02 -35.83 -34.38 -38.73 Diethyl sulfide 166

-39.55 -40.79 -38.98 -48.63 γ -Butyrolactone 167

-76.87 -79.59 -68.23 -78.24 trans-Stilbene

168

-64.31 -60.20 -59.71 -62.37 1-Chloronaphthalene

169

Cite this article as:

Zahra Dashtbozorgi et al.: Quantitative structure–property relationship studies for predicting gas to carbon tetrachloride

solvation enthalpy based on partial least squares, artificial neural network and support vector machine.

Global J. Phys. Chem. 2012, 3: 13