counter-propagation artificial neural networks as a tool for prediction of pkbh+ for series of...

7
Counter-propagation articial neural networks as a tool for prediction of pK BH+ for series of amides Goran Stojković a , Marjana Novič b , Igor Kuzmanovski a, a Institut za hemija, PMF, Univerzitet Sv. Kiril i Metodij, PO Box 162, 1001 Skopje, Macedonia b National Institute of Chemistry, Ljubljana, Hajdrihova 19, SLO-1115 Ljubljana, Slovenia abstract article info Article history: Received 27 January 2010 Received in revised form 8 April 2010 Accepted 14 April 2010 Available online 27 April 2010 Keywords: Counter-propagation articial neural networks Genetic algorithms QSPR Prediction of pK In this work counter-propagation articial neural networks (CPANN) were used as a tool for development of interpretable quantitative structureproperty relationship (QSPR) models for prediction of pK BH+ values of a series of amides. The methodology used here is based on our recently developed algorithm for automatic adjustment of the relative importance of the input variables for training of the CPANN. Using this novel algorithm we were able to develop several simple QSPR models. One of the best models, discussed in details in the article, has only three interpretable descriptors: number of halogen atoms in the structure, the energy of the lowest unoccupied molecular orbital (LUMO) which reects the electronic properties of the molecules and the average molecular weight. The nal analysis of this model shows that the most responsible for modeling of the pK BH+ values is the number of the present halogen atoms in the structures. Similar relative importance has LUMO. This descriptor helps in groping of the similar substances in different part of the CPANN. While the average molecular weight, with nearly seven times smaller relative importance compared to previous two descriptors, is related to the inuence of the presence of, in most of the cases, more than one halogen atom in the structures on pK BH+ . Finally, the developed models have excellent generalization performances which were checked using independent test set. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Acidbase properties and protonationdeprotonation equilibria are widely studied chemical phenomena. The importance for accurate and comparable determination of pK values have resulted in development and application of many analytical techniques [1]. The information about the protonation behavior of weak bases helps in detailed kinetic analysis of hydrolysis. It also helps in better interpretation of the qualitative structureactivity relationship (QSAR) studies [25]. Basicity and solvation of aliphatic amides are intrinsically important parameters for understanding the behavior of biochemical systems. Protonation has important catalytic role on hydrolysis of amide bond in peptides. As a consequence, simple amides (formamide for example) are widely used as model compounds for studying the protonation site in strained amides [6]. The pK of some substituted benzamides are used for investigation of kinetics of reaction of deamination of amides by nitrous acid [7]. Protonation constants of weak organic bases are obtained by analyzing the changes of some physical properties of the observed substrates, going from free towards protonated base by increasing the acidity of the solutions [8,9]. However, in carbonyl compounds, such as amides, the situation is more complex since the spectra of one or both forms are usually subject to substantial medium effects. Both commonly applied methods for study of protonation equilibria (UVVis and NMR spectroscopy) are affected to some extent by the medium effects that accompany the change of the acid concentration. Various methods (more than twenty [10]) have been devised to correct the inuence of the medium (PCA principal component analysis [1117], target factor analysis [10,18,19], etc.). The results obtained using PCA for removal of the medium effect for aromatic amides are rather satisfactory, unlike for aliphatic amides. Namely, aliphatic amides have been conveniently studied by NMR spectroscopy [2026], where the solvent effect can be adequately handled [20]. UV spectroscopy is not appropriate for this purpose because the hypsochromic shift of the absorption band of the protonated base under 190 nm [20] results in non-monotonous reconstructed spectra which are unsuitable for pK BH+ estimation [27]. Hammett's equation [28] has been successfully applied for rationalizing rates and equilibria of meta- and para-substituted benzene derivatives [15,28] in terms of empirically obtained substit- uent constants (σ) and reaction constants (ρ). The lack of literature data for σ for the substituents in ortho-position makes the attempts for further correlations with Hammett's constants impossible. The Chemometrics and Intelligent Laboratory Systems 102 (2010) 123129 Corresponding author. E-mail address: [email protected] (I. Kuzmanovski). 0169-7439/$ see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2010.04.013 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

Upload: goran-stojkovic

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Counter-propagation artificial neural networks as a tool for prediction of pKBH+ for series of amides

Chemometrics and Intelligent Laboratory Systems 102 (2010) 123–129

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r.com/ locate /chemolab

Counter-propagation artificial neural networks as a tool for prediction of pKBH+ forseries of amides

Goran Stojković a, Marjana Novič b, Igor Kuzmanovski a,⁎a Institut za hemija, PMF, Univerzitet “Sv. Kiril i Metodij”, PO Box 162, 1001 Skopje, Macedoniab National Institute of Chemistry, Ljubljana, Hajdrihova 19, SLO-1115 Ljubljana, Slovenia

⁎ Corresponding author.E-mail address: [email protected] (I. Kuzmanovs

0169-7439/$ – see front matter © 2010 Elsevier B.V. Aldoi:10.1016/j.chemolab.2010.04.013

a b s t r a c t

a r t i c l e i n f o

Article history:Received 27 January 2010Received in revised form 8 April 2010Accepted 14 April 2010Available online 27 April 2010

Keywords:Counter-propagation artificial neuralnetworksGenetic algorithmsQSPRPrediction of pK

In this work counter-propagation artificial neural networks (CPANN) were used as a tool for development ofinterpretable quantitative structure–property relationship (QSPR) models for prediction of pKBH+ values of aseries of amides. The methodology used here is based on our recently developed algorithm for automaticadjustment of the relative importance of the input variables for training of the CPANN. Using this novelalgorithm we were able to develop several simple QSPR models.One of the best models, discussed in details in the article, has only three interpretable descriptors: number ofhalogen atoms in the structure, the energy of the lowest unoccupied molecular orbital (LUMO) whichreflects the electronic properties of the molecules and the average molecular weight. The final analysis of thismodel shows that the most responsible for modeling of the pKBH+ values is the number of the presenthalogen atoms in the structures. Similar relative importance has LUMO. This descriptor helps in groping ofthe similar substances in different part of the CPANN. While the average molecular weight, with nearly seventimes smaller relative importance compared to previous two descriptors, is related to the influence of thepresence of, in most of the cases, more than one halogen atom in the structures on pKBH+. Finally, thedeveloped models have excellent generalization performances which were checked using independent testset.

ki).

l rights reserved.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

Acid–base properties and protonation–deprotonation equilibriaare widely studied chemical phenomena. The importance for accurateand comparable determination of pK values have resulted indevelopment and application of many analytical techniques [1]. Theinformation about the protonation behavior of weak bases helps indetailed kinetic analysis of hydrolysis. It also helps in betterinterpretation of the qualitative structure–activity relationship(QSAR) studies [2–5].

Basicity and solvation of aliphatic amides are intrinsicallyimportant parameters for understanding the behavior of biochemicalsystems. Protonation has important catalytic role on hydrolysis ofamide bond in peptides. As a consequence, simple amides (formamidefor example) are widely used as model compounds for studying theprotonation site in strained amides [6]. The pK of some substitutedbenzamides are used for investigation of kinetics of reaction ofdeamination of amides by nitrous acid [7].

Protonation constants of weak organic bases are obtained byanalyzing the changes of some physical properties of the observed

substrates, going from free towards protonated base by increasing theacidity of the solutions [8,9]. However, in carbonyl compounds, suchas amides, the situation is more complex since the spectra of one orboth forms are usually subject to substantial medium effects. Bothcommonly applied methods for study of protonation equilibria (UV–Vis and NMR spectroscopy) are affected to some extent by themedium effects that accompany the change of the acid concentration.Various methods (more than twenty [10]) have been devised tocorrect the influence of the medium (PCA — principal componentanalysis [11–17], target factor analysis [10,18,19], etc.).

The results obtained using PCA for removal of the medium effectfor aromatic amides are rather satisfactory, unlike for aliphaticamides. Namely, aliphatic amides have been conveniently studiedby NMR spectroscopy [20–26], where the solvent effect can beadequately handled [20]. UV spectroscopy is not appropriate for thispurpose because the hypsochromic shift of the absorption band of theprotonated base under 190 nm [20] results in non-monotonousreconstructed spectra which are unsuitable for pKBH+ estimation [27].

Hammett's equation [28] has been successfully applied forrationalizing rates and equilibria of meta- and para-substitutedbenzene derivatives [15,28] in terms of empirically obtained substit-uent constants (σ) and reaction constants (ρ). The lack of literaturedata for σ for the substituents in ortho-position makes the attemptsfor further correlations with Hammett's constants impossible. The

Page 2: Counter-propagation artificial neural networks as a tool for prediction of pKBH+ for series of amides

Table 1The preselected descriptors used during the search of the models with bestgeneralization performances by genetic algorithms.

Abbreviation Description

1 ALOGP Ghose–Crippen octanol–water partition coefficient2 SPAM Diameter of the molecular3 SPH Spherosity4 ASP Asphericity5 IAC Total information index of atomic composition6 IC1 Information content index (neighborhood symmetry of 1-order)7 XMOD Modified Randic connectivity index8 MW Molecular weight9 AMW Average molecular weight10 Ss Sum of Kier–Hall electrotopological states11 Mv Mean atomic van der Waals volume (scaled on Carbon atom)12 Me Meanatomic Sandersonelectronegativity (scaledonCarbonatom)13 Mp Mean atomic polarizability (scaled on Carbon atom)14 nF Number of fluorine atoms15 nCL Number of chlorine atoms16 nX Number of halogen atoms17 nArCONHR Number of secondary amides (aromatic)18 nCHRX2 Number of CHR2X groups19 nHAcc Number of acceptor atoms for H-bonds (N, O, F)20 nHBonds Number of intramolecular H-bonds21 HOMO Energy of the highest occupied molecular orbital22 LUMO Energy of the lowest unoccupied molecular orbital23 TotE Total energy of the molecule

Fig. 1. Graphical representation of CPANN. In the Kohonen layer the mapping of theinput vectors is performed, the winning neuron is selected and then, in both layers, thecorrections of its weights and the weights of the neurons in its neighborhood areperformed.

124 G. Stojković et al. / Chemometrics and Intelligent Laboratory Systems 102 (2010) 123–129

only available comparable investigation is on the thirteen N-methoxypolynitroanilines [29]. Additionally, this equation is practicallyinapplicable for aliphatic compounds, because of the significantcontribution of the steric effect (beside the polar one). Usually, thecombined contribution of both effects is expressed by Taft's equation[30].

The use of Hammett's equation [28] on theN,N-dialkyl formamidesand acetamides, in any possible combination with Taft's constants, nosatisfactory correlation was obtained [28] — a conclusion similar tothat formulated by Bagno et al. [20].

Thus, it is impossible to determine accurate and comparable pKBH+

values of this insufficiently investigated group of amides. In order tosolve these and similar problems, in the recent years, differentchemometric methods, including artificial neural networks have beenemployed to a significant extent [29,31–35].

A quantitative structure–property relationship (QSPR) study [29]is presented for acidities of thirteen N-methoxy-polynitroanilinederivatives. Also, QSPR study [31] has been performed for predictionof acidity constants of some recently synthesized 9,10-anthraquinonederivatives in binary methanol–water mixtures. The possibility ofprediction of protonation thermodynamic constants, using Compar-ative Molecular Field Analysis, has also been presented [32]. Theartificial neural networks (ANN) optimized by genetic algorithms hadbeen used for the prediction of the acidity constants of some 1-hydroxy-9,10-anthraquinone derivatives using quantum chemicaldescriptors [33]. Furthermore, ANN have been successfully used topredict the acidity constants (pKa) of 128 various phenols withdiverse chemical structures using a quantitative structure–activityrelationship [34].

Additionally, there are several commercially available softwarepackages for this purpose (ACD/pKa Batch [36], ChemDB [37],ChemAxon [38], Epik/Schrödinger [39]).

In this paper our results from the attempt to create a simple andinterpretable model for prediction of pKBH+ values for amides, usingcounter-propagation artificial neural networks (CPANN), is presented.To the best of our knowledge, the application of this modelingalgorithm for amides has not been reported in the literature.Furthermore, using the genetic algorithms for optimization wedeveloped a procedure which simplifies interpretability of thedeveloped model [40].

2. The data

Most of the data used in this work were collected from theliterature [13,20–22,25,41,42], while the remaining data are obtainedas a result of the work in our laboratory [15,27]. After the removal ofthe duplicate entries the final data set was reduced to 92 structures ofdifferent amides.

The QSPR modeling was performed on the basis of the calculateddescriptors. For this purpose a large number of descriptors werecalculated using Dragon 5.4 [43]. Additionally, several quantumchemical descriptors were calculated using MOPAC [44].

From the set of calculated descriptors a subset of descriptors withthe largest correlation with pKBH+ was selected. Further, thecorrelation coefficients for each pair of descriptors in the subsetwere calculated. Among each pair of descriptors with correlationcoefficient larger than 0.8 the one which was simpler for interpreta-tion was chosen. The final set, used for modeling of pKBH+, iscomposed of 23 descriptors (see Table 1).

3. Methods

Counter-propagation artificial neural networks [45,46] is an artificialneural network algorithm which is widely used in chemistry [46,47].This type of ANNs have two layers (Fig. 1). The first or Kohonen layer isresponsible for mapping of the multidimensional data into lower-

dimensional grid of neurons. The second layer, which is called outputor Grossberg layer [48,49], serves as a pointing device.

The optimization of this type of artificial neural network isperformed in a similar manner as the optimization of Kohonen self-organizing maps [46]. The only difference in the optimization forthese two types of networks is that, in the case of CPANN, the winningneuron is selected only by comparison of the input variables of thetraining objects with the corresponding weight levels from theKohonen layer (Fig. 1). After the winning neuron is found thecorrection of the weights is performed simultaneously in both layers.

The mapping of the objects performed in the Kohonen layer helpsin grouping the data according to their similarity. But the objects withsimilar input variables usually have similar values for the outputvariables — this property makes CPANN algorithm suitable modelingpurposes.

Genetic algorithms — the optimization of the CPANN wasperformed in automated manner we used genetic algorithms (GA).

Page 3: Counter-propagation artificial neural networks as a tool for prediction of pKBH+ for series of amides

125G. Stojković et al. / Chemometrics and Intelligent Laboratory Systems 102 (2010) 123–129

This algorithm has been proven as a valuable tool for solving differentchemical problems [50].

In order to extract information about which of the input variables(descriptors in our case) have more influence on the obtained results,in this work, we used GA not only for (1) selection of themost suitabledescriptors, (2) for finding the most suitable network size and (3) thenumber of training epochs, but also for the adjustment of the relativeimportance of the input variables [40].

The use of the approach for automatic adjustment of relativeimportance [40] in the case of CPANN, is possible because there are nointeractions between the weights from the different weight levelsduring their correction in the training phase. The results obtainedwith adjusted relative importance can give better insight into thefactors (variables) that have larger influence on the mapping of thedata and, at the same time, on the modeling of the pKBH+.

For this purpose the crossover operator was applied in a standardway, while some restrictions were applied for the mutation. Namely,in the case where GA is used only for selection of the input variables(descriptors in our case) and for adjustment of the relativeimportance of the input variables the mutation was applied in twostages (Fig. 2) [40]. In the first stage the mutation is applied only in thepart of the chromosome responsible for the selection of thedescriptors. While the second stage the mutation is applied only inthe part of the chromosomes responsible for the adjustment of therelative importance of the selected descriptors.

If these restrictions for the mutation are not applied thechromosome could lose information for the relative importance ofthe variables not selected in the model defined by it. The use of thisprocedure is helping in preserving the information for the relativeimportance (obtained in previous generations) for the descriptors notselected by the model defined with the current chromosome.

Fig. 2. In order not to lose valuable information in some of the genes responsible for therelative importance of the descriptors not selected in the model defined by the currentchromosome the mutation is applied only to those parts of that chromosome thatcorrespond to selected descriptors. The genes in the part responsible for the adjustmentof relative importance for the selected descriptors are later transformed into decimalnumbers which represent the relative importance of the input variables.

Software — As previously noted, all programs used in this workwere developed in Matlab environment [51] using our recentlydeveloped program for CPANN [52] (based on SOM Toolbox [53]), thealgorithm for automatic adjustment of relative importance of theinput variables [40] and Genetic Algorithms Toolbox developed at theUniversity of Sheffield [54].

4. Results and discussion

As previously stated the entire optimization procedure used in thiswork is based on genetic algorithms. The use of GA was necessary foravoiding the trial and error approach, which in case of large number ofvariables could, most probably, produce models with suboptimalperformances.

The chromosomes were encoded as follows: 23 genes were usedfor selection of the descriptors (Table 1); 3 genes were usedadjustment of the width of CPANN (in the interval: 5–12); 3 geneswere used adjustment of the length of the CPANN (in the interval: 5–12); 4 genes were used for finding themost suitable number of epochsin the rough training phase (in the interval: 10–25); 7 genes wereused for finding the most suitable number of epochs in the finetraining phase (the final number of the epochs in this phase wasincreased by twice the number of the training epochs in the roughtraining phase, in order tomake sure that the number of epochs in thisphase is larger than the one in the previous phase). Additionally, 115(=5×23) genes were used for adjustment of the relative importanceof the selected descriptors. 5 genes were used for the relativeimportance of each descriptor. The relative importance was adjustedin the interval: 1/32–32/32. Due to the complexity of the optimizationproblem the size of the population was 100 chromosomes.

In chemistry square CPANNs are most often used [46,47]. OurCPANN program is based on SOM Toolbox [53] which initializes theweights in the directions defined by the first two principalcomponents. This weight initialization algorithm helps in fastertraining of the Kohonen self-organizing maps. As a consequence, inorder to better cover the space defined by first two principalcomponents the use of non-squared CPANNs is preferred.

Before the optimization started the data set was divided intotraining set consisting of 67% of the objects. The test set was composedof the remaining 33% of the objects in the data set. Kennard–Stonealgorithm was used for this purpose [55].

The optimization performed using GA lasted for 600 generations.The performance of the individual chromosome from the populationwas evaluated using the following performance function:

Perf = 0:5⋅RMSEPTrS + 0:5⋅RMSEPCV + pen:

Fig. 3. The change of the mutation as a function of the number of generations (dashedline — for the part of the chromosomes responsible for adjustment of the relativeimportance; solid line — for the remaining part of the chromosomes).

Page 4: Counter-propagation artificial neural networks as a tool for prediction of pKBH+ for series of amides

Fig. 4. The weight levels of the trained CPANN together with the relative importance of the descriptors (given in the brackets next to the labels of the descriptors).

126 G. Stojković et al. / Chemometrics and Intelligent Laboratory Systems 102 (2010) 123–129

In this equation RMSEPTrS is a root mean squared error of perditionfor the training data set and RMSEPCV represents root mean squarederror of prediction for leave-two-out cross-validation. The last term inthe performance function, labeled as pen, is a penalty parameter. Itwas introduced in order to force GA to search for QSAR models withsmaller number of descriptors, which could be easier for interpreta-tion. The penalty parameter was defined as described here:

pen = 0 if Nd≤5Nd= 10 if Nd N 5 :

In the previous expression Nd represents the number of descrip-tors included in the model.

Fig. 5. Expected vs. found values for to pKBH+ for th

The mating pairs were formed using roulette will selection ruleand the genetic material between them was exchanged using two-point crossover.

In the part or the chromosomes responsible for adjustment of therelative importance the mutation was applied with the restrictionsdescribed earlier in this manuscript. In order to force the search for themost appropriate relative importance of the selected variables in thispart of the chromosomes larger mutations were applied. Untilgeneration 120 the mutation was 0.25 (Fig. 3). Until generation 300the mutation was kept at 0.20, while after that it was lowered to 0.15and it was kept at this level until the end of the optimization using GA.In the remaining part of the chromosomes the mutation was kept at0.15 until generation 200. After that it linearly decreased to 0.05 untilgeneration 400 and after that it was kept at that level.

e training set (left) and for the test set (right).

Page 5: Counter-propagation artificial neural networks as a tool for prediction of pKBH+ for series of amides

Table 2The amides used in this study. The labels for the molecules which were used as test setare given with bold numbers.

Label Chemical name Formula pKBH+ Reference

1 2-Methylbenzamide C8H9NO −1.64 [13]2 2-Chlorobenzamide C7H6ClNO −2.11 [13]3 2-Bromobenzamide C7H6BrNO −2.26 [13]4 2-Nitrobenzamide C7H6N2O3 −1.9 [13]5 2-Ethoxybenzamide C9H11NO2 −1.32 [13]6 2-Aminobenzamide C7H8N2O −2.65 [13]7 2-Fluorobenzamide C7H6FNO −1.98 [13]8 N-Methylbenzamide C8H9NO −1.46 [25]9 N-Ethylbenzamide C9H11NO −1.52 [25]10 N-Isopropylbenzamide C10H13NO −1.54 [25]11 N-Isobutylbenzamide C11H15NO −1.5 [25]12 N-sec-Butylbenzamide C11H15NO −1.55 [25]13 2-Bromo-N-t-butylbenzamide C11H14BrNO −1.34 [25]14 N-Benzylbenzamide C14H13NO −1.83 [25]15 N-(1-Phenylethyl)benzamide C15H15NO −1.58 [25]16 N,N-Dimethylbenzamide C9H11NO −1 [25]17 N,N-Diethylbenzamide C11H15NO −1.14 [25]18 N,N-Diisopropylbenzamide C13H19NO −0.55 [25]19 3-Methylbenzamide C8H9NO −1.37 [15]20 3-Chlorobenzamide C7H6ClNO −1.65 [15]21 3-Nitrobenzamide C7H6N2O3 −2.01 [15]22 4-Methylbenzamide C8H9NO −1.44 [15]23 4-Chlorobenzamide C7H6ClNO −1.66 [15]24 4-Nitrobenzamide C7H6N2O3 −2.28 [15]25 N-Ethyl-3-methylbenzamide C10H13NO −1.88 [41]26 3-Chloro-N-ethylbenzamide C9H10ClNO −2.28 [41]27 3-Bromo-N-ethylbenzamide C9H10BrNO −2.4 [41]28 N-Ethyl-3-nitrobenzamide C9H10N2O3 −2.54 [41]29 N-Ethyl-4-methylbenzamide C10H13NO −1.77 [41]30 4-Chloro-N-ethylbenzamide C9H10ClNO −2.21 [41]31 N-Ethyl-4-methoxybenzamide C10H13NO2 −1.6 [41]32 N-(2,2,2-Trifluoroethyl)benzamide C9H8F3NO −3.33 [41]33 3-Methyl-N-(2,2,2-trifluoroethyl)

benzamideC10H10F3NO −3.2 [41]

34 3-Chloro-N-(2,2,2-trifluoroethyl)benzamide

C9H7ClF3NO −3.62 [41]

35 3-Nitro-N-(2,2,2-trifluoroethyl)benzamide

C9H7F3N2O3 −3.83 [41]

36 4-Methyl-N-(2,2,2-trifluoroethyl)benzamide

C10H10F3NO −3 [41]

37 4-Chloro-N-(2,2,2-trifluoroethyl)benzamide

C9H7ClF3NO −3.41 [41]

38 4-Methoxy-N-(2,2,2-trifluoroethyl)benzamide

C10H10F3NO2 −2.7 [41]

39 N-(4-Hydroxyphenyl)acetamide C8H9NO2 −0.89 [42]40 N-(4-Methoxyphenyl)acetamide C9H11NO2 −1.2 [42]41 N-(4-Ethoxyphenyl)acetamide C10H13NO2 −1.14 [42]42 N-p-Tolylacetamide C9H11NO −1.16 [42]43 N-o-Tolylacetamide C9H11NO −2.13 [42]44 N-Phenylacetamide C8H9NO −1.43 [42]45 N-(4-Fluorophenyl)acetamide C8H8FNO −1.47 [42]46 N-(4-Chlorophenyl)acetamide C8H8ClNO −1.67 [42]47 N-(4-Bromophenyl)acetamide C8H8BrNO −1.76 [42]48 N-(4-Iodophenyl)acetamide C8H8INO −1.74 [42]49 N-(4-Aminophenyl)acetamide C8H10N2O −1.99 [42]50 4-Acetamidobenzoic acid C9H9NO3 −1.73 [42]51 N-(4-Nitrophenyl)acetamide C8H8N2O3 −2.16 [42]52 Formamide CH3NO −1.47 [20]53 Acetamide C2H5NO −0.66 [20]54 Propanamide C3H7NO −0.86 [20]55 2-Methylpropanamide C4H9NO −1.11 [20]56 t-Butylformamide C5H11NO −1.49 [20]57 N,N-Dimethylthioacetamide C4H9NS −2.25 [20]58 N-Methylformamide C2H5NO −1.1 [20]59 N-Methylacetamide C3H7NO −0.56 [20]60 N-Methylpropanamide C4H9NO −0.7 [20]61 N,2-Dimethylpropanamide C5H11NO −1.14 [20]62 N-Methyl-t-butylformamide C6H13NO −1.59 [20]63 N,N-dimethylpropanethioamide C5H11NS −2.26 [20]64 N,N-Dimethyl-t-butylthioformamide C7H15NS −2.32 [20]65 N,N-dimethylpropanamide C5H11NO −0.56 [20]66 N,N,2-Trimethylpropanamide C6H13NO −1.61 [20]67 N,N-Dimethyl-t-butylformamide C7H15NO −2.03 [20]68 N,N,2-Trimethylthiopropanamide C6H13NS −2.48 [20]

(continued on next page)

Table 2 (continued)

Label Chemical name Formula pKBH+ Reference

69 N,N-Dimethylformamide C3H7NO −1.33 [21]70 N-Ethylformamide C3H7NO −1.42 [21]71 N-Isopropylformamide C4H9NO −1.26 [21]72 N-(Phenylmethyl)formamide C8H9NO −1.92 [21]73 Butanamide C4H9NO −1.03 [22]74 2-Chloroacetamide C2H4ClNO −2.8 [22]75 3-Methylbutanamide C5H11NO −1.26 [22]76 2,2-Dichloro-N-methylacetamide C3H5Cl2NO −3.84 [22]77 2,2-Dichloro-N,N-dimethylacetamide C4H7Cl2NO −3.73 [22]78 2-Phenylacetamide C8H9NO −1.68 [22]79 t-Butylformamide C5H11NO −1.43 [22]80 N-Ethylacetamide C4H9NO −0.49 [25]81 N,N-Diethylacetamide C6H13NO −0.33 [25]82 N-t-Butylacetamide C6H13NO −0.41 [25]83 N-(Phenylmethyl)acetamide C9H11NO −1.08 [25]84 N-(4-Methoxyphenylmethyl)

acetamideC10H13NO2 −0.94 [25]

85 N-(4-Chlorophenylmethyl)acetamide C9H10ClNO −1.23 [25]86 N,N-Dimethylformamide C3H7NO −1.21 [27]87 N,N-Diethylformamide C5H11NO −0.7 [27]88 N,N-Diisopropylformamide C7H15NO −0.3 [27]89 N,N-Dibutylformamide C9H19NO −0.85 [27]90 N,N-Diisobutylformamide C9H19NO −1.13 [27]91 N,N-Dimethylacetamide C4H9NO −0.19 [27]92 N,N-Diethylacetamide C6H13NO 0.03 [27]

127G. Stojković et al. / Chemometrics and Intelligent Laboratory Systems 102 (2010) 123–129

The weight levels for the best model are presented in Fig. 4. Thismodel is very simple. It consists of only three descriptors (AMW, nXand LUMO). The relative importance of these descriptors for thepresented model is given in the brackets next to the labels in Fig. 4.

The expected versus found values for the training set, and for thetest set are presented in Fig. 5. The correlation between the expectedand found values for pKBH+ for the independent test set is very good.

The examination of the weight levels that correspond to differentdescriptors and their relative importance will give us a more detailedpicture on how the performed mapping in the Kohonen layer helps inprediction of the pKBH+.

For the presented model the number of the halogen atoms in themolecules (nX) has the largest influence on the performed mapping.This constitutional descriptor for our data set varies in the intervalbetween 0 and 4. The compounds with halogen atoms are grouped inthe central, in the upper central and upper right part of the CPANN.The compounds with the largest number of halogen atoms (3 and 4)are derivatives of N-(2,2,2-trifluoroethyl)benzamide (in Table 2structures: 32–38. See Fig. 6). Three more structures in the entiredata set have more than one halogen atom and two of them are alsomapped in this region. These two structures are 2,2-dichloro-N-methylacetamide and 2,2-dichloro-N,N-dimethylacetamide. Compar-isons of the weight levels that correspond to nX and pKBH+ gives us aclear picture on how this descriptor influences the prediction of themodeled property. Obviously, the inductive effect of the halogenatoms in the structures makes the conjugated acids of thesesubstances the strongest in this data set.

For this model the average molecular weight (AMW), thedescriptor with smallest relative importance, could also be connectedto the influence of the inductive effect of the halogen atoms on thepKBH+. Namely, among the ten molecules with the largest AMW fivehave two or more halogen atoms. These compounds are: 3-chloro-N-(2,2,2-trifluoroethyl)benzamide, 3-nitro-N-(2,2,2-trifluoroethyl)ben-zamide, 4-chloro-N-(2,2,2-trifluoroethyl)benzamide, 2,2-dichloro-N-methylacetamide and 2,2-dichloro-N,N-dimethylacetamide. Althoughthe fluorine is the halogen atom with the smallest relative atomicweight, the number of the fluorine atoms in these molecules makesthe AMW of these structures among the highest in this data set.Another structure that is grouped here is 2-chloroacetamide. Allstructures discussed until now have the pKBH+ values which areamong the highest in this data set. Additionally, five of them containtwo or more halogen atoms in their structures.

Page 6: Counter-propagation artificial neural networks as a tool for prediction of pKBH+ for series of amides

Fig. 6. The weigh level that corresponds to pKBH+ labeled with the numbers that correspond to the substances presented on Table 2.

128 G. Stojković et al. / Chemometrics and Intelligent Laboratory Systems 102 (2010) 123–129

Actually, this descriptor is the most important for grouping of theremaining four compounds (labeled as: 3, 27, 47 and 48 in Table 2)withthe largest AMW close to the part of the CPANN that corresponds tosmaller values for the pKBH+. Thesemolecules in their structure containonly one bromine or one iodine atom. The pKBH+ values for two of thesestructures are among the highest in the data set (2-bromobenzamideand 3-bromo-N-ethylbenzamide), while the remaining two structureshave values smaller than the average pKBH+ in the data set N-(4-bromophenyl)acetamide and N-(4-iodophenyl)acetamide.

nX has only discreet values but also the largest relative importancefor the performed mapping of the data. However, having in mind thatit accepts only integer values, it is obvious that LUMO (with relativeimportance of 0.625) is responsible for the fine tuning of the mappingof the data into the CPANN.

The structures with the largest values for LUMO (in the Table 2labeled as: 52–56, 58–62, 65–67, 70, 71, 73, 75, 79–82, and 86–92) aremono- or di- N-substituted alkyl amides. These structures are mappedinto the lower right corner of the CPANN and most of them arestructures with large values of pKBH+.

Further examination of the weight level for this descriptor showsthat the regions of the CPANN that correspond to the structures withsmall energies of LUMO are in the upper left and upper right corner. Thecommon characteristic for the structures in the upper left corner is thatthey are nitrobenzamides (structures in labeledwith: 4, 21, 24 and 28 inTable 2). These structures are joined by N-(4-nitrophenyl)acetamide(structure number 52) which is placed in the lowest part of this region.

The small region in the upper right part of the CPANN containsthree N-(2,2,2-trifluoroethyl)benzamides labeled with numbers 34,35 and 37. The LUMO values for these structures are among the tenlowest in the data set.

Additionally, the examination of Fig. 6 and the comparison of thelabels in it with those in Table 2 shows that most of the benzamidesare grouped in the left part of the CPANN. The exceptions are theseven structures grouped in the upper right corner (structures: 32–38) which are separated due to the small pKBH+ influenced by thepresence of the three or more halogen atoms in the structure.

In the lower central part of the CPANN the most of the structuresgrouped are acetamides (labels: 39–44 and 49). Further, N-(4-fluorophenyl)acetamide and N-(4-chlorophenyl)acetamide (labels: 45and 46). Close to these two structures, in the region that corresponds tohighestAMWvalues, aremappedN-(4-bromophenyl)acetamide andN-(4-iodophenyl)acetamide. Most of these acetamides have large valuesfor pKBH+. N-(4-bromophenyl)acetamide and N-(4-iodophenyl)acet-amide, have the smallest pKBH+ in this group.

The further examination of the weigh levels could help us toconclude that the largest LUMO values correspond to molecules withthe largest values for pKBH+ (these molecules are grouped in thelower right corner of the CPANN). In other words, largest values forLUMO and related to strong base. Additionally, small values for thisdescriptor correspond to strong conjugated acids (small values forpKBH+). The molecules with small values of this descriptor aregrouped in the upper left and upper right parts of the CPANN.

Furthermore the separation of the area that corresponds to smallLUMO values into two parts (where in upper right part the weakestbases and a group of weak bases in the upper left corner) is influencedby the presence of halogen atoms in the molecules.

At the end, we also decided to compare the prediction perfor-mances of the model discussed here with the results obtained usingtwo different modeling algorithms. Namely, using similar procedurefor optimization we developed models for prediction of pKBH+ basedon back-propagation artificial neural networks (ANN) [46,56] andpartial least squares (PLS) regression [57,58].

The RMSEP for the test set obtained using the best PLS regressionmodel was 0.3355, while the RMSEP for the test set obtained using thebest back-propagation ANN model was 0.3186. Compared to RMSEPvalue of 0.3721 obtained for the test set by CPANNmodel discussed inthis article, ANN and PLS regression models show better predictionperformances. However, unlike CPANN model, the models based onPLS and ANN are black box models, which do not allow us to directlyexamine how different input variables influence on pKBH+. Moreover,the use of the algorithm for adjustment of the relative importance ofthe input variables [40] is helping us to better understand how the

Page 7: Counter-propagation artificial neural networks as a tool for prediction of pKBH+ for series of amides

129G. Stojković et al. / Chemometrics and Intelligent Laboratory Systems 102 (2010) 123–129

descriptors influence the mapping of the data into the two-dimensional CPANN grid and the prediction of the pKBH+ values.

5. Conclusion

The analysis of the discussed model showed that for the presentdataset the largest influence on pKBH+ has the presence of the halogenatoms in the structure. The number of halogen atoms (nX) in thestructures makes the rough separation of the data according to thepKBH+ values. For successful prediction of the pKBH+ the LUMOenergies of the examined amides are almost as important as nX. TheLUMO energies of the analyzed substances are responsible for gropingof the similar structures in different part of the CPANN.

In addition to the conclusions drawn from the analysis of the resultsobtained by the discussed model, we must say that using the geneticalgorithms for adjustment of relative importance of the input variables[40], we developed a simple model (with only three descriptors).Furthermore, a model with smaller number of input variables alsomeans amodel which is less subjected to over-fittingwhich, in our case,is supported by our results. And at the end, a simplermodel at the sametime is easier for interpretation. Using a simpler model we were able toget a better insight into the structural features of the analyzedsubstances that help in successful prediction of the property of interest.

References

[1] R.F. Cookson, The determination of acidity constants, Chem. Rev. 74 (1974) 5–28.[2] A. Arcelli, G. Porzi, S. Rinaldi, S. Sandri, An efficient acid hydrolysis of the ether

bond assisted by the neighbouring benzamide group. Part 3, J. Chem. Soc. PerkinTrans. 2 (2001) 296–301.

[3] R.A. Cox, The mechanisms of the hydrolyses of N-nitrobenzenesulfonamides, N-nitrobenzamides and some other N-nitro amides in aqueous sulfuric acid, J. Chem.Soc. Perkin Trans. 2 (1997) 1743–1750.

[4] D.A. Marković, The hydrolisis of acrylamide and methacrylamide in aqueoussulphuric acid. I. The rate constants and the position of protonation, J. Serb. Chem.Soc. 59 (1994) 943–948.

[5] G.M. Loudon, M.R. Almond, J.N. Jacob, Mechanism of hydrolysis of N-(1-aminoalkyl) amides, J. Am. Chem. Soc. 103 (1981) 4508–4515.

[6] S.J. Cho, C. Cui, J.Y. Lee, J.K. Park, S.B. Suh, J. Park, B.H. Kim, K.S. Kim, N-protonation vsO-protonation in strained amides: ab initio study, J. Org. Chem. 62 (1997) 4068–4071.

[7] K. Al-Mallah, G. Stedman, Kinetics of the deamination of amides by nitrous acid,J. Chem. Res. (S) (1998) 670–671.

[8] J.T. Edward, I.C. Wang, Ionization of organic compounds: III. Basicities of propionicacid and propionamide, Can. J. Chem. 40 (1962) 966–975.

[9] K. Yates, J.B. Stevens, The ionization behavior of amides in concentrated sulfuricacids, Can. J. Chem. 43 (1965) 529–537.

[10] Ü. Haldna, M. Grebenkova, Evaluation of different factor analytical methods forestimation of pKBH+ and solvation parameter values of 2-hydrohybenzoic acid,Comput. Chem. 17 (1993) 241–243.

[11] J.T. Edward, S.C. Wong, Ionization of carbonyl compounds in sulfuric acid,Correction for medium effects by characteristic vector analysis, J. Am. Chem. Soc.99 (1977) 4229–4232.

[12] G. Stojković, B. Andonovski, UV Study of the protonation of benzamide and N-phenyl benzamide in sulfuric acid media, Proceedings, 3rd Aegan AnalyticalChemistry Days, Polihnitos, Lesvos, Greece, University of Athens, Athens, 2002,pp. 498–501.

[13] B. Garcia, R.M. Casado, J. Castillo, S. Ibeas, I. Domingo, J.M. Leal, Acidity constants ofbenzamide and some ortho-substituted derivatives, J. Phys. Org. Chem. 6 (1993)101–106.

[14] R.I. Zalewski, Adaptation of characteristic vector analysis and titration curveanalysis for calculations of pKBH+ from ultraviolet–visible spectral data, J. Chem.Soc. Perkin Trans. II (1979) 1637–1639.

[15] G. Stojković, E. Popovski, Determination and structural correlation of pKBH+ formeta- and para-substituted benzamides in sulfuric acid solutions, J. Serb. Chem.Soc. 71 (2006) 1061–1071.

[16] G. Stojković, F. Anastasova, Protonation acidity constants for benzotoluidides insulfuric acid solutions, Cent. Eur. J. Chem. 4 (2006) 56–67.

[17] E.R. Malinowski, D.G. Howery, Factor Analysis in Chemistry, Wiley, New York,1980, pp. 32–82.

[18] Ü. Haldna, Estimation of the basicity constants of weak bases by the target testingmethod of factor analysis, Prog. Phys. Org. Chem. 18 (1990) 65–75.

[19] Ü. Haldna, A. Murshak, Estimation of the basicity constants of weak bases by thetarget testing method of factor analysis, Comput. Chem. 8 (1984) 201–204.

[20] A. Bagno, G. Lovato, G. Scorrano, Thermodynamics of protonation and hydration ofaliphatic amides, J. Chem. Soc. Perkin Trans. 2 (1993) 1091–1098.

[21] M. Liler, Studies of nuclear magnetic resonance chemical shifts caused byprotonation. Part II. Formamide and some N-alkyl and N,N-dialkyl derivatives,J. Chem. Soc. (B) 2 (1971) 334–338.

[22] M. Liler, Studies of nuclear magnetic resonance chemical shifts caused byprotonation. Part I. Substituted acetamides and some N-methyl- and N,N-dimethyl-derivatives, J. Chem. Soc. (B) 4 (1969) 385–389.

[23] A. Bagno, V. Lucchini, G. Scorrano, Thermodynamics of protonation of N,N-dimethylthioamides in aqueous sulfuric acid, Can. J. Chem. 68 (1990) 1746–1749.

[24] A. Bagno, G. Scorrano, Acid–base properties of organic solvents, J. Am. Chem. Soc.110 (1988) 4577–4586.

[25] R.A. Cox, L.M. Druet, A.E. Klausner, T.A. Modro, P. Wan, K. Yates, Protonationacidity constants for some benzamides, acetamides, and lactams, Can. J. Chem. 59(1981) 1568–1573.

[26] B.G. Cox, A nuclear magnetic resonance study of the rates of protonation ofdimethylacetamide and dimethylbenzamide in concentrated acid solutions,J. Chem. Soc. (B) 9 (1970) 1780–1783.

[27] G.M. Stojković, Investigation of the reactions of protonation of some amides inhighly acidic media with UV spectroscopy, PhD Thesis, University “Sts. Cyril andMethodius”, Faculty of Natural Sciences and Mathematics, Institute of Chemistry,Skopje, Republic of Macedonia, 2006.

[28] L.P. Hammett, The effect of structure upon the reactions of organic compounds.Benzene derivatives, J. Am. Chem. Soc. 59 (1937) 96–103.

[29] A. Beteringhe, QSPR study on pKa values of N-methoxy-polynitroanilinederivatives, Cent. Eur. J. Chem. 3 (2005) 585–591.

[30] R.W. Taft Jr., Polar and steric substituent constats for aliphatic and o-benzoategroups from rates of esterification and hydrolysis of esters, J. Am. Chem. Soc. 74(1952) 3120–3128.

[31] M. Shamsipur, B. Hemmateenejad, M. Akhond, H. Sharghi, Quantitative structure–property relationship study of acidity constants of some 9,10-anthraquinonederivatives using multiple linear regression and partial last-squares procedures,Talanta 54 (2001) 1113–1120.

[32] R. Gargallo, C.A. Sotriffer, K.R. Liedl, B.M. Rode, Application of multivariate dataanalysis methods to comparative molecular field analysis (CoMFA) data: protonaffinities and pKa prediction for nucleic acid components, J. Comput.-Aided Mol.Des. 13 (1999) 611–623.

[33] B. Hemmateenejad, M.A. Safarpour, F. Taghavi, Application of ab initio theory for theprediction of acidity constants of some 1-hydroxy-9,10-anthraquinone derivativesusing genetic neural network, J. Mol. Struct. (Theochem) 635 (2003) 183–190.

[34] A. Habibi-Yangjeh, M. Danandeh-Jenagharad, M. Nooshyar, Application ofartificial neural networks for predicting the aqueous acidity of various phenolsusing QSAR, J. Mol. Model. 12 (2006) 338–347.

[35] M. Pompe, M. Randić, Variable connectivity model for determination of pKa valuesfor selected organic acids, Acta Chim. Slov. 54 (2007) 605–610.

[36] The Advanced Chemistry Development Website. http://www.acdlabs.com/pro-ducts/phys_chem_lab/pka/batch.html/ [1 April 2009].

[37] The ChemDBsoft Website, http://www.chemdbsoft.com/Physical-Property-Pre-diction.php/ [1 April 2009].

[38] The ChemAxonWebsite, http://www.chemaxon.com/product/pka.html [4 April 2009].[39] The Schrödinger Website, http://www.schrodinger.com/ProductDescription.

php?mID=6&sID=25&cID=0/ [1 April 2009].[40] I. Kuzmanovski, M. Novic, M. Trpkovska, Automatic adjustment of the relative

importance of different input variables for optimization of counter-propagationartificial neural networks, Anal. Chim. Acta 642 (2009) 142–147.

[41] D.B. Farlow, R.B. Moodie, Protonation equilibria of amides and related compounds.Part III. σ and σ+ Correlations in some N-substituted benzamides, J. Chem. Soc.(B) 2 (1970) 334–336.

[42] J. Giffney, C.J. O'Connor, Spectrophotometric determination of basicity constants.Part II. Acetanilides, J. Chem. Soc. Perkin Trans. II 7 (1975) 706–712.

[43] Talete srl. Dragon for Windows (Software for Molecular Descriptor Calculations),Version 5.4-2006. http://www.talete.mi.it/.

[44] J.J.P. Stewart, MOPAC 6.0, Quantum Chemical Program Exchange, 455, 1990.[45] R. Hecht-Nielsen, Counterpropagation networks, Appl. Optics 26 (1987) 4979–4983.[46] J. Zupan, J. Gasteiger, Neural Networks in Chemistry and Drug Design, WILEY-VCH

Verlag GmbH, Weinheim, Germany, 1999.[47] J. Zupan, M. Novič, I. Ruisáinchez, Kohonen and counterpropagation artificial neural

networks in analytical chemistry, Chemom. Intell. Lab. Syst. 38 (1997) 1–23.[48] G.A. Carpenter, S. Grossberg, Amassively parallel architecture for a self-organizing

neural pattern recognition machine, Comput. Vision Graph. Image Process. 37(1987) 54–115.

[49] S. Grossberg, Nonlinear neural networks: principles, mechanisms, and architec-tures, Neural Netw. 1 (1988) 17–61.

[50] R. Leardi, A. Lupiáñez Gonzáles, Genetic algorithms applied to feature selection in PLSregression:howandwhen to use them, Chemom. Intell. Lab. Syst. 41 (1998) 195–207.

[51] MATLAB 5.2, 1984–1998 Mathworks.[52] I. Kuzmanovski, M. Novič, Counter-propagation neural networks in Matlab,

Chemom. Intell. Lab. Syst. 90 (2008) 84–91.[53] J. Vesanto, SOM-based data visualizationmethods, Intell. Data Anal. 6 (1999) 111–126.[54] A. Chipperfield, P. Fleming, H. Pohlheim, C. Fonseca, Genetic Algorithm Toolbox

User's Guide, University of Sheffield, Sheffield, UK, 1994.[55] R.W. Kennard, L.A. Stone, Computer aided design of experiments, Technometrics

11 (1969) 137–148.[56] A. Bos, M. Bos, W.E. van der Linden, Artificial neural networks as a tool for soft-

modelling inquantitative analytical chemistry: the prediction of thewater content ofcheese, Anal. Chim. Acta 256 (1992) 133–144.

[57] P. Geladi, B.R. Kowalski, Partial least-squares regression: a tutorial, Anal. Chim.Acta 185 (1986) 1–17.

[58] S. Wold, M. Sjostrom, L. Eriksson, PLS-regression: a basic tool of chemometrics,Chemom. Intell. Lab. Syst. 58 (2001) 109–130.