the aquatic toxicity values of 57 esters, with experimental and predicted lc50 in fish, ec50 in...
TRANSCRIPT
The aquatic toxicity values of 57 esters, with experimental and predicted LC50 in fish, EC50 in Daphnia and seaweed and IGC in Entosiphon sulcatum, were studied in the Principal Component space. The first component was found to be the most important with 61.8% of explained variance and can be considered as a general index of aquatic toxicity. In order to have a fast method to rank the esters according to their aquatic toxicity, the PC1 was modeled by theoretical molecular descriptors. The best model, selected by Genetic Algorithm, was verified for stability and predictivity by internal and external validation.
Gramatica, P., Battaini, F., Gramatica, P., Battaini, F., Papa, E.Papa, E.
QSAR and Environmental Chemistry Research Unit, University of Insubria, Varese (Italy).QSAR and Environmental Chemistry Research Unit, University of Insubria, Varese (Italy).
Web: http://dipbsf.uninsubria.it/qsar/ e-mail:Web: http://dipbsf.uninsubria.it/qsar/ e-mail: [email protected]
QSAR PREDICTION OF AQUATIC TOXICITY OF ESTERSQSAR PREDICTION OF AQUATIC TOXICITY OF ESTERS
INTRODUCTIONINTRODUCTION
A large number of compounds (more than 100,000) are currently in common use, and
about 2,000 new ones appear each year. No data are available for the majority of these
compounds so we have no understanding of their environmental fate, their behavior or
effects [1]. This general lack of knowledge has led to the European Commission adopting
a “White Paper on a strategy for a future Community Policy for Chemicals” [2]. This
Directive requires, at the latest by the end of 2005, physico-chemicals data and toxicity
data for HPV (High Production Volume) compounds with production volume of 1,000
tonnes/year. Among the HPV compounds the class of esters is one of the largest and
environmentally most “interesting”. Some esters, i.e. phthalates, are known for their
weak carcinogenic and estrogenic effects [3], thus, there is a need to identify these
compounds to assess their potential health hazard and their impact on the environment.
The aim of our research was to develop “local” QSAR (Quantitative Structure-Activity
Relationship) models to rapidly predict the toxicity of esters. As this prediction is based
simply on knowing molecular structure, the approach could be applied usefully to new
chemicals, even those not yet synthesised, if they belong to the chemical domain of the
training set. In this case it is possible to reduce the cost and the time needed for
experimental data.
RESULTS AND DISCUSSIONRESULTS AND DISCUSSION
The more relevant molecular descriptors, calculated by the DRAGON software, were select by Genetic Algorithm (GA – Variable Subset Selection). For each end-points the best model was validated with more validation techniques:
• Leave-one-out using QUIK rule (Q Under Influence of K (18)) to avoid chance correlation.
• Strongest validation using leave-many-out procedure (15-30%).
• Y scrambling ( permutation testing by recalculating models for randomly reordered response ).
The models were not all validated externally owing to the small sets studied (14-30 obj.). The reliability of the
predictions was always checked by the leverage approach in order to verify the chemical domain of the models.
The regression lines of the fish and Daphnia models are reported (outliers and influential chemicals are
highlighted). Table 1 shows the performance of the best models for each end-point.
ABSTRACTABSTRACT
Esters are an important class of industrial chemicals, for which the EU-Directive “White Paper on a strategy for a
future Community Policy for Chemicals” requires toxicity data by, at the latest, the end of 2005. The object of the
study was to develop QSAR models to rapidly predict the aquatic toxicity of esters. Unfortunately the experimental
toxicity data are not known for a large number of these compounds or, if known, the data are not all homogeneous,
hindering an accurate and comparable evaluation of the toxicological behaviour of the considered compounds.
Different theoretical molecular descriptors (1D-constitutional, 2D-topological, and different 3D-descriptors) are
calculated by the DRAGON software. The Genetic Algorithm (GA-Variable Subset Selection) is used to select the
more relevant molecular descriptors in the modelling by Ordinary Least Squares (OLS) regression. The studied end-
points are: LC50 in Pimephales promelas, EC50 in Daphnia magna and in seaweed, IGC50 in Entosiphon sulcatum
and chronic toxicity in Daphnia magna. The best models were validated for their predictive performance using
leave-one-out (Q2LOO=70-90%), leave-many-out (30% of perturbation, Q2
LMO=70-90%) and the scrambling of the
responses. The models were not all externally validated owing to the small dimension (14-30) of the studied sets.
The reliability of the predictions was always checked by the leverage approach in order to verify the chemical
domain of the models. A PCA model, based on four acute toxicity end-points, has been proposed to evaluate the
trend of aquatic toxicity for the studied esters. The PC1 score is also modelled by theoretical molecular descriptors
(Q2LOO=89%, Q2
LMO=88%): this last model can be used as an evaluative method for screening esters according to their
aquatic toxicity, just starting from their molecular structure.
End-point Species N.obj. Variables R2 Q2 Q215% Q230%LC50 Fish 30 DP02 n=CH2 82.5 79.2 79.7 78.8EC50 Daphnia 30 TI1 Jhetv GATS1v 85.1 80.8 80.2 78.2EC50 Seaweed 12 DIPp H8u 96 93.5 92.9 88.1EC50 Pseudomonas 13 GATS5e R2v+ 92.5 86.5 85.9 83.7IGC Entosiphon 18 Me Xindex 91.5 87.7 88.1 87.9IGC Scenedesmus 17 AAC Jhetm 89.6 81.9 82.4 81.1IGC Pseudomonas 15 GATS1e R5u+ 83.4 74.3 73.7 71.6
LOEC Daphnia 13 BELm4 94 90.1 91.2 90.1NOEL Daphnia 14 BELm4 91.4 86.9 87.3 85.5
Principal Component AnalysisCum.Ev% = 82% (PC1 = 61.8%)
PC1
PC
2
1
2
3
4
5
6
7
89
10
11
12
13
14
15 16
17
18
19 2021
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 38
39
40
41
42
43
44
45
46
4748
4950
51
52
53
54
55
56
57
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
FishSeaweed
Daphnia
Entosiphon
AQUATIC TOXICITY= - 5.76 - 0.39 TI2 + 4.99 GATS1v - 2.74 DISPp
AQUATIC TOXICITY from PCA
AQ
UA
TIC
TO
XIC
ITY
cal
cula
ted
-4.0
-2.5
-1.0
0.5
2.0
3.5
-4.0 -2.5 -1.0 0.5 2.0 3.5
test (14 obj.)traingin (43 obj.)
Log(1/EC50)= 14.4 - 0.03 TI1 - 1.4 Jhetv - 7.1 GATS1v
Experimental Log(1/EC50)
Pre
dic
ted
Lo
g(1
/EC
50
)
-1.5
-0.5
0.5
1.5
2.5
3.5
-1.5 -0.5 0.5 1.5 2.5 3.5
methyl acrylate
butyl benzyl phthalate
glycerol trienanthate
Log(1/LC50) = - 2.4 + 0.7 DP02 + 1.1 n=CH2
Experimental Log(1/LC50)
Pre
dic
ted
Lo
g(1
/LC
50
)
-0.9
-0.4
0.1
0.6
1.1
1.6
2.1
2.6
-0.9 -0.4 0.1 0.6 1.1 1.6 2.1 2.6
diethyl phthalate
MOLECULAR DESCRIPTORS
The molecular structure of the studied compounds was described using several molecular descriptors calculated by the DRAGON software [8]:
descriptors 0D – costitutional descriptors (atoms and group counts)
descriptors 1D – functional groups, atom centered fragments and empirical descriptors
descriptors 2D – BCUTs, Galvez indices from the adjacency matrix, walk counts, various autocorrelations from the molecular graph and topological descriptors.
descriptors 3D – Randic molecular profiles from the geometry matrix, WHIMs, GETAWAY and geometrical descriptors
CHEMOMETRIC METHODS
Multiple Linear Regression analysis and variable selection were performed by the software MOBY DIGS [9] using the Ordinary Least Square Regression (OLS) method and GA-VSS (Genetic Algorithm-Variable
Subset Selection) [10]. All the calculations were performed using the leave-one-out (LOO) and leave-many-out (LMO) procedures and the response scrambling for the internal validation of the models.
External validation [11-12] was performed on a validation set obtained with the splitting at 75% of the original data set by Experimental Design procedure, applying the software DOLPHIN of Todeschini et al [13].
Tools of regression diagnostics as residual plots and Williams plots were used to check the quality of the best models and define their applicability regarding to the chemical domain, using the chemometric
package SCAN [14]. RMS (residual mean squares) are also reported for model comparison with ECOSAR [15].
EXPERIMENTAL DATA
The studied end-points are: LC50 in Pimephales promelas, EC50 in Daphnia magna, in Pseudomonas and in seaweed, IGC50 in Entosiphon sulcatum, in Scenedesmus and in Pseudomonas. Also studied was the
chronic toxicity of phthalates in Daphnia magna. The experimental data were taken from literature [4-7], reported in mmol/L and transformed in logarithmic units.
MATERIALS & METHODSMATERIALS & METHODS
Log (1/EC50) in Daphnia magna
Log (1/LC50) in Fish
For comparison purposes the RMS (Residual Mean Squares) values are reported only for LC50 in fish and EC50 in seaweed as the other end-points are not included in the ECOSAR software. The ECOSAR models for LC50 in fish and our new models show similar performance; but the EPA model for EC50 in seaweed has the biggest RMS (tab.2). This result appears particulary satisfactory considering that EPIWIN model was obtained on a training set bigger than our data set. End-Point Species Obj. training Variables RMS from our model RMS from ECOSAR
LC50 Fish 30 DP02 n=CH2 0.31 0.38EC50 Seaweed 12 DIPp H8u 0.13 3.47
Tab.1 – Model Performances
Tab.2 – Comparison of models
Aquatic Toxicity
n.obj=43 R2=91.5% Q2=89.9% Q2
LMO30%=89.9% Q2EXT=95.6%
CONCLUSIONSCONCLUSIONS
New predictive “local” models for ecotoxicity end-points of esters are proposed.
These models are based only on theoretical molecular descriptors selected by Genetic
Algorithm.
All models have good predictive power, verified by internal validation techniques.
Principal Component Analysis has been used to propose an esters ranking for global
aquatic toxicity for 4 acute toxicity end-points (LC50 in fish, EC50 in Daphnia magna and in
seaweed, IGC in Entosiphon sulcatum).
The PC1 score highlights the global trend of aquatic toxicity and is modelled by
theoretical molecular descriptors. This model can be used for the screening and ranking
of esters according to their global toxicity, just starting from their structure.
The application of those models reduces animal testing and minimises the time and money
needed for experimental data.
REFERENCESREFERENCES
[1] Gramatica P., Fine Chemicals and Intermediates technologies (Chemistry Today), 1991, 18-24;
[2] http://europa.eu.int/comm/environmental/chemicals/whitepaper.htm;
[3] Thomsen M. and al. Chemophere, 1999, 38, 2613-2624.[4] Cash G.G.and Clements R.G., SAR and QSAR in Environmental Research, 1996, 5, 113-124;
[5] European Commission – Joint Research Centre IUCLID CD-ROM, 2000;
[6] Verschueren K., Handbook of Environmental Data on Organic Chemicals, 1983, 2th Edition, Van Nostrand Reinhold
[7] Rhodes J.E. and al., Environmental Toxicology and Chemistry, 1995, 14, 1967-1976
[8] Todeschini R., Consonni V. and Pavan E. 2002. DRAGON – Software for the calculation of molecular descriptors, rel. 1.12 for
Windows. Free download available at http://www.disat.unimib/chm.;
[9] Todeschini, R., 2001. Moby Digs - Software for multilinear regression analysis and variable subset selection by Genetic Algorithm, rel. 2.3 for
Windows, Talete srl, Milan (Italy);
[10] Leardi, R.; Boggia, R.; Terrile, M.,. J. Chemom., 1992, 6, 267-281;
[11] Wold, S. Eriksson, L. Chemometric Methods in Molecular Design, 1995, VCH, Germany, 309-318;
[12] Golbraikh, A. Tropsha, A., J. Mol. Graph and Mod., 2002, 20, 269-276.
[13] Todeschini, R. and Mauri, A., 2000; DOLPHIN- Software for Optimal Distance-based Experimental Design rel 1.1 for Windows, Talete srl, Milan
(Italy);
[14] SCAN- Software for Chemometric Analysis, rel. 1.1 for Windows, Jerll. Inc., Standard, CA, 1992;
[15] ECOSAR in EPIWIN-EPI Suite 2001, Ver.3.10, Environmental Protection Agency (http://www.epa.gov)