qsar modelling of carcinogenicity for regulatory use in europe
DESCRIPTION
QSAR Modelling of Carcinogenicity for Regulatory Use in Europe. Natalja Fjodorova, Marjana Novič , Marjan Vračk o, Marjan Tušar, National institute of Chemistry, Ljubljana, Slovenia. CAESAR MEETING, 17.11.2008, BERLIN, GERMANY. Overview. Carcinogenic potency prediction- - PowerPoint PPT PresentationTRANSCRIPT
QSAR Modelling of Carcinogenicity for
Regulatory Use in Europe
Natalja Fjodorova, Marjana Novič, Marjan Vračko, Marjan Tušar, National institute of Chemistry,
Ljubljana, Slovenia
CAESAR MEETING, 17.11.2008,BERLIN, GERMANY
Overview
• Carcinogenic potency prediction- state of art• Data and methods used for
modeling by NIC_LJU• Statistical performance of obtained
models and their evaluation• Some findings about structural
alerts• Conclusion
Carcinogenic potency prediction- state of art
The QSAR models can be divided into two families:
• congeneric (for certain classes of chemicals); external prediction performance for rodent carcinogenicity is 58 to 71% accurate
• noncongeneric (for different classes of chemicals); accuracy is around 65%.
Further studies are required to improve thepredictive reliability of noncongeneric chemicals.
Ref.Romualdo Benigni, Cecilia Bossa, Tatiana Netzeva, Andrew Worth.Collection and Evaluation of (Q)SAR Models for Mutagenicity and
Carcinogenicity. EUR 22772EN, 2007
• The chemicals involved in the study belong to different
chemical classes, (noncongeneric substances)• The work is addressed to
industrial chemicals, referring to REACH initiative. The aim is to
cover chemical space as much as possible
Carcinogenicity prediction in scope of CAESAR project
Present state:
- compilation of dataset for carcinogenicity - cross-checking of structures - calculation of descriptors - selection of descriptors - development of models – carcingenicity - investigation of structural alerts (SA)-
ongoing
Dataset: 805 chemicals were extracted from rodent
carcinogenicity study findings for 1481chemicals taken from Distributed Structure-SearchableToxicity (DSSTox) Public Database Network http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html derived from the Lois Gold Carcinogenic Database
(CPDBAS)
Response:
for quantitative models TD50_Rat- Carcinogenic potency in rat (expressed in mmol/kg body wt/day)
for qualitative models yes/no principle
P-positive-activeNP-not positive-inactive
Training and test sets
805 chemicals were splitted into
training set (644 chemicals) and
test set (161 chamicals)
(done at the Helmholtz Centre for Environmental Research – UFZ (Germany)
Distribution of active (P) and inactive (NP) chemicals in the total, training and test sets
Descriptors:254 MDL descriptors calculated by MDLQSAR software, 254MDLdes_806carcinogenicity.rar file
835 Dragon descriptors calculated byDRAGON software,Dragon_Carc.xls file 88 CODESSA descriptors calculatedusing CODESSA software 88_CODESSA_descr_Cancer.xls file
Descriptors used for modeling
Model CARC_NIC_CPANN_0127 MDL descriptors provided by NIC_LJU (method for variable selection: Kohonen network and PCA).
Model CARC_NIC_CPANN_0218 DRAGON and MDL descriptors were taken
from one of the best models (CARC_CSL_KNN_05) developed by CSL. The goal was to compare results obtained for carcinogenicity prediction using different methods.
Model CARC_NIC_CPANN_0334 CODESSA descriptors were taken from oneof the best models (CARC_CSL_KNN_02) developed by CSL.
(method for variable selection for models 2 and 3- cross correlationmatrix, multicolinearity technique, fisher ratio and genetic algorithm)
Counter Propagation Artificial Neural Network
Step1: mapping of molecule Xs (vector representing structure) into the Kohonen layer
Step2: correction of weights in both, the Kohonen and the Output layer
Step3: prediction of the four-dementional target (toxicity) Ts=carcinogenicity
Model input parameters
• Minimal correction factor- 0.01• Maximum correction factor- 0.5• Number of neurons in x direction-
(35)• Number of neurons in y direction-
(35)• Number of learning epochs- 100, 200, 400, 600, 800, 1000, 1200,
1400, 1600, 1800
Statistical evaluation of models
Confusion matrix for two class
True positive (TP) True negative (TN) False positive (FP) False negative (FN)
Accuracy (AC) =(TN+TP)/(TN+TP+FN+FP)Sensitivity(SE)=TP/(TP+FN) Specificity(SP)=TN/(TN+FP)
Statistical performance of models
Changing the threshold from 0 to 1 leads to decrease the number of false positive and increases and number of false negative increases. This tendency is common for all our models 1, 2 and 3.
Threshold vs. wrong prediction rate for test set (model1)
0.00
0.20
0.40
0.60
0.80
1.00
0.00 0.20 0.40 0.60 0.80 1.00
Treshold
Wro
ng
pre
dic
tio
n r
ate
FP_rate
FN_rate
Threshold vs. accuracy, SE and SP for test set (model 1)
0.00
0.20
0.40
0.60
0.80
1.00
0.00 0.20 0.40 0.60 0.80 1.00
Threshold
Acc
urac
y_SE
_SP
SE
SP
ACCThreshold=0.45Accuracy=0.68SE=0.71SP=0.65
In the figure we have marked the maximum accuracy and corresponding thresholds. For model 1 the optimal threshold is equal to 0.45. In this case accuracy has a maximal value of 0.68, sensitivity is 0.71 and specificity is 0.65.
Threshold vs. accuracy, SE and SPfor test set (model 2)
0.00
0.20
0.40
0.60
0.80
1.00
0.00 0.20 0.40 0.60 0.80 1.00
Threshold
Acc
urac
y_SE
_SP SE
SP
ACCThreshold=0.6Accuracy=0.70SE=0.69SP=0.72
For model 2 optimal threshold for test set is 0.6 and accuracy has maximal value of 0.70. Sensitivity in this point is 0.69 and specificity is 0.72.
Threshold vs. accuracy, SE, SP for test set (model 3)
0.00
0.20
0.40
0.60
0.80
1.00
0.00 0.20 0.40 0.60 0.80 1.00
Threshold
Acc
urac
y_SE
_SP SE
SP
ACC
Threshold=0.5Accuracy=0.68SE=0.70SP=0.62
For model 3 optimal threshold is equal to 0.5, maximum accuracy is 0.68, sensitivity is 0.70 and specificity is 0.62.Changing the threshold leads to revision of sensitivity and specificity. It may be used to increase the number of correctly predicted carcinogens or non carcinogens.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1.0
0.8
0.6
0.4
0.2
0.0
False positive rate (1-specificity)
Tru
e p
osi
tive
rate
(se
nsi
tivi
ty)
Training_mod_01Test_mod_01Training_mod_02Test_mod_02Training_mod_03Test_mod_03
ROCs for CARC_NIC_CPANN models_01_02 and 03
The closer the curve tends towards (0,1) the more accurate are the prediction made
A model with no predicted ability yields the diagonal line
Accuracy of prediction and area under the curve (AUC) (models 1,2,3)
Study structural alerts for our dataset collected from Benigni
Toxtree program• We have extracted the following alerts for
out dataset of 805 compounds• GA-genotoxic alerts• nGA-non-genotoxic alerts• NA-no carcinogenic alerts• When we have calculated how many
chemicals with pointed alerts fall into NP-not positive and P-positive area.
For substances withGA about 2/3 belong to Positive and about 1/3 to NP-not positive
For substances with nGAabout half substances belong to Positive and half to NP
For substances with NA-no carcinogenic alerts about 2/3 belongs to NP and 1/3 belong to Positive
P-positive and NP-not positive relates only for results for rats
Needs for future investigations
Conclusion• Quantitative models with dependent variable-
tumorgenic dose TD50 for rats, have shown low prediction power with correlation coefficient for the test set less than 0.5.
• Conversely, qualitative models demonstrated an excellent accuracy of internal performance (accuracy of the training set is 91-93%) and good external performance (accuracy of the test set is 68-70%, sensitivity is 69-73% and specificity 63-72%).
• Changing the threshold leads to revision of sensitivity and specificity. It may be used to increase the number of correctly predicted carcinogens or non carcinogens.