making solubility models with reaxy
TRANSCRIPT
Creating Solubility Models with Reaxys |
Presented By
Date
Creating Solubility Models with Reaxys
Elsevier R&D Solutions Services
Dr. Matthew CLARK
19 January 2016
Creating Solubility Models with Reaxys |
• Reaxys has solubility data that can be used to create and study predictive models
• Appears to have data more diverse than the well-studied “Huuskonen” data set.
• The nature/diversity of the training set is very important for predictive models
• The best reported models have the smallest training sets.
• However, these training sets may not be useful for prediction of more diverse compounds.
• Huuskonen-set-trained model predictions on Reaxys set is poor.
• Reaxys has a diverse set of structures and solubilities
• Each individual measurement is referenced.
• Good source for model making
2
What We Will Learn
Creating Solubility Models with Reaxys |
• In addition to the well-known reactions and compounds, Reaxys is filled with hundreds of different measured properties reported for the compounds
• Each property is associated with a reference
• Each property has a “cluster” of values such as measurement temperature, pressure, solvents etc. describing the conditions of the measurement.
• In many cases multiple measurements are reported by different authors at different times for a particular value.
• A mean, median, and standard deviation can be assessed for the value. Each value is associated with a reference.
• One can use this data, combined with the chemical structures of the compounds to make structure-based predictive models for these properties.
• One can then predict the value of new or proposed compounds from their chemical structures.
Reaxys Property Data
Creating Solubility Models with Reaxys | 4
Reaxys Property Data is Grouped with Conditions
You can select the measurement conditions relevant to your model
Boiling Point
Boiling Point, °C (BP.BP)
Pressure, Torr (BP.P)
Refractive Index
Refractive Index (RI.RI)
Wavelength, nm (RI.W)
Temperature, °C (RI.T)
Dielectric Constant
Dielectric Constant (DIC.DIC)
Frequency, Hz (DIC.F)
Temperature, °C (DIC.T)
Electrical Moment
Description (EM.KW)
Moment, D (EM.EM)
Temperature, °C (EM.T)
Method (EM.MET)
Solvent (EM.SOL)
Enthalpy of Formation
Enthalpy of Formation, Jmol-1 (HFOR.HFOR)
Temperature, °C (HFOR.T)
Pressure, Torr (HFOR.P)
Solubility (MCS)
Solubility, gl-1 (SLB.SLB) Saturation (SLB.SAT) Temperature, °C (SLB.T) Solvent (SLB.SOL) Ratio of Solvents (SLB.RAT)
Creating Solubility Models with Reaxys |
• There are several ways to access this data
• API (Application Programming Interface) allows direct access
• Download tagged SD file from Reaxys after searching
• “Hop in to” links to automatically go to data
• Reaxys API allows direct access to the data
• XML-based interface
• KNIME, PiplelinePilot supported.
• Need to query based on measurement conditions, (temp, solvent), and nature of molecules (organic, single-fragment)
• Form-based query
• “Advanced Query”
5
Model Making Tools
Creating Solubility Models with Reaxys | 6
Solubility Query To Select Data and Molecules
SLB.SLB > 0 has a reported solubility
Temperature 19-25 temperature range of measurement
Solvent 'H2O solubility in water
Number of Fragments =1 only one contiguous fragment
Elements = 'c‘ contains carbon!
NOT Chemical Name = '*radical not a radical
Molecular Weight > 40 AND < 1000 molecular weight range
Number of Elements <5 fewer than 5 different elements
Creating Solubility Models with Reaxys | 7
Reviewing Solubility Data in Reaxys
Creating Solubility Models with Reaxys | 8
SolubilitySources
Reaxys logS is -3.67
Creating Solubility Models with Reaxys | 9
Data Processing in KNIME
• Combines compounds with solubility measured in desired conditions • Convert values to molarity by dividing by molecular weight.
Creating Solubility Models with Reaxys |
• Used with data from Reaxys, and from the Huuskonen paper
• Uses “R” and stepwise multiple regression
• Results and error of prediction appear in a spreadsheet
10
Model Making Workflow
Creating Solubility Models with Reaxys |
• Full compound set, no further filtering
• 3590 compounds
• Standard error of prediction 1.1 log units
• Not spectacular, but useful
• Training set is larger range of diversity than used in most models
• r2 0.56
11
Initial Model and Prediction Result is OK-ish
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
Creating Solubility Models with Reaxys | 12
Reaxys Solubility Model 2 – Filtering of Source Compounds
Residual standard error: 0.6932 on 2697 degrees of freedom
Multiple R-squared: 0.8099, Adjusted R-squared: 0.8037
F-statistic: 132 on 87 and 2697 DF, p-value: < 2.2e-16
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
2785 remain, Examples of filtered compounds:
Model is better, but does not improve prediction of Huuskonen data set
Creating Solubility Models with Reaxys | 13
Comparison with Other Reports
Clark – fragment-based solubility model r2 0.73, SE 0.89 using “PHYSPROP” data set
Generalized Fragment-Substructure Based Property Prediction Method Matthew Clark J. Chem. Inf. Model., 2005, 45 (1), pp 30–38 DOI: 10.1021/ci049744c
Creating Solubility Models with Reaxys | 14
Comparison with other data sets
Defined a training set of compounds/solubilities, and test sets that have been used for several comparative studies
Creating Solubility Models with Reaxys |
• Models made with Huuskonen structures and data using CDK descriptors and R model
• Using published training, test sets.
• Models not as good as in publication; he used different descriptor computation and statistical method. Standard error 0.67 log units.
15
Huuskonen Molecule/Data Set Models – (No Reaxys Data)
y = 0.961x R² = 0.8832
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
y = 0.9452x R² = 0.8598
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
y = 0.9912x R² = 0.7857
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
Training Set Test Set 1 Test Set 2
Creating Solubility Models with Reaxys |
• Same molecule sets – Model Trained with Reaxys Training Set
• Standard error 0.98 log units – not bad
16
Huuskonen Molecule Sets – Predicted with Model Created from Reaxys Data Set
y = 0.8824x R² = 0.6522
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
y = 0.8834x R² = 0.6889
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
y = 0.8741x R² = 0.7968
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
pre
dic
ted
logS
experimental logS
Creating Solubility Models with Reaxys |
• Standard Error 3.5 log units
• Issue is likely that many molecules from Reaxys are “outside” the structural diversity of the Huuskonen data set
• Illustrates a significant issue with modeling –
• Generally predictions are best when the molecule are similar to the training set.
17
Reaxys Molecule Set Predicted with Model Created from Huuskonen Data Set – Not Very Good
y = 0.6596x - 1.0645 R² = 0.1459
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4p
red
icte
d lo
gS
experimental logS
Creating Solubility Models with Reaxys |
• Only a subset of solubilities of the Huuskonen set are found in Reaxys.
• Differences are generally due to multiple measurements being reported with outliers
18
Does Reaxys Give The Same Solubility Values as Huuskonen Data Set? Yes.
y = 1.0082x - 0.0367 R² = 0.9607
-12
-10
-8
-6
-4
-2
0
2
4
-12 -10 -8 -6 -4 -2 0 2 4
Re
axys
logS
Huuskonen logS
Creating Solubility Models with Reaxys |
• Similarity matrix of each data set computed set using fingerprints/Tanimoto
• Huuskonen set more similar to each other than Reaxys set
19
Reaxys Solubility Data Set is Structurally More Diverse
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2N
orm
aliz
ed
Fra
ctio
n o
f P
air-
Sim
ilari
ty C
ou
nt
Similarity Value
Huuskonen
Reaxys
Reaxys has a higher proportion of
molecules not similar to others in the set Normalized
for different data set
sizes
Creating Solubility Models with Reaxys |
• Reaxys has solubility data that can be used to create and study predictive models
• Appears to have data more diverse than the well-studied “Huuskonen” data set.
• The nature/diversity of the training set is very important.
• The best reported models have the smallest training sets.
• However, these training sets may not be useful for prediction of more diverse compounds.
• Huuskonen-set-trained model predictions on Reaxys set is poor.
• Generally good models can predict with a standard error of about 1 log unit – for compounds similar to training set.
• Question: what is the accuracy of measurement?
•𝜕𝑙𝑜𝑔𝑆
𝜕𝑔𝐿−1 =1
2.303 ∗𝑔𝐿−1 ~ logS changes 0.4 log units/mg for a 1mg/L solubility
• Reaxys has a diverse set of structures and solubilities
• Each individual measurement is referenced.
• Good source for model making
20
What We Learned
Creating Solubility Models with Reaxys |
• Reaxys is a rich source of data for solubility and other properties.
• One can explore many subsets based on condition, molecule class etc.
• High diversity of molecules – organic, inorganic, peptides etc.
• Reaxys is a good source of data for making predictive models
• It provides not just the value, but the measurement conditions
• Selection of “good” measurements is an important factor in making models
• Reaxys contains hundreds of measured properties!
• Solubility is well studied
• Not as many models available for refractive index, magnetic susceptibility etc.
• Reaxys has only measured solubilities, SciFinder has predicted values
• We can see the effect of the training set and model quality in this presentation.
• Reaxys Medicinal Chemistry contains thousands of bioassay results on thousands of targets that can be used for predictive models.
21
Conclusion