making solubility models with reaxy

Creating Solubility Models with Reaxys |

Presented By

Date

Creating Solubility Models with Reaxys

Elsevier R&D Solutions Services

Dr. Matthew CLARK

19 January 2016


• Reaxys has solubility data that can be used to create and study predictive models

• Appears to have data more diverse than the well-studied “Huuskonen” data set.

• The nature/diversity of the training set is very important for predictive models

• The best reported models have the smallest training sets.

• However, these training sets may not be useful for prediction of more diverse compounds.

• Huuskonen-set-trained model predictions on Reaxys set is poor.

• Reaxys has a diverse set of structures and solubilities

• Each individual measurement is referenced.

• Good source for model making

2

What We Will Learn


• In addition to the well-known reactions and compounds, Reaxys is filled with hundreds of different measured properties reported for the compounds

• Each property is associated with a reference

• Each property has a “cluster” of values such as measurement temperature, pressure, solvents etc. describing the conditions of the measurement.

• In many cases multiple measurements are reported by different authors at different times for a particular value.

• A mean, median, and standard deviation can be assessed for the value. Each value is associated with a reference.

• One can use this data, combined with the chemical structures of the compounds to make structure-based predictive models for these properties.

• One can then predict the value of new or proposed compounds from their chemical structures.

Reaxys Property Data

Creating Solubility Models with Reaxys | 4

Reaxys Property Data is Grouped with Conditions

You can select the measurement conditions relevant to your model

Boiling Point

Boiling Point, °C (BP.BP)

Pressure, Torr (BP.P)

Refractive Index

Refractive Index (RI.RI)

Wavelength, nm (RI.W)

Temperature, °C (RI.T)

Dielectric Constant

Dielectric Constant (DIC.DIC)

Frequency, Hz (DIC.F)

Temperature, °C (DIC.T)

Electrical Moment

Description (EM.KW)

Moment, D (EM.EM)

Temperature, °C (EM.T)

Method (EM.MET)

Solvent (EM.SOL)

Enthalpy of Formation

Enthalpy of Formation, Jmol-1 (HFOR.HFOR)

Temperature, °C (HFOR.T)

Pressure, Torr (HFOR.P)

Solubility (MCS)

Solubility, gl-1 (SLB.SLB) Saturation (SLB.SAT) Temperature, °C (SLB.T) Solvent (SLB.SOL) Ratio of Solvents (SLB.RAT)


• There are several ways to access this data

• API (Application Programming Interface) allows direct access

• Download tagged SD file from Reaxys after searching

• “Hop in to” links to automatically go to data

• Reaxys API allows direct access to the data

• XML-based interface

• KNIME, PiplelinePilot supported.

• Need to query based on measurement conditions, (temp, solvent), and nature of molecules (organic, single-fragment)

• Form-based query

• “Advanced Query”

5

Model Making Tools


Solubility Query To Select Data and Molecules

SLB.SLB > 0 has a reported solubility

Temperature 19-25 temperature range of measurement

Solvent 'H2O solubility in water

Number of Fragments =1 only one contiguous fragment

Elements = 'c‘ contains carbon!

NOT Chemical Name = '*radical not a radical

Molecular Weight > 40 AND < 1000 molecular weight range

Number of Elements <5 fewer than 5 different elements


Reviewing Solubility Data in Reaxys


SolubilitySources

Reaxys logS is -3.67


Data Processing in KNIME

• Combines compounds with solubility measured in desired conditions • Convert values to molarity by dividing by molecular weight.


• Used with data from Reaxys, and from the Huuskonen paper

• Uses “R” and stepwise multiple regression

• Results and error of prediction appear in a spreadsheet

10

Model Making Workflow


• Full compound set, no further filtering

• 3590 compounds

• Standard error of prediction 1.1 log units

• Not spectacular, but useful

• Training set is larger range of diversity than used in most models

• r2 0.56

11

Initial Model and Prediction Result is OK-ish

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS


Reaxys Solubility Model 2 – Filtering of Source Compounds

Residual standard error: 0.6932 on 2697 degrees of freedom

Multiple R-squared: 0.8099, Adjusted R-squared: 0.8037

F-statistic: 132 on 87 and 2697 DF, p-value: < 2.2e-16

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS

2785 remain, Examples of filtered compounds:

Model is better, but does not improve prediction of Huuskonen data set


Comparison with Other Reports

Clark – fragment-based solubility model r2 0.73, SE 0.89 using “PHYSPROP” data set

Generalized Fragment-Substructure Based Property Prediction Method Matthew Clark J. Chem. Inf. Model., 2005, 45 (1), pp 30–38 DOI: 10.1021/ci049744c


Comparison with other data sets

Defined a training set of compounds/solubilities, and test sets that have been used for several comparative studies


• Models made with Huuskonen structures and data using CDK descriptors and R model

• Using published training, test sets.

• Models not as good as in publication; he used different descriptor computation and statistical method. Standard error 0.67 log units.

15

Huuskonen Molecule/Data Set Models – (No Reaxys Data)

y = 0.961x R² = 0.8832

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS

y = 0.9452x R² = 0.8598

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS

y = 0.9912x R² = 0.7857

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS

Training Set Test Set 1 Test Set 2


• Same molecule sets – Model Trained with Reaxys Training Set

• Standard error 0.98 log units – not bad

16

Huuskonen Molecule Sets – Predicted with Model Created from Reaxys Data Set

y = 0.8824x R² = 0.6522

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS

y = 0.8834x R² = 0.6889

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS

y = 0.8741x R² = 0.7968

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

pre

dic

ted

logS

experimental logS


• Standard Error 3.5 log units

• Issue is likely that many molecules from Reaxys are “outside” the structural diversity of the Huuskonen data set

• Illustrates a significant issue with modeling –

• Generally predictions are best when the molecule are similar to the training set.

17

Reaxys Molecule Set Predicted with Model Created from Huuskonen Data Set – Not Very Good

y = 0.6596x - 1.0645 R² = 0.1459

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4p

red

icte

d lo

gS

experimental logS


• Only a subset of solubilities of the Huuskonen set are found in Reaxys.

• Differences are generally due to multiple measurements being reported with outliers

18

Does Reaxys Give The Same Solubility Values as Huuskonen Data Set? Yes.

y = 1.0082x - 0.0367 R² = 0.9607

-12

-10

-8

-6

-4

-2

0

2

4

-12 -10 -8 -6 -4 -2 0 2 4

Re

axys

logS

Huuskonen logS


• Similarity matrix of each data set computed set using fingerprints/Tanimoto

• Huuskonen set more similar to each other than Reaxys set

19

Reaxys Solubility Data Set is Structurally More Diverse

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2N

orm

aliz

ed

Fra

ctio

n o

f P

air-

Sim

ilari

ty C

ou

nt

Similarity Value

Huuskonen

Reaxys

Reaxys has a higher proportion of

molecules not similar to others in the set Normalized

for different data set

sizes


• Reaxys has solubility data that can be used to create and study predictive models

• Appears to have data more diverse than the well-studied “Huuskonen” data set.

• The nature/diversity of the training set is very important.

• The best reported models have the smallest training sets.

• However, these training sets may not be useful for prediction of more diverse compounds.

• Huuskonen-set-trained model predictions on Reaxys set is poor.

• Generally good models can predict with a standard error of about 1 log unit – for compounds similar to training set.

• Question: what is the accuracy of measurement?

•𝜕𝑙𝑜𝑔𝑆

𝜕𝑔𝐿−1 =1

2.303 ∗𝑔𝐿−1 ~ logS changes 0.4 log units/mg for a 1mg/L solubility

• Reaxys has a diverse set of structures and solubilities

• Each individual measurement is referenced.

• Good source for model making

20

What We Learned


• Reaxys is a rich source of data for solubility and other properties.

• One can explore many subsets based on condition, molecule class etc.

• High diversity of molecules – organic, inorganic, peptides etc.

• Reaxys is a good source of data for making predictive models

• It provides not just the value, but the measurement conditions

• Selection of “good” measurements is an important factor in making models

• Reaxys contains hundreds of measured properties!

• Solubility is well studied

• Not as many models available for refractive index, magnetic susceptibility etc.

• Reaxys has only measured solubilities, SciFinder has predicted values

• We can see the effect of the training set and model quality in this presentation.

• Reaxys Medicinal Chemistry contains thousands of bioassay results on thousands of targets that can be used for predictive models.

21

Conclusion

making solubility models with reaxy

Health & Medicine