quantum chemical and machine learning approaches to property prediction for druglike molecules
DESCRIPTION
Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules. Dr John Mitchell University of St Andrews. 1. Solubility is an important issue in drug discovery and a major source of attrition. This is expensive for the pharma industry. - PowerPoint PPT PresentationTRANSCRIPT
Quantum Chemical and Machine Learning Approaches to Property Prediction for
Druglike Molecules
Dr John MitchellUniversity of St Andrews
1. Solubility is an important issue in drug discovery and a major source of attrition
This is expensive for the pharma industry
A good model for predicting the solubility of druglike molecules would be very valuable.
How should we approach the
prediction/estimation/calculation
of the aqueous solubility of
druglike molecules?
Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.
Theoretical Chemistry
• Calculations and simulations based on real physics.
• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.
• Attempt to model or simulate reality.
• Usually Low Throughput.
Dataset
•The thermodynamically most stable polymorph was selected where possible. •All have experimental crystal structures.
•All have experimental logS.
•10 have experimental ΔGsub and ΔGhydr (circled in red).
CheqSol MethodCheqSol Method
NH
O
Cl Cl
ONa NH
OCl Cl
O
Na+
NH
OCl Cl
OH
NH
OCl Cl
OH
In SolutionPowder
● We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating….
SOLUTION IS IN EQUILIBRIUM
Repeatability better than 0.05 log units
dpH/dt Versus Time
dpH
/dt
Time (minutes)
-0.008
-0.004
0.000
0.004
0.008
20 25 30 35 40 45
Supersaturated Solution
Subsaturated Solution8 Intrinsic solubility values8 Intrinsic solubility values
* A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), 979-983
● First precipitation – Kinetic Solubility (Not in Equilibrium)● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium)
Supersaturation Factor SSF = Skin – S0
“CheqSol”
Thermodynamic Cycle
Thermodynamic Cycle
Crystal
Gas
Solution
Sublimation Free Energy
Crystal
Gas
Sublimation Free Energy
Crystal
Gas
(rigid molecule approximation)
Sublimation Free Energy
Crystal
Gas
Calculating ΔGsub is a standard procedure in crystal structure prediction
Gsub from lattice energy & a phonon entropy term;DMACRYS using B3LYP/6-31G** multipoles and FIT repulsion-dispersion potential.
Theoretical method for crystal
Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles.
Results for ΔGsub
A 46 compound set has a larger error, mostly due to some large outliers. Error statistics vary with dataset.
Thermodynamic Cycle
Crystal
Gas
Solution
Hydration Free Energy
We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.
Ghydr from Reference Interaction Site Model with Universal Correction (3DRISM-KH/UC).
Theoretical method for aqueous solution
Reference Interaction Site Model (RISM)
• Combines features of explicit and implicit solvent models.
• Solvent density is modelled, but no explicit molecular coordinates or dynamics.
~45 CPU minsper compound
Reference Interaction Site Model (RISM)
Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the Reference Interaction Site Model. Journal of Chemical Physics, 2010. 133(4): p. 044104-11.
Perhaps surprisingly, error in Ghyd is smaller than in Gsub.
Results for ΔGhyd
logS from Thermodynamic Cycle
Crystal
Gas
Solution
Add the two terms to get ΔGsol and hence logS.
Results for ΔGsol
Conclusions: Solubility from Theory
• Must calculate Gsub & Ghyd separately;
• RISM is efficient & fairly accurate for Ghyd;
• Experimental data for Gsub & Ghyd sparse and errors may be large;
• Dataset size and composition make comparisons of methods hard;
• Not yet matched accuracy of informatics.
Informatics and Empirical Models
• In general, informatics methods represent phenomena mathematically, but not in a physics-based way.
• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.
• Do not attempt to simulate reality. • Usually High Throughput.
What Error is Acceptable?
• For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units.
• A RMSE > 1.0 logS unit is probably unacceptable.
• This corresponds to an error range of 4.0 to 5.7 kJ/mol in Gsol.
What Error is Acceptable?
• A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units;
• The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?
Machine Learning Method
Random Forest
Random Forest: Solubility Results
RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04
RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01
DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007) Ntrain = 658; Ntest = 300
Support Vector Machine
SVM: Solubility Results
et al.,
Ntrain = 150 + 50; Ntest = 87RMSE(te)=0.94r2(te)=0.79
100 Compound Cross-Validation
Theoretical energies don’t seem to improve descriptor models.
Replicating Solubility Challenge (post hoc)
RMSE(te)=1.09; 1.00; 0.89; 1.08r2(te)=0.39; 0.49; 0.58; 0.4110; 12; 12; 13/28 correct within 0.5 logS units Ntrain 94; Ntest 28
CDK descriptors: RF, RF, PLS, SVM
Replicating Solubility Challenge (post hoc)
Ntrain 94; Ntest 28
CDK descriptors: RF, RF, PLS, SVM
Although the test dataset is small, it is a standard set.
Conclusions: Solubility from Informatics
• Experimental data: errors unknown, but limit possible accuracy of models;
• CheqSol - step in right direction; • Dataset size and composition hinder comparisons
of methods; • Solubility Challenge – step in right direction.
2. Protein Target Prediction
• Which protein does a given molecule bind to?• Virtual Screening• Multiple endpoint drugs - polypharmacology• New targets for existing drugs• Prediction of adverse drug reactions (ADR)
– Computational toxicology
Predicted Protein Targets
• Selection of 233 classes from the MDL Drug Data Report
• ~90,000 molecules• 15 independent
50%/50% splits into training/test set
Actually we are predicting closely target-related MDDR classes
Predicted Protein Targets
Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)
Protein Target Prediction• Given a specific compound, is it possible to predict
computationally its biological interactions with protein targets?• Very important for
• In silico screening (time and money efficient)• off-target prediction (side effects)
• Can be used for identifying substances with performance-enhancing potential
Drug discovery: Predicting promiscuity, Andrew L. Hopkins, Nature 462, 167-168(12 November 2009),doi:10.1038/462167a
Substances Prohibited in Sports
• WADA publishes and maintains a prohibited list of banned compounds, updated every 6 months
• Substances are split into three main categories:
Substances prohibited at all times(in and out of competition)
S0. Non-Approved substancesS1. Anabolic AgentsS2. Peptide hormones, GrowthFactors and Related SubstancesS3. Beta-2 AgonistsS4. Hormone Antagonists andModulatorsS5. Diuretics and Other MaskingAgents
Substances prohibited in competitionS6. StimulantsS7. NarcoticsS8. CannabinoidsS9. Glucocorticosteroids
Substances prohibited in particularsports
P1. Alcohol with a violation thresholdof 0.10 g/L. (Archery, Karate etc)P2. Beta-Blockers prohibited In-Competition only (Bridge, Curling,Darts, Wrestling, Archery etc.)
Database
Rank
1
2
3
Class
Anabolic Agents
VitaminD
Glucocorticoids
MethodologyStanozolol
ChEMBL-Activities
• Each compound hasexperimental data for anumber of targets
• Activity data based on IC50,EC50, Ki, Kd etc.
• Some activities just labelled“inactive” or “active”
• Each compound can havemore than one record for agiven target
• Each of the 8,845 targets has anumber of compounds assigned
• Not all compounds have actualdata on the target or are active
• We filtered each of the familiesaccording to rules defining“active” and “inactive”
• The rules were decided byvisual inspection of distributions
• Rules• IC50
• ≤50000nM active & >50000nM inactive
• Ki
• <20000nM active & ≥20000nM inactive
• Kd
• ≤ 10000nM active & >10000nM inactive
• EC50• ≤ 40000nM active & >40000nM inactive
• ED50• ≤ 10000nM active & >10000nM inactive
• Potency• ≤ 10000nM active & >10000nM inactive
• Activity• ≥40% active & <40% inactive
• Inhibition• ≥45% active & <45% inactive
Filtering the CheMBLFamilies
Index Index Index
Example: Distributions of Ki
???
Refined Families
• Filtered families consist of compoundswith significant experimental activitiesagainst the relevant targets.
• Many targets have distinct groupsof ligands with different scaffolds.
• May be because there is more thanone binding site, or becausedifferent scaffolds can fit the same site.
• Splitting such a family into smallergroups based on ligand structure willallow us to identify the different setsof ligands.
Refined Families - PFClust
We selected the PFClust algorithm because it is aparameter free clustering algorithm and does notrequire any kind of parameter tuning.
PFClust : A novel parameter free clustering algorithm. Mavridis L, Nath N, Mitchell JBO. BMC Bioinformatics 2013, 14:213.
• 3563• 783690
RuleFiltering
• 19639• 616600
ClusteringDatabase Database
RefinedFamilies
Compounds Compounds5443Families1366460Compounds
Predicting the protein targets for athletic performance-enhancing substances. Mavridis L, Mitchell JBO. J Cheminformatics2013, 5:31.
Database Refinement
Original
Families
Database Refinement - Validation• Monte Carlo Cross-Validation
• The three versions of the databasewere examined (Original, Filtered andRefined)
• 10% of each family were randomlyremoved and used as queries
• If the top prediction was the familythat the query was a member of, a TPwould be counted; if not, a FP
• Average Matthews CorrelationCoefficient (MCC)
• Original : 0.02• Filtered : 0.03• Refined : 0.66
2.58% (6.61%)
66.98% (87.25%)
3.18% (7.21%)
Top Hit (Top four )
P2–BetaBlockers
20 explicitly prohibitedcompounds
Every compound, excepttimolol and levobunolol,gave a strong prediction(PR-Score) for at least onefamily
Good experimentalvalidation
We see that the majority of
the families are Beta-1,2 &3 adrenergic receptorligands, as expected.
Other families alsogenerate some interestingresults, such as theserotonin 1a receptor,indicated to make off-targetinteractions with pindolol
Compound Target PR-Score E-Value
P2-Beta Blockers
Alprenolol (266195) Cavia Porceullus (369) 0.039 LogB/F = −0.158
Carvedilol (723) β-1 adrenergic receptor (3252) 0.032 Ki = 0.81 nM
β-2 adrenergic receptor (210) 0.044 Ki = 0.166 nM
Pindolol (500)
β-2 adrenergic receptor (3754) 0.047
β-3 adrenergic receptor (4031) 0.036
β-1 adrenergic receptor (3252) 0.017
Prediction
Prediction
Ki = 1 nM
β-2 adrenergic receptor (210) 0.015 Ki = 0.4 nM
β-2 adrenergic receptor (3754) 0.026
β-3 adrenergic receptor (4031) 0.018
Inhibition = 84%
Ki = 1 nM
Serotonin 1a (5-HT1a (214) 0.026 Ki = 24 nM
Propranolol (27)
Sotalol (471)
β-2 adrenergic receptor (210)
β-3 adrenergic receptor (246)
0.003
0.009
IC50 = 12 nM
IC50 = 7200 nM
Carvedilol
WADA– P2 Beta Blockers
Metoprolol
Compound Target PR-Score E-ValueS8-Cannabinoids
Cannabidivarin (−)Cannabinoid CB1 receptor (218) 0.037 Prediction
Cannabigerol (497318)
Cannabinoid CB2 receptor (253)
HL-60 (383)
0.037
0.047
Prediction
Prediction
HU-210 (70625)
JWH-018 (561013)
Cannabinoid CB1 receptor (3571)
Cannabinoid CB2 receptor (5373)
Cannabinoid CB1 receptor (218)
Cannabinoid CB1 receptor (3571)
Cannabinoid CB2 receptor (253)
0.035
0.029
0.002
0.015
0.009
Ki = 0.82 nMa
Prediction
pKi = 8.7
pKi = 8.045
pKi = 8.2
JWH-073 (−)
Isoprenylcysteine carboxylmethyltransferase (4699)
MDA-MB-231 (400)
Cannabinoid CB1 receptor (218)
0.031
0.030
0.002
Prediction
Prediction
Prediction
Cannabinoid CB1 receptor (3571) 0.025 Prediction
Tetrahydrocannabinol(465)
Cannabinoid CB1 receptor (218) 0.037 Ki = 2.9 nM
Cannabinoid CB1 receptor (3571)
Cannabinoid CB2 receptor (2470)
Cannabinoid CB2 receptor (253)
Cannabinoid CB2 receptor (5373)
0.037
0.034
0.033
0.049
Ki = 37 nM
Ki = 20 nM
Ki = 3.3 nM
Ki = 9.2 nM
S8-Cannabinoids
10 explicitly prohibitedcompounds
17 refined families ofwhich 13 arecannabinoid CB1/2receptors
All compounds showstrong predicted affinityto at least onecannabinoid receptor,except cannabivarol
Excellent agreementbetween PR-scores andexperimental results
Tetrahydrocannabinol
HU-210
WADA– S8 Cannabinoids
JWH-018/073
Discussion• As for any method, the success of our approach depends on
the quality of the underlying data.
• Our methodology addresses the problem that only a smallfraction of possible activities of different molecules against
different targets have been assayed.
• For ChEMBL families that are not well populated, or for proteintargets which too few compounds are assayed against, wecannot make predictions.
Conclusions – Target Prediction• Automated data curation of the ChEMBL families greatly
increases the precision of our protein target predictions.
• Our validations show good correspondence withexperiment, 87% having the correct refined familyamong the top four hits.
• Across the seven WADA classes considered, we find acombination of expected and unexpected protein targets.
• Many of the non-obvious predicted targets have biochemically or clinically validated connections with the expected bioactivities.
Thanks• SULSA, WADA, BBSRC, SFC• Dr Lazaros Mavridis, James McDonagh, Dr Tanja van
Mourik, Dr Luna De Ferrari, Neetika Nath (St Andrews)• Prof. Maxim Fedorov, Dr David Palmer (Strathclyde) • Laura Hughes, Dr Toni Llinas (ex-Cambridge)• James Taylor, Simon Hogan, Gregor McInnes, Callum Kirk,
William Walton (U/G project)