quantum chemical and machine learning approaches to property prediction for druglike molecules

Quantum Chemical and Machine Learning Approaches to Property Prediction for

Druglike Molecules

Dr John MitchellUniversity of St Andrews

http://www.sulsa.ac.uk/

http://www.eastchem.ac.uk/index.htm

1. Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the pharma industry

A good model for predicting the solubility of druglike molecules would be very valuable.

How should we approach the

prediction/estimation/calculation

of the aqueous solubility of

druglike molecules?

Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.

Theoretical Chemistry

• Calculations and simulations based on real physics.

• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.

• Attempt to model or simulate reality.

• Usually Low Throughput.

Dataset

•The thermodynamically most stable polymorph was selected where possible. •All have experimental crystal structures.

•All have experimental logS.

•10 have experimental ΔGsub and ΔGhydr (circled in red).

CheqSol MethodCheqSol Method

NH

O

Cl Cl

ONa NH

OCl Cl

O

Na+

NH

OCl Cl

OH

NH

OCl Cl

OH

In SolutionPowder

● We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating….

SOLUTION IS IN EQUILIBRIUM

Repeatability better than 0.05 log units

dpH/dt Versus Time

dpH

/dt

Time (minutes)

-0.008

-0.004

0.000

0.004

0.008

20 25 30 35 40 45

Supersaturated Solution

Subsaturated Solution8 Intrinsic solubility values8 Intrinsic solubility values

* A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), 979-983

● First precipitation – Kinetic Solubility (Not in Equilibrium)● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium)

Supersaturation Factor SSF = Skin – S0

“CheqSol”

Thermodynamic Cycle

Thermodynamic Cycle

Crystal

Gas

Solution

Sublimation Free Energy

Crystal

Gas


Crystal

Gas

(rigid molecule approximation)


Crystal

Gas

Calculating ΔGsub is a standard procedure in crystal structure prediction

Gsub from lattice energy & a phonon entropy term;DMACRYS using B3LYP/6-31G** multipoles and FIT repulsion-dispersion potential.

Theoretical method for crystal

Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles.

Results for ΔGsub

A 46 compound set has a larger error, mostly due to some large outliers. Error statistics vary with dataset.

Thermodynamic Cycle

Crystal

Gas

Solution

Hydration Free Energy

We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.

Ghydr from Reference Interaction Site Model with Universal Correction (3DRISM-KH/UC).

Theoretical method for aqueous solution

Reference Interaction Site Model (RISM)

• Combines features of explicit and implicit solvent models.

• Solvent density is modelled, but no explicit molecular coordinates or dynamics.

~45 CPU minsper compound

Reference Interaction Site Model (RISM)

Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the Reference Interaction Site Model. Journal of Chemical Physics, 2010. 133(4): p. 044104-11.

Perhaps surprisingly, error in Ghyd is smaller than in Gsub.

Results for ΔGhyd

logS from Thermodynamic Cycle

Crystal

Gas

Solution

Add the two terms to get ΔGsol and hence logS.

Results for ΔGsol

Conclusions: Solubility from Theory

• Must calculate Gsub & Ghyd separately;

• RISM is efficient & fairly accurate for Ghyd;

• Experimental data for Gsub & Ghyd sparse and errors may be large;

• Dataset size and composition make comparisons of methods hard;

• Not yet matched accuracy of informatics.

Informatics and Empirical Models

• In general, informatics methods represent phenomena mathematically, but not in a physics-based way.

• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.

• Do not attempt to simulate reality. • Usually High Throughput.

What Error is Acceptable?

• For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units.

• A RMSE > 1.0 logS unit is probably unacceptable.

• This corresponds to an error range of 4.0 to 5.7 kJ/mol in Gsol.

What Error is Acceptable?

• A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units;

• The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

Machine Learning Method

Random Forest

Random Forest: Solubility Results

RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04

RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01

DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007) Ntrain = 658; Ntest = 300

Support Vector Machine

SVM: Solubility Results

et al.,

Ntrain = 150 + 50; Ntest = 87RMSE(te)=0.94r2(te)=0.79

100 Compound Cross-Validation

Theoretical energies don’t seem to improve descriptor models.

Replicating Solubility Challenge (post hoc)

RMSE(te)=1.09; 1.00; 0.89; 1.08r2(te)=0.39; 0.49; 0.58; 0.4110; 12; 12; 13/28 correct within 0.5 logS units Ntrain 94; Ntest 28

CDK descriptors: RF, RF, PLS, SVM

Replicating Solubility Challenge (post hoc)

Ntrain 94; Ntest 28

CDK descriptors: RF, RF, PLS, SVM

Although the test dataset is small, it is a standard set.

Conclusions: Solubility from Informatics

• Experimental data: errors unknown, but limit possible accuracy of models;

• CheqSol - step in right direction; • Dataset size and composition hinder comparisons

of methods; • Solubility Challenge – step in right direction.

2. Protein Target Prediction

• Which protein does a given molecule bind to?• Virtual Screening• Multiple endpoint drugs - polypharmacology• New targets for existing drugs• Prediction of adverse drug reactions (ADR)

– Computational toxicology

Predicted Protein Targets

• Selection of 233 classes from the MDL Drug Data Report

• ~90,000 molecules• 15 independent

50%/50% splits into training/test set

Actually we are predicting closely target-related MDDR classes

Predicted Protein Targets

Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

Protein Target Prediction• Given a specific compound, is it possible to predict

computationally its biological interactions with protein targets?• Very important for

• In silico screening (time and money efficient)• off-target prediction (side effects)

• Can be used for identifying substances with performance-enhancing potential

Drug discovery: Predicting promiscuity, Andrew L. Hopkins, Nature 462, 167-168(12 November 2009),doi:10.1038/462167a

http://www.nature.com/nature/journal/v462/n7270/full/462167a.html

http://www.nature.com/nature/journal/v462/n7270/full/462167a.html

James

The template of the slides changes here and it looks odd to have such a sudden transition.

Substances Prohibited in Sports

• WADA publishes and maintains a prohibited list of banned compounds, updated every 6 months

• Substances are split into three main categories:

Substances prohibited at all times(in and out of competition)

S0. Non-Approved substancesS1. Anabolic AgentsS2. Peptide hormones, GrowthFactors and Related SubstancesS3. Beta-2 AgonistsS4. Hormone Antagonists andModulatorsS5. Diuretics and Other MaskingAgents

Substances prohibited in competitionS6. StimulantsS7. NarcoticsS8. CannabinoidsS9. Glucocorticosteroids

Substances prohibited in particularsports

P1. Alcohol with a violation thresholdof 0.10 g/L. (Archery, Karate etc)P2. Beta-Blockers prohibited In-Competition only (Bridge, Curling,Darts, Wrestling, Archery etc.)

Database

Rank

1

2

3

Class

Anabolic Agents

VitaminD

Glucocorticoids

MethodologyStanozolol

ChEMBL-Activities

• Each compound hasexperimental data for anumber of targets

• Activity data based on IC50,EC50, Ki, Kd etc.

• Some activities just labelled“inactive” or “active”

• Each compound can havemore than one record for agiven target

• Each of the 8,845 targets has anumber of compounds assigned

• Not all compounds have actualdata on the target or are active

• We filtered each of the familiesaccording to rules defining“active” and “inactive”

• The rules were decided byvisual inspection of distributions

• Rules• IC50

• ≤50000nM active & >50000nM inactive

• Ki

• <20000nM active & ≥20000nM inactive

• Kd

• ≤ 10000nM active & >10000nM inactive

• EC50• ≤ 40000nM active & >40000nM inactive

• ED50• ≤ 10000nM active & >10000nM inactive

• Potency• ≤ 10000nM active & >10000nM inactive

• Activity• ≥40% active & <40% inactive

• Inhibition• ≥45% active & <45% inactive

Filtering the CheMBLFamilies

Index Index Index

Example: Distributions of Ki

???

Refined Families

• Filtered families consist of compoundswith significant experimental activitiesagainst the relevant targets.

• Many targets have distinct groupsof ligands with different scaffolds.

• May be because there is more thanone binding site, or becausedifferent scaffolds can fit the same site.

• Splitting such a family into smallergroups based on ligand structure willallow us to identify the different setsof ligands.

Refined Families - PFClust

We selected the PFClust algorithm because it is aparameter free clustering algorithm and does notrequire any kind of parameter tuning.

PFClust : A novel parameter free clustering algorithm. Mavridis L, Nath N, Mitchell JBO. BMC Bioinformatics 2013, 14:213.

• 3563• 783690

RuleFiltering

• 19639• 616600

ClusteringDatabase Database

RefinedFamilies

Compounds Compounds5443Families1366460Compounds

Predicting the protein targets for athletic performance-enhancing substances. Mavridis L, Mitchell JBO. J Cheminformatics2013, 5:31.

Database Refinement

Original

Families

Database Refinement - Validation• Monte Carlo Cross-Validation

• The three versions of the databasewere examined (Original, Filtered andRefined)

• 10% of each family were randomlyremoved and used as queries

• If the top prediction was the familythat the query was a member of, a TPwould be counted; if not, a FP

• Average Matthews CorrelationCoefficient (MCC)

• Original : 0.02• Filtered : 0.03• Refined : 0.66

2.58% (6.61%)

66.98% (87.25%)

3.18% (7.21%)

Top Hit (Top four )

P2–BetaBlockers

20 explicitly prohibitedcompounds

Every compound, excepttimolol and levobunolol,gave a strong prediction(PR-Score) for at least onefamily

Good experimentalvalidation

We see that the majority of

the families are Beta-1,2 &3 adrenergic receptorligands, as expected.

Other families alsogenerate some interestingresults, such as theserotonin 1a receptor,indicated to make off-targetinteractions with pindolol

Compound Target PR-Score E-Value

P2-Beta Blockers

Alprenolol (266195) Cavia Porceullus (369) 0.039 LogB/F = −0.158

Carvedilol (723) β-1 adrenergic receptor (3252) 0.032 Ki = 0.81 nM

β-2 adrenergic receptor (210) 0.044 Ki = 0.166 nM

Pindolol (500)

β-2 adrenergic receptor (3754) 0.047



Prediction

Prediction

Ki = 1 nM

β-2 adrenergic receptor (210) 0.015 Ki = 0.4 nM



Inhibition = 84%

Ki = 1 nM

Serotonin 1a (5-HT1a (214) 0.026 Ki = 24 nM

Propranolol (27)

Sotalol (471)

β-2 adrenergic receptor (210)

β-3 adrenergic receptor (246)

0.003

0.009

IC50 = 12 nM

IC50 = 7200 nM

Carvedilol

WADA– P2 Beta Blockers

Metoprolol

Compound Target PR-Score E-ValueS8-Cannabinoids

Cannabidivarin (−)Cannabinoid CB1 receptor (218) 0.037 Prediction

Cannabigerol (497318)

Cannabinoid CB2 receptor (253)

HL-60 (383)

0.037

0.047

Prediction

Prediction

HU-210 (70625)

JWH-018 (561013)






0.035

0.029

0.002

0.015

0.009

Ki = 0.82 nMa

Prediction

pKi = 8.7

pKi = 8.045

pKi = 8.2

JWH-073 (−)

Isoprenylcysteine carboxylmethyltransferase (4699)

MDA-MB-231 (400)


0.031

0.030

0.002

Prediction

Prediction

Prediction

Cannabinoid CB1 receptor (3571) 0.025 Prediction

Tetrahydrocannabinol(465)

Cannabinoid CB1 receptor (218) 0.037 Ki = 2.9 nM





0.037

0.034

0.033

0.049

Ki = 37 nM

Ki = 20 nM

Ki = 3.3 nM

Ki = 9.2 nM

S8-Cannabinoids

10 explicitly prohibitedcompounds

17 refined families ofwhich 13 arecannabinoid CB1/2receptors

All compounds showstrong predicted affinityto at least onecannabinoid receptor,except cannabivarol

Excellent agreementbetween PR-scores andexperimental results

Tetrahydrocannabinol

HU-210

WADA– S8 Cannabinoids

JWH-018/073

Discussion• As for any method, the success of our approach depends on

the quality of the underlying data.

• Our methodology addresses the problem that only a smallfraction of possible activities of different molecules against

different targets have been assayed.

• For ChEMBL families that are not well populated, or for proteintargets which too few compounds are assayed against, wecannot make predictions.

Conclusions – Target Prediction• Automated data curation of the ChEMBL families greatly

increases the precision of our protein target predictions.

• Our validations show good correspondence withexperiment, 87% having the correct refined familyamong the top four hits.

• Across the seven WADA classes considered, we find acombination of expected and unexpected protein targets.

• Many of the non-obvious predicted targets have biochemically or clinically validated connections with the expected bioactivities.

Thanks• SULSA, WADA, BBSRC, SFC• Dr Lazaros Mavridis, James McDonagh, Dr Tanja van

Mourik, Dr Luna De Ferrari, Neetika Nath (St Andrews)• Prof. Maxim Fedorov, Dr David Palmer (Strathclyde) • Laura Hughes, Dr Toni Llinas (ex-Cambridge)• James Taylor, Simon Hogan, Gregor McInnes, Callum Kirk,

William Walton (U/G project)

http://www.msm.cam.ac.uk/pfizer/index.html

http://www.sulsa.ac.uk/

http://www.eastchem.ac.uk/index.htm

quantum chemical and machine learning approaches to property prediction for druglike molecules

Documents

theoretical method

experimental gsub

intrinsic solubility

diclofenac solubility

fit atomatom model potential

subsaturated solution

solution switches

pharma industrya good