predictive modelling of the primary and secondary
TRANSCRIPT
1
Predictive Modelling of the Primary and Secondary
Pharmacology of Compounds in Drug Discovery
Kathryn Anne Giblin
Darwin College
Submission Date: June 2019
University: University of Cambridge
Funding: AstraZeneca and European Research Council (ERC)
This Thesis is Submitted for the Degree of Doctor of Philosophy
2
Summary
Predictive Modelling of the Primary and Secondary Pharmacology of
Compounds in Drug Discovery
Kathryn Anne Giblin
Understanding the primary and secondary pharmacology of drugs is imperative for delivering a
drug molecule which has the necessary efficacy and safety profile in humans. The application of
machine learning and data mining methods to drug discovery has the promise to accelerate the
drug discovery process by learning from the large amounts of existing data available. This thesis
focusses on in silico approaches to address challenges related to selectivity and toxicity in drug
discovery.
The first section of this thesis is concerned with the prediction of ligand selectivity profiles using
proteochemometric (PCM) modelling, a technique which uses both compound similarity and
protein target similarity as input into machine learning models for the prediction of ligand-
target interactions. We showed that employing a multi-target PCM model outperformed other
methods, including QSAR models, on the same bioactivity dataset for the bromodomain-
containing proteins. Furthermore, we established the applicability domain of the model by
employing conformal prediction, which was further used to aid the selection of compounds for
prospective experimental testing in bromodomain assays. We achieved a 31 % hit rate for actives
from our experimental selections, including the discovery of diverse novel hits. The PCM models
were interpreted to reveal residues in the protein active site important for the classification of
activity for each bromodomain, which were further validated by the generation of co-crystal
structures for new ligands bound to bromodomains, as well as literature evidence. The PCM
technique can be used in virtual screening, in silico off-target panel screening of compounds
and to aid future structure-based design.
The second section of this thesis is concerned with the translation of toxic effects between
preclinical and clinical studies. Animal models of toxicity are used in the drug discovery process
to assess the risk of toxicity in humans. However, these effects do not always translate into
human studies as demonstrated by previous retrospective concordance analyses. Here, we asked
“what more can animal models tell us about toxicity in humans?” by extending the previous
concordance approaches to find associations between toxicities in animals and humans which
were not described by the same adverse event (AE) terms. Using data from PharmaPendium,
3
we derived 2,050 statistically significant associations using the mutual information and
subsequently located preclinical AEs which, when observed for a drug, were indicative of a high
risk of experiencing a clinical AE, as measured by a positive likelihood ratio of greater than 10.
To find mechanistic links for associations with the highest mutual information values, we
conducted an analysis to find intersecting genes between preclinical AE, clinical AE and the
drugs which were reported to cause both AEs, finding genetic evidence for 25 % of associations
that were analysed. We suggested that the protein targets identified by this method can be
considered for inclusion in in vitro toxicity panel screening to enable the earlier prediction of
toxicity. In summary, we quantified from existing data where animal models were useful in
clinical toxicity prediction and where they were not, as well as generated mechanistic
hypotheses for the connection between in vivo and clinical toxicities. This study can provide
evidence to support the case for in vivo animal model usage for the prediction of certain clinical
toxicities, as well as provide suggestions for targets to incorporate into in vitro screening panels.
Both outcomes contribute towards the aim to replace, reduce and refine animal usage in drug
discovery.
Overall, this work contributes to the prediction and understanding of the primary and
secondary pharmacology of drugs and how this relates to the concepts of selectivity and safety.
The four main applications of this work included; 1. a machine learning tool for the prediction
of selective bromodomain inhibitor chemotypes, 2. a guide for determining which binding site
residues to interact with to optimise binding to bromodomains, derived from machine learning
models, 3. a quantitative assessment of the drug-induced toxicities in animals which indicate a
high risk of drug-induced toxicities in humans to guide preclinical safety decisions, 4. proposed
targets for further investigation for inclusion in in vitro secondary pharmacology screening
panels for early detection of potential toxic liabilities. These techniques can be applied to other
target classes or toxicity datasets in the future.
4
Acknowledgements
Firstly, I would like to thank my PhD supervisor Andreas Bender for his technical guidance and
support during the PhD and for providing me with opportunities to present my work and
network within the field. I would also like to thank Samantha Hughes, my industrial supervisor
in AstraZeneca, particularly for her help in enabling the experimental validation of this work,
but also for her computational chemistry input, as well as her time and patience with reviewing
and approving publications.
Within AstraZeneca there are many individuals who contributed to the material contained
within this thesis, through discussions, reviewing written work and scientific expertise. Firstly,
for the experimental validation of the work, I would like to acknowledge Pia Hansson, Helen
Boyd, Marianne Schimpl and Clare Gregson. For their bromodomain expertise, I acknowledge
the contributions of Robert Sheppard and Thomas Hayhow. I thank Avid Afzal, Nigel Greene,
Lyn Rosenbrier-Ribeiro and Ian Barret for their discussions and contributions to the work in
chapters 4 and 5, in terms of technical knowledge, idea generation, secondary pharmacology
knowledge and knowledge of bioinformatic databases, respectively. Generally, I would like to
thank members of the Oncology Computational Chemistry groups and Discovery Sciences
Quantitative Biology groups for listening to presentations and offering ongoing feedback and
advice throughout the PhD, as well as including me in group meetings and social events.
The time spent during my PhD would not have been half the experience it has been without
other members of my research group, past and present, many who are now my great friends.
Ben Alexander-Dann, Stephanie Ashenden, Erin Oerton, Fatima Baldo, Avid Afzal, Fynn Krause,
Krishna Bulusu, Christoph Schlaffner, Lewis Mervin, Azedine Zoufir, Georgios Drakakis, Ines
Smit, Maria-Anna Trapotsi, Samar Mahmoud, Chad Allen, Layla Gerami, Fredrik Svensson, and
many others -thank you so much for all the help with the PhD, the fantastic memories, from
the pub nights and formal dinners to the trips away (India, Germany, Norfolk!), and for keeping
me going through the hard times.
Completing my PhD would not have been possible without the love, support and
encouragement from my family. To my parents, Sue and Ged, and my brother Matt, thank you
for everything you have done for me, I am very lucky to have you in my life. Personally, the past
few years have also been a memorable time for me and also my partner Tom; not only did we
buy our first home and adopt two lovely cats (thanks Oliver and Socks for your calming
influences), we also got married at Darwin College, my college for the PhD. I would like to
5
dedicate this thesis to my wonderful husband Tom for his unwavering love and encouragement
in helping me to achieve my goals, and for making me very happy!
6
List of Written Publications During PhD
Here we list publications relevant to work done throughout the PhD and, where applicable,
explain their relation to chapters of the thesis.
Kathryn A. Giblin, Samantha J. Hughes, Helen Boyd, Pia Hansson and Andreas Bender.
Prospectively Validated Proteochemometric Models for the Prediction of Small Molecule
Binding to Bromodomain Proteins, J. Chem. Inf. Model., 2018, 58, 1870-1888
Sections from this publication are included in chapter 2 of this thesis.
Kathryn A. Giblin, Samantha J. Hughes, Marianne Schimpl, Helen Boyd, Pia Hansson, Clare
Gregson, Robert J Sheppard, Thomas Hayhow, Andreas Bender, Selectivity Insight for the
Rational Design of Bromodomain Inhibitors Derived from Proteochemometric Models (in
preparation)
This publication is described in chapter 3 of this thesis.
Kathryn A. Giblin, Avid Afzal, Lyn Rosenbrier Ribeiro, Nigel Greene, Ian Barrett, Samantha
Hughes, Andreas Bender, New Mechanistic Associations Between Drug-Induced Toxicities in
Animal Models and Humans Uncovered by Statistical Analysis (in preparation)
This publication is described in chapters 4 and 5 of this thesis.
Fredrik Svensson, Azedine Zoufir, Samar Mahmoud, Avid Afzal, Ines Smit, Kathryn A. Giblin,
Jerome Mettetal, Amy Pointon, James Harvey, Nigel Greene, Richard Williams and Andreas
Bender, Information-Derived Mechanisms of Structural Cardiotoxicity, Chem. Res. Toxicol.,
2018, 31, 1119-1127
This publication was the work of a collaboration between the University of Cambridge,
Lhasa, AstraZeneca and GlaxoSmithKline during the PhD.
7
1 Thesis Introduction ................................................................................................................. 11
1.1 Challenges in Drug Discovery ............................................................................................. 11
1.1.1 Primary and Secondary Pharmacology of Drugs ........................................................ 11
1.1.2 Summary ..................................................................................................................... 35
1.2 General Computational Methods .................................................................................. 36
1.2.1 Data Science Methods ................................................................................................ 36
1.2.2 Summary ..................................................................................................................... 57
1.3 Computational Methods and Applications for Bioactivity and Selectivity Profile
Prediction .................................................................................................................................. 57
1.3.1 Ligand-Target Interactions ........................................................................................ 57
1.3.2 Computational Encoding of Ligands and Targets .................................................... 58
1.3.3 Ligand-Based Bioactivity Prediction Against Multiple Targets ............................... 62
1.3.4 Target-Based Bioactivity Prediction Against Multiple Targets ................................ 63
1.3.5 Ligand and Target-Based Bioactivity Prediction Against Multiple Targets............. 64
1.3.6 Bromodomain Target Family Selectivity Profile Prediction ..................................... 69
1.3.7 Summary ..................................................................................................................... 70
1.4 Computational Methods and Applications for Toxicity Prediction ............................. 71
1.4.1 Prediction of Clinical Adverse Events from Preclinical Adverse Events ................... 71
1.4.2 Computational Analyses for the Proposal of Targets for In Vitro Secondary
Pharmacology Screening ....................................................................................................... 75
1.4.3 Summary ..................................................................................................................... 76
1.5 Aims ................................................................................................................................ 76
2 Prospectively Validated Proteochemometric Models for the Prediction of Small Molecule
Binding to Bromodomain Proteins .............................................................................................. 78
2.1 Introduction ................................................................................................................... 78
2.2 Materials and Methods .................................................................................................. 78
2.2.1 Dataset ........................................................................................................................ 78
2.2.2 Analysis of Chemical and Biological space ............................................................ 85
8
2.2.3 Compound and Target descriptors ........................................................................ 85
2.2.4 Algorithms and Generation of PCM models ......................................................... 86
2.2.5 Model Validation .................................................................................................... 87
2.2.6 Benchmarking to QSAR, Quantitative Sequence Activity Model (QSAM) and
Baseline Models ..................................................................................................................... 87
2.2.7 Public Dataset Model ............................................................................................. 88
2.2.8 Applicability Domain ............................................................................................. 88
2.2.9 Experimental Validation ........................................................................................ 89
2.2.10 Experimental Testing .............................................................................................. 91
2.3 Results and Discussion .................................................................................................. 92
2.3.1 Analysis of Chemical and Biological Space ............................................................... 92
2.3.2 Benchmarking Compound and Target Descriptors for PCM Models .................. 94
2.3.3 Algorithms and Model Performance .................................................................... 101
2.3.4 Model Validation ...................................................................................................105
2.3.5 Benchmarking to QSAR, QSAM and baseline models ....................................... 108
2.3.6 Public Dataset PCM Model .................................................................................... 111
2.3.7 Applicability Domain .............................................................................................112
2.3.8 Experimental Validation ........................................................................................ 113
2.4 Conclusions ................................................................................................................... 124
3 Selectivity Insight for the Rational Design of Bromodomain Inhibitors Derived from
Proteochemometric Models ........................................................................................................ 126
3.1 Introduction .................................................................................................................. 126
3.2 Materials and Methods ................................................................................................. 127
3.2.1 Computational models.............................................................................................. 127
3.2.2 Model Interpretation ............................................................................................. 127
3.2.3 X-ray Crystallographic Structure Determination for BRD4 and BRD9 .............. 129
3.3 Results and Discussion ................................................................................................. 130
3.3.1 Interpretation ............................................................................................................. 131
9
3.3.2 Binding Modes Elucidated by Co-Crystallization ................................................ 154
3.4 Conclusions ................................................................................................................... 165
4 Associations Between Drug-Induced Adverse Events in Animal Models and Humans:
Beyond Concordance ................................................................................................................... 167
4.1 Introduction .................................................................................................................. 167
4.2 Materials and Methods ................................................................................................. 167
4.2.1 Dataset ....................................................................................................................... 167
4.2.2 Feature Filtering .................................................................................................... 168
4.2.3 Mutual Information Associations ......................................................................... 168
4.2.4 Network Analysis of Significant Associations ...................................................... 170
4.2.5 Limitations .............................................................................................................. 171
4.3 Results and Discussion .................................................................................................. 171
4.3.1 Concordance Analysis ................................................................................................ 171
4.3.2 Statistical Associations Between All Preclinical and Clinical AEs ....................... 174
4.3.3 Network Analysis of Significant AEs ................................................................... 180
4.4 Conclusions .................................................................................................................. 188
5 Deriving Potential Mechanisms from Associations Between Drug-Induced Adverse Events
in Animal Models and Humans ...................................................................................................190
5.1 Introduction ................................................................................................................. 190
5.2 Materials and Methods ................................................................................................. 191
5.2.1 Associations for Interpretation ................................................................................. 191
5.2.2 Extracting Targets for Drugs ................................................................................. 192
5.2.3 Mapping Preclinical and Clinical AEs to Ontology Terms .................................. 194
5.2.4 Extracting Genes for Preclinical and Clinical AEs ............................................... 194
5.2.5 Gene Filtering and Mapping Animal Genes to Human Orthologs ..................... 195
5.2.6 Overlap Analysis of Genes from Preclinical AE, Clinical AE and Drugs ............. 196
5.2.7 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels ........... 196
5.2.8 Visualisations ......................................................................................................... 197
10
5.2.9 Limitations ............................................................................................................. 197
5.3 Results and Discussion ................................................................................................. 197
5.3.1 Overall Analysis of Mechanistically Supported Associations .................................. 197
5.3.2 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels .......... 202
5.3.3 Mechanistically Supported One-to-One Relationships Between AEs ................... 205
5.3.4 Mechanistically Supported Groups of Associations Between AEs .....................208
5.4 Conclusions ................................................................................................................... 212
6 Conclusions and Future Perspective ................................................................................... 213
7 References ............................................................................................................................. 215
8 Appendix .............................................................................................................................. 266
8.1 Tables and Figures ....................................................................................................... 266
8.2 Supplementary Data Files ............................................................................................ 272
8.3 Compound Characterization Data .............................................................................. 272
8.3.1 General Methods ...................................................................................................... 272
8.3.2 Compound 1 .......................................................................................................... 273
8.3.3 Compound 2 ......................................................................................................... 273
8.3.4 Compound 3 ......................................................................................................... 274
8.3.5 Compound 4 ......................................................................................................... 274
8.3.6 Compound 5 ......................................................................................................... 274
8.3.7 Compound 6 ......................................................................................................... 275
8.3.8 Compound 7 ......................................................................................................... 275
8.3.9 Compound 8 ......................................................................................................... 276
8.4 Experimental Data ....................................................................................................... 276
11
1 Thesis Introduction
This thesis explores data science and machine learning methods relevant to understanding and
predicting the selectivity and toxicity of drugs. In the introduction we first highlight the
challenges faced in drug discovery associated with obtaining drug molecules which are selective
and safe, describing how these parameters link to the concepts of primary and secondary
pharmacology. The theory behind general methods used within the thesis related to data
science will subsequently be discussed to provide the appropriate technical background. Finally,
we review and discuss previous applications of computational approaches to address the
endpoints of focus for this thesis, namely selectivity and toxicity, outlining gaps in existing work.
Overall, the introduction presents a case for the research questions addressed in this work.
1.1 Challenges in Drug Discovery
The process of small molecule drug discovery traditionally comprises of a multi-step process,
starting with the identification of a target and culminating in drug approval in certain diseases,
patient populations and territories (Figure 1-1).1,2 Due to the complexity of the process, it usually
takes over 12 years3 and a cost of over $1 billion4 to develop a New Molecular Entity (NME) from
concept to launch.
Figure 1-1: Traditional stages of the drug discovery process
There is a high attrition rate in many stages of the process and it has been estimated that over
24 drug projects are initiated for every drug launched, with numbers increasing for more
difficult therapeutic targets.2 The main stages of attrition are in phase 2 and phase 3 clinical
trials, with 66 % and 30 % of compounds respectively failing during these stages. It is therefore
important to understand efficacy and safety early in the drug discovery process to avoid late-
stage attrition.
1.1.1 Primary and Secondary Pharmacology of Drugs
There are challenges to overcome in all stages of the drug discovery process with the aim to
reduce attrition from the lack of efficacy in patients and unanticipated toxicity. One of these
challenges is anticipating and understanding the consequences of the off-target or secondary
pharmacology profiles of drugs. In a traditional target-based model of drug discovery, the aim
Target Identification
and Validation
Hit Generation
Lead Optimization
Pre-clinical Studies
Phase-1: Safety
Phase-2: Efficacy, Safety
Phase-3: Efficacy, Safety
Approval and Launch
12
is to develop a compound for a primary biological target (or sometimes targets) with a validated
role in disease. Due to the vast number and types of biological targets in the human body, it is
likely that drugs will not only interact with the specific target(s) of interest, but also exhibit a
range of off-target activities.5 Interaction with multiple targets, known as polypharmacology,6
can be favourable and used as an approach to modulate multiple targets with roles in disease,
inspired by the concept of combination therapy.7 However, off-target activities can often be
unfavourable and can contribute to unwanted pharmacological or toxicological effects.
Medicinal chemists attempt to design compounds that have the desired on-target profile whilst
mitigating the risk of off-target effects during the lead optimisation process. Two concepts
related to on- and off- target activities of drugs, namely selectivity and toxicity, are explored in
the material of this thesis.
Challenge of Selectivity
Selectivity and Polypharmacology
A specific drug is one which has only one desired effect on a biological system.8 In reality,
however, most drug molecules have effects on multiple components of the system, given a high
enough dose, and the degree of this promiscuity is termed the selectivity of a drug. These system
components can be genes, proteins, pathways or cells, and so selectivity is a broad term, for
effects at different levels.9
In the traditional approach to drug discovery, with the aim to modulate a specific protein
function in disease, selectivity refers to target selectivity. Target selectivity is the degree to
which a small molecule interacts with a desired protein target versus its degree of interaction
with off-targets. In medicinal chemistry, small molecules are optimised for a biological target
or targets of interest, in an iterative process, using a variety of binding, functional and cell-based
assays to measure success. One of the main aims in lead optimisation programs is to design a
drug with a balance of high potency at the desired target, whilst minimising the interactions
with undesirable targets, termed optimising drug selectivity. Activity for undesirable off-targets
can cause unwanted effects including toxicities, which will be discussed in the section Off-
Target Pharmacology.
Not all drugs need to be selective for a single target. Polypharmacology refers to a single
molecule which modulates multiple molecular targets and can be described as either desirable
or undesirable.10 The established hypothesis that a drug needs to interact with a specific target
protein ignores the fact that multiple targets within a network cause a disease.11 For example, in
cancer, where mutations occur rapidly, it is useful to have a drug which not only binds the wild-
13
type target but also to any resistant variants of the protein.5 Additionally, diseases are
sometimes caused by multiple pathways which compensate for one another when one is
modulated, for example in mood disorders and schizophrenia.12 This can provide an opportunity
for the development of a therapeutic agent to interact with at least one target in each pathway
to mediate the desired effect. The single-target hypothesis is also shown in some cases to have
limitations in duration of pharmacodynamic effect due to compensatory cellular mechanisms.13
Even for the multi-target objective, the aim would remain to create drug molecules with a
selectivity over other undesired targets; these drugs can been termed “selectively
promiscuous”.14
Types of Selectivity
Given that most drugs require some type of selectivity optimization, we will now discuss the
different types of selectivity.
One important type of selectivity addressed in the lead optimisation stage is target-family
selectivity. This is important when the biological target of interest belongs to a characterised
biological family, which is known to bind to similar endogenous substrates.5 For example,
kinases present one of the major target classes for the development of new therapeutics due to
their roles in diseases such as cancer and inflammatory disorders.15 Because of the functionally
conserved ATP binding site, it is often found that inhibitor molecules for one kinase often also
inhibit other targets within the family.16 Thus, for kinase targets, target family selectivity is often
assessed early on using panel screening against the druggable kinome.15,17 Similar principles
apply to other target families and members of the same target family are often screened as a
first point-of-call in projects. This is due to the hypothesis that phylogenetically-related targets
of the primary target are more likely to bind similar endogenous molecules and therefore
interact with the same chemotypes of small molecules.18 The main aspects governing the
binding of small molecules for protein targets, namely shape, electrostatics, flexibility,
hydration and allostery can be applied to measure target similarity and to understand and
predict selectivity.5 Targets displaying similarity in these aspects to the primary target are often
also targeted by similar small molecules and taking advantage of the differences can be used to
achieve selectivity.18 To obtain selectivity for specific related targets, screening can be conducted
for the secondary targets alongside the primary target and selectivity optimized iteratively based
on structure activity relationships (SAR).
In addition to interacting with other members of the same target class as the primary intended
target(s), drugs can also interact with targets in vivo, which are unrelated in structure or
14
function to the primary drug target. These targets include enzymes, ion channels, receptors and
transporters, which are known to cause unwanted effects, for example, the human ether-a- go-
go (hERG) and Nav1.5. In addition, molecules are screened for drug-drug interactions (DDIs)5
and plasma protein binding,19,20 two other examples of non-specific binding. To mitigate against
the risks of unrelated off-targets, there are a few avenues which can be explored, including in
vitro toxicity panel screening21 (discussed in Off-Target Toxicity Assessment) and avoiding
molecular functionality or properties known to interact with certain undesired off-targets (for
example, toxicophores).10
Measuring Selectivity
Between two targets, we can measure the ratio of activity for the on-target and off-target of
interest, as a measurement of selectivity. This is known as the selectivity ratio (SR) and can be
defined in Equation 1-1.22
Equation 1-1
𝑆𝑅 =𝑏𝑖𝑜𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑜𝑓 𝑜𝑓𝑓 − 𝑡𝑎𝑟𝑔𝑒𝑡 𝑎𝑠𝑠𝑎𝑦
𝑏𝑖𝑜𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑜𝑓 𝑜𝑛 − 𝑡𝑎𝑟𝑔𝑒𝑡 𝑎𝑠𝑠𝑎𝑦
This calculation requires the measurement of biological activity for both the on-target and off-
target. Biological activity is defined as the “specific ability or capacity of a particular molecular
entity to achieve a defined biological effect.”23 Multiple measurements of biological activity at a
target are commonly used, including the dissociation constant between the ligand-receptor
complex (Kd),24 the half-maximal effective concentration (EC50),25 the half maximal inhibitory
concentration (IC50),25 and the dissociation constant of the receptor-inhibitor complex at
equilibrium (Ki).26 The value of the SR is often described as the fold-selectivity. A higher fold-
selectivity means the drug is more selective, since there was a larger difference between the
concentrations needed to afford the same degree of response.
Beyond the comparison of two targets, basic measurements can determine the degree of
promiscuity of a drug, by counting the number of targets for which the bioactivity value is found
to be below a certain threshold for activity for each drug. Different methods have been used to
estimate the average drug promiscuity, finding that a drug might on average interact with 3-6
targets at a concentration of less than 10 μM according to estimations from DrugBank and
Wombat27 and that 50 % of drugs interact with more than 5 different targets at a concentration
of less than 10 μM determined from drugs annotated in the Anatomical Therapeutic Chemical
(ATC) classification.28 Utilising the large amounts of bioactivity data which have become
available in recent years, a detailed analysis of compound promiscuity has been conducted at
15
different stages of the drug development process, finding that around 50 % of screening hits
had activity against at least two targets and that a notable increase in promiscuity was observed
from screening hits and optimised compounds (2.9-3.7 targets on average, depending upon the
dataset), to experimental and approved drugs (4.7 and 6.9 targets on average, respectively).29 It
was also found that similar drugs are likely to bind similar targets, and in one study 71 % of
drugs were found to interact with at least two targets with similar binding sites.18
Overall, we note that most compounds display some degree of promiscuity or
polypharmacology, and that experimental and approved drugs have a large degree of
promiscuity, which may be causing undesired effects. Designing molecules with the appropriate
profile of on- and off-target pharmacology is therefore imperative to provide the appropriate in
vivo functionality.
Challenge of Designing Selective Bromodomain Inhibitors
As mentioned above, it can be difficult to achieve selectivity between members of the same
target family. Bromodomain-containing proteins are a family of epigenetic proteins, which play
important roles in immunological, developmental and cardiovascular disorders, as well as
cancers.30 To better elucidate their individual roles in disease, selective probe molecules are
required. Designing selectivity remains difficult since small molecule bromodomain binding
requires the mimicry of endogenous peptide binding and therefore chemotypes which bind to
one bromodomain often also bind to other bromodomains. This lack of selectivity is particularly
notable between bromodomains with a high binding site sequence conservation.31 In this
section we provide the background to bromodomain-containing proteins including their
function and structure, as well as highlighting literature knowledge of the active site residues
which form interactions with known small molecule binders.
Bromodomain Function and Clinical Relevance
Bromodomain-containing proteins are epigenetic readers, which regulate gene transcription by
recognising and binding to acetyl lysine post translational modifications (PTMs) on histone
proteins, as well as other proteins.32–34 PTMs alter the accessibility of proximal chromatin to
gene transcription and the binding of bromodomain-containing proteins often causes the
transcriptional activation of genes.35 61 unique bromodomains have been identified in the
human proteome, which have been classified into to eight subfamilies.36 Most bromodomains
contain a deep and well-defined acetyl lysine recognition binding site, which interacts with
acetylated lysine residues on histone peptides in a protein-protein interaction, providing an
interaction which can be disrupted by small molecule inhibition.37,33
16
Histone acetylation is associated with an open chromatin state and consequently transcriptional
activation. This open state is due to the neutralisation of the charged lysine upon acetylation,
which affects the packing of the chromatin by reducing the strong ionic interactions that form
the closed structure of the DNA.36 The histone acetylation process is dynamically regulated by
the writer proteins; histone acetyl-transferases (HATs) and the eraser proteins; histone
deacetylases (HDACs),33 inhibitors of which have been constructed in order to interfere with
the DNA accessibility and therefore gene transcription processes. Once these acetylation marks
have been positioned by the HAT enzymes, bromodomains act to read these modifications and
recruit the transcription factors and chromatin remodelling factors that are responsible for the
increase in gene transcription.32 Specific bromodomains have different profiles of acetyl lysine
mark recognition, which determine their functional roles; this was elucidated by
Filippakopoulos et al., 2012 in a study that used SPOT peptide arrays in order to assess
bromodomain selectivity.38
Bromo and extra terminal (BET) family bromodomains have been well-studied, with
bromodomain containing protein 4 (BRD4) having received the most attention for the
development of inhibitors.39 The BET bromodomains include: BRD2, BRD3, BRD4 and BRDT.
Each protein contains two bromodomains, which are known as BD1 and BD2. The majority of
drugs in clinical trials express pan-BET activity, as selectivity for the BET bromodomains is
difficult to acquire, especially due to the similarity of BD1 and BD2 domains across proteins.
There are 17 inhibitors for BET in clinical trials currently for a wide range of indications,
including cardiovascular conditions, such as atherosclerosis, diabetes, solid tumours and
haematological tumours.30,33,36,40–42 Five commonly reported BET inhibitors (belonging to
different scaffolds) are depicted in Figure 1-2.
17
Figure 1-2: shows the structures of five pan-BET bromodomain inhibitors.
JQ1 and I-BET762 (Figure 1-2) were the first potent BET inhibitors, which stimulated the interest
in targeting bromodomains for various indications.30,43,44 These both share the same diazepine
chemical scaffold. JQ1 was developed when compounds with the triazolothienodiazepines were
discovered to have anti-cancer properties in addition to anti-inflammatory effects, caused by
BET inhibition.36 The inhibitor was shown to possess BRD4 BD1 activity of 50 µM and activity
against NUT midline carcinoma, an aggressive form of childhood cancer, driven by BRD4-NUT
translocations. This discovery led to the therapeutic rationale for targeting BET proteins with
inhibitors for oncology.45 Subsequently, other areas of oncology have been explored for the
potential for therapeutic benefit from a BET inhibitor. It was found that the most active target
of BET was BRD4 which when inhibited leads to the downregulation of transcription of genes
involved in cell proliferation, including the growth promoting oncogene c-MYC, which is
present as a key driver in many cancer types.41 I-BET762 was developed in the same chemical
scaffold as JQ1 and validated against the downregulation of inflammatory gene expression in
LPS-stimulated macrophages.44 The main effect was the suppression of inflammatory cytokines
and transcription factors that lead to inflammation. This led to the therapeutic application in
I-BET762 RVX-208
I-BET151 XD14
JQ1
18
inflammatory disease for BET inhibitors. Other BET inhibitors have since been developed with
more varying chemotypes.36,46
The therapeutic potential identified from BET inhibitor validation studies led to the
investigation of the other bromodomains as potential biological targets of interest. The histone
acetyl transferase (HAT) proteins CREBBP and EP300 act as co-activators for transcription
factors including p53, which is mutated in 50 % of cancers.47 Ischemin was developed as the
first inhibitor of CREBBP/EP300 and was shown to display anti-apoptotic phenotypes in
ischemic cardiomyocytes, demonstrated to be due to the post translational modification of
histones and inhibition of p53 gene activation by CREBBP after recognition of the acetyl lysine
mark. This inhibitor dually inhibits both the HAT and bromodomains in the CREBBP protein
to produce this effect.48 Since this, multiple inhibitors of CREBBP have been developed,49,50 and
CREBBP and EP300 dysfunction have been implicated in developmental disorders,
haematological malignancies, inflammation and neuropsychiatric disorders.34
In addition to the above, inhibitors for the other bromodomain-containing proteins have also
been produced. For PCAF, one inhibitor was shown to prevent the binding of HIV trans-
activator Tat, preventing HIV provirus transcription and replication.51 PCAF inhibitors are also
broadly associated with roles in cancers and neuro-inflammation, indications which are
included in patents for inhibitor molecules.34 For CECR2, GNE-886 was produced as a selective
inhibitor, suitable for use as an in vivo tool molecule.52 For BPTF, a moderately potent inhibitor
rac-1 was optimised53 to better elucidate the role of BPTF in various cancers including bladder
and colorectal cancers.34 BAZ2A and BAZ2B have been associated with prostate cancer,
lymphoblastic leukaemia and sudden cardiac death.34 The BAZ2-ICR compound54 and
GSK280155 have been developed for the BAZ2 bromodomains to further elucidate their roles in
disease. Multiple inhibitors of BRD9/BRD7 have been discovered, including LP-9956, I-BRD957,
BI-727358 and BI-956458, which caused the selective down-regulation of genes implicated in
cancer and immune response pathways in acute myeloid leukaemia (AML).34,36,56,57 BRPF1,
BRPF2 (BRD1) and BRPF3 have been targeted by the benzamidazolone scaffold present in
GSK685359 and 1,3-dimethylquinolin-2(1H)-one scaffold present in NI-57.60 Dual
TRIM24/BRPF1b inhibitors were also discovered containing the benzamidazolone scaffold.61
ATAD2 inhibitors62–64 have recently been developed and the bromodomain has been linked to
a variety of different malignancies.34 PF1-365 was discovered as an inhibitor of the SMARCA2/4
and PB1 bromodomains which are also linked to a number of cancers.34 In addition to selective
inhibitors, pan-BRD inhibitors have also been developed, including triazolophthalazines which
19
displayed activity for CECR2, BRD4, CREBBP, BRD9 and TAF1L, and Bromosporine, which was
active for BET, CECR2, TAF1, BRD9 and CREBBP.36 There is increasing evidence that some
existing kinase inhibitors are dual kinase-BRD inhibitors; inhibitors of JAK2, PLK1 and PI3K
have been found to also inhibit BET bromodomains.66,67
Structural Binding Site and Phylogenetics
61 bromodomains are encoded in the human genome, expressed 46 unique proteins.68
Bromodomains are found in a range of proteins that are associated with other chromatin
functions including the histone acetyl transferases (HAT) and histone methyl transferase
(HMT) proteins.36 They are also found in transcription initiation factors and ATP dependent
helicases and in proteins containing other reader domains, including other bromodomains
within the same protein, as well as reader domains that recognise other marks including the
methyl lysine recognition domain (known as the PHD finger domain).69
Structurally, bromodomains have a highly conserved region in the acetyl lysine binding site,
known as the globular fold, which forms between a left-handed α-helical structure comprised
of four α-helices (αZ, αA, αB, αC) and two loop regions (known as the ZA and BC loops). The
hydrophobic binding site is nestled between these structures (Figure 1-3a).68 There are also
conserved residues across bromodomains which are involved in acetyl lysine binding, mediated
through water molecules, which have been observed in crystal structures. A tyrosine residue in
the ZA loop and an asparagine in the BC loop are usually present, except in a few exceptional
cases, where the loop ZA loop configuration can be noticed to be unusual.36 The asparagine
residue forms a hydrogen bond to the peptide, as shown in Figure 1-3b. There is also extensive
H-bond stabilisation to the peptide backbone from the bromodomain site. Interaction of
diacetylated peptides with one bromodomain can also occur and these often bind with higher
affinity than singly acetylated peptides.70
20
Figure 1-3: left (a), shows the standard structure of a bromodomain with four α-helices and two loops. Diagram was generated using PDB structure 3UVW in MOE. Structure is formed of a diacetylated histone 4 peptide (in pink) and BRD4 BD1 (in ribbon form). Right (b), shows the binding site with the conserved residues, tyrosine 97 and asparagine 240 (in teal).
Crystal structures are now available for 47 members of the bromodomain family, many of which
were delivered as a result of efforts from the Structural Genomics Consortium (SGC).38
Bromodomains have been organised into subtypes according to their sequence similarity. The
most structurally characterised types are I, VI and VIII. The families still lacking in structural
characterisation include members of groups III, V and VII. Inhibitors for the type II
bromodomain and extra-terminal (BET) BRDs were the first to be identified, consistent with
the functional information known about this subfamily. It is interesting to observe that
inhibitors have been produced for members of all families to date, suggesting that all
bromodomains could be druggable.
The BET family of bromodomains is the most explored family of bromodomains. These are
characterised by their two distinct binding sites, known as bromodomain 1 (BD1) and
bromodomain 2 (BD2). Structurally, BET bromodomains within the same protein are less
similar to each other (<45 % sequence identity) than the same domain between proteins (∼75
% identity). For this reason, it is difficult to obtain selective inhibitors for this family of
bromodomains and explains why most of the drugs targeting these are pan-BET inhibitors.
Inhibitor Chemotypes
Structurally, inhibitors mimic the binding of acetyl lysine and therefore need to contain
functional groups which can exploit the key interaction of the peptide amide carbonyl with the
asparagine residue. Bromodomain inhibitors therefore have an equivalent acceptor which can
interact with this residue. Such chemotypes for the BET family include 1,2,4-triazoles,
quinolones, dimethylisoxazoles and thiazolinones.71 As mentioned above, five of these
structures are depicted in Figure 1-2 and whilst these compounds all bind to the same binding
ZA loop
αA
H4KAc2 BC loop
αB αZ
αC
H4KAc2
Y97
N240
21
site, they differ in their molecular properties including their acid/base properties (Table 1-1).
The apparent lipophilicity is often calculated for the neutral species known as the logP value,
however this value does not consider the ionisation states of compounds at physiological pH,
which is why the logD value (logP measured at physiological pH) should be considered and is
often different to the clogp value. Although all structures in Figure 1-2 are depicted as neutral,
which will be the dominant species at physiological pH for these compounds, these molecules
will exist in equilibrium between charged and uncharged species which is governed by the pKa
of the functional groups. For example, a proportion of RVX at pH 7.4 may be in the NH+ lactam
tautomeric species since the pKa of this group is 8.9, which might be responsible for the
decrease in logD in comparison to logP. Additionally, the phenol moiety in XD14 will be mostly
protonated, however, some of the species will be deprotonated and therefore charged since the
pKa of this group is at 8.0. Again, this is likely to result in the lower logD prediction in
comparison to logP. It should be noted here that these are calculated values and therefore the
error in these values may also contribute significantly to the differences observed. More
generally, quinazolones, thienopyridones, methyl pyrazoles, acetylindolizines and propylated
benzoxazepines, have been identified as acetyl lysine mimetics for other bromodomains.36
Further chemotypes known for bromodomains are shown and discussed in chapter 2.
22
Table 1-1: calculated properties for the five bromodomain inhibitors shown in Figure 1-2. clogP and logD were calculated based on models within AstraZeneca and pKa’s were calculated using the Advanced Chemistry Development, Inc ACD/Labs software.72
Compound clogP logD pKa/functional group pKa/functional group pKa/functional group
JQ1 4.82 3.97 1.9/NH+ diazepine 1.5/NH+ diazepine tautomer
I-BET762 3.29 2.76 15.5/NH amide 1.9/ NH+ diazepine 2.4/NH+ diazepine
tautomer
RVX-208 3.25 2.44 14.2/OH 8.9/NH lactam 8.9/NH+ lactam
tautomer
I-BET151 2.32 2.72 11.1/NH azabenzimidazolone 5.3/NH+ azabenzimidazolone 4.7/NH+ pyridine
XD14 3.27 2.71 18.5/NH amide 8.0/OH phenol
23
Selectivity Knowledge for Bromodomain Binding to Small molecule Inhibitors
Recently, there has been an increase in numbers of chemical probes produced for the
bromodomain family of proteins, which has provided some information on specific target
residues important for obtaining activity at individual bromodomains or selectivity for one
bromodomain over other members of the target family. The structural binding sites of
bromodomains differ in specific residues and more broadly in their secondary and tertiary
structure. These features have been exploited for structure-based design of selective inhibitors
previously.
Table 1-2 provides a summary of residues in the active site of bromodomains which are currently
known to make interactions with small molecules, indicating that there is evidence to support
their importance for binding. The contents of the table were determined from a survey of the
literature and highlight the key differences in bromodomain sequence and structure which have
been exploited previously to design inhibitor and probe molecules for the bromodomain target
family. It is important to note that the crystallographic contents extracted here will be affected
by the quality of the crystal structures analysed. For all 1,492 bromodomain X-ray structures in
the protein data bank (PDB), we extracted the refinement details report (Supplementary Data
File 1). The mean residual value (R-value) for these structures was 0.19, with a standard deviation
of 0.03 which is around the typical expected value for the difference between the atomic model
and the experimental data, with only small differences between the R-value and R-free of 0.03,
a sign that the atomic model did not overfit to the experimental data.73 The mean resolution
was 1.78 with a standard deviation of 0.42 showing that on average we can be confident about
the location of atoms in bromodomain crystal structures due to the well-defined electron
density maps that can be created from the detailed diffraction pattern.
Notably, it can be observed that there are differences in the numbers of interacting residues
between bromodomains, with BRD4 BD1, CREBBP and BRD9 having the highest number of
known interacting residues, due to the number of publications and crystal structures for these
bromodomains. For some bromodomains, how to obtain optimize interactions between small
molecules and the protein to obtain activity is largely unknown.
24
Table 1-2: Residues highlighted from the literature to have roles in determining binding affinity for different bromodomains. Not included here are the interactions with the conserved asparagine and tyrosine residues, since this interaction is common across all bromodomains studied.
Bromodomain Activity/Selectivity Residues References
BPTF F2887, D2834, W2824 53
CECR2 Y520, P458, F459, M506, W457 52,74
KAT2A Y814, E761, P752, W751 75
PCAF Y809, E756, P747, W746, V752, P751, K753 75,76
BRD2 BD1 Q101, D160, I162 31,77
BRD2 BD2 K374, H433, V435 31,77
BRD3 BD1 Q61, D120, I122 31,77
BRD3 BD2 K336, H395, V397 31,77
BRD4 BD1 Q85, D144, I146, Y139, L92, W81, L97, P82, F83 31,77–80
BRD4 BD2 K378, H437, V439, Y432, N433, P375, P376 31,77,78,80
BRDT BD1 I115 31,77
BRDT BD2 V357 31,77
BRWD1 BD2 None known
CREBBP R1173, L1109, V1174, L1120, P1110, F1111, Q1113, L1119 81–84
EP300 R1137, L1073 81–83
ATAD2 R1007, V1008, R1077, D1071, D1014, I1074, V1018, Y1063 85–87
ATAD2B N981, I982, R1051, D1045, D988, I1048, V992, Y1037 86,87
BRD1 F648, S592, V647, Q589, Y649 60,88–90
BRD7 Y217, A154, S157 57,58
BRD9 Y106, F44, I53, A54, G43, F45, V49, H42, R101, A46, F47, P48, T50
56–58,91
BRPF1 F714, P658, I713 60,88–90
BRPF3 F675 88,89
BAZ2A W1816, V1879, E1820 55,92
BAZ2B W1887, I1950, L1891 55,92
TIF1A A923, L922, F924, V932, V928, V986, P929, E985 61,93
TRIM33 None known
TAF1 BD2 N1533, W1526 74,94
TAF1L BD2 None known
PB1 BD5 A703, M699, L687, L693, I745, I683, F684, M731 34,65,95
SMARCA2 L1418, P1413, E1417 65
SMARCA4 L1494, P1489, E1493 65
Challenge of Toxicity
As mentioned in the section Primary and Secondary Pharmacology of Drugs, unanticipated
toxicity in human trials is one of the major causes of drug attrition.96 In total, over 30 % of drugs
are terminated due to animal or human adverse events.97 Efforts are therefore being made to
assess toxicity in more relevant models prior to clinical trials in order to prevent investment
into molecules which will be later terminated for toxicity. This is consistent with the concept
25
that it is better to fail early to reduce costs and avoid unnecessary risk to human volunteers and
patients.98
Mechanisms of Toxicity
“The dose makes the poison” was a principle outlined by Paracelsus,99 which described the fact
that all chemicals are poisons at high enough doses. Nowadays, it is generally considered that
there are five main types of drug-induced toxicities, namely on-target toxicity, hypersensitivity
and immunological reactions, biological activation to toxic metabolites, idiosyncratic toxicities
and off-target pharmacology.100,101
On-Target Toxicity
This can be described as on-target effects that occur in the wrong organ or cell type sometimes
due to elevated doses. A good example of this is statin toxicity, which eventually led to the
market withdrawal of cerivastatin.102 Statins act on the liver to reduce synthesis of cholesterol
through inhibition of the isoprenoid pathway, however in the muscles, inhibition of the same
mechanism can lead to excess myoglobin which can cause kidney failure.100 This is an example
of the same mechanism having different effects in different tissues.
Hypersensitivity and Immunological Reactions
These reactions often result from long term exposure to a drug and are unpredictable.103 These
reactions are split further into four categories; type 1 are IgE mediated reactions and cause
anaphylaxis, asthma and urticaria, type 2 are immunoglobulin-mediated cytotoxic mechanisms
which can lead to abnormality in blood cells, type 3 are immune complex mediated, for example
vasculitis from which inflammation can cause the destruction of blood vessels, and type 4 are a
range of reactions mediated by T-cells, causing skin rashes known as delayed hypersensitivity
reactions.104 Common agents associated with these hypersensitivity and immunological
reactions include penicillins and β-lactum antibiotics.101
Biological Activation to Toxic Metabolites
Biological activation toxicity involves the biotransformation of the drug into toxic metabolites.
Since drug molecules are often lipophilic in nature due to their need to permeate across the
lipid bilayer, many of the biotransformation’s performed by enzymes in the body act to make
drug molecules more hydrophilic to aid excretion from the body. The enzymes responsible for
biotransformation act on a broader degree of chemical structures than therapeutic targets,
which explains why different classes of drug molecules can often interact with the same
cytochrome p450 enzymes.105 Biotransformation occurs in two steps, namely phase 1 and phase
26
2 reactions and occurs primarily in the liver. Phase 1 reactions involve oxidation, reduction and
hydrolysis reactions,101 whereas phase 2 bioconjugates the resulting phase 1 metabolite with an
endogenous high polarity molecule. Primarily, these transformations are designed to detoxify
and aid excretion of drugs, however, in some cases toxic metabolites can form.
Biotransformed metabolites fall into four main categories based on mechanism. The first type
is the conversion to a stable, but toxic metabolite. This is rare for drug molecules however, an
example of this type of toxicity is demonstrated by the oxidation of dichloromethane (a common
solvent) to carbon monoxide, by cytochrome p450 enzymes, which subsequently complexes
with haemoglobin to form carboxyhaemoglobin, restricting oxygen transport.106 The second
type is the conversion of a chemical to a reactive electrophile metabolite and is by far the most
common bioactivation toxicity mechanism. This can lead to cytotoxicity and carcinogenicity,
due to the reaction of the resulting electrophile with endogenous nucleophiles from proteins,
lipids or DNA tissue components, forming a covalent interaction which can eventually lead to
tissue degradation and necrosis.107 This is also thought to contribute to oxidative stress, due to
glutathione depletion.101 The third type of biotransformation is the conversion to a free radical,
a highly reactive species which can induce cascades of subsequent reactions, including the
oxidative modification of proteins and lipid peroxidation.105 The fourth type of transformation
involves the formation of reactive oxygen radicals, which have consequences for oxidative stress.
With redox cycling and high concentrations of oxygen radicals, reduced oxygen metabolites can
form, which can initiate diseases including arteriosclerosis and polyarthritis.105
Idiosyncratic Toxicities
These toxicities belong to the other mechanisms of toxicities that occur very rarely (1 in 10,000
individuals)101 and therefore their mechanisms are not well-known. Their rarity is attributed to
be due to the genetic predisposition of some individuals to certain toxicities. These toxicities
often become apparent when the drug is exposed to an increased number and diversity of
patients, often after drugs have been marketed or in late-stage clinical trials, making their
prediction very difficult and there are very few relevant animal models are available.108 These
toxicities are hypothesised to be due to low-level infections, loss of mitochondrial function and
inhibition of RNA polymerase.
27
Off-Target Pharmacology
As discussed in the section Types of Selectivity, when undesired secondary targets are
modulated by the drug, this can lead to toxicity.101 This will be discussed in more detail in the
next sections as it forms the basis for many in vitro toxicity screens.
Drug-Drug Interactions
Another source of toxicity can come from the interactions between one drug and another,
termed Drug-Drug interactions (DDIs). Increasingly, patients are taking many drugs for
different indications, and so it is important to assess toxicity with relation to other common
medications taken within the patient population. Drug interactions are when the concurrent
usage of one drug with another effects the level or activity of another drug.109 Changing the
concentration of drug in the body from the approved concentrations can have effects on toxicity
and efficacy. An example of a DDI includes the concurrent treatment with sulfonylureas such
as glyburide for diabetes, in addition to sulphonamide antibiotics, which can cause
hypoglycaemia due to the inhibition of the metabolism of glyburide by cytochrome p450 2C9
(CYP 2C9) enzyme.110
Toxicity Assessment in Drug Development
Toxicity studies are conducted at various stages of the drug discovery process.97 There are three
main stages of toxicity determination, namely in vitro assays, in vivo animal studies and clinical
human studies. Much of the requirements for toxicity testing are mandated by national
regulatory bodies such as the Food and Drug Administration (FDA) and the Medicines and
Healthcare products Regulatory Agency (MRHA).
In Vitro Toxicity
Off-Target Toxicity Assessment
As mentioned previously, off-target interactions of a molecule can lead to unwanted toxicity. It
is not currently mandated that off-target pharmacology screening be implemented, except for
testing against the human ether-a- go-go (hERG) potassium channel to anticipate cardiac
arrhythmias.111,112 When hERG is inhibited, long QT syndrome and eventual sudden cardiac
death can occur.113,114 Due to the severity of the potential adverse event in humans, and the
number of drugs which have been withdrawn due to hERG effects,97 this target is now
considered high priority for early liability assessment. Despite this being the only mandated off-
target assessment, selected off-targets known to be associated with toxicities in in vivo settings
are often screened for in early stages of drug discovery to mitigate against downstream risks.
28
Panel screening in vitro against targets associated with a variety of toxicities has been
highlighted as important for the de-risking of compounds for late-stage toxicities.21 The ability
to conduct medium-to-high throughput screens enables the fast profiling of safety targets.
Several initiatives have defined safety targets which might be appropriate to include in such
studies, including the Tox-21 target panel of in vitro quantitative HTS assays,115 Bioprint,116 the
Novartis safety panel published by Lounkine et al., 2012, 117 and the panel-44 safety screen which
uses the targets published by Bowes et al., 2012.118 The study by Lynch et al., 2017119 summarizes
the evidence for adverse events induced by agonism/activation and antagonism/inhibition of
each of 70 off-targets included in AbbVie’s in vitro screening panel.120 For targets to be
considered as off-targets, a link between a drug induced adverse event and the off-target should
be made. Subsequently, not all targets can be incorporated into a screening panel and several
factors should be used for prioritisation, including 1. the severity of the adverse event associated
with the targets; 2. the hit rate of the targets; 3. the selection of targets across target families to
increase panel diversity; and 4. the assay formats being suitable for high throughput and high
concentration screening.121
In addition to hERG, the activation or inhibition of cytochrome P450 enzymes is also monitored
closely at an early stage, as these enzymes are responsible for the metabolism of many drugs
and are therefore key to whether the drug survives liver metabolism and reaches the tissue of
action in sufficient concentration to sustain the desired effect.122 Furthermore, inhibition or
activation of these targets are important for the understanding of potential DDIs.123
Cellular Toxicity Assessment
It is also important to assess drug toxicity in a cellular context to determine if a toxic phenotype
occurs in mammalian cells. Cytotoxic compounds can affect the viability of in vitro cultured
cells in different ways including interference with cellular attachment, significant changes in
cell morphology, changes in cell growth rate or induction of cell death.124,125 Many assays have
been developed for the in vivo assessment of cytotoxicity and cell viability, including
permeability assays measuring membrane integrity, functional assays, for example to measure
mitochondrial function/energy capacity, morphological assays including measurements of
changes to cellular support structures and osmotic properties and reproductive assays
measuring cell numbers and capacity for colony formation by cell division.126,127
Toxicity on a pathway level can also be evaluated in vitro. Many cellular assays have been
designed in a high throughput fashion to determine pathway-based toxicities, for example in
the Tox-21 database the induction of hypoxia is measured, which is the reduction in the tissue
29
oxygen tension. This can be brought about by different mechanisms which are captured by the
endpoint of hypoxia-inducible factor 1a activity.115
Genotoxicity Assessment
Genotoxicity is defined as the effect of a chemical on the cell’s genetic material, which leads to
a loss of integrity.128 Genotoxins are classified into three classes; carcinogens, mutagens or
teratogens which cause cancer, mutations or birth defects respectively. There are three main
tests employed in vitro for assessment of the genotoxicity of drugs. These include the Ames
reverse mutagenesis test,129 the micronucleus clastogenicity assay130,122 and the mouse
lymphoma thymidine kinase (tk) gene mutation assay.131 The Ames test is the long established
test to identify carcinogens.129 This is conducted by applying a compound to the mutated form
of Salmonella typhimurium bacteria, which is unable to synthesise histidine. With a genotoxic
substance, a second mutation arises which reverses the effect of the original mutation and
enables the synthesis of histidine.132 Clastogens are agents which induce structural aberrations
to chromosomes.133 The micronucleus test counts the number of abnormal cells in a tissue
sample harbouring a micronucleus which is a marker of chromosomal aberration.134 A positive
result in both the micronucleus and the Ames tests are predictive of in vivo mutagenicity or
carcinogenicity.135 The mouse lymphoma cell assay can be used in some cases to identify
mutagens or clastogens and measures the degree of mutation or damage to the thymidine kinase
(tk) locus of mouse lymphoma tk cells, by measuring how resistance to trifuorothymidine
nucleoside changes.131 This assay was determined less useful in detecting carcinogens than the
other two assays combined.135
Organ Toxicity
The cellular methods above have limitations in the prediction of toxicity because these methods
do not retain the original organ functions and morphologies when cultivated in vitro.136 The
traditional approach to overcome this is to use whole animal models (described in In Vivo
Toxicity), however, ethical considerations as well as species differences between animals and
humans has resulted in the development of in vitro methods which can determine organ or body
toxicities. The recent technology established in this regard involves the concept of organs-on-
a-chip.136,137 This technology uses sterile microfluidic components containing multiple cell types
which would be found in a tissue, connected by synthetic blood vessels and allowing the influx
of the drug into the tissue.136 The microenvironment for the cells can be adjusted by changing
the pressure (using vacuums e.g. for lung or heart tissues), or the supply of nutrients to the cells
to mimic the endogenous environment in the tissue. Furthermore, components representing
30
individual organs can be pieced together to mimic the entire human system and its
connectivity.137
In Vivo Toxicity
In vivo toxicity screening is still considered the standard for assessment of the toxicological risks
in humans, due to the ability to administer the drug to the whole system of the animal, rather
than in an isolated target-based or cell-based assay.98 However, when compared to in vitro
screening for toxicity assessment, in vivo studies are lower throughput. For an in vivo study to
be practical for use in early drug discovery it needs to have rapid turnover of results and require
low compound amounts.98 Other challenges in in vivo toxicity studies includes the variation
between species in biology, pathophysiology and pharmacokinetics, which affects the degree of
translation of toxicity between animals and humans,98 as well as the initiative to replace, reduce
and refine the use of animals in drug development for ethical reasons.138,139
Initial in vivo toxicity measurements come from the early animal studies to establish efficacy
and pharmacokinetic properties of the drug where the drug is administered at a single dose to
rodent and non-rodent species.98 This means that any toxicities observed will be at
pharmacological dose. Following from the initial in vivo studies, designated acute toxicity
studies are conducted which explore escalated dosing to determine the maximum tolerated
doses (MTDs) of the drug. The determined MTDs are then investigated in repeated-dosing
schedules within the same species. These studies are designed to identify toxicities from
repeated exposure, identify affected organs and to determine the dose for each toxicity.140
Repeated Dose Toxicity (RDT) studies lead to quantification of No Observed Adverse Effect
Level (NOAEL) and Lowest Observed Adverse Effect Level (LOAEL) measurements in animal
models which are often used to estimate a safe dose to administer in clinical trials.141 The NOAEL
is the highest dose for which no adverse effects were observed and the LOAEL is the lowest dose
where an adverse effect was observed.142 The primary measurements made in these studies
include body weight, clinical measurements, histological evaluation of tissues, evaluation of
clinical chemistry and haematology. In addition toxicokinetic measurements allow the
determination of exposure levels associated with effect and no effect levels of toxicity. Before
clinical trials, acute toxicity and repeated-dose toxicity studies need to be conducted in two
species including one non-rodent species.143 In addition, individual drug and therapeutic
indication knowledge will guide the investigation of other potential toxic liabilities in in vivo
animal studies. Other studies might include the investigation of genotoxicity, carcinogenicity,
reproductive toxicity, immunotoxicity, drug abuse liabilities.143 Safety pharmacology studies are
31
conducted to assess the risk of on- or off- target pharmacology, for example dog telemetry
studies are used to assess cardiovascular liabilities.140
Clinical Toxicity
Toxicity Assessment in Clinical Trials
Clinical trials are usually split into different phases and toxicity is monitored and assessed
continually and according to the study design.
The objectives of phase 1 clinical studies are to establish the pharmacokinetics, tolerability and
safety of the drug in a small number of human subjects. Sub-therapeutic doses are initially
administered to a small number of human subjects who are monitored for adverse effects.144
Pharmacokinetic exposure is determined to establish the most appropriate incremental increase
of dose. Dose escalation continues until the expected pharmacological exposure range is
achieved or exceeded. Where the expected pharmacological exposure is not known, dose
escalation can continue up to the adverse effect level as determined by animal toxicity studies
or to the actual maximum tolerated dose in humans defined by the onset of adverse events.145
Serious adverse events at any dose level will usually result in the suspension of dosing or
termination of the study.
Phase 2 clinical trials are designed to assess the therapeutic efficacy of the drug in a larger group
of patients who have the target disease.144 There are different designs for a Phase 2 trial
depending on the outcomes required.146 A comparison of the efficacy with historical controls,
experimenting with different dosing regimen and randomisation of subjects between drug and
placebo help to inform the design of the subsequent phase 3 efforts.147 Toxicity and
pharmacokinetics are also monitored throughout all stages of the trials and incidences of
adverse events are strictly reported and regulated.
Due to often the lack of statistical power from endpoints generated from a phase 2 clinical trial
because of the relatively small sample sizes, phase 3 clinical trials are mandated by regulators
to assess the drug in a wider cohort of patients with the disease.144 Most phase 3 clinical trials
are structured as a comparative efficacy trial of the drug against placebo, however, some drugs
are compared to the current standard-of-care known as the equivalency trial structure.148
Randomisation of patients between the treatment groups or the placebo or standard-of-care
groups is conducted to control for the effect of confounding factors on trial results. These may
include concomitant medications, comorbidities, or demographic features such as age, gender
or socioeconomic factors.149 Again, toxicity will be assessed throughout the trial and adverse
32
events reported and monitored for patients in both arms of the trial. The adverse events
reported may be different in frequency or severity to those experienced in smaller cohorts in
previous trials.
Clinical Adverse Events
Adverse events (AEs) are defined by the World Health Organisation (WHO) as “any untoward
medical occurrence in a patient or clinical investigation subject administered a pharmaceutical
product and which does not necessarily have to have a causal relationship with this
treatment”.150 As mentioned in Mechanisms of Toxicity there are a variety of mechanisms of
drug-induced toxicity which could be responsible for the AEs in humans. An important factor
to consider is the causality of AEs, as toxicity experienced after administration of the drug can
be due to mechanisms of the drug itself, as well as other confounding factors and it not always
trivial to distinguish between these situations. For example, the underlying disease and any
concomitant medications could also be linked to the AE.151 Serious adverse events (SAEs) are a
class of AE which result in either death, hospitalisation, significant disability or a congenital
abnormality or birth defect.152
Reporting of Adverse Events from Clinical Trials
AEs attributed to the drug are reported during clinical trials as part of the pharmacovigilance
framework. These AEs are reported as per requirements in the country that the clinical trial is
conducted. For the UK, the Medicines and Healthcare products Regulatory Agency (MHRA)
requires that sponsors and clinical study investigators report all serious AEs to the European
Medicines Agency (EMA) to be collated into a database as Individual Case Safety Reports
(ICSRs).153 Likewise, the Food and Drug Administration (FDA) requires the reporting of
unanticipated and serious AEs in clinical trials to the Institutional Review Board (IRB).154 Other
information is documented in the medical records of the study participants and this is
transferred to a case report form, which is collated into pharmacovigilance databases. These
databases are often coded to a medical dictionary of AE terms to aid analysis. The Medical
Dictionary for Regulatory Activities (MedDRA) provides a 5-level hierarchy and is used for this
purpose.155 This is the main dictionary used in the UK, US and Japan to report AEs from clinical
trials. The highest level of this hierarchy is the System Organ Class (SOC) and the lowest level
terms are more likely to be the original terms described in a document.156 Finally, the published
trial results on ClinicalTrials.gov will include the incidence of AEs for the drug.157,158 In this
database, it is now required that the adverse event term, organ system, type of assessment
33
(spontaneous vs symptomatic) and the number of participants affected are included in the
report.159
Attrition due to Toxicity
Toxicity in clinical trials can result in termination of the investigational drug. The frequency of
attrition can be broken down into target organs or tissues and was reported in the study by
Guengerich et al.,97 based on information from DuPont-Merck and Bristol-Myers Squibb
between 1993-2006. These results (Table 1-3) show that cardiovascular and liver toxicity
findings were responsible for > 42 % of drug attrition due to toxicity. Recent examples of drugs
withdrawn for toxicity observations include the pain medications flupirtine, which was
withdrawn for liver toxicity in 2018,160 and propoxyphene, withdrawn for cardiovascular toxicity
in 2010.161
Table 1-3: Percentage of molecule attritions in clinical trials which occurred due to a toxicity classified by each target organ or tissue.
Target organ or tissue Percentage of all advanced molecules
Cardiovascular 27.3
Liver 14.8
Teratogenicity 8.0
Hematologic 6.8
Central and peripheral nervous system 6.8
Retina 6.8
Mutagenicity/clastogenicity 4.5
Reproductive toxicity 4.5
Gastrointestinal/pancreatic 3.4
Muscle 3.4
Carcinogenicity 3.4
Lung 2.3
Acute death (unspecified cause) 2.3
Renal 2.3
Irritant 2.3
Skeletal (arthritis/bone development) 1.1
Toxicity Translation
Testing new chemical entities (NCEs) in animal models is a regulatory requirement for toxicity
assessment before administration of a drug to humans in clinical trials. The pharmaceutical
industry has long been evaluating the value of these models for this purpose, in the search for
new information that can be used for toxicity assessment. From an analysis of the attrition of
drug candidates from four pharmaceutical companies, it was found that 40 % of drugs were
terminated due to non-clinical toxicology findings and 11 % due to clinical safety findings.162
34
Safety was found to be the highest contributor to attrition in both preclinical and Phase 1
studies.162 In an AstraZeneca study conducted on drug projects between 2005–2010, it was
found that, of those drug projects terminated preclinically, 82 % were due to toxicity.163
Furthermore, of those drug projects which passed into clinical trials, 62 %, 35 % and 12 % of
projects were terminated were due to toxicity in phase 1, phase 2 and phase 3, respectively.163
Since this paper was published there has been a renewed focus on 5 technical determinants of
drug success (the 5Rs) within AstraZeneca, one of which was focusing on the “right safety”
testing. The results of these efforts have led to the reduction in project closures due to safety
between 2012-2016, to 38 % for phase 1 and 8 % for phase 2.164
Since clinical toxicity is still a source of drug attrition, it is clear that the predictivity of animal
models is currently not sufficient to anticipate later compound toxicity in man. Toxicological
findings in preclinical studies are not always observed in the clinic and thus the development
of potential therapeutics could be terminated before progression to the clinic, causing
unwarranted attrition.165 Conversely, there may be no toxicological observations observed
preclinically, but when clinical studies are conducted new toxicities are uncovered.166 The
reasons for the lack of concordance in both directions can often be attributed to the increased
heterogeneity of patients in clinical trials,166 lack of relevance of the model organism/or model
measurement to the human toxic endpoint,167 differences in toxic or detoxifying mechanisms
between species,168 or differences in exposure,169 amongst other effects. An example of a toxicity
in humans which is a result of different species physiology includes the vomiting AE in humans.
Rats do not have a vomiting response, and therefore vomiting in humans is hard to predict using
rodent studies and is instead predicted by taste eversion/food avoidance responses in rodents
or ferret or dog emesis models.170,171 Nausea is another example of a common side effect
experienced in humans, which is not easily recognisable in animal models, but which poses a
significant challenge to drug development due to patient compliance. In fact, a study conducted
by Parkinson et al.,172 explores nausea prediction in humans in relation to gastro-intestinal (GI)
AEs experienced in animals, finding that combinations of preclinical GI effects were strong
predictors of nausea. Teratogenicity does not always correlate between animals and humans,
for example corticosteroids are teratogenic in animal models but not in humans,173 and
conversely thalidomide is a teratogen in humans but not in many animal species, due to
metabolism differences.174 Other difficult to observe side effects in animals are those associated
with a psychological condition that might be a unique disorder to the human condition, for
example schizophrenia,175 although behavioural end-point rodent models have been developed
35
to provide some early evidence in animals of changes in mesolimbic dopamine function relevant
in schizophrenia.176
Despite this lack of correlation for some toxicities, others are more concordant between animals
and humans, for example cardiotoxicity.169,177 For these cases, the occurrence of drug-induced
toxicity preclinically may halt progression of an investigational drug due to an unacceptable risk
to patients or provide vital information that allows an informed decision to be made on patient
monitoring and inclusion/exclusion criteria for clinical trials. Sometimes drugs are continued
despite toxicity flags in preclinical studies. Reasons to continue a drug despite preclinical
toxicity are numerous and include, for example, low severity of the toxicity, the risk-benefit
analysis of the disease to be treated versus the toxicity, as well as evidence that the toxicity may
only occur in subsets of patients which can be excluded from the study or highly monitored
during the study. In particular, cancer drugs, antibiotics, antidepressants and antipsychotics are
drugs which are often taken forward despite cardiovascular risk factors, on a risk-benefit basis.96
Increasingly, there is a drive towards the reduction of animal usage in support of the 3R’s
initiative,138 by prioritising the use of only the most toxicologically relevant animal models, as
well as implementing and developing in vitro alternative models, such as discussed in In Vitro
Toxicity, including safety target screening panels and organs-on-a-chip technologies.
1.1.2 Summary
In this section of the introduction, we outlined the general context for this project and
highlighted the need to produce selective and safe drug molecules. The current drug discovery
set-up allows for the assessment of selectivity and toxicity, however, there is still progress to be
made.
Gaining target family selectivity remains a challenge due to the similarity of proteins with
diverse biological roles and lack of knowledge for designing selective molecules. On average,
drugs are active against six targets (see Measuring Selectivity), which can lead to undesired
effects, including toxicity. We furthermore introduced the bromodomain-containing proteins
with respect to their functional importance and structural composition, as well as highlighted
existing knowledge for designing selective bromodomain inhibitors. This thesis will explore this
target family in detail with respect to the prediction and understanding of selectivity profiles.
The translation of toxicity between animals and humans still represents a challenge, as many
adverse events do not translate across species for a variety of reasons. This lack of predictivity,
36
as well as the initiative to reduce animal testing, leads to the need to provide better in vitro
alternatives for clinical toxicity prediction.
This thesis addresses these general challenges by the implementation of in silico approaches to
generate new knowledge within these areas.
1.2 General Computational Methods
In this section we highlight the general computational methods that were implemented within
this thesis related to the field of data science, including machine learning algorithms and
statistical methods.
1.2.1 Data Science Methods
Data science is the study of extracting generalizable knowledge from data by unifying statistics
machine learning and their related methods.178,179 This field can be thought to encompass other
related concepts including artificial intelligence (AI) and machine learning and the more
recently popular method of deep learning (Figure 1-4), as well as statistics and data mining.
Figure 1-4: shows the field of data science and its connection to the fields of artificial intelligence, machine learning, deep learning, data mining and statistics.
AI is “the effort to automate intellectual tasks normally performed by humans” and originated
in the 1950s with the idea that computers might have the ability to think like a human to solve
problems in a variety of fields.180 Machine learning is a branch of AI which learns rules through
pattern recognition and statistical algorithms in an automated fashion from input data and their
expected answers. The difference between machine learning and the broader field of AI (which
also includes symbolic AI) is said to be that a machine learning system is trained by known
Data Science
Artificial Intelligence
Machine Learning
Deep Learning
Data Mining
Statistics
37
input and output data to find rules, rather than programmed with input data and human
inputted rules to produce an output (Figure 1-5).180
Figure 1-5: The difference between machine learning and symbolic AI.
Machine Learning Models
As defined above, machine learning is a field within AI which is concerned with learning rules
through pattern recognition and statistical algorithms in an automated fashion from input data
and their expected answers. A model is said to be trained on the input data and the expected
answers, which can then be applied to predict for new examples.180
In this section we describe the algorithms which are commonly used in machine learning
applications, with examples related to drug discovery.
There are different categories of machine learning problems. In this section we focus on
supervised machine learning frameworks, which are the most relevant to the content of the
thesis. Supervised machine learning involves feeding the model the desired solutions known as
the output labels as well as the input variables, whereas for unsupervised frameworks the
training data is unlabelled, and the learning algorithm tries to learn rules between input
variables, e.g. clustering or association rule mining.181,182
For supervised machine learning methods, tasks can be split into either regression or
classification problems.
This thesis is concerned with classification and so this will be discussed in the next section.
Classification Models
Classification models are built on data which has an output label of a discrete class, where there
are K possible classes to which instances can be assigned.183 Classes can be nominal or ordinal
Input data
Input data
Answers
Answers
Rules
Rules
Symbolic AI
Machine Learning
38
and ordinal classes are often derived from continuous data to simplify a regression problem, for
example classifying into the levels “low”, “medium” and “high”.
Classification tasks can come in different forms, as illustrated in Figure 1-6, including binary
classification, multi-class classification, multi-label classification184 and multi-output
classification.185,186
Figure 1-6: Adapted from Read et al., 2015.185 L is the number of labels and K is the number of values that each label variable can take.
Binary classification is the most commonly dealt with form of classifier, where there is one label,
and there are two values which this label can take, for example for the label “drug biological
activity” there could be two values “active” or “not active”. Multi-class classification presents a
similar problem to binary classification except that the label can now take more than one value,
for example there could be three classes such as for a label “type of drug activity” which could
be “inhibition”, “activation” or “none”. Note that for both cases the value of the label must
belong to only one of the K options and these “classes” are mutually exclusive.
For multi-label classification, the number of labels is now increased, however each label itself is
binary (K=2). The output could be now a combination of two labels, for example the presence
and absence of two toxicity types for one drug. The two labels might be “cardiotoxicity” and
“renal toxicity” and each label has two classes “present” and “absent”. Note that now one
instance, in this example one drug, has two classifications; either presence or absence of
cardiotoxicity and presence or absence of renal toxicity. This type of classification predicts a
profile of binary label occurrences.
The fourth type of classification is multi-label, multi-class classification and extends the
approach to when K > 2 and L > 1. This could be, for example, when each of our two labels for
Binary
Multi-Output Multi-Label
Multi-Class L = 1
L > 1
K = 2 K > 2
39
toxicities have three options “severe toxicity”, “medium toxicity” or “no toxicity”. This is called
multi-output classification.
Evaluation Metrics
Classification is often assessed by the calculation of metrics based on the confusion matrix.187,188
For a binary problem, the confusion matrix is a two-by-two table of predicted label values (class)
against actual label values (class) with counts in each quadrant equivalent to the number of
instances in each category (Figure 1-7).
Figure 1-7: Confusion matrix for the assessment of classification model performance.
Metrics can be then calculated from the confusion matrix to assess the performance of the
classifier the most popular being the precision, recall (sensitivity), specificity, accuracy, F1 score,
Matthew’s correlation coefficient (MCC) and the area under the receiver operating
characteristic curve (ROC AUC).186–189
Machine Learning Algorithms
In this section we discuss the main machine learning classification algorithms which were used
in this work, including their theoretical basis, advantages and disadvantages.
Logistic Regression
Logistic regression is an algorithm applied from linear regression to classification problems. As
in linear regression, a function is learned from the training data which is a weighted sum of the
input features plus a bias term. However, instead of outputting continuous values, the logistic
function (𝜎) is applied to the continuous values to output a probability value between 0 and 1
for binary classification:190
True Positives
(TP)
True Negatives
(TN)
False Negatives
(FN)
False Positives
(FP) Active
Inactive
Active Inactive
Pre
dic
ted
Cla
ss
Actual Class
40
Equation 1-2:
�̂� = 𝜎(𝜃0 + θ1𝑥1 + θ2𝑥2 + ⋯+ θ𝑛𝑥𝑛)
Where �̂� is the probability value that an instance belongs to the positive class, θ0 is the bias term,
𝑥𝑖 is the ith feature value, n is the number of features and θj is the jth model parameter (including
the bias term and feature weights θ1 to θn). 𝜎 is the logistic function:
Equation 1-3:
𝜎(𝑡) = 1
1 + 𝑒−𝑡
From the probability output, values ≥ 0.5 predict membership of the positive class and values <
0.5 predict membership of the negative class.
Logistic regression works under the following assumptions: 1. the dependent variable is
dichotomous, 2. the errors are independent and therefore each observation is independent from
others, 3. there is no multicollinearity in the predictor variables and 4. there are no outliers in
the data.191 Logistic regression is often implemented due to its efficient implementation, high
interpretability, easy regularization and ability to output well-calibrated probabilities. It has the
disadvantages that it cannot solve non-linear separation problems, requires prior knowledge of
which input variables are meaningfully related to the output variable, performs less well under
multicollinearity of predictors, or a large ratio of predictor variables to samples.192
Support Vector Machines
Support Vector Machines (SVMs) are used in classification tasks and work by optimising a linear
decision surface known as a hyperplane to best separate the data into classes, whilst maintaining
the largest distance possible from any examples, known as large margin classification.193 Large
margin classification allows increased generalisability of the algorithm, since the hyperplane is
more likely to succeed in predicting for new examples.194 The lines which represent the margin
gap are called the support vectors (Figure 1-8). When the data is not separable in original feature
space, SVM can map input features onto a higher dimensional feature space to which the data
can then be linearly separated into classes.195,196
41
Figure 1-8: SVM classification for a linearly separable dataset.
For linear SVMs each data point is represented by a vector in n-dimensional space (Rn). The
equation of the hyperplane is derived from two vectors 𝑥⃗⃗ 0 and 𝑥⃗⃗ which represent two points
P0 and P on the hyperplane (Figure 1-9). For P to be on the plane, the vector 𝑥 − 𝑥⃗⃗ 0 must be
perpendicular to a vector �⃗⃗� at point P0. From this we know that the dot product of these two
vectors must be 0 as they are perpendicular:
Equation 1-4:
�⃗⃗� . ( 𝑥 − 𝑥 0) = �⃗⃗� . 𝑥⃗⃗ − �⃗⃗� . 𝑥⃗⃗ 0 = 0
And therefore:
Equation 1-5:
�⃗⃗� . 𝑥⃗⃗ + 𝑏 = 0
where 𝑏 = −�⃗⃗� . 𝑥⃗⃗ 0. This can be visualised in Figure 1-9:
:
Figure 1-9: derivation of the equation for the hyperplane for an R3 space from point vectors of points P and Po.
Positive class
Negative class
Hyperplane
Support Vectors
Margin
�⃗⃗� . 𝑥⃗⃗ + 𝑏 = 0
�⃗⃗� . 𝑥⃗⃗ + 𝑏 = +1
�⃗⃗� . 𝑥⃗⃗ + 𝑏 = −1
𝑥 − 𝑥⃗⃗ 0 𝑃0 𝑃
𝑂
�⃗⃗�
42
Parallel hyperplanes have different values of the 𝑏 coefficient, which move along the direction
perpendicular to the hyperplane, in the direction of �⃗⃗� . For an instance to be classified as positive
(+1) the value of �⃗⃗� . 𝑥⃗⃗ + 𝑏 must be greater than 0 and for an instance to be classified as the
negative class (-1) the value of �⃗⃗� . 𝑥⃗⃗ + 𝑏 must be less than 0. Because the support vector is a
parallel line to the hyperplane which intersects the nearest positive or negative examples, the
equations of the support vectors must be the following:
Equation 1-6:
�⃗⃗� . 𝑥⃗⃗ + 𝑏 = +1 = �⃗⃗� . 𝑥⃗⃗ + (𝑏 − 1) = 0
Equation 1-7:
�⃗⃗� . 𝑥⃗⃗ + 𝑏 = −1 = �⃗⃗� . 𝑥⃗⃗ + (𝑏 + 1) = 0
The distance d between two parallel hyperplanes is given by:
Equation 1-8:
𝑑 =|𝑏1 − 𝑏2|
∥ �⃗⃗� ∥
Which, since 𝑏1 = 𝑏 + 1 and 𝑏2 = 𝑏 − 1, simplifies to:
Equation 1-9:
𝑑 =2
∥ �⃗⃗� ∥
To maximise the margin, we would need to minimize the value of ∥ �⃗⃗� ∥ which is the norm of
the weight vector. In practice the distance which is minimized is:
Equation 1-10:
𝑑 =1
2∥ �⃗⃗� ∥2
For a hard margin classifier, it is required to minimise the objective function 1
2∑ 𝑤𝑖
2𝑛𝑖=1 subject
to 𝑦𝑖(�⃗⃗� . 𝑥⃗⃗ + 𝑏) − 1 ≥ 0 which encodes the constraint that all instances 𝑖 are correctly classified
(all positive instances > +1 and all negative instances < -1).190,183 For soft margin classification,
margin violations are allowed. Therefore, a balance between maximising the margin and
minimising the number of violations is required. The hyperparameter C in SVM measures the
trade-off between these two objectives.
For non-linearly separable data, as mentioned previously, we can map the instances from the
original feature space a higher dimensional feature space using a mapping function 𝑥 → 𝑓(𝑥)
43
known as the kernel. The kernel is a similarity function which computes the dot product
between two vectors in the higher dimensional space based only on the original vectors a and b
without having to know about the transformation.194 Therefore, the dot product of the vectors
transformed by function 𝑓(𝑥) is equal to a function of the original vectors. This is known as the
kernel trick and does not require the computation of the coordinates of the input data in the
higher dimensional space, but only the dot products between all vectors instead, which is more
computationally efficient. There are a variety of kernels which can be implemented; the most
common being those shown in Equation 1-11, Equation 1-12 and Equation 1-13:190
Equation 1-11:
Linear: 𝐾(𝑎, 𝑏) = 𝑎𝑇 . 𝑏
Equation 1-12:
Polynomial: 𝐾(𝑎, 𝑏) = (𝛾𝑎𝑇 . 𝑏 + 𝑟)𝑑
Equation 1-13:
Gaussian Radial Basis Function (RBF): 𝐾(𝑎, 𝑏) = exp (−𝛾||𝑎 − 𝑏||2)
Where aT is the transpose of a, d is the polynomial degree and ||𝑎 − 𝑏|| is the Euclidean distance
between the vectors in the original space. 𝛾 is a hyperparameter to be optimised.
Using this method, it may now be possible to find a hyperplane in the projected space which
can better separate the classes.
SVMs have the advantage that they can model non-linear relationships, where they can use the
kernel trick to reduce computational complexity. Additionally, by employing a convex
optimisation problem SVMs do not get stuck in local minima and converge to a unique
solution.197 They are, however, less interpretable, less suitable for multi-class classification,
favour discrete variables with more categories, and can be sensitive to the choice of kernel when
compared to other algorithms.197,183
Decision Trees
Decision trees can be used to perform both regression and classification tasks. Decision trees
split the data into smaller sub-groups by using input model features, until the instances are
separated into their classes (classification) or groups of values with a low standard deviation
(regression). At the start of the algorithm all instances are grouped together at the root node
(Figure 1-10). Subsequently the data is split by many decision nodes (including the root node),
which have multiple output branches and use one of the input features to split the data. Finally,
44
the leaf nodes consist of the decision of the class or group of closely related values; these nodes
should now consist of homogenous data.
Figure 1-10: Exemplified decision tree for classification of dataset into positive and negative classes.
There are different algorithms for decision tree construction. The Iterative Dichotomiser 3
(ID3),198 uses a greedy search approach to find an optimal decision tree by iteratively forming a
decision tree on larger subsets of the training dataset until a tree forms which classifies the
remaining instances completely correctly. ID3 uses the concepts of entropy H(T) and
information gain (IG) for classification to select the best features for partitioning data.
Entropy describes the information contained within a variable and originates from the Shannon
entropy, derived from models of data communication systems, which operated based on
deriving the average minimum length (in bits or Shannons) needed to transmit information in
a compressed form from source to receiver.199 Entropy can be described as the average or
expected amount of information derived from identifying the outcome of a random trial; high
entropy means that the state is unpredictable, since more information is needed to encode the
possible outcomes. Mathematically, entropy can be calculated by the following formula:
Equation 1-14:
𝐻(𝑇) = −∑𝑝𝑖
𝐽
𝑖=1
𝑙𝑜𝑔2𝑝𝑖
Where 𝑝1, 𝑝2 … are fractions of each of the J classes, which in total sum up to one and represent
the proportion of each class present in the set of training inputs T.
Equation 1-15:
𝐼𝐺(𝑇, 𝑎) = 𝐻(𝑇) − 𝐻(𝑇|𝑎)
Leaf Nodes
Decision Nodes
Root Node Feature 1
Feature 2
Positive Negative
Feature 3
Positive Negative
Yes
Yes No
No
No Yes
45
Where H(T) is the entropy of the training inputs T, and 𝑎 is the feature which has been used for
the split. This can be interpreted as the change in entropy from the parent node to the weighted
average of the child nodes after a split has been made using feature 𝑎. The feature with highest
information gain is used to make the split and relies on having a large entropy for the parent
node and a lower entropy (more homogenous) child node.200
The ID3 algorithm was later developed to produce the C4.5 algorithm, which had advantages
including that both discrete and continuous features could be used to make the split, it can cope
with missing feature values for some instances and that it prunes those branches which do not
help with classification.201
Some decision tree algorithms, including the Classification and Regression Trees (CART)
algorithm, use the Gini Impurity I(T) instead of the entropy H(T) as the impurity measure for
attributes; this is defined as follows:
Equation 1-16:
𝐼(𝑇) = ∑𝑝𝑖
𝐽
𝑖=1
(1 − 𝑝𝑖)
Where 𝑝1, 𝑝2 … are fractions of each of the J classes, which in total sum up to one and represent
the proportion of each class present in the set of training inputs T.
Practically, it has been shown that the difference between the entropy-based and Gini impurity-
based measures are minimal (2 %).202 An important difference between CART and other
algorithms is that it produces only binary trees with two child nodes resulting from each parent
node in the tree.190
Decision trees have the advantage that they are known as “white box” models and are
interpretable; some decision tree algorithms can output classification rules, which allows
understanding of how the dataset was classified. It is possible to calculate the importance of
each feature towards classification using the impurity measures mentioned above. Decision
trees are prone to overfitting to the training set, affecting generalisation. For this reason, trees
are often regularised by controlling the maximum depth of the tree and using pruning.190
Ensemble Methods
Ensemble methods make use of the aggregation of multiple weak learning algorithm outputs to
produce a stronger ensemble consensus output. The main purpose of employing ensemble
models is to increase model predictive performance. The key to the effectiveness of an ensemble
46
algorithm is that the errors between the weak learners are uncorrelated; that is the weak
learners are as diverse as possible. One way to train diverse weak learners is to implement a
variety of different algorithms on the same training set and create the ensemble from
aggregating the resultant predictions. Other popular ensemble methods include bagging and
boosting. We explain these techniques and then provide detail for the Random Forest
algorithm, since it is used in this work.
Boosting
Boosting is a way of sequentially combining weak learners into one stronger ensemble learner.
Each new predictor aims to correct the error from the previous predictor in the sequence.203
The most popular methods of boosting are Adaptive Boosting (AdaBoost) and Gradient
Boosting. AdaBoost modifies the sample distribution by updating the weights assigned to each
instance to prioritize the instances that the previous predictor underfitted for the subsequent
predictor, whereas Gradient Boosting aims to fit the next predictor in the sequence to the
residual error from the previous predictor.
Bagging
Bagging refers to the use of the same algorithm on different subsets of the training set where
the training set for each predictor is selected by a sampling with replacement (bootstrapping)
procedure.204 The predictions for a new instance are then aggregated across all predictors; in
classification this is often achieved by a majority voting aggregator.
Random Forest
Random Forests are popular algorithms in machine learning in the life sciences domain, due to
their high performance coupled with relatively simple implementation and interpretability.205
The Random Forest (RF) algorithm is an ensemble of weak learning decision trees implemented
using bagging.206 The difference between RFs and traditional bagging techniques is the
additional randomisation which allows the construction of diverse, independent decision trees
to minimise the correlation between the weak learners in the ensemble.207 Each tree is grown
as described in Decision Trees, with the difference that extra randomisation is introduced by
limiting the features from which the best feature (based on entropy or Gini impurity metrics)
to split on for each node is selected to a subset of the total possible features. By restricting to a
random subspace of features, it is possible to increase the tree diversity by avoiding the situation
where a few features always form the first splitting features and therefore dominate for all trees.
This increases the bias of the model and decreases the variance, resulting in less overfitting than
deep decision trees. Finally, all trees output the classification prediction for an instance, which
47
are aggregated by majority voting to produce the final class. The proportion of trees voting for
each class gives an indication of a class probability value, for example if a RF model has 100 trees
and 70 trees have predicted class 1, the probability would be 0.7; this is not a true likelihood of
the class prediction, however, as this does not take into account the distribution between
classes.208
RFs model non-linear relationships between features and output data, are relatively quick to
train, are less likely to overfit relative to decision trees, allow the interpretation of feature
importance to the model, and do not require extensive tuning of model parameters.190 In
comparison to other methods, RFs do not require strict feature scaling. Conversely, their
weaknesses lie in the fact that they cannot predict error estimates for models and are less easily
interpreted than some simpler algorithms.209
Interpretation of Random Forests
RFs are a popular machine learning approach, due to their interpretability. As described in
Decision Trees, features are selected for splitting instances at each node in each tree based on
achieving the best split, as measured by the difference between the Gini impurity or the entropy
metrics between parent and child nodes. The mean decrease in the Gini impurity is often used
as a measure of overall feature importance for interpreting RF models. This measures the
average decrease in node impurity across all times a certain feature was used to make a split in
all trees within the forest, weighted by the number of samples which reach the node involved
in each of these splits. This results in a value for the importance of each feature in the overall
classification of instances across the ensemble and these values can be ranked from highest to
lowest importance.
Although the mean decrease in Gini impurity method is commonly used to determine feature
importance, there are other importance metrics which have been used. These include
permutation measures, such as computing the out-of-bag error when each input variable is
randomly permuted; those variables which cause an increase in error or decrease in accuracy
when permuted are considered important towards classification.206 Conditional permutation
testing was proposed to correct for the effects of highly correlated variables.210 This method
calculated the mean decrease in accuracy for a feature when it is permuted as above, but now
with respect to subsets of the data split by other highly correlated features. Additionally,
variable selection for RF using backward elimination methods has been used for interpretation
of features as it provides a smaller set of important features which contribute to model
performance, which can be interpreted,211 Another method suggests that the minimal distance
48
measure, which calculates the average depth of the tree when a variable is used as a split, should
be used to weight variables near the root of the tree as more important.212
Recent studies have explored the use of alternative methods to capture a more detailed picture
of the relevance of certain features towards the classification of subsets of the training
dataset.213–215 In this work a local feature importance method213 was implemented to determine
the features relevant for classification of individual instances or groups of instances. This
method computes the local increment 𝐿𝐼𝑓𝑐 of feature f towards class 1 when it splits instances
between parent (p) and child (c) nodes in a tree:
Equation 1-17:
𝐿𝐼𝑓𝑐 = 𝑌𝑚𝑒𝑎𝑛
𝑐 − 𝑌𝑚𝑒𝑎𝑛𝑝
Where 𝑌𝑚𝑒𝑎𝑛𝑐 is the fraction of instances in the child node belonging to class 1, and 𝑌𝑚𝑒𝑎𝑛
𝑝 is the
fraction of instances in the parent node belonging to class 1.
Following this, for a specific instance i in the training set, the contribution of feature f for an
individual tree is the sum of all local increments from all splits in the tree that the instance
passes through which used this feature.
Finally, the contribution of feature f to the classification of instance i over the entire forest is
given by:
Equation 1-18:
𝐹𝐶𝑖𝑓
=1
𝑇∑𝐹𝐶𝑖,𝑡
𝑓
𝑇
𝑡=1
Where T is the total number of trees and 𝐹𝐶𝑖,𝑡𝑓
is the feature contribution of feature f towards
instance i for tree t. For more details on the method the reader is referred to the publication by
Palczewska et al.213
Since this method was implemented in this study, other local methods of feature contribution
for RFs have been developed. One, coined the “Intervention in Prediction Measure (IPM),”
based on the internal structure of the trees, was recently shown to outperform other established
methods, including the Gini coefficient and mean decrease in accuracy methods215 and works
by computing the percentage of times a variable is selected along the path of each instance
classified in out-of-bag set in each tree, which is then averaged over the whole forest. Very
recently, another local feature importance approach was proposed based on computing only the
49
feature contributions of the positive feature values, which had utility for the interpretation of
binary input features (in this example Gene Ontology (GO) terms), since positive values were
indicative of the presence of a feature, whereas in some cases a negative value can mean either
the absence of the feature or that there was a missing feature value, which is a less concrete
interpretation.214
Assessment of Applicability Domain
The applicability domain (AD) is the area in descriptor space to which a machine learning model
can be reasonably applied for future predictions.216 This region where the model can make
predictions which can be relied upon, is determined by the descriptor space of the training set.
In the context of Quantitative Structure Activity Relationship (QSAR) models, the descriptor
space will be the chemical space for the training set compounds and the question asked would
be “Is the query compound inside or outside the applicability domain?”.217 Approaches to define
the applicability domain are grouped into three main methods.218 The first involves limiting the
model to interpolation, where only those new examples which fit within the range of the
descriptor set are within the applicability domain. Secondly, similarity approaches are often
employed to extend towards extrapolation of a model; these methods rely on the distance of the
new compound from existing compounds in the training set to determine reliability of a new
prediction. Many methods have extended this approach to also model the density of the
neighbourhood of the new compound within the training set, by measuring the average
similarity between all close training examples rather than the nearest neighbour distance
alone.219 The third type of approach comes from outputs of the algorithm itself; including, for
example in ensemble methods, determining how many of the individual predictors vote for the
same class for classification or the consistency between values for regression.220 The more
consistent the outputs between predictors, the more likely the aggregated prediction can be
relied upon. All methods require thresholds to be determined based on the application, are
often ambiguous as to their quantitative meaning and are often not comparable to one
another.221 This is due to many AD metrics being sensitive to the individual dataset and the
algorithm used, rather than being globally applicable.222
Conformal Prediction
Conformal prediction has been implemented to assess the applicability domain of models in
previous contexts, in particular in QSAR applied to virtual screening,223–226 and addresses some
of the problems with the approaches previously detailed, by providing a quantitative
interpretation of confidence in predictions.227 Conformal prediction is designed to provide
50
information on whether a new prediction falls into a prediction region at a defined confidence
level. It achieves this by implementing a mathematical framework to predict new instances with
a guaranteed error rate. In the classification case, the confidence of predicting a class label for
a new example is computed, given the training set instances to which the new example
conforms. Conformal prediction assumes that the new examples are sampled from an
exchangeable distribution and, under these conditions, a guaranteed error rate is achieved,
providing prediction regions which are always valid.221 For the binary classification case, new
examples can be assigned to four sets; 1. {Class 1}, 2. {Class 2}, 3. {both} classes 1 and 2, or 4.
{null}. Validity and efficiency are the metrics used to assess conformal prediction performance.
The conformal predictor is valid if the frequency of errors does not exceed ε at the chosen
confidence level (1 – ε). It is efficient if instances are assigned to as few labels as possible. In the
binary classification problem, assignment to the set {both} is always valid, however, is not
efficient, as the instance has two labels. Conversely, the set {null} is not valid but is efficient
since the instances are assigned to zero labels. Ideally, single class predictions are the most
useful in practice, however, if the confidence level is increased, often the number of {both} set
predictions are increased, because the algorithm needs to be more certain in obtaining correct
predictions.
There are different variations of conformal prediction. The first distinction is between the
methods of transductive and inductive conformal prediction.221 In the transductive setting, each
instance is predicted separately, after which the model is updated based on the true value of the
instance, before the next instance is predicted. In this way the new prediction is based on all
available previous data. In contrast, inductive conformal prediction employs a batch approach
where the model is trained on a set of existing data and is not updated after each prediction.
Transductive conformal prediction is more computationally expensive, and so inductive
conformal prediction is often used in the QSAR setting.225,226 The second variation in conformal
prediction methods is between the Non-Mondrian or Mondrian cases, and a third important
distinction is the method used to generate the calibration sets. In this section we will focus on
the method implemented in this thesis; namely Inductive Mondrian Cross-Conformal
Prediction (IMCCP) where further details of these variations in conformal prediction methods
will be discussed.227
Practically, the process of employing Inductive Mondrian Cross-Conformal Prediction (IMCCP)
with a hypothetical example is illustrated in Figure 1-11. The algorithm starts with the definition
of a non-conformity score (Step 1), which is a measure of how different the new example is from
51
existing training set examples. This can be defined in many ways and metrics used to define
traditional applicability domains can be used here. For classification, often the probability
scores for each class generated from the machine learning algorithm are used as a measure of
non-conformity. The difference between IMCCP and standard Inductive Mondrian Conformal
Prediction (IMCP) is that instead of the training set being split into a training set and a smaller
calibration set, IMCCP iteratively splits the training set into k training sets and k calibration
sets, in a procedure mimicking cross-validation. This is employed to relieve the biases from
generating a random calibration set and to ensure that all training examples appear in
calibration sets once.227 Once the non-conformity measure has been established, it is calculated
for the examples in each of the k calibration sets based on the model predictions from the model
trained on each of the k training sets. These non-conformity scores are then organised into
Mondrian class lists (Step 2), which are ordered lists of the non-conformity scores for each of
the possible class outcomes. Mondrian conformal prediction was introduced to represent the
non-conformity scores on a per class basis, which overcomes the problem of class imbalance in
previous Inductive Conformal Predictors (ICPs), ensuring that the framework is valid for each
class.221 There will be k class lists for the k calibration sets. Subsequently, we define a confidence
level (1-ε) which we require for our application, where ε is the error rate which must not be
exceeded for the conformal predictor to be valid. ε is defined as the significance level for which
the probability values from conformal prediction must exceed for the class to be predicted (Step
3). In Step 4 a prediction is made for a new instance. For each new instance the non-conformity
score is calculated in the same way to the calibration sets and a value for each class will be
obtained. The values are then inserted into the Mondrian class lists for the k calibration sets
separately and a p_value is obtained for each calibration set (Step 5). P_values are calculated
based on the position in the Mondrian class list of the new example. For example, if there is one
data point in the calibration set with a non-conformity score lower than the example and there
were 7 samples in the calibration set, the p_value would be 1/7 or 0.14. The p_values for each of
the k calibration sets for the new examples are then averaged for each class to produce one value
per class. This value is then compared to the significance level (Step 6); where the p_value is
greater than the significance level ε, the new instance is said to belong to the class. In this way
both p_values could be less than ε assigning the instance to set {null}, both may be greater than
ε assigning the instance to set {both}, or the p_value for only one class may be greater than the
significance level assigning the instance to either {Class 1} or {Class 2} only.
52
Class 1 Class 2
0.02 0.34 0.44 0.67 0.87 0.91 0.99
0.05 0.10 0.56 0.69 0.70 0.88 0.89
Step 2- Create k Mondrian class lists from calibration sets: Sorted non-conformity scores per class
Step 1- Define a non-conformity measure: How different a new sample is from previous samples. E.g. What is
the predicted probability of an example belonging to a class?
Non-conforming samples: Probability estimate for correct
class is low (if the score is less than 0.5 then
classification was wrong)
Step 3- Define a confidence level and significance level: E.g. if confidence level (Cl) = 0.70 (70 %) then significance level (Cl) = 0.30.
Probability value for an instance must exceed significance level to be predicted as a member of this class
Step 4- For a new example calculate non-conformity score and place into each of k class lists:
E.g. if: p(Class 1)=0.2, p(Class 2)=0.8:
Class 1 Class 2
0.02 0.34 0.44 0.67 0.87 0.91 0.99
0.05 0.10 0.56 0.69 0.70 0.88 0.89
Step 5- Calculate k p_values based on position in each class list:
p_value (Class 1) = 1
7 = 0.14
p_value (Class 2) = 5
7 = 0.71
Step 6 -Interpret outcome: average k p_values and assess if they are greater than significance level
e.g. average p_value (Class 1) = 0.16 < 0.3 average p_value (Class 2) = 0.80 > 0.3
New instance is assigned to Class 2 but not Class 1 at 70 % confidence level
53
Figure 1-11: The process for implementing Inductive Mondrian Cross-Conformal Prediction.
Statistical Methods
This section covers statistical methods relevant to this thesis that are not encompassed by the
machine learning section above.
Covariance and correlation
Covariance is a statistical measure of association between two variables and a high covariance
indicates that two random variables vary together. Correlation measures how strongly two
random variables are related to one another and is a scaled version of covariance with values
between -1 and 1. Correlation is a specific example of covariance which is unaffected by a change
in scale. The Pearson’s correlation coefficient (ρ) is often used to measure correlation of
numerical variables and can be calculated for a pair of random variables (𝑋, 𝑌) using the
following equation:
Equation 1-19:
ρ𝑋,𝑌 = 𝐸[(𝑋 − 𝜇𝑋)(𝑌 − 𝜇𝑌)]
𝜎𝑋𝜎𝑌
where 𝐸 is the expectation, 𝜇 is the mean and 𝜎 is the standard deviation. Pearson’s correlation
coefficient is therefore the covariance of two variables divided by the product of their standard
deviations.
Mutual Information
Mutual information (MI) can be used to measure the statistical dependence of two random
variables, like in the measures of covariance and correlation. The main difference between
mutual information and common correlation methods is that correlation determines the effect
of non-independence on the product of two random variables, whereas MI determines the effect
of non-independence on the joint probability distribution. This allows the MI to measure non-
linear relationships between random variables, extending the generalisability of the correlation
metric which is limited to linear relationships. MI can be related to the entropy of the two
random variables and can be represented visually as in Figure 1-12.
54
Figure 1-12: H(X) is the entropy of random variable X and is represented by the left circle (yellow and green), H(Y) is the entropy of random variable Y and is represented by the right circle (blue and green), H(X|Y) is the conditional entropy of X given Y (yellow only), H(Y|X) is the conditional entropy of Y given X (blue only). H(X,Y) is the joint entropy of both variables and encompasses both circles. I(X;Y) is the mutual information and is the intersection of the two variable entropies (green region).
As mentioned in Decision Trees, entropy describes the amount of information contained within
a variable and was first proposed from the Shannon entropy.199 From Figure 1-12 it can be
rationalised that the MI can be described as the information that two random variables share
i.e. how much knowing one variable reduces the uncertainty of the other:228
Equation 1-20:
𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)
Another interpretation from Figure 1-12 is the that the MI is the measure of dependence of the
joint distribution of X and Y under the assumption of independence:
Equation 1-21:
𝐼(𝑋; 𝑌) = 𝐻(𝑋) + 𝐻(𝑌) − 𝐻(𝑋, 𝑌)
Overall, MI is calculated by the following equation for two discrete random variables X and Y:
Equation 1-22:
𝐼(𝑋; 𝑌) = ∑ ∑ 𝑝(𝑥, 𝑦)log (𝑝(𝑥,𝑦)
𝑝(𝑥)𝑝(𝑦)𝑥 ∈ 𝑋𝑦 ∈ 𝑌 )
Where 𝑝(𝑥, 𝑦) is the joint probability distribution function of X and Y discrete random variables,
and 𝑝(𝑥) and 𝑝(𝑦) are the marginal probability distribution functions for X and Y respectively.
Values of the MI range from zero, indicating that X and Y are completely independent variables,
to +∞. For ease of interpretation, the MI values are often normalised to within the bounds of 0
to 1 to give the Normalized Mutual Information (NMI):
H(X|Y) H(Y|X)I(X;Y)
H(Y) H(X)
H(X,Y)
55
Equation 1-23:
𝑁𝑀𝐼(𝑋, 𝑌) = 2 × 𝐼(𝑋; 𝑌)
[𝐻(𝑌) + 𝐻(𝑋)]
Likelihood Ratio
Sensitivity and specificity values are often used in diagnostic testing; however, the likelihood
ratio is a more powerful metric used in clinical applications.229 To construct a diagnostic test
analysis a 2 x 2 contingency table is created based on the test results against the disease
condition (Figure 1-13).
Figure 1-13: shows the set-up for evaluation of a diagnostic test.
The sensitivity metric is calculated by the Equation 1-24 and can be interpreted as the ability of
the test to find the individuals with the disease. Conversely the specificity Equation 1-25 is the
ability of the test to identify those individuals without the disease. Both are important for
determining the degree of success of the test.
Equation 1-24:
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
Equation 1-25
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁
𝑇𝑁 + 𝐹𝑃
The likelihood ratio extends the analysis to provide a value describing how many times more or
less likely patients with the disease are to have that particular test result than patients without
the disease. The likelihood ratios are ratios of probabilities and are calculated directly from the
sensitivities and specificities for a two-outcome case:
True Positives
(TP)
True Negatives
(TN)
False Negatives
(FN)
False Positives
(FP) Positive
eeeee
Negative
Positive Negative
Tes
t R
esult
Disease
56
Equation 1-26:
𝐿𝑅+ =𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
Equation 1-27:
𝐿𝑅− =1 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
There are two versions of the likelihood ratio (LR), the positive and negative likelihood ratios
(LR+ and LR-). The LR+ takes the proportion of those with the disease who had the positive test
outcome (sensitivity) and divides this value by the proportion of those without the disease who
had the positive test outcome (1- specificity) (Equation 1-26). The LR- is the proportion of those
with the disease which had the negative test outcome (1- sensitivity) divided by the proportion
of those without the disease which had the negative test outcome (specificity) (Equation 1-27).
The likelihood ratio has been argued to be a better indicator of risk for this type of analysis than
the traditional sensitivity metric, as it takes into account the false negatives and the false
positive values,230 as well as providing extra interpretation. The utility of the likelihood ratio is
that it can be used to calculate the probability of the presence of disease from different tests
using modified Bayes theorem where the likelihood ratio is multiplied by the pre-test odds to
give the post-test odds (Equation 1-28). The advantage of the likelihood ratio in diagnostic
testing is that the prior probability from the context can be incorporated into calculating the
post-test risk of disease making for a more generalisable risk calculation, whereas the sensitivity
and specificity metrics are affected by the prevalence of the outcome and are context
dependent.231
Equation 1-28:
𝑝(𝑑𝑖𝑠𝑒𝑎𝑠𝑒 | 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡)
𝑝(𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 | 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡)=
𝑝(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 | 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)
𝑝(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡| 𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)
𝑝(𝑑𝑖𝑠𝑒𝑎𝑠𝑒 )
𝑝(𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)
Posterior odds ratio Likelihood ratio (LR+) Prior odds ratio
Practically, a positive likelihood ratio of greater than 1 indicates that the test result is associated
with the disease, whereas a negative likelihood ratio less than 1 indicates that the test result is
associated with the absence of the disease. Ratios above 10 or below 0.1 are generally considered
of strong enough evidence to determine if the test is useful in diagnosing the disease or not.232
57
1.2.2 Summary
This section highlights the general computational methods employed as part of this thesis,
providing a broader explanation of their theory and positioning within the data science field.
This thesis is concerned with applying machine learning and statistical techniques to derive
knowledge from selectivity and toxicity data within the context outlined in Challenges in Drug
Discovery
Next, we introduce the previous in silico studies conducted in the fields of target-family
selectivity prediction and clinical toxicity prediction.
1.3 Computational Methods and Applications for Bioactivity and Selectivity Profile
Prediction
Given the importance of the concepts of selectivity and polypharmacology to drug design
(Selectivity and Polypharmacology) and the lack of feasibility of screening all potential targets
for all molecules during the drug discovery process, in silico approaches are important for the
prediction of the off-target effects of compounds.10
Approaches to selectivity prediction often use the principle that similar ligands bind similar
targets.233 This field is known as chemogenomics.234 It has been shown that similar ligands often
bind to protein targets in a similar way from an analysis of the protein data bank (PDB)
structures and their co-crystal ligands.235 From this concept, many approaches have been
developed to exploit the differences between targets and the differences between ligands, or a
combination of both, to predict or infer bioactivity against multiple targets, the aggregation of
which can be used for selectivity profile prediction.
1.3.1 Ligand-Target Interactions
To understand what is meant by the difference between ligands and targets, it is important to
discuss which types of features of the ligand or target are important for binding. Ligands are
often small molecules which bind to a protein. The free energy of this binding event is based on
the changes in entropy ΔS and enthalpy ΔH on binding and the temperature (T) and is given by
the Gibb’s free energy ΔG equation:236
Equation 1-29:
∆𝐺 = ∆𝐻 − 𝑇∆𝑆
A negative ΔG is required for spontaneous binding, which requires a positive overall entropy
change and a negative overall enthalpy change. Generally, upon binding of a small molecule the
58
entropy (∆𝑆) of the system increases, due to the liberation of crystal water molecules, from the
lipophilic binding pocket into the bulk solution, however, this is balanced by the loss entropy
from the reduction of degrees of freedom (conformational, translational and rotational) for both
the ligand and protein upon binding as opposed to the free state.237 The degree of entropy
contribution to binding can be modified by parameters such as the conformational flexibility of
the ligand and its overall size and shape. The enthalpy of binding (∆𝐻) is determined by specific
interactions made between protein and ligand and these need to be stronger interactions than
the molecule would make in the aqueous environment to have a negative and favourable change
in enthalpy upon binding.237 The types of interactions that are enthalpically favourable include
those related to functional groups in the molecule which can make specific interactions with
the protein such as hydrogen-bonding or ionic interactions, or more general interactions such
as hydrophobic contacts, shape complementarity and flexibility.
1.3.2 Computational Encoding of Ligands and Targets
As mentioned above, the binding of small molecules to proteins is governed by molecular
properties which contribute to maximising the entropy and minimising the enthalpy of binding.
Here we discuss how we can computationally represent features of ligands and targets which
may be important towards these binding interactions for use in in silico approaches.
Chemical Descriptors
Since there is complexity in the factors that govern ligand binding to proteins, many different
computational ligand descriptors are available to capture these diverse molecular properties.
The currently known chemical descriptors can be divided into three categories: 1D, 2D and 3D
descriptors.
1D descriptors calculate the physicochemical properties of the compounds. Common examples
of these are DRAGON238 or PaDEL239 descriptors, which can be generated rapidly for large
numbers of compounds and describe various aspects of molecules including; shape, size,
lipophilicity, hydrogen bonding ability, chemical composition and combinations thereof.
2D descriptors are also called topological descriptors,240 and are numerical values that describe
the 2D connectivity of the molecule, based on calculating properties on the molecular graph
representation, where atoms are nodes and bonds are edges. These descriptors encompass
molecular size, composition, shape, branching and bonding of the molecule.241 One example,
MDL MACCS keys, uses predefined substructure keys to record the presence or absence of a
substructure in molecules, based on the molecular graph. These are often used for substructure
59
searching and similarity approaches and are easily interpretable.242 Other examples use both
information about the physicochemical properties of the molecule and the position of atoms in
the molecular structure to describe the molecule, for example, the autocorrelation
descriptors.243 These are derived by calculating the number of bonds between all atom pairs and
introducing a coefficient of the product of the properties for each atom pair, overall leading to
a measure of the property distribution over the topological structure. Another commonly used
2D descriptor is the extended connectivity fingerprint (ECFP), which are a type of circular
fingerprint that were developed for structure-activity modelling purposes.244 Circular
fingerprints are an alternative way of representing the 2D structure of a molecule, which does
not encode specific connectivity, but contains information about the arrangement of all heavy
atoms within a certain radius in a molecule from a central atom (Figure 1-14).245
Figure 1-14: Shows the generation of Morgan Fingerprints. The atoms within certain distances of the central atom are encoded in layers as substructures into a binary fingerprint of fixed length by the use of a hashing function.
ECFPs are generated based on the Morgan fingerprint algorithm246 and the size is determined
by the number of bonds covered by the largest substructure, which is twice the radius specified.
Therefore, ECFP4 fingerprints will be generated from a two-layer analysis. ECFP or Morgan
fingerprints encode the presence or absence of substructures in a molecule using a linear binary
string of defined bit length. These fingerprints have the advantage that they are not restricted
to predefined substructures. However, more than one substructure can be mapped to one-bit
position in a hashed fingerprint calculation, since there may be more substructures than the
number of bits specified causing bit-collisions. A benchmarking study found that ECFPs
outperformed other fingerprint types in a virtual screening application.247
The methods described so far do not account for the 3D conformation of molecules in the
binding site, which can have an important influence on binding. However, one of the main
limitations in encoding molecules in this way is that 3D descriptors require that the binding
conformation of the ligand is known. Otherwise this conformation must be predicted, usually
be generating a low energy conformer for the molecule, from which 3D descriptors are
subsequently calculated. This makes 3D descriptors very dependent on the conformation
selected. 3D descriptors encode the conformation-specific properties across a molecule,
60
including 3D pharmacophores, shapes, potentials, fields and atomic co-ordinates. Many 3D
descriptors rely on ligand alignment and a comparison of properties in cartesian space grid, for
example in comparative molecular field analysis (CoMFA)248. Grid Independent Descriptors
(GRINDs), are independent of ligand alignment and instead use structural features of the
compound that are energetically favourable in the interaction with probes to form the
descriptors.249 3D Volsurf descriptors are examples of internal co-ordinate based descriptors
calculated from the 3D conformation of molecules and encode the distribution of molecular
size, shape, hydrophilic and hydrophobicity properties across the molecule.250–253 More recently,
3D circular fingerprints called extended E3FPs for molecules have been derived by encoding
atoms in shells of increasing size from a central atom for an ensemble of molecular conformers
for each molecule, and is similar in principle to the derivation of 2D ECFP fingerprints.254
More recently, the reinstatement of deep learning has led to the invention of different
architectures which can be used to generate explicit or implicit chemical descriptors. One
example is variational autoencoders, which can be used as continuous descriptors of chemical
space, derived from SMILES or graph representations of molecules.255 These descriptors are
being used in many applications due to their ability to perform automatic feature extraction, as
well as their usage in the generation of new molecules. Convolutional neural networks can
derive representations from graphs, and have been used in the derivation of features from
compounds,256 and recurrent neural networks are adept at learning sequences and can learn
chemical structure from SMILES.257 These methods have advantages that they can automatically
generate useful and meaningful representations of molecules, reducing the need for explicit
feature engineering, in this case, selecting and calculating chemical descriptors.
Target Descriptors
In a similar way to compounds, biological targets can be encoded by quantitative descriptors.
These descriptors can be divided into five groups: 1. atom-based,2. sequence-based, 3. amino
acid-based descriptors, 4. Conformational-based, and 5. structure-based descriptors.258
Atom descriptors encode SMARTS definitions of all the different atom types in the protein and
can generate a 39 bit target descriptor based on a count of the 39 different types of atom in the
protein.259,258 Sequence based descriptors contain information about the primary sequence of
amino acids in the protein or binding site of interest.260 Examples include descriptors which
encode the relative composition of each of the 20 amino acids in the sequence (for amino acid
composition descriptors),261 or the composition of combinations of residues (for e.g. dipeptide
composition based descriptors),262 within the structures of interest. These descriptors can
61
represent a local structure of the protein. In one study,263 target descriptors were generated
based on identifying all the amino acids within a defined radius (6.5 Å) of the ligand and then
using these amino acids as the central point for identification of the four nearest sequence
neighbours (two either side) to produce a sequence of 5 amino acids. This information
combined within the radius is then counted as a local substructure. This produced a library of
substructures to which each of the receptors was compared and led to their description by five
local structures, matched to these local substructures created from four proteins. Therefore 20,
(4 x5) bits described each receptor. Properties of the sequence can also be encoded by a number
of different sequence-based descriptors, for example, with Moran autocorrelation descriptors.264
Often these sequence-based descriptors are used for predicting protein functional families.260
Amino acid-based descriptors include information about each amino acid in the sequence.
Types of information used are physicochemical properties, topological information and feature
information. The most widely used amino acid based descriptors are the Z-Scales descriptors.259
These are derived by principal component analysis (PCA) and contain information about the
physicochemical properties of the molecules including: hydrophobicity (Z1), steric bulk and
polarisabilitiy (Z2), polarity (Z3), and electronic effects (Z4 and Z5).265 Other descriptors of this
type commonly used are VHSE266 and ProtFP PCA.267 ST-Scales268 and T-scales269 are based on
PCA analysis of mainly topological properties, and ProtFP (feature) describes each amino acid
by a single feature.267 BLOSUM descriptors270 were derived from physicochemical properties
using a VARIMAX analysis of the BLOSUM62 substitution matrix used to score alignments for
protein sequences.271 Additionally, a different type of descriptor was also generated based on 3D
electrostatic properties of amino acids called MS-WHIM.272 These descriptors were
benchmarked finding that the Z-Scales descriptors performed the best in selectivity studies and
combining Z-Scales with ProtFP (feature descriptors) consistently resulted in a small increase
of model performance.267
Conformational structure-based descriptors use 3D information about the protein or binding
site to create a descriptor. Wu et al., 2012273 derived structural similarity descriptors from the
Protein Comparison Tool, part of the PDB resources, and geometry descriptors were computed
from the bond lengths and angles of the main backbone atoms in the protein sequence.274
Additionally, Qiu et al., 2015 produced a novel protein descriptor for modelling the antigen-
antibody interactions using a cylinder model, where an interaction face was computed and
placed in the X-Y plane. Then, a rotating plane was defined in the Z axis with a defined distance
radius. Through a similar method to 2D circular fingerprinting for ligands e.g. ECFPs, the 3D
62
fingerprint can be generated at different shells (distances from the central point) by coding the
presence of neighbouring residues within each layer.275 More recently, Watermap derived field-
based descriptors were used to encode the binding site of proteins for selectivity modelling.
Kinase-specific 3D protein descriptors have been crafted in one study, by computing the
pairwise distances between residues. This method also encoded the active and inactive forms of
kinases, which is known to be important for the binding of type 1 vs type 11 kinase inhibitors.276
Ligand-Target Interaction Descriptors
Due to the increase of the availability of liganded co-crystal structures for small molecule
binders to macromolecules, molecular descriptors can be encoded from the interaction between
small molecule and protein, known as protein-ligand interaction descriptors. Many versions of
this concept have been produced,277 including fingerprints which encode interactions between
the ligand and its protein, such as hydrogen-bonding, hydrophobic contacts, π-stacking
interactions, π-cation interactions, salt bridges, water bridges and halogen bonds.278,279
1.3.3 Ligand-Based Bioactivity Prediction Against Multiple Targets
For ligand-based bioactivity prediction, the structures of known ligands for a target or multiple
targets are used to predict future ligand interactions with the same target(s). Compound
similarity methods can be used to predict bioactivity for a new compound based on the activities
of known compounds. Methods using chemical similarity have been employed extensively for
virtual screening against a target to predict new hits and can be thought of as a multi-step
process consisting of 1. taking a set of reference compounds, 2. computing their descriptors or
pharmacophores (as described above), 3. developing a screening protocol such as employing
quantitative structure activity relationship (QSAR) techniques, or pharmacophore/shape-based
searches, and then 4. screening a library of compounds against the protocol to identify hits with
the same predicted biological activity as the reference compounds.234 QSAR methods employ
different machine learning algorithms280 to learn the relationships between the descriptor
properties and the bioactivity endpoint, inherently modelling similarity in a variety of linear
and nonlinear ways. The pharmacophore and shape-based methods work by using the 3D
structures of existing ligands to find new molecules in the library that mimic either the key
functionality of the molecules or the binding conformation. One commonly used software for
this method is known as Rapid Overlay of Chemical Structures (ROCS).281,282 These methods are
limited by the fact that the binding conformation information needs to be available or predicted
for known ligands.
63
For selectivity profile prediction, target prediction methods have been implemented as
ensembles of QSAR models, one for each target, based on the training data of the chemotypes
known previously for each target.283 Multi-task deep learning frameworks now allow for the
bioactivity prediction across multiple targets, using only compound information.284 This
method is based on learning more complex representations shared between prediction tasks,
and allows for transfer learning particularly for targets with less bioactivity data.
1.3.4 Target-Based Bioactivity Prediction Against Multiple Targets
Target-based bioactivity prediction can be used when information is available for the primary,
secondary or tertiary structure of the protein of interest, often in the form of X-ray or NMR
derived protein structures or homology modelling.285 Docking methods are extensively used in
order to screen libraries of compounds against the 3D structure, by firstly sampling
conformations of the ligand in the binding site and secondly scoring those conformations
according an estimated binding affinity.286 Docking methods can be flexible or rigid with respect
to the ligand and receptor, with more flexible methods being more computationally
expensive.287 With a flexible ligand, the method aims to find the highest scoring pose of the
ligand. Often an ensemble of rigid protein conformations is used for docking and ligands are
scored against each conformation. Whilst docking is useful for predicting binding poses of a
ligand and ranking activities, the predictive ability of scoring functions are often poor.288
Flexible ligand docking programs commonly used include Dock, Gold, Glide and
AutoDock.288,289 Docking methods are suited to low or medium throughput molecule bioactivity
prediction.
The main target-based in silico approaches for selectivity profile prediction are based on
sequence and structural analysis of the binding site and the concept of binding site similarity.
This can be used to identify the binding site features which could be targeted to discriminate
between activity for the on-target and off-target.8 Sequence-based similarity methods have
previously been used where structural information is limited to predict selectivity for related
biological targets. As mentioned before, once sequences are aligned, there are many descriptors
that can be calculated for the binding site, from which distance matrices from sequence identity
or sequence similarity can be computed.290,291 By identifying residues responsible for selectivity
by inspection of the results of sequence analysis, targeted libraries of compounds have been
designed to interact with the residues identified.292
Structure based similarity methods can also be used to understand and predict the selectivity
profiles of small molecules, including different binding site comparison methods which have
64
been used to understand the similarities and differences between binding sites that could be
exploitable for selectivity.234 To implement these methods, the macromolecular protein
sequences need to be aligned and the binding sites identified, from which the binding site
descriptors can be calculated. Binding site comparison algorithms have previously used
different descriptors to calculate similarity,293 including fingerprint methods,294 grid
methods,295 graph methods,296 distribution methods297 and geometric methods including
volumes, points and clouds.298 Once the binding site comparisons have been made, approaches
to screen ligands can be used to find selective hits for one target over another.
In one study, “binding site signatures” from the active site of kinases were used for selectivity
prediction. These signatures were defined as the energetically important interactions from co-
crystal structures of kinase inhibitors, and this information was used to successfully predict
kinase off-targets for small molecule kinase inhibitors.299,300 This method achieved a prediction
accuracy of 90 % across a large data set, showing the relevance of these signatures to kinase
selectivity.
Screening approaches can be used once the binding sites have been identified as exploitable for
selectivity. These approaches are designed to find hits for a target based on the binding site
information. Docking-based virtual screening methods can be used for this purpose which
describe the binding site shapes and pharmacophores to search for ligands.301 For example a
protocol involving structure-based virtual screening was used to find selective hits for the
histone deacetylase-6 protein.302
1.3.5 Ligand and Target-Based Bioactivity Prediction Against Multiple Targets
Combining both the ligand and the target information where possible, is now a common
approach to bioactivity prediction. Ligand-protein interaction fingerprints (see Ligand-Target
Interaction Descriptors) can be used in docking-based methods for virtual screening approaches
to identify novel hits.303 In a similar way to the ligand shape-based methods, interaction
fingerprints can be used to determine if the predicted binding pose is similar to the reference
binding pose and this requires measurement of similarity between interaction fingerprints.277
Within cheminformatics approaches, a combination of compound and ligand descriptors have
been used to build machine learning models for selectivity prediction. These are known as
proteochemometric (PCM) models. PCM models are applied as part of this study, and so in the
next section we review this integrative technique for the prediction of the selectivity profiles of
ligands for target families.
65
Proteochemometric Modelling
As mentioned in Ligand-Based Bioactivity Prediction Against Multiple Targets, QSAR can be
used as a ligand-based method for bioactivity and selectivity profile prediction. However, this
method has limitations in its applicability domain, reducing chemical space inclusion in
models.304 Additionally, QSAR only takes into account how chemical structure influences
bioactivity when, in reality, aspects of both the target and compound combined are important.
In order to develop models that contain a higher diversity of chemical space and also to
incorporate additional target information, predictive models for multiple related biological
targets have been created, known as proteochemometric (PCM) models.305 Target information
increases the ability of the model to identify new ligand scaffolds and binding interactions in
chemical space outside the training set and can also help distinguish how small differences in
SAR might lead to large differences in bioactivity.258 Additionally, PCM models offer the
advantage of exploring compound activity across the entire target family, which allows for the
interpretation of the degree of selectivity, or conversely, promiscuity of a compound, as opposed
to just the interaction of one compound with one target.209 The models use information from
both chemical and biological spaces in order to predict the bioactivity of new compounds
(interpolation from known targets) or new biological targets (interpolation from known
compounds) or both (extrapolation) (Figure 1-15),209 and relies on the concept that similar drugs
bind similar targets, which has been found to be at least partially true; in one study it was found
that 71 % of drugs interact with at least two targets with similar binding sites.18 This approach
has the potential application in repurposing compounds for new biological targets, or
suggesting new compounds for orphan receptors.209 Moreover, through PCM modelling, it is
possible to understand which features of the protein and ligand contribute to the bioactivity, by
identifying key structural motifs and binding site hotspots; this can then be extended to
interpretation of how structural binding site differences influence ligand selectivity profiles.
Non-bioactivity endpoints can also be used for PCM including phenotypic, genomic and
toxicological readouts.209
66
Figure 1-15: Adapted from Cortes et al. 2015.306 Shows the process of PCM modelling as a matrix format. The modelling is similar to QSAR except that the model incorporates data for multiple related targets and can therefore interpolate for both new biological and new chemical space.
The overall process for producing a PCM model is highlighted in Figure 1-16.258,209 Firstly,
compound and target descriptors are calculated and combined with the bioactivity data. The
data is then divided into training and test sets, with the model being trained on the training set.
Parameter and variable optimisation are conducted, each model usually being validated by
leaving a portion of the training set out (cross-validation) and testing on this portion. The model
is then used to predict the test set examples and evaluated using appropriate methods for
regression or classification. Subsequently, a test for overfitting can be conducted using Y-
randomisation approaches. Once the model is considered representative, other validation
methods can be explored to define the ability of the model to predict for unseen targets using
leave one target out (LOTO) and for unseen compound clusters using leave one compound
cluster out (LOCCO) validations. The gold standard for assessing predictivity of a model is to
make prospective predictions and have those predictions experimentally confirmed, although
this is often in practice more difficult to achieve. The following sections will discuss some of
these aspects in further detail.
New compound
New target
Compounds
Targ
ets
Interpolation from known compounds
Interpolation from known targets
QSAR
Bioinformatics
Extrapolation for new compounds on new targets
Active Inactive Missing value
67
Figure 1-16: Schematic describing the general process of PCM modelling in sequence. Comprised from information from review articles of PCM258,209
Machine learning algorithms are used in PCM modelling to develop a mapping function to
predict the relationships between the input values (variables) and the output values
(observations) in a system.200 In the case of PCM instances are compound-target pairs,
described by a concatenation of compound and target variables. For each instance either a
numerical or categorical value is given to represent the activity of the compound-target pair
combination (Figure 1-17). This learned model can then be used to predict observations for new
instances, provided it has access to the same input variables for the new instance. There are
many algorithms that have been used in this field including: Random Forests (RF), Support
Vector Machines (SVM), Neural Networks (NN) (including Deep Neural Networks), and Naive
Bayes.209,200 Details of machine learning methods are discussed in Machine Learning
Algorithms.
Calculate chemical descriptors
Align proteins
Calculate protein descriptors
Tune model to find best parameters using
cross validation
Build final model using cross validation
Validate model on test set
Perform Y-scrambling to detect overfitting
Assess applicability domain (LOCCO, LOTO analysis)
Experimentally validate and interpret
model
68
Figure 1-17: Shows the design of the PCM model. Each row is a compound-target pair and the model takes a concatenated input of compound and target descriptors for the pair and predicts an outcome (in this example binary classification where 1 and 0 are the positive and negative classes respectively).
As mentioned above, PCM modelling is an extension of Quantitative Structure Activity
Relationship (QSAR) modelling,307 PCM has been used to create bioactivity prediction models
for a variety of different target classes to date. This includes G-protein coupled receptors, for
which PCM models were built using Support Vector Regression algorithms achieving a model
with an R2 = 0.93 and Q2test = 0.74. The ROC AUC for an external test set was 0.89.308,249 Kinases
have also been explored with this technique, including a model which covered 95 kinases and
1572 inhibitors using 3D protein field based descriptors, achieving a ROC AUC on external test
set of greater than 0.8.251,309 Viral mutants, including a study of HIV proteases containing 4792
protease-inhibitor pairs were studied by PCM, achieving R2 = 0.92 and Q2 = 0.87, as well as a
high predictivity for leave one target out analysis of 0.72.310 A classification model was built for
the cytochrome p450 enzymes using a dataset with 63391 datapoints.311 This model achieved a
ROC AUC of greater than 0.9 for both RF and SVM models on the internal and external test
sets. Epigenetic proteins classes have been explored using PCM, including the HDAC proteins,
which when modelled by SVM, achieved an R2 = 0.99 and a Q2 = 0.75.274 A more recent study
employed PCM for the identification of allosteric modulators of glutamate receptors achieving
an overall ROC AUC of 0.97 for the model, which was used to identify hits using virtual
screening for the glutamate 7 receptor described as an orphan target.312 A model for nuclear
receptors was built achieving a ROC AUC on the external test set of 0.74 using RF and revealed
the molecular scaffolds predicted for five major nuclear receptor targets.313 What is notable is
the high performance for previous PCM models, with ROC AUCs > 0.9. This could be attributed
to the structure of PCM models, where a random test set can contain the same compound but
Outcome Predictors IDs
69
with an activity measurement for a different target. If there are correlated bioactivity profiles
between targets, this could be the cause of inflated performance in the test set, which will not
generalise to new compounds or new targets (a more realistic situation). Therefore, it is
important to validate PCM models by leaving compound clusters (LOCCO) and leaving targets
(LOTO) out to assess ability for future interpolation or extrapolation, as discussed above. For
further information on PCM, we refer the reader to the comprehensive reviews by Cortes-
Ciriano et al209 and Qui et al.258
1.3.6 Bromodomain Target Family Selectivity Profile Prediction
This thesis is concerned with the prediction of selectivity for bromodomain-containing proteins
using PCM. Next, we highlight previous in silico methods to predict bromodomain selectivity,
which extends from the section Selectivity Knowledge for Bromodomain Binding to Small
molecule Inhibitors.
Previous in Silico Approaches for Predicting Bromodomain Inhibitor Affinity
When designing inhibitors for bromodomain-containing proteins it is important to understand
the selectivity profiles between subfamilies and between individual domains. To facilitate
efficient inhibitor design, studies have investigated the prediction of affinity of small molecules
for bromodomains. Examples include the molecular docking and QSAR study developed for the
prediction of naphthyridone derivatives for the ATAD2 bromodomain,62 and the predictions of
binding affinities for 16 tetrahydroquinoline (THQ)-based ligands314 against BRD4 BD1 using
free energy perturbation (FEP) calculations. Selectivity has also been predicted on a small scale
using relative binding free energy (RBFE) approaches;315 this study used three inhibitor
structures and predicted their affinities across up to 22 bromodomains, achieving mean
unsigned errors of 0.81-1.76 kcal/mol with experimental data. Multiple QSAR models have been
developed for a set of 88 organic molecules against the bromodomains in proteins BRD2, BRD3
and BRD4. These models achieved Q2 values between 0.75-0.88 for predicting the activity for
the three bromodomains by employing QuBiLS-MIDAS 3D molecular descriptors and multiple
linear regression models with 6-9 variables per model.316 A de novo fragment-growing approach
called AutoCouple, was successfully implemented to discover CREBBP bromodomain binders
within an expanded chemical space.317 The approach couples known headgroups of
bromodomain binders with commercially available building blocks to produce new molecules
which were docked and scored against the target binding site. However, no studies have yet
modelled bromodomain bioactivity and selectivity data for a large number of ligands and
targets.
70
Computational Studies to Understand Bromodomain Selectivity
In addition to the literature on individual bromodomain selectivity derived from both
experimental and computational approaches presented in Selectivity Knowledge for
Bromodomain Binding to Small molecule Inhibitors, a computational study on 24
bromodomains was previously conducted to provide insight into selectivity for the target family.
In this study, Vidler et al.318 explored the classification of bromodomains by their structural
motifs, using druggability scores to group bromodomains into 9 groups by combinations of
signatures of three amino acids in their binding sites. This led to the identification of 7 key
binding site residues, which can be used to help understand which other bromodomains may
bind to the same compounds, and therefore bromodomain selectivity.318 More recently, a
molecular dynamics approach to assess the structural and energetic properties of structural
waters between different bromodomains (ATAD2, BRD2 BD1, CREBBP and BAZ2B) was
conducted.319 This study found that for BRD2 BD1 the ZA loop was conformationally rigid (often
in a closed conformation), whereas in ATAD2 this loop was flexible with a higher population of
the open conformation, whilst the ZA loops in CREBBP and BAZ2B interconvert between the
two states frequently. Furthermore, they observed that the degrees of flexibility linked to the
water network; BRD2 BD1 keeps its well-constructed water network, whilst in the other
bromodomains the waters were more displaceable due to the conformational flexibility
observed.
1.3.7 Summary
Here we highlight the previous computational approaches for predicting the bioactivity of small
molecules for protein targets. These techniques can be categorised into ligand-based, target-
based and ligand- and target-based methods. We furthermore highlight how these approaches
can be extended to predict for bioactivities against multiple targets, i.e. selectivity profiles,
which is the concern of this thesis. We introduce the method of proteochemometrics for
selectivity profile prediction and provide a survey of the literature for previous computational
methods applied to bromodomain-containing proteins, highlighting the gap in previous work
for a large-scale computational analysis of selectivity profiles, especially given the context
outlined in Challenge of Designing Selective Bromodomain Inhibitors, which suggests the need
for new methods to understand and predict the bromodomain selectivity profiles of small
molecules.
71
1.4 Computational Methods and Applications for Toxicity Prediction
Many of the in silico approaches to selectivity and bioactivity prediction, including QSAR
methods can be applied to toxicity prediction, to predict off-targets for drugs, based on a
combination of chemical structure and other biological information.320 More recently, in silico
approaches used in the field of toxicity analysis include the integration of multiple datasets,
describing the effects of compounds on different system levels, including the interaction with
targets, cellular responses e.g. pathways and genetics, organ level responses e.g. altered tissue
function, and organism level responses e.g. in vivo and clinical phenotypes. The Adverse
Outcome Pathway (AOP) framework is an evidence-based approach to link the information
between these levels for chemical toxicity.321 There remains limited understanding however,
between different types of toxicity data within this framework. Here we discuss the in silico
approaches used to help elucidate relationships between toxicity measurements which are
relevant to this thesis, namely the problem of preclinical to clinical adverse event translation
and the understanding of compound-target interactions measured in vitro which display
relevance to clinical toxicities.
1.4.1 Prediction of Clinical Adverse Events from Preclinical Adverse Events
It is important to understand which toxicity endpoints translate from preclinical to clinical
studies and which ones do not. Such efforts to understand and quantify the concordance of
adverse events (AEs) between animal models and clinical studies have been conducted
previously using retrospective statistical analyses and details of previous concordance studies
are summarized in Table 1-4. The first studies of this type used the sensitivity to measure
concordance between AEs recorded in preclinical studies and AEs recorded in clinical studies,
however, later studies moved towards using the positive likelihood ratio (LR+) due to its more
meaningful interpretation, as well as its ability account for false negatives and false positive
values.230 Studies of this type are not trivial to conduct, due to small data set sizes, biological
variability and species exposure differences,232 as well as biased data such as due to ‘survivor
bias’, since data for drugs which are terminated before clinical trials due to safety (or other)
reasons cannot be used in the analysis, which has consequences for the inclusion of severe
preclinical toxicities. The main findings from previous studies included that haematological,
gastrointestinal, injection site and some specific cardiovascular AEs display a high concordance
(LR+ values of 11),232,322 showing that it was 11 times more likely to see the clinical AE given the
same preclinical finding; however, neurological toxicity and cutaneous toxicities have a poor
concordance of less than 35 %.169,323 It was also observed that concordance is higher for small
72
molecules than it was for large molecules (e.g. antibodies),323,324 and that there were differences
between concordance with humans for different preclinical species.169,324,325 For a
comprehensive review of previous concordance literature the reader is referred to Monticello et
al.326
What all previous studies have in common is that they measured the concordance between the
same toxicity or a toxicity related to the same system organ class (SOC) as the clinical AE. Some
adverse effects in humans are not predicted by the same AE in animals, due to the differences
in anatomy, physiology and biology between species. An example of this is the lack of a vomiting
response in rats,170 which excludes this from being used as a model for vomiting in humans.
Instead, taste aversion/food avoidance responses in rodents or ferret, or dog emesis models are
used.170,171 Linked to this, species differences exist for some teratogenic based toxicities, for
example corticosteroids are teratogenic in animal models but not in humans173 and conversely
thalidomide is a teratogen in humans but not in many animal species, which has been attributed
to the differences in metabolism across species.174 The picture is further complicated by a lack
of correlation of drug bioavailability between species,327 linked to poor dose extrapolation
between species which can lead to differences in toxicity observations.328 These and other
reasons demonstrate how concordance between the same AE across species is only part of the
picture, showing that there is more to be learnt about interrelationships of different AEs across
species. This highlights the gap an in silico investigation into whether seemingly unrelated
preclinical AEs in different SOC classes may be mechanistically predictive of clinical AEs.
73
Table 1-4 Summary of the previous concordance literature relating preclinical adverse events to clinical adverse events. Previous studies have
examined the ability of preclinical AEs, which are same AE or contained within the same system organ class (SOC) as the clinical AE to predict the
clinical AE. Not yet analysed is the associations between recorded preclinical AEs which are not part of the same SOC as the clinical AE and the
clinical AE, which is the gap that this study aims to address. LR+ is the positive likelihood ratio and LR- the negative likelihood ratio.
Reference Scale of study Concordance measurement (Sensitivity or Likelihood Ratios (LR+/ LR-))
Concordance Interpretation
Olson et al., 2000 169 150 drugs Sensitivity 71 % concordance between AEs within same organ class in any animal species and human 63 % concordance for non-rodent species with humans 43 % concordance for rodent species with humans
Haematological, gastrointestinal and cardiovascular toxicities were highly concordant (> 80 %) Cutaneous toxicities were the least concordant (< 35 %)
Bugelski et al., 2012 324 15 monoclonal antibodies Sensitivity 42 % concordance of AEs from rodent species to humans 35 % concordance from non-human primates to humans
Human AEs were predicted poorly by animal models for monoclonal antibodies Individual drug concordance was discussed
Nishida et al., 2013 323 142 approved drugs in Japan
Sensitivity 48 % concordance between AEs within same organ class in any animal species and human 33 % concordance for large molecule drugs, including antibodies 58 % concordance for small molecule drugs
Haematological, ocular, and injection site reactions reached a concordance of 70% Cardiovascular, neurological and cutaneous toxicities showed low concordance (< 30 %)
74
Bailey et al., 2014 325 2,366 drugs LR+/LR- Median LR+ 253 and LR- 1.82 between AEs in rats and AEs in humans Median LR+ 203 and LR- 1.39 between AEs in mouse and AEs in humans Median LR+ 101 and LR- 1.12 between AEs in rabbit and AEs in humans
High risk of AE in humans given of presence of AE in animals Very little evidence for absence of AE in humans given absence of AE in animal models
Clark, 2015 232 3,815 drugs LR+/LR- ARRHYTHMIAS, QT PROLONGATION and ABNORMAL HEPATIC FUNCTION had the highest LR+ values (LR+
11, 11, 26 respectively)
Asymmetry between LR+
and LR- values
Clark et al., 2018 322 3,290 drugs LR+/LR- QT PROLONGATION, ARRHYTHMIAS, DRUG SPECIFIC ANTIBODY PRESENT and INJECTION SITE REACTION had highest LR+ across species (LR+
11, 11, 162, 17 respectively)
Only in a few cases there was an advantage to abstracting the description to a higher level of the MedDRA hierarchy, due to the increase in false positives and decrease in significance Concordance between animal species and human is dominated by the selection of species and selected species is predictive for the endpoint of interest.
75
1.4.2 Computational Analyses for the Proposal of Targets for In Vitro Secondary Pharmacology
Screening
As mentioned in Off-Target Toxicity Assessment, secondary pharmacology screening panels are
currently used to flag potential toxicities which may occur in vivo. For targets to be considered
as off-targets, a link between a drug induced adverse event and the off-target should be made.
Often these links are made through statistical analyses of clinical trial and post-marketing
databases, looking for evidence of multiple drugs that cause the same adverse event profile and
express similar in vitro target modulation, as outlined in the review by Whitebread et al., 2016.121
This approach uses clinical adverse event data, phenotype data for diseases, and literature
mining to ultimately identify novel associations between drug adverse events and protein
targets. These types of analyses are often conducted on a small-scale to find associations for
specific toxicities, with few large-scale analyses having been performed to date. Another review
suggests that “the panels of [safety] targets that are employed vary widely and are often selected
without justification or a description of their relevance to human safety”, which highlights the
need for approaches which are based on more concrete evidence.329 Recently, a large-scale in
silico approach was implemented to identify other safety targets which could be included in
safety panels.112 In this study, the authors derive links between adverse event phenotypes and
drug targets using an enrichment analysis, measured by the odds ratio. The data for drug
associated adverse events was obtained from the intersection of the FDA Adverse Event
Reporting System (FAERS) and SIDER databases. Adverse events were mapped to phenotypes
using the Unified Medical Language System (ULMS) and then gene information was gathered
from phenotypes encoded by the Human Phenotype Ontology (HPO) which mapped to diseases
in Online Mendelian Inheritance in Man (OMIM). This study culminated in a proposal of 70
safety targets for screening.
One source of information which the previous study does not include is the in vivo adverse event
data, which has been outlined previously as an important parameter to include in analyses, since
positive correlations between secondary pharmacology results and adverse events in animals
are better indications of the expected toxicities in humans.329 This is a gap not yet addressed
using computational studies and by providing this extra mechanistic link, the targets proposed
for future secondary pharmacology screening will be based on combined evidence from in vitro,
in vivo and clinical information.
76
1.4.3 Summary
Here we discuss the previous literature on computational analyses of the translation of adverse
events between animals and humans. Previous studies have analysed the concordance of
toxicities i.e. the degree to which the same or related toxicities in animals predict the same
clinical toxicity. We define the clear gap for a study which attempts to find mechanistically
related toxicities across species, which are not necessarily part of the same system organ class
grouping. This will provide more information on the utility of animal models for clinical adverse
event prediction, whilst fitting with the need to reduce, replace and refine animal usage in drug
discovery, as discussed in
Toxicity Translation. Finding a mechanistic link between the adverse event in animals and
humans will add further weight to using such associations in a practical setting.
We secondly discuss the generation of secondary pharmacology screening panels from
statistical associations between drug-induced adverse events and drug off-targets. We highlight
the gap within previous approaches for proposing new targets for secondary pharmacology
screening, namely the lack of inclusion of animal in vivo adverse event information. A study
which can combine the use of preclinical information to support new mechanistic relationships
between clinical drug induced adverse events and drug off-targets will help to provide stronger
evidence for the incorporation of new targets into in vitro screening panels, an essential method
for the early anticipation of toxicity to avoid drug attrition.
1.5 Aims
The main aim of this thesis was to address the problem of off-target pharmacology of
compounds using data mining and machine learning techniques. The first half of this thesis will
be concerned with investigating selectivity prediction using machine learning methods, with
the application to modelling bioactivity data for the bromodomain family of proteins. We aimed
to use the models to predict activity and selectivity for new compounds, leading to the discovery
of novel experimentally confirmed chemical hits for bromodomains with a desired selectivity
profile. We furthermore tested the hypothesis that the models can be interpreted to identify
selectivity residues in the binding site of bromodomains for the use in compound design. We
compared the findings to the literature as well as analysed the correspondence of the model
interpretation to the binding modes of novel hits in the protein of interest from obtained crystal
structures. In summary, we explore the extent to which proteochemometric models can be used
in compound design and for on-target and off-target prediction of activity for bromodomains.
77
The second half of the thesis was concerned with extending the problem of selectivity to the
fact that all drugs have affinity for off-targets which can lead to toxicity. Using data mining
methods, we investigated the correspondence between preclinical and clinical toxicities for
marketed drugs with the aim to identify new links between toxicities across species. We
furthermore tested the hypothesis that the associated toxicities were mechanistically linked
across species, by integrating data from compound-target databases as well as phenotype-gene
and disease-gene databases to find intersecting evidence for targets which could be responsible
for inducing toxicities for the derived associations. In summary, we aimed to quantify the utility
of animal models for the assessment of the risk of drug-induced clinical toxicities, as well as to
propose new targets for in vitro secondary pharmacology screening panels. Overall, the thesis
employs in silico methods which can be used to improve the understanding of the primary and
secondary pharmacology of drugs with relation to selectivity and toxicity. These parameters are
vital to understand when designing and developing a successful drug molecule.
78
2 Prospectively Validated Proteochemometric Models for
the Prediction of Small Molecule Binding to
Bromodomain Proteins
2.1 Introduction
It is important to find small molecules with the desired target family selectivity profile to avoid
off-target effects including toxicity, as well as to elucidate target roles in disease. Bromodomain-
containing proteins have functional relevance in immunological, developmental and
cardiovascular disorders, as well as cancers.30 To enable their individual roles to be further
elucidated there is a high interest in developing selective probe molecules31 (see Challenge of
Designing Selective Bromodomain Inhibitors). The studies described in this chapter
investigated the extent to which in silico modelling approaches can provide a framework for
virtual screening, with the aim to find both new small molecule hits for individual
bromodomain targets, as well as small molecules with specific selectivity profiles across the
bromodomain target family. The method implemented was proteochemometric (PCM)
modelling, which has been successfully applied to other target families for the purpose of
selectivity profile prediction across structurally related proteins.209,258 This technique had not
been previously applied to model the bromodomain target class and, due to the recent discovery
of a larger number of chemotypes of small molecule bromodomain binders, we applied this
method to produce the largest reported in silico study for bromodomain-containing proteins,
based on a diverse chemical space and the use of selectivity panel data. The technique allows
the transfer learning of information about related biological targets to inform the structure
activity model for each target, providing an extension of traditional QSAR methods (see
Proteochemometric Modelling). We implemented conformal prediction to determine the
applicability domain of our models and tested high confidence predictions of new compound-
target pairs from our virtual screen in our prospective validation, as described in the following.
2.2 Materials and Methods
2.2.1 Dataset
The dataset used to generate the models was extracted from the public and licensed sources of
ChEMBL330, PubChem331, ChEpiMod332, GOSTAR333 and the manual extraction of data from
recent publications,54–56,58,59,65,74,82,94,334–338, as well as AstraZeneca proprietary databases.
79
Public Dataset
The public dataset composition across sources is shown in Figure 2-1 and the distribution across
bromodomains is shown in Figure 2-2.
Figure 2-1: Number of compound-target pair annotations provided by each source in the public dataset after filtering, coloured by the activity classification. “Manual” is data manually curated from publications at the time of data set construction. “PubChem others” were those data points that were found in PubChem but not in ChEMBL.
Compound-target bioactivity data points were extracted from ChEMBL-20, using
bromodomain UniProtKB Accession IDs and applying the criteria of a bioactivity type of either
IC50, Ki, Kd, % inhibition and ΔTm, binding (B) assay type, as well as a confidence score of at
least 8 (corresponding to the classification of “Homologous single protein target assigned”),330
and presence of a numerical value for the bioactivity. Bioassay descriptions were used to
manually filter out compounds interacting with multiple protein domains or non-bromodomain
domains within a bromodomain-containing protein, and to place data into the correct domain
where multiple domains exist within one bromodomain-containing protein. Data points where
the domain was unresolved were removed. For percentage inhibition values between 20-80 %
at a certain concentration, the following Hill equation (Equation 1) was applied to convert to an
estimated pIC50 value:
Equation 2-1:
𝑝𝐼𝐶50 = −(log10(100 − 𝑌) − log10(𝑌) + 𝑋)
where Y is the inhibition value (in %) and X is the log concentration (Molar).24
ChEMBL ChEpiMod GOSTAR ManualPubchem
others
Inactive compounds 179 879 14 326 113
Active compounds 429 1883 76 343 210
0
500
1000
1500
2000
2500
3000N
um
be
r o
f C
om
po
und
-Ta
rge
t P
airs
Data Source
80
Figure 2-2: The distribution of data per domain for the public dataset used in the model, coloured by assigned activity (A=active, N=not active) and annotated with data point counts for each class. In total 3,950 data points were collected from the public domain.
ATAD2 BAZ2A BAZ2B BRD1BRD2BD1
BRD3BD1
BRD3BD2
BRD4BD1
BRD4BD2
BRD7 BRD9BRDTBD1
BRPF1 BRPF3 CECR2CREBB
PPB1BD5
PCAFSMARC
A4TIF1A
N 45 9 65 16 40 7 4 843 79 7 27 18 33 34 12 111 10 34 7 20
A 42 27 31 44 79 69 24 1124 540 21 113 13 66 3 16 163 16 51 24 63
0
200
400
600
800
1000
1200
1400
1600
1800
2000N
um
be
r o
f C
om
po
und
-Ta
rge
t P
airs
Bromodomain
81
Those inhibition values below 20 % inhibition and above 80 % inhibition were assigned to less
than the derived pIC50 at 20 % inhibition or greater than the derived pIC50 at 80 % inhibition
respectively, to account for the fact that the equation no longer applies to these parts of the IC50
curve.
Inactive data from PubChem was extracted by using the UniProtKB Accession ID,339 using the
PubChem API. Again, bioassay descriptions were used to filter out compounds with unresolved
bromodomain activity. Compound ID’s were converted to SMILES using the PubChem
identifier exchange service340.
Data from ChEpiMod was provided by the curators. Data from the large functional screen for
BAZ2B (PubChem_AID:504391) was filtered out, as it was suggested in the assay description
that this screen should be used with caution due to likelihood of screening artefacts. Any PDB
compounds without numerical data points were also removed, as these might be fragment
molecules with low activity.
GOSTAR data has been incorporated into the internal AstraZeneca database service and data
for bromodomain targets were extracted using SQL Developer (version 2.11.2), queried using
EntrezGene IDs (EGIDs).
For the combined public dataset, active (A) and not active (N) classes were assigned by the
following criteria: A = pIC50, pKd, pKi ≥ 5 or ΔTm ≥ 0.9 (at 10 μM). All other data points were
assigned as N and other endpoints were removed. ΔTm ≥ 0.9 was chosen to maintain
consistency between public data and AstraZeneca data, as this was used as a cut-off in the
internal AstraZeneca Differential Scanning Fluorimetry (DSF) assays, since it reflected 3 x
standard deviation of the DMSO controls. The cut-off of 10 μM was chosen to enable the
application of the model to virtual screening where this is a frequently used activity threshold.
AstraZeneca Dataset
Proprietary AstraZeneca data was extracted for bromodomains by querying the internal IBIS
SAR database. Inhibition data was extracted at four commonly used concentrations: 1, 3, 10 and
25 μM, as well as concentration response data and data from DSF assays. Where multiple data
were present for the same assay, these were aggregated as an average. The AstraZeneca dataset
was classified into active (A) or not active (N) classes. pKd and pIC50 data were classified using
the same thresholds as above, and compounds with thermal shift values of ΔTm ≥ 0.9 at a
concentration of 10 μM were classified as active, while the remainder were assigned to the not
active class. The binding data from DiscoveRx assays had pre-assigned activity flags for each
82
concentration point (based on their percentage of control values), which were used for
classification. Records with flags other than active or not active were removed. Since some
records had values at multiple concentrations, the records were then placed into classes in the
following order:
active at 1 μM: A, active at 3 μM: A, active at 10 μM: A, active at 25 μM: N, not active at 25 μM:
N, not active at 10 μM: N, not active at 3 μM: N, not active at 1 μM: N.
Combined Dataset
After classification of compounds into active and not active classes, duplicate entries were
removed from the dataset by comparison of structure (as calculated by StandardiseMolecules
function in camb306 package in R using Indigo’s C API.341), domain and activity assignment.
Additionally, for entries that were duplicates of structure and domain but had been classified
into opposite activity classes, both data points were removed due to inconsistency.
Bromodomains with less than 25 data points were removed from subsequent analysis.
The final dataset contained 15,350 data points; 6,352 compounds across 31 bromodomains. To
obtain a suitable and information-rich dataset, data from multiple endpoints were incorporated.
Although it is appreciated that there are limitations around incorporating different
experimental readouts,342 it was necessary to include as many data as possible to provide
sufficient small molecule bioactivity values across the bromodomains in the dataset. Much of
the Kd data incorporated originated from the BROMOscan®343 competition based assay panel
screens. Kd and % inhibition data originated from AstraZeneca data and public data. Ki and IC50
values were obtained from public datasets. ΔTm values (at 10 μM) were used from panel
selectivity screening studies and provide a large amount of inactive binding data. We explored
the comparability of data points which were collated from different sources for the same
compound-target pair between public data (IC50, Kd and Ki values, as well as the IC50 values
estimated from % inhibition values) and AstraZeneca concentration response (CR) assays,
which measured Kd values (Figure 2-3). For compounds with multiple data points, we observed
a correlation of R=0.96. Due to the heterogeneity of endpoints, a classification (instead of a
regression) model was generated to produce a global model of selectivity. Classification has
been implemented in place of regression previously for similar tasks where there is data
variability, including the DREAM challenge to model cytotoxicity.344 Classification and
83
regression PCM studies have been performed previously using multiple bioactivity
endpoints.345,346,312
Figure 2-3: Correlation between values of pIC50, pKd, pKi and % inhibition converted to pIC50 for published assay data with AstraZeneca concentration response assay data (pKd values) for compound-target pairs which exist in both the public and proprietary data sets. Shows that for these data points there is good correlation between bioactivity measurements.
The final distribution of the dataset, split into public and AstraZeneca data, per target and
activity label can be found in Table 8-1 and is depicted in Figure 2-4. 53.2 % of data points
correspond to type 2 bromodomains. In contrast, PB1 BD5, PCAF and SMARCA2 contain a low
proportion of data points (0.2 %, 0.6 %, and 0.2 % respectively). The BET family (type 2)
bromodomains contain an enriched proportion of active compounds (47.3 %) compared to the
whole dataset (34.3 %). Other domains which have a high proportion of actives include: BRD9
(27.8 %), CREBBP (37.3 %), EP300 (27.2 %) and TAF1 BD2 (55.2 %). PCAF, PB1 BD5 and
SMARCA2 bromodomains contain a high proportion of actives (60.0 %, 61.5 % and 100 %)
within a low number of data points, suggesting that these domains are rarely screened in
bromodomain panel assays. ATAD2B, BRWD1 BD2 and BRPF3 contain low numbers of actives
(0.0 %, 0.5 %, and 0.02 %), and data for these domains originates primarily from screening
panels. These domains were included in the model dataset since compounds overlapped with
other domains, and thus provided information on selectivity.
Lo
g P
ub
lic
Dat
a A
ctiv
ity
pKd AstraZeneca Data
84
Figure 2-4: Number of data points employed in this study per bromodomain from combined public and proprietary data sources. It can be seen that there is a bias towards a high number of BRD4 data points, which was addressed when modelling the data by down-sampling. For numbers of data points for each category the reader is referred to Table 8-1.
0
1000
2000
3000
4000
5000
6000
ATA
D2
ATA
D2
B
BA
Z2A
BA
Z2B
BP
TF
BR
D1
BR
D2
BD
1
BR
D2
BD
2
BR
D3
BD
1
BR
D3
BD
2
BR
D4
BD
1
BR
D4
BD
2
BR
D7
BR
D9
BR
DT
BD
1
BR
DT
BD
2
BR
PF1
BR
PF3
BR
WD
1 B
D2
CEC
R2
CR
EBB
P
EP3
00
KA
T2A
PB
1 B
D5
PC
AF
SMA
RC
A2
SMA
RC
A4
TAF1
BD
2
TAF1
L B
D2
TIF1
A
TRIM
33N
um
ber
of
Co
mp
ou
nd
-Tar
get
pai
rs
Bromodomain
Not Active
Active
85
2.2.2 Analysis of Chemical and Biological space
Bromodomains were clustered based on sequence similarity using R packages SeqinR347 and
APE348 to plot a phylogenetic tree using the plot.phylo function, with argument type=fan.
DataWarrior349 Similarity Analysis using Skelspheres descriptors was used to generate a
chemical space visualization of the dataset. The technique uses a rubberbanding forcefield
approach to place molecules in 2D space according to their similarity, with a cut-off value of 0.9
for the similarity.
Bemis-Murcko scaffolds350 for the whole dataset and the public portion of the dataset were
generated in KNIME351 using the RDKit Find Murcko Scaffolds node. The scaffold visualisations
including the network graph, which used the force-directed layout, were generated in TIBCO
Spotfire352. The network graph was generated only for active data points from the public dataset
and for scaffolds where greater than 4 compound members had active data points for a
bromodomain for presentation clarity.
2.2.3 Compound and Target descriptors
Compound Descriptors
Compounds were standardised from SMILES as described above using the Indigo API341,
removing inorganic molecules and salts. Different compound descriptors were benchmarked
against one another in PCM models. Compound 1D and 2D physicochemical descriptors were
calculated using camb306, using the functions GeneratePadelDescriptors, which uses the PaDEL-
Descriptor Java library,239 and 512-bit hashed binary Morgan fingerprints of radius 2 were
generated from the Python RDKit module.353 3D structures were generated using Corina-3.6354
providing a single low-energy conformer for each molecule, from which 3D Vsurf descriptors
were generated in MOE355. 3D VSurf descriptors are internal co-ordinate based descriptors
calculated from the 3D conformation of molecules and are similar to the Volsurf descriptors
used in previous studies.250–253 Since 3D Vsurf descriptors encode the distribution of molecular
size, shape, hydrophilic and hydrophobicity properties across the molecule,250 they were
benchmarked for their use in PCM against the physicochemical 1D and 2D descriptors generated
by the PaDEL library.
Target Descriptors
Sequences were aligned by importing one crystal structure per bromodomain into MOE 355,
(Table 8-2), using the in-built “protein” function and selecting the automatic alignment option
and then manually adjusting the alignment using the sequence editor to minimise alignment
86
gaps. Binding site residues were chosen by importing all publicly known liganded bromodomain
crystal structures in the MOE355 protein family database, filtering to those that contained
ligands between 100-600 molecular weight (in total 352 crystal structures) and applying a
threshold of 4.5 Å from any ligand for a residue position to be included. The final alignment can
be found in Table 8-1. Z-Scales 5265 and other alignment dependent target descriptors were
calculated using the camb306 AADescs function. These descriptors are derived from
dimensionality reduction methods (e.g. principal component analysis) of the numerical
property values for amino acid residues in the binding site of each protein.267
Peptide array data was obtained from the SPOT peptide array data for histone proteins against
bromodomains published in Filippakopoulos, et al. 2012 (see Figure 4 of this work) 38 by request
for the data from the authors. The normalized raw intensity values (normalized to between 0-
100) for each peptide-bromodomain interaction were used as a numerical descriptor for
bromodomain targets, providing a fingerprint of bromodomain binding specificity for mono-
lysine acetylated histone peptides. The final target descriptors comprised of a matrix of 22
bromodomain numerical interactions with 136 acetylated-lysine histone peptides.
2.2.4 Algorithms and Generation of PCM models
PCM models were built using camb and caret packages in R.306,356 Compound and target
descriptors were concatenated and highly correlated and near zero variance descriptors were
removed using the functions RemoveHighlyCorrelatedFeatures (cut-off 0.95) and
RemoveNearZeroVarianceFeatures (cut-off 30/1). The number of data points for BRD4 BD1 were
randomly downsampled to 50 % of the original data points to reduce bias in the dataset and to
present a more even distribution across bromodomains. Variables were centred and scaled to
mean and unit variance using the PreProcess caret function and split into 70/30 training to test
ratio using the SplitSet function in camb, using stratified sampling according to bioactivity
labels).
Models were created using Random Forest (RF), Support Vector Machines (SVM) and
Generalized Linear Models (GLM) algorithms, using the rf, svmRadial and glm methods
respectively in the train function in caret using the function GetCVTrainControl, 5-fold cross
validation (CV), argument classProbs=True and summaryFunction=twoClassSummary to
calculate class probabilities to provide summary statistics. Recursive feature elimination using
caret rfe function, removed redundant variables, as determined by 5-fold CV assessed by ROC
AUC. The number of input variables to randomly select at each node in the random forest trees
87
(mtry) was optimised in CV using a random grid search of 15 values for RF and a grid search was
performed to optimise the hyperparameter values of σ (0-0.9) and C (0-5) for SVM.
2.2.5 Model Validation
PCM models were validated through 5-fold cross validation (CV) and, after parameter
optimisation, the prediction for unseen test set values (30% of the records split using stratified
random sampling according to bioactivity) using the predict function from caret. ROC AUC,
Matthew’s Correlation Coefficient (MCC), Area under the precision-recall curve (PRAUC),
sensitivity and specificity were used to assess performance on the test set.
Leave-one-scaffold-out (LOSO) validation was conducted by training a model based on data
points for all scaffolds except one hold-out scaffold and predicting for the data points containing
compounds with the new scaffold. To obtain scaffolds, the carbon framework Bermis-Murcko
scaffolds350 of all compounds were calculated using the RDKit Find Murcko Scaffolds node in
KNIME.351 3,553 framework scaffolds resulted, of which a random sample of 50 scaffolds with
more than 10 data points were selected as hold-out scaffolds.
Leave-one-target-out (LOTO) validation was conducted by sequentially removing all
compound-target pairs associated with one bromodomain target, training a model based on the
remaining data and predicting for the hold-out target data points.
2.2.6 Benchmarking to QSAR, Quantitative Sequence Activity Model (QSAM) and
Baseline Models
Models using only chemical descriptors (global QSAR) and only target descriptors (global
QSAM) were generated on the same dataset using the caret and camb R packages356, using the
same method described above on the same test set. These models test the hypotheses that
compound activities are correlated across targets (global QSAR) and that compound activities
are only target dependent (global QSAM). Their performances as measured by ROC AUC were
compared to the PCM model by implementing a pairwise two-sided Student’s t-test, utilising
the compare_models function in caret, with Bonferroni correction,356 to determine if there was
a significant difference.
Individual QSAR RF models were generated for each bromodomain using the same method as
for the PCM models.
To assess the utility of the extra target information encoded in the binding site alignment-
dependent target descriptors, a comparator RF model was trained on Morgan fingerprints with
88
binary target identity descriptors. The binding site alignment-dependent descriptors, such as Z-
Scales, provide a similarity measure between domains based on their amino acid binding site
residues and properties from which the model is expected to make interpolations between
similar domains. In contrast, the basic binary label identifiers for the different bromodomains
do not encode a relationship between the targets themselves.
To assess baseline performance, a generalised linear model was also generated on the whole
dataset applying the glm function implemented in caret, using compound and target IDs as
binary descriptors.
Y-scrambling was performed by randomly reorganising the labels associated with each
compound-target pair and retraining the models using the same method as for the final
model.357 Class labels were randomly scrambled by both 50% and 100% over 50 iterations. The
mean ROC AUC over all iterations was calculated.
2.2.7 Public Dataset Model
An RF model was constructed for the public dataset using the same methods described above.
This model was tested on a 30 % external test set as well as the proprietary AstraZeneca dataset.
2.2.8 Applicability Domain
Since applicability domain determination is difficult to rationalise in multi-dimensional
space,225 the Mondrian cross-conformal prediction (CCP) framework was used to define which
new compound-target pairs can be predicted at a certain confidence level (Cl).221 Applying
Mondrian CCP results in data points in the test set being classified into one of four classes; 1.
active, 2. not active, 3. “neither” active nor not active, or 4. “both” active and not active, to
achieve the specific error rate provided at different Cls. For our test set we assume that samples
are exchangeable with the training set, i.e. that it fulfils the exchangeable distribution
assumption required for a guaranteed error rate. For the conformal method to be suitable, the
validity metric (defined as the number of samples in the test set that are classified into a label
that contains their experimental class (including the “both” class assignment)), must be greater
than the Cl; i.e. the error rate should not exceed 1-Cl.
The applicability domain was assessed by calculating conformal predictions implementing
Mondrian cross-conformal prediction227 in R, which provides a wrapper around the caret
package. We use the RF probability values (fraction of trees predicting for each class) as the
non-conformity measure. The validity and efficiency metrics on the test set were used to assess
89
whether the conformal prediction framework was valid for the performance on the external test
set.
When interpreting conformal predictions, the probability value (p_value) assigned to the new
sample prediction can be used as an indication of the degree of confidence that a sample is
contained within the predicted class.
2.2.9 Experimental Validation
Predictions and Filtering
Compounds in the liquid sample library at AstraZeneca were screened against the model to
predict activity against four bromodomains, namely BRD1, BRD4 BD1, BRD9 and BRPF1b. A
total of 2,164,399 compounds were pre-processed in the same way as the training set molecules
(see above) and combined with Z-Scales 5 descriptors for each target. The model was used to
predict activity for these molecules using conformal prediction, applying the confidence levels
of 0.7, 0.8 and 0.9 and the corresponding significance values of 0.3, 0.2 and 0.1 to compare new
compound-target pair p_values to the model training set values to shortlist compounds for
experimental testing. We chose this range because we wanted to find hits with structural
diversity to the training set and any lower confidence than 0.7 would place too many new
samples into the “neither” class, especially as the screening library compound set will be more
diverse than the model test set. We also selected a range of significance values to observe the
relationship between p_values and prediction accuracy.
Firstly, the predictions with a p_value of > 0.1 for the active class (corresponding to the 0.9
confidence level) were selected. These actives were filtered to exclude those which were
unavailable for testing (availability as a solid < 3 mg, as a solution <0.05 mL). Compounds were
removed if they had a logBSF (frequent hitter) score358 < -3 or if they had already been tested
against the domain of interest according to the internal AstraZeneca IBIS database. We also
checked compounds for chemical substructure alerts using the AZfilters method, which
searches for fingerprints matching to common reactive or undesirable compound features.359
Those that were classified as “ugly” were removed from the set. A molecular weight filter was
then applied to restrict compounds to a molecular weight range between 200 and 700 Da.
Those compound-target pairs that were in the training or test set were removed. In total, the
above process resulted in 8,581 predicted active compound-target pairs.
90
Calculating Sampling Parameters
These predicted actives were matched with any compound-target pairs containing the same
internal AstraZeneca compound ID for which an inactive prediction with a conformal prediction
p_value > 0.1 (inactive at 0.9 confidence level) was generated. These inactives were combined
with the active set to provide bioactivity profiles for the same compound against multiple
domains and compounds were annotated with their selectivity profile. Compound structures
that were in the public domain were identified to select a significant number of these structures.
In the next step, sampling parameters for a subsequent diversity selection of compounds were
calculated, namely the compound similarity to the training set and the cluster membership of
the selected compounds.
To provide measures of diversity for future selection processes and analysis, the compound-
target pairs were assessed for their diversity compared to the training set. Firstly, the similarity
of compound structures to the nearest neighbour compound structure in the training set (across
any domain) was calculated using the Tanimoto similarity360 index, calculated from 512-bit
Morgan Fingerprints in KNIME351. The experimentally determined selectivity profile of the most
similar compound in the training set was extracted. Additionally, the nearest compound-target
pair neighbour in the training set was identified by selecting the training set instance with the
minimum Euclidean distance from the new prediction instance, calculated in KNIME351 from all
compound and target descriptors used in the model.
Clustering was conducted to provide a means of selecting a diverse set of compounds for testing.
To cluster a larger set of molecules, compounds were assigned their Bermis-Murcko carbon
framework scaffolds,350 from which 512-bit Morgan fingerprints were generated and used to
hierarchically cluster the scaffolds based on their distance matrix measured as the Tanimoto
distance. A distance threshold of 0.375 was used to merge scaffolds into clusters in KNIME,351
derived from the largest non-outlier distance value, which resulted in 585 clusters.
Compound Selection for the Prospective Validation Study
Compounds were sorted into three overlapping sets: 1. Interpolation Compounds that were
present in the training set but lacking experimental data for one or more of the four
bromodomains (BRD1, BRD4 BD1, BRD9 and BRPF1b) , 2. Selectivity Profiles for novel
compounds, defined as multiple confident active or inactive predictions (at the 0.7 to 0.9
confidence levels) for a given compound and 3. Singular Active predictions, comprising
91
compounds that were not in the training set and were predicted to be active (at the 70-90 %
confidence levels) for one of the four bromodomains in the validation study.
All compounds available from the Interpolation Compound set were tested for their
bromodomain activity. This included 9 compounds that had not previously been tested on
BRD1, 18 for BRD9 and 8 for BRPF1.
For the selection of Selectivity Profile compounds, grouped stratified weighted sampling (using
sample_n in the dplr R package361) was conducted to select compounds across groups of public
domain compounds, profile annotations, compound Tanimoto similarity to the nearest
compound neighbour in the training set (binned into 10 bins between 0 and 1), and cluster
numbers, weighted overall to an even distribution across public domain structure, profile
annotations and Tanimoto similarity (limited by overall availability of compounds in each
category). In total 721 compound-target pairs were tested as part of the Selectivity Profiles set.
For the Singular Actives set, grouped stratified weighted sampling was conducted to select active
compounds separately for each domain across groups of public availability, binned p_values
from conformal prediction (p_value >0.3, 0.2< p_value <0.3, 0.1< p_value <0.2), compound
Tanimoto similarity to the nearest compound neighbour in the training set (binned into 10 bins
between 0 and 1) and cluster numbers, weighted overall to an even distribution across public
availability, the 0.7, 0.8 and 0.9 confidence levels and Tanimoto similarity to the nearest
compound neighbour in the training set. In total 388 compound-target pairs were tested as part
of the Singular Actives set.
2.2.10 Experimental Testing
Overall 1,139 compound-target pair data points were selected for prospective experimental
validation in Differential Scanning Fluorimetry (DSF) assays362 at 10 μM concentration.
The experimental assays were run by Helen Boyd and Pia Hansson in AstraZeneca. The DSF
assays were performed in 4titude FrameStar 384 well plates (4titude, Surrey, UK) in assay buffer
with 10 mM HEPES and 500 mM NaCl. A 10 µL reaction was conducted by addition of 10 µL of
protein to assay ready plates with 10 nL of compound solution or DMSO followed by addition
of 20 nL SYPRO® Orange (SigmaAldrich, St. Louis, MO). The plate was sealed with qPCR Seal
(4titude, Surrey, UK), shaken briefly and spun for 1 min at 800 RPM in a benchtop centrifuge.
Final concentrations were 0.1 mg/mL of either BRD4 BD1, BRD9, BRD1 or BRPF1b and 1/500
dilution of SYPRO® Orange. The positive controls used in the assays were JQ-1 for BRD4 BD1,43
iBRD9 for BRD957 and NI-57 for BRD1 and BRPF1b.363 Control compounds had a final assay
92
concentration of 10 µM and the DMSO concentration was 0.1%. The LightCycler® (ROCHE,
Basel, Switzerland) was programmed to increase temperature from 20 to 85°C at a rate of
0.6°C/s with 10 acquisitions per degree Celsius. Tm was calculated using midpoint (geometric)
and first derivative methods. The midpoint method defines the arithmetic midpoint between
upper and lower plateaus as Tm whereas the First Derivative method calculates the first
derivative of the melting curve and defines the maximum value as Tm. Melting temperatures
from both methods should agree with each other and differences can be used to identify
irregular curves. For a compound to be considered as an active hit the ΔTm (shift in Tm between
compound and DMSO control) should be greater than 3 times the standard deviation (SD) of
the DMSO control or 0.9°C (the latter agrees with activity assignments in the dataset used for
training). The control data can be found in Table 8-5. All data handling was performed using a
Screener 13 (Genedata AG, Basel, Switzerland).
2.3 Results and Discussion
2.3.1 Analysis of Chemical and Biological Space
The dataset used for proteochemometric modelling comprised 15,350 data points for 6,352
compounds across 31 bromodomains, corresponding to a matrix coverage of about 7.8 % (Figure
2-4).
Figure 2-5 shows the bromodomains represented in this study, clustered by their binding site
sequence similarity. The dataset covers at least two bromodomains from seven out of eight
bromodomain subfamilies and 51% of all known bromodomains. The type 2 and 4
bromodomains are represented fully in the dataset, with many inhibitors of these targets having
been reported.36,45,34 Subfamilies of type 3, 5, 7 and 8 are less well-represented, which could be
for multiple reasons, including interest in biological function, ligandability or discovery date.34
93
Figure 2-5: Sequence similarity dendrogram based on sequence identity for the bromodomains with bioactivity data that was used in constructing models in this study. The plot is coloured by bromodomain subfamily.
The Bermis-Murcko scaffolds which describe four or more active compounds for each
bromodomain in the public portion of the dataset are visualised as a network map in Figure 2-6.
BRD4 BD1 and BRD4 BD2 form the centralised nodes in the public dataset network, comprising
many scaffolds for which chemotypes overlap with other bromodomains. Other bromodomains
are positioned around the periphery of the network. Scaffold 61, originating from the
publication by Crawford et al., 201674 is a common active scaffold for CECR2, CREBBP, BRD9,
BRD4 BD1 and BRD4 BD2 and presents one of the most connected scaffolds in the network. For
BRD1, TIF1A and BRPF1, scaffold 208 is the most frequently reported scaffold.93 BRD7 has only
one active scaffold (scaffold 29) with more than four member compounds for the public dataset,
which it shares with BRD9.58 BAZ2A and BAZ2B share their most common scaffold 21,55 and are
separated from the remainder of the network. Figure 2-6 also shows that overall there is
considerable scaffold diversity within active compounds for bromodomains, ranging from
Type 2
Type 4
Type 3
Type 7 Type 1
Type 8
Type 5
94
fragment sized scaffolds such as scaffolds 21 and 29, to larger scaffolds including scaffolds 329
and 539. Large scaffold diversity for BRD4 bromodomain ligands can also be observed. The
chemical space diversity for each bromodomain for the combined dataset is further shown in
Figure 2-7.
2.3.2 Benchmarking Compound and Target Descriptors for PCM Models
Benchmarking Compound Descriptors
1D and 2D physicochemical descriptors from the PaDEL library, sub-structural Morgan
fingerprints and 3D Vsurf descriptors were benchmarked for use as chemical descriptors in PCM
models. We compared their performance against one another in cross-validation (CV) when
employed with a constant protein target descriptor (Z-Scales 5), the results of which are shown
in Table 2-1. The combination of Morgan fingerprints and Z-Scales 5 performs similarly to the
combination of PaDEL descriptors and Z-Scales 5 measured by ROC AUC (0.964 vs 0.962),
showing equal suitability for their employ as compound descriptors for this dataset. In a
previous study, using multiple descriptor types to describe the chemical space in PCM models
have been shown to increase performance in some cases.364 In our case, when we combined
Morgan Fingerprints with PaDEL and Z-Scales 5 descriptors, the performance (ROC AUC 0.966)
is also similar to models with fewer descriptors, showing that the addition of nearly 200 further
ligand-side descriptors did not add significantly more information. This redundancy in
information has been found before in PCM modelling, including the findings from the study to
predict cell line growth inhibition, where the incorporation of multiple gene sets from gene
transcripts did not increase performance.365 Finally we investigated the utility of 3D chemical
descriptors, namely Vsurf descriptors combined with Z-scales target descriptors, which
performed the worst of the compound descriptors explored (ROC AUC 0.945), suggesting that
information encoded in the simpler 2D sub-structural fingerprints describing molecular
connectivity or physicochemical properties are more suitable than 3D information to describe
chemical structure for the type of models generated here.
95
Figure 2-6: Network map for active compound-target annotations extracted from public databases. Nodes are formed of Bermis-Murcko scaffolds (covering those comprising four or more active compounds for a bromodomain) and bromodomain identities. Bermis-Murcko scaffold numbers are connected to a bromodomain by an edge when compounds with this scaffold are active for a given bromodomain. Bermis-Murcko scaffold structures for the scaffolds with the largest number of active compounds for each bromodomain are depicted.
96
Figure 2-7: Chemical similarity projected onto 2D space of all compounds for each bromodomain. Individual compounds are represented by dots and those close in chemical space are positioned neighbouring on the plot. A cut-off value of 0.9 for the similarity was used. Red dots represent the active (A) and blue represent the not active (N) activity assignment. The plot shows the similarity between bromodomains based on their bioactivity space. BRD4 BD1 has the largest number of data points and occupies a chemical space not overlapping with other domains.
97
Table 2-1: Shows the number of variables selected by the recursive feature elimination method and the model performance in cross-validation as measured by ROC AUC, sensitivity and specificity for RF models trained on different combinations of compound descriptors. The addition of ~ 200 new variables for the model with PaDEL, Morgan Fingerprints and Z-scales 5 did not significantly improve the model.
Model Number of Variables selected
Value of mTry
ROC AUC Sensitivity Specificity
PaDEL, Morgan Fingerprints and Z-Scales 5
538 102 0.966 0.854 0.950
Morgan Fingerprints and Z-Scales 5
350 86 0.964 0.857 0.948
PaDEL and Z-Scales 5
218 56 0.962 0.830 0.948
Vsurf and Z-Scales 5
100 26 0.945 0.792 0.935
Benchmarking Target Descriptors
Alignment-dependent amino acid sequence descriptors were implemented in this study due to
their successful application in previous PCM studies.267 These target descriptors were
benchmarked, utilising Morgan fingerprints for the compound descriptors. The performance
measured by cross-validation (CV) for each target descriptor is shown in Table 2-2. As in
previous studies,267 we observe that target descriptors perform similarly with no significant
difference in performance (ROC AUC 0.962-0.964).
We further explored the use of alternative target descriptors generated from a SPOT peptide
array,38 to test if the specificity of peptides can infer information about the specificity of
compounds for bromodomains (see Methods section for details). The peptide model achieved
a high ROC AUC of 0.957 and MCC 0.801 (Table 2-3), only slightly outperformed by the final
traditional PCM models. This model shows that there is consistency between bromodomains
described by their binding affinities to peptides and their compound specificity.
98
Table 2-2: Description, number of components, ROC AUC, sensitivity and specificity values from cross-validation for the different alignment-dependent target descriptors used to build the initial proteochemometric models. Performance was very similar for all alignment-dependent descriptors.
Descriptors (combined with Morgan Fingerprints)
Description Number of Components
Number of Variables selected
Value of mTry
ROC AUC Sensitivity Specificity
Z-scales 5 Physicochemical PCA
5 350 86 0.964 0.857 0.948
FASGAI Physicochemical Factor Analysis
6 350 89 0.962 0.844 0.950
ST-scales Topological PCA 5 506 96 0.963 0.849 0.947
VHSE Physicochemical PCA
8 509 37 0.964 0.861 0.948
ProtFP8 Physicochemical PCA
8 250 18 0.962 0.835 0.954
BLOSUM Physicochemical and substitution matrix VARIMAX
10 350 66 0.964 0.846 0.950
T-scales Topological PCA 8 350 66 0.963 0.841 0.947
Z-scales 3 Physicochemical PCA
3 350 66 0.963 0.841 0.947
MSWHIM 3D electrostatic potential PCA
3 350 26 0.962 0.848 0.948
Table 2-3: Comparison of the external test set performance of the PCM binary classification model with other models generated for the dataset. The peptide model uses histone peptide array data as a surrogate for target descriptors. Models utilising different algorithms (RF=Random Forest, SVM=Support Vector Machines, GLM= Generalised Linear Model) were compared, as well as those incorporating different feature spaces (only compound descriptors (QSAR) and only target descriptors (QSAM)). Binary target ID descriptors and a baseline linear model (GLM) based on using only binary compound IDs and target IDs were also benchmarked. A model built on the public dataset and tested on a 30 % random test set as well as the proprietary data was benchmarked. The performance measures were area under the receiver operating characteristic curve (ROC AUC), Matthews’ Correlation Coefficient (MCC),
99
sensitivity (true positive predictions divided by all positive labels), specificity (true negative predictions divided by all negative labels) and the area under the precision recall curve (PRAUC). The table shows that the PCM model based on Morgan Fingerprints and Z-scales 5 outperformed other models in terms of ROC AUC and that the peptide model and the public data model performed well for their external test set validations. On average PCM outperformed individual target QSAR models.
Validation ROC AUC MCC Sensitivity Specificity AUPR curve
RF PCM 5-fold Cross Validation
0.964 +/- 0.0026 - 0.857 +/- 0.0141 0.948 +/- 0.0066 -
RF PCM 30 % external test set
0.968 0.826 0.859 0.955 0.879
Peptide descriptor model with Morgan fingerprints
0.957 0.801 0.850 0.942 0.912
SVM PCM model 30 % external test set
0.942 0.789 0.810 0.956 0.910
GLM Linear Model 30 % external test set
0.916 0.680 0.754 0.912 0.860
Global QSAR 0.801 0.498 0.511 0.926 0.679
Global QSAM (Quantitative Sequence Activity Models)
0.747 0.376 0.390 0.919 0.555
Average individual PCM performance
0.936 0.634 0.750 0.900 -
Average Individual QSAR performance
0.897 0.569 0.762 0.809 -
RF PCM with binary target id descriptors 30 % external test set
0.853 0.611 0.616 0.940 0.722
GLM Linear baseline Model (Target and compound
0.730
0.441 0.755 0.706 0.113
100
binary IDs) 30 % external test set
Public data model 30 % external test set
0.951 0.767 0.919 0.848 0.884
Public data model proprietary data test set
0.642 0.182 0.410 0.778 0.347
101
2.3.3 Algorithms and Model Performance
Table 2-3 is the RF binary classification model, which achieved a high predictive performance
(ROC AUC of 0.964 +/- 0.0026) in CV, showing a low standard deviation across folds, and
utilising 350 variables which were automatically selected using recursive feature elimination.
The test set performance for this model was similar to that in CV achieving a ROC AUC of 0.968
and a MCC of 0.826.
In addition to compound and target descriptors, different machine learning algorithms were
explored next. We found that the RF model performed similarly to SVM and outperformed GLM
on the same external test set when using 512-bit Morgan fingerprints for compound descriptors
and Z-scales 5 for target descriptors (ROC AUC 0.968 RF vs 0.942 SVM and 0.916 GLM; ROC
curves are shown in Figure 2-8). ROC AUC values above 0.9 and MCC values above 0.8 as
obtained from this study are comparable to the classification results for PCM models generated
previously on other datasets.308,249,311
Figure 2-8: ROC AUC curves for different algorithms, including Random Forest (RF) (mtry 86), Radial Support Vector Machines (SVM) (C=3, σ=0.01) and a Generalised Linear Model (GLM) on the same test set. The non-linear methods of RF and SVM outperform the linear GLM technique for this dataset, with the highest performing algorithm being RF.
We selected the RF model because this was the highest performing model for our external test
set and due to the higher interpretability of RF for future applications. At the 0.5 probability
threshold for the number of trees voting for each class, we can analyse the model performance
102
in terms of sensitivity and specificity values. It can be observed in Table 2-3 that the true
negative prediction rate (specificity) is higher than the true positive prediction rate (sensitivity),
providing a marginally better prediction of inactive compounds than active compounds. The
optimal probability threshold for maximising the sensitivity and specificity of the model (at a
probability threshold of 0.40) led to 90% sensitivity at 93% specificity.
For the final PCM model, the class-averaged performance across bromodomains was an MCC of
0.645 with a corresponding ROC AUC 0.934 (Table 2-3). The performance for the test set per-
target is shown in Figure 2-9 and Table 8-4 and highlights the domains for which the model is
most predictive. The BET bromodomains are generally predicted well for PCM with high ROC
AUCs (0.941-0.985) and a higher sensitivity (0.844-1.00) than specificity (0.667-0.967), as
might be expected due to the enriched number of actives for these domains (47.3 %) in the
dataset (Figure 2-4). ATAD2B and BRWD1 have no actives in the test set and SMARCA2 has no
inactives in the test set, and therefore, as the model predicts all compounds as inactive for
ATAD2B and BRWD1 BD2 this gives a specificity of 1.00, and the opposite applies as the model
predicts all compounds as active for SMARCA2 leading to a sensitivity of 1.00. (see Figure 2-4).
The case is similar for BRPF3, KAT2A and TAF1L BD2, where due to their low proportion of
active compounds, the sensitivity of the resulting model is zero. PCM performs the best outside
of the BET bromodomains for BAZ2A, BAZ2B, CECR2, PB1 BD5, TIF1A and TRIM33 (ROC AUCs
> 0.99, sensitivity > 0.85, specificity > 0.90). For most of these domains the dataset is heavily
biased towards inactives (Figure 2-4), and the high performance can be explained by the active
scaffolds common to training and test set for these domains (Figure 2-10). High overall ROC
AUCs are found for ATAD2, BPTF, BRD1, EP300, BRD7, BRD9, BRPF1 and TAF1 BD2; however,
they generally have lower sensitivity values (0.48-0.84) at a higher specificity, resulting in an
overprediction of inactives. This could be due to their higher active scaffold diversity, which is
represented in Figure 2-11. Conversely, those domains with a higher sensitivity than specificity
outside the BET family include CREBBP, and PCAF, both of which have high proportion of active
data points (37.3 % and 60 %; Figure 2-4). Overall, 21 out of 31 bromodomains are modelled well
(with a ROC AUC > 0.8 and both sensitivity and specificity > 0.6). For bromodomains with
poorer performance, this can be attributed to data limitations in many cases, including class
imbalances and scaffold diversity as discussed above.
103
Figure 2-9: ROC AUC, sensitivity and specificity for the external test set, split by bromodomain. Overall, 21 bromodomains are modelled well, with a ROC AUC >0.8 and both sensitivity and specificity > 0.6.
Figure 2-10: The number of carbon framework Bermis-Murcko scaffolds for active molecules in the external test set plotted per bromodomain. Scaffolds that were in the training set and test set are coloured red and scaffolds only in the test set are coloured blue. This plot shows that most active scaffolds for those bromodomains with low numbers of scaffolds in the test set were also found in the training set.
Bromodomain
Bro
mo
dom
ain
Number of Active Murcko Scaffolds in Test Set
104
Figure 2-11: The number of active Bermis-Murcko carbon framework scaffolds for each bromodomain in the combined public and proprietary dataset.
Number of Active Murcko Scaffolds
Bro
modom
ain
105
2.3.4 Model Validation
In addition to assessing performance on a representative random test set, is it important to
assess the model performance on a test set more realistic of a practical use. In drug discovery,
we often want to find hits and design compounds for relatively unestablished targets, often also
developing new chemical series. In these cases, limited existing bioactivity data will exist in one
or more of the chemical and biological spaces. For this reason, we performed the more realistic
validations of leave-one-scaffold-out (LOSO) and leave-one-target-out (LOTO) to assess the
ability of the model to interpolate/extrapolate for new compounds and new targets respectively.
Leave-One-Scaffold-Out Validation
Figure 2-12 displays the LOSO results, which shows that there was a large variation in ROC AUC
across scaffolds (mean ROC AUC 0.92, standard deviation 0.12). There is a general observed
trend towards better discrimination of activity for more complex scaffolds (scaffolds with higher
numbers are more complex with a larger number of rings), which is particularly noticeable for
those scaffold numbers above 3,000, which have consistently high ROC AUC values above 0.95.
The higher predictivity for more specific scaffolds is likely to result from more consistent SAR
for these molecules, due to their high degree of optimisation, combined with the fact that larger
molecules are generally more similar to one another in fingerprint space.366 For less complex
scaffolds, represented here with lower Bermis-Murcko scaffold numbers, there was a higher
variability between scaffold interpolation ability and hence model performance. This can be
attributed to class imbalance in some cases (Figure 2-13); for example, scaffolds 192 and 2,753
both have a highly active data point composition, and consequently both have a high sensitivity
accompanied by a lower specificity, corresponding to a larger number of overpredictions.
Conversely, we observe high specificities for scaffolds 172, 740 and 1614 with zero sensitivity,
which have a low number of active data points within the scaffold. However, this is not always
the case, as for example scaffold 464 has a lower fraction of active data points (0.09) while still
achieving perfect performance. Therefore, we can conclude that multiple factors contribute to
the ability of models to interpolate for a new scaffold, including the scaffold similarity, the class
distribution and the SAR itself within the dataset.
106
Figure 2-12: The ROC AUC, sensitivity and specificity values for leave-one-scaffold-out (LOSO) validation for the final PCM model.
Figure 2-13: The fraction of active to inactive data points for the scaffolds predicted for in the leave-one-scaffold-out (LOSO) validation, showing that class imbalance can influence predictive performance for a scaffold.
Leave-One-Target-Out Validation
PCM offers the advantage over QSAR that predictions can be interpolated also for new targets
based on information in the training set about similar targets. To test the ability to interpolate
to new targets we employed LOTO validation, the results of which are displayed in Figure 2-14.
The average performance across all bromodomains was ROC AUC 0.863, with a sensitivity of
0.672 and a specificity of 0.864, which was unsurprisingly lower than our original model results
averaged over all targets (ROC AUC 0.934, sensitivity 0.750, specificity 0.900). Target
interpolation/extrapolation for the BET bromodomains (BD1 and BD2 domains from BRD2,
BRD3, BRD4 and BRDT proteins) can be achieved with high predictive performance (ROC AUC
> 0.8, sensitivities and specificities > 0.7). This can be attributed to the high correlation of
activity of molecules between the BETs, as well as a high proportion of shared scaffolds between
the domains (Figure 2-4 and Figure 2-7). These targets are also well represented in selectivity
panel screens and therefore have data points for a larger number of chemotypes. All targets have
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
47
49
10
11
41
16
81
72
19
23
82
46
45
21
53
36
18
74
07
88
79
88
98
93
31
57
61
61
41
95
22
07
42
27
62
49
32
51
42
75
32
90
92
91
12
99
13
03
63
12
63
13
83
17
73
17
93
19
03
30
03
30
63
32
73
33
03
36
23
41
63
46
53
47
03
48
43
48
73
52
63
53
0
Scaffold Number
ROC AUC
Sensitivity
Specificity
0
0.2
0.4
0.6
0.8
1
47
49
10
11
41
16
81
72
19
23
82
46
45
21
53
36
18
74
07
88
79
88
98
93
31
57
61
61
41
95
22
07
42
27
62
49
32
51
42
75
32
90
92
91
12
99
13
03
63
12
63
13
83
17
73
17
93
19
03
30
03
30
63
32
73
33
03
36
23
41
63
46
53
47
03
48
43
48
73
52
63
53
0
Scaffold Number
Proportion inactives
Proportion actives
107
a ROC AUC well above 0.5 (random), except for TAF1 BD2, which cannot be predicted for using
information about other domains. This can be rationalised by the fact that there is only one
other member of the subfamily, which is TAF1L BD2 (Figure 2-5); however, TAF1 BD2 has many
more active compounds than TAF1L BD2 (Figure 2-4), and extrapolation from TAF1L BD2 hence
results in the underprediction of active compounds, demonstrated by the low sensitivity of
0.081. Conversely the opposite is true for the predictions of TAF1L BD2 based on the activity for
TAF1 BD2, where we observe an underprediction of the inactive class, thereby resulting in a low
specificity of 0.292. BRPF1 and ATAD2 also have lower ROC AUCs of around 0.7. This can be
rationalised again by the lower proportion of active compounds of the nearest domain
sequences BRPF3/BRD1 and ATAD2B (Figure 2-4), resulting in a greater number of inactive
predictions. SMARCA4 predictions have an overprediction of actives because the nearest
domain SMARCA2 has a large proportion of active data points (Figure 2-4). ATAD2B consisted
of all inactive data points and we could predict this domain with a specificity of 0.887 and a
false positive rate of 0.113; these findings likely result from the transfer of the activity profile of
ATAD2. BRD7, BRD1, KAT2A and PB1 BD5 are predicted with high ROC AUC, sensitivity and
specificity values > 0.8. The number of data points for each of these domains is small, and a
larger number of data points for a similar domain, BRD9, BRPF1b, CECR2 and SMARCA4
respectively, (Figure 2-4) are contained within the training set, with a similar activity profile
(Figure 2-7). This shows the ability of the model to interpolate between similar domains to those
with similar activity profiles, especially for domains where there are fewer data available.
In summary, it was found that PCM can generally interpolate/extrapolate well for new targets
that have similar bioactivity profiles to those in the training set, and less well for novel targets
where this is not the case.
108
Figure 2-14: The ROC AUC, sensitivity and specificity values for leave-one-target-out (LOTO) validation for the final PCM model.
2.3.5 Benchmarking to QSAR, QSAM and baseline models
We next compared our PCM model to other techniques which could be performed on the same
dataset, in order to establish its comparative advantage for bromodomain bioactivity modelling.
The final PCM model significantly (p value <0.05, paired t-test) outperformed models built
based on only compound descriptors (global QSAR) or only target descriptors (global
Quantitative Sequence Activity Models (QSAM)) as shown in Figure 2-15 and Table 2-3,
particularly notably in terms of sensitivity at the 0.5 probability cut-off (sensitivity: PCM=0.86,
QSAR=0.51, QSAM=0.39). Further evidence is also provided for the poorer prediction of actives
by these methods, by the reduction in the area under the precision recall curve (PRAUC:
PCM=0.879, QSAR=0.679, QSAM=0.555). Global QSAR outperforms global QSAM (ROC AUC:
QSAR=0.80, QSAM=0.75), showing that discrimination between active and inactive classes is
more dependent upon compound descriptors than target descriptors. This benchmarking
demonstrates that a more predictive model can be achieved by utilising a combination of target
and compound information, as is the case in the PCM method. Per-target PCM on average
outperforms individual QSAR models on the dataset (mean MCC PCM = 0.634, mean MCC
QSAR = 0.569); however, this is highly target dependent, as shown for the comparison between
individual QSAR models and the per-target PCM model (Figure 2-16). For example, for the BET
bromodomains (highlighted in Figure 2-16), PCM outperforms individual QSAR methods,
showing that bioactivity information about one target can interpolate to another target to
improve performance in these cases. However, for BRD7 and SMARCA4 and BRWD1 BD2,
individual QSAR models result in noticeably better predictive performance. For SMARCA4 we
00.10.20.30.40.50.60.70.80.9
1
ATA
D2
B
BA
Z2A
BA
Z2B
BP
TF
BR
D1
BR
D2
BD
1
BR
D2
BD
2
BR
D3
BD
1
BR
D3
BD
2
BR
D4
BD
1
BR
D4
BD
2
BR
D7
BR
D9
BR
DT
BD
1
BR
DT
BD
2
BR
PF1
BR
PF3
BR
WD
1 B
D2
CEC
R2
CR
EBB
P
EP3
00
KA
T2A
PB
1 B
D5
PC
AF
SMA
RC
A2
SMA
RC
A4
TAF1
BD
2
TAF1
L B
D2
TIF1
A
TRIM
33
Bromodomain
0.714761441
0
1
109
can explain the increase in performance to be due to the model wrongly interpolating from
SMARCA2 with a high number of active predictions. These domains are better suited to being
modelled individually. Both individual QSAR and PCM perform poorly for both BRPF3 and
TAF1L BD2 and SMARCA2. Predictive performance for ATAD2, CECR2, PB1 BD5 and TIF1A are
greatly enhanced by PCM. Therefore, PCM is favoured or similar in performance to individual
QSAR models for the majority of bromodomains (28 out of 31), with only three cases to the
contrary, where it is highly outperformed. This information can be used for future decisions on
modelling techniques to employ for different bromodomains.
The PCM model was also benchmarked against a model which used simplistic binary target
identifier descriptors to justify the use of the more complex alignment-dependent amino acid
descriptors such as Z-scales 5 for the protein side. The PCM model outperformed this model
(MCC: PCM = 0.826, Binary Target Descriptor model = 0.611, ROC AUC: PCM = 0.968, Binary
Target Descriptor model = 0.853), showing that the protein binding pocket information of the
PCM model is important for classification.
Figure 2-15: ROC AUC curves for PCM versus Global QSAR and Global QSAM techniques on the same test set. PCM outperforms both Global QSAR and QSAM techniques, showing an interaction of compound and target features is important for modelling this dataset.
110
Figure 2-16: Individual QSAR vs PCM performance for each target measured by Matthew’s Correlation Coefficient (MCC). On average, PCM outperforms QSAR. The BET bromodomains are highlighted the green box outline.
RF is a non-linear machine learning algorithm and the method employed here uses complex
features to describe molecules and targets. For comparison a baseline generalised linear model
(GLM) was produced, using only compound and target binary identifiers as the input variables.
This model achieved adequate performance with a ROC AUC of 0.730 and MCC of 0.441, which
was however outperformed by the RF PCM model. Additionally, the GLM model is less
generalisable than the RF PCM model, since only those compound target pairs with identifiers
in the training set can be meaningfully modelled using previous chemical information, leaving
new compounds to be estimated using target activity profiles.
We next compared this baseline model directly to the GLM model using compound and target
descriptors (Table 2-3), which outperforms the baseline model (ROC AUC: GLM = 0.916 vs
Baseline = 0.781) showing that additional information is contained within the chemical and
biological descriptors used, i.e. information which is further exploited by the final RF model,
which can consider nonlinear relationships between both spaces.
Finally, we investigated the model validity utilising Y-scrambling. Performance dropped from
ROC AUC 0.968 to 0.701 when 50 % of labels were scrambled and then further to random
chance (ROC AUC 0.500) predictions with 100 % of labels scrambled, showing that a valid
model is formed between correlations of descriptors with classification outcome.
Bromodomain
Ma
tth
ew
’s C
orr
ela
tion
Co
eff
icie
nt (M
CC
)
111
Overall, we conclude that using the non-linear RF method, with its ability to model interactions
between descriptors, as well as threshold cut-offs for descriptors,209 enhances the overall
predictive performance of the models for this dataset and increases generalisation to new
compounds and targets beyond the training set.
2.3.6 Public Dataset PCM Model
Both public and proprietary data were used to construct the PCM models; however, a model
based solely on public data was also generated for comparison. 3,950 data points were modelled
across 20 bromodomains for the public dataset, producing a model with high performance on
the random test set (MCC 0.767, ROC AUC 0.951, PRAUC 0.884; Table 2-3). The model is,
however, more limited in domain and scaffold coverage, comprising only 20 out of 31 domains,
and 685 out of 3618 Murcko scaffolds compared to the in-house model. On a per-target basis,
the performance varies between domains (Figure 2-17). Opposite to the trend for the combined
proprietary and public data model, we observe an overprediction of active labels for many
bromodomains, with a higher sensitivity than specificity, particularly for BAZ2A, BRD1, BRD3
BD1, BRD4 BD2, BRD9, PB1 BD5, which can be explained as above by a high proportions of
active data points (Figure 2-2).
Figure 2-17: Per-bromodomain performance of the public model in terms of ROC AUC, sensitivity and specificity on the 30 % test set.
Since the current performance analysis is only for the 30% test set of public data, we also tested
the ability of the public model to predict for the proprietary dataset. The overall performance
decreased to a MCC of 0.182, an ROC AUC of 0.642, and a PRAUC 0.347 (Table 2-3) and the
per-target performance (Figure 2-18) drops to an average ROC AUC of 0.648, with
overprediction of one class over another for many domains (average sensitivity 0.447, average
00.10.20.30.40.50.60.70.80.9
1
Bromodomain
ROC AUC
Sensitivity
Specificity
112
specificity 0.656). This shows that the two data sets are diverse in chemical and biological
spaces, and that the public model is not able to predict well for a structurally diverse dataset.
This supports the rationale for combining the two sets of data to increase the applicability
domain of the model.
Figure 2-18: Per-bromodomain performance of the public model in terms of ROC AUC, sensitivity and specificity on the proprietary dataset.
2.3.7 Applicability Domain
As we have seen in the previous section, we cannot expect a model to reliably predict for
chemical and biological space which has little relationship to the dataset the model is trained
on. Therefore, when applying the model to future test sets, we need to define an applicability
domain, where the model can predict better than random guesses. Therefore, we examined the
suitability of cross conformal prediction (CCP)227 as a method to define the applicability domain
of our best performing model, by employing the method to predict classes for the representative
external test set. The results are presented in Table 2-4, where it can be seen that the model has
a validity for the test set close to the confidence level, namely of 70.1%, 80.6%, 90.4% and 94.9
% for the 0.7, 0.8 and 0.9, 0.95 confidence levels (Cls) respectively, demonstrating the
conformal predictions were valid and confirming suitability for use of this method in future
compound selections. As described in the methods, to achieve a specified validity, the conformal
predictor allows samples to be classified into the “both” or “neither” classes, as well as single
label classes. The efficiency metric, which is the average number of output labels per sample,
demonstrates how the model achieves a defined confidence level. For the 70 and 80 %
confidence levels the efficiency is 0.73 and 0.85, respectively, and 27 % and 15 % of compounds
are classified into the “neither” class at these levels (i.e., they lie outside the applicability domain
of the model). For the 90 % confidence level, the efficiency is 0.99 and this is fulfilled by placing
00.10.20.30.40.50.60.70.80.9
1
ATA
D2
ATA
D2
B
BA
Z2A
BA
Z2B
BP
TF
BR
D1
BR
D2
BD
1
BR
D2
BD
2
BR
D3
BD
1
BR
D3
BD
2
BR
D4
BD
1
BR
D4
BD
2
BR
D7
BR
D9
BR
DT
BD
1
BR
DT
BD
2
BR
PF1
BR
PF3
BR
WD
1 B
D2
CEC
R2
CR
EBB
P
EP3
00
KA
T2A
SMA
RC
A2
SMA
RC
A4
TAF1
BD
2
TAF1
L B
D2
TIF1
A
TRIM
33
Bromodomain
ROC AUC
Sensitivity
Specificity
113
compounds into all classes, including 0.12 % into the “both” class and 1.06 % into the “neither”
class, to fulfil the guaranteed error rate. Since the baseline performance of the model is at 91.9
% validity (accuracy), at the confidence level of 0.95 the model assigns many samples into the
“both” class to achieve a higher validity than the underlying classifier. This explains why the
efficiency metric is now greater than 1.00 (1.12), where 12 % of the labels are now in the “both”
class.
In summary, since conformal prediction performs with expected validities on our test set, we
use this method to select compounds for experimental validation in the next section (see
methods).
Table 2-4: Applicability domain of the model as defined at 4 different confidence levels of 0.7, 0.8, 0.9 and 0.95 with Cross Conformal Prediction (CPP). Conformal prediction has four outcomes; active, not active and “both” active and not active or “neither”. The number of single class predictions increases as the confidence level increases to avoid the incorrect “neither” class assignment, until the baseline accuracy of the binary classifier (0.92) is reached, whereupon compounds are predicted to be members of “both” classes to increase the validity further. This table shows that the conformal method as applied to PCM is valid and fulfils the required maximum error criteria when tested on the external test set, making the model and the conformal prediction framework suitable for future predictions.
Confidence (%)
Expected Error (%)
Validity (%) Efficiency (average number of output labels per sample)
Percentage of compound-target pairs with label “both”
Percentage of compound-target pairs with label “neither”
70 30 70.1 0.73 0 27.0
80 20 80.8 0.85 0 15.0
90 10 90.4 0.99 0.12 1.06
95 5 94.9 1.12 12 0.00
2.3.8 Experimental Validation
Using the best performing RF PCM model, compounds were next prioritised for experimental
testing in Differential Scanning Fluorimetry assays against the four bromodomains BRD1, BRD4
BD1, BRD9 and BRPF1b, as described in the methods section. The overall performance of the
model on the validation set achieved a modest MCC of 0.24 (Table 2-5). The compound
selections were further divided into three categories; 1. Interpolation Compounds, 2. Selectivity
Profiles, and 3. Singular Actives, which were designed to test different hypotheses for model
prediction (see methods), and the resulting performance for the three overlapping sets is shown
in Table 2-5. The balanced accuracy (mean of the sensitivity and specificity values) was used to
assess performance instead of the MCC due to the absence of inactive predictions for the
Singular Active set. For the Interpolation Compounds, Selectivity Profiles and Singular Actives
the balanced accuracies were 0.66, 0.64 and 0.29 respectively, showing better predictive
114
performance for the Interpolation Compounds and Selectivity Profile sets than for the Singular
Active set. This trend is not only driven by the inclusion of the inactive class, but also an increase
in precision for the active class (the precision of the Interpolation Compounds set is 0.42, while
that for the Selectivity Profiles set is 0.37 and that for the Singular Actives set is only 0.29). It is
unsurprising that Interpolation Compounds are generally better predicted, due to some
information about the compound’s bioactivity being included in the model training set. The
median Euclidean distance to the nearest compound-target neighbour in descriptor space
increases from Interpolation Compounds (0.37) to Selectivity Profiles (0.52) to Singular Actives
(0.57), showing that the compound-target pairs in the Singular Actives set are more diverse in
Euclidean space and likely represent a more difficult prediction set for the RF algorithm. This
can explain the better performance for the Selectivity Set over the Singular Actives set.
Table 2-5: Performance reported as True positives (TP), True negatives (TN), False positives (FP), False negatives (FN), Balanced Accuracy, Matthew’s Correlation Coefficient (MCC), Precision and the median distance to the training set for the three overlapping subsets of compounds selected for testing. Interpolation performs the best for the model, as expected, followed by the Selectivity Profile predictions.
Set TP TN FP FN Balanced Accuracy
MCC Precision Median Euclidean distance to nearest compound-target neighbour in training set
Interpolation Compounds
9 12 12 2 0.66 0.30 0.42 0.37
Selectivity Profiles
204 166 343 8 0.64 0.31 0.37 0.52
Singular Actives
115 0 273 0 0.29 - 0.29 0.57
In the next sections we go into further detail of model performance with regards to the
individual sets modelled on a per-bromodomain basis, as well as look at the overall hit rates for
active predictions, often used to measure the success of a virtual screen.
Interpolation Compounds
The model performance for Interpolation Compounds on a per-target basis is summarised in
Table 2-6. For BRD1 and BRPF1 the model performed well at predicting interpolations, with an
overall accuracy of 100 % and 75 %, respectively, for both the active and inactive classes. For
BRD9, the accuracy was lower (33 %), due to an overprediction of actives. Figure 2-19 shows the
115
Euclidean distance of the predicted compounds in combined compound and target space. It is
noticeable that many of the active BRD9 predictions were for compounds which had a larger
Euclidean distance to the nearest compound-target pair in the training set (see circled points in
Figure 2-19), than for active predictions for BRD1 and BRPF1b, which could mean that the
interpolation for BRD9 is not achievable from the other bromodomains considered here. In
general, interpolation is more successful for BRD1 and BRPF1b than for BRD9.
Table 2-6: Interpolation prediction results, showing number of active predictions, number of inactive predictions and the % accuracy for each domain. Interpolation compounds have activity in the training set against another domain but activity is not known for the tested domain. Interpolation for BRD1 and BRPF1b was more successful (as measured by accuracy) than for BRD9.
Bromodomain Number of Active predictions
Number of Inactive predictions
Accuracy (%)
BRD1 3 6 100.0
BRD9 12 6 33.0
BRPF1b 6 2 75.0
Figure 2-19: Interpolation predictions categorised according to their domain and predicted activity at the 0.9 confidence level plotted with their Euclidean distance in compound and target space from the nearest compound-target neighbour in the training set. Since all compounds are in the training set, this shows the distance to the nearest bromodomain on a relative scale. It is noticeable that the erroneous/incorrect active BRD9 predictions were for compounds for which there was a higher Euclidean distance to the nearest compound-target pair in the training set (circled points), than for BRD1 and BRPF1b, which could mean that the interpolation for BRD9 is not achievable from the other bromodomains considered here.
Bro
mo
dom
ain
, S
plit
by P
redic
ted A
ctivity
Euclidean Distance in Descriptor Space to Nearest Compound-Target Pair in Training Set
116
Selectivity Profiles
Table 2-7 shows the results of the Selectivity Profile predictions for two, three or four
bromodomains. For the Selectivity Profile compounds predicted for two domains,
corresponding to the first 10 rows of the table, we can see that the best predicted profiles are for
activity at both BRD4 BD1 and BRD9 (Profile 1), activity for BRD4 BD1 with inactivity at BRPF1
(Profile 8) and activity at BRD4 BD1 with inactivity at BRD1 (Profile 10), with 61.1 %, 63.1 % and
62.5 % completely correct predictions across all bromodomains, respectively. This is reflective
of the general selectivity profiles for the training set; BRD4 BD1 and BRD9 are often reported as
active together (Figure 2-20d), whereas many compounds that have activity at BRD4 BD1 are
inactive against BRD1 and BRPF1 (Figure 2-20a and Figure 2-20c). The dual inactive profile for
BRD1 and BRPF1b when a compound displays activity against BRD4 BD1 can also be rationalised
by the similarity between domains (Figure 2-5), which shows that BRD1 and BRPF1b are the
most similar domains within in the same type 4 subfamily, and BRD4 BD1 is in a separate
subfamily (type 2). When analysing the predicted Selectivity Profile that includes activity for
both BRD1 and BRPF1 (Profile 3, Table 2-7), we find a poor accuracy for completely correct
predictions of 12.3 %. Most of the incorrect predictions fall into the p_value for the active class
from conformal prediction range between 0.1 and 0.3, showing that this region represents more
speculative predictions (Figure 2-21a). Additionally, for this profile, the training dataset shows
a high degree of overlap for these domains in terms of compound activity profiles (Figure
2-20e). However, from experimental testing, we have been able to find a relatively larger set of
35 compounds which were active at one domain over the other. This data would be important
to incorporate in the future to understand selectivity between BRD1 and BRPF1 domains, which
has not been well-explored previously.
Table 2-7: Selectivity profile predictions for the experimental validation set, per individual Selectivity Profile. The results are broken down into numbers of compounds for which 4, 3, 2, 1 or 0 predictions were correct and the percentage of each Selectivity Profile predictions which were predicted entirely correctly. Selectivity profiles tested vary between 2, 3 or 4 predictions per compound.
Profile Number
Predicted Profile
4 correct
3 correct
2 correct
1 correct
None correct
Total tested
Percentage completely correct (%)
1 BRD4 BD1 and BRD9 active
NA NA 11 1 6 18 61.1
2 BRD9 and BRPF1 active
NA NA 2 5 18 25 8.0
3 BRD1 and BRPF1 active
NA NA 15 35 71 121 12.3
117
4 BRD4 BD1 and BRPF1 active
NA NA 1 1 5 7 14.2
5 BRD1 and BRD9 active
NA NA 0 1 9 10 0.0
6 BRD9 active, BRD4 BD1 inactive
NA NA 1 3 0 4 25.0
7 BRPF1 active, BRD9 inactive
NA NA 1 2 0 3 33.3
8 BRD4 BD1 active BRPF1 inactive
NA NA 12 7 0 19 63.1
9 BRPF1 active, BRD4 BD1 inactive
NA NA 3 8 0 11 27.2
10 BRD4 BD1 active, BRD1 inactive
NA NA 5 3 0 8 62.5
11 BRD4 BD1, BRD9 and BRPF1 active
NA 0 0 1 0 1 0.0
12 BRD1, BRD9 and BRPF1 active
NA 8 15 6 34 55 14.5
13 BRD4 BD1 active, BRD1 and BRPF1 inactive
NA 10 3 0 0 13 76.9
14 BRD4 BD1 active, BRD1 and BRD9 inactive
NA 8 4 1 0 13 61.5
15 BRD9, BRPF1 active BRD4 BD1 inactive
NA 0 1 1 0 2 0.0
16 BRD1, BRPF1 active BRD4 BD1 inactive
NA 0 0 2 0 2 0.0
17 BRD1, BRD9, BRPF1 active, BRD4 BD1 inactive
0 2 2 3 0 7 0.0
118
18 BRD4 BD1 active, BRD1, BRD9 and BRPF1 inactive
4 5 1 0 0 10 40.0
When analysing the results for the three- and four-domain Selectivity Profile predictions, we
observe trends consistent with the two-domain Selectivity Profile predictions. For the
Selectivity Profile including activity at BRD4 BD1 with inactivity at both BRD1 and BRPF1 (Profile
13), and the Selectivity Profile including activity BRD4 BD1 with inactivity at both BRD1 and
BRD9 (Profile 14), we see a high completely correct prediction rate of 76.9 and 61.5 %,
respectively. This is consistent with the fact that Selectivity Profiles with actives for BRD4 BD1
were well-predicted also for the two component Selectivity Profiles. For the Selectivity Profile
including activity BRD4 BD1 with inactivity at BRD1 and BRPF1 (Profile 13), those compounds
with higher p_values and similarity to the nearest compound neighbour in the training set were
correctly predicted active for BRD4 BD1 (Figure 2-18b). Another Selectivity Profile of note
includes activity for BRD1, BRD9 and BRPF1 (Profile 12), for which, although the overall
completely correct prediction rate is poor at 14.5 %, there are still eight compounds predicted
correctly and 15 with two correct activity assignments, correctly identifying compounds with
multiple activities.
119
Compound Compound B
rom
odom
ain
B
rom
od
om
ain
B
rom
odom
ain
a b
c d
e f
BRD1
BRD4
BD1
BRD9
BRPF1b
BRD4
BD1
BRD9
BRD4
BD1
BRPF1b
BRD1
BRD9
BRD1
BRPF1b
Figure 2-20: Pairwise bioactivity profiles of training set molecules for all two-domain combinations of BRD1, BRD4 BD1, BRD9, and
BRPF1b. Each cell represents a compound-target pair activity with unique compounds which have data for both domains along the x-
axis plotted against each of the bromodomains on the y-axis. Red coloured cells represent active (A) compound-target pairs, blue coloured
cells represent not active (N) compound-target pairs. It can be seen that BRD1 and BRPF1b have highly correlated activities and BRD4
BD1 has the most active compounds, which tend to inversely correlate with other domain bioactivities
120
Figure 2-21: Selectivity Profile predictions plotted with the p_values for the active class generated from conformal predictions against the Tanimoto similarity to the nearest compound neighbour in the training set (Morgan fingerprints, 512 bits) for a) compounds that were predicted active at both BRD1 and BRPF1; active=A, not active=N. The p_values are lower for this set on average compared to the data for the other Selectivity Profile, and most of the wrong predictions fall between 0.1-0.3. b) compounds that were predicted active at BRD4 BD1 and inactive at BRD1 and BRPF1; active=A, not active=N. Higher p_values and similarity to the nearest compound neighbour in the training set were predicted correctly as active for BRD4 BD1.
Tanimoto Similarity to Nearest Compound Neighbour in Training Set
Tanimoto Similarity to Nearest Compound Neighbour in Training Set
P_
va
lue f
or
Active
Cla
ss fro
m
Co
nfo
rma
l P
redic
tio
n
P_
va
lue f
or
Active
Cla
ss fro
m
Co
nfo
rma
l P
redic
tio
n
a b
121
Table 2-8: Three compound selectivity profiles which were predicted by the PCM model and their Thermal Shift values, as determined by Differential Scanning Fluorimetry (DSF). A= active, N= not Active. Shows the thermal shift values (ΔTm) tested at 10 μM averaged between the midpoint and first differential (see methods) and the range of the two measurements. p_values were calculated from conformal predictions and show the probability values that the compound belongs to the class assigned. NA values mean the compound was not selected for testing in the assay. The table shows three compound structures for which selectivity was predicted by the model including two examples of a dual BRD1 and BRPF1b active prediction (one of which was correctly predicted the other partially correctly predicted) and one example of a dual BRD4 BD1 and BRPF1b active.
Structure BRD1 BRPF1b BRD9 BRD4 BD1
Experimental Activity (ΔTm average / range)
Prediction Activity (p_value)
Experimental Activity (ΔTm average / range)
Prediction Activity (p_value)
Experimental Activity (ΔTm average / range)
Prediction Activity (p_value)
Experimental Activity (ΔTm average / range)
Prediction Activity (p_value)
1
A (4.65/ 0.10)
A (0.12) A (7.81/0.35) A (0.12) NA NA NA NA
2
A (2.15/0.70) A (0.21) N (0.87/0.25) A (0.21) NA NA NA NA
3
NA NA A (1.15/0.31) A (0.17) NA NA A (1.15/0.10) A (0.16)
122
Table 2-8 shows three examples of predicted profiles from the model along with their activities
as determined by DSF. It is important to note than DSF is only a semi-quantitative assay and
that there may only be small differences between weakly active compounds, and not active
compounds. An example of a dual active compound correctly predicted for Profile 3 (Table 2-7)
is compound 1 in Table 2-8, which is a regioisomer of the chemical probe for the BRPF1b and
BRD1 bromodomains, NI-57.363 Data for the 7-amino-1,3-dimethyl-quinolin-2-one warhead was
not in the training dataset, since it was more recently generated and published by AstraZeneca
in 2017; however, some scaffolds with a six-membered ring fused to an aryl ring were active for
BRPF1b in the dataset from internal AstraZeneca compounds, as well as the aryl-sulfonamide
linker being known from the benzimidazolone series.93 Combining these two molecular features
of a six-membered ring fused to an aryl ring containing an acceptor and the sulfonamide linker,
the model could correctly predict dual activity at BRD1 and BRPF1b. This compound was
predicted as active with a p_value of 0.12 for both domains, showing that the probability of
belonging to this class was close to the cut-off of 0.1 and that this compound was at the edge of
our defined applicability domain. Compound 2 is an example of an incorrectly predicted profile
for the same Selectivity Profile 3 (Table 2-8) and is also related to the benzimidazolone series,93
with the main differences of; 1. a reversed sulfonamide, 2. the expanded six-membered ring
system from the original five-membered ring, and 3. the addition of an extra carbonyl acceptor.
In this case, the compound was correctly predicted as active at BRD1 (p_value 0.21); however, it
was also predicted as active at BRPF1b (p_value 0.21) but tested as inactive. This example
confirms our earlier observation that the model often predicts dual activity for BRD1 and
BRPF1b, which occurs due to the data in the training set having few examples of selectivity for
one of the bromodomains over the other. (Figure 2-19e). This example is one of several of the
same chemotype tested which displays this selectivity profile and can be utilised in future
models to enable the distinction between chemical features which provide selectivity for BRD1
over BRPF1b and those which constitute activity for both. Compound 3 is an example of a
correctly predicted BRD4 BD1 and BRPF1b active (Profile 4, Table 2-8), which was predicted as
active at BRPF1b with a p_value of 0.17 and BRD4 BD1 with a p_value of 0.16. The higher
probability values come from the fact that the warhead is found in the training dataset with one
example of the O-linked aryl showing activity at multiple bromodomains, including BRPF1b and
BRD4 BD1.
Singular Actives and Overall
For the Singular Active set, a hit rate of 29 % was achieved. When we expand the analysis to all
predicted actives, we obtain an overall active hit rate of 34.1 % for all four assays (Table 2-9),
123
which greatly exceeds a random hit rate of 0.05 % from previous testing against the same assays
in AstraZeneca screens. When considering per-bromodomain hit rates, BRD4 BD1 has the
highest hit rate of 53.8 %, whereas the other bromodomains ranged from 25-31 %. The
performance per bromodomain is shown in Table 2-9, showing that the number of active hits
correctly predicted as active by the model varied across bromodomains, comprising 57 for BRD1,
129 for BRD4 BD1, 62 for BRD9 and 71 for BRPF1b. Although hit rates are a useful assessment
for how well the model helped shortlist compounds for testing (and one focus was put on
identifying active compounds), it should be reiterated that the compound selection was not
only designed to optimise the hit rate, as the intention was to also sample across different
similarity ranges to the training set and clusters to identify hits which were more structurally
diverse. A consequence of optimising for diversity and reducing the confidence level is that
there was a high overprediction rate of actives, as can be seen by the high false positive
prediction rate of 0.76 (Table 2-9). When the p_value cut-off was increased post-hoc to 0.3 or
greater (corresponding to a confidence level of 0.7), the overall hit rate of classification was
further increased to 68.7 % (Figure 2-22, Table 2-10), showing that adjusting the p_value has
utility in increasing the number of correct predictions. However, this is accompanied by
decreased hit diversity compared to the training set.
Table 2-9: Metrics for the experimental test set predictions in terms of hit rate, sensitivity, specificity, balanced accuracy, Matthew’s correlation coefficient (MCC) and the number of correctly predicted new actives discovered.
Bromodomain Hit rate (%)
Sensitivity Specificity Balanced Accuracy
MCC Number of correctly predicted actives
All 34.1 0.97 0.24 0.61 0.24 319
BRD1 26.6 1.00 0.29 0.65 0.28 57
BRD4 BD1 53.8 0.99 0.27 0.63 0.36 129
BRD9 25.4 0.86 0.15 0.50 0.02 62
BRPF1b 30.6 0.97 0.25 0.61 0.25 71
124
Figure 2-22: The p_value generated from conformal prediction as plotted against the Tanimoto similarity of compounds to the nearest compound neighbour in the training set (calculated from 512-bit Morgan fingerprints). A=active, N= not active. The overall trend is towards a higher confidence for molecules with a more similar compound neighbour in the training set. The false positive rate decreases as the p_value cut-off is increased to 0.3 for all bromodomains. Hit rate is increased to 68.7 % from 34.1 %. Many of the incorrectly predicted single-label actives are now classified as the “neither” class, which means they are considered outside of the newly defined applicability domain. However, improved hit rate comes with decreased diversity of screening hits.
Table 2-10: Confusion matrix for overall performance for experimental predictions when the p_value threshold was raised to 0.3 (corresponding to a confidence level of 0.7).
Raised threshold to p_value > 0.3
Experimental
Prediction
A N
A 169 95
N 8 146
2.4 Conclusions
In this work we have generated the first PCM models for the bromodomain family of proteins,
using a dataset of 15,350 data points across 31 bromodomains. Comparison of different
algorithms, compound descriptors and protein descriptors allowed us to produce a high
performing PCM model using Morgan fingerprints as compound descriptors, physicochemical
property alignment-dependent descriptors as target descriptors, and the RF algorithm,
achieving an MCC 0.83 on the external test set. PCM outperforms other techniques that can be
P_v
alu
e fo
r A
ctiv
e C
lass
fro
m C
onfo
rmal
Pre
dic
tion
Tanimoto Similarity to Nearest Compound
Neighbour in Training Set
125
applied on the dataset, including global QSAR (MCC 0.50), global QSAM (MCC 0.38) and
average individual QSAR (mean MCC 0.57 QSAR vs mean MCC 0.63 for PCM), showing that
compound and target descriptors together should be used for modelling this bromodomain
data. The use of histone peptide array data as a replacement for target descriptors also resulted
in a highly predictive PCM model (MCC 0.80), showing that information from histone peptide
binding preferences can inform small molecule SAR. We prospectively validated the model
using the cross-conformal prediction framework to shortlist compounds for experimental
validation, to test the ability of the model to 1. interpolate between new combinations of training
set compounds and targets, 2. predict selectivity profiles across bromodomains and 3. find novel
and diverse active hits for bromodomains. Overall, we found that the accuracy of predictions
scaled with the similarity to the training set. The model achieved a performance of an MCC of
0.24 across the 1,139 compound-target pairs which were tested experimentally, identifying 319
new bromodomain active hits. We show that Selectivity Profiles including activity at BRD4 BD1,
in combination with other activities for the other bromodomains are often correctly predicted
and rationalise the predictions for examples of interesting selectivity profiles found from the
experimental testing. Furthermore, we propose that the derived p_values from conformal
prediction can be used as a method to fine-tune the hit rate verses compound hit diversity in a
virtual screening setting. This study contributes towards addressing the need for a better
method to predict selectivity for the bromodomain family, which is becoming increasingly
important due to the biological relevance of bromodomains in disease.
126
3 Selectivity Insight for the Rational Design of
Bromodomain Inhibitors Derived from
Proteochemometric Models
3.1 Introduction
The previous chapter addressed the prediction of the binding of small molecules to
bromodomain-containing proteins using proteochemometric (PCM) modelling. We know that
PCM models were able to distinguish selectivity between bromodomains, however, we now
focus on interpretation of the PCM models to provide insight into which biological features
were important for determining activity and selectivity for bromodomains. The aim was to test
the hypothesis that model features corresponded to observed binding interactions in existing
and novel liganded crystal structures, as well as revealed new insights into binding interactions.
As mentioned previously, since small molecule bromodomain binding requires the mimicry of
endogenous peptide binding, chemotypes which bind to one bromodomain often also bind to
other bromodomains. This lack of selectivity is particularly notable between bromodomains
with a high binding site sequence conservation.31 Therefore, the challenge of designing a
selective inhibitor remains. Although this challenge exists, increased efforts to produce
chemical probes for the different members of the bromodomain family of proteins have led to
the identification of certain target features deemed important for gaining selectivity for one
bromodomain over another.
127
Table 1-2 in the introduction provides a summary of a literature survey of residues in the active
site of bromodomains (modelled within this study) which interact with small molecules. We
will use this information to compare to the residues identified as important for binding, derived
from the interpretation of the machine learning models trained on a large bioactivity dataset in
this study. We aim to confirm and extend existing knowledge to provide a more comprehensive
picture of selectivity.
3.2 Materials and Methods
3.2.1 Computational models
PCM models were generated as in the previous chapter and publication.367 The dataset used to
generate the models was extracted from the public and licensed sources of ChEMBL330,
PubChem331, ChEpiMod332, GOSTAR333 and the manual extraction of data from publications,54–
56,58,59,65,74,82,94,334–338, as well as AstraZeneca proprietary databases. The final dataset contained
15,350 data points; 6,352 compounds across 31 bromodomains. For further details of the
curation process and the dataset composition, the reader is referred to the previous chapter.367
The number of datapoints per bromodomain, split into active and inactive classes is detailed in
Table 8-1.
512-bit hashed binary Morgan fingerprints of radius 2 were generated using the Python RDKit
module353 and Z-scales 3 265 alignment dependent target descriptors were calculated using the
function AADescs from the camb306 package in R. The PCM model was generated the using
camb and caret packages in R,53,71 as detailed in the previous chapter.367
3.2.2 Model Interpretation
Z-Scales 3 descriptors make up the first three principal components (PCs) of amino acids
encoded by 26 physicochemical properties. When inspecting the variable loadings of these 26
properties for each principal component it can be observed, as highlighted in the previous study
by Sandberg et al.265 and summarized in Table 3-1, that the three principal components are
primarily described by properties related to lipophilicity, size/polarizability and polarity for the
first, second and third Z-Scales respectively. Therefore, in this study the three principal
components were interpreted as lipophilicity, size/polarizability and polarity descriptors of
amino acids in each residue alignment position. Z-Scales 3 were used instead of the Z-Scales 5
descriptors because the 4th and 5th PCs are difficult to interpret and account for only a small
proportion of the variance.265
128
Table 3-1: Z-Scales 3 principal component interpretation and variable loadings.
Principal Component (PC)
Interpretation Variables with Highest Contribution to PC
1 Lipophilicity Log10(octanol/water) partition coefficient (LogP), non-
polar surface area (Snp), TLC variables, Polar surface area
(Spol), number of hydrogen bond acceptors (HACCR)
2 Size/Polarizability Molecular weight (MW), side chain van der Waals volume (vdW), total surface area (Stot), polarizability (POLAR)
3 Polarity Electrophilicity (ELUMO), NMR α-proton shift at pD 1
and 7 (NM1, NM7), electronegativity (EN)
The PCM model was interpreted on two levels, namely global and local feature importance, with
the aim to find binding site amino acid residues and their properties which most highly
influenced the model discrimination between classes.
Global Feature Importance
For the global feature importance assessment, the varimp function from caret was used to rank
the most important variables, using the mean decrease in the Gini impurity.368 The Gini
impurity is an approximation to the entropy and measures how well a variable in the input
dataset splits the instances at each node into homogenous classes at the successive nodes (see
Decision Trees for further details). The lower the value of the Gini impurity, the more effective
the variable was at splitting instances into classes. When calculating the mean decrease in Gini
coefficient for a variable, the mean of the Gini impurity values for each time the variable is used
to make a split in the forest was calculated. The higher the mean decrease in Gini impurity, the
more important the feature was for the classification of the training set instances.369 The top 10
% of input features were then ranked according to this metric. These important residues were
mapped to a reference crystal structure PDB 2YEM in BRD4 BD2370 for visualisation, generated
in MOE.355
Local Feature Importance
The package rfFC213 in R was used to compute the local feature contributions per instance in the
training set. This method is discussed in more detail in the section Interpretation of Random
Forests.
The features in the model can be interpreted from this method on a per-instance basis (in our
case compound-target pairs) or for groups of the data (e.g. per target or per subfamily). For the
129
analysis of selectivity between BD1 and BD2 domains in BET bromodomains, only the
compound-target pairs for which there was an activity annotation at these domains were
considered. This data was then split into the active and inactive compound-target pairs forming
two separate subsets, and the median feature contribution towards active class was calculated
for each of the subsets (where a positive median feature contribution indicated that the feature
contributed towards classifying the example as active and a negative median feature
contribution indicated that the feature contributed towards classifying the example as inactive).
Features with a positive median feature contribution towards classifying active instances and
features with a negative median feature contribution towards classifying inactive instances for
each of BD1 and BD2 domains were then analysed. For the selectivity analysis for CREBBP over
BRD9, data was split firstly by domain and then by activity classification, forming four subsets
of the data in this case. In a similar process to above, the median feature contribution values
towards the active class was calculated for each of the subsets. For each bromodomain, features
with a positive median feature contribution towards classifying active instances and features
with a negative median feature contribution towards classifying inactive instances were retained
for analysis. Finally, for the individual bromodomain activity analysis, the data was grouped by
domain, followed by selecting those with the “active” label to discover important features for
classifying actives at each bromodomain. Aggregation over the group was performed by taking
the median value of the feature contribution, to avoid the influence of outliers, where a variable
might be very important for one instance, but not generally for the group. Values for the median
importance of each variable towards the active class for each bromodomain group, as well as
the lower and upper quartiles and the interquartile range (IQR) are listed in Supplementary
Data File 2. The IQR gives an indication of how representative the median is of all values in the
data subset by providing an indicator of spread and therefore reliability. Features with a median
value of greater than 0 were classified as important towards activity for each bromodomain for
the purposes of the discretized visualisation in Figure 3-4. Those variables encoding the
lipophilicity of each residue which were important to classifying active compounds for each
bromodomain were visualised quantitatively by their median feature contribution values.
Figures were generated using ggplot2371 in R.
3.2.3 X-ray Crystallographic Structure Determination for BRD4 and BRD9
Four novel bromodomain ligands identified in the previous chapter 367 were selected for crystal
structure determination in the available proteins BRD4 BD1 and BRD9. The crystal structures
130
were determined by Marianne Schimpl in AstraZeneca and the was protein prepared within
AstraZeneca.
The bromodomain of BRD9, amino acids 14—134, and the first bromodomain of BRD4 (BRD4
BD1), amino acids 42—169, were recombinantly expressed in E. coli with an N-terminal
hexahistidine tag. Proteins were purified by immobilised metal-affinity chromatography,
proteolytic cleavage of the affinity tag and size-exclusion chromatography.
BRD4 BD1 was concentrated to 9.2 mg/ml in 100 mM HEPES pH 7.5, 100 mM NaCl, 1 mM DTT.
Sitting drop vapour diffusion crystallisation experiments were set up by combining 200 nl of
protein with 200 nl of reservoir solution, followed by incubation at 4C. The complex with 4
was obtained by co-crystallisation with 1 mM ligand, where the buffer condition contained 0.2
M MgCl2, 0.1 M Tris pH 8.5 and 20 % PEG 8000. Crystals were cryo-protected by 1-second
immersion in reservoir solution supplemented with 20 % ethylene glycol, and flash-frozen in
liquid nitrogen. The other complexes with 5, 6 and 7 were obtained by soaking of BRD4 BD1
crystals grown in 0.2 M LiCl, 30 % PEG 4000, 0.1 M Tris pH 8.5. Crystals were soaked for 16
hours in reservoir solution containing 5 mM ligands 2 or 4 and 10 % DMSO, and harvested
without further cryoprotection.
BRD9 was concentrated to 14.5 mg/ml in 50 mM HEPES pH 7.5, 250 mM NaCl, 0.5 mM DTT,
and crystallised as above in 20 % PEG8000, 0.1 M CHES pH 9.5. Bar-shaped crystals appeared
within 3 days and were soaked in a 5 mM solution of ligand 4 as described above.
X-ray diffraction data were collected at Soleil Synchrotron beamline Proxima I and at Diamond
Light Source beamlines I03 and I04. Data were processed using XDS372 and Aimless373 from the
CCP4 suite374. Structures were solved by molecular replacement using Phaser375, and refined
using Buster376 and Coot377. Initial ligand topologies were calculated by Grade378. Data collection
and refinement statistics are summarised in Table 8-6.
MOE355 was used to interpret crystal structures and produce visualisations.
3.3 Results and Discussion
In this study we used the PCM model generated in our previous study for feature interpretation.
The model built using Morgan fingerprints as compound descriptors and Z-Scales 3 as target
descriptors, achieved a performance of ROC AUC of 0.97 and a Matthew’s Correlation
coefficient (MCC) of 0.83 on the external test set. This performance was comparable with
previous PCM models on other target classes.249,308,311 Further details of PCM model performance
131
and benchmarking can be found in chapter 2.367 Importantly, high model performance is key to
the meaningful interpretation of features, since we can be confident that the most highly
contributing features are leading to the correct classification. However, it must be noted that
not all features found from this analysis will necessarily be relevant, as there will be false positive
results for various reasons, including the class balance between actives and inactives, as well as
biases in the number of data points for each bromodomain (Table 8-1).
3.3.1 Interpretation
We next assessed the model as a tool to interpret selectivity. Two levels of interpretation were
used to analyse PCM models; global and local feature importance, as described in the methods
section.
Z-scales 3 descriptors were used to describe the binding site amino acids in each alignment
position based on their properties. Z-scales are derived by principal component analysis (PCA)
of 26 different physicochemical properties. We attribute the three principal components (PCs)
of the Z-Scales 3 descriptors, which encode the binding site amino acids, to lipophilicity for PC1,
size/polarizability for PC2, and polarity for PC3, as discussed in the methods section. Notably
this analysis did not consider binding site waters as descriptors in building PCM models and
assumes that the only interactions between ligand and protein are those which are important
for binding. Previous studies have shown the importance of interactions with stable waters in
bromodomains can lead to increases in binding affinity and selectivity, for example it has been
shown that selectivity between BD1 and BD2 domains in BETs can be achieved through
interactions with the ZA channel water network.379
Global Selectivity
Firstly, the PCM model was interpreted on a global level to discover the binding site residue
features important across the whole dataset for the discrimination between the active and
inactive class. These residues correspond to regions of variability between bromodomains and
can be deemed selectivity residues. It should be noted that biases in the numbers of data points
and activity distributions for each bromodomain will have an impact on this interpretation;
these data imbalances are discussed in more detail in chapter 2.367 For example, the features
identified from this global analysis will be more likely to be relevant towards discriminating
between actives and inactive compound-target pairs for bromodomains with a larger number
of datapoints, in particular the BET family in our study. Important target features, defined as
those with a large mean decrease in the Gini coefficient, are summarised as binding site hotspots
for selectivity in Table 3-2 and represented visually as mapped onto the BRD4 BD2 structure
132
(one of the BET family members) in Figure 3-1. The residues for BRD4 BD2 are used in the next
section as a reference for the important alignment positions and the full alignment is shown in
Figure 8-1.
The alignment positions identified from the global feature importance analysis predominately
correspond to residues on the ZA and BC loop regions. Figure 3-1 shows that the residue position
corresponding to V440 in BRD4 BD2 is important towards discrimination between active and
inactive classes across all bromodomains. This residue neighbours the gatekeeper residue (V439
in BRD4 BD2), which is often used to classify the bromodomain family77. Y377 neighbours the
hydrophobic shelf (known as the WPF shelf in BETs) present in some bromodomains, and the
size of the residue in this position was deemed important for selectivity, on average, across the
whole set of bromodomains. C391, L387, M398, V382, Y377, Y372 and G386 were important
from our analysis due to their size. These residues are positioned on the ZA loop which is known
to be conformationally different between bromodomains, for example, in BRD1 and BRD9 the
ZA channel region has been shown to be conformationally more open, which was used as a
factor for grouping these bromodomains based on druggability scores by Vidler et al.318 The
importance of the ZA loop is furthermore corroborated by ICAS-9571, a selective inhibitor for
the BRD1/BRPF bromodomains, which occupies a more open ZA channel, and a greater affinity
for BRPF was achieved by positioning bulky groups in the 6-paramethoxy position of the 1,3-
dimethylbenziidazolone core, which occupies the ZA channel.61 A similar observation was made
for LP99, a BRD9 selective inhibitor which occupies the ZA channel in this bromodomain.56
In conclusion, we find that the residue positions identified to be important for determining
selectivity across all bromodomains corresponded to the residues highlighted in the literature
for obtaining selectivity for bromodomain subtypes.
Table 3-2: Top ranked binding site amino acids from Z-Scales 3 model by mean decrease in Gini coefficient score. The residue position in the alignment (Figure 8-1), region in the bromodomain secondary structure, number and type of residue in BRD4 BD2 (2YEM) crystal structure and the broad property types encoded by the Z-Scales. Corresponds to the visual representation of the binding site hotspots in Figure 3-1
Amino Acid Alignment Number
Region of binding site Residue Number in BRD4 BD2
Z-Scales property encoding
1 ZA loop Y372 Size
6 ZA loop Y377 Size
11 ZA loop V382 Size
15 ZA loop G386 Lipophilicity, size
16 ZA loop L387 Lipophilicity, size
133
20 ZA loop C391 Lipophilicity, size, polarity
23 ZA loop M398 Lipophilicity, size
29 BC loop P435 Lipophilicity
32 BC loop D436 Lipophilicity, polarity
33 BC loop H437 Lipophilicity, size
36 αC helix (neighbouring GK residue)
V440 Lipophilicity, size, polarity
134
Figure 3-1: Binding site hotspots. Target features (Z-Scales 3) with the largest mean decrease in Gini coefficient from the PCM model were mapped back to their alignment position in BRD4 BD2. The residue positions are coloured by the property of the amino acids in the alignment position which was successful in the discrimination between active and inactive classes.
Bromodomain Subfamily Activity/Selectivity
Whilst global importance analyses, including the importance measures of the Gini index and
the mean decrease in accuracy (see Interpretation of Random Forests), are used routinely to
interpret machine learning models and can offer insight such as highlighted in the previous
section,380 the use of these measures for this model is limited, since the feature importance is
averaged over the whole dataset and multiple targets. Therefore, it is more beneficial to apply a
feature importance method which can identify residue positions which are important for
classifying subsets of the data and, more specifically in this study, bromodomain subfamilies or
individual bromodomains. This local importance method,381 (described in the methods section)
identifies residues in the active site of each group (subfamily, individual bromodomain etc),
which were important towards classifying active or inactive compound-target pairs in the PCM
model by taking the median importance value across all the active or inactive instances for the
group.
Selectivity Between BET bromodomains: BD1 vs BD2
Understanding the drivers of BD1 versus BD2 selectivity for BET bromodomains is an area of
interest as many inhibitor molecules tend to display activity for both domains.36 Figure 3-2
V382
M398
G386
L387
Y377
Y372
P435
H437
αC helix
BC loop αB helix
V440
C391
D436
ZA loop
ZA loop
Key: residues coloured based on properties: Size Lipophilicity Lipophilicity, size Lipophilicity, polarity Size, polarity Lipophilicity, size, polarity
BRD4 BD2 (2YEM)
135
shows the model interpretation results displaying those target features (encoded by Z-Scales 3)
which were important towards classifying active and inactive instances for BD1 BET domains
(Figure 3-2a) and BD2 BET domains (Figure 3-2b). The differences between the feature
importance for each of these subsets of the data can be studied for selectivity insight. One of
the most noticeable differences between BD1 and BD2 domains was that for the amino acid in
alignment position 6, all three properties of the amino acid (lipophilicity, size and polarity)
contributed towards classification for the inactive class for BD1 (blue bar with negative median
feature contribution), whereas there was a positive contribution towards the classification of
active instances for BD2 domains (orange bars with positive median feature contributions). This
residue position neighbours the WPF shelf in BET bromodomains. For the BD2 domains this
residue is formed of tyrosine (Y377 in Figure 3-1) and in BD1 domains this residue varies between
glutamine (BRD4 BD1 and BRDT BD1), arginine (BRD2 BD1) and tyrosine (BRD3 BD1). The
variable describing the size (Z2 Z-Scale) has a larger positive contribution for active instances
for BD2 domains than the lipophilicity and polarity (Z1 and Z3 Z-scales) variables, and the size
variable (Z2 Z-Scale) has a larger negative contribution towards the classification of inactive
instances than the lipophilicity and polarity (Z1 and Z3 Z-scales) variables. This shows that,
although all three properties of this amino acid encode selectivity for BD2 over BD1 domains,
the size/polarizability of the residue in this position is particularly important for selectivity. The
amino acid in position 32 was found to be important towards the classification of inactives for
BD1 domains for all three principal components and has a small contribution to the
classification of actives for the lipophilicity and polarity descriptors (Z1 and Z3 Z-Scales) (Figure
3-2a). For BD2 domains there are small negative contributions of this residue to the prediction
of inactives for BD2 domains for lipophilicity and polarity (Z1 and Z3 Z-Scales) and a larger
positive contribution towards the classification of actives for the size of the residue (Z2 Z-Scale)
(Figure 3-2b). For BD1 domains this residue is either threonine (BRD2 BD1 and BRD3 BD1) or
glycine (BRD4 BD1 and BRDT BD1), and for the BD2 domains the residue is consistently aspartic
acid. In BD1, the second Z-scale for this residue contributes towards the inactive class, whereas
for BD2 the second Z-scale is important towards the active class (Figure 3-2). This shows that
the size and polarizability of this residue is important for activity (the size of which is much
larger for aspartic acid than threonine and glycine) and suggests the aspartic acid is a key residue
for obtaining BD2 selectivity over BD1. These two discussed residues have not been identified
previously as selectivity residues for BD1 vs BD2 domains by an analysis of the sequence identity
variation,31 or in the literature search presented in Table 1-2, and so represent new information
which can be used to provide selectivity between BD1 and BD2 BET bromodomains by exploiting
136
interactions with amino acids in the binding site. One of the previously reported alignment
positions suggested to be exploitable for selectivity for BET bromodomains corresponds to
position 33 in our analysis. For this residue, the median feature contributions for all three
principal components are positive for the classification of actives against both BD1 and BD2
domains for dataset compounds, with higher contributions for the lipophilicity and size (Z1 and
Z2 Z-Scales) features (Figure 3-2). Amino acid 33 corresponds to D144 (BRD4 BD1) and H437
(BRD4 BD2). Because the lipophilicity and size of these amino acids are similar (aspartic acid
Z1 Z-Scale value is 3.98, Z2 Z-Scale value is 0.93, histidine Z1 Z-Scale value is 2.47, Z2 Z-Scale
value is 1.95),265 this may contribute to a more similar interaction across BD1 and BD2 domains
than previously expected. Further details of BD1 vs BD2 selectivity are discussed in the section
Type 2 bromodomains.
Example of Selectivity for CREBBP over BRD9
Next, we highlight an example of how the analysis was able to find a key selectivity residue for
CREBBP over BRD9, by examining the feature contributions for individual bromodomains.
Figure 3-3a shows all positive median feature contributions for classifying actives at CREBBP
(orange bars) and all negative median feature contributions for classifying inactives at CREBBP
(blue bars). Figure 3-3b shows the same analysis but for BRD9. Residue 34 has a large positive
feature contribution for the first two Z-Scales encoding lipophilicity and size for classifying
active compounds for CREBBP. Conversely, there is a large negative contribution from all three
Z-Scales for the classification of inactives at BRD9. Combining this information on the role of
the residue in position 34 for these two bromodomains, we can say that this residue is important
for selectivity for CREBBP over BRD9. This can be validated with the literature observation of
the well-known cation -π interaction that is made between the arginine R1173 in position 34 in
CREBBP with multiple selective inhibitor scaffolds. Cortopassi et al., 2016 discuss how
interaction strength relates to binding affinity for CREBBP over multiple chemical series of
CREBBP inhibitors.82 Figure 3c shows a selective CREBBP inhibitor (Compound 4) from the
training set, which makes a cation-π interaction between the arginine (R1173) residue and the
aromatic ring system (PDB 4NR7). This example shows that the PCM model was able to capture
this selectivity residue as important for binding to CREBBP and the closely related EP300
protein and captures that this residue was not important for binding in BRD9.
137
Figure 3-2: The feature contribution values for features with a positive median feature contribution towards the classification of active (orange) instances and a negative median feature contribution towards inactive (blue) instances over all a) BD1 and b) all BD2 domains within the BET bromodomains in the dataset. Descriptors were encoded in the form Z3_X_AAY, where X is the principal component corresponding to 1-lipophilicity, 2-size and 3-polarity, and Y is the alignment position of the amino acid.
a
b
Key: Median value for Active instances Median value Inactive instances
BD1 BET Domains
BD2 BET Domains
138
ZA loop
π-cation
interaction
R1173
V105
BC loop
αC helix
Key:
CREBBP (4NR7)
BRD9 (4Z6I)
Compound 4
a
b
c
Key: Median value for Active instances
Median value Inactive instances
CREBBP
BRD9
139
Figure 3-3: Target based feature interpretation example. Shows variables with median positive feature contributions towards classifying active compound-target pairs and mean negative feature contributions towards classifying inactive compound-target pairs for a) BRD9 and b) CREBBP. Highlighted in blue boxes are the target descriptors for the residue in alignment position 34, showing a positive contribution towards the prediction of the active class for CREBBP and a negative contribution towards the prediction of the inactive class for BRD9. c) Interpretation of the selectivity residue in amino acid position 34, showing compound 4, a selective CREBBP inhibitor which makes a key interaction with an arginine residue in CREBBP, which cannot be exploited by the V105 in this position for BRD9. Overlay of CREBBP (4NR7, yellow), with BRD9 (4Z6I, green). Descriptors were encoded in the form Z3_X_AAY, where X is the principal component corresponding to 1-lipophilicity, 2-size and 3-polarity, and Y is the alignment position of the amino acid.
Individual Bromodomain Activity
In this section we also used the local feature importance method to analyse the median feature
contributions for subsets of the data, in this case active compounds at each bromodomain. We
provide the upper and lower quartile values, and therefore the interquartile range as a measure
of the spread of the data from the median to show the reliability of these values for representing
the feature contributions of the whole subset (here active compound-target pairs for each
bromodomain) (Supplementary Data File 2).
Profiles of the important residues for classifying active compound-target pairs for each
bromodomain are highlighted in Figure 3-4, along with the important property of each amino
acid, defined as either lipophilicity, size, polarity or a combination of these. Highly conserved
alignment positions (positions 5, 8, 9, 19, 22, 24, 28 and 30) were removed from the input
features for model construction due to low variance and thus do not appear as important
features for any bromodomain; this does not necessarily mean they are unimportant for binding,
but that this analysis will not pick up the constant binding features for all bromodomains. An
example of this is the asparagine in alignment position 28, which is necessary for the binding of
the acetyl lysine mimetic in bromodomain inhibitors.70
140
Figure 3-4: Alignment map for the binding site residues of bromodomains used in this study. Each residue is coloured according to the property (or properties) of the amino acid identified as important towards the classification of actives for each bromodomain from the local feature contribution method. The threshold for determination of importance towards activity was a median feature contribution of greater than 0. “None” refers to those residues which were not found to be important for the classification of actives from our model interpretation. “-“ represent alignment gaps in the sequence. Alignment positions 5, 8, 9, 19, 22, 24, 28 and 30 were removed as model input features due to low variance of amino acid properties across targets. For the quantitative values of each median feature importance, as well the spread of feature importance from the median, the reader is referred to the Supplementary Data File 2.
Type 1
Type 2
Type 3
Type 4
Type 5
Type 7
Type 8
141
Individual Bromodomain Activity: Lipophilicity
Whilst Figure 3-4 shows those residue positions for all three Z-Scales 3 principal components
towards the classification of actives, it does not allow quantification of the feature contribution.
Therefore, we next analysed quantitatively the contribution of the first principal component,
which is related to lipophilicity (see methods), to the activity of ligands against individual
bromodomains within each subfamily. Figure 3-5 shows the relative magnitude of the positive
contribution of the lipophilicity of each amino acid towards the classification of activity. A larger
positive value indicates higher importance towards the classification of the active compounds
against the target and a value of zero indicates that the feature was not important for classifying
active compounds for this bromodomain.
Type 1 bromodomains
For all four of the type 1 bromodomains (BPTF, CECR2, KAT2A and PCAF) the amino acids in
positions 3 (tryptophan) and 4 (proline) were important for activity (Figure 3-4, Figure 3-5). The
median contributions of these features can be found in the Supplementary Data File 2. Positions
3 and 4 form the WPF shelf hydrophobic region, known to be important for targeting BET
bromodomains31. There is a good weight of evidence to support the role of tryptophan in
position 3 and proline in position 4 in binding interactions with type 1 inhibitors and these
residues are highlighted in Table 1-2. For example, the CECR2 inhibitor GNE-886 makes
hydrophobic interactions with W457 and P458 of the WPF shelf.52 Additionally, molecular
dynamics studies for the binding of the inhibitor Rac-1 to BPTF identified interactions with
W2824,53 and the key for gaining selectivity for KAT2A and PCAF over BET bromodomains was
described to include interactions with W751/W746 and P752/P747.75 The other residue which
was consistently important for classification of active compound-target pairs for the type 1
bromodomains was in position 35, where this residue was either phenylalanine (BPTF) or
tyrosine (CECR2, KAT2A, PCAF). This is the residue known as the gatekeeper (GK) residue
across all bromodomains, and where this residue is tyrosine, this creates a narrower binding
pocket,75 and inhibitors have exploited π-stacking interactions with this residue.57 For PCAF,
many fragment hits are found make an aromatic hydrophobic interaction with this Y809,76 and
for CECR2, GNE-886 also forms a π-stacking interaction with the Y520.52 For BPTF, F2887 (GK
residue) was also identified as important towards the selective binding of the Rac-1 inhibitor.
Overall, we identify that the commonly important residues for the type 1 bromodomains are
consistent with previous literature.
142
Some alignment positions were found to be important towards certain subfamily members.
Alignment position 1 was found to be important for only BPTF (Figure 3-5) with a median
feature contribution value of 0.019 (Supplementary Data File 2). This residue in BPTF is M2822
and, except in TIF1A, this is the only bromodomain featuring methionine in this position.
Hence, we hypothesise that we can make use of interactions with this methionine residue to
afford activity and selectivity for BPTF. Residues important for the classification of actives
against PCAF, but not other type 1 bromodomains, included residue 11 (R754) and residue 12
(T755) with the highest median feature contributions of 0.018 and 0.014 respectively. Other
residues contributed to a smaller extent (Figure 3-5, Supplementary Data File 2). When
comparing these positions to the residues in other bromodomains (Figure 3-4) we can see that
the arginine in position 11 is found for only four bromodomains, none of which are members of
subfamily 1, and that the threonine residue in position 12 is unique to PCAF. These two residues,
situated on the ZA loop, could be key selectivity features for PCAF activity over other
bromodomains and to the best of our knowledge are not currently known (Table 1-2). It is
known that the conformation of the ZA loop in PCAF is different to other bromodomains, for
example CREBBP, and that this is important for peptide binding.382 KAT2A shares one its most
important residues with PCAF, which is amino acid position 29 (Figure 3-5), namely P810 and
P805 with feature contributions of 0.013 and 0.012 respectively (Supplementary Data File 2).
This residue is proline for a variety of bromodomains in our model (Figure 3-4), and it is the
most lipophilic residue in this alignment position. Therefore, we hypothesise that targeting this
residue with a lipophilic interaction could provide selectivity between type 1 bromodomains (i.e.
for KAT2A and PCAF over others) and activity against a range of bromodomains, since proline
is conserved across many targets.
143
Type 4
Type 5 Type 7
Type 8
Amino Acid Alignment Position
Amino Acid Alignment Position
Amino Acid Alignment Position
Amino Acid Alignment Position
Amino Acid Alignment Position Amino Acid Alignment Position
Amino Acid Alignment Position
Me
dia
n F
ea
ture
Im
po
rta
nce
for
Active
s
Me
dia
n F
ea
ture
Im
po
rta
nce
for
Active
s
Me
dia
n F
ea
ture
Im
po
rta
nce
for
Active
s
Me
dia
n F
ea
ture
Im
po
rta
nce
for
Active
s
Me
dia
n F
ea
ture
Im
po
rta
nce
for
Active
s
Me
dia
n F
ea
ture
Im
po
rta
nce
for
Active
s
Me
dia
n F
ea
ture
Im
po
rta
nce
for
Active
s
Type 1 Type 2
Type 3 Type 4
144
Figure 3-5: The median feature importance for the descriptor encoding lipohilicity (first Z-Scale) towards the active compound-target pairs for each bromodomain, as derived from the local feature contribution method. The plots are trellised by bromodomain subfamily. Those alignment positions with a high median feature importance correspond to residues (which can be identified from Figure 3-4) more important to the classification of active compound-target pairs for the bromodomain due to their lipophilic properties.
Type 2 bromodomains
The type 2 or BET bromodomains have the largest numbers of active compounds in the study
(Table 8-1) and so the feature interpretation for this subfamily is based on a larger amount of
data than other subfamilies. For type 2 (BET) bromodomains (Figure 3-5) the highest
contribution towards activity for all the type 2 bromodomains is position 36, adjacent to the
gatekeeper residue.(8) This residue is consistently occupied by a valine for all type 2
bromodomains, whereas for all non-BET bromodomains the position contains a different
residue (Figure 3-4). Other residues of common importance to the classification of BET
bromodomain activity included positions 16 (leucine), 29 (proline), 15 (glycine, asparagine,
glutamine and glutamic acid), 33 (BD1 aspartic acid, BD2 histidine), 3 (tryptophan) and 27
(methionine). Residue 33 has been noted as a key area of residue variation which has been
targeted to provide selectivity for BD1 vs BD2 BET.31,80 W81, the residue in position 3 in BRD4
BD1, was identified from machine learning models to be a key residue to interact with for activity
at BD1 domains of BETs,79 as well as being a part of the WPF shelf with which multiple inhibitors
make lipophilic interactions.80,383
As discussed previously, BD1 and BD2 domains of BET bromodomains have distinct functions,
however, it is challenging to design a small molecule to be selective for one domain over another
in the same protein.384 The differences in residue importance against the BD1 and BD2 domains
can offer further insights beyond the current knowledge (Table 1-2) for the design of selective
ligands. The lipophilicity of residue 17 is only found to be important towards the classification
of active instances for BD2 domains contained within type 2 bromodomains, but not for BD1
domains (Figure 3-4). Residue 17 is proline in BD1 domains and histidine in BD2 domains, while
the majority of other bromodomains have proline in this alignment position (Figure 3-4). This
residue is positioned at the bottom of the ZA loop and targeted interactions with the histidine
in this position could be exploited for selectivity. The lipophilicity of residue 6 is important for
classification of actives against BD2, but not BD1, domains in our model. We previously
highlight this residue in the section Selectivity Between BET bromodomains: BD1 vs BD2. In
BD2 domains, this residue is tyrosine, whereas in BD1 domains this is arginine, tyrosine or
glutamine. The quantitative values for the first principal component (PC) of the Z-scale
(interpreted here as lipophilicity) were -2.54 for tyrosine, 3.52 for arginine and 1.75 for
145
glutamine,265 showing that this position in BD1 domains is hence occupied generally by more
hydrophilic residues. This residue has not been identified previously as important for selectivity,
however, the neighbouring residues Q85 (BRD4 BD1) and K378 (BRD4 BD2) have been
identified as a region of variation between BD1 and BD2 domains,31 and so this lipophilicity
change in alignment position 6 might also be important for binding in this region. Conversely,
the lipophilicity of amino acid 13 is specifically important for the classification of active instances
for BD1, but not BD2. Residue 13 is lysine in BD1 domains and alanine in BD2, showing a distinct
difference in lipophilicity (Z-scales first PC values 2.29 and 0.24 respectively).265 This residue
points into the binding site and will affect the properties of the binding site for an inhibitor. As
far as we know this has not yet been exploited for inhibitor design. For quantitative values of
the median feature contributions for the type 2 bromodomains, the reader is referred to
Supplementary Data File 2.
Type 3 bromodomains
The type 3 bromodomains include CREBBP and EP300, as well as BRWD1 BD2. Residue 6
(arginine), residue 15 (glycine for CREBBP and EP300 and glutamic acid for BRWD1 BD2),
residue 34 (lysine in BRWD1 and arginine in CREBBP and EP300) and residue 10 (aspartic acid)
were important features for classifying active compound-target pairs for all three
bromodomains (Figure 3-4). Residue 6 (arginine) is neighbouring the LPF or EPF shelves
aligned with the WPF motif in BET bromodomains. This residue is less lipophilic than the
residues for other bromodomains in this position (Figure 3-4) and could hence potentially be
exploited with for activity and selectivity for type 3 bromodomains. For alignment position 15,
the remaining subfamilies do not have a residue in this position, therefore this position encodes
the differences in length of the ZA loop, an important feature of type 3 bromodomains, which
makes them similar to BET bromodomains (Figure 3-4).81 Residue 34 (arginine) corresponds to
the R1173 residue in CREBBP, which has been previously exploited to design activity for CREBBP,
as demonstrated by the cation- π interaction with multiple series of inhibitors.81,82 Residue 10
(aspartic acid) is the same for the type 3 and type 2 bromodomains. This is one of the least
lipophilic residues in this position (Z-scales first PC value 3.98)265 and is also shared with the
BET bromodomains (Figure 3-4). This could be targeted by a charged interaction. The fact that
some of the residues that are found to be important for classifying activity for type 3
bromodomains are also important for type 2 bromodomains is consistent with the fact that the
BET family are frequent off-targets of CREBBP bromodomain inhibitors,337 due to their
similarities in the ZA loop.31
146
For CREBBP and EP300, all but one of the remaining alignment positions selected by the model
are the same for both domains, due to their highly conserved binding site sequence (Figure 3-4).
The most important lipophilicity descriptor for classifying activity in this case corresponds to
residue 32, T1171 in CREBBP, with a median feature importance of 0.010, and T1135 in EP300,
also with a median feature importance of 0.010. This residue is positioned at the bottom of the
BC loop and threonine is shared with BRD2 BD1 and BRD3 BD1. Other residues in this position
mostly consist of either glycine or acidic residues for other bromodomains, showing that the
properties of this loop are different in CREBBP and EP300. Residue 13, corresponding to L1119
in CREBBP and L1083 in EP300 with median feature contributions of 0.007 and 0.007
respectively, is also important for the classification of actives at CREBBP and EP300 (Figure
3-4), and leucine in this position is not found in any other bromodomains. Leucine is one of the
more lipophilic residues in this position (with a value for the first PC of Z-scale of -4.28),
especially as compared to the BET bromodomains, where lysine and alanine residues are present
in this position (first PC of Z-scales are 2.29 and 0.24 respectively)265 (Figure 3-4). This residue
is near to the end of the ZA channel, making the properties of this region different between type
3 and type 2 bromodomains. L1119 has been highlighted previously as important for obtaining
CREBBP selectivity over BETs.83 Alignment position 37 was found to be important for the
lipophilicity descriptor for activity at CREBBP (F1177, with a median feature importance of
0.003) but not for EP300 (Y1141), and this is the only major difference between the two
bromodomains from this analysis (Figure 3-5). These two bromodomains have the most
lipophilic residues in this position, which is situated on the αC helix and could, for example,
produce π-stacking interactions with a potential future CREBBP inhibitor, since it provides a
more lipophilic pocket than for the BET bromodomains.
Here, we show that the lipophilic factors contributing to the classification of actives for CREBBP
and EP300 are highly similar and highlight a lipophilic pocket which could be targeted in future
CREBBP/EP300 inhibitors. Further residues found to have a positive feature contribution
towards classification of actives at CREBBP, which are confirmed from previous studies (Table
1-2), included V1174, P1110, L1109 and Q1113 (see Figure 8-1 for all important residues). Overall,
two key residues known from previous literature to be important for gaining CREBBP selectivity
over BET bromodomains have been found to have a high feature contribution towards
classifying actives for the CREBBP and EP300 bromodomains, and four other residues match
with literature interactions for CREBBP inhibitors.
147
Type 4 bromodomains
The type 4 bromodomains included in this study were ATAD2 and ATAD2B, BRD1, BRPF1,
BRPF3, BRD7 and BRD9, with 65, 0, 53, 146, 5, 41 and 346 active datapoints respectively (Table
8-1). Overall, we find that there is a more complex relationship of features important for
classification of active compound-target pairs within this subfamily, and so we analysed it
separately by bromodomain.
For ATAD2, Figure 3-5 shows that there are several important residues used for classifying
actives in the model. By far the largest feature contribution for the first PC of the Z-Scales
descriptor for classification of actives for ATAD2 was residue 33 (R1072 with a median feature
contribution of 0.010).( Supplementary Data File 2) The presence of arginine is consistent with
the observation that the binding site of ATAD2 is generally less lipophilic than BET
bromodomains and therefore less druggable.31,318 This residue is positioned near the end of the
αC helix, and it is the only bromodomain for which this position is arginine in the set of
bromodomains studied. Residues 11 and 10 corresponding to P1015 and D1014 were also found
to be of smaller importance to classifying actives for ATAD2 with both having median feature
contribution values of 0.001. D1014 has been noted previously as a residue which can be used
to make H-bonding interactions to ATAD2 inhibitors.86,87
For the BRPF bromodomains including BRD1 (BRPF2), BRPF1 and BRPF3 there were a small
number of residue positions which were important for classification of actives for these domains
(see Supplementary Data File 2 for full list). Residue 7 was commonly identified as important
for all three bromodomains, which is formed by Q589 in BRD1 and E655 and E616 in BRPF1 and
BRPF3 respectively (Figure 3-4, Figure 3-5). Q589 in BRD1 is present also in case of BD1 BET
family members, and this residue was found to be responsible for selectivity at BRD1 over BRPF1
for a benzoisoquinolinedione inhibitor molecule.90 BRPF1 and BRPF3 are the only
bromodomains to contain glutamic acid in this position among the ones included in this study,
and the negative charge can potentially be targeted for selectivity and/or activity.
Other residues with a positive feature contribution value for classifying actives against BRD1
included residue 10 (S592, with median feature contribution of 0.005), residue 36 (Y649, with
median feature contribution of 0.001) and residue 3 (R585, with median feature contribution
of 0.0004); Figure 3-4, Figure 3-5). BRD1 is the only member of the type 4 bromodomains to
have a serine residue in position 10 and S592 was found to be responsible for selectivity at BRD1
over BRPF1 for the same benzoisoquinolinedione inhibitor molecule mentioned above.90 Y649
in alignment position 36, neighbouring to the gatekeeper residue, is in common with several
148
bromodomains, excluding the BET bromodomains and this could be exploited for BET
selectivity. The combination of the gatekeeper F648 and the neighbouring Y649 residue makes
this region much more sterically constrained and lipophilic than the corresponding residues in
BETs (which are formed by isoleucine and valine); this has been exploited for achieving
selectivity in the past.31,89 Residue 3 is the R585 in the RIF set of residues in BRD1, in place of
the WPF shelf in BETs, delivering very different properties to this section of the active site,
which could be exploited by interactions in a similar way to the RVF shelf in ATAD2.385 Overall,
from our feature contribution analysis we find three residues in BRD1, which are consistent with
previous literature describing selectivity for BRD1 including Q589, S592 and Y649.
For BRPF1, in addition to residue 7 mentioned above, residue 10 (P658), residue 2 (G650) and
residue 6 (S654) also have positive median feature contributions towards the classification of
actives with values of 0.008, 0.006 and 0.002 respectively (Figure 3-4, Figure 3-5). Proline at
residue 10 is only present for three bromodomains among those studied here, and so this residue
differentiates BRPF1 by being one of the more lipophilic residues in this position (Figure 3-4).
Selectivity for BRPF1 over BET bromodomains has been achieved by targeting P658 in this
position of the ZA loop with a lipophilic interaction, which is not exploitable with the aspartic
acid in this position in BETs.88 G650 is uniquely glycine for BRPF1 which is situated next to the
NIF group of residues, which is in the same region as WPF in BETs. This glycine makes the
pocket larger and less lipophilic than for other bromodomains. S654 is on the opposite side to
the NIF series of residues and serine is only found in one other bromodomain (BRD7) in this
position, rendering it a potential selectivity residue. BRPF3 has relatively few active compound-
target pairs in the dataset and so will not be interpreted further in this discussion. Overall, we
note that we have identified the P658 residue, which is known to be involved in lipophilic
interactions of BRPF1 inhibitors, as well as highlighted the importance of other residues.
BRD7 and BRD9 are highly conserved domains, and it is difficult to achieve selectivity for one
the other.31 Correspondingly, nearly all residues that are important for the classification of active
compounds at both bromodomains conserved, including residue 29 (P213/P102), residue 11
(D162/D51), residue 12 (F163 in BRD7 and A52 in BRD9), residue 13 (I164/I53), residue 27
(M203/M92), residue 20 (S169/S58), residue 18 (G167/G56), residue 35 (Y217/Y109) and residue
7 (F158/F47), for BRD7/BRD9 respectively (Figure 3-4, Figure 3-5). These are listed in order of
their feature importance values, which can be found in Supplementary Data File 2. The proline
residue at alignment position 29 is shared with the BET bromodomains, as well as the type 1
PCAF bromodomain, and as discussed above, is positioned at the end of the BC loop and more
149
lipophilic than other residues in this position (Figure 3-4), and from our model this residue
seems to be globally important to activity (Figure 3-4). Residue 11 and residue 12 are situated on
the ZA loop; aspartic acid is unique in position 11 for BRD7 and BRD9 and so can be considered
a potential selectivity residue. F163 and A52 in alignment position 12 form two of the more
lipophilic residues found in this position across all bromodomains, which contributes to the
more lipophilic ZA channel in BRD9 compared to BET bromodomains.31 These residues,
although not identified previously as one of the key hydrophobic residues, could be included in
the lipophilic residues which line the ZA channel of BRD9 from which selectivity could be
gained. Conversely, residue 13 (I53) and residue 7 (F47) in BRD9 belong to the previously
identified lipophilic residues of the ZA channel, which were suggested as residues which can be
exploited for selectivity.58,91 BRD7 and BRD9 are the only bromodomains for which residue 20
is serine, while many other bromodomains have tyrosine or histidine in this position. The amino
acid in position 18 is glycine, which is only found for a small number of bromodomains, with
the remainder of the bromodomains containing a charged residue in this position. These
residues positioned at the bottom of the ZA loop may alter the properties of the ZA loop for
BRD7 and BRD9. Y109 in position 35 in BRD9 is the gatekeeper residue and is well-known to
make interactions with inhibitor molecules, for example in BI-9564,58 and is key for activity for
BRD9.58 Overall, we find that the lipophilic residues on the ZA loop in addition to the tyrosine
gatekeeper have been highlighted as important for the classification of actives against BRD9,
which is consistent with the literature.
For the type 4 bromodomains, we correctly identified from our data-driven analysis key residues
which have been found to be important towards activity in previous studies for BRD1, BRPF1
and BRD7/9. We furthermore contributed new potential residues which might be exploited for
activity or selectivity for type 4 bromodomains.
Type 5 bromodomains
The type 5 bromodomains in this study include BAZ2A, BAZ2B and TIF1A. Overall it should be
noted that there were fewer active compounds (27, 31 and 63 respectively) in the dataset than
for other bromodomains, and therefore that the findings below are based on less evidence than
other subfamilies.
The residue positions most important towards the classification of actives for BAZ2A and BAZ2B
included residues 18, 20, 27, 3 and 34 for both domains in order of their importance (Figure 3-4,
Figure 3-5). Residue 18 is G1829/G1900 in BAZ2A/BAZ2B respectively and this position is also
glycine for BRD9 and BRD7 bromodomains. Whilst glycine is not considered lipophilic, it is the
150
most lipophilic residue in this position (Figure 3-4), with a value for the first PC of Z-scales of
2.05 compared to asparagine and aspartic acid with a value for the first PC of Z-scales of 3.52
and 3.05 respectively.265 The glycine residue in this position could offer the opportunity for
interaction with inhibitors due to its smaller size and lack of polar atoms. Residue 20 is R1831
in BAZ2A and K1902 in BAZ2B, both of which are charged, which could be exploited via opposite
charge interactions. Residue 18 and 20 are positioned at the end of the ZA loop and the BAZ
bromodomains are known to have a shorter ZA loop than other domains,31 showing that the
model has located residues that describe the key differences in structure for these
bromodomains compared to other bromodomains. Residue 27 is V1865 for BAZ2A and V1936
for BAZ2B whereas the remaining bromodomains have methionine or isoleucine in this
position, which may change the properties of the αB helix. Residue 3 is the W1816/W1887 of the
WPF shelf conserved from BET bromodomains important for activity at BAZ2A and BAZ2B
bromodomains notably due to its flexibility to make π-stacking interactions, in agreement with
previous findings.55 The amino acids in position 34 are the negatively charged E1878 (BAZ2A)
and D1949 (BAZ2B), which are particularly hydrophilic amino acids (with a value for the first
PC of Z-scale of 3.11 and 3.98 respectively), and are also found in BET bromodomains.
Additionally, the residue in position 7 was found to be important towards classification of
actives for the BAZ2A, but not the BAZ2B bromodomain, which corresponds to E1820 in BAZ2A
and L1891 in BAZ2B and forms a known selectivity residue between these domains.92
The residue positions that were important to the classification of actives against TIF1A are
different to those for the BAZ bromodomains (Figure 3-4, Figure 3-5), including the globally
important proline residue at position 29 (P982 in TIF1A), as discussed previously. The amino
acid in alignment position 1 (M920) is the same as in BPTF, and in a similar way to BPTF, we
hypothesise that this methionine can be exploited by specific interactions to achieve broad
bromodomain selectivity. Amino acids 12 (in this case an alignment gap) and 10 (P929) are also
important for activity for TIF1A. This agrees with an important structural difference in the
binding pockets, namely that the ZA loop is shorter for TIF1A than for most other
bromodomains.31 P929 has been shown to be important for lipophilic interactions with a known
inhibitor.93 Residue 4 (A923) replaces proline in the same region as the WPF shelf in many
bromodomains is also highlighted as being important for the classification of actives for TIF1A,
which is consistent with previous knowledge that inhibitors occupy this hydrophobic region
flanked by A923.61 Overall, we identified by model interpretation, residues which are important
for classification of actives for TIF1A including P939 and A923, which are mentioned previously
as being important for lipophilic interactions with inhibitors.
151
Type 7 bromodomains
The type 7 bromodomains for which data was included in the model were TAF1 BD2 and TAF1L
BD2, and the number of active datapoints in the model for these domains were 243 and 7
respectively. These domains share 37 out of 38 of the same binding site residues used in this
study, and therefore, it is unsurprising that most of the same alignment positions are found to
be important for the classification of actives, including residue 29 (P1585/P1604), 32
(E1586/E1605), 13 (F1536/F1555), 6 (H1529/H1548), 2 (S1525/S1544), 35 (Y1589/Y1608) and 34
(Q1588/Q1607), for TAF1 BD2/TAF1L BD2 (Figure 3-4, Figure 3-5), ranked in order of their
importance (Supplementary Data File 2). Residue 29 (proline) has been noted to be a
determinant of activity against the BET bromodomains and TIF1A in our study, where this
residue is also proline. Glutamic acid in position 32 is important for classification of actives, in
a similar way to the finding that the aspartic acid residue in this position was important for the
classification of actives at BD1 BET bromodomains. A potential selectivity residue identified for
TAF1 BD2 and TAF1L BD2 includes the F1536/F1555 residue at position 13, since this lipophilic
aromatic group located on the ZA channel is unique to these two bromodomains (Figure 3-4),
adding a potential lipophilic pocket to the ZA channel similar to that found in BRD7 and BRD9.
Histidine in position 6, neighbouring the WPF shelf, is also a unique residue in this position for
these bromodomains (Figure 3-4) which may be exploited to achieve selectivity. The gatekeeper
residue and its neighbouring residue have also been identified as important by our model, which
are formed by tyrosine and glutamine. As seen for the other bromodomains with the tyrosine
gatekeeper residue, including BRD9 and PCAF, this tyrosine could be exploited for π-stacking
interactions to enhance activity for TAF1 and TAF1L BD2 domains.
Type 8 bromodomains
The type 8 bromodomains included in this study are SMARCA2, SMARCA4 and PB1 BD5, which
have low numbers of actives; 28, 70 and 10 respectively.
For SMARACA2 and SMARCA4, the residues in positions 16 (L1418/L1494), 32 (G1467/G1543), 1
(L1405/L1481) and 11 (R1415/R1491) are important for the classification of actives, (Figure 3-4,
Figure 3-5), in order of their median feature importance. Leucine is a common residue at
position 16 across all bromodomains in this study, including the BET bromodomains. L1418
(SMARCA2) or L1494 (SMARCA4), is positioned on the ZA-loop and has previously been
identified as important for lipophilic interactions with an inhibitor molecule (Table 1-2). 65
Glycine at position 32 is unique to the SMARCA2 and SMARCA4 bromodomains in this position
(Figure 3-4), suggesting a region for potential selectivity by extending a large, lipophilic group
152
into this region of the BC loop. L1405/L1481 and R1415/R1491 residues in positions 1 and 11 are at
the top and bottom of the ZA loop, respectively, affecting the properties of these regions.
Leucine has not been previously identified, however, arginine may be deemed a selectivity
residue which is important for activity at SMARCA2 and SMARCA4 bromodomains, where
previous experimental evidence of a π-cation interaction exists.82 Amino acid 34 (Q1469) next
to the gatekeeper residue is also found to be important for the classification of SMARCA2
actives. Since very little is currently known about targeting SMARCA2 and SMARCA4
bromodomains this interpretation may be useful in the future design of inhibitors for these
domains. In addition, the importance of the known interaction with L1418 has been supported
by literature.
For PB1 BD5, in agreement with SMARCA2 and 4, also residues 16 (L693) and 1 (L680) were
identified as important by the model (Figure 3-5). L693 (position 16) in PB1 BD5 has been
identified previously to make interactions with inhibitor molecules.95 The amino acids in
alignment position 25 (M706) and 29 (P741) are also important for activity, which have been
discussed previously for other bromodomains in this study. Interestingly, the lipophilicity of
amino acids in alignment positions 7 (R686), 4 (I683) and 6 (L685) are also important for the
classification of actives against PB1 BD5, which are positioned in the region where the WPF shelf
occurs in BET bromodomains, consistent with experimental evidence that I683 makes
interactions with inhibitors of PB1 BD5.95
Given the limited knowledge of inhibitors for this family, we find a consensus with known
residues important for binding against type 8 bromodomains, as well as suggest other residues
which from the model might be exploitable for future ligand design.
Summary
The correspondence of the model feature importance towards classifying actives with the
residues discussed in the literature for each bromodomain are summarized in Table 3-3. For the
references to the literature the reader should consult Table 1-1. For 13 out of 31 bromodomains,
namely BPFT, BRD2 BD1, BRD2 BD2, BRD3 BD1, BRD3 BD2, BRDT BD1, BRDT BD2, EP300,
BRD1, BRD7, BRPF3, BAZ2A and TAF1 BD2, we successfully identify all the residues previously
reported to be important for activity (sensitivity of 1). The model features correspond less well
to the known residues for ATAD2, PB1 BD5 and SMARCA4 with sensitivities below 0.5 (Table
3-3). What is not considered here is the false positive rate which is difficult to determine, as we
do not know which of the newly identified residues are novel and important contributors to
binding, and which are false positives. When considering which residues to target in design, it
153
is therefore important to consider the magnitude of the median feature contribution and the
spread of values across the subset indicated by the interquartile range (Supplementary Data File
2), as well as the literature evidence and plausibility of a small molecule interaction with the
residue. Overall, we propose that this study can be used as a guideline for generating future
design hypotheses for targeting residues in bromodomains with small molecule inhibitors.
Table 3-3: Shows the residues identified from the model to be important to the classification of active compounds for each bromodomain (according to the first PC of Z-Scales 3 target descriptor interpreted as lipophilicity), which corresponded to previous literature knowledge (in bold). Selectivity residues are a list of current literature residues deemed important for activity at the bromodomain and correspond to Table 1-2. The number of true positives, false negatives and the sensitivity are reported.
Bromodomain Selectivity Residues (see Table 1-1 for references)
True positives False Negatives Sensitivity
BPTF F2887, D2834, W2824
3 0 1.00
CECR2 Y520, P458, F459, M506, W457
4 1 0.80
KAT2A Y814, E761, P752, W751
3 1 0.75
PCAF Y809, E756, P747, W746, V752, P751, K753
5 2 0.71
BRD2 BD1 Q101, D160, I162 3 0 1.00
BRD2 BD2 K374, H433, V435 3 0 1.00
BRD3 BD1 Q61, D120, I122 3 0 1.00
BRD3 BD2 K336, H395, V397 3 0 1.00
BRD4 BD1 Q85, D144, I146, Y139, L92, W81, P82, F83
6 2 0.75
BRD4 BD2 K378, H437, V439, Y432, P375, F376
4 2 0.67
BRDT BD1 I115 1 0 1.00
BRDT BD2 V357 1 0 1.00
BRWD1 BD2 None known 0 0 N/A
CREBBP R1173, L1109, V1174, L1120, P1110, F1111, Q1113, L1119
7 1 0.88
EP300 R1137, L1073 2 0 1.00
ATAD2 R1007, V1008, R1077, D1071, D1014, I1074, V1018, Y1063
3 5 0.38
ATAD2B N981, I982, R1051, D1045, D988,
0 8 0
154
I1048, V992, Y1037
BRD1 F648, S592, V647, Q589, Y649
5 0 1.00
BRD7 Y217, A154, S157 3 0 1.00
BRD9 Y106, F44, I53, A54, G43, F45, V49, H42, R101, A46, F47, P48, T50
9 4 0.69
BRPF1 F714, P658, I713 2 1 0.67
BRPF3 F675 1 0 1.00
BAZ2A W1816, V1879, E1820
3 0 1.00
BAZ2B W1887, I1950, L1891
2 1 0.67
TIF1A A923, L922, F924, V932, V928, V986, P929, E985
6 2 0.75
TRIM33 None known 0 0 N/A
TAF1 BD2 N1533, W1526 2 0 1.00
TAF1L BD2 None known 0 0 N/A
PB1 BD5 A703, M699, L687, L693, I745, I683, F684, M731
4 4 0.5
SMARCA2 L1418, P1413, E1417
2 1 0.67
SMARCA4 L1494, P1489, E1493
1 2 0.33
To further explore the consensus between model insight and experimental data, we next
prospectively validated interactions identified as driving activity and/or selectivity from PCM
models via novel crystal structures of ligand-protein complexes.
3.3.2 Binding Modes Elucidated by Co-Crystallization
Four bromodomain binders identified in our previous chapter367 are shown in Table 3-4. We
used differential scanning fluorimetry (DSF) to determine the activity of these hits for BRD4
BD1, BRD1, BRPF1 and BRD9, and obtained novel crystal structures to validate our selectivity
findings prospectively. Activity measurements were performed at 10 µM compound
concentration to establish if there was a binding interaction between compound and
bromodomain. Differential Scanning Fluorimetry is only semi-quantitative and therefore the
measurement was taken as a classification of “active” or “not active” rather than a quantitative
degree of activity displayed. The cut-off for activity was determined to be three times the
standard deviation of the negative control (see Table 8-5). We determined four crystal
155
structures for BRD4 BD1 and one for BRD9. The overall mapping of residues from the crystal
structures to our analysis for the two bromodomains is visualised in Figure 3-6. As can be
observed, from the 20 residues identified as important towards the classification of BRD9
actives from the model, 4 residues were found to be relevant to the binding of compound 7 to
BRD9. From the 26 residues identified as important towards the classification of BRD4 BD1
actives from the model, 11 were found to make contacts with compounds 1-4 in co-crystal
structures with BRD4 BD1. We note that some residues were deemed important by the model
towards the classification of actives but not found to interact with the ligands for which
crystallographic data was obtained. This is to be expected because not all compounds will bind
in the same mode to each protein.
Table 3-4: Bromodomain ligands for which crystal structures were obtained. Activity was experimentally confirmed by differential scanning fluorimetry. Values report the observed thermal shift (°C) midpoint/first derivative, NA indicates that the compound was not tested in the assay.
Compound Structure BRD4 BD1 BRD1 BRPF1 BRD9
5
4.40/4.40 NA NA NA
6
1.80/1.80 -0.30/-0.30 0.06/0.08 1.80/1.70 (within assay error)
7
4.90/5.20 NA NA NA
1.50/1.40 NA NA 3.34/3.30
156
8
Figure 3-6: The amino acid identity and residue numbering for each alignment position in BRD9 and BRD4 BD1 which were important towards the classification of actives for each bromodomain. The figure is coloured by the property or combination of properties (defined by the principal component loadings of Z-Scales descriptor, see methods) of the amino acid which were important towards classifying active compounds for each domain. “None” refers to those residues which were not found to be important for the classification of actives from our model interpretation. Outlined in black are the residues which were identified from our study to make interactions with our newly generated liganded crystal structures for these domains. “-“ represent alignment gaps in the sequence. Alignment positions 5, 8, 9, 19, 22, 24, 28 and 30 were removed as model input features due to low variance of amino acid properties across targets.
Compound 5 (Table 3-4), containing the dimethylisoxazole moiety, which is a well-known
acetyl lysine mimetic, displayed an average experimental activity of ΔTm 4.40°C at 10 µM for
BRD4 BD1 in the DSF assays. The unique aspect of this molecule, as compared to other similar
compounds in the public domain with bromodomain data, was its benzoxazolone core. We were
interested to see how the compound binds to BRD4 BD1 with this core because the compound
was published in a patent subsequent to our experimental testing,386 but there is no published
157
crystal structure for this compound bound to BRD4 BD1. Figure 3-7 shows the co-crystal
structure of compound 5 bound to the acetyl lysine binding site of BRD4 BD1. The 3,5-
dimethylisoxazole moiety makes an expected83,387,388 interaction with N140 and Y97 via a stably
bound water molecule (Figure 3-7a). These residues were not highlighted in our analysis
because they were excluded as features since they are conserved across bromodomains (See
methods). Compound 5 makes lipophilic interactions from one methyl group to L94, a residue
highlighted as important for lipophilic, size and polar interactions with BRD4 BD1 actives from
our selectivity analysis (Figure 3-6, alignment position 16). The benzoxazolone packs against the
WPF shelf and makes hydrophobic interactions with the residues of the WPF shelf and L92. L92
was highlighted from our selectivity analysis to be important towards the classification of actives
based on polarity (Figure 3-6). The aryl ring sits in the hydrophobic groove formed by the W81,
M149 and I146 residues (Figure 3-7b), all of which are residues which were highlighted from our
selectivity analysis as being important for lipophilic interactions with BRD4 BD1 active
compounds (Figure 3-6). Figure 3-7c shows that this molecule mimics the binding mode of a
similar BRD4 inhibitor with a dihydroindenol core, where the aryl ring also occupies the
hydrophobic groove.389 Overall, the binding mode of this compound is consistent with similar
compounds in the literature, as well as forming interactions with six of the residues which are
found to be important for classifying actives at BRD4 BD1 from our model.
Compound 6 (Table 3-4) also contains the dimethylisoxazole moiety and was found to be active
at BRD4 BD1 with an average ΔTm of 1.80 °C measured at 10 µM, while being inactive at BRD1,
BRD9 and BRPF1b (with ΔTm values falling below the threshold of 3 standard deviations from
the negative control), making this a selective compound for BRD4 BD1 according to this
measurement. From the crystal structure, as with compound 5, we observe the same
interactions of the isoxazole with N140 and Y97 via a conserved pocket water (Figure 3-8a),
along with the interaction of the methyl group with L94. The aryl ring is positioned along the
ridge of the hydrophobic WPF shelf, as seen for the benzoxazolone in compound 5. The meta-
substituted sulfonamide of compound 6 points into the opposite direction to the aryl ring of
compound 5, towards the ZA channel and away from the WPF shelf (Figure 3-8b) and forms an
interaction to the backbone nitrogen of D88. This residue is part of the ZA loop and was found
from our model to be an important residue with respect to its lipophilic, steric and polar
properties for the classification of active compounds (Figure 3-6). The sulfonamide displaces a
water in this position and extends towards K91, a residue for which lipophilicity and polarity
were found from the model to be important for the classification of active compounds for BRD4
BD1.
158
The selectivity of this molecule for BRD4 BD1 over BRD9 can partially be explained by the
residue in alignment position 7 from our model (Figure 3-6), which is phenylalanine in BRD9
and glutamine in BRD4 BD1. From the overlay of structures, the sulfonamide clashes sterically
with F47 (Figure 3-8c), rendering the compound inactive. The size of Q85 was specifically
highlighted as important for the classification of actives for BRD4 BD1 and the lipophilicity, size
and polarity of F47 (corresponding to the same alignment position as Q85) in BRD9 was
important classifying actives at BRD9 (Figure 3-6), providing evidence from the model
consistent with this observation. For selectivity over BRPF1b the residue in alignment position
10 from our model can be hypothesised to have a role in the selectivity observed. In BRD4 BD1
this residue is aspartic acid (D88) which compound 6 makes an interaction with via the
backbone nitrogen as discussed above. In BRPF1b this residue is and proline (P658) and the
ligand can no longer make the same interaction.
In summary, we observe similarities in the binding mode of compounds 5 and 6, except for the
two binding vectors for the aryl vs sulfonamide groups. We identified three additional residues
which make interactions with compound 6 but not compound 5, namely Q85, D88 and K91,
which were consistent with the model activity analysis for BRD4 BD1. We note that residue
positions important for activity at BRD9 (position 7, F47) and BRD1 and BRPF1b (position 10,
S592 and P658) from our computational analysis were consistent with our experimental
findings. We furthermore generated selectivity hypotheses for why the molecule is active at
BRD4 BD1, but inactive at the other three bromodomains tested.
159
Figure 3-7: Crystal structure for of compound 5 bound to BRD4 BD1. Compound 5 is shown in light blue sticks; key protein side chains are shown as yellow sticks. a) Key hydrogen bonding interactions of the dimethylisoxazole with N140 and Y97 via a stably bound water molecule. b) Ligand with a receptor surface representation of BRD4 BD1 with surface coloured by lipophilicity (lipophilic regions in green and hydrophilic regions in pink). The aryl ring occupies the hydrophobic groove formed by W81, M149 and I146, c) Overlay of compound 5 (light blue) with the ligand from crystal structure PDB 4GPJ (magenta), with a different bicyclic core. The aryl ring is in both cases positioned similarly into the hydrophobic groove.
WPF Shelf
Hydrophobic
Groove
a b c
W812
L92
Y97
F83
I146
M149 N140
L94
Compound 5
Ligand from PDB 4GPJ
L92
L94
Y97
N140
160
Figure 3-8: Crystal structure for compound 6 bound to BRD4 BD1. Compound 6 is shown in orange sticks; key protein side chains are shown as yellow sticks. a) Key hydrogen bonding interactions of the dimethylisoxazole with N140 and with Y97 via a conserved water molecule. b) Surface representation of BRD4 BD1 with surface coloured by liphophilicity (lipophilic regions in green and hydrophilic regions in pink). Compound 6 is displayed in orange and compound 5 in blue. The sulfonamide occupies the region towards the ZA channel instead of the hydrophobic groove, as found for compound 5 c) Overlay of BRD9 structure in pink with our crystal structure for compound 6 in BRD4 BD1 in yellow. Surface represented as the same colour as receptor residues F47 in BRD9 and Q85 in BRD4 BD1. F47 in alignment position 7 in BRD9 clashes sterically with the sulfonamide (pink surface), likely to be the reason for selectivity of this molecule for BRD4 BD1 over BRD9.
WPF Shelf
a b
c
Hydrophobic
Groove
N140
Y97
F47
Q85
L94
D88 K91
F83
W81
P82
Compound 6
D88 K91
N140
Y97
L94
161
Compound 7, active at BRD4 BD1 with a high average ΔTm value of 5.05°C (Table 3-4), is related
to BI-2536,66 a known inhibitor of BRD4 BD1, however, is modified structurally in three ways
including; 1. the expansion from a 6 to a 7-membered ring system, 2. the change of the
substituents around the ring and 3. the alteration of the simple base to a more complex cyclic
amine base. In the crystal structure of this compound with BRD4 BD1 (Figure 3-9a) the carbonyl
of the amide in the 6,7-fused ring system interacts with the N140 residue. The cyclopropyl group
forms a lipophilic interaction with a hydrophobic sub-pocket defined by L94, A89, V87 and Y97,
while the methyl group extends into the lipophilic sub-pocket defined by M132, C136 and F83.
In agreement with this, L94, A89 and M132 were highlighted as important residues for lipophilic
interactions with active compounds by our model (Figure 3-6). The other residues mentioned
(V87, Y97 and F83) were regions of low variability between bromodomains and therefore were
explicitly excluded these as features in the model. The cyclopentyl group extends towards the
hydrophobic groove defined before for compounds 5 and 6 as the sub-pocket between W81,
M149 and I146. The pyrimidine interacts with a stable water in the binding site. The pyrimidine
ring forms a hydrophobic contact with P82 and the aryl ring forms an edge-to-face π-stacking
interaction with W81 in the WPF shelf and a hydrophobic interaction with the L92 residue on
the ZA loop; these residues have been mentioned previously in our study for their importance
for the classification of actives for BRD4 BD1 (Figure 3-6), as well as in previous studies.79
As shown in Figure 3-9b, the binding mode of compound 7 is similar to that of BI-2536 in that
the carbonyl interacts with N140 and the substituents around the 7-membered ring (methyl and
cyclopentyl groups) occupy the same regions in the structure. In addition, the cyclopropyl group
in compound 7 overlays with the ethyl group in BI-2536. Noticeable changes in binding mode
between the two compounds include that the 7-membered ring changes the position of the
pyrimidine to be able to form an interaction with the stable water molecule. In addition, the
chlorine substitution moves the aryl ring further into the ZA channel towards the D88 and K91
residues. This positions the aryl ring more favourably for an edge-to-face interaction with W81,
as well as a hydrophobic interaction with L92, in agreement with being picked up as features
related to activity against BRD4 BD1 by our model.
This compound hence forms key interactions with several residues that were recognised by the
model as being important in terms of making interactions with inhibitor molecules of BRD4
BD1, including L94, A89, M132, M149, W81, I146, D88 and K91.
162
Figure 3-9: Crystal structure for compound 7 bound to BRD4 BD1. Compound 7 is shown in light blue sticks; a) Key hydrogen bonding interactions from the carbonyl of the amide in the 6,7-fused ring system with N140, as well as the interactions with multiple other residues b) Overlay of BI-2536 (PDB 4O74) shown in purple sticks with compound 7, showing that the cyclopentyl and methyl groups contact similar regions of the protein, but that the 7-membered ring changes the position of the pyrimidine to be able to form an interaction with the stable water molecule. The chlorine substitution moves the aryl ring further into the pocket to enable an interaction with W81. Surface representation of BRD4 BD1 near to residues K91, D88 and L92 coloured by liphophilicity (lipophilic regions in pink and hydrophilic regions in green). Shows that compound 7 extends towards the ZA channel residues K91 and D88.
a b
W81
N140
Y97
P86
W81
P82
F83
I146
L92
L94
A89
C136
V87
M149
L92 K91
D88
Compound 7
BI-2536
163
Compound 8 is formed of a quinazolinone warhead and an aryl mesylate and is active for BRD9
with an average ΔTm of 3.32 at 10 µM, as well as for BRD4 BD1 with a ∆Tm of 1.45 (Table 1). This
is the first O-linked quinazolinone demonstrated to have activity for BRD9. From the crystal
structures we obtained for both bromodomains, the molecule forms a key interaction with N140
(BRD4 BD1) and N100 (BRD9) via the carbonyl group on the quinazolinone (Figure 3-10). For
BRD4 BD1, the quinazolinone interacts with I146 and V87 in the binding site via hydrophobic
interactions. The aryl mesylate extends into the hydrophobic groove between W81, M149 and
I146 in a similar way to compound 5. (Figure 3-10a). For BRD9 the binding mode is different
due to the difference in the amino acid residues in the binding site (Figure 3-10b). In this case,
the primary interaction of the quinazolinone, apart from that with N100, is the π-stacking
interaction with Y106, which has been noted to be of relevance for BRD9 inhibitors previously58.
This residue has been identified from our model as important for activity at BRD9 for the
properties of lipophilicity and size, which agrees with the interactions observed in the crystal
structure. This residue (position 35) is the gatekeeper and corresponds to I146 in BRD4, which
is unable to form π-stacking interactions. Other key lipophilic interactions are with F44 and
F47 either side of the quinazolinone core (Figure 3-10b), which agrees with important
interaction features derived from the model (Figure 3-6). As shown in Figure 3-10c, the molecule
cannot adopt the same binding mode in BRD9 as in BRD4 BD1, as Y106 blocks access to the
hydrophobic groove and would clash with the aryl ring in this conformation. Therefore, the aryl
mesylate extends into solvent in this case. On the other side of the protein in the ZA channel
L92 in BRD4 BD1 is replaced by I53, and hence the position of the ZA loop is altered, allowing
for the quinazolinone system to shift towards the then more open ZA loop.
From our interpretation of the model, the alignment position 35 (corresponding to the
gatekeeper residue of Y106), position 13 (corresponding to the residue I53), position 7
(corresponding to F47) and position 4 (corresponding to F44) were important for classification
of activity at BRD9 (Figure 3-6) and these residues were indeed found to form interactions with
compound 8, as shown in the BRD9 cocrystal structure. The binding mode was different in
BRD4, mostly since BRD9 has a larger and more lipophilic gatekeeper residue Y106, which
blocks access to the hydrophobic groove.
164
Figure 3-10: a) Crystal structure for compound 8 (blue sticks) bound to BRD4 BD1 (yellow sticks). Key hydrogen H-bonding interactions of the quinazolinone with the N140 residue, as well as the interaction of the quinazolinone with I146 and V87 in the binding site can be seen. The aryl mesylate extends into the hydrophobic groove between W81, M149 and I146. b) Crystal structure for compound 8 (green sticks) bound to BRD9 (pink sticks). Key H-bonding interactions of the quinazolinone with the N100 residue, as well as the π-stacking interaction of the quinazolinone with Y106 can be seen. In this case the aryl mesylate points out into solvent. c) Compound 8 bound to BRD4 BD1 conformation (blue sticks) overlaid with compound 8 (green sticks) bound to BRD9 (pink sticks). In pink hash is the BRD9 receptor surface, showing compound 8 cannot bind in the BRD4 BD1 binding mode due to the clash of the aryl mesylate with Y106 in BRD9.
a b c
N140
I146
V87
M149
N100
Y106
I53 F47
F44 G43
F45
L92
W81
P82
F83
Y97
Y106
Compound 8
165
3.4 Conclusions
This study provides information on how to optimise activity and selectivity for bromodomain-
containing proteins, through interactions with residues in the binding site. We systematically
examined selectivity features across bromodomains by interpreting proteochemometric models
at different levels, namely the global, subfamily and individual target levels. Our analysis led to
the identification of residues in the bromodomain active site which were important towards
obtaining activity and selectivity at these different levels, and we compared our findings to
existing knowledge from the literature, as well as to newly-generated experimental binding
modes of compounds. We showed that the model retrieved a highly important selectivity
residue for CREBBP over BRD9, namely R1173 known to make cation-π interactions with
multiple series of CREBBP inhibitors. We analysed the features important for classifying active
compound-target pairs at each bromodomains and found that for 13 out of 31 bromodomains,
the model identified all previously known important features from the literature. We also
highlighted and discussed potential new interactions that can be exploited for selectivity for
each bromodomain, for example, that interacting with the tyrosine residue in alignment
position 6 may gain selectivity for BET BD2 domains over BD1 domains, and that interacting
with the more lipophilic residues of F1177 and Y1141 in alignment position 37 for CREBBP and
EP300 bromodomains may gain selectivity over BETs. We focussed on the interpretation of the
first principal component of the target descriptors which describes the lipophilicity of the amino
acid in each alignment position for the per-target analysis, although the other properties of size
and polarity can be further explored in a similar systematic analysis in the future (and the data
is contained in Supplementary Data File 2). We next examined the correspondence between the
residues identified from the computational analysis and interaction residues from novel co-
crystal structures. For BRD4 BD1, 11 residues were confirmed from our model interpretation to
be relevant to the binding of at least one out of four compounds (for which cocrystal structures
of the compound in BRD4 BD1 were generated), namely W81, P82, Q85, D88, A89, K91, L92,
L94, M132, I146 and M149. These residues showed good overlap with the previous machine
learning interpretation study.79 For BRD9, we were able to confirm that 4 residues from our
model interpretation were relevant to the binding of compound 7 including F44, F47, I53 and
Y106. Furthermore, we have shown that the residue in position 7, identified as important for
activity at BRD9, was key to gaining selectivity for BRD9 over BRD4 BD1, due to the change in
residue from Q95 in BRD9 to F47 in BRD4 BD1, causing a clash with compound 7 in BRD4 BD1.
Overall, this study adds to the knowledge for bromodomain target family selectivity and
166
provides a tool to augment future structure-based design of small molecules for targeting the
bromodomain-containing proteins.
167
4 Associations Between Drug-Induced Adverse Events in
Animal Models and Humans: Beyond Concordance
4.1 Introduction
The problem of translating toxicities from animals to humans is still an important issue to be
addressed (see
Toxicity Translation). Data-driven approaches can lead to new insight by retrospectively
analysing adverse event data in animals and humans to look for statistical associations.
Previously, most of these studies have been conducted as concordance analyses where the
degree of association between the same or similar toxicities (defined as toxicities in the same
system organ class (SOC) groups) in animals and humans has been determined by in silico
approaches (Prediction of Clinical Adverse Events from Preclinical Adverse Events).232 We
previously highlighted in this section that there is a broader need to understand not only the
concordance between the same or related AEs between species, but also the relationships
between AEs of a different nature in animals and humans, due to differences in anatomy,
physiology and biology across species.
To this end, we implemented a data-driven approach to find novel relationships between animal
and human toxicities. We extended the previous concordance approaches to find associations
between AEs encoded by different MedDRA terms. We did this to find mechanistic links
between AEs which might not be found by using prior information, such as the SOC groupings
or known relationships between AEs, as in previous studies. We implemented, in contrast with
previous studies, Mutual Information (MI) as a measure of non-linear dependency between AEs,
and quantitatively assessed the associated risk increase using Likelihood Ratios. Furthermore,
we aimed to understand nature of the interrelationships between associations. This chapter
provides an unbiased approach designed to reveal new understanding on the relevance of
animal studies. New associations discovered by this analysis can be used to aid the future risk
assessment of clinical toxicities based on preclinical information.
4.2 Materials and Methods
4.2.1 Dataset
Preclinical and clinical adverse event (AE) data encoded by the Medical Dictionary for
Regulatory Activities (MedDRA)155 preferred term (PT) were manually extracted from
PharmaPendium (2017-04)390 for all drugs in the database. These drugs were filtered to only
168
retain those drugs which had at least one reported AE in a preclinical and a clinical study,
resulting in 2259 drugs. Duplicate entries of the same drug and AE combination were removed,
retaining only one instance of each pair. The dataset was converted into a binary matrix of AEs
against drugs where presence of an AE for a drug was encoded by 1 and absence encoded by 0.
In total, 4585 preclinical AE variables and 7,675 clinical AE variables were extracted.
4.2.2 Feature Filtering
Near-zero variance AE features were removed using the VarianceThreshold function from
sklearn.feature_selection in Python391, using the variance for Bernoulli random variables = p(1-
p), where p=0.99. Different thresholds were assessed; however, this value was chosen as any
lower probability led to a large drop off in the number of features retained (Figure 4-1). 751
preclinical AEs and 1,740 clinical AEs remained after filtering, with a minimum and maximum
frequency of 23 and 1,862 times respectively where the AE was present across all drugs.
Figure 4-1: Feature selection for preclinical and clinical adverse events. Shows the effect of different probability thresholds, on the number of variables retained for both the preclinical and clinical adverse events, when utilised to remove features with low variance.
4.2.3 Mutual Information Associations
Concordance analysis
The normalized mutual information (MI) between each preclinical to clinical AE encoded by
the same MedDRA term was calculated using normalized_mutual_info_score from
sklearn.metrics in Python.391 Here we use the MI as a measure of the dependence between the
preclinical and clinical variables, as opposed to other methods of correlation, because it can 1.
measure the general dependence rather than the linear dependence between two variables and
2. doesn’t depend on the exact values but the probability distribution of the variables.392
Variance Threshold (Probability Value (p))
Nu
mb
er o
f V
aria
ble
s
169
The Fisher’s exact test implemented using scipy.stats.fisher_exact in Python393 was used to
assess the significance of the associations using a cut-off of 0.01 for the Bonferroni corrected p-
value to reduce the type 1 error.
Statistical Associations between all preclinical and clinical AEs
The same methods as above were used to calculate the values for the MI between the all
preclinical AEs to all clinical AEs, regardless of the MedDRA term. For each clinical AE, the top
three preclinical MI scores were retained.
Assessing significance
The Fisher’s exact test implemented using scipy.stats.fisher_exact in Python393 was used to
assess the significance of the associations using a cut-off of 0.01 for the Bonferroni corrected p-
value.
Furthermore, for each individual preclinical and clinical AE the binary labels were randomised
using rv_discrete from SciPy, preserving the computed probability of observing the AE for each
of the vectors and generating randomised vectors of same length as the real vectors (2259
drugs). Then the normalized mutual information was calculated as before for the random set of
vectors for all preclinical to clinical AEs, repeated 10 times. The 99th percentile of the
distribution of randomised MI values was 0.011. This was used as a cut-off for significant
associations between the real data, where only associations with MI values greater than 0.011
were retained for subsequent analysis.
This method implemeted stringent cut-offs in both approaches to reduce the false positive rate
of our newly discovered associations. In total, after applying the Fisher’s exact test and the
randomisation derived cut-off, 2,050 significant associations were identified.
Quantifying risk
To quantify the risk determined by the associations this method employed the likelihood ratios.
The positive and negative likelihood ratios394 (LR+, LR-) were calculated to assess the risk of
experiencing a clinical AE given the presence of a preclinical AE and the likelihood of absence
of a clinical AE given the absence of a preclinical AE, respectively. The following formulas were
applied:
Equation 4-1:
𝐿𝑅+ =𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
170
Equation 4-2:
𝐿𝑅− =1 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
The likelihood ratios were used to compare to previous work, including the study by Clark et
al.232 This value is more easily interpretable than MI values in real terms and also uses all values
of the contingency table to provide an informative measure of the predictive value of a
preclinical AE for a clinical AE, which is not affected by prevalence of outcome.231 As mentioned
in the general introduction (Likelihood Ratio), the likelihood ratio has been argued to be a
better indicator of concordance for this type of analysis than the traditional sensitivity metric,
as it takes into account the false negatives and the false positive values,230 as well as providing
extra interpretation. A positive or negative likelihood ratio of 1 means that there is no useful
information in the probability of the preclinical finding to predict the clinical outcome, whereas
a positive likelihood ratio (LR+) of greater than 10 shows a large and conclusive change in
probability of a clinical AE given the probability of the preclinical AE. Values between 5-10 are
moderate and values above 2 can be significant. Conversely, for the negative likelihood ratio
(LR-), values closer to 0 are more useful to determine the likelihood of the absence of a
preclinical finding predicting the absence of the clinical AE.231,232 We can use these metrics
additionally to measure the directionality of our associations.
4.2.4 Network Analysis of Significant Associations
All 2050 statistically significant associations were analysed as a network graph,352 where nodes
are AEs (preclinical or clinical) and edges connect two AEs which are statistically associated.
The network is directed since edges are formed by associations from one preclinical AE to one
clinical AE. A single node can have multiple connections, due to the interdependency of AEs
with one another. The network was constructed using the force-directed layout based on a
modified version of the Fruchterman-Reingold algorithm395 in TIBCO Spotfire,352 The network
was clustered by using the concept of a connected component,396,397 which is a maximal
connected subgraph of the entire graph, where each node belongs to one connected component
only. The degree of each node, defined as the number of edges connecting the node to other
nodes in the network was calculated to identify central AEs in the network. Due to the directed
nature of the network, all degrees for preclinical AEs should be considered as the outdegree and
all degrees for clinical AEs should be considered the indegree.
171
4.2.5 Limitations
When interpreting the results of this study there are certain limitations that should be noted
which are related to the dataset and also the methods used.
Firstly, this analysis did not consider the dose of drug administered for both the preclinical and
clinical studies. Therefore, the dosages in preclinical studies were assumed to be relevant to the
human dose in clinical trials and therefore assumed to achieve a similar plasma concentration.
Furthermore, this analysis did not analyse the frequency i.e. the number of animals or humans
which experienced the toxicity, and instead encoded the presence or absence of an AE across
all subjects and studies conducted for each drug. This is because this quantitative information
was not captured within the PharmaPendium database. Also, the degree or severity of the
toxicity was not encoded in the data used and, although some MedDRA terms can be deemed
more severe events, there was no classification of the degree of severity for the AE itself (i.e.
whether the AE was mild or if it caused hospitalisation or death).
Additionally, since only data for drugs which continued to clinical trials was used, it must be
noted that the drug set on which this analysis was performed was limited in that those drugs
causing severe preclinical toxicity observations will have been excluded from our analysis. In
consequence we are likely to have limited data on both the preclinical and clinical side for
serious AEs.
4.3 Results and Discussion
4.3.1 Concordance Analysis
To conduct a concordance analysis similar to previous studies, we computed the Mutual
Information (MI) and Likelihood Ratio (LR) values for those AEs which were recorded by the
same MedDRA terms preclinically and clinically.232 (MedDRA terms are represented throughout
the text in small capitals to provide clarity). Overall, 22 %, corresponding to 102 of the possible
473 matching preclinical and clinical MedDRA terms were identified as significantly associated
with one another. (Supplementary Data File 3).
The median MI value for the concordant associations was 0.048 covering a wide range of MI
values between 0.02 and 0.33 (Table 4-1). The median Positive Likelihood Ratio (LR+) was 3.75,
with a large range between 1.25 and 118.48. 96 % of those significant associations derived from
the MI values are also significant in terms of the LR+ (greater than 2) and can be interpreted as
showing some degree of concordance i.e. the presence of a preclinical AE increased the
172
likelihood of the presence of a clinical AE. The LR- median value was 0.82 and ranged from 0.18,
indicative of a moderate shift in the probability of the absence of the clinical AE given the
absence of the preclinical AE, to 0.96, indicative of a very small and not important shift of the
absence of the clinical AE given the absence of the preclinical AE. This finding that the LR+
values were significant, but the LR- values were not, is consistent with previous studies of
concordance.232 The preclinical model was often diagnostic for the clinical AE in only the
positive direction.
Table 4-1: Metrics for the mutual information, positive likelihood ratio and negative likelihood ratio across all concordant associations.
Measure Minimum Value Maximum Value Median Mean
Mutual Information 0.02 0.33 0.05 0.05
Positive Likelihood Ratio
1.25 118.48 3.75 5.99
Negative Likelihood Ratio
0.18 0.96 0.82 0.80
The 10 highest associations for the same adverse event encoded preclinically and clinically as
measured by the MI are shown in Table 4-2. For 7 out of 10 reported concordant AEs, the LR+
values from this study were comparable to those reported in previous studies. 232,322 For BLOOD
PROLACTIN INCREASED, DIARRHOEA and INJECTION SITE ERYTHEMA, the LR+ values from previous
studies were reported from only one species, rather than from all preclinical species as
investigated here. In contrast to previous studies, we also find that RENAL PAPILLARY NECROSIS,
SEDATION and PLATELET COUNT INCREASED were concordant between animals and humans with
LR+ values of 33, 5 and 4 respectively. We recommend that the concordant toxicities found
through this analysis should continue to be assessed through preclinical models, since we
provide evidence of their past success in predicting clinical drug-induced toxicities.
Table 4-2: The 10 most statistically significant concordant adverse events as ranked by the Mutual Information. Concordant here refers to AEs encoded by the same MedDRA term preclinically and clinically, for which statistically significant associations were derived from our method. The MI and the LR+s values were reported, as well as the LR+ values from previous studies, if present (For all other associations see Supplementary Data File 3).
Preclinical AE Clinical AE Normalised Mutual Information
Positive Likelihood Ratio
Previous Study Positive Likelihood Ratio (reference)
DRUG SPECIFIC
ANTIBODY PRESENT DRUG SPECIFIC
ANTIBODY PRESENT 0.33 118 162 322
173
RENAL PAPILLARY
NECROSIS RENAL PAPILLARY
NECROSIS 0.13 33 N/A
BLOOD PROLACTIN
INCREASED BLOOD PROLACTIN
INCREASED 0.13 23 33 from rats 322
ELECTROCARDIOGRA
M QT CORRECTED
INTERVAL PROLONGED
ELECTROCARDIOGRA
M QT CORRECTED
INTERVAL PROLONGED
0.11 11 11 232
ALANINE
AMINOTRANSFERASE
INCREASED
ALANINE
AMINOTRANSFERASE
INCREASED
0.11 3 6 232
CARDIOTOXICITY CARDIOTOXICITY 0.10 25 N/A
PLATELET COUNT
DECREASED PLATELET COUNT
DECREASED 0.09 4 N/A
DIARRHOEA DIARRHOEA 0.09 4 12 from monkeys 322
INJECTION SITE
ERYTHEMA INJECTION SITE
ERYTHEMA 0.08 10 11 from rabbits 322
SEDATION SEDATION 0.08 5 N/A
From the study by Clark et al,232 there were five AEs reported at the preferred term level of the
MedDRA hierarchy as being the most concordant; these were ELECTROCARDIOGRAM QT
PROLONGED, INJECTION SITE REACTION, ARRHYTHMIA, ELECTROCARDIOGRAM QT CORRECTED
INTERVAL PROLONGED and DRUG SPECIFIC ANTIBODY PRESENT. Each of these preclinical AEs were
found to be significantly associated with the same event clinically from our analysis also, with
LR+ greater than 5, indicating moderate or high shifts in the probability of the presence of a
clinical AE given the presence of the same preclinical AE. Using our method to find associations
based on the MI between preclinical and clinical AEs we identified the same AEs as previous
studies with LR+ values of the same order of magnitude.
The number of associations with a LR+ above 5, indicating a moderate or large shift in the
probability of the presence of a clinical AE given the presence of the same preclinical AE, was
29 (28 % of significant associations) of which only 7 (6.8 % of significant associations)
(Supplementary Data File 3) exceeded a LR+ of 10, indicating a large and conclusive shift in
probability. This shows that most of our associations had likelihood ratios which were
significant but of low quantitative risk (see methods).
Overall, this section highlights that the concordance results from our analysis are largely
consistent with previous concordance analyses, displaying comparable positive likelihood
ratios. This highlights the suitability of our method to find known associations between
preclinical and clinical AEs. We furthermore highlight that our study also found that RENAL
174
PAPILLARY NECROSIS, SEDATION AND PLATELET COUNT INCREASED were amongst the most
concordant AEs across species, observations which were not explicitly reported in previous
studies, showing that this method was able to discover new examples of concordant toxicities,
for which we can say that existing preclinical models have succeeded in predicting the clinical
toxicity in the past and therefore can be used for toxicity testing in the future. Since only 22 %
of MedDRA terms which were present both preclinically and clinically were statistically
concordant, and only 7 % of these had a likelihood ratio of greater than 10, this poses the
question as to whether other preclinical AEs in the dataset might be more relevant than the
exact MedDRA term match in predicting certain clinical AEs.
4.3.2 Statistical Associations Between All Preclinical and Clinical AEs
Since only 22 % of clinical AEs with a matching term preclinically were significantly predicted
by their preclinical counterpart, we next investigated which drug-induced animal toxicities were
predictive for human adverse events (AEs), without the requirement that the AE be encoded by
the same term in both species. In total, 2,050 statistically significant associations were found
between preclinical and clinical AEs (Supplementary Data File 4), with Mutual Information (MI)
values ranging between 0.02 - 0.33, Positive likelihood ratio (LR+) values ranging between 1.5 -
118.5 and Negative Likelihood Ratio (LR-) values ranging between 0.10-0.97. Consistent with the
previous study,232 we find that whilst the Positive likelihood ratios show a significant change in
risk of the clinical AE given the preclinical AE (92 % of significant associations had LR+ greater
than 2), the negative likelihood ratios show that the absence of a preclinical AE does not alter
the probability of clinical safety by an important degree (91 % of the LR- values were between
0.5-1). Therefore, it is important to note that these associations are asymmetrical in nature and
absence of the preclinical finding does not (in most cases) indicate absence of the clinical AE.
The relationship between the MI and LR+ is represented in Figure 4-2. Overall there was a trend
between the MI and the LR+ in the positive direction, however, this was found to be dependent
on another variable which was the number of drugs which caused both AEs for the association.
As can be seen in Figure 4-2, for each quartile of the number of drugs which displayed both AEs,
there is a roughly linear relationship between the log MI and the log LR+ values, suggesting that
each LR+ value is a function of both variables. We also note that the higher the number of drugs
which have both AEs, the lower the LR+ value will be for the same value of the MI. Overall, this
shows that LR+ is a function of both MI and the number of intersecting drugs.
175
Figure 4-2: The relationship between the log positive likelihood ratio (LR+) and the mutual information (MI) coloured by the quartile of the number of compounds overlapping between the two AEs. As can be seen, there is a linear relationship between the log values of the MI and log LR+, however the number of compounds which overlap between the preclinical and clinical AE also influences the LR+ value, but not on the MI value. The higher the number of compounds which display both AEs, the lower the likelihood ratio for the same value of the MI.
We selected to restrict our associations to the three preclinical AEs which gave the highest MI
values for each clinical AE for interpretability reasons. In doing this, we now reduce our previous
102 concordant AEs to only 43 AEs where the same AE preclinically is one of the highest three
MI values which can be derived for the clinical observation. (Supplementary Data File 4). The
10 concordant associations with the highest MI are now shown in Table 4-3. As can be seen, two
of the previous associations in Table 4-2, RENAL PAPILLARY NECROSIS and ALANINE
AMINOTRANSFERASE INCREASED are now no longer in our results, since other preclinical AEs in
the dataset are more predictive of the clinical AE. For the AE of clinical RENAL PAPILLARY
NECROSIS, preclinical INTESTINAL ULCER, PERITONITIS and GASTROINTESTINAL ULCER had higher MI
values (0.25, 0.19 and 0.17) than the association between preclinical RENAL PAPILLARY NECROSIS
and the same event clinically (MI 0.13). For clinical ALANINE AMINOTRANSFERASE INCREASED,
preclinical RED BLOOD CELL COUNT DECREASED, VOMITING and DECREASED ACTIVITY had higher MI
values (0.12, 0.12 and 0.12 respectively) in comparison to preclinical ALANINE AMINOTRANSFERASE
INCREASED which had an MI value of 0.11 (Supplementary Data File 4). These two examples show
that whilst some preclinical AEs were predictive of the same clinical AE, there were other
preclinical AE predictors in the dataset which can also be considered significantly associated
with the clinical endpoint.
176
Table 4-3: Statistically significant concordant adverse events appearing within the top 3 preclinical AEs associated with each clinical AE. The MedDRA terms for which statistically significant associations between the same term preclinically and clinically were found according to the thresholds used in this study. The MI and the LR+s values were reported.
Preclinical AE Clinical AE Normalised Mutual Information (MI)
Positive Likelihood Ratio (LR+)
DRUG SPECIFIC ANTIBODY PRESENT DRUG SPECIFIC ANTIBODY PRESENT 0.33 118
BLOOD PROLACTIN INCREASED BLOOD PROLACTIN INCREASED 0.13 23
ELECTROCARDIOGRAM QT
CORRECTED INTERVAL PROLONGED ELECTROCARDIOGRAM QT
CORRECTED INTERVAL PROLONGED 0.11 11
CARDIOTOXICITY CARDIOTOXICITY 0.10 25
PLATELET COUNT DECREASED PLATELET COUNT DECREASED 0.09 4
DIARRHOEA DIARRHOEA 0.09 4
INJECTION SITE ERYTHEMA INJECTION SITE ERYTHEMA 0.08 10
SEDATION SEDATION 0.08 5
ELECTROCARDIOGRAM QT
PROLONGED ELECTROCARDIOGRAM QT
PROLONGED 0.08 7
OVARIAN DISORDER OVARIAN DISORDER 0.08 5
So far, we have analysed the associations which are most predictive of the presence of a clinical
toxicity, however, understanding where associations can predict safety for a clinical endpoint is
also important. This is quantified by the LR- values where low LR- values can be considered
important for understanding where an absence of a preclinical AE was indicative of clinical
safety. The three associations with a low value (< 0.2) for the LR-, indicative of a moderate shift
in probability given the absence of the preclinical AE that there will be no clinical AE, are shown
in Table 4-4. These were the findings that the absence of a WEIGHT DECREASE in animals was
predictive of a high probability of no AORTIC ANEURYSMS and no clinical ULCER in humans (LR-
0.19 and 0.10 respectively) and that, whilst preclinical OVARIAN DISORDER had a moderate LR+
value for same clinical finding, we also observe a low value for the LR- of 0.18, indicative of the
absence of OVARIAN DISORDER in animals leading to a shift in the probability of the drug being
safe with respect to OVARIAN DISORDERS in humans.
Table 4-4: AEs contained in associations which were significant as determined our analysis, which had low Negative Likelihood Ratios (<0.2), indicating a moderate shift in the probability that the clinical AE will not be present, given that the preclinical AE was not present.
Preclinical AE Clinical AE Normalised Mutual Information Negative Likelihood Ratio
Positive Likelihood Ratio
WEIGHT DECREASED AORTIC ANEURYSM 0.03 0.19 1.8
WEIGHT DECREASED ULCER 0.04 0.18 5.5
OVARIAN DISORDER OVARIAN DISORDER 0.08 0.10 2.0
177
The 20 associations with the highest MI values from all significant associations are presented
in Table 4-5, along with their Likelihood Ratios to quantify the conditional risk. The association
with the highest MI score of 0.33 was the preclinical finding of DRUG SPECIFIC ANTIBODY PRESENT
associated with the same finding clinically, a concordant observation highlighted in the
previous section. The same AE preclinically was also associated with clinical INFUSION RELATED
REACTION, most likely because monoclonal antibodies are administered by infusion.324The other
members of the 20 highest MI associations have a few common preclinical findings, including
WAXY FLEXIBILITY, BLOOD PROLACTIN INCREASED, INTESTINAL ULCER, ADRENAL CORTEX ATROPHY
and PERITONITIS, which associated with different clinical observations. Preclinical WAXY
FLEXIBILITY associates with the most clinical toxicities for the top 20 associations and the
preclinical finding with the second highest number of clinical associations was BLOOD
PROLACTIN INCREASED. Both preclinical findings were predictive of a range of central nervous
system (CNS) side effects, which have been recognised as a toxicity area which need to be better
predicted in preclinical models.398,399 From our analysis we find that a measurement of BLOOD
PROLACTIN INCREASED combined with behavioural symptoms of catatonia in animal species (e.g.
WAXY FLEXIBILITY) may serve as good indicators of a variety of disorders, including
SCHIZOPHRENIA, COGWHEEL RIGIDITY, dyskinesias, spasms and male fertility problems. A wide
variety of behavioural observations have previously been identified in animals as indicative of
potential CNS side effects in humans,400 however, from our study we highlighted that the
behavioural observation of WAXY FLEXIBILITY when present in preclinical studies elevated the
risk of certain neurological side effects in the clinic by up to 44 times. Similarly, a wide range of
liquid biomarkers have been previously identified as being important for neurological toxicity
prediction,399 however, here we identified that the biomarker of increased blood prolactin levels
in animals strongly associates with clinical neurological toxicities, with the presence of
preclinical BLOOD PROLACTIN INCREASED elevating the risk of some neurological toxicities by over
50 times. Together, or separately, these preclinical observations can influence the strategy used
to flag an increased likelihood of clinical CNS toxicities. We discuss these findings further in the
section Frequently Predictive Preclinical AEs.
Preclinical INTESTINAL ULCER and preclinical PERITONITIS were both associated with RENAL
PAPILLARY NECROSIS within the highest 20 associations as ranked by Mutual Information, with
preclinical INTESTINAL ULCER found be more predictive for the presence of clinical RENAL
PAPILLARY NECROSIS (LR+ = 60), than the preclinical observation of RENAL PAPILLARY NECROSIS
(LR+ = 33) itself. INTESTINAL ULCER, PERITONITIS and RENAL PAPILLARY NECROSIS are side effects
observed in nonsteroidal anti-inflammatory drugs, such as those targeting Cyclooxygenase-1
178
(Cox-1), and gastrointestinal ulceration is one of the early symptoms of RENAL PAPILLARY
NECROSIS. Cox-1 inhibition is known to lead to decreased prostaglandin production which is
necessary for both mucosal integrity in the GI tract and maintenance of blood flow to the renal
papillae.401 These associations can therefore be linked to the primary pharmacology of anti-
inflammatory drugs. This finding is particularly useful as there is difficulty in producing robust
reproducible models of RENAL PAPILLARY NECROSIS.402 Here we highlight that we can find
literature evidence to connect these AEs by mechanisms of drug action, despite their different
System Organ Class (SOC) groupings, showing that formal organ class categorizations of
biological effects imply differences where underlying mechanisms may well be conserved.
Preclinical ADRENAL CORTEX ATROPHY was associated with ADRENAL SUPPRESSION clinically (LR+
= 25). ADRENAL SUPPRESSION is one of the main side effects associated with long term
corticosteroid usage.403 Since the adrenal glands are suppressed from corticosteroid usage, less
cortisol is released over time. In animals this is effect is often measured by histopathological
examination of the adrenal glands.404 This functional link suggests that ADRENAL CORTEX
ATROPHY in histopathology studies can be used to predict ADRENAL SUPPRESSION.
These highest-ranking associations can be linked by strong literature evidence. All other
significant associations can be analysed in the same way to indicate where preclinical models
might be useful for anticipating the risk of clinical AEs of interest. Notably, all clinical AEs
except for DRUG SPECIFIC ANTIBODY PRESENT contained within the 20 highest-ranking associations
had a higher MI value of association with a preclinical AE which was not identical in clinical and
preclinical space than the same preclinical AE. Furthermore, in cases where the clinical AE
cannot be replicated in animals due to species differences or difficulty of measurement, we
present here that alternate preclinical toxicities may be employed for risk profiling. Finally, by
extending beyond simple term matching of adverse events, this analysis demonstrates that it is
possible to find relationships between adverse events in preclinical and clinical space also
between different SOC groups, which indicates that underlying biological mechanisms can be
conserved despite their different formal ontological annotations. We show that our analysis was
able to discover 2,007 statistically significant relationships between preclinical and clinical AEs
that were beyond the straightforward term matching, as well as 43 concordant associations, to
make up 2,050 associations overall. These results can provide preclinical scientists with
information to determine which preclinical models give valuable information about which
clinical toxicities, as well as help to identify those preclinical models may be less informative
than previously believed.
179
Table 4-5: The most significant statistical associations between preclinical and clinical adverse events ranked by the highest MI values. Reported are the normalised mutual information, Bonferroni corrected p_value from Fisher’s exact test, positive and negative likelihood ratios and the number of intersecting drugs between the preclinical and clinical AE terms.
Preclinical AE
Clinical AE Normalised Mutual Information (MI)
Bonferroni corrected Fishers’ exact test p-value
Positive Likelihood ratio (LR+)
Negative Likelihood ratio (LR-)
Number of intersecting drugs between preclinical and clinical AE
DRUG
SPECIFIC
ANTIBODY
PRESENT
DRUG SPECIFIC
ANTIBODY
PRESENT
0.33 1.37 x 10-39 119 0.67 35
INFUSION
RELATED
REACTION
0.24 4.40 x 10-24 44 0.66 24
WAXY
FLEXIBILITY
DYSKINESIA
OESOPHAGEAL 0.20 3.98 x10-11 43 0.60 11
FACIAL SPASM 0.20 3.98 x10-11 43 0.60 11
MEIGE'S
SYNDROME 0.20 7.98 x 10-8 44 0.57 10
INTESTINAL
DILATATION 0.19 6.20 x 10-10 42 0.59 10
SPERM COUNT
INCREASED 0.19 6.20 x 10-10 42 0.59 10
ASPHYXIA 0.19 1.92 x 10-12 39 0.67 13
RETROGRADE
EJACULATION 0.18 1.63 x 10-10 39 0.64 11
AUTONOMIC
NERVOUS SYSTEM
IMBALANCE
0.18 9.39 x 10-14 37 0.72 15
ASPERMIA 0.18 1.65 x 10-9 39 0.62 10
BLOOD
ANTIDIURETIC
HORMONE
INCREASED
0.18 1.65 x 10-9 39 0.62 10
BLOOD
PROLACTIN
INCREASED
COGWHEEL
RIGIDITY 0.26 4.35 x 10-15 53 0.46 13
DROOLING 0.20 4.71 x 10-12 33 0.58 12
FACIAL SPASM 0.18 3.98 x 10-11 36 0.60 11
SCHIZOPHRENIA 0.18 3.36 x 10-11 35 0.63 12
TARDIVE
DYSKINESIA 0.18 1.58 x 10-16 33 0.75 19
INTESTINAL
ULCER RENAL PAPILLARY
NECROSIS 0.25 2.02 x 10-15 60 0.65 17
ADRENAL
CORTEX
ATROPHY
ADRENAL
SUPPRESSION 0.20 7.97 x 10-20 25 0.54 22
180
PERITONITIS RENAL PAPILLARY
NECROSIS 0.19 4.08 x 10-18 27 0.59 20
4.3.3 Network Analysis of Significant AEs
Next, we aimed to uncover the AEs which are involved in three different types of relationships;
namely the relationship of multiple preclinical AEs with one clinical AE, the relationship of one
preclinical AE to multiple clinical AEs and a one-to-one relationship between one preclinical
and one clinical AE. This can help to guide how preclinical toxicity endpoints could be used in
practice. To determine these relationships, we constructed a network of the significant
associations where nodes corresponded to preclinical and clinical AEs and edges connected AEs
which are statistically associated by our MI analysis (see Mutual Information Associations),
where we analysed its structure using the concept of connected components. Connected
components are completely connected subgraphs of a network which are isolated from the
other nodes of the network and can be considered distinct clusters of related nodes.
The overall network structure is displayed in Figure 4-3, where we can identify 27 different
connected components. These 27 components can be considered biological clusters of
associations. Table 4-6 shows the number and identity of the nodes involved in the network
components. The numbers of nodes in each component varied from 2 to 1,129, with component
1 containing 95 % of the AEs. The composition of component 1 will be discussed further in the
next section. The other components had fewer numbers of AEs and therefore represent AEs
with more specific relationships, examples of which will be analysed further in the context of
the relationships stated above.
181
Figure 4-3: Network analysis for the statistically significant associations. Shows visually the 27 connected components of the network formed from significant associations between preclinical and clinical AEs. Greyed out is the first connected component which forms the largest hub of the network, highlighted are the nodes in the network which form separate connected components, detailed in Table 4-6
Table 4-6: Adverse events contained in the connected components from the network analysis. 27 connected components of the network of 2050 significant associations, listing the preclinical and clinical AE terms, as well as the number of nodes for each connected component of the network
Connected Component ID
Preclinical Nodes Clinical Nodes Number of nodes
1 Large component containing the bulk of the network (see Supplementary Data File 5)
Large component containing the bulk of the network (see Supplementary Data File 5)
1,129
2 MOBILITY DECREASED
BILE DUCT STONE 2
3 HAEMATOCRIT DECREASED
BLADDER CANCER 2
4 CARDIOTOXICITY
BLOOD ALKALINE PHOSPHATASE CARDIOTOXICITY SKIN HYPERPIGMENTATION
4
5 URINE SODIUM INCREASED
BLOOD CHOLESTEROL ABNORMAL 2
6 BLOOD CALCIUM DECREASED
BLOOD PARATHYROID HORMONE
INCREASED HAEMOGLOBIN INCREASED
3
7 UTERINE DISORDER PROSTATIC DISORDER INCREASED APPETITE
BREAST DISORDER LOW DENSITY LIPOPROTEIN INCREASED
5
8 TESTIS CANCER
BREAST TENDERNESS 2
9 POOR WEIGHT GAIN NEONATAL
CARDIAC DEATH 2
10 GASTRITIS
DUODENAL ULCER HAEMORRHAGE 2
11 BIOPSY VAGINA ABNORMAL BIOPSY UTERUS ABNORMAL
ENDOMETRIAL HYPERPLASIA UTERINE HAEMORRHAGE
4
12 ABORTION
EYE PRURITIS GROWTH OF EYELASHES
3
13 GOITRE
GOITRE 2
14 SPINAL CORD DISORDER
HYPOMAGNESAEMIA 2
15 WEIGHT INCREASED JOINT SPRAIN 3
16 GROWTH RETARDATION MOOD SWINGS 2
17 COMPLICATION OF PREGNANCY MYELODYSPLASTIC SYNDROME 2
18 LYMPHOMA HEPATIC NEOPLASM MALIGNANT
MYOSITIS 3
19 ARTHROPATHY TENDON DISORDER NEUROTOXICITY
3
182
20 BONE DEVELOPMENT ABNORMAL SCIATICA 2
21 HEPATIC NECROSIS SENSORY DISTURBANCE 2
22 DECREASED APPETITE SINUS DISORDER 2
23 CONGENITAL ANOMALY SMALL INTESTINAL OBSTRUCTION 2
24 SPERMATOGENESIS ABNORMAL SPERM COUNT DECREASED 2
25 ELECTROCARDIOGRAM QT SHORTENED SUDDEN CARDIAC DEATH 2
26 NASAL MUCOSAL DISORDER THROAT IRRITATION 2
27 GASTROINTESTINAL INFLAMMATION UPPER GASTROINTESTINAL
HAEMORRHAGE 2
Specific Relationships Between AEs
Next, we discuss the components of the network which formed a variety of specific
relationships, including 1. multiple preclinical AEs being predictive of one clinical AE, 2. the
relationship of one preclinical AE being predictive of multiple clinical AEs, 3. multiple
preclinical AEs predictive of multiple clinical AEs, and 4. a one-to-one relationship where one
preclinical AE is predictive of one clinical AE. Overall, for the 26 connected components, there
were 19 one-to-one relationships, 4 one-to-many relationships, 1 many-to-one relationship and
3 many-to-many relationships (Supplementary Data File 5).
One-to-one Relationship
Here we discuss examples of one-to-one relationships along with their MI and LRs, as well as
literature evidence for the link.
The one-to-one relationship of preclinical ELECTROCARDIOGRAM QT SHORTENED with clinical
SUDDEN CARDIAC DEATH was found from our network analysis to be the association with the
highest value for the MI (0.11) out of all one-to-one relationships (Component 25). The LR+
indicated that it was 23 times more likely to observe clinical SUDDEN CARDIAC DEATH given
preclinical ELECTROCARDIOGRAM QT SHORTENED than in the absence of this observation. This is
a known association with the well-understood molecular drivers of potassium and calcium
channels in the heart.405,406 Despite the knowledge that shortening of the QT interval is
associated with severe ventricular fibrillation, the use of QT shortening in animals as a
biomarker for clinical arrythmias and death was identified as an area which needed to be
addressed, since pharmaceutical companies have noticed an increase in QT shortening being
reported in preclinical studies.407 This report noted that further research should be conducted
on the mechanisms behind QT shortening and its translation between preclinical and clinical
settings. Here, we highlighted the strong predictive value of preclinical ELECTROCARDIOGRAM
QT SHORTENED with clinical SUDDEN CARDIAC DEATH quantified by a likelihood ratio of 23. We
183
also in the next chapter discuss potential mechanisms for this association. This information can
support the case for regulatory agencies to consider this risk factor in the future.
Another one-to-one association relationship with a high MI value of 0.10, was the association
of preclinical MOBILITY DECREASED with clinical BILE DUCT STONE (Component 2). The LR+ value
for this association was 21, showing a high risk given the preclinical observation (Supplementary
Data File 4). Drug-induced bile duct stones can be potentially fatal and experimental models
for this endpoint are not well known.408 The observation that preclinical MOBILITY DECREASED is
currently the best predictor and only significant predictor for determining the risk of a BILE DUCT
STONE clinically could be a starting point for assessing the risk of drug-induced bile duct stone
formation in humans in the future.
A third one-to-one association relationship with a high MI value of 0.10 was the association of
preclinical TESTIS CANCER with clinical BREAST TENDERNESS (Component 8). The link between
testicular cancer and breast tenderness is known, as breast tenderness is a symptom of testicular
cancer in 10 % of men, thought to be related to the release of beta human chronic gonadotropin
which is secreted by testicular tumours.409
Overall, we find strong explainable links for these examples of one-to-one associations with the
highest MI values from our analysis and we suggest that these one-to-one links can be used for
future risk assessment of clinical AEs based on the preclinical AE.
One-to-many Relationships
The component demonstrating a one-to-many relationship containing the associations with the
highest MI values was component 4, which contained three clinical AEs that were associated
with preclinical CARDIOTOXICITY, which were BLOOD ALKALINE PHOSPHATASE, CARDIOTOXICITY
and SKIN HYPERPIGMENTATION, occurring with MI values of 0.13, 0.10 and 0.08 and LR+ values of
33, 24 and 17 respectively. This shows concordance between the general term CARDIOTOXICITY
from animal models to humans and quantifies the risk of experiencing CARDIOTOXICITY
clinically, given the observation preclinically to be 25 times as high as if CARDIOTOXICITY was
not experienced preclinically. BLOOD ALKALINE PHOSPHATASE usually indicates a problem with
the liver, gallbladder or bones, however it has been connected to myocardial infarction after a
drug-eluting stent has been administered to coronary artery disease patients.410 This is an
example of a non-causal association. Tissue nonspecific alkaline phosphatase also has a role in
vasculature stiffening which can lead to heart failure.411 Considering the high LR+ (33) of
experiencing BLOOD ALKALINE PHOSPHATASE given CARDIOTOXICITY preclinically, it is surprising
184
that a stronger evidence-based link has not been previously reported. We did not find a causal
link from CARDIOTOXICITY to SKIN HYPERPIGMENTATION, which is usually caused by an
accumulation of melanin.412
Many-to-many Relationship
The component displaying a many-to-many relationship containing the associations with the
highest MI was component 11. This component contained the associations of preclinical BIOPSY
UTERUS ABNORMAL with clinical ENDOMETRIAL HYPERPLASIA and clinical UTERINE HAEMORRHAGE
(MI = 0.10 and 0.05, LR+ = 15 and 7). It also contained the association of preclinical BIOPSY
VAGINA ABNORMAL with clinical ENDOMETRIAL HYPERPLASIA (MI = 0.09 and LR+ = 16).
ENDOMETRIAL HYPERPLASIA is a thickening of the endometrium which can lead to endometrial
cancers. 413 This is often caused by an excess of oestrogen, with a lack of progesterone. Animal
model biopsies can be used to predict these clinical AEs.
Overall, this section shows that we have found specific associations for some of our adverse
events which are not highly co-dependent on one another. These smaller clusters can be
analysed in more detail for use in guiding the utility of preclinical models. The determination
of the type of relationship in different cases will be important to decide which preclinical models
are important for assessing multiple clinical toxicities, which combinations of preclinical
models are important to assess one clinical toxicity, and which preclinical models are more
specific to one clinical toxicity. This can help to guide how these models could be used in
practice.
Frequently Predictive Preclinical AEs
Next, we analyse the largest component in the network (Component 1 Table 4-6). Figure 4-4 is
a visual representation of the first connected component in the network. The larger and pink
coloured nodes are representative of the nodes with a higher degree (see Network Analysis of
Significant Associations) in the network. There is great variability with the outdegree of the
preclinical AE nodes, showing that there were preclinical AEs which feature in the highest three
predictors for multiple clinical AE; the preclinical AEs with the highest outdegree can be
considered as the hubs of the network. These AEs (and their corresponding outdegree)
included; DECREASED ACTIVITY (97), WAXY FLEXIBILITY (88), BLOOD PROLACTIN INCREASED (81),
INTRA-UTERINE DEATH (79), FOETAL GROWTH RETARDATION (56), VOMITING (48), BIOPSY LYMPH
GLAND ABNORMAL (43), WEIGHT GAIN POOR (41), RED BLOOD CELL COUNT DECREASED (40) and
others, a full list of which can be found in Supplementary Data File 5.
185
Figure 4-4: Network plot in for the first connected component from the network analysis. On the left-hand side are nodes for all preclinical AEs and on the right-hand side are all clinical AEs. The size and colour of the nodes are proportional to the number degree of each node. The larger and more pink nodes are the AEs with the highest number of connections in the network and the smaller blue nodes are AEs with the lowest number of significant connections.
Some of the AEs with the highest outdegree are more prevalent in the dataset (Figure 4-5), such
as WEIGHT DECREASED (1,115 drugs), INTRA-UTERINE DEATH (944 drugs), WEIGHT GAIN POOR (935
drugs) and DECREASED ACTIVITY (822 drugs). Since these AEs will be commonly assessed
preclinical toxicities across animal studies for all drugs, it is unsurprising that their reporting
prevalence is high across drugs in our data. Reproductive toxicity is assessed in multiple animal
models including rodent and non-rodent species and is responsible for a high animal usage (up
to 60 % of all experimental animals). Since thalidomide was withdrawn from the market due to
unanticipated birth defects, these studies are considered a vital component of the in vivo
toxicology assessment.414 Animal body weight is measured as part of all studies, since extreme
weight loss can be indicative of severe organ damage and is one of the factors assessed with
regard to the humane termination of animals before extensive pain and distress occurs.415 In
summary, it is unsurprising that these preclinical AEs are often found to be highly associated
with a range of commonly occurring clinical toxicities.
Preclinical AEs Clinical AEs
186
Figure 4-5: Correlation of the degree of the node representing a preclinical AE in the network for component 1 with the number of drugs in the dataset which displayed the preclinical AE
On the other hand, the degree of the node representing a preclinical AE in the network is not
directly correlated with the number of drugs for which the AE is experienced, as illustrated for
the cases of associations with a high degree but a low number of drugs including the preclinical
AEs of WAXY FLEXIBILITY (32 drugs), BLOOD PROLACTIN INCREASED (36 drugs) and BIOPSY LYMPH
GLAND ABNORMAL (133 drugs) (Figure 4-5). As mentioned above in the analysis of the top 20
associations, preclinical WAXY FLEXIBILITY and BLOOD PROLACTIN INCREASED were highly
predictive together or separately of a range of CNS side effects. When looking at all associations
for these preclinical AEs, it was found that preclinical WAXY FLEXIBILITY and BLOOD PROLACTIN
INCREASED are predictive of many of the same clinical AEs. 53 AEs were the same out of the 105
clinical AEs associated with one or both preclinical AEs (~50 %) showing that these preclinical
AEs were predictive of a similar set of clinical toxicities. The clinical toxicities for which these
observations were significantly associated were grouped according to their system organ class
(SOC) in Figure 4-6, which showed that the most populated SOC groups were for NERVOUS
SYSTEM DISORDERS (49 AEs) and PSYCHIATRIC DISORDERS (29 AEs). There is previous literature
evidence to connect WAXY FLEXIBILITY and BLOOD PROLACTIN INCREASED to these NERVOUS SYSTEM
and PSYCHIATRIC DISORDERS. WAXY FLEXIBILITY is a motor symptom of catatonia which causes
immobility and reduced response to stimulus.416 The main driver of this effect is believed to be
D2 receptor blockade and therefore WAXY FLEXIBILITY is often observed as an adverse effect of
antipsychotic agents. Antagonism of D2 receptors leads to Parkinson’s disease related side
0
200
400
600
800
1000
1200
0 20 40 60 80 100 120
Nu
mb
er
of d
rug
s w
ith
pre
clin
ial A
E
Degree of preclinical AE in network
Decreased activity
Intra-uterine death
Waxy flexibility Blood prolactin increased
Foetal growth retardation Vomiting
Biopsy lymph gland abnormal
Weight gain poor
Weight decreased
Red blood cell count decreased
187
effects. WAXY FLEXIBILITY is observed in a range of animals including rats, mice, dogs and
monkeys according to the data collated from PharmaPendium. Many of the clinical AEs have
links to either catatonia e.g. ANTIDIURETIC HORMONE INCREASE, Parkinson’s disease e.g. FACIAL
SPASM, MEIGE’S SYNDROME417 and DYSKINESIA418, or anti-psychotic medication e.g. RETROGRADE
EJACULATION419 which explains their link to preclinical WAXY FLEXIBILITY.417 Preclinical BLOOD
PROLACTIN INCREASED is a known side effect of a decrease in dopamine, as well as being regulated
by other targets including serotonin, GABA, oestrogens and opioids.420 From our analysis,
BLOOD PROLACTIN INCREASED is associated with COGWHEEL RIGIDITY, DROOLING, FACIAL SPASM,
SCHIZOPHRENIA and TARDIVE DYSKINESIA in the 20 associations with the highest MI values, all of
which are known extra-pyramidal side effects associated with decreased dopamine effects.418
Overall, these findings along with the literature evidence makes the observations of WAXY
FLEXIBILITY and BLOOD PROLACTIN INCREASED in animal models very important flags for toxicity
in humans, especially in combination with one another.
In summary, by analysing the most connected preclinical AEs within component 1, we identified
that the preclinical observations taken routinely during in vivo animal studies, including
reproductive toxicity, body weight and activity measurements are important for determining
the general risk of clinical AEs. Additionally, we found that preclinical AEs of WAXY FLEXIBILITY
and BLOOD PROLACTIN INCREASED associated with a range of clinical NERVOUS SYSTEM and
PSYCHIATRIC DISORDERS, consistent with the fact that animal models are not well-developed for
these disorders.398,399 We propose that preclinical WAXY FLEXIBILITY and BLOOD PROLACTIN
INCREASED can be used to assess the risk of experiencing clinical NERVOUS SYSTEM and
PSYCHIATRIC DISORDERS.
188
Figure 4-6: Frequency distribution for System Organ Class (SOC) MedDRA terms mapped from clinical AEs (MedDRA preferred terms) for significant associations with preclinical waxy flexibility and blood prolactin increased. Coloured by preclinical AE. Preferred MedDRA terms can be members of multiple MedDRA SOC groups.
4.4 Conclusions
Previous concordance studies have led to data-driven conclusions for the risk of experiencing
clinical AEs given similar preclinical AEs. Despite the recognition that AEs manifest differently
between animals and humans, for reasons including differences in anatomy and physiology
between species, as well as differences on the genetic and cellular level, no studies have yet
taken advantage of the opportunity to associate preclinical AEs with substantially different
clinical AEs. In this study we implemented a computational data-driven approach towards
assessing the value of preclinical toxicity measurements in determining the risk of toxicity in
clinical studies for approved drugs, by computing the pairwise mutual information between all
preclinical and all clinical toxicity observations, generating statistical associations. The method
presented here has (in addition to confirming 102 associations between related terms,
‘concordant associations’) now extended previous approaches in such a way and identified
2,007 new associations between AEs which were encoded by different terms preclinically and
clinically. Firstly, this highlights the scale on which we were able to find novel predictive links
between preclinical AEs and clinical AEs through our approach. Secondly, we showed that there
were a large range of clinical AEs which are better predicted by different preclinical AEs than via
their corresponding term preclinically. We observed that 57 % of the generated associations
have a positive likelihood ratio of greater than 2, which indicates a significant change in risk of
189
experiencing the clinical AE given the preclinical AE. This ratio can be used to assess the
diagnostic ability of preclinical AEs for predicting a clinical outcome, which can be used in safety
assessment cases. Additionally, we constructed a network from these associations to show the
interdependency of adverse events upon one another and highlighted examples of AEs which
formed smaller connected components of the network. These included the relationships of one-
to-one, many-to-one and many-to-many associations. We find strong evidence to support the
association between preclinical ELECTROCARDIOGRAM QT SHORTENED with clinical SUDDEN
CARDIAC DEATH, with a positive likelihood ratio of 23, and therefore recommend that regulatory
agencies consider the assessment of QT shortening in preclinical models in a similar way to QT
prolongation. We also discussed the main preclinical AE predictors for the largest connected
component in the network, which included the finding that WAXY FLEXIBILITY and BLOOD
PROLACTIN INCREASED were important predictors for a variety of NERVOUS SYSTEM DISORDERS and
PSYCHIATRIC DISORDERS clinically.
190
5 Deriving Potential Mechanisms from Associations
Between Drug-Induced Adverse Events in Animal
Models and Humans
5.1 Introduction
In the previous chapter, we analysed the data from PharmaPendium to derive statistical
associations between adverse events in animals and humans which were not necessarily
encoded as the same or similar terms in both preclinical and clinical toxicity data. This allowed
us to identify potential toxicities in animals which can be used as predictors of toxicity in
humans, and to quantify the risk based on the data. However, for these associations to be of
further use, it is important to consider the mechanisms by which these adverse events may be
connected. By identifying biological targets and pathways which may have a role in connecting
the two adverse events, compound effects on these biological components can be assessed at an
earlier stage through in vitro toxicity screening.
To provide biological support for our newly discovered associations and to propose biological
targets for secondary pharmacology screens, we in this chapter constructed a novel gene overlap
analysis, outlined Figure 5-1, by integrating genetic information drug and phenotype databases.
Overall, for each association, any genes which were relevant to both the preclinical AE and the
clinical AE, whose protein targets were found to interact with one or more of the drugs causing
both AEs, were considered as plausible mechanistic hypotheses. The targets identified from our
analysis can be investigated as mechanisms for the toxicities, for which further interpretation
and validation should be explored.
191
Figure 5-1: Conceptual illustration for the gene overlap analysis. To derive plausible biological mechanisms to link the preclinical AE, clinical AE and the drugs which induced both AEs, we mapped drug-target interactions to genes, preclinical and clinical AEs to phenotypes/diseases and subsequently to genes with a role in the phenotype or disease. Overlap between the three spaces of one or more genes for each association, was indicative of a potential mechanistic link. A more detailed step-wise process can be found in Figure 5-2.
5.2 Materials and Methods
5.2.1 Associations for Interpretation
For the purposes of interpretation, we limited our associations to only those with the highest
association scores, chosen by applying a cut-off for the Mutual Information (MI) of greater than
0.095, which was the highest value for the randomised distribution (see Assessing significance).
This left 248 associations which we investigated for evidence of mechanistic drivers.
A flow diagram of the overlap analysis method is shown in Figure 5-2.
Preclinical AE Clinical AE Drugs
192
Figure 5-2: Workflow for gene overlap analysis. The 5-step process for the gene overlap analysis to derive plausible biological mechanisms to link the preclinical AE, clinical AE and the association-driving drugs for each association.
5.2.2 Extracting Targets for Drugs
Corresponds to Step 1, Figure 5-2.
All 2259 drugs in PharmaPendium390 were standardised from their SMILES strings using the
StandardiseMolecules function camb package in R,306 which removes salts. The standardised
Step 5: Conduct Overlap Analysis of Genes between Preclinical AE, Clinical AE and Association-Driving Drugs for each Association
For each association obtain a list of genes which overlap between the three spaces for interpretation
Step 4: Filter Genes to Relevant Species and Map to Orthologs
Remove all human genes from Preclinical AE and non-human genes from Clinical AE
gene lists Map all animal genes to human orthologs
Step 3: Map Phenotype and Disease IDs to Genes for each AE
From Disease-Gene and Phenotype-Gene Databases, extract all genes
Convert all gene identifiers to Uniprot KB IDs
Step 2: Map Preclinical and Clinical AEs to Diseases and Phenotypes
Using Ontology Mapping find equivalent terms for AEs
Step 1: Map Association-Driving Drugs to Genes
Identify drugs which induced both clinical and preclinical AEs for each association
Extract genes from Drug-Target interaction databases
193
structures were then converted to InChIKeys using KNIME,351 to map the drugs to their target
activities in other databases.
Targets for the drugs in PharmaPendium were extracted from three main sources, namely the
AstraZeneca ChemistryConnect database (which contained Bioprint (2007 snapshot),116
ChEMBL-23330 and GOSTAR (GVK Bio) data), DrugTargetCommons421 and SuperDrug2.422
Data was extracted from ChemistryConnect by querying the database using a synonym search
from the drugs in the PharmaPendium database. Compounds extracted were standardised using
the same process as for the PharmaPendium drugs, and then matched by InChIKey423 to the
PharmaPendium drugs. It was necessary to carry out a second match on exact drug names
between the databases, as some of the drugs were non-small molecule drugs for which SMILES
were not available. Targets in ChemistryConnect were encoded by EntrezGeneID424 which was
mapped to their UniProtKB Accession IDs using the Uniprot Identifier exchange service339. The
data in ChemistryConnect is already categorized into the classes of active and inactive based on
the cut-off of 10 µM for endpoints including Ki, Kd, IC50 or % inhibition at 10 µM. We retained
the active entries for our analysis. Whilst this is a relatively high cut-off, we chose this activity
cut-off as we wanted to include as much off-target information for drugs as possible and many
of the panel screens were conducted at a concentration of 10 µM.
Data from the SuperDrug2422 database was provided by the curators. Compounds extracted
were standardised using the same process as for the PharmaPendium drugs, and then matched
by InChIKey and exact name matching to the PharmaPendium drugs. The targets in this
database were already encoded in UniProtKB Accession IDs.
DrugTargetCommons421 data was downloaded from the web platform
(https://drugtargetcommons.fimm.fi/). Since the data did not have SMILES, but did include
ChEMBL IDs, we matched the database to ChEMBL-23 using these IDs to obtain SMILES. For
this purpose, the ChEMBL IDs and SMILES for all ChEMBL-23 compounds were extracted using
Toad for MySQL425 and then merged with the ChEMBL IDs in Drug-Target Commons. The
resulting compounds were standardised using the same process as for the PharmaPendium
drugs, and then matched by InChIKey and exact name matching (for biologic drugs) to the
PharmaPendium drugs. For the data in DrugTargetCommons, those with an “active” flag in the
activity comment column were retained, where a cut-off of 10 µM was used to define activity
and only those records for which there was Ki, Kd, IC50 or % inhibition at 10 µM were retained.
The targets in this database were already encoded in UniProtKB Accession IDs.
194
Active drug-target interaction data from all sources was combined into one database, retaining
PharmaPendium drug names, UniProtKB Accession IDs, gene names, gene symbols where
possible and an external reference key to the original database. Duplicates of PharmaPendium
drug name and UniProtKB Accession ID were removed. Overall 82,459 positive drug-target
interactions were extracted for 1604 out of 2,259 drugs in the PharmaPendium dataset, which
included 7,533 unique targets. The breakdown across databases was as follows: 1,397 drugs
mapped to ChemistryConnect, 1,511 drugs mapped to DrugTargetCommons and 1,253 drugs
mapped to SuperDrug2. 4,863 out of 7,533 targets were found to interact with multiple drugs
and 1,549 out of 1,604 drugs had more than one active target, and hence this combined dataset
was found to be encoding more than just the primary target activity.
5.2.3 Mapping Preclinical and Clinical AEs to Ontology Terms
Corresponds to Step 2, Figure 5-2.
For each preclinical or clinical AE which was found as part of the 248 statistically significant
associations generated from the analysis, the MedDRA encoded AE terms were mapped to
ontology IDs using Zooma,426 identifying phenotypes and diseases that describe the AE. The
ontologies searched in Zooma included the Human Disease Ontology (DOID),427 Mammalian
Phenotpye Ontology (MP),428 Human Phenotype Ontology (HP),429 Experimental Factor
Ontology (EFO),430 Orphanet Rare Disease Ontology (ORDO),431 National Cancer Institute
Theasurus (NCIT),432 Ontology of MIRNA Target (OMIT),433 Ontology of Adverse Events
(OAE),434 Monarch Merged Disease Ontology (MONDO),435 Symptom Ontology (SYMP),436
Mental Disease Ontology (MFOMD),437 Mouse Pathology Ontology (MPATH),438 Ontology of
Biological Attributes (OBA),439 and BioAssay Ontology (BAO).440 The mapping results were
manually filtered to leave only terms which have the same meaning as the MedDRA encoded
AE term. This resulted in 183 Zooma mappings for preclinical AEs and 405 mappings for clinical
AEs. Not all AEs were mapped to ontology identifiers. The list of Zooma Ontology terms mapped
from PharmaPendium MedDRA terms are in Supplementary Data Files 6 and 7.
5.2.4 Extracting Genes for Preclinical and Clinical AEs
Corresponds to Step 3, Figure 5-2.
From these mappings, the OpenTargets (version 3.4)441 database was queried using the
OpenTargets Python client442 to extract the genes associated with diseases encoded by the EFO
and DOID ontologies. The genes found were encoded by Ensembl gene (ENSG) IDs443 from
195
OpenTargets. In total 22,056 genes were mapped to preclinical AEs and 44,718 genes were
mapped to clinical AEs
For the AE terms which were mapped to phenotype ontologies including HPOs, MPOs, NCITs,
Orphanet, as well as others, genes were extracted from the Monarch Initiative,444 using the
requests library in Python445 to import the data from the URL for the matched ontology ID. In
total 26,216 genes were mapped to preclinical AEs and 50,415 genes were mapped to clinical
AEs. The output gene IDs were HUGO Gene Nomenclature Committee (HGNC) symbols446 for
human genes and the relevant non-human gene IDs for other organisms including the Mouse
Genome Informatics (MGI)447 IDs.
For the AE terms which were mapped to disease ontologies including DOIDs and MONDO IDs,
genes were extracted from the Monarch Initiative,444 using the requests library in Python445 to
import the .tsv file from the url for the matched ontology ID. In total 9,252 genes were mapped
to preclinical AEs and 35,239 genes were mapped to clinical AEs. The output gene IDs were as
for the previous step.
Gene IDs from all sources were then mapped to UniprotKB identifiers using the Uniprot
Identifier exchange service. The genes were mapped back to their original AE term. Duplicate
UniprotKB Accession IDs for each AE were removed. In total, 242,997 gene-preclinical AE
pairings were found across all preclinical AEs and 546,902 gene-clinical AE pairings were found
for clinical AEs.
5.2.5 Gene Filtering and Mapping Animal Genes to Human Orthologs
Corresponds to Step 4, Figure 5-2.
The UniprotKB gene IDs associated with drugs, preclinical AEs and clinical AEs were filtered in
the next step as follows. For preclinical AE genes, we retained those which were from non-
human species, while for clinical AE genes we retained only human genes. For the drug-gene
associations the non-human genes were separated from the human genes for the next step.
Next, for the preclinical AE-associated non-human genes, we mapped their gene identifiers to
their orthologs in humans, utilising the Uniprot Identifier exchange service to map to Eggnog448
identifiers and then back to human UniprotKB genes via the REST API.339 We chose the Eggnog
database due to its higher level of overlap with UniprotKB identifiers than other ortholog
mapping databases. The non-human genes for drugs were also mapped using the same method
196
to their human orthologs, meaning all genes were mapped to human UniprotKB identifiers for
the overlap analysis.
5.2.6 Overlap Analysis of Genes from Preclinical AE, Clinical AE and Drugs
Corresponds to Step 5, Figure 5-2.
The following analysis was conducted in Python utilising the pandas449 library. All the drugs,
preclinical and clinical AEs and ortholog -mapped genes with Uniprot KB Accession IDs were
used in the overlap analysis. The overlap analysis was designed to check whether, for each
association, the drugs which displayed both the preclinical and clinical AEs possess protein
targets which are encoded by genes associated with both the preclinical and clinical AE, and
which could hence mechanistically be associated with AEs in both animals and humans. All
matches of a drug with protein targets which are present in the genes associated with both
preclinical and clinical AEs were retained for subsequent analysis. For each association, the
intersection between genes for the preclinical and clinical AE terms was extracted as a list of
genes.
5.2.7 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels
In vitro target safety panels were extracted from Lounkine et al., 2012117 and Bowes et al., 2012
118 which represent the Novartis Safety Target panel and Panel-44 respectively. The genes found
from the overlap analysis were compared to the genes represented by the safety targets in the
combined panels and flagged as either known safety target genes (if present in the panels) or
potentially novel safety target genes (if not present in the panels). To assess how much better
this method was at identifying known safety targets than a random selection of proteins
associated with active compounds, we simulated the sampling of 482 targets (the same number
as we found from our analysis) with known actives from ChEMBL and compared these targets
to the safety panel targets. Sampling was repeated 1000 times to produce a distribution of the
number of matches to the 77 safety panel targets (Figure 5-3), from which the mean was
calculated to be 16, corresponding to the retrieval of 21 % of known safety panel targets on
average.
197
Figure 5-3: Distribution of the number of matches to safety panels from 1000 random samples of 482 targets from ChEMBL. The mean of the distribution (marked in red) was 16 matches, identifying 21 % of known safety panel targets on average.
5.2.8 Visualisations
Visualisations were generated in TIBCO Spotfire352 and R.
5.2.9 Limitations
There are several limitations to be aware of with respect to the analysis conducted. Firstly, we
assumed that AEs can be represented by diseases or phenotypes; some mapping was possible,
however, not all AEs could be mapped in this way and this may be responsible for some
associations with a lack of mechanistic support. Additionally, we assume that gene associations
are equal to protein target associations. Finally, we did not discriminate on the level of
association required for a gene to be associated with a phenotype or disease because we wanted
to identify all plausible links which could be used for hypothesis generation.
5.3 Results and Discussion
5.3.1 Overall Analysis of Mechanistically Supported Associations
We set out to systematically identify mechanistic links between our associations. To do this we
identified targets modulated by drugs inducing both AEs in the association, as well as mapped
our AE terms to diseases or phenotypes from which we subsequently retrieved genes associated
with each AE. We then conducted an overlap analysis to retrieve genes which were common to
the preclinical and clinical AEs, and one or more of the drugs which induced AEs in both
198
domains (See Materials and Methods). We conducted our mechanistic overlap analysis on a
subset of 248 highly scoring associations as measured by the MI from the previous chapter, for
which we found overlapping evidence on the gene level for 60 associations. 482 unique genes
were found across all associations. (see Supplementary Data File 8 for details).
Figure 5-4 shows the number of unique genes that were found to link preclinical and clinical
AEs induced by drug treatment. The association with the highest number of unique genes was
the association of preclinical FOETAL MALFORMATION with clinical NERVOUS SYSTEM DISORDERS
with 310 different genes, with the high number of genes involved likely to be due to the
multitude of AEs that can be described by NERVOUS SYSTEM DISORDERS, as well as overlap
between pathways in developmental and CNS disorders.450 Conversely, for some associations,
there were very few genes found for the association between preclinical and clinical AEs, for
example the associations of preclinical BLOOD PROLACTIN INCREASED with BLOOD GROWTH
HORMONE INCREASED, which was linked by only one gene, namely Peroxisome proliferator-
activated receptor gamma (PPARG) for four drugs. (see Figure 5-5 for the number of drugs
driving each association, as well as the proportion of drugs which displayed gene overlap). We
also observe links between other AEs which have an underlying mechanistic rationale (Figure
5-4), including the association of preclinical ADRENAL SUPPRESSION with clinical ADRENAL
INSUFFICIENCY, the association of preclinical ELECTROCARDIOGRAM QT SHORTENED with clinical
SUDDEN CARDIAC DEATH, associations of preclinical MUTAGENIC tests with clinical
TERATOSPERMIA and the association of PSYCHOMOTOR HYPERACTIVITY with clinical DRUG
WITHDRAWAL SYNDROME; we discuss some of these in detail in the subsequent sections.
We found that the number of associations with mechanistic overlap between preclinical, clinical
and drug space amounted to only about 25 % of our initial associations, which whilst low now
offers increased confidence in those associations to be supported by genetic links.
Supplementary Data File 8 provides more detailed information on the gene overlap analysis.
This information can now be used to identify preclinical readouts which are mechanistically
linked to a clinical readout of interest, to identify preclinical safety models that go beyond
simple term and observation matching between the preclinical and clinical domain, extending
the analysis from the previous chapter.
For those 75 % of associations that were not mechanistically linked, based on existing data, we
propose that there were multiple possible reasons. Firstly, the individual AEs could be caused
by separate and distinct mechanisms of the same drugs (i.e., two AEs are produced by different
off-targets of the same drug) and therefore we find statistical, but no causal associations
199
between (mechanistically distinct) AEs. This highlights the need to produce selective drugs
where possible in order to better understand toxicities. Secondly, in some cases we were unable
to map the AE terms to a representative disease or phenotype found in one of our databases.
Thirdly, there may not be extensive knowledge about the involvement of genes in the linked
diseases or phenotypes, as annotated in the databases used here. Fourthly, for the drugs
producing given AEs in preclinical and clinical space, the protein target profile of this drug may
be incomplete (which is likely even a quite considerable source of false negative links in this
work, given the incompleteness of available bioactivity data). Finally, there may be indirect
downstream biological effects via signalling networks, so that the target of a drug is not directly
involved in any toxic downstream events. This type of event is also rather plausible in practice;
however, this type of link will also not be discovered by the current analysis.
Therefore, we conclude that there may be valuable information between the statistical
associations that were not supported by this method, as many of these may be due to the
limitations of knowledge in the public domain. In the future, for those associations where there
was no gene overlap found, it would be interesting to explore those where a target was found to
overlap between two of the three spaces. This could reveal insight into potential novel
mechanisms of disease and propose off-targets not yet discovered for the set of drugs.
200
Figure 5-4: Number of unique genes found for each mechanistically supported association between preclinical and clinical AEs. Plot is coloured by the number of unique genes overlapping between drug targets and genes involved in phenotypes in both preclinical and clinical AE space. Minimum number of genes was 1 and the maximum number of genes was 310. Only about 25% of links between preclinical and clinical AEs are mechanistically supported; however, those links are now in turn more likely to be biologically meaningful, and at the same time a potential biomarker/safety target hypothesis for the respective AE is derived.
Pre
clin
ica
l A
E
Clinical AE
Max (310) Average (22) Min (1)
201
Figure 5-5: Number of drugs with gene overlap for each significant association. For those associations which had preclinical AE, clinical AE and drug gene overlap, the overall number of drugs driving the association coloured by whether there was gene overlap for each drug.
Clinical AE Preclinical AE
202
5.3.2 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels
We next assessed the correspondence between the genes found from this analysis and the
published safety target panel associated genes to determine the ability of our analysis to find
known safety targets. In total, our analysis identified 56 out of 73 (77 %) of known safety target
genes, as defined from the combination of the panels published in Lounkine et al., 2012117 and
Bowes et al., 2012118, which was enriched compared to a mean of 16 targets out of 73 (21 %)
retrieved from repeated random samples of 482 targets with activity from ChEMBL (Figure 5-3).
The gene names, as well as the clinical AE to which each gene was found to be associated is
displayed in Figure 5-6. Certain genes can be associated with groups of (as opposed to
individual) clinical AEs; for example, for nervous system and psychiatric disorders, the terms
DRUG DEPENDENCE, OBSESSIVE COMPULSIVE DISORDER, SCHIZOPHRENIA, TARDIVE DYSKINESIA and
TOURETTE’S DISORDER were associated with many genes, including the Dopamine Receptor
subtypes 1, 2 and 3, which are known to be associated with CNS adverse events,119 however, the
precise link between these receptors and OBSESSIVE COMPULSIVE DISORDER, SCHIZOPHRENIA and
TOURETTE’S DISORDER was not reported.119 The AR and ESR1 genes, which encode the Androgen
Receptor and Estrogen Receptor 1 respectively, are promiscuous across clinical AEs, due to their
roles in a wide range of biological processes, specifically linking to endocrine and CNS adverse
events.119 Similarly, the GCR gene which encodes the Glucocorticoid Receptor was also found
to be associated with many of the clinical AEs, in particular a range of CNS adverse events, which
have not been previously identified.119 More specific gene-clinical AE associations from our
analysis included the Acetylcholinesterase (ACES) and Tyrosine-Protein Kinase LCK (LCK)
genes associated with PNEUMONIA; ACES has been linked previously to be associated with
respiratory adverse events.119 The Neuronal acetylcholine receptor subunit alpha-4 (ACHA4)
gene associated with POOR-QUALITY SLEEP, the Neuronal acetylcholine receptor subunit alpha-7
(ACHA7), Neuronal Acetylcholine Receptor Subunit Beta-4 (ACHB4), Substance-P Receptor
(NK1R) and Glutamate Receptor Ionotropic, NMDA 2A (NMDE1) genes associated with DRUG
WITHDRAWAL SYNDROME, and the Steroid Hormone Receptor ERR2 (ERR2) and Gamma-
aminobutyric acid Receptor Subunit Alpha-5 (GBRA5) associated with NERVOUS SYSTEM
DISORDERS, none of which were summarized in the previous study,119 making these findings
novel associations between previously identified safety target encoding genes and adverse
events.
The remaining 426 genes have not been previously identified as safety target encoding genes
and are not therefore routinely screened within pharmaceutical companies. These novel genes
203
can be found in Supplementary Data File 9 and can be used to help prioritise important toxicity
targets for further evaluation. These genes represent biological mechanisms that can be targeted
with a small molecule ligand, and which are mechanistically involved in different of toxic
phenotypes across species.
In the next section we discuss examples of mechanistically supported associations highlighting
known and novel genes for further evaluation.
204
Figure 5-6: Genes whose resulting protein products are targeted by drugs and which are associated with toxicities on the preclinical and clinical level, which overlap with known safety targets from the published studies by Lounkine et al., 2012 117 and Bowes et al., 2012 118. Matrix elements are coloured by the number of times the gene was associated with each clinical AE. The most promiscuous genes across AEs, included the Nuclear Hormone receptors of the Androgen receptor, the Estrogen receptor 1 and the Glucocorticoid receptor.
Clin
ica
l A
E
Gene Name (Uniprot)
Max (73) Average (15) Min (1)
205
5.3.3 Mechanistically Supported One-to-One Relationships Between AEs
From our mechanistically supported associations, 6 out of 60 showed a one-to-one relationship
between one preclinical AE and one clinical AE (Table 5-1). In this section, three examples of
mechanistically supported associations are discussed where there were no other associations
with the same preclinical or clinical AE.
Table 5-1: Biological targets and drugs for all six one-to-one associations with mechanistic support derived from the gene overlap analysis. Includes the MedDRA terms for both the preclinical and clinical adverse events in the association, as well as the positive likelihood ratio, number of drugs for which overlapping genetic evidence was found and the identity of the genes.
Preclinical AE
Clinical AE Positive likelihood ratio (LR+)
Number of drugs overlapping (number with gene overlap)
Number of unique genes
Genes overlapping (order of decreasing number of drugs for which genes appear)
ELECTROCARDI
OGRAM QT
SHORTENED
SUDDEN
CARDIAC DEATH 22.9 9 (1) 4 CAC1C (1), KCND3 (1),
KCNQ1 (1), SCN5A (1)
REFLEXES
ABNORMAL POOR QUALITY
SLEEP 17.3 10 (7) 7 AA2AR (6), OX2R (6),
ACM2 (2), ACHA4 (1), ACHB2 (1), CAC1C (1), GBRB3 (1)
INJECTION SITE
ERYTHEMA INJECTION SITE
RASH 13.1 20 (1) 7 ABL1 (1), BTK (1), FGFR3
(1), KIT (1), PGFRA (1), RET (1), TIE2 (1)
PSYCHOMOTOR
HYPERACTIVITY DRUG
WITHDRAWAL
SYNDROME
6.2 53 (49) 26 CP2B6 (44), AA1R (42), AA2AR (42), ADA2A (42), ADA2B (42), CNR1 (42), DRD2 (42), DRD3 (42), GASR (42), NPY2R (42), OPRD (42), OPRK (42), OPRX (42), OX1R (42), HRH1 (26), SC5A (25), ACHA7 (8), ACHB2 (8), ACHB4 (8), MK01 (5), MK03 (5), NMDE1 (3), SSDH (3), CREB1 (1), GRIA2 (1), NK1R (1)
FOETAL
MALFORMATIO
N
NERVOUS
SYSTEM
DISORDERS
4.9 104 (54) 310 See Supplementary Data File 8
INTRA-UTERINE
DEATH BACK PAIN 2.5 606 (114) 17 BRAF (19), CBP (22),
CDK4 (17), EP300 (22),
206
FGFR1 (61), FGFR3 (61), GNAS2 (4), HBB (1), P53 (16), PK3CA (24), PTN11 (2), RASH (1), TASK (6), RASN (5), SMAD4 (21), SMO (3), STK11 (61)
Preclinical ELECTROCARDIOGRAM QT SHORTENING and clinical SUDDEN CARDIAC DEATH are associated
via multiple genes/protein targets
The prediction of the risk of drug-induced sudden cardiac death in humans from non-clinical
studies is particularly important, as due to the severity of this AE, there are health and cost
implications of taking a drug forward with this risk to patients.451 In this study we found that
preclinical ELECTROCARDIOGRAM QT SHORTENING is associated with clinical SUDDEN CARDIAC
DEATH with a rather high LR+ of 23 (Table 5-1), as discussed in the previous chapter. Overall in
the dataset, nine drugs supported this association. When analysing the gene overlap between
preclinical and clinical toxicities and the drugs which caused both toxicities, rufinamide was the
only drug for which the known targets modulated, overlapped with both toxic effects. These
protein targets were Voltage-dependent L-type calcium channel subunit alpha-1C (CAC1C),
Potassium voltage-gated channel subfamily D member 3 (KCND3), Potassium voltage-gated
channel subfamily KQT member 1 (KCNQ1) and Sodium channel protein type 5 subunit alpha
(SCN5A), which are well associated with short QT intervals and sudden cardiac death.406,452–454
For the other drugs which induced both AEs we did not find mechanistic support, suggesting
that the proteins responsible for the connection between the adverse events might not be
known for all drugs. Here, we found mechanistic evidence to support the link between
preclinical ELECTROCARDIOGRAM QT SHORTENING associated with CLINICAL SUDDEN CARDIAC
DEATH for clinically tested drugs. KCND3 is not currently found in published safety target panels
considered here,117,118 however, is screened as part of the Comprehensive in Vitro Proarrhythmia
Assay (CiPA).455
This mechanistic evidence adds to the case made in the previous chapter where we discussed
the need for regulatory agencies to consider how preclinical ELECTROCARDIOGRAM QT
SHORTENING should be used to assess clinical risk of arrythmias and clinical SUDDEN CARDIAC
DEATH, consistent with the rise in the incidence of the preclinical and clinical observations.406
Based on this data-driven study and the previous reports highlighting that drug-induced QT
shortening should be monitored in addition to QT prolongation due to the severe
prognosis,407,456 we recommend that the targets identified in this study (CAC1C, KCND3, KCNQ1
207
and SCN5A) and other associated targets be prioritised for drug safety screening due to the
severity of the clinical AE. This could be achieved by expanding such efforts as the CiPA panel
to encompass QT shortening-related targets, in addition to its current content of targets of
relevance to proarryhymias.457
Preclinical ABNORMAL REFLEXES are associated with clinical POOR-QUALITY SLEEP via multiple
genes/protein targets
Many drugs induce sleep problems, which can have a significant effect on quality of life. Animal
models exist for assessing the risk of poor-quality of sleep in patients; however, many differences
in sleep patterns exist between animals and humans.458 In this study we identified a strong
association between preclinical REFLEXES ABNORMAL and clinical POOR-QUALITY SLEEP with a LR+
of 17 (Table 5-1). This association is driven by the observation of both side effects for 10
individual drugs in the dataset. Of these 10 drugs, 7 were found to share at least one gene target
with the two phenotypes. The most common targets found for this association were the
Adenosine A2a receptor (AA2AR), annotated as a target of six drugs, and the Orexin 2 receptor
(OX2R), also annotated as a target of six drugs. Studies of A2aR-knockout mice show that
AA2AR is linked with a decrease in sensitivity to painful stimuli, which can lead to reduction in
the reflex response.459 AA2AR also plays a role in sleep directly, and mice lacking the functional
Adenosine A2a receptor no longer show increased wakefulness in response to caffeine.460 In
humans, polymorphisms in the A2a receptor are associated with impaired sleep, showing a role
for A2a receptors in sleep quality directly.461 The orexin receptors are well-known for their link
with sleep and wakefulness cycles,462–464 in particular the conditions of insomnia and
narcolepsy.465 More recently, orexin has also been linked to antinociceptive effects, showing a
role in neuropathic pain modulation, which may have a role in the reflex response to pain.466
While the A2a receptor is part of previously published safety panels, OX2R is currently not
found in these.455 However, in particular where sleep modulation is a concern during drug
development, the recommendation from this analysis (in agreement with previous literature)
would be to include OX2R in screening panels.
Preclinical PSYCHOMOTOR HYPERACTIVITY is associated with clinical DRUG WITHDRAWAL SYNDROME via
multiple genes/protein targets
Drug withdrawal syndrome presents a real challenge to patients and in some cases can be life-
threatening.467 One of the one-to-one associations from our analysis was that of PSYCHOMOTOR
HYPERACTIVITY in the preclinical setting being significantly associated with clinical DRUG
WITHDRAWAL SYNDROME, with a LR+ of 6.2 (Table 5-1). In total, 53 drugs were found which
208
induced both phenotypes, of which 49 had one or more gene overlapping between itself and the
two AEs in the association. The genes whose protein targets were modulated by the highest
number of drugs for this association included CP2B6, AA1R, AA2AR, ADA2A, ADA2B, CNR1,
DRD2, DRD3, GASR, NPY2R, OPRD, OPRK, OPRX, OX1R, which were targets for at least 42 of
the drugs supporting the association. Many of those links identified from the data were
supported by biological knowledge. The Cytochrome P450 family 2 subfamily B member 6
(CP2B6) polymorphisms, in combination with attention deficit hyperactivity disorder (ADHD)
symptoms, are found to be linked to nicotine addiction.468 Antagonism of the Adenosine
receptors (AA1R and AA2AR) leads to psychomotor phenotypes from the reduction in adenosine
and regulation of genes in the striatum signalling pathway.469 AA1R and AA2AR agonists have
been linked to withdrawal symptoms for benzodiazepine drugs,470 and agonism of these
receptors antagonistically modulates dopaminergic neurotransmission and therefore reward
systems.471 The Cannabinoid receptor 1 (CNR1) and Dopamine receptors (DRD2, DRD3) have
well-known links to ADHD, neuropsychiatric disorders and substance withdrawal, as do the
Opioid receptors (OPRD, OPRK, OPRX).472–475 Cholecystokinin B receptor (GASR) mutations
are related to behavioural changes in animals476 and also affect morphine induced hyperactivity
and withdrawal symptoms.477,478 Finally, the Orexin 1 (OXR1) receptor is involved in ADHD as
well as naloxone-precipitated morphine withdrawal.479–481
This analysis presents strong evidence for a link between preclinical PSYCHOMOTOR
HYPERACTIVITY and clinical DRUG WITHDRAWAL SYNDROME, supported by multiple target
mechanisms. Of these targets, the majority are present in existing safety panels, and this analysis
adds to the weight of evidence to support their utility in predicting clinically relevant endpoints.
We also identified three targets, namely CP2B6, OPRX and OX1R, which were not routinely
screened according to published in vitro safety panels, although CP2B6 is screened in drug
metabolism and pharmacokinetic (DMPK) assays for potential drug-drug interactions.482 Based
on the analysis performed here, we propose these three targets for further investigation and
possible inclusion in such panels, where associated effects are of relevance for decision-making
for compound progression along the clinical development path.
5.3.4 Mechanistically Supported Groups of Associations Between AEs
We finally investigated whether associations in our analysis could be considered as groups of
associations, given that in many cases (see Figure 5-4) individual preclinical side effects and
clinical side effects are related in a more complex manner than one-to-one relationships, as
detailed in the previous chapter. Hence, we next analysed whether mechanistic targets were
209
present across a whole set of associations of preclinical and clinical effects. In the following we
focus on finding new safety targets for clinical reproductive toxicity, namely TERATOSPERMIA and
OVARIAN FAILURE. This is an area where non-clinical models are actively being investigated, to
improve the risk assessment for reproductive toxicity in the clinic.483
Preclinical mutagenicity tests are associated with clinical teratospermia and ovarian failure,
supported via by multiple genes/protein targets
From our analysis, observations of preclinical mutagenicity using the MUTAGENIC:
MICRONUCLEUS TEST, IN BONE MARROW CELLS test and MUTAGENIC: CHROMOSOME ABERRATION, IN
BONE MARROW CELLS, were associated with clinical TERATOSPERMIA and OVARIAN FAILURE (Figure
5-7). The LR+ for TERATOSPERMIA, given a positive result in MUTAGENIC: SPECIFIC LOCUS TEST
(THYMIDINE KINASE), IN LYMPHOMA CELLS, was 28.9, while the ratio for the MUTAGENIC:
CHROMOSOME ABERRATION, IN BONE MARROW CELLS test was 23.1 and the ratio for the
MUTAGENIC: MICRONUCLEUS TEST, IN BONE MARROW CELLS was 20.7. This showed a very high
conditional likelihood for experiencing TERATOSPERMIA in clinical trials, given a positive readout
in either of these preclinical tests. For clinical OVARIAN FAILURE, the MUTAGENIC: MICRONUCLEUS
TEST IN BONE MARROW CELLS had a LR+ of 17.2 and the MUTAGENIC: CHROMOSOME ABERRATION
TEST a LR+ of 19.2. Hence, in this set of toxic events, we were able to identify a group of related
preclinical readouts that were predictive of related clinical toxicities
Figure 5-7 shows all 33 shared genes between the associations for mutagenicity-related
preclinical observations. It can be observed that the main gene present for all associations was
the human ANDR gene. This gene (the Androgen receptor) is well-known to be linked to
TERATOSPERMIA, as it has a vital role in spermatogenesis and mutations in the gene lead to male
infertility.484–486 OVARIAN FAILURE is linked to the absence of the Androgen receptor in mice and
serum androgen levels are elevated in women with ovarian failure.487,488
Many of the other 32 genes for these associations, shown in Figure 5-7, are related to DNA
damage or repair, apoptosis, cellular proliferation, angiogenesis, methylation effects and
transcriptional regulation. Interestingly, we can distinguish the different mutagenic tests in vivo
by their mechanistic drivers, as not all genes are found to be overlapping. For example, the
CHROMOSOME ABBERATION TEST IN BONE MARROW CELLS is associated with genetic evidence for
bromodomains BAZ1A, BRDT and BRWD1, whereas these genes are not associated with the
MUTAGENIC: MICRONUCLEUS TEST IN BONE MARROW CELLS. There are differences in the genetic
evidence found in the literature associated with clinical endpoints. For example, there are many
genes which are found for both associations with TERATOSPERMIA but not OVARIAN FAILURE, such
210
as the Retinoic acid receptors (RARA and RARG) and the Cyclin-dependent kinases of CDK2
and CDK16. Conversely, there is only one gene which was found for OVARIAN FAILURE and not
TERATOSPERMIA, namely Cyclin-dependent kinase 4 (CDK4). Overall, we present the genes
which were found to support our associations between mutagenic preclinical AEs and the
clinical AEs of TERATOSPERMIA and OVARIAN FAILURE, providing protein targets which could be
investigated further for their roles in these serious reproductive toxicities of drugs.
In total, out of 33 genes, only five are currently routinely screened as in vitro safety panels,
namely ANDR, ESR1, ESR2, GCR and PRGR. The protein products of the remaining 28 genes,
namely BAZ1A, BRDT, BRWD1, CDK1, CDK16, CDK2, CDK4, COT2, FGFR1, FGFR2, GSK3A,
INSR, MTOR, NR0B2, P53, PPARA, PPARG, RARA, RARG, RET, RXFP2, RXRB, STK11, THB,
TXK, VDR, VGFR1 and VGFR2, are not currently routinely used within published safety
screening panels and, where relevant, based on this data-driven analysis we propose that they
could be investigated further for their role in the prediction of clinical reproductive toxicity
211
Figure 5-7: Mechanistic targets for the associations between mutagenic preclinical adverse events and clinical TERATOSPERMIA and OVARIAN FAILURE. Matrix elements are coloured by the number of drugs which had active data at the protein target encoded by the gene. The Androgen receptor (ANDR) was identified as a mechanism for both TERATOSPERMIA and OVARIAN
FAILURE, and is well-known to be relevant to these processes. The Retinoic acid receptors (RARA and RARG) were identified as a mechanistic target for only TERATOSPERMIA, and Cyclin-dependent kinase 4 (CDK4) was identified as a mechanistic target for OVARIAN FAILURE only. Many of the targets overlap between the two reproductive toxicity adverse events.
Ad
ve
rse
Eve
nt
Clinical AE Preclinical AE
Gene Names
Max (8)
Average (4.2)
Min (1)
212
5.4 Conclusions
In this study we establish the degree of mechanistic support for the most significant associations
found in chapter 4, by employing a novel gene overlap analysis. We found the intersection of
genetic evidence for the preclinical and clinical AEs and the drugs which caused both AEs. Out
of all associations for which we conducted the gene overlap analysis, we found one or more
genes supporting the association between preclinical and clinical AEs for 25% of cases. When
comparing these genes to known safety target genes, 77 % of known safety targets were found
via our overlap analysis. We also identified 426 genes, which we propose could be investigated
as future safety target encoding genes, for different clinical toxicities of interest. Table 5-2
summarizes the clinical AEs which we have highlighted as case studies in our work, as well as
the targets found from our mechanistic overlap analysis which are not included in currently
published in vitro screening panels. It should be noted that these case studies are only the
toxicities we analysed in detail in this work; for the full list of associations from which the reader
may pick out those of interest to their work please see Supplementary Data Files 3 and 7.
Table 5-2: Summary of the clinical AEs and the genes which did not encode targets in existing safety panels117,118 identified by our gene overlap analysis.
Clinical Toxicity Genes identified in this work not encoding current targets in published safety panels117,118
SUDDEN CARDIAC DEATH KCND3
POOR QUALITY SLEEP OX2R
DRUG WITHDRAWAL SYNDROME CP2B6, OPRX, OX1R
TERATOSPERMIA BAZ1A, BRDT, BRWD1, CDK1, CDK16, CDK2, COT2, FGFR1, FGFR2, GSK3A, INSR, MTOR, NR0B2, P53, PPARA, PPARG, RARA, RARG, RET, RXFP2, RXRB, STK11, THB, TXK, VDR, VGFR1, VGFR2
OVARIAN FAILURE BAZ1A, BRDT, BRWD1, CDK2, CDK4, MTOR, PPARG, VGFR2
This analysis presents an integrative knowledge-based approach to find mechanistic links
between statistically associated preclinical and clinical toxicities. This approach can be applied
for similar datasets in the future to determine the interrelationship between toxicological data.
We extend the work detailed in the chapter 4 by suggesting mechanistic drivers for the
associations previously found. The targets identified can be assessed for inclusion in secondary
pharmacology safety panel screens. By including in vitro, in vivo and clinical toxicity data and
requiring that the target be associated with both the preclinical and clinical toxicities, we
suggest that the mechanisms identified by our method should be highly relevant to toxicities
213
across species, helping to anticipate both animal model and human toxicities. Finally, this type
of analysis can be used to provide semantic reasoning between AEs, as well a basis for machine
learning models of clinical toxicity.
6 Conclusions and Future Perspective
In this thesis, we discuss computational methods for the prediction of selectivity and toxicity
endpoints within drug discovery. We integrated data from different sources in order to generate
new knowledge from data mining and machine learning approaches.
For selectivity we show that multi-target proteochemometric (PCM) models can be useful for
predicting the polypharmacology of compounds against related biological targets and
outperform single-target QSAR models in the case of bromodomain-containing proteins. This
was attributed to the fact that PCM can interpolate for new targets and therefore serves utility
in predicting activity for targets with less data, termed orphan targets. We established that for
the models to be practically useful, we needed to define the applicability domain for the models,
for which we successfully implemented Mondrian cross-conformal prediction. We discovered
new and structurally distinct chemical hits for the bromodomain-containing protein family
from shortlisting compounds for test, achieving a hit rate of 31% overall. By interpreting the
model, it was possible to identify the key residues in the active site of different bromodomain
proteins which could be exploited for activity and selectivity in future design. We were able to
link these residues to the binding modes of hits derived from this study and previous literature
knowledge, showing that there was confidence that the model was based on knowledge of
bromodomain inhibitor binding. A particularly relevant extension of this work would be to
include stable binding site water molecules and their location, as well as their network
properties, as target descriptors in addition to amino acid alignment dependent descriptors
employed here. This would incorporate the more recent knowledge319 developed around how
water networks influence the binding of inhibitors to bromodomains and contribute to
observed selectivity.379 The PCM modelling technique can be applied to other target classes to
provide off-target in silico screening platforms within drug discovery, to prioritise targets for
experimental follow up, as potential secondary pharmacology targets. Interpretation of the
models for bromodomain targets and other target families could lead to provide guidance for
designing small molecule inhibitors for these targets in the future.
For toxicity modelling we explored association analyses to quantify the conditional risk of
experiencing a clinical adverse event (AE), given the presence of a preclinical AE. In contrast to
214
previous studies, we did not require that the AEs preclinically and clinically be encoded by the
same term and therefore our study extended previous concordance approaches. We found 2050
significant associations which can be used to assess which animal models are relevant to human
toxicity prediction, and which, given the data, are not important to determining clinical toxicity.
By constructing a network, we showed different relationships between AEs existed including
one preclinical adverse event associating with one clinical adverse event, many preclinical
events associated with one clinical adverse event and one preclinical adverse event associated
with multiple clinical adverse events. In general, the latter relationship existed more frequently,
and the most relevant preclinical AEs were identified for the largest cluster. These preclinical
AEs were associated with a higher risk of experiencing a range of clinical AEs and should be
considered important information to be continually derived from animal models in the future.
The most significant associations were analysed to determine the degree of mechanistic
support, in total 60 out of 248 associations were found to have gene evidence in common
between the drug which caused both AEs, the preclinical AE and the clinical AE. Protein targets
identified are suggested for follow up for the inclusion in secondary pharmacology screening
panels to assess the potential risk of experiencing toxicity in vivo before proceeding to animal
studies. This analysis presents a systematic, data-driven and less biased approach for finding
compelling associations between preclinical and clinical toxicities, which can be applied in
similar studies to determine the interrelationship between toxicological data. This study can
provide a starting point to evaluate which preclinical models are useful to assess the risk of
experiencing which clinical toxicities. We furthermore see the utility of methods such as these
in hypothesis generation for mechanistic drivers for toxicity, potentially with the application of
inclusion in toxicity panel screens. For associations where there was no overlap in drug targets
with genes found from AEs, it would be interesting to further investigate if there was gene
overlap in the AE spaces. This could lead to additional mechanistic hypotheses for previously
untested off-targets which may be modulated by the drugs. Finally, this type of analysis can
provide semantic reasoning between AEs as well a basis for machine learning models of clinical
toxicity.
215
7 References
(1) Mohs, Richard C.; Greig, Nigel H. Drug Discovery and Development: Role of Basic
Biological Research. Alzheimer’s Dement. Transl. Res. Clin. Interv. 2017, 3, 651–657.
(2) Paul, Steven M.; Mytelka, Daniel S.; Dunwiddie, Christopher T.; Persinger, Charles C.;
Munos, Bernard H.; Lindborg, Stacy R.; Schacht, Aaron L. How to Improve R&D
Productivity: The Pharmaceutical Industry’s Grand Challenge. Nat. Rev. Drug Discov.
2010, 9, 203–214.
(3) Dimasi, J. A.; Feldman, L.; Seckler, A.; Wilson, A. Trends in Risks Associated with New
Drug Development: Success Rates for Investigational Drugs. Clinical Pharmacology and
Therapeutics. March 3, 2010, pp 272–277.
(4) DiMasi, Joseph A.; Grabowski, Henry G.; Hansen, Ronald W. Innovation in the
Pharmaceutical Industry: New Estimates of R&D Costs. J. Health Econ. 2016, 47, 20–33.
(5) Huggins, David J.; Sherman, Woody; Tidor, Bruce. Rational Approaches to Improving
Selectivity in Drug Design. J. Med. Chem. 2012, 55, 1424–1444.
(6) Reddy, A. Srinivas; Zhang, Shuxing. Polypharmacology: Drug Discovery for the Future.
Expert Rev. Clin. Pharmacol. 2013, 6, 41–47.
(7) de Lera, Angel R.; Ganesan, A. Epigenetic Polypharmacology: From Combination
Therapy to Multitargeted Drugs. Clin. Epigenetics 2016, 8, 105.
(8) Ortiz, Angel; Gomez-Puertas, Paulino; Leo-Macias, Alejandra; Lopez-Romero, Pedro;
Lopez-Vinas, Eduardo; Morreale, Antonio; Murcia, Marta; Wang, Kun. Computational
Approaches to Model Ligand Selectivity in Drug Design. Curr. Top. Med. Chem. 2005, 6,
41–55.
(9) Mencher, Simon K.; Wang, Long G. Promiscuous Drugs Compared to Selective Drugs
(Promiscuity Can Be a Virtue). BMC Clin. Pharmacol. 2005, 5, 3.
(10) Méndez-Lucio, Oscar; Jesús Naveja, J.; Vite-Caritino, Hugo; Prieto-Martínez, Fernando
D.; Medina-Franco, José L. Polypharmacology in Drug Discovery. In Drug Selectivity: An
Evolving Concept in Medicinal Chemistry; Handler, Norbert, Buschmann, Helmut, Eds.;
Wiley-VCH Verlag GmbH & Co. KGaA, 2010; p 510.
(11) Hopkins, Andrew L. Network Pharmacology: The next Paradigm in Drug Discovery. Nat.
Chem. Biol. 2008, 4, 682–690.
216
(12) Roth, Bryan L.; Sheffer, Douglas J.; Kroeze, Wesley K. Magic Shotguns versus Magic
Bullets: Selectively Non-Selective Drugs for Mood Disorders and Schizophrenia. Nat. Rev.
Drug Discov. 2004, 3, 353–359.
(13) Lapins, Maris; Wikberg, Jarl ES S. Kinome-Wide Interaction Modelling Using Alignment-
Based and Alignment-Independent Approaches for Kinase Description and Linear and
Non-Linear Data Analysis Techniques. BMC Bioinformatics 2010, 11, 339.
(14) Fernández, Ariel; Crespo, Alejandro; Tiwari, Abhinav. Is There a Case for Selectively
Promiscuous Anticancer Drugs? Drug Discov. Today 2009, 14, 1–5.
(15) Karaman, Mazen W.; Herrgard, Sanna; Treiber, Daniel K.; Gallant, Paul; Atteridge, Corey
E.; Campbell, Brian T.; Chan, Katrina W.; Ciceri, Pietro; Davis, Mindy I.; Edeen, Philip T.;
Faraoni, Raffaella; Floyd, Mark; Hunt, Jeremy P.; Lockhart, Daniel J.; Milanov, Zdravko
V; Morrison, Michael J.; Pallares, Gabriel; Patel, Hitesh K.; Pritchard, Stephanie; et al. A
Quantitative Analysis of Kinase Inhibitor Selectivity. Nat. Biotechnol. 2008, 26, 127–132.
(16) Klaeger, Susan; Heinzlmeir, Stephanie; Wilhelm, Mathias; Polzer, Harald; Vick, Binje;
Koenig, Paul Albert; Reinecke, Maria; Ruprecht, Benjamin; Petzoldt, Svenja; Meng, Chen;
Zecha, Jana; Reiter, Katrin; Qiao, Huichao; Helm, Dominic; Koch, Heiner; Schoof,
Melanie; Canevari, Giulia; Casale, Elena; Re Depaolini, Stefania; et al. The Target
Landscape of Clinical Kinase Drugs. Science (80-. ). 2017, 358, eaan4368.
(17) Davis, Mindy I.; Hunt, Jeremy P.; Herrgard, Sanna; Ciceri, Pietro; Wodicka, Lisa M.;
Pallares, Gabriel; Hocker, Michael; Treiber, Daniel K.; Zarrinkar, Patrick P.
Comprehensive Analysis of Kinase Inhibitor Selectivity. Nat. Biotechnol. 2011, 29, 1046–
1051.
(18) Haupt, V. Joachim; Daminelli, Simone; Schroeder, Michael. Drug Promiscuity in PDB:
Protein Binding Site Similarity Is Key. PLoS One 2013, 8, e65894.
(19) Ghuman, Jamie; Zunszain, Patricia A.; Petitpas, Isabelle; Bhattacharya, Ananyo A.;
Otagiri, Masaki; Curry, Stephen. Structural Basis of the Drug-Binding Specificity of
Human Serum Albumin. J. Mol. Biol. 2005, 353, 38–52.
(20) Smith, Dennis A.; Di, Li; Kerns, Edward H. The Effect of Plasma Protein Binding on in
Vivo Efficacy: Misconceptions in Drug Discovery. Nat. Rev. Drug Discov. 2010, 9, 929–
939.
217
(21) Urban, Laszlo; Whitebread, Steven; Hamon, Jacques; Mikhailov, Dmitri; Azzaoui, Kamal.
Screening for Safety-Relevant Off-Target Activities. In Polypharmacology in Drug
Discovery; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2012; pp 15–46.
(22) Goedken, Eric R.; Devanarayan, Viswanath; Harris, Christopher M.; Dowding, Lori A.;
Jakway, James P.; Voss, Jeffrey W.; Wishart, Neil; Jordan, David C.; Talanian, Robert V.
Minimum Significant Ratio of Selectivity Ratios (MSRSR) and Confidence in Ratio of
Selectivity Ratios (CRSR): Quantitative Measures for Selectivity Ratios Obtained by
Screening Assays. J. Biomol. Screen. 2012, 17, 857–867.
(23) Jackson, Craig M.; Esnouf, M. Peter; Winzor, Donald J.; Duewer, David L. Defining and
Measuring Biological Activity: Applying the Principles of Metrology. Accredit. Qual.
Assur. 2007, 12, 283–294.
(24) Hulme, Edward C.; Trevethick, Mike A. Ligand Binding Assays at Equilibrium: Validation
and Interpretation. Br. J. Pharmacol. 2010, 161, 1219–1237.
(25) MarÉchal, Eric. Measuring Bioactivity: KI, IC50 and EC50. In Chemogenomics and
Chemical Genetics; Springer Berlin Heidelberg: Berlin, Heidelberg, 2011; pp 55–65.
(26) Yung-Chi, Cheng; Prusoff, William H. Relationship between the Inhibition Constant (KI)
and the Concentration of Inhibitor Which Causes 50 per Cent Inhibition (I50) of an
Enzymatic Reaction. Biochem. Pharmacol. 1973, 22, 3099–3108.
(27) Mestres, Jordi; Gregori-Puigjané, Elisabet; Valverde, Sergi; Solé, Ricard V. Data
Completeness—the Achilles Heel of Drug-Target Networks. Nat. Biotechnol. 2008, 26,
983–984.
(28) Jalencas, Xavier; Mestres, Jordi. On the Origins of Drug Polypharmacology.
Medchemcomm 2013, 4, 80–87.
(29) Hu, Ye; Gupta-Ostermann, Disha; Bajorath, Jürgen. Exploring Compound Promiscuity
Patterns and Multi-Target Activity Spaces. Comput. Struct. Biotechnol. J. 2014, 9,
e201401003.
(30) Pervaiz, Mehrosh; Mishra, Pankaj; Günther, Stefan. Bromodomain Drug Discovery – the
Past, the Present, and the Future. Chem. Rec. 2018, 18, 1808–1817.
(31) Galdeano, Carles; Ciulli, Alessio. Selectivity On-Target of Bromodomain Chemical Probes
by Structure-Guided Medicinal Chemistry and Chemical Biology. Future Med. Chem.
218
2016, 8, 1655–1680.
(32) Sanchez, Roberto; Meslamani, Jamel; Zhou, Ming-Ming. The Bromodomain: From
Epigenome Reader to Druggable Target. Biochim. Biophys. Acta 2014, 1839, 676–685.
(33) Muller, Susanne; Filippakopoulos, Panagis; Knapp, Stefan. Bromodomains as
Therapeutic Targets. Expert Rev. Mol. Med. 2011, 13, e29.
(34) Moustakim, Moses; Clark, Peter G. K. K.; Hay, Duncan A.; Dixon, Darren J.; Brennan, Paul
E. Chemical Probes and Inhibitors of Bromodomains Outside the BET Family.
Medchemcomm 2016, 7, 2246–2264.
(35) Josling, Gabrielle A.; Selvarajah, Shamista A.; Petter, Michaela; Duffy, Michael F. The Role
of Bromodomain Proteins in Regulating Gene Expression. Genes (Basel). 2012, 3, 320–
343.
(36) Ferri, Elena; Petosa, Carlo; McKenna, Charles E. Bromodomains: Structure, Function and
Pharmacology of Inhibition. Biochem. Pharmacol. 2016, 106, 1–18.
(37) Arkin, Michelle R.; Tang, Yinyan; Wells, James A. Small-Molecule Inhibitors of Protein-
Protein Interactions: Progressing toward the Reality. Chem. Biol. 2014, 21, 1102–1114.
(38) Filippakopoulos, Panagis; Picaud, Sarah; Mangos, Maria; Keates, Tracy; Lambert, Jean-
Philippe; Barsyte-Lovejoy, Dalia; Felletar, Ildiko; Volkmer, Rudolf; Müller, Susanne;
Pawson, Tony; Gingras, Anne-Claude; Arrowsmith, Cheryl H. H.; Knapp, Stefan. Histone
Recognition and Large-Scale Structural Analysis of the Human Bromodomain Family.
Cell 2012, 149, 214–231.
(39) Brand, Michael; Measures, Angelina M.; Wilson, Brian G.; Cortopassi, Wilian A.;
Alexander, Rikki; Höss, Matthias; Hewings, David S.; Rooney, Timothy P. C.; Paton,
Robert S.; Conway, Stuart J. Small Molecule Inhibitors of Bromodomain-Acetyl-Lysine
Interactions. ACS Chem. Biol. 2015, 10, 22–39.
(40) Prinjha, Rab K.; Witherington, Jason; Lee, Kevin. Place Your BETs: The Therapeutic
Potential of Bromodomains. Trends Pharmacol. Sci. 2012, 33, 146–153.
(41) Müller, Susanne; Knapp, Stefan. Discovery of BET Bromodomain Inhibitors and Their
Role in Target Validation. Medchemcomm 2014, 5, 288–296.
(42) Atkinson, Stephen J.; Soden, Peter E.; Angell, Davina C.; Bantscheff, Marcus; Chung,
Chun Wa; Giblin, Kathryn A.; Smithers, Nicholas; Furze, Rebecca C.; Gordon, Laurie;
219
Drewes, Gerard; Rioja, Inmaculada; Witherington, Jason; Parr, Nigel J.; Prinjha, Rab K.
The Structure Based Design of Dual HDAC/BET Inhibitors as Novel Epigenetic Probes.
Medchemcomm 2014, 5, 342–351.
(43) Wang, Limei; Wu, Xiuyin; Huang, Ping; Lv, Zhijun; Qi, Yuping; Wei, Xiujuan; Yang,
Pishan; Zhang, Fenghe. JQ1, a Small Molecule Inhibitor of BRD4, Suppresses Cell Growth
and Invasion in Oral Squamous Cell Carcinoma. Oncol. Rep. 2016, 36, 1989–1996.
(44) Mirguet, Olivier; Gosmini, Romain; Toum, Jérôme; Clément, Catherine A.; Barnathan,
Mélanie; Brusq, Jean-Marie; Mordaunt, Jacqueline E.; Grimes, Richard M.; Crowe,
Miriam; Pineau, Olivier; Ajakane, Myriam; Daugan, Alain; Jeffrey, Phillip; Cutler, Leanne;
Haynes, Andrea C.; Smithers, Nicholas N.; Chung, Chun-wa; Bamborough, Paul; Uings,
Iain J.; et al. Discovery of Epigenetic Regulator I-BET762: Lead Optimization to Afford a
Clinical Candidate Inhibitor of the BET Bromodomains. J. Med. Chem. 2013, 56, 7501–
7515.
(45) Filippakopoulos, Panagis; Qi, Jun; Picaud, Sarah; Shen, Yao; Smith, William B.; Fedorov,
Oleg; Morse, Elizabeth M.; Keates, Tracey; Hickman, Tyler T.; Felletar, Ildiko; Philpott,
Martin; Munro, Shonagh; McKeown, Michael R.; Wang, Yuchuan; Christie, Amanda L.;
West, Nathan; Cameron, Michael J.; Schwartz, Brian; Heightman, Tom D.; et al. Selective
Inhibition of BET Bromodomains. Nature 2010, 468, 1067–1073.
(46) Klein, Kerstin. Bromodomain Protein Inhibition: A Novel Therapeutic Strategy in
Rheumatic Diseases. RMD Open 2018, 4, e000744.
(47) Vogelstein, Bert; Lane, David; Levine, Arnold J. Surfing the P53 Network. Nature 2000,
408, 307–310.
(48) Borah, Jagat C.; Mujtaba, Shiraz; Karakikes, Ioannis; Zeng, Lei; Muller, Michaela; Patel,
Jigneshkumar; Moshkina, Natasha; Morohashi, Keita; Zhang, Weijia; Gerona-Navarro,
Guillermo; Hajjar, Roger J.; Zhou, Ming-Ming. A Small Molecule Binding to the
Coactivator CREB-Binding Protein Blocks Apoptosis in Cardiomyocytes. Chem. Biol.
2011, 18, 531–541.
(49) Rooney, Timothy P. C.; Filippakopoulos, Panagis; Fedorov, Oleg; Picaud, Sarah;
Cortopassi, Wilian A.; Hay, Duncan A.; Martin, Sarah; Tumber, Anthony; Rogers,
Catherine M.; Philpott, Martin; Wang, Minghua; Thompson, Amber L.; Heightman, Tom
D.; Pryde, David C.; Cook, Andrew; Paton, Robert S.; Müller, Susanne; Knapp, Stefan;
220
Brennan, Paul E.; et al. A Series of Potent CREBBP Bromodomain Ligands Reveals an
Induced-Fit Pocket Stabilized by a Cation-π Interaction. Angew. Chemie Int. Ed. 2014, 53,
6126–6130.
(50) Ghosh, Srimoyee; Taylor, Alexander; Chin, Melissa; Huang, Hon Ren; Conery, Andrew
R.; Mertz, Jennifer A.; Salmeron, Andres; Dakle, Pranal J.; Mele, Deanna; Cote, Alexandre;
Jayaram, Hari; Setser, Jeremy W.; Poy, Florence; Hatzivassiliou, Georgia; DeAlmeida-
Nagata, Denise; Sandy, Peter; Hatton, Charlie; Romero, F. Anthony; Chiang, Eugene; et
al. Regulatory T Cell Modulation by CBP/EP300 Bromodomain Inhibition. J. Biol. Chem.
2016, 291, 13014–13027.
(51) Zeng, Lei; Li, Jiaming; Muller, Michaela; Yan, Sherry; Mujtaba, Shiraz; Pan, Chongfeng;
Wang, Zhiyong; Zhou, Ming-Ming. Selective Small Molecules Blocking HIV-1 Tat and
Coactivator PCAF Association. J. Am. Chem. Soc. 2005, 127, 2376–2377.
(52) Crawford, Terry D.; Audia, James E.; Bellon, Steve; Burdick, Daniel J.; Bommi-Reddy,
Archana; Côté, Alexandre; Cummings, Richard T.; Duplessis, Martin; Flynn, E. Megan;
Hewitt, Michael; Huang, Hon-Ren; Jayaram, Hariharan; Jiang, Ying; Joshi, Shivangi;
Kiefer, James R.; Murray, Jeremy; Nasveschuk, Christopher G.; Neiss, Arianne; Pardo,
Eneida; et al. GNE-886: A Potent and Selective Inhibitor of the Cat Eye Syndrome
Chromosome Region Candidate 2 Bromodomain (CECR2). ACS Med. Chem. Lett. 2017,
8, 737–741.
(53) Kirberger, Steven E.; Ycas, Peter D.; Johnson, Jorden A.; Chen, Chen; Ciccone, Michael F.;
Woo, Rinette W. L.; Urick, Andrew K.; Zahid, Huda; Shi, Ke; Aihara, Hideki; McAllister,
Sean D.; Kashani-Sabet, Mohammed; Shi, Junwei; Dickson, Alex; Dos Santos, Camila O.;
Pomerantz, William C. K. Selectivity, Ligand Deconstruction, and Cellular Activity
Analysis of a BPTF Bromodomain Inhibitor. Org. Biomol. Chem. 2019, 17, 2020–2027.
(54) Drouin, Ludovic; McGrath, Sally; Vidler, Lewis R.; Chaikuad, Apirat; Monteiro, Octovia;
Tallant, Cynthia; Philpott, Martin; Rogers, Catherine; Fedorov, Oleg; Liu, Manjuan;
Akhtar, Wasim; Hayes, Angela; Raynaud, Florence; Müller, Susanne; Knapp, Stefan;
Hoelder, Swen. Structure Enabled Design of BAZ2-ICR, A Chemical Probe Targeting the
Bromodomains of BAZ2A and BAZ2B. J. Med. Chem. 2015, 58, 2553–2559.
(55) Chen, Peiling; Chaikuad, Apirat; Bamborough, Paul; Bantscheff, Marcus; Bountra, Chas;
Chung, Chun Wa; Fedorov, Oleg; Grandi, Paola; Jung, David; Lesniak, Robert; Lindon,
Matthew; Müller, Susanne; Philpott, Martin; Prinjha, Rab; Rogers, Catherine; Selenski,
221
Carolyn; Tallant, Cynthia; Werner, Thilo; Willson, Timothy M.; et al. Discovery and
Characterization of GSK2801, a Selective Chemical Probe for the Bromodomains BAZ2A
and BAZ2B. J. Med. Chem. 2016, 59, 1410–1424.
(56) Clark, Peter G. K.; Vieira, Lucas C. C.; Tallant, Cynthia; Fedorov, Oleg; Singleton, Dean
C.; Rogers, Catherine M.; Monteiro, Octovia P.; Bennett, James M.; Baronio, Roberta;
Müller, Susanne; Daniels, Danette L.; Méndez, Jacqui; Knapp, Stefan; Brennan, Paul E.;
Dixon, Darren J. LP99: Discovery and Synthesis of the First Selective BRD7/9
Bromodomain Inhibitor. Angew. Chemie - Int. Ed. 2015, 54, 6217–6221.
(57) Theodoulou, Natalie H.; Bamborough, Paul; Bannister, Andrew J.; Becher, Isabelle; Bit,
Rino A.; Che, Ka Hing; Chung, Chun Wa; Dittmann, Antje; Drewes, Gerard; Drewry,
David H.; Gordon, Laurie; Grandi, Paola; Leveridge, Melanie; Lindon, Matthew; Michon,
Anne Marie; Molnar, Judit; Robson, Samuel C.; Tomkinson, Nicholas C. O.; Kouzarides,
Tony; et al. Discovery of I-BRD9, a Selective Cell Active Chemical Probe for
Bromodomain Containing Protein 9 Inhibition. J. Med. Chem. 2016, 59, 1425–1439.
(58) Martin, Laetitia J.; Koegl, Manfred; Bader, Gerd; Cockcroft, Xiao Ling; Fedorov, Oleg;
Fiegen, Dennis; Gerstberger, Thomas; Hofmann, Marco H.; Hohmann, Anja F.; Kessler,
Dirk; Knapp, Stefan; Knesl, Petr; Kornigg, Stefan; Müller, Susanne; Nar, Herbert; Rogers,
Catherine; Rumpel, Klaus; Schaaf, Otmar; Steurer, Steffen; et al. Structure-Based Design
of an in Vivo Active Selective BRD9 Inhibitor. J. Med. Chem. 2016, 59, 4462–4475.
(59) Bamborough, Paul; Barnett, Heather A.; Becher, Isabelle; Bird, Mark J.; Chung, Chun-wa;
Craggs, Peter D.; Demont, Emmanuel H.; Diallo, Hawa; Fallon, David J.; Gordon, Laurie
J.; Grandi, Paola; Hobbs, Clare I.; Hooper-Greenhill, Edward; Jones, Emma J.; Law, Robert
P.; Le Gall, Armelle; Lugo, David; Michon, Anne-Marie; Mitchell, Darren J.; et al.
GSK6853, a Chemical Probe for Inhibition of the BRPF1 Bromodomain. ACS Med. Chem.
Lett. 2016, 7, 552–557.
(60) Meier, Julia C.; Tallant, Cynthia; Fedorov, Oleg; Witwicka, Hanna; Hwang, Sung Yong;
Van Stiphout, Ruud G.; Lambert, Jean Philippe; Rogers, Catherine; Yapp, Clarence;
Gerstenberger, Brian S.; Fedele, Vita; Savitsky, Pavel; Heidenreich, David; Daniels,
Danette L.; Owen, Dafydd R.; Fish, Paul V.; Igoe, Niall M.; Bayle, Elliott D.; Haendler,
Bernard; et al. Selective Targeting of Bromodomains of the Bromodomain-PHD Fingers
Family Impairs Osteoclast Differentiation. ACS Chem. Biol. 2017, 12, 2619–2630.
(61) Palmer, Wylie S.; Poncet-Montange, Guillaume; Liu, Gang; Petrocchi, Alessia; Reyna,
222
Naphtali; Subramanian, Govindan; Theroff, Jay; Yau, Anne; Kost-Alimova, Maria;
Bardenhagen, Jennifer P.; Leo, Elisabetta; Shepard, Hannah E.; Tieu, Trang N.; Shi, Xi;
Zhan, Yanai; Zhao, Shuping; Barton, Michelle C.; Draetta, Giulio; Toniatti, Carlo; et al.
Structure-Guided Design of IACS-9571, a Selective High-Affinity Dual TRIM24-BRPF1
Bromodomain Inhibitor. J. Med. Chem. 2016, 59, 1440–1454.
(62) Sepehri, Bakhtyar; Rasouli, Zolaikha; Hassanzadeh, Zeinabe; Ghavami, Raouf. Molecular
Docking and QSAR Analysis of Naphthyridone Derivatives as ATAD2 Bromodomain
Inhibitors: Application of CoMFA, LS-SVM, and RBF Neural Network. Med. Chem. Res.
2016, 25, 2895–2905.
(63) Bamborough, Paul; Chung, Chun wa; Demont, Emmanuel H.; Furze, Rebecca C.;
Bannister, Andrew J.; Che, Ka Hing; Diallo, Hawa; Douault, Clement; Grandi, Paola;
Kouzarides, Tony; Michon, Anne Marie; Mitchell, Darren J.; Prinjha, Rab K.; Rau,
Christina; Robson, Samuel; Sheppard, Robert J.; Upton, Richard; Watson, Robert J. A
Chemical Probe for the ATAD2 Bromodomain. Angew. Chemie - Int. Ed. 2016, 55, 11382–
11386.
(64) Bamborough, Paul; Chung, Chun Wa; Furze, Rebecca C.; Grandi, Paola; Michon, Anne
Marie; Sheppard, Robert J.; Barnett, Heather; Diallo, Hawa; Dixon, David P.; Douault,
Clement; Jones, Emma J.; Karamshi, Bhumika; Mitchell, Darren J.; Prinjha, Rab K.; Rau,
Christina; Watson, Robert J.; Werner, Thilo; Demont, Emmanuel H. Structure-Based
Optimization of Naphthyridones into Potent ATAD2 Bromodomain Inhibitors. J. Med.
Chem. 2015, 58, 6151–6178.
(65) Gerstenberger, Brian S.; Trzupek, John D.; Tallant, Cynthia; Fedorov, Oleg;
Filippakopoulos, Panagis; Brennan, Paul E.; Fedele, Vita; Martin, Sarah; Picaud, Sarah;
Rogers, Catherine; Parikh, Mihir; Taylor, Alexandria; Samas, Brian; O’Mahony, Alison;
Berg, Ellen; Pallares, Gabriel; Torrey, Adam D.; Treiber, Daniel K.; Samardjiev, Ivan J.; et
al. Identification of a Chemical Probe for Family VIII Bromodomains through
Optimization of a Fragment Hit. J. Med. Chem. 2016, 59, 4800–4811.
(66) Ember, Stuart W. J.; Zhu, Jin Yi; Olesen, Sanne H.; Martin, Mathew P.; Becker, Andreas;
Berndt, Norbert; Georg, Gunda I.; Schonbrunn, Ernst. Acetyl-Lysine Binding Site of
Bromodomain-Containing Protein 4 (BRD4) Interacts with Diverse Kinase Inhibitors.
ACS Chem. Biol. 2014, 9, 1160–1171.
(67) Dittmann, Antje; Werner, Thilo; Chung, Chun Wa; Savitski, Mikhail M.; Fälth Savitski,
223
Maria; Grandi, Paola; Hopf, Carsten; Lindon, Matthew; Neubauer, Gitte; Prinjha,
Rabinder K.; Bantscheff, Marcus; Drewes, Gerard. The Commonly Used PI3-Kinase Probe
LY294002 Is an Inhibitor of BET Bromodomains. ACS Chem. Biol. 2014, 9, 495–502.
(68) Filippakopoulos, Panagis; Knapp, Stefan. Targeting Bromodomains: Epigenetic Readers
of Lysine Acetylation. Nat. Rev. Drug Discov. 2014, 13, 337–356.
(69) Pérez-Salvia, Montserrat; Esteller, Manel. Bromodomain Inhibitors and Cancer Therapy:
From Structures to Applications. Epigenetics 2017, 12, 323–339.
(70) Filippakopoulos, Panagis; Knapp, Stefan. The Bromodomain Interaction Module. FEBS
Lett. 2012, 586, 2692–2704.
(71) Sharp, Phillip P.; Garnier, Jean Marc; Huang, David C. S.; Burns, Christopher J. Evaluation
of Functional Groups as Acetyl-Lysine Mimetics for BET Bromodomain Inhibition.
Medchemcomm 2014, 5, 1834–1842.
(72) ACD/Structure Eludicator. Advanced Chemistry Developement, Inc.: Toronto, ON,
Canada 2019.
(73) Kleywegt, Gerard J.; Jones, T. Alwyn. Model Building and Refinement Practice. Methods
Enzymol. 1997, 277, 208–230.
(74) Crawford, Terry D.; Tsui, Vickie; Flynn, E. Megan; Wang, Shumei; Taylor, Alexander M.;
Côté, Alexandre; Audia, James E.; Beresini, Maureen H.; Burdick, Daniel J.; Cummings,
Richard; Dakin, Les A.; Duplessis, Martin; Good, Andrew C.; Hewitt, Michael C.; Huang,
Hon Ren; Jayaram, Hariharan; Kiefer, James R.; Jiang, Ying; Murray, Jeremy; et al. Diving
into the Water: Inducible Binding Conformations for BRD4, TAF1(2), BRD9, and CECR2
Bromodomains. J. Med. Chem. 2016, 59, 5391–5402.
(75) Humphreys, Philip G.; Bamborough, Paul; Chung, Chun Wa; Craggs, Peter D.; Gordon,
Laurie; Grandi, Paola; Hayhow, Thomas G.; Hussain, Jameed; Jones, Katherine L.; Lindon,
Matthew; Michon, Anne Marie; Renaux, Jessica F.; Suckling, Colin J.; Tough, David F.;
Prinjha, Rab K. Discovery of a Potent, Cell Penetrant, and Selective P300/CBP-Associated
Factor (PCAF)/General Control Nonderepressible 5 (GCN5) Bromodomain Chemical
Probe. J. Med. Chem. 2017, 60, 695–709.
(76) Chaikuad, Apirat; Lang, Steffen; Brennan, Paul E.; Temperini, Claudia; Fedorov, Oleg;
Hollander, Johan; Nachane, Ruta; Abell, Chris; Müller, Susanne; Siegal, Gregg; Knapp,
224
Stefan. Structure-Based Identification of Inhibitory Fragments Targeting the P300/CBP-
Associated Factor Bromodomain. J. Med. Chem. 2016, 59, 1648–1653.
(77) Unzue, Andrea; Zhao, Hongtao; Lolli, Graziano; Dong, Jing; Zhu, Jian; Zechner, Melanie;
Dolbois, Aymeric; Caflisch, Amedeo; Nevado, Cristina. The “Gatekeeper” Residue
Influences the Mode of Binding of Acetyl Indoles to Bromodomains. J. Med. Chem. 2016,
59, 3087–3097.
(78) Su, Jing; Liu, Xinguo; Zhang, Shaolong; Yan, Fangfang; Zhang, Qinggang; Chen,
Jianzhong. A Theoretical Insight into Selectivity of Inhibitors toward Two Domains of
Bromodomain-Containing Protein 4 Using Molecular Dynamics Simulations. Chem. Biol.
Drug Des. 2018, 91, 828–840.
(79) Xing, Jing; Lu, Wenchao; Liu, Rongfeng; Wang, Yulan; Xie, Yiqian; Zhang, Hao; Shi, Zhe;
Jiang, Hao; Liu, Yu Chih; Chen, Kaixian; Jiang, Hualiang; Luo, Cheng; Zheng, Mingyue.
Machine-Learning-Assisted Approach for Discovering Novel Inhibitors Targeting
Bromodomain-Containing Protein 4. J. Chem. Inf. Model. 2017, 57, 1677–1690.
(80) Ma, Junlong; Chen, Heng; Yang, Jie; Yu, Zutao; Huang, Pan; Yang, Haofeng; Zheng,
Bifeng; Liu, Rangru; Li, Qianbin; Hu, Gaoyun; Chen, Zhuo. Binding Pocket-Based Design,
Synthesis and Biological Evaluation of Novel Selective BRD4-BD1 Inhibitors. Bioorganic
Med. Chem. 2019, 27, 1871–1881.
(81) Unzue, Andrea; Xu, Min; Dong, Jing; Wiedmer, Lars; Spiliotopoulos, Dimitrios; Caflisch,
Amedeo; Nevado, Cristina. Fragment-Based Design of Selective Nanomolar Ligands of
the CREBBP Bromodomain. J. Med. Chem. 2016, 59, 1350–1356.
(82) Cortopassi, Wilian A.; Kumar, Kiran; Paton, Robert S. Cation-π Interactions in CREBBP
Bromodomain Inhibition: An Electrostatic Model for Small-Molecule Binding Affinity
and Selectivity. Org. Biomol. Chem. 2016, 14, 10926–10938.
(83) Hewings, David S.; Fedorov, Oleg; Filippakopoulos, Panagis; Martin, Sarah; Picaud,
Sarah; Tumber, Anthony; Wells, Christopher; Olcina, Monica M.; Freeman, Katherine;
Gill, Andrew; Ritchie, Alison J.; Sheppard, David W.; Russell, Angela J.; Hammond, Ester
M.; Knapp, Stefan; Brennan, Paul E.; Conway, Stuart J. Optimization of 3,5-
Dimethylisoxazole Derivatives as Potent Bromodomain Ligands. J. Med. Chem. 2013, 56,
3217–3227.
(84) Lai, Kwong Wah; Romero, F. Anthony; Tsui, Vickie; Beresini, Maureen H.; de Leon
225
Boenig, Gladys; Bronner, Sarah M.; Chen, Kevin; Chen, Zhongguo; Choo, Edna F.;
Crawford, Terry D.; Cyr, Patrick; Kaufman, Susan; Li, Yingjie; Liao, Jiangpeng; Liu,
Wenfeng; Ly, Justin; Murray, Jeremy; Shen, Weichao; Wai, John; et al. Design and
Synthesis of a Biaryl Series as Inhibitors for the Bromodomains of CBP/P300. Bioorganic
Med. Chem. Lett. 2018, 28, 15–23.
(85) Bamborough, Paul; Chung, Chun Wa; Furze, Rebecca C.; Grandi, Paola; Michon, Anne
Marie; Watson, Robert J.; Mitchell, Darren J.; Barnett, Heather; Prinjha, Rab K.; Rau,
Christina; Sheppard, Robert J.; Werner, Thilo; Demont, Emmanuel H. Aiming to Miss a
Moving Target: Bromo and Extra Terminal Domain (BET) Selectivity in Constrained
ATAD2 Inhibitors. J. Med. Chem. 2018, 61, 8321–8336.
(86) Miller, Duncan C.; Martin, Mathew P.; Adhikari, Santosh; Brennan, Alfie; Endicott, Jane
A.; Golding, Bernard T.; Hardcastle, Ian R.; Heptinstall, Amy; Hobson, Stephen; Jennings,
Claire; Molyneux, Lauren; Ng, Yvonne; Wedge, Stephen R.; Noble, Martin E. M.; Cano,
Celine. Identification of a Novel Ligand for the ATAD2 Bromodomain with Selectivity
over BRD4 through a Fragment Growing Approach. Org. Biomol. Chem. 2018, 16, 1843–
1850.
(87) Lloyd, Jonathan T.; Gay, Jamie C.; Tonelli, Marco; Cornilescu, Gabriel; Nguyen, Paul;
Carlson, Samuel; Markley, John L.; Glass, Karen C. Structural Insights into the
Recognition of Mono-and Di-Acetyllysine by the ATAD2B Bromodomain-ATPase Family
AAA + Domain Containing 2. bioRxiv 2018, 263624.
(88) Zhu, Jian; Zhou, Chunxian; Caflisch, Amedeo. Structure-Based Discovery of Selective
BRPF1 Bromodomain Inhibitors. Eur. J. Med. Chem. 2018, 155, 337–352.
(89) Demont, Emmanuel H.; Bamborough, Paul; Chung, Chun Wa; Craggs, Peter D.; Fallon,
David; Gordon, Laurie J.; Grandi, Paola; Hobbs, Clare I.; Hussain, Jameed; Jones, Emma
J.; Le Gall, Armelle; Michon, Anne Marie; Mitchell, Darren J.; Prinjha, Rab K.; Roberts,
Andy D.; Sheppard, Robert J.; Watson, Robert J. 1,3-Dimethyl Benzimidazolones Are
Potent, Selective Inhibitors of the Brpf1 Bromodomain. ACS Med. Chem. Lett. 2014, 5,
1190–1195.
(90) Bouché, Léa; Christ, Clara D.; Siegel, Stephan; Fernández-Montalván, Amaury E.; Holton,
Simon J.; Fedorov, Oleg; Ter Laak, Antonius; Sugawara, Tatsuo; Stöckigt, Detlef; Tallant,
Cynthia; Bennett, James; Monteiro, Octovia; Díaz-Sáez, Laura; Siejka, Paulina; Meier,
Julia; Pütter, Vera; Weiske, Jörg; Müller, Susanne; Huber, Kilian V. M.; et al.
226
Benzoisoquinolinediones as Potent and Selective Inhibitors of BRPF2 and TAF1/TAF1L
Bromodomains. J. Med. Chem. 2017, 60, 4002–4022.
(91) Hui, Ma; Jian, Zhang; Peiyuan, Zheng; Zhenwei, Wu; Huibin, Zhang. Research Progress
of Selective Small Molecule Bromodomain-Containing Protein 9 Inhibitors. Future Med.
Chem. 2018, 10, 895–906.
(92) Dalle Vedove, Andrea; Spiliotopoulos, Dimitrios; D’Agostino, Vito G.; Marchand, Jean
Rémy; Unzue, Andrea; Nevado, Cristina; Lolli, Graziano; Caflisch, Amedeo. Structural
Analysis of Small-Molecule Binding to the BAZ2A and BAZ2B Bromodomains.
ChemMedChem 2018, 13, 1479–1487.
(93) Bennett, James; Fedorov, Oleg; Tallant, Cynthia; Monteiro, Octovia; Meier, Julia; Gamble,
Vicky; Savitsky, Pavel; Nunez-Alonso, Graciela A.; Haendler, Bernard; Rogers, Catherine;
Brennan, Paul E.; Müller, Susanne; Knapp, Stefan. Discovery of a Chemical Tool Inhibitor
Targeting the Bromodomains of TRIM24 and BRPF. J. Med. Chem. 2016, 59, 1642–1647.
(94) McKeown, Michael R.; Shaw, Daniel L.; Fu, Harry; Liu, Shuai; Xu, Xiang; Marineau, Jason
J.; Huang, Yibo; Zhang, Xiaofeng; Buckley, Dennis L.; Kadam, Asha; Zhang, Zijuan;
Blacklow, Stephen C.; Qi, Jun; Zhang, Wei; Bradner, James E. Biased Multicomponent
Reactions to Develop Novel Bromodomain Inhibitors. J. Med. Chem. 2014, 57, 9019–9027.
(95) Myrianthopoulos, Vassilios; Gaboriaud-Kolar, Nicolas; Tallant, Cynthia; Hall, Michelle
Lynn; Grigoriou, Stylianos; Brownlee, Peter Moore; Fedorov, Oleg; Rogers, Catherine;
Heidenreich, David; Wanior, Marek; Drosos, Nikolaos; Mexia, Nikitia; Savitsky, Pavel;
Bagratuni, Tina; Kastritis, Efstathios; Terpos, Evangelos; Filippakopoulos, Panagis;
Müller, Susanne; Skaltsounis, Alexios Leandros; et al. Discovery and Optimization of a
Selective Ligand for the Switch/Sucrose Nonfermenting-Related Bromodomains of
Polybromo Protein-1 by the Use of Virtual Screening and Hydration Analysis. J. Med.
Chem. 2016, 59, 8787–8803.
(96) Ferri, Nicola; Siegl, Peter; Corsini, Alberto; Herrmann, Joerg; Lerman, Amir; Benghozi,
Renee. Drug Attrition during Pre-Clinical and Clinical Development: Understanding and
Managing Drug-Induced Cardiotoxicity. Pharmacol. Ther. 2013, 138, 470–484.
(97) Guengerich, F. Peter. Mechanisms of Drug Toxicity and Relevance to Pharmaceutical
Development. Drug Metab. Pharmacokinet. 2011, 26, 3–14.
(98) Fielden, Mark R.; Kolaja, Kyle L. The Role of Early in Vivo Toxicity Testing in Drug
227
Discovery Toxicology . Expert Opin. Drug Saf. 2008, 7, 107–110.
(99) Borzelleca, J. F. Paracelsus: Herald of Modern Toxicology. Toxicol. Sci. 2000, 53, 2–4.
(100) Liebler, Daniel C.; Guengerich, F. Peter. Elucidating Mechanisms of Drug-Induced
Toxicity. Nat. Rev. Drug Discov. 2005, 4, 410–420.
(101) Guengerich, F. Peter. Cytochrome P450s and Other Enzymes in Drug Metabolism and
Toxicity. AAPS J. 2006, 8, E101–E111.
(102) Furberg, Curt D.; Pitt, Bertram. Withdrawal of Cerivastatin from the World Market. Curr.
Control. Trials Cardiovasc. Med. 2001, 2, 205–207.
(103) Pichler, Werner J. Delayed Drug Hypersensitivity Reactions. Ann. Intern. Med. 2003, 139,
683.
(104) Faulk, W. Page. Clinical Aspects of Immunology: 3rd Edn; Wiley-Blackwell, 1976; Vol. 30.
(105) Dekant, Wolfgang. The Role of Biotransformation and Bioactivation in Toxicity. In
Molecular, Clinical and Environmental Toxicology; Luch, Andreas, Ed.; Berlin, 2009.
(106) Takano, T.; Miyazaki, Y. Metabolism of Dichloromethane and the Subsequent Binding of
Its Product, Carbon Monoxide, to Cytochrome P-450 in Perfused Rat Liver. Toxicol. Lett.
1988, 40, 93–96.
(107) Macherey, Anne-Christine; Dansette, Patrick M. Biotransformations Leading to Toxic
Metabolites: Chemical Aspects. Pract. Med. Chem. 2015, 585–614.
(108) Waring, Jeffrey F.; Anderson, Mark G. Idiosyncratic Toxicity: Mechanistic Insights
Gained from Analysis of Prior Compounds. Curr. Opin. Drug Discov. Devel. 2005, 8, 59–
65.
(109) Segura-Bedmar, Isabel; Martínez, Paloma; de Pablo-Sánchez, César. A Linguistic Rule-
Based Approach to Extract Drug-Drug Interactions from Pharmacological Documents.
BMC Bioinformatics 2011, 12, S1.
(110) Juurlink, David N.; Mamdani, Muhammad; Kopp, Alexander; Laupacis, Andreas;
Redelmeier, Donald A. Drug-Drug Interactions among Elderly Patients Hospitalized for
Drug Toxicity. J. Am. Med. Assoc. 2003, 289, 1652–1658.
(111) Safety Guidelines : ICH https://www.ich.org/products/guidelines/safety/article/safety-
guidelines.html (accessed May 7, 2019).
228
(112) Deaton, Aimee M.; Fan, Fan; Zhang, Wei; Nguyen, Phuong A.; Ward, Lucas D.; Nioi, Paul.
Rationalizing Secondary Pharmacology Screening Using Human Genetic and
Pharmacological Evidence. Toxicol. Sci. 2019, 167, 593–603.
(113) Viskin, Sami. Long QT Syndromes and Torsade de Pointes. Lancet 1999, 354, 1625–1633.
(114) Thomas, D.; Karle, C. A.; Kiehn, J. The Cardiac HERG/IKr Potassium Channel as
Pharmacological Target: Structure, Function, Regulation, and Clinical Applications. Curr.
Pharm. Des. 2006, 12, 2271–2283.
(115) Shukla, Sunita J.; Huang, Ruili; Austin, Christopher P.; Xia, Menghang. The Future of
Toxicity Testing: A Focus on in Vitro Methods Using a Quantitative High-Throughput
Screening Platform. Drug Discov. Today 2010, 15, 997–1007.
(116) Krejsa, Cecile M.; Horvath, Dragos; Rogalski, Sherri L.; Penzotti, Julie E.; Mao, Boryeu;
Barbosa, Frédérique; Migeon, Jacques C. Predicting ADME Properties and Side Effects:
The BioPrint Approach. Curr. Opin. Drug Discov. Devel. 2003, 6, 470–480.
(117) Lounkine, Eugen; Keiser, Michael J.; Whitebread, Steven; Mikhailov, Dmitri; Hamon,
Jacques; Jenkins, Jeremy L.; Lavan, Paul; Weber, Eckhard; Doak, Allison K.; Côté, Serge;
Shoichet, Brian K.; Urban, Laszlo. Large-Scale Prediction and Testing of Drug Activity on
Side-Effect Targets. Nature 2012, 486, 361–367.
(118) Bowes, Joanne; Brown, Andrew J.; Hamon, Jacques; Jarolimek, Wolfgang; Sridhar, Arun;
Waldron, Gareth; Whitebread, Steven. Reducing Safety-Related Drug Attrition: The Use
of in Vitro Pharmacological Profiling. Nat. Rev. Drug Discov. 2012, 11, 909–922.
(119) Lynch, James J.; Van Vleet, Terry R.; Mittelstadt, Scott W.; Blomme, Eric A. G. Potential
Functional and Pathological Side Effects Related to Off-Target Pharmacological Activity.
J. Pharmacol. Toxicol. Methods 2017, 87, 108–126.
(120) Van Vleet, Terry R.; Liguori, Michael J.; Lynch, James J.; Rao, Mohan; Warder, Scott.
Screening Strategies and Methods for Better Off-Target Liability Prediction and
Identification of Small-Molecule Pharmaceuticals. SLAS Discov. 2019, 24, 1–24.
(121) Whitebread, Steven; Dumotier, Berengere; Armstrong, Duncan; Fekete, Alexander;
Chen, Shanni; Hartmann, Andreas; Muller, Patrick Y.; Urban, Laszlo. Secondary
Pharmacology: Screening and Interpretation of off-Target Activities – Focus on
Translation. Drug Discov. Today 2016, 21, 1232–1242.
229
(122) Houck, Keith A.; Kavlock, Robert J. Understanding Mechanisms of Toxicity: Insights from
Drug Discovery Research. Toxicol. Appl. Pharmacol. 2008, 227, 163–178.
(123) Obach, R. S. The Utility of in Vitro Cytochrome P450 Inhibition Data in the Prediction
of Drug-Drug Interactions. J. Pharmacol. Exp. Ther. 2005, 316, 336–348.
(124) Horváth, S. Cytotoxicity of Drugs and Diverse Chemical Agents to Cell Cultures.
Toxicology 1980, 16, 59–66.
(125) Niles, Andrew L.; Moravec, Richard A.; Riss, Terry L. Update on in Vitro Cytotoxicity
Assays for Drug Development . Expert Opin. Drug Discov. 2008, 3, 655–669.
(126) Cook, John A.; Mitchell, James B. Viability Measurements in Mammalian Cell Systems.
Anal. Biochem. 1989, 179, 1–7.
(127) Steinberg, Pablo. High-Throughput Screening Methods in Toxicity Testing; Steinberg,
Pablo, Ed.; Wiley, 2013.
(128) Shah, Shaily Umang. Importance of Genotoxicity & S2A Guidelines for Genotoxicity
Testing for Pharmaceuticals: IOSR J. Pharm. Biol. Sci. 2012, 1, 43–54.
(129) Mccann, Joyce; Choi, Edmund; Yamasaki, Edith; Ames, Bruce N.; Haroun, L.; Maron, D.;
Keng, T.; Car, C. Detection of Carcinogens as Mutagens in the Salmonella/Microsome
Test: Assay of 300 Chemicals. Med. Sci. 1975, 72, 5135–5139.
(130) Krishna, G.; Fiedler, R.; Theiss, J. C. Simultaneous Evaluation of Clastogenicity,
Aneugenicity and Toxicity in the Mouse Micronucleus Assay Using Immunofluorescence.
Mutat. Res. 1992, 282, 159–167.
(131) Lloyd, Melvyn; Kidd, Darren. The Mouse Lymphoma Assay. Methods Mol. Biol. 2012, 817,
35–54.
(132) Rao, P. Rama; Jena, G. B.; Kaul, C. L.; Ramarao, P. Genotoxicity Testing, a Regulatory
Requirement for Drug Discovery and Development: Impact of ICH Guidelines. Indian J.
Pharmacol. 2002, No. 34, 86–99.
(133) Mishima, Masayuki. Chromosomal Aberrations, Clastogens vs Aneugens. Front. Biosci.
(Schol. Ed). 2017, 9, 1–16.
(134) Hayashi, Makoto. The Micronucleus Test-Most Widely Used in Vivo Genotoxicity Test.
Genes Environ. 2016, 38, 18.
230
(135) Kirkland, David; Reeve, Lesley; Gatehouse, David; Vanparys, Philippe. A Core in Vitro
Genotoxicity Battery Comprising the Ames Test plus the in Vitro Micronucleus Test Is
Sufficient to Detect Rodent Carcinogens and in Vivo Genotoxins. Mutat. Res. - Genet.
Toxicol. Environ. Mutagen. 2011, 721, 27–73.
(136) Kimura, Hiroshi; Sakai, Yasuyuki; Fujii, Teruo. Organ/Body-on-a-Chip Based on
Microfluidic Technology for Drug Discovery. Drug Metab. Pharmacokinet. 2018, 33, 43–
48.
(137) Ishida, Seiichi. Organs-on-a-Chip: Current Applications and Consideration Points for in
Vitro ADME-Tox Studies. Drug Metab. Pharmacokinet. 2018, 33, 49–54.
(138) Törnqvist, Elin; Annas, Anita; Granath, Britta; Jalkesten, Elisabeth; Cotgreave, Ian; Öberg,
Mattias. Strategic Focus on 3R Principles Reveals Major Reductions in the Use of Animals
in Pharmaceutical Toxicity Testing. 2014, 9, e101638.
(139) Flecknell, Paul. Replacement, Reduction and Refinement. ALTEX 2002, 19, 73–78.
(140) Nuffield BioEthics. Chapter 9- Animal Use in Toxicity Studies. In The Ethics of Research
Involving Animals; Nuffield Council on Bioethics: London, 2005; pp 155–167.
(141) Calabrese, E. J.; Baldwin, L. A. Improved Method for Selection of the NOAEL. Regul.
Toxicol. Pharmacol. 1994, 19, 48–50.
(142) Dorato, Michael A.; Engelhardt, Jeffery A. The No-Observed-Adverse-Effect-Level in
Drug Safety Evaluations: Use, Issues, and Definition(S). Regul. Toxicol. Pharmacol. 2005,
42, 265–274.
(143) Food and Drug Administration, HHS. International Conference on Harmonisation;
Guidance on M3(R2) Nonclinical Safety Studies for the Conduct of Human Clinical Trials
and Marketing Authorization for Pharmaceuticals; Availability. Notice. Fed. Regist. 2010,
75, 3471–3472.
(144) Evans, Scott R. Fundamentals of Clinical Trial Design. J. Exp. Stroke Transl. Med. 2010, 3,
19–27.
(145) Umscheid, Craig A.; Margolis, David J.; Grossman, Craig E. Key Concepts of Clinical
Trials: A Narrative Review. Postgrad. Med. 2011, 123, 194–204.
(146) Evans, Scott R. Clinical Trial Structures. J. Exp. Stroke Transl. Med. 2010, 3, 8–18.
231
(147) Mandrekar, Sumithra J.; Sargent, Daniel J. Randomized Phase II Trials: Time for a New
Era in Clinical Trial Design. J. Thorac. Oncol. 2010, 5, 932–934.
(148) Walker, Esteban; Nowacki, Amy S. Understanding Equivalence and Noninferiority
Testing. J. Gen. Intern. Med. 2011, 26, 192–196.
(149) Sedgwick, P. Confounding in Randomised Controlled Trials. Bmj 2010, 341, c5403–
c5403.
(150) Lineberry, Neil; Berlin, Jesse A.; Mansi, Bernadette; Glasser, Susan; Berkwits, Michael;
Klem, Christian; Bhattacharya, Ananya; Citrome, Leslie; Enck, Robert; Fletcher, John;
Haller, Daniel; Chen, Tai Tsang; Laine, Christine. Recommendations to Improve Adverse
Event Reporting in Clinical Trial Publications: A Joint Pharmaceutical Industry/Journal
Editor Perspective. BMJ 2016, 355.
(151) Meyboom, Ronald H. B.; Hekster, Yechiel A.; Egberts, Antoine C. G.; Gribnau, Frank W.
J.; Edwards, I. Ralph. Causal or Casual? The Role of Causality Assessment in
Pharmacovigilance. Drug Saf. 1997, 17, 374–389.
(152) Coomarasamy, A.; Williams, H.; Truchanowwicz, E. Definitions of Adverse Events,
Seriousness and Causality. Ncbi 2016, 20, Appendix 3.
(153) EudraVigilance: electronic reporting European Medicines Agency
https://www.ema.europa.eu/en/human-regulatory/research-
development/pharmacovigilance/eudravigilance/eudravigilance-electronic-reporting
(accessed Oct 31, 2018).
(154) US FDA. Adverse Event Reporting to IRBs — Improving Human Subject Protection; 2009.
(155) Brown, Elliot G.; Wood, Louise; Wood, Sue. The Medical Dictionary for Regulatory
Activities (MedDRA). In Drug Safety; John Wiley & Sons, Ltd: Chichester, UK, 1999; Vol.
20, pp 109–117.
(156) Schroll, Jeppe Bennekou; Maund, Emma; Gøtzsche, Peter C. Challenges in Coding
Adverse Events in Clinical Trials: A Systematic Review. PLoS One 2012, 7, e41174.
(157) Jurić, Diana; Pranić, Shelly; Tokalić, Ružica; Milat, Ana Marija; Mudnić, Ivana; Pavličević,
Ivančica; Marušić, Ana. Clinical Trials on Drug-Drug Interactions Registered in
ClinicalTrials.Gov Reported Incongruent Safety Data in Published Articles: An
Observational Study. J. Clin. Epidemiol. 2018, 104, 35–45.
232
(158) Riveros, Carolina; Dechartres, Agnes; Perrodeau, Elodie; Haneef, Romana; Boutron,
Isabelle; Ravaud, Philippe. Timing and Completeness of Trial Results Posted at
ClinicalTrials.Gov and Published in Journals. PLoS Med. 2013, 10, 1–12.
(159) Tse, Tony; Williams, Rebecca J.; Zarin, Deborah A. Reporting “Basic Results” in
ClinicalTrials.Gov. Chest 2009, 136, 295–303.
(160) Puljak, Livia. Flupirtine, an Effective Analgesic, but Hepatotoxicity Should Limit Its Use.
Anesth. Analg. 2018, 127, 309–310.
(161) Research, Center for Drug Evaluation and. Drug Safety and Availability - FDA Drug Safety
Communication: FDA recommends against the continued use of propoxyphene
https://www.fda.gov/Drugs/DrugSafety/ucm234338.htm (accessed Jan 14, 2019).
(162) Waring, Michael J.; Arrowsmith, John; Leach, Andrew R.; Leeson, Paul D.; Mandrell, Sam;
Owen, Robert M.; Pairaudeau, Garry; Pennie, William D.; Pickett, Stephen D.; Wang,
Jibo; Wallace, Owen; Weir, Alex. An Analysis of the Attrition of Drug Candidates from
Four Major Pharmaceutical Companies. Nat. Rev. Drug Discov. 2015, 14, 475–486.
(163) Cook, David; Brown, Dearg; Alexander, Robert; March, Ruth; Morgan, Paul;
Satterthwaite, Gemma; Pangalos, Menelas N. Lessons Learned from the Fate of
AstraZeneca’s Drug Pipeline: A Five-Dimensional Framework. Nat. Rev. Drug Discov.
2014, 13, 419–431.
(164) Morgan, Paul; Brown, Dean G.; Lennard, Simon; Anderton, Mark J.; Barrett, J. Carl;
Eriksson, Ulf; Fidock, Mark; Hamrén, Bengt; Johnson, Anthony; March, Ruth E.;
Matcham, James; Mettetal, Jerome; Nicholls, David J.; Platz, Stefan; Rees, Steve;
Snowden, Michael A.; Pangalos, Menelas N. Impact of a Five-Dimensional Framework on
R&D Productivity at AstraZeneca. Nat. Rev. Drug Discov. 2018, 17, 167–181.
(165) Gintant, Gary; Sager, Philip T.; Stockbridge, Norman. Evolution of Strategies to Improve
Preclinical Cardiac Safety Testing. Nat. Rev. Drug Discov. 2016, 15, 457–471.
(166) Kim, Eunjung; Rebecca, Vito W.; Smalley, Keiran S. M.; Anderson, Alexander R. A. Phase
i Trials in Melanoma: A Framework to Translate Preclinical Findings to the Clinic. Eur. J.
Cancer 2016, 67, 213–222.
(167) Mak, Isabella Wy; Evaniew, Nathan; Ghert, Michelle. Lost in Translation: Animal Models
and Clinical Trials in Cancer Treatment. Am. J. Transl. Res. 2014, 6, 114–118.
233
(168) Li, Albert P. Accurate Prediction of Human Drug Toxicity: A Major Challenge in Drug
Development. Chem. Biol. Interact. 2004, 150, 3–7.
(169) Olson, Harry; Betton, Graham; Robinson, Denise; Thomas, Karluss; Monro, Alastair;
Kolaja, Gerald; Lilly, Patrick; Sanders, James; Sipes, Glenn; Bracken, William; Dorato,
Michael; Van Deun, Koen; Smith, Peter; Berger, Bruce; Heller, Allen. Concordance of the
Toxicity of Pharmaceuticals in Humans and in Animals. Regul. Toxicol. Pharmacol. 2000,
32, 56–67.
(170) King, G. L. Animal Models in the Study of Vomiting. Can. J. Physiol. Pharmacol. 1990, 68,
260–268.
(171) Holmes, A. M.; Rudd, J. A.; Tattersall, F. D.; Aziz, Q.; Andrews, P. L. R. Opportunities for
the Replacement of Animals in the Study of Nausea and Vomiting. Br. J. Pharmacol.
2009, 157, 865–880.
(172) Parkinson, Joanna; Muthas, Daniel; Clark, Matthew; Boyer, Scott; Valentin, Jean Pierre;
Ewart, Lorna. Application of Data Mining and Visualization Techniques for the
Prediction of Drug-Induced Nausea in Man. Toxicol. Sci. 2012, 126, 275–284.
(173) Needs, C. J.; Brooks, P. M. Antirheumatic Medication in Pregnancy. Br. J. Rheumatol.
1985, 24, 282–290.
(174) Lepper, Erin R.; Smith, Nicola F.; Cox, Michael C.; Scripture, Charity D.; Figg, William D.
Thalidomide Metabolism and Hydrolysis: Mechanisms and Implications. Curr. Drug
Metab. 2006, 7, 677–685.
(175) Powell, Craig M.; Miyakawa, Tsuyoshi. Schizophrenia-Relevant Behavioral Testing in
Rodent Models: A Uniquely Human Disorder? Biol. Psychiatry 2006, 59, 1198–1207.
(176) Jones, C. A.; Watson, D. J. G.; Fone, K. C. F. Animal Models of Schizophrenia. Br. J.
Pharmacol. 2011, 164, 1162–1194.
(177) Force, Thomas; Kolaja, Kyle L. Cardiotoxicity of Kinase Inhibitors: The Prediction and
Translation of Preclinical Models to Clinical Outcomes. Nat. Rev. Drug Discov. 2011, 10,
111–126.
(178) Dhar, Vasant. Data Science and Prediction. Ssrn 2012, 56, 64–73.
(179) Hayashi, Chikio. What Is Data Science ? Fundamental Concepts and a Heuristic Example.
In Data Science, Classification, and Related Methods; Springer, Tokyo, 1998; pp 40–51.
234
(180) Chollet, Francois. Deep Learning With Python; Arritola, Toni, Gaines, Jerry,
Dragosavlijevic Aleksandar, Taylor, Tiffany, Ed.; Manning Publications Co.: New York,
2018.
(181) Talwar, Anish; Kumar, Yogesh. Machine Learning: An Artificial Intelligence
Methodology. Int. J. Eng. Comput. Sci. 2013, 2, 3400–3404.
(182) Dey, Ayon. Machine Learning Algorithms: A Review. Int. J. Comput. Sci. Inf. Technol.
2016, 7, 1174–1179.
(183) Kotsiantis, S. B. Supervised Machine Learning: A Review of Classification Techniques.
Informatica 2007, 31, 249–268.
(184) Tsoumakas, Grigorios; Katakis, Ioannis. Multi-Label Classification: An Overview;
Thessaloniki, Greece.
(185) Read, Jesse; Martino, Luca; Olmos, Pablo M.; Luengo, David. Scalable Multi-Output Label
Prediction: From Classifier Chains to Classifier Trellises. Pattern Recognit. 2015, 48,
2096–2109.
(186) Sokolova, Marina; Lapalme, Guy. A Systematic Analysis of Performance Measures for
Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437.
(187) Fawcett, Tom. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–
874.
(188) Powers, David M. W. Evaluation: From Precision, Recall and F-Factor to ROC,
Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2007, 2, 37–63.
(189) Hajian-Tilaki, Karimollah. Receiver Operating Characteristic (ROC) Curve Analysis for
Medical Diagnostic Test Evaluation. Casp. J. Intern. Med. 2013, 4, 627–635.
(190) Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn and TensorFlow, 1st
Editio.; O’Reilly, 2017.
(191) Tabachnick, Barbara G.; Fidell, Linda S. Using Multivariate Statistics Title: Using
Multivariate Statistics, 7th ed.; Pearson: New York, 2019; Vol. 7.
(192) Ranganathan, Priya; Pramesh, C. S.; Aggarwal, Rakesh. Common Pitfalls in Statistical
Analysis: Logistic Regression. Perspect. Clin. Res. 2017, 8, 148–151.
(193) Cortes, Corinna; Vapnik, Vladimir. Support-Vector Networks. Mach. Learn. 1995, 20,
235
273–297.
(194) Raschka, Sebastian. A Tour of Machine Learning Classifiers Using Scikit-Learn. In Python
Machine Learning : unlock deeper insights into machine learning with this vital guide to
cutting-edge predictive analytics; Packt Publishing Ltd: Birmingham, UK, 2015; pp 49–96.
(195) Basak, Debasish; Pal, Srimanta; Patranabis, Dipak Chandra. Support Vector Regression.
In Neural Information Processing-Letters and Reviews; 2007; Vol. 11, pp 203–224.
(196) Noble, William S. What Is a Support Vector Machine? Nat. Biotechnol. 2006, 24, 1565–
1567.
(197) Auria, Laura; Moro, Rouslan A. Support Vector Machines (SVM) as a Technique for
Solvency Analysis; Berlin, Germany, 2008.
(198) Quinlan, J. R. Induction of Trees. Mach. Learn. 2007, 1, 81–106.
(199) Shannon, C. E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27,
379–423.
(200) Mitchell B.O., John B. O. Machine Learning Methods in Chemoinformatics. Wiley
Interdiscip. Rev. Comput. Mol. Sci. 2014, 4, 468–481.
(201) Quinlan, J. R. Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res. 1996, 4,
77–90.
(202) Raileanu, Laura E.; Stoffel, Kilian. Theoretical Comparison between the Gini Index and
Information Gain. Ann. Math. Artif. Intell. 2004, 41, 77–93.
(203) Dietterich, Thomas G. An Experimental Comparison of Three Methods for Constructing
Ensembles of Decision Trees. Mach. Learn. 2000, 40, 139–157.
(204) Breiman, Leo. Bagging Predictors: Technical Report No. 421. Dep. Stat. Univ. Calif. 1994,
No. 2, 19.
(205) Touw, Wouter G.; Bayjanov, Jumamurat R.; Overmars, Lex; Backus, Lennart; Boekhorst,
Jos; Wels, Michiel; Sacha van Hijum, A. F. T. Data Mining in the Life Science Swith
Random Forest: A Walk in the Park or Lost in the Jungle? Brief. Bioinform. 2013, 14, 315–
326.
(206) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.
236
(207) Tin Kam Ho. The Random Subspace Method for Constructing Decision Forests. IEEE
Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844.
(208) Rich, Caruana; Niculescu-Mizil, Alexandru. Predicting Good Probabilities With
Supervised Learning Alexandru Niculescu-Mizil. In Proceeding ICML ’05 Proceedings of
the 22nd international conference on Machine learning; Bonn, Germany, 2005; pp 625–
632.
(209) Cortés-Ciriano, Isidro; Ain, Qurrat Ul; Subramanian, Vigneshwari; Lenselink, Eelke B.;
Méndez-Lucio, Oscar; IJzerman, Adriaan P.; Wohlfahrt, Gerd; Prusis, Peteris; Malliavin,
Thérèse E.; van Westen, Gerard J. P.; Bender, Andreas. Polypharmacology Modelling
Using Proteochemometrics (PCM): Recent Methodological Developments, Applications
to Target Families, and Future Prospects. Med. Chem. Commun. 2015, 6, 24–50.
(210) Strobl, Carolin; Boulesteix, Anne-Laure; Kneib, Thomas; Augustin, Thomas; Zeileis,
Achim. Conditional Variable Importance for Random Forests. BMC Bioinformatics 2008,
9, 307.
(211) Diaz-Uriarte, Ramon; de Andres, Sara Alvarez. Variable Selection from Random Forests:
Application to Gene Expression Data. In Conference: Proceedings of the 5th Annual
Spanish Bioinformatics Conference; Barcelona, 2005; pp 1–11.
(212) Ishwaran, Hemant; Kogalur, Udaya B.; Gorodeski, Eiran Z.; Minn, Andy J.; Lauer, Michael
S. High-Dimensional Variable Selection for Survival Data. J. Am. Stat. Assoc. 2010, 105,
205–217.
(213) Palczewska, Anna; Palczewski, Jan; Robinson, Richard Marchese; Neagu, Daniel.
Interpreting Random Forest Classification Models Using a Feature Contribution Method.
Adv. Intell. Syst. Comput. 2014, 263, 193–218.
(214) Fabris, Fabio; Doherty, Aoife; Palmer, Daniel; de Magalhães, João Pedro; Freitas, Alex A.
A New Approach for Interpreting Random Forest Models and Its Application to the
Biology of Ageing. Bioinformatics 2018, 34, 2449–2456.
(215) Epifanio, Irene. Intervention in Prediction Measure: A New Approach to Assessing
Variable Importance for Random Forests. BMC Bioinformatics 2017, 18, 230.
(216) Hanser, T.; Barber, C.; Marchaland, J. F.; Werner, S. Applicability Domain: Towards a
More Formal Definition. SAR QSAR Environ. Res. 2016, 27, 865–881.
237
(217) Ruiz, Irene Luque; Gómez-Nieto, Miguel Ángel. Study of the Applicability Domain of the
QSAR Classification Models by Means of the Rivality and Modelability Indexes. Molecules
2018, 23, 2756.
(218) Kaneko, Hiromasa; Funatsu, Kimito. Applicability Domain Based on Ensemble Learning
in Classification and Regression Analyses. J. Chem. Inf. Model. 2014, 54, 2469–2482.
(219) Aniceto, Natália; Freitas, Alex A.; Bender, Andreas; Ghafourian, Taravat. A Novel
Applicability Domain Technique for Mapping Predictive Reliability across the Chemical
Space of a QSAR: Reliability-Density Neighbourhood. J. Cheminform. 2016, 8, 69.
(220) Sushko, Iurii; Novotarskyi, Sergii; Körner, Robert; Pandey, Anil Kumar; Cherkasov,
Artem; Li, Jiazhong; Gramatica, Paola; Hansen, Katja; Schroeter, Timon; Müller, Klaus-
Robert; Xi, Lili; Liu, Huanxiang; Yao, Xiaojun; Öberg, Tomas; Hormozdiari, Farhad; Dao,
Phuong; Sahinalp, Cenk; Todeschini, Roberto; Polishchuk, Pavel; et al. Applicability
Domains for Classification Problems: Benchmarking of Distance to Models for Ames
Mutagenicity Set. J. Chem. Inf. Model. 2010, 50, 2094–2111.
(221) Norinder, Ulf; Carlsson, Lars; Boyer, Scott; Eklund, Martin. Introducing Conformal
Prediction in Predictive Modeling. A Transparent and Flexible Alternative to
Applicability Domain Determination. J. Chem. Inf. Model. 2014, 54, 1596–1603.
(222) Toplak, Marko; Močnik, Rok; Polajnar, Matija; Bosnić, Zoran; Carlsson, Lars; Hasselgren,
Catrin; Demšar, Janez; Boyer, Scott; Zupan, Blaž; Stålring, Jonna. Assessment of Machine
Learning Reliability Methods for Quantifying the Applicability Domain of QSAR
Regression Models. J. Chem. Inf. Model. 2014, 54, 431–441.
(223) Svensson, Fredrik; Norinder, Ulf; Bender, Andreas. Modelling Compound Cytotoxicity
Using Conformal Prediction and PubChem HTS Data. Toxicol. Res. (Camb). 2017, 6, 73–
80.
(224) Svensson, Fredrik; Norinder, Ulf; Bender, Andreas. Improving Screening Efficiency
through Iterative Screening Using Docking and Conformal Prediction. J. Chem. Inf.
Model. 2017, 57, 439–444.
(225) Norinder, Ulf; Carlsson, Lars; Boyer, Scott; Eklund, Martin. Introducing Conformal
Prediction in Predictive Modeling for Regulatory Purposes. A Transparent and Flexible
Alternative to Applicability Domain Determination. Regul. Toxicol. Pharmacol. 2015, 71,
279–284.
238
(226) Eklund, Martin; Norinder, Ulf; Boyer, Scott; Carlsson, Lars. Application of Conformal
Prediction in QSAR. In IFIP Advances in Information and Communication Technology;
Springer, Berlin, Heidelberg, 2012; Vol. 382 AICT, pp 166–175.
(227) Sun, Jiangming; Carlsson, Lars; Ahlberg, Ernst; Norinder, Ulf; Engkvist, Ola; Chen,
Hongming. Applying Mondrian Cross-Conformal Prediction To Estimate Prediction
Confidence on Large Imbalanced Bioactivity Data Sets. J. Chem. Inf. Model. 2017, 57, 1591–
1598.
(228) Cahill, Nathan D. Normalized Measures of Mutual Information with General Definitions
of Entropy for Multimodal Image Registration. In Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics); 2010; Vol. 6204 LNCS, pp 258–268.
(229) Deeks, Jonathan J.; Altman, Douglas G. Diagnostic Tests 4: Likelihood Ratios. Bmj 2004,
329, 168–169.
(230) Wall, R. J.; Shani, M. Are Animal Models as Good as We Think? Theriogenology 2008,
69, 2–9.
(231) Chien, Patrick F. W.; Khan, Khalid S. Evaluation of a Clinical Test. II: Assessment of
Validity. BJOG An Int. J. Obstet. Gynaecol. 2003, 108, 568–572.
(232) Clark, Matthew. Prediction of Clinical Risks by Analysis of Preclinical and Clinical
Adverse Events. J. Biomed. Inform. 2015, 54, 167–173.
(233) Klabunde, T. Chemogenomic Approaches to Drug Discovery: Similar Receptors Bind
Similar Ligands. Br. J. Pharmacol. 2007, 152, 5–7.
(234) Rognan, D. Chemogenomic Approaches to Rational Drug Design. Br. J. Pharmacol. 2007,
152, 38–52.
(235) Boström, Jonas; Hogner, Anders; Schmitt, Stefan. Do Structurally Similar Ligands Bind
in a Similar Fashion? J. Med. Chem. 2006, 49, 6716–6725.
(236) Chodera, John D.; Mobley, David L. Entropy-Enthalpy Compensation: Role and
Ramifications in Biomolecular Ligand Recognition and Design. Annu. Rev. Biophys. 2013,
42, 121–142.
(237) Du, Xing; Li, Yi; Xia, Yuan Ling; Ai, Shi Meng; Liang, Jing; Sang, Peng; Ji, Xing Lai; Liu,
Shu Qun. Insights into Protein–ligand Interactions: Mechanisms, Models, and Methods.
239
Int. J. Mol. Sci. 2016, 17, 144.
(238) Mauri, Andrea; Consonni, Viviana; Pavan, Manuela; Todeschini, Roberto. Dragon
Software : An Easy Approach To Molecular Descriptor Calculations. MATCH Commun.
Math. Comput. Chem. 2006, 56, 237–248.
(239) Yap, Chun Wei. PaDEL-Descriptor: An Open Source Software to Calculate Molecular
Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466–1474.
(240) Rifaioglu, Ahmet Sureyya; Atas, Heval; Martin, Maria Jesus; Cetin-Atalay, Rengul; Atalay,
Volkan; Doğan, Tunca. Recent Applications of Deep Learning and Machine Intelligence
on in Silico Drug Discovery: Methods, Tools and Databases. Brief. Bioinform. 2018, 1–36.
(241) Gozalbes, R.; Doucet, J. P.; Derouin, F. Application of Topological Descriptors in QSAR
and Drug Design: History and New Trends. Curr. Drug Targets. Infect. Disord. 2002, 2,
93–102.
(242) Durant, Joseph L.; Leland, Burton A.; Henry, Douglas R.; Nourse, James G.
Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002,
42, 1273–1280.
(243) Sliwoski, Gregory; Mendenhall, Jeffrey; Meiler, Jens. Autocorrelation Descriptor
Improvements for QSAR: 2DA-Sign and 3DA-Sign. J. Comput. Aided. Mol. Des. 2016, 30,
209–217.
(244) Rogers, David; Hahn, Mathew. Extended-Connectivity Fingerprints. J. Chem. Inf. Model.
2010, 50, 742–754.
(245) Glem, Robert C.; Bender, Andreas; Arnby, Catrin H.; Carlsson, Lars; Boyer, Scott; Smith,
James. Circular Fingerprints: Flexible Molecular Descriptors with Applications from
Physical Chemistry to ADME. IDrugs 2006, 9, 199–204.
(246) Morgan, H. L. The Generation of a Unique Machine Description for Chemical Structures-
A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107–113.
(247) Bender, Andreas; Jenkins, Jeremy L.; Scheiber, Josef; Sukuru, Sai Chetan K.; Glick, Meir;
Davies, John W. How Similar Are Similarity Searching Methods? A Principal Component
Analysis of Molecular Descriptor Space. J. Chem. Inf. Model. 2009, 49, 108–119.
(248) Cramer, Richard D.; Patterson, David E.; Bunce, Jeffrey D. Comparative Molecular Field
Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins. J. Am.
240
Chem. Soc. 1988, 110, 5959–5967.
(249) Lapinsh, Maris; Prusis, Peteris; Lundstedt, Torbjörn; Wikberg, Jarl E. S.; Witte, Peter de;
Augustijns, Patrick F.; Annaert, Pieter P. Proteochemometrics Modeling of the
Interaction of Amine G-Protein Coupled Receptors with a Diverse Set of Ligands. Mol.
Pharmacol. 2002, 61, 1465–1475.
(250) Cruciani, Gabriele; Pastor, Manuel; Guba, Wolfgang. VolSurf: A New Tool for the
Pharmacokinetic Optimization of Lead Compounds. Eur. J. Pharm. Sci. 2000, 11, S29-39.
(251) Subramanian, Vigneshwari; Prusis, Peteris; Pietilä, Lars-Olof; Xhaard, Henri; Wohlfahrt,
Gerd; Pietilä, Lars-Olof; Xhaard, Henri; Wohlfahrt, Gerd; Pietilä, Lars-Olof; Xhaard,
Henri; Wohlfahrt, Gerd. Visually Interpretable Models of Kinase Selectivity Related
Features Derived from Field-Based Proteochemometrics. J. Chem. Inf. Model. 2013, 53,
3021–3030.
(252) Ekins, S.; Mestres, J.; Testa, B. In Silico Pharmacology for Drug Discovery: Applications
to Targets and Beyond. Br. J. Pharmacol. 2007, 152, 21–37.
(253) Mulliner, Denis; Schmidt, Friedemann; Stolte, Manuela; Spirkl, Hans-Peter Peter; Czich,
Andreas; Amberg, Alexander. Computational Models for Human and Animal
Hepatotoxicity with a Global Application Scope. Chem. Res. Toxicol. 2016, 29, 757–767.
(254) Axen, Seth D.; Huang, Xi-Ping; Cáceres, Elena L.; Gendelev, Leo; Roth, Bryan L.; Keiser,
Michael J. A Simple Representation of Three-Dimensional Molecular Structure. J. Med.
Chem. 2017, 60, 7393–7409.
(255) Gómez-Bombarelli, Rafael; Wei, Jennifer N.; Duvenaud, David; Hernández-Lobato, José
Miguel; Sánchez-Lengeling, Benjamín; Sheberla, Dennis; Aguilera-Iparraguirre, Jorge;
Hirzel, Timothy D.; Adams, Ryan P.; Aspuru-Guzik, Alán. Automatic Chemical Design
Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4,
268–276.
(256) Duvenaud, David; Maclaurin, Dougal; Aguilera-Iparraguirre, Jorge; Gómez-Bombarelli,
Rafael; Hirzel, Timothy; Aspuru-Guzik, Alán; Adams, Ryan P. Convolutional Networks
on Graphs for Learning Molecular Fingerprints. In Proceedings of Advances in Neural
Information Processing Systems 28 (NIPS 2015); Montreal, 2015; pp 2215–2223.
(257) Gupta, Anvita; Müller, Alex T.; Huisman, Berend J. H.; Fuchs, Jens A.; Schneider, Petra;
241
Schneider, Gisbert. Generative Recurrent Networks for De Novo Drug Design. Mol.
Inform. 2018, 37, 1700111.
(258) Qiu, Tianyi; Qiu, Jingxuan; Feng, Jun; Wu, Dingfeng; Yang, Yiyan; Tang, Kailin; Cao,
Zhiwei; Zhu, Ruixin. The Recent Progress in Proteochemometric Modelling: Focusing on
Target Descriptors, Cross-Term Descriptors and Application Scope. Brief. Bioinform.
2017, 18, 125–136.
(259) Kramer, Christian; Gedeck, Peter. Global Free Energy Scoring Functions Based on
Distance-Dependent Atom-Type Pair Descriptors. J. Chem. Inf. Model. 2011, 51, 707–720.
(260) Ong, Serene A. K.; Lin, Hong Huang; Chen, Yu Zong; Li, Ze Rong; Cao, Zhiwei. Efficacy
of Different Protein Descriptors in Predicting Protein Functional Families. BMC
Bioinformatics 2007, 8, 300.
(261) Gao, Qing-Bin; Wang, Zheng-Zhi; Yan, Chun; Du, Yao-Hua. Prediction of Protein
Subcellular Location Using a Combined Feature of Sequence. FEBS Lett. 2005, 579, 3444–
3448.
(262) Bhasin, Manoj; Raghava, Gajendra P. S. Classification of Nuclear Receptors Based on
Amino Acid Composition and Dipeptide Composition. J. Biol. Chem. 2004, 279, 23262–
23266.
(263) Strömbergsson, Helena; Kryshtafovych, Andriy; Prusis, Peteris; Fidelis, Krzysztof;
Wikberg, Jarl E. S.; Komorowski, Jan; Hvidsten, Torgeir R. Generalized Modeling of
Enzyme-Ligand Interactions Using Proteochemometrics and Local Protein
Substructures. Proteins 2006, 65, 568–579.
(264) Horne, David S. Prediction of Protein Helix Content from an Autocorrelation Analysis of
Sequence Hydrophobicities. Biopolymers 1988, 27, 451–477.
(265) Sandberg, Maria; Eriksson, Lennart; Jonsson, Jörgen; Sjöström, Michael; Wold, Svante.
New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A
Multivariate Characterization of 87 Amino Acids. J. Med. Chem. 1998, 41, 2481–2491.
(266) Mei, Hu; Liao, Zhi H.; Zhou, Yuan; Li, Shengshi Z. A New Set of Amino Acid Descriptors
and Its Application in Peptide QSARs. Biopolym. - Pept. Sci. Sect. 2005, 80, 775–786.
(267) Van Westen, Gerard JP P. J. P.; Swier, Remco F.; Cortes-Ciriano, Isidro; Wegner, Jorg K.
Jörg K.; Overington, John P.; IJzerman, Adriaan P.; Van Vlijmen, Herman WT T. W. T.;
242
Bender, Andreas; Jzerman, Adriaan P. I.; Van Vlijmen, Herman WT T. W. T.; Bender,
Andreas; IJzerman, Adriaan P.; Van Vlijmen, Herman WT T. W. T.; Bender, Andreas; Jp
Van Westen, Gerard; Swier, Remco F.; Cortes-Ciriano, Isidro; Wegner, Jorg K. Jörg K.;
Overington, John P.; et al. Benchmarking of Protein Descriptor Sets in
Proteochemometric Modeling (Part 2): Modeling Performance of 13 Amino Acid
Descriptor Sets. J. Cheminform. 2013, 5, 42–62.
(268) Yang, Li; Shu, Mao; Ma, Kaiwang; Mei, Hu; Jiang, Yongjun; Li, Zhiliang. ST-Scale as a
Novel Amino Acid Descriptor and Its Application in QSAM of Peptides and Analogues.
Amino Acids 2010, 38, 805–816.
(269) Tian, Feifei; Zhou, Peng; Li, Zhiliang. T-Scale as a Novel Vector of Topological Descriptors
for Amino Acids and Its Application in QSARs of Peptides. J. Mol. Struct. 2007, 830, 106–
115.
(270) Georgiev, Alexander G. Interpretable Numerical Descriptors of Amino Acid Space. J.
Comput. Biol. 2009, 16, 703–723.
(271) Henikoff, S.; Henikoff, J. G. Amino Acid Substitution Matrices from Protein Blocks. Proc.
Natl. Acad. Sci. U. S. A. 1992, 89, 10915–10919.
(272) Zaliani, A.; Gancia, E. MS-WHIM Scores for Amino Acids: A New 3D-Description for
Peptide QSAR and QSPR Studies. J. Chem. Inf. Comput. Sci. 1999, 39, 525–533.
(273) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov,
I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242.
(274) Wu, Dingfeng; Huang, Qi; Zhang, Yida; Zhang, Qingchen; Liu, Qi; Gao, Jun; Cao, Zhiwei
ZW; Zhu, Ruixin RX; Park, H.; Kim, S.; Kim, YE; Lim, SJ; Bertrand, P.; Rikiishi, H.; Witt,
O.; Deubzer, HE; Milde, T.; Oehme, I.; Ruijter, AJM De; et al. Screening of Selective
Histone Deacetylase Inhibitors by Proteochemometric Modeling. BMC Bioinformatics
2012, 13, 212.
(275) Qiu, Tianyi; Xiao, Han; Zhang, Qingchen; Qiu, Jingxuan; Yang, Yiyan; Wu, Dingfeng; Cao,
Zhiwei; Zhu, Ruixin. Proteochemometric Modeling of the Antigen-Antibody Interaction:
New Fingerprints for Antigen, Antibody and Epitope-Paratope Interaction. PLoS One
2015, 10, e0122416.
(276) Bosc, Nicolas; Wroblowski, Berthold; Meyer, Christophe; Bonnet, Pascal. Prediction of
243
Protein Kinase-Ligand Interactions through 2.5D Kinochemometrics. J. Chem. Inf. Model.
2017, 57, 93–101.
(277) Rácz, Anita; Bajusz, Dávid; Héberger, Károly. Life beyond the Tanimoto Coefficient:
Similarity Measures for Interaction Fingerprints. J. Cheminform. 2018, 10, 48.
(278) Bajusz, D.; Rácz, A.; Héberger, K. Chemical Data Formats, Fingerprints, and Other
Molecular Descriptions for Database Analysis and Searching. In Comprehensive
Medicinal Chemistry III; Elsevier, 2017; pp 329–378.
(279) Salentin, Sebastian; Schreiber, Sven; Haupt, V. Joachim; Adasme, Melissa F.; Schroeder,
Michael. PLIP: Fully Automated Protein-Ligand Interaction Profiler. Nucleic Acids Res.
2015, 43, W443–W447.
(280) Gini, Giuseppina. QSAR Methods. Methods Mol. Biol. 2016, 1425, 1–20.
(281) Kearnes, Steven; Pande, Vijay. ROCS-Derived Features for Virtual Screening. J. Comput.
Aided. Mol. Des. 2016, 30, 609–617.
(282) Fontaine, Fabien; Bolton, Evan; Borodina, Yulia; Bryant, Stephen H. Fast 3D Shape
Screening of Large Chemical Databases through Alignment-Recycling. Chem. Cent. J.
2007, 1, 12.
(283) Mayr, Andreas; Klambauer, Günter; Unterthiner, Thomas; Steijaert, Marvin; Wegner,
Jörg K.; Ceulemans, Hugo; Clevert, Djork-Arné; Hochreiter, Sepp. Large-Scale
Comparison of Machine Learning Methods for Drug Target Prediction on ChEMBL.
Chem. Sci. 2018, 9, 5441–5451.
(284) Simões, Rodolfo S.; Maltarollo, Vinicius G.; Oliveira, Patricia R.; Honorio, Kathia M.
Transfer and Multi-Task Learning in QSAR Modeling: Advances and Challenges. Front.
Pharmacol. 2018, 9, 74.
(285) Huang, Hung-Jin; Yu, Hsin Wei; Chen, Chien-Yu; Hsu, Chih-Ho; Chen, Hsin-Yi; Lee,
Kuei-Jen; Tsai, Fuu-Jen; Chen, Calvin Yu-Chian. Current Developments of Computer-
Aided Drug Design. J. Taiwan Inst. Chem. Eng. 2010, 41, 623–635.
(286) Salmaso, Veronica; Moro, Stefano. Bridging Molecular Docking to Molecular Dynamics
in Exploring Ligand-Protein Recognition Process: An Overview. Front. Pharmacol. 2018,
9, 923.
(287) Meng, Xuan-Yu; Zhang, Hong-Xing; Mezei, Mihaly; Cui, Meng. Molecular Docking: A
244
Powerful Approach for Structure-Based Drug Discovery. Curr. Comput. Aided. Drug Des.
2011, 7, 146–157.
(288) Śledź, Paweł. Protein Structure-Based Drug Design: From Docking to Molecular
Dynamics. Curr. Opin. Struct. Biol. 2018, 48, 93–102.
(289) Repasky, Matthew P.; Shelley, Mee; Friesner, Richard A. Flexible Ligand Docking with
Glide. In Current Protocols in Bioinformatics; John Wiley & Sons, Inc.: Hoboken, NJ, USA,
2007; p UNIT 8.12.
(290) Surgand, Jean Sebastien; Rodrigo, Jordi; Kellenberger, Esther; Rognan, Didier. A
Chemogenomic Analysis of the Transmembrane Binding Cavity of Human G-Protein-
Coupled Receptors. Proteins Struct. Funct. Genet. 2006, 62, 509–538.
(291) Kratochwil, Nicole A.; Malherbe, Pari; Lindemann, Lothar; Ebeling, Martin; Hoener,
Marius C.; Mühlemann, Andreas; Porter, Richard H. P.; Stahl, Martin; Gerber, Paul R. An
Automated System for the Analysis of G Protein-Coupled Receptor Transmembrane
Binding Pockets: Alignment, Receptor-Based Pharmacophores, and Their Application. J.
Chem. Inf. Model. 2005, 45, 1324–1336.
(292) Frimurer, Thomas M.; Ulven, Trond; Elling, Christian E.; Gerlach, Lars-Ole; Kostenis, Evi;
Högberg, Thomas. A Physicogenetic Method to Assign Ligand-Binding Relationships
between 7TM Receptors. Bioorg. Med. Chem. Lett. 2005, 15, 3707–3712.
(293) Ehrt, Christiane; Brinkjost, Tobias; Koch, Oliver. Impact of Binding Site Comparisons on
Medicinal Chemistry and Rational Molecular Design. J. Med. Chem. 2016, 59, 4121–4151.
(294) Baroni, Massimo; Cruciani, Gabriele; Sciabola, Simone; Perruccio, Francesca; Mason,
Jonathan S. A Common Reference Framework for Analyzing/Comparing Proteins and
Ligands. Fingerprints for Ligands And Proteins (FLAP): Theory and Application. J. Chem.
Inf. Model. 2007, 47, 279–294.
(295) Totrov, Maxim. Ligand Binding Site Superposition and Comparison Based on Atomic
Property Fields: Identification of Distant Homologues, Convergent Evolution and PDB-
Wide Clustering of Binding Sites. BMC Bioinformatics 2011, 12, S35.
(296) Konc, Janez; Janezic, Dusanka. ProBiS Algorithm for Detection of Structurally Similar
Protein Binding Sites by Local Structural Alignment. Bioinformatics 2010, 26, 1160–1168.
(297) Das, Sourav; Kokardekar, Arshad; Breneman, Curt M. Rapid Comparison of Protein
245
Binding Site Surfaces with Property Encoded Shape Distributions. J. Chem. Inf. Model.
2009, 49, 2863–2872.
(298) Li, Gong-Hua; Huang, Jing-Fei. CMASA: An Accurate Algorithm for Detecting Local
Protein Structural Similarity and Its Application to Enzyme Catalytic Site Annotation.
BMC Bioinformatics 2010, 11, 439.
(299) Sheinerman, Felix B.; Giraud, Elie; Laoui, Abdelazize. High Affinity Targets of Protein
Kinase Inhibitors Have Similar Residues at the Positions Energetically Important for
Binding. J. Mol. Biol. 2005, 352, 1134–1156.
(300) Subramanian, Govindan; Sud, Manish. Computational Modeling of Kinase Inhibitor
Selectivity. Chem. Lett 2010, 1, 395–399.
(301) Kontoyianni, Maria. Docking and Virtual Screening in Drug Discovery. In Methods in
Molecular Biology; 2017; Vol. 1647, pp 255–266.
(302) Uba, Abdullahi Ibrahim; Yelekçi, Kemal. Carboxylic Acid Derivatives Display Potential
Selectivity for Human Histone Deacetylase 6: Structure-Based Virtual Screening,
Molecular Docking and Dynamics Simulation Studies. Comput. Biol. Chem. 2018, 75, 131–
142.
(303) Jasper, Julia B.; Humbeck, Lina; Brinkjost, Tobias; Koch, Oliver. A Novel Interaction
Fingerprint Derived from per Atom Score Contributions: Exhaustive Evaluation of
Interaction Fingerprint Performance in Docking Based Virtual Screening. J. Cheminform.
2018, 10, 15.
(304) Weaver, Shane; Gleeson, M. Paul. The Importance of the Domain of Applicability in
QSAR Modeling. J. Mol. Graph. Model. 2008, 26, 1315–1326.
(305) Van Westen, Gerard J. P.; Wegner, Jörg K.; Ijzerman, Adriaan P.; Van Vlijmen, Herman
W. T.; Bender, A. Proteochemometric Modeling as a Tool to Design Selective Compounds
and for Extrapolating to Novel Targets. Medchemcomm 2011, 2, 16–30.
(306) Murrell, Daniel S.; Cortes-Ciriano, Isidro; Van Westen, Gerard J. P.; Stott, Ian P.; Bender,
Andreas; Malliavin, Thérèse E.; Glen, Robert C. Chemically Aware Model Builder (Camb):
An R Package for Property and Bioactivity Modelling of Small Molecules. J. Cheminform.
2015, 7, 45–55.
(307) Tropsha, Alexander; Golbraikh, Alexander. Predictive QSAR Modeling Workflow, Model
246
Applicability Domains, and Virtual Screening. Curr. Pharm. Des. 2007, 13, 3494–3504.
(308) Gao, Jun; Huang, Qi; Wu, Dingfeng; Zhang, Qingchen; Zhang, Yida; Chen, Tian; Liu, Qi;
Zhu, Ruixin; Cao, Zhiwei; He, Yuan. Study on Human GPCR–inhibitor Interactions by
Proteochemometric Modeling. Gene 2013, 518, 124–131.
(309) Subramanian, Vigneshwari; Prusis, Peteris; Xhaard, Henri; Wohlfahrt, Gerd. Predictive
Proteochemometric Models for Kinases Derived from 3D Protein Field-Based
Descriptors. Med. Chem. Commun. 2016, 7, 1007–1015.
(310) Lapins, Maris; Eklund, Martin; Spjuth, Ola; Prusis, Peteris; Wikberg, Jarl ES S.
Proteochemometric Modeling of HIV Protease Susceptibility. BMC Bioinformatics 2008,
9, 181.
(311) Lapins, Maris; Worachartcheewan, Apilak; Spjuth, Ola; Georgiev, Valentin;
Prachayasittikul, Virapong; Nantasenamat, Chanin; Wikberg, Jarl E. S. A Unified
Proteochemometric Model for Prediction of Inhibition of Cytochrome P450 Isoforms.
PLoS One 2013, 8, e66566.
(312) Tresadern, Gary; Trabanco, Andres A.; Pérez-Benito, Laura; Overington, John P.; van
Vlijmen, Herman W. T.; van Westen, Gerard J. P. Identification of Allosteric Modulators
of Metabotropic Glutamate 7 Receptor Using Proteochemometric Modeling. J. Chem. Inf.
Model. 2017, 57, 2976–2985.
(313) Qiu, Tianyi; Wu, Dingfeng; Qiu, Jingxuan; Cao, Zhiwei. Finding the Molecular Scaffold
of Nuclear Receptor Inhibitors through High-Throughput Screening Based on
Proteochemometric Modelling. J. Cheminform. 2018, 10, 21.
(314) Wan, Shunzhou; Bhati, Agastya P.; Zasada, Stefan J.; Wall, Ian; Green, Darren;
Bamborough, Paul; Coveney, Peter V. Rapid and Reliable Binding Affinity Prediction of
Bromodomain Inhibitors: A Computational Study. J. Chem. Theory Comput. 2017, 13,
784–795.
(315) Aldeghi, Matteo; Heifetz, Alexander; Bodkin, Michael J.; Knapp, Stefan; Biggin, Philip C.
Predictions of Ligand Selectivity from Absolute Binding Free Energy Calculations. J. Am.
Chem. Soc. 2017, 139, 946–957.
(316) García-Jacas, C. R.; Martinez-Mayorga, K.; Marrero-Ponce, Y.; Medina-Franco, J. L.
Conformation-Dependent QSAR Approach for the Prediction of Inhibitory Activity of
247
Bromodomain Modulators. SAR QSAR Environ. Res. 2017, 28, 41–58.
(317) Batiste, Laurent; Unzue, Andrea; Dolbois, Aymeric; Hassler, Fabrice; Wang, Xuan;
Deerain, Nicholas; Zhu, Jian; Spiliotopoulos, Dimitrios; Nevado, Cristina; Caflisch,
Amedeo. Chemical Space Expansion of Bromodomain Ligands Guided by in Silico Virtual
Couplings (AutoCouple). ACS Cent. Sci. 2018, 4, 180–188.
(318) Vidler, Lewis R. LR Lewis R.; Brown, Nathan; Knapp, Stefan; Hoelder, Swen. Druggability
Analysis and Structural Classification of Bromodomain Acetyl-Lysine Binding Sites. J.
Med. Chem. 2012, 55, 7346–7359.
(319) Zhang, Xiaoxiao; Chen, Kai; Wu, Yun-Dong; Wiest, Olaf. Protein Dynamics and
Structural Waters in Bromodomains. PLoS One 2017, 12, e0186570.
(320) Zhao, Linlin; Zhu, Hao. Big Data in Computational Toxicology: Challenges and
Opportunities. In Computational Toxicology; John Wiley & Sons, Inc.: Hoboken, NJ, USA,
2018; pp 291–312.
(321) Jeong, Jaeseong; Choi, Jinhee. Use of Adverse Outcome Pathways in Chemical Toxicity
Testing: Potential Advantages and Limitations. Environ. Health Toxicol. 2017, 33,
e2018002.
(322) Clark, Matthew; Steger-Hartmann, Thomas. A Big Data Approach to the Concordance of
the Toxicity of Pharmaceuticals in Animals and Humans. Regul. Toxicol. Pharmacol.
2018, 96, 94–105.
(323) Nishida, Minoru; Takashima, Yoshiharu; Ogino, Yamato; Yoneta, Yasuo; Nakamura,
Kazuichi; Fujiyoshi, Masato; Kodaira, Hiroshi; Hizue, Masanori; Hisada, Shigeru;
Nagayama, Takashi; Hashiba, Masamichi; Ohkura, Takako; Suzuki, Kazuhiko; Yasugi,
Daisaku; Tamaki, Chihiro. Potentials and Limitations of Nonclinical Safety Assessment
for Predicting Clinical Adverse Drug Reactions: Correlation Analysis of 142 Approved
Drugs in Japan. J. Toxicol. Sci. 2013, 38, 581–598.
(324) Bugelski, Peter J.; Martin, Pauline L. Concordance of Preclinical and Clinical
Pharmacology and Toxicology of Therapeutic Monoclonal Antibodies and Fusion
Proteins: Cell Surface Targets. Br. J. Pharmacol. 2012, 166, 823–846.
(325) Bailey, Jarrod; Thew, Michelle; Balls, Michael. An Analysis of the Use of Animal Models
in Predicting Human Toxicology and Drug Safety. Altern. Lab. Anim. 2014, 42, 181–199.
248
(326) Monticello, Thomas M. T. M. Drug Development and Nonclinical to Clinical
Translational Databases:Past and Current Efforts. Toxicol. Pathol. 2015, 43, 57–61.
(327) Shanks, Niall; Greek, Ray; Greek, Jean. Are Animal Models Predictive for Humans? Philos.
Ethics, Humanit. Med. 2009, 4, 2.
(328) Voisin, Emmanuelle M.; Ruthsatz, Manfred; Collins, Jerry M.; Hoyle, Peter C.
Extrapolation of Animal Toxicity to Humans: Interspecies Comparisons in Drug
Development. Regul. Toxicol. Pharmacol. 1990, 12, 107–116.
(329) Papoian, Thomas; Chiu, Haw-Jyh; Elayan, Ikram; Jagadeesh, Gowraganahalli; Khan,
Imran; Laniyonu, Adebayo A.; Li, Cindy Xinguang; Saulnier, Muriel; Simpson, Natalie;
Yang, Baichun. Secondary Pharmacology Data to Assess Potential Off-Target Activity of
New Drugs: A Regulatory Perspective. Nat. Rev. Drug Discov. 2015, 14, 294–294.
(330) Bento, a. Patrícia; Gaulton, Anna; Hersey, Anne; Bellis, Louisa J.; Chambers, Jon; Davies,
Mark; Krüger, Felix a.; Light, Yvonne; Mak, Lora; McGlinchey, Shaun; Nowotka, Michal;
Papadatos, George; Santos, Rita; Overington, John P. The ChEMBL Bioactivity Database:
An Update. Nucleic Acids Res. 2014, 42, D1083-90.
(331) Wang, Yanli; Suzek, Tugba; Zhang, Jian; Wang, Jiyao; He, Siqian; Cheng, Tiejun;
Shoemaker, Benjamin A.; Gindulyte, Asta; Bryant, Stephen H. PubChem BioAssay: 2014
Update. Nucleic Acids Res. 2014, 42, D1075-82.
(332) Meslamani, Jamel; Smith, Steven G.; Sanchez, Roberto; Zhou, Ming-Ming Ming.
ChEpiMod: A Knowledgebase for Chemical Modulators of Epigenome Reader Domains.
Bioinformatics 2014, 30, 1481–1483.
(333) Jagarlapudi, Sarma A. R. P.; Kishan, K. V. Radha. Database Systems for Knowledge-Based
Discovery. In Methods in molecular biology (Clifton, N.J.); Jacoby E., Ed.; Humana Press:
Totowa, NJ, 2009; Vol. 575, pp 159–172.
(334) Kharenko, Olesya A.; Gesner, Emily M.; Patel, Reena G.; Norek, Karen; White, Andre;
Fontano, Eric; Suto, Robert K.; Young, Peter R.; McLure, Kevin G.; Hansen, Henrik C.
RVX-297- a Novel BD2 Selective Inhibitor of BET Bromodomains. Biochem. Biophys. Res.
Commun. 2016, 477, 62–67.
(335) Clark, Peter G. K.; Dixon, Darren J.; Brennan, Paul E. Development of Chemical Probes
for the Bromodomains of BRD7 and BRD9. Drug Discov. Today Technol. 2016, 19, 73–80.
249
(336) Guetzoyan, Lucie; Ingham, Richard J.; Nikbin, Nikzad; Rossignol, Julien; Wolling,
Michael; Baumert, Mark; Burgess-Brown, Nicola A.; Strain-Damerell, Claire M.; Shrestha,
Leela; Brennan, Paul E.; Fedorov, Oleg; Knapp, Stefan; Ley, Steven V. Machine-Assisted
Synthesis of Modulators of the Histone Reader BRD9 Using Flow Methods of Chemistry
and Frontal Affinity Chromatography. Med. Chem. Commun. 2014, 5, 540–546.
(337) Hay, Duncan A.; Fedorov, Oleg; Martin, Sarah; Singleton, Dean C.; Tallant, Cynthia;
Wells, Christopher; Picaud, Sarah; Philpott, Martin; Monteiro, Octovia P.; Rogers,
Catherine M.; Conway, Stuart J.; Rooney, Timothy P. C.; Tumber, Anthony; Yapp,
Clarence; Filippakopoulos, Panagis; Bunnage, Mark E.; Müller, Susanne; Knapp, Stefan;
Schofield, Christopher J.; et al. Discovery and Optimization of Small-Molecule Ligands
for the CBP/P300 Bromodomains. J. Am. Chem. Soc. 2014, 136, 9308–9319.
(338) Gao, Nana; Ren, Jixia; Hou, Li; Zhou, Yue; Xin, Ling; Wang, Jiedong; Yu, Heming; Xie,
Yong; Wang, Huiping. Identification of Novel Potent Human Testis-Specific and
Bromodomain-Containing Protein (BRDT) Inhibitors Using Crystal Structure-Based
Virtual Screening. Int. J. Mol. Med. 2016, 38, 39–44.
(339) Magrane, Michele; Consortium, Uni Prot. UniProt Knowledgebase: A Hub of Integrated
Protein Data. Database 2011, 2011, bar009.
(340) PubChem Identifier Exchange Service
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi (accessed Feb 8, 2018).
(341) Indigo Toolkit http://lifescience.opensource.epam.com/ (accessed Feb 8, 2018).
(342) Burlingham, Benjamin T.; Widlanski, Theodore S. An Intuitive Look at the Relationship
of Ki and IC50: A More General Use for the Dixon Plot. J. Chem. Educ. 2003, 80, 214–218.
(343) Quinn, Elizabeth; Wodicka, Lisa; Ciceri, Pietro; Pallares, Gabriel; Pickle, Elyssa; Torrey,
Adam; Floyd, Mark; Hunt, Jeremy; Treiber, Daniel. Abstract 4238: BROMO Scan - a High
Throughput, Quantitative Ligand Binding Platform Identifies Best-in-Class
Bromodomain Inhibitors from a Screen of Mature Compounds Targeting Other Protein
Classes. Cancer Res. 2013, 73, 4238–4238.
(344) Eduati, Federica; Mangravite, Lara M.; Wang, Tao; Tang, Hao; Bare, J. Christopher;
Huang, Ruili; Norman, Thea; Kellen, Mike; Menden, Michael P.; Yang, Jichen; Zhan,
Xiaowei; Zhong, Rui; Xiao, Guanghua; Xia, Menghang; Abdo, Nour; Kosyk, Oksana;
Friend, Stephen; Dearry, Allen; Simeonov, Anton; et al. Prediction of Human Population
250
Responses to Toxic Compounds by a Collaborative Competition. Nat. Biotechnol. 2015,
33, 933–940.
(345) Ain, Qurrat U.; Mé ndez-Lucio, Oscar; Corté Ciriano, Isidro; rè se Malliavin, Thé; P van
Westen, Gerard J.; Bender, Andreas; Méndez-Lucio, Oscar; Ciriano, Isidro Cortés;
Malliavin, Thérèse; van Westen, Gerard J. P.; Bender, Andreas; Corté Ciriano, Isidro; rè
se Malliavin, Thé; P van Westen, Gerard J.; Bender, Andreas; Méndez-Lucio, Oscar;
Ciriano, Isidro Cortés; Malliavin, Thérèse; van Westen, Gerard J. P.; et al. Modelling
Ligand Selectivity of Serine Proteases Using Integrative Proteochemometric Approaches
Improves Model Performance and Allows the Multi-Target Dependent Interpretation of
Features. Integr. Biol. Integr. Biol 2014, 6, 1023–1033.
(346) Koutsoukas, Alexios; Lowe, Robert; KalantarMotamedi, Yasaman; Mussa, Hamse Y.;
Klaffke, Werner; Mitchell, John B. O.; Glen, Robert C.; Bender, Andreas. In Silico Target
Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the
Multiclass Naïve Bayes and Parzen-Rosenblatt Window. J. Chem. Inf. Model. 2013, 53,
1957–1966.
(347) Charif, Delphine; Lobry, Jean R. SeqinR 1.0-2: A Contributed Package to the R Project for
Statistical Computing Devoted to Biological Sequences Retrieval and Analysis; Springer,
Berlin, Heidelberg, 2007; pp 207–232.
(348) Paradis E., Claude J. &. Strimmer K. APE: Analyses of Phylogenetics and Evolution in R
Language. Bioinformatics 2004, 20, 289–290.
(349) Sander, Thomas; Freyss, Joel; von Korff, Modest; Rufener, Christian. DataWarrior: An
Open-Source Program for Chemistry Aware Data Visualization and Analysis. J. Chem. Inf.
Model. 2015, 55, 460–473.
(350) Bemis, Guy W.; Murcko, Mark A. The Properties of Known Drugs. 1. Molecular
Frameworks. J. Med. Chem. 1996, 39, 2887–2893.
(351) Berthold, Michael R.; Cebron, Nicolas; Dill, Fabian; Gabriel, Thomas R.; Kötter, Tobias;
Meinl, Thorsten; Ohl, Peter; Thiel, Kilian; Wiswedel, Bernd. KNIME - the Konstanz
Information Miner. In ACM SIGKDD Explorations Newsletter; Preisach C., Burkhardt H.,
Schmidt-Thieme L., Decker R., Ed.; Springer: Berlin, Heidelberg, 2009; Vol. 11, p 26.
(352) Ahlberg, Christopher. Spotfire: An Information Exploration Environment. ACM SIGMOD
Rec. 1996, 25, 25–29.
251
(353) RDKit: Open-source cheminformatics http://www.rdkit.org (accessed Mar 19, 2018).
(354) Schwab, Christof H. Conformations and 3D Pharmacophore Searching. Drug Discov.
Today Technol. 2010, 7, e245–e253.
(355) MOE. Mol. Oper. Environ. (MOE), 2013.08; Chem. Comput. Gr. Inc., 1010 Sherbooke St.
West, Suite #910, Montr. QC, Canada, H3A 2R7, 2016.
(356) Max Kuhn Contributions form Jed Wing, Author; Weston, Steve; Williams, Andre; Max
Kuhn, Maintainer. caret: Classification 5.15-044., Regression Training. R package version
http://cran.r-project.org/package=caret (accessed Feb 8, 2018).
(357) Rücker, Christoph; Rücker, Gerta; Meringer, Markus. Y-Randomization and Its Variants
in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345–2357.
(358) M Nissink, J. Willem; Blackburn, Sam. Quantification of Frequent-Hitter Behavior Based
on Historical High-Throughput Screening Data. Future Med. Chem. 2014, 6, 1113–1126.
(359) Cumming, John G.; Davis, Andrew M.; Muresan, Sorel; Haeberlein, Markus; Chen,
Hongming. Chemical Predictive Modelling to Improve Compound Quality. Nat. Rev.
Drug Discov. 2013, 12, 948–962.
(360) Godden, Jeffrey W.; Xue, Ling; Bajorath, Jürgen. Combinatorial Preferences Affect
Molecular Similarity/Diversity Calculations Using Binary Fingerprints and Tanimoto
Coefficients. J. Chem. Inf. Comput. Sci. 2000, 40, 163–166.
(361) Wickham, Hadley; Francois, Romain. A Grammar of Data Manipulation https://cran.r-
project.org/web/packages/dplyr/index.html (accessed Feb 8, 2018).
(362) Niesen, Frank H.; Berglund, Helena; Vedadi, Masoud. The Use of Differential Scanning
Fluorimetry to Detect Ligand Interactions That Promote Protein Stability. Nat. Protoc.
2007, 2, 2212–2221.
(363) Igoe, Niall; Bayle, Elliott D.; Tallant, Cynthia; Fedorov, Oleg; Meier, Julia C.; Savitsky,
Pavel; Rogers, Catherine; Morias, Yannick; Scholze, Sarah; Boyd, Helen; Cunoosamy,
Danen; Andrews, David M.; Cheasty, Anne; Brennan, Paul E.; Müller, Susanne; Knapp,
Stefan; Fish, Paul V. Design of a Chemical Probe for the Bromodomain and Plant
Homeodomain Finger-Containing (BRPF) Family of Proteins. J. Med. Chem. 2017, 60,
6998–7011.
(364) De Bruyn, Tom; van Westen, Gerard J. P.; Ijzerman, Adriaan P.; Stieger, Bruno; de Witte,
252
Peter; Augustijns, Patrick F.; Annaert, Pieter P. Structure-Based Identification of
OATP1B1/3 Inhibitors. Mol. Pharmacol. 2013, 83, 1257–1267.
(365) Cortés-Ciriano, Isidro; Van Westen, Gerard J. P.; Bouvier, Guillaume; Nilges, Michael;
Overington, John P.; Bender, Andreas; Malliavin, Thérèse E. Improved Large-Scale
Prediction of Growth Inhibition Patterns Using the NCI60 Cancer Cell Line Panel.
Bioinformatics 2015, 32, 85–95.
(366) Bender, Andreas; Glen, Robert C. Molecular Similarity: A Key Technique in Molecular
Informatics. Org. Biomol. Chem. 2004, 2, 3204.
(367) Giblin, Kathryn A.; Hughes, Samantha J.; Boyd, Helen; Hansson, Pia; Bender, Andreas.
Prospectively Validated Proteochemometric Models for the Prediction of Small-Molecule
Binding to Bromodomain Proteins. J. Chem. Inf. Model. 2018, 58, 1870–1888.
(368) Louppe, Gilles; Wehenkel, Louis; Sutera, Antonio; Geurts, Pierre. Understanding
Variable Importances in Forests of Randomized Trees. Neural Inf. Process. Syst. 2013, 1–
9.
(369) Menze, Bjoern H.; Kelm, B. Michael; Masuch, Ralf; Himmelreich, Uwe; Bachert, Peter;
Petrich, Wolfgang; Hamprecht, Fred A. A Comparison of Random Forest and Its Gini
Importance with Standard Chemometric Methods for the Feature Selection and
Classification of Spectral Data. BMC Bioinformatics 2009, 10:213.
(370) Chung, Chun-wa; Coste, Hervé; White, Julia H.; Mirguet, Olivier; Wilde, Jonathan;
Gosmini, Romain L.; Delves, Chris; Magny, Sylvie M.; Woodward, Robert; Hughes,
Stephen A.; Boursier, Eric V.; Flynn, Helen; Bouillot, Anne M.; Bamborough, Paul; Brusq,
Jean-Marie G.; Gellibert, Françoise J.; Jones, Emma J.; Riou, Alizon M.; Homes, Paul; et
al. Discovery and Characterization of Small Molecule Inhibitors of the BET Family
Bromodomains. J. Med. Chem. 2011, 54, 3827–3838.
(371) Valero-Mora, Pedro M. Ggplot2: Elegant Graphics for Data Analysis. J. Stat. Softw. 2015,
35, 212.
(372) Kabsch, Wolfgang; IUCr. XDS. Acta Crystallogr. Sect. D Biol. Crystallogr. 2010, 66, 125–
132.
(373) Evans, Philip R.; Murshudov, Garib N. How Good Are My Data and What Is the
Resolution? Acta Crystallogr. Sect. D Biol. Crystallogr. 2013, 69, 1204–1214.
253
(374) Winn, Martyn D.; Ballard, Charles C.; Cowtan, Kevin D.; Dodson, Eleanor J.; Emsley, Paul;
Evans, Phil R.; Keegan, Ronan M.; Krissinel, Eugene B.; Leslie, Andrew G. W.; McCoy,
Airlie; McNicholas, Stuart J.; Murshudov, Garib N.; Pannu, Navraj S.; Potterton, Elizabeth
A.; Powell, Harold R.; Read, Randy J.; Vagin, Alexei; Wilson, Keith S. Overview of the
CCP4 Suite and Current Developments. Acta Crystallogr. D. Biol. Crystallogr. 2011, 67,
235–242.
(375) McCoy, Airlie J.; Grosse-Kunstleve, Ralf W.; Adams, Paul D.; Winn, Martyn D.; Storoni,
Laurent C.; Read, Randy J.; IUCr. Phaser Crystallographic Software. J. Appl. Crystallogr.
2007, 40, 658–674.
(376) Bricogne G., Blanc E., Brandl M., Flensburg C., Keller P., Paciorek W.; Roversi P, Sharff
A., Smart O.S., Vonrhein C., Womack T. O. (2017). Welcome to Global Phasing Limited
http://www.globalphasing.com./?_sm_au_=i0VZntN5H3PkNsP6 (accessed Feb 18,
2019).
(377) Emsley, P.; Lohkamp, B.; Scott, W. G.; Cowtan, K. Features and Development of Coot.
Acta Crystallogr. Sect. D Biol. Crystallogr. 2010, 66, 486–501.
(378) O. S. Smart, T. O. W., A. Sharff, C. Flensburg, P. Keller, W. Paciorek, C. Vonrhein, and G.
Bricogne. Grade, Version 1.2.9. Global Phasing Ltd: Cambridge United Kingdom 2011.
(379) Bharatham, Nagakumar; Slavish, Peter J.; Shadrick, William R.; Young, Brandon M.;
Shelat, Anang A. The Role of ZA Channel Water-Mediated Interactions in the Design of
Bromodomain-Selective BET Inhibitors. J. Mol. Graph. Model. 2018, 81, 197–210.
(380) Simeon, Saw; Anuwongcharoen, Nuttapat; Shoombuatong, Watshara; Malik, Aijaz
Ahmad; Prachayasittikul, Virapong; Wikberg, Jarl E. S.; Nantasenamat, Chanin. Probing
the Origins of Human Acetylcholinesterase Inhibition via QSAR Modeling and Molecular
Docking. PeerJ 2016, 4, e2322.
(381) Marchese Robinson, Richard L.; Palczewska, Anna; Palczewski, Jan; Kidley, Nathan.
Comparison of the Predictive Performance and Interpretability of Random Forest and
Linear Models on Benchmark Data Sets. J. Chem. Inf. Model. 2017, 57, 1773–1792.
(382) Zeng, Lei; Zhang, Qiang; Gerona-Navarro, Guillermo; Moshkina, Natalia; Zhou, Ming-
Ming. Structural Basis of Site-Specific Histone Recognition by the Bromodomains of
Human Coactivators PCAF and CBP/P300. Structure 2008, 16, 643–652.
254
(383) Jennings, Laura E.; Schiedel, Matthias; Hewings, David S.; Picaud, Sarah; Laurin,
Corentine M. C.; Bruno, Paul A.; Bluck, Joseph P.; Scorah, Amy R.; See, Larissa; Reynolds,
Jessica K.; Moroglu, Mustafa; Mistry, Ishna N.; Hicks, Amy; Guzanov, Pavel; Clayton,
James; Evans, Charles N. G.; Stazi, Giulia; Biggin, Philip C.; Mapp, Anna K.; et al. BET
Bromodomain Ligands: Probing the WPF Shelf to Improve BRD4 Bromodomain Affinity
and Metabolic Stability. Bioorg. Med. Chem. 2018, 26, 2937–2957.
(384) Liu, Zhiqing; Wang, Pingyuan; Chen, Haiying; Wold, Eric A.; Tian, Bing; Brasier, Allan
R.; Zhou, Jia. Drug Discovery Targeting Bromodomain-Containing Protein 4. J. Med.
Chem. 2017, 60, 4533–4558.
(385) Demont, Emmanuel H.; Chung, Chun-wa; Furze, Rebecca C.; Grandi, Paola; Michon,
Anne-Marie; Wellaway, Chris; Barrett, Nathalie; Bridges, Angela M.; Craggs, Peter D.;
Diallo, Hawa; Dixon, David P.; Douault, Clement; Emmons, Amanda J.; Jones, Emma J.;
Karamshi, Bhumika V.; Locke, Kelly; Mitchell, Darren J.; Mouzon, Bernadette H.; Prinjha,
Rab K.; et al. Fragment-Based Discovery of Low-Micromolar ATAD2 Bromodomain
Inhibitors. J. Med. Chem. 2015, 58, 5649–5673.
(386) Quinn, JOHN FREDERICK; Duffy, BRYAN CORDELL; Liu, SHUANG; Wang, RUIFANG;
XIAOWU, Jiang M. A. Y.; Martin, GREGORY SCOTT; Wagner, GREGORY STEVEN;
Young, PETER RONALD. Novel Bicyclic Bromodomain Inhibitors, 2017.
(387) Hewings, David S.; Wang, Minghua; Philpott, Martin; Fedorov, Oleg; Uttarkar, Sagar;
Filippakopoulos, Panagis; Picaud, Sarah; Vuppusetty, Chaitanya; Marsden, Brian; Knapp,
Stefan; Conway, Stuart J.; Heightman, Tom D. 3,5-Dimethylisoxazoles Act As Acetyl-
Lysine-Mimetic Bromodomain Ligands. J. Med. Chem. 2011, 54, 6761–6770.
(388) Xue, Xiaoqian; Zhang, Yan; Wang, Chao; Zhang, Maofeng; Xiang, Qiuping; Wang,
Junjian; Wang, Anhui; Li, Chenchang; Zhang, Cheng; Zou, Lingjiao; Wang, Rui; Wu,
Shuang; Lu, Yongzhi; Chen, Hongwu; Ding, Ke; Li, Guohui; Xu, Yong. Benzoxazinone-
Containing 3,5-Dimethylisoxazole Derivatives as BET Bromodomain Inhibitors for
Treatment of Castration-Resistant Prostate Cancer. Eur. J. Med. Chem. 2018, 152, 542–
559.
(389) Hay, Duncan; Fedorov, Oleg; Filippakopoulos, Panagis; Martin, Sarah; Philpott, Martin;
Picaud, Sarah; Hewings, David S.; Uttakar, Sagar; Heightman, Tom D.; Conway, Stuart J.;
Knapp, Stefan; Brennan, Paul E. The Design and Synthesis of 5- and 6-
Isoxazolylbenzimidazoles as Selective Inhibitors of the BET Bromodomains. Med. Chem.
255
Commun. 2013, 4, 140–144.
(390) Reed Elsevier Properties SA. PharmaPendium https://pharmapendium.com/ (accessed
Jun 4, 2018).
(391) Pedregosa, Fabian; Varoquaux, Gaël; Gramfort, Alexandre; Michel, Vincent; Thirion,
Bertrand; Grisel, Olivier; Blondel, Mathieu; Louppe, Gilles; Prettenhofer, Peter; Weiss,
Ron; Dubourg, Vincent; Vanderplas, Jake; Passos, Alexandre; Cournapeau, David;
Brucher, Matthieu; Perrot an Edouard Duchesnay Pedregosa, Matthieu; David
Cournapeau, Al; Perrot matthieuperrot, Matthieu; Edouard Duchesnay. Scikit-Learn:
Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
(392) Li, Wentian. Mutual Information Functions Versus Correlation Functions in Binary
Sequences. J. Stat. Phys. 2012, 60, 249–252.
(393) Jones, Eric; Oliphant, Travis; Peterson, Pearu. SciPy: Open source scientific tools for
Python http://www.scipy.org/ (accessed Jun 4, 2018).
(394) Grimes, David A.; Schulz, Kenneth F. Refining Clinical Diagnosis with Likelihood Ratios.
Lancet 2005, 365, 1500–1505.
(395) Fruchterman, Thomas M. J.; Reingold, Edward M. Graph Drawing by Force-Directed
Placement. Software-Practice Exp. 1991, 21, 1129–1164.
(396) Imai, Hiroshi; Asano, Takao. Finding the Connected Components and a Maximum Clique
of an Intersection Graph of Rectangles in the Plane. J. Algorithms 1983, 4, 310–323.
(397) Fortunato, Santo. Community Detection in Graphs. Phys. Rep. 2010, 486, 75–174.
(398) Sharma, Hari S.; Menon, Preeti; Lafuente, José Vicente; Muresanu, Dafin F.; Tian, Z. Ryan;
Patnaik, Ranjana; Sharma, Aruna. Development of in Vivo Drug-Induced Neurotoxicity
Models. Expert Opin. Drug Metab. Toxicol. 2014, 10, 1637–1661.
(399) Roberts, Ruth A.; Aschner, Michael; Calligaro, David; Guilarte, Tomas R.; Hanig, Joseph
P.; Herr, David W.; Hudzik, Thomas J.; Jeromin, Andreas; Kallman, Mary J.; Liachenko,
Serguei; Lynch, James J.; Miller, Diane B.; Moser, Virginia C.; O’Callaghan, James P.;
Slikker, William; Paule, Merle G. Translational Biomarkers of Neurotoxicity: A Health
and Environmental Sciences Institute Perspective on the Way Forward. Toxicol. Sci. 2015,
148, 332–340.
(400) O’Donoghue, John L. Clinical Neurologic Indices of Toxicity in Animals. Environ. Health
256
Perspect. 1996, 104, 323–330.
(401) Süleyman, Halis; Demircan, Berna; Karagöz, Yalçin. Anti-Inflammatory and Side Effects
of Cyclooxygenase Inhibitors. Pharmacol. Reports 2007, 59, 247–258.
(402) R H Bach, E. E.; Thi I M Thanh, Nguyen K. Renal Papillary Necrosis 40 Years On. Toxicol.
Pathol. 1998, 26, 73–91.
(403) Nicolaides, Nicolas C.; Pavlaki, Aikaterini N.; Maria Alexandra, Maria Alexandra;
Chrousos, George P. Glucocorticoid Therapy and Adrenal Suppression. In Endotext;
Feingold, Kenneth R., Ed.; MDText.com, Inc.: South Dartmouth (MA), 2018.
(404) Inomata, Akira; Sasano, Hironobu. Practical Approaches for Evaluating Adrenal Toxicity
in Nonclinical Safety Assessment. J. Toxicol. Pathol. 2015, 28, 125–132.
(405) Schimpf, Rainer; Borggrefe, Martin; Wolpert, Christian. Clinical and Molecular Genetics
of the Short QT Syndrome. Curr. Opin. Cardiol. 2008, 23, 192–198.
(406) Bjerregaard, Preben. Diagnosis and Management of Short QT Syndrome. Hear. Rhythm
2018, 15, 1261–1267.
(407) Holbrook, Mark; Malik, Marek; Shah, Rashmi R.; Valentin, Jean-Pierre. Drug Induced
Shortening of the QT/QTc Interval: An Emerging Safety Issue Warranting Further
Modelling and Evaluation in Drug Research and Development? J. Pharmacol. Toxicol.
Methods 2009, 59, 21–28.
(408) Visentin, Michele; Lenggenhager, Daniela; Gai, Zhibo; Kullak-Ublick, Gerd A. Drug-
Induced Bile Duct Injury. Biochim. Biophys. Acta - Mol. Basis Dis. 2018, 1864, 1498–1506.
(409) Brown, Carlton Gene. Testicular Cancer: An Overview. MedSurg Nurs. 2003, 12, 37–45.
(410) Park, Jun-Bean; Kang, Do-yoon; Yang, Han-Mo; Cho, Hyun-Jai; Park, Kyung Woo; Lee,
Hae-Young; Kang, Hyun-Jae; Koo, Bon-Kwon; Kim, Hyo-Soo. Serum Alkaline
Phosphatase Is a Predictor of Mortality, Myocardial Infarction, or Stent Thrombosis after
Implantation of Coronary Drug-Eluting Stent. Eur. Heart J. 2013, 34, 920–931.
(411) Sheen, Campbell R.; Kuss, Pia; Narisawa, Sonoko; Yadav, Manisha C.; Nigro, Jessica;
Wang, Wei; Chhea, T. Nicole; Sergienko, Eduard A.; Kapoor, Kapil; Jackson, Michael R.;
Hoylaerts, Marc F.; Pinkerton, Anthony B.; O’Neill, W. Charles; Millán, José Luis.
Pathophysiological Role of Vascular Smooth Muscle Alkaline Phosphatase in Medial
Artery Calcification. J. Bone Miner. Res. 2015, 30, 824–836.
257
(412) Dereure, Olivier. Drug-Induced Skin Pigmentation. Am. J. Clin. Dermatol. 2001, 2, 253–
262.
(413) Kuokkanen, Satu; Zhu, Liyin; Pollard, Jeffrey W. Xenografted Tissue Models for the Study
of Human Endometrial Biology. Differentiation 2017, 98, 62–69.
(414) van der Laan, Jan Willem; Chapin, Robert E.; Haenen, Bert; Jacobs, Abigail C.; Piersma,
Aldert. Testing Strategies for Embryo-Fetal Toxicity of Human Pharmaceuticals. Animal
Models vs. in Vitro Approaches. Regul. Toxicol. Pharmacol. 2012, 63, 115–123.
(415) Stokes, William S. Humane Endpoints for Laboratory Animals Used in Regulatory
Testing. ILAR J. 2002, 43, S31-38.
(416) Wijemanne, Subhashie; Jankovic, Joseph. Movement Disorders in Catatonia. J Neurol
Neurosurg Psychiatry 2015, 86, 825–832.
(417) Behari, Madhuri. Current Status of Dystonias Including Meige’s Syndrome. Neurol. India
2018, 66, 36–37.
(418) Shin, Hae-Won; Chung, Sun Ju. Drug-Induced Parkinsonism. J. Clin. Neurol. 2012, 8, 15–
21.
(419) Dabbous, Zeinab; Atkin, Stephen L. Hyperprolactinaemia in Male Infertility: Clinical
Case Scenarios. Arab J. Urol. 2018, 16, 44–52.
(420) Fitzgerald, Peter; Dinan, Timothy G. Prolactin and Dopamine: What Is the Connection?
A Review Article. J. Psychopharmacol. 2008, 22, 12–19.
(421) Tang, Jing; Tanoli, Zia ur Rehman; Ravikumar, Balaguru; Alam, Zaid; Rebane, Anni; Vähä-
Koskela, Markus; Peddinti, Gopal; van Adrichem, Arjan J.; Wakkinen, Janica; Jaiswal,
Alok; Karjalainen, Ella; Gautam, Prson; He, Liye; Parri, Elina; Khan, Suleiman; Gupta,
Abhishekh; Ali, Mehreen; Yetukuri, Laxman; Gustavsson, Anna Lena; et al. Drug Target
Commons: A Community Effort to Build a Consensus Knowledge Base for Drug-Target
Interactions. Cell Chem. Biol. 2018, 25, 224–229.e2.
(422) Siramshetty, Vishal B.; Eckert, Oliver Andreas; Gohlke, Björn-Oliver; Goede, Andrean;
Chen, Qiaofeng; Devarakonda, Prashanth; Preissner, Saskia; Preissner, Robert.
SuperDRUG2: A One Stop Resource for Approved/Marketed Drugs. Nucleic Acids Res.
2018, 46, D1137–D1143.
(423) Heller, Stephen R.; McNaught, Alan; Pletnev, Igor; Stein, Stephen; Tchekhovskoi,
258
Dmitrii. InChI, the IUPAC International Chemical Identifier. J. Cheminform. 2015, 7, 23.
(424) Maglott, Donna; Ostell, Jim; Pruitt, Kim D.; Tatusova, Tatiana. Entrez Gene: Gene-
Centered Information at NCBI. Nucleic Acids Res. 2011, 39, D52-7.
(425) Toad for mysql - MariaDB Knowledge Base http://www.toadworld.com/products#mysql
(accessed Feb 4, 2019).
(426) Burdett, T.; Jupp, Simon; Malone, James; Williams, Eleanor; Keays, Maria; Parkinson,
Helen; Trust, Wellcome; Campus, Genome. Zooma2 - A repository of annotation
knowledge and curation API http://www.ebi.ac.uk/spot/zooma/index.html (accessed
Jun 4, 2018).
(427) Schriml, Lynn M.; Mitraka, Elvira; Munro, James; Tauber, Becky; Schor, Mike; Nickle,
Lance; Felix, Victor; Jeng, Linda; Bearer, Cynthia; Lichenstein, Richard; Bisordi,
Katharine; Campion, Nicole; Hyman, Brooke; Kurland, David; Oates, Connor Patrick;
Kibbey, Siobhan; Sreekumar, Poorna; Le, Chris; Giglio, Michelle; et al. Human Disease
Ontology 2018 Update: Classification, Content and Workflow Expansion. Nucleic Acids
Res. 2019, 47, D955–D962.
(428) Smith, Cynthia L.; Eppig, Janan T. The Mammalian Phenotype Ontology: Enabling
Robust Annotation and Comparative Analysis. Wiley Interdiscip. Rev. Syst. Biol. Med.
2009, 1, 390–399.
(429) Robinson, Peter N.; Köhler, Sebastian; Bauer, Sebastian; Seelow, Dominik; Horn, Denise;
Mundlos, Stefan. The Human Phenotype Ontology: A Tool for Annotating and Analyzing
Human Hereditary Disease. Am. J. Hum. Genet. 2008, 83, 610–615.
(430) Malone, James; Holloway, Ele; Adamusiak, Tomasz; Kapushesky, Misha; Zheng, Jie;
Kolesnikov, Nikolay; Zhukova, Anna; Brazma, Alvis; Parkinson, Helen. Modeling Sample
Variables with an Experimental Factor Ontology. Bioinformatics 2010, 26, 1112–1118.
(431) Vasant, Drashtti; Chanas, Laetitia; Malone, James; Hanauer, Marc; Olry, Annie; Jupp,
Simon; Robinson, Peter N.; Parkinson, Helen; Rath, Ana. ORDO: An Ontology
Connecting Rare Disease, Epidemiology and Genetic Data. In Phenotype data at
ISMB2014; 2014.
(432) Ceusters, Werner; Smith, B.; Goldberg, L. A Terminological and Ontological Analysis of
the NCI Thesaurus. Methods Inf. Med. 2005, 44, 498–507.
259
(433) Huang, Jingshan; Dang, Jiangbo; Borchert, Glen M.; Eilbeck, Karen; Zhang, He; Xiong,
Min; Jiang, Weijian; Wu, Hao; Blake, Judith A.; Natale, Darren A.; Tan, Ming. OMIT:
Dynamic, Semi-Automated Ontology Development for the MicroRNA Domain. PLoS
One 2014, 9, e100855.
(434) He, Yongqun; Sarntivijai, Sirarat; Lin, Yu; Xiang, Zuoshuang; Guo, Abra; Zhang, Shelley;
Jagannathan, Desikan; Toldo, Luca; Tao, Cui; Smith, Barry. OAE: The Ontology of
Adverse Events. J. Biomed. Semantics 2014, 5, 29.
(435) Mungall, Christopher J.; Koehler, Sebastian; Robinson, Peter; Holmes, Ian; Haendel,
Melissa. K-BOOM: A Bayesian Approach to Ontology Structure Inference, with
Applications in Disease Ontology Construction. bioRxiv 2019, 048843.
(436) Mohammed, Osama; Benlamri, Rachid; Fong, Simon. Building a Diseases Symptoms
Ontology for Medical Diagnosis: An Integrative Approach. In 1st International Conference
on Future Generation Communication Technologies, FGCT 2012; IEEE, 2012; pp 104–108.
(437) Ceusters, Werner; Smith, Barry. Foundations for a Realist Ontology of Mental Disease. J.
Biomed. Semantics 2010, 1, 10.
(438) Schofield, P. N.; Gruenberger, M.; Sundberg, John P. Pathbase and the MPATH Ontology:
Community Resources for Mouse Histopathology. Vet. Pathol. 2010, 47, 1016–1020.
(439) Dönitz, Jürgen; Wingender, Edgar. The Ontology-Based Answers (OBA) Service: A
Connector for Embedded Usage of Ontologies in Applications. Front. Genet. 2012, 3, 197.
(440) Visser, Ubbo; Abeyruwan, Saminda; Vempati, Uma; Smith, Robin P.; Lemmon, Vance;
Schürer, Stephan C. BioAssay Ontology (BAO): A Semantic Description of Bioassays and
High-Throughput Screening Results. BMC Bioinformatics 2011, 12, 257.
(441) Koscielny, Gautier; An, Peter; Carvalho-Silva, Denise; Cham, Jennifer A.; Fumis, Luca;
Gasparyan, Rippa; Hasan, Samiul; Karamanis, Nikiforos; Maguire, Michael; Papa, Eliseo;
Pierleoni, Andrea; Pignatelli, Miguel; Platt, Theo; Rowland, Francis; Wankar, Priyanka;
Bento, A. Patrícia; Burdett, Tony; Fabregat, Antonio; Forbes, Simon; et al. Open Targets:
A Platform for Therapeutic Target Identification and Validation. Nucleic Acids Res. 2017,
45, D985–D994.
(442) opentargets - Python client for targetvalidation.org — opentargets 2.0.0 documentation
https://opentargets.readthedocs.io/en/stable/ (accessed Oct 8, 2018).
260
(443) Zerbino, Daniel R.; Achuthan, Premanand; Akanni, Wasiu; Amode, M. Ridwan; Barrell,
Daniel; Bhai, Jyothish; Billis, Konstantinos; Cummins, Carla; Gall, Astrid; Girón, Carlos
García; Gil, Laurent; Gordon, Leo; Haggerty, Leanne; Haskell, Erin; Hourlier, Thibaut;
Izuogu, Osagie G.; Janacek, Sophie H.; Juettemann, Thomas; To, Jimmy Kiang; et al.
Ensembl 2018. Nucleic Acids Res. 2018, 46, D754–D761.
(444) Mungall, Christopher J.; McMurry, Julie A.; Köhler, Sebastian; Balhoff, James P.;
Borromeo, Charles; Brush, Matthew; Carbon, Seth; Conlin, Tom; Dunn, Nathan;
Engelstad, Mark; Foster, Erin; Gourdine, J. P.; Jacobsen, Julius O. B.; Keith, Dan; Laraway,
Bryan; Lewis, Suzanna E.; NguyenXuan, Jeremy; Shefchek, Kent; Vasilevsky, Nicole; et al.
The Monarch Initiative: An Integrative Data and Analytic Platform Connecting
Phenotypes to Genotypes across Species. Nucleic Acids Res. 2017, 45, D712–D722.
(445) Requests: HTTP for HumansTM — Requests 2.21.0 documentation http://docs.python-
requests.org/en/master/ (accessed Mar 14, 2019).
(446) About the HGNC: HUGO Gene Nomenclature Committee
https://www.genenames.org/about/overview (accessed Oct 8, 2018).
(447) Smith, Cynthia L.; Blake, Judith A.; Kadin, James A.; Richardson, Joel E.; Bult, Carol J.;
Mouse Genome Database Group. Mouse Genome Database (MGD)-2018: Knowledgebase
for the Laboratory Mouse. Nucleic Acids Res. 2018, 46, D836–D842.
(448) Jensen, Lars Juhl; Julien, Philippe; Kuhn, Michael; von Mering, Christian; Muller, Jean;
Doerks, Tobias; Bork, Peer. EggNOG: Automated Construction and Annotation of
Orthologous Groups of Genes. Nucleic Acids Res. 2008, 36, D250-4.
(449) pandas: a Foundational Python Library for Data Analysis and Statistics | R (Programming
Language) | Database Index https://www.scribd.com/document/71048089/pandas-a-
Foundational-Python-Library-for-Data-Analysis-and-Statistics (accessed Jun 4, 2018).
(450) Huang, Yue; Yu, Sui; Wu, Zhanhe; Tang, Beisha. Genetics of Hereditary Neurological
Disorders in Children. Transl. Pediatr. 2014, 3, 108–119.
(451) Katritsis, Demosthenes G.; Gersh, Bernard J.; Camm, A. John. A Clinical Perspective on
Sudden Cardiac Death. Arrhythmia Electrophysiol. Rev. 2016, 5, 177–182.
(452) Antzelevitch, C.; Pollevick, G. D.; Cordeiro, J. M.; Casis, O.; Sanguinetti, M. C.; Aizawa,
Y.; Guerchicoff, A.; Pfeiffer, R.; Oliva, A.; Wollnik, B.; Gelber, P.; Bonaros, E. P.;
261
Burashnikov, E.; Wu, Y.; Sargent, J. D.; Schickel, S.; Oberheiden, R.; Bhatia, A.; Hsu, L. F.;
et al. Loss-of-Function Mutations in the Cardiac Calcium Channel Underlie a New
Clinical Entity Characterized by ST-Segment Elevation, Short QT Intervals, and Sudden
Cardiac Death. Circulation 2007, 115, 442–449.
(453) Schimpf, R.; Veltmann, C.; Wolpert, C.; Borggrefe, M. Arrhythmogenic Hereditary
Syndromes: Brugada Syndrome, Long QT Syndrome, Short QT Syndrome and CPVT.
Minerva Cardioangiol. 2010, 58, 623–636.
(454) Campuzano, Oscar; Sarquella-Brugada, Georgia; Brugada, Ramon; Brugada, Josep.
Brugada Syndrome. Clin. Cardiogenetics Second Ed. 2016, 175–191.
(455) Whitebread, Steven; Hamon, Jacques; Bojanic, Dejan; Urban, Laszlo. Keynote Review: In
Vitro Safety Pharmacology Profiling: An Essential Tool for Successful Drug Development.
Drug Discov. Today 2005, 10, 1421–1433.
(456) Lu, H. R.; Vlaminckx, E.; Hermans, A. N.; Rohrbacher, J.; Van Ammel, K.; Towart, R.;
Pugsley, M.; Gallacher, D. J. Predicting Drug-Induced Changes in QT Interval and
Arrhythmias: QT-Shortening Drugs Point to Gaps in the ICHS7B Guidelines. Br. J.
Pharmacol. 2008, 154, 1427–1438.
(457) Colatsky, Thomas; Fermini, Bernard; Gintant, Gary; Pierson, Jennifer B.; Sager, Philip;
Sekino, Yuko; Strauss, David G.; Stockbridge, Norman. The Comprehensive in Vitro
Proarrhythmia Assay (CiPA) Initiative — Update on Progress. J. Pharmacol. Toxicol.
Methods 2016, 81, 15–20.
(458) Toth, Linda A.; Bhargava, Pavan. Animal Models of Sleep Disorders. Comp. Med. 2013,
63, 91–104.
(459) Ledent, Catherine; Vaugeois, Jean-Marie; Schiffmann, Serge N.; Pedrazzini, Thierry;
Yacoubi, Malika El; Vanderhaeghen, Jean-Jacques; Costentin, Jean; Heath, John K.;
Vassart, Gilbert; Parmentier, Marc. Aggressiveness, Hypoalgesia and High Blood Pressure
in Mice Lacking the Adenosine A2a Receptor. Nature 1997, 388, 674–678.
(460) Yang, Amy; Palmer, Abraham A.; De Wit, Harriet. Genetics of Caffeine Consumption and
Responses to Caffeine. Psychopharmacology (Berl). 2010, 211, 245–257.
(461) Rétey, J. V; Adam, M.; Khatami, R.; Luhmann, U. F. O.; Jung, H. H.; Berger, W.; Landolt,
H. P. A Genetic Variation in the Adenosine A2A Receptor Gene (ADORA2A) Contributes
262
to Individual Sensitivity to Caffeine Effects on Sleep. Clin. Pharmacol. Ther. 2007, 81,
692–698.
(462) Mochizuki, T.; Arrigoni, E.; Marcus, J. N.; Clark, E. L.; Yamamoto, M.; Honer, M.; Borroni,
E.; Lowell, B. B.; Elmquist, J. K.; Scammell, T. E. Orexin Receptor 2 Expression in the
Posterior Hypothalamus Rescues Sleepiness in Narcoleptic Mice. Proc. Natl. Acad. Sci.
2011, 108, 4471–4476.
(463) Willie, Jon T.; Chemelli, Richard M.; Sinton, Christopher M.; Tokita, Shigeru; Williams,
S. Clay; Kisanuki, Yaz Y.; Marcus, Jacob N.; Lee, Charlotte; Elmquist, Joel K.; Kohlmeier,
Kristi A.; Leonard, Christopher S.; Richardson, James A.; Hammer, Robert E.; Yanagisawa,
Masashi. Distinct Narcolepsy Syndromes in Orexin Receptor-2 and Orexin Null Mice:
Molecular Genetic Dissection of Non-REM and REM Sleep Regulatory Processes. Neuron
2003, 38, 715–730.
(464) Ghanemi, Abdelaziz; Hu, Xintian. Targeting the Orexinergic System: Mainly but Not
Only for Sleep-Wakefulness Therapies. Alexandria J. Med. 2015, 51, 279–286.
(465) Irukayama-Tomobe, Yoko; Ogawa, Yasuhiro; Tominaga, Hiromu; Ishikawa, Yukiko;
Hosokawa, Naoto; Ambai, Shinobu; Kawabe, Yuki; Uchida, Shuntaro; Nakajima, Ryo;
Saitoh, Tsuyoshi; Kanda, Takeshi; Vogt, Kaspar; Sakurai, Takeshi; Nagase, Hiroshi;
Yanagisawa, Masashi. Nonpeptide Orexin Type-2 Receptor Agonist Ameliorates
Narcolepsy-Cataplexy Symptoms in Mouse Models. Proc. Natl. Acad. Sci. 2017, 114, 5731–
5736.
(466) Razavi, Bibi Marjan; Hosseinzadeh, Hossein. A Review of the Role of Orexin System in
Pain Modulation. Biomed. Pharmacother. 2017, 90, 187–193.
(467) Santos, Cynthia; Olmedo, Ruben E. Sedative-Hypnotic Drug Withdrawal Syndrome:
Recognition And Treatment. Emerg. Med. Pract. 2017, 19, 1–20.
(468) Bidwell, L. Cinnamon; Garrett, Melanie E.; McClernon, F. Joseph; Fuemmeler, Bernard
F.; Williams, Redford B.; Ashley-Koch, Allison E.; Kollins, Scott H. A Preliminary Analysis
of Interactions between Genotype, Retrospective ADHD Symptoms, and Initial Reactions
to Smoking in a Sample of Young Adults. Nicotine Tob. Res. 2012, 14, 229–233.
(469) Fisone, G.; Borgkvist, A.; Usiello, A. Caffeine as a Psychomotor Stimulant: Mechanism of
Action. Cell. Mol. Life Sci. 2004, 61, 857–872.
263
(470) Listos, Joanna; Malec, Danuta; Fidecka, Sylwia. Influence of Adenosine Receptor Agonists
on Benzodiazepine Withdrawal Signs in Mice. Eur. J. Pharmacol. 2005, 523, 71–78.
(471) Ballesteros-Yáñez, Inmaculada; Castillo, Carlos A.; Merighi, Stefania; Gessi, Stefania. The
Role of Adenosine Receptors in Psychostimulant Addiction. Front. Pharmacol. 2018, 8,
985.
(472) Lu, Ake T.; Ogdie, Matthew N.; Järvelin, Marjo-Ritta; Moilanen, Irma K.; Loo, Sandra K.;
McCracken, James T.; McGough, James J.; Yang, May H.; Peltonen, Leena; Nelson, Stanley
F.; Cantor, Rita M.; Smalley, Susan L. Association of the Cannabinoid Receptor Gene
(CNR1) with ADHD and Post-Traumatic Stress Disorder. Am. J. Med. Genet. B.
Neuropsychiatr. Genet. 2008, 147B, 1488–1494.
(473) Castelli, Maura; Federici, Mauro; Rossi, Silvia; De Chiara, Valentina; Napolitano,
Francesco; Studer, Valeria; Motta, Caterina; Sacchetti, Lucia; Romano, Rosaria; Musella,
Alessandra; Bernardi, Giorgio; Siracusano, Alberto; Gu, Howard H.; Mercuri, Nicola B.;
Usiello, Alessandro; Centonze, Diego. Loss of Striatal Cannabinoid CB1 Receptor
Function in Attention-Deficit / Hyperactivity Disorder Mice with Point-Mutation of the
Dopamine Transporter. Eur. J. Neurosci. 2011, 34, 1369–1377.
(474) Haughey, Heather M.; Marshall, Erin; Schacht, Joseph P.; Louis, Ashleigh; Hutchison,
Kent E. Marijuana Withdrawal and Craving: Influence of the Cannabinoid Receptor 1 (
CNR1 ) and Fatty Acid Amide Hydrolase ( FAAH ) Genes. Addiction 2008, 103, 1678–1686.
(475) Jones, Declan N. C.; Holtzman, Stephen G. Influence of Naloxone upon Motor Activity
Induced by Psychomotor Stimulant Drugs. Psychopharmacology (Berl). 1994, 114, 215–
224.
(476) Abramov, Urho; Raud, Sirli; Kõks, Sulev; Innos, Jürgen; Kurrikoff, Kaido; Matsui,
Toshimitsu; Vasar, Eero. Targeted Mutation of CCK2 Receptor Gene Antagonises
Behavioural Changes Induced by Social Isolation in Female, but Not in Male Mice. Behav.
Brain Res. 2004, 155, 1–11.
(477) Schnur, P.; Cesar, S. S.; Foderaro, M. A.; Kulkosky, P. J. Effects of Cholecystokinin on
Morphine-Elicited Hyperactivity in Hamsters. Pharmacol. Biochem. Behav. 1991, 39, 581–
586.
(478) Kayser, V.; Idänpään-Hekkilä, J. J.; Christensen, D.; Guilbaud, G. The Selective
CholecystokininB Receptor Antagonist L-365,260 Diminishes the Expression of
264
Naloxone-Induced Morphine Withdrawal Symptoms in Normal and Neuropathic Rats.
Life Sci. 1998, 62, 947–952.
(479) Filardi, Marco; Pizza, Fabio; Tonetti, Lorenzo; Antelmi, Elena; Natale, Vincenzo; Plazzi,
Giuseppe. Attention Impairments and ADHD Symptoms in Adult Narcoleptic Patients
with and without Hypocretin Deficiency. PLoS One 2017, 12, e0182085.
(480) Gentile, Taylor A.; Simmons, Steven J.; Watson, Mia N.; Connelly, Krista L.; Brailoiu,
Eugen; Zhang, Yanan; Muschamp, John W. Effects of Suvorexant, a Dual
Orexin/Hypocretin Receptor Antagonist, on Impulsive Behavior Associated with
Cocaine. Neuropsychopharmacology 2018, 43, 1001–1009.
(481) Azizi, Hossein; Mirnajafi-Zadeh, Javad; Rohampour, Kambiz; Semnanian, Saeed.
Antagonism of Orexin Type 1 Receptors in the Locus Coeruleus Attenuates Signs of
Naloxone-Precipitated Morphine Withdrawal in Rats. Neurosci. Lett. 2010, 482, 255–
259.
(482) Hughes, J. P.; Rees, S.; Kalindjian, S. B.; Philpott, K. L. Principles of Early Drug Discovery.
Br. J. Pharmacol. 2011, 162, 1239–1249.
(483) Brannen, Kimberly C.; Chapin, Robert E.; Jacobs, Abigail C.; Green, Maia L. Alternative
Models of Developmental and Reproductive Toxicity in Pharmaceutical Risk Assessment
and the 3Rs. ILAR J. 2016, 57, 144–156.
(484) Komori, Shinji; Kasumi, Hiroyuki; Sakata, Kazuko; Koyama, Koji. The Role of Androgens
in Spermatogenesis. Soc. Reprod. Fertil. Suppl. 2007, 63, 25–30.
(485) Milatiner, D.; Halle, David; Huerta, Michael; Margalioth, Ehud J.; Cohen, Yoram; Ben-
Chetrit, Avraham; Gal, Michael; Mimoni, Tzvia; Eldar-Geva, Talia. Associations between
Androgen Receptor CAG Repeat Length and Sperm Morphology. Hum. Reprod. 2004, 19,
1426–1430.
(486) Melo, C. O. A.; Danin, A. R.; Silva, D. M.; Tacon, J. A.; Moura, K. K. V. O.; Costa, E. O. A.;
Da Cruz, A. D. Association between Male Infertility and Androgen Receptor Mutations
in Brazilian Patients. funpecrp.com.br Genet. Mol. Res. Genet. Mol. Res 2010, 9, 128–133.
(487) Bachelot, Anne; Meduri, Géri; Massin, Nathalie; Misrahi, Micheline; Kuttenn, Frédérique;
Touraine, Philippe. Ovarian Steroidogenesis and Serum Androgen Levels in Patients with
Premature Ovarian Failure. J. Clin. Endocrinol. Metab. 2005, 90, 2391–2396.
265
(488) Shiina, H.; Matsumoto, T.; Sato, T.; Igarashi, K.; Miyamoto, J.; Takemasa, S.; Sakari, M.;
Takada, I.; Nakamura, T.; Metzger, D.; Chambon, P.; Kanno, J.; Yoshikawa, H.; Kato, S.
Premature Ovarian Failure in Androgen Receptor-Deficient Mice. Proc. Natl. Acad. Sci.
2006, 103, 224–229.
266
8 Appendix
8.1 Tables and Figures
Table 8-1: Number of data points for the public and proprietary dataset after filtering, divided by bromodomain and activity, along with the totals for each dataset for active (A) and not active (N) compound-target pairs.
Bromodomain Activity (active (A)/ not active (N))
Number of public data points after filtering
Number of proprietary data points after filtering
Total Number of data points after filtering
ATAD2 A 42 23 65
ATAD2 N 45 328 373
ATAD2B A - - 0
ATAD2B N 3 203 206
BAZ2A A 27 - 27
BAZ2A N 9 203 212
BAZ2B A 31 - 31
BAZ2B N 65 211 276
BPTF A 11 - 11
BPTF N 4 203 207
BRD1 A 44 9 53
BRD1 N 16 495 511
BRD2 BD1 A 79 163 242
BRD2 BD1 N 40 46 86
BRD2 BD2 A 15 135 150
BRD2 BD2 N 6 74 80
BRD3 BD1 A 69 167 236
BRD3 BD1 N 7 42 49
BRD3 BD2 A 24 137 161
BRD3 BD2 N 4 72 76
BRD4 BD1 A 1124 961 2085
BRD4 BD1 N 843 2682 3525
BRD4 BD2 A 540 161 701
BRD4 BD2 N 79 239 318
BRD7 A 21 20 41
BRD7 N 7 180 187
BRD9 A 113 20 133
BRD9 N 27 319 346
BRDT BD1 A 13 147 160
BRDT BD1 N 18 64 82
BRDT BD2 A 6 127 133
BRDT BD2 N 4 84 88
BRPF1b A 66 80 146
BRPF1b N 33 565 598
BRPF3 A 3 2 5
BRPF3 N 34 201 235
BRWD1 BD2 A - 1 1
BRWD1 BD2 N 4 202 206
CECR2 A 16 - 16
267
CECR2 N 12 319 331
CREBBP A 163 112 275
CREBBP N 111 350 461
EP300 A 8 116 124
EP300 N 4 328 332
KAT2A A - 2 2
KAT2A N 4 198 202
PB1 BD5 A 16 - 16
PB1 BD5 N 10 - 10
PCAF A 51 - 51
PCAF N 34 - 34
SMARCA2 A - 28 28
SMARCA2 N - - 0
SMARCA4 A 24 46 70
SMARCA4 N 7 242 249
TAF1 BD2 A 11 232 243
TAF1 BD2 N 6 191 197
TAF1L BD2 A 1 6 7
TAF1L BD2 N 3 179 182
TIF1A A 63 - 63
TIF1A N 20 189 209
TRIM33 A 2 1 3
TRIM33 N 1 202 203
Total all actives A 2583 2696 5280
Total all not actives
N 1460 8611 10071
Grand Total A+N 4043 11307 15350
Table 8-2: PDB identifiers for the crystal structures used as the reference structures for each bromodomain for the alignment. “Not found” means that from the MOE protein database at the time of collation no crystal structures were found for this bromodomain. Alignment was asserted based on the nearest bromodomain for these domains (BRPF3 and BRDT BD2).
Bromodomain Crystal structure used for alignment (PDB)
CECR2 3NXB
BPFT 3UV2
KAT2A 1F68
PCAF 1WUG
BRD2 BD1 2YEK
BRD2 BD2 2DVV
BRD3 BD1 3S91
BRD3 BD2 3S92
BRD4 BD1 2YEL
BRD4 BD2 2YEM
BRDT BD1 4FLP
BRWD1 BD2 3Q2E
CREBBP 2D82
EP300 3I3J
ATAD2 4TT2
268
ATAD2B 3LXJ
BRD1 5AME
BRPF1 4UYE
BRD7 2I7K
BRD9 4NQN
TIF1A 3O35
TRIM33 3U5N
BAZ2A 4QBM
BAZ2B 3Q2F
TAF1 BD2 4YYM
TAF1L BD2 3HMH
PB1 BD5 4Q0N
SMARCA2 5DKC
SMARCA4 5DKD
BRPF3 Not found
BRDT BD2 Not found
269
Table 8-3: Binding site residue sequence alignment for bromodomains in this study. Binding site residues were defined as residues in the same alignment position as any residue within 4.5 Å of the ligands in public liganded co-crystal structures for bromodomains.
Domain 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
ATAD2 r f r v f t k p v d p d e - - v p d y v i p m d l l i n d p g d r l i r r a
ATAD2B r f n i f s k p v d i e e - - v s d y l i p m d l l i n d p g d k i i r r a
BAZ2A a a w p f l e p v n p r - l - v s g y r i p m d f l v n d - - d s e v g a g
BAZ2B d a w p f l l p v n l k - l - v p g y k i p m d f l v n d - - d s d i g a g
BPTF m a w p f l e p v d p n d - - a p d y y i p m d l k i n s - - d s p f y c a
BRD1 p a r i f a q p v s l k e - - v p d y l i p m d f l i n r - - d t v f y a a
BRD2 BD1 f a w p f r q p v d a v k l g l p d y h i p m d m t m n p - - t d d i v m a
BRD2 BD2 y a w p f y k p v d a s a l g l h d y h i p m d l l m n p - - d h d v v m a
BRD3 BD1 f a w p f y q p v d a i k l n l p d y h i p m d m t m n p - - t d d i v m a
BRD3 BD2 y a w p f y k p v d a e a l e l h d y h i p m d l l m n p - - d h e v v m a
BRD4 BD1 f a w p f q q p v d a v k l n l p d y y i p m d m t m n p - - g d d i v m a
BRD4 BD2 y a w p f y k p v d v e a l g l h d y c i p m d m l m n p - - d h e v v m a
BRD7 p s a f f s f p v t d f i - - a p g y s i p m d f l m n p - - e t i y y a a
BRD9 p h g f f a f p v t d a i - - a p g y s i p m d f l m n p - - d t v y y l a
BRDT BD1 f s w p f q r p v d a v k l q l p d y y i p m d l t m n p - - g d d i v m a
BRDT BD2 y a w p f y n p v d a d a l g l h n y y v p m d l l m n p - - d h e v v m a
BRPF1 t g n i f s e p v p l s e - - v p d y l i p m d f l i n k - - d t i f y a a
BRPF3 p a h i f a e p v n l s e - - v p d y l i p m d f l i n k - - d t i f h a a
BRWD1 BD2 d s e p f r q p v d l v - - e y p d y r i p m d f l i t k - - r s k i y m t
CECR2 d s w p f l e p v d e s y - - a p n y y i p m d i t m n e - - s s e y t m s
CREBBP e s l p f r q p v d p q l l g i p d y f v p m d l l m n k - - t s r v y f c
EP300 e s l p f r q p v d p q l l g i p d y f v p m d l l m n k - - t s r v y y c
KAT2A s a w p f m e p v k k s e - - a p d y y i p i d l r v n p - - d s e y c c a
PB1 BD5 l s a i f l r l p s r s e - - l p d y y i p m d m m m n p - - e s l i y d a
PCAF s a w p f m e p v k r t e - - a p g y y i p m d l r v n p - - e s e y y c a
SMARCA2 l s e v f i q l p s r k e - - l p e y y i p v d f l l n e - - g s q i y d s
SMARCA4 l s e v f i q l p s r k e - - l p e y y i p v d f l l n e - - g s l i y d s
TAF1 BD2 d s w p f h h p v n k k f - - v p d y y i p m d l l i n p - - e s q y t t a
TAF1L BD2 d s w p f h h p v n k k f - - v p d y y i p v d l l i n p - - e s q y t t a
TIF1A m s l a f q d p v p l - - - t v p d y y i p m d l l i n p - - d s e v a a g
TRIM33 l s i e f q e p v p a - - - s i p n y y i p m d l l i n a - - d s e v a a g
270
Table 8-4: The ROC AUC, sensitivity and specificity values per-bromodomain for the final PCM model based on Morgan fingerprints (512 bits) and Z-scales 5 descriptors. Corresponds to data found in Figure 2-10.
Bromodomain ROC AUC Sensitivity Specificity
ATAD2 0.96 0.71 0.97
ATAD2B NA NA 1.00
BAZ2A 1.00 1.00 0.96
BAZ2B 1.00 0.88 0.98
BPTF 1.00 0.75 1.00
BRD1 0.92 0.71 0.97
BRD2 BD1 0.96 0.95 0.80
BRD2 BD2 0.97 0.95 0.79
BRD3 BD1 0.99 0.97 0.88
BRD3 BD2 0.96 0.98 0.76
BRD4 BD1 0.95 0.84 0.97
BRD4 BD2 0.97 0.99 0.82
BRD7 0.97 0.67 0.97
BRD9 0.98 0.84 0.98
BRDT BD1 0.94 0.96 0.72
BRDT BD2 0.96 1.00 0.67
BRPF1 0.83 0.49 0.98
BRPF3 0.84 0.00 1.00
BRWD1 BD2 NA NA 1.00
CECR2 0.99 1.00 0.99
CREBBP 0.93 0.86 0.83
EP300 0.92 0.71 0.96
KAT2A 0.96 0.00 1.00
PB1 BD5 1.00 1.00 1.00
PCAF 0.84 0.89 0.58
SMARCA2 NA 1.00 NA
SMARCA4 0.63 0.28 0.93
TAF1 BD2 0.84 0.69 0.80
TAF1L BD2 0.99 0.00 1.00
TIF1A 0.99 0.88 0.99
TRIM33 1.00 1.00 1.00
271
Figure 8-1: Binding site residue sequence alignment (with numbering) for bromodomains in this study, coloured by the interpretation of residues important towards the classification of active compound-target pairs at each bromod1omain. Corresponds to Figure 3-4 in the main text. Binding site residues were defined as residues in the same alignment position as any residue within 4.5 Å of the ligands in public liganded co-crystal structures for bromodomains.
272
8.2 Supplementary Data Files
Data referred throughout the text to be present in Supplementary Data Files were too large for
the Appendix and can be found on the accompanying CD with the thesis as .csv files
8.3 Compound Characterization Data
8.3.1 General Methods
NMR spectra
NMR spectra were obtained on Bruker Avance 500 (500 MHz) system using d6-DMSO as
solvent. Measurements were taken at ambient temperature unless otherwise specified, and the
following abbreviations have been used: s, singlet; d, doublet; t, triplet; q, quartet; m, multiplet;
dd, doublet of doublets; ddd, doublet of doublet of doublet; dq, double of quartets; dt, doublet
of triplets; tt, triplet of triplets; p, pentet.
UPLC conditions
UPLC was carried out using a Waters UPLC fitted with a Waters SQD, SQD2 or QDA mass
spectrometer with mass Spec = ESI with positive/negative switching
A: 0.1 % NH3 in water
B: acetonitrile
Column: Waters Acquity CSHTM C18 1.7 µm 2.1 x 50 mm
Gradient: 97% A/3% B to 3% A/97% B over 1.5 min
UV: 220 nm - 320 nm
Temperature: 40 °C
Flow rate: 1 ml/min
High Resolution Mass Spectrometry Accurate Mass conditions
The High Resolution Mass Spectrometer is run in Electrospray (ESI) +ve or -ve ion mode and
automatic MSMS using CID at 35eV is carried out automatically on the two biggest ions
generated from MS1.
Mass range: 100 – 1000 amu
Mass Spectrometer: Orbitrap XL or Waters Xevo Qtof
273
Gradient: Acid and Base mobile phase eluent available consisting of A = aqueous
0.1% Formic acid (0.1% ammonium hydroxide) and B= Acetonitrile 0.1%
formic acid, (0.1% ammonium hydroxide) and runs a 95%A to 5%A
gradient at 0.7mL/min over 4.0 mins. There is a 0.5 min hold and a
return to 95%A by 5 minutes.
Total run time:5 mins
Column: Waters CSH 50 x 2.1 BEH
Temperature: 45°C.
Injection volume: 1 – 5uL
UV: 220 to 400nm (PDA detector).
Sample preparation: No greater than 0.5mg/mL solution in DMSO/methanol
Purity criteria
Compounds are >90% purity based on UPLC and 1H NMR.
8.3.2 Compound 1
4-cyano-N-(1,3-dimethyl-2-oxo-7-quinolyl)-2-methoxy-benzenesulfonamide
1H NMR (500 MHz, DMSO) 2.06 – 2.1 (m, 3H), 3.55 (s, 3H), 3.98 (s, 3H), 7.26 (dd, J = 9.0, 2.4
Hz, 1H), 7.30 (d, J = 2.4 Hz, 1H), 7.37 (d, J = 9.0 Hz, 1H), 7.47 (dd, J = 8.1, 1.3 Hz, 1H), 7.69 (s, 1H),
7.73 (d, J = 1.2 Hz, 1H), 7.86 (d, J = 8.1 Hz, 1H), 10.27 (s, 1H); 13C NMR (126 MHz, DMSO, 27°C)
161.81, 156.81, 136.44, 135.48, 131.58, 131.57, 131.25, 130.39, 124.59, 123.45, 120.71, 119.57, 117.96,
117.35, 117.18, 115.80, 57.45, 29.86, 17.79. m/z: ES+ [M+H]+ 384, HRMS (ESI): calculated for
C19H18N3O4S [M+H]+: 384.1018, found 384.1004, error 3.6 ppm.
8.3.3 Compound 2
1,3-dimethyl-2,4-dioxo-N-[4-(trifluoromethyl)phenyl]quinazoline-6-sulfonamide
274
1H NMR (500 MHz, DMSO) 3.29 (s, 3H), 3.50 (s, 3H), 7.32 (d, J = 8.5 Hz, 2H), 7.60-7.65 (m,
3H), 8.09 (dd, J = 8.9, 2.3 Hz, 1H), 8.44 (d, J = 2.3 Hz, 1H), 10.98 (s, 1H); 13C NMR (126 MHz,
DMSO) 160.94, 150.84, 143.97, 141.80, 133.23, 132.92, 127.19, 127.15, 125.70, 124.28, 119.34, 116.55,
115.52, 31.49, 28.78; m/z: ES+ [M+H]+ 414, HRMS (ESI): calculated for C17H15N3O4SF3 [M+H]+:
414.0735, found 414.0752, error 4.1 ppm.
8.3.4 Compound 3
1H NMR (500 MHz, DMSO) 2.04 (s, 3H), 3.12 (s, 3H), 4.22 (s, 2H), 4.97 (s, 2H), 6.85 – 6.89 (m,
2H), 6.83 (s, J = 8.6, 1H), 6.93 (s, 1H), 7.35 (d, J = 8.4 Hz, 2H), 7.58 (d, J = 8.4 Hz, 2H); 13C NMR
(126 MHz, DMSO) 168.75, 155.53, 153.68, 139.43, 133.93, 132.02, 128.80, 122.60, 119.33, 114.05,
113.95, 113.14, 69.84, 42.39, 29.63, 24.46; m/z: ES+ [M+H]+ 326; HRMS (ESI): calculated for
C18H19N3O3 [M+Na]+: 348.1324, found 348.1342, error 5.2 ppm.
8.3.5 Compound 4
Literature compound
8.3.6 Compound 5
3-benzyl-5-(3,5-dimethylisoxazol-4-yl)-1,3-benzoxazol-2-one
275
1H NMR (500 MHz, DMSO) 2.15 (s, 3H), 2.33 (s, 3H), 5.07 (s, 2H), 7.11 (dd, J = 8.2, 1.8 Hz, 1H),
7.23 (d, J = 1.5 Hz, 1H), 7.28 – 7.34 (m, 1H), 7.33 – 7.4 (m, 2H), 7.41 – 7.48 (m, 3H); 13C NMR (126
MHz, DMSO) 165.63, 158.55, 154.42, 141.79, 135.98, 131.65, 129.24, 128.45, 128.38, 126.18, 123.62,
116.02, 110.58, 110.43, 45.61, 11.64, 10.74; m/z: ES+ [M+H]+ 321; HRMS (ESI): calculated for
C19H16N2O3 [M+H]+: 321.1239, found 321.1235, error 1.2 ppm.
8.3.7 Compound 6
N-[[3-(3,5-dimethylisoxazol-4-yl)phenyl]methyl]methanesulfonamide
1H NMR (500 MHz, DMSO) 2.23 (s, 3H), 2.40 (s, 3H), 2.87 (s, 3H), 4.22 (d, J = 6.3 Hz, 2H), 7.29
(d, J = 7.7 Hz, 1H), 7.32 – 7.39 (m, 2H), 7.46 (t, J = 7.6 Hz, 1H), 7.61 (t, J = 6.3 Hz, 1H); 13C NMR
(126 MHz, DMSO) 165.58, 158.53, 139.49, 130.37, 129.36, 128.53, 128.12, 127.22, 116.25, 46.26,
40.45, 11.76, 10.92; m/z: ES+ [M+H]+ 280; HRMS (ESI): calculated for C13H16N2O3S [M+Na]+:
303.0779, found 303.0790, error 3.6 ppm.
8.3.8 Compound 7
4-[(9-cyclopentyl-5-methyl-6-oxo-spiro[8H-pyrimido[4,5-b][1,4]diazepine-7,1'-
cyclopropane]-2-yl)amino]-2-fluoro-5-methoxy-N-[(1R,5S)-9-methyl-9-
azabicyclo[3.3.1]nonan-3-yl]benzamide
1H NMR (500 MHz, DMSO) 0.62 – 0.74 (m, 2H), 0.87 – 1.01 (m, 4H), 1.22 – 1.26 (m, 2H), 1.42
– 1.48 (m, 2H), 1.49 – 1.56 (m, 2H), 1.57 – 1.63 (m, 2H), 1.66 – 1.76 (m, 2H), 1.85 – 1.96 (m, 4H),
2.14 – 2.26 (m, 2H), 2.44 (s, 3H), 3.00 (s, 2H), 3.18 (s, 4H), 3.50 (s, 2H), 3.93 (s, 3H), 4.31 (s, 1H),
4.87 (p, J = 8.6, 8.2 Hz, 1H), 7.22 (d, J = 6.8 Hz, 1H), 7.72 (s, 1H), 7.78 (s, 1H), 8.03 (s, 1H), 8.35
276
(d, J = 13.7 Hz, 1H); m/z: ES+ [M+H]+ 592; HRMS (ESI): calculated for C32H42N7O3F [M+H]+:
592.3411, found 592.3397, error 2.4 ppm.
8.3.9 Compound 8
[2-[(1-methyl-2-oxo-6-quinolyl)oxymethyl]phenyl] methanesulfonate
1H NMR (500 MHz, DMSO) 3.51 (s, 3H), 3.59 (s, 3H), 5.24 (s, 2H), 6.61 (d, J = 9.5 Hz, 1H), 7.33
(dd, J = 9.2, 2.9 Hz, 1H), 7.38 – 7.43 (m, 2H), 7.44 – 7.52 (m, 3H), 7.62 – 7.68 (m, 1H), 7.83 (d, J
= 9.5 Hz, 1H); 13C NMR (126 MHz, DMSO) 161.12, 153.34, 147.33, 139.06, 135.06, 130.64, 130.46,
130.26, 127.87, 122.84, 122.23, 121.26, 120.05, 116.51, 112.49, 65.26, 38.60, 29.54; m/z: ES+ [M+H]+
360; HRMS (ESI): calculated for C18H17NO5S [M+H]+: 360.0906, found 360.0895, error 3.1 ppm.
8.4 Experimental Data
Table 8-5: Differential Scanning Fluorimetry (DSF) assay quality control, showing the identities, thermal shift values as measured in the DSF assay and the literature reported binding affinity of the positive controls for each assay as well as the standard deviation of the DMSO neutral controls.
Target Positive control Positive control ΔTm (°C)
Binding affinity Kd (nM)
Neutral control (DMSO) standard deviation
BRD1 NI-57 5.1 110 0.10
BRD4 BD1 JQ-1 10.5 60 0.37
BRD9 i-BRD9 9.8 1.9 0.38
BRPF1b NI-57 8.8 31 0.05
277
Table 8-6: X-ray crystallographic data collection and refinement statistics
BRD4: 1 BRD4: 2 BRD4: 4 BRD9: 4
PDB ID
Data collection
Space group P212121 P212121 P212121 P21212
Cell dimensions
a, b, c (Å) 37.97
42.99
79.41
37.86
44.05
77.77
34.14
47.41
78.60
74.11
122.96
29.90
α, β, γ () 90.0
90.0
90.0
90.0
90.0
90.0
90.0
90.0
90.0
90.0
90.0
90.0
Resolution (Å) 40.00-1.25
(1.28-1.25)
44.05-1.57 (1.61-
1.57)
40.60-1.30
(1.32-1.30)
63.47-1.53 (1.57-
1.53)
I / σI 20.5 (1.4) 14.1 (1.6) 22.2 (5.3) 12.8 (1.5)
CC 1/2 1.00 (0.49) 0.999 (0.596) 0.998 (0.98.6) 0.998 (0.624)
Completeness (%) 99.3 (94.4) 99.1 (94.2) 99.9 (97.2) 100.0 (100.0)
Redundancy 5.6 (2.9) 6.9 (6.8) 6.0 (3.6) 6.2 (6.2)
Refinement
Resolution (Å) 40.00-1.25 44.05-2.1.57 40.60-1.30 63.47-1.53
No. reflections 205450 129989 191880 262534
Rwork / Rfree 0.196 / 0.219 0.183 / 0.212 0.201 / 0.213 0.204 / 0.230
No. atoms
Protein 1062 1056 1066 1838
Ligand 24 19 25 50
Water 130 108 150 188
B-factors
Protein 20.6 25.4 20.1 39.9
Ligand 18.7 31.6 23.7 40.5
Water 29.1 33.8 28.0 42.4
R.m.s. deviations
278
Bond lengths (Å) 0.01 0.01 0.01 0.01
Bond angles () 0.94 0.94 1.11 0.80