predictive modelling of the primary and secondary

1

Predictive Modelling of the Primary and Secondary

Pharmacology of Compounds in Drug Discovery

Kathryn Anne Giblin

Darwin College

Submission Date: June 2019

University: University of Cambridge

Funding: AstraZeneca and European Research Council (ERC)

This Thesis is Submitted for the Degree of Doctor of Philosophy

2

Summary

Predictive Modelling of the Primary and Secondary Pharmacology of

Compounds in Drug Discovery

Kathryn Anne Giblin

Understanding the primary and secondary pharmacology of drugs is imperative for delivering a

drug molecule which has the necessary efficacy and safety profile in humans. The application of

machine learning and data mining methods to drug discovery has the promise to accelerate the

drug discovery process by learning from the large amounts of existing data available. This thesis

focusses on in silico approaches to address challenges related to selectivity and toxicity in drug

discovery.

The first section of this thesis is concerned with the prediction of ligand selectivity profiles using

proteochemometric (PCM) modelling, a technique which uses both compound similarity and

protein target similarity as input into machine learning models for the prediction of ligand-

target interactions. We showed that employing a multi-target PCM model outperformed other

methods, including QSAR models, on the same bioactivity dataset for the bromodomain-

containing proteins. Furthermore, we established the applicability domain of the model by

employing conformal prediction, which was further used to aid the selection of compounds for

prospective experimental testing in bromodomain assays. We achieved a 31 % hit rate for actives

from our experimental selections, including the discovery of diverse novel hits. The PCM models

were interpreted to reveal residues in the protein active site important for the classification of

activity for each bromodomain, which were further validated by the generation of co-crystal

structures for new ligands bound to bromodomains, as well as literature evidence. The PCM

technique can be used in virtual screening, in silico off-target panel screening of compounds

and to aid future structure-based design.

The second section of this thesis is concerned with the translation of toxic effects between

preclinical and clinical studies. Animal models of toxicity are used in the drug discovery process

to assess the risk of toxicity in humans. However, these effects do not always translate into

human studies as demonstrated by previous retrospective concordance analyses. Here, we asked

“what more can animal models tell us about toxicity in humans?” by extending the previous

concordance approaches to find associations between toxicities in animals and humans which

were not described by the same adverse event (AE) terms. Using data from PharmaPendium,

3

we derived 2,050 statistically significant associations using the mutual information and

subsequently located preclinical AEs which, when observed for a drug, were indicative of a high

risk of experiencing a clinical AE, as measured by a positive likelihood ratio of greater than 10.

To find mechanistic links for associations with the highest mutual information values, we

conducted an analysis to find intersecting genes between preclinical AE, clinical AE and the

drugs which were reported to cause both AEs, finding genetic evidence for 25 % of associations

that were analysed. We suggested that the protein targets identified by this method can be

considered for inclusion in in vitro toxicity panel screening to enable the earlier prediction of

toxicity. In summary, we quantified from existing data where animal models were useful in

clinical toxicity prediction and where they were not, as well as generated mechanistic

hypotheses for the connection between in vivo and clinical toxicities. This study can provide

evidence to support the case for in vivo animal model usage for the prediction of certain clinical

toxicities, as well as provide suggestions for targets to incorporate into in vitro screening panels.

Both outcomes contribute towards the aim to replace, reduce and refine animal usage in drug

discovery.

Overall, this work contributes to the prediction and understanding of the primary and

secondary pharmacology of drugs and how this relates to the concepts of selectivity and safety.

The four main applications of this work included; 1. a machine learning tool for the prediction

of selective bromodomain inhibitor chemotypes, 2. a guide for determining which binding site

residues to interact with to optimise binding to bromodomains, derived from machine learning

models, 3. a quantitative assessment of the drug-induced toxicities in animals which indicate a

high risk of drug-induced toxicities in humans to guide preclinical safety decisions, 4. proposed

targets for further investigation for inclusion in in vitro secondary pharmacology screening

panels for early detection of potential toxic liabilities. These techniques can be applied to other

target classes or toxicity datasets in the future.

4

Acknowledgements

Firstly, I would like to thank my PhD supervisor Andreas Bender for his technical guidance and

support during the PhD and for providing me with opportunities to present my work and

network within the field. I would also like to thank Samantha Hughes, my industrial supervisor

in AstraZeneca, particularly for her help in enabling the experimental validation of this work,

but also for her computational chemistry input, as well as her time and patience with reviewing

and approving publications.

Within AstraZeneca there are many individuals who contributed to the material contained

within this thesis, through discussions, reviewing written work and scientific expertise. Firstly,

for the experimental validation of the work, I would like to acknowledge Pia Hansson, Helen

Boyd, Marianne Schimpl and Clare Gregson. For their bromodomain expertise, I acknowledge

the contributions of Robert Sheppard and Thomas Hayhow. I thank Avid Afzal, Nigel Greene,

Lyn Rosenbrier-Ribeiro and Ian Barret for their discussions and contributions to the work in

chapters 4 and 5, in terms of technical knowledge, idea generation, secondary pharmacology

knowledge and knowledge of bioinformatic databases, respectively. Generally, I would like to

thank members of the Oncology Computational Chemistry groups and Discovery Sciences

Quantitative Biology groups for listening to presentations and offering ongoing feedback and

advice throughout the PhD, as well as including me in group meetings and social events.

The time spent during my PhD would not have been half the experience it has been without

other members of my research group, past and present, many who are now my great friends.

Ben Alexander-Dann, Stephanie Ashenden, Erin Oerton, Fatima Baldo, Avid Afzal, Fynn Krause,

Krishna Bulusu, Christoph Schlaffner, Lewis Mervin, Azedine Zoufir, Georgios Drakakis, Ines

Smit, Maria-Anna Trapotsi, Samar Mahmoud, Chad Allen, Layla Gerami, Fredrik Svensson, and

many others -thank you so much for all the help with the PhD, the fantastic memories, from

the pub nights and formal dinners to the trips away (India, Germany, Norfolk!), and for keeping

me going through the hard times.

Completing my PhD would not have been possible without the love, support and

encouragement from my family. To my parents, Sue and Ged, and my brother Matt, thank you

for everything you have done for me, I am very lucky to have you in my life. Personally, the past

few years have also been a memorable time for me and also my partner Tom; not only did we

buy our first home and adopt two lovely cats (thanks Oliver and Socks for your calming

influences), we also got married at Darwin College, my college for the PhD. I would like to

5

dedicate this thesis to my wonderful husband Tom for his unwavering love and encouragement

in helping me to achieve my goals, and for making me very happy!

6

List of Written Publications During PhD

Here we list publications relevant to work done throughout the PhD and, where applicable,

explain their relation to chapters of the thesis.

Kathryn A. Giblin, Samantha J. Hughes, Helen Boyd, Pia Hansson and Andreas Bender.

Prospectively Validated Proteochemometric Models for the Prediction of Small Molecule

Binding to Bromodomain Proteins, J. Chem. Inf. Model., 2018, 58, 1870-1888

Sections from this publication are included in chapter 2 of this thesis.

Kathryn A. Giblin, Samantha J. Hughes, Marianne Schimpl, Helen Boyd, Pia Hansson, Clare

Gregson, Robert J Sheppard, Thomas Hayhow, Andreas Bender, Selectivity Insight for the

Rational Design of Bromodomain Inhibitors Derived from Proteochemometric Models (in

preparation)

This publication is described in chapter 3 of this thesis.

Kathryn A. Giblin, Avid Afzal, Lyn Rosenbrier Ribeiro, Nigel Greene, Ian Barrett, Samantha

Hughes, Andreas Bender, New Mechanistic Associations Between Drug-Induced Toxicities in

Animal Models and Humans Uncovered by Statistical Analysis (in preparation)

This publication is described in chapters 4 and 5 of this thesis.

Fredrik Svensson, Azedine Zoufir, Samar Mahmoud, Avid Afzal, Ines Smit, Kathryn A. Giblin,

Jerome Mettetal, Amy Pointon, James Harvey, Nigel Greene, Richard Williams and Andreas

Bender, Information-Derived Mechanisms of Structural Cardiotoxicity, Chem. Res. Toxicol.,

2018, 31, 1119-1127

This publication was the work of a collaboration between the University of Cambridge,

Lhasa, AstraZeneca and GlaxoSmithKline during the PhD.

7

1 Thesis Introduction ................................................................................................................. 11

1.1 Challenges in Drug Discovery ............................................................................................. 11

1.1.1 Primary and Secondary Pharmacology of Drugs ........................................................ 11

1.1.2 Summary ..................................................................................................................... 35

1.2 General Computational Methods .................................................................................. 36

1.2.1 Data Science Methods ................................................................................................ 36

1.2.2 Summary ..................................................................................................................... 57

1.3 Computational Methods and Applications for Bioactivity and Selectivity Profile

Prediction .................................................................................................................................. 57

1.3.1 Ligand-Target Interactions ........................................................................................ 57

1.3.2 Computational Encoding of Ligands and Targets .................................................... 58

1.3.3 Ligand-Based Bioactivity Prediction Against Multiple Targets ............................... 62

1.3.4 Target-Based Bioactivity Prediction Against Multiple Targets ................................ 63

1.3.5 Ligand and Target-Based Bioactivity Prediction Against Multiple Targets............. 64

1.3.6 Bromodomain Target Family Selectivity Profile Prediction ..................................... 69

1.3.7 Summary ..................................................................................................................... 70

1.4 Computational Methods and Applications for Toxicity Prediction ............................. 71

1.4.1 Prediction of Clinical Adverse Events from Preclinical Adverse Events ................... 71

1.4.2 Computational Analyses for the Proposal of Targets for In Vitro Secondary

Pharmacology Screening ....................................................................................................... 75

1.4.3 Summary ..................................................................................................................... 76

1.5 Aims ................................................................................................................................ 76

2 Prospectively Validated Proteochemometric Models for the Prediction of Small Molecule

Binding to Bromodomain Proteins .............................................................................................. 78

2.1 Introduction ................................................................................................................... 78

2.2 Materials and Methods .................................................................................................. 78

2.2.1 Dataset ........................................................................................................................ 78

2.2.2 Analysis of Chemical and Biological space ............................................................ 85

8

2.2.3 Compound and Target descriptors ........................................................................ 85

2.2.4 Algorithms and Generation of PCM models ......................................................... 86

2.2.5 Model Validation .................................................................................................... 87

2.2.6 Benchmarking to QSAR, Quantitative Sequence Activity Model (QSAM) and

Baseline Models ..................................................................................................................... 87

2.2.7 Public Dataset Model ............................................................................................. 88

2.2.8 Applicability Domain ............................................................................................. 88

2.2.9 Experimental Validation ........................................................................................ 89

2.2.10 Experimental Testing .............................................................................................. 91

2.3 Results and Discussion .................................................................................................. 92

2.3.1 Analysis of Chemical and Biological Space ............................................................... 92

2.3.2 Benchmarking Compound and Target Descriptors for PCM Models .................. 94

2.3.3 Algorithms and Model Performance .................................................................... 101

2.3.4 Model Validation ...................................................................................................105

2.3.5 Benchmarking to QSAR, QSAM and baseline models ....................................... 108

2.3.6 Public Dataset PCM Model .................................................................................... 111

2.3.7 Applicability Domain .............................................................................................112

2.3.8 Experimental Validation ........................................................................................ 113

2.4 Conclusions ................................................................................................................... 124

3 Selectivity Insight for the Rational Design of Bromodomain Inhibitors Derived from

Proteochemometric Models ........................................................................................................ 126

3.1 Introduction .................................................................................................................. 126

3.2 Materials and Methods ................................................................................................. 127

3.2.1 Computational models.............................................................................................. 127

3.2.2 Model Interpretation ............................................................................................. 127

3.2.3 X-ray Crystallographic Structure Determination for BRD4 and BRD9 .............. 129

3.3 Results and Discussion ................................................................................................. 130

3.3.1 Interpretation ............................................................................................................. 131

9

3.3.2 Binding Modes Elucidated by Co-Crystallization ................................................ 154

3.4 Conclusions ................................................................................................................... 165

4 Associations Between Drug-Induced Adverse Events in Animal Models and Humans:

Beyond Concordance ................................................................................................................... 167

4.1 Introduction .................................................................................................................. 167


4.2.1 Dataset ....................................................................................................................... 167

4.2.2 Feature Filtering .................................................................................................... 168

4.2.3 Mutual Information Associations ......................................................................... 168

4.2.4 Network Analysis of Significant Associations ...................................................... 170

4.2.5 Limitations .............................................................................................................. 171

4.3 Results and Discussion .................................................................................................. 171

4.3.1 Concordance Analysis ................................................................................................ 171

4.3.2 Statistical Associations Between All Preclinical and Clinical AEs ....................... 174

4.3.3 Network Analysis of Significant AEs ................................................................... 180

4.4 Conclusions .................................................................................................................. 188

5 Deriving Potential Mechanisms from Associations Between Drug-Induced Adverse Events

in Animal Models and Humans ...................................................................................................190

5.1 Introduction ................................................................................................................. 190


5.2.1 Associations for Interpretation ................................................................................. 191

5.2.2 Extracting Targets for Drugs ................................................................................. 192

5.2.3 Mapping Preclinical and Clinical AEs to Ontology Terms .................................. 194

5.2.4 Extracting Genes for Preclinical and Clinical AEs ............................................... 194

5.2.5 Gene Filtering and Mapping Animal Genes to Human Orthologs ..................... 195

5.2.6 Overlap Analysis of Genes from Preclinical AE, Clinical AE and Drugs ............. 196

5.2.7 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels ........... 196

5.2.8 Visualisations ......................................................................................................... 197

10

5.2.9 Limitations ............................................................................................................. 197

5.3 Results and Discussion ................................................................................................. 197

5.3.1 Overall Analysis of Mechanistically Supported Associations .................................. 197

5.3.2 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels .......... 202

5.3.3 Mechanistically Supported One-to-One Relationships Between AEs ................... 205

5.3.4 Mechanistically Supported Groups of Associations Between AEs .....................208

5.4 Conclusions ................................................................................................................... 212

6 Conclusions and Future Perspective ................................................................................... 213

7 References ............................................................................................................................. 215

8 Appendix .............................................................................................................................. 266

8.1 Tables and Figures ....................................................................................................... 266

8.2 Supplementary Data Files ............................................................................................ 272

8.3 Compound Characterization Data .............................................................................. 272

8.3.1 General Methods ...................................................................................................... 272

8.3.2 Compound 1 .......................................................................................................... 273

8.3.3 Compound 2 ......................................................................................................... 273

8.3.4 Compound 3 ......................................................................................................... 274

8.3.5 Compound 4 ......................................................................................................... 274

8.3.6 Compound 5 ......................................................................................................... 274

8.3.7 Compound 6 ......................................................................................................... 275

8.3.8 Compound 7 ......................................................................................................... 275

8.3.9 Compound 8 ......................................................................................................... 276

8.4 Experimental Data ....................................................................................................... 276

11

1 Thesis Introduction

This thesis explores data science and machine learning methods relevant to understanding and

predicting the selectivity and toxicity of drugs. In the introduction we first highlight the

challenges faced in drug discovery associated with obtaining drug molecules which are selective

and safe, describing how these parameters link to the concepts of primary and secondary

pharmacology. The theory behind general methods used within the thesis related to data

science will subsequently be discussed to provide the appropriate technical background. Finally,

we review and discuss previous applications of computational approaches to address the

endpoints of focus for this thesis, namely selectivity and toxicity, outlining gaps in existing work.

Overall, the introduction presents a case for the research questions addressed in this work.

1.1 Challenges in Drug Discovery

The process of small molecule drug discovery traditionally comprises of a multi-step process,

starting with the identification of a target and culminating in drug approval in certain diseases,

patient populations and territories (Figure 1-1).1,2 Due to the complexity of the process, it usually

takes over 12 years3 and a cost of over $1 billion4 to develop a New Molecular Entity (NME) from

concept to launch.

Figure 1-1: Traditional stages of the drug discovery process

There is a high attrition rate in many stages of the process and it has been estimated that over

24 drug projects are initiated for every drug launched, with numbers increasing for more

difficult therapeutic targets.2 The main stages of attrition are in phase 2 and phase 3 clinical

trials, with 66 % and 30 % of compounds respectively failing during these stages. It is therefore

important to understand efficacy and safety early in the drug discovery process to avoid late-

stage attrition.

1.1.1 Primary and Secondary Pharmacology of Drugs

There are challenges to overcome in all stages of the drug discovery process with the aim to

reduce attrition from the lack of efficacy in patients and unanticipated toxicity. One of these

challenges is anticipating and understanding the consequences of the off-target or secondary

pharmacology profiles of drugs. In a traditional target-based model of drug discovery, the aim

Target Identification

and Validation

Hit Generation

Lead Optimization

Pre-clinical Studies

Phase-1: Safety

Phase-2: Efficacy, Safety

Phase-3: Efficacy, Safety

Approval and Launch

12

is to develop a compound for a primary biological target (or sometimes targets) with a validated

role in disease. Due to the vast number and types of biological targets in the human body, it is

likely that drugs will not only interact with the specific target(s) of interest, but also exhibit a

range of off-target activities.5 Interaction with multiple targets, known as polypharmacology,6

can be favourable and used as an approach to modulate multiple targets with roles in disease,

inspired by the concept of combination therapy.7 However, off-target activities can often be

unfavourable and can contribute to unwanted pharmacological or toxicological effects.

Medicinal chemists attempt to design compounds that have the desired on-target profile whilst

mitigating the risk of off-target effects during the lead optimisation process. Two concepts

related to on- and off- target activities of drugs, namely selectivity and toxicity, are explored in

the material of this thesis.

Challenge of Selectivity

Selectivity and Polypharmacology

A specific drug is one which has only one desired effect on a biological system.8 In reality,

however, most drug molecules have effects on multiple components of the system, given a high

enough dose, and the degree of this promiscuity is termed the selectivity of a drug. These system

components can be genes, proteins, pathways or cells, and so selectivity is a broad term, for

effects at different levels.9

In the traditional approach to drug discovery, with the aim to modulate a specific protein

function in disease, selectivity refers to target selectivity. Target selectivity is the degree to

which a small molecule interacts with a desired protein target versus its degree of interaction

with off-targets. In medicinal chemistry, small molecules are optimised for a biological target

or targets of interest, in an iterative process, using a variety of binding, functional and cell-based

assays to measure success. One of the main aims in lead optimisation programs is to design a

drug with a balance of high potency at the desired target, whilst minimising the interactions

with undesirable targets, termed optimising drug selectivity. Activity for undesirable off-targets

can cause unwanted effects including toxicities, which will be discussed in the section Off-

Target Pharmacology.

Not all drugs need to be selective for a single target. Polypharmacology refers to a single

molecule which modulates multiple molecular targets and can be described as either desirable

or undesirable.10 The established hypothesis that a drug needs to interact with a specific target

protein ignores the fact that multiple targets within a network cause a disease.11 For example, in

cancer, where mutations occur rapidly, it is useful to have a drug which not only binds the wild-

13

type target but also to any resistant variants of the protein.5 Additionally, diseases are

sometimes caused by multiple pathways which compensate for one another when one is

modulated, for example in mood disorders and schizophrenia.12 This can provide an opportunity

for the development of a therapeutic agent to interact with at least one target in each pathway

to mediate the desired effect. The single-target hypothesis is also shown in some cases to have

limitations in duration of pharmacodynamic effect due to compensatory cellular mechanisms.13

Even for the multi-target objective, the aim would remain to create drug molecules with a

selectivity over other undesired targets; these drugs can been termed “selectively

promiscuous”.14

Types of Selectivity

Given that most drugs require some type of selectivity optimization, we will now discuss the

different types of selectivity.

One important type of selectivity addressed in the lead optimisation stage is target-family

selectivity. This is important when the biological target of interest belongs to a characterised

biological family, which is known to bind to similar endogenous substrates.5 For example,

kinases present one of the major target classes for the development of new therapeutics due to

their roles in diseases such as cancer and inflammatory disorders.15 Because of the functionally

conserved ATP binding site, it is often found that inhibitor molecules for one kinase often also

inhibit other targets within the family.16 Thus, for kinase targets, target family selectivity is often

assessed early on using panel screening against the druggable kinome.15,17 Similar principles

apply to other target families and members of the same target family are often screened as a

first point-of-call in projects. This is due to the hypothesis that phylogenetically-related targets

of the primary target are more likely to bind similar endogenous molecules and therefore

interact with the same chemotypes of small molecules.18 The main aspects governing the

binding of small molecules for protein targets, namely shape, electrostatics, flexibility,

hydration and allostery can be applied to measure target similarity and to understand and

predict selectivity.5 Targets displaying similarity in these aspects to the primary target are often

also targeted by similar small molecules and taking advantage of the differences can be used to

achieve selectivity.18 To obtain selectivity for specific related targets, screening can be conducted

for the secondary targets alongside the primary target and selectivity optimized iteratively based

on structure activity relationships (SAR).

In addition to interacting with other members of the same target class as the primary intended

target(s), drugs can also interact with targets in vivo, which are unrelated in structure or

14

function to the primary drug target. These targets include enzymes, ion channels, receptors and

transporters, which are known to cause unwanted effects, for example, the human ether-a- go-

go (hERG) and Nav1.5. In addition, molecules are screened for drug-drug interactions (DDIs)5

and plasma protein binding,19,20 two other examples of non-specific binding. To mitigate against

the risks of unrelated off-targets, there are a few avenues which can be explored, including in

vitro toxicity panel screening21 (discussed in Off-Target Toxicity Assessment) and avoiding

molecular functionality or properties known to interact with certain undesired off-targets (for

example, toxicophores).10

Measuring Selectivity

Between two targets, we can measure the ratio of activity for the on-target and off-target of

interest, as a measurement of selectivity. This is known as the selectivity ratio (SR) and can be

defined in Equation 1-1.22

Equation 1-1

𝑆𝑅 =𝑏𝑖𝑜𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑜𝑓 𝑜𝑓𝑓 − 𝑡𝑎𝑟𝑔𝑒𝑡 𝑎𝑠𝑠𝑎𝑦

𝑏𝑖𝑜𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑜𝑓 𝑜𝑛 − 𝑡𝑎𝑟𝑔𝑒𝑡 𝑎𝑠𝑠𝑎𝑦

This calculation requires the measurement of biological activity for both the on-target and off-

target. Biological activity is defined as the “specific ability or capacity of a particular molecular

entity to achieve a defined biological effect.”23 Multiple measurements of biological activity at a

target are commonly used, including the dissociation constant between the ligand-receptor

complex (Kd),24 the half-maximal effective concentration (EC50),25 the half maximal inhibitory

concentration (IC50),25 and the dissociation constant of the receptor-inhibitor complex at

equilibrium (Ki).26 The value of the SR is often described as the fold-selectivity. A higher fold-

selectivity means the drug is more selective, since there was a larger difference between the

concentrations needed to afford the same degree of response.

Beyond the comparison of two targets, basic measurements can determine the degree of

promiscuity of a drug, by counting the number of targets for which the bioactivity value is found

to be below a certain threshold for activity for each drug. Different methods have been used to

estimate the average drug promiscuity, finding that a drug might on average interact with 3-6

targets at a concentration of less than 10 μM according to estimations from DrugBank and

Wombat27 and that 50 % of drugs interact with more than 5 different targets at a concentration

of less than 10 μM determined from drugs annotated in the Anatomical Therapeutic Chemical

(ATC) classification.28 Utilising the large amounts of bioactivity data which have become

available in recent years, a detailed analysis of compound promiscuity has been conducted at

15

different stages of the drug development process, finding that around 50 % of screening hits

had activity against at least two targets and that a notable increase in promiscuity was observed

from screening hits and optimised compounds (2.9-3.7 targets on average, depending upon the

dataset), to experimental and approved drugs (4.7 and 6.9 targets on average, respectively).29 It

was also found that similar drugs are likely to bind similar targets, and in one study 71 % of

drugs were found to interact with at least two targets with similar binding sites.18

Overall, we note that most compounds display some degree of promiscuity or

polypharmacology, and that experimental and approved drugs have a large degree of

promiscuity, which may be causing undesired effects. Designing molecules with the appropriate

profile of on- and off-target pharmacology is therefore imperative to provide the appropriate in

vivo functionality.

Challenge of Designing Selective Bromodomain Inhibitors

As mentioned above, it can be difficult to achieve selectivity between members of the same

target family. Bromodomain-containing proteins are a family of epigenetic proteins, which play

important roles in immunological, developmental and cardiovascular disorders, as well as

cancers.30 To better elucidate their individual roles in disease, selective probe molecules are

required. Designing selectivity remains difficult since small molecule bromodomain binding

requires the mimicry of endogenous peptide binding and therefore chemotypes which bind to

one bromodomain often also bind to other bromodomains. This lack of selectivity is particularly

notable between bromodomains with a high binding site sequence conservation.31 In this

section we provide the background to bromodomain-containing proteins including their

function and structure, as well as highlighting literature knowledge of the active site residues

which form interactions with known small molecule binders.

Bromodomain Function and Clinical Relevance

Bromodomain-containing proteins are epigenetic readers, which regulate gene transcription by

recognising and binding to acetyl lysine post translational modifications (PTMs) on histone

proteins, as well as other proteins.32–34 PTMs alter the accessibility of proximal chromatin to

gene transcription and the binding of bromodomain-containing proteins often causes the

transcriptional activation of genes.35 61 unique bromodomains have been identified in the

human proteome, which have been classified into to eight subfamilies.36 Most bromodomains

contain a deep and well-defined acetyl lysine recognition binding site, which interacts with

acetylated lysine residues on histone peptides in a protein-protein interaction, providing an

interaction which can be disrupted by small molecule inhibition.37,33

16

Histone acetylation is associated with an open chromatin state and consequently transcriptional

activation. This open state is due to the neutralisation of the charged lysine upon acetylation,

which affects the packing of the chromatin by reducing the strong ionic interactions that form

the closed structure of the DNA.36 The histone acetylation process is dynamically regulated by

the writer proteins; histone acetyl-transferases (HATs) and the eraser proteins; histone

deacetylases (HDACs),33 inhibitors of which have been constructed in order to interfere with

the DNA accessibility and therefore gene transcription processes. Once these acetylation marks

have been positioned by the HAT enzymes, bromodomains act to read these modifications and

recruit the transcription factors and chromatin remodelling factors that are responsible for the

increase in gene transcription.32 Specific bromodomains have different profiles of acetyl lysine

mark recognition, which determine their functional roles; this was elucidated by

Filippakopoulos et al., 2012 in a study that used SPOT peptide arrays in order to assess

bromodomain selectivity.38

Bromo and extra terminal (BET) family bromodomains have been well-studied, with

bromodomain containing protein 4 (BRD4) having received the most attention for the

development of inhibitors.39 The BET bromodomains include: BRD2, BRD3, BRD4 and BRDT.

Each protein contains two bromodomains, which are known as BD1 and BD2. The majority of

drugs in clinical trials express pan-BET activity, as selectivity for the BET bromodomains is

difficult to acquire, especially due to the similarity of BD1 and BD2 domains across proteins.

There are 17 inhibitors for BET in clinical trials currently for a wide range of indications,

including cardiovascular conditions, such as atherosclerosis, diabetes, solid tumours and

haematological tumours.30,33,36,40–42 Five commonly reported BET inhibitors (belonging to

different scaffolds) are depicted in Figure 1-2.

17

Figure 1-2: shows the structures of five pan-BET bromodomain inhibitors.

JQ1 and I-BET762 (Figure 1-2) were the first potent BET inhibitors, which stimulated the interest

in targeting bromodomains for various indications.30,43,44 These both share the same diazepine

chemical scaffold. JQ1 was developed when compounds with the triazolothienodiazepines were

discovered to have anti-cancer properties in addition to anti-inflammatory effects, caused by

BET inhibition.36 The inhibitor was shown to possess BRD4 BD1 activity of 50 µM and activity

against NUT midline carcinoma, an aggressive form of childhood cancer, driven by BRD4-NUT

translocations. This discovery led to the therapeutic rationale for targeting BET proteins with

inhibitors for oncology.45 Subsequently, other areas of oncology have been explored for the

potential for therapeutic benefit from a BET inhibitor. It was found that the most active target

of BET was BRD4 which when inhibited leads to the downregulation of transcription of genes

involved in cell proliferation, including the growth promoting oncogene c-MYC, which is

present as a key driver in many cancer types.41 I-BET762 was developed in the same chemical

scaffold as JQ1 and validated against the downregulation of inflammatory gene expression in

LPS-stimulated macrophages.44 The main effect was the suppression of inflammatory cytokines

and transcription factors that lead to inflammation. This led to the therapeutic application in

I-BET762 RVX-208

I-BET151 XD14

JQ1

18

inflammatory disease for BET inhibitors. Other BET inhibitors have since been developed with

more varying chemotypes.36,46

The therapeutic potential identified from BET inhibitor validation studies led to the

investigation of the other bromodomains as potential biological targets of interest. The histone

acetyl transferase (HAT) proteins CREBBP and EP300 act as co-activators for transcription

factors including p53, which is mutated in 50 % of cancers.47 Ischemin was developed as the

first inhibitor of CREBBP/EP300 and was shown to display anti-apoptotic phenotypes in

ischemic cardiomyocytes, demonstrated to be due to the post translational modification of

histones and inhibition of p53 gene activation by CREBBP after recognition of the acetyl lysine

mark. This inhibitor dually inhibits both the HAT and bromodomains in the CREBBP protein

to produce this effect.48 Since this, multiple inhibitors of CREBBP have been developed,49,50 and

CREBBP and EP300 dysfunction have been implicated in developmental disorders,

haematological malignancies, inflammation and neuropsychiatric disorders.34

In addition to the above, inhibitors for the other bromodomain-containing proteins have also

been produced. For PCAF, one inhibitor was shown to prevent the binding of HIV trans-

activator Tat, preventing HIV provirus transcription and replication.51 PCAF inhibitors are also

broadly associated with roles in cancers and neuro-inflammation, indications which are

included in patents for inhibitor molecules.34 For CECR2, GNE-886 was produced as a selective

inhibitor, suitable for use as an in vivo tool molecule.52 For BPTF, a moderately potent inhibitor

rac-1 was optimised53 to better elucidate the role of BPTF in various cancers including bladder

and colorectal cancers.34 BAZ2A and BAZ2B have been associated with prostate cancer,

lymphoblastic leukaemia and sudden cardiac death.34 The BAZ2-ICR compound54 and

GSK280155 have been developed for the BAZ2 bromodomains to further elucidate their roles in

disease. Multiple inhibitors of BRD9/BRD7 have been discovered, including LP-9956, I-BRD957,

BI-727358 and BI-956458, which caused the selective down-regulation of genes implicated in

cancer and immune response pathways in acute myeloid leukaemia (AML).34,36,56,57 BRPF1,

BRPF2 (BRD1) and BRPF3 have been targeted by the benzamidazolone scaffold present in

GSK685359 and 1,3-dimethylquinolin-2(1H)-one scaffold present in NI-57.60 Dual

TRIM24/BRPF1b inhibitors were also discovered containing the benzamidazolone scaffold.61

ATAD2 inhibitors62–64 have recently been developed and the bromodomain has been linked to

a variety of different malignancies.34 PF1-365 was discovered as an inhibitor of the SMARCA2/4

and PB1 bromodomains which are also linked to a number of cancers.34 In addition to selective

inhibitors, pan-BRD inhibitors have also been developed, including triazolophthalazines which

19

displayed activity for CECR2, BRD4, CREBBP, BRD9 and TAF1L, and Bromosporine, which was

active for BET, CECR2, TAF1, BRD9 and CREBBP.36 There is increasing evidence that some

existing kinase inhibitors are dual kinase-BRD inhibitors; inhibitors of JAK2, PLK1 and PI3K

have been found to also inhibit BET bromodomains.66,67

Structural Binding Site and Phylogenetics

61 bromodomains are encoded in the human genome, expressed 46 unique proteins.68

Bromodomains are found in a range of proteins that are associated with other chromatin

functions including the histone acetyl transferases (HAT) and histone methyl transferase

(HMT) proteins.36 They are also found in transcription initiation factors and ATP dependent

helicases and in proteins containing other reader domains, including other bromodomains

within the same protein, as well as reader domains that recognise other marks including the

methyl lysine recognition domain (known as the PHD finger domain).69

Structurally, bromodomains have a highly conserved region in the acetyl lysine binding site,

known as the globular fold, which forms between a left-handed α-helical structure comprised

of four α-helices (αZ, αA, αB, αC) and two loop regions (known as the ZA and BC loops). The

hydrophobic binding site is nestled between these structures (Figure 1-3a).68 There are also

conserved residues across bromodomains which are involved in acetyl lysine binding, mediated

through water molecules, which have been observed in crystal structures. A tyrosine residue in

the ZA loop and an asparagine in the BC loop are usually present, except in a few exceptional

cases, where the loop ZA loop configuration can be noticed to be unusual.36 The asparagine

residue forms a hydrogen bond to the peptide, as shown in Figure 1-3b. There is also extensive

H-bond stabilisation to the peptide backbone from the bromodomain site. Interaction of

diacetylated peptides with one bromodomain can also occur and these often bind with higher

affinity than singly acetylated peptides.70

20

Figure 1-3: left (a), shows the standard structure of a bromodomain with four α-helices and two loops. Diagram was generated using PDB structure 3UVW in MOE. Structure is formed of a diacetylated histone 4 peptide (in pink) and BRD4 BD1 (in ribbon form). Right (b), shows the binding site with the conserved residues, tyrosine 97 and asparagine 240 (in teal).

Crystal structures are now available for 47 members of the bromodomain family, many of which

were delivered as a result of efforts from the Structural Genomics Consortium (SGC).38

Bromodomains have been organised into subtypes according to their sequence similarity. The

most structurally characterised types are I, VI and VIII. The families still lacking in structural

characterisation include members of groups III, V and VII. Inhibitors for the type II

bromodomain and extra-terminal (BET) BRDs were the first to be identified, consistent with

the functional information known about this subfamily. It is interesting to observe that

inhibitors have been produced for members of all families to date, suggesting that all

bromodomains could be druggable.

The BET family of bromodomains is the most explored family of bromodomains. These are

characterised by their two distinct binding sites, known as bromodomain 1 (BD1) and

bromodomain 2 (BD2). Structurally, BET bromodomains within the same protein are less

similar to each other (<45 % sequence identity) than the same domain between proteins (∼75

% identity). For this reason, it is difficult to obtain selective inhibitors for this family of

bromodomains and explains why most of the drugs targeting these are pan-BET inhibitors.

Inhibitor Chemotypes

Structurally, inhibitors mimic the binding of acetyl lysine and therefore need to contain

functional groups which can exploit the key interaction of the peptide amide carbonyl with the

asparagine residue. Bromodomain inhibitors therefore have an equivalent acceptor which can

interact with this residue. Such chemotypes for the BET family include 1,2,4-triazoles,

quinolones, dimethylisoxazoles and thiazolinones.71 As mentioned above, five of these

structures are depicted in Figure 1-2 and whilst these compounds all bind to the same binding

ZA loop

αA

H4KAc2 BC loop

αB αZ

αC

H4KAc2

Y97

N240

21

site, they differ in their molecular properties including their acid/base properties (Table 1-1).

The apparent lipophilicity is often calculated for the neutral species known as the logP value,

however this value does not consider the ionisation states of compounds at physiological pH,

which is why the logD value (logP measured at physiological pH) should be considered and is

often different to the clogp value. Although all structures in Figure 1-2 are depicted as neutral,

which will be the dominant species at physiological pH for these compounds, these molecules

will exist in equilibrium between charged and uncharged species which is governed by the pKa

of the functional groups. For example, a proportion of RVX at pH 7.4 may be in the NH+ lactam

tautomeric species since the pKa of this group is 8.9, which might be responsible for the

decrease in logD in comparison to logP. Additionally, the phenol moiety in XD14 will be mostly

protonated, however, some of the species will be deprotonated and therefore charged since the

pKa of this group is at 8.0. Again, this is likely to result in the lower logD prediction in

comparison to logP. It should be noted here that these are calculated values and therefore the

error in these values may also contribute significantly to the differences observed. More

generally, quinazolones, thienopyridones, methyl pyrazoles, acetylindolizines and propylated

benzoxazepines, have been identified as acetyl lysine mimetics for other bromodomains.36

Further chemotypes known for bromodomains are shown and discussed in chapter 2.

22

Table 1-1: calculated properties for the five bromodomain inhibitors shown in Figure 1-2. clogP and logD were calculated based on models within AstraZeneca and pKa’s were calculated using the Advanced Chemistry Development, Inc ACD/Labs software.72

Compound clogP logD pKa/functional group pKa/functional group pKa/functional group

JQ1 4.82 3.97 1.9/NH+ diazepine 1.5/NH+ diazepine tautomer

I-BET762 3.29 2.76 15.5/NH amide 1.9/ NH+ diazepine 2.4/NH+ diazepine

tautomer

RVX-208 3.25 2.44 14.2/OH 8.9/NH lactam 8.9/NH+ lactam

tautomer

I-BET151 2.32 2.72 11.1/NH azabenzimidazolone 5.3/NH+ azabenzimidazolone 4.7/NH+ pyridine

XD14 3.27 2.71 18.5/NH amide 8.0/OH phenol

23

Selectivity Knowledge for Bromodomain Binding to Small molecule Inhibitors

Recently, there has been an increase in numbers of chemical probes produced for the

bromodomain family of proteins, which has provided some information on specific target

residues important for obtaining activity at individual bromodomains or selectivity for one

bromodomain over other members of the target family. The structural binding sites of

bromodomains differ in specific residues and more broadly in their secondary and tertiary

structure. These features have been exploited for structure-based design of selective inhibitors

previously.

Table 1-2 provides a summary of residues in the active site of bromodomains which are currently

known to make interactions with small molecules, indicating that there is evidence to support

their importance for binding. The contents of the table were determined from a survey of the

literature and highlight the key differences in bromodomain sequence and structure which have

been exploited previously to design inhibitor and probe molecules for the bromodomain target

family. It is important to note that the crystallographic contents extracted here will be affected

by the quality of the crystal structures analysed. For all 1,492 bromodomain X-ray structures in

the protein data bank (PDB), we extracted the refinement details report (Supplementary Data

File 1). The mean residual value (R-value) for these structures was 0.19, with a standard deviation

of 0.03 which is around the typical expected value for the difference between the atomic model

and the experimental data, with only small differences between the R-value and R-free of 0.03,

a sign that the atomic model did not overfit to the experimental data.73 The mean resolution

was 1.78 with a standard deviation of 0.42 showing that on average we can be confident about

the location of atoms in bromodomain crystal structures due to the well-defined electron

density maps that can be created from the detailed diffraction pattern.

Notably, it can be observed that there are differences in the numbers of interacting residues

between bromodomains, with BRD4 BD1, CREBBP and BRD9 having the highest number of

known interacting residues, due to the number of publications and crystal structures for these

bromodomains. For some bromodomains, how to obtain optimize interactions between small

molecules and the protein to obtain activity is largely unknown.

24

Table 1-2: Residues highlighted from the literature to have roles in determining binding affinity for different bromodomains. Not included here are the interactions with the conserved asparagine and tyrosine residues, since this interaction is common across all bromodomains studied.

Bromodomain Activity/Selectivity Residues References

BPTF F2887, D2834, W2824 53

CECR2 Y520, P458, F459, M506, W457 52,74

KAT2A Y814, E761, P752, W751 75

PCAF Y809, E756, P747, W746, V752, P751, K753 75,76

BRD2 BD1 Q101, D160, I162 31,77

BRD2 BD2 K374, H433, V435 31,77

BRD3 BD1 Q61, D120, I122 31,77

BRD3 BD2 K336, H395, V397 31,77

BRD4 BD1 Q85, D144, I146, Y139, L92, W81, L97, P82, F83 31,77–80

BRD4 BD2 K378, H437, V439, Y432, N433, P375, P376 31,77,78,80

BRDT BD1 I115 31,77

BRDT BD2 V357 31,77

BRWD1 BD2 None known

CREBBP R1173, L1109, V1174, L1120, P1110, F1111, Q1113, L1119 81–84

EP300 R1137, L1073 81–83

ATAD2 R1007, V1008, R1077, D1071, D1014, I1074, V1018, Y1063 85–87

ATAD2B N981, I982, R1051, D1045, D988, I1048, V992, Y1037 86,87

BRD1 F648, S592, V647, Q589, Y649 60,88–90

BRD7 Y217, A154, S157 57,58

BRD9 Y106, F44, I53, A54, G43, F45, V49, H42, R101, A46, F47, P48, T50

56–58,91

BRPF1 F714, P658, I713 60,88–90

BRPF3 F675 88,89

BAZ2A W1816, V1879, E1820 55,92

BAZ2B W1887, I1950, L1891 55,92

TIF1A A923, L922, F924, V932, V928, V986, P929, E985 61,93

TRIM33 None known

TAF1 BD2 N1533, W1526 74,94

TAF1L BD2 None known

PB1 BD5 A703, M699, L687, L693, I745, I683, F684, M731 34,65,95

SMARCA2 L1418, P1413, E1417 65

SMARCA4 L1494, P1489, E1493 65

Challenge of Toxicity

As mentioned in the section Primary and Secondary Pharmacology of Drugs, unanticipated

toxicity in human trials is one of the major causes of drug attrition.96 In total, over 30 % of drugs

are terminated due to animal or human adverse events.97 Efforts are therefore being made to

assess toxicity in more relevant models prior to clinical trials in order to prevent investment

into molecules which will be later terminated for toxicity. This is consistent with the concept

25

that it is better to fail early to reduce costs and avoid unnecessary risk to human volunteers and

patients.98

Mechanisms of Toxicity

“The dose makes the poison” was a principle outlined by Paracelsus,99 which described the fact

that all chemicals are poisons at high enough doses. Nowadays, it is generally considered that

there are five main types of drug-induced toxicities, namely on-target toxicity, hypersensitivity

and immunological reactions, biological activation to toxic metabolites, idiosyncratic toxicities

and off-target pharmacology.100,101

On-Target Toxicity

This can be described as on-target effects that occur in the wrong organ or cell type sometimes

due to elevated doses. A good example of this is statin toxicity, which eventually led to the

market withdrawal of cerivastatin.102 Statins act on the liver to reduce synthesis of cholesterol

through inhibition of the isoprenoid pathway, however in the muscles, inhibition of the same

mechanism can lead to excess myoglobin which can cause kidney failure.100 This is an example

of the same mechanism having different effects in different tissues.

Hypersensitivity and Immunological Reactions

These reactions often result from long term exposure to a drug and are unpredictable.103 These

reactions are split further into four categories; type 1 are IgE mediated reactions and cause

anaphylaxis, asthma and urticaria, type 2 are immunoglobulin-mediated cytotoxic mechanisms

which can lead to abnormality in blood cells, type 3 are immune complex mediated, for example

vasculitis from which inflammation can cause the destruction of blood vessels, and type 4 are a

range of reactions mediated by T-cells, causing skin rashes known as delayed hypersensitivity

reactions.104 Common agents associated with these hypersensitivity and immunological

reactions include penicillins and β-lactum antibiotics.101

Biological Activation to Toxic Metabolites

Biological activation toxicity involves the biotransformation of the drug into toxic metabolites.

Since drug molecules are often lipophilic in nature due to their need to permeate across the

lipid bilayer, many of the biotransformation’s performed by enzymes in the body act to make

drug molecules more hydrophilic to aid excretion from the body. The enzymes responsible for

biotransformation act on a broader degree of chemical structures than therapeutic targets,

which explains why different classes of drug molecules can often interact with the same

cytochrome p450 enzymes.105 Biotransformation occurs in two steps, namely phase 1 and phase

26

2 reactions and occurs primarily in the liver. Phase 1 reactions involve oxidation, reduction and

hydrolysis reactions,101 whereas phase 2 bioconjugates the resulting phase 1 metabolite with an

endogenous high polarity molecule. Primarily, these transformations are designed to detoxify

and aid excretion of drugs, however, in some cases toxic metabolites can form.

Biotransformed metabolites fall into four main categories based on mechanism. The first type

is the conversion to a stable, but toxic metabolite. This is rare for drug molecules however, an

example of this type of toxicity is demonstrated by the oxidation of dichloromethane (a common

solvent) to carbon monoxide, by cytochrome p450 enzymes, which subsequently complexes

with haemoglobin to form carboxyhaemoglobin, restricting oxygen transport.106 The second

type is the conversion of a chemical to a reactive electrophile metabolite and is by far the most

common bioactivation toxicity mechanism. This can lead to cytotoxicity and carcinogenicity,

due to the reaction of the resulting electrophile with endogenous nucleophiles from proteins,

lipids or DNA tissue components, forming a covalent interaction which can eventually lead to

tissue degradation and necrosis.107 This is also thought to contribute to oxidative stress, due to

glutathione depletion.101 The third type of biotransformation is the conversion to a free radical,

a highly reactive species which can induce cascades of subsequent reactions, including the

oxidative modification of proteins and lipid peroxidation.105 The fourth type of transformation

involves the formation of reactive oxygen radicals, which have consequences for oxidative stress.

With redox cycling and high concentrations of oxygen radicals, reduced oxygen metabolites can

form, which can initiate diseases including arteriosclerosis and polyarthritis.105

Idiosyncratic Toxicities

These toxicities belong to the other mechanisms of toxicities that occur very rarely (1 in 10,000

individuals)101 and therefore their mechanisms are not well-known. Their rarity is attributed to

be due to the genetic predisposition of some individuals to certain toxicities. These toxicities

often become apparent when the drug is exposed to an increased number and diversity of

patients, often after drugs have been marketed or in late-stage clinical trials, making their

prediction very difficult and there are very few relevant animal models are available.108 These

toxicities are hypothesised to be due to low-level infections, loss of mitochondrial function and

inhibition of RNA polymerase.

27

Off-Target Pharmacology

As discussed in the section Types of Selectivity, when undesired secondary targets are

modulated by the drug, this can lead to toxicity.101 This will be discussed in more detail in the

next sections as it forms the basis for many in vitro toxicity screens.

Drug-Drug Interactions

Another source of toxicity can come from the interactions between one drug and another,

termed Drug-Drug interactions (DDIs). Increasingly, patients are taking many drugs for

different indications, and so it is important to assess toxicity with relation to other common

medications taken within the patient population. Drug interactions are when the concurrent

usage of one drug with another effects the level or activity of another drug.109 Changing the

concentration of drug in the body from the approved concentrations can have effects on toxicity

and efficacy. An example of a DDI includes the concurrent treatment with sulfonylureas such

as glyburide for diabetes, in addition to sulphonamide antibiotics, which can cause

hypoglycaemia due to the inhibition of the metabolism of glyburide by cytochrome p450 2C9

(CYP 2C9) enzyme.110

Toxicity Assessment in Drug Development

Toxicity studies are conducted at various stages of the drug discovery process.97 There are three

main stages of toxicity determination, namely in vitro assays, in vivo animal studies and clinical

human studies. Much of the requirements for toxicity testing are mandated by national

regulatory bodies such as the Food and Drug Administration (FDA) and the Medicines and

Healthcare products Regulatory Agency (MRHA).

In Vitro Toxicity

Off-Target Toxicity Assessment

As mentioned previously, off-target interactions of a molecule can lead to unwanted toxicity. It

is not currently mandated that off-target pharmacology screening be implemented, except for

testing against the human ether-a- go-go (hERG) potassium channel to anticipate cardiac

arrhythmias.111,112 When hERG is inhibited, long QT syndrome and eventual sudden cardiac

death can occur.113,114 Due to the severity of the potential adverse event in humans, and the

number of drugs which have been withdrawn due to hERG effects,97 this target is now

considered high priority for early liability assessment. Despite this being the only mandated off-

target assessment, selected off-targets known to be associated with toxicities in in vivo settings

are often screened for in early stages of drug discovery to mitigate against downstream risks.

28

Panel screening in vitro against targets associated with a variety of toxicities has been

highlighted as important for the de-risking of compounds for late-stage toxicities.21 The ability

to conduct medium-to-high throughput screens enables the fast profiling of safety targets.

Several initiatives have defined safety targets which might be appropriate to include in such

studies, including the Tox-21 target panel of in vitro quantitative HTS assays,115 Bioprint,116 the

Novartis safety panel published by Lounkine et al., 2012, 117 and the panel-44 safety screen which

uses the targets published by Bowes et al., 2012.118 The study by Lynch et al., 2017119 summarizes

the evidence for adverse events induced by agonism/activation and antagonism/inhibition of

each of 70 off-targets included in AbbVie’s in vitro screening panel.120 For targets to be

considered as off-targets, a link between a drug induced adverse event and the off-target should

be made. Subsequently, not all targets can be incorporated into a screening panel and several

factors should be used for prioritisation, including 1. the severity of the adverse event associated

with the targets; 2. the hit rate of the targets; 3. the selection of targets across target families to

increase panel diversity; and 4. the assay formats being suitable for high throughput and high

concentration screening.121

In addition to hERG, the activation or inhibition of cytochrome P450 enzymes is also monitored

closely at an early stage, as these enzymes are responsible for the metabolism of many drugs

and are therefore key to whether the drug survives liver metabolism and reaches the tissue of

action in sufficient concentration to sustain the desired effect.122 Furthermore, inhibition or

activation of these targets are important for the understanding of potential DDIs.123

Cellular Toxicity Assessment

It is also important to assess drug toxicity in a cellular context to determine if a toxic phenotype

occurs in mammalian cells. Cytotoxic compounds can affect the viability of in vitro cultured

cells in different ways including interference with cellular attachment, significant changes in

cell morphology, changes in cell growth rate or induction of cell death.124,125 Many assays have

been developed for the in vivo assessment of cytotoxicity and cell viability, including

permeability assays measuring membrane integrity, functional assays, for example to measure

mitochondrial function/energy capacity, morphological assays including measurements of

changes to cellular support structures and osmotic properties and reproductive assays

measuring cell numbers and capacity for colony formation by cell division.126,127

Toxicity on a pathway level can also be evaluated in vitro. Many cellular assays have been

designed in a high throughput fashion to determine pathway-based toxicities, for example in

the Tox-21 database the induction of hypoxia is measured, which is the reduction in the tissue

29

oxygen tension. This can be brought about by different mechanisms which are captured by the

endpoint of hypoxia-inducible factor 1a activity.115

Genotoxicity Assessment

Genotoxicity is defined as the effect of a chemical on the cell’s genetic material, which leads to

a loss of integrity.128 Genotoxins are classified into three classes; carcinogens, mutagens or

teratogens which cause cancer, mutations or birth defects respectively. There are three main

tests employed in vitro for assessment of the genotoxicity of drugs. These include the Ames

reverse mutagenesis test,129 the micronucleus clastogenicity assay130,122 and the mouse

lymphoma thymidine kinase (tk) gene mutation assay.131 The Ames test is the long established

test to identify carcinogens.129 This is conducted by applying a compound to the mutated form

of Salmonella typhimurium bacteria, which is unable to synthesise histidine. With a genotoxic

substance, a second mutation arises which reverses the effect of the original mutation and

enables the synthesis of histidine.132 Clastogens are agents which induce structural aberrations

to chromosomes.133 The micronucleus test counts the number of abnormal cells in a tissue

sample harbouring a micronucleus which is a marker of chromosomal aberration.134 A positive

result in both the micronucleus and the Ames tests are predictive of in vivo mutagenicity or

carcinogenicity.135 The mouse lymphoma cell assay can be used in some cases to identify

mutagens or clastogens and measures the degree of mutation or damage to the thymidine kinase

(tk) locus of mouse lymphoma tk cells, by measuring how resistance to trifuorothymidine

nucleoside changes.131 This assay was determined less useful in detecting carcinogens than the

other two assays combined.135

Organ Toxicity

The cellular methods above have limitations in the prediction of toxicity because these methods

do not retain the original organ functions and morphologies when cultivated in vitro.136 The

traditional approach to overcome this is to use whole animal models (described in In Vivo

Toxicity), however, ethical considerations as well as species differences between animals and

humans has resulted in the development of in vitro methods which can determine organ or body

toxicities. The recent technology established in this regard involves the concept of organs-on-

a-chip.136,137 This technology uses sterile microfluidic components containing multiple cell types

which would be found in a tissue, connected by synthetic blood vessels and allowing the influx

of the drug into the tissue.136 The microenvironment for the cells can be adjusted by changing

the pressure (using vacuums e.g. for lung or heart tissues), or the supply of nutrients to the cells

to mimic the endogenous environment in the tissue. Furthermore, components representing

30

individual organs can be pieced together to mimic the entire human system and its

connectivity.137

In Vivo Toxicity

In vivo toxicity screening is still considered the standard for assessment of the toxicological risks

in humans, due to the ability to administer the drug to the whole system of the animal, rather

than in an isolated target-based or cell-based assay.98 However, when compared to in vitro

screening for toxicity assessment, in vivo studies are lower throughput. For an in vivo study to

be practical for use in early drug discovery it needs to have rapid turnover of results and require

low compound amounts.98 Other challenges in in vivo toxicity studies includes the variation

between species in biology, pathophysiology and pharmacokinetics, which affects the degree of

translation of toxicity between animals and humans,98 as well as the initiative to replace, reduce

and refine the use of animals in drug development for ethical reasons.138,139

Initial in vivo toxicity measurements come from the early animal studies to establish efficacy

and pharmacokinetic properties of the drug where the drug is administered at a single dose to

rodent and non-rodent species.98 This means that any toxicities observed will be at

pharmacological dose. Following from the initial in vivo studies, designated acute toxicity

studies are conducted which explore escalated dosing to determine the maximum tolerated

doses (MTDs) of the drug. The determined MTDs are then investigated in repeated-dosing

schedules within the same species. These studies are designed to identify toxicities from

repeated exposure, identify affected organs and to determine the dose for each toxicity.140

Repeated Dose Toxicity (RDT) studies lead to quantification of No Observed Adverse Effect

Level (NOAEL) and Lowest Observed Adverse Effect Level (LOAEL) measurements in animal

models which are often used to estimate a safe dose to administer in clinical trials.141 The NOAEL

is the highest dose for which no adverse effects were observed and the LOAEL is the lowest dose

where an adverse effect was observed.142 The primary measurements made in these studies

include body weight, clinical measurements, histological evaluation of tissues, evaluation of

clinical chemistry and haematology. In addition toxicokinetic measurements allow the

determination of exposure levels associated with effect and no effect levels of toxicity. Before

clinical trials, acute toxicity and repeated-dose toxicity studies need to be conducted in two

species including one non-rodent species.143 In addition, individual drug and therapeutic

indication knowledge will guide the investigation of other potential toxic liabilities in in vivo

animal studies. Other studies might include the investigation of genotoxicity, carcinogenicity,

reproductive toxicity, immunotoxicity, drug abuse liabilities.143 Safety pharmacology studies are

31

conducted to assess the risk of on- or off- target pharmacology, for example dog telemetry

studies are used to assess cardiovascular liabilities.140

Clinical Toxicity

Toxicity Assessment in Clinical Trials

Clinical trials are usually split into different phases and toxicity is monitored and assessed

continually and according to the study design.

The objectives of phase 1 clinical studies are to establish the pharmacokinetics, tolerability and

safety of the drug in a small number of human subjects. Sub-therapeutic doses are initially

administered to a small number of human subjects who are monitored for adverse effects.144

Pharmacokinetic exposure is determined to establish the most appropriate incremental increase

of dose. Dose escalation continues until the expected pharmacological exposure range is

achieved or exceeded. Where the expected pharmacological exposure is not known, dose

escalation can continue up to the adverse effect level as determined by animal toxicity studies

or to the actual maximum tolerated dose in humans defined by the onset of adverse events.145

Serious adverse events at any dose level will usually result in the suspension of dosing or

termination of the study.

Phase 2 clinical trials are designed to assess the therapeutic efficacy of the drug in a larger group

of patients who have the target disease.144 There are different designs for a Phase 2 trial

depending on the outcomes required.146 A comparison of the efficacy with historical controls,

experimenting with different dosing regimen and randomisation of subjects between drug and

placebo help to inform the design of the subsequent phase 3 efforts.147 Toxicity and

pharmacokinetics are also monitored throughout all stages of the trials and incidences of

adverse events are strictly reported and regulated.

Due to often the lack of statistical power from endpoints generated from a phase 2 clinical trial

because of the relatively small sample sizes, phase 3 clinical trials are mandated by regulators

to assess the drug in a wider cohort of patients with the disease.144 Most phase 3 clinical trials

are structured as a comparative efficacy trial of the drug against placebo, however, some drugs

are compared to the current standard-of-care known as the equivalency trial structure.148

Randomisation of patients between the treatment groups or the placebo or standard-of-care

groups is conducted to control for the effect of confounding factors on trial results. These may

include concomitant medications, comorbidities, or demographic features such as age, gender

or socioeconomic factors.149 Again, toxicity will be assessed throughout the trial and adverse

32

events reported and monitored for patients in both arms of the trial. The adverse events

reported may be different in frequency or severity to those experienced in smaller cohorts in

previous trials.

Clinical Adverse Events

Adverse events (AEs) are defined by the World Health Organisation (WHO) as “any untoward

medical occurrence in a patient or clinical investigation subject administered a pharmaceutical

product and which does not necessarily have to have a causal relationship with this

treatment”.150 As mentioned in Mechanisms of Toxicity there are a variety of mechanisms of

drug-induced toxicity which could be responsible for the AEs in humans. An important factor

to consider is the causality of AEs, as toxicity experienced after administration of the drug can

be due to mechanisms of the drug itself, as well as other confounding factors and it not always

trivial to distinguish between these situations. For example, the underlying disease and any

concomitant medications could also be linked to the AE.151 Serious adverse events (SAEs) are a

class of AE which result in either death, hospitalisation, significant disability or a congenital

abnormality or birth defect.152

Reporting of Adverse Events from Clinical Trials

AEs attributed to the drug are reported during clinical trials as part of the pharmacovigilance

framework. These AEs are reported as per requirements in the country that the clinical trial is

conducted. For the UK, the Medicines and Healthcare products Regulatory Agency (MHRA)

requires that sponsors and clinical study investigators report all serious AEs to the European

Medicines Agency (EMA) to be collated into a database as Individual Case Safety Reports

(ICSRs).153 Likewise, the Food and Drug Administration (FDA) requires the reporting of

unanticipated and serious AEs in clinical trials to the Institutional Review Board (IRB).154 Other

information is documented in the medical records of the study participants and this is

transferred to a case report form, which is collated into pharmacovigilance databases. These

databases are often coded to a medical dictionary of AE terms to aid analysis. The Medical

Dictionary for Regulatory Activities (MedDRA) provides a 5-level hierarchy and is used for this

purpose.155 This is the main dictionary used in the UK, US and Japan to report AEs from clinical

trials. The highest level of this hierarchy is the System Organ Class (SOC) and the lowest level

terms are more likely to be the original terms described in a document.156 Finally, the published

trial results on ClinicalTrials.gov will include the incidence of AEs for the drug.157,158 In this

database, it is now required that the adverse event term, organ system, type of assessment

33

(spontaneous vs symptomatic) and the number of participants affected are included in the

report.159

Attrition due to Toxicity

Toxicity in clinical trials can result in termination of the investigational drug. The frequency of

attrition can be broken down into target organs or tissues and was reported in the study by

Guengerich et al.,97 based on information from DuPont-Merck and Bristol-Myers Squibb

between 1993-2006. These results (Table 1-3) show that cardiovascular and liver toxicity

findings were responsible for > 42 % of drug attrition due to toxicity. Recent examples of drugs

withdrawn for toxicity observations include the pain medications flupirtine, which was

withdrawn for liver toxicity in 2018,160 and propoxyphene, withdrawn for cardiovascular toxicity

in 2010.161

Table 1-3: Percentage of molecule attritions in clinical trials which occurred due to a toxicity classified by each target organ or tissue.

Target organ or tissue Percentage of all advanced molecules

Cardiovascular 27.3

Liver 14.8

Teratogenicity 8.0

Hematologic 6.8

Central and peripheral nervous system 6.8

Retina 6.8

Mutagenicity/clastogenicity 4.5

Reproductive toxicity 4.5

Gastrointestinal/pancreatic 3.4

Muscle 3.4

Carcinogenicity 3.4

Lung 2.3

Acute death (unspecified cause) 2.3

Renal 2.3

Irritant 2.3

Skeletal (arthritis/bone development) 1.1

Toxicity Translation

Testing new chemical entities (NCEs) in animal models is a regulatory requirement for toxicity

assessment before administration of a drug to humans in clinical trials. The pharmaceutical

industry has long been evaluating the value of these models for this purpose, in the search for

new information that can be used for toxicity assessment. From an analysis of the attrition of

drug candidates from four pharmaceutical companies, it was found that 40 % of drugs were

terminated due to non-clinical toxicology findings and 11 % due to clinical safety findings.162

34

Safety was found to be the highest contributor to attrition in both preclinical and Phase 1

studies.162 In an AstraZeneca study conducted on drug projects between 2005–2010, it was

found that, of those drug projects terminated preclinically, 82 % were due to toxicity.163

Furthermore, of those drug projects which passed into clinical trials, 62 %, 35 % and 12 % of

projects were terminated were due to toxicity in phase 1, phase 2 and phase 3, respectively.163

Since this paper was published there has been a renewed focus on 5 technical determinants of

drug success (the 5Rs) within AstraZeneca, one of which was focusing on the “right safety”

testing. The results of these efforts have led to the reduction in project closures due to safety

between 2012-2016, to 38 % for phase 1 and 8 % for phase 2.164

Since clinical toxicity is still a source of drug attrition, it is clear that the predictivity of animal

models is currently not sufficient to anticipate later compound toxicity in man. Toxicological

findings in preclinical studies are not always observed in the clinic and thus the development

of potential therapeutics could be terminated before progression to the clinic, causing

unwarranted attrition.165 Conversely, there may be no toxicological observations observed

preclinically, but when clinical studies are conducted new toxicities are uncovered.166 The

reasons for the lack of concordance in both directions can often be attributed to the increased

heterogeneity of patients in clinical trials,166 lack of relevance of the model organism/or model

measurement to the human toxic endpoint,167 differences in toxic or detoxifying mechanisms

between species,168 or differences in exposure,169 amongst other effects. An example of a toxicity

in humans which is a result of different species physiology includes the vomiting AE in humans.

Rats do not have a vomiting response, and therefore vomiting in humans is hard to predict using

rodent studies and is instead predicted by taste eversion/food avoidance responses in rodents

or ferret or dog emesis models.170,171 Nausea is another example of a common side effect

experienced in humans, which is not easily recognisable in animal models, but which poses a

significant challenge to drug development due to patient compliance. In fact, a study conducted

by Parkinson et al.,172 explores nausea prediction in humans in relation to gastro-intestinal (GI)

AEs experienced in animals, finding that combinations of preclinical GI effects were strong

predictors of nausea. Teratogenicity does not always correlate between animals and humans,

for example corticosteroids are teratogenic in animal models but not in humans,173 and

conversely thalidomide is a teratogen in humans but not in many animal species, due to

metabolism differences.174 Other difficult to observe side effects in animals are those associated

with a psychological condition that might be a unique disorder to the human condition, for

example schizophrenia,175 although behavioural end-point rodent models have been developed

35

to provide some early evidence in animals of changes in mesolimbic dopamine function relevant

in schizophrenia.176

Despite this lack of correlation for some toxicities, others are more concordant between animals

and humans, for example cardiotoxicity.169,177 For these cases, the occurrence of drug-induced

toxicity preclinically may halt progression of an investigational drug due to an unacceptable risk

to patients or provide vital information that allows an informed decision to be made on patient

monitoring and inclusion/exclusion criteria for clinical trials. Sometimes drugs are continued

despite toxicity flags in preclinical studies. Reasons to continue a drug despite preclinical

toxicity are numerous and include, for example, low severity of the toxicity, the risk-benefit

analysis of the disease to be treated versus the toxicity, as well as evidence that the toxicity may

only occur in subsets of patients which can be excluded from the study or highly monitored

during the study. In particular, cancer drugs, antibiotics, antidepressants and antipsychotics are

drugs which are often taken forward despite cardiovascular risk factors, on a risk-benefit basis.96

Increasingly, there is a drive towards the reduction of animal usage in support of the 3R’s

initiative,138 by prioritising the use of only the most toxicologically relevant animal models, as

well as implementing and developing in vitro alternative models, such as discussed in In Vitro

Toxicity, including safety target screening panels and organs-on-a-chip technologies.

1.1.2 Summary

In this section of the introduction, we outlined the general context for this project and

highlighted the need to produce selective and safe drug molecules. The current drug discovery

set-up allows for the assessment of selectivity and toxicity, however, there is still progress to be

made.

Gaining target family selectivity remains a challenge due to the similarity of proteins with

diverse biological roles and lack of knowledge for designing selective molecules. On average,

drugs are active against six targets (see Measuring Selectivity), which can lead to undesired

effects, including toxicity. We furthermore introduced the bromodomain-containing proteins

with respect to their functional importance and structural composition, as well as highlighted

existing knowledge for designing selective bromodomain inhibitors. This thesis will explore this

target family in detail with respect to the prediction and understanding of selectivity profiles.

The translation of toxicity between animals and humans still represents a challenge, as many

adverse events do not translate across species for a variety of reasons. This lack of predictivity,

36

as well as the initiative to reduce animal testing, leads to the need to provide better in vitro

alternatives for clinical toxicity prediction.

This thesis addresses these general challenges by the implementation of in silico approaches to

generate new knowledge within these areas.

1.2 General Computational Methods

In this section we highlight the general computational methods that were implemented within

this thesis related to the field of data science, including machine learning algorithms and

statistical methods.

1.2.1 Data Science Methods

Data science is the study of extracting generalizable knowledge from data by unifying statistics

machine learning and their related methods.178,179 This field can be thought to encompass other

related concepts including artificial intelligence (AI) and machine learning and the more

recently popular method of deep learning (Figure 1-4), as well as statistics and data mining.

Figure 1-4: shows the field of data science and its connection to the fields of artificial intelligence, machine learning, deep learning, data mining and statistics.

AI is “the effort to automate intellectual tasks normally performed by humans” and originated

in the 1950s with the idea that computers might have the ability to think like a human to solve

problems in a variety of fields.180 Machine learning is a branch of AI which learns rules through

pattern recognition and statistical algorithms in an automated fashion from input data and their

expected answers. The difference between machine learning and the broader field of AI (which

also includes symbolic AI) is said to be that a machine learning system is trained by known

Data Science

Artificial Intelligence

Machine Learning

Deep Learning

Data Mining

Statistics

37

input and output data to find rules, rather than programmed with input data and human

inputted rules to produce an output (Figure 1-5).180

Figure 1-5: The difference between machine learning and symbolic AI.

Machine Learning Models

As defined above, machine learning is a field within AI which is concerned with learning rules

through pattern recognition and statistical algorithms in an automated fashion from input data

and their expected answers. A model is said to be trained on the input data and the expected

answers, which can then be applied to predict for new examples.180

In this section we describe the algorithms which are commonly used in machine learning

applications, with examples related to drug discovery.

There are different categories of machine learning problems. In this section we focus on

supervised machine learning frameworks, which are the most relevant to the content of the

thesis. Supervised machine learning involves feeding the model the desired solutions known as

the output labels as well as the input variables, whereas for unsupervised frameworks the

training data is unlabelled, and the learning algorithm tries to learn rules between input

variables, e.g. clustering or association rule mining.181,182

For supervised machine learning methods, tasks can be split into either regression or

classification problems.

This thesis is concerned with classification and so this will be discussed in the next section.

Classification Models

Classification models are built on data which has an output label of a discrete class, where there

are K possible classes to which instances can be assigned.183 Classes can be nominal or ordinal

Input data

Input data

Answers

Answers

Rules

Rules

Symbolic AI

Machine Learning

38

and ordinal classes are often derived from continuous data to simplify a regression problem, for

example classifying into the levels “low”, “medium” and “high”.

Classification tasks can come in different forms, as illustrated in Figure 1-6, including binary

classification, multi-class classification, multi-label classification184 and multi-output

classification.185,186

Figure 1-6: Adapted from Read et al., 2015.185 L is the number of labels and K is the number of values that each label variable can take.

Binary classification is the most commonly dealt with form of classifier, where there is one label,

and there are two values which this label can take, for example for the label “drug biological

activity” there could be two values “active” or “not active”. Multi-class classification presents a

similar problem to binary classification except that the label can now take more than one value,

for example there could be three classes such as for a label “type of drug activity” which could

be “inhibition”, “activation” or “none”. Note that for both cases the value of the label must

belong to only one of the K options and these “classes” are mutually exclusive.

For multi-label classification, the number of labels is now increased, however each label itself is

binary (K=2). The output could be now a combination of two labels, for example the presence

and absence of two toxicity types for one drug. The two labels might be “cardiotoxicity” and

“renal toxicity” and each label has two classes “present” and “absent”. Note that now one

instance, in this example one drug, has two classifications; either presence or absence of

cardiotoxicity and presence or absence of renal toxicity. This type of classification predicts a

profile of binary label occurrences.

The fourth type of classification is multi-label, multi-class classification and extends the

approach to when K > 2 and L > 1. This could be, for example, when each of our two labels for

Binary

Multi-Output Multi-Label

Multi-Class L = 1

L > 1

K = 2 K > 2

39

toxicities have three options “severe toxicity”, “medium toxicity” or “no toxicity”. This is called

multi-output classification.

Evaluation Metrics

Classification is often assessed by the calculation of metrics based on the confusion matrix.187,188

For a binary problem, the confusion matrix is a two-by-two table of predicted label values (class)

against actual label values (class) with counts in each quadrant equivalent to the number of

instances in each category (Figure 1-7).

Figure 1-7: Confusion matrix for the assessment of classification model performance.

Metrics can be then calculated from the confusion matrix to assess the performance of the

classifier the most popular being the precision, recall (sensitivity), specificity, accuracy, F1 score,

Matthew’s correlation coefficient (MCC) and the area under the receiver operating

characteristic curve (ROC AUC).186–189

Machine Learning Algorithms

In this section we discuss the main machine learning classification algorithms which were used

in this work, including their theoretical basis, advantages and disadvantages.

Logistic Regression

Logistic regression is an algorithm applied from linear regression to classification problems. As

in linear regression, a function is learned from the training data which is a weighted sum of the

input features plus a bias term. However, instead of outputting continuous values, the logistic

function (𝜎) is applied to the continuous values to output a probability value between 0 and 1

for binary classification:190

True Positives

(TP)

True Negatives

(TN)

False Negatives

(FN)

False Positives

(FP) Active

Inactive

Active Inactive

Pre

dic

ted

Cla

ss

Actual Class

40

Equation 1-2:

�̂� = 𝜎(𝜃0 + θ1𝑥1 + θ2𝑥2 + ⋯+ θ𝑛𝑥𝑛)

Where �̂� is the probability value that an instance belongs to the positive class, θ0 is the bias term,

𝑥𝑖 is the ith feature value, n is the number of features and θj is the jth model parameter (including

the bias term and feature weights θ1 to θn). 𝜎 is the logistic function:

Equation 1-3:

𝜎(𝑡) = 1

1 + 𝑒−𝑡

From the probability output, values ≥ 0.5 predict membership of the positive class and values <

0.5 predict membership of the negative class.

Logistic regression works under the following assumptions: 1. the dependent variable is

dichotomous, 2. the errors are independent and therefore each observation is independent from

others, 3. there is no multicollinearity in the predictor variables and 4. there are no outliers in

the data.191 Logistic regression is often implemented due to its efficient implementation, high

interpretability, easy regularization and ability to output well-calibrated probabilities. It has the

disadvantages that it cannot solve non-linear separation problems, requires prior knowledge of

which input variables are meaningfully related to the output variable, performs less well under

multicollinearity of predictors, or a large ratio of predictor variables to samples.192

Support Vector Machines

Support Vector Machines (SVMs) are used in classification tasks and work by optimising a linear

decision surface known as a hyperplane to best separate the data into classes, whilst maintaining

the largest distance possible from any examples, known as large margin classification.193 Large

margin classification allows increased generalisability of the algorithm, since the hyperplane is

more likely to succeed in predicting for new examples.194 The lines which represent the margin

gap are called the support vectors (Figure 1-8). When the data is not separable in original feature

space, SVM can map input features onto a higher dimensional feature space to which the data

can then be linearly separated into classes.195,196

41

Figure 1-8: SVM classification for a linearly separable dataset.

For linear SVMs each data point is represented by a vector in n-dimensional space (Rn). The

equation of the hyperplane is derived from two vectors 𝑥⃗⃗ 0 and 𝑥⃗⃗ which represent two points

P0 and P on the hyperplane (Figure 1-9). For P to be on the plane, the vector 𝑥 − 𝑥⃗⃗ 0 must be

perpendicular to a vector �⃗⃗� at point P0. From this we know that the dot product of these two

vectors must be 0 as they are perpendicular:

Equation 1-4:

�⃗⃗� . ( 𝑥 − 𝑥 0) = �⃗⃗� . 𝑥⃗⃗ − �⃗⃗� . 𝑥⃗⃗ 0 = 0

And therefore:

Equation 1-5:

�⃗⃗� . 𝑥⃗⃗ + 𝑏 = 0

where 𝑏 = −�⃗⃗� . 𝑥⃗⃗ 0. This can be visualised in Figure 1-9:

:

Figure 1-9: derivation of the equation for the hyperplane for an R3 space from point vectors of points P and Po.

Positive class

Negative class

Hyperplane

Support Vectors

Margin

�⃗⃗� . 𝑥⃗⃗ + 𝑏 = 0

�⃗⃗� . 𝑥⃗⃗ + 𝑏 = +1

�⃗⃗� . 𝑥⃗⃗ + 𝑏 = −1

𝑥 − 𝑥⃗⃗ 0 𝑃0 𝑃

𝑂

�⃗⃗�

42

Parallel hyperplanes have different values of the 𝑏 coefficient, which move along the direction

perpendicular to the hyperplane, in the direction of �⃗⃗� . For an instance to be classified as positive

(+1) the value of �⃗⃗� . 𝑥⃗⃗ + 𝑏 must be greater than 0 and for an instance to be classified as the

negative class (-1) the value of �⃗⃗� . 𝑥⃗⃗ + 𝑏 must be less than 0. Because the support vector is a

parallel line to the hyperplane which intersects the nearest positive or negative examples, the

equations of the support vectors must be the following:

Equation 1-6:

�⃗⃗� . 𝑥⃗⃗ + 𝑏 = +1 = �⃗⃗� . 𝑥⃗⃗ + (𝑏 − 1) = 0

Equation 1-7:

�⃗⃗� . 𝑥⃗⃗ + 𝑏 = −1 = �⃗⃗� . 𝑥⃗⃗ + (𝑏 + 1) = 0

The distance d between two parallel hyperplanes is given by:

Equation 1-8:

𝑑 =|𝑏1 − 𝑏2|

∥ �⃗⃗� ∥

Which, since 𝑏1 = 𝑏 + 1 and 𝑏2 = 𝑏 − 1, simplifies to:

Equation 1-9:

𝑑 =2

∥ �⃗⃗� ∥

To maximise the margin, we would need to minimize the value of ∥ �⃗⃗� ∥ which is the norm of

the weight vector. In practice the distance which is minimized is:

Equation 1-10:

𝑑 =1

2∥ �⃗⃗� ∥2

For a hard margin classifier, it is required to minimise the objective function 1

2∑ 𝑤𝑖

2𝑛𝑖=1 subject

to 𝑦𝑖(�⃗⃗� . 𝑥⃗⃗ + 𝑏) − 1 ≥ 0 which encodes the constraint that all instances 𝑖 are correctly classified

(all positive instances > +1 and all negative instances < -1).190,183 For soft margin classification,

margin violations are allowed. Therefore, a balance between maximising the margin and

minimising the number of violations is required. The hyperparameter C in SVM measures the

trade-off between these two objectives.

For non-linearly separable data, as mentioned previously, we can map the instances from the

original feature space a higher dimensional feature space using a mapping function 𝑥 → 𝑓(𝑥)

43

known as the kernel. The kernel is a similarity function which computes the dot product

between two vectors in the higher dimensional space based only on the original vectors a and b

without having to know about the transformation.194 Therefore, the dot product of the vectors

transformed by function 𝑓(𝑥) is equal to a function of the original vectors. This is known as the

kernel trick and does not require the computation of the coordinates of the input data in the

higher dimensional space, but only the dot products between all vectors instead, which is more

computationally efficient. There are a variety of kernels which can be implemented; the most

common being those shown in Equation 1-11, Equation 1-12 and Equation 1-13:190

Equation 1-11:

Linear: 𝐾(𝑎, 𝑏) = 𝑎𝑇 . 𝑏

Equation 1-12:

Polynomial: 𝐾(𝑎, 𝑏) = (𝛾𝑎𝑇 . 𝑏 + 𝑟)𝑑

Equation 1-13:

Gaussian Radial Basis Function (RBF): 𝐾(𝑎, 𝑏) = exp (−𝛾||𝑎 − 𝑏||2)

Where aT is the transpose of a, d is the polynomial degree and ||𝑎 − 𝑏|| is the Euclidean distance

between the vectors in the original space. 𝛾 is a hyperparameter to be optimised.

Using this method, it may now be possible to find a hyperplane in the projected space which

can better separate the classes.

SVMs have the advantage that they can model non-linear relationships, where they can use the

kernel trick to reduce computational complexity. Additionally, by employing a convex

optimisation problem SVMs do not get stuck in local minima and converge to a unique

solution.197 They are, however, less interpretable, less suitable for multi-class classification,

favour discrete variables with more categories, and can be sensitive to the choice of kernel when

compared to other algorithms.197,183

Decision Trees

Decision trees can be used to perform both regression and classification tasks. Decision trees

split the data into smaller sub-groups by using input model features, until the instances are

separated into their classes (classification) or groups of values with a low standard deviation

(regression). At the start of the algorithm all instances are grouped together at the root node

(Figure 1-10). Subsequently the data is split by many decision nodes (including the root node),

which have multiple output branches and use one of the input features to split the data. Finally,

44

the leaf nodes consist of the decision of the class or group of closely related values; these nodes

should now consist of homogenous data.

Figure 1-10: Exemplified decision tree for classification of dataset into positive and negative classes.

There are different algorithms for decision tree construction. The Iterative Dichotomiser 3

(ID3),198 uses a greedy search approach to find an optimal decision tree by iteratively forming a

decision tree on larger subsets of the training dataset until a tree forms which classifies the

remaining instances completely correctly. ID3 uses the concepts of entropy H(T) and

information gain (IG) for classification to select the best features for partitioning data.

Entropy describes the information contained within a variable and originates from the Shannon

entropy, derived from models of data communication systems, which operated based on

deriving the average minimum length (in bits or Shannons) needed to transmit information in

a compressed form from source to receiver.199 Entropy can be described as the average or

expected amount of information derived from identifying the outcome of a random trial; high

entropy means that the state is unpredictable, since more information is needed to encode the

possible outcomes. Mathematically, entropy can be calculated by the following formula:

Equation 1-14:

𝐻(𝑇) = −∑𝑝𝑖

𝐽

𝑖=1

𝑙𝑜𝑔2𝑝𝑖

Where 𝑝1, 𝑝2 … are fractions of each of the J classes, which in total sum up to one and represent

the proportion of each class present in the set of training inputs T.

Equation 1-15:

𝐼𝐺(𝑇, 𝑎) = 𝐻(𝑇) − 𝐻(𝑇|𝑎)

Leaf Nodes

Decision Nodes

Root Node Feature 1

Feature 2

Positive Negative

Feature 3

Positive Negative

Yes

Yes No

No

No Yes

45

Where H(T) is the entropy of the training inputs T, and 𝑎 is the feature which has been used for

the split. This can be interpreted as the change in entropy from the parent node to the weighted

average of the child nodes after a split has been made using feature 𝑎. The feature with highest

information gain is used to make the split and relies on having a large entropy for the parent

node and a lower entropy (more homogenous) child node.200

The ID3 algorithm was later developed to produce the C4.5 algorithm, which had advantages

including that both discrete and continuous features could be used to make the split, it can cope

with missing feature values for some instances and that it prunes those branches which do not

help with classification.201

Some decision tree algorithms, including the Classification and Regression Trees (CART)

algorithm, use the Gini Impurity I(T) instead of the entropy H(T) as the impurity measure for

attributes; this is defined as follows:

Equation 1-16:

𝐼(𝑇) = ∑𝑝𝑖

𝐽

𝑖=1

(1 − 𝑝𝑖)

Where 𝑝1, 𝑝2 … are fractions of each of the J classes, which in total sum up to one and represent

the proportion of each class present in the set of training inputs T.

Practically, it has been shown that the difference between the entropy-based and Gini impurity-

based measures are minimal (2 %).202 An important difference between CART and other

algorithms is that it produces only binary trees with two child nodes resulting from each parent

node in the tree.190

Decision trees have the advantage that they are known as “white box” models and are

interpretable; some decision tree algorithms can output classification rules, which allows

understanding of how the dataset was classified. It is possible to calculate the importance of

each feature towards classification using the impurity measures mentioned above. Decision

trees are prone to overfitting to the training set, affecting generalisation. For this reason, trees

are often regularised by controlling the maximum depth of the tree and using pruning.190

Ensemble Methods

Ensemble methods make use of the aggregation of multiple weak learning algorithm outputs to

produce a stronger ensemble consensus output. The main purpose of employing ensemble

models is to increase model predictive performance. The key to the effectiveness of an ensemble

46

algorithm is that the errors between the weak learners are uncorrelated; that is the weak

learners are as diverse as possible. One way to train diverse weak learners is to implement a

variety of different algorithms on the same training set and create the ensemble from

aggregating the resultant predictions. Other popular ensemble methods include bagging and

boosting. We explain these techniques and then provide detail for the Random Forest

algorithm, since it is used in this work.

Boosting

Boosting is a way of sequentially combining weak learners into one stronger ensemble learner.

Each new predictor aims to correct the error from the previous predictor in the sequence.203

The most popular methods of boosting are Adaptive Boosting (AdaBoost) and Gradient

Boosting. AdaBoost modifies the sample distribution by updating the weights assigned to each

instance to prioritize the instances that the previous predictor underfitted for the subsequent

predictor, whereas Gradient Boosting aims to fit the next predictor in the sequence to the

residual error from the previous predictor.

Bagging

Bagging refers to the use of the same algorithm on different subsets of the training set where

the training set for each predictor is selected by a sampling with replacement (bootstrapping)

procedure.204 The predictions for a new instance are then aggregated across all predictors; in

classification this is often achieved by a majority voting aggregator.

Random Forest

Random Forests are popular algorithms in machine learning in the life sciences domain, due to

their high performance coupled with relatively simple implementation and interpretability.205

The Random Forest (RF) algorithm is an ensemble of weak learning decision trees implemented

using bagging.206 The difference between RFs and traditional bagging techniques is the

additional randomisation which allows the construction of diverse, independent decision trees

to minimise the correlation between the weak learners in the ensemble.207 Each tree is grown

as described in Decision Trees, with the difference that extra randomisation is introduced by

limiting the features from which the best feature (based on entropy or Gini impurity metrics)

to split on for each node is selected to a subset of the total possible features. By restricting to a

random subspace of features, it is possible to increase the tree diversity by avoiding the situation

where a few features always form the first splitting features and therefore dominate for all trees.

This increases the bias of the model and decreases the variance, resulting in less overfitting than

deep decision trees. Finally, all trees output the classification prediction for an instance, which

47

are aggregated by majority voting to produce the final class. The proportion of trees voting for

each class gives an indication of a class probability value, for example if a RF model has 100 trees

and 70 trees have predicted class 1, the probability would be 0.7; this is not a true likelihood of

the class prediction, however, as this does not take into account the distribution between

classes.208

RFs model non-linear relationships between features and output data, are relatively quick to

train, are less likely to overfit relative to decision trees, allow the interpretation of feature

importance to the model, and do not require extensive tuning of model parameters.190 In

comparison to other methods, RFs do not require strict feature scaling. Conversely, their

weaknesses lie in the fact that they cannot predict error estimates for models and are less easily

interpreted than some simpler algorithms.209

Interpretation of Random Forests

RFs are a popular machine learning approach, due to their interpretability. As described in

Decision Trees, features are selected for splitting instances at each node in each tree based on

achieving the best split, as measured by the difference between the Gini impurity or the entropy

metrics between parent and child nodes. The mean decrease in the Gini impurity is often used

as a measure of overall feature importance for interpreting RF models. This measures the

average decrease in node impurity across all times a certain feature was used to make a split in

all trees within the forest, weighted by the number of samples which reach the node involved

in each of these splits. This results in a value for the importance of each feature in the overall

classification of instances across the ensemble and these values can be ranked from highest to

lowest importance.

Although the mean decrease in Gini impurity method is commonly used to determine feature

importance, there are other importance metrics which have been used. These include

permutation measures, such as computing the out-of-bag error when each input variable is

randomly permuted; those variables which cause an increase in error or decrease in accuracy

when permuted are considered important towards classification.206 Conditional permutation

testing was proposed to correct for the effects of highly correlated variables.210 This method

calculated the mean decrease in accuracy for a feature when it is permuted as above, but now

with respect to subsets of the data split by other highly correlated features. Additionally,

variable selection for RF using backward elimination methods has been used for interpretation

of features as it provides a smaller set of important features which contribute to model

performance, which can be interpreted,211 Another method suggests that the minimal distance

48

measure, which calculates the average depth of the tree when a variable is used as a split, should

be used to weight variables near the root of the tree as more important.212

Recent studies have explored the use of alternative methods to capture a more detailed picture

of the relevance of certain features towards the classification of subsets of the training

dataset.213–215 In this work a local feature importance method213 was implemented to determine

the features relevant for classification of individual instances or groups of instances. This

method computes the local increment 𝐿𝐼𝑓𝑐 of feature f towards class 1 when it splits instances

between parent (p) and child (c) nodes in a tree:

Equation 1-17:

𝐿𝐼𝑓𝑐 = 𝑌𝑚𝑒𝑎𝑛

𝑐 − 𝑌𝑚𝑒𝑎𝑛𝑝

Where 𝑌𝑚𝑒𝑎𝑛𝑐 is the fraction of instances in the child node belonging to class 1, and 𝑌𝑚𝑒𝑎𝑛

𝑝 is the

fraction of instances in the parent node belonging to class 1.

Following this, for a specific instance i in the training set, the contribution of feature f for an

individual tree is the sum of all local increments from all splits in the tree that the instance

passes through which used this feature.

Finally, the contribution of feature f to the classification of instance i over the entire forest is

given by:

Equation 1-18:

𝐹𝐶𝑖𝑓

=1

𝑇∑𝐹𝐶𝑖,𝑡

𝑓

𝑇

𝑡=1

Where T is the total number of trees and 𝐹𝐶𝑖,𝑡𝑓

is the feature contribution of feature f towards

instance i for tree t. For more details on the method the reader is referred to the publication by

Palczewska et al.213

Since this method was implemented in this study, other local methods of feature contribution

for RFs have been developed. One, coined the “Intervention in Prediction Measure (IPM),”

based on the internal structure of the trees, was recently shown to outperform other established

methods, including the Gini coefficient and mean decrease in accuracy methods215 and works

by computing the percentage of times a variable is selected along the path of each instance

classified in out-of-bag set in each tree, which is then averaged over the whole forest. Very

recently, another local feature importance approach was proposed based on computing only the

49

feature contributions of the positive feature values, which had utility for the interpretation of

binary input features (in this example Gene Ontology (GO) terms), since positive values were

indicative of the presence of a feature, whereas in some cases a negative value can mean either

the absence of the feature or that there was a missing feature value, which is a less concrete

interpretation.214

Assessment of Applicability Domain

The applicability domain (AD) is the area in descriptor space to which a machine learning model

can be reasonably applied for future predictions.216 This region where the model can make

predictions which can be relied upon, is determined by the descriptor space of the training set.

In the context of Quantitative Structure Activity Relationship (QSAR) models, the descriptor

space will be the chemical space for the training set compounds and the question asked would

be “Is the query compound inside or outside the applicability domain?”.217 Approaches to define

the applicability domain are grouped into three main methods.218 The first involves limiting the

model to interpolation, where only those new examples which fit within the range of the

descriptor set are within the applicability domain. Secondly, similarity approaches are often

employed to extend towards extrapolation of a model; these methods rely on the distance of the

new compound from existing compounds in the training set to determine reliability of a new

prediction. Many methods have extended this approach to also model the density of the

neighbourhood of the new compound within the training set, by measuring the average

similarity between all close training examples rather than the nearest neighbour distance

alone.219 The third type of approach comes from outputs of the algorithm itself; including, for

example in ensemble methods, determining how many of the individual predictors vote for the

same class for classification or the consistency between values for regression.220 The more

consistent the outputs between predictors, the more likely the aggregated prediction can be

relied upon. All methods require thresholds to be determined based on the application, are

often ambiguous as to their quantitative meaning and are often not comparable to one

another.221 This is due to many AD metrics being sensitive to the individual dataset and the

algorithm used, rather than being globally applicable.222

Conformal Prediction

Conformal prediction has been implemented to assess the applicability domain of models in

previous contexts, in particular in QSAR applied to virtual screening,223–226 and addresses some

of the problems with the approaches previously detailed, by providing a quantitative

interpretation of confidence in predictions.227 Conformal prediction is designed to provide

50

information on whether a new prediction falls into a prediction region at a defined confidence

level. It achieves this by implementing a mathematical framework to predict new instances with

a guaranteed error rate. In the classification case, the confidence of predicting a class label for

a new example is computed, given the training set instances to which the new example

conforms. Conformal prediction assumes that the new examples are sampled from an

exchangeable distribution and, under these conditions, a guaranteed error rate is achieved,

providing prediction regions which are always valid.221 For the binary classification case, new

examples can be assigned to four sets; 1. {Class 1}, 2. {Class 2}, 3. {both} classes 1 and 2, or 4.

{null}. Validity and efficiency are the metrics used to assess conformal prediction performance.

The conformal predictor is valid if the frequency of errors does not exceed ε at the chosen

confidence level (1 – ε). It is efficient if instances are assigned to as few labels as possible. In the

binary classification problem, assignment to the set {both} is always valid, however, is not

efficient, as the instance has two labels. Conversely, the set {null} is not valid but is efficient

since the instances are assigned to zero labels. Ideally, single class predictions are the most

useful in practice, however, if the confidence level is increased, often the number of {both} set

predictions are increased, because the algorithm needs to be more certain in obtaining correct

predictions.

There are different variations of conformal prediction. The first distinction is between the

methods of transductive and inductive conformal prediction.221 In the transductive setting, each

instance is predicted separately, after which the model is updated based on the true value of the

instance, before the next instance is predicted. In this way the new prediction is based on all

available previous data. In contrast, inductive conformal prediction employs a batch approach

where the model is trained on a set of existing data and is not updated after each prediction.

Transductive conformal prediction is more computationally expensive, and so inductive

conformal prediction is often used in the QSAR setting.225,226 The second variation in conformal

prediction methods is between the Non-Mondrian or Mondrian cases, and a third important

distinction is the method used to generate the calibration sets. In this section we will focus on

the method implemented in this thesis; namely Inductive Mondrian Cross-Conformal

Prediction (IMCCP) where further details of these variations in conformal prediction methods

will be discussed.227

Practically, the process of employing Inductive Mondrian Cross-Conformal Prediction (IMCCP)

with a hypothetical example is illustrated in Figure 1-11. The algorithm starts with the definition

of a non-conformity score (Step 1), which is a measure of how different the new example is from

51

existing training set examples. This can be defined in many ways and metrics used to define

traditional applicability domains can be used here. For classification, often the probability

scores for each class generated from the machine learning algorithm are used as a measure of

non-conformity. The difference between IMCCP and standard Inductive Mondrian Conformal

Prediction (IMCP) is that instead of the training set being split into a training set and a smaller

calibration set, IMCCP iteratively splits the training set into k training sets and k calibration

sets, in a procedure mimicking cross-validation. This is employed to relieve the biases from

generating a random calibration set and to ensure that all training examples appear in

calibration sets once.227 Once the non-conformity measure has been established, it is calculated

for the examples in each of the k calibration sets based on the model predictions from the model

trained on each of the k training sets. These non-conformity scores are then organised into

Mondrian class lists (Step 2), which are ordered lists of the non-conformity scores for each of

the possible class outcomes. Mondrian conformal prediction was introduced to represent the

non-conformity scores on a per class basis, which overcomes the problem of class imbalance in

previous Inductive Conformal Predictors (ICPs), ensuring that the framework is valid for each

class.221 There will be k class lists for the k calibration sets. Subsequently, we define a confidence

level (1-ε) which we require for our application, where ε is the error rate which must not be

exceeded for the conformal predictor to be valid. ε is defined as the significance level for which

the probability values from conformal prediction must exceed for the class to be predicted (Step

3). In Step 4 a prediction is made for a new instance. For each new instance the non-conformity

score is calculated in the same way to the calibration sets and a value for each class will be

obtained. The values are then inserted into the Mondrian class lists for the k calibration sets

separately and a p_value is obtained for each calibration set (Step 5). P_values are calculated

based on the position in the Mondrian class list of the new example. For example, if there is one

data point in the calibration set with a non-conformity score lower than the example and there

were 7 samples in the calibration set, the p_value would be 1/7 or 0.14. The p_values for each of

the k calibration sets for the new examples are then averaged for each class to produce one value

per class. This value is then compared to the significance level (Step 6); where the p_value is

greater than the significance level ε, the new instance is said to belong to the class. In this way

both p_values could be less than ε assigning the instance to set {null}, both may be greater than

ε assigning the instance to set {both}, or the p_value for only one class may be greater than the

significance level assigning the instance to either {Class 1} or {Class 2} only.

52

Class 1 Class 2

0.02 0.34 0.44 0.67 0.87 0.91 0.99

0.05 0.10 0.56 0.69 0.70 0.88 0.89

Step 2- Create k Mondrian class lists from calibration sets: Sorted non-conformity scores per class

Step 1- Define a non-conformity measure: How different a new sample is from previous samples. E.g. What is

the predicted probability of an example belonging to a class?

Non-conforming samples: Probability estimate for correct

class is low (if the score is less than 0.5 then

classification was wrong)

Step 3- Define a confidence level and significance level: E.g. if confidence level (Cl) = 0.70 (70 %) then significance level (Cl) = 0.30.

Probability value for an instance must exceed significance level to be predicted as a member of this class

Step 4- For a new example calculate non-conformity score and place into each of k class lists:

E.g. if: p(Class 1)=0.2, p(Class 2)=0.8:

Class 1 Class 2

0.02 0.34 0.44 0.67 0.87 0.91 0.99

0.05 0.10 0.56 0.69 0.70 0.88 0.89

Step 5- Calculate k p_values based on position in each class list:

p_value (Class 1) = 1

7 = 0.14

p_value (Class 2) = 5

7 = 0.71

Step 6 -Interpret outcome: average k p_values and assess if they are greater than significance level

e.g. average p_value (Class 1) = 0.16 < 0.3 average p_value (Class 2) = 0.80 > 0.3

New instance is assigned to Class 2 but not Class 1 at 70 % confidence level

53

Figure 1-11: The process for implementing Inductive Mondrian Cross-Conformal Prediction.

Statistical Methods

This section covers statistical methods relevant to this thesis that are not encompassed by the

machine learning section above.

Covariance and correlation

Covariance is a statistical measure of association between two variables and a high covariance

indicates that two random variables vary together. Correlation measures how strongly two

random variables are related to one another and is a scaled version of covariance with values

between -1 and 1. Correlation is a specific example of covariance which is unaffected by a change

in scale. The Pearson’s correlation coefficient (ρ) is often used to measure correlation of

numerical variables and can be calculated for a pair of random variables (𝑋, 𝑌) using the

following equation:

Equation 1-19:

ρ𝑋,𝑌 = 𝐸[(𝑋 − 𝜇𝑋)(𝑌 − 𝜇𝑌)]

𝜎𝑋𝜎𝑌

where 𝐸 is the expectation, 𝜇 is the mean and 𝜎 is the standard deviation. Pearson’s correlation

coefficient is therefore the covariance of two variables divided by the product of their standard

deviations.

Mutual Information

Mutual information (MI) can be used to measure the statistical dependence of two random

variables, like in the measures of covariance and correlation. The main difference between

mutual information and common correlation methods is that correlation determines the effect

of non-independence on the product of two random variables, whereas MI determines the effect

of non-independence on the joint probability distribution. This allows the MI to measure non-

linear relationships between random variables, extending the generalisability of the correlation

metric which is limited to linear relationships. MI can be related to the entropy of the two

random variables and can be represented visually as in Figure 1-12.

54

Figure 1-12: H(X) is the entropy of random variable X and is represented by the left circle (yellow and green), H(Y) is the entropy of random variable Y and is represented by the right circle (blue and green), H(X|Y) is the conditional entropy of X given Y (yellow only), H(Y|X) is the conditional entropy of Y given X (blue only). H(X,Y) is the joint entropy of both variables and encompasses both circles. I(X;Y) is the mutual information and is the intersection of the two variable entropies (green region).

As mentioned in Decision Trees, entropy describes the amount of information contained within

a variable and was first proposed from the Shannon entropy.199 From Figure 1-12 it can be

rationalised that the MI can be described as the information that two random variables share

i.e. how much knowing one variable reduces the uncertainty of the other:228

Equation 1-20:

𝐼(𝑋; 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌)

Another interpretation from Figure 1-12 is the that the MI is the measure of dependence of the

joint distribution of X and Y under the assumption of independence:

Equation 1-21:

𝐼(𝑋; 𝑌) = 𝐻(𝑋) + 𝐻(𝑌) − 𝐻(𝑋, 𝑌)

Overall, MI is calculated by the following equation for two discrete random variables X and Y:

Equation 1-22:

𝐼(𝑋; 𝑌) = ∑ ∑ 𝑝(𝑥, 𝑦)log (𝑝(𝑥,𝑦)

𝑝(𝑥)𝑝(𝑦)𝑥 ∈ 𝑋𝑦 ∈ 𝑌 )

Where 𝑝(𝑥, 𝑦) is the joint probability distribution function of X and Y discrete random variables,

and 𝑝(𝑥) and 𝑝(𝑦) are the marginal probability distribution functions for X and Y respectively.

Values of the MI range from zero, indicating that X and Y are completely independent variables,

to +∞. For ease of interpretation, the MI values are often normalised to within the bounds of 0

to 1 to give the Normalized Mutual Information (NMI):

H(X|Y) H(Y|X)I(X;Y)

H(Y) H(X)

H(X,Y)

55

Equation 1-23:

𝑁𝑀𝐼(𝑋, 𝑌) = 2 × 𝐼(𝑋; 𝑌)

[𝐻(𝑌) + 𝐻(𝑋)]

Likelihood Ratio

Sensitivity and specificity values are often used in diagnostic testing; however, the likelihood

ratio is a more powerful metric used in clinical applications.229 To construct a diagnostic test

analysis a 2 x 2 contingency table is created based on the test results against the disease

condition (Figure 1-13).

Figure 1-13: shows the set-up for evaluation of a diagnostic test.

The sensitivity metric is calculated by the Equation 1-24 and can be interpreted as the ability of

the test to find the individuals with the disease. Conversely the specificity Equation 1-25 is the

ability of the test to identify those individuals without the disease. Both are important for

determining the degree of success of the test.

Equation 1-24:

𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

Equation 1-25

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁

𝑇𝑁 + 𝐹𝑃

The likelihood ratio extends the analysis to provide a value describing how many times more or

less likely patients with the disease are to have that particular test result than patients without

the disease. The likelihood ratios are ratios of probabilities and are calculated directly from the

sensitivities and specificities for a two-outcome case:

True Positives

(TP)

True Negatives

(TN)

False Negatives

(FN)

False Positives

(FP) Positive

eeeee

Negative

Positive Negative

Tes

t R

esult

Disease

56

Equation 1-26:

𝐿𝑅+ =𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦

1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦

Equation 1-27:

𝐿𝑅− =1 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦

There are two versions of the likelihood ratio (LR), the positive and negative likelihood ratios

(LR+ and LR-). The LR+ takes the proportion of those with the disease who had the positive test

outcome (sensitivity) and divides this value by the proportion of those without the disease who

had the positive test outcome (1- specificity) (Equation 1-26). The LR- is the proportion of those

with the disease which had the negative test outcome (1- sensitivity) divided by the proportion

of those without the disease which had the negative test outcome (specificity) (Equation 1-27).

The likelihood ratio has been argued to be a better indicator of risk for this type of analysis than

the traditional sensitivity metric, as it takes into account the false negatives and the false

positive values,230 as well as providing extra interpretation. The utility of the likelihood ratio is

that it can be used to calculate the probability of the presence of disease from different tests

using modified Bayes theorem where the likelihood ratio is multiplied by the pre-test odds to

give the post-test odds (Equation 1-28). The advantage of the likelihood ratio in diagnostic

testing is that the prior probability from the context can be incorporated into calculating the

post-test risk of disease making for a more generalisable risk calculation, whereas the sensitivity

and specificity metrics are affected by the prevalence of the outcome and are context

dependent.231

Equation 1-28:

𝑝(𝑑𝑖𝑠𝑒𝑎𝑠𝑒 | 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡)

𝑝(𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 | 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡)=

𝑝(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 | 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)

𝑝(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡| 𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)

𝑝(𝑑𝑖𝑠𝑒𝑎𝑠𝑒 )

𝑝(𝑛𝑜 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)

Posterior odds ratio Likelihood ratio (LR+) Prior odds ratio

Practically, a positive likelihood ratio of greater than 1 indicates that the test result is associated

with the disease, whereas a negative likelihood ratio less than 1 indicates that the test result is

associated with the absence of the disease. Ratios above 10 or below 0.1 are generally considered

of strong enough evidence to determine if the test is useful in diagnosing the disease or not.232

57

1.2.2 Summary

This section highlights the general computational methods employed as part of this thesis,

providing a broader explanation of their theory and positioning within the data science field.

This thesis is concerned with applying machine learning and statistical techniques to derive

knowledge from selectivity and toxicity data within the context outlined in Challenges in Drug

Discovery

Next, we introduce the previous in silico studies conducted in the fields of target-family

selectivity prediction and clinical toxicity prediction.

1.3 Computational Methods and Applications for Bioactivity and Selectivity Profile

Prediction

Given the importance of the concepts of selectivity and polypharmacology to drug design

(Selectivity and Polypharmacology) and the lack of feasibility of screening all potential targets

for all molecules during the drug discovery process, in silico approaches are important for the

prediction of the off-target effects of compounds.10

Approaches to selectivity prediction often use the principle that similar ligands bind similar

targets.233 This field is known as chemogenomics.234 It has been shown that similar ligands often

bind to protein targets in a similar way from an analysis of the protein data bank (PDB)

structures and their co-crystal ligands.235 From this concept, many approaches have been

developed to exploit the differences between targets and the differences between ligands, or a

combination of both, to predict or infer bioactivity against multiple targets, the aggregation of

which can be used for selectivity profile prediction.

1.3.1 Ligand-Target Interactions

To understand what is meant by the difference between ligands and targets, it is important to

discuss which types of features of the ligand or target are important for binding. Ligands are

often small molecules which bind to a protein. The free energy of this binding event is based on

the changes in entropy ΔS and enthalpy ΔH on binding and the temperature (T) and is given by

the Gibb’s free energy ΔG equation:236

Equation 1-29:

∆𝐺 = ∆𝐻 − 𝑇∆𝑆

A negative ΔG is required for spontaneous binding, which requires a positive overall entropy

change and a negative overall enthalpy change. Generally, upon binding of a small molecule the

58

entropy (∆𝑆) of the system increases, due to the liberation of crystal water molecules, from the

lipophilic binding pocket into the bulk solution, however, this is balanced by the loss entropy

from the reduction of degrees of freedom (conformational, translational and rotational) for both

the ligand and protein upon binding as opposed to the free state.237 The degree of entropy

contribution to binding can be modified by parameters such as the conformational flexibility of

the ligand and its overall size and shape. The enthalpy of binding (∆𝐻) is determined by specific

interactions made between protein and ligand and these need to be stronger interactions than

the molecule would make in the aqueous environment to have a negative and favourable change

in enthalpy upon binding.237 The types of interactions that are enthalpically favourable include

those related to functional groups in the molecule which can make specific interactions with

the protein such as hydrogen-bonding or ionic interactions, or more general interactions such

as hydrophobic contacts, shape complementarity and flexibility.

1.3.2 Computational Encoding of Ligands and Targets

As mentioned above, the binding of small molecules to proteins is governed by molecular

properties which contribute to maximising the entropy and minimising the enthalpy of binding.

Here we discuss how we can computationally represent features of ligands and targets which

may be important towards these binding interactions for use in in silico approaches.

Chemical Descriptors

Since there is complexity in the factors that govern ligand binding to proteins, many different

computational ligand descriptors are available to capture these diverse molecular properties.

The currently known chemical descriptors can be divided into three categories: 1D, 2D and 3D

descriptors.

1D descriptors calculate the physicochemical properties of the compounds. Common examples

of these are DRAGON238 or PaDEL239 descriptors, which can be generated rapidly for large

numbers of compounds and describe various aspects of molecules including; shape, size,

lipophilicity, hydrogen bonding ability, chemical composition and combinations thereof.

2D descriptors are also called topological descriptors,240 and are numerical values that describe

the 2D connectivity of the molecule, based on calculating properties on the molecular graph

representation, where atoms are nodes and bonds are edges. These descriptors encompass

molecular size, composition, shape, branching and bonding of the molecule.241 One example,

MDL MACCS keys, uses predefined substructure keys to record the presence or absence of a

substructure in molecules, based on the molecular graph. These are often used for substructure

59

searching and similarity approaches and are easily interpretable.242 Other examples use both

information about the physicochemical properties of the molecule and the position of atoms in

the molecular structure to describe the molecule, for example, the autocorrelation

descriptors.243 These are derived by calculating the number of bonds between all atom pairs and

introducing a coefficient of the product of the properties for each atom pair, overall leading to

a measure of the property distribution over the topological structure. Another commonly used

2D descriptor is the extended connectivity fingerprint (ECFP), which are a type of circular

fingerprint that were developed for structure-activity modelling purposes.244 Circular

fingerprints are an alternative way of representing the 2D structure of a molecule, which does

not encode specific connectivity, but contains information about the arrangement of all heavy

atoms within a certain radius in a molecule from a central atom (Figure 1-14).245

Figure 1-14: Shows the generation of Morgan Fingerprints. The atoms within certain distances of the central atom are encoded in layers as substructures into a binary fingerprint of fixed length by the use of a hashing function.

ECFPs are generated based on the Morgan fingerprint algorithm246 and the size is determined

by the number of bonds covered by the largest substructure, which is twice the radius specified.

Therefore, ECFP4 fingerprints will be generated from a two-layer analysis. ECFP or Morgan

fingerprints encode the presence or absence of substructures in a molecule using a linear binary

string of defined bit length. These fingerprints have the advantage that they are not restricted

to predefined substructures. However, more than one substructure can be mapped to one-bit

position in a hashed fingerprint calculation, since there may be more substructures than the

number of bits specified causing bit-collisions. A benchmarking study found that ECFPs

outperformed other fingerprint types in a virtual screening application.247

The methods described so far do not account for the 3D conformation of molecules in the

binding site, which can have an important influence on binding. However, one of the main

limitations in encoding molecules in this way is that 3D descriptors require that the binding

conformation of the ligand is known. Otherwise this conformation must be predicted, usually

be generating a low energy conformer for the molecule, from which 3D descriptors are

subsequently calculated. This makes 3D descriptors very dependent on the conformation

selected. 3D descriptors encode the conformation-specific properties across a molecule,

60

including 3D pharmacophores, shapes, potentials, fields and atomic co-ordinates. Many 3D

descriptors rely on ligand alignment and a comparison of properties in cartesian space grid, for

example in comparative molecular field analysis (CoMFA)248. Grid Independent Descriptors

(GRINDs), are independent of ligand alignment and instead use structural features of the

compound that are energetically favourable in the interaction with probes to form the

descriptors.249 3D Volsurf descriptors are examples of internal co-ordinate based descriptors

calculated from the 3D conformation of molecules and encode the distribution of molecular

size, shape, hydrophilic and hydrophobicity properties across the molecule.250–253 More recently,

3D circular fingerprints called extended E3FPs for molecules have been derived by encoding

atoms in shells of increasing size from a central atom for an ensemble of molecular conformers

for each molecule, and is similar in principle to the derivation of 2D ECFP fingerprints.254

More recently, the reinstatement of deep learning has led to the invention of different

architectures which can be used to generate explicit or implicit chemical descriptors. One

example is variational autoencoders, which can be used as continuous descriptors of chemical

space, derived from SMILES or graph representations of molecules.255 These descriptors are

being used in many applications due to their ability to perform automatic feature extraction, as

well as their usage in the generation of new molecules. Convolutional neural networks can

derive representations from graphs, and have been used in the derivation of features from

compounds,256 and recurrent neural networks are adept at learning sequences and can learn

chemical structure from SMILES.257 These methods have advantages that they can automatically

generate useful and meaningful representations of molecules, reducing the need for explicit

feature engineering, in this case, selecting and calculating chemical descriptors.

Target Descriptors

In a similar way to compounds, biological targets can be encoded by quantitative descriptors.

These descriptors can be divided into five groups: 1. atom-based,2. sequence-based, 3. amino

acid-based descriptors, 4. Conformational-based, and 5. structure-based descriptors.258

Atom descriptors encode SMARTS definitions of all the different atom types in the protein and

can generate a 39 bit target descriptor based on a count of the 39 different types of atom in the

protein.259,258 Sequence based descriptors contain information about the primary sequence of

amino acids in the protein or binding site of interest.260 Examples include descriptors which

encode the relative composition of each of the 20 amino acids in the sequence (for amino acid

composition descriptors),261 or the composition of combinations of residues (for e.g. dipeptide

composition based descriptors),262 within the structures of interest. These descriptors can

61

represent a local structure of the protein. In one study,263 target descriptors were generated

based on identifying all the amino acids within a defined radius (6.5 Å) of the ligand and then

using these amino acids as the central point for identification of the four nearest sequence

neighbours (two either side) to produce a sequence of 5 amino acids. This information

combined within the radius is then counted as a local substructure. This produced a library of

substructures to which each of the receptors was compared and led to their description by five

local structures, matched to these local substructures created from four proteins. Therefore 20,

(4 x5) bits described each receptor. Properties of the sequence can also be encoded by a number

of different sequence-based descriptors, for example, with Moran autocorrelation descriptors.264

Often these sequence-based descriptors are used for predicting protein functional families.260

Amino acid-based descriptors include information about each amino acid in the sequence.

Types of information used are physicochemical properties, topological information and feature

information. The most widely used amino acid based descriptors are the Z-Scales descriptors.259

These are derived by principal component analysis (PCA) and contain information about the

physicochemical properties of the molecules including: hydrophobicity (Z1), steric bulk and

polarisabilitiy (Z2), polarity (Z3), and electronic effects (Z4 and Z5).265 Other descriptors of this

type commonly used are VHSE266 and ProtFP PCA.267 ST-Scales268 and T-scales269 are based on

PCA analysis of mainly topological properties, and ProtFP (feature) describes each amino acid

by a single feature.267 BLOSUM descriptors270 were derived from physicochemical properties

using a VARIMAX analysis of the BLOSUM62 substitution matrix used to score alignments for

protein sequences.271 Additionally, a different type of descriptor was also generated based on 3D

electrostatic properties of amino acids called MS-WHIM.272 These descriptors were

benchmarked finding that the Z-Scales descriptors performed the best in selectivity studies and

combining Z-Scales with ProtFP (feature descriptors) consistently resulted in a small increase

of model performance.267

Conformational structure-based descriptors use 3D information about the protein or binding

site to create a descriptor. Wu et al., 2012273 derived structural similarity descriptors from the

Protein Comparison Tool, part of the PDB resources, and geometry descriptors were computed

from the bond lengths and angles of the main backbone atoms in the protein sequence.274

Additionally, Qiu et al., 2015 produced a novel protein descriptor for modelling the antigen-

antibody interactions using a cylinder model, where an interaction face was computed and

placed in the X-Y plane. Then, a rotating plane was defined in the Z axis with a defined distance

radius. Through a similar method to 2D circular fingerprinting for ligands e.g. ECFPs, the 3D

62

fingerprint can be generated at different shells (distances from the central point) by coding the

presence of neighbouring residues within each layer.275 More recently, Watermap derived field-

based descriptors were used to encode the binding site of proteins for selectivity modelling.

Kinase-specific 3D protein descriptors have been crafted in one study, by computing the

pairwise distances between residues. This method also encoded the active and inactive forms of

kinases, which is known to be important for the binding of type 1 vs type 11 kinase inhibitors.276

Ligand-Target Interaction Descriptors

Due to the increase of the availability of liganded co-crystal structures for small molecule

binders to macromolecules, molecular descriptors can be encoded from the interaction between

small molecule and protein, known as protein-ligand interaction descriptors. Many versions of

this concept have been produced,277 including fingerprints which encode interactions between

the ligand and its protein, such as hydrogen-bonding, hydrophobic contacts, π-stacking

interactions, π-cation interactions, salt bridges, water bridges and halogen bonds.278,279

1.3.3 Ligand-Based Bioactivity Prediction Against Multiple Targets

For ligand-based bioactivity prediction, the structures of known ligands for a target or multiple

targets are used to predict future ligand interactions with the same target(s). Compound

similarity methods can be used to predict bioactivity for a new compound based on the activities

of known compounds. Methods using chemical similarity have been employed extensively for

virtual screening against a target to predict new hits and can be thought of as a multi-step

process consisting of 1. taking a set of reference compounds, 2. computing their descriptors or

pharmacophores (as described above), 3. developing a screening protocol such as employing

quantitative structure activity relationship (QSAR) techniques, or pharmacophore/shape-based

searches, and then 4. screening a library of compounds against the protocol to identify hits with

the same predicted biological activity as the reference compounds.234 QSAR methods employ

different machine learning algorithms280 to learn the relationships between the descriptor

properties and the bioactivity endpoint, inherently modelling similarity in a variety of linear

and nonlinear ways. The pharmacophore and shape-based methods work by using the 3D

structures of existing ligands to find new molecules in the library that mimic either the key

functionality of the molecules or the binding conformation. One commonly used software for

this method is known as Rapid Overlay of Chemical Structures (ROCS).281,282 These methods are

limited by the fact that the binding conformation information needs to be available or predicted

for known ligands.

63

For selectivity profile prediction, target prediction methods have been implemented as

ensembles of QSAR models, one for each target, based on the training data of the chemotypes

known previously for each target.283 Multi-task deep learning frameworks now allow for the

bioactivity prediction across multiple targets, using only compound information.284 This

method is based on learning more complex representations shared between prediction tasks,

and allows for transfer learning particularly for targets with less bioactivity data.

1.3.4 Target-Based Bioactivity Prediction Against Multiple Targets

Target-based bioactivity prediction can be used when information is available for the primary,

secondary or tertiary structure of the protein of interest, often in the form of X-ray or NMR

derived protein structures or homology modelling.285 Docking methods are extensively used in

order to screen libraries of compounds against the 3D structure, by firstly sampling

conformations of the ligand in the binding site and secondly scoring those conformations

according an estimated binding affinity.286 Docking methods can be flexible or rigid with respect

to the ligand and receptor, with more flexible methods being more computationally

expensive.287 With a flexible ligand, the method aims to find the highest scoring pose of the

ligand. Often an ensemble of rigid protein conformations is used for docking and ligands are

scored against each conformation. Whilst docking is useful for predicting binding poses of a

ligand and ranking activities, the predictive ability of scoring functions are often poor.288

Flexible ligand docking programs commonly used include Dock, Gold, Glide and

AutoDock.288,289 Docking methods are suited to low or medium throughput molecule bioactivity

prediction.

The main target-based in silico approaches for selectivity profile prediction are based on

sequence and structural analysis of the binding site and the concept of binding site similarity.

This can be used to identify the binding site features which could be targeted to discriminate

between activity for the on-target and off-target.8 Sequence-based similarity methods have

previously been used where structural information is limited to predict selectivity for related

biological targets. As mentioned before, once sequences are aligned, there are many descriptors

that can be calculated for the binding site, from which distance matrices from sequence identity

or sequence similarity can be computed.290,291 By identifying residues responsible for selectivity

by inspection of the results of sequence analysis, targeted libraries of compounds have been

designed to interact with the residues identified.292

Structure based similarity methods can also be used to understand and predict the selectivity

profiles of small molecules, including different binding site comparison methods which have

64

been used to understand the similarities and differences between binding sites that could be

exploitable for selectivity.234 To implement these methods, the macromolecular protein

sequences need to be aligned and the binding sites identified, from which the binding site

descriptors can be calculated. Binding site comparison algorithms have previously used

different descriptors to calculate similarity,293 including fingerprint methods,294 grid

methods,295 graph methods,296 distribution methods297 and geometric methods including

volumes, points and clouds.298 Once the binding site comparisons have been made, approaches

to screen ligands can be used to find selective hits for one target over another.

In one study, “binding site signatures” from the active site of kinases were used for selectivity

prediction. These signatures were defined as the energetically important interactions from co-

crystal structures of kinase inhibitors, and this information was used to successfully predict

kinase off-targets for small molecule kinase inhibitors.299,300 This method achieved a prediction

accuracy of 90 % across a large data set, showing the relevance of these signatures to kinase

selectivity.

Screening approaches can be used once the binding sites have been identified as exploitable for

selectivity. These approaches are designed to find hits for a target based on the binding site

information. Docking-based virtual screening methods can be used for this purpose which

describe the binding site shapes and pharmacophores to search for ligands.301 For example a

protocol involving structure-based virtual screening was used to find selective hits for the

histone deacetylase-6 protein.302

1.3.5 Ligand and Target-Based Bioactivity Prediction Against Multiple Targets

Combining both the ligand and the target information where possible, is now a common

approach to bioactivity prediction. Ligand-protein interaction fingerprints (see Ligand-Target

Interaction Descriptors) can be used in docking-based methods for virtual screening approaches

to identify novel hits.303 In a similar way to the ligand shape-based methods, interaction

fingerprints can be used to determine if the predicted binding pose is similar to the reference

binding pose and this requires measurement of similarity between interaction fingerprints.277

Within cheminformatics approaches, a combination of compound and ligand descriptors have

been used to build machine learning models for selectivity prediction. These are known as

proteochemometric (PCM) models. PCM models are applied as part of this study, and so in the

next section we review this integrative technique for the prediction of the selectivity profiles of

ligands for target families.

65

Proteochemometric Modelling

As mentioned in Ligand-Based Bioactivity Prediction Against Multiple Targets, QSAR can be

used as a ligand-based method for bioactivity and selectivity profile prediction. However, this

method has limitations in its applicability domain, reducing chemical space inclusion in

models.304 Additionally, QSAR only takes into account how chemical structure influences

bioactivity when, in reality, aspects of both the target and compound combined are important.

In order to develop models that contain a higher diversity of chemical space and also to

incorporate additional target information, predictive models for multiple related biological

targets have been created, known as proteochemometric (PCM) models.305 Target information

increases the ability of the model to identify new ligand scaffolds and binding interactions in

chemical space outside the training set and can also help distinguish how small differences in

SAR might lead to large differences in bioactivity.258 Additionally, PCM models offer the

advantage of exploring compound activity across the entire target family, which allows for the

interpretation of the degree of selectivity, or conversely, promiscuity of a compound, as opposed

to just the interaction of one compound with one target.209 The models use information from

both chemical and biological spaces in order to predict the bioactivity of new compounds

(interpolation from known targets) or new biological targets (interpolation from known

compounds) or both (extrapolation) (Figure 1-15),209 and relies on the concept that similar drugs

bind similar targets, which has been found to be at least partially true; in one study it was found

that 71 % of drugs interact with at least two targets with similar binding sites.18 This approach

has the potential application in repurposing compounds for new biological targets, or

suggesting new compounds for orphan receptors.209 Moreover, through PCM modelling, it is

possible to understand which features of the protein and ligand contribute to the bioactivity, by

identifying key structural motifs and binding site hotspots; this can then be extended to

interpretation of how structural binding site differences influence ligand selectivity profiles.

Non-bioactivity endpoints can also be used for PCM including phenotypic, genomic and

toxicological readouts.209

66

Figure 1-15: Adapted from Cortes et al. 2015.306 Shows the process of PCM modelling as a matrix format. The modelling is similar to QSAR except that the model incorporates data for multiple related targets and can therefore interpolate for both new biological and new chemical space.

The overall process for producing a PCM model is highlighted in Figure 1-16.258,209 Firstly,

compound and target descriptors are calculated and combined with the bioactivity data. The

data is then divided into training and test sets, with the model being trained on the training set.

Parameter and variable optimisation are conducted, each model usually being validated by

leaving a portion of the training set out (cross-validation) and testing on this portion. The model

is then used to predict the test set examples and evaluated using appropriate methods for

regression or classification. Subsequently, a test for overfitting can be conducted using Y-

randomisation approaches. Once the model is considered representative, other validation

methods can be explored to define the ability of the model to predict for unseen targets using

leave one target out (LOTO) and for unseen compound clusters using leave one compound

cluster out (LOCCO) validations. The gold standard for assessing predictivity of a model is to

make prospective predictions and have those predictions experimentally confirmed, although

this is often in practice more difficult to achieve. The following sections will discuss some of

these aspects in further detail.

New compound

New target

Compounds

Targ

ets

Interpolation from known compounds

Interpolation from known targets

QSAR

Bioinformatics

Extrapolation for new compounds on new targets

Active Inactive Missing value

67

Figure 1-16: Schematic describing the general process of PCM modelling in sequence. Comprised from information from review articles of PCM258,209

Machine learning algorithms are used in PCM modelling to develop a mapping function to

predict the relationships between the input values (variables) and the output values

(observations) in a system.200 In the case of PCM instances are compound-target pairs,

described by a concatenation of compound and target variables. For each instance either a

numerical or categorical value is given to represent the activity of the compound-target pair

combination (Figure 1-17). This learned model can then be used to predict observations for new

instances, provided it has access to the same input variables for the new instance. There are

many algorithms that have been used in this field including: Random Forests (RF), Support

Vector Machines (SVM), Neural Networks (NN) (including Deep Neural Networks), and Naive

Bayes.209,200 Details of machine learning methods are discussed in Machine Learning

Algorithms.

Calculate chemical descriptors

Align proteins

Calculate protein descriptors

Tune model to find best parameters using

cross validation

Build final model using cross validation

Validate model on test set

Perform Y-scrambling to detect overfitting

Assess applicability domain (LOCCO, LOTO analysis)

Experimentally validate and interpret

model

68

Figure 1-17: Shows the design of the PCM model. Each row is a compound-target pair and the model takes a concatenated input of compound and target descriptors for the pair and predicts an outcome (in this example binary classification where 1 and 0 are the positive and negative classes respectively).

As mentioned above, PCM modelling is an extension of Quantitative Structure Activity

Relationship (QSAR) modelling,307 PCM has been used to create bioactivity prediction models

for a variety of different target classes to date. This includes G-protein coupled receptors, for

which PCM models were built using Support Vector Regression algorithms achieving a model

with an R2 = 0.93 and Q2test = 0.74. The ROC AUC for an external test set was 0.89.308,249 Kinases

have also been explored with this technique, including a model which covered 95 kinases and

1572 inhibitors using 3D protein field based descriptors, achieving a ROC AUC on external test

set of greater than 0.8.251,309 Viral mutants, including a study of HIV proteases containing 4792

protease-inhibitor pairs were studied by PCM, achieving R2 = 0.92 and Q2 = 0.87, as well as a

high predictivity for leave one target out analysis of 0.72.310 A classification model was built for

the cytochrome p450 enzymes using a dataset with 63391 datapoints.311 This model achieved a

ROC AUC of greater than 0.9 for both RF and SVM models on the internal and external test

sets. Epigenetic proteins classes have been explored using PCM, including the HDAC proteins,

which when modelled by SVM, achieved an R2 = 0.99 and a Q2 = 0.75.274 A more recent study

employed PCM for the identification of allosteric modulators of glutamate receptors achieving

an overall ROC AUC of 0.97 for the model, which was used to identify hits using virtual

screening for the glutamate 7 receptor described as an orphan target.312 A model for nuclear

receptors was built achieving a ROC AUC on the external test set of 0.74 using RF and revealed

the molecular scaffolds predicted for five major nuclear receptor targets.313 What is notable is

the high performance for previous PCM models, with ROC AUCs > 0.9. This could be attributed

to the structure of PCM models, where a random test set can contain the same compound but

Outcome Predictors IDs

69

with an activity measurement for a different target. If there are correlated bioactivity profiles

between targets, this could be the cause of inflated performance in the test set, which will not

generalise to new compounds or new targets (a more realistic situation). Therefore, it is

important to validate PCM models by leaving compound clusters (LOCCO) and leaving targets

(LOTO) out to assess ability for future interpolation or extrapolation, as discussed above. For

further information on PCM, we refer the reader to the comprehensive reviews by Cortes-

Ciriano et al209 and Qui et al.258

1.3.6 Bromodomain Target Family Selectivity Profile Prediction

This thesis is concerned with the prediction of selectivity for bromodomain-containing proteins

using PCM. Next, we highlight previous in silico methods to predict bromodomain selectivity,

which extends from the section Selectivity Knowledge for Bromodomain Binding to Small

molecule Inhibitors.

Previous in Silico Approaches for Predicting Bromodomain Inhibitor Affinity

When designing inhibitors for bromodomain-containing proteins it is important to understand

the selectivity profiles between subfamilies and between individual domains. To facilitate

efficient inhibitor design, studies have investigated the prediction of affinity of small molecules

for bromodomains. Examples include the molecular docking and QSAR study developed for the

prediction of naphthyridone derivatives for the ATAD2 bromodomain,62 and the predictions of

binding affinities for 16 tetrahydroquinoline (THQ)-based ligands314 against BRD4 BD1 using

free energy perturbation (FEP) calculations. Selectivity has also been predicted on a small scale

using relative binding free energy (RBFE) approaches;315 this study used three inhibitor

structures and predicted their affinities across up to 22 bromodomains, achieving mean

unsigned errors of 0.81-1.76 kcal/mol with experimental data. Multiple QSAR models have been

developed for a set of 88 organic molecules against the bromodomains in proteins BRD2, BRD3

and BRD4. These models achieved Q2 values between 0.75-0.88 for predicting the activity for

the three bromodomains by employing QuBiLS-MIDAS 3D molecular descriptors and multiple

linear regression models with 6-9 variables per model.316 A de novo fragment-growing approach

called AutoCouple, was successfully implemented to discover CREBBP bromodomain binders

within an expanded chemical space.317 The approach couples known headgroups of

bromodomain binders with commercially available building blocks to produce new molecules

which were docked and scored against the target binding site. However, no studies have yet

modelled bromodomain bioactivity and selectivity data for a large number of ligands and

targets.

70

Computational Studies to Understand Bromodomain Selectivity

In addition to the literature on individual bromodomain selectivity derived from both

experimental and computational approaches presented in Selectivity Knowledge for

Bromodomain Binding to Small molecule Inhibitors, a computational study on 24

bromodomains was previously conducted to provide insight into selectivity for the target family.

In this study, Vidler et al.318 explored the classification of bromodomains by their structural

motifs, using druggability scores to group bromodomains into 9 groups by combinations of

signatures of three amino acids in their binding sites. This led to the identification of 7 key

binding site residues, which can be used to help understand which other bromodomains may

bind to the same compounds, and therefore bromodomain selectivity.318 More recently, a

molecular dynamics approach to assess the structural and energetic properties of structural

waters between different bromodomains (ATAD2, BRD2 BD1, CREBBP and BAZ2B) was

conducted.319 This study found that for BRD2 BD1 the ZA loop was conformationally rigid (often

in a closed conformation), whereas in ATAD2 this loop was flexible with a higher population of

the open conformation, whilst the ZA loops in CREBBP and BAZ2B interconvert between the

two states frequently. Furthermore, they observed that the degrees of flexibility linked to the

water network; BRD2 BD1 keeps its well-constructed water network, whilst in the other

bromodomains the waters were more displaceable due to the conformational flexibility

observed.

1.3.7 Summary

Here we highlight the previous computational approaches for predicting the bioactivity of small

molecules for protein targets. These techniques can be categorised into ligand-based, target-

based and ligand- and target-based methods. We furthermore highlight how these approaches

can be extended to predict for bioactivities against multiple targets, i.e. selectivity profiles,

which is the concern of this thesis. We introduce the method of proteochemometrics for

selectivity profile prediction and provide a survey of the literature for previous computational

methods applied to bromodomain-containing proteins, highlighting the gap in previous work

for a large-scale computational analysis of selectivity profiles, especially given the context

outlined in Challenge of Designing Selective Bromodomain Inhibitors, which suggests the need

for new methods to understand and predict the bromodomain selectivity profiles of small

molecules.

71

1.4 Computational Methods and Applications for Toxicity Prediction

Many of the in silico approaches to selectivity and bioactivity prediction, including QSAR

methods can be applied to toxicity prediction, to predict off-targets for drugs, based on a

combination of chemical structure and other biological information.320 More recently, in silico

approaches used in the field of toxicity analysis include the integration of multiple datasets,

describing the effects of compounds on different system levels, including the interaction with

targets, cellular responses e.g. pathways and genetics, organ level responses e.g. altered tissue

function, and organism level responses e.g. in vivo and clinical phenotypes. The Adverse

Outcome Pathway (AOP) framework is an evidence-based approach to link the information

between these levels for chemical toxicity.321 There remains limited understanding however,

between different types of toxicity data within this framework. Here we discuss the in silico

approaches used to help elucidate relationships between toxicity measurements which are

relevant to this thesis, namely the problem of preclinical to clinical adverse event translation

and the understanding of compound-target interactions measured in vitro which display

relevance to clinical toxicities.

1.4.1 Prediction of Clinical Adverse Events from Preclinical Adverse Events

It is important to understand which toxicity endpoints translate from preclinical to clinical

studies and which ones do not. Such efforts to understand and quantify the concordance of

adverse events (AEs) between animal models and clinical studies have been conducted

previously using retrospective statistical analyses and details of previous concordance studies

are summarized in Table 1-4. The first studies of this type used the sensitivity to measure

concordance between AEs recorded in preclinical studies and AEs recorded in clinical studies,

however, later studies moved towards using the positive likelihood ratio (LR+) due to its more

meaningful interpretation, as well as its ability account for false negatives and false positive

values.230 Studies of this type are not trivial to conduct, due to small data set sizes, biological

variability and species exposure differences,232 as well as biased data such as due to ‘survivor

bias’, since data for drugs which are terminated before clinical trials due to safety (or other)

reasons cannot be used in the analysis, which has consequences for the inclusion of severe

preclinical toxicities. The main findings from previous studies included that haematological,

gastrointestinal, injection site and some specific cardiovascular AEs display a high concordance

(LR+ values of 11),232,322 showing that it was 11 times more likely to see the clinical AE given the

same preclinical finding; however, neurological toxicity and cutaneous toxicities have a poor

concordance of less than 35 %.169,323 It was also observed that concordance is higher for small

72

molecules than it was for large molecules (e.g. antibodies),323,324 and that there were differences

between concordance with humans for different preclinical species.169,324,325 For a

comprehensive review of previous concordance literature the reader is referred to Monticello et

al.326

What all previous studies have in common is that they measured the concordance between the

same toxicity or a toxicity related to the same system organ class (SOC) as the clinical AE. Some

adverse effects in humans are not predicted by the same AE in animals, due to the differences

in anatomy, physiology and biology between species. An example of this is the lack of a vomiting

response in rats,170 which excludes this from being used as a model for vomiting in humans.

Instead, taste aversion/food avoidance responses in rodents or ferret, or dog emesis models are

used.170,171 Linked to this, species differences exist for some teratogenic based toxicities, for

example corticosteroids are teratogenic in animal models but not in humans173 and conversely

thalidomide is a teratogen in humans but not in many animal species, which has been attributed

to the differences in metabolism across species.174 The picture is further complicated by a lack

of correlation of drug bioavailability between species,327 linked to poor dose extrapolation

between species which can lead to differences in toxicity observations.328 These and other

reasons demonstrate how concordance between the same AE across species is only part of the

picture, showing that there is more to be learnt about interrelationships of different AEs across

species. This highlights the gap an in silico investigation into whether seemingly unrelated

preclinical AEs in different SOC classes may be mechanistically predictive of clinical AEs.

73

Table 1-4 Summary of the previous concordance literature relating preclinical adverse events to clinical adverse events. Previous studies have

examined the ability of preclinical AEs, which are same AE or contained within the same system organ class (SOC) as the clinical AE to predict the

clinical AE. Not yet analysed is the associations between recorded preclinical AEs which are not part of the same SOC as the clinical AE and the

clinical AE, which is the gap that this study aims to address. LR+ is the positive likelihood ratio and LR- the negative likelihood ratio.

Reference Scale of study Concordance measurement (Sensitivity or Likelihood Ratios (LR+/ LR-))

Concordance Interpretation

Olson et al., 2000 169 150 drugs Sensitivity 71 % concordance between AEs within same organ class in any animal species and human 63 % concordance for non-rodent species with humans 43 % concordance for rodent species with humans

Haematological, gastrointestinal and cardiovascular toxicities were highly concordant (> 80 %) Cutaneous toxicities were the least concordant (< 35 %)

Bugelski et al., 2012 324 15 monoclonal antibodies Sensitivity 42 % concordance of AEs from rodent species to humans 35 % concordance from non-human primates to humans

Human AEs were predicted poorly by animal models for monoclonal antibodies Individual drug concordance was discussed

Nishida et al., 2013 323 142 approved drugs in Japan

Sensitivity 48 % concordance between AEs within same organ class in any animal species and human 33 % concordance for large molecule drugs, including antibodies 58 % concordance for small molecule drugs

Haematological, ocular, and injection site reactions reached a concordance of 70% Cardiovascular, neurological and cutaneous toxicities showed low concordance (< 30 %)

74

Bailey et al., 2014 325 2,366 drugs LR+/LR- Median LR+ 253 and LR- 1.82 between AEs in rats and AEs in humans Median LR+ 203 and LR- 1.39 between AEs in mouse and AEs in humans Median LR+ 101 and LR- 1.12 between AEs in rabbit and AEs in humans

High risk of AE in humans given of presence of AE in animals Very little evidence for absence of AE in humans given absence of AE in animal models

Clark, 2015 232 3,815 drugs LR+/LR- ARRHYTHMIAS, QT PROLONGATION and ABNORMAL HEPATIC FUNCTION had the highest LR+ values (LR+

11, 11, 26 respectively)

Asymmetry between LR+

and LR- values

Clark et al., 2018 322 3,290 drugs LR+/LR- QT PROLONGATION, ARRHYTHMIAS, DRUG SPECIFIC ANTIBODY PRESENT and INJECTION SITE REACTION had highest LR+ across species (LR+

11, 11, 162, 17 respectively)

Only in a few cases there was an advantage to abstracting the description to a higher level of the MedDRA hierarchy, due to the increase in false positives and decrease in significance Concordance between animal species and human is dominated by the selection of species and selected species is predictive for the endpoint of interest.

75

1.4.2 Computational Analyses for the Proposal of Targets for In Vitro Secondary Pharmacology

Screening

As mentioned in Off-Target Toxicity Assessment, secondary pharmacology screening panels are

currently used to flag potential toxicities which may occur in vivo. For targets to be considered

as off-targets, a link between a drug induced adverse event and the off-target should be made.

Often these links are made through statistical analyses of clinical trial and post-marketing

databases, looking for evidence of multiple drugs that cause the same adverse event profile and

express similar in vitro target modulation, as outlined in the review by Whitebread et al., 2016.121

This approach uses clinical adverse event data, phenotype data for diseases, and literature

mining to ultimately identify novel associations between drug adverse events and protein

targets. These types of analyses are often conducted on a small-scale to find associations for

specific toxicities, with few large-scale analyses having been performed to date. Another review

suggests that “the panels of [safety] targets that are employed vary widely and are often selected

without justification or a description of their relevance to human safety”, which highlights the

need for approaches which are based on more concrete evidence.329 Recently, a large-scale in

silico approach was implemented to identify other safety targets which could be included in

safety panels.112 In this study, the authors derive links between adverse event phenotypes and

drug targets using an enrichment analysis, measured by the odds ratio. The data for drug

associated adverse events was obtained from the intersection of the FDA Adverse Event

Reporting System (FAERS) and SIDER databases. Adverse events were mapped to phenotypes

using the Unified Medical Language System (ULMS) and then gene information was gathered

from phenotypes encoded by the Human Phenotype Ontology (HPO) which mapped to diseases

in Online Mendelian Inheritance in Man (OMIM). This study culminated in a proposal of 70

safety targets for screening.

One source of information which the previous study does not include is the in vivo adverse event

data, which has been outlined previously as an important parameter to include in analyses, since

positive correlations between secondary pharmacology results and adverse events in animals

are better indications of the expected toxicities in humans.329 This is a gap not yet addressed

using computational studies and by providing this extra mechanistic link, the targets proposed

for future secondary pharmacology screening will be based on combined evidence from in vitro,

in vivo and clinical information.

76

1.4.3 Summary

Here we discuss the previous literature on computational analyses of the translation of adverse

events between animals and humans. Previous studies have analysed the concordance of

toxicities i.e. the degree to which the same or related toxicities in animals predict the same

clinical toxicity. We define the clear gap for a study which attempts to find mechanistically

related toxicities across species, which are not necessarily part of the same system organ class

grouping. This will provide more information on the utility of animal models for clinical adverse

event prediction, whilst fitting with the need to reduce, replace and refine animal usage in drug

discovery, as discussed in

Toxicity Translation. Finding a mechanistic link between the adverse event in animals and

humans will add further weight to using such associations in a practical setting.

We secondly discuss the generation of secondary pharmacology screening panels from

statistical associations between drug-induced adverse events and drug off-targets. We highlight

the gap within previous approaches for proposing new targets for secondary pharmacology

screening, namely the lack of inclusion of animal in vivo adverse event information. A study

which can combine the use of preclinical information to support new mechanistic relationships

between clinical drug induced adverse events and drug off-targets will help to provide stronger

evidence for the incorporation of new targets into in vitro screening panels, an essential method

for the early anticipation of toxicity to avoid drug attrition.

1.5 Aims

The main aim of this thesis was to address the problem of off-target pharmacology of

compounds using data mining and machine learning techniques. The first half of this thesis will

be concerned with investigating selectivity prediction using machine learning methods, with

the application to modelling bioactivity data for the bromodomain family of proteins. We aimed

to use the models to predict activity and selectivity for new compounds, leading to the discovery

of novel experimentally confirmed chemical hits for bromodomains with a desired selectivity

profile. We furthermore tested the hypothesis that the models can be interpreted to identify

selectivity residues in the binding site of bromodomains for the use in compound design. We

compared the findings to the literature as well as analysed the correspondence of the model

interpretation to the binding modes of novel hits in the protein of interest from obtained crystal

structures. In summary, we explore the extent to which proteochemometric models can be used

in compound design and for on-target and off-target prediction of activity for bromodomains.

77

The second half of the thesis was concerned with extending the problem of selectivity to the

fact that all drugs have affinity for off-targets which can lead to toxicity. Using data mining

methods, we investigated the correspondence between preclinical and clinical toxicities for

marketed drugs with the aim to identify new links between toxicities across species. We

furthermore tested the hypothesis that the associated toxicities were mechanistically linked

across species, by integrating data from compound-target databases as well as phenotype-gene

and disease-gene databases to find intersecting evidence for targets which could be responsible

for inducing toxicities for the derived associations. In summary, we aimed to quantify the utility

of animal models for the assessment of the risk of drug-induced clinical toxicities, as well as to

propose new targets for in vitro secondary pharmacology screening panels. Overall, the thesis

employs in silico methods which can be used to improve the understanding of the primary and

secondary pharmacology of drugs with relation to selectivity and toxicity. These parameters are

vital to understand when designing and developing a successful drug molecule.

78

2 Prospectively Validated Proteochemometric Models for

the Prediction of Small Molecule Binding to

Bromodomain Proteins

2.1 Introduction

It is important to find small molecules with the desired target family selectivity profile to avoid

off-target effects including toxicity, as well as to elucidate target roles in disease. Bromodomain-

containing proteins have functional relevance in immunological, developmental and

cardiovascular disorders, as well as cancers.30 To enable their individual roles to be further

elucidated there is a high interest in developing selective probe molecules31 (see Challenge of

Designing Selective Bromodomain Inhibitors). The studies described in this chapter

investigated the extent to which in silico modelling approaches can provide a framework for

virtual screening, with the aim to find both new small molecule hits for individual

bromodomain targets, as well as small molecules with specific selectivity profiles across the

bromodomain target family. The method implemented was proteochemometric (PCM)

modelling, which has been successfully applied to other target families for the purpose of

selectivity profile prediction across structurally related proteins.209,258 This technique had not

been previously applied to model the bromodomain target class and, due to the recent discovery

of a larger number of chemotypes of small molecule bromodomain binders, we applied this

method to produce the largest reported in silico study for bromodomain-containing proteins,

based on a diverse chemical space and the use of selectivity panel data. The technique allows

the transfer learning of information about related biological targets to inform the structure

activity model for each target, providing an extension of traditional QSAR methods (see

Proteochemometric Modelling). We implemented conformal prediction to determine the

applicability domain of our models and tested high confidence predictions of new compound-

target pairs from our virtual screen in our prospective validation, as described in the following.

2.2 Materials and Methods

2.2.1 Dataset

The dataset used to generate the models was extracted from the public and licensed sources of

ChEMBL330, PubChem331, ChEpiMod332, GOSTAR333 and the manual extraction of data from

recent publications,54–56,58,59,65,74,82,94,334–338, as well as AstraZeneca proprietary databases.

79

Public Dataset

The public dataset composition across sources is shown in Figure 2-1 and the distribution across

bromodomains is shown in Figure 2-2.

Figure 2-1: Number of compound-target pair annotations provided by each source in the public dataset after filtering, coloured by the activity classification. “Manual” is data manually curated from publications at the time of data set construction. “PubChem others” were those data points that were found in PubChem but not in ChEMBL.

Compound-target bioactivity data points were extracted from ChEMBL-20, using

bromodomain UniProtKB Accession IDs and applying the criteria of a bioactivity type of either

IC50, Ki, Kd, % inhibition and ΔTm, binding (B) assay type, as well as a confidence score of at

least 8 (corresponding to the classification of “Homologous single protein target assigned”),330

and presence of a numerical value for the bioactivity. Bioassay descriptions were used to

manually filter out compounds interacting with multiple protein domains or non-bromodomain

domains within a bromodomain-containing protein, and to place data into the correct domain

where multiple domains exist within one bromodomain-containing protein. Data points where

the domain was unresolved were removed. For percentage inhibition values between 20-80 %

at a certain concentration, the following Hill equation (Equation 1) was applied to convert to an

estimated pIC50 value:

Equation 2-1:

𝑝𝐼𝐶50 = −(log10(100 − 𝑌) − log10(𝑌) + 𝑋)

where Y is the inhibition value (in %) and X is the log concentration (Molar).24

ChEMBL ChEpiMod GOSTAR ManualPubchem

others

Inactive compounds 179 879 14 326 113

Active compounds 429 1883 76 343 210

0

500

1000

1500

2000

2500

3000N

um

be

r o

f C

om

po

und

-Ta

rge

t P

airs

Data Source

80

Figure 2-2: The distribution of data per domain for the public dataset used in the model, coloured by assigned activity (A=active, N=not active) and annotated with data point counts for each class. In total 3,950 data points were collected from the public domain.

ATAD2 BAZ2A BAZ2B BRD1BRD2BD1

BRD3BD1

BRD3BD2

BRD4BD1

BRD4BD2

BRD7 BRD9BRDTBD1

BRPF1 BRPF3 CECR2CREBB

PPB1BD5

PCAFSMARC

A4TIF1A

N 45 9 65 16 40 7 4 843 79 7 27 18 33 34 12 111 10 34 7 20

A 42 27 31 44 79 69 24 1124 540 21 113 13 66 3 16 163 16 51 24 63

0

200

400

600

800

1000

1200

1400

1600

1800

2000N

um

be

r o

f C

om

po

und

-Ta

rge

t P

airs

Bromodomain

81

Those inhibition values below 20 % inhibition and above 80 % inhibition were assigned to less

than the derived pIC50 at 20 % inhibition or greater than the derived pIC50 at 80 % inhibition

respectively, to account for the fact that the equation no longer applies to these parts of the IC50

curve.

Inactive data from PubChem was extracted by using the UniProtKB Accession ID,339 using the

PubChem API. Again, bioassay descriptions were used to filter out compounds with unresolved

bromodomain activity. Compound ID’s were converted to SMILES using the PubChem

identifier exchange service340.

Data from ChEpiMod was provided by the curators. Data from the large functional screen for

BAZ2B (PubChem_AID:504391) was filtered out, as it was suggested in the assay description

that this screen should be used with caution due to likelihood of screening artefacts. Any PDB

compounds without numerical data points were also removed, as these might be fragment

molecules with low activity.

GOSTAR data has been incorporated into the internal AstraZeneca database service and data

for bromodomain targets were extracted using SQL Developer (version 2.11.2), queried using

EntrezGene IDs (EGIDs).

For the combined public dataset, active (A) and not active (N) classes were assigned by the

following criteria: A = pIC50, pKd, pKi ≥ 5 or ΔTm ≥ 0.9 (at 10 μM). All other data points were

assigned as N and other endpoints were removed. ΔTm ≥ 0.9 was chosen to maintain

consistency between public data and AstraZeneca data, as this was used as a cut-off in the

internal AstraZeneca Differential Scanning Fluorimetry (DSF) assays, since it reflected 3 x

standard deviation of the DMSO controls. The cut-off of 10 μM was chosen to enable the

application of the model to virtual screening where this is a frequently used activity threshold.

AstraZeneca Dataset

Proprietary AstraZeneca data was extracted for bromodomains by querying the internal IBIS

SAR database. Inhibition data was extracted at four commonly used concentrations: 1, 3, 10 and

25 μM, as well as concentration response data and data from DSF assays. Where multiple data

were present for the same assay, these were aggregated as an average. The AstraZeneca dataset

was classified into active (A) or not active (N) classes. pKd and pIC50 data were classified using

the same thresholds as above, and compounds with thermal shift values of ΔTm ≥ 0.9 at a

concentration of 10 μM were classified as active, while the remainder were assigned to the not

active class. The binding data from DiscoveRx assays had pre-assigned activity flags for each

82

concentration point (based on their percentage of control values), which were used for

classification. Records with flags other than active or not active were removed. Since some

records had values at multiple concentrations, the records were then placed into classes in the

following order:

active at 1 μM: A, active at 3 μM: A, active at 10 μM: A, active at 25 μM: N, not active at 25 μM:

N, not active at 10 μM: N, not active at 3 μM: N, not active at 1 μM: N.

Combined Dataset

After classification of compounds into active and not active classes, duplicate entries were

removed from the dataset by comparison of structure (as calculated by StandardiseMolecules

function in camb306 package in R using Indigo’s C API.341), domain and activity assignment.

Additionally, for entries that were duplicates of structure and domain but had been classified

into opposite activity classes, both data points were removed due to inconsistency.

Bromodomains with less than 25 data points were removed from subsequent analysis.

The final dataset contained 15,350 data points; 6,352 compounds across 31 bromodomains. To

obtain a suitable and information-rich dataset, data from multiple endpoints were incorporated.

Although it is appreciated that there are limitations around incorporating different

experimental readouts,342 it was necessary to include as many data as possible to provide

sufficient small molecule bioactivity values across the bromodomains in the dataset. Much of

the Kd data incorporated originated from the BROMOscan®343 competition based assay panel

screens. Kd and % inhibition data originated from AstraZeneca data and public data. Ki and IC50

values were obtained from public datasets. ΔTm values (at 10 μM) were used from panel

selectivity screening studies and provide a large amount of inactive binding data. We explored

the comparability of data points which were collated from different sources for the same

compound-target pair between public data (IC50, Kd and Ki values, as well as the IC50 values

estimated from % inhibition values) and AstraZeneca concentration response (CR) assays,

which measured Kd values (Figure 2-3). For compounds with multiple data points, we observed

a correlation of R=0.96. Due to the heterogeneity of endpoints, a classification (instead of a

regression) model was generated to produce a global model of selectivity. Classification has

been implemented in place of regression previously for similar tasks where there is data

variability, including the DREAM challenge to model cytotoxicity.344 Classification and

83

regression PCM studies have been performed previously using multiple bioactivity

endpoints.345,346,312

Figure 2-3: Correlation between values of pIC50, pKd, pKi and % inhibition converted to pIC50 for published assay data with AstraZeneca concentration response assay data (pKd values) for compound-target pairs which exist in both the public and proprietary data sets. Shows that for these data points there is good correlation between bioactivity measurements.

The final distribution of the dataset, split into public and AstraZeneca data, per target and

activity label can be found in Table 8-1 and is depicted in Figure 2-4. 53.2 % of data points

correspond to type 2 bromodomains. In contrast, PB1 BD5, PCAF and SMARCA2 contain a low

proportion of data points (0.2 %, 0.6 %, and 0.2 % respectively). The BET family (type 2)

bromodomains contain an enriched proportion of active compounds (47.3 %) compared to the

whole dataset (34.3 %). Other domains which have a high proportion of actives include: BRD9

(27.8 %), CREBBP (37.3 %), EP300 (27.2 %) and TAF1 BD2 (55.2 %). PCAF, PB1 BD5 and

SMARCA2 bromodomains contain a high proportion of actives (60.0 %, 61.5 % and 100 %)

within a low number of data points, suggesting that these domains are rarely screened in

bromodomain panel assays. ATAD2B, BRWD1 BD2 and BRPF3 contain low numbers of actives

(0.0 %, 0.5 %, and 0.02 %), and data for these domains originates primarily from screening

panels. These domains were included in the model dataset since compounds overlapped with

other domains, and thus provided information on selectivity.

Lo

g P

ub

lic

Dat

a A

ctiv

ity

pKd AstraZeneca Data

84

Figure 2-4: Number of data points employed in this study per bromodomain from combined public and proprietary data sources. It can be seen that there is a bias towards a high number of BRD4 data points, which was addressed when modelling the data by down-sampling. For numbers of data points for each category the reader is referred to Table 8-1.

0

1000

2000

3000

4000

5000

6000

ATA

D2

ATA

D2

B

BA

Z2A

BA

Z2B

BP

TF

BR

D1

BR

D2

BD

1

BR

D2

BD

2

BR

D3

BD

1

BR

D3

BD

2

BR

D4

BD

1

BR

D4

BD

2

BR

D7

BR

D9

BR

DT

BD

1

BR

DT

BD

2

BR

PF1

BR

PF3

BR

WD

1 B

D2

CEC

R2

CR

EBB

P

EP3

00

KA

T2A

PB

1 B

D5

PC

AF

SMA

RC

A2

SMA

RC

A4

TAF1

BD

2

TAF1

L B

D2

TIF1

A

TRIM

33N

um

ber

of

Co

mp

ou

nd

-Tar

get

pai

rs

Bromodomain

Not Active

Active

85

2.2.2 Analysis of Chemical and Biological space

Bromodomains were clustered based on sequence similarity using R packages SeqinR347 and

APE348 to plot a phylogenetic tree using the plot.phylo function, with argument type=fan.

DataWarrior349 Similarity Analysis using Skelspheres descriptors was used to generate a

chemical space visualization of the dataset. The technique uses a rubberbanding forcefield

approach to place molecules in 2D space according to their similarity, with a cut-off value of 0.9

for the similarity.

Bemis-Murcko scaffolds350 for the whole dataset and the public portion of the dataset were

generated in KNIME351 using the RDKit Find Murcko Scaffolds node. The scaffold visualisations

including the network graph, which used the force-directed layout, were generated in TIBCO

Spotfire352. The network graph was generated only for active data points from the public dataset

and for scaffolds where greater than 4 compound members had active data points for a

bromodomain for presentation clarity.

2.2.3 Compound and Target descriptors

Compound Descriptors

Compounds were standardised from SMILES as described above using the Indigo API341,

removing inorganic molecules and salts. Different compound descriptors were benchmarked

against one another in PCM models. Compound 1D and 2D physicochemical descriptors were

calculated using camb306, using the functions GeneratePadelDescriptors, which uses the PaDEL-

Descriptor Java library,239 and 512-bit hashed binary Morgan fingerprints of radius 2 were

generated from the Python RDKit module.353 3D structures were generated using Corina-3.6354

providing a single low-energy conformer for each molecule, from which 3D Vsurf descriptors

were generated in MOE355. 3D VSurf descriptors are internal co-ordinate based descriptors

calculated from the 3D conformation of molecules and are similar to the Volsurf descriptors

used in previous studies.250–253 Since 3D Vsurf descriptors encode the distribution of molecular

size, shape, hydrophilic and hydrophobicity properties across the molecule,250 they were

benchmarked for their use in PCM against the physicochemical 1D and 2D descriptors generated

by the PaDEL library.

Target Descriptors

Sequences were aligned by importing one crystal structure per bromodomain into MOE 355,

(Table 8-2), using the in-built “protein” function and selecting the automatic alignment option

and then manually adjusting the alignment using the sequence editor to minimise alignment

86

gaps. Binding site residues were chosen by importing all publicly known liganded bromodomain

crystal structures in the MOE355 protein family database, filtering to those that contained

ligands between 100-600 molecular weight (in total 352 crystal structures) and applying a

threshold of 4.5 Å from any ligand for a residue position to be included. The final alignment can

be found in Table 8-1. Z-Scales 5265 and other alignment dependent target descriptors were

calculated using the camb306 AADescs function. These descriptors are derived from

dimensionality reduction methods (e.g. principal component analysis) of the numerical

property values for amino acid residues in the binding site of each protein.267

Peptide array data was obtained from the SPOT peptide array data for histone proteins against

bromodomains published in Filippakopoulos, et al. 2012 (see Figure 4 of this work) 38 by request

for the data from the authors. The normalized raw intensity values (normalized to between 0-

100) for each peptide-bromodomain interaction were used as a numerical descriptor for

bromodomain targets, providing a fingerprint of bromodomain binding specificity for mono-

lysine acetylated histone peptides. The final target descriptors comprised of a matrix of 22

bromodomain numerical interactions with 136 acetylated-lysine histone peptides.

2.2.4 Algorithms and Generation of PCM models

PCM models were built using camb and caret packages in R.306,356 Compound and target

descriptors were concatenated and highly correlated and near zero variance descriptors were

removed using the functions RemoveHighlyCorrelatedFeatures (cut-off 0.95) and

RemoveNearZeroVarianceFeatures (cut-off 30/1). The number of data points for BRD4 BD1 were

randomly downsampled to 50 % of the original data points to reduce bias in the dataset and to

present a more even distribution across bromodomains. Variables were centred and scaled to

mean and unit variance using the PreProcess caret function and split into 70/30 training to test

ratio using the SplitSet function in camb, using stratified sampling according to bioactivity

labels).

Models were created using Random Forest (RF), Support Vector Machines (SVM) and

Generalized Linear Models (GLM) algorithms, using the rf, svmRadial and glm methods

respectively in the train function in caret using the function GetCVTrainControl, 5-fold cross

validation (CV), argument classProbs=True and summaryFunction=twoClassSummary to

calculate class probabilities to provide summary statistics. Recursive feature elimination using

caret rfe function, removed redundant variables, as determined by 5-fold CV assessed by ROC

AUC. The number of input variables to randomly select at each node in the random forest trees

https://www.ncbi.nlm.nih.gov/pubmed/?term=Filippakopoulos%20P%5BAuthor%5D&cauthor=true&cauthor_uid=22464331

87

(mtry) was optimised in CV using a random grid search of 15 values for RF and a grid search was

performed to optimise the hyperparameter values of σ (0-0.9) and C (0-5) for SVM.

2.2.5 Model Validation

PCM models were validated through 5-fold cross validation (CV) and, after parameter

optimisation, the prediction for unseen test set values (30% of the records split using stratified

random sampling according to bioactivity) using the predict function from caret. ROC AUC,

Matthew’s Correlation Coefficient (MCC), Area under the precision-recall curve (PRAUC),

sensitivity and specificity were used to assess performance on the test set.

Leave-one-scaffold-out (LOSO) validation was conducted by training a model based on data

points for all scaffolds except one hold-out scaffold and predicting for the data points containing

compounds with the new scaffold. To obtain scaffolds, the carbon framework Bermis-Murcko

scaffolds350 of all compounds were calculated using the RDKit Find Murcko Scaffolds node in

KNIME.351 3,553 framework scaffolds resulted, of which a random sample of 50 scaffolds with

more than 10 data points were selected as hold-out scaffolds.

Leave-one-target-out (LOTO) validation was conducted by sequentially removing all

compound-target pairs associated with one bromodomain target, training a model based on the

remaining data and predicting for the hold-out target data points.

2.2.6 Benchmarking to QSAR, Quantitative Sequence Activity Model (QSAM) and

Baseline Models

Models using only chemical descriptors (global QSAR) and only target descriptors (global

QSAM) were generated on the same dataset using the caret and camb R packages356, using the

same method described above on the same test set. These models test the hypotheses that

compound activities are correlated across targets (global QSAR) and that compound activities

are only target dependent (global QSAM). Their performances as measured by ROC AUC were

compared to the PCM model by implementing a pairwise two-sided Student’s t-test, utilising

the compare_models function in caret, with Bonferroni correction,356 to determine if there was

a significant difference.

Individual QSAR RF models were generated for each bromodomain using the same method as

for the PCM models.

To assess the utility of the extra target information encoded in the binding site alignment-

dependent target descriptors, a comparator RF model was trained on Morgan fingerprints with

88

binary target identity descriptors. The binding site alignment-dependent descriptors, such as Z-

Scales, provide a similarity measure between domains based on their amino acid binding site

residues and properties from which the model is expected to make interpolations between

similar domains. In contrast, the basic binary label identifiers for the different bromodomains

do not encode a relationship between the targets themselves.

To assess baseline performance, a generalised linear model was also generated on the whole

dataset applying the glm function implemented in caret, using compound and target IDs as

binary descriptors.

Y-scrambling was performed by randomly reorganising the labels associated with each

compound-target pair and retraining the models using the same method as for the final

model.357 Class labels were randomly scrambled by both 50% and 100% over 50 iterations. The

mean ROC AUC over all iterations was calculated.

2.2.7 Public Dataset Model

An RF model was constructed for the public dataset using the same methods described above.

This model was tested on a 30 % external test set as well as the proprietary AstraZeneca dataset.

2.2.8 Applicability Domain

Since applicability domain determination is difficult to rationalise in multi-dimensional

space,225 the Mondrian cross-conformal prediction (CCP) framework was used to define which

new compound-target pairs can be predicted at a certain confidence level (Cl).221 Applying

Mondrian CCP results in data points in the test set being classified into one of four classes; 1.

active, 2. not active, 3. “neither” active nor not active, or 4. “both” active and not active, to

achieve the specific error rate provided at different Cls. For our test set we assume that samples

are exchangeable with the training set, i.e. that it fulfils the exchangeable distribution

assumption required for a guaranteed error rate. For the conformal method to be suitable, the

validity metric (defined as the number of samples in the test set that are classified into a label

that contains their experimental class (including the “both” class assignment)), must be greater

than the Cl; i.e. the error rate should not exceed 1-Cl.

The applicability domain was assessed by calculating conformal predictions implementing

Mondrian cross-conformal prediction227 in R, which provides a wrapper around the caret

package. We use the RF probability values (fraction of trees predicting for each class) as the

non-conformity measure. The validity and efficiency metrics on the test set were used to assess

89

whether the conformal prediction framework was valid for the performance on the external test

set.

When interpreting conformal predictions, the probability value (p_value) assigned to the new

sample prediction can be used as an indication of the degree of confidence that a sample is

contained within the predicted class.

2.2.9 Experimental Validation

Predictions and Filtering

Compounds in the liquid sample library at AstraZeneca were screened against the model to

predict activity against four bromodomains, namely BRD1, BRD4 BD1, BRD9 and BRPF1b. A

total of 2,164,399 compounds were pre-processed in the same way as the training set molecules

(see above) and combined with Z-Scales 5 descriptors for each target. The model was used to

predict activity for these molecules using conformal prediction, applying the confidence levels

of 0.7, 0.8 and 0.9 and the corresponding significance values of 0.3, 0.2 and 0.1 to compare new

compound-target pair p_values to the model training set values to shortlist compounds for

experimental testing. We chose this range because we wanted to find hits with structural

diversity to the training set and any lower confidence than 0.7 would place too many new

samples into the “neither” class, especially as the screening library compound set will be more

diverse than the model test set. We also selected a range of significance values to observe the

relationship between p_values and prediction accuracy.

Firstly, the predictions with a p_value of > 0.1 for the active class (corresponding to the 0.9

confidence level) were selected. These actives were filtered to exclude those which were

unavailable for testing (availability as a solid < 3 mg, as a solution <0.05 mL). Compounds were

removed if they had a logBSF (frequent hitter) score358 < -3 or if they had already been tested

against the domain of interest according to the internal AstraZeneca IBIS database. We also

checked compounds for chemical substructure alerts using the AZfilters method, which

searches for fingerprints matching to common reactive or undesirable compound features.359

Those that were classified as “ugly” were removed from the set. A molecular weight filter was

then applied to restrict compounds to a molecular weight range between 200 and 700 Da.

Those compound-target pairs that were in the training or test set were removed. In total, the

above process resulted in 8,581 predicted active compound-target pairs.

90

Calculating Sampling Parameters

These predicted actives were matched with any compound-target pairs containing the same

internal AstraZeneca compound ID for which an inactive prediction with a conformal prediction

p_value > 0.1 (inactive at 0.9 confidence level) was generated. These inactives were combined

with the active set to provide bioactivity profiles for the same compound against multiple

domains and compounds were annotated with their selectivity profile. Compound structures

that were in the public domain were identified to select a significant number of these structures.

In the next step, sampling parameters for a subsequent diversity selection of compounds were

calculated, namely the compound similarity to the training set and the cluster membership of

the selected compounds.

To provide measures of diversity for future selection processes and analysis, the compound-

target pairs were assessed for their diversity compared to the training set. Firstly, the similarity

of compound structures to the nearest neighbour compound structure in the training set (across

any domain) was calculated using the Tanimoto similarity360 index, calculated from 512-bit

Morgan Fingerprints in KNIME351. The experimentally determined selectivity profile of the most

similar compound in the training set was extracted. Additionally, the nearest compound-target

pair neighbour in the training set was identified by selecting the training set instance with the

minimum Euclidean distance from the new prediction instance, calculated in KNIME351 from all

compound and target descriptors used in the model.

Clustering was conducted to provide a means of selecting a diverse set of compounds for testing.

To cluster a larger set of molecules, compounds were assigned their Bermis-Murcko carbon

framework scaffolds,350 from which 512-bit Morgan fingerprints were generated and used to

hierarchically cluster the scaffolds based on their distance matrix measured as the Tanimoto

distance. A distance threshold of 0.375 was used to merge scaffolds into clusters in KNIME,351

derived from the largest non-outlier distance value, which resulted in 585 clusters.

Compound Selection for the Prospective Validation Study

Compounds were sorted into three overlapping sets: 1. Interpolation Compounds that were

present in the training set but lacking experimental data for one or more of the four

bromodomains (BRD1, BRD4 BD1, BRD9 and BRPF1b) , 2. Selectivity Profiles for novel

compounds, defined as multiple confident active or inactive predictions (at the 0.7 to 0.9

confidence levels) for a given compound and 3. Singular Active predictions, comprising

91

compounds that were not in the training set and were predicted to be active (at the 70-90 %

confidence levels) for one of the four bromodomains in the validation study.

All compounds available from the Interpolation Compound set were tested for their

bromodomain activity. This included 9 compounds that had not previously been tested on

BRD1, 18 for BRD9 and 8 for BRPF1.

For the selection of Selectivity Profile compounds, grouped stratified weighted sampling (using

sample_n in the dplr R package361) was conducted to select compounds across groups of public

domain compounds, profile annotations, compound Tanimoto similarity to the nearest

compound neighbour in the training set (binned into 10 bins between 0 and 1), and cluster

numbers, weighted overall to an even distribution across public domain structure, profile

annotations and Tanimoto similarity (limited by overall availability of compounds in each

category). In total 721 compound-target pairs were tested as part of the Selectivity Profiles set.

For the Singular Actives set, grouped stratified weighted sampling was conducted to select active

compounds separately for each domain across groups of public availability, binned p_values

from conformal prediction (p_value >0.3, 0.2< p_value <0.3, 0.1< p_value <0.2), compound

Tanimoto similarity to the nearest compound neighbour in the training set (binned into 10 bins

between 0 and 1) and cluster numbers, weighted overall to an even distribution across public

availability, the 0.7, 0.8 and 0.9 confidence levels and Tanimoto similarity to the nearest

compound neighbour in the training set. In total 388 compound-target pairs were tested as part

of the Singular Actives set.

2.2.10 Experimental Testing

Overall 1,139 compound-target pair data points were selected for prospective experimental

validation in Differential Scanning Fluorimetry (DSF) assays362 at 10 μM concentration.

The experimental assays were run by Helen Boyd and Pia Hansson in AstraZeneca. The DSF

assays were performed in 4titude FrameStar 384 well plates (4titude, Surrey, UK) in assay buffer

with 10 mM HEPES and 500 mM NaCl. A 10 µL reaction was conducted by addition of 10 µL of

protein to assay ready plates with 10 nL of compound solution or DMSO followed by addition

of 20 nL SYPRO® Orange (SigmaAldrich, St. Louis, MO). The plate was sealed with qPCR Seal

(4titude, Surrey, UK), shaken briefly and spun for 1 min at 800 RPM in a benchtop centrifuge.

Final concentrations were 0.1 mg/mL of either BRD4 BD1, BRD9, BRD1 or BRPF1b and 1/500

dilution of SYPRO® Orange. The positive controls used in the assays were JQ-1 for BRD4 BD1,43

iBRD9 for BRD957 and NI-57 for BRD1 and BRPF1b.363 Control compounds had a final assay

92

concentration of 10 µM and the DMSO concentration was 0.1%. The LightCycler® (ROCHE,

Basel, Switzerland) was programmed to increase temperature from 20 to 85°C at a rate of

0.6°C/s with 10 acquisitions per degree Celsius. Tm was calculated using midpoint (geometric)

and first derivative methods. The midpoint method defines the arithmetic midpoint between

upper and lower plateaus as Tm whereas the First Derivative method calculates the first

derivative of the melting curve and defines the maximum value as Tm. Melting temperatures

from both methods should agree with each other and differences can be used to identify

irregular curves. For a compound to be considered as an active hit the ΔTm (shift in Tm between

compound and DMSO control) should be greater than 3 times the standard deviation (SD) of

the DMSO control or 0.9°C (the latter agrees with activity assignments in the dataset used for

training). The control data can be found in Table 8-5. All data handling was performed using a

Screener 13 (Genedata AG, Basel, Switzerland).

2.3 Results and Discussion

2.3.1 Analysis of Chemical and Biological Space

The dataset used for proteochemometric modelling comprised 15,350 data points for 6,352

compounds across 31 bromodomains, corresponding to a matrix coverage of about 7.8 % (Figure

2-4).

Figure 2-5 shows the bromodomains represented in this study, clustered by their binding site

sequence similarity. The dataset covers at least two bromodomains from seven out of eight

bromodomain subfamilies and 51% of all known bromodomains. The type 2 and 4

bromodomains are represented fully in the dataset, with many inhibitors of these targets having

been reported.36,45,34 Subfamilies of type 3, 5, 7 and 8 are less well-represented, which could be

for multiple reasons, including interest in biological function, ligandability or discovery date.34

93

Figure 2-5: Sequence similarity dendrogram based on sequence identity for the bromodomains with bioactivity data that was used in constructing models in this study. The plot is coloured by bromodomain subfamily.

The Bermis-Murcko scaffolds which describe four or more active compounds for each

bromodomain in the public portion of the dataset are visualised as a network map in Figure 2-6.

BRD4 BD1 and BRD4 BD2 form the centralised nodes in the public dataset network, comprising

many scaffolds for which chemotypes overlap with other bromodomains. Other bromodomains

are positioned around the periphery of the network. Scaffold 61, originating from the

publication by Crawford et al., 201674 is a common active scaffold for CECR2, CREBBP, BRD9,

BRD4 BD1 and BRD4 BD2 and presents one of the most connected scaffolds in the network. For

BRD1, TIF1A and BRPF1, scaffold 208 is the most frequently reported scaffold.93 BRD7 has only

one active scaffold (scaffold 29) with more than four member compounds for the public dataset,

which it shares with BRD9.58 BAZ2A and BAZ2B share their most common scaffold 21,55 and are

separated from the remainder of the network. Figure 2-6 also shows that overall there is

considerable scaffold diversity within active compounds for bromodomains, ranging from

Type 2

Type 4

Type 3

Type 7 Type 1

Type 8

Type 5

94

fragment sized scaffolds such as scaffolds 21 and 29, to larger scaffolds including scaffolds 329

and 539. Large scaffold diversity for BRD4 bromodomain ligands can also be observed. The

chemical space diversity for each bromodomain for the combined dataset is further shown in

Figure 2-7.

2.3.2 Benchmarking Compound and Target Descriptors for PCM Models

Benchmarking Compound Descriptors

1D and 2D physicochemical descriptors from the PaDEL library, sub-structural Morgan

fingerprints and 3D Vsurf descriptors were benchmarked for use as chemical descriptors in PCM

models. We compared their performance against one another in cross-validation (CV) when

employed with a constant protein target descriptor (Z-Scales 5), the results of which are shown

in Table 2-1. The combination of Morgan fingerprints and Z-Scales 5 performs similarly to the

combination of PaDEL descriptors and Z-Scales 5 measured by ROC AUC (0.964 vs 0.962),

showing equal suitability for their employ as compound descriptors for this dataset. In a

previous study, using multiple descriptor types to describe the chemical space in PCM models

have been shown to increase performance in some cases.364 In our case, when we combined

Morgan Fingerprints with PaDEL and Z-Scales 5 descriptors, the performance (ROC AUC 0.966)

is also similar to models with fewer descriptors, showing that the addition of nearly 200 further

ligand-side descriptors did not add significantly more information. This redundancy in

information has been found before in PCM modelling, including the findings from the study to

predict cell line growth inhibition, where the incorporation of multiple gene sets from gene

transcripts did not increase performance.365 Finally we investigated the utility of 3D chemical

descriptors, namely Vsurf descriptors combined with Z-scales target descriptors, which

performed the worst of the compound descriptors explored (ROC AUC 0.945), suggesting that

information encoded in the simpler 2D sub-structural fingerprints describing molecular

connectivity or physicochemical properties are more suitable than 3D information to describe

chemical structure for the type of models generated here.

95

Figure 2-6: Network map for active compound-target annotations extracted from public databases. Nodes are formed of Bermis-Murcko scaffolds (covering those comprising four or more active compounds for a bromodomain) and bromodomain identities. Bermis-Murcko scaffold numbers are connected to a bromodomain by an edge when compounds with this scaffold are active for a given bromodomain. Bermis-Murcko scaffold structures for the scaffolds with the largest number of active compounds for each bromodomain are depicted.

96

Figure 2-7: Chemical similarity projected onto 2D space of all compounds for each bromodomain. Individual compounds are represented by dots and those close in chemical space are positioned neighbouring on the plot. A cut-off value of 0.9 for the similarity was used. Red dots represent the active (A) and blue represent the not active (N) activity assignment. The plot shows the similarity between bromodomains based on their bioactivity space. BRD4 BD1 has the largest number of data points and occupies a chemical space not overlapping with other domains.

97

Table 2-1: Shows the number of variables selected by the recursive feature elimination method and the model performance in cross-validation as measured by ROC AUC, sensitivity and specificity for RF models trained on different combinations of compound descriptors. The addition of ~ 200 new variables for the model with PaDEL, Morgan Fingerprints and Z-scales 5 did not significantly improve the model.

Model Number of Variables selected

Value of mTry

ROC AUC Sensitivity Specificity

PaDEL, Morgan Fingerprints and Z-Scales 5

538 102 0.966 0.854 0.950

Morgan Fingerprints and Z-Scales 5

350 86 0.964 0.857 0.948

PaDEL and Z-Scales 5

218 56 0.962 0.830 0.948

Vsurf and Z-Scales 5

100 26 0.945 0.792 0.935

Benchmarking Target Descriptors

Alignment-dependent amino acid sequence descriptors were implemented in this study due to

their successful application in previous PCM studies.267 These target descriptors were

benchmarked, utilising Morgan fingerprints for the compound descriptors. The performance

measured by cross-validation (CV) for each target descriptor is shown in Table 2-2. As in

previous studies,267 we observe that target descriptors perform similarly with no significant

difference in performance (ROC AUC 0.962-0.964).

We further explored the use of alternative target descriptors generated from a SPOT peptide

array,38 to test if the specificity of peptides can infer information about the specificity of

compounds for bromodomains (see Methods section for details). The peptide model achieved

a high ROC AUC of 0.957 and MCC 0.801 (Table 2-3), only slightly outperformed by the final

traditional PCM models. This model shows that there is consistency between bromodomains

described by their binding affinities to peptides and their compound specificity.

98

Table 2-2: Description, number of components, ROC AUC, sensitivity and specificity values from cross-validation for the different alignment-dependent target descriptors used to build the initial proteochemometric models. Performance was very similar for all alignment-dependent descriptors.

Descriptors (combined with Morgan Fingerprints)

Description Number of Components

Number of Variables selected

Value of mTry

ROC AUC Sensitivity Specificity

Z-scales 5 Physicochemical PCA

5 350 86 0.964 0.857 0.948

FASGAI Physicochemical Factor Analysis

6 350 89 0.962 0.844 0.950

ST-scales Topological PCA 5 506 96 0.963 0.849 0.947

VHSE Physicochemical PCA

8 509 37 0.964 0.861 0.948

ProtFP8 Physicochemical PCA

8 250 18 0.962 0.835 0.954

BLOSUM Physicochemical and substitution matrix VARIMAX

10 350 66 0.964 0.846 0.950

T-scales Topological PCA 8 350 66 0.963 0.841 0.947

Z-scales 3 Physicochemical PCA

3 350 66 0.963 0.841 0.947

MSWHIM 3D electrostatic potential PCA

3 350 26 0.962 0.848 0.948

Table 2-3: Comparison of the external test set performance of the PCM binary classification model with other models generated for the dataset. The peptide model uses histone peptide array data as a surrogate for target descriptors. Models utilising different algorithms (RF=Random Forest, SVM=Support Vector Machines, GLM= Generalised Linear Model) were compared, as well as those incorporating different feature spaces (only compound descriptors (QSAR) and only target descriptors (QSAM)). Binary target ID descriptors and a baseline linear model (GLM) based on using only binary compound IDs and target IDs were also benchmarked. A model built on the public dataset and tested on a 30 % random test set as well as the proprietary data was benchmarked. The performance measures were area under the receiver operating characteristic curve (ROC AUC), Matthews’ Correlation Coefficient (MCC),

99

sensitivity (true positive predictions divided by all positive labels), specificity (true negative predictions divided by all negative labels) and the area under the precision recall curve (PRAUC). The table shows that the PCM model based on Morgan Fingerprints and Z-scales 5 outperformed other models in terms of ROC AUC and that the peptide model and the public data model performed well for their external test set validations. On average PCM outperformed individual target QSAR models.

Validation ROC AUC MCC Sensitivity Specificity AUPR curve

RF PCM 5-fold Cross Validation

0.964 +/- 0.0026 - 0.857 +/- 0.0141 0.948 +/- 0.0066 -

RF PCM 30 % external test set

0.968 0.826 0.859 0.955 0.879

Peptide descriptor model with Morgan fingerprints

0.957 0.801 0.850 0.942 0.912

SVM PCM model 30 % external test set

0.942 0.789 0.810 0.956 0.910

GLM Linear Model 30 % external test set

0.916 0.680 0.754 0.912 0.860

Global QSAR 0.801 0.498 0.511 0.926 0.679

Global QSAM (Quantitative Sequence Activity Models)

0.747 0.376 0.390 0.919 0.555

Average individual PCM performance

0.936 0.634 0.750 0.900 -

Average Individual QSAR performance

0.897 0.569 0.762 0.809 -

RF PCM with binary target id descriptors 30 % external test set

0.853 0.611 0.616 0.940 0.722

GLM Linear baseline Model (Target and compound

0.730

0.441 0.755 0.706 0.113

100

binary IDs) 30 % external test set

Public data model 30 % external test set

0.951 0.767 0.919 0.848 0.884

Public data model proprietary data test set

0.642 0.182 0.410 0.778 0.347

101

2.3.3 Algorithms and Model Performance

Table 2-3 is the RF binary classification model, which achieved a high predictive performance

(ROC AUC of 0.964 +/- 0.0026) in CV, showing a low standard deviation across folds, and

utilising 350 variables which were automatically selected using recursive feature elimination.

The test set performance for this model was similar to that in CV achieving a ROC AUC of 0.968

and a MCC of 0.826.

In addition to compound and target descriptors, different machine learning algorithms were

explored next. We found that the RF model performed similarly to SVM and outperformed GLM

on the same external test set when using 512-bit Morgan fingerprints for compound descriptors

and Z-scales 5 for target descriptors (ROC AUC 0.968 RF vs 0.942 SVM and 0.916 GLM; ROC

curves are shown in Figure 2-8). ROC AUC values above 0.9 and MCC values above 0.8 as

obtained from this study are comparable to the classification results for PCM models generated

previously on other datasets.308,249,311

Figure 2-8: ROC AUC curves for different algorithms, including Random Forest (RF) (mtry 86), Radial Support Vector Machines (SVM) (C=3, σ=0.01) and a Generalised Linear Model (GLM) on the same test set. The non-linear methods of RF and SVM outperform the linear GLM technique for this dataset, with the highest performing algorithm being RF.

We selected the RF model because this was the highest performing model for our external test

set and due to the higher interpretability of RF for future applications. At the 0.5 probability

threshold for the number of trees voting for each class, we can analyse the model performance

102

in terms of sensitivity and specificity values. It can be observed in Table 2-3 that the true

negative prediction rate (specificity) is higher than the true positive prediction rate (sensitivity),

providing a marginally better prediction of inactive compounds than active compounds. The

optimal probability threshold for maximising the sensitivity and specificity of the model (at a

probability threshold of 0.40) led to 90% sensitivity at 93% specificity.

For the final PCM model, the class-averaged performance across bromodomains was an MCC of

0.645 with a corresponding ROC AUC 0.934 (Table 2-3). The performance for the test set per-

target is shown in Figure 2-9 and Table 8-4 and highlights the domains for which the model is

most predictive. The BET bromodomains are generally predicted well for PCM with high ROC

AUCs (0.941-0.985) and a higher sensitivity (0.844-1.00) than specificity (0.667-0.967), as

might be expected due to the enriched number of actives for these domains (47.3 %) in the

dataset (Figure 2-4). ATAD2B and BRWD1 have no actives in the test set and SMARCA2 has no

inactives in the test set, and therefore, as the model predicts all compounds as inactive for

ATAD2B and BRWD1 BD2 this gives a specificity of 1.00, and the opposite applies as the model

predicts all compounds as active for SMARCA2 leading to a sensitivity of 1.00. (see Figure 2-4).

The case is similar for BRPF3, KAT2A and TAF1L BD2, where due to their low proportion of

active compounds, the sensitivity of the resulting model is zero. PCM performs the best outside

of the BET bromodomains for BAZ2A, BAZ2B, CECR2, PB1 BD5, TIF1A and TRIM33 (ROC AUCs

> 0.99, sensitivity > 0.85, specificity > 0.90). For most of these domains the dataset is heavily

biased towards inactives (Figure 2-4), and the high performance can be explained by the active

scaffolds common to training and test set for these domains (Figure 2-10). High overall ROC

AUCs are found for ATAD2, BPTF, BRD1, EP300, BRD7, BRD9, BRPF1 and TAF1 BD2; however,

they generally have lower sensitivity values (0.48-0.84) at a higher specificity, resulting in an

overprediction of inactives. This could be due to their higher active scaffold diversity, which is

represented in Figure 2-11. Conversely, those domains with a higher sensitivity than specificity

outside the BET family include CREBBP, and PCAF, both of which have high proportion of active

data points (37.3 % and 60 %; Figure 2-4). Overall, 21 out of 31 bromodomains are modelled well

(with a ROC AUC > 0.8 and both sensitivity and specificity > 0.6). For bromodomains with

poorer performance, this can be attributed to data limitations in many cases, including class

imbalances and scaffold diversity as discussed above.

103

Figure 2-9: ROC AUC, sensitivity and specificity for the external test set, split by bromodomain. Overall, 21 bromodomains are modelled well, with a ROC AUC >0.8 and both sensitivity and specificity > 0.6.

Figure 2-10: The number of carbon framework Bermis-Murcko scaffolds for active molecules in the external test set plotted per bromodomain. Scaffolds that were in the training set and test set are coloured red and scaffolds only in the test set are coloured blue. This plot shows that most active scaffolds for those bromodomains with low numbers of scaffolds in the test set were also found in the training set.

Bromodomain

Bro

mo

dom

ain

Number of Active Murcko Scaffolds in Test Set

104

Figure 2-11: The number of active Bermis-Murcko carbon framework scaffolds for each bromodomain in the combined public and proprietary dataset.

Number of Active Murcko Scaffolds

Bro

modom

ain

105

2.3.4 Model Validation

In addition to assessing performance on a representative random test set, is it important to

assess the model performance on a test set more realistic of a practical use. In drug discovery,

we often want to find hits and design compounds for relatively unestablished targets, often also

developing new chemical series. In these cases, limited existing bioactivity data will exist in one

or more of the chemical and biological spaces. For this reason, we performed the more realistic

validations of leave-one-scaffold-out (LOSO) and leave-one-target-out (LOTO) to assess the

ability of the model to interpolate/extrapolate for new compounds and new targets respectively.

Leave-One-Scaffold-Out Validation

Figure 2-12 displays the LOSO results, which shows that there was a large variation in ROC AUC

across scaffolds (mean ROC AUC 0.92, standard deviation 0.12). There is a general observed

trend towards better discrimination of activity for more complex scaffolds (scaffolds with higher

numbers are more complex with a larger number of rings), which is particularly noticeable for

those scaffold numbers above 3,000, which have consistently high ROC AUC values above 0.95.

The higher predictivity for more specific scaffolds is likely to result from more consistent SAR

for these molecules, due to their high degree of optimisation, combined with the fact that larger

molecules are generally more similar to one another in fingerprint space.366 For less complex

scaffolds, represented here with lower Bermis-Murcko scaffold numbers, there was a higher

variability between scaffold interpolation ability and hence model performance. This can be

attributed to class imbalance in some cases (Figure 2-13); for example, scaffolds 192 and 2,753

both have a highly active data point composition, and consequently both have a high sensitivity

accompanied by a lower specificity, corresponding to a larger number of overpredictions.

Conversely, we observe high specificities for scaffolds 172, 740 and 1614 with zero sensitivity,

which have a low number of active data points within the scaffold. However, this is not always

the case, as for example scaffold 464 has a lower fraction of active data points (0.09) while still

achieving perfect performance. Therefore, we can conclude that multiple factors contribute to

the ability of models to interpolate for a new scaffold, including the scaffold similarity, the class

distribution and the SAR itself within the dataset.

106

Figure 2-12: The ROC AUC, sensitivity and specificity values for leave-one-scaffold-out (LOSO) validation for the final PCM model.

Figure 2-13: The fraction of active to inactive data points for the scaffolds predicted for in the leave-one-scaffold-out (LOSO) validation, showing that class imbalance can influence predictive performance for a scaffold.

Leave-One-Target-Out Validation

PCM offers the advantage over QSAR that predictions can be interpolated also for new targets

based on information in the training set about similar targets. To test the ability to interpolate

to new targets we employed LOTO validation, the results of which are displayed in Figure 2-14.

The average performance across all bromodomains was ROC AUC 0.863, with a sensitivity of

0.672 and a specificity of 0.864, which was unsurprisingly lower than our original model results

averaged over all targets (ROC AUC 0.934, sensitivity 0.750, specificity 0.900). Target

interpolation/extrapolation for the BET bromodomains (BD1 and BD2 domains from BRD2,

BRD3, BRD4 and BRDT proteins) can be achieved with high predictive performance (ROC AUC

> 0.8, sensitivities and specificities > 0.7). This can be attributed to the high correlation of

activity of molecules between the BETs, as well as a high proportion of shared scaffolds between

the domains (Figure 2-4 and Figure 2-7). These targets are also well represented in selectivity

panel screens and therefore have data points for a larger number of chemotypes. All targets have

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

47

49

10

11

41

16

81

72

19

23

82

46

45

21

53

36

18

74

07

88

79

88

98

93

31

57

61

61

41

95

22

07

42

27

62

49

32

51

42

75

32

90

92

91

12

99

13

03

63

12

63

13

83

17

73

17

93

19

03

30

03

30

63

32

73

33

03

36

23

41

63

46

53

47

03

48

43

48

73

52

63

53

0

Scaffold Number

ROC AUC

Sensitivity

Specificity

0

0.2

0.4

0.6

0.8

1

47

49

10

11

41

16

81

72

19

23

82

46

45

21

53

36

18

74

07

88

79

88

98

93

31

57

61

61

41

95

22

07

42

27

62

49

32

51

42

75

32

90

92

91

12

99

13

03

63

12

63

13

83

17

73

17

93

19

03

30

03

30

63

32

73

33

03

36

23

41

63

46

53

47

03

48

43

48

73

52

63

53

0

Scaffold Number

Proportion inactives

Proportion actives

107

a ROC AUC well above 0.5 (random), except for TAF1 BD2, which cannot be predicted for using

information about other domains. This can be rationalised by the fact that there is only one

other member of the subfamily, which is TAF1L BD2 (Figure 2-5); however, TAF1 BD2 has many

more active compounds than TAF1L BD2 (Figure 2-4), and extrapolation from TAF1L BD2 hence

results in the underprediction of active compounds, demonstrated by the low sensitivity of

0.081. Conversely the opposite is true for the predictions of TAF1L BD2 based on the activity for

TAF1 BD2, where we observe an underprediction of the inactive class, thereby resulting in a low

specificity of 0.292. BRPF1 and ATAD2 also have lower ROC AUCs of around 0.7. This can be

rationalised again by the lower proportion of active compounds of the nearest domain

sequences BRPF3/BRD1 and ATAD2B (Figure 2-4), resulting in a greater number of inactive

predictions. SMARCA4 predictions have an overprediction of actives because the nearest

domain SMARCA2 has a large proportion of active data points (Figure 2-4). ATAD2B consisted

of all inactive data points and we could predict this domain with a specificity of 0.887 and a

false positive rate of 0.113; these findings likely result from the transfer of the activity profile of

ATAD2. BRD7, BRD1, KAT2A and PB1 BD5 are predicted with high ROC AUC, sensitivity and

specificity values > 0.8. The number of data points for each of these domains is small, and a

larger number of data points for a similar domain, BRD9, BRPF1b, CECR2 and SMARCA4

respectively, (Figure 2-4) are contained within the training set, with a similar activity profile

(Figure 2-7). This shows the ability of the model to interpolate between similar domains to those

with similar activity profiles, especially for domains where there are fewer data available.

In summary, it was found that PCM can generally interpolate/extrapolate well for new targets

that have similar bioactivity profiles to those in the training set, and less well for novel targets

where this is not the case.

108

Figure 2-14: The ROC AUC, sensitivity and specificity values for leave-one-target-out (LOTO) validation for the final PCM model.

2.3.5 Benchmarking to QSAR, QSAM and baseline models

We next compared our PCM model to other techniques which could be performed on the same

dataset, in order to establish its comparative advantage for bromodomain bioactivity modelling.

The final PCM model significantly (p value <0.05, paired t-test) outperformed models built

based on only compound descriptors (global QSAR) or only target descriptors (global

Quantitative Sequence Activity Models (QSAM)) as shown in Figure 2-15 and Table 2-3,

particularly notably in terms of sensitivity at the 0.5 probability cut-off (sensitivity: PCM=0.86,

QSAR=0.51, QSAM=0.39). Further evidence is also provided for the poorer prediction of actives

by these methods, by the reduction in the area under the precision recall curve (PRAUC:

PCM=0.879, QSAR=0.679, QSAM=0.555). Global QSAR outperforms global QSAM (ROC AUC:

QSAR=0.80, QSAM=0.75), showing that discrimination between active and inactive classes is

more dependent upon compound descriptors than target descriptors. This benchmarking

demonstrates that a more predictive model can be achieved by utilising a combination of target

and compound information, as is the case in the PCM method. Per-target PCM on average

outperforms individual QSAR models on the dataset (mean MCC PCM = 0.634, mean MCC

QSAR = 0.569); however, this is highly target dependent, as shown for the comparison between

individual QSAR models and the per-target PCM model (Figure 2-16). For example, for the BET

bromodomains (highlighted in Figure 2-16), PCM outperforms individual QSAR methods,

showing that bioactivity information about one target can interpolate to another target to

improve performance in these cases. However, for BRD7 and SMARCA4 and BRWD1 BD2,

individual QSAR models result in noticeably better predictive performance. For SMARCA4 we

00.10.20.30.40.50.60.70.80.9

1

ATA

D2

B

BA

Z2A

BA

Z2B

BP

TF

BR

D1

BR

D2

BD

1

BR

D2

BD

2

BR

D3

BD

1

BR

D3

BD

2

BR

D4

BD

1

BR

D4

BD

2

BR

D7

BR

D9

BR

DT

BD

1

BR

DT

BD

2

BR

PF1

BR

PF3

BR

WD

1 B

D2

CEC

R2

CR

EBB

P

EP3

00

KA

T2A

PB

1 B

D5

PC

AF

SMA

RC

A2

SMA

RC

A4

TAF1

BD

2

TAF1

L B

D2

TIF1

A

TRIM

33

Bromodomain

0.714761441

0

1

109

can explain the increase in performance to be due to the model wrongly interpolating from

SMARCA2 with a high number of active predictions. These domains are better suited to being

modelled individually. Both individual QSAR and PCM perform poorly for both BRPF3 and

TAF1L BD2 and SMARCA2. Predictive performance for ATAD2, CECR2, PB1 BD5 and TIF1A are

greatly enhanced by PCM. Therefore, PCM is favoured or similar in performance to individual

QSAR models for the majority of bromodomains (28 out of 31), with only three cases to the

contrary, where it is highly outperformed. This information can be used for future decisions on

modelling techniques to employ for different bromodomains.

The PCM model was also benchmarked against a model which used simplistic binary target

identifier descriptors to justify the use of the more complex alignment-dependent amino acid

descriptors such as Z-scales 5 for the protein side. The PCM model outperformed this model

(MCC: PCM = 0.826, Binary Target Descriptor model = 0.611, ROC AUC: PCM = 0.968, Binary

Target Descriptor model = 0.853), showing that the protein binding pocket information of the

PCM model is important for classification.

Figure 2-15: ROC AUC curves for PCM versus Global QSAR and Global QSAM techniques on the same test set. PCM outperforms both Global QSAR and QSAM techniques, showing an interaction of compound and target features is important for modelling this dataset.

110

Figure 2-16: Individual QSAR vs PCM performance for each target measured by Matthew’s Correlation Coefficient (MCC). On average, PCM outperforms QSAR. The BET bromodomains are highlighted the green box outline.

RF is a non-linear machine learning algorithm and the method employed here uses complex

features to describe molecules and targets. For comparison a baseline generalised linear model

(GLM) was produced, using only compound and target binary identifiers as the input variables.

This model achieved adequate performance with a ROC AUC of 0.730 and MCC of 0.441, which

was however outperformed by the RF PCM model. Additionally, the GLM model is less

generalisable than the RF PCM model, since only those compound target pairs with identifiers

in the training set can be meaningfully modelled using previous chemical information, leaving

new compounds to be estimated using target activity profiles.

We next compared this baseline model directly to the GLM model using compound and target

descriptors (Table 2-3), which outperforms the baseline model (ROC AUC: GLM = 0.916 vs

Baseline = 0.781) showing that additional information is contained within the chemical and

biological descriptors used, i.e. information which is further exploited by the final RF model,

which can consider nonlinear relationships between both spaces.

Finally, we investigated the model validity utilising Y-scrambling. Performance dropped from

ROC AUC 0.968 to 0.701 when 50 % of labels were scrambled and then further to random

chance (ROC AUC 0.500) predictions with 100 % of labels scrambled, showing that a valid

model is formed between correlations of descriptors with classification outcome.

Bromodomain

Ma

tth

ew

’s C

orr

ela

tion

Co

eff

icie

nt (M

CC

)

111

Overall, we conclude that using the non-linear RF method, with its ability to model interactions

between descriptors, as well as threshold cut-offs for descriptors,209 enhances the overall

predictive performance of the models for this dataset and increases generalisation to new

compounds and targets beyond the training set.

2.3.6 Public Dataset PCM Model

Both public and proprietary data were used to construct the PCM models; however, a model

based solely on public data was also generated for comparison. 3,950 data points were modelled

across 20 bromodomains for the public dataset, producing a model with high performance on

the random test set (MCC 0.767, ROC AUC 0.951, PRAUC 0.884; Table 2-3). The model is,

however, more limited in domain and scaffold coverage, comprising only 20 out of 31 domains,

and 685 out of 3618 Murcko scaffolds compared to the in-house model. On a per-target basis,

the performance varies between domains (Figure 2-17). Opposite to the trend for the combined

proprietary and public data model, we observe an overprediction of active labels for many

bromodomains, with a higher sensitivity than specificity, particularly for BAZ2A, BRD1, BRD3

BD1, BRD4 BD2, BRD9, PB1 BD5, which can be explained as above by a high proportions of

active data points (Figure 2-2).

Figure 2-17: Per-bromodomain performance of the public model in terms of ROC AUC, sensitivity and specificity on the 30 % test set.

Since the current performance analysis is only for the 30% test set of public data, we also tested

the ability of the public model to predict for the proprietary dataset. The overall performance

decreased to a MCC of 0.182, an ROC AUC of 0.642, and a PRAUC 0.347 (Table 2-3) and the

per-target performance (Figure 2-18) drops to an average ROC AUC of 0.648, with

overprediction of one class over another for many domains (average sensitivity 0.447, average

00.10.20.30.40.50.60.70.80.9

1

Bromodomain

ROC AUC

Sensitivity

Specificity

112

specificity 0.656). This shows that the two data sets are diverse in chemical and biological

spaces, and that the public model is not able to predict well for a structurally diverse dataset.

This supports the rationale for combining the two sets of data to increase the applicability

domain of the model.

Figure 2-18: Per-bromodomain performance of the public model in terms of ROC AUC, sensitivity and specificity on the proprietary dataset.

2.3.7 Applicability Domain

As we have seen in the previous section, we cannot expect a model to reliably predict for

chemical and biological space which has little relationship to the dataset the model is trained

on. Therefore, when applying the model to future test sets, we need to define an applicability

domain, where the model can predict better than random guesses. Therefore, we examined the

suitability of cross conformal prediction (CCP)227 as a method to define the applicability domain

of our best performing model, by employing the method to predict classes for the representative

external test set. The results are presented in Table 2-4, where it can be seen that the model has

a validity for the test set close to the confidence level, namely of 70.1%, 80.6%, 90.4% and 94.9

% for the 0.7, 0.8 and 0.9, 0.95 confidence levels (Cls) respectively, demonstrating the

conformal predictions were valid and confirming suitability for use of this method in future

compound selections. As described in the methods, to achieve a specified validity, the conformal

predictor allows samples to be classified into the “both” or “neither” classes, as well as single

label classes. The efficiency metric, which is the average number of output labels per sample,

demonstrates how the model achieves a defined confidence level. For the 70 and 80 %

confidence levels the efficiency is 0.73 and 0.85, respectively, and 27 % and 15 % of compounds

are classified into the “neither” class at these levels (i.e., they lie outside the applicability domain

of the model). For the 90 % confidence level, the efficiency is 0.99 and this is fulfilled by placing

00.10.20.30.40.50.60.70.80.9

1

ATA

D2

ATA

D2

B

BA

Z2A

BA

Z2B

BP

TF

BR

D1

BR

D2

BD

1

BR

D2

BD

2

BR

D3

BD

1

BR

D3

BD

2

BR

D4

BD

1

BR

D4

BD

2

BR

D7

BR

D9

BR

DT

BD

1

BR

DT

BD

2

BR

PF1

BR

PF3

BR

WD

1 B

D2

CEC

R2

CR

EBB

P

EP3

00

KA

T2A

SMA

RC

A2

SMA

RC

A4

TAF1

BD

2

TAF1

L B

D2

TIF1

A

TRIM

33

Bromodomain

ROC AUC

Sensitivity

Specificity

113

compounds into all classes, including 0.12 % into the “both” class and 1.06 % into the “neither”

class, to fulfil the guaranteed error rate. Since the baseline performance of the model is at 91.9

% validity (accuracy), at the confidence level of 0.95 the model assigns many samples into the

“both” class to achieve a higher validity than the underlying classifier. This explains why the

efficiency metric is now greater than 1.00 (1.12), where 12 % of the labels are now in the “both”

class.

In summary, since conformal prediction performs with expected validities on our test set, we

use this method to select compounds for experimental validation in the next section (see

methods).

Table 2-4: Applicability domain of the model as defined at 4 different confidence levels of 0.7, 0.8, 0.9 and 0.95 with Cross Conformal Prediction (CPP). Conformal prediction has four outcomes; active, not active and “both” active and not active or “neither”. The number of single class predictions increases as the confidence level increases to avoid the incorrect “neither” class assignment, until the baseline accuracy of the binary classifier (0.92) is reached, whereupon compounds are predicted to be members of “both” classes to increase the validity further. This table shows that the conformal method as applied to PCM is valid and fulfils the required maximum error criteria when tested on the external test set, making the model and the conformal prediction framework suitable for future predictions.

Confidence (%)

Expected Error (%)

Validity (%) Efficiency (average number of output labels per sample)

Percentage of compound-target pairs with label “both”

Percentage of compound-target pairs with label “neither”

70 30 70.1 0.73 0 27.0

80 20 80.8 0.85 0 15.0

90 10 90.4 0.99 0.12 1.06

95 5 94.9 1.12 12 0.00

2.3.8 Experimental Validation

Using the best performing RF PCM model, compounds were next prioritised for experimental

testing in Differential Scanning Fluorimetry assays against the four bromodomains BRD1, BRD4

BD1, BRD9 and BRPF1b, as described in the methods section. The overall performance of the

model on the validation set achieved a modest MCC of 0.24 (Table 2-5). The compound

selections were further divided into three categories; 1. Interpolation Compounds, 2. Selectivity

Profiles, and 3. Singular Actives, which were designed to test different hypotheses for model

prediction (see methods), and the resulting performance for the three overlapping sets is shown

in Table 2-5. The balanced accuracy (mean of the sensitivity and specificity values) was used to

assess performance instead of the MCC due to the absence of inactive predictions for the

Singular Active set. For the Interpolation Compounds, Selectivity Profiles and Singular Actives

the balanced accuracies were 0.66, 0.64 and 0.29 respectively, showing better predictive

114

performance for the Interpolation Compounds and Selectivity Profile sets than for the Singular

Active set. This trend is not only driven by the inclusion of the inactive class, but also an increase

in precision for the active class (the precision of the Interpolation Compounds set is 0.42, while

that for the Selectivity Profiles set is 0.37 and that for the Singular Actives set is only 0.29). It is

unsurprising that Interpolation Compounds are generally better predicted, due to some

information about the compound’s bioactivity being included in the model training set. The

median Euclidean distance to the nearest compound-target neighbour in descriptor space

increases from Interpolation Compounds (0.37) to Selectivity Profiles (0.52) to Singular Actives

(0.57), showing that the compound-target pairs in the Singular Actives set are more diverse in

Euclidean space and likely represent a more difficult prediction set for the RF algorithm. This

can explain the better performance for the Selectivity Set over the Singular Actives set.

Table 2-5: Performance reported as True positives (TP), True negatives (TN), False positives (FP), False negatives (FN), Balanced Accuracy, Matthew’s Correlation Coefficient (MCC), Precision and the median distance to the training set for the three overlapping subsets of compounds selected for testing. Interpolation performs the best for the model, as expected, followed by the Selectivity Profile predictions.

Set TP TN FP FN Balanced Accuracy

MCC Precision Median Euclidean distance to nearest compound-target neighbour in training set

Interpolation Compounds

9 12 12 2 0.66 0.30 0.42 0.37

Selectivity Profiles

204 166 343 8 0.64 0.31 0.37 0.52

Singular Actives

115 0 273 0 0.29 - 0.29 0.57

In the next sections we go into further detail of model performance with regards to the

individual sets modelled on a per-bromodomain basis, as well as look at the overall hit rates for

active predictions, often used to measure the success of a virtual screen.

Interpolation Compounds

The model performance for Interpolation Compounds on a per-target basis is summarised in

Table 2-6. For BRD1 and BRPF1 the model performed well at predicting interpolations, with an

overall accuracy of 100 % and 75 %, respectively, for both the active and inactive classes. For

BRD9, the accuracy was lower (33 %), due to an overprediction of actives. Figure 2-19 shows the

115

Euclidean distance of the predicted compounds in combined compound and target space. It is

noticeable that many of the active BRD9 predictions were for compounds which had a larger

Euclidean distance to the nearest compound-target pair in the training set (see circled points in

Figure 2-19), than for active predictions for BRD1 and BRPF1b, which could mean that the

interpolation for BRD9 is not achievable from the other bromodomains considered here. In

general, interpolation is more successful for BRD1 and BRPF1b than for BRD9.

Table 2-6: Interpolation prediction results, showing number of active predictions, number of inactive predictions and the % accuracy for each domain. Interpolation compounds have activity in the training set against another domain but activity is not known for the tested domain. Interpolation for BRD1 and BRPF1b was more successful (as measured by accuracy) than for BRD9.

Bromodomain Number of Active predictions

Number of Inactive predictions

Accuracy (%)

BRD1 3 6 100.0

BRD9 12 6 33.0

BRPF1b 6 2 75.0

Figure 2-19: Interpolation predictions categorised according to their domain and predicted activity at the 0.9 confidence level plotted with their Euclidean distance in compound and target space from the nearest compound-target neighbour in the training set. Since all compounds are in the training set, this shows the distance to the nearest bromodomain on a relative scale. It is noticeable that the erroneous/incorrect active BRD9 predictions were for compounds for which there was a higher Euclidean distance to the nearest compound-target pair in the training set (circled points), than for BRD1 and BRPF1b, which could mean that the interpolation for BRD9 is not achievable from the other bromodomains considered here.

Bro

mo

dom

ain

, S

plit

by P

redic

ted A

ctivity

Euclidean Distance in Descriptor Space to Nearest Compound-Target Pair in Training Set

116

Selectivity Profiles

Table 2-7 shows the results of the Selectivity Profile predictions for two, three or four

bromodomains. For the Selectivity Profile compounds predicted for two domains,

corresponding to the first 10 rows of the table, we can see that the best predicted profiles are for

activity at both BRD4 BD1 and BRD9 (Profile 1), activity for BRD4 BD1 with inactivity at BRPF1

(Profile 8) and activity at BRD4 BD1 with inactivity at BRD1 (Profile 10), with 61.1 %, 63.1 % and

62.5 % completely correct predictions across all bromodomains, respectively. This is reflective

of the general selectivity profiles for the training set; BRD4 BD1 and BRD9 are often reported as

active together (Figure 2-20d), whereas many compounds that have activity at BRD4 BD1 are

inactive against BRD1 and BRPF1 (Figure 2-20a and Figure 2-20c). The dual inactive profile for

BRD1 and BRPF1b when a compound displays activity against BRD4 BD1 can also be rationalised

by the similarity between domains (Figure 2-5), which shows that BRD1 and BRPF1b are the

most similar domains within in the same type 4 subfamily, and BRD4 BD1 is in a separate

subfamily (type 2). When analysing the predicted Selectivity Profile that includes activity for

both BRD1 and BRPF1 (Profile 3, Table 2-7), we find a poor accuracy for completely correct

predictions of 12.3 %. Most of the incorrect predictions fall into the p_value for the active class

from conformal prediction range between 0.1 and 0.3, showing that this region represents more

speculative predictions (Figure 2-21a). Additionally, for this profile, the training dataset shows

a high degree of overlap for these domains in terms of compound activity profiles (Figure

2-20e). However, from experimental testing, we have been able to find a relatively larger set of

35 compounds which were active at one domain over the other. This data would be important

to incorporate in the future to understand selectivity between BRD1 and BRPF1 domains, which

has not been well-explored previously.

Table 2-7: Selectivity profile predictions for the experimental validation set, per individual Selectivity Profile. The results are broken down into numbers of compounds for which 4, 3, 2, 1 or 0 predictions were correct and the percentage of each Selectivity Profile predictions which were predicted entirely correctly. Selectivity profiles tested vary between 2, 3 or 4 predictions per compound.

Profile Number

Predicted Profile

4 correct

3 correct

2 correct

1 correct

None correct

Total tested

Percentage completely correct (%)

1 BRD4 BD1 and BRD9 active

NA NA 11 1 6 18 61.1

2 BRD9 and BRPF1 active

NA NA 2 5 18 25 8.0

3 BRD1 and BRPF1 active

NA NA 15 35 71 121 12.3

117

4 BRD4 BD1 and BRPF1 active

NA NA 1 1 5 7 14.2

5 BRD1 and BRD9 active

NA NA 0 1 9 10 0.0

6 BRD9 active, BRD4 BD1 inactive

NA NA 1 3 0 4 25.0

7 BRPF1 active, BRD9 inactive

NA NA 1 2 0 3 33.3

8 BRD4 BD1 active BRPF1 inactive

NA NA 12 7 0 19 63.1

9 BRPF1 active, BRD4 BD1 inactive

NA NA 3 8 0 11 27.2

10 BRD4 BD1 active, BRD1 inactive

NA NA 5 3 0 8 62.5

11 BRD4 BD1, BRD9 and BRPF1 active

NA 0 0 1 0 1 0.0

12 BRD1, BRD9 and BRPF1 active

NA 8 15 6 34 55 14.5

13 BRD4 BD1 active, BRD1 and BRPF1 inactive

NA 10 3 0 0 13 76.9

14 BRD4 BD1 active, BRD1 and BRD9 inactive

NA 8 4 1 0 13 61.5

15 BRD9, BRPF1 active BRD4 BD1 inactive

NA 0 1 1 0 2 0.0

16 BRD1, BRPF1 active BRD4 BD1 inactive

NA 0 0 2 0 2 0.0

17 BRD1, BRD9, BRPF1 active, BRD4 BD1 inactive

0 2 2 3 0 7 0.0

118

18 BRD4 BD1 active, BRD1, BRD9 and BRPF1 inactive

4 5 1 0 0 10 40.0

When analysing the results for the three- and four-domain Selectivity Profile predictions, we

observe trends consistent with the two-domain Selectivity Profile predictions. For the

Selectivity Profile including activity at BRD4 BD1 with inactivity at both BRD1 and BRPF1 (Profile

13), and the Selectivity Profile including activity BRD4 BD1 with inactivity at both BRD1 and

BRD9 (Profile 14), we see a high completely correct prediction rate of 76.9 and 61.5 %,

respectively. This is consistent with the fact that Selectivity Profiles with actives for BRD4 BD1

were well-predicted also for the two component Selectivity Profiles. For the Selectivity Profile

including activity BRD4 BD1 with inactivity at BRD1 and BRPF1 (Profile 13), those compounds

with higher p_values and similarity to the nearest compound neighbour in the training set were

correctly predicted active for BRD4 BD1 (Figure 2-18b). Another Selectivity Profile of note

includes activity for BRD1, BRD9 and BRPF1 (Profile 12), for which, although the overall

completely correct prediction rate is poor at 14.5 %, there are still eight compounds predicted

correctly and 15 with two correct activity assignments, correctly identifying compounds with

multiple activities.

119

Compound Compound B

rom

odom

ain

B

rom

od

om

ain

B

rom

odom

ain

a b

c d

e f

BRD1

BRD4

BD1

BRD9

BRPF1b

BRD4

BD1

BRD9

BRD4

BD1

BRPF1b

BRD1

BRD9

BRD1

BRPF1b

Figure 2-20: Pairwise bioactivity profiles of training set molecules for all two-domain combinations of BRD1, BRD4 BD1, BRD9, and

BRPF1b. Each cell represents a compound-target pair activity with unique compounds which have data for both domains along the x-

axis plotted against each of the bromodomains on the y-axis. Red coloured cells represent active (A) compound-target pairs, blue coloured

cells represent not active (N) compound-target pairs. It can be seen that BRD1 and BRPF1b have highly correlated activities and BRD4

BD1 has the most active compounds, which tend to inversely correlate with other domain bioactivities

120

Figure 2-21: Selectivity Profile predictions plotted with the p_values for the active class generated from conformal predictions against the Tanimoto similarity to the nearest compound neighbour in the training set (Morgan fingerprints, 512 bits) for a) compounds that were predicted active at both BRD1 and BRPF1; active=A, not active=N. The p_values are lower for this set on average compared to the data for the other Selectivity Profile, and most of the wrong predictions fall between 0.1-0.3. b) compounds that were predicted active at BRD4 BD1 and inactive at BRD1 and BRPF1; active=A, not active=N. Higher p_values and similarity to the nearest compound neighbour in the training set were predicted correctly as active for BRD4 BD1.

Tanimoto Similarity to Nearest Compound Neighbour in Training Set

Tanimoto Similarity to Nearest Compound Neighbour in Training Set

P_

va

lue f

or

Active

Cla

ss fro

m

Co

nfo

rma

l P

redic

tio

n

P_

va

lue f

or

Active

Cla

ss fro

m

Co

nfo

rma

l P

redic

tio

n

a b

121

Table 2-8: Three compound selectivity profiles which were predicted by the PCM model and their Thermal Shift values, as determined by Differential Scanning Fluorimetry (DSF). A= active, N= not Active. Shows the thermal shift values (ΔTm) tested at 10 μM averaged between the midpoint and first differential (see methods) and the range of the two measurements. p_values were calculated from conformal predictions and show the probability values that the compound belongs to the class assigned. NA values mean the compound was not selected for testing in the assay. The table shows three compound structures for which selectivity was predicted by the model including two examples of a dual BRD1 and BRPF1b active prediction (one of which was correctly predicted the other partially correctly predicted) and one example of a dual BRD4 BD1 and BRPF1b active.

Structure BRD1 BRPF1b BRD9 BRD4 BD1

Experimental Activity (ΔTm average / range)

Prediction Activity (p_value)







1

A (4.65/ 0.10)

A (0.12) A (7.81/0.35) A (0.12) NA NA NA NA

2

A (2.15/0.70) A (0.21) N (0.87/0.25) A (0.21) NA NA NA NA

3

NA NA A (1.15/0.31) A (0.17) NA NA A (1.15/0.10) A (0.16)

122

Table 2-8 shows three examples of predicted profiles from the model along with their activities

as determined by DSF. It is important to note than DSF is only a semi-quantitative assay and

that there may only be small differences between weakly active compounds, and not active

compounds. An example of a dual active compound correctly predicted for Profile 3 (Table 2-7)

is compound 1 in Table 2-8, which is a regioisomer of the chemical probe for the BRPF1b and

BRD1 bromodomains, NI-57.363 Data for the 7-amino-1,3-dimethyl-quinolin-2-one warhead was

not in the training dataset, since it was more recently generated and published by AstraZeneca

in 2017; however, some scaffolds with a six-membered ring fused to an aryl ring were active for

BRPF1b in the dataset from internal AstraZeneca compounds, as well as the aryl-sulfonamide

linker being known from the benzimidazolone series.93 Combining these two molecular features

of a six-membered ring fused to an aryl ring containing an acceptor and the sulfonamide linker,

the model could correctly predict dual activity at BRD1 and BRPF1b. This compound was

predicted as active with a p_value of 0.12 for both domains, showing that the probability of

belonging to this class was close to the cut-off of 0.1 and that this compound was at the edge of

our defined applicability domain. Compound 2 is an example of an incorrectly predicted profile

for the same Selectivity Profile 3 (Table 2-8) and is also related to the benzimidazolone series,93

with the main differences of; 1. a reversed sulfonamide, 2. the expanded six-membered ring

system from the original five-membered ring, and 3. the addition of an extra carbonyl acceptor.

In this case, the compound was correctly predicted as active at BRD1 (p_value 0.21); however, it

was also predicted as active at BRPF1b (p_value 0.21) but tested as inactive. This example

confirms our earlier observation that the model often predicts dual activity for BRD1 and

BRPF1b, which occurs due to the data in the training set having few examples of selectivity for

one of the bromodomains over the other. (Figure 2-19e). This example is one of several of the

same chemotype tested which displays this selectivity profile and can be utilised in future

models to enable the distinction between chemical features which provide selectivity for BRD1

over BRPF1b and those which constitute activity for both. Compound 3 is an example of a

correctly predicted BRD4 BD1 and BRPF1b active (Profile 4, Table 2-8), which was predicted as

active at BRPF1b with a p_value of 0.17 and BRD4 BD1 with a p_value of 0.16. The higher

probability values come from the fact that the warhead is found in the training dataset with one

example of the O-linked aryl showing activity at multiple bromodomains, including BRPF1b and

BRD4 BD1.

Singular Actives and Overall

For the Singular Active set, a hit rate of 29 % was achieved. When we expand the analysis to all

predicted actives, we obtain an overall active hit rate of 34.1 % for all four assays (Table 2-9),

123

which greatly exceeds a random hit rate of 0.05 % from previous testing against the same assays

in AstraZeneca screens. When considering per-bromodomain hit rates, BRD4 BD1 has the

highest hit rate of 53.8 %, whereas the other bromodomains ranged from 25-31 %. The

performance per bromodomain is shown in Table 2-9, showing that the number of active hits

correctly predicted as active by the model varied across bromodomains, comprising 57 for BRD1,

129 for BRD4 BD1, 62 for BRD9 and 71 for BRPF1b. Although hit rates are a useful assessment

for how well the model helped shortlist compounds for testing (and one focus was put on

identifying active compounds), it should be reiterated that the compound selection was not

only designed to optimise the hit rate, as the intention was to also sample across different

similarity ranges to the training set and clusters to identify hits which were more structurally

diverse. A consequence of optimising for diversity and reducing the confidence level is that

there was a high overprediction rate of actives, as can be seen by the high false positive

prediction rate of 0.76 (Table 2-9). When the p_value cut-off was increased post-hoc to 0.3 or

greater (corresponding to a confidence level of 0.7), the overall hit rate of classification was

further increased to 68.7 % (Figure 2-22, Table 2-10), showing that adjusting the p_value has

utility in increasing the number of correct predictions. However, this is accompanied by

decreased hit diversity compared to the training set.

Table 2-9: Metrics for the experimental test set predictions in terms of hit rate, sensitivity, specificity, balanced accuracy, Matthew’s correlation coefficient (MCC) and the number of correctly predicted new actives discovered.

Bromodomain Hit rate (%)

Sensitivity Specificity Balanced Accuracy

MCC Number of correctly predicted actives

All 34.1 0.97 0.24 0.61 0.24 319

BRD1 26.6 1.00 0.29 0.65 0.28 57

BRD4 BD1 53.8 0.99 0.27 0.63 0.36 129

BRD9 25.4 0.86 0.15 0.50 0.02 62

BRPF1b 30.6 0.97 0.25 0.61 0.25 71

124

Figure 2-22: The p_value generated from conformal prediction as plotted against the Tanimoto similarity of compounds to the nearest compound neighbour in the training set (calculated from 512-bit Morgan fingerprints). A=active, N= not active. The overall trend is towards a higher confidence for molecules with a more similar compound neighbour in the training set. The false positive rate decreases as the p_value cut-off is increased to 0.3 for all bromodomains. Hit rate is increased to 68.7 % from 34.1 %. Many of the incorrectly predicted single-label actives are now classified as the “neither” class, which means they are considered outside of the newly defined applicability domain. However, improved hit rate comes with decreased diversity of screening hits.

Table 2-10: Confusion matrix for overall performance for experimental predictions when the p_value threshold was raised to 0.3 (corresponding to a confidence level of 0.7).

Raised threshold to p_value > 0.3

Experimental

Prediction

A N

A 169 95

N 8 146

2.4 Conclusions

In this work we have generated the first PCM models for the bromodomain family of proteins,

using a dataset of 15,350 data points across 31 bromodomains. Comparison of different

algorithms, compound descriptors and protein descriptors allowed us to produce a high

performing PCM model using Morgan fingerprints as compound descriptors, physicochemical

property alignment-dependent descriptors as target descriptors, and the RF algorithm,

achieving an MCC 0.83 on the external test set. PCM outperforms other techniques that can be

P_v

alu

e fo

r A

ctiv

e C

lass

fro

m C

onfo

rmal

Pre

dic

tion

Tanimoto Similarity to Nearest Compound

Neighbour in Training Set

125

applied on the dataset, including global QSAR (MCC 0.50), global QSAM (MCC 0.38) and

average individual QSAR (mean MCC 0.57 QSAR vs mean MCC 0.63 for PCM), showing that

compound and target descriptors together should be used for modelling this bromodomain

data. The use of histone peptide array data as a replacement for target descriptors also resulted

in a highly predictive PCM model (MCC 0.80), showing that information from histone peptide

binding preferences can inform small molecule SAR. We prospectively validated the model

using the cross-conformal prediction framework to shortlist compounds for experimental

validation, to test the ability of the model to 1. interpolate between new combinations of training

set compounds and targets, 2. predict selectivity profiles across bromodomains and 3. find novel

and diverse active hits for bromodomains. Overall, we found that the accuracy of predictions

scaled with the similarity to the training set. The model achieved a performance of an MCC of

0.24 across the 1,139 compound-target pairs which were tested experimentally, identifying 319

new bromodomain active hits. We show that Selectivity Profiles including activity at BRD4 BD1,

in combination with other activities for the other bromodomains are often correctly predicted

and rationalise the predictions for examples of interesting selectivity profiles found from the

experimental testing. Furthermore, we propose that the derived p_values from conformal

prediction can be used as a method to fine-tune the hit rate verses compound hit diversity in a

virtual screening setting. This study contributes towards addressing the need for a better

method to predict selectivity for the bromodomain family, which is becoming increasingly

important due to the biological relevance of bromodomains in disease.

126

3 Selectivity Insight for the Rational Design of

Bromodomain Inhibitors Derived from

Proteochemometric Models

3.1 Introduction

The previous chapter addressed the prediction of the binding of small molecules to

bromodomain-containing proteins using proteochemometric (PCM) modelling. We know that

PCM models were able to distinguish selectivity between bromodomains, however, we now

focus on interpretation of the PCM models to provide insight into which biological features

were important for determining activity and selectivity for bromodomains. The aim was to test

the hypothesis that model features corresponded to observed binding interactions in existing

and novel liganded crystal structures, as well as revealed new insights into binding interactions.

As mentioned previously, since small molecule bromodomain binding requires the mimicry of

endogenous peptide binding, chemotypes which bind to one bromodomain often also bind to

other bromodomains. This lack of selectivity is particularly notable between bromodomains

with a high binding site sequence conservation.31 Therefore, the challenge of designing a

selective inhibitor remains. Although this challenge exists, increased efforts to produce

chemical probes for the different members of the bromodomain family of proteins have led to

the identification of certain target features deemed important for gaining selectivity for one

bromodomain over another.

127

Table 1-2 in the introduction provides a summary of a literature survey of residues in the active

site of bromodomains (modelled within this study) which interact with small molecules. We

will use this information to compare to the residues identified as important for binding, derived

from the interpretation of the machine learning models trained on a large bioactivity dataset in

this study. We aim to confirm and extend existing knowledge to provide a more comprehensive

picture of selectivity.


3.2.1 Computational models

PCM models were generated as in the previous chapter and publication.367 The dataset used to

generate the models was extracted from the public and licensed sources of ChEMBL330,

PubChem331, ChEpiMod332, GOSTAR333 and the manual extraction of data from publications,54–

56,58,59,65,74,82,94,334–338, as well as AstraZeneca proprietary databases. The final dataset contained

15,350 data points; 6,352 compounds across 31 bromodomains. For further details of the

curation process and the dataset composition, the reader is referred to the previous chapter.367

The number of datapoints per bromodomain, split into active and inactive classes is detailed in

Table 8-1.

512-bit hashed binary Morgan fingerprints of radius 2 were generated using the Python RDKit

module353 and Z-scales 3 265 alignment dependent target descriptors were calculated using the

function AADescs from the camb306 package in R. The PCM model was generated the using

camb and caret packages in R,53,71 as detailed in the previous chapter.367

3.2.2 Model Interpretation

Z-Scales 3 descriptors make up the first three principal components (PCs) of amino acids

encoded by 26 physicochemical properties. When inspecting the variable loadings of these 26

properties for each principal component it can be observed, as highlighted in the previous study

by Sandberg et al.265 and summarized in Table 3-1, that the three principal components are

primarily described by properties related to lipophilicity, size/polarizability and polarity for the

first, second and third Z-Scales respectively. Therefore, in this study the three principal

components were interpreted as lipophilicity, size/polarizability and polarity descriptors of

amino acids in each residue alignment position. Z-Scales 3 were used instead of the Z-Scales 5

descriptors because the 4th and 5th PCs are difficult to interpret and account for only a small

proportion of the variance.265

128

Table 3-1: Z-Scales 3 principal component interpretation and variable loadings.

Principal Component (PC)

Interpretation Variables with Highest Contribution to PC

1 Lipophilicity Log10(octanol/water) partition coefficient (LogP), non-

polar surface area (Snp), TLC variables, Polar surface area

(Spol), number of hydrogen bond acceptors (HACCR)

2 Size/Polarizability Molecular weight (MW), side chain van der Waals volume (vdW), total surface area (Stot), polarizability (POLAR)

3 Polarity Electrophilicity (ELUMO), NMR α-proton shift at pD 1

and 7 (NM1, NM7), electronegativity (EN)

The PCM model was interpreted on two levels, namely global and local feature importance, with

the aim to find binding site amino acid residues and their properties which most highly

influenced the model discrimination between classes.

Global Feature Importance

For the global feature importance assessment, the varimp function from caret was used to rank

the most important variables, using the mean decrease in the Gini impurity.368 The Gini

impurity is an approximation to the entropy and measures how well a variable in the input

dataset splits the instances at each node into homogenous classes at the successive nodes (see

Decision Trees for further details). The lower the value of the Gini impurity, the more effective

the variable was at splitting instances into classes. When calculating the mean decrease in Gini

coefficient for a variable, the mean of the Gini impurity values for each time the variable is used

to make a split in the forest was calculated. The higher the mean decrease in Gini impurity, the

more important the feature was for the classification of the training set instances.369 The top 10

% of input features were then ranked according to this metric. These important residues were

mapped to a reference crystal structure PDB 2YEM in BRD4 BD2370 for visualisation, generated

in MOE.355

Local Feature Importance

The package rfFC213 in R was used to compute the local feature contributions per instance in the

training set. This method is discussed in more detail in the section Interpretation of Random

Forests.

The features in the model can be interpreted from this method on a per-instance basis (in our

case compound-target pairs) or for groups of the data (e.g. per target or per subfamily). For the

129

analysis of selectivity between BD1 and BD2 domains in BET bromodomains, only the

compound-target pairs for which there was an activity annotation at these domains were

considered. This data was then split into the active and inactive compound-target pairs forming

two separate subsets, and the median feature contribution towards active class was calculated

for each of the subsets (where a positive median feature contribution indicated that the feature

contributed towards classifying the example as active and a negative median feature

contribution indicated that the feature contributed towards classifying the example as inactive).

Features with a positive median feature contribution towards classifying active instances and

features with a negative median feature contribution towards classifying inactive instances for

each of BD1 and BD2 domains were then analysed. For the selectivity analysis for CREBBP over

BRD9, data was split firstly by domain and then by activity classification, forming four subsets

of the data in this case. In a similar process to above, the median feature contribution values

towards the active class was calculated for each of the subsets. For each bromodomain, features

with a positive median feature contribution towards classifying active instances and features

with a negative median feature contribution towards classifying inactive instances were retained

for analysis. Finally, for the individual bromodomain activity analysis, the data was grouped by

domain, followed by selecting those with the “active” label to discover important features for

classifying actives at each bromodomain. Aggregation over the group was performed by taking

the median value of the feature contribution, to avoid the influence of outliers, where a variable

might be very important for one instance, but not generally for the group. Values for the median

importance of each variable towards the active class for each bromodomain group, as well as

the lower and upper quartiles and the interquartile range (IQR) are listed in Supplementary

Data File 2. The IQR gives an indication of how representative the median is of all values in the

data subset by providing an indicator of spread and therefore reliability. Features with a median

value of greater than 0 were classified as important towards activity for each bromodomain for

the purposes of the discretized visualisation in Figure 3-4. Those variables encoding the

lipophilicity of each residue which were important to classifying active compounds for each

bromodomain were visualised quantitatively by their median feature contribution values.

Figures were generated using ggplot2371 in R.

3.2.3 X-ray Crystallographic Structure Determination for BRD4 and BRD9

Four novel bromodomain ligands identified in the previous chapter 367 were selected for crystal

structure determination in the available proteins BRD4 BD1 and BRD9. The crystal structures

130

were determined by Marianne Schimpl in AstraZeneca and the was protein prepared within

AstraZeneca.

The bromodomain of BRD9, amino acids 14—134, and the first bromodomain of BRD4 (BRD4

BD1), amino acids 42—169, were recombinantly expressed in E. coli with an N-terminal

hexahistidine tag. Proteins were purified by immobilised metal-affinity chromatography,

proteolytic cleavage of the affinity tag and size-exclusion chromatography.

BRD4 BD1 was concentrated to 9.2 mg/ml in 100 mM HEPES pH 7.5, 100 mM NaCl, 1 mM DTT.

Sitting drop vapour diffusion crystallisation experiments were set up by combining 200 nl of

protein with 200 nl of reservoir solution, followed by incubation at 4C. The complex with 4

was obtained by co-crystallisation with 1 mM ligand, where the buffer condition contained 0.2

M MgCl2, 0.1 M Tris pH 8.5 and 20 % PEG 8000. Crystals were cryo-protected by 1-second

immersion in reservoir solution supplemented with 20 % ethylene glycol, and flash-frozen in

liquid nitrogen. The other complexes with 5, 6 and 7 were obtained by soaking of BRD4 BD1

crystals grown in 0.2 M LiCl, 30 % PEG 4000, 0.1 M Tris pH 8.5. Crystals were soaked for 16

hours in reservoir solution containing 5 mM ligands 2 or 4 and 10 % DMSO, and harvested

without further cryoprotection.

BRD9 was concentrated to 14.5 mg/ml in 50 mM HEPES pH 7.5, 250 mM NaCl, 0.5 mM DTT,

and crystallised as above in 20 % PEG8000, 0.1 M CHES pH 9.5. Bar-shaped crystals appeared

within 3 days and were soaked in a 5 mM solution of ligand 4 as described above.

X-ray diffraction data were collected at Soleil Synchrotron beamline Proxima I and at Diamond

Light Source beamlines I03 and I04. Data were processed using XDS372 and Aimless373 from the

CCP4 suite374. Structures were solved by molecular replacement using Phaser375, and refined

using Buster376 and Coot377. Initial ligand topologies were calculated by Grade378. Data collection

and refinement statistics are summarised in Table 8-6.

MOE355 was used to interpret crystal structures and produce visualisations.


In this study we used the PCM model generated in our previous study for feature interpretation.

The model built using Morgan fingerprints as compound descriptors and Z-Scales 3 as target

descriptors, achieved a performance of ROC AUC of 0.97 and a Matthew’s Correlation

coefficient (MCC) of 0.83 on the external test set. This performance was comparable with

previous PCM models on other target classes.249,308,311 Further details of PCM model performance

131

and benchmarking can be found in chapter 2.367 Importantly, high model performance is key to

the meaningful interpretation of features, since we can be confident that the most highly

contributing features are leading to the correct classification. However, it must be noted that

not all features found from this analysis will necessarily be relevant, as there will be false positive

results for various reasons, including the class balance between actives and inactives, as well as

biases in the number of data points for each bromodomain (Table 8-1).

3.3.1 Interpretation

We next assessed the model as a tool to interpret selectivity. Two levels of interpretation were

used to analyse PCM models; global and local feature importance, as described in the methods

section.

Z-scales 3 descriptors were used to describe the binding site amino acids in each alignment

position based on their properties. Z-scales are derived by principal component analysis (PCA)

of 26 different physicochemical properties. We attribute the three principal components (PCs)

of the Z-Scales 3 descriptors, which encode the binding site amino acids, to lipophilicity for PC1,

size/polarizability for PC2, and polarity for PC3, as discussed in the methods section. Notably

this analysis did not consider binding site waters as descriptors in building PCM models and

assumes that the only interactions between ligand and protein are those which are important

for binding. Previous studies have shown the importance of interactions with stable waters in

bromodomains can lead to increases in binding affinity and selectivity, for example it has been

shown that selectivity between BD1 and BD2 domains in BETs can be achieved through

interactions with the ZA channel water network.379

Global Selectivity

Firstly, the PCM model was interpreted on a global level to discover the binding site residue

features important across the whole dataset for the discrimination between the active and

inactive class. These residues correspond to regions of variability between bromodomains and

can be deemed selectivity residues. It should be noted that biases in the numbers of data points

and activity distributions for each bromodomain will have an impact on this interpretation;

these data imbalances are discussed in more detail in chapter 2.367 For example, the features

identified from this global analysis will be more likely to be relevant towards discriminating

between actives and inactive compound-target pairs for bromodomains with a larger number

of datapoints, in particular the BET family in our study. Important target features, defined as

those with a large mean decrease in the Gini coefficient, are summarised as binding site hotspots

for selectivity in Table 3-2 and represented visually as mapped onto the BRD4 BD2 structure

132

(one of the BET family members) in Figure 3-1. The residues for BRD4 BD2 are used in the next

section as a reference for the important alignment positions and the full alignment is shown in

Figure 8-1.

The alignment positions identified from the global feature importance analysis predominately

correspond to residues on the ZA and BC loop regions. Figure 3-1 shows that the residue position

corresponding to V440 in BRD4 BD2 is important towards discrimination between active and

inactive classes across all bromodomains. This residue neighbours the gatekeeper residue (V439

in BRD4 BD2), which is often used to classify the bromodomain family77. Y377 neighbours the

hydrophobic shelf (known as the WPF shelf in BETs) present in some bromodomains, and the

size of the residue in this position was deemed important for selectivity, on average, across the

whole set of bromodomains. C391, L387, M398, V382, Y377, Y372 and G386 were important

from our analysis due to their size. These residues are positioned on the ZA loop which is known

to be conformationally different between bromodomains, for example, in BRD1 and BRD9 the

ZA channel region has been shown to be conformationally more open, which was used as a

factor for grouping these bromodomains based on druggability scores by Vidler et al.318 The

importance of the ZA loop is furthermore corroborated by ICAS-9571, a selective inhibitor for

the BRD1/BRPF bromodomains, which occupies a more open ZA channel, and a greater affinity

for BRPF was achieved by positioning bulky groups in the 6-paramethoxy position of the 1,3-

dimethylbenziidazolone core, which occupies the ZA channel.61 A similar observation was made

for LP99, a BRD9 selective inhibitor which occupies the ZA channel in this bromodomain.56

In conclusion, we find that the residue positions identified to be important for determining

selectivity across all bromodomains corresponded to the residues highlighted in the literature

for obtaining selectivity for bromodomain subtypes.

Table 3-2: Top ranked binding site amino acids from Z-Scales 3 model by mean decrease in Gini coefficient score. The residue position in the alignment (Figure 8-1), region in the bromodomain secondary structure, number and type of residue in BRD4 BD2 (2YEM) crystal structure and the broad property types encoded by the Z-Scales. Corresponds to the visual representation of the binding site hotspots in Figure 3-1

Amino Acid Alignment Number

Region of binding site Residue Number in BRD4 BD2

Z-Scales property encoding

1 ZA loop Y372 Size

6 ZA loop Y377 Size

11 ZA loop V382 Size

15 ZA loop G386 Lipophilicity, size

16 ZA loop L387 Lipophilicity, size

133

20 ZA loop C391 Lipophilicity, size, polarity

23 ZA loop M398 Lipophilicity, size

29 BC loop P435 Lipophilicity

32 BC loop D436 Lipophilicity, polarity

33 BC loop H437 Lipophilicity, size

36 αC helix (neighbouring GK residue)

V440 Lipophilicity, size, polarity

134

Figure 3-1: Binding site hotspots. Target features (Z-Scales 3) with the largest mean decrease in Gini coefficient from the PCM model were mapped back to their alignment position in BRD4 BD2. The residue positions are coloured by the property of the amino acids in the alignment position which was successful in the discrimination between active and inactive classes.

Bromodomain Subfamily Activity/Selectivity

Whilst global importance analyses, including the importance measures of the Gini index and

the mean decrease in accuracy (see Interpretation of Random Forests), are used routinely to

interpret machine learning models and can offer insight such as highlighted in the previous

section,380 the use of these measures for this model is limited, since the feature importance is

averaged over the whole dataset and multiple targets. Therefore, it is more beneficial to apply a

feature importance method which can identify residue positions which are important for

classifying subsets of the data and, more specifically in this study, bromodomain subfamilies or

individual bromodomains. This local importance method,381 (described in the methods section)

identifies residues in the active site of each group (subfamily, individual bromodomain etc),

which were important towards classifying active or inactive compound-target pairs in the PCM

model by taking the median importance value across all the active or inactive instances for the

group.

Selectivity Between BET bromodomains: BD1 vs BD2

Understanding the drivers of BD1 versus BD2 selectivity for BET bromodomains is an area of

interest as many inhibitor molecules tend to display activity for both domains.36 Figure 3-2

V382

M398

G386

L387

Y377

Y372

P435

H437

αC helix

BC loop αB helix

V440

C391

D436

ZA loop

ZA loop

Key: residues coloured based on properties: Size Lipophilicity Lipophilicity, size Lipophilicity, polarity Size, polarity Lipophilicity, size, polarity

BRD4 BD2 (2YEM)

135

shows the model interpretation results displaying those target features (encoded by Z-Scales 3)

which were important towards classifying active and inactive instances for BD1 BET domains

(Figure 3-2a) and BD2 BET domains (Figure 3-2b). The differences between the feature

importance for each of these subsets of the data can be studied for selectivity insight. One of

the most noticeable differences between BD1 and BD2 domains was that for the amino acid in

alignment position 6, all three properties of the amino acid (lipophilicity, size and polarity)

contributed towards classification for the inactive class for BD1 (blue bar with negative median

feature contribution), whereas there was a positive contribution towards the classification of

active instances for BD2 domains (orange bars with positive median feature contributions). This

residue position neighbours the WPF shelf in BET bromodomains. For the BD2 domains this

residue is formed of tyrosine (Y377 in Figure 3-1) and in BD1 domains this residue varies between

glutamine (BRD4 BD1 and BRDT BD1), arginine (BRD2 BD1) and tyrosine (BRD3 BD1). The

variable describing the size (Z2 Z-Scale) has a larger positive contribution for active instances

for BD2 domains than the lipophilicity and polarity (Z1 and Z3 Z-scales) variables, and the size

variable (Z2 Z-Scale) has a larger negative contribution towards the classification of inactive

instances than the lipophilicity and polarity (Z1 and Z3 Z-scales) variables. This shows that,

although all three properties of this amino acid encode selectivity for BD2 over BD1 domains,

the size/polarizability of the residue in this position is particularly important for selectivity. The

amino acid in position 32 was found to be important towards the classification of inactives for

BD1 domains for all three principal components and has a small contribution to the

classification of actives for the lipophilicity and polarity descriptors (Z1 and Z3 Z-Scales) (Figure

3-2a). For BD2 domains there are small negative contributions of this residue to the prediction

of inactives for BD2 domains for lipophilicity and polarity (Z1 and Z3 Z-Scales) and a larger

positive contribution towards the classification of actives for the size of the residue (Z2 Z-Scale)

(Figure 3-2b). For BD1 domains this residue is either threonine (BRD2 BD1 and BRD3 BD1) or

glycine (BRD4 BD1 and BRDT BD1), and for the BD2 domains the residue is consistently aspartic

acid. In BD1, the second Z-scale for this residue contributes towards the inactive class, whereas

for BD2 the second Z-scale is important towards the active class (Figure 3-2). This shows that

the size and polarizability of this residue is important for activity (the size of which is much

larger for aspartic acid than threonine and glycine) and suggests the aspartic acid is a key residue

for obtaining BD2 selectivity over BD1. These two discussed residues have not been identified

previously as selectivity residues for BD1 vs BD2 domains by an analysis of the sequence identity

variation,31 or in the literature search presented in Table 1-2, and so represent new information

which can be used to provide selectivity between BD1 and BD2 BET bromodomains by exploiting

136

interactions with amino acids in the binding site. One of the previously reported alignment

positions suggested to be exploitable for selectivity for BET bromodomains corresponds to

position 33 in our analysis. For this residue, the median feature contributions for all three

principal components are positive for the classification of actives against both BD1 and BD2

domains for dataset compounds, with higher contributions for the lipophilicity and size (Z1 and

Z2 Z-Scales) features (Figure 3-2). Amino acid 33 corresponds to D144 (BRD4 BD1) and H437

(BRD4 BD2). Because the lipophilicity and size of these amino acids are similar (aspartic acid

Z1 Z-Scale value is 3.98, Z2 Z-Scale value is 0.93, histidine Z1 Z-Scale value is 2.47, Z2 Z-Scale

value is 1.95),265 this may contribute to a more similar interaction across BD1 and BD2 domains

than previously expected. Further details of BD1 vs BD2 selectivity are discussed in the section

Type 2 bromodomains.

Example of Selectivity for CREBBP over BRD9

Next, we highlight an example of how the analysis was able to find a key selectivity residue for

CREBBP over BRD9, by examining the feature contributions for individual bromodomains.

Figure 3-3a shows all positive median feature contributions for classifying actives at CREBBP

(orange bars) and all negative median feature contributions for classifying inactives at CREBBP

(blue bars). Figure 3-3b shows the same analysis but for BRD9. Residue 34 has a large positive

feature contribution for the first two Z-Scales encoding lipophilicity and size for classifying

active compounds for CREBBP. Conversely, there is a large negative contribution from all three

Z-Scales for the classification of inactives at BRD9. Combining this information on the role of

the residue in position 34 for these two bromodomains, we can say that this residue is important

for selectivity for CREBBP over BRD9. This can be validated with the literature observation of

the well-known cation -π interaction that is made between the arginine R1173 in position 34 in

CREBBP with multiple selective inhibitor scaffolds. Cortopassi et al., 2016 discuss how

interaction strength relates to binding affinity for CREBBP over multiple chemical series of

CREBBP inhibitors.82 Figure 3c shows a selective CREBBP inhibitor (Compound 4) from the

training set, which makes a cation-π interaction between the arginine (R1173) residue and the

aromatic ring system (PDB 4NR7). This example shows that the PCM model was able to capture

this selectivity residue as important for binding to CREBBP and the closely related EP300

protein and captures that this residue was not important for binding in BRD9.

137

Figure 3-2: The feature contribution values for features with a positive median feature contribution towards the classification of active (orange) instances and a negative median feature contribution towards inactive (blue) instances over all a) BD1 and b) all BD2 domains within the BET bromodomains in the dataset. Descriptors were encoded in the form Z3_X_AAY, where X is the principal component corresponding to 1-lipophilicity, 2-size and 3-polarity, and Y is the alignment position of the amino acid.

a

b

Key: Median value for Active instances Median value Inactive instances

BD1 BET Domains

BD2 BET Domains

138

ZA loop

π-cation

interaction

R1173

V105

BC loop

αC helix

Key:

CREBBP (4NR7)

BRD9 (4Z6I)

Compound 4

a

b

c

Key: Median value for Active instances

Median value Inactive instances

CREBBP

BRD9

139

Figure 3-3: Target based feature interpretation example. Shows variables with median positive feature contributions towards classifying active compound-target pairs and mean negative feature contributions towards classifying inactive compound-target pairs for a) BRD9 and b) CREBBP. Highlighted in blue boxes are the target descriptors for the residue in alignment position 34, showing a positive contribution towards the prediction of the active class for CREBBP and a negative contribution towards the prediction of the inactive class for BRD9. c) Interpretation of the selectivity residue in amino acid position 34, showing compound 4, a selective CREBBP inhibitor which makes a key interaction with an arginine residue in CREBBP, which cannot be exploited by the V105 in this position for BRD9. Overlay of CREBBP (4NR7, yellow), with BRD9 (4Z6I, green). Descriptors were encoded in the form Z3_X_AAY, where X is the principal component corresponding to 1-lipophilicity, 2-size and 3-polarity, and Y is the alignment position of the amino acid.

Individual Bromodomain Activity

In this section we also used the local feature importance method to analyse the median feature

contributions for subsets of the data, in this case active compounds at each bromodomain. We

provide the upper and lower quartile values, and therefore the interquartile range as a measure

of the spread of the data from the median to show the reliability of these values for representing

the feature contributions of the whole subset (here active compound-target pairs for each

bromodomain) (Supplementary Data File 2).

Profiles of the important residues for classifying active compound-target pairs for each

bromodomain are highlighted in Figure 3-4, along with the important property of each amino

acid, defined as either lipophilicity, size, polarity or a combination of these. Highly conserved

alignment positions (positions 5, 8, 9, 19, 22, 24, 28 and 30) were removed from the input

features for model construction due to low variance and thus do not appear as important

features for any bromodomain; this does not necessarily mean they are unimportant for binding,

but that this analysis will not pick up the constant binding features for all bromodomains. An

example of this is the asparagine in alignment position 28, which is necessary for the binding of

the acetyl lysine mimetic in bromodomain inhibitors.70

140

Figure 3-4: Alignment map for the binding site residues of bromodomains used in this study. Each residue is coloured according to the property (or properties) of the amino acid identified as important towards the classification of actives for each bromodomain from the local feature contribution method. The threshold for determination of importance towards activity was a median feature contribution of greater than 0. “None” refers to those residues which were not found to be important for the classification of actives from our model interpretation. “-“ represent alignment gaps in the sequence. Alignment positions 5, 8, 9, 19, 22, 24, 28 and 30 were removed as model input features due to low variance of amino acid properties across targets. For the quantitative values of each median feature importance, as well the spread of feature importance from the median, the reader is referred to the Supplementary Data File 2.

Type 1

Type 2

Type 3

Type 4

Type 5

Type 7

Type 8

141

Individual Bromodomain Activity: Lipophilicity

Whilst Figure 3-4 shows those residue positions for all three Z-Scales 3 principal components

towards the classification of actives, it does not allow quantification of the feature contribution.

Therefore, we next analysed quantitatively the contribution of the first principal component,

which is related to lipophilicity (see methods), to the activity of ligands against individual

bromodomains within each subfamily. Figure 3-5 shows the relative magnitude of the positive

contribution of the lipophilicity of each amino acid towards the classification of activity. A larger

positive value indicates higher importance towards the classification of the active compounds

against the target and a value of zero indicates that the feature was not important for classifying

active compounds for this bromodomain.

Type 1 bromodomains

For all four of the type 1 bromodomains (BPTF, CECR2, KAT2A and PCAF) the amino acids in

positions 3 (tryptophan) and 4 (proline) were important for activity (Figure 3-4, Figure 3-5). The

median contributions of these features can be found in the Supplementary Data File 2. Positions

3 and 4 form the WPF shelf hydrophobic region, known to be important for targeting BET

bromodomains31. There is a good weight of evidence to support the role of tryptophan in

position 3 and proline in position 4 in binding interactions with type 1 inhibitors and these

residues are highlighted in Table 1-2. For example, the CECR2 inhibitor GNE-886 makes

hydrophobic interactions with W457 and P458 of the WPF shelf.52 Additionally, molecular

dynamics studies for the binding of the inhibitor Rac-1 to BPTF identified interactions with

W2824,53 and the key for gaining selectivity for KAT2A and PCAF over BET bromodomains was

described to include interactions with W751/W746 and P752/P747.75 The other residue which

was consistently important for classification of active compound-target pairs for the type 1

bromodomains was in position 35, where this residue was either phenylalanine (BPTF) or

tyrosine (CECR2, KAT2A, PCAF). This is the residue known as the gatekeeper (GK) residue

across all bromodomains, and where this residue is tyrosine, this creates a narrower binding

pocket,75 and inhibitors have exploited π-stacking interactions with this residue.57 For PCAF,

many fragment hits are found make an aromatic hydrophobic interaction with this Y809,76 and

for CECR2, GNE-886 also forms a π-stacking interaction with the Y520.52 For BPTF, F2887 (GK

residue) was also identified as important towards the selective binding of the Rac-1 inhibitor.

Overall, we identify that the commonly important residues for the type 1 bromodomains are

consistent with previous literature.

142

Some alignment positions were found to be important towards certain subfamily members.

Alignment position 1 was found to be important for only BPTF (Figure 3-5) with a median

feature contribution value of 0.019 (Supplementary Data File 2). This residue in BPTF is M2822

and, except in TIF1A, this is the only bromodomain featuring methionine in this position.

Hence, we hypothesise that we can make use of interactions with this methionine residue to

afford activity and selectivity for BPTF. Residues important for the classification of actives

against PCAF, but not other type 1 bromodomains, included residue 11 (R754) and residue 12

(T755) with the highest median feature contributions of 0.018 and 0.014 respectively. Other

residues contributed to a smaller extent (Figure 3-5, Supplementary Data File 2). When

comparing these positions to the residues in other bromodomains (Figure 3-4) we can see that

the arginine in position 11 is found for only four bromodomains, none of which are members of

subfamily 1, and that the threonine residue in position 12 is unique to PCAF. These two residues,

situated on the ZA loop, could be key selectivity features for PCAF activity over other

bromodomains and to the best of our knowledge are not currently known (Table 1-2). It is

known that the conformation of the ZA loop in PCAF is different to other bromodomains, for

example CREBBP, and that this is important for peptide binding.382 KAT2A shares one its most

important residues with PCAF, which is amino acid position 29 (Figure 3-5), namely P810 and

P805 with feature contributions of 0.013 and 0.012 respectively (Supplementary Data File 2).

This residue is proline for a variety of bromodomains in our model (Figure 3-4), and it is the

most lipophilic residue in this alignment position. Therefore, we hypothesise that targeting this

residue with a lipophilic interaction could provide selectivity between type 1 bromodomains (i.e.

for KAT2A and PCAF over others) and activity against a range of bromodomains, since proline

is conserved across many targets.

143

Type 4

Type 5 Type 7

Type 8

Amino Acid Alignment Position




Amino Acid Alignment Position Amino Acid Alignment Position


Me

dia

n F

ea

ture

Im

po

rta

nce

for

Active

s

Me

dia

n F

ea

ture

Im

po

rta

nce

for

Active

s

Me

dia

n F

ea

ture

Im

po

rta

nce

for

Active

s

Me

dia

n F

ea

ture

Im

po

rta

nce

for

Active

s

Me

dia

n F

ea

ture

Im

po

rta

nce

for

Active

s

Me

dia

n F

ea

ture

Im

po

rta

nce

for

Active

s

Me

dia

n F

ea

ture

Im

po

rta

nce

for

Active

s

Type 1 Type 2

Type 3 Type 4

144

Figure 3-5: The median feature importance for the descriptor encoding lipohilicity (first Z-Scale) towards the active compound-target pairs for each bromodomain, as derived from the local feature contribution method. The plots are trellised by bromodomain subfamily. Those alignment positions with a high median feature importance correspond to residues (which can be identified from Figure 3-4) more important to the classification of active compound-target pairs for the bromodomain due to their lipophilic properties.

Type 2 bromodomains

The type 2 or BET bromodomains have the largest numbers of active compounds in the study

(Table 8-1) and so the feature interpretation for this subfamily is based on a larger amount of

data than other subfamilies. For type 2 (BET) bromodomains (Figure 3-5) the highest

contribution towards activity for all the type 2 bromodomains is position 36, adjacent to the

gatekeeper residue.(8) This residue is consistently occupied by a valine for all type 2

bromodomains, whereas for all non-BET bromodomains the position contains a different

residue (Figure 3-4). Other residues of common importance to the classification of BET

bromodomain activity included positions 16 (leucine), 29 (proline), 15 (glycine, asparagine,

glutamine and glutamic acid), 33 (BD1 aspartic acid, BD2 histidine), 3 (tryptophan) and 27

(methionine). Residue 33 has been noted as a key area of residue variation which has been

targeted to provide selectivity for BD1 vs BD2 BET.31,80 W81, the residue in position 3 in BRD4

BD1, was identified from machine learning models to be a key residue to interact with for activity

at BD1 domains of BETs,79 as well as being a part of the WPF shelf with which multiple inhibitors

make lipophilic interactions.80,383

As discussed previously, BD1 and BD2 domains of BET bromodomains have distinct functions,

however, it is challenging to design a small molecule to be selective for one domain over another

in the same protein.384 The differences in residue importance against the BD1 and BD2 domains

can offer further insights beyond the current knowledge (Table 1-2) for the design of selective

ligands. The lipophilicity of residue 17 is only found to be important towards the classification

of active instances for BD2 domains contained within type 2 bromodomains, but not for BD1

domains (Figure 3-4). Residue 17 is proline in BD1 domains and histidine in BD2 domains, while

the majority of other bromodomains have proline in this alignment position (Figure 3-4). This

residue is positioned at the bottom of the ZA loop and targeted interactions with the histidine

in this position could be exploited for selectivity. The lipophilicity of residue 6 is important for

classification of actives against BD2, but not BD1, domains in our model. We previously

highlight this residue in the section Selectivity Between BET bromodomains: BD1 vs BD2. In

BD2 domains, this residue is tyrosine, whereas in BD1 domains this is arginine, tyrosine or

glutamine. The quantitative values for the first principal component (PC) of the Z-scale

(interpreted here as lipophilicity) were -2.54 for tyrosine, 3.52 for arginine and 1.75 for

145

glutamine,265 showing that this position in BD1 domains is hence occupied generally by more

hydrophilic residues. This residue has not been identified previously as important for selectivity,

however, the neighbouring residues Q85 (BRD4 BD1) and K378 (BRD4 BD2) have been

identified as a region of variation between BD1 and BD2 domains,31 and so this lipophilicity

change in alignment position 6 might also be important for binding in this region. Conversely,

the lipophilicity of amino acid 13 is specifically important for the classification of active instances

for BD1, but not BD2. Residue 13 is lysine in BD1 domains and alanine in BD2, showing a distinct

difference in lipophilicity (Z-scales first PC values 2.29 and 0.24 respectively).265 This residue

points into the binding site and will affect the properties of the binding site for an inhibitor. As

far as we know this has not yet been exploited for inhibitor design. For quantitative values of

the median feature contributions for the type 2 bromodomains, the reader is referred to

Supplementary Data File 2.

Type 3 bromodomains

The type 3 bromodomains include CREBBP and EP300, as well as BRWD1 BD2. Residue 6

(arginine), residue 15 (glycine for CREBBP and EP300 and glutamic acid for BRWD1 BD2),

residue 34 (lysine in BRWD1 and arginine in CREBBP and EP300) and residue 10 (aspartic acid)

were important features for classifying active compound-target pairs for all three

bromodomains (Figure 3-4). Residue 6 (arginine) is neighbouring the LPF or EPF shelves

aligned with the WPF motif in BET bromodomains. This residue is less lipophilic than the

residues for other bromodomains in this position (Figure 3-4) and could hence potentially be

exploited with for activity and selectivity for type 3 bromodomains. For alignment position 15,

the remaining subfamilies do not have a residue in this position, therefore this position encodes

the differences in length of the ZA loop, an important feature of type 3 bromodomains, which

makes them similar to BET bromodomains (Figure 3-4).81 Residue 34 (arginine) corresponds to

the R1173 residue in CREBBP, which has been previously exploited to design activity for CREBBP,

as demonstrated by the cation- π interaction with multiple series of inhibitors.81,82 Residue 10

(aspartic acid) is the same for the type 3 and type 2 bromodomains. This is one of the least

lipophilic residues in this position (Z-scales first PC value 3.98)265 and is also shared with the

BET bromodomains (Figure 3-4). This could be targeted by a charged interaction. The fact that

some of the residues that are found to be important for classifying activity for type 3

bromodomains are also important for type 2 bromodomains is consistent with the fact that the

BET family are frequent off-targets of CREBBP bromodomain inhibitors,337 due to their

similarities in the ZA loop.31

146

For CREBBP and EP300, all but one of the remaining alignment positions selected by the model

are the same for both domains, due to their highly conserved binding site sequence (Figure 3-4).

The most important lipophilicity descriptor for classifying activity in this case corresponds to

residue 32, T1171 in CREBBP, with a median feature importance of 0.010, and T1135 in EP300,

also with a median feature importance of 0.010. This residue is positioned at the bottom of the

BC loop and threonine is shared with BRD2 BD1 and BRD3 BD1. Other residues in this position

mostly consist of either glycine or acidic residues for other bromodomains, showing that the

properties of this loop are different in CREBBP and EP300. Residue 13, corresponding to L1119

in CREBBP and L1083 in EP300 with median feature contributions of 0.007 and 0.007

respectively, is also important for the classification of actives at CREBBP and EP300 (Figure

3-4), and leucine in this position is not found in any other bromodomains. Leucine is one of the

more lipophilic residues in this position (with a value for the first PC of Z-scale of -4.28),

especially as compared to the BET bromodomains, where lysine and alanine residues are present

in this position (first PC of Z-scales are 2.29 and 0.24 respectively)265 (Figure 3-4). This residue

is near to the end of the ZA channel, making the properties of this region different between type

3 and type 2 bromodomains. L1119 has been highlighted previously as important for obtaining

CREBBP selectivity over BETs.83 Alignment position 37 was found to be important for the

lipophilicity descriptor for activity at CREBBP (F1177, with a median feature importance of

0.003) but not for EP300 (Y1141), and this is the only major difference between the two

bromodomains from this analysis (Figure 3-5). These two bromodomains have the most

lipophilic residues in this position, which is situated on the αC helix and could, for example,

produce π-stacking interactions with a potential future CREBBP inhibitor, since it provides a

more lipophilic pocket than for the BET bromodomains.

Here, we show that the lipophilic factors contributing to the classification of actives for CREBBP

and EP300 are highly similar and highlight a lipophilic pocket which could be targeted in future

CREBBP/EP300 inhibitors. Further residues found to have a positive feature contribution

towards classification of actives at CREBBP, which are confirmed from previous studies (Table

1-2), included V1174, P1110, L1109 and Q1113 (see Figure 8-1 for all important residues). Overall,

two key residues known from previous literature to be important for gaining CREBBP selectivity

over BET bromodomains have been found to have a high feature contribution towards

classifying actives for the CREBBP and EP300 bromodomains, and four other residues match

with literature interactions for CREBBP inhibitors.

147

Type 4 bromodomains

The type 4 bromodomains included in this study were ATAD2 and ATAD2B, BRD1, BRPF1,

BRPF3, BRD7 and BRD9, with 65, 0, 53, 146, 5, 41 and 346 active datapoints respectively (Table

8-1). Overall, we find that there is a more complex relationship of features important for

classification of active compound-target pairs within this subfamily, and so we analysed it

separately by bromodomain.

For ATAD2, Figure 3-5 shows that there are several important residues used for classifying

actives in the model. By far the largest feature contribution for the first PC of the Z-Scales

descriptor for classification of actives for ATAD2 was residue 33 (R1072 with a median feature

contribution of 0.010).( Supplementary Data File 2) The presence of arginine is consistent with

the observation that the binding site of ATAD2 is generally less lipophilic than BET

bromodomains and therefore less druggable.31,318 This residue is positioned near the end of the

αC helix, and it is the only bromodomain for which this position is arginine in the set of

bromodomains studied. Residues 11 and 10 corresponding to P1015 and D1014 were also found

to be of smaller importance to classifying actives for ATAD2 with both having median feature

contribution values of 0.001. D1014 has been noted previously as a residue which can be used

to make H-bonding interactions to ATAD2 inhibitors.86,87

For the BRPF bromodomains including BRD1 (BRPF2), BRPF1 and BRPF3 there were a small

number of residue positions which were important for classification of actives for these domains

(see Supplementary Data File 2 for full list). Residue 7 was commonly identified as important

for all three bromodomains, which is formed by Q589 in BRD1 and E655 and E616 in BRPF1 and

BRPF3 respectively (Figure 3-4, Figure 3-5). Q589 in BRD1 is present also in case of BD1 BET

family members, and this residue was found to be responsible for selectivity at BRD1 over BRPF1

for a benzoisoquinolinedione inhibitor molecule.90 BRPF1 and BRPF3 are the only

bromodomains to contain glutamic acid in this position among the ones included in this study,

and the negative charge can potentially be targeted for selectivity and/or activity.

Other residues with a positive feature contribution value for classifying actives against BRD1

included residue 10 (S592, with median feature contribution of 0.005), residue 36 (Y649, with

median feature contribution of 0.001) and residue 3 (R585, with median feature contribution

of 0.0004); Figure 3-4, Figure 3-5). BRD1 is the only member of the type 4 bromodomains to

have a serine residue in position 10 and S592 was found to be responsible for selectivity at BRD1

over BRPF1 for the same benzoisoquinolinedione inhibitor molecule mentioned above.90 Y649

in alignment position 36, neighbouring to the gatekeeper residue, is in common with several

148

bromodomains, excluding the BET bromodomains and this could be exploited for BET

selectivity. The combination of the gatekeeper F648 and the neighbouring Y649 residue makes

this region much more sterically constrained and lipophilic than the corresponding residues in

BETs (which are formed by isoleucine and valine); this has been exploited for achieving

selectivity in the past.31,89 Residue 3 is the R585 in the RIF set of residues in BRD1, in place of

the WPF shelf in BETs, delivering very different properties to this section of the active site,

which could be exploited by interactions in a similar way to the RVF shelf in ATAD2.385 Overall,

from our feature contribution analysis we find three residues in BRD1, which are consistent with

previous literature describing selectivity for BRD1 including Q589, S592 and Y649.

For BRPF1, in addition to residue 7 mentioned above, residue 10 (P658), residue 2 (G650) and

residue 6 (S654) also have positive median feature contributions towards the classification of

actives with values of 0.008, 0.006 and 0.002 respectively (Figure 3-4, Figure 3-5). Proline at

residue 10 is only present for three bromodomains among those studied here, and so this residue

differentiates BRPF1 by being one of the more lipophilic residues in this position (Figure 3-4).

Selectivity for BRPF1 over BET bromodomains has been achieved by targeting P658 in this

position of the ZA loop with a lipophilic interaction, which is not exploitable with the aspartic

acid in this position in BETs.88 G650 is uniquely glycine for BRPF1 which is situated next to the

NIF group of residues, which is in the same region as WPF in BETs. This glycine makes the

pocket larger and less lipophilic than for other bromodomains. S654 is on the opposite side to

the NIF series of residues and serine is only found in one other bromodomain (BRD7) in this

position, rendering it a potential selectivity residue. BRPF3 has relatively few active compound-

target pairs in the dataset and so will not be interpreted further in this discussion. Overall, we

note that we have identified the P658 residue, which is known to be involved in lipophilic

interactions of BRPF1 inhibitors, as well as highlighted the importance of other residues.

BRD7 and BRD9 are highly conserved domains, and it is difficult to achieve selectivity for one

the other.31 Correspondingly, nearly all residues that are important for the classification of active

compounds at both bromodomains conserved, including residue 29 (P213/P102), residue 11

(D162/D51), residue 12 (F163 in BRD7 and A52 in BRD9), residue 13 (I164/I53), residue 27

(M203/M92), residue 20 (S169/S58), residue 18 (G167/G56), residue 35 (Y217/Y109) and residue

7 (F158/F47), for BRD7/BRD9 respectively (Figure 3-4, Figure 3-5). These are listed in order of

their feature importance values, which can be found in Supplementary Data File 2. The proline

residue at alignment position 29 is shared with the BET bromodomains, as well as the type 1

PCAF bromodomain, and as discussed above, is positioned at the end of the BC loop and more

149

lipophilic than other residues in this position (Figure 3-4), and from our model this residue

seems to be globally important to activity (Figure 3-4). Residue 11 and residue 12 are situated on

the ZA loop; aspartic acid is unique in position 11 for BRD7 and BRD9 and so can be considered

a potential selectivity residue. F163 and A52 in alignment position 12 form two of the more

lipophilic residues found in this position across all bromodomains, which contributes to the

more lipophilic ZA channel in BRD9 compared to BET bromodomains.31 These residues,

although not identified previously as one of the key hydrophobic residues, could be included in

the lipophilic residues which line the ZA channel of BRD9 from which selectivity could be

gained. Conversely, residue 13 (I53) and residue 7 (F47) in BRD9 belong to the previously

identified lipophilic residues of the ZA channel, which were suggested as residues which can be

exploited for selectivity.58,91 BRD7 and BRD9 are the only bromodomains for which residue 20

is serine, while many other bromodomains have tyrosine or histidine in this position. The amino

acid in position 18 is glycine, which is only found for a small number of bromodomains, with

the remainder of the bromodomains containing a charged residue in this position. These

residues positioned at the bottom of the ZA loop may alter the properties of the ZA loop for

BRD7 and BRD9. Y109 in position 35 in BRD9 is the gatekeeper residue and is well-known to

make interactions with inhibitor molecules, for example in BI-9564,58 and is key for activity for

BRD9.58 Overall, we find that the lipophilic residues on the ZA loop in addition to the tyrosine

gatekeeper have been highlighted as important for the classification of actives against BRD9,

which is consistent with the literature.

For the type 4 bromodomains, we correctly identified from our data-driven analysis key residues

which have been found to be important towards activity in previous studies for BRD1, BRPF1

and BRD7/9. We furthermore contributed new potential residues which might be exploited for

activity or selectivity for type 4 bromodomains.

Type 5 bromodomains

The type 5 bromodomains in this study include BAZ2A, BAZ2B and TIF1A. Overall it should be

noted that there were fewer active compounds (27, 31 and 63 respectively) in the dataset than

for other bromodomains, and therefore that the findings below are based on less evidence than

other subfamilies.

The residue positions most important towards the classification of actives for BAZ2A and BAZ2B

included residues 18, 20, 27, 3 and 34 for both domains in order of their importance (Figure 3-4,

Figure 3-5). Residue 18 is G1829/G1900 in BAZ2A/BAZ2B respectively and this position is also

glycine for BRD9 and BRD7 bromodomains. Whilst glycine is not considered lipophilic, it is the

150

most lipophilic residue in this position (Figure 3-4), with a value for the first PC of Z-scales of

2.05 compared to asparagine and aspartic acid with a value for the first PC of Z-scales of 3.52

and 3.05 respectively.265 The glycine residue in this position could offer the opportunity for

interaction with inhibitors due to its smaller size and lack of polar atoms. Residue 20 is R1831

in BAZ2A and K1902 in BAZ2B, both of which are charged, which could be exploited via opposite

charge interactions. Residue 18 and 20 are positioned at the end of the ZA loop and the BAZ

bromodomains are known to have a shorter ZA loop than other domains,31 showing that the

model has located residues that describe the key differences in structure for these

bromodomains compared to other bromodomains. Residue 27 is V1865 for BAZ2A and V1936

for BAZ2B whereas the remaining bromodomains have methionine or isoleucine in this

position, which may change the properties of the αB helix. Residue 3 is the W1816/W1887 of the

WPF shelf conserved from BET bromodomains important for activity at BAZ2A and BAZ2B

bromodomains notably due to its flexibility to make π-stacking interactions, in agreement with

previous findings.55 The amino acids in position 34 are the negatively charged E1878 (BAZ2A)

and D1949 (BAZ2B), which are particularly hydrophilic amino acids (with a value for the first

PC of Z-scale of 3.11 and 3.98 respectively), and are also found in BET bromodomains.

Additionally, the residue in position 7 was found to be important towards classification of

actives for the BAZ2A, but not the BAZ2B bromodomain, which corresponds to E1820 in BAZ2A

and L1891 in BAZ2B and forms a known selectivity residue between these domains.92

The residue positions that were important to the classification of actives against TIF1A are

different to those for the BAZ bromodomains (Figure 3-4, Figure 3-5), including the globally

important proline residue at position 29 (P982 in TIF1A), as discussed previously. The amino

acid in alignment position 1 (M920) is the same as in BPTF, and in a similar way to BPTF, we

hypothesise that this methionine can be exploited by specific interactions to achieve broad

bromodomain selectivity. Amino acids 12 (in this case an alignment gap) and 10 (P929) are also

important for activity for TIF1A. This agrees with an important structural difference in the

binding pockets, namely that the ZA loop is shorter for TIF1A than for most other

bromodomains.31 P929 has been shown to be important for lipophilic interactions with a known

inhibitor.93 Residue 4 (A923) replaces proline in the same region as the WPF shelf in many

bromodomains is also highlighted as being important for the classification of actives for TIF1A,

which is consistent with previous knowledge that inhibitors occupy this hydrophobic region

flanked by A923.61 Overall, we identified by model interpretation, residues which are important

for classification of actives for TIF1A including P939 and A923, which are mentioned previously

as being important for lipophilic interactions with inhibitors.

151

Type 7 bromodomains

The type 7 bromodomains for which data was included in the model were TAF1 BD2 and TAF1L

BD2, and the number of active datapoints in the model for these domains were 243 and 7

respectively. These domains share 37 out of 38 of the same binding site residues used in this

study, and therefore, it is unsurprising that most of the same alignment positions are found to

be important for the classification of actives, including residue 29 (P1585/P1604), 32

(E1586/E1605), 13 (F1536/F1555), 6 (H1529/H1548), 2 (S1525/S1544), 35 (Y1589/Y1608) and 34

(Q1588/Q1607), for TAF1 BD2/TAF1L BD2 (Figure 3-4, Figure 3-5), ranked in order of their

importance (Supplementary Data File 2). Residue 29 (proline) has been noted to be a

determinant of activity against the BET bromodomains and TIF1A in our study, where this

residue is also proline. Glutamic acid in position 32 is important for classification of actives, in

a similar way to the finding that the aspartic acid residue in this position was important for the

classification of actives at BD1 BET bromodomains. A potential selectivity residue identified for

TAF1 BD2 and TAF1L BD2 includes the F1536/F1555 residue at position 13, since this lipophilic

aromatic group located on the ZA channel is unique to these two bromodomains (Figure 3-4),

adding a potential lipophilic pocket to the ZA channel similar to that found in BRD7 and BRD9.

Histidine in position 6, neighbouring the WPF shelf, is also a unique residue in this position for

these bromodomains (Figure 3-4) which may be exploited to achieve selectivity. The gatekeeper

residue and its neighbouring residue have also been identified as important by our model, which

are formed by tyrosine and glutamine. As seen for the other bromodomains with the tyrosine

gatekeeper residue, including BRD9 and PCAF, this tyrosine could be exploited for π-stacking

interactions to enhance activity for TAF1 and TAF1L BD2 domains.

Type 8 bromodomains

The type 8 bromodomains included in this study are SMARCA2, SMARCA4 and PB1 BD5, which

have low numbers of actives; 28, 70 and 10 respectively.

For SMARACA2 and SMARCA4, the residues in positions 16 (L1418/L1494), 32 (G1467/G1543), 1

(L1405/L1481) and 11 (R1415/R1491) are important for the classification of actives, (Figure 3-4,

Figure 3-5), in order of their median feature importance. Leucine is a common residue at

position 16 across all bromodomains in this study, including the BET bromodomains. L1418

(SMARCA2) or L1494 (SMARCA4), is positioned on the ZA-loop and has previously been

identified as important for lipophilic interactions with an inhibitor molecule (Table 1-2). 65

Glycine at position 32 is unique to the SMARCA2 and SMARCA4 bromodomains in this position

(Figure 3-4), suggesting a region for potential selectivity by extending a large, lipophilic group

152

into this region of the BC loop. L1405/L1481 and R1415/R1491 residues in positions 1 and 11 are at

the top and bottom of the ZA loop, respectively, affecting the properties of these regions.

Leucine has not been previously identified, however, arginine may be deemed a selectivity

residue which is important for activity at SMARCA2 and SMARCA4 bromodomains, where

previous experimental evidence of a π-cation interaction exists.82 Amino acid 34 (Q1469) next

to the gatekeeper residue is also found to be important for the classification of SMARCA2

actives. Since very little is currently known about targeting SMARCA2 and SMARCA4

bromodomains this interpretation may be useful in the future design of inhibitors for these

domains. In addition, the importance of the known interaction with L1418 has been supported

by literature.

For PB1 BD5, in agreement with SMARCA2 and 4, also residues 16 (L693) and 1 (L680) were

identified as important by the model (Figure 3-5). L693 (position 16) in PB1 BD5 has been

identified previously to make interactions with inhibitor molecules.95 The amino acids in

alignment position 25 (M706) and 29 (P741) are also important for activity, which have been

discussed previously for other bromodomains in this study. Interestingly, the lipophilicity of

amino acids in alignment positions 7 (R686), 4 (I683) and 6 (L685) are also important for the

classification of actives against PB1 BD5, which are positioned in the region where the WPF shelf

occurs in BET bromodomains, consistent with experimental evidence that I683 makes

interactions with inhibitors of PB1 BD5.95

Given the limited knowledge of inhibitors for this family, we find a consensus with known

residues important for binding against type 8 bromodomains, as well as suggest other residues

which from the model might be exploitable for future ligand design.

Summary

The correspondence of the model feature importance towards classifying actives with the

residues discussed in the literature for each bromodomain are summarized in Table 3-3. For the

references to the literature the reader should consult Table 1-1. For 13 out of 31 bromodomains,

namely BPFT, BRD2 BD1, BRD2 BD2, BRD3 BD1, BRD3 BD2, BRDT BD1, BRDT BD2, EP300,

BRD1, BRD7, BRPF3, BAZ2A and TAF1 BD2, we successfully identify all the residues previously

reported to be important for activity (sensitivity of 1). The model features correspond less well

to the known residues for ATAD2, PB1 BD5 and SMARCA4 with sensitivities below 0.5 (Table

3-3). What is not considered here is the false positive rate which is difficult to determine, as we

do not know which of the newly identified residues are novel and important contributors to

binding, and which are false positives. When considering which residues to target in design, it

153

is therefore important to consider the magnitude of the median feature contribution and the

spread of values across the subset indicated by the interquartile range (Supplementary Data File

2), as well as the literature evidence and plausibility of a small molecule interaction with the

residue. Overall, we propose that this study can be used as a guideline for generating future

design hypotheses for targeting residues in bromodomains with small molecule inhibitors.

Table 3-3: Shows the residues identified from the model to be important to the classification of active compounds for each bromodomain (according to the first PC of Z-Scales 3 target descriptor interpreted as lipophilicity), which corresponded to previous literature knowledge (in bold). Selectivity residues are a list of current literature residues deemed important for activity at the bromodomain and correspond to Table 1-2. The number of true positives, false negatives and the sensitivity are reported.

Bromodomain Selectivity Residues (see Table 1-1 for references)

True positives False Negatives Sensitivity

BPTF F2887, D2834, W2824

3 0 1.00

CECR2 Y520, P458, F459, M506, W457

4 1 0.80

KAT2A Y814, E761, P752, W751

3 1 0.75

PCAF Y809, E756, P747, W746, V752, P751, K753

5 2 0.71

BRD2 BD1 Q101, D160, I162 3 0 1.00

BRD2 BD2 K374, H433, V435 3 0 1.00

BRD3 BD1 Q61, D120, I122 3 0 1.00

BRD3 BD2 K336, H395, V397 3 0 1.00

BRD4 BD1 Q85, D144, I146, Y139, L92, W81, P82, F83

6 2 0.75

BRD4 BD2 K378, H437, V439, Y432, P375, F376

4 2 0.67

BRDT BD1 I115 1 0 1.00

BRDT BD2 V357 1 0 1.00

BRWD1 BD2 None known 0 0 N/A

CREBBP R1173, L1109, V1174, L1120, P1110, F1111, Q1113, L1119

7 1 0.88

EP300 R1137, L1073 2 0 1.00

ATAD2 R1007, V1008, R1077, D1071, D1014, I1074, V1018, Y1063

3 5 0.38

ATAD2B N981, I982, R1051, D1045, D988,

0 8 0

154

I1048, V992, Y1037

BRD1 F648, S592, V647, Q589, Y649

5 0 1.00

BRD7 Y217, A154, S157 3 0 1.00

BRD9 Y106, F44, I53, A54, G43, F45, V49, H42, R101, A46, F47, P48, T50

9 4 0.69

BRPF1 F714, P658, I713 2 1 0.67

BRPF3 F675 1 0 1.00

BAZ2A W1816, V1879, E1820

3 0 1.00

BAZ2B W1887, I1950, L1891

2 1 0.67

TIF1A A923, L922, F924, V932, V928, V986, P929, E985

6 2 0.75

TRIM33 None known 0 0 N/A

TAF1 BD2 N1533, W1526 2 0 1.00

TAF1L BD2 None known 0 0 N/A

PB1 BD5 A703, M699, L687, L693, I745, I683, F684, M731

4 4 0.5

SMARCA2 L1418, P1413, E1417

2 1 0.67

SMARCA4 L1494, P1489, E1493

1 2 0.33

To further explore the consensus between model insight and experimental data, we next

prospectively validated interactions identified as driving activity and/or selectivity from PCM

models via novel crystal structures of ligand-protein complexes.

3.3.2 Binding Modes Elucidated by Co-Crystallization

Four bromodomain binders identified in our previous chapter367 are shown in Table 3-4. We

used differential scanning fluorimetry (DSF) to determine the activity of these hits for BRD4

BD1, BRD1, BRPF1 and BRD9, and obtained novel crystal structures to validate our selectivity

findings prospectively. Activity measurements were performed at 10 µM compound

concentration to establish if there was a binding interaction between compound and

bromodomain. Differential Scanning Fluorimetry is only semi-quantitative and therefore the

measurement was taken as a classification of “active” or “not active” rather than a quantitative

degree of activity displayed. The cut-off for activity was determined to be three times the

standard deviation of the negative control (see Table 8-5). We determined four crystal

155

structures for BRD4 BD1 and one for BRD9. The overall mapping of residues from the crystal

structures to our analysis for the two bromodomains is visualised in Figure 3-6. As can be

observed, from the 20 residues identified as important towards the classification of BRD9

actives from the model, 4 residues were found to be relevant to the binding of compound 7 to

BRD9. From the 26 residues identified as important towards the classification of BRD4 BD1

actives from the model, 11 were found to make contacts with compounds 1-4 in co-crystal

structures with BRD4 BD1. We note that some residues were deemed important by the model

towards the classification of actives but not found to interact with the ligands for which

crystallographic data was obtained. This is to be expected because not all compounds will bind

in the same mode to each protein.

Table 3-4: Bromodomain ligands for which crystal structures were obtained. Activity was experimentally confirmed by differential scanning fluorimetry. Values report the observed thermal shift (°C) midpoint/first derivative, NA indicates that the compound was not tested in the assay.

Compound Structure BRD4 BD1 BRD1 BRPF1 BRD9

5

4.40/4.40 NA NA NA

6

1.80/1.80 -0.30/-0.30 0.06/0.08 1.80/1.70 (within assay error)

7

4.90/5.20 NA NA NA

1.50/1.40 NA NA 3.34/3.30

156

8

Figure 3-6: The amino acid identity and residue numbering for each alignment position in BRD9 and BRD4 BD1 which were important towards the classification of actives for each bromodomain. The figure is coloured by the property or combination of properties (defined by the principal component loadings of Z-Scales descriptor, see methods) of the amino acid which were important towards classifying active compounds for each domain. “None” refers to those residues which were not found to be important for the classification of actives from our model interpretation. Outlined in black are the residues which were identified from our study to make interactions with our newly generated liganded crystal structures for these domains. “-“ represent alignment gaps in the sequence. Alignment positions 5, 8, 9, 19, 22, 24, 28 and 30 were removed as model input features due to low variance of amino acid properties across targets.

Compound 5 (Table 3-4), containing the dimethylisoxazole moiety, which is a well-known

acetyl lysine mimetic, displayed an average experimental activity of ΔTm 4.40°C at 10 µM for

BRD4 BD1 in the DSF assays. The unique aspect of this molecule, as compared to other similar

compounds in the public domain with bromodomain data, was its benzoxazolone core. We were

interested to see how the compound binds to BRD4 BD1 with this core because the compound

was published in a patent subsequent to our experimental testing,386 but there is no published

157

crystal structure for this compound bound to BRD4 BD1. Figure 3-7 shows the co-crystal

structure of compound 5 bound to the acetyl lysine binding site of BRD4 BD1. The 3,5-

dimethylisoxazole moiety makes an expected83,387,388 interaction with N140 and Y97 via a stably

bound water molecule (Figure 3-7a). These residues were not highlighted in our analysis

because they were excluded as features since they are conserved across bromodomains (See

methods). Compound 5 makes lipophilic interactions from one methyl group to L94, a residue

highlighted as important for lipophilic, size and polar interactions with BRD4 BD1 actives from

our selectivity analysis (Figure 3-6, alignment position 16). The benzoxazolone packs against the

WPF shelf and makes hydrophobic interactions with the residues of the WPF shelf and L92. L92

was highlighted from our selectivity analysis to be important towards the classification of actives

based on polarity (Figure 3-6). The aryl ring sits in the hydrophobic groove formed by the W81,

M149 and I146 residues (Figure 3-7b), all of which are residues which were highlighted from our

selectivity analysis as being important for lipophilic interactions with BRD4 BD1 active

compounds (Figure 3-6). Figure 3-7c shows that this molecule mimics the binding mode of a

similar BRD4 inhibitor with a dihydroindenol core, where the aryl ring also occupies the

hydrophobic groove.389 Overall, the binding mode of this compound is consistent with similar

compounds in the literature, as well as forming interactions with six of the residues which are

found to be important for classifying actives at BRD4 BD1 from our model.

Compound 6 (Table 3-4) also contains the dimethylisoxazole moiety and was found to be active

at BRD4 BD1 with an average ΔTm of 1.80 °C measured at 10 µM, while being inactive at BRD1,

BRD9 and BRPF1b (with ΔTm values falling below the threshold of 3 standard deviations from

the negative control), making this a selective compound for BRD4 BD1 according to this

measurement. From the crystal structure, as with compound 5, we observe the same

interactions of the isoxazole with N140 and Y97 via a conserved pocket water (Figure 3-8a),

along with the interaction of the methyl group with L94. The aryl ring is positioned along the

ridge of the hydrophobic WPF shelf, as seen for the benzoxazolone in compound 5. The meta-

substituted sulfonamide of compound 6 points into the opposite direction to the aryl ring of

compound 5, towards the ZA channel and away from the WPF shelf (Figure 3-8b) and forms an

interaction to the backbone nitrogen of D88. This residue is part of the ZA loop and was found

from our model to be an important residue with respect to its lipophilic, steric and polar

properties for the classification of active compounds (Figure 3-6). The sulfonamide displaces a

water in this position and extends towards K91, a residue for which lipophilicity and polarity

were found from the model to be important for the classification of active compounds for BRD4

BD1.

158

The selectivity of this molecule for BRD4 BD1 over BRD9 can partially be explained by the

residue in alignment position 7 from our model (Figure 3-6), which is phenylalanine in BRD9

and glutamine in BRD4 BD1. From the overlay of structures, the sulfonamide clashes sterically

with F47 (Figure 3-8c), rendering the compound inactive. The size of Q85 was specifically

highlighted as important for the classification of actives for BRD4 BD1 and the lipophilicity, size

and polarity of F47 (corresponding to the same alignment position as Q85) in BRD9 was

important classifying actives at BRD9 (Figure 3-6), providing evidence from the model

consistent with this observation. For selectivity over BRPF1b the residue in alignment position

10 from our model can be hypothesised to have a role in the selectivity observed. In BRD4 BD1

this residue is aspartic acid (D88) which compound 6 makes an interaction with via the

backbone nitrogen as discussed above. In BRPF1b this residue is and proline (P658) and the

ligand can no longer make the same interaction.

In summary, we observe similarities in the binding mode of compounds 5 and 6, except for the

two binding vectors for the aryl vs sulfonamide groups. We identified three additional residues

which make interactions with compound 6 but not compound 5, namely Q85, D88 and K91,

which were consistent with the model activity analysis for BRD4 BD1. We note that residue

positions important for activity at BRD9 (position 7, F47) and BRD1 and BRPF1b (position 10,

S592 and P658) from our computational analysis were consistent with our experimental

findings. We furthermore generated selectivity hypotheses for why the molecule is active at

BRD4 BD1, but inactive at the other three bromodomains tested.

159

Figure 3-7: Crystal structure for of compound 5 bound to BRD4 BD1. Compound 5 is shown in light blue sticks; key protein side chains are shown as yellow sticks. a) Key hydrogen bonding interactions of the dimethylisoxazole with N140 and Y97 via a stably bound water molecule. b) Ligand with a receptor surface representation of BRD4 BD1 with surface coloured by lipophilicity (lipophilic regions in green and hydrophilic regions in pink). The aryl ring occupies the hydrophobic groove formed by W81, M149 and I146, c) Overlay of compound 5 (light blue) with the ligand from crystal structure PDB 4GPJ (magenta), with a different bicyclic core. The aryl ring is in both cases positioned similarly into the hydrophobic groove.

WPF Shelf

Hydrophobic

Groove

a b c

W812

L92

Y97

F83

I146

M149 N140

L94

Compound 5

Ligand from PDB 4GPJ

L92

L94

Y97

N140

160

Figure 3-8: Crystal structure for compound 6 bound to BRD4 BD1. Compound 6 is shown in orange sticks; key protein side chains are shown as yellow sticks. a) Key hydrogen bonding interactions of the dimethylisoxazole with N140 and with Y97 via a conserved water molecule. b) Surface representation of BRD4 BD1 with surface coloured by liphophilicity (lipophilic regions in green and hydrophilic regions in pink). Compound 6 is displayed in orange and compound 5 in blue. The sulfonamide occupies the region towards the ZA channel instead of the hydrophobic groove, as found for compound 5 c) Overlay of BRD9 structure in pink with our crystal structure for compound 6 in BRD4 BD1 in yellow. Surface represented as the same colour as receptor residues F47 in BRD9 and Q85 in BRD4 BD1. F47 in alignment position 7 in BRD9 clashes sterically with the sulfonamide (pink surface), likely to be the reason for selectivity of this molecule for BRD4 BD1 over BRD9.

WPF Shelf

a b

c

Hydrophobic

Groove

N140

Y97

F47

Q85

L94

D88 K91

F83

W81

P82

Compound 6

D88 K91

N140

Y97

L94

161

Compound 7, active at BRD4 BD1 with a high average ΔTm value of 5.05°C (Table 3-4), is related

to BI-2536,66 a known inhibitor of BRD4 BD1, however, is modified structurally in three ways

including; 1. the expansion from a 6 to a 7-membered ring system, 2. the change of the

substituents around the ring and 3. the alteration of the simple base to a more complex cyclic

amine base. In the crystal structure of this compound with BRD4 BD1 (Figure 3-9a) the carbonyl

of the amide in the 6,7-fused ring system interacts with the N140 residue. The cyclopropyl group

forms a lipophilic interaction with a hydrophobic sub-pocket defined by L94, A89, V87 and Y97,

while the methyl group extends into the lipophilic sub-pocket defined by M132, C136 and F83.

In agreement with this, L94, A89 and M132 were highlighted as important residues for lipophilic

interactions with active compounds by our model (Figure 3-6). The other residues mentioned

(V87, Y97 and F83) were regions of low variability between bromodomains and therefore were

explicitly excluded these as features in the model. The cyclopentyl group extends towards the

hydrophobic groove defined before for compounds 5 and 6 as the sub-pocket between W81,

M149 and I146. The pyrimidine interacts with a stable water in the binding site. The pyrimidine

ring forms a hydrophobic contact with P82 and the aryl ring forms an edge-to-face π-stacking

interaction with W81 in the WPF shelf and a hydrophobic interaction with the L92 residue on

the ZA loop; these residues have been mentioned previously in our study for their importance

for the classification of actives for BRD4 BD1 (Figure 3-6), as well as in previous studies.79

As shown in Figure 3-9b, the binding mode of compound 7 is similar to that of BI-2536 in that

the carbonyl interacts with N140 and the substituents around the 7-membered ring (methyl and

cyclopentyl groups) occupy the same regions in the structure. In addition, the cyclopropyl group

in compound 7 overlays with the ethyl group in BI-2536. Noticeable changes in binding mode

between the two compounds include that the 7-membered ring changes the position of the

pyrimidine to be able to form an interaction with the stable water molecule. In addition, the

chlorine substitution moves the aryl ring further into the ZA channel towards the D88 and K91

residues. This positions the aryl ring more favourably for an edge-to-face interaction with W81,

as well as a hydrophobic interaction with L92, in agreement with being picked up as features

related to activity against BRD4 BD1 by our model.

This compound hence forms key interactions with several residues that were recognised by the

model as being important in terms of making interactions with inhibitor molecules of BRD4

BD1, including L94, A89, M132, M149, W81, I146, D88 and K91.

162

Figure 3-9: Crystal structure for compound 7 bound to BRD4 BD1. Compound 7 is shown in light blue sticks; a) Key hydrogen bonding interactions from the carbonyl of the amide in the 6,7-fused ring system with N140, as well as the interactions with multiple other residues b) Overlay of BI-2536 (PDB 4O74) shown in purple sticks with compound 7, showing that the cyclopentyl and methyl groups contact similar regions of the protein, but that the 7-membered ring changes the position of the pyrimidine to be able to form an interaction with the stable water molecule. The chlorine substitution moves the aryl ring further into the pocket to enable an interaction with W81. Surface representation of BRD4 BD1 near to residues K91, D88 and L92 coloured by liphophilicity (lipophilic regions in pink and hydrophilic regions in green). Shows that compound 7 extends towards the ZA channel residues K91 and D88.

a b

W81

N140

Y97

P86

W81

P82

F83

I146

L92

L94

A89

C136

V87

M149

L92 K91

D88

Compound 7

BI-2536

163

Compound 8 is formed of a quinazolinone warhead and an aryl mesylate and is active for BRD9

with an average ΔTm of 3.32 at 10 µM, as well as for BRD4 BD1 with a ∆Tm of 1.45 (Table 1). This

is the first O-linked quinazolinone demonstrated to have activity for BRD9. From the crystal

structures we obtained for both bromodomains, the molecule forms a key interaction with N140

(BRD4 BD1) and N100 (BRD9) via the carbonyl group on the quinazolinone (Figure 3-10). For

BRD4 BD1, the quinazolinone interacts with I146 and V87 in the binding site via hydrophobic

interactions. The aryl mesylate extends into the hydrophobic groove between W81, M149 and

I146 in a similar way to compound 5. (Figure 3-10a). For BRD9 the binding mode is different

due to the difference in the amino acid residues in the binding site (Figure 3-10b). In this case,

the primary interaction of the quinazolinone, apart from that with N100, is the π-stacking

interaction with Y106, which has been noted to be of relevance for BRD9 inhibitors previously58.

This residue has been identified from our model as important for activity at BRD9 for the

properties of lipophilicity and size, which agrees with the interactions observed in the crystal

structure. This residue (position 35) is the gatekeeper and corresponds to I146 in BRD4, which

is unable to form π-stacking interactions. Other key lipophilic interactions are with F44 and

F47 either side of the quinazolinone core (Figure 3-10b), which agrees with important

interaction features derived from the model (Figure 3-6). As shown in Figure 3-10c, the molecule

cannot adopt the same binding mode in BRD9 as in BRD4 BD1, as Y106 blocks access to the

hydrophobic groove and would clash with the aryl ring in this conformation. Therefore, the aryl

mesylate extends into solvent in this case. On the other side of the protein in the ZA channel

L92 in BRD4 BD1 is replaced by I53, and hence the position of the ZA loop is altered, allowing

for the quinazolinone system to shift towards the then more open ZA loop.

From our interpretation of the model, the alignment position 35 (corresponding to the

gatekeeper residue of Y106), position 13 (corresponding to the residue I53), position 7

(corresponding to F47) and position 4 (corresponding to F44) were important for classification

of activity at BRD9 (Figure 3-6) and these residues were indeed found to form interactions with

compound 8, as shown in the BRD9 cocrystal structure. The binding mode was different in

BRD4, mostly since BRD9 has a larger and more lipophilic gatekeeper residue Y106, which

blocks access to the hydrophobic groove.

164

Figure 3-10: a) Crystal structure for compound 8 (blue sticks) bound to BRD4 BD1 (yellow sticks). Key hydrogen H-bonding interactions of the quinazolinone with the N140 residue, as well as the interaction of the quinazolinone with I146 and V87 in the binding site can be seen. The aryl mesylate extends into the hydrophobic groove between W81, M149 and I146. b) Crystal structure for compound 8 (green sticks) bound to BRD9 (pink sticks). Key H-bonding interactions of the quinazolinone with the N100 residue, as well as the π-stacking interaction of the quinazolinone with Y106 can be seen. In this case the aryl mesylate points out into solvent. c) Compound 8 bound to BRD4 BD1 conformation (blue sticks) overlaid with compound 8 (green sticks) bound to BRD9 (pink sticks). In pink hash is the BRD9 receptor surface, showing compound 8 cannot bind in the BRD4 BD1 binding mode due to the clash of the aryl mesylate with Y106 in BRD9.

a b c

N140

I146

V87

M149

N100

Y106

I53 F47

F44 G43

F45

L92

W81

P82

F83

Y97

Y106

Compound 8

165

3.4 Conclusions

This study provides information on how to optimise activity and selectivity for bromodomain-

containing proteins, through interactions with residues in the binding site. We systematically

examined selectivity features across bromodomains by interpreting proteochemometric models

at different levels, namely the global, subfamily and individual target levels. Our analysis led to

the identification of residues in the bromodomain active site which were important towards

obtaining activity and selectivity at these different levels, and we compared our findings to

existing knowledge from the literature, as well as to newly-generated experimental binding

modes of compounds. We showed that the model retrieved a highly important selectivity

residue for CREBBP over BRD9, namely R1173 known to make cation-π interactions with

multiple series of CREBBP inhibitors. We analysed the features important for classifying active

compound-target pairs at each bromodomains and found that for 13 out of 31 bromodomains,

the model identified all previously known important features from the literature. We also

highlighted and discussed potential new interactions that can be exploited for selectivity for

each bromodomain, for example, that interacting with the tyrosine residue in alignment

position 6 may gain selectivity for BET BD2 domains over BD1 domains, and that interacting

with the more lipophilic residues of F1177 and Y1141 in alignment position 37 for CREBBP and

EP300 bromodomains may gain selectivity over BETs. We focussed on the interpretation of the

first principal component of the target descriptors which describes the lipophilicity of the amino

acid in each alignment position for the per-target analysis, although the other properties of size

and polarity can be further explored in a similar systematic analysis in the future (and the data

is contained in Supplementary Data File 2). We next examined the correspondence between the

residues identified from the computational analysis and interaction residues from novel co-

crystal structures. For BRD4 BD1, 11 residues were confirmed from our model interpretation to

be relevant to the binding of at least one out of four compounds (for which cocrystal structures

of the compound in BRD4 BD1 were generated), namely W81, P82, Q85, D88, A89, K91, L92,

L94, M132, I146 and M149. These residues showed good overlap with the previous machine

learning interpretation study.79 For BRD9, we were able to confirm that 4 residues from our

model interpretation were relevant to the binding of compound 7 including F44, F47, I53 and

Y106. Furthermore, we have shown that the residue in position 7, identified as important for

activity at BRD9, was key to gaining selectivity for BRD9 over BRD4 BD1, due to the change in

residue from Q95 in BRD9 to F47 in BRD4 BD1, causing a clash with compound 7 in BRD4 BD1.

Overall, this study adds to the knowledge for bromodomain target family selectivity and

166

provides a tool to augment future structure-based design of small molecules for targeting the

bromodomain-containing proteins.

167

4 Associations Between Drug-Induced Adverse Events in

Animal Models and Humans: Beyond Concordance

4.1 Introduction

The problem of translating toxicities from animals to humans is still an important issue to be

addressed (see

Toxicity Translation). Data-driven approaches can lead to new insight by retrospectively

analysing adverse event data in animals and humans to look for statistical associations.

Previously, most of these studies have been conducted as concordance analyses where the

degree of association between the same or similar toxicities (defined as toxicities in the same

system organ class (SOC) groups) in animals and humans has been determined by in silico

approaches (Prediction of Clinical Adverse Events from Preclinical Adverse Events).232 We

previously highlighted in this section that there is a broader need to understand not only the

concordance between the same or related AEs between species, but also the relationships

between AEs of a different nature in animals and humans, due to differences in anatomy,

physiology and biology across species.

To this end, we implemented a data-driven approach to find novel relationships between animal

and human toxicities. We extended the previous concordance approaches to find associations

between AEs encoded by different MedDRA terms. We did this to find mechanistic links

between AEs which might not be found by using prior information, such as the SOC groupings

or known relationships between AEs, as in previous studies. We implemented, in contrast with

previous studies, Mutual Information (MI) as a measure of non-linear dependency between AEs,

and quantitatively assessed the associated risk increase using Likelihood Ratios. Furthermore,

we aimed to understand nature of the interrelationships between associations. This chapter

provides an unbiased approach designed to reveal new understanding on the relevance of

animal studies. New associations discovered by this analysis can be used to aid the future risk

assessment of clinical toxicities based on preclinical information.


4.2.1 Dataset

Preclinical and clinical adverse event (AE) data encoded by the Medical Dictionary for

Regulatory Activities (MedDRA)155 preferred term (PT) were manually extracted from

PharmaPendium (2017-04)390 for all drugs in the database. These drugs were filtered to only

168

retain those drugs which had at least one reported AE in a preclinical and a clinical study,

resulting in 2259 drugs. Duplicate entries of the same drug and AE combination were removed,

retaining only one instance of each pair. The dataset was converted into a binary matrix of AEs

against drugs where presence of an AE for a drug was encoded by 1 and absence encoded by 0.

In total, 4585 preclinical AE variables and 7,675 clinical AE variables were extracted.

4.2.2 Feature Filtering

Near-zero variance AE features were removed using the VarianceThreshold function from

sklearn.feature_selection in Python391, using the variance for Bernoulli random variables = p(1-

p), where p=0.99. Different thresholds were assessed; however, this value was chosen as any

lower probability led to a large drop off in the number of features retained (Figure 4-1). 751

preclinical AEs and 1,740 clinical AEs remained after filtering, with a minimum and maximum

frequency of 23 and 1,862 times respectively where the AE was present across all drugs.

Figure 4-1: Feature selection for preclinical and clinical adverse events. Shows the effect of different probability thresholds, on the number of variables retained for both the preclinical and clinical adverse events, when utilised to remove features with low variance.

4.2.3 Mutual Information Associations

Concordance analysis

The normalized mutual information (MI) between each preclinical to clinical AE encoded by

the same MedDRA term was calculated using normalized_mutual_info_score from

sklearn.metrics in Python.391 Here we use the MI as a measure of the dependence between the

preclinical and clinical variables, as opposed to other methods of correlation, because it can 1.

measure the general dependence rather than the linear dependence between two variables and

2. doesn’t depend on the exact values but the probability distribution of the variables.392

Variance Threshold (Probability Value (p))

Nu

mb

er o

f V

aria

ble

s

169

The Fisher’s exact test implemented using scipy.stats.fisher_exact in Python393 was used to

assess the significance of the associations using a cut-off of 0.01 for the Bonferroni corrected p-

value to reduce the type 1 error.

Statistical Associations between all preclinical and clinical AEs

The same methods as above were used to calculate the values for the MI between the all

preclinical AEs to all clinical AEs, regardless of the MedDRA term. For each clinical AE, the top

three preclinical MI scores were retained.

Assessing significance

The Fisher’s exact test implemented using scipy.stats.fisher_exact in Python393 was used to

assess the significance of the associations using a cut-off of 0.01 for the Bonferroni corrected p-

value.

Furthermore, for each individual preclinical and clinical AE the binary labels were randomised

using rv_discrete from SciPy, preserving the computed probability of observing the AE for each

of the vectors and generating randomised vectors of same length as the real vectors (2259

drugs). Then the normalized mutual information was calculated as before for the random set of

vectors for all preclinical to clinical AEs, repeated 10 times. The 99th percentile of the

distribution of randomised MI values was 0.011. This was used as a cut-off for significant

associations between the real data, where only associations with MI values greater than 0.011

were retained for subsequent analysis.

This method implemeted stringent cut-offs in both approaches to reduce the false positive rate

of our newly discovered associations. In total, after applying the Fisher’s exact test and the

randomisation derived cut-off, 2,050 significant associations were identified.

Quantifying risk

To quantify the risk determined by the associations this method employed the likelihood ratios.

The positive and negative likelihood ratios394 (LR+, LR-) were calculated to assess the risk of

experiencing a clinical AE given the presence of a preclinical AE and the likelihood of absence

of a clinical AE given the absence of a preclinical AE, respectively. The following formulas were

applied:

Equation 4-1:

𝐿𝑅+ =𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦

1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦

170

Equation 4-2:

𝐿𝑅− =1 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦

The likelihood ratios were used to compare to previous work, including the study by Clark et

al.232 This value is more easily interpretable than MI values in real terms and also uses all values

of the contingency table to provide an informative measure of the predictive value of a

preclinical AE for a clinical AE, which is not affected by prevalence of outcome.231 As mentioned

in the general introduction (Likelihood Ratio), the likelihood ratio has been argued to be a

better indicator of concordance for this type of analysis than the traditional sensitivity metric,

as it takes into account the false negatives and the false positive values,230 as well as providing

extra interpretation. A positive or negative likelihood ratio of 1 means that there is no useful

information in the probability of the preclinical finding to predict the clinical outcome, whereas

a positive likelihood ratio (LR+) of greater than 10 shows a large and conclusive change in

probability of a clinical AE given the probability of the preclinical AE. Values between 5-10 are

moderate and values above 2 can be significant. Conversely, for the negative likelihood ratio

(LR-), values closer to 0 are more useful to determine the likelihood of the absence of a

preclinical finding predicting the absence of the clinical AE.231,232 We can use these metrics

additionally to measure the directionality of our associations.

4.2.4 Network Analysis of Significant Associations

All 2050 statistically significant associations were analysed as a network graph,352 where nodes

are AEs (preclinical or clinical) and edges connect two AEs which are statistically associated.

The network is directed since edges are formed by associations from one preclinical AE to one

clinical AE. A single node can have multiple connections, due to the interdependency of AEs

with one another. The network was constructed using the force-directed layout based on a

modified version of the Fruchterman-Reingold algorithm395 in TIBCO Spotfire,352 The network

was clustered by using the concept of a connected component,396,397 which is a maximal

connected subgraph of the entire graph, where each node belongs to one connected component

only. The degree of each node, defined as the number of edges connecting the node to other

nodes in the network was calculated to identify central AEs in the network. Due to the directed

nature of the network, all degrees for preclinical AEs should be considered as the outdegree and

all degrees for clinical AEs should be considered the indegree.

171

4.2.5 Limitations

When interpreting the results of this study there are certain limitations that should be noted

which are related to the dataset and also the methods used.

Firstly, this analysis did not consider the dose of drug administered for both the preclinical and

clinical studies. Therefore, the dosages in preclinical studies were assumed to be relevant to the

human dose in clinical trials and therefore assumed to achieve a similar plasma concentration.

Furthermore, this analysis did not analyse the frequency i.e. the number of animals or humans

which experienced the toxicity, and instead encoded the presence or absence of an AE across

all subjects and studies conducted for each drug. This is because this quantitative information

was not captured within the PharmaPendium database. Also, the degree or severity of the

toxicity was not encoded in the data used and, although some MedDRA terms can be deemed

more severe events, there was no classification of the degree of severity for the AE itself (i.e.

whether the AE was mild or if it caused hospitalisation or death).

Additionally, since only data for drugs which continued to clinical trials was used, it must be

noted that the drug set on which this analysis was performed was limited in that those drugs

causing severe preclinical toxicity observations will have been excluded from our analysis. In

consequence we are likely to have limited data on both the preclinical and clinical side for

serious AEs.


4.3.1 Concordance Analysis

To conduct a concordance analysis similar to previous studies, we computed the Mutual

Information (MI) and Likelihood Ratio (LR) values for those AEs which were recorded by the

same MedDRA terms preclinically and clinically.232 (MedDRA terms are represented throughout

the text in small capitals to provide clarity). Overall, 22 %, corresponding to 102 of the possible

473 matching preclinical and clinical MedDRA terms were identified as significantly associated

with one another. (Supplementary Data File 3).

The median MI value for the concordant associations was 0.048 covering a wide range of MI

values between 0.02 and 0.33 (Table 4-1). The median Positive Likelihood Ratio (LR+) was 3.75,

with a large range between 1.25 and 118.48. 96 % of those significant associations derived from

the MI values are also significant in terms of the LR+ (greater than 2) and can be interpreted as

showing some degree of concordance i.e. the presence of a preclinical AE increased the

172

likelihood of the presence of a clinical AE. The LR- median value was 0.82 and ranged from 0.18,

indicative of a moderate shift in the probability of the absence of the clinical AE given the

absence of the preclinical AE, to 0.96, indicative of a very small and not important shift of the

absence of the clinical AE given the absence of the preclinical AE. This finding that the LR+

values were significant, but the LR- values were not, is consistent with previous studies of

concordance.232 The preclinical model was often diagnostic for the clinical AE in only the

positive direction.

Table 4-1: Metrics for the mutual information, positive likelihood ratio and negative likelihood ratio across all concordant associations.

Measure Minimum Value Maximum Value Median Mean

Mutual Information 0.02 0.33 0.05 0.05

Positive Likelihood Ratio

1.25 118.48 3.75 5.99

Negative Likelihood Ratio

0.18 0.96 0.82 0.80

The 10 highest associations for the same adverse event encoded preclinically and clinically as

measured by the MI are shown in Table 4-2. For 7 out of 10 reported concordant AEs, the LR+

values from this study were comparable to those reported in previous studies. 232,322 For BLOOD

PROLACTIN INCREASED, DIARRHOEA and INJECTION SITE ERYTHEMA, the LR+ values from previous

studies were reported from only one species, rather than from all preclinical species as

investigated here. In contrast to previous studies, we also find that RENAL PAPILLARY NECROSIS,

SEDATION and PLATELET COUNT INCREASED were concordant between animals and humans with

LR+ values of 33, 5 and 4 respectively. We recommend that the concordant toxicities found

through this analysis should continue to be assessed through preclinical models, since we

provide evidence of their past success in predicting clinical drug-induced toxicities.

Table 4-2: The 10 most statistically significant concordant adverse events as ranked by the Mutual Information. Concordant here refers to AEs encoded by the same MedDRA term preclinically and clinically, for which statistically significant associations were derived from our method. The MI and the LR+s values were reported, as well as the LR+ values from previous studies, if present (For all other associations see Supplementary Data File 3).

Preclinical AE Clinical AE Normalised Mutual Information


Previous Study Positive Likelihood Ratio (reference)

DRUG SPECIFIC

ANTIBODY PRESENT DRUG SPECIFIC

ANTIBODY PRESENT 0.33 118 162 322

173

RENAL PAPILLARY

NECROSIS RENAL PAPILLARY

NECROSIS 0.13 33 N/A

BLOOD PROLACTIN

INCREASED BLOOD PROLACTIN

INCREASED 0.13 23 33 from rats 322

ELECTROCARDIOGRA

M QT CORRECTED

INTERVAL PROLONGED

ELECTROCARDIOGRA

M QT CORRECTED

INTERVAL PROLONGED

0.11 11 11 232

ALANINE

AMINOTRANSFERASE

INCREASED

ALANINE

AMINOTRANSFERASE

INCREASED

0.11 3 6 232

CARDIOTOXICITY CARDIOTOXICITY 0.10 25 N/A

PLATELET COUNT

DECREASED PLATELET COUNT

DECREASED 0.09 4 N/A

DIARRHOEA DIARRHOEA 0.09 4 12 from monkeys 322

INJECTION SITE

ERYTHEMA INJECTION SITE

ERYTHEMA 0.08 10 11 from rabbits 322

SEDATION SEDATION 0.08 5 N/A

From the study by Clark et al,232 there were five AEs reported at the preferred term level of the

MedDRA hierarchy as being the most concordant; these were ELECTROCARDIOGRAM QT

PROLONGED, INJECTION SITE REACTION, ARRHYTHMIA, ELECTROCARDIOGRAM QT CORRECTED

INTERVAL PROLONGED and DRUG SPECIFIC ANTIBODY PRESENT. Each of these preclinical AEs were

found to be significantly associated with the same event clinically from our analysis also, with

LR+ greater than 5, indicating moderate or high shifts in the probability of the presence of a

clinical AE given the presence of the same preclinical AE. Using our method to find associations

based on the MI between preclinical and clinical AEs we identified the same AEs as previous

studies with LR+ values of the same order of magnitude.

The number of associations with a LR+ above 5, indicating a moderate or large shift in the

probability of the presence of a clinical AE given the presence of the same preclinical AE, was

29 (28 % of significant associations) of which only 7 (6.8 % of significant associations)

(Supplementary Data File 3) exceeded a LR+ of 10, indicating a large and conclusive shift in

probability. This shows that most of our associations had likelihood ratios which were

significant but of low quantitative risk (see methods).

Overall, this section highlights that the concordance results from our analysis are largely

consistent with previous concordance analyses, displaying comparable positive likelihood

ratios. This highlights the suitability of our method to find known associations between

preclinical and clinical AEs. We furthermore highlight that our study also found that RENAL

174

PAPILLARY NECROSIS, SEDATION AND PLATELET COUNT INCREASED were amongst the most

concordant AEs across species, observations which were not explicitly reported in previous

studies, showing that this method was able to discover new examples of concordant toxicities,

for which we can say that existing preclinical models have succeeded in predicting the clinical

toxicity in the past and therefore can be used for toxicity testing in the future. Since only 22 %

of MedDRA terms which were present both preclinically and clinically were statistically

concordant, and only 7 % of these had a likelihood ratio of greater than 10, this poses the

question as to whether other preclinical AEs in the dataset might be more relevant than the

exact MedDRA term match in predicting certain clinical AEs.

4.3.2 Statistical Associations Between All Preclinical and Clinical AEs

Since only 22 % of clinical AEs with a matching term preclinically were significantly predicted

by their preclinical counterpart, we next investigated which drug-induced animal toxicities were

predictive for human adverse events (AEs), without the requirement that the AE be encoded by

the same term in both species. In total, 2,050 statistically significant associations were found

between preclinical and clinical AEs (Supplementary Data File 4), with Mutual Information (MI)

values ranging between 0.02 - 0.33, Positive likelihood ratio (LR+) values ranging between 1.5 -

118.5 and Negative Likelihood Ratio (LR-) values ranging between 0.10-0.97. Consistent with the

previous study,232 we find that whilst the Positive likelihood ratios show a significant change in

risk of the clinical AE given the preclinical AE (92 % of significant associations had LR+ greater

than 2), the negative likelihood ratios show that the absence of a preclinical AE does not alter

the probability of clinical safety by an important degree (91 % of the LR- values were between

0.5-1). Therefore, it is important to note that these associations are asymmetrical in nature and

absence of the preclinical finding does not (in most cases) indicate absence of the clinical AE.

The relationship between the MI and LR+ is represented in Figure 4-2. Overall there was a trend

between the MI and the LR+ in the positive direction, however, this was found to be dependent

on another variable which was the number of drugs which caused both AEs for the association.

As can be seen in Figure 4-2, for each quartile of the number of drugs which displayed both AEs,

there is a roughly linear relationship between the log MI and the log LR+ values, suggesting that

each LR+ value is a function of both variables. We also note that the higher the number of drugs

which have both AEs, the lower the LR+ value will be for the same value of the MI. Overall, this

shows that LR+ is a function of both MI and the number of intersecting drugs.

175

Figure 4-2: The relationship between the log positive likelihood ratio (LR+) and the mutual information (MI) coloured by the quartile of the number of compounds overlapping between the two AEs. As can be seen, there is a linear relationship between the log values of the MI and log LR+, however the number of compounds which overlap between the preclinical and clinical AE also influences the LR+ value, but not on the MI value. The higher the number of compounds which display both AEs, the lower the likelihood ratio for the same value of the MI.

We selected to restrict our associations to the three preclinical AEs which gave the highest MI

values for each clinical AE for interpretability reasons. In doing this, we now reduce our previous

102 concordant AEs to only 43 AEs where the same AE preclinically is one of the highest three

MI values which can be derived for the clinical observation. (Supplementary Data File 4). The

10 concordant associations with the highest MI are now shown in Table 4-3. As can be seen, two

of the previous associations in Table 4-2, RENAL PAPILLARY NECROSIS and ALANINE

AMINOTRANSFERASE INCREASED are now no longer in our results, since other preclinical AEs in

the dataset are more predictive of the clinical AE. For the AE of clinical RENAL PAPILLARY

NECROSIS, preclinical INTESTINAL ULCER, PERITONITIS and GASTROINTESTINAL ULCER had higher MI

values (0.25, 0.19 and 0.17) than the association between preclinical RENAL PAPILLARY NECROSIS

and the same event clinically (MI 0.13). For clinical ALANINE AMINOTRANSFERASE INCREASED,

preclinical RED BLOOD CELL COUNT DECREASED, VOMITING and DECREASED ACTIVITY had higher MI

values (0.12, 0.12 and 0.12 respectively) in comparison to preclinical ALANINE AMINOTRANSFERASE

INCREASED which had an MI value of 0.11 (Supplementary Data File 4). These two examples show

that whilst some preclinical AEs were predictive of the same clinical AE, there were other

preclinical AE predictors in the dataset which can also be considered significantly associated

with the clinical endpoint.

176

Table 4-3: Statistically significant concordant adverse events appearing within the top 3 preclinical AEs associated with each clinical AE. The MedDRA terms for which statistically significant associations between the same term preclinically and clinically were found according to the thresholds used in this study. The MI and the LR+s values were reported.

Preclinical AE Clinical AE Normalised Mutual Information (MI)

Positive Likelihood Ratio (LR+)

DRUG SPECIFIC ANTIBODY PRESENT DRUG SPECIFIC ANTIBODY PRESENT 0.33 118

BLOOD PROLACTIN INCREASED BLOOD PROLACTIN INCREASED 0.13 23

ELECTROCARDIOGRAM QT

CORRECTED INTERVAL PROLONGED ELECTROCARDIOGRAM QT

CORRECTED INTERVAL PROLONGED 0.11 11

CARDIOTOXICITY CARDIOTOXICITY 0.10 25

PLATELET COUNT DECREASED PLATELET COUNT DECREASED 0.09 4

DIARRHOEA DIARRHOEA 0.09 4

INJECTION SITE ERYTHEMA INJECTION SITE ERYTHEMA 0.08 10

SEDATION SEDATION 0.08 5

ELECTROCARDIOGRAM QT

PROLONGED ELECTROCARDIOGRAM QT

PROLONGED 0.08 7

OVARIAN DISORDER OVARIAN DISORDER 0.08 5

So far, we have analysed the associations which are most predictive of the presence of a clinical

toxicity, however, understanding where associations can predict safety for a clinical endpoint is

also important. This is quantified by the LR- values where low LR- values can be considered

important for understanding where an absence of a preclinical AE was indicative of clinical

safety. The three associations with a low value (< 0.2) for the LR-, indicative of a moderate shift

in probability given the absence of the preclinical AE that there will be no clinical AE, are shown

in Table 4-4. These were the findings that the absence of a WEIGHT DECREASE in animals was

predictive of a high probability of no AORTIC ANEURYSMS and no clinical ULCER in humans (LR-

0.19 and 0.10 respectively) and that, whilst preclinical OVARIAN DISORDER had a moderate LR+

value for same clinical finding, we also observe a low value for the LR- of 0.18, indicative of the

absence of OVARIAN DISORDER in animals leading to a shift in the probability of the drug being

safe with respect to OVARIAN DISORDERS in humans.

Table 4-4: AEs contained in associations which were significant as determined our analysis, which had low Negative Likelihood Ratios (<0.2), indicating a moderate shift in the probability that the clinical AE will not be present, given that the preclinical AE was not present.

Preclinical AE Clinical AE Normalised Mutual Information Negative Likelihood Ratio


WEIGHT DECREASED AORTIC ANEURYSM 0.03 0.19 1.8

WEIGHT DECREASED ULCER 0.04 0.18 5.5

OVARIAN DISORDER OVARIAN DISORDER 0.08 0.10 2.0

177

The 20 associations with the highest MI values from all significant associations are presented

in Table 4-5, along with their Likelihood Ratios to quantify the conditional risk. The association

with the highest MI score of 0.33 was the preclinical finding of DRUG SPECIFIC ANTIBODY PRESENT

associated with the same finding clinically, a concordant observation highlighted in the

previous section. The same AE preclinically was also associated with clinical INFUSION RELATED

REACTION, most likely because monoclonal antibodies are administered by infusion.324The other

members of the 20 highest MI associations have a few common preclinical findings, including

WAXY FLEXIBILITY, BLOOD PROLACTIN INCREASED, INTESTINAL ULCER, ADRENAL CORTEX ATROPHY

and PERITONITIS, which associated with different clinical observations. Preclinical WAXY

FLEXIBILITY associates with the most clinical toxicities for the top 20 associations and the

preclinical finding with the second highest number of clinical associations was BLOOD

PROLACTIN INCREASED. Both preclinical findings were predictive of a range of central nervous

system (CNS) side effects, which have been recognised as a toxicity area which need to be better

predicted in preclinical models.398,399 From our analysis we find that a measurement of BLOOD

PROLACTIN INCREASED combined with behavioural symptoms of catatonia in animal species (e.g.

WAXY FLEXIBILITY) may serve as good indicators of a variety of disorders, including

SCHIZOPHRENIA, COGWHEEL RIGIDITY, dyskinesias, spasms and male fertility problems. A wide

variety of behavioural observations have previously been identified in animals as indicative of

potential CNS side effects in humans,400 however, from our study we highlighted that the

behavioural observation of WAXY FLEXIBILITY when present in preclinical studies elevated the

risk of certain neurological side effects in the clinic by up to 44 times. Similarly, a wide range of

liquid biomarkers have been previously identified as being important for neurological toxicity

prediction,399 however, here we identified that the biomarker of increased blood prolactin levels

in animals strongly associates with clinical neurological toxicities, with the presence of

preclinical BLOOD PROLACTIN INCREASED elevating the risk of some neurological toxicities by over

50 times. Together, or separately, these preclinical observations can influence the strategy used

to flag an increased likelihood of clinical CNS toxicities. We discuss these findings further in the

section Frequently Predictive Preclinical AEs.

Preclinical INTESTINAL ULCER and preclinical PERITONITIS were both associated with RENAL

PAPILLARY NECROSIS within the highest 20 associations as ranked by Mutual Information, with

preclinical INTESTINAL ULCER found be more predictive for the presence of clinical RENAL

PAPILLARY NECROSIS (LR+ = 60), than the preclinical observation of RENAL PAPILLARY NECROSIS

(LR+ = 33) itself. INTESTINAL ULCER, PERITONITIS and RENAL PAPILLARY NECROSIS are side effects

observed in nonsteroidal anti-inflammatory drugs, such as those targeting Cyclooxygenase-1

178

(Cox-1), and gastrointestinal ulceration is one of the early symptoms of RENAL PAPILLARY

NECROSIS. Cox-1 inhibition is known to lead to decreased prostaglandin production which is

necessary for both mucosal integrity in the GI tract and maintenance of blood flow to the renal

papillae.401 These associations can therefore be linked to the primary pharmacology of anti-

inflammatory drugs. This finding is particularly useful as there is difficulty in producing robust

reproducible models of RENAL PAPILLARY NECROSIS.402 Here we highlight that we can find

literature evidence to connect these AEs by mechanisms of drug action, despite their different

System Organ Class (SOC) groupings, showing that formal organ class categorizations of

biological effects imply differences where underlying mechanisms may well be conserved.

Preclinical ADRENAL CORTEX ATROPHY was associated with ADRENAL SUPPRESSION clinically (LR+

= 25). ADRENAL SUPPRESSION is one of the main side effects associated with long term

corticosteroid usage.403 Since the adrenal glands are suppressed from corticosteroid usage, less

cortisol is released over time. In animals this is effect is often measured by histopathological

examination of the adrenal glands.404 This functional link suggests that ADRENAL CORTEX

ATROPHY in histopathology studies can be used to predict ADRENAL SUPPRESSION.

These highest-ranking associations can be linked by strong literature evidence. All other

significant associations can be analysed in the same way to indicate where preclinical models

might be useful for anticipating the risk of clinical AEs of interest. Notably, all clinical AEs

except for DRUG SPECIFIC ANTIBODY PRESENT contained within the 20 highest-ranking associations

had a higher MI value of association with a preclinical AE which was not identical in clinical and

preclinical space than the same preclinical AE. Furthermore, in cases where the clinical AE

cannot be replicated in animals due to species differences or difficulty of measurement, we

present here that alternate preclinical toxicities may be employed for risk profiling. Finally, by

extending beyond simple term matching of adverse events, this analysis demonstrates that it is

possible to find relationships between adverse events in preclinical and clinical space also

between different SOC groups, which indicates that underlying biological mechanisms can be

conserved despite their different formal ontological annotations. We show that our analysis was

able to discover 2,007 statistically significant relationships between preclinical and clinical AEs

that were beyond the straightforward term matching, as well as 43 concordant associations, to

make up 2,050 associations overall. These results can provide preclinical scientists with

information to determine which preclinical models give valuable information about which

clinical toxicities, as well as help to identify those preclinical models may be less informative

than previously believed.

179

Table 4-5: The most significant statistical associations between preclinical and clinical adverse events ranked by the highest MI values. Reported are the normalised mutual information, Bonferroni corrected p_value from Fisher’s exact test, positive and negative likelihood ratios and the number of intersecting drugs between the preclinical and clinical AE terms.

Preclinical AE

Clinical AE Normalised Mutual Information (MI)

Bonferroni corrected Fishers’ exact test p-value

Positive Likelihood ratio (LR+)

Negative Likelihood ratio (LR-)

Number of intersecting drugs between preclinical and clinical AE

DRUG

SPECIFIC

ANTIBODY

PRESENT

DRUG SPECIFIC

ANTIBODY

PRESENT

0.33 1.37 x 10-39 119 0.67 35

INFUSION

RELATED

REACTION

0.24 4.40 x 10-24 44 0.66 24

WAXY

FLEXIBILITY

DYSKINESIA

OESOPHAGEAL 0.20 3.98 x10-11 43 0.60 11

FACIAL SPASM 0.20 3.98 x10-11 43 0.60 11

MEIGE'S

SYNDROME 0.20 7.98 x 10-8 44 0.57 10

INTESTINAL

DILATATION 0.19 6.20 x 10-10 42 0.59 10

SPERM COUNT

INCREASED 0.19 6.20 x 10-10 42 0.59 10

ASPHYXIA 0.19 1.92 x 10-12 39 0.67 13

RETROGRADE

EJACULATION 0.18 1.63 x 10-10 39 0.64 11

AUTONOMIC

NERVOUS SYSTEM

IMBALANCE

0.18 9.39 x 10-14 37 0.72 15

ASPERMIA 0.18 1.65 x 10-9 39 0.62 10

BLOOD

ANTIDIURETIC

HORMONE

INCREASED

0.18 1.65 x 10-9 39 0.62 10

BLOOD

PROLACTIN

INCREASED

COGWHEEL

RIGIDITY 0.26 4.35 x 10-15 53 0.46 13

DROOLING 0.20 4.71 x 10-12 33 0.58 12

FACIAL SPASM 0.18 3.98 x 10-11 36 0.60 11

SCHIZOPHRENIA 0.18 3.36 x 10-11 35 0.63 12

TARDIVE

DYSKINESIA 0.18 1.58 x 10-16 33 0.75 19

INTESTINAL

ULCER RENAL PAPILLARY

NECROSIS 0.25 2.02 x 10-15 60 0.65 17

ADRENAL

CORTEX

ATROPHY

ADRENAL

SUPPRESSION 0.20 7.97 x 10-20 25 0.54 22

180

PERITONITIS RENAL PAPILLARY

NECROSIS 0.19 4.08 x 10-18 27 0.59 20

4.3.3 Network Analysis of Significant AEs

Next, we aimed to uncover the AEs which are involved in three different types of relationships;

namely the relationship of multiple preclinical AEs with one clinical AE, the relationship of one

preclinical AE to multiple clinical AEs and a one-to-one relationship between one preclinical

and one clinical AE. This can help to guide how preclinical toxicity endpoints could be used in

practice. To determine these relationships, we constructed a network of the significant

associations where nodes corresponded to preclinical and clinical AEs and edges connected AEs

which are statistically associated by our MI analysis (see Mutual Information Associations),

where we analysed its structure using the concept of connected components. Connected

components are completely connected subgraphs of a network which are isolated from the

other nodes of the network and can be considered distinct clusters of related nodes.

The overall network structure is displayed in Figure 4-3, where we can identify 27 different

connected components. These 27 components can be considered biological clusters of

associations. Table 4-6 shows the number and identity of the nodes involved in the network

components. The numbers of nodes in each component varied from 2 to 1,129, with component

1 containing 95 % of the AEs. The composition of component 1 will be discussed further in the

next section. The other components had fewer numbers of AEs and therefore represent AEs

with more specific relationships, examples of which will be analysed further in the context of

the relationships stated above.

181

Figure 4-3: Network analysis for the statistically significant associations. Shows visually the 27 connected components of the network formed from significant associations between preclinical and clinical AEs. Greyed out is the first connected component which forms the largest hub of the network, highlighted are the nodes in the network which form separate connected components, detailed in Table 4-6

Table 4-6: Adverse events contained in the connected components from the network analysis. 27 connected components of the network of 2050 significant associations, listing the preclinical and clinical AE terms, as well as the number of nodes for each connected component of the network

Connected Component ID

Preclinical Nodes Clinical Nodes Number of nodes

1 Large component containing the bulk of the network (see Supplementary Data File 5)

Large component containing the bulk of the network (see Supplementary Data File 5)

1,129

2 MOBILITY DECREASED

BILE DUCT STONE 2

3 HAEMATOCRIT DECREASED

BLADDER CANCER 2

4 CARDIOTOXICITY

BLOOD ALKALINE PHOSPHATASE CARDIOTOXICITY SKIN HYPERPIGMENTATION

4

5 URINE SODIUM INCREASED

BLOOD CHOLESTEROL ABNORMAL 2

6 BLOOD CALCIUM DECREASED

BLOOD PARATHYROID HORMONE

INCREASED HAEMOGLOBIN INCREASED

3

7 UTERINE DISORDER PROSTATIC DISORDER INCREASED APPETITE

BREAST DISORDER LOW DENSITY LIPOPROTEIN INCREASED

5

8 TESTIS CANCER

BREAST TENDERNESS 2

9 POOR WEIGHT GAIN NEONATAL

CARDIAC DEATH 2

10 GASTRITIS

DUODENAL ULCER HAEMORRHAGE 2

11 BIOPSY VAGINA ABNORMAL BIOPSY UTERUS ABNORMAL

ENDOMETRIAL HYPERPLASIA UTERINE HAEMORRHAGE

4

12 ABORTION

EYE PRURITIS GROWTH OF EYELASHES

3

13 GOITRE

GOITRE 2

14 SPINAL CORD DISORDER

HYPOMAGNESAEMIA 2

15 WEIGHT INCREASED JOINT SPRAIN 3

16 GROWTH RETARDATION MOOD SWINGS 2

17 COMPLICATION OF PREGNANCY MYELODYSPLASTIC SYNDROME 2

18 LYMPHOMA HEPATIC NEOPLASM MALIGNANT

MYOSITIS 3

19 ARTHROPATHY TENDON DISORDER NEUROTOXICITY

3

182

20 BONE DEVELOPMENT ABNORMAL SCIATICA 2

21 HEPATIC NECROSIS SENSORY DISTURBANCE 2

22 DECREASED APPETITE SINUS DISORDER 2

23 CONGENITAL ANOMALY SMALL INTESTINAL OBSTRUCTION 2

24 SPERMATOGENESIS ABNORMAL SPERM COUNT DECREASED 2

25 ELECTROCARDIOGRAM QT SHORTENED SUDDEN CARDIAC DEATH 2

26 NASAL MUCOSAL DISORDER THROAT IRRITATION 2

27 GASTROINTESTINAL INFLAMMATION UPPER GASTROINTESTINAL

HAEMORRHAGE 2

Specific Relationships Between AEs

Next, we discuss the components of the network which formed a variety of specific

relationships, including 1. multiple preclinical AEs being predictive of one clinical AE, 2. the

relationship of one preclinical AE being predictive of multiple clinical AEs, 3. multiple

preclinical AEs predictive of multiple clinical AEs, and 4. a one-to-one relationship where one

preclinical AE is predictive of one clinical AE. Overall, for the 26 connected components, there

were 19 one-to-one relationships, 4 one-to-many relationships, 1 many-to-one relationship and

3 many-to-many relationships (Supplementary Data File 5).

One-to-one Relationship

Here we discuss examples of one-to-one relationships along with their MI and LRs, as well as

literature evidence for the link.

The one-to-one relationship of preclinical ELECTROCARDIOGRAM QT SHORTENED with clinical

SUDDEN CARDIAC DEATH was found from our network analysis to be the association with the

highest value for the MI (0.11) out of all one-to-one relationships (Component 25). The LR+

indicated that it was 23 times more likely to observe clinical SUDDEN CARDIAC DEATH given

preclinical ELECTROCARDIOGRAM QT SHORTENED than in the absence of this observation. This is

a known association with the well-understood molecular drivers of potassium and calcium

channels in the heart.405,406 Despite the knowledge that shortening of the QT interval is

associated with severe ventricular fibrillation, the use of QT shortening in animals as a

biomarker for clinical arrythmias and death was identified as an area which needed to be

addressed, since pharmaceutical companies have noticed an increase in QT shortening being

reported in preclinical studies.407 This report noted that further research should be conducted

on the mechanisms behind QT shortening and its translation between preclinical and clinical

settings. Here, we highlighted the strong predictive value of preclinical ELECTROCARDIOGRAM

QT SHORTENED with clinical SUDDEN CARDIAC DEATH quantified by a likelihood ratio of 23. We

183

also in the next chapter discuss potential mechanisms for this association. This information can

support the case for regulatory agencies to consider this risk factor in the future.

Another one-to-one association relationship with a high MI value of 0.10, was the association

of preclinical MOBILITY DECREASED with clinical BILE DUCT STONE (Component 2). The LR+ value

for this association was 21, showing a high risk given the preclinical observation (Supplementary

Data File 4). Drug-induced bile duct stones can be potentially fatal and experimental models

for this endpoint are not well known.408 The observation that preclinical MOBILITY DECREASED is

currently the best predictor and only significant predictor for determining the risk of a BILE DUCT

STONE clinically could be a starting point for assessing the risk of drug-induced bile duct stone

formation in humans in the future.

A third one-to-one association relationship with a high MI value of 0.10 was the association of

preclinical TESTIS CANCER with clinical BREAST TENDERNESS (Component 8). The link between

testicular cancer and breast tenderness is known, as breast tenderness is a symptom of testicular

cancer in 10 % of men, thought to be related to the release of beta human chronic gonadotropin

which is secreted by testicular tumours.409

Overall, we find strong explainable links for these examples of one-to-one associations with the

highest MI values from our analysis and we suggest that these one-to-one links can be used for

future risk assessment of clinical AEs based on the preclinical AE.

One-to-many Relationships

The component demonstrating a one-to-many relationship containing the associations with the

highest MI values was component 4, which contained three clinical AEs that were associated

with preclinical CARDIOTOXICITY, which were BLOOD ALKALINE PHOSPHATASE, CARDIOTOXICITY

and SKIN HYPERPIGMENTATION, occurring with MI values of 0.13, 0.10 and 0.08 and LR+ values of

33, 24 and 17 respectively. This shows concordance between the general term CARDIOTOXICITY

from animal models to humans and quantifies the risk of experiencing CARDIOTOXICITY

clinically, given the observation preclinically to be 25 times as high as if CARDIOTOXICITY was

not experienced preclinically. BLOOD ALKALINE PHOSPHATASE usually indicates a problem with

the liver, gallbladder or bones, however it has been connected to myocardial infarction after a

drug-eluting stent has been administered to coronary artery disease patients.410 This is an

example of a non-causal association. Tissue nonspecific alkaline phosphatase also has a role in

vasculature stiffening which can lead to heart failure.411 Considering the high LR+ (33) of

experiencing BLOOD ALKALINE PHOSPHATASE given CARDIOTOXICITY preclinically, it is surprising

184

that a stronger evidence-based link has not been previously reported. We did not find a causal

link from CARDIOTOXICITY to SKIN HYPERPIGMENTATION, which is usually caused by an

accumulation of melanin.412

Many-to-many Relationship

The component displaying a many-to-many relationship containing the associations with the

highest MI was component 11. This component contained the associations of preclinical BIOPSY

UTERUS ABNORMAL with clinical ENDOMETRIAL HYPERPLASIA and clinical UTERINE HAEMORRHAGE

(MI = 0.10 and 0.05, LR+ = 15 and 7). It also contained the association of preclinical BIOPSY

VAGINA ABNORMAL with clinical ENDOMETRIAL HYPERPLASIA (MI = 0.09 and LR+ = 16).

ENDOMETRIAL HYPERPLASIA is a thickening of the endometrium which can lead to endometrial

cancers. 413 This is often caused by an excess of oestrogen, with a lack of progesterone. Animal

model biopsies can be used to predict these clinical AEs.

Overall, this section shows that we have found specific associations for some of our adverse

events which are not highly co-dependent on one another. These smaller clusters can be

analysed in more detail for use in guiding the utility of preclinical models. The determination

of the type of relationship in different cases will be important to decide which preclinical models

are important for assessing multiple clinical toxicities, which combinations of preclinical

models are important to assess one clinical toxicity, and which preclinical models are more

specific to one clinical toxicity. This can help to guide how these models could be used in

practice.

Frequently Predictive Preclinical AEs

Next, we analyse the largest component in the network (Component 1 Table 4-6). Figure 4-4 is

a visual representation of the first connected component in the network. The larger and pink

coloured nodes are representative of the nodes with a higher degree (see Network Analysis of

Significant Associations) in the network. There is great variability with the outdegree of the

preclinical AE nodes, showing that there were preclinical AEs which feature in the highest three

predictors for multiple clinical AE; the preclinical AEs with the highest outdegree can be

considered as the hubs of the network. These AEs (and their corresponding outdegree)

included; DECREASED ACTIVITY (97), WAXY FLEXIBILITY (88), BLOOD PROLACTIN INCREASED (81),

INTRA-UTERINE DEATH (79), FOETAL GROWTH RETARDATION (56), VOMITING (48), BIOPSY LYMPH

GLAND ABNORMAL (43), WEIGHT GAIN POOR (41), RED BLOOD CELL COUNT DECREASED (40) and

others, a full list of which can be found in Supplementary Data File 5.

185

Figure 4-4: Network plot in for the first connected component from the network analysis. On the left-hand side are nodes for all preclinical AEs and on the right-hand side are all clinical AEs. The size and colour of the nodes are proportional to the number degree of each node. The larger and more pink nodes are the AEs with the highest number of connections in the network and the smaller blue nodes are AEs with the lowest number of significant connections.

Some of the AEs with the highest outdegree are more prevalent in the dataset (Figure 4-5), such

as WEIGHT DECREASED (1,115 drugs), INTRA-UTERINE DEATH (944 drugs), WEIGHT GAIN POOR (935

drugs) and DECREASED ACTIVITY (822 drugs). Since these AEs will be commonly assessed

preclinical toxicities across animal studies for all drugs, it is unsurprising that their reporting

prevalence is high across drugs in our data. Reproductive toxicity is assessed in multiple animal

models including rodent and non-rodent species and is responsible for a high animal usage (up

to 60 % of all experimental animals). Since thalidomide was withdrawn from the market due to

unanticipated birth defects, these studies are considered a vital component of the in vivo

toxicology assessment.414 Animal body weight is measured as part of all studies, since extreme

weight loss can be indicative of severe organ damage and is one of the factors assessed with

regard to the humane termination of animals before extensive pain and distress occurs.415 In

summary, it is unsurprising that these preclinical AEs are often found to be highly associated

with a range of commonly occurring clinical toxicities.

Preclinical AEs Clinical AEs

186

Figure 4-5: Correlation of the degree of the node representing a preclinical AE in the network for component 1 with the number of drugs in the dataset which displayed the preclinical AE

On the other hand, the degree of the node representing a preclinical AE in the network is not

directly correlated with the number of drugs for which the AE is experienced, as illustrated for

the cases of associations with a high degree but a low number of drugs including the preclinical

AEs of WAXY FLEXIBILITY (32 drugs), BLOOD PROLACTIN INCREASED (36 drugs) and BIOPSY LYMPH

GLAND ABNORMAL (133 drugs) (Figure 4-5). As mentioned above in the analysis of the top 20

associations, preclinical WAXY FLEXIBILITY and BLOOD PROLACTIN INCREASED were highly

predictive together or separately of a range of CNS side effects. When looking at all associations

for these preclinical AEs, it was found that preclinical WAXY FLEXIBILITY and BLOOD PROLACTIN

INCREASED are predictive of many of the same clinical AEs. 53 AEs were the same out of the 105

clinical AEs associated with one or both preclinical AEs (~50 %) showing that these preclinical

AEs were predictive of a similar set of clinical toxicities. The clinical toxicities for which these

observations were significantly associated were grouped according to their system organ class

(SOC) in Figure 4-6, which showed that the most populated SOC groups were for NERVOUS

SYSTEM DISORDERS (49 AEs) and PSYCHIATRIC DISORDERS (29 AEs). There is previous literature

evidence to connect WAXY FLEXIBILITY and BLOOD PROLACTIN INCREASED to these NERVOUS SYSTEM

and PSYCHIATRIC DISORDERS. WAXY FLEXIBILITY is a motor symptom of catatonia which causes

immobility and reduced response to stimulus.416 The main driver of this effect is believed to be

D2 receptor blockade and therefore WAXY FLEXIBILITY is often observed as an adverse effect of

antipsychotic agents. Antagonism of D2 receptors leads to Parkinson’s disease related side

0

200

400

600

800

1000

1200

0 20 40 60 80 100 120

Nu

mb

er

of d

rug

s w

ith

pre

clin

ial A

E

Degree of preclinical AE in network

Decreased activity

Intra-uterine death

Waxy flexibility Blood prolactin increased

Foetal growth retardation Vomiting

Biopsy lymph gland abnormal

Weight gain poor

Weight decreased

Red blood cell count decreased

187

effects. WAXY FLEXIBILITY is observed in a range of animals including rats, mice, dogs and

monkeys according to the data collated from PharmaPendium. Many of the clinical AEs have

links to either catatonia e.g. ANTIDIURETIC HORMONE INCREASE, Parkinson’s disease e.g. FACIAL

SPASM, MEIGE’S SYNDROME417 and DYSKINESIA418, or anti-psychotic medication e.g. RETROGRADE

EJACULATION419 which explains their link to preclinical WAXY FLEXIBILITY.417 Preclinical BLOOD

PROLACTIN INCREASED is a known side effect of a decrease in dopamine, as well as being regulated

by other targets including serotonin, GABA, oestrogens and opioids.420 From our analysis,

BLOOD PROLACTIN INCREASED is associated with COGWHEEL RIGIDITY, DROOLING, FACIAL SPASM,

SCHIZOPHRENIA and TARDIVE DYSKINESIA in the 20 associations with the highest MI values, all of

which are known extra-pyramidal side effects associated with decreased dopamine effects.418

Overall, these findings along with the literature evidence makes the observations of WAXY

FLEXIBILITY and BLOOD PROLACTIN INCREASED in animal models very important flags for toxicity

in humans, especially in combination with one another.

In summary, by analysing the most connected preclinical AEs within component 1, we identified

that the preclinical observations taken routinely during in vivo animal studies, including

reproductive toxicity, body weight and activity measurements are important for determining

the general risk of clinical AEs. Additionally, we found that preclinical AEs of WAXY FLEXIBILITY

and BLOOD PROLACTIN INCREASED associated with a range of clinical NERVOUS SYSTEM and

PSYCHIATRIC DISORDERS, consistent with the fact that animal models are not well-developed for

these disorders.398,399 We propose that preclinical WAXY FLEXIBILITY and BLOOD PROLACTIN

INCREASED can be used to assess the risk of experiencing clinical NERVOUS SYSTEM and

PSYCHIATRIC DISORDERS.

188

Figure 4-6: Frequency distribution for System Organ Class (SOC) MedDRA terms mapped from clinical AEs (MedDRA preferred terms) for significant associations with preclinical waxy flexibility and blood prolactin increased. Coloured by preclinical AE. Preferred MedDRA terms can be members of multiple MedDRA SOC groups.

4.4 Conclusions

Previous concordance studies have led to data-driven conclusions for the risk of experiencing

clinical AEs given similar preclinical AEs. Despite the recognition that AEs manifest differently

between animals and humans, for reasons including differences in anatomy and physiology

between species, as well as differences on the genetic and cellular level, no studies have yet

taken advantage of the opportunity to associate preclinical AEs with substantially different

clinical AEs. In this study we implemented a computational data-driven approach towards

assessing the value of preclinical toxicity measurements in determining the risk of toxicity in

clinical studies for approved drugs, by computing the pairwise mutual information between all

preclinical and all clinical toxicity observations, generating statistical associations. The method

presented here has (in addition to confirming 102 associations between related terms,

‘concordant associations’) now extended previous approaches in such a way and identified

2,007 new associations between AEs which were encoded by different terms preclinically and

clinically. Firstly, this highlights the scale on which we were able to find novel predictive links

between preclinical AEs and clinical AEs through our approach. Secondly, we showed that there

were a large range of clinical AEs which are better predicted by different preclinical AEs than via

their corresponding term preclinically. We observed that 57 % of the generated associations

have a positive likelihood ratio of greater than 2, which indicates a significant change in risk of

189

experiencing the clinical AE given the preclinical AE. This ratio can be used to assess the

diagnostic ability of preclinical AEs for predicting a clinical outcome, which can be used in safety

assessment cases. Additionally, we constructed a network from these associations to show the

interdependency of adverse events upon one another and highlighted examples of AEs which

formed smaller connected components of the network. These included the relationships of one-

to-one, many-to-one and many-to-many associations. We find strong evidence to support the

association between preclinical ELECTROCARDIOGRAM QT SHORTENED with clinical SUDDEN

CARDIAC DEATH, with a positive likelihood ratio of 23, and therefore recommend that regulatory

agencies consider the assessment of QT shortening in preclinical models in a similar way to QT

prolongation. We also discussed the main preclinical AE predictors for the largest connected

component in the network, which included the finding that WAXY FLEXIBILITY and BLOOD

PROLACTIN INCREASED were important predictors for a variety of NERVOUS SYSTEM DISORDERS and

PSYCHIATRIC DISORDERS clinically.

190

5 Deriving Potential Mechanisms from Associations

Between Drug-Induced Adverse Events in Animal

Models and Humans

5.1 Introduction

In the previous chapter, we analysed the data from PharmaPendium to derive statistical

associations between adverse events in animals and humans which were not necessarily

encoded as the same or similar terms in both preclinical and clinical toxicity data. This allowed

us to identify potential toxicities in animals which can be used as predictors of toxicity in

humans, and to quantify the risk based on the data. However, for these associations to be of

further use, it is important to consider the mechanisms by which these adverse events may be

connected. By identifying biological targets and pathways which may have a role in connecting

the two adverse events, compound effects on these biological components can be assessed at an

earlier stage through in vitro toxicity screening.

To provide biological support for our newly discovered associations and to propose biological

targets for secondary pharmacology screens, we in this chapter constructed a novel gene overlap

analysis, outlined Figure 5-1, by integrating genetic information drug and phenotype databases.

Overall, for each association, any genes which were relevant to both the preclinical AE and the

clinical AE, whose protein targets were found to interact with one or more of the drugs causing

both AEs, were considered as plausible mechanistic hypotheses. The targets identified from our

analysis can be investigated as mechanisms for the toxicities, for which further interpretation

and validation should be explored.

191

Figure 5-1: Conceptual illustration for the gene overlap analysis. To derive plausible biological mechanisms to link the preclinical AE, clinical AE and the drugs which induced both AEs, we mapped drug-target interactions to genes, preclinical and clinical AEs to phenotypes/diseases and subsequently to genes with a role in the phenotype or disease. Overlap between the three spaces of one or more genes for each association, was indicative of a potential mechanistic link. A more detailed step-wise process can be found in Figure 5-2.


5.2.1 Associations for Interpretation

For the purposes of interpretation, we limited our associations to only those with the highest

association scores, chosen by applying a cut-off for the Mutual Information (MI) of greater than

0.095, which was the highest value for the randomised distribution (see Assessing significance).

This left 248 associations which we investigated for evidence of mechanistic drivers.

A flow diagram of the overlap analysis method is shown in Figure 5-2.

Preclinical AE Clinical AE Drugs

192

Figure 5-2: Workflow for gene overlap analysis. The 5-step process for the gene overlap analysis to derive plausible biological mechanisms to link the preclinical AE, clinical AE and the association-driving drugs for each association.

5.2.2 Extracting Targets for Drugs

Corresponds to Step 1, Figure 5-2.

All 2259 drugs in PharmaPendium390 were standardised from their SMILES strings using the

StandardiseMolecules function camb package in R,306 which removes salts. The standardised

Step 5: Conduct Overlap Analysis of Genes between Preclinical AE, Clinical AE and Association-Driving Drugs for each Association

For each association obtain a list of genes which overlap between the three spaces for interpretation

Step 4: Filter Genes to Relevant Species and Map to Orthologs

Remove all human genes from Preclinical AE and non-human genes from Clinical AE

gene lists Map all animal genes to human orthologs

Step 3: Map Phenotype and Disease IDs to Genes for each AE

From Disease-Gene and Phenotype-Gene Databases, extract all genes

Convert all gene identifiers to Uniprot KB IDs

Step 2: Map Preclinical and Clinical AEs to Diseases and Phenotypes

Using Ontology Mapping find equivalent terms for AEs

Step 1: Map Association-Driving Drugs to Genes

Identify drugs which induced both clinical and preclinical AEs for each association

Extract genes from Drug-Target interaction databases

193

structures were then converted to InChIKeys using KNIME,351 to map the drugs to their target

activities in other databases.

Targets for the drugs in PharmaPendium were extracted from three main sources, namely the

AstraZeneca ChemistryConnect database (which contained Bioprint (2007 snapshot),116

ChEMBL-23330 and GOSTAR (GVK Bio) data), DrugTargetCommons421 and SuperDrug2.422

Data was extracted from ChemistryConnect by querying the database using a synonym search

from the drugs in the PharmaPendium database. Compounds extracted were standardised using

the same process as for the PharmaPendium drugs, and then matched by InChIKey423 to the

PharmaPendium drugs. It was necessary to carry out a second match on exact drug names

between the databases, as some of the drugs were non-small molecule drugs for which SMILES

were not available. Targets in ChemistryConnect were encoded by EntrezGeneID424 which was

mapped to their UniProtKB Accession IDs using the Uniprot Identifier exchange service339. The

data in ChemistryConnect is already categorized into the classes of active and inactive based on

the cut-off of 10 µM for endpoints including Ki, Kd, IC50 or % inhibition at 10 µM. We retained

the active entries for our analysis. Whilst this is a relatively high cut-off, we chose this activity

cut-off as we wanted to include as much off-target information for drugs as possible and many

of the panel screens were conducted at a concentration of 10 µM.

Data from the SuperDrug2422 database was provided by the curators. Compounds extracted

were standardised using the same process as for the PharmaPendium drugs, and then matched

by InChIKey and exact name matching to the PharmaPendium drugs. The targets in this

database were already encoded in UniProtKB Accession IDs.

DrugTargetCommons421 data was downloaded from the web platform

(https://drugtargetcommons.fimm.fi/). Since the data did not have SMILES, but did include

ChEMBL IDs, we matched the database to ChEMBL-23 using these IDs to obtain SMILES. For

this purpose, the ChEMBL IDs and SMILES for all ChEMBL-23 compounds were extracted using

Toad for MySQL425 and then merged with the ChEMBL IDs in Drug-Target Commons. The

resulting compounds were standardised using the same process as for the PharmaPendium

drugs, and then matched by InChIKey and exact name matching (for biologic drugs) to the

PharmaPendium drugs. For the data in DrugTargetCommons, those with an “active” flag in the

activity comment column were retained, where a cut-off of 10 µM was used to define activity

and only those records for which there was Ki, Kd, IC50 or % inhibition at 10 µM were retained.

The targets in this database were already encoded in UniProtKB Accession IDs.

194

Active drug-target interaction data from all sources was combined into one database, retaining

PharmaPendium drug names, UniProtKB Accession IDs, gene names, gene symbols where

possible and an external reference key to the original database. Duplicates of PharmaPendium

drug name and UniProtKB Accession ID were removed. Overall 82,459 positive drug-target

interactions were extracted for 1604 out of 2,259 drugs in the PharmaPendium dataset, which

included 7,533 unique targets. The breakdown across databases was as follows: 1,397 drugs

mapped to ChemistryConnect, 1,511 drugs mapped to DrugTargetCommons and 1,253 drugs

mapped to SuperDrug2. 4,863 out of 7,533 targets were found to interact with multiple drugs

and 1,549 out of 1,604 drugs had more than one active target, and hence this combined dataset

was found to be encoding more than just the primary target activity.

5.2.3 Mapping Preclinical and Clinical AEs to Ontology Terms


For each preclinical or clinical AE which was found as part of the 248 statistically significant

associations generated from the analysis, the MedDRA encoded AE terms were mapped to

ontology IDs using Zooma,426 identifying phenotypes and diseases that describe the AE. The

ontologies searched in Zooma included the Human Disease Ontology (DOID),427 Mammalian

Phenotpye Ontology (MP),428 Human Phenotype Ontology (HP),429 Experimental Factor

Ontology (EFO),430 Orphanet Rare Disease Ontology (ORDO),431 National Cancer Institute

Theasurus (NCIT),432 Ontology of MIRNA Target (OMIT),433 Ontology of Adverse Events

(OAE),434 Monarch Merged Disease Ontology (MONDO),435 Symptom Ontology (SYMP),436

Mental Disease Ontology (MFOMD),437 Mouse Pathology Ontology (MPATH),438 Ontology of

Biological Attributes (OBA),439 and BioAssay Ontology (BAO).440 The mapping results were

manually filtered to leave only terms which have the same meaning as the MedDRA encoded

AE term. This resulted in 183 Zooma mappings for preclinical AEs and 405 mappings for clinical

AEs. Not all AEs were mapped to ontology identifiers. The list of Zooma Ontology terms mapped

from PharmaPendium MedDRA terms are in Supplementary Data Files 6 and 7.

5.2.4 Extracting Genes for Preclinical and Clinical AEs


From these mappings, the OpenTargets (version 3.4)441 database was queried using the

OpenTargets Python client442 to extract the genes associated with diseases encoded by the EFO

and DOID ontologies. The genes found were encoded by Ensembl gene (ENSG) IDs443 from

195

OpenTargets. In total 22,056 genes were mapped to preclinical AEs and 44,718 genes were

mapped to clinical AEs

For the AE terms which were mapped to phenotype ontologies including HPOs, MPOs, NCITs,

Orphanet, as well as others, genes were extracted from the Monarch Initiative,444 using the

requests library in Python445 to import the data from the URL for the matched ontology ID. In

total 26,216 genes were mapped to preclinical AEs and 50,415 genes were mapped to clinical

AEs. The output gene IDs were HUGO Gene Nomenclature Committee (HGNC) symbols446 for

human genes and the relevant non-human gene IDs for other organisms including the Mouse

Genome Informatics (MGI)447 IDs.

For the AE terms which were mapped to disease ontologies including DOIDs and MONDO IDs,

genes were extracted from the Monarch Initiative,444 using the requests library in Python445 to

import the .tsv file from the url for the matched ontology ID. In total 9,252 genes were mapped

to preclinical AEs and 35,239 genes were mapped to clinical AEs. The output gene IDs were as

for the previous step.

Gene IDs from all sources were then mapped to UniprotKB identifiers using the Uniprot

Identifier exchange service. The genes were mapped back to their original AE term. Duplicate

UniprotKB Accession IDs for each AE were removed. In total, 242,997 gene-preclinical AE

pairings were found across all preclinical AEs and 546,902 gene-clinical AE pairings were found

for clinical AEs.

5.2.5 Gene Filtering and Mapping Animal Genes to Human Orthologs


The UniprotKB gene IDs associated with drugs, preclinical AEs and clinical AEs were filtered in

the next step as follows. For preclinical AE genes, we retained those which were from non-

human species, while for clinical AE genes we retained only human genes. For the drug-gene

associations the non-human genes were separated from the human genes for the next step.

Next, for the preclinical AE-associated non-human genes, we mapped their gene identifiers to

their orthologs in humans, utilising the Uniprot Identifier exchange service to map to Eggnog448

identifiers and then back to human UniprotKB genes via the REST API.339 We chose the Eggnog

database due to its higher level of overlap with UniprotKB identifiers than other ortholog

mapping databases. The non-human genes for drugs were also mapped using the same method

196

to their human orthologs, meaning all genes were mapped to human UniprotKB identifiers for

the overlap analysis.

5.2.6 Overlap Analysis of Genes from Preclinical AE, Clinical AE and Drugs


The following analysis was conducted in Python utilising the pandas449 library. All the drugs,

preclinical and clinical AEs and ortholog -mapped genes with Uniprot KB Accession IDs were

used in the overlap analysis. The overlap analysis was designed to check whether, for each

association, the drugs which displayed both the preclinical and clinical AEs possess protein

targets which are encoded by genes associated with both the preclinical and clinical AE, and

which could hence mechanistically be associated with AEs in both animals and humans. All

matches of a drug with protein targets which are present in the genes associated with both

preclinical and clinical AEs were retained for subsequent analysis. For each association, the

intersection between genes for the preclinical and clinical AE terms was extracted as a list of

genes.

5.2.7 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels

In vitro target safety panels were extracted from Lounkine et al., 2012117 and Bowes et al., 2012

118 which represent the Novartis Safety Target panel and Panel-44 respectively. The genes found

from the overlap analysis were compared to the genes represented by the safety targets in the

combined panels and flagged as either known safety target genes (if present in the panels) or

potentially novel safety target genes (if not present in the panels). To assess how much better

this method was at identifying known safety targets than a random selection of proteins

associated with active compounds, we simulated the sampling of 482 targets (the same number

as we found from our analysis) with known actives from ChEMBL and compared these targets

to the safety panel targets. Sampling was repeated 1000 times to produce a distribution of the

number of matches to the 77 safety panel targets (Figure 5-3), from which the mean was

calculated to be 16, corresponding to the retrieval of 21 % of known safety panel targets on

average.

197

Figure 5-3: Distribution of the number of matches to safety panels from 1000 random samples of 482 targets from ChEMBL. The mean of the distribution (marked in red) was 16 matches, identifying 21 % of known safety panel targets on average.

5.2.8 Visualisations

Visualisations were generated in TIBCO Spotfire352 and R.

5.2.9 Limitations

There are several limitations to be aware of with respect to the analysis conducted. Firstly, we

assumed that AEs can be represented by diseases or phenotypes; some mapping was possible,

however, not all AEs could be mapped in this way and this may be responsible for some

associations with a lack of mechanistic support. Additionally, we assume that gene associations

are equal to protein target associations. Finally, we did not discriminate on the level of

association required for a gene to be associated with a phenotype or disease because we wanted

to identify all plausible links which could be used for hypothesis generation.


5.3.1 Overall Analysis of Mechanistically Supported Associations

We set out to systematically identify mechanistic links between our associations. To do this we

identified targets modulated by drugs inducing both AEs in the association, as well as mapped

our AE terms to diseases or phenotypes from which we subsequently retrieved genes associated

with each AE. We then conducted an overlap analysis to retrieve genes which were common to

the preclinical and clinical AEs, and one or more of the drugs which induced AEs in both

198

domains (See Materials and Methods). We conducted our mechanistic overlap analysis on a

subset of 248 highly scoring associations as measured by the MI from the previous chapter, for

which we found overlapping evidence on the gene level for 60 associations. 482 unique genes

were found across all associations. (see Supplementary Data File 8 for details).

Figure 5-4 shows the number of unique genes that were found to link preclinical and clinical

AEs induced by drug treatment. The association with the highest number of unique genes was

the association of preclinical FOETAL MALFORMATION with clinical NERVOUS SYSTEM DISORDERS

with 310 different genes, with the high number of genes involved likely to be due to the

multitude of AEs that can be described by NERVOUS SYSTEM DISORDERS, as well as overlap

between pathways in developmental and CNS disorders.450 Conversely, for some associations,

there were very few genes found for the association between preclinical and clinical AEs, for

example the associations of preclinical BLOOD PROLACTIN INCREASED with BLOOD GROWTH

HORMONE INCREASED, which was linked by only one gene, namely Peroxisome proliferator-

activated receptor gamma (PPARG) for four drugs. (see Figure 5-5 for the number of drugs

driving each association, as well as the proportion of drugs which displayed gene overlap). We

also observe links between other AEs which have an underlying mechanistic rationale (Figure

5-4), including the association of preclinical ADRENAL SUPPRESSION with clinical ADRENAL

INSUFFICIENCY, the association of preclinical ELECTROCARDIOGRAM QT SHORTENED with clinical

SUDDEN CARDIAC DEATH, associations of preclinical MUTAGENIC tests with clinical

TERATOSPERMIA and the association of PSYCHOMOTOR HYPERACTIVITY with clinical DRUG

WITHDRAWAL SYNDROME; we discuss some of these in detail in the subsequent sections.

We found that the number of associations with mechanistic overlap between preclinical, clinical

and drug space amounted to only about 25 % of our initial associations, which whilst low now

offers increased confidence in those associations to be supported by genetic links.

Supplementary Data File 8 provides more detailed information on the gene overlap analysis.

This information can now be used to identify preclinical readouts which are mechanistically

linked to a clinical readout of interest, to identify preclinical safety models that go beyond

simple term and observation matching between the preclinical and clinical domain, extending

the analysis from the previous chapter.

For those 75 % of associations that were not mechanistically linked, based on existing data, we

propose that there were multiple possible reasons. Firstly, the individual AEs could be caused

by separate and distinct mechanisms of the same drugs (i.e., two AEs are produced by different

off-targets of the same drug) and therefore we find statistical, but no causal associations

199

between (mechanistically distinct) AEs. This highlights the need to produce selective drugs

where possible in order to better understand toxicities. Secondly, in some cases we were unable

to map the AE terms to a representative disease or phenotype found in one of our databases.

Thirdly, there may not be extensive knowledge about the involvement of genes in the linked

diseases or phenotypes, as annotated in the databases used here. Fourthly, for the drugs

producing given AEs in preclinical and clinical space, the protein target profile of this drug may

be incomplete (which is likely even a quite considerable source of false negative links in this

work, given the incompleteness of available bioactivity data). Finally, there may be indirect

downstream biological effects via signalling networks, so that the target of a drug is not directly

involved in any toxic downstream events. This type of event is also rather plausible in practice;

however, this type of link will also not be discovered by the current analysis.

Therefore, we conclude that there may be valuable information between the statistical

associations that were not supported by this method, as many of these may be due to the

limitations of knowledge in the public domain. In the future, for those associations where there

was no gene overlap found, it would be interesting to explore those where a target was found to

overlap between two of the three spaces. This could reveal insight into potential novel

mechanisms of disease and propose off-targets not yet discovered for the set of drugs.

200

Figure 5-4: Number of unique genes found for each mechanistically supported association between preclinical and clinical AEs. Plot is coloured by the number of unique genes overlapping between drug targets and genes involved in phenotypes in both preclinical and clinical AE space. Minimum number of genes was 1 and the maximum number of genes was 310. Only about 25% of links between preclinical and clinical AEs are mechanistically supported; however, those links are now in turn more likely to be biologically meaningful, and at the same time a potential biomarker/safety target hypothesis for the respective AE is derived.

Pre

clin

ica

l A

E

Clinical AE

Max (310) Average (22) Min (1)

201

Figure 5-5: Number of drugs with gene overlap for each significant association. For those associations which had preclinical AE, clinical AE and drug gene overlap, the overall number of drugs driving the association coloured by whether there was gene overlap for each drug.

Clinical AE Preclinical AE

202

5.3.2 Comparison of Mechanistic Targets to In Vitro Safety Screening Panels

We next assessed the correspondence between the genes found from this analysis and the

published safety target panel associated genes to determine the ability of our analysis to find

known safety targets. In total, our analysis identified 56 out of 73 (77 %) of known safety target

genes, as defined from the combination of the panels published in Lounkine et al., 2012117 and

Bowes et al., 2012118, which was enriched compared to a mean of 16 targets out of 73 (21 %)

retrieved from repeated random samples of 482 targets with activity from ChEMBL (Figure 5-3).

The gene names, as well as the clinical AE to which each gene was found to be associated is

displayed in Figure 5-6. Certain genes can be associated with groups of (as opposed to

individual) clinical AEs; for example, for nervous system and psychiatric disorders, the terms

DRUG DEPENDENCE, OBSESSIVE COMPULSIVE DISORDER, SCHIZOPHRENIA, TARDIVE DYSKINESIA and

TOURETTE’S DISORDER were associated with many genes, including the Dopamine Receptor

subtypes 1, 2 and 3, which are known to be associated with CNS adverse events,119 however, the

precise link between these receptors and OBSESSIVE COMPULSIVE DISORDER, SCHIZOPHRENIA and

TOURETTE’S DISORDER was not reported.119 The AR and ESR1 genes, which encode the Androgen

Receptor and Estrogen Receptor 1 respectively, are promiscuous across clinical AEs, due to their

roles in a wide range of biological processes, specifically linking to endocrine and CNS adverse

events.119 Similarly, the GCR gene which encodes the Glucocorticoid Receptor was also found

to be associated with many of the clinical AEs, in particular a range of CNS adverse events, which

have not been previously identified.119 More specific gene-clinical AE associations from our

analysis included the Acetylcholinesterase (ACES) and Tyrosine-Protein Kinase LCK (LCK)

genes associated with PNEUMONIA; ACES has been linked previously to be associated with

respiratory adverse events.119 The Neuronal acetylcholine receptor subunit alpha-4 (ACHA4)

gene associated with POOR-QUALITY SLEEP, the Neuronal acetylcholine receptor subunit alpha-7

(ACHA7), Neuronal Acetylcholine Receptor Subunit Beta-4 (ACHB4), Substance-P Receptor

(NK1R) and Glutamate Receptor Ionotropic, NMDA 2A (NMDE1) genes associated with DRUG

WITHDRAWAL SYNDROME, and the Steroid Hormone Receptor ERR2 (ERR2) and Gamma-

aminobutyric acid Receptor Subunit Alpha-5 (GBRA5) associated with NERVOUS SYSTEM

DISORDERS, none of which were summarized in the previous study,119 making these findings

novel associations between previously identified safety target encoding genes and adverse

events.

The remaining 426 genes have not been previously identified as safety target encoding genes

and are not therefore routinely screened within pharmaceutical companies. These novel genes

203

can be found in Supplementary Data File 9 and can be used to help prioritise important toxicity

targets for further evaluation. These genes represent biological mechanisms that can be targeted

with a small molecule ligand, and which are mechanistically involved in different of toxic

phenotypes across species.

In the next section we discuss examples of mechanistically supported associations highlighting

known and novel genes for further evaluation.

204

Figure 5-6: Genes whose resulting protein products are targeted by drugs and which are associated with toxicities on the preclinical and clinical level, which overlap with known safety targets from the published studies by Lounkine et al., 2012 117 and Bowes et al., 2012 118. Matrix elements are coloured by the number of times the gene was associated with each clinical AE. The most promiscuous genes across AEs, included the Nuclear Hormone receptors of the Androgen receptor, the Estrogen receptor 1 and the Glucocorticoid receptor.

Clin

ica

l A

E

Gene Name (Uniprot)

Max (73) Average (15) Min (1)

205

5.3.3 Mechanistically Supported One-to-One Relationships Between AEs

From our mechanistically supported associations, 6 out of 60 showed a one-to-one relationship

between one preclinical AE and one clinical AE (Table 5-1). In this section, three examples of

mechanistically supported associations are discussed where there were no other associations

with the same preclinical or clinical AE.

Table 5-1: Biological targets and drugs for all six one-to-one associations with mechanistic support derived from the gene overlap analysis. Includes the MedDRA terms for both the preclinical and clinical adverse events in the association, as well as the positive likelihood ratio, number of drugs for which overlapping genetic evidence was found and the identity of the genes.

Preclinical AE

Clinical AE Positive likelihood ratio (LR+)

Number of drugs overlapping (number with gene overlap)

Number of unique genes

Genes overlapping (order of decreasing number of drugs for which genes appear)

ELECTROCARDI

OGRAM QT

SHORTENED

SUDDEN

CARDIAC DEATH 22.9 9 (1) 4 CAC1C (1), KCND3 (1),

KCNQ1 (1), SCN5A (1)

REFLEXES

ABNORMAL POOR QUALITY

SLEEP 17.3 10 (7) 7 AA2AR (6), OX2R (6),

ACM2 (2), ACHA4 (1), ACHB2 (1), CAC1C (1), GBRB3 (1)

INJECTION SITE

ERYTHEMA INJECTION SITE

RASH 13.1 20 (1) 7 ABL1 (1), BTK (1), FGFR3

(1), KIT (1), PGFRA (1), RET (1), TIE2 (1)

PSYCHOMOTOR

HYPERACTIVITY DRUG

WITHDRAWAL

SYNDROME

6.2 53 (49) 26 CP2B6 (44), AA1R (42), AA2AR (42), ADA2A (42), ADA2B (42), CNR1 (42), DRD2 (42), DRD3 (42), GASR (42), NPY2R (42), OPRD (42), OPRK (42), OPRX (42), OX1R (42), HRH1 (26), SC5A (25), ACHA7 (8), ACHB2 (8), ACHB4 (8), MK01 (5), MK03 (5), NMDE1 (3), SSDH (3), CREB1 (1), GRIA2 (1), NK1R (1)

FOETAL

MALFORMATIO

N

NERVOUS

SYSTEM

DISORDERS

4.9 104 (54) 310 See Supplementary Data File 8

INTRA-UTERINE

DEATH BACK PAIN 2.5 606 (114) 17 BRAF (19), CBP (22),

CDK4 (17), EP300 (22),

206

FGFR1 (61), FGFR3 (61), GNAS2 (4), HBB (1), P53 (16), PK3CA (24), PTN11 (2), RASH (1), TASK (6), RASN (5), SMAD4 (21), SMO (3), STK11 (61)

Preclinical ELECTROCARDIOGRAM QT SHORTENING and clinical SUDDEN CARDIAC DEATH are associated

via multiple genes/protein targets

The prediction of the risk of drug-induced sudden cardiac death in humans from non-clinical

studies is particularly important, as due to the severity of this AE, there are health and cost

implications of taking a drug forward with this risk to patients.451 In this study we found that

preclinical ELECTROCARDIOGRAM QT SHORTENING is associated with clinical SUDDEN CARDIAC

DEATH with a rather high LR+ of 23 (Table 5-1), as discussed in the previous chapter. Overall in

the dataset, nine drugs supported this association. When analysing the gene overlap between

preclinical and clinical toxicities and the drugs which caused both toxicities, rufinamide was the

only drug for which the known targets modulated, overlapped with both toxic effects. These

protein targets were Voltage-dependent L-type calcium channel subunit alpha-1C (CAC1C),

Potassium voltage-gated channel subfamily D member 3 (KCND3), Potassium voltage-gated

channel subfamily KQT member 1 (KCNQ1) and Sodium channel protein type 5 subunit alpha

(SCN5A), which are well associated with short QT intervals and sudden cardiac death.406,452–454

For the other drugs which induced both AEs we did not find mechanistic support, suggesting

that the proteins responsible for the connection between the adverse events might not be

known for all drugs. Here, we found mechanistic evidence to support the link between

preclinical ELECTROCARDIOGRAM QT SHORTENING associated with CLINICAL SUDDEN CARDIAC

DEATH for clinically tested drugs. KCND3 is not currently found in published safety target panels

considered here,117,118 however, is screened as part of the Comprehensive in Vitro Proarrhythmia

Assay (CiPA).455

This mechanistic evidence adds to the case made in the previous chapter where we discussed

the need for regulatory agencies to consider how preclinical ELECTROCARDIOGRAM QT

SHORTENING should be used to assess clinical risk of arrythmias and clinical SUDDEN CARDIAC

DEATH, consistent with the rise in the incidence of the preclinical and clinical observations.406

Based on this data-driven study and the previous reports highlighting that drug-induced QT

shortening should be monitored in addition to QT prolongation due to the severe

prognosis,407,456 we recommend that the targets identified in this study (CAC1C, KCND3, KCNQ1

207

and SCN5A) and other associated targets be prioritised for drug safety screening due to the

severity of the clinical AE. This could be achieved by expanding such efforts as the CiPA panel

to encompass QT shortening-related targets, in addition to its current content of targets of

relevance to proarryhymias.457

Preclinical ABNORMAL REFLEXES are associated with clinical POOR-QUALITY SLEEP via multiple

genes/protein targets

Many drugs induce sleep problems, which can have a significant effect on quality of life. Animal

models exist for assessing the risk of poor-quality of sleep in patients; however, many differences

in sleep patterns exist between animals and humans.458 In this study we identified a strong

association between preclinical REFLEXES ABNORMAL and clinical POOR-QUALITY SLEEP with a LR+

of 17 (Table 5-1). This association is driven by the observation of both side effects for 10

individual drugs in the dataset. Of these 10 drugs, 7 were found to share at least one gene target

with the two phenotypes. The most common targets found for this association were the

Adenosine A2a receptor (AA2AR), annotated as a target of six drugs, and the Orexin 2 receptor

(OX2R), also annotated as a target of six drugs. Studies of A2aR-knockout mice show that

AA2AR is linked with a decrease in sensitivity to painful stimuli, which can lead to reduction in

the reflex response.459 AA2AR also plays a role in sleep directly, and mice lacking the functional

Adenosine A2a receptor no longer show increased wakefulness in response to caffeine.460 In

humans, polymorphisms in the A2a receptor are associated with impaired sleep, showing a role

for A2a receptors in sleep quality directly.461 The orexin receptors are well-known for their link

with sleep and wakefulness cycles,462–464 in particular the conditions of insomnia and

narcolepsy.465 More recently, orexin has also been linked to antinociceptive effects, showing a

role in neuropathic pain modulation, which may have a role in the reflex response to pain.466

While the A2a receptor is part of previously published safety panels, OX2R is currently not

found in these.455 However, in particular where sleep modulation is a concern during drug

development, the recommendation from this analysis (in agreement with previous literature)

would be to include OX2R in screening panels.

Preclinical PSYCHOMOTOR HYPERACTIVITY is associated with clinical DRUG WITHDRAWAL SYNDROME via

multiple genes/protein targets

Drug withdrawal syndrome presents a real challenge to patients and in some cases can be life-

threatening.467 One of the one-to-one associations from our analysis was that of PSYCHOMOTOR

HYPERACTIVITY in the preclinical setting being significantly associated with clinical DRUG

WITHDRAWAL SYNDROME, with a LR+ of 6.2 (Table 5-1). In total, 53 drugs were found which

208

induced both phenotypes, of which 49 had one or more gene overlapping between itself and the

two AEs in the association. The genes whose protein targets were modulated by the highest

number of drugs for this association included CP2B6, AA1R, AA2AR, ADA2A, ADA2B, CNR1,

DRD2, DRD3, GASR, NPY2R, OPRD, OPRK, OPRX, OX1R, which were targets for at least 42 of

the drugs supporting the association. Many of those links identified from the data were

supported by biological knowledge. The Cytochrome P450 family 2 subfamily B member 6

(CP2B6) polymorphisms, in combination with attention deficit hyperactivity disorder (ADHD)

symptoms, are found to be linked to nicotine addiction.468 Antagonism of the Adenosine

receptors (AA1R and AA2AR) leads to psychomotor phenotypes from the reduction in adenosine

and regulation of genes in the striatum signalling pathway.469 AA1R and AA2AR agonists have

been linked to withdrawal symptoms for benzodiazepine drugs,470 and agonism of these

receptors antagonistically modulates dopaminergic neurotransmission and therefore reward

systems.471 The Cannabinoid receptor 1 (CNR1) and Dopamine receptors (DRD2, DRD3) have

well-known links to ADHD, neuropsychiatric disorders and substance withdrawal, as do the

Opioid receptors (OPRD, OPRK, OPRX).472–475 Cholecystokinin B receptor (GASR) mutations

are related to behavioural changes in animals476 and also affect morphine induced hyperactivity

and withdrawal symptoms.477,478 Finally, the Orexin 1 (OXR1) receptor is involved in ADHD as

well as naloxone-precipitated morphine withdrawal.479–481

This analysis presents strong evidence for a link between preclinical PSYCHOMOTOR

HYPERACTIVITY and clinical DRUG WITHDRAWAL SYNDROME, supported by multiple target

mechanisms. Of these targets, the majority are present in existing safety panels, and this analysis

adds to the weight of evidence to support their utility in predicting clinically relevant endpoints.

We also identified three targets, namely CP2B6, OPRX and OX1R, which were not routinely

screened according to published in vitro safety panels, although CP2B6 is screened in drug

metabolism and pharmacokinetic (DMPK) assays for potential drug-drug interactions.482 Based

on the analysis performed here, we propose these three targets for further investigation and

possible inclusion in such panels, where associated effects are of relevance for decision-making

for compound progression along the clinical development path.

5.3.4 Mechanistically Supported Groups of Associations Between AEs

We finally investigated whether associations in our analysis could be considered as groups of

associations, given that in many cases (see Figure 5-4) individual preclinical side effects and

clinical side effects are related in a more complex manner than one-to-one relationships, as

detailed in the previous chapter. Hence, we next analysed whether mechanistic targets were

209

present across a whole set of associations of preclinical and clinical effects. In the following we

focus on finding new safety targets for clinical reproductive toxicity, namely TERATOSPERMIA and

OVARIAN FAILURE. This is an area where non-clinical models are actively being investigated, to

improve the risk assessment for reproductive toxicity in the clinic.483

Preclinical mutagenicity tests are associated with clinical teratospermia and ovarian failure,

supported via by multiple genes/protein targets

From our analysis, observations of preclinical mutagenicity using the MUTAGENIC:

MICRONUCLEUS TEST, IN BONE MARROW CELLS test and MUTAGENIC: CHROMOSOME ABERRATION, IN

BONE MARROW CELLS, were associated with clinical TERATOSPERMIA and OVARIAN FAILURE (Figure

5-7). The LR+ for TERATOSPERMIA, given a positive result in MUTAGENIC: SPECIFIC LOCUS TEST

(THYMIDINE KINASE), IN LYMPHOMA CELLS, was 28.9, while the ratio for the MUTAGENIC:

CHROMOSOME ABERRATION, IN BONE MARROW CELLS test was 23.1 and the ratio for the

MUTAGENIC: MICRONUCLEUS TEST, IN BONE MARROW CELLS was 20.7. This showed a very high

conditional likelihood for experiencing TERATOSPERMIA in clinical trials, given a positive readout

in either of these preclinical tests. For clinical OVARIAN FAILURE, the MUTAGENIC: MICRONUCLEUS

TEST IN BONE MARROW CELLS had a LR+ of 17.2 and the MUTAGENIC: CHROMOSOME ABERRATION

TEST a LR+ of 19.2. Hence, in this set of toxic events, we were able to identify a group of related

preclinical readouts that were predictive of related clinical toxicities

Figure 5-7 shows all 33 shared genes between the associations for mutagenicity-related

preclinical observations. It can be observed that the main gene present for all associations was

the human ANDR gene. This gene (the Androgen receptor) is well-known to be linked to

TERATOSPERMIA, as it has a vital role in spermatogenesis and mutations in the gene lead to male

infertility.484–486 OVARIAN FAILURE is linked to the absence of the Androgen receptor in mice and

serum androgen levels are elevated in women with ovarian failure.487,488

Many of the other 32 genes for these associations, shown in Figure 5-7, are related to DNA

damage or repair, apoptosis, cellular proliferation, angiogenesis, methylation effects and

transcriptional regulation. Interestingly, we can distinguish the different mutagenic tests in vivo

by their mechanistic drivers, as not all genes are found to be overlapping. For example, the

CHROMOSOME ABBERATION TEST IN BONE MARROW CELLS is associated with genetic evidence for

bromodomains BAZ1A, BRDT and BRWD1, whereas these genes are not associated with the

MUTAGENIC: MICRONUCLEUS TEST IN BONE MARROW CELLS. There are differences in the genetic

evidence found in the literature associated with clinical endpoints. For example, there are many

genes which are found for both associations with TERATOSPERMIA but not OVARIAN FAILURE, such

210

as the Retinoic acid receptors (RARA and RARG) and the Cyclin-dependent kinases of CDK2

and CDK16. Conversely, there is only one gene which was found for OVARIAN FAILURE and not

TERATOSPERMIA, namely Cyclin-dependent kinase 4 (CDK4). Overall, we present the genes

which were found to support our associations between mutagenic preclinical AEs and the

clinical AEs of TERATOSPERMIA and OVARIAN FAILURE, providing protein targets which could be

investigated further for their roles in these serious reproductive toxicities of drugs.

In total, out of 33 genes, only five are currently routinely screened as in vitro safety panels,

namely ANDR, ESR1, ESR2, GCR and PRGR. The protein products of the remaining 28 genes,

namely BAZ1A, BRDT, BRWD1, CDK1, CDK16, CDK2, CDK4, COT2, FGFR1, FGFR2, GSK3A,

INSR, MTOR, NR0B2, P53, PPARA, PPARG, RARA, RARG, RET, RXFP2, RXRB, STK11, THB,

TXK, VDR, VGFR1 and VGFR2, are not currently routinely used within published safety

screening panels and, where relevant, based on this data-driven analysis we propose that they

could be investigated further for their role in the prediction of clinical reproductive toxicity

211

Figure 5-7: Mechanistic targets for the associations between mutagenic preclinical adverse events and clinical TERATOSPERMIA and OVARIAN FAILURE. Matrix elements are coloured by the number of drugs which had active data at the protein target encoded by the gene. The Androgen receptor (ANDR) was identified as a mechanism for both TERATOSPERMIA and OVARIAN

FAILURE, and is well-known to be relevant to these processes. The Retinoic acid receptors (RARA and RARG) were identified as a mechanistic target for only TERATOSPERMIA, and Cyclin-dependent kinase 4 (CDK4) was identified as a mechanistic target for OVARIAN FAILURE only. Many of the targets overlap between the two reproductive toxicity adverse events.

Ad

ve

rse

Eve

nt

Clinical AE Preclinical AE

Gene Names

Max (8)

Average (4.2)

Min (1)

212

5.4 Conclusions

In this study we establish the degree of mechanistic support for the most significant associations

found in chapter 4, by employing a novel gene overlap analysis. We found the intersection of

genetic evidence for the preclinical and clinical AEs and the drugs which caused both AEs. Out

of all associations for which we conducted the gene overlap analysis, we found one or more

genes supporting the association between preclinical and clinical AEs for 25% of cases. When

comparing these genes to known safety target genes, 77 % of known safety targets were found

via our overlap analysis. We also identified 426 genes, which we propose could be investigated

as future safety target encoding genes, for different clinical toxicities of interest. Table 5-2

summarizes the clinical AEs which we have highlighted as case studies in our work, as well as

the targets found from our mechanistic overlap analysis which are not included in currently

published in vitro screening panels. It should be noted that these case studies are only the

toxicities we analysed in detail in this work; for the full list of associations from which the reader

may pick out those of interest to their work please see Supplementary Data Files 3 and 7.

Table 5-2: Summary of the clinical AEs and the genes which did not encode targets in existing safety panels117,118 identified by our gene overlap analysis.

Clinical Toxicity Genes identified in this work not encoding current targets in published safety panels117,118

SUDDEN CARDIAC DEATH KCND3

POOR QUALITY SLEEP OX2R

DRUG WITHDRAWAL SYNDROME CP2B6, OPRX, OX1R

TERATOSPERMIA BAZ1A, BRDT, BRWD1, CDK1, CDK16, CDK2, COT2, FGFR1, FGFR2, GSK3A, INSR, MTOR, NR0B2, P53, PPARA, PPARG, RARA, RARG, RET, RXFP2, RXRB, STK11, THB, TXK, VDR, VGFR1, VGFR2

OVARIAN FAILURE BAZ1A, BRDT, BRWD1, CDK2, CDK4, MTOR, PPARG, VGFR2

This analysis presents an integrative knowledge-based approach to find mechanistic links

between statistically associated preclinical and clinical toxicities. This approach can be applied

for similar datasets in the future to determine the interrelationship between toxicological data.

We extend the work detailed in the chapter 4 by suggesting mechanistic drivers for the

associations previously found. The targets identified can be assessed for inclusion in secondary

pharmacology safety panel screens. By including in vitro, in vivo and clinical toxicity data and

requiring that the target be associated with both the preclinical and clinical toxicities, we

suggest that the mechanisms identified by our method should be highly relevant to toxicities

213

across species, helping to anticipate both animal model and human toxicities. Finally, this type

of analysis can be used to provide semantic reasoning between AEs, as well a basis for machine

learning models of clinical toxicity.

6 Conclusions and Future Perspective

In this thesis, we discuss computational methods for the prediction of selectivity and toxicity

endpoints within drug discovery. We integrated data from different sources in order to generate

new knowledge from data mining and machine learning approaches.

For selectivity we show that multi-target proteochemometric (PCM) models can be useful for

predicting the polypharmacology of compounds against related biological targets and

outperform single-target QSAR models in the case of bromodomain-containing proteins. This

was attributed to the fact that PCM can interpolate for new targets and therefore serves utility

in predicting activity for targets with less data, termed orphan targets. We established that for

the models to be practically useful, we needed to define the applicability domain for the models,

for which we successfully implemented Mondrian cross-conformal prediction. We discovered

new and structurally distinct chemical hits for the bromodomain-containing protein family

from shortlisting compounds for test, achieving a hit rate of 31% overall. By interpreting the

model, it was possible to identify the key residues in the active site of different bromodomain

proteins which could be exploited for activity and selectivity in future design. We were able to

link these residues to the binding modes of hits derived from this study and previous literature

knowledge, showing that there was confidence that the model was based on knowledge of

bromodomain inhibitor binding. A particularly relevant extension of this work would be to

include stable binding site water molecules and their location, as well as their network

properties, as target descriptors in addition to amino acid alignment dependent descriptors

employed here. This would incorporate the more recent knowledge319 developed around how

water networks influence the binding of inhibitors to bromodomains and contribute to

observed selectivity.379 The PCM modelling technique can be applied to other target classes to

provide off-target in silico screening platforms within drug discovery, to prioritise targets for

experimental follow up, as potential secondary pharmacology targets. Interpretation of the

models for bromodomain targets and other target families could lead to provide guidance for

designing small molecule inhibitors for these targets in the future.

For toxicity modelling we explored association analyses to quantify the conditional risk of

experiencing a clinical adverse event (AE), given the presence of a preclinical AE. In contrast to

214

previous studies, we did not require that the AEs preclinically and clinically be encoded by the

same term and therefore our study extended previous concordance approaches. We found 2050

significant associations which can be used to assess which animal models are relevant to human

toxicity prediction, and which, given the data, are not important to determining clinical toxicity.

By constructing a network, we showed different relationships between AEs existed including

one preclinical adverse event associating with one clinical adverse event, many preclinical

events associated with one clinical adverse event and one preclinical adverse event associated

with multiple clinical adverse events. In general, the latter relationship existed more frequently,

and the most relevant preclinical AEs were identified for the largest cluster. These preclinical

AEs were associated with a higher risk of experiencing a range of clinical AEs and should be

considered important information to be continually derived from animal models in the future.

The most significant associations were analysed to determine the degree of mechanistic

support, in total 60 out of 248 associations were found to have gene evidence in common

between the drug which caused both AEs, the preclinical AE and the clinical AE. Protein targets

identified are suggested for follow up for the inclusion in secondary pharmacology screening

panels to assess the potential risk of experiencing toxicity in vivo before proceeding to animal

studies. This analysis presents a systematic, data-driven and less biased approach for finding

compelling associations between preclinical and clinical toxicities, which can be applied in

similar studies to determine the interrelationship between toxicological data. This study can

provide a starting point to evaluate which preclinical models are useful to assess the risk of

experiencing which clinical toxicities. We furthermore see the utility of methods such as these

in hypothesis generation for mechanistic drivers for toxicity, potentially with the application of

inclusion in toxicity panel screens. For associations where there was no overlap in drug targets

with genes found from AEs, it would be interesting to further investigate if there was gene

overlap in the AE spaces. This could lead to additional mechanistic hypotheses for previously

untested off-targets which may be modulated by the drugs. Finally, this type of analysis can

provide semantic reasoning between AEs as well a basis for machine learning models of clinical

toxicity.

215

7 References

(1) Mohs, Richard C.; Greig, Nigel H. Drug Discovery and Development: Role of Basic

Biological Research. Alzheimer’s Dement. Transl. Res. Clin. Interv. 2017, 3, 651–657.

(2) Paul, Steven M.; Mytelka, Daniel S.; Dunwiddie, Christopher T.; Persinger, Charles C.;

Munos, Bernard H.; Lindborg, Stacy R.; Schacht, Aaron L. How to Improve R&D

Productivity: The Pharmaceutical Industry’s Grand Challenge. Nat. Rev. Drug Discov.

2010, 9, 203–214.

(3) Dimasi, J. A.; Feldman, L.; Seckler, A.; Wilson, A. Trends in Risks Associated with New

Drug Development: Success Rates for Investigational Drugs. Clinical Pharmacology and

Therapeutics. March 3, 2010, pp 272–277.

(4) DiMasi, Joseph A.; Grabowski, Henry G.; Hansen, Ronald W. Innovation in the

Pharmaceutical Industry: New Estimates of R&D Costs. J. Health Econ. 2016, 47, 20–33.

(5) Huggins, David J.; Sherman, Woody; Tidor, Bruce. Rational Approaches to Improving

Selectivity in Drug Design. J. Med. Chem. 2012, 55, 1424–1444.

(6) Reddy, A. Srinivas; Zhang, Shuxing. Polypharmacology: Drug Discovery for the Future.

Expert Rev. Clin. Pharmacol. 2013, 6, 41–47.

(7) de Lera, Angel R.; Ganesan, A. Epigenetic Polypharmacology: From Combination

Therapy to Multitargeted Drugs. Clin. Epigenetics 2016, 8, 105.

(8) Ortiz, Angel; Gomez-Puertas, Paulino; Leo-Macias, Alejandra; Lopez-Romero, Pedro;

Lopez-Vinas, Eduardo; Morreale, Antonio; Murcia, Marta; Wang, Kun. Computational

Approaches to Model Ligand Selectivity in Drug Design. Curr. Top. Med. Chem. 2005, 6,

41–55.

(9) Mencher, Simon K.; Wang, Long G. Promiscuous Drugs Compared to Selective Drugs

(Promiscuity Can Be a Virtue). BMC Clin. Pharmacol. 2005, 5, 3.

(10) Méndez-Lucio, Oscar; Jesús Naveja, J.; Vite-Caritino, Hugo; Prieto-Martínez, Fernando

D.; Medina-Franco, José L. Polypharmacology in Drug Discovery. In Drug Selectivity: An

Evolving Concept in Medicinal Chemistry; Handler, Norbert, Buschmann, Helmut, Eds.;

Wiley-VCH Verlag GmbH & Co. KGaA, 2010; p 510.

(11) Hopkins, Andrew L. Network Pharmacology: The next Paradigm in Drug Discovery. Nat.

Chem. Biol. 2008, 4, 682–690.

216

(12) Roth, Bryan L.; Sheffer, Douglas J.; Kroeze, Wesley K. Magic Shotguns versus Magic

Bullets: Selectively Non-Selective Drugs for Mood Disorders and Schizophrenia. Nat. Rev.

Drug Discov. 2004, 3, 353–359.

(13) Lapins, Maris; Wikberg, Jarl ES S. Kinome-Wide Interaction Modelling Using Alignment-

Based and Alignment-Independent Approaches for Kinase Description and Linear and

Non-Linear Data Analysis Techniques. BMC Bioinformatics 2010, 11, 339.

(14) Fernández, Ariel; Crespo, Alejandro; Tiwari, Abhinav. Is There a Case for Selectively

Promiscuous Anticancer Drugs? Drug Discov. Today 2009, 14, 1–5.

(15) Karaman, Mazen W.; Herrgard, Sanna; Treiber, Daniel K.; Gallant, Paul; Atteridge, Corey

E.; Campbell, Brian T.; Chan, Katrina W.; Ciceri, Pietro; Davis, Mindy I.; Edeen, Philip T.;

Faraoni, Raffaella; Floyd, Mark; Hunt, Jeremy P.; Lockhart, Daniel J.; Milanov, Zdravko

V; Morrison, Michael J.; Pallares, Gabriel; Patel, Hitesh K.; Pritchard, Stephanie; et al. A

Quantitative Analysis of Kinase Inhibitor Selectivity. Nat. Biotechnol. 2008, 26, 127–132.

(16) Klaeger, Susan; Heinzlmeir, Stephanie; Wilhelm, Mathias; Polzer, Harald; Vick, Binje;

Koenig, Paul Albert; Reinecke, Maria; Ruprecht, Benjamin; Petzoldt, Svenja; Meng, Chen;

Zecha, Jana; Reiter, Katrin; Qiao, Huichao; Helm, Dominic; Koch, Heiner; Schoof,

Melanie; Canevari, Giulia; Casale, Elena; Re Depaolini, Stefania; et al. The Target

Landscape of Clinical Kinase Drugs. Science (80-. ). 2017, 358, eaan4368.

(17) Davis, Mindy I.; Hunt, Jeremy P.; Herrgard, Sanna; Ciceri, Pietro; Wodicka, Lisa M.;

Pallares, Gabriel; Hocker, Michael; Treiber, Daniel K.; Zarrinkar, Patrick P.

Comprehensive Analysis of Kinase Inhibitor Selectivity. Nat. Biotechnol. 2011, 29, 1046–

1051.

(18) Haupt, V. Joachim; Daminelli, Simone; Schroeder, Michael. Drug Promiscuity in PDB:

Protein Binding Site Similarity Is Key. PLoS One 2013, 8, e65894.

(19) Ghuman, Jamie; Zunszain, Patricia A.; Petitpas, Isabelle; Bhattacharya, Ananyo A.;

Otagiri, Masaki; Curry, Stephen. Structural Basis of the Drug-Binding Specificity of

Human Serum Albumin. J. Mol. Biol. 2005, 353, 38–52.

(20) Smith, Dennis A.; Di, Li; Kerns, Edward H. The Effect of Plasma Protein Binding on in

Vivo Efficacy: Misconceptions in Drug Discovery. Nat. Rev. Drug Discov. 2010, 9, 929–

939.

217

(21) Urban, Laszlo; Whitebread, Steven; Hamon, Jacques; Mikhailov, Dmitri; Azzaoui, Kamal.

Screening for Safety-Relevant Off-Target Activities. In Polypharmacology in Drug

Discovery; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2012; pp 15–46.

(22) Goedken, Eric R.; Devanarayan, Viswanath; Harris, Christopher M.; Dowding, Lori A.;

Jakway, James P.; Voss, Jeffrey W.; Wishart, Neil; Jordan, David C.; Talanian, Robert V.

Minimum Significant Ratio of Selectivity Ratios (MSRSR) and Confidence in Ratio of

Selectivity Ratios (CRSR): Quantitative Measures for Selectivity Ratios Obtained by

Screening Assays. J. Biomol. Screen. 2012, 17, 857–867.

(23) Jackson, Craig M.; Esnouf, M. Peter; Winzor, Donald J.; Duewer, David L. Defining and

Measuring Biological Activity: Applying the Principles of Metrology. Accredit. Qual.

Assur. 2007, 12, 283–294.

(24) Hulme, Edward C.; Trevethick, Mike A. Ligand Binding Assays at Equilibrium: Validation

and Interpretation. Br. J. Pharmacol. 2010, 161, 1219–1237.

(25) MarÉchal, Eric. Measuring Bioactivity: KI, IC50 and EC50. In Chemogenomics and

Chemical Genetics; Springer Berlin Heidelberg: Berlin, Heidelberg, 2011; pp 55–65.

(26) Yung-Chi, Cheng; Prusoff, William H. Relationship between the Inhibition Constant (KI)

and the Concentration of Inhibitor Which Causes 50 per Cent Inhibition (I50) of an

Enzymatic Reaction. Biochem. Pharmacol. 1973, 22, 3099–3108.

(27) Mestres, Jordi; Gregori-Puigjané, Elisabet; Valverde, Sergi; Solé, Ricard V. Data

Completeness—the Achilles Heel of Drug-Target Networks. Nat. Biotechnol. 2008, 26,

983–984.

(28) Jalencas, Xavier; Mestres, Jordi. On the Origins of Drug Polypharmacology.

Medchemcomm 2013, 4, 80–87.

(29) Hu, Ye; Gupta-Ostermann, Disha; Bajorath, Jürgen. Exploring Compound Promiscuity

Patterns and Multi-Target Activity Spaces. Comput. Struct. Biotechnol. J. 2014, 9,

e201401003.

(30) Pervaiz, Mehrosh; Mishra, Pankaj; Günther, Stefan. Bromodomain Drug Discovery – the

Past, the Present, and the Future. Chem. Rec. 2018, 18, 1808–1817.

(31) Galdeano, Carles; Ciulli, Alessio. Selectivity On-Target of Bromodomain Chemical Probes

by Structure-Guided Medicinal Chemistry and Chemical Biology. Future Med. Chem.

218

2016, 8, 1655–1680.

(32) Sanchez, Roberto; Meslamani, Jamel; Zhou, Ming-Ming. The Bromodomain: From

Epigenome Reader to Druggable Target. Biochim. Biophys. Acta 2014, 1839, 676–685.

(33) Muller, Susanne; Filippakopoulos, Panagis; Knapp, Stefan. Bromodomains as

Therapeutic Targets. Expert Rev. Mol. Med. 2011, 13, e29.

(34) Moustakim, Moses; Clark, Peter G. K. K.; Hay, Duncan A.; Dixon, Darren J.; Brennan, Paul

E. Chemical Probes and Inhibitors of Bromodomains Outside the BET Family.

Medchemcomm 2016, 7, 2246–2264.

(35) Josling, Gabrielle A.; Selvarajah, Shamista A.; Petter, Michaela; Duffy, Michael F. The Role

of Bromodomain Proteins in Regulating Gene Expression. Genes (Basel). 2012, 3, 320–

343.

(36) Ferri, Elena; Petosa, Carlo; McKenna, Charles E. Bromodomains: Structure, Function and

Pharmacology of Inhibition. Biochem. Pharmacol. 2016, 106, 1–18.

(37) Arkin, Michelle R.; Tang, Yinyan; Wells, James A. Small-Molecule Inhibitors of Protein-

Protein Interactions: Progressing toward the Reality. Chem. Biol. 2014, 21, 1102–1114.

(38) Filippakopoulos, Panagis; Picaud, Sarah; Mangos, Maria; Keates, Tracy; Lambert, Jean-

Philippe; Barsyte-Lovejoy, Dalia; Felletar, Ildiko; Volkmer, Rudolf; Müller, Susanne;

Pawson, Tony; Gingras, Anne-Claude; Arrowsmith, Cheryl H. H.; Knapp, Stefan. Histone

Recognition and Large-Scale Structural Analysis of the Human Bromodomain Family.

Cell 2012, 149, 214–231.

(39) Brand, Michael; Measures, Angelina M.; Wilson, Brian G.; Cortopassi, Wilian A.;

Alexander, Rikki; Höss, Matthias; Hewings, David S.; Rooney, Timothy P. C.; Paton,

Robert S.; Conway, Stuart J. Small Molecule Inhibitors of Bromodomain-Acetyl-Lysine

Interactions. ACS Chem. Biol. 2015, 10, 22–39.

(40) Prinjha, Rab K.; Witherington, Jason; Lee, Kevin. Place Your BETs: The Therapeutic

Potential of Bromodomains. Trends Pharmacol. Sci. 2012, 33, 146–153.

(41) Müller, Susanne; Knapp, Stefan. Discovery of BET Bromodomain Inhibitors and Their

Role in Target Validation. Medchemcomm 2014, 5, 288–296.

(42) Atkinson, Stephen J.; Soden, Peter E.; Angell, Davina C.; Bantscheff, Marcus; Chung,

Chun Wa; Giblin, Kathryn A.; Smithers, Nicholas; Furze, Rebecca C.; Gordon, Laurie;

219

Drewes, Gerard; Rioja, Inmaculada; Witherington, Jason; Parr, Nigel J.; Prinjha, Rab K.

The Structure Based Design of Dual HDAC/BET Inhibitors as Novel Epigenetic Probes.

Medchemcomm 2014, 5, 342–351.

(43) Wang, Limei; Wu, Xiuyin; Huang, Ping; Lv, Zhijun; Qi, Yuping; Wei, Xiujuan; Yang,

Pishan; Zhang, Fenghe. JQ1, a Small Molecule Inhibitor of BRD4, Suppresses Cell Growth

and Invasion in Oral Squamous Cell Carcinoma. Oncol. Rep. 2016, 36, 1989–1996.

(44) Mirguet, Olivier; Gosmini, Romain; Toum, Jérôme; Clément, Catherine A.; Barnathan,

Mélanie; Brusq, Jean-Marie; Mordaunt, Jacqueline E.; Grimes, Richard M.; Crowe,

Miriam; Pineau, Olivier; Ajakane, Myriam; Daugan, Alain; Jeffrey, Phillip; Cutler, Leanne;

Haynes, Andrea C.; Smithers, Nicholas N.; Chung, Chun-wa; Bamborough, Paul; Uings,

Iain J.; et al. Discovery of Epigenetic Regulator I-BET762: Lead Optimization to Afford a

Clinical Candidate Inhibitor of the BET Bromodomains. J. Med. Chem. 2013, 56, 7501–

7515.

(45) Filippakopoulos, Panagis; Qi, Jun; Picaud, Sarah; Shen, Yao; Smith, William B.; Fedorov,

Oleg; Morse, Elizabeth M.; Keates, Tracey; Hickman, Tyler T.; Felletar, Ildiko; Philpott,

Martin; Munro, Shonagh; McKeown, Michael R.; Wang, Yuchuan; Christie, Amanda L.;

West, Nathan; Cameron, Michael J.; Schwartz, Brian; Heightman, Tom D.; et al. Selective

Inhibition of BET Bromodomains. Nature 2010, 468, 1067–1073.

(46) Klein, Kerstin. Bromodomain Protein Inhibition: A Novel Therapeutic Strategy in

Rheumatic Diseases. RMD Open 2018, 4, e000744.

(47) Vogelstein, Bert; Lane, David; Levine, Arnold J. Surfing the P53 Network. Nature 2000,

408, 307–310.

(48) Borah, Jagat C.; Mujtaba, Shiraz; Karakikes, Ioannis; Zeng, Lei; Muller, Michaela; Patel,

Jigneshkumar; Moshkina, Natasha; Morohashi, Keita; Zhang, Weijia; Gerona-Navarro,

Guillermo; Hajjar, Roger J.; Zhou, Ming-Ming. A Small Molecule Binding to the

Coactivator CREB-Binding Protein Blocks Apoptosis in Cardiomyocytes. Chem. Biol.

2011, 18, 531–541.

(49) Rooney, Timothy P. C.; Filippakopoulos, Panagis; Fedorov, Oleg; Picaud, Sarah;

Cortopassi, Wilian A.; Hay, Duncan A.; Martin, Sarah; Tumber, Anthony; Rogers,

Catherine M.; Philpott, Martin; Wang, Minghua; Thompson, Amber L.; Heightman, Tom

D.; Pryde, David C.; Cook, Andrew; Paton, Robert S.; Müller, Susanne; Knapp, Stefan;

220

Brennan, Paul E.; et al. A Series of Potent CREBBP Bromodomain Ligands Reveals an

Induced-Fit Pocket Stabilized by a Cation-π Interaction. Angew. Chemie Int. Ed. 2014, 53,

6126–6130.

(50) Ghosh, Srimoyee; Taylor, Alexander; Chin, Melissa; Huang, Hon Ren; Conery, Andrew

R.; Mertz, Jennifer A.; Salmeron, Andres; Dakle, Pranal J.; Mele, Deanna; Cote, Alexandre;

Jayaram, Hari; Setser, Jeremy W.; Poy, Florence; Hatzivassiliou, Georgia; DeAlmeida-

Nagata, Denise; Sandy, Peter; Hatton, Charlie; Romero, F. Anthony; Chiang, Eugene; et

al. Regulatory T Cell Modulation by CBP/EP300 Bromodomain Inhibition. J. Biol. Chem.

2016, 291, 13014–13027.

(51) Zeng, Lei; Li, Jiaming; Muller, Michaela; Yan, Sherry; Mujtaba, Shiraz; Pan, Chongfeng;

Wang, Zhiyong; Zhou, Ming-Ming. Selective Small Molecules Blocking HIV-1 Tat and

Coactivator PCAF Association. J. Am. Chem. Soc. 2005, 127, 2376–2377.

(52) Crawford, Terry D.; Audia, James E.; Bellon, Steve; Burdick, Daniel J.; Bommi-Reddy,

Archana; Côté, Alexandre; Cummings, Richard T.; Duplessis, Martin; Flynn, E. Megan;

Hewitt, Michael; Huang, Hon-Ren; Jayaram, Hariharan; Jiang, Ying; Joshi, Shivangi;

Kiefer, James R.; Murray, Jeremy; Nasveschuk, Christopher G.; Neiss, Arianne; Pardo,

Eneida; et al. GNE-886: A Potent and Selective Inhibitor of the Cat Eye Syndrome

Chromosome Region Candidate 2 Bromodomain (CECR2). ACS Med. Chem. Lett. 2017,

8, 737–741.

(53) Kirberger, Steven E.; Ycas, Peter D.; Johnson, Jorden A.; Chen, Chen; Ciccone, Michael F.;

Woo, Rinette W. L.; Urick, Andrew K.; Zahid, Huda; Shi, Ke; Aihara, Hideki; McAllister,

Sean D.; Kashani-Sabet, Mohammed; Shi, Junwei; Dickson, Alex; Dos Santos, Camila O.;

Pomerantz, William C. K. Selectivity, Ligand Deconstruction, and Cellular Activity

Analysis of a BPTF Bromodomain Inhibitor. Org. Biomol. Chem. 2019, 17, 2020–2027.

(54) Drouin, Ludovic; McGrath, Sally; Vidler, Lewis R.; Chaikuad, Apirat; Monteiro, Octovia;

Tallant, Cynthia; Philpott, Martin; Rogers, Catherine; Fedorov, Oleg; Liu, Manjuan;

Akhtar, Wasim; Hayes, Angela; Raynaud, Florence; Müller, Susanne; Knapp, Stefan;

Hoelder, Swen. Structure Enabled Design of BAZ2-ICR, A Chemical Probe Targeting the

Bromodomains of BAZ2A and BAZ2B. J. Med. Chem. 2015, 58, 2553–2559.

(55) Chen, Peiling; Chaikuad, Apirat; Bamborough, Paul; Bantscheff, Marcus; Bountra, Chas;

Chung, Chun Wa; Fedorov, Oleg; Grandi, Paola; Jung, David; Lesniak, Robert; Lindon,

Matthew; Müller, Susanne; Philpott, Martin; Prinjha, Rab; Rogers, Catherine; Selenski,

221

Carolyn; Tallant, Cynthia; Werner, Thilo; Willson, Timothy M.; et al. Discovery and

Characterization of GSK2801, a Selective Chemical Probe for the Bromodomains BAZ2A

and BAZ2B. J. Med. Chem. 2016, 59, 1410–1424.

(56) Clark, Peter G. K.; Vieira, Lucas C. C.; Tallant, Cynthia; Fedorov, Oleg; Singleton, Dean

C.; Rogers, Catherine M.; Monteiro, Octovia P.; Bennett, James M.; Baronio, Roberta;

Müller, Susanne; Daniels, Danette L.; Méndez, Jacqui; Knapp, Stefan; Brennan, Paul E.;

Dixon, Darren J. LP99: Discovery and Synthesis of the First Selective BRD7/9

Bromodomain Inhibitor. Angew. Chemie - Int. Ed. 2015, 54, 6217–6221.

(57) Theodoulou, Natalie H.; Bamborough, Paul; Bannister, Andrew J.; Becher, Isabelle; Bit,

Rino A.; Che, Ka Hing; Chung, Chun Wa; Dittmann, Antje; Drewes, Gerard; Drewry,

David H.; Gordon, Laurie; Grandi, Paola; Leveridge, Melanie; Lindon, Matthew; Michon,

Anne Marie; Molnar, Judit; Robson, Samuel C.; Tomkinson, Nicholas C. O.; Kouzarides,

Tony; et al. Discovery of I-BRD9, a Selective Cell Active Chemical Probe for

Bromodomain Containing Protein 9 Inhibition. J. Med. Chem. 2016, 59, 1425–1439.

(58) Martin, Laetitia J.; Koegl, Manfred; Bader, Gerd; Cockcroft, Xiao Ling; Fedorov, Oleg;

Fiegen, Dennis; Gerstberger, Thomas; Hofmann, Marco H.; Hohmann, Anja F.; Kessler,

Dirk; Knapp, Stefan; Knesl, Petr; Kornigg, Stefan; Müller, Susanne; Nar, Herbert; Rogers,

Catherine; Rumpel, Klaus; Schaaf, Otmar; Steurer, Steffen; et al. Structure-Based Design

of an in Vivo Active Selective BRD9 Inhibitor. J. Med. Chem. 2016, 59, 4462–4475.

(59) Bamborough, Paul; Barnett, Heather A.; Becher, Isabelle; Bird, Mark J.; Chung, Chun-wa;

Craggs, Peter D.; Demont, Emmanuel H.; Diallo, Hawa; Fallon, David J.; Gordon, Laurie

J.; Grandi, Paola; Hobbs, Clare I.; Hooper-Greenhill, Edward; Jones, Emma J.; Law, Robert

P.; Le Gall, Armelle; Lugo, David; Michon, Anne-Marie; Mitchell, Darren J.; et al.

GSK6853, a Chemical Probe for Inhibition of the BRPF1 Bromodomain. ACS Med. Chem.

Lett. 2016, 7, 552–557.

(60) Meier, Julia C.; Tallant, Cynthia; Fedorov, Oleg; Witwicka, Hanna; Hwang, Sung Yong;

Van Stiphout, Ruud G.; Lambert, Jean Philippe; Rogers, Catherine; Yapp, Clarence;

Gerstenberger, Brian S.; Fedele, Vita; Savitsky, Pavel; Heidenreich, David; Daniels,

Danette L.; Owen, Dafydd R.; Fish, Paul V.; Igoe, Niall M.; Bayle, Elliott D.; Haendler,

Bernard; et al. Selective Targeting of Bromodomains of the Bromodomain-PHD Fingers

Family Impairs Osteoclast Differentiation. ACS Chem. Biol. 2017, 12, 2619–2630.

(61) Palmer, Wylie S.; Poncet-Montange, Guillaume; Liu, Gang; Petrocchi, Alessia; Reyna,

222

Naphtali; Subramanian, Govindan; Theroff, Jay; Yau, Anne; Kost-Alimova, Maria;

Bardenhagen, Jennifer P.; Leo, Elisabetta; Shepard, Hannah E.; Tieu, Trang N.; Shi, Xi;

Zhan, Yanai; Zhao, Shuping; Barton, Michelle C.; Draetta, Giulio; Toniatti, Carlo; et al.

Structure-Guided Design of IACS-9571, a Selective High-Affinity Dual TRIM24-BRPF1

Bromodomain Inhibitor. J. Med. Chem. 2016, 59, 1440–1454.

(62) Sepehri, Bakhtyar; Rasouli, Zolaikha; Hassanzadeh, Zeinabe; Ghavami, Raouf. Molecular

Docking and QSAR Analysis of Naphthyridone Derivatives as ATAD2 Bromodomain

Inhibitors: Application of CoMFA, LS-SVM, and RBF Neural Network. Med. Chem. Res.

2016, 25, 2895–2905.

(63) Bamborough, Paul; Chung, Chun wa; Demont, Emmanuel H.; Furze, Rebecca C.;

Bannister, Andrew J.; Che, Ka Hing; Diallo, Hawa; Douault, Clement; Grandi, Paola;

Kouzarides, Tony; Michon, Anne Marie; Mitchell, Darren J.; Prinjha, Rab K.; Rau,

Christina; Robson, Samuel; Sheppard, Robert J.; Upton, Richard; Watson, Robert J. A

Chemical Probe for the ATAD2 Bromodomain. Angew. Chemie - Int. Ed. 2016, 55, 11382–

11386.

(64) Bamborough, Paul; Chung, Chun Wa; Furze, Rebecca C.; Grandi, Paola; Michon, Anne

Marie; Sheppard, Robert J.; Barnett, Heather; Diallo, Hawa; Dixon, David P.; Douault,

Clement; Jones, Emma J.; Karamshi, Bhumika; Mitchell, Darren J.; Prinjha, Rab K.; Rau,

Christina; Watson, Robert J.; Werner, Thilo; Demont, Emmanuel H. Structure-Based

Optimization of Naphthyridones into Potent ATAD2 Bromodomain Inhibitors. J. Med.

Chem. 2015, 58, 6151–6178.

(65) Gerstenberger, Brian S.; Trzupek, John D.; Tallant, Cynthia; Fedorov, Oleg;

Filippakopoulos, Panagis; Brennan, Paul E.; Fedele, Vita; Martin, Sarah; Picaud, Sarah;

Rogers, Catherine; Parikh, Mihir; Taylor, Alexandria; Samas, Brian; O’Mahony, Alison;

Berg, Ellen; Pallares, Gabriel; Torrey, Adam D.; Treiber, Daniel K.; Samardjiev, Ivan J.; et

al. Identification of a Chemical Probe for Family VIII Bromodomains through

Optimization of a Fragment Hit. J. Med. Chem. 2016, 59, 4800–4811.

(66) Ember, Stuart W. J.; Zhu, Jin Yi; Olesen, Sanne H.; Martin, Mathew P.; Becker, Andreas;

Berndt, Norbert; Georg, Gunda I.; Schonbrunn, Ernst. Acetyl-Lysine Binding Site of

Bromodomain-Containing Protein 4 (BRD4) Interacts with Diverse Kinase Inhibitors.

ACS Chem. Biol. 2014, 9, 1160–1171.

(67) Dittmann, Antje; Werner, Thilo; Chung, Chun Wa; Savitski, Mikhail M.; Fälth Savitski,

223

Maria; Grandi, Paola; Hopf, Carsten; Lindon, Matthew; Neubauer, Gitte; Prinjha,

Rabinder K.; Bantscheff, Marcus; Drewes, Gerard. The Commonly Used PI3-Kinase Probe

LY294002 Is an Inhibitor of BET Bromodomains. ACS Chem. Biol. 2014, 9, 495–502.

(68) Filippakopoulos, Panagis; Knapp, Stefan. Targeting Bromodomains: Epigenetic Readers

of Lysine Acetylation. Nat. Rev. Drug Discov. 2014, 13, 337–356.

(69) Pérez-Salvia, Montserrat; Esteller, Manel. Bromodomain Inhibitors and Cancer Therapy:

From Structures to Applications. Epigenetics 2017, 12, 323–339.

(70) Filippakopoulos, Panagis; Knapp, Stefan. The Bromodomain Interaction Module. FEBS

Lett. 2012, 586, 2692–2704.

(71) Sharp, Phillip P.; Garnier, Jean Marc; Huang, David C. S.; Burns, Christopher J. Evaluation

of Functional Groups as Acetyl-Lysine Mimetics for BET Bromodomain Inhibition.

Medchemcomm 2014, 5, 1834–1842.

(72) ACD/Structure Eludicator. Advanced Chemistry Developement, Inc.: Toronto, ON,

Canada 2019.

(73) Kleywegt, Gerard J.; Jones, T. Alwyn. Model Building and Refinement Practice. Methods

Enzymol. 1997, 277, 208–230.

(74) Crawford, Terry D.; Tsui, Vickie; Flynn, E. Megan; Wang, Shumei; Taylor, Alexander M.;

Côté, Alexandre; Audia, James E.; Beresini, Maureen H.; Burdick, Daniel J.; Cummings,

Richard; Dakin, Les A.; Duplessis, Martin; Good, Andrew C.; Hewitt, Michael C.; Huang,

Hon Ren; Jayaram, Hariharan; Kiefer, James R.; Jiang, Ying; Murray, Jeremy; et al. Diving

into the Water: Inducible Binding Conformations for BRD4, TAF1(2), BRD9, and CECR2

Bromodomains. J. Med. Chem. 2016, 59, 5391–5402.

(75) Humphreys, Philip G.; Bamborough, Paul; Chung, Chun Wa; Craggs, Peter D.; Gordon,

Laurie; Grandi, Paola; Hayhow, Thomas G.; Hussain, Jameed; Jones, Katherine L.; Lindon,

Matthew; Michon, Anne Marie; Renaux, Jessica F.; Suckling, Colin J.; Tough, David F.;

Prinjha, Rab K. Discovery of a Potent, Cell Penetrant, and Selective P300/CBP-Associated

Factor (PCAF)/General Control Nonderepressible 5 (GCN5) Bromodomain Chemical

Probe. J. Med. Chem. 2017, 60, 695–709.

(76) Chaikuad, Apirat; Lang, Steffen; Brennan, Paul E.; Temperini, Claudia; Fedorov, Oleg;

Hollander, Johan; Nachane, Ruta; Abell, Chris; Müller, Susanne; Siegal, Gregg; Knapp,

224

Stefan. Structure-Based Identification of Inhibitory Fragments Targeting the P300/CBP-

Associated Factor Bromodomain. J. Med. Chem. 2016, 59, 1648–1653.

(77) Unzue, Andrea; Zhao, Hongtao; Lolli, Graziano; Dong, Jing; Zhu, Jian; Zechner, Melanie;

Dolbois, Aymeric; Caflisch, Amedeo; Nevado, Cristina. The “Gatekeeper” Residue

Influences the Mode of Binding of Acetyl Indoles to Bromodomains. J. Med. Chem. 2016,

59, 3087–3097.

(78) Su, Jing; Liu, Xinguo; Zhang, Shaolong; Yan, Fangfang; Zhang, Qinggang; Chen,

Jianzhong. A Theoretical Insight into Selectivity of Inhibitors toward Two Domains of

Bromodomain-Containing Protein 4 Using Molecular Dynamics Simulations. Chem. Biol.

Drug Des. 2018, 91, 828–840.

(79) Xing, Jing; Lu, Wenchao; Liu, Rongfeng; Wang, Yulan; Xie, Yiqian; Zhang, Hao; Shi, Zhe;

Jiang, Hao; Liu, Yu Chih; Chen, Kaixian; Jiang, Hualiang; Luo, Cheng; Zheng, Mingyue.

Machine-Learning-Assisted Approach for Discovering Novel Inhibitors Targeting

Bromodomain-Containing Protein 4. J. Chem. Inf. Model. 2017, 57, 1677–1690.

(80) Ma, Junlong; Chen, Heng; Yang, Jie; Yu, Zutao; Huang, Pan; Yang, Haofeng; Zheng,

Bifeng; Liu, Rangru; Li, Qianbin; Hu, Gaoyun; Chen, Zhuo. Binding Pocket-Based Design,

Synthesis and Biological Evaluation of Novel Selective BRD4-BD1 Inhibitors. Bioorganic

Med. Chem. 2019, 27, 1871–1881.

(81) Unzue, Andrea; Xu, Min; Dong, Jing; Wiedmer, Lars; Spiliotopoulos, Dimitrios; Caflisch,

Amedeo; Nevado, Cristina. Fragment-Based Design of Selective Nanomolar Ligands of

the CREBBP Bromodomain. J. Med. Chem. 2016, 59, 1350–1356.

(82) Cortopassi, Wilian A.; Kumar, Kiran; Paton, Robert S. Cation-π Interactions in CREBBP

Bromodomain Inhibition: An Electrostatic Model for Small-Molecule Binding Affinity

and Selectivity. Org. Biomol. Chem. 2016, 14, 10926–10938.

(83) Hewings, David S.; Fedorov, Oleg; Filippakopoulos, Panagis; Martin, Sarah; Picaud,

Sarah; Tumber, Anthony; Wells, Christopher; Olcina, Monica M.; Freeman, Katherine;

Gill, Andrew; Ritchie, Alison J.; Sheppard, David W.; Russell, Angela J.; Hammond, Ester

M.; Knapp, Stefan; Brennan, Paul E.; Conway, Stuart J. Optimization of 3,5-

Dimethylisoxazole Derivatives as Potent Bromodomain Ligands. J. Med. Chem. 2013, 56,

3217–3227.

(84) Lai, Kwong Wah; Romero, F. Anthony; Tsui, Vickie; Beresini, Maureen H.; de Leon

225

Boenig, Gladys; Bronner, Sarah M.; Chen, Kevin; Chen, Zhongguo; Choo, Edna F.;

Crawford, Terry D.; Cyr, Patrick; Kaufman, Susan; Li, Yingjie; Liao, Jiangpeng; Liu,

Wenfeng; Ly, Justin; Murray, Jeremy; Shen, Weichao; Wai, John; et al. Design and

Synthesis of a Biaryl Series as Inhibitors for the Bromodomains of CBP/P300. Bioorganic

Med. Chem. Lett. 2018, 28, 15–23.

(85) Bamborough, Paul; Chung, Chun Wa; Furze, Rebecca C.; Grandi, Paola; Michon, Anne

Marie; Watson, Robert J.; Mitchell, Darren J.; Barnett, Heather; Prinjha, Rab K.; Rau,

Christina; Sheppard, Robert J.; Werner, Thilo; Demont, Emmanuel H. Aiming to Miss a

Moving Target: Bromo and Extra Terminal Domain (BET) Selectivity in Constrained

ATAD2 Inhibitors. J. Med. Chem. 2018, 61, 8321–8336.

(86) Miller, Duncan C.; Martin, Mathew P.; Adhikari, Santosh; Brennan, Alfie; Endicott, Jane

A.; Golding, Bernard T.; Hardcastle, Ian R.; Heptinstall, Amy; Hobson, Stephen; Jennings,

Claire; Molyneux, Lauren; Ng, Yvonne; Wedge, Stephen R.; Noble, Martin E. M.; Cano,

Celine. Identification of a Novel Ligand for the ATAD2 Bromodomain with Selectivity

over BRD4 through a Fragment Growing Approach. Org. Biomol. Chem. 2018, 16, 1843–

1850.

(87) Lloyd, Jonathan T.; Gay, Jamie C.; Tonelli, Marco; Cornilescu, Gabriel; Nguyen, Paul;

Carlson, Samuel; Markley, John L.; Glass, Karen C. Structural Insights into the

Recognition of Mono-and Di-Acetyllysine by the ATAD2B Bromodomain-ATPase Family

AAA + Domain Containing 2. bioRxiv 2018, 263624.

(88) Zhu, Jian; Zhou, Chunxian; Caflisch, Amedeo. Structure-Based Discovery of Selective

BRPF1 Bromodomain Inhibitors. Eur. J. Med. Chem. 2018, 155, 337–352.

(89) Demont, Emmanuel H.; Bamborough, Paul; Chung, Chun Wa; Craggs, Peter D.; Fallon,

David; Gordon, Laurie J.; Grandi, Paola; Hobbs, Clare I.; Hussain, Jameed; Jones, Emma

J.; Le Gall, Armelle; Michon, Anne Marie; Mitchell, Darren J.; Prinjha, Rab K.; Roberts,

Andy D.; Sheppard, Robert J.; Watson, Robert J. 1,3-Dimethyl Benzimidazolones Are

Potent, Selective Inhibitors of the Brpf1 Bromodomain. ACS Med. Chem. Lett. 2014, 5,

1190–1195.

(90) Bouché, Léa; Christ, Clara D.; Siegel, Stephan; Fernández-Montalván, Amaury E.; Holton,

Simon J.; Fedorov, Oleg; Ter Laak, Antonius; Sugawara, Tatsuo; Stöckigt, Detlef; Tallant,

Cynthia; Bennett, James; Monteiro, Octovia; Díaz-Sáez, Laura; Siejka, Paulina; Meier,

Julia; Pütter, Vera; Weiske, Jörg; Müller, Susanne; Huber, Kilian V. M.; et al.

226

Benzoisoquinolinediones as Potent and Selective Inhibitors of BRPF2 and TAF1/TAF1L


(91) Hui, Ma; Jian, Zhang; Peiyuan, Zheng; Zhenwei, Wu; Huibin, Zhang. Research Progress

of Selective Small Molecule Bromodomain-Containing Protein 9 Inhibitors. Future Med.

Chem. 2018, 10, 895–906.

(92) Dalle Vedove, Andrea; Spiliotopoulos, Dimitrios; D’Agostino, Vito G.; Marchand, Jean

Rémy; Unzue, Andrea; Nevado, Cristina; Lolli, Graziano; Caflisch, Amedeo. Structural

Analysis of Small-Molecule Binding to the BAZ2A and BAZ2B Bromodomains.

ChemMedChem 2018, 13, 1479–1487.

(93) Bennett, James; Fedorov, Oleg; Tallant, Cynthia; Monteiro, Octovia; Meier, Julia; Gamble,

Vicky; Savitsky, Pavel; Nunez-Alonso, Graciela A.; Haendler, Bernard; Rogers, Catherine;

Brennan, Paul E.; Müller, Susanne; Knapp, Stefan. Discovery of a Chemical Tool Inhibitor

Targeting the Bromodomains of TRIM24 and BRPF. J. Med. Chem. 2016, 59, 1642–1647.

(94) McKeown, Michael R.; Shaw, Daniel L.; Fu, Harry; Liu, Shuai; Xu, Xiang; Marineau, Jason

J.; Huang, Yibo; Zhang, Xiaofeng; Buckley, Dennis L.; Kadam, Asha; Zhang, Zijuan;

Blacklow, Stephen C.; Qi, Jun; Zhang, Wei; Bradner, James E. Biased Multicomponent

Reactions to Develop Novel Bromodomain Inhibitors. J. Med. Chem. 2014, 57, 9019–9027.

(95) Myrianthopoulos, Vassilios; Gaboriaud-Kolar, Nicolas; Tallant, Cynthia; Hall, Michelle

Lynn; Grigoriou, Stylianos; Brownlee, Peter Moore; Fedorov, Oleg; Rogers, Catherine;

Heidenreich, David; Wanior, Marek; Drosos, Nikolaos; Mexia, Nikitia; Savitsky, Pavel;

Bagratuni, Tina; Kastritis, Efstathios; Terpos, Evangelos; Filippakopoulos, Panagis;

Müller, Susanne; Skaltsounis, Alexios Leandros; et al. Discovery and Optimization of a

Selective Ligand for the Switch/Sucrose Nonfermenting-Related Bromodomains of

Polybromo Protein-1 by the Use of Virtual Screening and Hydration Analysis. J. Med.

Chem. 2016, 59, 8787–8803.

(96) Ferri, Nicola; Siegl, Peter; Corsini, Alberto; Herrmann, Joerg; Lerman, Amir; Benghozi,

Renee. Drug Attrition during Pre-Clinical and Clinical Development: Understanding and

Managing Drug-Induced Cardiotoxicity. Pharmacol. Ther. 2013, 138, 470–484.

(97) Guengerich, F. Peter. Mechanisms of Drug Toxicity and Relevance to Pharmaceutical

Development. Drug Metab. Pharmacokinet. 2011, 26, 3–14.

(98) Fielden, Mark R.; Kolaja, Kyle L. The Role of Early in Vivo Toxicity Testing in Drug

227

Discovery Toxicology . Expert Opin. Drug Saf. 2008, 7, 107–110.

(99) Borzelleca, J. F. Paracelsus: Herald of Modern Toxicology. Toxicol. Sci. 2000, 53, 2–4.

(100) Liebler, Daniel C.; Guengerich, F. Peter. Elucidating Mechanisms of Drug-Induced

Toxicity. Nat. Rev. Drug Discov. 2005, 4, 410–420.

(101) Guengerich, F. Peter. Cytochrome P450s and Other Enzymes in Drug Metabolism and

Toxicity. AAPS J. 2006, 8, E101–E111.

(102) Furberg, Curt D.; Pitt, Bertram. Withdrawal of Cerivastatin from the World Market. Curr.

Control. Trials Cardiovasc. Med. 2001, 2, 205–207.

(103) Pichler, Werner J. Delayed Drug Hypersensitivity Reactions. Ann. Intern. Med. 2003, 139,

683.

(104) Faulk, W. Page. Clinical Aspects of Immunology: 3rd Edn; Wiley-Blackwell, 1976; Vol. 30.

(105) Dekant, Wolfgang. The Role of Biotransformation and Bioactivation in Toxicity. In

Molecular, Clinical and Environmental Toxicology; Luch, Andreas, Ed.; Berlin, 2009.

(106) Takano, T.; Miyazaki, Y. Metabolism of Dichloromethane and the Subsequent Binding of

Its Product, Carbon Monoxide, to Cytochrome P-450 in Perfused Rat Liver. Toxicol. Lett.

1988, 40, 93–96.

(107) Macherey, Anne-Christine; Dansette, Patrick M. Biotransformations Leading to Toxic

Metabolites: Chemical Aspects. Pract. Med. Chem. 2015, 585–614.

(108) Waring, Jeffrey F.; Anderson, Mark G. Idiosyncratic Toxicity: Mechanistic Insights

Gained from Analysis of Prior Compounds. Curr. Opin. Drug Discov. Devel. 2005, 8, 59–

65.

(109) Segura-Bedmar, Isabel; Martínez, Paloma; de Pablo-Sánchez, César. A Linguistic Rule-

Based Approach to Extract Drug-Drug Interactions from Pharmacological Documents.

BMC Bioinformatics 2011, 12, S1.

(110) Juurlink, David N.; Mamdani, Muhammad; Kopp, Alexander; Laupacis, Andreas;

Redelmeier, Donald A. Drug-Drug Interactions among Elderly Patients Hospitalized for

Drug Toxicity. J. Am. Med. Assoc. 2003, 289, 1652–1658.

(111) Safety Guidelines : ICH https://www.ich.org/products/guidelines/safety/article/safety-

guidelines.html (accessed May 7, 2019).

228

(112) Deaton, Aimee M.; Fan, Fan; Zhang, Wei; Nguyen, Phuong A.; Ward, Lucas D.; Nioi, Paul.

Rationalizing Secondary Pharmacology Screening Using Human Genetic and

Pharmacological Evidence. Toxicol. Sci. 2019, 167, 593–603.

(113) Viskin, Sami. Long QT Syndromes and Torsade de Pointes. Lancet 1999, 354, 1625–1633.

(114) Thomas, D.; Karle, C. A.; Kiehn, J. The Cardiac HERG/IKr Potassium Channel as

Pharmacological Target: Structure, Function, Regulation, and Clinical Applications. Curr.

Pharm. Des. 2006, 12, 2271–2283.

(115) Shukla, Sunita J.; Huang, Ruili; Austin, Christopher P.; Xia, Menghang. The Future of

Toxicity Testing: A Focus on in Vitro Methods Using a Quantitative High-Throughput

Screening Platform. Drug Discov. Today 2010, 15, 997–1007.

(116) Krejsa, Cecile M.; Horvath, Dragos; Rogalski, Sherri L.; Penzotti, Julie E.; Mao, Boryeu;

Barbosa, Frédérique; Migeon, Jacques C. Predicting ADME Properties and Side Effects:

The BioPrint Approach. Curr. Opin. Drug Discov. Devel. 2003, 6, 470–480.

(117) Lounkine, Eugen; Keiser, Michael J.; Whitebread, Steven; Mikhailov, Dmitri; Hamon,

Jacques; Jenkins, Jeremy L.; Lavan, Paul; Weber, Eckhard; Doak, Allison K.; Côté, Serge;

Shoichet, Brian K.; Urban, Laszlo. Large-Scale Prediction and Testing of Drug Activity on

Side-Effect Targets. Nature 2012, 486, 361–367.

(118) Bowes, Joanne; Brown, Andrew J.; Hamon, Jacques; Jarolimek, Wolfgang; Sridhar, Arun;

Waldron, Gareth; Whitebread, Steven. Reducing Safety-Related Drug Attrition: The Use

of in Vitro Pharmacological Profiling. Nat. Rev. Drug Discov. 2012, 11, 909–922.

(119) Lynch, James J.; Van Vleet, Terry R.; Mittelstadt, Scott W.; Blomme, Eric A. G. Potential

Functional and Pathological Side Effects Related to Off-Target Pharmacological Activity.

J. Pharmacol. Toxicol. Methods 2017, 87, 108–126.

(120) Van Vleet, Terry R.; Liguori, Michael J.; Lynch, James J.; Rao, Mohan; Warder, Scott.

Screening Strategies and Methods for Better Off-Target Liability Prediction and

Identification of Small-Molecule Pharmaceuticals. SLAS Discov. 2019, 24, 1–24.

(121) Whitebread, Steven; Dumotier, Berengere; Armstrong, Duncan; Fekete, Alexander;

Chen, Shanni; Hartmann, Andreas; Muller, Patrick Y.; Urban, Laszlo. Secondary

Pharmacology: Screening and Interpretation of off-Target Activities – Focus on

Translation. Drug Discov. Today 2016, 21, 1232–1242.

229

(122) Houck, Keith A.; Kavlock, Robert J. Understanding Mechanisms of Toxicity: Insights from

Drug Discovery Research. Toxicol. Appl. Pharmacol. 2008, 227, 163–178.

(123) Obach, R. S. The Utility of in Vitro Cytochrome P450 Inhibition Data in the Prediction

of Drug-Drug Interactions. J. Pharmacol. Exp. Ther. 2005, 316, 336–348.

(124) Horváth, S. Cytotoxicity of Drugs and Diverse Chemical Agents to Cell Cultures.

Toxicology 1980, 16, 59–66.

(125) Niles, Andrew L.; Moravec, Richard A.; Riss, Terry L. Update on in Vitro Cytotoxicity

Assays for Drug Development . Expert Opin. Drug Discov. 2008, 3, 655–669.

(126) Cook, John A.; Mitchell, James B. Viability Measurements in Mammalian Cell Systems.

Anal. Biochem. 1989, 179, 1–7.

(127) Steinberg, Pablo. High-Throughput Screening Methods in Toxicity Testing; Steinberg,

Pablo, Ed.; Wiley, 2013.

(128) Shah, Shaily Umang. Importance of Genotoxicity & S2A Guidelines for Genotoxicity

Testing for Pharmaceuticals: IOSR J. Pharm. Biol. Sci. 2012, 1, 43–54.

(129) Mccann, Joyce; Choi, Edmund; Yamasaki, Edith; Ames, Bruce N.; Haroun, L.; Maron, D.;

Keng, T.; Car, C. Detection of Carcinogens as Mutagens in the Salmonella/Microsome

Test: Assay of 300 Chemicals. Med. Sci. 1975, 72, 5135–5139.

(130) Krishna, G.; Fiedler, R.; Theiss, J. C. Simultaneous Evaluation of Clastogenicity,

Aneugenicity and Toxicity in the Mouse Micronucleus Assay Using Immunofluorescence.

Mutat. Res. 1992, 282, 159–167.

(131) Lloyd, Melvyn; Kidd, Darren. The Mouse Lymphoma Assay. Methods Mol. Biol. 2012, 817,

35–54.

(132) Rao, P. Rama; Jena, G. B.; Kaul, C. L.; Ramarao, P. Genotoxicity Testing, a Regulatory

Requirement for Drug Discovery and Development: Impact of ICH Guidelines. Indian J.

Pharmacol. 2002, No. 34, 86–99.

(133) Mishima, Masayuki. Chromosomal Aberrations, Clastogens vs Aneugens. Front. Biosci.

(Schol. Ed). 2017, 9, 1–16.

(134) Hayashi, Makoto. The Micronucleus Test-Most Widely Used in Vivo Genotoxicity Test.

Genes Environ. 2016, 38, 18.

230

(135) Kirkland, David; Reeve, Lesley; Gatehouse, David; Vanparys, Philippe. A Core in Vitro

Genotoxicity Battery Comprising the Ames Test plus the in Vitro Micronucleus Test Is

Sufficient to Detect Rodent Carcinogens and in Vivo Genotoxins. Mutat. Res. - Genet.

Toxicol. Environ. Mutagen. 2011, 721, 27–73.

(136) Kimura, Hiroshi; Sakai, Yasuyuki; Fujii, Teruo. Organ/Body-on-a-Chip Based on

Microfluidic Technology for Drug Discovery. Drug Metab. Pharmacokinet. 2018, 33, 43–

48.

(137) Ishida, Seiichi. Organs-on-a-Chip: Current Applications and Consideration Points for in

Vitro ADME-Tox Studies. Drug Metab. Pharmacokinet. 2018, 33, 49–54.

(138) Törnqvist, Elin; Annas, Anita; Granath, Britta; Jalkesten, Elisabeth; Cotgreave, Ian; Öberg,

Mattias. Strategic Focus on 3R Principles Reveals Major Reductions in the Use of Animals

in Pharmaceutical Toxicity Testing. 2014, 9, e101638.

(139) Flecknell, Paul. Replacement, Reduction and Refinement. ALTEX 2002, 19, 73–78.

(140) Nuffield BioEthics. Chapter 9- Animal Use in Toxicity Studies. In The Ethics of Research

Involving Animals; Nuffield Council on Bioethics: London, 2005; pp 155–167.

(141) Calabrese, E. J.; Baldwin, L. A. Improved Method for Selection of the NOAEL. Regul.

Toxicol. Pharmacol. 1994, 19, 48–50.

(142) Dorato, Michael A.; Engelhardt, Jeffery A. The No-Observed-Adverse-Effect-Level in

Drug Safety Evaluations: Use, Issues, and Definition(S). Regul. Toxicol. Pharmacol. 2005,

42, 265–274.

(143) Food and Drug Administration, HHS. International Conference on Harmonisation;

Guidance on M3(R2) Nonclinical Safety Studies for the Conduct of Human Clinical Trials

and Marketing Authorization for Pharmaceuticals; Availability. Notice. Fed. Regist. 2010,

75, 3471–3472.

(144) Evans, Scott R. Fundamentals of Clinical Trial Design. J. Exp. Stroke Transl. Med. 2010, 3,

19–27.

(145) Umscheid, Craig A.; Margolis, David J.; Grossman, Craig E. Key Concepts of Clinical

Trials: A Narrative Review. Postgrad. Med. 2011, 123, 194–204.

(146) Evans, Scott R. Clinical Trial Structures. J. Exp. Stroke Transl. Med. 2010, 3, 8–18.

231

(147) Mandrekar, Sumithra J.; Sargent, Daniel J. Randomized Phase II Trials: Time for a New

Era in Clinical Trial Design. J. Thorac. Oncol. 2010, 5, 932–934.

(148) Walker, Esteban; Nowacki, Amy S. Understanding Equivalence and Noninferiority

Testing. J. Gen. Intern. Med. 2011, 26, 192–196.

(149) Sedgwick, P. Confounding in Randomised Controlled Trials. Bmj 2010, 341, c5403–

c5403.

(150) Lineberry, Neil; Berlin, Jesse A.; Mansi, Bernadette; Glasser, Susan; Berkwits, Michael;

Klem, Christian; Bhattacharya, Ananya; Citrome, Leslie; Enck, Robert; Fletcher, John;

Haller, Daniel; Chen, Tai Tsang; Laine, Christine. Recommendations to Improve Adverse

Event Reporting in Clinical Trial Publications: A Joint Pharmaceutical Industry/Journal

Editor Perspective. BMJ 2016, 355.

(151) Meyboom, Ronald H. B.; Hekster, Yechiel A.; Egberts, Antoine C. G.; Gribnau, Frank W.

J.; Edwards, I. Ralph. Causal or Casual? The Role of Causality Assessment in

Pharmacovigilance. Drug Saf. 1997, 17, 374–389.

(152) Coomarasamy, A.; Williams, H.; Truchanowwicz, E. Definitions of Adverse Events,

Seriousness and Causality. Ncbi 2016, 20, Appendix 3.

(153) EudraVigilance: electronic reporting European Medicines Agency

https://www.ema.europa.eu/en/human-regulatory/research-

development/pharmacovigilance/eudravigilance/eudravigilance-electronic-reporting

(accessed Oct 31, 2018).

(154) US FDA. Adverse Event Reporting to IRBs — Improving Human Subject Protection; 2009.

(155) Brown, Elliot G.; Wood, Louise; Wood, Sue. The Medical Dictionary for Regulatory

Activities (MedDRA). In Drug Safety; John Wiley & Sons, Ltd: Chichester, UK, 1999; Vol.

20, pp 109–117.

(156) Schroll, Jeppe Bennekou; Maund, Emma; Gøtzsche, Peter C. Challenges in Coding

Adverse Events in Clinical Trials: A Systematic Review. PLoS One 2012, 7, e41174.

(157) Jurić, Diana; Pranić, Shelly; Tokalić, Ružica; Milat, Ana Marija; Mudnić, Ivana; Pavličević,

Ivančica; Marušić, Ana. Clinical Trials on Drug-Drug Interactions Registered in

ClinicalTrials.Gov Reported Incongruent Safety Data in Published Articles: An

Observational Study. J. Clin. Epidemiol. 2018, 104, 35–45.

232

(158) Riveros, Carolina; Dechartres, Agnes; Perrodeau, Elodie; Haneef, Romana; Boutron,

Isabelle; Ravaud, Philippe. Timing and Completeness of Trial Results Posted at

ClinicalTrials.Gov and Published in Journals. PLoS Med. 2013, 10, 1–12.

(159) Tse, Tony; Williams, Rebecca J.; Zarin, Deborah A. Reporting “Basic Results” in

ClinicalTrials.Gov. Chest 2009, 136, 295–303.

(160) Puljak, Livia. Flupirtine, an Effective Analgesic, but Hepatotoxicity Should Limit Its Use.

Anesth. Analg. 2018, 127, 309–310.

(161) Research, Center for Drug Evaluation and. Drug Safety and Availability - FDA Drug Safety

Communication: FDA recommends against the continued use of propoxyphene

https://www.fda.gov/Drugs/DrugSafety/ucm234338.htm (accessed Jan 14, 2019).

(162) Waring, Michael J.; Arrowsmith, John; Leach, Andrew R.; Leeson, Paul D.; Mandrell, Sam;

Owen, Robert M.; Pairaudeau, Garry; Pennie, William D.; Pickett, Stephen D.; Wang,

Jibo; Wallace, Owen; Weir, Alex. An Analysis of the Attrition of Drug Candidates from

Four Major Pharmaceutical Companies. Nat. Rev. Drug Discov. 2015, 14, 475–486.

(163) Cook, David; Brown, Dearg; Alexander, Robert; March, Ruth; Morgan, Paul;

Satterthwaite, Gemma; Pangalos, Menelas N. Lessons Learned from the Fate of

AstraZeneca’s Drug Pipeline: A Five-Dimensional Framework. Nat. Rev. Drug Discov.

2014, 13, 419–431.

(164) Morgan, Paul; Brown, Dean G.; Lennard, Simon; Anderton, Mark J.; Barrett, J. Carl;

Eriksson, Ulf; Fidock, Mark; Hamrén, Bengt; Johnson, Anthony; March, Ruth E.;

Matcham, James; Mettetal, Jerome; Nicholls, David J.; Platz, Stefan; Rees, Steve;

Snowden, Michael A.; Pangalos, Menelas N. Impact of a Five-Dimensional Framework on

R&D Productivity at AstraZeneca. Nat. Rev. Drug Discov. 2018, 17, 167–181.

(165) Gintant, Gary; Sager, Philip T.; Stockbridge, Norman. Evolution of Strategies to Improve

Preclinical Cardiac Safety Testing. Nat. Rev. Drug Discov. 2016, 15, 457–471.

(166) Kim, Eunjung; Rebecca, Vito W.; Smalley, Keiran S. M.; Anderson, Alexander R. A. Phase

i Trials in Melanoma: A Framework to Translate Preclinical Findings to the Clinic. Eur. J.

Cancer 2016, 67, 213–222.

(167) Mak, Isabella Wy; Evaniew, Nathan; Ghert, Michelle. Lost in Translation: Animal Models

and Clinical Trials in Cancer Treatment. Am. J. Transl. Res. 2014, 6, 114–118.

233

(168) Li, Albert P. Accurate Prediction of Human Drug Toxicity: A Major Challenge in Drug

Development. Chem. Biol. Interact. 2004, 150, 3–7.

(169) Olson, Harry; Betton, Graham; Robinson, Denise; Thomas, Karluss; Monro, Alastair;

Kolaja, Gerald; Lilly, Patrick; Sanders, James; Sipes, Glenn; Bracken, William; Dorato,

Michael; Van Deun, Koen; Smith, Peter; Berger, Bruce; Heller, Allen. Concordance of the

Toxicity of Pharmaceuticals in Humans and in Animals. Regul. Toxicol. Pharmacol. 2000,

32, 56–67.

(170) King, G. L. Animal Models in the Study of Vomiting. Can. J. Physiol. Pharmacol. 1990, 68,

260–268.

(171) Holmes, A. M.; Rudd, J. A.; Tattersall, F. D.; Aziz, Q.; Andrews, P. L. R. Opportunities for

the Replacement of Animals in the Study of Nausea and Vomiting. Br. J. Pharmacol.

2009, 157, 865–880.

(172) Parkinson, Joanna; Muthas, Daniel; Clark, Matthew; Boyer, Scott; Valentin, Jean Pierre;

Ewart, Lorna. Application of Data Mining and Visualization Techniques for the

Prediction of Drug-Induced Nausea in Man. Toxicol. Sci. 2012, 126, 275–284.

(173) Needs, C. J.; Brooks, P. M. Antirheumatic Medication in Pregnancy. Br. J. Rheumatol.

1985, 24, 282–290.

(174) Lepper, Erin R.; Smith, Nicola F.; Cox, Michael C.; Scripture, Charity D.; Figg, William D.

Thalidomide Metabolism and Hydrolysis: Mechanisms and Implications. Curr. Drug

Metab. 2006, 7, 677–685.

(175) Powell, Craig M.; Miyakawa, Tsuyoshi. Schizophrenia-Relevant Behavioral Testing in

Rodent Models: A Uniquely Human Disorder? Biol. Psychiatry 2006, 59, 1198–1207.

(176) Jones, C. A.; Watson, D. J. G.; Fone, K. C. F. Animal Models of Schizophrenia. Br. J.

Pharmacol. 2011, 164, 1162–1194.

(177) Force, Thomas; Kolaja, Kyle L. Cardiotoxicity of Kinase Inhibitors: The Prediction and

Translation of Preclinical Models to Clinical Outcomes. Nat. Rev. Drug Discov. 2011, 10,

111–126.

(178) Dhar, Vasant. Data Science and Prediction. Ssrn 2012, 56, 64–73.

(179) Hayashi, Chikio. What Is Data Science ? Fundamental Concepts and a Heuristic Example.

In Data Science, Classification, and Related Methods; Springer, Tokyo, 1998; pp 40–51.

234

(180) Chollet, Francois. Deep Learning With Python; Arritola, Toni, Gaines, Jerry,

Dragosavlijevic Aleksandar, Taylor, Tiffany, Ed.; Manning Publications Co.: New York,

2018.

(181) Talwar, Anish; Kumar, Yogesh. Machine Learning: An Artificial Intelligence

Methodology. Int. J. Eng. Comput. Sci. 2013, 2, 3400–3404.

(182) Dey, Ayon. Machine Learning Algorithms: A Review. Int. J. Comput. Sci. Inf. Technol.

2016, 7, 1174–1179.

(183) Kotsiantis, S. B. Supervised Machine Learning: A Review of Classification Techniques.

Informatica 2007, 31, 249–268.

(184) Tsoumakas, Grigorios; Katakis, Ioannis. Multi-Label Classification: An Overview;

Thessaloniki, Greece.

(185) Read, Jesse; Martino, Luca; Olmos, Pablo M.; Luengo, David. Scalable Multi-Output Label

Prediction: From Classifier Chains to Classifier Trellises. Pattern Recognit. 2015, 48,

2096–2109.

(186) Sokolova, Marina; Lapalme, Guy. A Systematic Analysis of Performance Measures for

Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437.

(187) Fawcett, Tom. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–

874.

(188) Powers, David M. W. Evaluation: From Precision, Recall and F-Factor to ROC,

Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2007, 2, 37–63.

(189) Hajian-Tilaki, Karimollah. Receiver Operating Characteristic (ROC) Curve Analysis for

Medical Diagnostic Test Evaluation. Casp. J. Intern. Med. 2013, 4, 627–635.

(190) Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn and TensorFlow, 1st

Editio.; O’Reilly, 2017.

(191) Tabachnick, Barbara G.; Fidell, Linda S. Using Multivariate Statistics Title: Using

Multivariate Statistics, 7th ed.; Pearson: New York, 2019; Vol. 7.

(192) Ranganathan, Priya; Pramesh, C. S.; Aggarwal, Rakesh. Common Pitfalls in Statistical

Analysis: Logistic Regression. Perspect. Clin. Res. 2017, 8, 148–151.

(193) Cortes, Corinna; Vapnik, Vladimir. Support-Vector Networks. Mach. Learn. 1995, 20,

235

273–297.

(194) Raschka, Sebastian. A Tour of Machine Learning Classifiers Using Scikit-Learn. In Python

Machine Learning : unlock deeper insights into machine learning with this vital guide to

cutting-edge predictive analytics; Packt Publishing Ltd: Birmingham, UK, 2015; pp 49–96.

(195) Basak, Debasish; Pal, Srimanta; Patranabis, Dipak Chandra. Support Vector Regression.

In Neural Information Processing-Letters and Reviews; 2007; Vol. 11, pp 203–224.

(196) Noble, William S. What Is a Support Vector Machine? Nat. Biotechnol. 2006, 24, 1565–

1567.

(197) Auria, Laura; Moro, Rouslan A. Support Vector Machines (SVM) as a Technique for

Solvency Analysis; Berlin, Germany, 2008.

(198) Quinlan, J. R. Induction of Trees. Mach. Learn. 2007, 1, 81–106.

(199) Shannon, C. E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27,

379–423.

(200) Mitchell B.O., John B. O. Machine Learning Methods in Chemoinformatics. Wiley

Interdiscip. Rev. Comput. Mol. Sci. 2014, 4, 468–481.

(201) Quinlan, J. R. Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res. 1996, 4,

77–90.

(202) Raileanu, Laura E.; Stoffel, Kilian. Theoretical Comparison between the Gini Index and

Information Gain. Ann. Math. Artif. Intell. 2004, 41, 77–93.

(203) Dietterich, Thomas G. An Experimental Comparison of Three Methods for Constructing

Ensembles of Decision Trees. Mach. Learn. 2000, 40, 139–157.

(204) Breiman, Leo. Bagging Predictors: Technical Report No. 421. Dep. Stat. Univ. Calif. 1994,

No. 2, 19.

(205) Touw, Wouter G.; Bayjanov, Jumamurat R.; Overmars, Lex; Backus, Lennart; Boekhorst,

Jos; Wels, Michiel; Sacha van Hijum, A. F. T. Data Mining in the Life Science Swith

Random Forest: A Walk in the Park or Lost in the Jungle? Brief. Bioinform. 2013, 14, 315–

326.

(206) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.

236

(207) Tin Kam Ho. The Random Subspace Method for Constructing Decision Forests. IEEE

Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844.

(208) Rich, Caruana; Niculescu-Mizil, Alexandru. Predicting Good Probabilities With

Supervised Learning Alexandru Niculescu-Mizil. In Proceeding ICML ’05 Proceedings of

the 22nd international conference on Machine learning; Bonn, Germany, 2005; pp 625–

632.

(209) Cortés-Ciriano, Isidro; Ain, Qurrat Ul; Subramanian, Vigneshwari; Lenselink, Eelke B.;

Méndez-Lucio, Oscar; IJzerman, Adriaan P.; Wohlfahrt, Gerd; Prusis, Peteris; Malliavin,

Thérèse E.; van Westen, Gerard J. P.; Bender, Andreas. Polypharmacology Modelling

Using Proteochemometrics (PCM): Recent Methodological Developments, Applications

to Target Families, and Future Prospects. Med. Chem. Commun. 2015, 6, 24–50.

(210) Strobl, Carolin; Boulesteix, Anne-Laure; Kneib, Thomas; Augustin, Thomas; Zeileis,

Achim. Conditional Variable Importance for Random Forests. BMC Bioinformatics 2008,

9, 307.

(211) Diaz-Uriarte, Ramon; de Andres, Sara Alvarez. Variable Selection from Random Forests:

Application to Gene Expression Data. In Conference: Proceedings of the 5th Annual

Spanish Bioinformatics Conference; Barcelona, 2005; pp 1–11.

(212) Ishwaran, Hemant; Kogalur, Udaya B.; Gorodeski, Eiran Z.; Minn, Andy J.; Lauer, Michael

S. High-Dimensional Variable Selection for Survival Data. J. Am. Stat. Assoc. 2010, 105,

205–217.

(213) Palczewska, Anna; Palczewski, Jan; Robinson, Richard Marchese; Neagu, Daniel.

Interpreting Random Forest Classification Models Using a Feature Contribution Method.

Adv. Intell. Syst. Comput. 2014, 263, 193–218.

(214) Fabris, Fabio; Doherty, Aoife; Palmer, Daniel; de Magalhães, João Pedro; Freitas, Alex A.

A New Approach for Interpreting Random Forest Models and Its Application to the

Biology of Ageing. Bioinformatics 2018, 34, 2449–2456.

(215) Epifanio, Irene. Intervention in Prediction Measure: A New Approach to Assessing

Variable Importance for Random Forests. BMC Bioinformatics 2017, 18, 230.

(216) Hanser, T.; Barber, C.; Marchaland, J. F.; Werner, S. Applicability Domain: Towards a

More Formal Definition. SAR QSAR Environ. Res. 2016, 27, 865–881.

237

(217) Ruiz, Irene Luque; Gómez-Nieto, Miguel Ángel. Study of the Applicability Domain of the

QSAR Classification Models by Means of the Rivality and Modelability Indexes. Molecules

2018, 23, 2756.

(218) Kaneko, Hiromasa; Funatsu, Kimito. Applicability Domain Based on Ensemble Learning

in Classification and Regression Analyses. J. Chem. Inf. Model. 2014, 54, 2469–2482.

(219) Aniceto, Natália; Freitas, Alex A.; Bender, Andreas; Ghafourian, Taravat. A Novel

Applicability Domain Technique for Mapping Predictive Reliability across the Chemical

Space of a QSAR: Reliability-Density Neighbourhood. J. Cheminform. 2016, 8, 69.

(220) Sushko, Iurii; Novotarskyi, Sergii; Körner, Robert; Pandey, Anil Kumar; Cherkasov,

Artem; Li, Jiazhong; Gramatica, Paola; Hansen, Katja; Schroeter, Timon; Müller, Klaus-

Robert; Xi, Lili; Liu, Huanxiang; Yao, Xiaojun; Öberg, Tomas; Hormozdiari, Farhad; Dao,

Phuong; Sahinalp, Cenk; Todeschini, Roberto; Polishchuk, Pavel; et al. Applicability

Domains for Classification Problems: Benchmarking of Distance to Models for Ames

Mutagenicity Set. J. Chem. Inf. Model. 2010, 50, 2094–2111.

(221) Norinder, Ulf; Carlsson, Lars; Boyer, Scott; Eklund, Martin. Introducing Conformal

Prediction in Predictive Modeling. A Transparent and Flexible Alternative to

Applicability Domain Determination. J. Chem. Inf. Model. 2014, 54, 1596–1603.

(222) Toplak, Marko; Močnik, Rok; Polajnar, Matija; Bosnić, Zoran; Carlsson, Lars; Hasselgren,

Catrin; Demšar, Janez; Boyer, Scott; Zupan, Blaž; Stålring, Jonna. Assessment of Machine

Learning Reliability Methods for Quantifying the Applicability Domain of QSAR

Regression Models. J. Chem. Inf. Model. 2014, 54, 431–441.

(223) Svensson, Fredrik; Norinder, Ulf; Bender, Andreas. Modelling Compound Cytotoxicity

Using Conformal Prediction and PubChem HTS Data. Toxicol. Res. (Camb). 2017, 6, 73–

80.

(224) Svensson, Fredrik; Norinder, Ulf; Bender, Andreas. Improving Screening Efficiency

through Iterative Screening Using Docking and Conformal Prediction. J. Chem. Inf.

Model. 2017, 57, 439–444.

(225) Norinder, Ulf; Carlsson, Lars; Boyer, Scott; Eklund, Martin. Introducing Conformal

Prediction in Predictive Modeling for Regulatory Purposes. A Transparent and Flexible

Alternative to Applicability Domain Determination. Regul. Toxicol. Pharmacol. 2015, 71,

279–284.

238

(226) Eklund, Martin; Norinder, Ulf; Boyer, Scott; Carlsson, Lars. Application of Conformal

Prediction in QSAR. In IFIP Advances in Information and Communication Technology;

Springer, Berlin, Heidelberg, 2012; Vol. 382 AICT, pp 166–175.

(227) Sun, Jiangming; Carlsson, Lars; Ahlberg, Ernst; Norinder, Ulf; Engkvist, Ola; Chen,

Hongming. Applying Mondrian Cross-Conformal Prediction To Estimate Prediction

Confidence on Large Imbalanced Bioactivity Data Sets. J. Chem. Inf. Model. 2017, 57, 1591–

1598.

(228) Cahill, Nathan D. Normalized Measures of Mutual Information with General Definitions

of Entropy for Multimodal Image Registration. In Lecture Notes in Computer Science

(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics); 2010; Vol. 6204 LNCS, pp 258–268.

(229) Deeks, Jonathan J.; Altman, Douglas G. Diagnostic Tests 4: Likelihood Ratios. Bmj 2004,

329, 168–169.

(230) Wall, R. J.; Shani, M. Are Animal Models as Good as We Think? Theriogenology 2008,

69, 2–9.

(231) Chien, Patrick F. W.; Khan, Khalid S. Evaluation of a Clinical Test. II: Assessment of

Validity. BJOG An Int. J. Obstet. Gynaecol. 2003, 108, 568–572.

(232) Clark, Matthew. Prediction of Clinical Risks by Analysis of Preclinical and Clinical

Adverse Events. J. Biomed. Inform. 2015, 54, 167–173.

(233) Klabunde, T. Chemogenomic Approaches to Drug Discovery: Similar Receptors Bind

Similar Ligands. Br. J. Pharmacol. 2007, 152, 5–7.

(234) Rognan, D. Chemogenomic Approaches to Rational Drug Design. Br. J. Pharmacol. 2007,

152, 38–52.

(235) Boström, Jonas; Hogner, Anders; Schmitt, Stefan. Do Structurally Similar Ligands Bind

in a Similar Fashion? J. Med. Chem. 2006, 49, 6716–6725.

(236) Chodera, John D.; Mobley, David L. Entropy-Enthalpy Compensation: Role and

Ramifications in Biomolecular Ligand Recognition and Design. Annu. Rev. Biophys. 2013,

42, 121–142.

(237) Du, Xing; Li, Yi; Xia, Yuan Ling; Ai, Shi Meng; Liang, Jing; Sang, Peng; Ji, Xing Lai; Liu,

Shu Qun. Insights into Protein–ligand Interactions: Mechanisms, Models, and Methods.

239

Int. J. Mol. Sci. 2016, 17, 144.

(238) Mauri, Andrea; Consonni, Viviana; Pavan, Manuela; Todeschini, Roberto. Dragon

Software : An Easy Approach To Molecular Descriptor Calculations. MATCH Commun.

Math. Comput. Chem. 2006, 56, 237–248.

(239) Yap, Chun Wei. PaDEL-Descriptor: An Open Source Software to Calculate Molecular

Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466–1474.

(240) Rifaioglu, Ahmet Sureyya; Atas, Heval; Martin, Maria Jesus; Cetin-Atalay, Rengul; Atalay,

Volkan; Doğan, Tunca. Recent Applications of Deep Learning and Machine Intelligence

on in Silico Drug Discovery: Methods, Tools and Databases. Brief. Bioinform. 2018, 1–36.

(241) Gozalbes, R.; Doucet, J. P.; Derouin, F. Application of Topological Descriptors in QSAR

and Drug Design: History and New Trends. Curr. Drug Targets. Infect. Disord. 2002, 2,

93–102.

(242) Durant, Joseph L.; Leland, Burton A.; Henry, Douglas R.; Nourse, James G.

Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002,

42, 1273–1280.

(243) Sliwoski, Gregory; Mendenhall, Jeffrey; Meiler, Jens. Autocorrelation Descriptor

Improvements for QSAR: 2DA-Sign and 3DA-Sign. J. Comput. Aided. Mol. Des. 2016, 30,

209–217.

(244) Rogers, David; Hahn, Mathew. Extended-Connectivity Fingerprints. J. Chem. Inf. Model.

2010, 50, 742–754.

(245) Glem, Robert C.; Bender, Andreas; Arnby, Catrin H.; Carlsson, Lars; Boyer, Scott; Smith,

James. Circular Fingerprints: Flexible Molecular Descriptors with Applications from

Physical Chemistry to ADME. IDrugs 2006, 9, 199–204.

(246) Morgan, H. L. The Generation of a Unique Machine Description for Chemical Structures-

A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107–113.

(247) Bender, Andreas; Jenkins, Jeremy L.; Scheiber, Josef; Sukuru, Sai Chetan K.; Glick, Meir;

Davies, John W. How Similar Are Similarity Searching Methods? A Principal Component

Analysis of Molecular Descriptor Space. J. Chem. Inf. Model. 2009, 49, 108–119.

(248) Cramer, Richard D.; Patterson, David E.; Bunce, Jeffrey D. Comparative Molecular Field

Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins. J. Am.

240

Chem. Soc. 1988, 110, 5959–5967.

(249) Lapinsh, Maris; Prusis, Peteris; Lundstedt, Torbjörn; Wikberg, Jarl E. S.; Witte, Peter de;

Augustijns, Patrick F.; Annaert, Pieter P. Proteochemometrics Modeling of the

Interaction of Amine G-Protein Coupled Receptors with a Diverse Set of Ligands. Mol.

Pharmacol. 2002, 61, 1465–1475.

(250) Cruciani, Gabriele; Pastor, Manuel; Guba, Wolfgang. VolSurf: A New Tool for the

Pharmacokinetic Optimization of Lead Compounds. Eur. J. Pharm. Sci. 2000, 11, S29-39.

(251) Subramanian, Vigneshwari; Prusis, Peteris; Pietilä, Lars-Olof; Xhaard, Henri; Wohlfahrt,

Gerd; Pietilä, Lars-Olof; Xhaard, Henri; Wohlfahrt, Gerd; Pietilä, Lars-Olof; Xhaard,

Henri; Wohlfahrt, Gerd. Visually Interpretable Models of Kinase Selectivity Related

Features Derived from Field-Based Proteochemometrics. J. Chem. Inf. Model. 2013, 53,

3021–3030.

(252) Ekins, S.; Mestres, J.; Testa, B. In Silico Pharmacology for Drug Discovery: Applications

to Targets and Beyond. Br. J. Pharmacol. 2007, 152, 21–37.

(253) Mulliner, Denis; Schmidt, Friedemann; Stolte, Manuela; Spirkl, Hans-Peter Peter; Czich,

Andreas; Amberg, Alexander. Computational Models for Human and Animal

Hepatotoxicity with a Global Application Scope. Chem. Res. Toxicol. 2016, 29, 757–767.

(254) Axen, Seth D.; Huang, Xi-Ping; Cáceres, Elena L.; Gendelev, Leo; Roth, Bryan L.; Keiser,

Michael J. A Simple Representation of Three-Dimensional Molecular Structure. J. Med.

Chem. 2017, 60, 7393–7409.

(255) Gómez-Bombarelli, Rafael; Wei, Jennifer N.; Duvenaud, David; Hernández-Lobato, José

Miguel; Sánchez-Lengeling, Benjamín; Sheberla, Dennis; Aguilera-Iparraguirre, Jorge;

Hirzel, Timothy D.; Adams, Ryan P.; Aspuru-Guzik, Alán. Automatic Chemical Design

Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4,

268–276.

(256) Duvenaud, David; Maclaurin, Dougal; Aguilera-Iparraguirre, Jorge; Gómez-Bombarelli,

Rafael; Hirzel, Timothy; Aspuru-Guzik, Alán; Adams, Ryan P. Convolutional Networks

on Graphs for Learning Molecular Fingerprints. In Proceedings of Advances in Neural

Information Processing Systems 28 (NIPS 2015); Montreal, 2015; pp 2215–2223.

(257) Gupta, Anvita; Müller, Alex T.; Huisman, Berend J. H.; Fuchs, Jens A.; Schneider, Petra;

241

Schneider, Gisbert. Generative Recurrent Networks for De Novo Drug Design. Mol.

Inform. 2018, 37, 1700111.

(258) Qiu, Tianyi; Qiu, Jingxuan; Feng, Jun; Wu, Dingfeng; Yang, Yiyan; Tang, Kailin; Cao,

Zhiwei; Zhu, Ruixin. The Recent Progress in Proteochemometric Modelling: Focusing on

Target Descriptors, Cross-Term Descriptors and Application Scope. Brief. Bioinform.

2017, 18, 125–136.

(259) Kramer, Christian; Gedeck, Peter. Global Free Energy Scoring Functions Based on

Distance-Dependent Atom-Type Pair Descriptors. J. Chem. Inf. Model. 2011, 51, 707–720.

(260) Ong, Serene A. K.; Lin, Hong Huang; Chen, Yu Zong; Li, Ze Rong; Cao, Zhiwei. Efficacy

of Different Protein Descriptors in Predicting Protein Functional Families. BMC

Bioinformatics 2007, 8, 300.

(261) Gao, Qing-Bin; Wang, Zheng-Zhi; Yan, Chun; Du, Yao-Hua. Prediction of Protein

Subcellular Location Using a Combined Feature of Sequence. FEBS Lett. 2005, 579, 3444–

3448.

(262) Bhasin, Manoj; Raghava, Gajendra P. S. Classification of Nuclear Receptors Based on

Amino Acid Composition and Dipeptide Composition. J. Biol. Chem. 2004, 279, 23262–

23266.

(263) Strömbergsson, Helena; Kryshtafovych, Andriy; Prusis, Peteris; Fidelis, Krzysztof;

Wikberg, Jarl E. S.; Komorowski, Jan; Hvidsten, Torgeir R. Generalized Modeling of

Enzyme-Ligand Interactions Using Proteochemometrics and Local Protein

Substructures. Proteins 2006, 65, 568–579.

(264) Horne, David S. Prediction of Protein Helix Content from an Autocorrelation Analysis of

Sequence Hydrophobicities. Biopolymers 1988, 27, 451–477.

(265) Sandberg, Maria; Eriksson, Lennart; Jonsson, Jörgen; Sjöström, Michael; Wold, Svante.

New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A

Multivariate Characterization of 87 Amino Acids. J. Med. Chem. 1998, 41, 2481–2491.

(266) Mei, Hu; Liao, Zhi H.; Zhou, Yuan; Li, Shengshi Z. A New Set of Amino Acid Descriptors

and Its Application in Peptide QSARs. Biopolym. - Pept. Sci. Sect. 2005, 80, 775–786.

(267) Van Westen, Gerard JP P. J. P.; Swier, Remco F.; Cortes-Ciriano, Isidro; Wegner, Jorg K.

Jörg K.; Overington, John P.; IJzerman, Adriaan P.; Van Vlijmen, Herman WT T. W. T.;

242

Bender, Andreas; Jzerman, Adriaan P. I.; Van Vlijmen, Herman WT T. W. T.; Bender,

Andreas; IJzerman, Adriaan P.; Van Vlijmen, Herman WT T. W. T.; Bender, Andreas; Jp

Van Westen, Gerard; Swier, Remco F.; Cortes-Ciriano, Isidro; Wegner, Jorg K. Jörg K.;

Overington, John P.; et al. Benchmarking of Protein Descriptor Sets in

Proteochemometric Modeling (Part 2): Modeling Performance of 13 Amino Acid

Descriptor Sets. J. Cheminform. 2013, 5, 42–62.

(268) Yang, Li; Shu, Mao; Ma, Kaiwang; Mei, Hu; Jiang, Yongjun; Li, Zhiliang. ST-Scale as a

Novel Amino Acid Descriptor and Its Application in QSAM of Peptides and Analogues.

Amino Acids 2010, 38, 805–816.

(269) Tian, Feifei; Zhou, Peng; Li, Zhiliang. T-Scale as a Novel Vector of Topological Descriptors

for Amino Acids and Its Application in QSARs of Peptides. J. Mol. Struct. 2007, 830, 106–

115.

(270) Georgiev, Alexander G. Interpretable Numerical Descriptors of Amino Acid Space. J.

Comput. Biol. 2009, 16, 703–723.

(271) Henikoff, S.; Henikoff, J. G. Amino Acid Substitution Matrices from Protein Blocks. Proc.

Natl. Acad. Sci. U. S. A. 1992, 89, 10915–10919.

(272) Zaliani, A.; Gancia, E. MS-WHIM Scores for Amino Acids: A New 3D-Description for

Peptide QSAR and QSPR Studies. J. Chem. Inf. Comput. Sci. 1999, 39, 525–533.

(273) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov,

I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242.

(274) Wu, Dingfeng; Huang, Qi; Zhang, Yida; Zhang, Qingchen; Liu, Qi; Gao, Jun; Cao, Zhiwei

ZW; Zhu, Ruixin RX; Park, H.; Kim, S.; Kim, YE; Lim, SJ; Bertrand, P.; Rikiishi, H.; Witt,

O.; Deubzer, HE; Milde, T.; Oehme, I.; Ruijter, AJM De; et al. Screening of Selective

Histone Deacetylase Inhibitors by Proteochemometric Modeling. BMC Bioinformatics

2012, 13, 212.

(275) Qiu, Tianyi; Xiao, Han; Zhang, Qingchen; Qiu, Jingxuan; Yang, Yiyan; Wu, Dingfeng; Cao,

Zhiwei; Zhu, Ruixin. Proteochemometric Modeling of the Antigen-Antibody Interaction:

New Fingerprints for Antigen, Antibody and Epitope-Paratope Interaction. PLoS One

2015, 10, e0122416.

(276) Bosc, Nicolas; Wroblowski, Berthold; Meyer, Christophe; Bonnet, Pascal. Prediction of

243

Protein Kinase-Ligand Interactions through 2.5D Kinochemometrics. J. Chem. Inf. Model.

2017, 57, 93–101.

(277) Rácz, Anita; Bajusz, Dávid; Héberger, Károly. Life beyond the Tanimoto Coefficient:

Similarity Measures for Interaction Fingerprints. J. Cheminform. 2018, 10, 48.

(278) Bajusz, D.; Rácz, A.; Héberger, K. Chemical Data Formats, Fingerprints, and Other

Molecular Descriptions for Database Analysis and Searching. In Comprehensive

Medicinal Chemistry III; Elsevier, 2017; pp 329–378.

(279) Salentin, Sebastian; Schreiber, Sven; Haupt, V. Joachim; Adasme, Melissa F.; Schroeder,

Michael. PLIP: Fully Automated Protein-Ligand Interaction Profiler. Nucleic Acids Res.

2015, 43, W443–W447.

(280) Gini, Giuseppina. QSAR Methods. Methods Mol. Biol. 2016, 1425, 1–20.

(281) Kearnes, Steven; Pande, Vijay. ROCS-Derived Features for Virtual Screening. J. Comput.

Aided. Mol. Des. 2016, 30, 609–617.

(282) Fontaine, Fabien; Bolton, Evan; Borodina, Yulia; Bryant, Stephen H. Fast 3D Shape

Screening of Large Chemical Databases through Alignment-Recycling. Chem. Cent. J.

2007, 1, 12.

(283) Mayr, Andreas; Klambauer, Günter; Unterthiner, Thomas; Steijaert, Marvin; Wegner,

Jörg K.; Ceulemans, Hugo; Clevert, Djork-Arné; Hochreiter, Sepp. Large-Scale

Comparison of Machine Learning Methods for Drug Target Prediction on ChEMBL.

Chem. Sci. 2018, 9, 5441–5451.

(284) Simões, Rodolfo S.; Maltarollo, Vinicius G.; Oliveira, Patricia R.; Honorio, Kathia M.

Transfer and Multi-Task Learning in QSAR Modeling: Advances and Challenges. Front.

Pharmacol. 2018, 9, 74.

(285) Huang, Hung-Jin; Yu, Hsin Wei; Chen, Chien-Yu; Hsu, Chih-Ho; Chen, Hsin-Yi; Lee,

Kuei-Jen; Tsai, Fuu-Jen; Chen, Calvin Yu-Chian. Current Developments of Computer-

Aided Drug Design. J. Taiwan Inst. Chem. Eng. 2010, 41, 623–635.

(286) Salmaso, Veronica; Moro, Stefano. Bridging Molecular Docking to Molecular Dynamics

in Exploring Ligand-Protein Recognition Process: An Overview. Front. Pharmacol. 2018,

9, 923.

(287) Meng, Xuan-Yu; Zhang, Hong-Xing; Mezei, Mihaly; Cui, Meng. Molecular Docking: A

244

Powerful Approach for Structure-Based Drug Discovery. Curr. Comput. Aided. Drug Des.

2011, 7, 146–157.

(288) Śledź, Paweł. Protein Structure-Based Drug Design: From Docking to Molecular

Dynamics. Curr. Opin. Struct. Biol. 2018, 48, 93–102.

(289) Repasky, Matthew P.; Shelley, Mee; Friesner, Richard A. Flexible Ligand Docking with

Glide. In Current Protocols in Bioinformatics; John Wiley & Sons, Inc.: Hoboken, NJ, USA,

2007; p UNIT 8.12.

(290) Surgand, Jean Sebastien; Rodrigo, Jordi; Kellenberger, Esther; Rognan, Didier. A

Chemogenomic Analysis of the Transmembrane Binding Cavity of Human G-Protein-

Coupled Receptors. Proteins Struct. Funct. Genet. 2006, 62, 509–538.

(291) Kratochwil, Nicole A.; Malherbe, Pari; Lindemann, Lothar; Ebeling, Martin; Hoener,

Marius C.; Mühlemann, Andreas; Porter, Richard H. P.; Stahl, Martin; Gerber, Paul R. An

Automated System for the Analysis of G Protein-Coupled Receptor Transmembrane

Binding Pockets: Alignment, Receptor-Based Pharmacophores, and Their Application. J.

Chem. Inf. Model. 2005, 45, 1324–1336.

(292) Frimurer, Thomas M.; Ulven, Trond; Elling, Christian E.; Gerlach, Lars-Ole; Kostenis, Evi;

Högberg, Thomas. A Physicogenetic Method to Assign Ligand-Binding Relationships

between 7TM Receptors. Bioorg. Med. Chem. Lett. 2005, 15, 3707–3712.

(293) Ehrt, Christiane; Brinkjost, Tobias; Koch, Oliver. Impact of Binding Site Comparisons on

Medicinal Chemistry and Rational Molecular Design. J. Med. Chem. 2016, 59, 4121–4151.

(294) Baroni, Massimo; Cruciani, Gabriele; Sciabola, Simone; Perruccio, Francesca; Mason,

Jonathan S. A Common Reference Framework for Analyzing/Comparing Proteins and

Ligands. Fingerprints for Ligands And Proteins (FLAP): Theory and Application. J. Chem.

Inf. Model. 2007, 47, 279–294.

(295) Totrov, Maxim. Ligand Binding Site Superposition and Comparison Based on Atomic

Property Fields: Identification of Distant Homologues, Convergent Evolution and PDB-

Wide Clustering of Binding Sites. BMC Bioinformatics 2011, 12, S35.

(296) Konc, Janez; Janezic, Dusanka. ProBiS Algorithm for Detection of Structurally Similar

Protein Binding Sites by Local Structural Alignment. Bioinformatics 2010, 26, 1160–1168.

(297) Das, Sourav; Kokardekar, Arshad; Breneman, Curt M. Rapid Comparison of Protein

245

Binding Site Surfaces with Property Encoded Shape Distributions. J. Chem. Inf. Model.

2009, 49, 2863–2872.

(298) Li, Gong-Hua; Huang, Jing-Fei. CMASA: An Accurate Algorithm for Detecting Local

Protein Structural Similarity and Its Application to Enzyme Catalytic Site Annotation.

BMC Bioinformatics 2010, 11, 439.

(299) Sheinerman, Felix B.; Giraud, Elie; Laoui, Abdelazize. High Affinity Targets of Protein

Kinase Inhibitors Have Similar Residues at the Positions Energetically Important for

Binding. J. Mol. Biol. 2005, 352, 1134–1156.

(300) Subramanian, Govindan; Sud, Manish. Computational Modeling of Kinase Inhibitor

Selectivity. Chem. Lett 2010, 1, 395–399.

(301) Kontoyianni, Maria. Docking and Virtual Screening in Drug Discovery. In Methods in

Molecular Biology; 2017; Vol. 1647, pp 255–266.

(302) Uba, Abdullahi Ibrahim; Yelekçi, Kemal. Carboxylic Acid Derivatives Display Potential

Selectivity for Human Histone Deacetylase 6: Structure-Based Virtual Screening,

Molecular Docking and Dynamics Simulation Studies. Comput. Biol. Chem. 2018, 75, 131–

142.

(303) Jasper, Julia B.; Humbeck, Lina; Brinkjost, Tobias; Koch, Oliver. A Novel Interaction

Fingerprint Derived from per Atom Score Contributions: Exhaustive Evaluation of

Interaction Fingerprint Performance in Docking Based Virtual Screening. J. Cheminform.

2018, 10, 15.

(304) Weaver, Shane; Gleeson, M. Paul. The Importance of the Domain of Applicability in

QSAR Modeling. J. Mol. Graph. Model. 2008, 26, 1315–1326.

(305) Van Westen, Gerard J. P.; Wegner, Jörg K.; Ijzerman, Adriaan P.; Van Vlijmen, Herman

W. T.; Bender, A. Proteochemometric Modeling as a Tool to Design Selective Compounds

and for Extrapolating to Novel Targets. Medchemcomm 2011, 2, 16–30.

(306) Murrell, Daniel S.; Cortes-Ciriano, Isidro; Van Westen, Gerard J. P.; Stott, Ian P.; Bender,

Andreas; Malliavin, Thérèse E.; Glen, Robert C. Chemically Aware Model Builder (Camb):

An R Package for Property and Bioactivity Modelling of Small Molecules. J. Cheminform.

2015, 7, 45–55.

(307) Tropsha, Alexander; Golbraikh, Alexander. Predictive QSAR Modeling Workflow, Model

246

Applicability Domains, and Virtual Screening. Curr. Pharm. Des. 2007, 13, 3494–3504.

(308) Gao, Jun; Huang, Qi; Wu, Dingfeng; Zhang, Qingchen; Zhang, Yida; Chen, Tian; Liu, Qi;

Zhu, Ruixin; Cao, Zhiwei; He, Yuan. Study on Human GPCR–inhibitor Interactions by

Proteochemometric Modeling. Gene 2013, 518, 124–131.

(309) Subramanian, Vigneshwari; Prusis, Peteris; Xhaard, Henri; Wohlfahrt, Gerd. Predictive

Proteochemometric Models for Kinases Derived from 3D Protein Field-Based

Descriptors. Med. Chem. Commun. 2016, 7, 1007–1015.

(310) Lapins, Maris; Eklund, Martin; Spjuth, Ola; Prusis, Peteris; Wikberg, Jarl ES S.

Proteochemometric Modeling of HIV Protease Susceptibility. BMC Bioinformatics 2008,

9, 181.

(311) Lapins, Maris; Worachartcheewan, Apilak; Spjuth, Ola; Georgiev, Valentin;

Prachayasittikul, Virapong; Nantasenamat, Chanin; Wikberg, Jarl E. S. A Unified

Proteochemometric Model for Prediction of Inhibition of Cytochrome P450 Isoforms.

PLoS One 2013, 8, e66566.

(312) Tresadern, Gary; Trabanco, Andres A.; Pérez-Benito, Laura; Overington, John P.; van

Vlijmen, Herman W. T.; van Westen, Gerard J. P. Identification of Allosteric Modulators

of Metabotropic Glutamate 7 Receptor Using Proteochemometric Modeling. J. Chem. Inf.

Model. 2017, 57, 2976–2985.

(313) Qiu, Tianyi; Wu, Dingfeng; Qiu, Jingxuan; Cao, Zhiwei. Finding the Molecular Scaffold

of Nuclear Receptor Inhibitors through High-Throughput Screening Based on

Proteochemometric Modelling. J. Cheminform. 2018, 10, 21.

(314) Wan, Shunzhou; Bhati, Agastya P.; Zasada, Stefan J.; Wall, Ian; Green, Darren;

Bamborough, Paul; Coveney, Peter V. Rapid and Reliable Binding Affinity Prediction of

Bromodomain Inhibitors: A Computational Study. J. Chem. Theory Comput. 2017, 13,

784–795.

(315) Aldeghi, Matteo; Heifetz, Alexander; Bodkin, Michael J.; Knapp, Stefan; Biggin, Philip C.

Predictions of Ligand Selectivity from Absolute Binding Free Energy Calculations. J. Am.

Chem. Soc. 2017, 139, 946–957.

(316) García-Jacas, C. R.; Martinez-Mayorga, K.; Marrero-Ponce, Y.; Medina-Franco, J. L.

Conformation-Dependent QSAR Approach for the Prediction of Inhibitory Activity of

247

Bromodomain Modulators. SAR QSAR Environ. Res. 2017, 28, 41–58.

(317) Batiste, Laurent; Unzue, Andrea; Dolbois, Aymeric; Hassler, Fabrice; Wang, Xuan;

Deerain, Nicholas; Zhu, Jian; Spiliotopoulos, Dimitrios; Nevado, Cristina; Caflisch,

Amedeo. Chemical Space Expansion of Bromodomain Ligands Guided by in Silico Virtual

Couplings (AutoCouple). ACS Cent. Sci. 2018, 4, 180–188.

(318) Vidler, Lewis R. LR Lewis R.; Brown, Nathan; Knapp, Stefan; Hoelder, Swen. Druggability

Analysis and Structural Classification of Bromodomain Acetyl-Lysine Binding Sites. J.

Med. Chem. 2012, 55, 7346–7359.

(319) Zhang, Xiaoxiao; Chen, Kai; Wu, Yun-Dong; Wiest, Olaf. Protein Dynamics and

Structural Waters in Bromodomains. PLoS One 2017, 12, e0186570.

(320) Zhao, Linlin; Zhu, Hao. Big Data in Computational Toxicology: Challenges and

Opportunities. In Computational Toxicology; John Wiley & Sons, Inc.: Hoboken, NJ, USA,

2018; pp 291–312.

(321) Jeong, Jaeseong; Choi, Jinhee. Use of Adverse Outcome Pathways in Chemical Toxicity

Testing: Potential Advantages and Limitations. Environ. Health Toxicol. 2017, 33,

e2018002.

(322) Clark, Matthew; Steger-Hartmann, Thomas. A Big Data Approach to the Concordance of

the Toxicity of Pharmaceuticals in Animals and Humans. Regul. Toxicol. Pharmacol.

2018, 96, 94–105.

(323) Nishida, Minoru; Takashima, Yoshiharu; Ogino, Yamato; Yoneta, Yasuo; Nakamura,

Kazuichi; Fujiyoshi, Masato; Kodaira, Hiroshi; Hizue, Masanori; Hisada, Shigeru;

Nagayama, Takashi; Hashiba, Masamichi; Ohkura, Takako; Suzuki, Kazuhiko; Yasugi,

Daisaku; Tamaki, Chihiro. Potentials and Limitations of Nonclinical Safety Assessment

for Predicting Clinical Adverse Drug Reactions: Correlation Analysis of 142 Approved

Drugs in Japan. J. Toxicol. Sci. 2013, 38, 581–598.

(324) Bugelski, Peter J.; Martin, Pauline L. Concordance of Preclinical and Clinical

Pharmacology and Toxicology of Therapeutic Monoclonal Antibodies and Fusion

Proteins: Cell Surface Targets. Br. J. Pharmacol. 2012, 166, 823–846.

(325) Bailey, Jarrod; Thew, Michelle; Balls, Michael. An Analysis of the Use of Animal Models

in Predicting Human Toxicology and Drug Safety. Altern. Lab. Anim. 2014, 42, 181–199.

248

(326) Monticello, Thomas M. T. M. Drug Development and Nonclinical to Clinical

Translational Databases:Past and Current Efforts. Toxicol. Pathol. 2015, 43, 57–61.

(327) Shanks, Niall; Greek, Ray; Greek, Jean. Are Animal Models Predictive for Humans? Philos.

Ethics, Humanit. Med. 2009, 4, 2.

(328) Voisin, Emmanuelle M.; Ruthsatz, Manfred; Collins, Jerry M.; Hoyle, Peter C.

Extrapolation of Animal Toxicity to Humans: Interspecies Comparisons in Drug

Development. Regul. Toxicol. Pharmacol. 1990, 12, 107–116.

(329) Papoian, Thomas; Chiu, Haw-Jyh; Elayan, Ikram; Jagadeesh, Gowraganahalli; Khan,

Imran; Laniyonu, Adebayo A.; Li, Cindy Xinguang; Saulnier, Muriel; Simpson, Natalie;

Yang, Baichun. Secondary Pharmacology Data to Assess Potential Off-Target Activity of

New Drugs: A Regulatory Perspective. Nat. Rev. Drug Discov. 2015, 14, 294–294.

(330) Bento, a. Patrícia; Gaulton, Anna; Hersey, Anne; Bellis, Louisa J.; Chambers, Jon; Davies,

Mark; Krüger, Felix a.; Light, Yvonne; Mak, Lora; McGlinchey, Shaun; Nowotka, Michal;

Papadatos, George; Santos, Rita; Overington, John P. The ChEMBL Bioactivity Database:

An Update. Nucleic Acids Res. 2014, 42, D1083-90.

(331) Wang, Yanli; Suzek, Tugba; Zhang, Jian; Wang, Jiyao; He, Siqian; Cheng, Tiejun;

Shoemaker, Benjamin A.; Gindulyte, Asta; Bryant, Stephen H. PubChem BioAssay: 2014

Update. Nucleic Acids Res. 2014, 42, D1075-82.

(332) Meslamani, Jamel; Smith, Steven G.; Sanchez, Roberto; Zhou, Ming-Ming Ming.

ChEpiMod: A Knowledgebase for Chemical Modulators of Epigenome Reader Domains.

Bioinformatics 2014, 30, 1481–1483.

(333) Jagarlapudi, Sarma A. R. P.; Kishan, K. V. Radha. Database Systems for Knowledge-Based

Discovery. In Methods in molecular biology (Clifton, N.J.); Jacoby E., Ed.; Humana Press:

Totowa, NJ, 2009; Vol. 575, pp 159–172.

(334) Kharenko, Olesya A.; Gesner, Emily M.; Patel, Reena G.; Norek, Karen; White, Andre;

Fontano, Eric; Suto, Robert K.; Young, Peter R.; McLure, Kevin G.; Hansen, Henrik C.

RVX-297- a Novel BD2 Selective Inhibitor of BET Bromodomains. Biochem. Biophys. Res.

Commun. 2016, 477, 62–67.

(335) Clark, Peter G. K.; Dixon, Darren J.; Brennan, Paul E. Development of Chemical Probes

for the Bromodomains of BRD7 and BRD9. Drug Discov. Today Technol. 2016, 19, 73–80.

249

(336) Guetzoyan, Lucie; Ingham, Richard J.; Nikbin, Nikzad; Rossignol, Julien; Wolling,

Michael; Baumert, Mark; Burgess-Brown, Nicola A.; Strain-Damerell, Claire M.; Shrestha,

Leela; Brennan, Paul E.; Fedorov, Oleg; Knapp, Stefan; Ley, Steven V. Machine-Assisted

Synthesis of Modulators of the Histone Reader BRD9 Using Flow Methods of Chemistry

and Frontal Affinity Chromatography. Med. Chem. Commun. 2014, 5, 540–546.

(337) Hay, Duncan A.; Fedorov, Oleg; Martin, Sarah; Singleton, Dean C.; Tallant, Cynthia;

Wells, Christopher; Picaud, Sarah; Philpott, Martin; Monteiro, Octovia P.; Rogers,

Catherine M.; Conway, Stuart J.; Rooney, Timothy P. C.; Tumber, Anthony; Yapp,

Clarence; Filippakopoulos, Panagis; Bunnage, Mark E.; Müller, Susanne; Knapp, Stefan;

Schofield, Christopher J.; et al. Discovery and Optimization of Small-Molecule Ligands

for the CBP/P300 Bromodomains. J. Am. Chem. Soc. 2014, 136, 9308–9319.

(338) Gao, Nana; Ren, Jixia; Hou, Li; Zhou, Yue; Xin, Ling; Wang, Jiedong; Yu, Heming; Xie,

Yong; Wang, Huiping. Identification of Novel Potent Human Testis-Specific and

Bromodomain-Containing Protein (BRDT) Inhibitors Using Crystal Structure-Based

Virtual Screening. Int. J. Mol. Med. 2016, 38, 39–44.

(339) Magrane, Michele; Consortium, Uni Prot. UniProt Knowledgebase: A Hub of Integrated

Protein Data. Database 2011, 2011, bar009.

(340) PubChem Identifier Exchange Service

https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi (accessed Feb 8, 2018).

(341) Indigo Toolkit http://lifescience.opensource.epam.com/ (accessed Feb 8, 2018).

(342) Burlingham, Benjamin T.; Widlanski, Theodore S. An Intuitive Look at the Relationship

of Ki and IC50: A More General Use for the Dixon Plot. J. Chem. Educ. 2003, 80, 214–218.

(343) Quinn, Elizabeth; Wodicka, Lisa; Ciceri, Pietro; Pallares, Gabriel; Pickle, Elyssa; Torrey,

Adam; Floyd, Mark; Hunt, Jeremy; Treiber, Daniel. Abstract 4238: BROMO Scan - a High

Throughput, Quantitative Ligand Binding Platform Identifies Best-in-Class

Bromodomain Inhibitors from a Screen of Mature Compounds Targeting Other Protein

Classes. Cancer Res. 2013, 73, 4238–4238.

(344) Eduati, Federica; Mangravite, Lara M.; Wang, Tao; Tang, Hao; Bare, J. Christopher;

Huang, Ruili; Norman, Thea; Kellen, Mike; Menden, Michael P.; Yang, Jichen; Zhan,

Xiaowei; Zhong, Rui; Xiao, Guanghua; Xia, Menghang; Abdo, Nour; Kosyk, Oksana;

Friend, Stephen; Dearry, Allen; Simeonov, Anton; et al. Prediction of Human Population

250

Responses to Toxic Compounds by a Collaborative Competition. Nat. Biotechnol. 2015,

33, 933–940.

(345) Ain, Qurrat U.; Mé ndez-Lucio, Oscar; Corté Ciriano, Isidro; rè se Malliavin, Thé; P van

Westen, Gerard J.; Bender, Andreas; Méndez-Lucio, Oscar; Ciriano, Isidro Cortés;

Malliavin, Thérèse; van Westen, Gerard J. P.; Bender, Andreas; Corté Ciriano, Isidro; rè

se Malliavin, Thé; P van Westen, Gerard J.; Bender, Andreas; Méndez-Lucio, Oscar;

Ciriano, Isidro Cortés; Malliavin, Thérèse; van Westen, Gerard J. P.; et al. Modelling

Ligand Selectivity of Serine Proteases Using Integrative Proteochemometric Approaches

Improves Model Performance and Allows the Multi-Target Dependent Interpretation of

Features. Integr. Biol. Integr. Biol 2014, 6, 1023–1033.

(346) Koutsoukas, Alexios; Lowe, Robert; KalantarMotamedi, Yasaman; Mussa, Hamse Y.;

Klaffke, Werner; Mitchell, John B. O.; Glen, Robert C.; Bender, Andreas. In Silico Target

Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the

Multiclass Naïve Bayes and Parzen-Rosenblatt Window. J. Chem. Inf. Model. 2013, 53,

1957–1966.

(347) Charif, Delphine; Lobry, Jean R. SeqinR 1.0-2: A Contributed Package to the R Project for

Statistical Computing Devoted to Biological Sequences Retrieval and Analysis; Springer,

Berlin, Heidelberg, 2007; pp 207–232.

(348) Paradis E., Claude J. &. Strimmer K. APE: Analyses of Phylogenetics and Evolution in R

Language. Bioinformatics 2004, 20, 289–290.

(349) Sander, Thomas; Freyss, Joel; von Korff, Modest; Rufener, Christian. DataWarrior: An

Open-Source Program for Chemistry Aware Data Visualization and Analysis. J. Chem. Inf.

Model. 2015, 55, 460–473.

(350) Bemis, Guy W.; Murcko, Mark A. The Properties of Known Drugs. 1. Molecular

Frameworks. J. Med. Chem. 1996, 39, 2887–2893.

(351) Berthold, Michael R.; Cebron, Nicolas; Dill, Fabian; Gabriel, Thomas R.; Kötter, Tobias;

Meinl, Thorsten; Ohl, Peter; Thiel, Kilian; Wiswedel, Bernd. KNIME - the Konstanz

Information Miner. In ACM SIGKDD Explorations Newsletter; Preisach C., Burkhardt H.,

Schmidt-Thieme L., Decker R., Ed.; Springer: Berlin, Heidelberg, 2009; Vol. 11, p 26.

(352) Ahlberg, Christopher. Spotfire: An Information Exploration Environment. ACM SIGMOD

Rec. 1996, 25, 25–29.

251

(353) RDKit: Open-source cheminformatics http://www.rdkit.org (accessed Mar 19, 2018).

(354) Schwab, Christof H. Conformations and 3D Pharmacophore Searching. Drug Discov.

Today Technol. 2010, 7, e245–e253.

(355) MOE. Mol. Oper. Environ. (MOE), 2013.08; Chem. Comput. Gr. Inc., 1010 Sherbooke St.

West, Suite #910, Montr. QC, Canada, H3A 2R7, 2016.

(356) Max Kuhn Contributions form Jed Wing, Author; Weston, Steve; Williams, Andre; Max

Kuhn, Maintainer. caret: Classification 5.15-044., Regression Training. R package version

http://cran.r-project.org/package=caret (accessed Feb 8, 2018).

(357) Rücker, Christoph; Rücker, Gerta; Meringer, Markus. Y-Randomization and Its Variants

in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345–2357.

(358) M Nissink, J. Willem; Blackburn, Sam. Quantification of Frequent-Hitter Behavior Based

on Historical High-Throughput Screening Data. Future Med. Chem. 2014, 6, 1113–1126.

(359) Cumming, John G.; Davis, Andrew M.; Muresan, Sorel; Haeberlein, Markus; Chen,

Hongming. Chemical Predictive Modelling to Improve Compound Quality. Nat. Rev.

Drug Discov. 2013, 12, 948–962.

(360) Godden, Jeffrey W.; Xue, Ling; Bajorath, Jürgen. Combinatorial Preferences Affect

Molecular Similarity/Diversity Calculations Using Binary Fingerprints and Tanimoto

Coefficients. J. Chem. Inf. Comput. Sci. 2000, 40, 163–166.

(361) Wickham, Hadley; Francois, Romain. A Grammar of Data Manipulation https://cran.r-

project.org/web/packages/dplyr/index.html (accessed Feb 8, 2018).

(362) Niesen, Frank H.; Berglund, Helena; Vedadi, Masoud. The Use of Differential Scanning

Fluorimetry to Detect Ligand Interactions That Promote Protein Stability. Nat. Protoc.

2007, 2, 2212–2221.

(363) Igoe, Niall; Bayle, Elliott D.; Tallant, Cynthia; Fedorov, Oleg; Meier, Julia C.; Savitsky,

Pavel; Rogers, Catherine; Morias, Yannick; Scholze, Sarah; Boyd, Helen; Cunoosamy,

Danen; Andrews, David M.; Cheasty, Anne; Brennan, Paul E.; Müller, Susanne; Knapp,

Stefan; Fish, Paul V. Design of a Chemical Probe for the Bromodomain and Plant

Homeodomain Finger-Containing (BRPF) Family of Proteins. J. Med. Chem. 2017, 60,

6998–7011.

(364) De Bruyn, Tom; van Westen, Gerard J. P.; Ijzerman, Adriaan P.; Stieger, Bruno; de Witte,

252

Peter; Augustijns, Patrick F.; Annaert, Pieter P. Structure-Based Identification of

OATP1B1/3 Inhibitors. Mol. Pharmacol. 2013, 83, 1257–1267.

(365) Cortés-Ciriano, Isidro; Van Westen, Gerard J. P.; Bouvier, Guillaume; Nilges, Michael;

Overington, John P.; Bender, Andreas; Malliavin, Thérèse E. Improved Large-Scale

Prediction of Growth Inhibition Patterns Using the NCI60 Cancer Cell Line Panel.

Bioinformatics 2015, 32, 85–95.

(366) Bender, Andreas; Glen, Robert C. Molecular Similarity: A Key Technique in Molecular

Informatics. Org. Biomol. Chem. 2004, 2, 3204.

(367) Giblin, Kathryn A.; Hughes, Samantha J.; Boyd, Helen; Hansson, Pia; Bender, Andreas.

Prospectively Validated Proteochemometric Models for the Prediction of Small-Molecule

Binding to Bromodomain Proteins. J. Chem. Inf. Model. 2018, 58, 1870–1888.

(368) Louppe, Gilles; Wehenkel, Louis; Sutera, Antonio; Geurts, Pierre. Understanding

Variable Importances in Forests of Randomized Trees. Neural Inf. Process. Syst. 2013, 1–

9.

(369) Menze, Bjoern H.; Kelm, B. Michael; Masuch, Ralf; Himmelreich, Uwe; Bachert, Peter;

Petrich, Wolfgang; Hamprecht, Fred A. A Comparison of Random Forest and Its Gini

Importance with Standard Chemometric Methods for the Feature Selection and

Classification of Spectral Data. BMC Bioinformatics 2009, 10:213.

(370) Chung, Chun-wa; Coste, Hervé; White, Julia H.; Mirguet, Olivier; Wilde, Jonathan;

Gosmini, Romain L.; Delves, Chris; Magny, Sylvie M.; Woodward, Robert; Hughes,

Stephen A.; Boursier, Eric V.; Flynn, Helen; Bouillot, Anne M.; Bamborough, Paul; Brusq,

Jean-Marie G.; Gellibert, Françoise J.; Jones, Emma J.; Riou, Alizon M.; Homes, Paul; et

al. Discovery and Characterization of Small Molecule Inhibitors of the BET Family


(371) Valero-Mora, Pedro M. Ggplot2: Elegant Graphics for Data Analysis. J. Stat. Softw. 2015,

35, 212.

(372) Kabsch, Wolfgang; IUCr. XDS. Acta Crystallogr. Sect. D Biol. Crystallogr. 2010, 66, 125–

132.

(373) Evans, Philip R.; Murshudov, Garib N. How Good Are My Data and What Is the

Resolution? Acta Crystallogr. Sect. D Biol. Crystallogr. 2013, 69, 1204–1214.

253

(374) Winn, Martyn D.; Ballard, Charles C.; Cowtan, Kevin D.; Dodson, Eleanor J.; Emsley, Paul;

Evans, Phil R.; Keegan, Ronan M.; Krissinel, Eugene B.; Leslie, Andrew G. W.; McCoy,

Airlie; McNicholas, Stuart J.; Murshudov, Garib N.; Pannu, Navraj S.; Potterton, Elizabeth

A.; Powell, Harold R.; Read, Randy J.; Vagin, Alexei; Wilson, Keith S. Overview of the

CCP4 Suite and Current Developments. Acta Crystallogr. D. Biol. Crystallogr. 2011, 67,

235–242.

(375) McCoy, Airlie J.; Grosse-Kunstleve, Ralf W.; Adams, Paul D.; Winn, Martyn D.; Storoni,

Laurent C.; Read, Randy J.; IUCr. Phaser Crystallographic Software. J. Appl. Crystallogr.

2007, 40, 658–674.

(376) Bricogne G., Blanc E., Brandl M., Flensburg C., Keller P., Paciorek W.; Roversi P, Sharff

A., Smart O.S., Vonrhein C., Womack T. O. (2017). Welcome to Global Phasing Limited

http://www.globalphasing.com./?_sm_au_=i0VZntN5H3PkNsP6 (accessed Feb 18,

2019).

(377) Emsley, P.; Lohkamp, B.; Scott, W. G.; Cowtan, K. Features and Development of Coot.

Acta Crystallogr. Sect. D Biol. Crystallogr. 2010, 66, 486–501.

(378) O. S. Smart, T. O. W., A. Sharff, C. Flensburg, P. Keller, W. Paciorek, C. Vonrhein, and G.

Bricogne. Grade, Version 1.2.9. Global Phasing Ltd: Cambridge United Kingdom 2011.

(379) Bharatham, Nagakumar; Slavish, Peter J.; Shadrick, William R.; Young, Brandon M.;

Shelat, Anang A. The Role of ZA Channel Water-Mediated Interactions in the Design of

Bromodomain-Selective BET Inhibitors. J. Mol. Graph. Model. 2018, 81, 197–210.

(380) Simeon, Saw; Anuwongcharoen, Nuttapat; Shoombuatong, Watshara; Malik, Aijaz

Ahmad; Prachayasittikul, Virapong; Wikberg, Jarl E. S.; Nantasenamat, Chanin. Probing

the Origins of Human Acetylcholinesterase Inhibition via QSAR Modeling and Molecular

Docking. PeerJ 2016, 4, e2322.

(381) Marchese Robinson, Richard L.; Palczewska, Anna; Palczewski, Jan; Kidley, Nathan.

Comparison of the Predictive Performance and Interpretability of Random Forest and

Linear Models on Benchmark Data Sets. J. Chem. Inf. Model. 2017, 57, 1773–1792.

(382) Zeng, Lei; Zhang, Qiang; Gerona-Navarro, Guillermo; Moshkina, Natalia; Zhou, Ming-

Ming. Structural Basis of Site-Specific Histone Recognition by the Bromodomains of

Human Coactivators PCAF and CBP/P300. Structure 2008, 16, 643–652.

254

(383) Jennings, Laura E.; Schiedel, Matthias; Hewings, David S.; Picaud, Sarah; Laurin,

Corentine M. C.; Bruno, Paul A.; Bluck, Joseph P.; Scorah, Amy R.; See, Larissa; Reynolds,

Jessica K.; Moroglu, Mustafa; Mistry, Ishna N.; Hicks, Amy; Guzanov, Pavel; Clayton,

James; Evans, Charles N. G.; Stazi, Giulia; Biggin, Philip C.; Mapp, Anna K.; et al. BET

Bromodomain Ligands: Probing the WPF Shelf to Improve BRD4 Bromodomain Affinity

and Metabolic Stability. Bioorg. Med. Chem. 2018, 26, 2937–2957.

(384) Liu, Zhiqing; Wang, Pingyuan; Chen, Haiying; Wold, Eric A.; Tian, Bing; Brasier, Allan

R.; Zhou, Jia. Drug Discovery Targeting Bromodomain-Containing Protein 4. J. Med.

Chem. 2017, 60, 4533–4558.

(385) Demont, Emmanuel H.; Chung, Chun-wa; Furze, Rebecca C.; Grandi, Paola; Michon,

Anne-Marie; Wellaway, Chris; Barrett, Nathalie; Bridges, Angela M.; Craggs, Peter D.;

Diallo, Hawa; Dixon, David P.; Douault, Clement; Emmons, Amanda J.; Jones, Emma J.;

Karamshi, Bhumika V.; Locke, Kelly; Mitchell, Darren J.; Mouzon, Bernadette H.; Prinjha,

Rab K.; et al. Fragment-Based Discovery of Low-Micromolar ATAD2 Bromodomain

Inhibitors. J. Med. Chem. 2015, 58, 5649–5673.

(386) Quinn, JOHN FREDERICK; Duffy, BRYAN CORDELL; Liu, SHUANG; Wang, RUIFANG;

XIAOWU, Jiang M. A. Y.; Martin, GREGORY SCOTT; Wagner, GREGORY STEVEN;

Young, PETER RONALD. Novel Bicyclic Bromodomain Inhibitors, 2017.

(387) Hewings, David S.; Wang, Minghua; Philpott, Martin; Fedorov, Oleg; Uttarkar, Sagar;

Filippakopoulos, Panagis; Picaud, Sarah; Vuppusetty, Chaitanya; Marsden, Brian; Knapp,

Stefan; Conway, Stuart J.; Heightman, Tom D. 3,5-Dimethylisoxazoles Act As Acetyl-

Lysine-Mimetic Bromodomain Ligands. J. Med. Chem. 2011, 54, 6761–6770.

(388) Xue, Xiaoqian; Zhang, Yan; Wang, Chao; Zhang, Maofeng; Xiang, Qiuping; Wang,

Junjian; Wang, Anhui; Li, Chenchang; Zhang, Cheng; Zou, Lingjiao; Wang, Rui; Wu,

Shuang; Lu, Yongzhi; Chen, Hongwu; Ding, Ke; Li, Guohui; Xu, Yong. Benzoxazinone-

Containing 3,5-Dimethylisoxazole Derivatives as BET Bromodomain Inhibitors for

Treatment of Castration-Resistant Prostate Cancer. Eur. J. Med. Chem. 2018, 152, 542–

559.

(389) Hay, Duncan; Fedorov, Oleg; Filippakopoulos, Panagis; Martin, Sarah; Philpott, Martin;

Picaud, Sarah; Hewings, David S.; Uttakar, Sagar; Heightman, Tom D.; Conway, Stuart J.;

Knapp, Stefan; Brennan, Paul E. The Design and Synthesis of 5- and 6-

Isoxazolylbenzimidazoles as Selective Inhibitors of the BET Bromodomains. Med. Chem.

255

Commun. 2013, 4, 140–144.

(390) Reed Elsevier Properties SA. PharmaPendium https://pharmapendium.com/ (accessed

Jun 4, 2018).

(391) Pedregosa, Fabian; Varoquaux, Gaël; Gramfort, Alexandre; Michel, Vincent; Thirion,

Bertrand; Grisel, Olivier; Blondel, Mathieu; Louppe, Gilles; Prettenhofer, Peter; Weiss,

Ron; Dubourg, Vincent; Vanderplas, Jake; Passos, Alexandre; Cournapeau, David;

Brucher, Matthieu; Perrot an Edouard Duchesnay Pedregosa, Matthieu; David

Cournapeau, Al; Perrot matthieuperrot, Matthieu; Edouard Duchesnay. Scikit-Learn:

Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.

(392) Li, Wentian. Mutual Information Functions Versus Correlation Functions in Binary

Sequences. J. Stat. Phys. 2012, 60, 249–252.

(393) Jones, Eric; Oliphant, Travis; Peterson, Pearu. SciPy: Open source scientific tools for

Python http://www.scipy.org/ (accessed Jun 4, 2018).

(394) Grimes, David A.; Schulz, Kenneth F. Refining Clinical Diagnosis with Likelihood Ratios.

Lancet 2005, 365, 1500–1505.

(395) Fruchterman, Thomas M. J.; Reingold, Edward M. Graph Drawing by Force-Directed

Placement. Software-Practice Exp. 1991, 21, 1129–1164.

(396) Imai, Hiroshi; Asano, Takao. Finding the Connected Components and a Maximum Clique

of an Intersection Graph of Rectangles in the Plane. J. Algorithms 1983, 4, 310–323.

(397) Fortunato, Santo. Community Detection in Graphs. Phys. Rep. 2010, 486, 75–174.

(398) Sharma, Hari S.; Menon, Preeti; Lafuente, José Vicente; Muresanu, Dafin F.; Tian, Z. Ryan;

Patnaik, Ranjana; Sharma, Aruna. Development of in Vivo Drug-Induced Neurotoxicity

Models. Expert Opin. Drug Metab. Toxicol. 2014, 10, 1637–1661.

(399) Roberts, Ruth A.; Aschner, Michael; Calligaro, David; Guilarte, Tomas R.; Hanig, Joseph

P.; Herr, David W.; Hudzik, Thomas J.; Jeromin, Andreas; Kallman, Mary J.; Liachenko,

Serguei; Lynch, James J.; Miller, Diane B.; Moser, Virginia C.; O’Callaghan, James P.;

Slikker, William; Paule, Merle G. Translational Biomarkers of Neurotoxicity: A Health

and Environmental Sciences Institute Perspective on the Way Forward. Toxicol. Sci. 2015,

148, 332–340.

(400) O’Donoghue, John L. Clinical Neurologic Indices of Toxicity in Animals. Environ. Health

256

Perspect. 1996, 104, 323–330.

(401) Süleyman, Halis; Demircan, Berna; Karagöz, Yalçin. Anti-Inflammatory and Side Effects

of Cyclooxygenase Inhibitors. Pharmacol. Reports 2007, 59, 247–258.

(402) R H Bach, E. E.; Thi I M Thanh, Nguyen K. Renal Papillary Necrosis 40 Years On. Toxicol.

Pathol. 1998, 26, 73–91.

(403) Nicolaides, Nicolas C.; Pavlaki, Aikaterini N.; Maria Alexandra, Maria Alexandra;

Chrousos, George P. Glucocorticoid Therapy and Adrenal Suppression. In Endotext;

Feingold, Kenneth R., Ed.; MDText.com, Inc.: South Dartmouth (MA), 2018.

(404) Inomata, Akira; Sasano, Hironobu. Practical Approaches for Evaluating Adrenal Toxicity

in Nonclinical Safety Assessment. J. Toxicol. Pathol. 2015, 28, 125–132.

(405) Schimpf, Rainer; Borggrefe, Martin; Wolpert, Christian. Clinical and Molecular Genetics

of the Short QT Syndrome. Curr. Opin. Cardiol. 2008, 23, 192–198.

(406) Bjerregaard, Preben. Diagnosis and Management of Short QT Syndrome. Hear. Rhythm

2018, 15, 1261–1267.

(407) Holbrook, Mark; Malik, Marek; Shah, Rashmi R.; Valentin, Jean-Pierre. Drug Induced

Shortening of the QT/QTc Interval: An Emerging Safety Issue Warranting Further

Modelling and Evaluation in Drug Research and Development? J. Pharmacol. Toxicol.

Methods 2009, 59, 21–28.

(408) Visentin, Michele; Lenggenhager, Daniela; Gai, Zhibo; Kullak-Ublick, Gerd A. Drug-

Induced Bile Duct Injury. Biochim. Biophys. Acta - Mol. Basis Dis. 2018, 1864, 1498–1506.

(409) Brown, Carlton Gene. Testicular Cancer: An Overview. MedSurg Nurs. 2003, 12, 37–45.

(410) Park, Jun-Bean; Kang, Do-yoon; Yang, Han-Mo; Cho, Hyun-Jai; Park, Kyung Woo; Lee,

Hae-Young; Kang, Hyun-Jae; Koo, Bon-Kwon; Kim, Hyo-Soo. Serum Alkaline

Phosphatase Is a Predictor of Mortality, Myocardial Infarction, or Stent Thrombosis after

Implantation of Coronary Drug-Eluting Stent. Eur. Heart J. 2013, 34, 920–931.

(411) Sheen, Campbell R.; Kuss, Pia; Narisawa, Sonoko; Yadav, Manisha C.; Nigro, Jessica;

Wang, Wei; Chhea, T. Nicole; Sergienko, Eduard A.; Kapoor, Kapil; Jackson, Michael R.;

Hoylaerts, Marc F.; Pinkerton, Anthony B.; O’Neill, W. Charles; Millán, José Luis.

Pathophysiological Role of Vascular Smooth Muscle Alkaline Phosphatase in Medial

Artery Calcification. J. Bone Miner. Res. 2015, 30, 824–836.

257

(412) Dereure, Olivier. Drug-Induced Skin Pigmentation. Am. J. Clin. Dermatol. 2001, 2, 253–

262.

(413) Kuokkanen, Satu; Zhu, Liyin; Pollard, Jeffrey W. Xenografted Tissue Models for the Study

of Human Endometrial Biology. Differentiation 2017, 98, 62–69.

(414) van der Laan, Jan Willem; Chapin, Robert E.; Haenen, Bert; Jacobs, Abigail C.; Piersma,

Aldert. Testing Strategies for Embryo-Fetal Toxicity of Human Pharmaceuticals. Animal

Models vs. in Vitro Approaches. Regul. Toxicol. Pharmacol. 2012, 63, 115–123.

(415) Stokes, William S. Humane Endpoints for Laboratory Animals Used in Regulatory

Testing. ILAR J. 2002, 43, S31-38.

(416) Wijemanne, Subhashie; Jankovic, Joseph. Movement Disorders in Catatonia. J Neurol

Neurosurg Psychiatry 2015, 86, 825–832.

(417) Behari, Madhuri. Current Status of Dystonias Including Meige’s Syndrome. Neurol. India

2018, 66, 36–37.

(418) Shin, Hae-Won; Chung, Sun Ju. Drug-Induced Parkinsonism. J. Clin. Neurol. 2012, 8, 15–

21.

(419) Dabbous, Zeinab; Atkin, Stephen L. Hyperprolactinaemia in Male Infertility: Clinical

Case Scenarios. Arab J. Urol. 2018, 16, 44–52.

(420) Fitzgerald, Peter; Dinan, Timothy G. Prolactin and Dopamine: What Is the Connection?

A Review Article. J. Psychopharmacol. 2008, 22, 12–19.

(421) Tang, Jing; Tanoli, Zia ur Rehman; Ravikumar, Balaguru; Alam, Zaid; Rebane, Anni; Vähä-

Koskela, Markus; Peddinti, Gopal; van Adrichem, Arjan J.; Wakkinen, Janica; Jaiswal,

Alok; Karjalainen, Ella; Gautam, Prson; He, Liye; Parri, Elina; Khan, Suleiman; Gupta,

Abhishekh; Ali, Mehreen; Yetukuri, Laxman; Gustavsson, Anna Lena; et al. Drug Target

Commons: A Community Effort to Build a Consensus Knowledge Base for Drug-Target

Interactions. Cell Chem. Biol. 2018, 25, 224–229.e2.

(422) Siramshetty, Vishal B.; Eckert, Oliver Andreas; Gohlke, Björn-Oliver; Goede, Andrean;

Chen, Qiaofeng; Devarakonda, Prashanth; Preissner, Saskia; Preissner, Robert.

SuperDRUG2: A One Stop Resource for Approved/Marketed Drugs. Nucleic Acids Res.

2018, 46, D1137–D1143.

(423) Heller, Stephen R.; McNaught, Alan; Pletnev, Igor; Stein, Stephen; Tchekhovskoi,

258

Dmitrii. InChI, the IUPAC International Chemical Identifier. J. Cheminform. 2015, 7, 23.

(424) Maglott, Donna; Ostell, Jim; Pruitt, Kim D.; Tatusova, Tatiana. Entrez Gene: Gene-

Centered Information at NCBI. Nucleic Acids Res. 2011, 39, D52-7.

(425) Toad for mysql - MariaDB Knowledge Base http://www.toadworld.com/products#mysql

(accessed Feb 4, 2019).

(426) Burdett, T.; Jupp, Simon; Malone, James; Williams, Eleanor; Keays, Maria; Parkinson,

Helen; Trust, Wellcome; Campus, Genome. Zooma2 - A repository of annotation

knowledge and curation API http://www.ebi.ac.uk/spot/zooma/index.html (accessed

Jun 4, 2018).

(427) Schriml, Lynn M.; Mitraka, Elvira; Munro, James; Tauber, Becky; Schor, Mike; Nickle,

Lance; Felix, Victor; Jeng, Linda; Bearer, Cynthia; Lichenstein, Richard; Bisordi,

Katharine; Campion, Nicole; Hyman, Brooke; Kurland, David; Oates, Connor Patrick;

Kibbey, Siobhan; Sreekumar, Poorna; Le, Chris; Giglio, Michelle; et al. Human Disease

Ontology 2018 Update: Classification, Content and Workflow Expansion. Nucleic Acids

Res. 2019, 47, D955–D962.

(428) Smith, Cynthia L.; Eppig, Janan T. The Mammalian Phenotype Ontology: Enabling

Robust Annotation and Comparative Analysis. Wiley Interdiscip. Rev. Syst. Biol. Med.

2009, 1, 390–399.

(429) Robinson, Peter N.; Köhler, Sebastian; Bauer, Sebastian; Seelow, Dominik; Horn, Denise;

Mundlos, Stefan. The Human Phenotype Ontology: A Tool for Annotating and Analyzing

Human Hereditary Disease. Am. J. Hum. Genet. 2008, 83, 610–615.

(430) Malone, James; Holloway, Ele; Adamusiak, Tomasz; Kapushesky, Misha; Zheng, Jie;

Kolesnikov, Nikolay; Zhukova, Anna; Brazma, Alvis; Parkinson, Helen. Modeling Sample

Variables with an Experimental Factor Ontology. Bioinformatics 2010, 26, 1112–1118.

(431) Vasant, Drashtti; Chanas, Laetitia; Malone, James; Hanauer, Marc; Olry, Annie; Jupp,

Simon; Robinson, Peter N.; Parkinson, Helen; Rath, Ana. ORDO: An Ontology

Connecting Rare Disease, Epidemiology and Genetic Data. In Phenotype data at

ISMB2014; 2014.

(432) Ceusters, Werner; Smith, B.; Goldberg, L. A Terminological and Ontological Analysis of

the NCI Thesaurus. Methods Inf. Med. 2005, 44, 498–507.

259

(433) Huang, Jingshan; Dang, Jiangbo; Borchert, Glen M.; Eilbeck, Karen; Zhang, He; Xiong,

Min; Jiang, Weijian; Wu, Hao; Blake, Judith A.; Natale, Darren A.; Tan, Ming. OMIT:

Dynamic, Semi-Automated Ontology Development for the MicroRNA Domain. PLoS

One 2014, 9, e100855.

(434) He, Yongqun; Sarntivijai, Sirarat; Lin, Yu; Xiang, Zuoshuang; Guo, Abra; Zhang, Shelley;

Jagannathan, Desikan; Toldo, Luca; Tao, Cui; Smith, Barry. OAE: The Ontology of

Adverse Events. J. Biomed. Semantics 2014, 5, 29.

(435) Mungall, Christopher J.; Koehler, Sebastian; Robinson, Peter; Holmes, Ian; Haendel,

Melissa. K-BOOM: A Bayesian Approach to Ontology Structure Inference, with

Applications in Disease Ontology Construction. bioRxiv 2019, 048843.

(436) Mohammed, Osama; Benlamri, Rachid; Fong, Simon. Building a Diseases Symptoms

Ontology for Medical Diagnosis: An Integrative Approach. In 1st International Conference

on Future Generation Communication Technologies, FGCT 2012; IEEE, 2012; pp 104–108.

(437) Ceusters, Werner; Smith, Barry. Foundations for a Realist Ontology of Mental Disease. J.

Biomed. Semantics 2010, 1, 10.

(438) Schofield, P. N.; Gruenberger, M.; Sundberg, John P. Pathbase and the MPATH Ontology:

Community Resources for Mouse Histopathology. Vet. Pathol. 2010, 47, 1016–1020.

(439) Dönitz, Jürgen; Wingender, Edgar. The Ontology-Based Answers (OBA) Service: A

Connector for Embedded Usage of Ontologies in Applications. Front. Genet. 2012, 3, 197.

(440) Visser, Ubbo; Abeyruwan, Saminda; Vempati, Uma; Smith, Robin P.; Lemmon, Vance;

Schürer, Stephan C. BioAssay Ontology (BAO): A Semantic Description of Bioassays and

High-Throughput Screening Results. BMC Bioinformatics 2011, 12, 257.

(441) Koscielny, Gautier; An, Peter; Carvalho-Silva, Denise; Cham, Jennifer A.; Fumis, Luca;

Gasparyan, Rippa; Hasan, Samiul; Karamanis, Nikiforos; Maguire, Michael; Papa, Eliseo;

Pierleoni, Andrea; Pignatelli, Miguel; Platt, Theo; Rowland, Francis; Wankar, Priyanka;

Bento, A. Patrícia; Burdett, Tony; Fabregat, Antonio; Forbes, Simon; et al. Open Targets:

A Platform for Therapeutic Target Identification and Validation. Nucleic Acids Res. 2017,

45, D985–D994.

(442) opentargets - Python client for targetvalidation.org — opentargets 2.0.0 documentation

https://opentargets.readthedocs.io/en/stable/ (accessed Oct 8, 2018).

260

(443) Zerbino, Daniel R.; Achuthan, Premanand; Akanni, Wasiu; Amode, M. Ridwan; Barrell,

Daniel; Bhai, Jyothish; Billis, Konstantinos; Cummins, Carla; Gall, Astrid; Girón, Carlos

García; Gil, Laurent; Gordon, Leo; Haggerty, Leanne; Haskell, Erin; Hourlier, Thibaut;

Izuogu, Osagie G.; Janacek, Sophie H.; Juettemann, Thomas; To, Jimmy Kiang; et al.

Ensembl 2018. Nucleic Acids Res. 2018, 46, D754–D761.

(444) Mungall, Christopher J.; McMurry, Julie A.; Köhler, Sebastian; Balhoff, James P.;

Borromeo, Charles; Brush, Matthew; Carbon, Seth; Conlin, Tom; Dunn, Nathan;

Engelstad, Mark; Foster, Erin; Gourdine, J. P.; Jacobsen, Julius O. B.; Keith, Dan; Laraway,

Bryan; Lewis, Suzanna E.; NguyenXuan, Jeremy; Shefchek, Kent; Vasilevsky, Nicole; et al.

The Monarch Initiative: An Integrative Data and Analytic Platform Connecting

Phenotypes to Genotypes across Species. Nucleic Acids Res. 2017, 45, D712–D722.

(445) Requests: HTTP for HumansTM — Requests 2.21.0 documentation http://docs.python-

requests.org/en/master/ (accessed Mar 14, 2019).

(446) About the HGNC: HUGO Gene Nomenclature Committee

https://www.genenames.org/about/overview (accessed Oct 8, 2018).

(447) Smith, Cynthia L.; Blake, Judith A.; Kadin, James A.; Richardson, Joel E.; Bult, Carol J.;

Mouse Genome Database Group. Mouse Genome Database (MGD)-2018: Knowledgebase

for the Laboratory Mouse. Nucleic Acids Res. 2018, 46, D836–D842.

(448) Jensen, Lars Juhl; Julien, Philippe; Kuhn, Michael; von Mering, Christian; Muller, Jean;

Doerks, Tobias; Bork, Peer. EggNOG: Automated Construction and Annotation of

Orthologous Groups of Genes. Nucleic Acids Res. 2008, 36, D250-4.

(449) pandas: a Foundational Python Library for Data Analysis and Statistics | R (Programming

Language) | Database Index https://www.scribd.com/document/71048089/pandas-a-

Foundational-Python-Library-for-Data-Analysis-and-Statistics (accessed Jun 4, 2018).

(450) Huang, Yue; Yu, Sui; Wu, Zhanhe; Tang, Beisha. Genetics of Hereditary Neurological

Disorders in Children. Transl. Pediatr. 2014, 3, 108–119.

(451) Katritsis, Demosthenes G.; Gersh, Bernard J.; Camm, A. John. A Clinical Perspective on

Sudden Cardiac Death. Arrhythmia Electrophysiol. Rev. 2016, 5, 177–182.

(452) Antzelevitch, C.; Pollevick, G. D.; Cordeiro, J. M.; Casis, O.; Sanguinetti, M. C.; Aizawa,

Y.; Guerchicoff, A.; Pfeiffer, R.; Oliva, A.; Wollnik, B.; Gelber, P.; Bonaros, E. P.;

261

Burashnikov, E.; Wu, Y.; Sargent, J. D.; Schickel, S.; Oberheiden, R.; Bhatia, A.; Hsu, L. F.;

et al. Loss-of-Function Mutations in the Cardiac Calcium Channel Underlie a New

Clinical Entity Characterized by ST-Segment Elevation, Short QT Intervals, and Sudden

Cardiac Death. Circulation 2007, 115, 442–449.

(453) Schimpf, R.; Veltmann, C.; Wolpert, C.; Borggrefe, M. Arrhythmogenic Hereditary

Syndromes: Brugada Syndrome, Long QT Syndrome, Short QT Syndrome and CPVT.

Minerva Cardioangiol. 2010, 58, 623–636.

(454) Campuzano, Oscar; Sarquella-Brugada, Georgia; Brugada, Ramon; Brugada, Josep.

Brugada Syndrome. Clin. Cardiogenetics Second Ed. 2016, 175–191.

(455) Whitebread, Steven; Hamon, Jacques; Bojanic, Dejan; Urban, Laszlo. Keynote Review: In

Vitro Safety Pharmacology Profiling: An Essential Tool for Successful Drug Development.

Drug Discov. Today 2005, 10, 1421–1433.

(456) Lu, H. R.; Vlaminckx, E.; Hermans, A. N.; Rohrbacher, J.; Van Ammel, K.; Towart, R.;

Pugsley, M.; Gallacher, D. J. Predicting Drug-Induced Changes in QT Interval and

Arrhythmias: QT-Shortening Drugs Point to Gaps in the ICHS7B Guidelines. Br. J.

Pharmacol. 2008, 154, 1427–1438.

(457) Colatsky, Thomas; Fermini, Bernard; Gintant, Gary; Pierson, Jennifer B.; Sager, Philip;

Sekino, Yuko; Strauss, David G.; Stockbridge, Norman. The Comprehensive in Vitro

Proarrhythmia Assay (CiPA) Initiative — Update on Progress. J. Pharmacol. Toxicol.

Methods 2016, 81, 15–20.

(458) Toth, Linda A.; Bhargava, Pavan. Animal Models of Sleep Disorders. Comp. Med. 2013,

63, 91–104.

(459) Ledent, Catherine; Vaugeois, Jean-Marie; Schiffmann, Serge N.; Pedrazzini, Thierry;

Yacoubi, Malika El; Vanderhaeghen, Jean-Jacques; Costentin, Jean; Heath, John K.;

Vassart, Gilbert; Parmentier, Marc. Aggressiveness, Hypoalgesia and High Blood Pressure

in Mice Lacking the Adenosine A2a Receptor. Nature 1997, 388, 674–678.

(460) Yang, Amy; Palmer, Abraham A.; De Wit, Harriet. Genetics of Caffeine Consumption and

Responses to Caffeine. Psychopharmacology (Berl). 2010, 211, 245–257.

(461) Rétey, J. V; Adam, M.; Khatami, R.; Luhmann, U. F. O.; Jung, H. H.; Berger, W.; Landolt,

H. P. A Genetic Variation in the Adenosine A2A Receptor Gene (ADORA2A) Contributes

262

to Individual Sensitivity to Caffeine Effects on Sleep. Clin. Pharmacol. Ther. 2007, 81,

692–698.

(462) Mochizuki, T.; Arrigoni, E.; Marcus, J. N.; Clark, E. L.; Yamamoto, M.; Honer, M.; Borroni,

E.; Lowell, B. B.; Elmquist, J. K.; Scammell, T. E. Orexin Receptor 2 Expression in the

Posterior Hypothalamus Rescues Sleepiness in Narcoleptic Mice. Proc. Natl. Acad. Sci.

2011, 108, 4471–4476.

(463) Willie, Jon T.; Chemelli, Richard M.; Sinton, Christopher M.; Tokita, Shigeru; Williams,

S. Clay; Kisanuki, Yaz Y.; Marcus, Jacob N.; Lee, Charlotte; Elmquist, Joel K.; Kohlmeier,

Kristi A.; Leonard, Christopher S.; Richardson, James A.; Hammer, Robert E.; Yanagisawa,

Masashi. Distinct Narcolepsy Syndromes in Orexin Receptor-2 and Orexin Null Mice:

Molecular Genetic Dissection of Non-REM and REM Sleep Regulatory Processes. Neuron

2003, 38, 715–730.

(464) Ghanemi, Abdelaziz; Hu, Xintian. Targeting the Orexinergic System: Mainly but Not

Only for Sleep-Wakefulness Therapies. Alexandria J. Med. 2015, 51, 279–286.

(465) Irukayama-Tomobe, Yoko; Ogawa, Yasuhiro; Tominaga, Hiromu; Ishikawa, Yukiko;

Hosokawa, Naoto; Ambai, Shinobu; Kawabe, Yuki; Uchida, Shuntaro; Nakajima, Ryo;

Saitoh, Tsuyoshi; Kanda, Takeshi; Vogt, Kaspar; Sakurai, Takeshi; Nagase, Hiroshi;

Yanagisawa, Masashi. Nonpeptide Orexin Type-2 Receptor Agonist Ameliorates

Narcolepsy-Cataplexy Symptoms in Mouse Models. Proc. Natl. Acad. Sci. 2017, 114, 5731–

5736.

(466) Razavi, Bibi Marjan; Hosseinzadeh, Hossein. A Review of the Role of Orexin System in

Pain Modulation. Biomed. Pharmacother. 2017, 90, 187–193.

(467) Santos, Cynthia; Olmedo, Ruben E. Sedative-Hypnotic Drug Withdrawal Syndrome:

Recognition And Treatment. Emerg. Med. Pract. 2017, 19, 1–20.

(468) Bidwell, L. Cinnamon; Garrett, Melanie E.; McClernon, F. Joseph; Fuemmeler, Bernard

F.; Williams, Redford B.; Ashley-Koch, Allison E.; Kollins, Scott H. A Preliminary Analysis

of Interactions between Genotype, Retrospective ADHD Symptoms, and Initial Reactions

to Smoking in a Sample of Young Adults. Nicotine Tob. Res. 2012, 14, 229–233.

(469) Fisone, G.; Borgkvist, A.; Usiello, A. Caffeine as a Psychomotor Stimulant: Mechanism of

Action. Cell. Mol. Life Sci. 2004, 61, 857–872.

263

(470) Listos, Joanna; Malec, Danuta; Fidecka, Sylwia. Influence of Adenosine Receptor Agonists

on Benzodiazepine Withdrawal Signs in Mice. Eur. J. Pharmacol. 2005, 523, 71–78.

(471) Ballesteros-Yáñez, Inmaculada; Castillo, Carlos A.; Merighi, Stefania; Gessi, Stefania. The

Role of Adenosine Receptors in Psychostimulant Addiction. Front. Pharmacol. 2018, 8,

985.

(472) Lu, Ake T.; Ogdie, Matthew N.; Järvelin, Marjo-Ritta; Moilanen, Irma K.; Loo, Sandra K.;

McCracken, James T.; McGough, James J.; Yang, May H.; Peltonen, Leena; Nelson, Stanley

F.; Cantor, Rita M.; Smalley, Susan L. Association of the Cannabinoid Receptor Gene

(CNR1) with ADHD and Post-Traumatic Stress Disorder. Am. J. Med. Genet. B.

Neuropsychiatr. Genet. 2008, 147B, 1488–1494.

(473) Castelli, Maura; Federici, Mauro; Rossi, Silvia; De Chiara, Valentina; Napolitano,

Francesco; Studer, Valeria; Motta, Caterina; Sacchetti, Lucia; Romano, Rosaria; Musella,

Alessandra; Bernardi, Giorgio; Siracusano, Alberto; Gu, Howard H.; Mercuri, Nicola B.;

Usiello, Alessandro; Centonze, Diego. Loss of Striatal Cannabinoid CB1 Receptor

Function in Attention-Deficit / Hyperactivity Disorder Mice with Point-Mutation of the

Dopamine Transporter. Eur. J. Neurosci. 2011, 34, 1369–1377.

(474) Haughey, Heather M.; Marshall, Erin; Schacht, Joseph P.; Louis, Ashleigh; Hutchison,

Kent E. Marijuana Withdrawal and Craving: Influence of the Cannabinoid Receptor 1 (

CNR1 ) and Fatty Acid Amide Hydrolase ( FAAH ) Genes. Addiction 2008, 103, 1678–1686.

(475) Jones, Declan N. C.; Holtzman, Stephen G. Influence of Naloxone upon Motor Activity

Induced by Psychomotor Stimulant Drugs. Psychopharmacology (Berl). 1994, 114, 215–

224.

(476) Abramov, Urho; Raud, Sirli; Kõks, Sulev; Innos, Jürgen; Kurrikoff, Kaido; Matsui,

Toshimitsu; Vasar, Eero. Targeted Mutation of CCK2 Receptor Gene Antagonises

Behavioural Changes Induced by Social Isolation in Female, but Not in Male Mice. Behav.

Brain Res. 2004, 155, 1–11.

(477) Schnur, P.; Cesar, S. S.; Foderaro, M. A.; Kulkosky, P. J. Effects of Cholecystokinin on

Morphine-Elicited Hyperactivity in Hamsters. Pharmacol. Biochem. Behav. 1991, 39, 581–

586.

(478) Kayser, V.; Idänpään-Hekkilä, J. J.; Christensen, D.; Guilbaud, G. The Selective

CholecystokininB Receptor Antagonist L-365,260 Diminishes the Expression of

264

Naloxone-Induced Morphine Withdrawal Symptoms in Normal and Neuropathic Rats.

Life Sci. 1998, 62, 947–952.

(479) Filardi, Marco; Pizza, Fabio; Tonetti, Lorenzo; Antelmi, Elena; Natale, Vincenzo; Plazzi,

Giuseppe. Attention Impairments and ADHD Symptoms in Adult Narcoleptic Patients

with and without Hypocretin Deficiency. PLoS One 2017, 12, e0182085.

(480) Gentile, Taylor A.; Simmons, Steven J.; Watson, Mia N.; Connelly, Krista L.; Brailoiu,

Eugen; Zhang, Yanan; Muschamp, John W. Effects of Suvorexant, a Dual

Orexin/Hypocretin Receptor Antagonist, on Impulsive Behavior Associated with

Cocaine. Neuropsychopharmacology 2018, 43, 1001–1009.

(481) Azizi, Hossein; Mirnajafi-Zadeh, Javad; Rohampour, Kambiz; Semnanian, Saeed.

Antagonism of Orexin Type 1 Receptors in the Locus Coeruleus Attenuates Signs of

Naloxone-Precipitated Morphine Withdrawal in Rats. Neurosci. Lett. 2010, 482, 255–

259.

(482) Hughes, J. P.; Rees, S.; Kalindjian, S. B.; Philpott, K. L. Principles of Early Drug Discovery.

Br. J. Pharmacol. 2011, 162, 1239–1249.

(483) Brannen, Kimberly C.; Chapin, Robert E.; Jacobs, Abigail C.; Green, Maia L. Alternative

Models of Developmental and Reproductive Toxicity in Pharmaceutical Risk Assessment

and the 3Rs. ILAR J. 2016, 57, 144–156.

(484) Komori, Shinji; Kasumi, Hiroyuki; Sakata, Kazuko; Koyama, Koji. The Role of Androgens

in Spermatogenesis. Soc. Reprod. Fertil. Suppl. 2007, 63, 25–30.

(485) Milatiner, D.; Halle, David; Huerta, Michael; Margalioth, Ehud J.; Cohen, Yoram; Ben-

Chetrit, Avraham; Gal, Michael; Mimoni, Tzvia; Eldar-Geva, Talia. Associations between

Androgen Receptor CAG Repeat Length and Sperm Morphology. Hum. Reprod. 2004, 19,

1426–1430.

(486) Melo, C. O. A.; Danin, A. R.; Silva, D. M.; Tacon, J. A.; Moura, K. K. V. O.; Costa, E. O. A.;

Da Cruz, A. D. Association between Male Infertility and Androgen Receptor Mutations

in Brazilian Patients. funpecrp.com.br Genet. Mol. Res. Genet. Mol. Res 2010, 9, 128–133.

(487) Bachelot, Anne; Meduri, Géri; Massin, Nathalie; Misrahi, Micheline; Kuttenn, Frédérique;

Touraine, Philippe. Ovarian Steroidogenesis and Serum Androgen Levels in Patients with

Premature Ovarian Failure. J. Clin. Endocrinol. Metab. 2005, 90, 2391–2396.

265

(488) Shiina, H.; Matsumoto, T.; Sato, T.; Igarashi, K.; Miyamoto, J.; Takemasa, S.; Sakari, M.;

Takada, I.; Nakamura, T.; Metzger, D.; Chambon, P.; Kanno, J.; Yoshikawa, H.; Kato, S.

Premature Ovarian Failure in Androgen Receptor-Deficient Mice. Proc. Natl. Acad. Sci.

2006, 103, 224–229.

266

8 Appendix

8.1 Tables and Figures

Table 8-1: Number of data points for the public and proprietary dataset after filtering, divided by bromodomain and activity, along with the totals for each dataset for active (A) and not active (N) compound-target pairs.

Bromodomain Activity (active (A)/ not active (N))

Number of public data points after filtering

Number of proprietary data points after filtering

Total Number of data points after filtering

ATAD2 A 42 23 65

ATAD2 N 45 328 373

ATAD2B A - - 0

ATAD2B N 3 203 206

BAZ2A A 27 - 27

BAZ2A N 9 203 212

BAZ2B A 31 - 31

BAZ2B N 65 211 276

BPTF A 11 - 11

BPTF N 4 203 207

BRD1 A 44 9 53

BRD1 N 16 495 511

BRD2 BD1 A 79 163 242

BRD2 BD1 N 40 46 86

BRD2 BD2 A 15 135 150

BRD2 BD2 N 6 74 80

BRD3 BD1 A 69 167 236

BRD3 BD1 N 7 42 49

BRD3 BD2 A 24 137 161

BRD3 BD2 N 4 72 76

BRD4 BD1 A 1124 961 2085

BRD4 BD1 N 843 2682 3525

BRD4 BD2 A 540 161 701

BRD4 BD2 N 79 239 318

BRD7 A 21 20 41

BRD7 N 7 180 187

BRD9 A 113 20 133

BRD9 N 27 319 346

BRDT BD1 A 13 147 160

BRDT BD1 N 18 64 82

BRDT BD2 A 6 127 133

BRDT BD2 N 4 84 88

BRPF1b A 66 80 146

BRPF1b N 33 565 598

BRPF3 A 3 2 5

BRPF3 N 34 201 235

BRWD1 BD2 A - 1 1

BRWD1 BD2 N 4 202 206

CECR2 A 16 - 16

267

CECR2 N 12 319 331

CREBBP A 163 112 275

CREBBP N 111 350 461

EP300 A 8 116 124

EP300 N 4 328 332

KAT2A A - 2 2

KAT2A N 4 198 202

PB1 BD5 A 16 - 16

PB1 BD5 N 10 - 10

PCAF A 51 - 51

PCAF N 34 - 34

SMARCA2 A - 28 28

SMARCA2 N - - 0

SMARCA4 A 24 46 70

SMARCA4 N 7 242 249

TAF1 BD2 A 11 232 243

TAF1 BD2 N 6 191 197

TAF1L BD2 A 1 6 7

TAF1L BD2 N 3 179 182

TIF1A A 63 - 63

TIF1A N 20 189 209

TRIM33 A 2 1 3

TRIM33 N 1 202 203

Total all actives A 2583 2696 5280

Total all not actives

N 1460 8611 10071

Grand Total A+N 4043 11307 15350

Table 8-2: PDB identifiers for the crystal structures used as the reference structures for each bromodomain for the alignment. “Not found” means that from the MOE protein database at the time of collation no crystal structures were found for this bromodomain. Alignment was asserted based on the nearest bromodomain for these domains (BRPF3 and BRDT BD2).

Bromodomain Crystal structure used for alignment (PDB)

CECR2 3NXB

BPFT 3UV2

KAT2A 1F68

PCAF 1WUG

BRD2 BD1 2YEK

BRD2 BD2 2DVV

BRD3 BD1 3S91

BRD3 BD2 3S92

BRD4 BD1 2YEL

BRD4 BD2 2YEM

BRDT BD1 4FLP

BRWD1 BD2 3Q2E

CREBBP 2D82

EP300 3I3J

ATAD2 4TT2

268

ATAD2B 3LXJ

BRD1 5AME

BRPF1 4UYE

BRD7 2I7K

BRD9 4NQN

TIF1A 3O35

TRIM33 3U5N

BAZ2A 4QBM

BAZ2B 3Q2F

TAF1 BD2 4YYM

TAF1L BD2 3HMH

PB1 BD5 4Q0N

SMARCA2 5DKC

SMARCA4 5DKD

BRPF3 Not found

BRDT BD2 Not found

269

Table 8-3: Binding site residue sequence alignment for bromodomains in this study. Binding site residues were defined as residues in the same alignment position as any residue within 4.5 Å of the ligands in public liganded co-crystal structures for bromodomains.

Domain 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

ATAD2 r f r v f t k p v d p d e - - v p d y v i p m d l l i n d p g d r l i r r a

ATAD2B r f n i f s k p v d i e e - - v s d y l i p m d l l i n d p g d k i i r r a

BAZ2A a a w p f l e p v n p r - l - v s g y r i p m d f l v n d - - d s e v g a g

BAZ2B d a w p f l l p v n l k - l - v p g y k i p m d f l v n d - - d s d i g a g

BPTF m a w p f l e p v d p n d - - a p d y y i p m d l k i n s - - d s p f y c a

BRD1 p a r i f a q p v s l k e - - v p d y l i p m d f l i n r - - d t v f y a a

BRD2 BD1 f a w p f r q p v d a v k l g l p d y h i p m d m t m n p - - t d d i v m a

BRD2 BD2 y a w p f y k p v d a s a l g l h d y h i p m d l l m n p - - d h d v v m a

BRD3 BD1 f a w p f y q p v d a i k l n l p d y h i p m d m t m n p - - t d d i v m a

BRD3 BD2 y a w p f y k p v d a e a l e l h d y h i p m d l l m n p - - d h e v v m a

BRD4 BD1 f a w p f q q p v d a v k l n l p d y y i p m d m t m n p - - g d d i v m a

BRD4 BD2 y a w p f y k p v d v e a l g l h d y c i p m d m l m n p - - d h e v v m a

BRD7 p s a f f s f p v t d f i - - a p g y s i p m d f l m n p - - e t i y y a a

BRD9 p h g f f a f p v t d a i - - a p g y s i p m d f l m n p - - d t v y y l a

BRDT BD1 f s w p f q r p v d a v k l q l p d y y i p m d l t m n p - - g d d i v m a

BRDT BD2 y a w p f y n p v d a d a l g l h n y y v p m d l l m n p - - d h e v v m a

BRPF1 t g n i f s e p v p l s e - - v p d y l i p m d f l i n k - - d t i f y a a

BRPF3 p a h i f a e p v n l s e - - v p d y l i p m d f l i n k - - d t i f h a a

BRWD1 BD2 d s e p f r q p v d l v - - e y p d y r i p m d f l i t k - - r s k i y m t

CECR2 d s w p f l e p v d e s y - - a p n y y i p m d i t m n e - - s s e y t m s

CREBBP e s l p f r q p v d p q l l g i p d y f v p m d l l m n k - - t s r v y f c

EP300 e s l p f r q p v d p q l l g i p d y f v p m d l l m n k - - t s r v y y c

KAT2A s a w p f m e p v k k s e - - a p d y y i p i d l r v n p - - d s e y c c a

PB1 BD5 l s a i f l r l p s r s e - - l p d y y i p m d m m m n p - - e s l i y d a

PCAF s a w p f m e p v k r t e - - a p g y y i p m d l r v n p - - e s e y y c a

SMARCA2 l s e v f i q l p s r k e - - l p e y y i p v d f l l n e - - g s q i y d s

SMARCA4 l s e v f i q l p s r k e - - l p e y y i p v d f l l n e - - g s l i y d s

TAF1 BD2 d s w p f h h p v n k k f - - v p d y y i p m d l l i n p - - e s q y t t a

TAF1L BD2 d s w p f h h p v n k k f - - v p d y y i p v d l l i n p - - e s q y t t a

TIF1A m s l a f q d p v p l - - - t v p d y y i p m d l l i n p - - d s e v a a g

TRIM33 l s i e f q e p v p a - - - s i p n y y i p m d l l i n a - - d s e v a a g

270

Table 8-4: The ROC AUC, sensitivity and specificity values per-bromodomain for the final PCM model based on Morgan fingerprints (512 bits) and Z-scales 5 descriptors. Corresponds to data found in Figure 2-10.

Bromodomain ROC AUC Sensitivity Specificity

ATAD2 0.96 0.71 0.97

ATAD2B NA NA 1.00

BAZ2A 1.00 1.00 0.96

BAZ2B 1.00 0.88 0.98

BPTF 1.00 0.75 1.00

BRD1 0.92 0.71 0.97

BRD2 BD1 0.96 0.95 0.80

BRD2 BD2 0.97 0.95 0.79

BRD3 BD1 0.99 0.97 0.88

BRD3 BD2 0.96 0.98 0.76

BRD4 BD1 0.95 0.84 0.97

BRD4 BD2 0.97 0.99 0.82

BRD7 0.97 0.67 0.97

BRD9 0.98 0.84 0.98

BRDT BD1 0.94 0.96 0.72

BRDT BD2 0.96 1.00 0.67

BRPF1 0.83 0.49 0.98

BRPF3 0.84 0.00 1.00

BRWD1 BD2 NA NA 1.00

CECR2 0.99 1.00 0.99

CREBBP 0.93 0.86 0.83

EP300 0.92 0.71 0.96

KAT2A 0.96 0.00 1.00

PB1 BD5 1.00 1.00 1.00

PCAF 0.84 0.89 0.58

SMARCA2 NA 1.00 NA

SMARCA4 0.63 0.28 0.93

TAF1 BD2 0.84 0.69 0.80

TAF1L BD2 0.99 0.00 1.00

TIF1A 0.99 0.88 0.99

TRIM33 1.00 1.00 1.00

271

Figure 8-1: Binding site residue sequence alignment (with numbering) for bromodomains in this study, coloured by the interpretation of residues important towards the classification of active compound-target pairs at each bromod1omain. Corresponds to Figure 3-4 in the main text. Binding site residues were defined as residues in the same alignment position as any residue within 4.5 Å of the ligands in public liganded co-crystal structures for bromodomains.

272

8.2 Supplementary Data Files

Data referred throughout the text to be present in Supplementary Data Files were too large for

the Appendix and can be found on the accompanying CD with the thesis as .csv files

8.3 Compound Characterization Data

8.3.1 General Methods

NMR spectra

NMR spectra were obtained on Bruker Avance 500 (500 MHz) system using d6-DMSO as

solvent. Measurements were taken at ambient temperature unless otherwise specified, and the

following abbreviations have been used: s, singlet; d, doublet; t, triplet; q, quartet; m, multiplet;

dd, doublet of doublets; ddd, doublet of doublet of doublet; dq, double of quartets; dt, doublet

of triplets; tt, triplet of triplets; p, pentet.

UPLC conditions

UPLC was carried out using a Waters UPLC fitted with a Waters SQD, SQD2 or QDA mass

spectrometer with mass Spec = ESI with positive/negative switching

A: 0.1 % NH3 in water

B: acetonitrile

Column: Waters Acquity CSHTM C18 1.7 µm 2.1 x 50 mm

Gradient: 97% A/3% B to 3% A/97% B over 1.5 min

UV: 220 nm - 320 nm

Temperature: 40 °C

Flow rate: 1 ml/min

High Resolution Mass Spectrometry Accurate Mass conditions

The High Resolution Mass Spectrometer is run in Electrospray (ESI) +ve or -ve ion mode and

automatic MSMS using CID at 35eV is carried out automatically on the two biggest ions

generated from MS1.

Mass range: 100 – 1000 amu

Mass Spectrometer: Orbitrap XL or Waters Xevo Qtof

273

Gradient: Acid and Base mobile phase eluent available consisting of A = aqueous

0.1% Formic acid (0.1% ammonium hydroxide) and B= Acetonitrile 0.1%

formic acid, (0.1% ammonium hydroxide) and runs a 95%A to 5%A

gradient at 0.7mL/min over 4.0 mins. There is a 0.5 min hold and a

return to 95%A by 5 minutes.

Total run time:5 mins

Column: Waters CSH 50 x 2.1 BEH

Temperature: 45°C.

Injection volume: 1 – 5uL

UV: 220 to 400nm (PDA detector).

Sample preparation: No greater than 0.5mg/mL solution in DMSO/methanol

Purity criteria

Compounds are >90% purity based on UPLC and 1H NMR.

8.3.2 Compound 1

4-cyano-N-(1,3-dimethyl-2-oxo-7-quinolyl)-2-methoxy-benzenesulfonamide

1H NMR (500 MHz, DMSO) 2.06 – 2.1 (m, 3H), 3.55 (s, 3H), 3.98 (s, 3H), 7.26 (dd, J = 9.0, 2.4

Hz, 1H), 7.30 (d, J = 2.4 Hz, 1H), 7.37 (d, J = 9.0 Hz, 1H), 7.47 (dd, J = 8.1, 1.3 Hz, 1H), 7.69 (s, 1H),

7.73 (d, J = 1.2 Hz, 1H), 7.86 (d, J = 8.1 Hz, 1H), 10.27 (s, 1H); 13C NMR (126 MHz, DMSO, 27°C)

161.81, 156.81, 136.44, 135.48, 131.58, 131.57, 131.25, 130.39, 124.59, 123.45, 120.71, 119.57, 117.96,

117.35, 117.18, 115.80, 57.45, 29.86, 17.79. m/z: ES+ [M+H]+ 384, HRMS (ESI): calculated for

C19H18N3O4S [M+H]+: 384.1018, found 384.1004, error 3.6 ppm.

8.3.3 Compound 2

1,3-dimethyl-2,4-dioxo-N-[4-(trifluoromethyl)phenyl]quinazoline-6-sulfonamide

274

1H NMR (500 MHz, DMSO) 3.29 (s, 3H), 3.50 (s, 3H), 7.32 (d, J = 8.5 Hz, 2H), 7.60-7.65 (m,

3H), 8.09 (dd, J = 8.9, 2.3 Hz, 1H), 8.44 (d, J = 2.3 Hz, 1H), 10.98 (s, 1H); 13C NMR (126 MHz,

DMSO) 160.94, 150.84, 143.97, 141.80, 133.23, 132.92, 127.19, 127.15, 125.70, 124.28, 119.34, 116.55,

115.52, 31.49, 28.78; m/z: ES+ [M+H]+ 414, HRMS (ESI): calculated for C17H15N3O4SF3 [M+H]+:

414.0735, found 414.0752, error 4.1 ppm.

8.3.4 Compound 3

1H NMR (500 MHz, DMSO) 2.04 (s, 3H), 3.12 (s, 3H), 4.22 (s, 2H), 4.97 (s, 2H), 6.85 – 6.89 (m,

2H), 6.83 (s, J = 8.6, 1H), 6.93 (s, 1H), 7.35 (d, J = 8.4 Hz, 2H), 7.58 (d, J = 8.4 Hz, 2H); 13C NMR

(126 MHz, DMSO) 168.75, 155.53, 153.68, 139.43, 133.93, 132.02, 128.80, 122.60, 119.33, 114.05,

113.95, 113.14, 69.84, 42.39, 29.63, 24.46; m/z: ES+ [M+H]+ 326; HRMS (ESI): calculated for

C18H19N3O3 [M+Na]+: 348.1324, found 348.1342, error 5.2 ppm.

8.3.5 Compound 4

Literature compound

8.3.6 Compound 5

3-benzyl-5-(3,5-dimethylisoxazol-4-yl)-1,3-benzoxazol-2-one

275

1H NMR (500 MHz, DMSO) 2.15 (s, 3H), 2.33 (s, 3H), 5.07 (s, 2H), 7.11 (dd, J = 8.2, 1.8 Hz, 1H),

7.23 (d, J = 1.5 Hz, 1H), 7.28 – 7.34 (m, 1H), 7.33 – 7.4 (m, 2H), 7.41 – 7.48 (m, 3H); 13C NMR (126

MHz, DMSO) 165.63, 158.55, 154.42, 141.79, 135.98, 131.65, 129.24, 128.45, 128.38, 126.18, 123.62,

116.02, 110.58, 110.43, 45.61, 11.64, 10.74; m/z: ES+ [M+H]+ 321; HRMS (ESI): calculated for

C19H16N2O3 [M+H]+: 321.1239, found 321.1235, error 1.2 ppm.

8.3.7 Compound 6

N-[[3-(3,5-dimethylisoxazol-4-yl)phenyl]methyl]methanesulfonamide

1H NMR (500 MHz, DMSO) 2.23 (s, 3H), 2.40 (s, 3H), 2.87 (s, 3H), 4.22 (d, J = 6.3 Hz, 2H), 7.29

(d, J = 7.7 Hz, 1H), 7.32 – 7.39 (m, 2H), 7.46 (t, J = 7.6 Hz, 1H), 7.61 (t, J = 6.3 Hz, 1H); 13C NMR

(126 MHz, DMSO) 165.58, 158.53, 139.49, 130.37, 129.36, 128.53, 128.12, 127.22, 116.25, 46.26,

40.45, 11.76, 10.92; m/z: ES+ [M+H]+ 280; HRMS (ESI): calculated for C13H16N2O3S [M+Na]+:

303.0779, found 303.0790, error 3.6 ppm.

8.3.8 Compound 7

4-[(9-cyclopentyl-5-methyl-6-oxo-spiro[8H-pyrimido[4,5-b][1,4]diazepine-7,1'-

cyclopropane]-2-yl)amino]-2-fluoro-5-methoxy-N-[(1R,5S)-9-methyl-9-

azabicyclo[3.3.1]nonan-3-yl]benzamide

1H NMR (500 MHz, DMSO) 0.62 – 0.74 (m, 2H), 0.87 – 1.01 (m, 4H), 1.22 – 1.26 (m, 2H), 1.42

– 1.48 (m, 2H), 1.49 – 1.56 (m, 2H), 1.57 – 1.63 (m, 2H), 1.66 – 1.76 (m, 2H), 1.85 – 1.96 (m, 4H),

2.14 – 2.26 (m, 2H), 2.44 (s, 3H), 3.00 (s, 2H), 3.18 (s, 4H), 3.50 (s, 2H), 3.93 (s, 3H), 4.31 (s, 1H),

4.87 (p, J = 8.6, 8.2 Hz, 1H), 7.22 (d, J = 6.8 Hz, 1H), 7.72 (s, 1H), 7.78 (s, 1H), 8.03 (s, 1H), 8.35

276

(d, J = 13.7 Hz, 1H); m/z: ES+ [M+H]+ 592; HRMS (ESI): calculated for C32H42N7O3F [M+H]+:

592.3411, found 592.3397, error 2.4 ppm.

8.3.9 Compound 8

[2-[(1-methyl-2-oxo-6-quinolyl)oxymethyl]phenyl] methanesulfonate

1H NMR (500 MHz, DMSO) 3.51 (s, 3H), 3.59 (s, 3H), 5.24 (s, 2H), 6.61 (d, J = 9.5 Hz, 1H), 7.33

(dd, J = 9.2, 2.9 Hz, 1H), 7.38 – 7.43 (m, 2H), 7.44 – 7.52 (m, 3H), 7.62 – 7.68 (m, 1H), 7.83 (d, J

= 9.5 Hz, 1H); 13C NMR (126 MHz, DMSO) 161.12, 153.34, 147.33, 139.06, 135.06, 130.64, 130.46,

130.26, 127.87, 122.84, 122.23, 121.26, 120.05, 116.51, 112.49, 65.26, 38.60, 29.54; m/z: ES+ [M+H]+

360; HRMS (ESI): calculated for C18H17NO5S [M+H]+: 360.0906, found 360.0895, error 3.1 ppm.

8.4 Experimental Data

Table 8-5: Differential Scanning Fluorimetry (DSF) assay quality control, showing the identities, thermal shift values as measured in the DSF assay and the literature reported binding affinity of the positive controls for each assay as well as the standard deviation of the DMSO neutral controls.

Target Positive control Positive control ΔTm (°C)

Binding affinity Kd (nM)

Neutral control (DMSO) standard deviation

BRD1 NI-57 5.1 110 0.10

BRD4 BD1 JQ-1 10.5 60 0.37

BRD9 i-BRD9 9.8 1.9 0.38

BRPF1b NI-57 8.8 31 0.05

277

Table 8-6: X-ray crystallographic data collection and refinement statistics

BRD4: 1 BRD4: 2 BRD4: 4 BRD9: 4

PDB ID

Data collection

Space group P212121 P212121 P212121 P21212

Cell dimensions

a, b, c (Å) 37.97

42.99

79.41

37.86

44.05

77.77

34.14

47.41

78.60

74.11

122.96

29.90

α, β, γ () 90.0

90.0

90.0

90.0

90.0

90.0

90.0

90.0

90.0

90.0

90.0

90.0

Resolution (Å) 40.00-1.25

(1.28-1.25)

44.05-1.57 (1.61-

1.57)

40.60-1.30

(1.32-1.30)

63.47-1.53 (1.57-

1.53)

I / σI 20.5 (1.4) 14.1 (1.6) 22.2 (5.3) 12.8 (1.5)

CC 1/2 1.00 (0.49) 0.999 (0.596) 0.998 (0.98.6) 0.998 (0.624)

Completeness (%) 99.3 (94.4) 99.1 (94.2) 99.9 (97.2) 100.0 (100.0)

Redundancy 5.6 (2.9) 6.9 (6.8) 6.0 (3.6) 6.2 (6.2)

Refinement

Resolution (Å) 40.00-1.25 44.05-2.1.57 40.60-1.30 63.47-1.53

No. reflections 205450 129989 191880 262534

Rwork / Rfree 0.196 / 0.219 0.183 / 0.212 0.201 / 0.213 0.204 / 0.230

No. atoms

Protein 1062 1056 1066 1838

Ligand 24 19 25 50

Water 130 108 150 188

B-factors

Protein 20.6 25.4 20.1 39.9

Ligand 18.7 31.6 23.7 40.5

Water 29.1 33.8 28.0 42.4

R.m.s. deviations

278

Bond lengths (Å) 0.01 0.01 0.01 0.01

Bond angles () 0.94 0.94 1.11 0.80