students partners - smartr

33
MVE385: Project course in mathematical and statistical modelling Modeling Quantitative Structure-Activity Relationships for G-protein Coupled Receptor ligands Project report Students Adrià Amell Tosas Richard Martin Sebastian Oleszko [email protected] [email protected] [email protected] Partners Peder Svensson Mattias Sundén Fredrik Wallner Erik Lorentzen 2020–12–18

Upload: others

Post on 16-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Students Partners - Smartr

MVE385: Project course in mathematical and statistical modelling

Modeling Quantitative Structure-Activity Relationships forG-protein Coupled Receptor ligands

Project report

Students

Adrià Amell Tosas Richard Martin Sebastian [email protected] [email protected] [email protected]

Partners

Modeling Quantitative Structure-Activity Relationships for G-protein Coupled Receptor ligands Background

IRLAB Therapeutics is a Biotech company, engaged in the discovery and development of novel pharmaceuticals to

treat disorders of the brain, currently focusing on Parkinson’s disease. The research is based on a so called

phenotypic screening approach, which means that the effects of new chemical compounds are evaluated on a

system level, to ensure that both direct and indirect effects on different neurotransmitters and brain pathways are

captured. Evaluation of potential target receptor interactions is performed in silico as well as in different in vitro

systems. In the design of new compounds, and understanding of how different structural elements of the

compounds affect biological effects, quantitative structure-activity relationships (QSAR) is a key component. Our

current drug discovery projects focus mainly on G-protein coupled receptors (GPCRs), the main class of molecular

targets for CNS pharmaceuticals.

Project description

In this project, students will access large datasets covering chemical descriptors of a series of molecules,

combined with data on receptor interactions. The focus will be on GPCRs, of the monoamine family, but other

target proteins will also be included. Chemical descriptors, including eg physico-chemical property estimates,

shape, lipophilicity, ionization states, molecular graphs and fingerprints, are obtained from public databases, and

generated inhouse at IRLAB. The task is to find statistical QSAR-models that describes how biological activity, in

this case receptor affinities, relates to chemical properties of the ligand molecule. Such models can be used to

guide the design of novel compounds. Linear, principal component-based models as well as non-linear methods

will be investigated.

Key points to consider in the project will be:

Properties of the chemical descriptor space

Data distribution – transforms?

Different model types, linear, eg PLS, MR, non-linear, eg SVM, neural networks, random

forest

Choice of dependent variables for the models – some relations to the independent

variables (chemical descriptors) will be common for several dependent variables

(receptor affinities), some are unique to specific Y variables. Depending on the statistical

modelling approach, separate models for each dependent variable of interest, or a

multiple Y block.

Diagnostics – how to access predictive capability of models

IRLAB will provide chemical descriptors and biological activity data for one or more series of compounds of

interest for the pharmacological modulation of GPCRs, focusing on monoamine targets. Smartr will provide

supervision regarding statistical models.

Modeling Quantitative Structure-Activity Relationships for G-protein Coupled Receptor ligands Background

IRLAB Therapeutics is a Biotech company, engaged in the discovery and development of novel pharmaceuticals to

treat disorders of the brain, currently focusing on Parkinson’s disease. The research is based on a so called

phenotypic screening approach, which means that the effects of new chemical compounds are evaluated on a

system level, to ensure that both direct and indirect effects on different neurotransmitters and brain pathways are

captured. Evaluation of potential target receptor interactions is performed in silico as well as in different in vitro

systems. In the design of new compounds, and understanding of how different structural elements of the

compounds affect biological effects, quantitative structure-activity relationships (QSAR) is a key component. Our

current drug discovery projects focus mainly on G-protein coupled receptors (GPCRs), the main class of molecular

targets for CNS pharmaceuticals.

Project description

In this project, students will access large datasets covering chemical descriptors of a series of molecules,

combined with data on receptor interactions. The focus will be on GPCRs, of the monoamine family, but other

target proteins will also be included. Chemical descriptors, including eg physico-chemical property estimates,

shape, lipophilicity, ionization states, molecular graphs and fingerprints, are obtained from public databases, and

generated inhouse at IRLAB. The task is to find statistical QSAR-models that describes how biological activity, in

this case receptor affinities, relates to chemical properties of the ligand molecule. Such models can be used to

guide the design of novel compounds. Linear, principal component-based models as well as non-linear methods

will be investigated.

Key points to consider in the project will be:

Properties of the chemical descriptor space

Data distribution – transforms?

Different model types, linear, eg PLS, MR, non-linear, eg SVM, neural networks, random

forest

Choice of dependent variables for the models – some relations to the independent

variables (chemical descriptors) will be common for several dependent variables

(receptor affinities), some are unique to specific Y variables. Depending on the statistical

modelling approach, separate models for each dependent variable of interest, or a

multiple Y block.

Diagnostics – how to access predictive capability of models

IRLAB will provide chemical descriptors and biological activity data for one or more series of compounds of

interest for the pharmacological modulation of GPCRs, focusing on monoamine targets. Smartr will provide

supervision regarding statistical models.

Peder Svensson Mattias SundénFredrik Wallner Erik Lorentzen

2020–12–18

Page 2: Students Partners - Smartr

1 AbstractG protein-coupled receptors (GPCRs) are the main class of molecular targets for central nervoussystem pharmaceuticals. Dopamine receptors are GPCRs which are activated by the neurotrans-mitter dopamine and are central players in brain function and are involved in, for example, mo-tor control. Quantitative structure-activity relationships (QSAR) models are theoretical modelsthat relate the quantitative measure of chemical structure to a physical property or a biologicalactivity. They are key in the design of new drugs and in the understanding of how differentstructural elements of the compounds affect biologically.

This project presents a number of QSAR models that can help describe relationships betweenseries of organic molecules to the dopamine receptors D2 or D3 by predicting the correspondinginhibitory constant Ki. The molecules of study and Ki values were obtained from ChEMBL,a public chemical database of manually curated bioactive molecules with drug-like properties,while the descriptors were generated computationally. Linear-based models, a genetic algorithmand tree-ensemble methods are motivated and evaluated on the constructed dataset. The tree-ensemble methods performed best, closely followed by the genetic algorithm. Finally, recom-mendations for any QSAR modelling is discussed.

1

Page 3: Students Partners - Smartr

Contents1 Abstract 1

2 Introduction 3

3 Construction of the data set to model 33.1 Selection of compounds and inhibitory constants . . . . . . . . . . . . . . . . . . 33.2 Activity cliffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Modelling 64.1 Motivation for methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1.1 Linear-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.1.2 Selection by a Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . 84.1.3 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2.1 Linear-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2.2 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2.3 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.4 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Results 135.1 Variable selection with linear-based models . . . . . . . . . . . . . . . . . . . . . 135.2 Evaluating the linear-based models . . . . . . . . . . . . . . . . . . . . . . . . . . 145.3 Subset and model refinement with Genetic algorithm . . . . . . . . . . . . . . . . 165.4 Ensemble models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Discussion 196.1 Data discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 Applicability domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Modelling conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.3.1 Computational limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.3.2 Feature subsets and importance . . . . . . . . . . . . . . . . . . . . . . . . 226.3.3 Model interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.3.4 Comparison of model candidates . . . . . . . . . . . . . . . . . . . . . . . 23

6.4 Future recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Acknowledgements 24

A Result tables 28A.1 Variable selection chosen descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 28A.2 Specifications of the linear-based models . . . . . . . . . . . . . . . . . . . . . . . 28A.3 Final descriptor sets of the genetic algorithm . . . . . . . . . . . . . . . . . . . . 28

B Ensemble Parameter Tuning 28

C Code 32C.1 Variable selection scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32C.2 Linear-based model scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32C.3 Genetic algorithm script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2

Page 4: Students Partners - Smartr

2 IntroductionHuman cells are constantly communicating with each other and the surrounding environment.This requires a molecular mechanism for transmission of information over the cell plasma mem-brane. G protein-coupled receptors (GPCRs) are proteins located at the cell plasma membranethat provide this molecular mechanism which transfer signals upon binding of a ligand. Thismakes them the main class of molecular targets for central nervous system pharmaceuticals.

Dopamine receptors are GPCRs. They are activated by the neurotransmitter dopamine and arecentral players in brain function and are involved in, for example, motor control. This projectcovered the construction and modelling of a data set of chemical compounds with targets eitherdopamine receptors D2 or D3, also referred as D2R and D3R, respectively, given that IRLABTherapeutics, one of the partners, is currently focused on Parkinson’s disease. The purpose ofthe modelling is to predict an activity, in particular the inhibitory constant Ki, for differentchemical compounds with dopamine receptors D2 and D3 as targets. This is usually done as astage in drug discovery projects where properties of the compounds and their relation to receptorinteractions is a key component. The kind of compound-receptor models described in this reportare called quantitative structure activity relationship (QSAR) models [1], and are important inthe development of new pharmaceuticals.

In this report, Section 3 presents the data source and process followed to construct the data setof compounds and activities to model. Section 4 motivates and presents the methods employed,which are regression models. They consist in linear models, a genetic algorithm and tree-basedensemble methods. Modelling results are found in Section 5. Finally, a discussion on theconstruction of the data set, the modelling applicability domain and interpretability, as well asfuture recommendations are found in Section 6. Appendix B explains technical details on theimplementations and the hyperparameter tuning in the models. All the code and data sets usedare provided as supplementary material.

3 Construction of the data set to modelIn the initial stage of the project, the dataset of chemical compounds was collected, analyzedand processed in order to learn about its properties, to make it suitable for modelling and tohelp decide on appropriate model choices. In this section the resources used to obtain the dataset on which the modelling is based as well as the steps performed in compiling and cleaningthe data are described.

3.1 Selection of compounds and inhibitory constantsChEMBL is a large and open-access manually curated database of bioactive molecules with drug-like properties, maintained by the European Bioinformatics Institute of the European MolecularBiology Laboratory. The information in the database about small molecules and their biologicalactivity is extracted from medicinal chemistry journals and integrated with data on approveddrugs and clinical development candidates, as well as from combining bioactivity data from otherdatabases, allowing users to benefint from an even larger body of interaction [2].

This database can be accessed through a web user interface at https://www.ebi.ac.uk/chembl/,web services or a number of download formats. This project used a local PostgreSQL copy ofthe ChEMBL release 27, the latest release, to access the data. The targets of interest are thedopamine receptors D2 and D3 for the homo sapiens organism, which were identified with theChEMBL IDs CHEMBL217 and CHEMBL234, respectively.

All the compounds having activities for these target IDs were queried. Each record containsthe compound ID and a standardised activity value, type, units and relation type, i.e. whether

3

Page 5: Students Partners - Smartr

the given value is equal or a bound in a range, as well as the molecule canonical SMILES (adescription of the molecule in ASCII strings) or metadata such as an identifier for the assay,which can be linked to the data source. A variety of activity types were observed, some of whichcan be easily identified, such as IC50 (half maximal inhibitory concentration), while there maybe need to review the data sources to understand other types. In this context, we found thetypes pKi, Delta pKi, logKi, KiH, Ki(app), Ratio Ki, KiL, Log Ki, and Ki that can relatedirectly to the inhibitory constant Ki. We filtered out any type other than Ki because (i) it isnot exactly clear how some types relate to Ki, (ii) it is not mentioned the base of the logarithmin logKi or Log Ki, and (iii) all the pKi values are NaN.

Any compound whose weight was not between 120 and 620 Da and that was a salt was removed.The motivation was to have only drug-like compounds. A two-dimensional set of 207 moleculardescriptors for these compounds was computed using the Molecular Operating Environment2019.01 from Chemical Computing Group [3], where a Merck molecular force field charge modelwas used to compute the charge descriptors.

The computed molecular descriptors were combined with the obtained Ki, resulting in an initialdata set. This data set had to be curated for any accurate modelling [4]. It is not trivial toidentify and exclude records with inconsistencies or systematic errors. Any duplicated record(completely identical records, including Ki values) was removed and it was ensured that all Ki

were non-negative. The manual curation of the ChEMBL database was assumed to minimiseany other inconsistency and no further analysis was done in this regard.

In order to prepare the data for regression models, records not having a = relation were filteredout. This relation type establishes that the provided Ki value is equal to the measured valueand not in a range, which would be indicated by a relation type such as >= and could be usefulin classification models. It is recommended that the compounds in the data set are unique,because otherwise the models may have artificially skewed predictivity [5, 6]. Some compoundswere observed to be duplicated at this stage (Figure 1). A handful of cases was manuallyexamined, revealing that the Ki values for the same compounds can be both very close as wellas very different, relatively.

2 3 4 6 5 7 8 9 10 39 11 14 18 19 20 24 49

Duplicate frequency

0

500

Coun

t

D2R

2 3 4 6 5 9 8 10 7 12 11 34 18 14

Duplicate frequency

0

200

Coun

t

D3R

Figure 1: Frequency of compound duplicates for each receptor before averaging pKi values. Uniquecompounds for D2R (D3R): 5886 (4224), where 1110 (446) have at least one duplicate.

It was opted to assign one Ki value to each duplicated compound as long as the spread of Ki

values was reasonable. In order to assign one Ki value per compound, the relative standarddeviation (RSD)

RSD = σ

µ(3.1)

was used as measure of spread. The spread was computed on the pKi values, defined as

pKi = − log10 Ki, Ki in molar units (3.2)

All compounds that had a pKi spread less than 10% were kept, and the corresponding pKi

4

Page 6: Students Partners - Smartr

values were averaged to produce a single value. This defined the final size of each data set(Figure 2). The use of pKi instead of Ki is motivated later. In this step it was necessary todrop any metadata. Other approaches, such as selecting one value at random, could be usedto preserve the metadata were it to be used in any downstream analysis, but it was not in theproject scope.

Query all

compounds with

D2R/D3R as targets Type = Ki

Compounds remaining after

computing the descriptors

Duplicated rows

removed, Ki >= 0 Relation =

Averaged pKi

(train + test)0

2500

5000

7500

10000

12500

15000

17500

Reco

rds c

ount

17406

10124 10426

6636

9145

5652

9123

5632

8012

5309 54874005

D2RD2Runique compoundsD3RD3Runique compounds

Figure 2: Number of Ki observations in each data set. The resulting data set to model has 4938/549and 3604/401 train/test observations for D2R and D3R, respectively, with 207 descriptors.

Finally, the data was divided in a 90%–10% split for train and test purposes, respectively,defining the final data set to model for both D2R and D3R. The distributions of the pKi valueswere similar in the train and test sets (Figure 3). Note here that if instead the Ki distribution isused, it is exponential and presents potential outliers. Using pKi makes not only the modellingmore tractable but also softens any possible outlier. Moreover, motivated by the fact that it isknown that many descriptors can be dependant on the molecular weight, the molecular weightdistribution was visualised and the test/train split obtained also presented similar distributionsfor this descriptor.

5.0 7.5 10.0pKi

0.0

0.1

0.2

0.3

0.4D2R train

5.0 7.5 10.0pKi

0.0

0.1

0.2

0.3

0.4D2R test

5.0 7.5 10.0pKi

0.0

0.1

0.2

0.3

0.4D3R train

5.0 7.5 10.0pKi

0.0

0.1

0.2

0.3

0.4D3R test

Figure 3: Distributions of the pKi values in the train and test for D2R and D3R.

3.2 Activity cliffsSimilar molecules do not necessarily have similar activities. In fact, evidence supports thisstatement [7]. An activity cliff is observed when two similar compounds largely differ in theiractivity. Perfect valid data points in cliff regions may appear to be outliers.

While the activity cliffs can be of interest, it can also make the modelling process more com-

5

Page 7: Students Partners - Smartr

plicated. To begin with, the surrounding region of the cliff needs to contain many compounds.Moreover, it can be difficult to detect activity cliffs because (i) the compounds can be misla-belled, (ii) the measure of similarity is non-unique, (iii) the difference in activity values can bemeasured in different ways, and (iv) there can be different interpretations in how to connect thesimilarity and the difference in activity values for a pair of compounds.

Prior to begin any computational modelling of a data set, all activity cliffs must be detected,verified and treated [5]. It has been claimed that fewer activity cliffs are detected when theactivity of interest is Ki [8], therefore it was not expected to observe many activity cliffs, andespecially after pruning the data set in the last cleaning step.

The Structure-Activity Landscape Index (SALI) [9] is defined in this project for two compoundsA and B as

SALI = |pK(A)i − pK

(B)i |

1 − sim(A, B)(3.3)

where sim is a measure of similarity between A and B. In [9] it is discussed that the similaritymetric does not impact negatively on SALI. On the other hand, selecting the descriptors de-scribing the molecules is crucial: compounds that are nearest neighbours in one space need notbe in another. Here the Tanimoto similarity metric was evaluated on a number of molecularfingerprints: ECFP4, ECFP6, FCFP4, FCFP6, RDKit fingerprint, all as binary vectors of 2048to ensure enough sparsity, and the 166 public MACCS keys. These fingerprints were computedusing RDKit [10].

A SALI matrix contains at position (i, j) the SALI for compounds i and j. The columnsare sorted by increasing pKi, and the rows by decreasing pKi. Visualising these matrices asheatmaps, where the upper-left pixel is the first column and row, should reveal an activity cliffas a bright pixel—depending on the colour map—around the anti-diagonal.

Given the dimensions of the data set, it is infeasible to visually detect bright single pixels in5487 × 5847 (D2R) and 4005 × 4005 (D3R) SALI matrices as in [9], but a rough idea of thestructure-activity landscape can be obtained. Compounds with similar structures present similaractivities, and the more dissimilar they are the more this difference is highlighted (Figure 4).

In order to assess the presence of activity cliffs, studies were done between the anti-diagonal andthe 5-th anti-diagonal, that is up to five positions difference in the ranking of activity values,and any SALI above a threshold was set to 1. 15 compounds were candidates to be activitycliffs for D2R evaluated with different fingerprints, but only two were agreed to be activitycliffs candidates for D3R in more than one fingerprint. These compounds presented extremalactivity values and were decided to be kept, mainly because they were found upon revision ofthe code before submitting this report and all the modelling included them, but also because itis reasonable to expect that extremal values will not be modelled accurately.

It is worth mentioning that several pairs of compounds presented infinity SALI values, whichwas also found upon code revision before the submission. A reason for that, though not neces-sarily the single reason, is that different compounds presented the same fingerprint, producing anumerical 0 in the denominator. Nevertheless, an exhaustive study of this cause was not carriedout, because it was considered out of the scope of the project given the time constraints, butthis could have been another source of activity cliffs.

4 ModellingThe purpose of a statistical model can either be predictive, explanatory or both. In drugdiscovery and QSAR-modelling, it is typically beneficial to find out what properties of the

6

Page 8: Students Partners - Smartr

Figure 4: Heatmaps of the SALI matrices for ECFP4, ECFP6 and 166 public keys MACCS finger-prints computed with (3.3). The FCFP4, FCFP6 and RDKit fingerprint matrices (not shown) presentsimilar images. The D2R matrix is 5487 × 5847 and the D3R 4005 × 4005, therefore it is infeasibleto visualise all pixels. Nevertheless, they provide a qualitative understanding of the structure-activitylandscape.

investigated compounds are relevant for the response in the target. Therefore, interpretablemodels are preferred over black-box models. This limits the options of modelling and putsrestrictions on the data since they often come with various assumptions on the data. It is alsopreferrable to have a small model when possible since this also adds to the interpretability andis easier to use.

The modelling process can be divided into three stages: variable selection, linear-based models,ensemble models. The first step was done to try and reduce the set of descriptors and findthe relevant ones. The second step was done in order to try and find a model as simple andinterpretable as possible. Finally the third step was done in order to produce a model with ashigh predictive capability as possible, independently from the first two steps.

4.1 Motivation for methodsThe pre-processed dataset showed that a large portion of the descriptors presented distributionsresembling a normal distribution. Furthermore, there were groups of correlated descriptorspresent, as illustrated in Figure 5. This could cause stability problems for traditional modelsbased on linear regression.

The pruned dataset consisted of 207 descriptors, which is quite many to be able to interpretthem. For the D2R, five of those 207 showed no variance for the compounds chosen and weretherefore dropped since they would not be useful. For D3R, four were dropped on the samegrounds. In order to try and filter out as many descriptors as possible without losing predictivepower some variable selection techniques were applied. Initially, the set of descriptors weresplit into three different sets: one including all descriptors, called alldata, one including onlycontinuous descriptors and discrete descriptors that could be approximated as continuous, calledcont. The third set included only discrete and binary descriptors and is called cat. Sincepredictive modelling is much simpler with one-dimensional response, the two targets D2 andD3 were treated separately throughout the whole project. The data distributions for the twotargets were very similar and therefore only data and models regarding D2 were treated initially

7

Page 9: Students Partners - Smartr

Figure 5: Heatmap of the pairwise linear correlations among the descriptors. The horizontal andvertical yellow and blue lines indicate highly correlated descriptors.

since the D3 case could be assumed to have very similar properties. For the final models, theD3 case was also included.

4.1.1 Linear-based modelsPartial Least Squares (PLS) [11] is a variant that uses a projection similar to Principal Com-ponent Analysis, but also includes the target, to reduce the dimensionality before regressing thedata. It is not sensitive to multicollinear data and was therefore tried as one of the techniques.

Support Vector Regression (SVR) [12] is another method that selects a subset of support vectorsto model the target with. With a linear kernel it can perform variable selection in this way, butcould be sensitive to the multicollinearity. By using a non-linear kernel in the regression, thissensitivity can be alleviated. However, that will not yield linear regression coefficients in thesame way as the linear kernel does and it is therefore less interpretative.

The Lasso [13] and Group Lasso [13] models uses L1-regularisation on the linear regression,that is a penalisation on the absolute value of the associated descriptor coefficient, and willeffectively cancel out variables that they deem irrelevant. The difference between them is thatGroup Lasso considers whole groups of variables instead of single variables. This can be veryuseful for categorical data. There are 51 discrete descriptors, binary descriptors included, inthe dataset, and Group Lasso had potential to be be useful. Both models are likely sensitiveto multicollinearity, but if the true significant descriptors are sparse, it can work very well forvariable selection. If this assumption is fulfilled or not is not trivial to establish, especiallywithout chemical knowledge. The easiest way to find out is to try the models and analyse theresults. Elastic Net models (EN) [13] are much like Lasso, but with an additional quadraticpenalty term on the regression coefficient. This makes it more flexible than Lasso regression,but it does not perform explicit variable selection.

4.1.2 Selection by a Genetic algorithmThe simple models described above may come up with subsets of important descriptors, but maynot agree in all situations because of their different strengths and weaknesses. Simply trying out

8

Page 10: Students Partners - Smartr

all possible permutations of 1-202 descriptors is not possible, but by using subsets acquired by themethod of Section 4.2.2 as starting conditions it is possible to use a genetic algorithm to combineand evolve them into a better subset. The method is inspired by Darwinian evolution combinedwith DNA replication and operates on a population of descriptors. A thorough introduction isgiven in [14]. The advantages of this method is that no assumptions are made on the modelor the descriptors and it generalises well. In particular, the genetic algorithm can be usedtogether with any model of choice, and will always perform variable selection. It is by designguaranteed to find solutions at least as good as the best input set, and very likely much bettersets. The downside is the computational load it causes, since the model should be tuned overits parameter space as often as possible, which is hundreds of times in a reasonable iterationlength. In addition, the genetic algorithm itself also has a number of hyperparameters that needtuning.

4.1.3 Ensemble methodsTree-based ensemble methods have previously been successfully applied to the area of QSARmodelling [15], primarily in the form of Random Forest [16] and Gradient Boosting [17]. Bothmethods work by constructing several decision trees and combining the results to get a finalmodel. The main difference lies in the dependence between trees. Random Forest has independ-ent trees that all contribute to the final decision while boosting has dependence on the previoustree, with the objective of improving predictions in a step wise process.

Traditionally, Random Forest has long been one of the most common methods in QSAR mod-elling due to its good predictability, few adjustable parameters and ease of use [17]. However,it has been shown that XGBoost [18], an implementation of Gradient Boosting, achieves higherpredictive abilities while also having important advantages such as training speed [17]. Onepossible limitation of XGBoost is the number of adjustable parameters. Nevertheless, in QSARmodelling, the method has been shown not to be very sensitive to changes in parameters [17].Thus, hyperparameter tuning for a single datasets/domain as in the studied case will be straight-forward and not too computationally difficult.

Additionally, XGBoost uses a sparsity-aware approach to making decision splits [17]. This canbe very beneficial in QSAR modelling, especially when using molecular fingerprints as featuresin the model. To evaluate this approach to QSAR modelling, fingerprints were constructed foreach compound in the dataset using the RDKit fingerprint [10] and the method was evaluatedon the expanded dataset.

A measure of feature importance is built in to both methods. The decision trees make splitsbased on improving predictions, thus the features which are most often picked for splits should bemost important and descriptive of the data. There is however a difference of how both methodshandle the importance of correlated features. Due to the random and independent nature ofRandom Forest, correlated features have similar chances of getting picked, hence splitting theimportance between them. In boosting, once a feature is picked, a correlated feature should notbe picked over independent ones.

4.2 Implementation detailsThe models were implemented in Python using the scikit-learn framework for machine learningand statistical modelling [19]. Parameter tuning was needed in all models since they had at leastone hyperparameter. All model training was done with 5-fold cross-validation on the trainingset and evaluated on the test set, both described in Section 3.1. Adjusted R2 score was chosenas evaluation metric for the parameter searches while the final models were evaluated on the testset using the R2 and RMSE measures. When training models on continuous data, a standard-

9

Page 11: Students Partners - Smartr

normal scaling was used on the descriptor values. For all modelling, the pKi values defined inSection 3.1 was used as response variable.

4.2.1 Linear-based modelsThe training and evaluation of the linear-based models was based on the principles outlined inSection 4.2. For each dataset that a model was trained and evaluated on, an analysis of possibleoutliers was done to see if any data points should be removed for that model and dataset byconsidering the standardised Pearson residuals and a normal Q-Q plot. Each model was thenevaluated by its R2 and RMSE score along with a visual representation of its regression. Themodels and other tools were taken from Scikit-learn [19]. The code for all evaluation scripts arepresented in Appendix C.2. A brief explanation of each model follows where X is the inputmatrix of descriptor values, y is the output target and β are the model weights trained in thefitting process.

Lasso The lasso model tries to minimise the function

12nsamples

||y − Xβ||22 + α||β||1. (4.1)

Lasso has one hyperparameter α, which controls the strength of the regularisation. It is a realnumber and most likely between 0 and 1.

Elastic Net The elastic net is the Lasso with an extra squared penalty term. The objectivefunction is to minimise

12nsamples

||y − Xβ||22 + α||β||1 + 12

α(1 − ℓ1ratio)|β||2 (4.2)

It therefore has an additional hyperparameter ℓ1ratio that regulates the balance between ℓ1 andℓ2 penalisation. ℓ1ratio = 0 means only ℓ2 penalty (also called Ridge regression) while ℓ1ratio = 1is equivalent to the Lasso.

Group Lasso The Group Lasso model is a modification of Lasso and it’s objective is givenby

12

|y − Xβ|22 + λ1|β|1 + λgroup

K∑k=1

|Bk|2 (4.3)

where Bk are the coefficients for group k. It’s hyperparameters are λ1 for ℓ1-regularisation andλgroup for the groupwise regularisation. It was implemented via the FISTA [20] algorithm. Bothλ1 and λgroup are positive real numbers between 0 and 1 in the majority of cases.

Partial Least Squares PLS produces n orthogonal directions

zn =p∑

j=1φ̂njx(n−1)

j , with the projections φ̂nj =⟨x(n−1)

j , y⟩

. (4.4)

and then derives the regression coefficients as β̂n = ⟨zn, y⟩ / ⟨zn, zn⟩ . (4.5)

10

Page 12: Students Partners - Smartr

For each such component k, PLS works by maximising

corr(Xku, ykv) · std(Xku) · std(yku), s.t. |u| = 1 (4.6)

with respect to to the weights u and v, where corr and std are the correlation and standarddeviations. n is therefore the only hyperparameter to tune and it’s range is an integer between1 and the total number of variables. It was implemented using the PLS2 algorithm [21].

Support Vector Regression SVR tries to find a hyperplance that separates the data asgood as possible. The optimisation problem is defined as

minw,b,ζ,ζ∗ 12wT w + C

∑ni=1 (ζi + ζ∗

i )subject to yi − wT ϕ (xi) − b ≤ ε + ζi

wT ϕ (xi) + b − yi ≤ ε + ζ∗i

ζi, ζ∗i ≥ 0, i = 1, . . . , n.

(4.7)

C is the (inverse) strength of the regularisation. ϵ is the tube around the hyperplane whereno penalty is given. ξi and ξ∗

i are the penalties for points being on the upper or lower sideof the ϵ-tube. ϕ(x) is a mapping such that the kernel K(xj , xk) = ⟨ϕ(xj), ϕ(xk)⟩. The linearimplementation uses the linear kernel Klinear with ϕ(x) = x, which was used in the variableselection process due to the fact that it creates regression coefficients β that makes a certainvariable selection method possible. For the final evaluation of the SVR models, several kernelswere tried by eventually the RBF-kernel was used. Denote the support vectors that define thehyperplane by x′. The RBF kernel is then given by

KRBF

(x, x′) = exp

(−γ

∥∥x − x′∥∥2)

(4.8)

and has its own hyperparameter γ that controls the influence of single training samples. C andγ are real numbers between 0 and ∞. ϵ is a small number ≥ 0.

4.2.2 Variable selectionFour models were used for variable selection: Lasso, Group Lasso, Linear SVR and PLS. Thesemodels were fit according to the scheme described in Section 4.2. All implementations weretaken from Scikit-learn [19], except Group Lasso which is not provided by Scikit-learn and anindependent implementation was used [22].

Both Lasso and Group Lasso perform variable selection automatically and the chosen descriptorswere extracted directly from the trained models. The variable selection with PLS and SVR wasdone in a similar way but with a slight difference. Since the models do not explicitly assign zerovalues to coefficients, the choice of descriptors had to be done manually.

In the SVR model, after the model was fit with the best parameters found, the resulting modelcoefficients corresponding to descriptors were sorted in order of descending absolute value. Thisorder was taken as the descriptor significance. The model was then re-fit using different numberof descriptors. First, only the most significant descriptor was used, then the two most significantdescriptors, and so on until the model with the full number of descriptors had been fit. For eachof these model fits, the adjusted R2 was taken as performance measure, and finally the descriptorset that maximised the adjusted R2 of the model was chosen as the best set. This method iscommonly referred to as the Recursive Feature Elimination algorithm [23].

The PLS model selection was done by first tuning a model over its parameter space using cross-validation on the training set and then considering the loading weights from that model. For each

11

Page 13: Students Partners - Smartr

descriptor i and PLS component k in the weight matrix, the corresponding loading weights wi,k

were summed over the components to form the descriptor weights wi. Let w be the descriptorweights. The median w and interquartile range IQR(w) were computed and a threshold Tw todecide if a descriptor should be included or excluded was determined as

Tw = w

IQR(w)(4.9)

to include all descriptors with |w| ≥ Tw. This method was proposed by Shao et al. [24].

The code for these procedures can be found in Appendix C.1.

4.2.3 Genetic algorithmThe concept of the genetic algorithm explained in Section 4.1 was implemented in Python fromscratch. The initial population is formed by taking the subsets acquired by the linear-basedmodels in Table 2, and then adding a number of randomly chosen subsets to that population.These subsets are then represented in the algorithm as binary ”chromosomes”, all of which havelength 202 and where a 1 at an index in the string indicates that the descriptor correspondingto that index should be active.

An iterative process then starts, where each iteration is called a generation, and in which thefollowing steps take place: First, all subsets (chromosomes) in the population are evaluated byscoring a model that is fit to each such set. In order to punish large models, the Adjusted R2is taken as fitness. Based on these fitness scores, a new population is drawn with replacementfrom the existing one, with probability proportional to the fitness score of each chromosome.The new population is then subjected to sexual replication through a crossover operator that,with a certain probability, cuts pairs of chromosomes in two at a random index and pastes theends together with each other. Finally each chromosome is mutated randomly by flipping itsbits with a small probability. The loop is then complete and the new population is evaluated. Ineach iteration, the best chromosome found so far is kept so that the evolution never goes back-wards. This procedure continues until the best model fit stops improving, and the correspondingdescriptor set is taken as optimal for the current model.

The genetic algorithm has three hyperparameters: population size N , crossover probability pcrossand mutation rate rmut. A larger population size results in a more diverse solution space andprevents “inbreeding”, which in turn helps prevent premature convergence. A low crossoverprobability decreases the rate at which subsets spread across the population and also helpsprevent premature convergence to suboptimal solutions. The mutation rate is a constant that isused to form the probability of mutating each descriptor in a subset. It introduces new descriptorsets to the population.

Apart from these genetic parameters, the algorithm also has a predictive model with its ownparameters. The model was chosen before the start of the algorithm and was re-tuned withrespect to its parameters according to the following scheme. The model is initially tuned to thebest subset in the population of the first generation. From then on, every 10th generation thealgorithm checks if the currently best subset has changed and if so, re-tunes the model to thenew currently best subset. This means that the algorithm takes turns in tuning the descriptorsubset to the model with tuning the model to the descriptor subset. The hyperparameter tablefor the model consisted of an adaptible grid where if the tuning process had previously chosen amarginal value for any parameter, then that parameter value range was shifted logarithmicallyto center the previously chosen value in the new range. That way the tuning process only hasto search a small space around the previously optimal point. The code for the algorithm ispresented in Appendix C.3.

12

Page 14: Students Partners - Smartr

4.2.4 Ensemble methodsRandom Forest Regression was implemented with scikit-learn [19], and the scikit-learn Wrap-per interface for XGBoost [18] was used. This allowed for full usage of scikit-learn’s evaluationfunctions, such as cross-validation and scoring functions.

Three sets of descriptors were used with these methods. The first was the set of moleculardescriptors, the second was RDKit fingerprints generated with the Python library RDKit [10]and lastly a combined dataset of both the molecular and fingerprint descriptors. The fingerprintswere generated with the default settings, resulting in a size of 2048 bits.

Parameter tuning of the ensemble methods was performed with the Python library Hyperopt[25]. The tuning process was split into three segments, tuning tree-related parameters, regu-larisation parameters and learning rate separately. The algorithm is explained in further detailin Appendix B. For XGBoost, this resulted in a significant improvement (∼ 10%) over defaultsettings. However, the tuning method did not result in any improvement for Random Forest.Previous research in the subject of QSAR-modelling has concluded that Random Forest doesnot overfit, thus increasing the number of estimators only penalises computation time [16]. Itwas also found that this was the only significant tuning parameter [16], therefore the numberof estimators was set to a high value (n_estimators = 500). The parameters for both methodsare shown in Table 1.

Table 1: Chosen parameters for the ensemble methods. The parameters for XGBoost were tunedfollowing the algorithm in Appendix B. Random Forest used default settings except an increase inn_estimators.

Parameter Random Forest XGBoost

max_depth ∞ 23min_child_weighta - 30gammab 0 0.13colsample_bytreeb 1.0 0.58lambdaa - 0.55alphaa - 1.4n_estimators 500 417learning_ratea - 0.025

aFor XGBoost. bOther name for Random Forest.

5 Results5.1 Variable selection with linear-based modelsThe variable selection results using the simpler models for the D2 target are presented in Table2. Note that most scores are too low to be satisfactory while the selected subsets tend tobe large. For the full set of variables chosen by each model and dataset, see Appendix A.1.The models do not agree fully on which descriptors to select, however there is indeed somecorrelation. Figure 6 illustrates this by plotting the number of descriptors that the models agreeon to a certain degree (50%, 75% and 100%). This score was used to extract new subsets ofdescriptors in which to continue the analyses with, subsets of high agreement score. Table 3 liststhe descriptors that have 100% agreement to be included among the model selection processesover the three datasets. The continuous dataset proved to be the easiest of the three for most

13

Page 15: Students Partners - Smartr

Table 2: Results from the variable selection method using linear-based models for the D2 target. Themodels and scores are defined in Section 4.2 while the datasets are defined in Section 4.1. The size isthe number of descriptors chosen by the model according to the processes outlined in Section 4.2.

Model Dataset Test R2 Test RMSE Size

Group Lasso alldata 0.006 1.005 74Group Lasso cat -0.017 1.017 42Group Lasso cont -0.075 1.046 49Lasso alldata 0.396 0.961 86Lasso cat 0.272 1.007 33Lasso cont 0.387 0.965 77Linear SVR alldata 0.405 0.958 91Linear SVR cat 0.280 1.004 24Linear SVR cont 0.399 0.960 63PLS alldata 0.385 0.960 24PLS cat 0.275 1.005 55PLS cont 0.388 0.964 24

models to handle.

Figure 6: Histogram over how much the D2 models agree on selecting descriptors on the threedatasets. The x-axis shows the level of agreement (50%, 75% and 100%) and the y-axis shows thenumber of descriptors for that level.

5.2 Evaluating the linear-based modelsSVR, PLS, Lasso, Group Lasso and EN models as defined in Section 4.1 were fit on differentsubsets of the descriptors as chosen by the different variable selection techniques in Table 2.The full list of descriptors can be found in Appendix A.1. All models and subsets were chosensomewhat subjectively. The models were cross-validated as described in Section 4.2 and evalu-ated on the test set by their R2 and RMSE scores. The results for the D2 target are shown inTable 4.

The results obtained in Table 4 show that many models and subsets are either bad in themselvesor bad combinations. The SVR model with radial basis function (RBF) kernel seems to be thebest choice in most cases. The (cont, linear_svr) subset has 65 descriptors and the SVR model

14

Page 16: Students Partners - Smartr

Table 3: The names of the descriptors, with the size of each descriptor set in parentheses, that thelinear-based models includes with 100% agreement respectively.

Dataset Chosen descriptors at 100% agreement level

alldata(41)

Weight-mol, apol, a_donacc, BCUT_PEOE_1, BCUT_PEOE_2, BCUT_SLOGP_3,BCUT_SMR_1, b_ar, b_count, b_rotN, chi0v_C, chi0_C, chi1, GCUT_PEOE_0,GCUT_PEOE_3, GCUT_SLOGP_1, GCUT_SLOGP_2, GCUT_SMR_1,h_log_dbo, h_pKb, Kier1, KierFlex, logP(o/w), opr_brigid, PC-, PEOE_VSA_FPPOS,petitjean, petitjeanSC, Q_PC+, Q_PC-, Q_RPC-, Q_VSA_FHYD, Q_VSA_FPNEG,Q_VSA_FPOL, Q_VSA_FPOS, RPC-, SMR_VSA0, VAdjMa, VDistMa, vsa_hyd, Weight,

cont(56)

Weight-mol, apol, a_acc, a_count, a_heavy, a_nC, a_nH, BCUT_PEOE_1,BCUT_PEOE_2, b_count, b_heavy, chi0, chi0v, chi0v_C, chi0_C, chi1, chi1v, chi1v_C,chi1_C, diameter, GCUT_SLOGP_2, h_logS, h_mr, Kier1, KierA1, KierA2, KierA3,PC+, PEOE_VSA+3, PEOE_VSA_POS, Q_PC+, Q_RPC+, Q_RPC-, Q_VSA_FHYD,Q_VSA_FPNEG, Q_VSA_FPOL, Q_VSA_HYD, Q_VSA_POL, Q_VSA_POS,RPC+, RPC-, SlogP_VSA7, SlogP_VSA8, SMR, SMR_VSA0, SMR_VSA5, VAdjEq,VAdjMa, VDistMa, vdw_area, vdw_vol, vsa_other, vsa_pol, Weight, weinerPath, zagreb,

cat(11)

a_aro, a_count, a_don, a_heavy, a_nI, a_nS, b_count, b_double, lip_acc,lip_druglike, opr_violation,

Table 4: Results from evaluating the linear-based models on different subsets of the descriptors forD2. The Subset column denotes the model used to produce the subset in cases where the value is amodel name and it denotes an agreement level used to produce the subset in cases where the valueis a rational number. The parameter sets were found by cross-validation over a subjectively chosenparameter space. SVR with RBF-kernel stands out as the best choice regardless of subset.

Model Dataset Subset Subset size Test R2 Test RMSE

SVR 1 alldata None 202 0.529 0.693SVR 2 alldata pls 24 0.38 0.795PLS 1 alldata None 202 0.278 0.858Lasso 1 alldata None 202 0.252 0.873PLS 2 alldata pls 24 0.109 0.953SVR 3 alldata 0.75 20 0.139 0.937Elastic net 1 alldata pls 24 -0.007 1.013Lasso 2 cat None 51 0.117 0.948Group Lasso 1 cat gl 42 -0.075 1.046SVR 4 cont linear_svr 63 0.524 0.696SVR 5 cont lasso 77 0.461 0.741SVR 6 cont None 180 0.512 0.705SVR 7 cont 0.75 21 0.172 0.919SVR 8 cont 1 6 0.271 0.861PLS 3 cont None 180 0.27 0.862Lasso 3 cont None 180 0.241 0.88Elastic net 2 cont None 180 0.237 0.881Elastic net 3 cont 0.75 21 0.05 0.983

15

Page 17: Students Partners - Smartr

reaches R2 = 0.524 on that set. These model fitting procedures were not repeated for the D3case because of tests showing that the differences between how models perform on the two setsis very small. Therefore the results of this section work as guidelines for the D3 case as well.

5.3 Subset and model refinement with Genetic algorithmThe genetic algorithm finds a subset of descriptors based on a model that is fit and tuned tothat set, and therefore also finds a (locally) optimal model. The best such subsets and theirrespective models that was found for the D2 and D3 versions are presented in Table 5. The bestmodel in terms of R2 and RMSE was the SVR with RBF kernel. Since this is a non-linearkernel mapping to infinite dimension it is not possible to achieve a ”feature importance” like withthe linear kernel. For the full set of selected descriptors for each model, refer to Appendix A.1.It is obvious from these results that the process of variable selection with a genetic algorithmworks better than the previous methods in Section 5.1. The results also show that the SVRwith RBF kernel perform better when applied to the GA subsets than by itself. PLS was alsotried as a model for the GA but it did not perform well. An unexpected result was the largedescriptor subsets chosen by the algorithm. This could be either a preference of the SVR modelto function well, or it could be that a large number of descriptors actually explain the receptoraffinity well.

Table 5: The best results found from applying the Genetic algorithm to the problem of subsetselection and model tuning with the SVR model with RBF kernel. The two tables correspond to D2and D3 models respectively. The full sets of chosen descriptors can be found in Appendix A.1.

D2

Model name SVR 9Model parameters Kernel = RBF, C = 1.5, ϵ = 0.05, γ = 0.017Genetic parameters N = 50, pcross = 0.8, rmut = 1, Generations = 100Number of descriptors 166Test R2 0.579Test RMSE 0.654

D3

Model name SVR 10Model parameters Kernel = rbf, C = 1.5, ϵ = 0.1, γ = 0.011Genetic parameters N = 50, pcross = 0.8, rmut = 1, Generations = 100Number of descriptors 170Test R2 0.614Test RMSE 0.738

5.4 Ensemble modelsThe ensemble models were found to have the best predictive performance of all evaluated models.Table 6 shows evaluation of the ensemble models on the test set. Three sets of descriptors wereused, see Section 4.2 for more details. XGBoost was found to be superior to Random Forestfor all datasets, having ∼ 5% better R2. Additionally, the evaluated XGBoost model was morethan three times faster than Random Forest which is shown in Table 7. XGBoost used on themolecular descriptors is observed to have approximately the same level of prediction as RandomForest using all descriptors. A consequence of this is that XGBoost is effectively able to reacha similar level of prediction as Random Forest in approximately 6.4% of the computation time.

16

Page 18: Students Partners - Smartr

Table 6: Results of the ensemble models on the three descriptor sets; MD: Molecular Descriptor, FP:Fingerprint, CB: Combined. The results indicate that ensemble models make better predictions usingfingerprints.

Random Forest XGBoost

R2 RMSE R2 RMSE

Target = D2MD 0.579 0.655 0.624 0.619FP 0.616 0.625 0.643 0.603CB 0.630 0.614 0.669 0.580

Target = D3MD 0.611 0.741 0.643 0.710FP 0.638 0.715 0.680 0.673CB 0.658 0.695 0.690 0.662

Table 7: Time to fit to training data and predict on the test set for the D2 target. Both models wereparameter optimised as presented in Table 1 and XGBoost was found to be more than 3x faster thanRandom Forest.

Dataset Random Forest XGBoost

Molecular Descriptors 50.5 s 11.6 sFingerprint Descriptors 133.5 s 39.5 sBoth Sets 180.7 s 55.5 s

17

Page 19: Students Partners - Smartr

Table 8: Results on all descriptors (CB). The cross-validation (CV) results are the means of eachfold.

Random Forest XGBoost

R2 RMSE R2 RMSE

Target = D2CV Train 0.942 0.244 0.957 0.210CV Test 0.612 0.632 0.649 0.600Test 0.630 0.614 0.669 0.580

Target = D3CV Train 0.947 0.274 0.962 0.231CV Test 0.672 0.681 0.704 0.647Test 0.658 0.695 0.690 0.662

Table 9: Feature importance results for the ensemble models. The set I is an ordered set of featureimportances from highest to lowest. In the topmost part of the table are the number of features whosecumulative sum add up to (i) 50% (ii) 95% of the total importance. In the second section of the table,the 10 most important features are presented for each model.

Random Forest XGBoost

#features:∑

i∈I i < 0.5 33 47#features:

∑i∈I i < 0.95 141 167

maxi∈I i 0.044 0.049

Random Forest: GCUT_PEOE_0, GCUT_PEOE_1, h_pstates, rsynth, SMR_VSA4,

BCUT_SLOGP_0, SlogP_VSA4, PEOE_VSA+5, BCUT_SMR_0, balabanJ

XGBoost: PEOE_VSA+5, vsa_don, SlogP_VSA4, h_log_dbo, GCUT_PEOE_0,

a_nN, SMR_VSA4, a_nS, opr_nring, GCUT_PEOE_1

By comparing the cross-validation results in Table 8 with the evaluation on the external testset, we can see that both models generalised well. Also, it was hypothesised and found thatRandom Forest does not overfit from increasing the number of trees. This is shown by the plotin Appendix B.

Feature importances were produced by the ensemble models and the most important resultsare presented in Table 9. The table shows the minimum number of features that make up 50%and 95% of the total importance, the largest importance of a single feature and the ten mostimportant features for each model. Thus, these results show that both models use most ofthe available descriptors to some extent in the decision making process. The observed featureimportances do not give any obvious insight into the descriptors for this reason. However,although many descriptors are needed to reach the best fit, some of them are observed to bemore common among the top ten most important features.

18

Page 20: Students Partners - Smartr

6 Discussion6.1 Data discussionsIt may be of interest to limit the data set with data from certain sources or other metadatainformation, such as the date of publication. Users of the ChEMBL database sometimes usea time-split test set to validate the models, that is assigning compounds tested in later phasesof a study to the test set [15]. However, since in this project data from different sources andassays is combined, it was not considered appropriate to have time-split data set. Moreover,with the approach taken in aggregating the data by averaging pKi values, any metadata had tobe dropped.

The threshold that determines whether the spread of pKi values for one compound is acceptablehas a significant impact on the number of compounds removed and can be considered somewhatarbitrary and dependent on the field of application. Nevertheless, 10% can be considered acommon value when one reads discussions on the Internet, and it retains a good amount of com-pounds to model. Filtering out compounds based by their pKi spread can remove perfectly validcompounds. One reason why a spread can be large is that the stereoisomers of a compound pro-duce different inhibitory constants. It is not available in ChEMBL, to our best knowledge, whichstereoisomer each value corresponds to. Were this information available, a compound could bere-labelled by their stereosiomers, and modelling would require three-dimensional descriptors,as one- and two-dimensional descriptors cannot differentiate between stereoisomers.

This project assessed the presence of any activity cliff based on a proposal from [9], but withoutperforming any detailed analysis on it. The main reason was that an error was found in the codethat assessed the presence of activity cliffs before submission. Nonetheless, after the correction,a similar activity landscape was obtained and it was decided not to re-run the modelling becausethe results were not expected to differ significantly and because of time constraints.

A more in depth analysis of the data could have been performed if time allowed, for exampleanalysing why some compounds with different ChEMBL IDs are identical based on fingerprintsor an analysis of intra- and inter-assay variability in activity values, in order to obtain a data setof higher quality. On the other hand, other descriptors, such as three-dimensional descriptors,could be incorporated that could provide potential significant information on the activity values.

6.2 Applicability domainIt is advisable that a QSAR model is accompanied by an applicability domain (AD). The ap-plicability domain of a model relates the descriptor space on which it was trained and to whichit should be applied to provide some prediction reliability. Unsurprisingly, a defined domain ofapplicability is stated in the Organisation for Economic Co-operation and Development (OECD)principles for the validity of a QSAR model for regulatory purposes [26].

It has been claimed that the QSAR field precedes the general field of machine learning in definingapplicability domains [15]. There are two main approaches in defining an AD: defining the chem-ical space for which predictions are reliable or the estimation of a prediction uncertainty. Whilesome QSAR methods, such as Gaussian processes, already provide a prediction uncertainty [27],the definition of an AD seems to be a very active area of research. The recent publications fromBerenger and Yamanishi [28] and Liu et. al [29] briefly summarise different approaches, citeseveral reviews on applicability domains, and provide new AD definitions themselves.

A simple AD is that, in order to predict the activity of a new compound, its descriptors haveto be in the convex hull of the descriptors from the training set. This can be computationallychallenging, and can be approximated by imposing that its descriptors have to be in the rangeof the training set descriptors, creating a hypercube. A more elaborated method is to create

19

Page 21: Students Partners - Smartr

this hypercube using the principal components to reduce the dimensions of the hypercube, butthe amount of components to use depends on a user-chosen value.

In general, distance-based methods compute the distance from a compound to some compoundsin the data set, and often depend on a user-chosen threshold to categorise compounds in or outof the AD. Probably the most advanced technique to estimate an AD are probability densitydistribution methods [28].

It can be reasoned that distance-based methods can be outperformed by uncertainty estimationsas the prediction performance is more likely to deteriorate gradually towards the edges of thechemical descriptor space [29]. One approach to determine some sort of uncertainty is to buildan error model as in the work of Sheridan [27, 30, 31]. In this work, an error model using arandom forest is built to provide a new molecule the mean error from molecules in the trainset in the same region for a set of defined AD metrics, such as a fingerprint similarity to thenearest or five nearest compounds in the train set, the predicted activity value or the standarddeviation of the prediction among the random forest trees.

Some work has been published that aims to quantify uncertainty in random forests in a moreelaborate fashion than just the prediction standard deviation among the trees. Wager, Hastie andEfron [32] and Mentch and Hooker [33] have published work in estimating confidence intervalsfor random forests, while Zhang et al. [34] propose a method to determine prediction intervalsfor random forests. Prediction intervals can also be constructed with quantile regression forests[35], a generalisation of random forests.

In this project, the work presented in [36] to determine an applicability domain was applied to thedata sets generated. This method consists in first classifying compounds into well (the negativeclass) or badly (the positive class) predicted from a regression, and later classifying them with amethod that provides some measure of confidence in the predicted class or a probability of classmembership. The initial class assignment, somewhat arbitrarily, is reasoned in the original worksuch that the minority class, the positive, has a prevalence of 20-30%. This work also reasonsthat the same descriptors used in the regression model have to be employed in the classification.

This method was applied to the Random Forest and XGBoost regression models described inSection 4.1.3. Since they use all the descriptors computed, a random forest classifier was alsochosen suitable to reduce the effect of correlated descriptors, and which also provides a classmembership probability. The results showed desired findings for the train set without tuning,both in the confusion matrix as well as the class membership probabilities distribution. However,when tuning the classifier in order to generalise, maximising either the true positive rate or theF1 score, aiming to reduce false positives and also taking into account the class imbalance, notonly did the results degrade significantly for the train set but also they were unsatisfactory forthe test set.

Further work could be conducted towards evaluating the feasibility of applying different methodsto determine the applicability domain from the models generated and applying those foundsuitable. This could highlight that different models may present different applicability domains,especially if the AD provides an estimate of the prediction uncertainty. In particular, theaforementioned work from Sheridan [27] could be a starting point for the tree models presentedgiven the similarities.

20

Page 22: Students Partners - Smartr

6.3 Modelling conclusions6.3.1 Computational limitationsProperly evaluating statistical models on datasets of this size takes time and resources. Forexample the genetic algorithm of Section 4.2 fits the same model, without any tuning, for allN descriptor subsets in the population. The tuning is instead done only on the model with thehighest fitness, and is only done every tenth generation (if the current best model has changed).Tuning was done on a parameter grid of size 3 × 3 × 3 using 5-fold cross-validation. With SVRmodel it took around 7 hours to run for 80 generations with a population of 40 descriptor setson a relatively slow home computer. That amounts to 3200 model fits and evaluations in thefitness computation step plus an additional 3 × 3 × 3 × 5 fits every tenth generation, resultingin a total number of 4280 model fittings and evaluations. The algorithm would most certainlybenefit from tuning with cross-validation for each of the N subsets. Furthermore it would havebeen appropriate to run for up to 200 generations with a slightly larger population, and tofine-tune the model over a higher resolution of the parameter grid. This was not possible to doon home computers, but doing this with high performance computers would most likely leadto better results. The conclusion is that some of the results presented in this report are notoptimal, although the methods might very well be. An indication for the potential of the GAis the training progress plot in Figure 7 which shows that the fitness could probably get betterwith more generations and tuning of the genetic parameters. It is reasonable to assume a steeperfitness curve if the model tuning would have been done more often and with higher resolution.

Figure 7: Training progress of the genetic algorithm with a SVR model on the D2 dataset. Eventhough the increase in fitness is getting smaller with more generations, Some improvement couldprobably be gained from tuning the genetic parameters and doing some more iterations.

Additionally, computational limitations were the primary reasons to why XGBoost was nottuned to each descriptor set and target individually. When doing 5 fold cross-validation, asingle evaluation of one combination of parameters could take up to 10 minutes if the learningrate was low enough. Since 8 parameters are tuned for the model in a relatively large range,

21

Page 23: Students Partners - Smartr

this implies very many evaluations even when making smart choices using an optimisation toolsuch as Hyperopt. It was however concluded that the presented set of parameters in Table 1perform well even for data that it was not specifically tuned for. The adequate performanceof a XGBoost model with roughly tuned parameters is also argued by previous research inthe area [17]. Thus the model could have more optimal performance if tuned individually toeach situation it is evaluated on, however the improvements are minimal in comparison to theincrease in resources that are needed. A discussion that follows is the computational advantageof XGBoost compared to Random Forest. After the parameters are tuned, XGBoost train andpredict significantly faster than Random Forest, making it superior in this sense. It shouldhowever be noted that Random Forest did not require as many estimators as was used. It waspossible to see small increases in performance but the model was reasonably accurate with veryfew estimators (less than 100). Hence, Random Forest can also be computationally sped upsignificantly at the expense of some predictability. The conclusion is however still that XGBoostis the superior method out of the two when it comes to computational performance of trainingand predicting.

6.3.2 Feature subsets and importanceWhen comparing the results of the most important features for the ensemble methods and thesubset selected by the genetic algorithm, both methods are found to discover that the two modelsare mostly explained by the same number of descriptors. For the D2 receptor target, the geneticalgorithm chooses 166 features and the same number of features contribute to 94.6% of thefeature importance with the XGBoost model. When comparing the specific features chosen, anagreement of 84% is found, that is 139 common features.

A possible explanation is that the descriptors that are omitted from either of the models arecorrelated to included ones. The dependence structure of features in the descriptor set wouldbe an interesting area of further study, however it was not in the scope of this report.

A final note on the choice of descriptor sets concern the ensemble models that used fingerprints asfeatures. The inclusion of fingerprints seemed to increase performance for both Random Forestand XGBoost. Both the fingerprints and molecular descriptors are computed from the SMILESstring but from these results it seems like the fingerprints encode more information about themolecule, and is therefore better for prediction of molecule behaviour. Recent research [37] hasshown promising results, bypassing the step of computing descriptors from the SMILES andhave a neural network perform this step implicitly. This is an interesting area of study whichcould find that some information is lost in the manual creation of descriptors.

6.3.3 Model interpretabilityRBF kernel and Support vector machines The RBF-kernel used with SVR regressionproved to be the best choice for the Linear-based models. Linear SVR works by finding thehyperplane in descriptor space that best separates the data. When the RBF kernel is appliedthat hyperplane is subjected to a non-linear transformation into kernel space. The RBF-kernelas defined in Equation (4.8) is a gaussian function of the euclidean distance between data pointsto the support vectors. Since the goal of SVR is to choose the support vectors that creates thegreatest separation, this can be interpreted as if the RBF-SVR finds gaussian-shaped groups ofpoints and separates between those groups. This argument is not strictly rigorous but can beused to get a sense of how the descriptor data is clustered. The downside with the RBF-kernelis that no variable importance can be extracted in a reasonable way. This is because the kerneltransforms the input data from descriptor space to an infinite dimensional kernel space wherethe coefficient interpretation is lost.

22

Page 24: Students Partners - Smartr

Tree-based ensemble methods Both Random Forest and XGBoost create ensembles ofdecision trees to make predictions. Thus, the models are fully interpretable, as each decisioncan be traced though the tree structures and each decision can be specified. The problem is thatin practice, due to the large amount of trees (∼ 500), the decision cannot easily be visualisedand analysed, limiting model explainability. Thus, the primary tool of explainability becomesfeature importances which can easily be extracted from the ensemble tree structures.

6.3.4 Comparison of model candidatesFigure 8 presents a sample of the most interesting models found during this project. The positionof the model name on the grid indicates how many descriptors it considered important (modelcomplexity, the x-axis) and what its final evaluation score was (R2, the y-axis). This can beused to give an idea of how complexity and performance relates between the different models.The model specifications can be found in Table 4.

Figure 8: The points on the grid indicates a model that had a certain complexity and a certainevaluation score as indicated by the x and y axes respectively. Refer to Table 8 for info on the RF andXGB models. Note here that the 95% level for the number of important descriptors is plotted. Referto Table 4 for info on the other models.

6.4 Future recommendationsA number of articles that discuss best practices in QSAR modelling are useful to review beforeany study [5, 6, 15]. Moreover, the OECD document on guidance on the validation of QSARmodels [26] can be very convenient, because it provides principles for QSAR modelling forregulatory purposes that can be considered in any QSAR modelling, guidance on each of theseprinciples and a checklist to fulfil them. It can be discussed that this project does not satisfyone of the principles, a well defined applicability domain, which was not studied in detail due totime limitations, but the discussion in Section 6.2 can be a starting point for any future study.

The QSAR modelling should be lead by the scientific question the study wants to answer. Ingeneral, a model that reliably predicts and is interpretable can be difficult to obtain. Inter-pretable models often use empirical descriptors, are simpler and often linear, whereas modelsthat include computational descriptors and are more complex to interpret mechanistically aremore reliable predictors [38]. Consequently, prior to modelling it can be useful to determine

23

Page 25: Students Partners - Smartr

what is prioritised, always bearing in mind that correlation does not imply causality, which is acommon cause of disappointment in QSAR modelling [7, 39]. This decision can help determinethe data sources and descriptors to consider before any analysis, in particular if experimentalconditions affect the measurement and therefore the prediction.

Using a Random Forest as a model to benchmark against can be useful given that it is a wellknown and unambiguous method, it is extensively discussed in the QSAR literature and performsintrinsic variable selection. Nonetheless, the also tree-based method XGBoost can be a modelbenchmark alternative by the reasoning in Section 4.1.3 and given the good results presented inSection 5.4. It is remarkable that public model implementations are usually open source, whichallows an inspection of the implementation if required. These particular models can also be usedfor classification tasks, an approach not considered in this project given that the activity valuescould not be easily assigned a class.

7 AcknowledgementsWe want to thank Peder Svensson and Fredrik Wallner from IRLAB Therapeutics and MattiasSundén and Erik Lorentzen from Smartr for offering this project, the interesting discussions overvideocalls and for their availability. We found the project not only matched our expectationsbut also helped us gain new knowledge and discover QSAR modelling. We are very happy tohave been chosen for this project and we hope that out findings are useful.

24

Page 26: Students Partners - Smartr

References[1] Arkadiusz Z. Dudek, Tomasz Arodz and Jorge Gálvez. ‘Computational Methods in De-

veloping Quantitative Structure-Activity Relationships (QSAR): A Review’. In: Combin-atorial Chemistry & High Throughput Screening 9 (2006), pp. 213–228.

[2] David Mendez et al. ‘ChEMBL: towards direct deposition of bioassay data’. In: NucleicAcids Research 47.D1 (Nov. 2018), pp. D930–D940. issn: 0305-1048. doi: 10.1093/nar/gky1075.

[3] Chemical Computing Group. Molecular Operating Environment. Version 2019.01. url:https://www.chemcomp.com/.

[4] Denis Fourches, Eugene Muratov and Alexander Tropsha. ‘Curation of chemogenomicsdata’. In: Nature chemical biology 11.8 (2015), pp. 535–535. doi: 10.1038/nchembio.1881.

[5] Denis Fourches, Eugene Muratov and Alexander Tropsha. ‘Trust, but Verify II: A Prac-tical Guide to Chemogenomics Data Curation’. In: Journal of Chemical Information andModeling 56.7 (2016), pp. 1243–1252. issn: 15205142. doi: 10.1021/acs.jcim.6b00129.

[6] Artem Cherkasov et al. ‘QSAR modeling: Where have you been? Where are you goingto?’ In: Journal of Medicinal Chemistry 57.12 (2014), pp. 4977–5010. issn: 15204804. doi:10.1021/jm4004285.

[7] Gerald M. Maggiora. ‘On Outliers and Activity Cliffs — Why QSAR Often Disappoints’.In: Journal of Chemical Information and Modeling 46.4 (2006). PMID: 16859285, pp. 1535–1535. doi: 10.1021/ci060117s.

[8] Dagmar Stumpfe et al. ‘Recent Progress in Understanding Activity Cliffs and Their Utilityin Medicinal Chemistry’. In: Journal of Medicinal Chemistry 57.1 (2014). PMID: 23981118,pp. 18–28. doi: 10.1021/jm401120g.

[9] Rajarshi Guha and John H. Van Drie. ‘StructureActivity Landscape Index: Identifyingand Quantifying Activity Cliffs’. In: Journal of Chemical Information and Modeling 48.3(2008). PMID: 18303878, pp. 646–658. doi: 10.1021/ci7004093.

[10] Open-Source Cheminformatics. RDKit. Version 2020.09.1. url: https://www.chemcomp.com/.

[11] Tahir Mehmood, Solve Saebö and Kristian Hovde Liland. ‘Comparison of variable selectionmethods in partial least squares regression’. In: Journal of Chemometrics 9 (2020), pp. 213–228. doi: 10.1002/cem.3226.

[12] Alex J. Smola and Bernhard Schölkopf. ‘A tutorial on support vector regression’. In:Statistics and Computing 14 (2004), pp. 199–222. doi: 10.1023/B:STCO.0000035301.49549.88.

[13] Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learn-ing. Springer, 2009. doi: 10.1007/b94608.

[14] M Wahde. Biologically inspired optimization algorithms. WIT Press, 2008.[15] Eugene N. Muratov et al. ‘QSAR without borders’. In: Chem. Soc. Rev. 49 (11 2020),

pp. 3525–3564. doi: 10.1039/D0CS00098A.[16] Vladimir Svetnik et al. ‘Random Forest: A Classification and Regression Tool for Com-

pound Classification and QSAR Modeling’. In: Journal of Chemical Information and Com-puter Sciences 43.6 (2003). PMID: 14632445, pp. 1947–1958. doi: 10.1021/ci034160g.

[17] Robert P. Sheridan et al. ‘Extreme Gradient Boosting as a Method for QuantitativeStructure-Activity Relationships’. In: Journal of Chemical Information and Modeling 56.12(2016). PMID: 27958738, pp. 2353–2360. doi: 10.1021/acs.jcim.6b00591.

25

Page 27: Students Partners - Smartr

[18] Tianqi Chen and Carlos Guestrin. ‘XGBoost: A Scalable Tree Boosting System’. In: Pro-ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining. KDD ’16. San Francisco, California, USA: Association for ComputingMachinery, 2016, pp. 785–794. isbn: 9781450342322. doi: 10.1145/2939672.2939785.

[19] Scikit-learn website for regression models. Dec. 2020. url: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning.

[20] Mathematicl background of Group Lasso. Dec. 2020. url: https://group-lasso.readthedocs.io/en/latest/maths.html.

[21] Jacob A Wegelin. ‘A survey of Partial Least Squares (PLS) methods, with emphasis onthe two-block case’. In: Technical report 371 (2000).

[22] Python implementation of Group Lasso. Dec. 2020. url: https://github.com/yngvem/group-lasso.

[23] I. Guyon et al. ‘Gene selection for cancer classification using support vector machines’. In:Machine Learning 46 (2002), pp. 389–422.

[24] R Shao et al. ‘Wavelets and nonlinear principal components analysis for process monit-oring’. In: Control Eng Practice. 7 (1999), pp. 856–879. doi: 10.1016/S0967-0661(99)00039-8.

[25] James Bergstra et al. ‘Hyperopt: A Python library for model selection and hyperparameteroptimization’. In: Computational Science & Discovery 8 (July 2015), p. 014008. doi: 10.1088/1749-4699/8/1/014008.

[26] OECD. Guidance Document on the Validation of (Quantitative) Structure-Activity Rela-tionship [(Q)SAR] Models. 2014, p. 154. doi: https://doi.org/https://doi.org/10.1787/9789264085442-en.

[27] Robert P. Sheridan. ‘The Relative Importance of Domain Applicability Metrics for Es-timating Prediction Errors in QSAR Varies with Training Set Diversity’. In: Journal ofChemical Information and Modeling 55.6 (2015), pp. 1098–1107. issn: 15205142. doi:10.1021/acs.jcim.5b00110.

[28] Francois Berenger and Yoshihiro Yamanishi. ‘A Distance-Based Boolean Applicability Do-main for Classification of High Throughput Screening Data’. In: Journal of Chemical In-formation and Modeling 59.1 (2019), pp. 463–476. issn: 15205142. doi: 10.1021/acs.jcim.8b00499.

[29] Ruifeng Liu et al. ‘General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity’. In: Journal of Chemical Inform-ation and Modeling 58.8 (2018), pp. 1561–1575. issn: 15205142. doi: 10.1021/acs.jcim.8b00114.

[30] Robert P. Sheridan. ‘Three useful dimensions for domain applicability in QSAR modelsusing random forest’. In: Journal of Chemical Information and Modeling 52.3 (2012),pp. 814–823. issn: 15499596. doi: 10.1021/ci300004n.

[31] Robert P. Sheridan. ‘Using random forest to model the domain applicability of anotherrandom forest model’. In: Journal of Chemical Information and Modeling 53.11 (2013),pp. 2837–2850. issn: 15499596. doi: 10.1021/ci400482e.

[32] Stefan Wager, Trevor Hastie and Bradley Efron. ‘Confidence intervals for random forests:The jackknife and the infinitesimal jackknife’. In: Journal of Machine Learning Research 15(2014), pp. 1625–1651. issn: 15337928. url: https://jmlr.org/papers/v15/wager14a.html.

[33] Lucas Mentch and Giles Hooker. ‘Quantifying uncertainty in random forests via confidenceintervals and hypothesis tests’. In: Journal of Machine Learning Research 17 (2016), pp. 1–41. issn: 15337928. arXiv: 1404 . 6473. url: https : / / jmlr . org / papers / v17 / 14 -168.html.

26

Page 28: Students Partners - Smartr

[34] Haozhe Zhang et al. ‘Random Forest Prediction Intervals’. In: American Statistician 74.4(2020), pp. 392–406. issn: 15372731. doi: 10.1080/00031305.2019.1585288.

[35] Nicolai Meinshausen. ‘Quantile Regression Forests’. In: Journal of Machine Learning Re-search 7 (2006), pp. 983–999. url: https://www.jmlr.org/papers/v7/meinshausen06a.html.

[36] Rajarshi Guha and Peter C. Jurs. ‘Determining the validity of a QSAR model - A classi-fication approach’. In: Journal of Chemical Information and Modeling 45.1 (2005), pp. 65–73. issn: 15499596. doi: 10.1021/ci0497511.

[37] Suman K. Chakravarti and Sai Radha Mani Alla. ‘Descriptor Free QSAR Modeling UsingDeep Learning With Long Short-Term Memory Neural Networks’. In: Frontiers in Artifi-cial Intelligence 2 (2019), p. 17. issn: 2624-8212. doi: 10.3389/frai.2019.00017. url:https://www.frontiersin.org/article/10.3389/frai.2019.00017.

[38] Toshio Fujita and David A. Winkler. ‘Understanding the Roles of the Two QSARs’. In:Journal of Chemical Information and Modeling 56.2 (2016). PMID: 26754147, pp. 269–274. doi: 10.1021/acs.jcim.5b00229.

[39] Stephen R. Johnson. ‘The Trouble with QSAR (or How I Learned To Stop Worrying andEmbrace Fallacy)’. In: Journal of Chemical Information and Modeling 48.1 (2008). PMID:18161959, pp. 25–26. doi: 10.1021/ci700332k.

27

Page 29: Students Partners - Smartr

A Result tablesA.1 Variable selection chosen descriptorsSee separate csv files named varselect_results_v3_<target>_<dataset>.csv for the completelists of chosen descriptors and their variable agreement scores. <target> can be D2 or D3 while<dataset> can be alldata, cont or cat.

A.2 Specifications of the linear-based modelsMore detail on the models presented in Table 4 are presented in table 10.

Table 10: Results from evaluating the linear-based models on different subsets of the descriptors. TheSubset column denotes the model used to produce the subset in cases where the value is a model nameand it denotes an agreement level used to produce the subset in cases where the value is a rationalnumber. The parameter sets were found by crossvalidation over a subjectively chosen parameter space.

Model Dataset Subset Test R2 Test RMSE Best parameters

SVR 1 alldata None 0.529 0.693 C: 100.0, ϵ: 0.1, γ: 0.001, kernel: rbfSVR 2 alldata pls 0.38 0.795 C: 1.0, ϵ: 0.1, γ: 0.1, kernel: rbfPLS 1 alldata None 0.278 0.858 n_components: 100Lasso 1 alldata None 0.252 0.873 α: 0.005PLS 2 alldata pls 0.109 0.953 n_components: 16SVR 3 alldata 0.75 0.177 0.916 C :0.215, ϵ: 0, kernel: rbfElastic net 1 alldata pls -0.007 1.013 α: 1, l1_ratio: 0.5,Lasso 2 cat None 0.117 0.948 α: 0.005Group Lasso 1 cat gl 0.079 0.968 group_reg: 0.005, l1_reg: 0SVR 4 cont linear_svr 0.524 0.696 C: 1.0, ϵ: 0.1, γ: 0.1, kernel: rbfSVR 5 cont lasso 0.461 0.741 C: 100.0, ϵ: 0.01, γ: 0.001, kernel: rbfSVR 6 cont None 0.512 0.705 C: 100.0, ϵ: 0.1, γ: 0.001, kernel: rbfSVR 7 cont 0.75 0.406 0.778 C: 0.01, ϵ: 0PLS 3 cont None 0.27 0.862 n_components: 22Lasso 3 cont None 0.241 0.88 α: 0.005Elastic net 2 cont None 0.237 0.881 α: 0.1, l1_ratio: 0Elastic net 3 cont 0.75 0.196 0.905 α: 0.1, l1_ratio: 0

A.3 Final descriptor sets of the genetic algorithmTables 11 and 12 lists the names of the descriptors used in the final generation of the geneticalgorithm for the results presented in Table 5.

B Ensemble Parameter TuningParameters for the ensemble methods were tuned following the procedure specified in this section.The most relevant parameters were identified and split into three categories; tree-related para-meters: max_depth, min_child_weight, min_samples_split, gamma, colsample_bytree; reg-ularisation parameters: lambda, alpha; learning parameters: n_estimators, learning_rate.Each set of parameters were optimised separately in the order they are presented. Finally, theoptimisation of the tree-related parameters was rerun to produce the final results.

The Python library Hyperopt [25] was used for the optimisation step, specifically the functionfmin, which minimises an objective function on a given parameter space. A search space for thefunction was defined as in Table 13 and the objective function was the mean RMSE of a 5-foldcross-validation. Furthermore, the table shows the optimal parameters found by the algorithm.

28

Page 30: Students Partners - Smartr

Table 11: The descriptors chosen in the final iteration of the GA algorithm for the D2 dataset.

D2

Names

Weight-mol, apol, ast_fraglike, ast_violation, ast_violation_ext, a_acc, a_aro,a_base, a_count, a_don, a_donacc, a_heavy, a_ICM, a_nB, a_nBr, a_nC,a_nCl, a_nF, a_nH, a_nN, a_nO, a_nS, balabanJ, BCUT_PEOE_1,BCUT_PEOE_2, BCUT_PEOE_3, BCUT_SLOGP_0, BCUT_SLOGP_1,BCUT_SLOGP_3, BCUT_SMR_0, BCUT_SMR_1, BCUT_SMR_2,BCUT_SMR_3, bpol, b_1rotN, b_1rotR, b_ar, b_count, b_double, b_heavy,b_max1len, b_rotN, b_rotR, b_single, b_triple, chi0, chi0v, chi0_C, chi1_C,chiral, chiral_u, density, diameter, GCUT_PEOE_0, GCUT_PEOE_1,GCUT_PEOE_2, GCUT_PEOE_3, GCUT_SLOGP_0, GCUT_SLOGP_1,GCUT_SLOGP_2, GCUT_SLOGP_3, GCUT_SMR_0, GCUT_SMR_1,GCUT_SMR_2, GCUT_SMR_3, h_ema, h_emd_C, h_logP, h_logS,h_log_dbo, h_mr, h_pavgQ, h_pKa, h_pKb, h_pstates, h_pstrain, Kier1,Kier2, Kier3, KierA1, KierA3, lip_don, lip_druglike, lip_violation, logP(o/w),logS, mr, opr_brigid, opr_leadlike, opr_nring, opr_nrot, opr_violation, PC+,PEOE_PC+, PEOE_PC-, PEOE_RPC+, PEOE_VSA+0, PEOE_VSA+1,PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA+4, PEOE_VSA+5, PEOE_VSA+6,PEOE_VSA-0, PEOE_VSA-1, PEOE_VSA-2, PEOE_VSA-3, PEOE_VSA-4,PEOE_VSA-5, PEOE_VSA-6, PEOE_VSA_FHYD, PEOE_VSA_FNEG,PEOE_VSA_FPNEG, PEOE_VSA_FPOS, PEOE_VSA_FPPOS,PEOE_VSA_NEG, PEOE_VSA_POL, PEOE_VSA_POS, PEOE_VSA_PPOS,petitjean, petitjeanSC, Q_PC+, Q_PC-, Q_RPC+, Q_VSA_FHYD,Q_VSA_FNEG, Q_VSA_FPNEG, Q_VSA_FPOL, Q_VSA_FPOS,Q_VSA_FPPOS, Q_VSA_PNEG, radius, reactive, rings, rsynth, SlogP,SlogP_VSA0, SlogP_VSA2, SlogP_VSA3, SlogP_VSA4, SlogP_VSA5,SlogP_VSA6, SlogP_VSA7, SlogP_VSA8, SlogP_VSA9, SMR, SMR_VSA0,SMR_VSA1, SMR_VSA2, SMR_VSA3, SMR_VSA4, SMR_VSA5,SMR_VSA6, SMR_VSA7, TPSA, VAdjEq, VDistEq, VDistMa, vdw_area,vsa_acc, vsa_don, vsa_other, vsa_pol, weinerPath, weinerPol, zagreb,

29

Page 31: Students Partners - Smartr

Table 12: The descriptors chosen in the final iteration of the GA algorithm for the D2 dataset.

D3

Names

ast_fraglike_ext, ast_violation, ast_violation_ext, a_acc, a_base, a_don,a_donacc, a_heavy, a_hyd, a_IC, a_ICM, a_nBr, a_nCl, a_nF, a_nH,a_nI, a_nN, a_nO, a_nP, a_nS, balabanJ, BCUT_PEOE_0, BCUT_PEOE_1,BCUT_PEOE_2, BCUT_PEOE_3, BCUT_SLOGP_0, BCUT_SLOGP_1,BCUT_SLOGP_2, BCUT_SLOGP_3, BCUT_SMR_0, BCUT_SMR_1,BCUT_SMR_3, bpol, b_1rotR, b_ar, b_count, b_double, b_max1len, b_rotN,b_rotR, b_triple, chi0, chi0v, chi0v_C, chi1, chi1v, chi1v_C, chi1_C, chiral,chiral_u, density, FCharge, GCUT_PEOE_0, GCUT_PEOE_1,GCUT_PEOE_2, GCUT_PEOE_3, GCUT_SLOGP_0, GCUT_SLOGP_1,GCUT_SLOGP_2, GCUT_SLOGP_3, GCUT_SMR_0, GCUT_SMR_1,GCUT_SMR_2, GCUT_SMR_3, h_emd, h_emd_C, h_logP, h_logS,h_log_pbo, h_mr, h_pavgQ, h_pKa, h_pKb, h_pstates, h_pstrain, Kier1,Kier3, KierA3, lip_acc, lip_don, lip_violation, mr, mutagenic, opr_brigid,opr_leadlike, opr_nring, opr_nrot, opr_violation, PC+, PC-, PEOE_PC+,PEOE_PC-, PEOE_RPC+, PEOE_RPC-, PEOE_VSA+0, PEOE_VSA+1,PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA+4, PEOE_VSA+5,PEOE_VSA+6, PEOE_VSA-0, PEOE_VSA-1, PEOE_VSA-2, PEOE_VSA-4,PEOE_VSA-5, PEOE_VSA-6, PEOE_VSA_FHYD, PEOE_VSA_FNEG,PEOE_VSA_FPNEG, PEOE_VSA_FPOL, PEOE_VSA_FPOS,PEOE_VSA_FPPOS, PEOE_VSA_HYD, PEOE_VSA_NEG,PEOE_VSA_PNEG, PEOE_VSA_POS, PEOE_VSA_PPOS, petitjean,petitjeanSC, Q_PC+, Q_PC-, Q_RPC+, Q_VSA_FHYD, Q_VSA_FNEG,Q_VSA_FPOL, Q_VSA_FPPOS, Q_VSA_HYD, Q_VSA_NEG,Q_VSA_PNEG, Q_VSA_POS, Q_VSA_PPOS, radius, reactive, rings, RPC+,rsynth, SlogP, SlogP_VSA0, SlogP_VSA1, SlogP_VSA2, SlogP_VSA3,SlogP_VSA4, SlogP_VSA5, SlogP_VSA6, SlogP_VSA7, SlogP_VSA8,SlogP_VSA9, SMR_VSA0, SMR_VSA1, SMR_VSA2, SMR_VSA3,SMR_VSA4, SMR_VSA5, SMR_VSA6, SMR_VSA7, TPSA, VAdjEq,VAdjMa, VDistEq, VDistMa, vdw_area, vdw_vol, vsa_acc, vsa_don,vsa_hyd, vsa_other, vsa_pol, Weight, zagreb,

30

Page 32: Students Partners - Smartr

The datasets that were used in the optimisation were all descriptors (molecular descriptors andfingerprints) for the target D2.

Table 13: Optimal parameters chosen by tuning algorithm. The range and step indicate the searchedparameter space by hyperopt. Note that this was not the optimal set of parameters for Random Forest(see Appendix B).

Parameter Random Forest XGBoost Range Step

max_depth 17 23 [4, 30] 1min_child_weighta - 30 [1, 30] 1min_samples_splitb 5 - [2, 30] 1gammac 0 0.13 [0, 0.5] 0.01colsample_bytreec 0.61 0.58 [0.5, 1] 0.01lambdaa - 0.55 [0, 2] 0.05alphaa - 1.4 [0, 2] 0.05n_estimators 422 417 [100, 500] 1learning_ratea - 0.025 [0.005, 0.2] 0.005

aFor XGBoost. bFor Random Forest. cOther name for Random Forest.

Each optimised method was evaluated against its default settings as a part of the tuning pro-cedure. The 5-fold cross-validation results on the set of all descriptors are presented in Table14. The metrics presented are the mean of the scores on the test folds. It is observed that Ran-dom Forest barely benefit from parameter optimisation. This is explained in previous studieswhere authors note that only the number of trees impact predictive performance of the method[16]. Thus, default settings with n_estimators = 500 was later used, resulting in similar cross-validation scores. To confirm that this choice did not lead to overfitting, the plot in Figure 9 wasgenerated. It confirmed that Random Forest does not seem to overfit for an increased numberof estimators.

XGBoost experienced a larger benefit from tuning, in the range of 5–10%. It was noted that alot of the benefit came from reducing the learning speed and increasing the number of estim-

Table 14: Mean scores of test folds in 5 fold cross-validation. The dataset were the moleculardescriptors.

Random Forest XGBoost

R2 RMSE R2 RMSE

Target = D2Default 0.607 0.636 0.610 0.630Tuned 0.612 0.631 0.669 0.580

Target = D3Default 0.668 0.685 0.642 0.711Tuned 0.674 0.679 0.690 0.662

31

Page 33: Students Partners - Smartr

ators. This is of course a trade off between predictive accuracy and computational difficulty,thus not a trivial choice of model. We recommend that this is taken into consideration whenchoosing parameters for XGBoost. The other parameters can however reduce computation timeby considering fewer features or making smaller trees.

Figure 9: Random Forest Regression with an increasing number of trees n. The plot shows that inthe range 20 to 500, it does not overfit to the training data.

C CodeC.1 Variable selection scriptsSee separate file variable_selection.py

C.2 Linear-based model scriptsSee separate file linear_models.py

C.3 Genetic algorithm scriptSee separate file genetic_algorithm.py

32