assessing synthetic accessibility of chemical compounds ...with de novo rational drug design,...

23
Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods Yevgeniy Podolyan, Michael A. Walters, and George Karypis *,Department of Computer Science and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, and Institute for Therapeutics Discovery and Development, Department of Medicinal Chemistry, University of Minnesota, Minneapolis, MN 55455 Abstract With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active probes. However, many of them may be synthetically infeasi- ble and, therefore, of limited value to drug developers. On the other hand, most of the tools for synthetic accessibility evaluation are very slow and can process only a few molecules per minute. In this study, we present two approaches to quickly predict the synthetic accessibility of chemical compounds by utilizing support vector machines operating on molecular descrip- tors. The first approach, RSsvm, is designed to identify the compounds that can be synthesized using a specific set of reactions and starting materials and builds its model by training on the compounds identified as synthetically accessible or not by retrosynthetic analysis. The second approach, DRsvm, is designed to provide a more general assessment of synthetic accessibility that is not tied to any set of reactions or starting materials. The training set compounds for this approach are selected from a diverse library based on the number of other similar compounds within the same library. Both approaches have been shown to perform very well in their cor- responding areas of applicability with the RSsvm achieving receiver operator characteristic score of 0.952 in cross-validation experiments and the DRsvm achieving a score of 0.888 on an independent set of compounds. Our implementations can successfully process thousands of compounds per minute. 1 Introduction Many modern tools available today for virtual library design or de novo rational drug design allow the generation of a very large number of diverse potential ligands for various targets. One of Computer Science and Computer Engineering Institute for Therapeutics Discovery and Development 1

Upload: others

Post on 13-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Assessing Synthetic Accessibility of Chemical

Compounds Using Machine Learning Methods

Yevgeniy Podolyan,† Michael A. Walters,‡ and George Karypis∗,†

Department of Computer Science and Computer Engineering, University of Minnesota,

Minneapolis, MN 55455, and Institute for Therapeutics Discovery and Development, Department

of Medicinal Chemistry, University of Minnesota, Minneapolis, MN 55455

Abstract

With de novo rational drug design, scientists can rapidly generate a very large number ofpotentially biologically active probes. However, many of them may be synthetically infeasi-ble and, therefore, of limited value to drug developers. On the other hand, most of the toolsfor synthetic accessibility evaluation are very slow and can process only a few molecules perminute. In this study, we present two approaches to quickly predict the synthetic accessibilityof chemical compounds by utilizing support vector machines operating on molecular descrip-tors. The first approach, RSsvm, is designed to identify the compounds that can be synthesizedusing a specific set of reactions and starting materials and builds its model by training on thecompounds identified as synthetically accessible or not by retrosynthetic analysis. The secondapproach, DRsvm, is designed to provide a more general assessment of synthetic accessibilitythat is not tied to any set of reactions or starting materials. The training set compounds for thisapproach are selected from a diverse library based on the number of other similar compoundswithin the same library. Both approaches have been shown to perform very well in their cor-responding areas of applicability with the RSsvm achieving receiver operator characteristicscore of 0.952 in cross-validation experiments and the DRsvm achieving a score of 0.888 onan independent set of compounds. Our implementations can successfully process thousands ofcompounds per minute.

1 IntroductionMany modern tools available today for virtual library design or de novo rational drug design allowthe generation of a very large number of diverse potential ligands for various targets. One of†Computer Science and Computer Engineering‡Institute for Therapeutics Discovery and Development

1

Page 2: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

the problems associated with de novo ligand design is that these methods can generate a largenumber of structures that are hard to synthesize. There are two broad approaches to address thisproblem. The first approach is to limit the types of molecules a program can generate. For example,restriction of the number of fragment types and other parameters in the CONCERTS1 programallows it to control the types of molecules being generated which results in a greater percentageof the molecules being synthetically reasonable. SPROUT2 applies a complexity filter3 at eachstage of a candidate generation by making sure that all fragments in the molecule are frequent inthe starting material libraries. An implicit inclusion of synthesizability constraints is implementedin programs such as TOPAS4 and SYNOPSIS.5 The TOPAS uses only the fragments obtainedfrom known drugs using a reaction-based fragmentation procedure similar to RECAP.6 Insteadof using the small fragments, SYNOPSIS uses the whole molecules in the available moleculeslibrary as fragments. Both programs then use a limited set of chemical reactions to combinethe fragments into a single structure thus increasing the likelihood that the produced molecule issynthetically accessible. However, a potential limitation of using reaction-based fragmentationprocedures is that they can limit drastically the diversity of the generated molecules. Consequentlysuch approaches may fail to design molecules for protein targets that are significantly differentfrom those with known ligands.

The second approach is to generate a very large number of possible molecules and only thenapply a synthetic accessibility filter to eliminate hard to synthesize compounds. A number ofapproaches have been developed to identify such structures that are based on structural complex-ity,7–12 similarity to available starting materials,3 retrosynthetic feasibility,13–15 or a combinationof those.16

Complexity-based approaches use empirically derived formulas to estimate molecular com-plexity. Various methods have been developed to compute structural complexity. One of theearliest approaches suggested by Bertz7 uses a mathematical approach to calculate the chemicalcomplexity. Whitlock9 proposed a simple metric based on the counts of rings, nonaromatic unsatu-rations, heteroatoms and chiral centers. The molecular complexity index by Barone and Chanon10

considers only the connectivity of the atoms in a molecule and the size of the rings. In an at-tempt to overcome the high correlation of the complexity scores with the molecular weight, Alluand Oprea12 devised a more thorough scheme that is based on relative atomic electronegativitiesand bond parameters, which takes into account atomic hybridization states and various structuralfeatures such as number of chiral centers, geminal substitutions, number of rings and the types offused rings, etc. All methods computing molecular complexity scores are very fast because they donot need to perform any comparisons to other structures. However, this is also a deficiency in thesemethods because the complexity alone is only loosely correlated with the synthetic accessibility asmany starting materials with high complexity scores are readily available.

One of the newer approaches based on molecular complexity3 attempts to overcome the prob-lems associated with regular complexity-based methods by including the statistical distribution ofvarious cyclic and acyclic topologies and atom substitution patterns in existing drugs or commer-cially available starting materials. The method is based on the assumption that if the moleculecontains only the structural motifs that occur frequently in commercially available starting materi-als, then they are probably easy to synthesize.

The use of sophisticated retrosynthetic analysis is another approach to assess the synthetic fea-sibility. A number of methods such as LHASA,13 WODCA,15 and CAESA14 utilize this approach.These methods attempt to find the starting materials for the synthesis by iteratively breaking the

2

Page 3: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

bonds that are easy to create using high-yielding reactions. Unlike the methods described above,which require explicit reaction transformation rules, the Route Designer17 automatically gener-ates these rules from the reaction databases. However, these reaction databases are not alwaysannotated with yields and/or possibly difficult reaction conditions, which would limit their use. Inaddition, almost all of the programs based on retrosynthetic analysis are highly interactive and rel-atively slow, often taking up to several minutes per structure. This renders these methods unusablefor the fast evaluation of the synthetic accessibility of tens to hundreds of thousands of moleculesthat can be generated by the de novo ligand generation programs.

One of the most recent methods by Boda et. al16 attempts to evaluate the synthetic accessibilityby combining molecular graph complexity, ring complexity, stereochemical complexity, similarityto available starting materials, and an assessment of strategic bonds where a structure can be de-composed to obtain simpler fragments. The resulting score is the weighted sum of the components,where the weights are determined using regression so that it maximizes the correlation with medic-inal chemists’ predictions. The method was able to achieve correlation coefficient of 0.893 withthe average scores produced by five medicinal chemists. However, this method was only shown towork for the 100 compounds used in the regression analysis to select the weights for the individualcomponents. In addition, this method uses a complex retrosynthetic reaction fitness analysis whichwas shown to have a very low correlation with the total score. The method is able to process only200–300 molecules per minute.

In this paper we present methods for determining the relative synthetic accessibility of a largeset of compounds. These methods formulate the synthetic accessibility prediction problem asa supervised learning problem in which the compounds that are considered to be syntheticallyaccessible form the positive class and those that are not form the negative class. This approach isbased on the hypothesis that the compounds containing fragments that are present in a sufficientlylarge number of synthetically accessible molecules are themselves synthetically accessible. Theactual supervised learning model is built using Support Vector Machines (SVM) and utilizes afragment-based representation of the compounds that allows it to precisely capture the compounds’molecular fragments and their frequency. Using this framework, we developed two complementarymachine learning models for synthetic accessibility prediction that differ on the approach that theyemploy for determining the label of the compounds during training.

The first model, referred to as RSsvm, uses as positive instances a set of compounds that wereretrosynthetically determined to be synthesizable in a small number of steps and as negative in-stances a set of compounds that required a large number of steps. The resulting machine learningmodel can then be used to substantially reduce the amount of time required by approaches based onretrosynthetic analysis by replacing retrosynthesis with the model’s prediction. The advantage ofthis model is that by identifying the positive and negative instances using a retrosynthetic approach,it is trained on a set of compounds whose labels are reliable. However, assuming a sufficiently largelibrary of starting materials, the resulting model will be dependent on the set of reactions used inthe retrosynthetic analysis, and as such it may not be able to generalize to compounds that can besynthesized using additional reactions. Consequently, this approach is well-designed for cases inwhich the compounds need to be synthesized with a predetermined set of reactions as it is oftenthe case in the pharmaceutical industry where only parallel synthesis reactions are used.

The second model, referred to as DRsvm, is based on the assumption that compounds belong-ing to dense regions of the chemical space were probably synthesized by a related route throughcombinatorial or conventional synthesis and as such, they are relatively easy to synthesize. Based

3

Page 4: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

on that, this approach uses as positive instances a set of compounds that are similar to a large num-ber of other compounds in a diverse library and as negative instances a set of compounds that aresimilar to only a few compounds (if any). The advantage of this approach is that assuming thatthe library from where they were obtained was sufficiently diverse, the resulting model is not tiedto a specific set of reactions and as such it should be able to provide a more general assessmentof synthetic accessibility. However, since the labels of the training instances are less reliable, theerror rate of the resulting predictions can potentially be higher than those produced by RSsvm forcompounds that can be synthesized by the reactions used to train RSsvm’s models.

We experimentally evaluated the performance of these models on datasets derived from theMolecular Libraries Small Molecule Repository (MLSMR) and a set of compounds provided byEli Lilly and Company. Specifically, cross-validation results on MLSMR using a set of high-throughput synthesis reactions show that RSsvm leads to models that achieve an ROC score of0.952, and on the Eli Lilly dataset, both RSsvm and DRsvm correctly position only syntheticallyaccessible compounds among the 100 highest ranked compounds.

2 Methods

2.1 Chemical Compound DescriptorsThe compounds are represented as a frequency vector of the molecular fragments that they con-tain. The molecular fragments correspond to the GF descriptors,18 which are the complete set ofunique subgraphs between a minimum and maximum size that are present in each compound. Thefrequency corresponds to the number of distinct embeddings of each subgraph in a compound’smolecular graph. Details on how these descriptors were generated are provided in Section 3.3.

The GF descriptor-based representation of chemical compounds is well-suited for syntheticaccessibility purposes for several reasons. (i) Molecular complexity (e.g., the number and types ofrings, bond types, etc.) is easily encoded with such fragments. (ii) The fragments can implicitlyencode chemical reactions through the new sets of bonds they create. (iii) The similarity to thestarting materials can also be determined by the share of the common fragments present in thecompound of interest and the compounds in the starting materials. (iv) Unlike descriptors basedon physico-chemical properties,19,20 they capture elements of the compound’s molecular graph(e.g., bonding patterns, rings, etc.), which are more directly relevant to the synthesizability ofa compound. (v) Unlike molecular descriptors based on hashed fingerprints (e.g., Daylight orextended connectivity fingerprints21) they precisely capture the molecular fragments present ineach compound without the many-to-one mapping errors introduced as a result of hashing. (vi)Unlike methods utilizing a very limited set of predefined structural features (e.g., MACCS keysby Molecular Design Ltd.), the GF descriptors are complete and in conjunction with the use ofsupervised learning methods, can better identify the set of structural features that are important forpredicting the synthetic accessibility of compounds.

2.2 Model LearningWe used Support Vector Machines (SVM)22 to build the discriminative model that explicitly dif-ferentiates between easy and hard to synthesize compounds. Once built, such a model can be

4

Page 5: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

applied to any compound to assess its synthetic accessibility. The SVM machine learning methodis a linear classifier of the form

f (x) = wtx + b, (1)

where wtx is the dot product of the weight vector w and the input vector x which represents theobject being classified. However, by applying a kernel trick,23,24 the SVM can be turned into anon-linear classifier. The kernel trick is applied by simply replacing the dot product in the aboveequation with the kernel function to compute similarity of two vectors. The SVM classifier can berewritten as

f (x) =∑

xi∈S +

λ+i K(x,xi)−

∑xi∈S −

λ−i K(x,xi), (2)

where S + and S − are the sets of positive and negative class vectors, respectively, chosen as supportvectors during the training stage; λ+

i and λ−i are the non-negative weights computed during thetraining; and K is the kernel function that computes the similarity between two vectors. Theresulting value f (x) can be used to rank the testing set instances, while the sign of the value can beused to classify the vectors into positive or negative class.

In this study, we used the Tanimoto coefficient25 as the kernel function26 as well as for com-putations of the similarity to the compounds in starting materials library. Tanimoto coefficient hasthe following form:

S A,B =

∑ni=1 xiAxiB∑n

i=1 x2iA +∑n

i=1 x2iB−∑n

i=1 xiAxiB, (3)

where A and B are the vectors being compared, n is the total number of features, and xiA and xiB arethe numbers of times an i-th feature occurs in vectors A and B, respectively. Tanimoto coefficientis the most widely used metric to compute similarity between the molecules encoded as vectors ofvarious kinds of descriptors. It was recently shown27 to be the best performing coefficient for thewidest range of query and database partition sizes in retrieving complementary sets of compoundsfrom MDL Drug Data Report database.

The Tanimoto coefficient for any two vectors is a value between 0 and 1 indicating how similarthe compounds represented by those vectors are. Some examples of molecular pairs with differentcomputed similarity scores (based on the GF descriptors) are presented in Figure 1.

2.3 Identifying the Training CompoundsIn order to train a machine learning model, one needs to supply a training set consisting of positiveand negative examples. In the context of synthetic accessibility prediction, the positive exampleswould represent the compounds that are relatively easy to synthesize and the negative exampleswould represent the compounds that are relatively hard to synthesize. The problem here lies inthe fact that there are no catalogs or databases that would list compounds labeled as easy or hardto synthesize. The problem is complicated even further by the fact that even though the easyto synthesize molecules can be proven to be easy, the opposite is not true—one cannot provewith absolute certainty that the molecules classified as hard to synthesize are not actually easy tosynthesize by some little-known or newly discovered method.

One way to create a training set of molecules would be to ask synthetic chemists to give a scoreto each molecule that would represent their opinion of the difficulty associated with the synthesis

5

Page 6: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

O

O

NH

NH

O

SO

O

NH

N

N O

S

O

O

NH

NCl

O

O

NH

Br

0.2 0.4

O

HN

O

NO

Cl

O

OH

HN

OO

NN

O

HN

HN

NH

O NH

O

O

HN

HN

NH

O

O

0.6 0.8

Figure 1: Examples of molecular pairs with given computed similarity scores.

of that molecule. One can then split the set into two parts, one with lower scores and another withhigher scores. However, such an approach cannot lead to very large training sets, as there is a limiton the number of compounds that can be manually labeled. Moreover, the resulting scores are notobjective as chemists often disagree with each other.16

In this paper, we developed two different approaches, described in the subsequent sections, thattake as input a large diverse compound library and automatically identify the positive and negativecompounds for training SVM models for synthetic accessibility prediction.

2.3.1 Retrosynthetic Approach (RSsvmModel)

This approach identifies the subsets of compounds that are easy or hard to synthesize by using a ret-rosynthetic decomposition approach28,29 to determine the number of steps required to decomposeeach compound into easily available compounds (e.g., commercially available starting materials).Those compounds that can be decomposed in a small number of steps are considered to be easyto synthesize, whereas those compounds that either required a large number of steps or they couldnot be decomposed are considered to be hard to synthesize.

This approach requires three inputs: (i) the compounds that need to be decomposed, (ii) thereactions whose reverse application will split the compounds, and (iii) the starting materials library.The decomposition is performed in the following way. First, each reaction is applied to a givencompound to check whether it is applicable to this particular compound. All applicable reactions,i.e., reactions that break bond(s) that is/are present in a given molecule, are applied to produce twoor more smaller compounds. Each of the resulting compounds is checked for an exact match in

6

Page 7: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

the starting materials library. Those compounds not found among the starting materials are furtherbroken down into smaller compounds. The process is done in a breadth-first search fashion tofind the shortest path to the starting materials. The search is halted when all parts of the compoundalong one of the reaction paths in this search tree are found in the starting materials library or whenno complete decomposition is found after performing a certain number of decomposition rounds.

The schematic representation of the decomposition process is shown in Figure 2, where thesquares represent compounds (dashed squares represent compounds found in the starting materialslibrary) and circles represent reactions (dashed circles represent reactions that are not applicableto the chosen compound). One can see that the height of the decomposition tree for compound Ais two since it can be decomposed with two rounds of reactions. Using reaction R4, compound Acan be broken down into compounds D and E. Since D is already in the starting materials library,it does not need to be broken down any further. The compound E can be broken down into H andC, both of which are in the starting materials library, with another reaction R4. The retrosyntheticanalysis stopped after two rounds because the path A→ R4 → (D)E → R4 → (H)(C) terminateswith all compounds found in the starting materials library.

In this paper, we considered all compounds that have decomposition search trees with one tothree rounds of reactions (i.e., have a height of one to three measured in terms of reactions) to beeasy to synthesize and compounds that cannot be synthesized in at least five rounds of reactions tobe hard to synthesize. Note that this method assumes that the difficulty associated with performingthese reactions is the same. However, the overall approach can be easily extended to account forreactions with different levels of difficulty by using variable depth cut-offs based on the reactionsused. These easy and hard to synthesize compounds are used as the sets of positive and negativeinstances for training the RSsvm models and will be denoted by RS+ and RS−, respectively. Themolecules with a decomposition search tree height of four or five were not used in training ortesting the RSsvm models.

The advantage of this approach approach is that it provides an objective way of determining thesynthetic accessibility of a compound based on the input reactions and library of starting materials.Moreover, even though for a large number of reactions, retrosynthetic analysis is computationallyexpensive, the RSsvm model requires this step to be performed only once while generating itstraining set, and subsequent applications of the RSsvm model does not require retrosynthetic anal-ysis. Thus, the RSsvm model can be used to dramatically reduce the computational complexity ofretrosynthetic analysis by using it to prioritize the compounds that will be subjected to such typeof analysis.

2.3.2 Dense Regions Approach (DRsvmModel)

A potential limitation with the retrosynthetic approach is that the resulting RSsvm model will bebiased towards the set of reactions used to identify the positive and negative compounds. As a re-sult, it may fail to properly assess the synthetic accessibility of compounds that can be decomposedusing a different set of reactions. To address this problem, we developed an alternate method forselecting the sets of positive and negative training instances from the input library. The assump-tion underlying this approach is that if a compound has a large number of other compounds inthe library that are significantly similar to it, then this compound and many of its neighbors wereprobably synthesized by a related route through combinatorial or conventional synthesis and are,therefore, synthetically accessible.

7

Page 8: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Figure 2: Example of compound decomposition. The squares represent compounds (dashedsquares represent compounds found in the starting materials library). The circles represent re-actions (dashed circles represent reactions that are not applicable to the chosen compound). Theheight of the decomposition tree for compound A is 2 because it can be decomposed with a mini-mum of 2 rounds of reactions.

In this approach, the positive training set, denoted by DR+, is generated by selecting from theinput library the ones with at least κ+ other compounds that have a similarity of at least θ. Thesecompounds represent the dense area of the library. Conversely, the negative training set, denoted byDR−, is generated by selecting those library compounds that have no more that κ− other compoundsthat have a similarity of at least θ. Note that κ− � κ+ and that the similarity between compoundsis determined using the Tanimoto coefficient of their GF descriptor representation. The DR+ andDR− sets are used for building the DRsvm model for synthetic accessibility prediction. In ourstudy, we used κ+ = 20, κ− = 1, and θ = 0.6, as we found them to produce a reasonable number ofpositive and negative training compounds.

2.3.3 Baseline Approaches

In order to establish a basis for the comparison of the above models, we also developed two simpleapproaches for synthetic accessibility prediction. The first, denoted by SMNPsvm, is an SVM-based model that is trained on the starting materials as the positive class and natural products asthe negative class. This approach is motivated by the facts that (i) most commercially availablematerials are easily synthesized and (ii) natural products are considered to be hard to synthesize.30

As a result, an SVM-based model that uses starting materials as the positive class and naturalproducts as the negative class may be used to prioritize a set of compounds with respect to theirsynthetic accessibility. The second, denoted by maxSMsim, is a scheme that ranks the compoundswhose synthetic accessibility needs to be determined based on their highest similarity to a libraryof starting materials. Specifically, for each unknown compound, its Tanimoto similarity to eachof the compounds in a starting materials library is computed, and the maximum similarity valueis used as its synthetic accessibility score. The final ranking of the compounds is then obtainedby sorting them in descending order based on those scores. The use of the maximum similarity to

8

Page 9: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

starting materials is motivated by the empirical observation that the molecules in the positive sethave, on average, higher maximal similarity to starting materials than the molecules in the negativeset (Table 3).

3 Materials

3.1 Data SetsIn this study we used four sets of compounds. (i) A library of starting materials (SM) that is neededfor the retrosynthetic analysis. This set was compiled from the Sigma-Aldrich’s Selected StructureSets for Drug Discovery, Acros, and Maybridge databases. After stripping the salts and removingthe duplicates, the starting materials library contained 97,781 compounds including the multipletautomeric forms for some compounds. (ii) A library of natural products (NP) obtained fromInterBioScreen Ltd, which is used to train the SVM-based baseline model (Section 2.3.3). Thislibrary contains 41,746 natural compounds, their derivatives and analogs that are offered for high-throughput screening programs. (iii) The compounds in the Molecular Libraries Small MoleculeRepository (MLSMR), which is available from PubChem.31 This library was used as the diverselibrary of compounds for training the RSsvm and DRsvm models and evaluating the performanceof the RSsvm models. MLSMR is designed to collect, store and distribute compounds for high-throughput screening (HTS) of biological assays that were submitted by the research community.Since a large portion of the deposited compounds were synthesized via parallel synthesis, onecan assume that many of them should be relatively straightforward to synthesize. The size ofthe MLSMR library that we used contained 224,278 compounds. (iv) A set of 1,048 moleculesobtained from Eli Lilly and Company that is used for evaluating the performance of the RSsvm andDRsvm models that were trained on the MLSMR library, and as such providing an external wayof evaluating the performance of these methods. These compounds were considered to be hard tosynthesize by Eli Lilly chemists and their synthesis was outsourced to an independent companyspecializing in the chemical synthesis. The set contains 830 compounds that were eventuallysynthesized and 218 compounds for which no feasible synthetic path was found by chemists. Wewill denote the set of synthesized compounds as EL+ and the set of compounds that were neversynthesized as EL−.

3.2 ReactionsFor the retrosynthetic analysis, we used a set of 23 reactions that are employed widely in the high-throughput synthesis. The set included parallel synthesis reactions such as reductive amination;amine alkylation; nucleophilic aromatic substitution; amide, sulfonamide, urea, ether, carbamate,and ester formation; amido alkylation; thiol alkylation; and Suzuki and Buchwald couplings. Notethat this set can be considerably expanded to include various bench-top reactions and as such leadto potentially more general RSsvmmodels for synthetic accessibility prediction. However, becausewe are interested in assessing the potential of the RSsvm models in the context of high-throughputsynthesis, we only used parallel synthesis reactions.

Table 1 shows some statistics of the retrosynthetic decomposition of the MLSMR, natural prod-ucts, and Eli Lilly datasets based on this set of reactions. The compounds in these datasets that

9

Page 10: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Table 1: Characteristics of the retrosynthetic decomposition of different datasets.

MLSMR Natural Products Eli Lilly Data SetEL+ EL−

Count % of total Count % of total Count % of total Count % of totalFound in SM 7,890 3.5 459 1.1 0 0.0 0 0.0Depth 1 26,074 11.6 1,751 4.2 56 6.7 0 0.0Depth 2 20,025 8.9 1,387 3.3 28 3.4 0 0.0

Depth 3 2,522 1.1 155 0.4 4 0.5 0 0.0

Depth 4 188 0.1 20 <0.1 0 0.0 0 0.0

Depth 5 5 <0.1 11 <0.1 0 0.0 0 0.0

Depth >5† 167,574 74.7 37,963 90.9 742 89.4 218 100.0

Total 224,278 100.0 41,746 100.0 830 100.0 218 100.0† Compounds with no identifiable path with depth up to 5.

were found in the starting materials library (first line in the table) were not used in this study.These results show that a larger fraction of the MLSMR library (21.6%) can be synthesized inthree or fewer steps (easy to synthesize) than the corresponding fractions of the NP library (7.9%)and Eli Lilly dataset (10.6%). These results indicate that the compounds in the Eli Lilly datasetare considerably harder to synthesize than those in MLSMR. Moreover, since the retrosyntheticdecomposition of all the compounds in EL− is greater than five, these results provide an indepen-dent confirmation on the difficulty of synthesizing the EL− compounds. Note that one reason forthe relatively small fraction of the MLSMR and Eli Lilly compounds that are determined to besynthetically accessible is due to the use of a small number of parallel synthesis reactions and thenumber will be higher if additional reactions are used.

In addition, Table 2 provides some information on the role of the different reactions in de-composing the easy to synthesize compounds in MLSMR. Specifically, for each reaction, Table 2shows the number of compounds that used that reaction at least once and the number of com-pounds that had their decomposition tree height increased when the reaction was excluded fromthe set of available reactions. These statistics indicate that not all reactions are equally important.Some of the reactions, such as reactions 4 and 11 (amide formation from amines with acids andacyl halides, respectively), are ubiquitous. Reaction 4 was used in the decomposition of 21,495molecules, whereas reaction 11 was used in the decomposition of 12,392 molecules. On the otherhand, several reactions have been used in the decomposition of fewer than 100 molecules. Inaddition, some of the reactions are replaceable, in the sense that their elimination will not causesignificant changes in the decomposition trees of the compounds. For instance, the elimination ofreaction 11 increased the synthesis path for 68 out of the 12,392 molecules that used that reaction intheir shortest decomposition paths. On the other hand, the elimination of reaction 17 (thiol alkyla-tion with alkyl halides) caused 96% of the 3,467 molecules that used that reaction, to become hardto synthesize (i.e., their depth of the decomposition tree became greater than 5). Similarly, reaction19 (the synthesis of the imidazo-fused ring systems from aminoheterocycles and α-haloketones),while used only by 161 molecules, was absolutely essential to their decomposition as the decom-

10

Page 11: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Table 2: The use of each reaction in the decomposition of 48,781 easy to synthesize MLSMRmolecules and the effect of each reaction exclusion on the height of the shortest decompositionpath.

Molecules Molecules with decompos.Reaction using changed to given depth Description‡

reaction† 4 5 > 51 3989 5 0 1351 Reductive amination of aldehydes and ketones2 4721 9 0 2855 Amine alkylation with alkylhalides3 1055 24 28 844 NAS of haloheterocycles with amines4 21495 125 179 17872 Amide formation with amines and acids5 56 0 0 41 Urea formation with isocyanates and amines6 844 0 0 729 Suzuki coupling7 2620 0 0 318 Phenol alkylation with alkyl halides8 644 0 0 451 Ether formation from phenols and secondary alcohols9 7224 1 0 7169 Sulfonamide formation with sulfonyl chlorides and amines10 2544 0 0 478 N–N’-Disuccinimidyl carbonate mediated urea formation11 12392 0 0 68 Amide formation with amines and acyl halides12 3622 45 0 827 Amido alkylation with alkyl halides13 68 0 0 0 Urea formation with ethoxycarbonyl-carbamyl chlorides

and amines14 2131 0 0 4 Ester formation with acyl halides and alcohols15 3411 0 0 3093 Ester formation with acids and alcohols16 400 0 0 368 NAS of haloheterocycles with alcohols

and phenols17 3467 0 0 3345 Thiol alkylation with alkyl halides18 236 0 0 235 Benzimidazole formation with phenyldiamines and

aldehydes19 161 0 0 161 Synthesis of imidazo-fuzed ring systems from

aminoheterocycles and α-haloketones20 429 0 0 302 Carbamate synthesis with primary alcohols and secondary

amines21 1061 12 0 108 Reductive amination of carbonyl compounds followed by

acylation with carboxylic acids22 40 0 0 0 NAS of di-haloheterocycles with amines23 332 0 0 0 Ether formation from phenolic aldehydes followed by

reductive amination with amines† The number of molecules whose shortest decomposition search tree contains a given reaction.‡ NAS = Nucleophilic aromatic substitution.

11

Page 12: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

position paths of height up to five could not be discovered for all 161 molecules when the reactionwas omitted.

3.3 DescriptorsThe GF descriptors of the compounds were generated with the AFGen 2.0 program.32,33 AFGentreats all molecules as graphs with atoms being vertices and bonds being edges in the graph. Theatom types serve as vertex labels while bond types serve as the edge labels. We used fragmentsof length from four to six bonds. The fragments contained cycles and branch points in additionto linear paths. Since most chemical reactions create only one or just a few new bonds, such afragment length will be sufficient to implicitly encode the reactions that may have been used inthe compound synthesis. Our experiments with the maximum length of the fragments used as GFdescriptors have shown that increasing it beyond six does not improve the classification accuracybut significantly increases the computational requirements of the methods.

In order to compare the performance of the SVM models based on the GF descriptors to thoseusing fingerprints, we generated the latter with ChemAxon’s GenerateMD.34 The produced finger-prints are similar to the Daylight fingerprints. The 2048-bit fingerprints were produced by hashingthe paths in the molecules of the length up to six bonds. Note that both the GF descriptors and thehashed fingerprints contained only heavy atoms, i.e, all atoms except hydrogen.

3.4 Implementation DetailsThe retrosynthetic decomposition has been implemented in C++ programming language usingOpenEye’s OEChem software development kit.35 The kit, which is a library of classes and func-tions, allows among other things to apply a given reaction (encoded as Daylight’s SMIRKS patternsdescribing molecular transformations) to a set of reactants and obtain the products if the reactionis applicable. If the same reaction can be applied in multiple ways then multiple sets of productswill be obtained. The reactions were encoded in reverse, i.e. in the direction from products toreactants. The S V Mlight 36 implementation of the support vector machines was used to perform thelearning and classification. All input parameters to the S V Mlight with the exception of the kernel(see Section 2.2) were left at their default values. All experiments were performed on a Linuxworkstation with a 3.4 GHz Intel Xeon processor.

3.5 Training and Testing FrameworkThe MLSMR library was used to evaluate the performance of the RSsvm model in the followingway. Initially, in order to avoid molecules that are very similar within the positive and the negativesets as well as across training and testing sets, a subset of MLSMR compounds was selected fromthe entire MLSMR set such that no two compounds had a similarity greater than 0.6 (the similaritywas computed using the Tanimoto coefficient of GF-descriptor representation). A greedy maximalindependent set (MIS) algorithm was used to obtain this set of compounds. The algorithm worksin the following way. Initially, a graph is built in which vertices represent compounds and edgesare placed between vertices that have similarity greater than 0.6. After the graph is constructed,a vertex is picked from the graph, placed in the independent set and then removed with all of itsneighbors from the graph. The process is repeated until the graph is empty. In order to maximize

12

Page 13: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Table 3: Characteristics of molecules in the different positive and negative datasets.

MLSMR Eli Lilly Natural StartingRS+ RS− DR+ DR− EL+ EL− Products Materials

Median number of unique features 190 268 276 207 257 346 332 163Mean max. similarity to SM 0.74 0.65 0.69 0.63 0.63 0.52 0.70 0.88†

† Similarity to compounds in SM library other than itself.

the size of the independent set in general and the positive subset (RS+) in particular, the order inwhich the vertices were picked from the graph was based on the label and the degree (number ofneighbors) of the vertex. For that purpose, all vertices in a graph were sorted first by their label(positively labeled vertices were placed in a queue before the negatively labeled ones) and then bythe degree (vertices with fewer neighbors were placed ahead of the ones with a larger number ofneighbors). The order of vertices with the same label and degree was determined randomly. Theapplication of this MIS algorithm with a subsequent classification using retrosynthetic analysisresulted in 12,113 easy to synthesize compounds (RS+) and 35,580 hard to synthesize compounds(RS−). The variations in the sizes of the sets and in the results of the performance studies due tothe random ordering of the vertices with the same label and degree were found to be insignificantand will not be discussed. Both, RS+ and RS− sets, were randomly split into five subsets of similarsize, leading to five folds each containing a subset of positive and a subset of negative compounds.The RSsvm model was evaluated using five-fold cross validation, i.e. the model was learned usingfour folds and tested on the fifth fold (repeating the process five times for each of the folds). Thebaseline models were also evaluated on each of the five test folds. In addition, an RSsvm modelwas trained on all RS+ and RS− compounds without performing the MIS selection. This modelwas then used to determine the synthetic accessibility of the compounds in the Eli Lilly dataset;thus, providing an assessment of the RSsvm’s model on an independent dataset.

The performance of the DRsvm-based approach was also assessed by using the method de-scribed in Section 2.3.2 to identify the DR+ and DR− compounds in MLSMR, train the DRsvmmodel, and then use it to predict the synthetic accessibility of the compounds in the Eli Lillydataset. The DR+ and DR− training sets included 89,073 and 15,802 compounds, respectively,with DR+ containing 23.1% and DR− containing 9.9% of compounds identified as easy by a ret-rosynthetic analysis. Note that we did not attempt to evaluate the performance of the DRsvm-basedmodels on the RS+ and RS− compounds of the MLSMR dataset, because the labels of these com-pounds are derived from the small number of reactions used during the retrosynthetic decomposi-tion, and as such they do not represent the broader sets of synthetically accessible compounds thatthe DRsvm-based approach is designed to capture.

Some characteristics of the above MLSMR and Eli Lilly sets of compounds, such as mediannumber of unique features (graph fragments) and mean maximal similarity to the compounds inthe starting materials library, are given in Table 3.

3.6 Performance EvaluationWe used the receiver operating characteristic (ROC) curve to measure the performance of themodels. The ROC measures the rate at which positive class molecules are discovered comparedto the rate of the negative class molecules.37 The area under the ROC curve, or simply ROC

13

Page 14: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Table 4: Cross-validation performance on the MLSMR dataset.

Method ROC ROC50 PA100†(Enrichment)RSsvm (GF descriptors) 0.952 0.276 99.0 (3.9)RSsvm (2048-bit fingerprints) 0.948 0.208 97.0 (3.8)

SMNPsvm 0.567 0.004 27.4 (1.1)maxSMsim 0.679 0.007 39.2 (1.5)

Random 0.500 0.003 25.4 (1.0)† The number of true positives among the 100 highest-ranked test com-

pounds.

score, can indicate how good a method is. For example, in the ideal case, when all of the positiveclass molecules are discovered before (i.e., classified with higher score than) the molecules ofthe negative class, the ROC score will be 1. In a completely random selection case, the ROCscore will be 0.5. In addition to the ROC score, we also report two measures designed to give aninsight into the highest-ranked compounds. The first measure is the ROC50 score, which is themodified version of the ROC score that indicates the area under the ROC curve up to the pointwhen 50 negatives have been encountered.38 The second measure is the percentage of the 100highest-ranked test compounds that are true positive molecules. We will denote this measure byPA100 (positive accuracy in the top 100 compounds). Since in a completely random selection caseROC50 score depends on the actual number of negative class test instances and PA100 dependson the ratio of positive to negative class instances, the expected random values for these measuresare given in each table. We also provide the enrichment factor in the top 100 compounds, whichis calculated by dividing the actual PA100 by the expected random PA100 value. Note that in thecase of cross-validation experiments, the results reported are the averages over five different folds.

4 Results

4.1 Cross-Validation Performance on the MLSMR DatasetTable 4 shows the five-fold cross validation performance achieved by the RSsvm models on theMLSMR dataset. Specifically, the results for two different RSsvm models are presented that differon the descriptors that they use for representing the compounds (GF descriptors and 2048-bit fin-gerprints). In addition, this table shows the performance achieved by the two baseline approachesSMNPsvm and maxSMsim (Section 2.3.3), and an approach that simply produces a random order-ing of the test compounds. To ensure that the performance numbers of all schemes are directlycomparable, the numbers reported for the baseline and random approaches were obtained by firstevaluating these methods on each of the test-folds that were used by the RSsvm models, and thenaveraging their performance over the five folds.

These results show that the two RSsvm models are quite effective in predicting the syntheticaccessibility of the test compounds and that their performance is substantially better than thatachieved by the two baseline approaches across all three performance assessment metrics. Theperformance difference of the various schemes is also apparent by looking at their ROC curves

14

Page 15: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Negative rate

Pos

itive

rat

e

RSsvm (GF descriptors)RSsvm (fingerprints)SMNPsvmmaxSMsim

Figure 3: ROC curves for ranking easy to synthesize MLSMR (RS+) and hard to synthesizeMLSMR compounds (RS−) using different methods/models.

shown in Figure 3. These plots show that both of the RSsvm models are able to rank over 85% ofthe RS+ compounds same or higher than the top 10% ranked RS− compounds, which is substan-tially better than the corresponding performance of the baseline approaches. Comparing the twoRSsvmmodels, the results show that GF descriptors outperform the 2048-bit fingerprints. The per-formance difference between the two schemes is more pronounced in terms of the ROC50 metric(0.276 vs 0.208) and somewhat less for the other two metrics.

Comparing the performance of the two baseline approaches, we see that they perform quitedifferently. By using starting materials as the positive class and natural products as the negativeclass, the SMNPsvm model fails to rank the compounds in RS+ higher than the RS− compoundsand achieves an ROC score of just 0.567. The reason for this poor performance may lie in the factthat natural products have very complex structures and the model easily learns how to differentiatebetween natural products and everything else instead of easy and hard to synthesize compounds.On the other hand, by ranking the molecules based on their maximum similarity to the startingmaterials, maxSMsim performs better and achieves an area under the ROC curve of 0.679. ThemaxSMsim approach also produced better results in terms of ROC50 and PA100. Nevertheless,the performance of neither approach is sufficiently good for being used as a reliable syntheticaccessibility filter, as they achieve very low ROC50 and PA100 values.

4.2 Performance on the Eli Lilly DatasetTable 5 shows the performance achieved by the DRsvm, RSsvm, and the two baseline approacheson the Eli Lilly dataset, whereas Figure 4 shows the ROC curves associated with these schemes.

15

Page 16: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Table 5: Performance on the Eli Lilly dataset.

Model ROC ROC50 PA100 (Enrichment)DRsvm 0.888 0.713 100.0 (1.3)RSsvm 0.782 0.537 100.0 (1.3)

SMNPsvm 0.374 0.058 65.0 (0.8)maxSMsim 0.711 0.378 100.0 (1.3)maxDRsim 0.744 0.443 97.0 (1.2)

Random 0.500 0.115 79.2 (1.0)

As discussed in Section 3.5, the DRsvm and RSsvm models were trained on compounds that werederived from the MLSMR dataset, and consequently these results represent an evaluation of thesemethods on a dataset that is entirely independent from that used to train the respective models. Allthe results in this table were obtained using the GF descriptor representation of the compounds. Inaddition, the performance of the random scheme is also presented, in order to establish a baselinefor the ROC50 and PA100 values.

These results show that both DRsvm and RSsvm perform well and lead to ROC and ROC50scores that are considerably higher than those obtained by the baseline approaches. The DRsvmmodel achieves the best overall performance and obtains an ROC score of 0.888 and an ROC50score of 0.713. Among the two baseline schemes, maxSMsim performs the best (as it was the caseon the MLSMR dataset) and achieves ROC and ROC50 scores of 0.711 and 0.378, respectively.The fact that both RSsvm and maxSMsim achieve a PA100 score of 100 but the former achieves amuch higher ROC50 score, indicate that maxSMsim can easily identify some of the positive com-pounds, but its performance degrades considerably when the somewhat lower ranked compoundsare considered.

In order to show that the SVM itself contributes to the performance of the DRsvm methodas opposed to the training set only, we also computed the performance of a simple method inwhich the score of each compound is computed as the difference of the maximum similarity to thecompounds in the DR+ and DR− sets. The method is labeled maxDRsim in the Table 5. The lowerperformance of this method compared to the DRsvm (and even RSsvm) indicates the advantage ofusing SVM as opposed to the simple similarity to the training sets. Note that we also computedthe performance of the method based on the maximum similarity to the positive class (DR+) alonebut found its performance to be slightly worse than that of the maxDRsim.

The RSsvmmodel achieves the second best performance, but in absolute terms its performanceis considerably worse than that achieved by DRsvm and also its own performance on the MLSMRdataset. These results should not be surprising as the RSsvm model was trained on positive andnegative compounds that were identified via a retrosynthetic approach based on a small set of re-actions. As the decomposition statistics in Table 1 indicate, only 10.6% of the EL+ compoundscan be decomposed in a small number of steps using that set of reactions. Consequently, accord-ing to the positive and negative class definitions used for training the RSsvm model, the Eli Lillydataset contained only 10.6% positive compounds, which explains RSsvm’s performance degrada-tion. However, what is surprising is that despite this, RSsvm is still able to achieve good predictionperformance, indicating that it can generalize beyond the set of reactions that were used to deriveits training set.

16

Page 17: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Negative rate

Pos

itive

rat

e

DRsvmRSsvmSMNPsvmmaxSMsim

Figure 4: ROC curves for ranking the compounds in the Eli Lilly dataset using different methods(vertical line indicates the edge where ROC50 is calculated).

4.3 Generalization to New ReactionsA key parameter of the RSsvm model is the set of reactions that were used to derive the sets ofpositive and negative training instances during the retrosynthetic decomposition. As the discussionin Section 4.2 alluded, one of the key questions on the generality of the RSsvm-based approach isits ability to correctly predict the synthetic accessibility of compounds that can be synthesized bythe use of additional reactions. To answer this question, we performed the following experiment,which is designed to evaluate RSsvm’s generalizability to new reactions.

First, we selected 20 out of our 23 reactions whose elimination had an impact on the decom-position tree depth of the MLSMR compounds and randomly grouped them into four sets, eachhaving five reactions. Then for each training and testing dataset of the 5-fold cross validationframework used in the experiments of Section 4.1, we performed the following operations that arepresented graphically in Figure 5. For each group of reactions, we eliminated its reactions fromthe set of 23 reactions and used the remaining 18 reactions to determine the shortest synthesis pathof each RS+ compound in the training set (RS+-train) based on retrosynthetic decomposition (Sec-tion 2.3.1). This retrosynthetic decomposition resulted in partitioning the RS+-train compoundsinto three subsets, referred to as P+-train, Px-train, and P−-train. The P+-train subset contains thecompounds that, even with the reduced set of reactions, still had a synthesis path of length upto three, the Px-train subset contains the compounds whose synthesis path became either four orfive, and the P−-train subset contains the compounds whose synthesis path became higher thanfive. Thus, the elimination of the reactions turned the P−-train compounds into hard to synthesizecompounds from a retrosynthetic analysis perspective. We then built an SVM model using as pos-

17

Page 18: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

Figure 5: Schematic representation of the training and testing with a group of reactions eliminated.(P+, P- and PX are the subsets of RS+ that are easy to synthesize (path ≤ 3), hard to synthesize(path > 5) or in between, respectively, after a group of reactions is eliminated.)

Table 6: Performance when extra reactions are used to determine the positive test compounds.

Group Reactions % molecules† RSsvm maxSMsim Randomin group moved to P−-train ROC ROC50 PA100 (Enrich.) ROC ROC50 PA100 (Enrich.) ROC50 PA100

1 8,9,10,20,21 17.3% 0.557 0.005 22.2 (1.3) 0.673 0.002 14.2 (0.8) 0.004 16.82 1,3,6,14,15 12.2% 0.752 0.006 27.6 (1.7) 0.668 0.007 34.2 (2.0) 0.004 16.73 2,7,16,17,18 14.2% 0.690 0.006 32.6 (1.7) 0.649 0.005 22.2 (1.2) 0.004 18.74 4,5,11,12,19 38.6% 0.717 0.004 41.2 (1.2) 0.631 0.004 37.4 (1.1) 0.004 33.6

† The percentages refer to all 48,621 easy to synthesize compounds. The proportions for the RS+ set after selecting maximal independent set are slightly different.

itive instances only the molecules in P+-train and as negative instances the union of the moleculesin RS−-train and P−-train. The Px-train subset, which comprised less than 1% of the RS+, wasdiscarded. The performance of this model was assessed by first using the above approach to splitthe RS+ compounds of the testing set into the corresponding P+-test, Px-test, and P−-test subsets,and then creating a testing set whose positive instances were the compounds in P−-test and whosenegative instances were the compounds in RS−-test. Note that since the short synthesis paths (i.e.,≤ 3) of the compounds in P−-test require the eliminated reactions, this testing set allows us to as-sess how well the SVM model can generalize to compounds whose short synthesis paths dependon reactions that were not used to determine the positive training compounds.

Table 6 shows the results of these experiments. This table also lists the reactions in eachgroup and the fraction of all 48,621 easy to synthesize molecules in RS+ that became hard tosynthesize after a particular group of reactions was eliminated in the retrosynthetic decomposition.For comparison purposes, this table also shows the performance of the best-performing baselinemethod (maxSMsim) and that of the random scheme. Note, that since a maximal independent set ofRS+ is used for training and testing, the percentages of the molecules that became negative in thelatter are slightly different from those for the whole RS+ set, which explains why the PA100 valuesfor the random scheme are not proportional to the percentage of molecules that became negative.

These results show that, as expected, the relative performance of the RSsvm model on thesetest sets decreases when compared to the results reported in Table 4. This reflects the considerably

18

Page 19: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

harder nature of this classification problem as it designed to test how well RSsvm generalizes tocompounds that can only be synthesized when additional reactions are used. However, despitethis, for three out of the four reaction groups, RSsvm is still able to achieve reasonably goodperformance, obtaining ROC scores ranging from 0.690 to 0.752. Also, its performance comparesfavorably to the maxSMsim scheme for three out of the four subsets, as it achieves better ROC,ROC50, and PA100 scores. Note that we also assessed the performance of these models on testingsets whose positive compounds were the P+-test subsets. The performance achieved by both RSsvmand maxSMsim was comparable to that reported earlier.

4.4 Computational RequirementsAn advantage of the SVM-based synthetic accessibility approach over a similar approach thatis based entirely on retrosynthetic decomposition is its prediction speed. For example, in ourstudy, during the cross-validation experiments with the MLSMR dataset, the RSsvm models wereprocessing about 6,700 molecules per minute. This is significantly better than the retrosyntheticapproaches which process only several molecules per minute or less. In general, the amount oftime required by the SVM-based approach to rank the compounds is proportional to the size of themodel (i.e., the number of support vectors), which in turn depends on the number of the trainingcompounds and the separability of the positive and negative classes. Even though its absolute com-putational performance will change based on the training set, we expect it to remain considerablyfaster than the retrosynthetic-based approach. The time required to train SVM models depends onthe number of vectors in the training sets. Thus, the RSsvm model required 86 minutes for train-ing while the DRsvm model, which is trained on a larger number of molecules, required almost 9hours. The classification of the MLSMR molecules into DR+ and DR− sets required 69 minutes tocomplete. Note that training set preprocessing and SVM model training are operations that needto be performed only once.

5 Discussion & ConclusionsIn this paper we presented and evaluated two different approaches for predicting the synthetic ac-cessibility of chemical compounds that utilize support vector machines to build models based onthe descriptor representation of the compounds. These two approaches exhibit different perfor-mance characteristics and are designed to address two different application scenarios. The firstapproach, corresponding to the RSsvm-based models, is designed to identify the compounds thatcan be synthesized using a specific set of reactions and starting materials library, whereas the sec-ond approach, corresponding to the DRsvm-based models, is designed to provide a more generalassessment of synthetic accessibility that is not tied to any set of reactions or starting materialsdatabases.

The results presented in Section 4.1 showed that the RSsvm-based models are very effectivein identifying the compounds that can be synthesized in a small number of steps using the setof reactions and starting materials under consideration. As a result, these models can be used toprioritize the compounds that will need to be retrosynthetically analyzed in order to confirm theirsynthetic accessibility and identify the desired synthesis path, and as such significantly reducethe amount of time required by retrosynthetic analysis approaches. In addition, the experiments

19

Page 20: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

showed that even when additional reactions and/or starting materials are required for the synthesisof a compound, the RSsvm-based models are still able to prioritize a set of compounds basedon their synthetic accessibility (Sections 4.2 and 4.3). However, as expected, the quality of thepredictions are lower in such cases.

The results presented in Section 4.2 showed that the approach utilized by the DRsvm-basedmodel, which automatically identifies from a large diverse library a set of positive and negativetraining instances based on their similarity to other compounds, lead to an effective synthetic ac-cessibility prediction model. This approach outperforms both the baseline approaches and theRSsvm model, since the latter is biased towards the set of reactions and starting materials used foridentifying its training instances.

AcknowledgementThis work was supported by NSF ACI-0133464, IIS-0431135, NIH R01 LM008713A, and by theDigital Technology Center at the University of Minnesota. We would like to thank Dr. Ian Watsonat Eli Lilly and Company who graciously agreed to evaluate our models with their compounds.

References(1) Pearlman, D. A.; Murcko, M. CONCERTS: Dynamic collection of fragments as an approach to de novo ligand

design. J. Med. Chem. 1996, 39, 1651–1663

(2) Gillet, V.; Johnson, A.; Mata, P.; Sike, S.; Williams, P. SPROUT: A program for structure generation. J. Comput.-Aided Mol. Des. 1993, 7, 127–153

(3) Boda, K.; Johnson, A. P. Molecular Complexity Analysis of de Novo Designed Ligands. J. Med. Chem. 2006,49, 5869–5879

(4) Schneider, G.; Lee, M.-L.; Stahl, M.; Schneider, P. De novo design of molecular architectures by evolutionaryassembly of drugderived building blocks. J. Comput.-Aided Mol. Des. 2000, 14, 487–494

(5) Vinkers, H. M.; de Jonge, M. R.; Daeyaert, F. F. D.; Heeres, J.; Koymans, L. M. H.; van Lenthe, J. H.; Lewi, P. J.;Timmerman, H.; Aken, K. V.; Janssen, P. A. J. SYNOPSIS: SYNthesize and OPtimize System in Silico. J. Med.Chem. 2003, 46, 2765–2773

(6) Lewell, X. Q.; Judd, D. B.; Watson, S. P.; Hann, M. M. RECAP–retrosynthetic combinatorial analysis procedure:a powerful new technique for identifying privileged molecular fragments with useful applications in combinato-rial chemistry. J. Chem. Inf. Comput. Sci. 1998, 38, 511–522

(7) Bertz, S. The First General Index of Molecular Complexity. J. Am. Chem. Soc. 1981, 103, 3599–3601

(8) Hendrickson, J.; Toczko, P. H. A. Molecular complexity: a simplified formula adapted to individual atoms. J.Chem. Inf. Comput. Sci. 1987, 27, 63–67

(9) Whitlock, H. On the Structure of Total Synthesis of Complex Natural Products. J. Org. Chem. 1998, 63, 7982–7989

(10) Barone, R.; Chanon, M. A New and Simple Approach to Chemical Complexity. Application to the Synthesis ofNatural Products. J. Chem. Inf. Comput. Sci. 2001, 41, 269–272

(11) Rücker, C.; Rücker, G.; Bertz, S. Organic Synthesis - Art or Science? J. Chem. Inf. Comput. Sci. 2004, 44,378–386

20

Page 21: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

(12) Allu, T.; Oprea, T. Rapid Evaluation of Synthetic and Molecular Complexity for in Silico Chemistry. J. Chem.Inf. Model. 2005, 45, 1237–1243

(13) Johnson, A.; Marshall, C.; Judson, P. Starting material oriented retrosynthetic analysis in the LHASA program.1. General description. J. Chem. Inf. Comput. Sci. 1992, 32, 411–417

(14) Gillet, V.; Myatt, G.; Zsoldos, Z.; Johnson, A. SPROUT, HIPPO and CAESA: Tool for de novo structure gener-ation and estimation of synthetic accessibility. Perspect. Drug Discovery Des. 1995, 3, 34–50

(15) Pförtner, M.; Sitzmann, M. Computer-Assisted Synthesis Design by WODCA (CASD). In Handbook of Chem-informatics; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, Germany, 2003; pp 1457–1507

(16) Boda, K.; Seidel, T.; Gasteiger, J. Structure and reaction based evaluation of synthetic accessibility. J. Comput.-Aided Mol. Des. 2007, 21, 311–325

(17) Law, J.; Zsoldos, Z.; Simon, A.; Reid, D.; Liu, Y.; Khew, S. Y.; Johnson, A. P.; Major, S.; Wade, R. A.;Ando, H. Y. Route Designer: A Retrosynthetic Analysis Tool Utilizing Automated Retrosynthetic Rule Gen-eration. J. Chem. Inf. Model. 2009, 49, 593–602

(18) Wale, N.; Watson, I.; Karypis, G. Indirect Similarity Based Methods for Effective Scaffold-Hopping in ChemicalCompounds. J. Chem. Inf. Model. 2008, 48, 730–741

(19) Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2000

(20) Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics; Wiley-VCH: Weinheim, Germany,2009

(21) Rogers, D.; Brown, R.; Hahn, M. Using Extended-Connectivity Fingerprints with Laplacian-Modified BayesianAnalysis in High-Throughput Screening. J. Biomol. Screen. 2005, 10, 682–686

(22) Vapnik, V. The Nature of Statistical Learning Theory; Springer-Verlag: New York, 1995

(23) Aizerman, M.; Braverman, E.; Rozonoer, L. Theoretical foundations of the potential function method in patternrecognition learning. Autom. Rem. Cont. 1964, 25, 821–837

(24) Boser, B. E.; Guyon, I. M.; Vapnik, V. N. A training algorithm for optimal margin classifiers. 5th Annual ACMWorkshop on COLT, Pittsburgh, PA, 1992; pp 144–152

(25) Willett, P. Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 1998, 38, 983–996

(26) Geppert, H.; Horvath, T.; Gartner, T.; Wrobel, S.; Bajorath, J. Support-Vector-Machine-Based Ranking Sig-nificantly Improves the Effectiveness of Similarity Searching Using 2D Fingerprints and Multiple ReferenceCompounds. J. Chem. Inf. Model. 2008, 48, 742–746

(27) Chen, J.; Holliday, J.; Bradshaw, J. A Machine Learning Approach to Weighting Schemes in the Data Fusion ofSimilarity Coefficients. J. Chem. Inf. Model. 2009, 49, 185–194

(28) Corey, E. J. Retrosynthetic Thinking — Essentials and Examples. Chem. Soc. Rev. 1988, 17, 111–133

(29) Corey, E. J.; Cheng, X.-M. The Logic of Chemical Synthesis; Wiley: New York, 1995

(30) Bajorath, J. Chemoinformatics methods for systematic comparison of molecules from natural and syntheticsources and design of hybrid libraries. J. Comput.-Aided Mol. Des. 2002, 16, 431–439

(31) Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Bryant, S. H. PubChem: a public information system foranalyzing bioactivities of small molecules. Nucl. Acids Res. 2009, 37, W623–W633

21

Page 22: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

(32) Wale, N.; Karypis, G. Acyclic Subgraph-based Descriptor Spaces for Chemical Compound Retrieval and Clas-sification. IEEE International Conference on Data Mining (ICDM), 2006

(33) AFGen 2.0. http://glaros.dtc.umn.edu/gkhome/afgen/overview (accessed July 1, 2009)

(34) GenerateMD, v. 5.2.0, ChemAxon. http://www.chemaxon.com/jchem/doc/user/GenerateMD.html (accessed July1, 2009)

(35) OEChem TK, OpenEye Scientific Sofware. http://www.eyesopen.com/products/toolkits/oechem.html (accessedJuly 1, 2009).

(36) Joachims, T. Making large-scale support vector machine learning practical. In Advances in kernel methods:support vector learning; Schölkopf, B., Burges, C. J. C., Smola, A. J., Eds.; MIT Press: Cambridge, MA, USA,1999; pp 169–184

(37) Fawcett, T. An introduction to ROC analysis. Pattern Recog. Lett. 2006, 27, 861–874

(38) Gribskov, M.; Robinson, N. Use of receiver operating characteristic (ROC) analysis to evaluate sequence match-ing. Comput. Chem. 1996, 20, 25–33

22

Page 23: Assessing Synthetic Accessibility of Chemical Compounds ...With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active

For Table of Contents OnlyAssessing Synthetic Accessibility of Chemical CompoundsUsing Machine Learning Methods

Yevgeniy Podolyan, Michael A. Walters, and George Karypis

23