a model to search for synthesizable molecules · molecules from a set of commonly-available...

A Model to Search for Synthesizable Molecules

John BradshawUniversity of Cambridge

MPI for Intelligent [email protected]

Brooks PaigeUniversity of CambridgeThe Alan Turing [email protected]

Matt J. KusnerUniversity College LondonThe Alan Turing [email protected]

Marwin H. S. SeglerBenevolentAI

Westfälische Wilhelms-Universität Mü[email protected]

José Miguel Hernández-LobatoUniversity of CambridgeThe Alan Turing Institute

Microsoft Research [email protected]

Abstract

Deep generative models are able to suggest new organic molecules by generatingstrings, trees, and graphs representing their structure. While such models allowone to generate molecules with desirable properties, they give no guarantees thatthe molecules can actually be synthesized in practice. We propose a new moleculegeneration model, mirroring a more realistic real-world process, where (a) reactantsare selected, and (b) combined to form more complex molecules. More specifically,our generative model proposes a bag of initial reactants (selected from a pool ofcommercially-available molecules) and uses a reaction model to predict how theyreact together to generate new molecules. We first show that the model can generatediverse, valid and unique molecules due to the useful inductive biases of modelingreactions. Furthermore, our model allows chemists to interrogate not only theproperties of the generated molecules but also the feasibility of the synthesis routes.We conclude by using our model to solve retrosynthesis problems, predicting a setof reactants that can produce a target product.

1 Introduction

The ability of machine learning to generate structured objects has progressed dramatically in thelast few years. One particularly successful example of this is the flurry of developments devoted togenerating small molecules [15, 50, 30, 11, 53, 12, 22, 33, 59, 45, 2]. These models have been shownto be extremely effective at finding molecules with desirable properties: drug-like molecules [15],biological target activity molecules [50], and soluble molecules [12].

However, these improvements in molecule discovery come at a cost: these methods do not describehow to synthesize such molecules, a prerequisite for experimental testing. Traditionally, in computer-aided molecular design, this has been addressed by virtual screening [52], where molecule data sets|D| ≈ 108, are first generated via the expensive combinatorial enumeration of molecular fragmentsstitched together using hand-crafted bonding rules, and then are scored in an O(|D|) step.

In this paper we propose a generative model for molecules (shown in Figure 1) that describes how tomake such molecules from a set of commonly-available reactants. Our model first generates a set ofreactant molecules, and second maps them to a predicted product molecule via a reaction predictionmodel. It allows one to simultaneously search for better molecules and describe how such moleculescan be made. By closely mimicking the real-world process of designing new molecules, we showthat our model: 1. Is able to generate a wide range of molecules not seen in the training data; 2.

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

arX

iv:1

906.

0522

1v2

[cs

.LG

] 4

Dec

201

9

z

µ

σ

MoleculeChef (ours)fragments score via simulation

sim = 0 .5

Virtual Screening Prior Machine Learning De Novo Design

z

optimization in latent space

µ

σ

optimization in latent space

generate reactants

generate products

selected molecule!

selected molecule!

reaction predictor

products

O

HN

N

O

N N

sim = 0 .3

sim = 0 .2O

HN H

N

1) Combination Rule:

2) Enumerate N

OH

HN

O

F

F

F

OH

NH2

O

O O+

OH

HN

O

Figure 1: An overview of approaches used to find molecules with desirable properties. Left: Virtualscreening [52] aims to find novel molecules by the (computationally expensive) enumeration overall possible combinations of fragments. Center: More recent ML approaches, eg [15], aim to finduseful, novel molecules by optimizing in a continuous latent space; however, there are no cluesto whether (and how) these molecules can be synthesized. Right: We approach the generation ofmolecules through a multistage process mirroring how complex molecules are created in practice,while maintaining a continuous latent space to use for optimization. Our model, MOLECULE CHEF,first finds suitable reactants which then react together to create a final molecule.

Addresses practical synthesis concerns such as reaction stability and toxicity; and 3. Allows us topropose new reactants for given target molecules that may be more practical to manage.

2 Background

We start with an overview of traditional computational techniques to discover novel molecules withdesirable properties. We then review recent work in machine learning (ML) that seeks to improveparts of this process. We then identify aspects of molecule discovery we believe deserve much moreattention from the ML community. We end by laying out our contributions to address these concerns.

2.1 Virtual Screening

To discover new molecules with certain properties, one popular technique is virtual screening (VS)[52, 18, 41, 8, 38]. VS works by (a) enumerating all combinations of a set of building-block molecules(which are combined via virtual chemical bonding rules), (b) for each molecule, calculating thedesired properties via simulations or prediction models, (c) filtering the most interesting molecules tosynthesize in the lab. While VS is general, it has the important downside that the generation processis not targeted: VS needs to get lucky to find molecules with desirable properties, it does not searchfor them. Given that the number of possible drug-like compounds is estimated to be ∈ [1023, 10100][56], the chemical space usually screened in VS ∈ [107, 1010] is tiny. Searching in combinatorialfragment spaces has been proposed, but is limited to simpler similarity queries [42].

2.2 The Molecular Search Problem

To address these downsides, one idea is to replace this full enumeration with a search algorithm; anidea called de novo-design (DND) [46]. Instead of generating a large set of molecules with smallvariations, DND searches for molecules with particular properties, recomputes them for the newfoundmolecules, and searches again. We call this the molecular search problem. Early work on themolecular search problem used genetic algorithms, ant-colony optimization, or other discrete searchtechniques to make local changes to molecules [17]. While more directed than library-generation,these approaches still explored locally, limiting the diversity of discovered molecules.

The first work to apply current ML techniques to this problem was Gómez-Bombarelli et al. [15](in a late 2016 preprint). Their idea was to search by learning a mapping from molecular space tocontinuous space and back. With this mapping it is possible to leverage well-studied optimizationtechniques to do search: local search can be done via gradient descent and global search via Bayesianoptimization [54, 14]. For such a mapping, the authors chose to represent molecules as SMILESstrings [58] and leverage advances in generative models for text [5] to learn a character variationalautoencoder (CVAE) [29]. Shortly after this work, in an early 2017 preprint, Segler et al. [50] trainedrecurrent neural networks (RNNs) to take properties as input and output SMILES strings with theseproperties, with molecular search done using reinforcement learning (RL).

2

In Search of Molecular Validity. However, the SMILES string representation is very brittle: ifindividual characters are changed or swapped, it may no longer represent any molecule (called aninvalid molecule). Thus, the CVAE often produced invalid molecules (in one experiment, Kusneret al. [30] sampling from the continuous space, produced valid molecules only 0.7% of the time). Toaddress this validity problem, recent works have proposed using alternative molecular representationssuch as parse trees [30, 11] or graphs [53, 12, 32, 22, 33, 59, 23, 26, 45], where some of the morerecent among these enforce or strongly encourage validity [22, 33, 59, 26, 45]. In parallel, there hasbeen work based on RL that has aimed to learn a validity function during training directly [16, 20].

2.3 The Molecular Recipe Problem

Crucially, all of the works in the previous section solving the molecular search problem focus purelyon optimizing molecules towards desirable properties. These works, in addressing the downsides ofVS, removed a benefit of it: knowledge of the synthesis pathway of each molecule. Without this wedo not know how practical it is to make ML-generated molecules.

To address this concern is to address the molecular recipe problem: what molecules are we able tomake, given a set of readily-available starting molecules? So far, this problem has been addressedindependently of the molecular search problem through synthesis planning (SP) [51]. SP works byrecursively deconstructing a molecule. This deconstruction is done via (reversed) reaction predictors:models that predict how reactant molecules produce a product molecule. More recently, novel MLmodels have been designed for reaction prediction [57, 49, 21, 47, 6, 48].

2.4 This Work

In this paper, we propose to address both the molecular search problem and the molecular recipeproblem jointly. To do so, we propose a generative model over molecules using the followingmap: First, a mapping from continuous space to a set of known, reliable, easy-to-obtain reactantmolecules. Second a mapping from this set of reactant molecules to a final product molecule, basedon a reaction prediction model [57, 49, 21, 48, 6]. Thus our generative model not only generatesmolecules, but also a synthesis route using available reactants. This addresses the molecular recipeproblem, and also the molecular search problem, as the learned continuous space can also be used forsearch. Compared to previous work, in this work we are searching for new molecules through virtualchemical reactions, more directly simulating how molecules are actually discovered in the lab.

Concretely, we argue that our model, which we shall introduce in the next section, has severaladvantages over the current deep generative models of molecules reviewed previously:

Better extrapolation properties Generating molecules through graph editing operations, represent-ing reactions, we hope gives us strong inductive biases for extrapolating well.

Validity of generated molecules Naive generation of molecular SMILES strings or graphs can leadto molecules that are invalid. Although the syntactic validity can be fixed by using masking[30, 33], the molecules generated can often still be semantically invalid. By generatingmolecules from chemically stable reactants by means of reactions, our model proposes moresemantically valid molecules.

Provide synthesis routes Proposed molecules from other methods can often not be evaluated inpractice, as chemists do not know how to synthesize them. As a byproduct of our model wesuggest synthetic routes, which could have a useful, practical value.

3 Model

In this section we describe our model1. We define the set of all possible valid molecular graphs as G,with an individual graph g ∈ G representing the atoms of a molecule as its nodes, and the type ofbonds between these atoms (we consider single, double and triple bonds) as its edge types. The set ofcommon reactant molecules, easily procurable by a chemist, which we want to act as building blocksfor any final molecule is a subset of this,R ⊂ G.

1Further details can also be found in our appendix and code is available at https://github.com/john-bradshaw/molecule-chef

3

https://github.com/john-bradshaw/molecule-chef


As discussed in the previous section (and shown in Figure 1) our generative model for moleculesconsists of the composition of two parts: (1) a decoder from a continuous latent space, z ∈ Rm,to a bag (ie multiset2) of easily procurable reactants, x ⊂ R; (2) a reaction predictor model thattransforms this bag of molecules into a multiset of product molecules y ⊂ G.

The benefit of this approach is that for step (2) we can pick from several existing reaction predictormodels, including recently proposed methods that have used ML techniques [27, 49, 48, 6, 10]. Inthis work we use the Molecular Transformer (MT) of Schwaller et al. [48], as it has recently beenshown to provide state-of-the-art performance in this task [48, Table 4].

This leaves us with the task of (1), learning a way to decode to (and encode from) a bag of reac-tants, using a parameterized encoder q(z|x) and decoder p(x|z). We call this co-occurrence modelMOLECULE CHEF, and by moving around in the latent space we can select using MOLECULE CHEFdifferent “bags of reactants”.

Again there are several viable options of how to learn MOLECULE CHEF. For instance one couldchoose to use a VAE for this task [29, 44]. However, when paired with a complex decoder thesemodels are often difficult to train [5, 1], such that much of the previous work for generating graphshas has tuned down the KL regularization term in these models [33, 30]. We therefore instead proposeusing the WAE objective [55], which involves minimizing

L =Ex∼DEq(z|x) [c(x, p(x|z))] + λD (Ex∼D [q(z|x)] , p(z))

where c is a cost function, that enforces the reconstructed bag to be similar to the encoded one. D isa divergence measure, which is weighted in relative importance by λ, that forces the marginaliseddistribution of all encodings to match the prior on the latent space. Following Tolstikhin et al. [55]we use the maximum mean discrepancy (MMD) divergence measure, with λ = 10 and a standardnormal prior over the latents. We choose c so that this first term matches the reconstruction term wewould obtain in a VAE, i.e. with c(x, z) = − log p(x|z). This means that the objective only differsfrom a VAE in the second, regularisation term, such that we are not trying to match each encoding tothe prior but instead the marginalised distribution over all datapoints. Empirically, we find that thistrains well and does not suffer from the same local optimum issues as the VAE.

3.1 Encoder and Decoder

We can now begin describing the structure of our encoder and decoder. In these functions it is oftenconvenient to work with n-dimensional vector embeddings of graphs, mg ∈ Rn. Again we are facedwith a series of possible alternative ways to compute these embeddings. For instance, we couldignore the structure of the molecular graph and learn a distinct embedding for each molecule, or usefixed molecular fingerprints, such as Morgan Fingerprints [37]. We instead choose to use deep graphneural networks [36, 13, 3] that can produce graph-isomorphic representations.

Deep graph neural networks have been shown to perform well on a variety of tasks involvingsmall organic molecules, and their advantages compared to the previously mentioned alternativeapproaches are that (1) they take the structure of the graph into account and (2) they can learn whichcharacteristics are important when forming higher-level representations. In particular in this work weuse 4 layer Gated Graph Neural Networks (GGNN) [31]. These compute higher-level representationsfor each node. These node-level representations in turn can be combined by a weighted sum, toform a graph-level representation invariant to the order of the nodes, in an operation referred to as anaggregation transformation [24, §3].

Encoder The structure of MOLECULE CHEF’s encoder, q(z|x), is shown in Figure 2. For the ithdata point the encoder has as input the multiset of reactants xi = {xi1, xi2, · · · }. It first computes therepresentation of each individual reactant molecule graph using the GGNN, before summing theserepresentations to get a representation that is invariant to the order of the multiset. A feed forwardnetwork is then used to parameterize the mean and variance of a Gaussian distribution over z.

Decoder The decoder, p(x|z), (Figure 3) maps from the latent space to a multiset of reactantmolecules. These reactants are typically small molecules, which means we could fit a deep generative

2Note how we allow molecules to be present multiple times as reactants in our reaction, although practicallymany reactions only have one instance of a particular reactant.

4

BrX

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

µ<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

reactant bag embedding latent space

reactant embedding

order-invariant combination

Figure 2: The encoder of MOLECULE CHEF. This maps from a multiset of reactants to a distributionover latent space. There are three main steps: (1) the reactants molecules are embedded into acontinuous space by using GGNNs [31] to form molecule embeddings; (2) the molecule embeddingsin the multiset are summed to form one order-invariant embedding for the whole multiset; (3) this isthen used as input to a neural network which parameterizes a Gaussian distribution over z.

�<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

µ<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

RNN cell

Halt<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Halt<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Halt<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

reactant probability

reaction predictor

generate reactant bag

latent space

Figure 3: The decoder of MOLECULE CHEF. The decoder generates the multiset of reactants insequence through calls to a RNN. At each step the model picks either one reactant from the pool or tohalt, finishing the sequence. The latent vector, z, is used to parameterize the initial hidden layer ofthe RNN. Reactants that are selected are fed back into the RNN on the next step. The reactant bagformed is later fed through a reaction predictor to form a final product.

model which produces them from scratch. However, to better mimic the process of selecting reactantmolecules from an easily obtainable set, we instead restrict the output of the decoder to pick themolecules from a fixed set of reactant molecules,R.

This happens in a sequential process using a recurrent neural network (RNN), with the full processdescribed in Algorithm 1. The latent vector, z is used to parametrize the initial hidden layer of theRNN. The selected reactants are fed back in as inputs to the RNN at the next generation stage. Whilsttraining we randomly sample the ordering of the reactants, and use teacher forcing.

3.2 Adding a predictive penalty loss to the latent space

As discussed in section 2.2 we are interested in using and evaluating our model’s performance in themolecular search problem, that is using the learnt latent space to find new molecules with desirableproperties. In reality we would wish to measure some complex chemical property that can only bemeasured experimentally. However, as a surrogate for this, following [15], we optimize instead forthe QED (Quantitative Estimate of Drug-likeness [4]) score of a molecule, w, as a deterministicmapping from molecules to this score, y 7→ w, exists in RDKit [43].

To this end, in a similar manner to Liu et al. [33, §4.3] & Jin et al. [22, §3.3], we can simultaneouslytrain a 2 hidden layer property predictor NN for use in local optimization tasks. This network tries topredict the QED property, w, of the final product y from the latent encoding of the associated bag ofreactants. The use of this property predictor network for local optimization is described in Section4.2.

5

Algorithm 1 MOLECULE CHEF’s Decoder

Require: zi (latent space sample), GGNN (for embedding molecules), RNN (recurrent neural network),R (setof easy-to-obtain reactant molecules), s (learnt “halt” embedding), A (learnt matrix that projects the size ofthe latent space to the size of RNN’s hidden space)h0 ← Azi ; m0 ← 0 {Start symbol}for t = 1 to Tmax do

ht ← RNN(mt−1,ht−1) ; B← STACK([GGNN(g) for all g inR] + [s])logits← htB

T

xt ∼ softmax(logits)if xt = HALT then

break {If the logit corresponding to the halt embedding is selected then we stop early}else

mt ← GGNN(xt)end if

end forreturn x1, x2, · · ·

4 Evaluation

In this section we evaluate MOLECULE CHEF in (1) its ability to generate a diverse set of validmolecules; (2) how useful its learnt latent space is when optimizing product molecules for someproperty; and (3) whether by training a regressor back from product molecules to the latent space,MOLECULE CHEF can be used as part of a setup to perform retrosynthesis.

In order to train our model we need a dataset of reactant bags. For this we use the USPTO dataset[34], processed and cleaned up by Jin et al. [21]. We filter out reagents, molecules that form contextunder which the reaction occurs but do not contribute atoms to the final products, by following theapproach of Schwaller et al. [47, §3.1].

We wish to use as possible reactant molecules only popular molecules that a chemist would have easyaccess to. To this end, we filter our training (using Jin et al. [21]’s split) dataset so that each reactiononly contains reactants that occur at least 15 times across different reactions in the original largertraining USPTO dataset. This leaves us with a dataset of 34426 unique reactant bags for training theMOLECULE CHEF. In total there are 4344 unique reactants. For training the baselines, we combinethese 4344 unique reactants and the associated products from their different combinations, to form atraining set for baselines, as even though MOLECULE CHEF has not seen the products during training,the reaction predictor has.

4.1 Generation

We begin by analyzing our model using the metrics favored by previous work3 [22, 33, 32, 30]:validity, uniqueness and novelty. Validity is defined as requiring that at least one of the molecules inthe bag of products can be parsed by RDKit. For a bag of products to be unique we require it to haveat least one valid molecule that the model has not generated before in any of the previously seen bags.Finally, for computing novelty we require that the valid molecules not be present in the same trainingset we use for the baseline generative models.

In addition, we compute the Fréchet ChemNet Distance (FCD) [40] between the valid moleculesgenerated by each method and our baseline training set. Finally in order to try to assess the quality ofthe molecules generated we record the (train-normalized) proportion of valid molecules that passthe quality filters proposed by Brown et al. [7, §3.3]; these filters aim to remove molecules that are

“potentially unstable, reactive, laborious to synthesize, or simply unpleasant to the eye of medicinalchemists”.

3Note that we have extended the definition of these metrics to a bag (multiset) of products, given that ourmodel can output multiple molecules for each reaction. However, when sampling 20000 times from the prior ofour model, we generate single product bags 97% of the time, so that in practice most of the time we are usingthe same definition for these metrics as the previous work which always generated single molecules.

6

Table 1: Table showing the validity, uniqueness, novelty and normalized quality (all as %, higherbetter) of the products/or molecules generated from decoding from 20k random samples from theprior p(z). Quality is the proportion of valid molecules that pass the quality filters proposed in Brownet al. [7, §3.3], normalized such that the score on the training set is 100. FCD is the Fréchet ChemNetDistance [40], capturing a notion of distance between the generated valid molecules and the trainingdataset (lower better). The uniqueness and novelty figures are also conditioned on validity. MT standsfor the Molecular Transformer [48].

Model Name Validity Uniqueness Novelty Quality FCD

MOLECULE CHEF + MT 99.05 95.95 89.11 95.30 0.73

AAE [25, 39] 85.86 98.54 93.37 94.89 1.12CGVAE [33] 100.00 93.51 95.88 44.45 11.73CVAE [15] 12.02 56.28 85.65 52.86 37.65GVAE [30] 12.91 70.06 87.88 46.87 29.32LSTM [50] 91.18 93.42 74.03 100.12 0.43

For the baselines we consider the character VAE (CVAE) [15], the grammar VAE (GVAE) [30], theAAE (adversarial autoencoder) [25], the constrained graph VAE (CGVAE) [33], and a stacked LSTMgenerator with no latent space [50]. Further details about the baselines can be found in the appendix.

The results are shown in Table 1. As MOLECULE CHEF decodes to a bag made up from a predefinedset of molecules, those reactants going into the reaction predictor are valid. The validity of the finalproduct is not 100%, as the reaction predictor can make non-valid edits to these molecules, but we seethat in a high number of cases the products are valid too. Furthermore, what is very encouraging isthat the molecules generated often pass the quality filters, giving evidence that the process of buildingmolecules up by combining stable reactant building blocks often leads to stable products.

4.2 Local Optimization

As discussed in Section 3.2, when training MOLECULE CHEF we can simultaneously train a propertypredictor network, mapping from the latent space of MOLECULE CHEF to the QED score of the finalproduct. In this section we look at using the gradient information obtainable from this network to dolocal optimization to find a molecule created from our reactant pool that has a high QED score.

We evaluate the local optimization of molecular properties by taking 250 bags of reactants, encodingthem into the latent space of MOLECULE CHEF, and then repeatedly moving in the latent space usingthe gradient direction of the property predictor until we have decoded ten different reactant bags. As acomparison we consider instead moving in a random walk until we have also decoded to ten differentreaction bags. In Figure 4 we look at the distribution of the best QED score found in consideringthese ten reactant bags, and how this compared to the QEDs started with.

When looking at individual optimization runs, we see that the QEDs vary a lot between differentproducts even if made with similar reactants. However, Figure 4 shows that overall the distributionof the final best found QED scores is improved when purposefully optimizing for this. This isencouraging as it gives evidence of the utility of these models for the molecular search problem.

4.3 Retrosynthesis

A unique feature of our approach is that we learn a decoder from latent space to a bag of reactants. Thisgives us the ability to do retrosynthesis by training a model to map from products to their associatedreactants’ representation in latent space and using this in addition to MOLECULE CHEF’s decoderto generate a bag of reactants. This process is highlighted in Figure 5. Although retrosynthesisis a difficult task, with often multiple possible ways to create the same product and with currentstate-of-the-art approaches built using large reaction databases and able to deal with multiple reactions[51], we believe that our model could open up new interesting and exciting approaches to this task.We therefore train a small network based on the same graph neural network structure used forMOLECULE CHEF followed by four fully connected layers to regress from products to latent space.

7

0.0 0.2 0.4 0.6 0.8 1.0QED score

0

2

4

6

8

10

Starting locationsBest found with random walkBest found with local optimization

Figure 4: KDE plot showing that the distributionof the best QEDs found through local optimization,using our trained property predictor for QEDs, hashigher mass over higher QED scores compared tothe best found from a random walk. The startinglocations’ distribution (sampled from the trainingdata) is shown in green. The final products, given areactant bag are predicted using the MT [48].

z

x

w

y

reactants

products

"desirableproperty"

MoleculeChef

ReactionPredictor

LabExperiment

(proxied for by chemoinformaticstoolkit function)

(eg Molecular Transformer)

Figure 5: Having learnt a latent space whichcan map to products through reactants, we canlearn a regressor back from the suggested prod-ucts to the latent space (orange dashed −− ar-row shown) and couple this with MOLECULECHEF’s decoder to see if we can do retrosyn-thesis – the act of computing the reactants thatcreate a particular product.

USPTO Dataset

reactants

products

predicted reactants

predicted product

O

N

Cl

O

N

Cl

Cl

N

Cl

O

O

N

Cl

Br

Figure 6: An example of perform-ing retrosynthesis prediction us-ing a trained regressor from prod-ucts to latent space. This reactant-product pair has not been seen inthe training set of MOLECULECHEF. Further examples areshown in the appendix.

0.0 0.2 0.4 0.6 0.8 1.0product's QED

0.0

0.2

0.4

0.6

0.8

1.0

reco

nstru

cted

pro

duct

's QE

D R2: 0.61

(a) Reachable Products

0.0 0.2 0.4 0.6 0.8 1.0product's QED

0.0

0.2

0.4

0.6

0.8

1.0

reco

nstru

cted

pro

duct

's QE

D R2: 0.26

(b) Unreachable Products

Figure 8: Assessing the correlation between the QED scoresfor the original product and its reconstruction (see text fordetails). We assess on two portions of the test set, productsthat are made up of only reactants in MOLECULE CHEF’svocabulary are called ‘Reachable Products’, those that haveat least one reactant that is absent are called ‘UnreachableProducts’.

A few examples of the predicted reactants corresponding to products from reactions in the USPTOtest set, but which can be made in one step from the predefined possible reactants, are shown inFigure 6 and the appendix. We see that often this approach, although not always able to suggestthe correct whole reactant bag, chooses similar reactants that on reaction produce similar structuresto the original product we were trying to synthesize. While we would not expect this approach toretrosynthesis to be competitive with complex planning tools, we think this provides a promisingnew approach, which could be used to identify bags of reactants that produce molecules similarto a desired target molecule. In practice, it would be valuable to be pointed directly to moleculeswith similar properties to a target molecule if they are easier to make than the target, since it is theproperties of the molecules, and not the actual molecules themselves, that we are after.

With this in mind, we assess our approach in the following way: (1) we take a product and performretrosynthesis on it to produce a bag of reactants, (2) we transform this bag of reactants using theMolecular Transformer to produce a new reconstructed product, and then finally (3) we plot theresulting reconstructed product molecule’s QED score against the QED score of the initial product.We evaluate on a filtered version of Jin et al. [21]’s test set split of USPTO, where we have filteredout any reactions which have the exact same reactant and product multisets as a reaction present inthe set used to train Molecule Chef. In addition, we further split this filtered set into two sets: (i)

8

‘Reachable Products’, which are reactions in the test set that contain as reactants only molecules thatare in MOLECULE CHEF’s reactant vocabulary, and (ii) ‘Unreachable Products’, which have at leastone reactant molecule that is not in the vocabulary.

The results are shown in Figure 8; overall we see that there is some correlation between the propertiesof products and the properties of their reconstructions. This is more prominent for the reachableproducts, which we believe is because our latent space is only trained on reachable product reactionsand so is better able to model these reactions. Furthermore, some of the unreachable products mayalso require reactants that are not available in our pool of easily available reactants, at least whenconsidering one-step reactions. However, given that unreachable products have at least one reactantwhich is not in Molecule Chef’s vocabulary, we think it is very encouraging that there still is some,albeit smaller, correlation with the true QED. This is because it shows that our model can suggestmolecules with similar properties made from reactants that are available.

4.4 Qualitative Quality of Samples

O

O

ON NH

O

NH

BrS

OO

O

S NH

O

NH

Br

N

O

O

O

S

NH2

N

O

N

Br

OO

O

S NH

O

NH

Br

N

OO

O

S NH

O

NH

Br

N

OO

OO

N

O

S

O

C.

O

OO

S

NH2

N

O Cl

N

O

N

Br

OO

O

S NH

O

NH

Br

N

OH

O

S NH

O

Br

N

OO

O

O NH

O

NH

Br

N

O

OO

S

NH2

N

O

N NCl

Cl

N

O Cl

O

O O

S NH

O

N

N

O

N

Cl

Cl

N

O

OH

O

SNH

O

Br

N

OO

OO

N

S

O

HO

N

O

O

OO

S

NH2N

ONH

NH

OO

O

S NH

O

NH

N

NH2

OO

O

S NH

O

NH

Br

NN

O S O

S

NO

NH

O

O

Cl

SCl

N

N

N

O

O

OO

S

NH2N

OCl

Cl

OO

O

S NH

O

NH

N

OO

O

NH

O

NH

Br

O

O O

NH

N ONH

OO

ClSH

N

N

N

O

O

N NCl

Cl

N

O Cl

O

N

N

Cl

Cl

N

O

NH

O

OH

O

SNH

O

NH

Br

N

SH

H2N

OCl

OH

S.HO

O

O

O

N N Cl

Cl

N

O Cl

F

F

F

OI

O

OO

F

F

F

OO

O

NS NH

O

NH

Br

N

O O

S

NO

NH

O

NH

Cl

Br

N

N

N

O

O

OO

S

NH2N

O NH

O

OO

O

S NH

O

NH

N

OO

O

S NH

O

Br

N

OO

S

S NH

O

NH

Br

N

Mg+

O

O

NHNH2

O

N NCl

Cl

N

O Cl

ON N

ON

Cl

Cl

O

N

Br

O

O

O

S

NH2

N

OO

O

S NH

O

NH

Br

N

OO

O

S NH

O

NH

Br

N

OO

O

S NH

O

NH

Br

N

Mol

ecul

e C

hef

reac

tant

s M

olec

ule

Che

f pr

oduc

tsG

VAE

(Kus

ner e

t al.,

201

7)C

VAE

(Góm

ez-B

omba

relli

et a

l., 2

018) starting molecules

oxidizing agent unstable unstable unstable toxictoxic

unstable unstable

Figure 9: Random walk in latent space. See text for details.

In Figure 9 we show molecules generated from a random walk starting from the encoding of aparticular molecule (shown in the left-most column). We compare the CVAE, GVAE, and MOLECULECHEF (for MOLECULE CHEF we encode the reactant bag known to generate the same molecule).We showed all generated molecules to a domain expert and asked them to evaluate their propertiesin terms of their stability, toxicity, oxidizing power, corrosiveness (the rationales are provided inmore detail in the Appendix). Many molecules produced by the CVAE and GVAE show undesirablefeatures, unlike the molecules generated by MOLECULE CHEF.

5 Discussion

In this work we have introduced MOLECULE CHEF, a model that generates synthesizable molecules,by considering the products produced as a result of one-step reactions from a pool of pre-definedreactants. By constructing molecules through selecting reactants and running chemical reactions,while performing optimization in a continuous latent space, we can combine the strengths of previousVAE-based models and classical discrete de-novo design algorithms based on virtual reactions. Asfuture work, we hope to explore how to extend our approach to deal with larger reactant vocabulariesand multi-step reactions. This would allow the generation of a wider range of molecules, whilstmaintaining our approach’s advantages of being able to suggest synthetic routes and often producingsemantically valid molecules.

Acknowledgements

This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. JBalso acknowledges support from an EPSRC studentship.

9

References[1] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy.

Fixing a broken ELBO. In International Conference on Machine Learning, pages 159–168,2018.

[2] Rim Assouel, Mohamed Ahmed, Marwin H Segler, Amir Saffari, and Yoshua Bengio. De-factor: Differentiable edge factorization-based probabilistic graph generation. arXiv preprintarXiv:1811.09766, 2018.

[3] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, ViniciusZambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, RyanFaulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprintarXiv:1806.01261, 2018.

[4] G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins.Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90, 2012.

[5] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and SamyBengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLLConference on Computational Natural Language Learning, 2016.

[6] John Bradshaw, Matt J Kusner, Brooks Paige, Marwin HS Segler, and José Miguel Hernández-Lobato. A generative model for electron paths. In International Conference on LearningRepresentations, 2019.

[7] Nathan Brown, Marco Fiscato, Marwin H.S. Segler, and Alain C. Vaucher. Guacamol: Bench-marking models for de novo molecular design. Journal of Chemical Information and Modeling,59(3):1096–1108, 2019. doi: 10.1021/acs.jcim.8b00839.

[8] Florent Chevillard and Peter Kolb. Scubidoo: A large yet screenable and easily searchabledatabase of computationally created chemical compounds optimized toward high likelihood ofsynthetic tractability. J. Chem. Inf. Mod., 55(9):1824–1835, 2015.

[9] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1179. URL https://www.aclweb.org/anthology/D14-1179.

[10] Connor W Coley, Wengong Jin, Luke Rogers, Timothy F Jamison, Tommi S Jaakkola, William HGreen, Regina Barzilay, and Klavs F Jensen. A graph-convolutional neural network model forthe prediction of chemical reactivity. Chemical Science, 10(2):370–377, 2019.

[11] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntax-directed variationalautoencoder for structured data. In International Conference on Learning Representations,2018.

[12] Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small moleculargraphs. In International Conference on Machine Learning Deep Generative Models Workshop,2018.

[13] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learningmolecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.

[14] Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cun-ningham. Bayesian optimization with inequality constraints. In International Conference onMachine Learning, pages 937–945, 2014.

10

https://www.aclweb.org/anthology/D14-1179

https://www.aclweb.org/anthology/D14-1179

[15] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a Data-Drivencontinuous representation of molecules. ACS Cent Sci, 4(2):268–276, February 2018.

[16] Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis CunhaFarias, and Alán Aspuru-Guzik. Objective-reinforced generative adversarial networks (ORGAN)for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.

[17] Markus Hartenfeller and Gisbert Schneider. Enabling future drug discovery by de novo design.Wiley Interdisc. Rev. Comp. Mol. Sci., 1(5):742–759, 2011.

[18] Qiyue Hu, Zhengwei Peng, Jaroslav Kostrowicki, and Atsuo Kuki. Leap into the Pfizer globalvirtual library (PGVL) space: creation of readily synthesizable design ideas automatically. InChemical Library Design, pages 253–276. Springer, 2011.

[19] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman.ZINC: a free tool to discover chemistry for biology. Journal of chemical information andmodeling, 52(7):1757–1768, 2012.

[20] David Janz, Jos van der Westhuizen, Brooks Paige, Matt J Kusner, and José Miguel Hernández-Lobato. Learning a generative model for validity in complex discrete structures. In InternationalConference on Learning Representations, 2018.

[21] Wengong Jin, Connor W Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organicreaction outcomes with Weisfeiler-Lehman network. In Advances in Neural InformationProcessing Systems, 2017.

[22] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder formolecular graph generation. In International Conference on Machine Learning, 2018.

[23] Wengong Jin, Kevin Yang, Regina Barzilay, and Tommi Jaakkola. Learning multimodal graph-to-graph translation for molecular optimization. In International Conference on LearningRepresentations, 2019.

[24] Daniel D Johnson. Learning graphical state transitions. In International Conference on LearningRepresentations, 2017.

[25] Artur Kadurin, Alexander Aliper, Andrey Kazennov, Polina Mamoshina, Quentin Vanhaelen,Kuzma Khrabrov, and Alex Zhavoronkov. The cornucopia of meaningful leads: Applying deepadversarial autoencoders for new molecule development in oncology. Oncotarget, 8(7):10883,2017.

[26] Hiroshi Kajino. Molecular hypergraph grammar with its application to molecular optimization.In International Conference on Machine Learning, pages 3183–3191, 2019.

[27] Matthew A Kayala, Chloé-Agathe Azencott, Jonathan H Chen, and Pierre Baldi. Learning topredict chemical reactions. Journal of chemical information and modeling, 51(9):2209–2222,2011.

[28] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In InternationalConference on Learning Representations, 2015.

[29] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In InternationalConference on Learning Representations, 2014.

[30] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variationalautoencoder. In International Conference on Machine Learning, 2017.

[31] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neuralnetworks. International Conference on Learning Representations, 2016.

[32] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deepgenerative models of graphs. arXiv preprint arXiv:1803.03324, March 2018.

11

[33] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L Gaunt. Constrained graphvariational autoencoders for molecule design. In Advances in neural information processingsystems, 2018.

[34] Daniel Mark Lowe. Extraction of chemical structures and reactions from the literature. PhDthesis, University of Cambridge, 2012.

[35] Andreas Mayr, Günter Klambauer, Thomas Unterthiner, Marvin Steijaert, Jörg K. Wegner, HugoCeulemans, Djork-Arné Clevert, and Sepp Hochreiter. Large-scale comparison of machinelearning methods for drug target prediction on chembl. Chem. Sci., 9:5441–5451, 2018. doi:10.1039/C8SC00148K. URL http://dx.doi.org/10.1039/C8SC00148K.

[36] Christian Merkwirth and Thomas Lengauer. Automatic generation of complementary descriptorswith molecular graph networks. Journal of chemical information and modeling, 45(5):1159–1168, 2005.

[37] HL Morgan. The generation of a unique machine description for chemical structures-a techniquedeveloped at chemical abstracts service. Journal of Chemical Documentation, 5(2):107–113,1965.

[38] Christos A Nicolaou, Ian A Watson, Hong Hu, and Jibo Wang. The proximal lilly collection:Mapping, exploring and exploiting feasible chemical space. Journal of chemical informationand modeling, 56(7):1253–1266, 2016.

[39] Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov,Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy,Mark Veselov, Artur Kadurin, Sergey Nikolenko, Alan Aspuru-Guzik, and Alex Zhavoronkov.Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. arXivpreprint arXiv:1811.12823, 2018.

[40] Kristina Preuer, Philipp Renz, Thomas Unterthiner, Sepp Hochreiter, and Günter Klambauer.Fréchet chemnet distance: A metric for generative models for molecules in drug discovery.Journal of Chemical Information and Modeling, 58(9):1736–1741, 2018. doi: 10.1021/acs.jcim.8b00234.

[41] Edward O Pyzer-Knapp, Changwon Suh, Rafael Gómez-Bombarelli, Jorge Aguilera-Iparraguirre, and Alán Aspuru-Guzik. What is high-throughput virtual screening? A perspectivefrom organic materials discovery. Annual Review of Materials Research, 45:195–216, 2015.

[42] Matthias Rarey and Martin Stahl. Similarity searching in large combinatorial chemistry spaces.Journal of Computer-Aided Molecular Design, 15(6):497–520, 2001.

[43] RDKit, online. RDKit: Open-source cheminformatics. http://www.rdkit.org. [Online;accessed 01-February-2018].

[44] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagationand approximate inference in deep generative models. In International Conference on MachineLearning, pages 1278–1286, 2014.

[45] Bidisha Samanta, DE Abir, Gourhari Jana, Pratim Kumar Chattaraj, Niloy Ganguly, andManuel Gomez Rodriguez. NeVAE: A deep generative model for molecular graphs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1110–1117,2019.

[46] Petra Schneider and Gisbert Schneider. De novo design at the edge of chaos: Miniperspective.J. Med. Chem., 59(9):4077–4086, 2016.

[47] Philippe Schwaller, Théophile Gaudin, Dávid Lányi, Costas Bekas, and Teodoro Laino. “Foundin Translation”: predicting outcomes of complex organic chemistry reactions using neuralsequence-to-sequence models. Chem. Sci., 9:6091–6098, 2018. doi: 10.1039/C8SC02339E.

[48] Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A. Hunter,Costas Bekas, and Alpha A. Lee. Molecular transformer: A model for uncertainty-calibratedchemical reaction prediction. ACS Central Science, 5(9):1572–1583, 2019. doi: 10.1021/acscentsci.9b00576.

12

http://dx.doi.org/10.1039/C8SC00148K

http://www.rdkit.org

[49] Marwin HS Segler and Mark P Waller. Neural-symbolic machine learning for retrosynthesisand reaction prediction. Chemistry–A European Journal, 23(25):5966–5971, 2017.

[50] Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. Generating focusedmolecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci., 4(1):120–131, 2017.

[51] Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deepneural networks and symbolic AI. Nature, 555(7698):604, 2018.

[52] Brian K Shoichet. Virtual screening of chemical libraries. Nature, 432(7019):862, 2004.

[53] Martin Simonovsky and Nikos Komodakis. GraphVAE: Towards generation of small graphsusing variational autoencoders. In Vera Kurková, Yannis Manolopoulos, Barbara Hammer,Lazaros Iliadis, and Ilias Maglogiannis, editors, Artificial Neural Networks and MachineLearning – ICANN 2018, pages 412–422, Cham, 2018. Springer International Publishing. ISBN978-3-030-01418-6.

[54] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machinelearning algorithms. In Advances in neural information processing systems, pages 2951–2959,2012.

[55] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.

[56] Niek van Hilten, Florent Chevillard, and Peter Kolb. Virtual compound libraries in computer-assisted drug discovery. Journal of chemical information and modeling, 2019.

[57] Jennifer N Wei, David Duvenaud, and Alán Aspuru-Guzik. Neural networks for the predictionof organic chemistry reactions. ACS central science, 2(10):725–732, 2016.

[58] David Weininger. SMILES, a chemical language and information system. 1. Introduction tomethodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.

[59] Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutionalpolicy network for goal-directed molecular graph generation. In Advances in Neural InformationProcessing Systems, 2018.

13

A Appendix

A.1 Generation Benchmarks on ZINC

We also ran the baselines for the generation task on the ZINC dataset [19]. The results are shown inTable 2.

Table 2: Table showing generation results for the baseline models when trained on ZINC dataset[19]. The first four result columns show the validity, uniqueness, novelty and normalized quality (allas %, higher better) of the molecules generated from decoding from 20k random samples from theprior p(z). Quality is the proportion of molecules that pass the quality filters proposed in Brown et al.[7, §3.3], normalized such that the score on the USPTO derived training dataset (used in the mainpaper) is 100. FCD is the Fréchet ChemNet Distance [40], capturing a notion of distance between thegenerated molecules and the USPTO derived training dataset used in the main paper.

Model Name Validity Uniqueness Novelty Quality FCD

AAE [25, 39] 87.64 100.00 99.99 96.12 7.27CGVAE [33] 100.00 95.39 96.54 43.48 15.30CVAE [15] 0.31 40.98 24.59 128.54 40.10GVAE [30] 3.66 85.23 95.08 38.86 27.31LSTM [50] 95.71 99.98 99.93 108.68 7.99

A.2 Further Random Walk Examples and Rationales for Expert Annotation

Rationales for expert labels in Figure 9 in the main text. Denoted using letters for letters forrows and numbers for columns. A3: Unstable, enol; A5: unstable, aminal; B2: reactive, radical; B4:unstable ring system; B5: toxic, reactive sulfur-chloride bond, unstable ring system; B6: unstablering system; B7: unstable ring system; B9: toxic: thioketone.

NH2

O

HO

Cl

N+

O

O-

NH

O

HO

N+

O

O-

Mol

ecul

e C

hef

reac

tant

s M

olec

ule

Che

f pr

oduc

tsG

VAE

(Kus

ner e

t al.,

201

7)C

VAE

(Góm

ez-B

omba

relli

et a

l., 2

018)

NH

O

HO

N+

O

O-

NH

O

HO

N+

O

O-

S

HO

HS

N+

O

O-

NHO

N+

O

OH

NH2

O

HO

Cl

N+

O

O-

NH

O

HO

N+

O

O-

NH

H2N

O

N+

O

O-

NH

O

HO

N+

O

OH

NH2

O

O

N

Cl

NH

O

O

N

OF

H2NN+

O

O-

NH

O

F

N+

O

O-

NH2

O

HO

N+

O

O-

F

NH

O

N+

O

O-

F

S

H2N

O

N+

O

O-

NH

O

HO

N+

O

O-

NH2

O

N+

O

O-

Cl

N

NHN

O

N+

O

O-

NHHO

FN+

O

O-

S

O

F

N+

O

O-

NH2

O

OO

Cl

N

NHN

O

O

O

S

O

HO

N+

O

O-

NH

O

Cl

N+

O

O-

NH2

NCl

N+

O

-O

Cl

NH

NCl

N+

O

O-

NH

H2N

HO

N+

O

O-

NH

O

Cl

P

O

Cl

NO

NH2

OS

O

Cl ClCl

NH S

O

OClCl

NHF

HON+

O

O-

SO

N+

O

O-

NH2

O

S

O

ClN

NHS

O

ON

H2N

HON+

O

O-

NH

O

Br

N+

O

O-

O

O

NH2O

S

O

Cl ClCl

O

ON

S

O

OCl

ClO

corrosive corrosive corrosive corrosive

unstable

corrosive

unstablestarting molecules

Figure 10: Another example random walk in latent space. See §4.4 of main paper for further details.The rationals for the labels are (using letters for letters for rows and numbers for columns): A1:Unstable, gemthiolol; A7: unstable, gem-aminohydroxyl; B3: corrosive, acyl fluoride B5: corrosive,acyl fluoride; B6: corrosive, acyl chloride; B7: corrosive, acyl chloride; B9: corrosive, acyl bromide

14

H2NN O

FN

HO

F

NNH

O

F

O

NN

N

OH

SF

NH

N

HO

F

NNH

F

Cl

N

N

F

NNH

SH

S

F

OHF

NO

Cl

NNH

N

NNH

N

HO

F

NH

OH

OF

NH

NH

NO

Cl

NNH

NN NN

NH

N

NH

NH

FN

OH

NNH

F

Cl

H2N

N

O

N

N

F

HCl

HCl

N

F

N

N

HOF

NNH

O N

O

NH

O

N

NO

N N O

F

NH

HO

Cl

NNH

F

Cl

NH

NH

F

NNH

NH2N

HO

F

NH

HO

N

F

NNH

O O

NH

F

Cl

O

O

N F

N

HO

F

NO

F

Cl

N

HO

F

N

HO

F

Mol

ecul

e C

hef

reac

tant

s M

olec

ule

Che

f pr

oduc

tsG

VAE

(Kus

ner e

t al.,

201

7)C

VAE

(Góm

ez-B

omba

relli

et a

l., 2

018) starting molecules toxic,

explosivetoxic toxic toxic unstable unstable toxic toxic,

explosive

unstable unstable unstable unstable unstable

Figure 11: Another example random walk in latent space. See §4.4 of main paper for further details.The rationals for the labels are (using letters for letters for rows and numbers for columns): A1: toxic,explosive, hydrazine, three consecutive aliphatic nitrogens; A2: toxic, hydrazine; A3: toxic, unstable,hydrazine, hemithioacetal; A4: toxic, hydrazine; A5: unstable, could be oxidized to 1,3,4-triazol,potentially also toxic due to N-N bond/hydrazine; A6: unstable, three-membered ring is antiaromatic;A7: toxic, hydrazine; A8: toxic, explosive, hydrazine, three consecutive aliphatic nitrogens. B4:unstable, hemiacetal; B5-B8: unstable, unfavorable ring systems

A.3 Further Retrosynthesis Results

In this section we first provide more retrosynthesis examples before also describing an extra exper-iment in which we try to assess how well the retrosynthesis pipeline is at finding molecules withsimilar properties, even if not reconstructing the correct reactants themselves.

A.3.1 Further Examples

15

USPTO Dataset

reactants

products

predicted reactants

predicted product

O

O

N

O

O

O

N

O

O

O

N

O

N

N

N

O O

N

O

N

NN

(a)

USPTO Dataset

reactants

products

predicted reactants

predicted product

OO

N

Cl Cl

Cl

I

Cl

Br

N

O

O

O O

N

N

O Cl

O

(b)

USPTO Dataset

reactants

products

predicted reactants

predicted product

N

N

O

Br

O

BrN

N

O

N

N

O

N

N

(c)

USPTO Dataset

reactants

products

predicted reactants

predicted product

Br

O

Br

Br

O

O

Br

Br

Br

O

(d)

Figure 12: Further examples of the predicted reactants associated with a given product for prod-uct molecules not in MOLECULE CHEF’s training dataset, however with reactants belonging toMOLECULE CHEF’s vocabulary (ie Reachable Dataset).

USPTO Dataset

reactants

products

predicted reactants

predicted product

O

N

F

F

N

N

Cl

N

NO

N

N

N

S

O

O

Cl

O

N

N N

N

N

SO O

NN

N N

F

O

N

O N

NN

N

S

O

O

N N

N

O

F

N

O

N

F

F

N

N

N

O

N

S

O

ON

N

N

(a)

USPTO Dataset

reactants

products

predicted reactants

predicted product

Sn

O O

O

O

O

(b)

Figure 13: Further examples of the predicted reactants associated with a given product for prod-uct molecules not in MOLECULE CHEF’s training dataset, with at least one reactants not part ofMOLECULE CHEF’s vocabulary (ie Unreachable Dataset).

16

A.3.2 ChemNet Distances between Products and their Reconstructions

We also consider an experiment for which we analyze the Euclidean distance between the ChemNetembeddings of the product and the reconstructed product (found by feeding the original productthrough our retrosynthesis pipeline and then the Molecular Transformer). ChemNet embeddingsare used when calculating the FCD score [40] between molecule distributions, and so hopefullycapture various properties of the molecule [35]. Whilst learning MOLECULE CHEF we include a NNregressor from the latent space to the associated ChemNet embeddings, for which the MSE loss isminimized during training.

To try to establish an idea of how randomly chosen pairs of molecules in our dataset differ fromeach other, when measured using this metric, we provide a distribution of the distances of randompairs. This distribution is formed by taking each of the molecules in our dataset (consisting of all thereactants and their associated products) and matching it up with another randomly chosen moleculefrom this set, before measuring the Euclidean distance between the embeddings of each of thesepairs.

The results are shown in Figure 14. We see that the distribution of distances between the products andtheir reconstructions has greater mass on smaller distances compared to the random pairs baseline.

0 5 10Euclidean distance between ChemNet embeddings

0.0

0.1

0.2

0.3

matched pairs random pairs

(a) When evaluated on the portion of USPTOtest set reactions for which both reactants arepresent in the MOLECULE CHEF’s vocabulary.

0.0 2.5 5.0 7.5 10.0 12.5Euclidean distance between ChemNet embeddings

0.0

0.1

0.2

0.3

matched pairs random pairs

(b) When evaluated on the portion of USPTOtest set reactions for which at least one reac-tant is not present in the MOLECULE CHEF’svocabulary.

Figure 14: KDE plot showing the distribution of the Euclidean distances between the ChemNetembeddings [40] of our product and reconstructed product.

A.4 Details about our Dataset

In this section we provide further details about the molecules used in training our model and thebaselines. We also describe details of the molecules used in the retrosynthesis experiments.

For MOLECULE CHEF’s vocabulary we use reactants that occur at least 15 times in the USPTO traindataset, as processed and split by Jin et al. [21]. This dataset uses reactions collected by Lowe [34]from USPTO patents. In total we have 4344 reactants, and a training set of 34426 unique reactantbags for which these reactants co-occur. Each reactant bag is associated with a product.

For the baselines we train on these reactants and the associated products. This results in a dataset ofapproximately 37000 unique molecules, containing a wide variety of heavy elements:

{ ’Al’, ’B’, ’Br’, ’C’, ’Cl’, ’Cr’, ’Cu’, ’F’, ’I’, ’K’, ’Li’, ’Mg’, ’Mn’,’N’, ’Na’, ’O’, ’P’, ’S’, ’Se’, ’Si’, ’Sn’, ’Zn’ } .

Some examples of the molecules found in the dataset are shown in Figure 15. Note that the largenumber of heavy atoms present, as well as the small overall dataset size, makes a challenging learningtask compared to when using some of the more common benchmark datasets used elsewhere (such asZINC [19]).

We use examples from the USPTO test dataset when performing the retrosynthesis experiments.However, we first filter out any reactions for which the exact same reactant/product multisets tuple is

17

H3C O

O

CH2

CH

CH

CH

Br

HC

O

HN

CH2

CH

F

F

F

CHF

F

F

HC

HC

HC

CH

HC

N

HC N

H2C C

HCl

CH

HC

FF

F

CH2

CH2

NH

CH O

CH

HC

Cl N

CH

N

CH2

CH2

H2C

OH

CH

CH2

CH2

HN

H2C

CH

H2

CHN

O

HC

HCCH

F

HN

N

HN

HCCH

HC

H2C

H2

CN

CH2

H2C

O

H3C

CH2

CH2

N

CH

Cl

CH2

HC

CH2

I

H3C

CH2

CH

CH

Mg

HC

HC

H3C

O

OCH

P

CH

CH

CH

HC

HC

CHHC

HC

HC CH HC CH

CH

CHHC

H3C

CH

2

CH

O

O CH

H3C

CH

O

CH

CH2

CH3

O

CH3

CH

HO

CH

H3C O

CH

CH3

CH

O

CH

O

CH

H3C

H2

C

CH

N

CH3

CH3

CH

OH

CH3

OH

CH2

CH

CH3

O

CH

CH3

CH

OH

H3C

HO

CH3

O

O

NHH2N

H3C

CH

2

OH

CH3

OH2C

CH2

HO

CH2

CH2

H2C

CH3

N

HN O

CH

CH

N

HC

HC

O

O

H3C

CH3

H3C

Figure 15: Examples of molecules found in the dataset we use for training the baselines. This isa subset of the molecules found in USPTO [34]. It consists of the reactants that the MOLECULECHEF can produce along with their corresponding products. It contains complex molecules withchallenging structures to learn.

also present in the training data for MOLECULE CHEF4. Then we split the resultant dataset into twosubsets. The first, which we refer to as the reachable dataset, contains only reactants in MOLECULECHEF’s vocabulary. The second, which we refer to as the unreachable dataset contains reactions withat least one reactant not in the vocabulary.

A.5 Implementation Details

Details of MOLECULE CHEF’s architecture and parameters

An overview of MOLECULE CHEF’s architrecture can be seen in Figure 16. The encoder takes ina multiset of reactants and outputs the parameters of a Gaussian distribution over z. The decodermaps from the latent space to a multiset of reactant molecules. Both of these networks rely in turn onvector representations of molecules computed by a graph neural network. We provide details of thesenetworks’ architectures below as well as training details. Further information can be found in ourcode available at https://github.com/john-bradshaw/molecule-chef.

Computing vector representations of molecular graphs using graph neural networks Forcomputing the vector representation of molecular graphs we used Gated Graph Neural Networks[31], with the same network shared in both the encoder and decoder. We run these networks for4 propagation steps and the node representations have a dimension of 101. We initialise the nodereprsentations with the atom features shown in Table 3. The final step’s node representation isprojected down to a dimension of 50 by using a learnt linear projection. Graph level representationsare formed from these node representations by performing a weighted sum.

4After canonicalisation and the removal of reagents, the USPTO train and test dataset has some reactionspresent in both sets.

18


latent spacez

multiset of reactants multiset of products

decoder

encoder

reaction predictorBr

OOOH

µ

reactant bagembedding

latent space

reactantembedding

order-invariantcombination

ΣBr

OH

RNNcell

reactantprobability

reaction predictor

generate reactant bag

latent space

µ

Halt Halt Halt

Br OH

BrBrBr

OHOHOHOH ClClCl

Figure 16: Overview of MOLECULE CHEF showing how the encoder and decoder fit together.

Table 3: Atom features we use as input to the GGNN. These are calculated using RDKit.Feature Description

Atom type 72 possible elements in total, one hotDegree One hot (0, 1, 2, 3, 4, 5, 6, 7, 10)Explicit Valence One hot (0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14)Hybridization One hot (SP, SP2, SP3, Other)H count integerElectronegativity floatAtomic number integerPart of an aromatic ring boolean

Encoder The encoder sums the vector representations of the molecules present in the reactantmultiset to get a 50 dimensional vector representation of the entire multiset. This representation isfed through a single hidden layer NN (with a hidden layer size of 200) to parameterise the mean anddiagonal of the covariance matrix of a 25 dimensional multivariate-Gaussian distribution over z.

Decoder The decoder maps from the the latent space, z, to a multiset of reactants. It does thisthrough a sequential process, selecting one reactant at a time using a gated recurrent unit (GRU) [9]RNN. The parameters used for this GRU are shown in Table 4. The initial hidden state of the RNN isset using the result from a learnt linear projection of z. The final output of the GRU is fed through asingle hidden layer NN (with a hidden size of 128) to form a final output vector. The dot product ofthis final output vector is formed with each of the possible reactant embeddings as well as the HALTembedding to form logits for the next output of the decoder. The embedding of the reactant selectedis fed back in as input into the RNN at the next step.

19

Table 4: Parameters for GRU used in decoderParameter Value

GRU hidden size 50GRU number of layers 2GRU maximum number of steps 5

Property Predictor In §3.2 of the main paper we discuss how we also can train a property predictorfrom the latent space to a property of interest such as the QED, while traing the WAE. For theQED property predictor NN we use a fully connected network with two hidden layers, both withdimensionality of 40. The loss from this network is added to the WAE loss when training the modelfor the local optimization and retrosynthesis tasks.

Training We train the WAE (and property predictor when applicable) for 100 epochs. We use theAdam optimizer [28], with an initial learning rate of 0.001. We decay the learning rate by a factor of10 every 40 epochs.

Implementation Details for the Baselines in Section 4.1 of Main Paper

For the baselines in the generation section in the main paper we use the following implementations:

• CGVAE [33]: https://github.com/microsoft/constrained-graph-variational-autoencoder• LSTM [50], : https://github.com/BenevolentAI/guacamol_baselines• AAE [25, 39]: https://github.com/molecularsets/moses/tree/master/moses/aae

• GVAE [30]: https://github.com/mkusner/grammarVAE• CVAE [15]: https://github.com/mkusner/grammarVAE

The LSTM baseline implementation follows Segler et al. [50], which has as its alphabet a list of allindividual element symbols, plus special characters used in SMILES strings. This differs from thealphabet used by the decoder in the Molecular Transformer [48], which instead extracts “bracketed”atoms directly from the training set; this means that a portion of a SMILES string such as [OH+] or[NH2-] would be represented as a single symbol, rather than as a sequence of five symbols. A regularexpression can be used to extract a list of all such sequences from the training data. Effectively, thismakes the trade off of increasing the alphabet size (from 47 to 203 items), while reducing the chanceof making syntax errors or suggesting invalid charges. In practice we found very little qualitativeor quantitative difference in the performance of the LSTM model for the two alphabets; for sake ofconsistency with MOLECULE CHEF we report the baseline using the larger alphabet.

For the CGVAE we decide to include element-charge-valence triplets that occur at least 10 times overall the molecules in the training data. At generation time we pick one starting node at random.

Other Details

The majority of the experiments for MOLECULE CHEF were run on a NVIDIA Tesla K80 GPU. Forrunning the Molecular Transformer and CGVAE, we used NVIDIA P100 and P40 GPUs, as the latterin particular required a large memory GPU for training on the larger datasets.

For MOLECULE CHEF we have not tried a wide range of hyperparameters. For the latent dimension-ality we initially tried a dimension of 100 before trying and sticking with 25. Initially, we did notanneal the learning rate but found slightly improved performance by annealing it by a factor of 10after 40 epochs. These changes were made after considering the reconstruction error of the model onthe validation set (the validation dataset of USPTO restricted to the reactants in MOLECULE CHEF’svocabulary).

20

https://github.com/microsoft/constrained-graph-variational-autoencoder

https://github.com/BenevolentAI/guacamol_baselines

https://github.com/molecularsets/moses/tree/master/moses/aae

https://github.com/molecularsets/moses/tree/master/moses/aae

https://github.com/mkusner/grammarVAE

https://github.com/mkusner/grammarVAE

a model to search for synthesizable molecules · molecules from a set of commonly-available...

Documents