Submitted byIsaac Lazzeri
Submitted atInstitute ofBioinformatics
SupervisorUniv. Prof.Dr. Sepp Hochreiter
Co-SupervisorMag. Dr. GunterKlambauer
02 2018
JOHANNES KEPLERUNIVERSITY LINZAltenbergerstraße 694040 Linz, Osterreichwww.jku.atDVR 0093696
Artificial intelligence indrug design: generativeadversarial network formolecules generation
Master Thesis
to obtain the academic degree of
Master of Science
in the Master’s Program
Bioinformatics
i
Acknowledgment
I would like to express my deep gratitude to Dr. Günter Klambauer and Pro-
fessor Dr. Sepp Hochreiter, for their patient guidance and useful critiques
of this master thesis. I would also like to thank all people working at the
Bioinformatics department, for their constructive recommendations and
help.
Finally, I wish to thankmy family and Sandra for their support and encour-
agement throughout my studies.
ii
Contents
Abstract ix
Zusammenfassung xi
1 Introduction 1
1.1 Soft introduction inMachine learning . . . . . . . . . . . . . . 1
1.2 FromNeural Networks to Deep Learning . . . . . . . . . . . . 5
1.2.1 History of Neural Networks . . . . . . . . . . . . . . . . 5
1.2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . 10
1.3 Generative Adversarial Network (GAN) . . . . . . . . . . . . . 26
1.4 Machine Learning in chemoinformatics . . . . . . . . . . . . . 30
1.5 Aims of themaster thesis . . . . . . . . . . . . . . . . . . . . . . 34
2 Methods 39
2.1 Simplifiedmolecular-input line-entry system . . . . . . . . . 39
2.2 Molecular fingerprints . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 ChEMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4 Data-sets preparation . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.1 Tanimoto coefficient . . . . . . . . . . . . . . . . . . . . 49
iii
iv CONTENTS
2.5.2 Fréchet inception distance . . . . . . . . . . . . . . . . . 50
2.5.3 Fréchet Tox21 Distance . . . . . . . . . . . . . . . . . . 51
2.6 Chemo-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.7 Latent-Space-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7.1 Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7.2 Generator and Discriminator . . . . . . . . . . . . . . . 57
3 Results 63
3.1 Results Chemo-GAN . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Discussion 71
5 Conclusion 77
List of Figures
1.1 Example of decision tree . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Perceptron,F. Rosenblat 1957-1958 . . . . . . . . . . . . . . . . 7
1.3 Logistic function and Heaviside function . . . . . . . . . . . . 11
1.4 Structure of an artificial neuron. . . . . . . . . . . . . . . . . . . 12
1.5 Forward pass for one layer . . . . . . . . . . . . . . . . . . . . . 13
1.6 Deltas propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Structures of recurrent neural networks . . . . . . . . . . . . . 22
1.8 LSTM cell structure . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.9 DCGAN faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.10 Structure of a generative adversarial network . . . . . . . . . . 30
1.11 Workflows for QSAR and QSPR . . . . . . . . . . . . . . . . . . . 34
2.1 ECFP generation process . . . . . . . . . . . . . . . . . . . . . . 42
2.2 ChEMBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 Sequence length distribution of SMILES strings . . . . . . . . 47
2.4 Character frequency in ChEMBL . . . . . . . . . . . . . . . . . 48
2.5 Latent-Space-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.6 Accuracymeasured during training for the auto-encoder us-
ing the linear latent space. . . . . . . . . . . . . . . . . . . . . . 58
v
vi LIST OF FIGURES
2.7 Accuracymeasured during training for the auto-encoder us-
ing the sigmoidal latent space. . . . . . . . . . . . . . . . . . . . 59
2.8 Distribution of the percentages of valid SMILES strings per
generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1 Results Chemo-GAN: FTOXD and Tanimoto coefficient . . . 64
3.2 Results Chemo-GAN: FTOXD and Tanimoto coefficient per
group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 FTOXDmeasured each 500 updates for the Chemo-GAN . . 67
3.4 Learning curves of the Chemo-GAN . . . . . . . . . . . . . . . 68
3.5 Learning curves for the Chemo GAN . . . . . . . . . . . . . . . 69
4.1 100 generated chemicalc compounds . . . . . . . . . . . . . . 73
4.2 FTOXDmeasured for the Latent-Space-GAN for valid gen-
erated SMILES strings sampled from priors with different
SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
List of Tables
1.1 Example of data-set . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Timeline of the history of Neural Network . . . . . . . . . . . . 9
2.1 Data-set for Chemo-GAN . . . . . . . . . . . . . . . . . . . . . . 46
2.2 Data-set for Latent-Space-GAN . . . . . . . . . . . . . . . . . . 48
2.3 Comparison between original and generated SMILES strings 62
vii
viii LIST OF TABLES
Abstract
Introduction and background: generation of new chemical compoundsplays a key role in drug discovery, but in-silico methods based on hand-crafted rules can only cover a tiny part of the synthetically available chem-ical space. Therefore computational methods able to automatically extractrules from data are desirable. The aim of this work is to adapt GenerativeAdversarial Networks (GANs) to generate novel chemical compounds.Methods: Molecular fingerprints and canonical SMILES strings were re-trieved fromtheChEMBLdata-set through theRDkit packageandpython. AGANwas implementedusingKeras andTensorflowand trainedusing chem-ical fingerprints. This model was called Chemo-GAN and implemented asfully connected deep neural networks. This is, to the best of our knowledge,the first time that GANs were used to generate molecular descriptors. Asecond model was implemented using an autoencoder to map SMILESstrings from and to a latent space and a GAN trained to generate this latentspace representation of SMILES strings. The aim of this second approachwas to obtain a generator able to generate latent space representationsof SMILES and a decoder able tomap them back to SMILES strings. Thismodel was called Latent-Space-GAN. To evaluate the distance between thedistributions of original data and generated ones and to assess samplesquality, a new distancemeasure, which was called FTOXD, was designedand calculated during training giving a way to tackle the hard problem ofgenerative model evaluation.Results: Chemo-GAN successfully approximated the original data distri-bution, producingmolecular fingerprints of high quality, as showed by theFTOXD, which decreases along the training process. Latent-Space-GAN isable to produce SMILES strings with a low FTOXD, but the percentage ofvalid SMILES strings is low. However, it keeps increasing along the train-ing, suggesting the possibility of getting better performances after a longertraining.
ix
x ABSTRACT
Zusammenfassung
Einleitung undHintergrund: die Entwicklung neuer chemischer Verbin-dungen spielt eine wichtige Rolle in der Arzneimittelentwicklung, aberin-silicoMethoden, die auf handgemachten Regeln basieren, können nureinen kleinenTeil des synthetisierbaren chemischenRaums abdecken.Des-halb wären Berechnungsmethoden, die diese Regeln automatisch findenkönnen, wünschenswert. Das Ziel dieser Masterarbeit ist es, GenerativeAdversarial Networks (GANs) für die Entwicklung neuer chemischer Ver-bindungen zu adaptieren.Methoden: Molekulare Fingerabdrücke und kanonische SMILES Sequen-zen wurden von der ChEMBL Datenbank mittels des ProgrammpaketsRDKit abgerufen. Ein GANwurde in Keras und Tensorflow umgesetzt undauf chemischen Fingerabdrücken trainiert. DiesesModell, genannt Chemo-GAN, wurde in Form von vollständigen verbundenen tiefen neuronalenNetzwerken implementiert. Das ist, soweit den Autoren dieser Studie be-kannt, das ersteMal, dass GANs für die Entwicklungmolekularer Finger-abdrücke benutzt wurden. Eine zweite Methode, die einen Autoencoderverwendet, um die SMILES Sequenz in einem latenten Raum abzubilden,wurde entwickelt. Dazu wurde ein GAN trainiert, um diese Abbildung imlatenten Raum generieren zu können. Das Ziel dieser zweiten Methodewar, einen Generator zu erhalten, der eine Darstellung erzeugen kann,die in eine SMILES dekodiert werden kann. Dieses zweite Modell wurdeLatent-Space-GAN genannt. Um die Distanz zwischen der Verteilung deroriginalen Daten und die der generierten Daten und um die Qualität dergenerierten SMILES Zeichenfolge zu bewerten, wurde eine neue Distanzund Qualitätskriterium, genannt FTOXD, entwickelt und evaluiert.Ergebnisse: das Modell Chemo-GAN konnte die Verteilung der originalenDaten erfolgreich annähern und molekulare Fingerabdrücke von hoherQualität erzeugen, wie mittels FTOXD gezeigt werden konnte. DasModellLatent-Space-GAN kann gültige SMILES Sequenzen generieren, die eine
xi
xii ZUSAMMENFASSUNG
niedrigen FTOXDDistanz besitzen, aber der Anteil der gültigen SMILES Se-quenzen ist niedrig. Jedochwurde ebenfalls gezeigt, dass durch extensivereModellselektion bessere Ergebnisse erzielt werden könnten.
1. Introduction
In the following chapters a background needed to understand this thesis
is given. Along with this introduction, a general explanation of machine
learning and a detailed one concerning neural networks, Deep Learning
and generative adversarial networks are provided. These are the basis on
top ofwhich the coremethods of this thesis are built. Furthermore, the field
of chemoinformatics is introduced together with an explanation about the
importancemachine learningmethods have in it.
1.1 Soft introduction inMachine learning
Every day, billions of data are generated and stored. Smart-phone applica-
tions, health-care systems, bank-accounts, social-networks andmanymore
examples are just some of the systems, which are continuously generating
and storing data. The technological development we have been witnessing
during the last decades gave rise to the newproblemof how to use this huge
quantity of data to generate new knowledge. In this environment machine
learning arose [31].
In machine learning, data are fed in algorithms, whose aim is to learn from
them and solve a specific task.
Examples of tasks are: object recognition, stockmarket trends prediction,
1
2 CHAPTER 1. INTRODUCTION
Table 1.1: Example of data-set
wei g ht (x1:n,1) e yes number (x1:n,2) leg s number (x1:n,3) l abel (y1:n)x1,1:m = 10 2 4 dogx2,1:m = 65 2 2 humanx3,1:m = 0.85 8 8 spi derxn,1:m = ... ... ... ...
photos’ captions generation, sequence to sequence translation ecc., but
what does it mean for an algorithm "to learn"? When we think about learn-
ing processes, we generally associate themwith human or animal behavior.
People learn to associate names with persons or things and themore often
they come across with them the better they can recognize them. People in
this case learn to associate features of these persons like: body size, voice
tonality, hair color ecc. with their names and these associations become
each time stronger. But learning is not only associating things or mem-
orizing them. Indeed, it sometimes happens to come across with things
with specific characteristics we associated with fear and experience this
sensation even without having ever seen that thing before. This generaliza-
tion process, which brought this association to our mind, represents the
real difference betweenmemorizing and learning, algorithm with hand-
crafted rules andmachine learning ones. In machine learning, algorithms
"observe" samples/objects composed by vectors of features which may
have labels representing themembership classes. Samples define a design
matrix X ∈Rn∗m , where X1,1:m is equal to the first sample and X1:n,1 is equal
to the first feature vector. The designmatrix together with the labels vector
constitutes a data-set. The data-set represents the "experience" algorithms
have [31].
1.1. SOFT INTRODUCTION INMACHINE LEARNING 3
This "experience" is used to build a model of the perceived reality upon
which decision and conclusion concerning the task to solve are taken[31].
Machine learning involvesmanydifferentalgorithmsand techniques,which
can be generally grouped in three fields:
• Supervisedmachine learning (SML)
• Unsupervisedmachine learning (UML)
• Reinforcement learning (RL)
These three fields differ from one another in the kind of "experience" the
algorithms are allowed to experience. Indeed, while in SML algorithms
observe both features representing the samples and the labels defining the
membership classes, in UML, they are only allowed to inspect the features.
In RL instead, algorithms act in an environment carrying out actions and
observing their consequences, understanding through this trial-error ap-
proach, which actions and states are favorable [31], [5], [52].
An example of technique used in SML is decision tree. This method tries to
build a tree composed by nodes, which represent rules, leaves, which rep-
resent classes or predictions, and edges, which connect nodes with nodes
or nodes with leaves. A sample follows a path from the root to a leaf. The
path followed and the leaf reacheddependon the rules encountered at each
node and the values of the features of the samples being classified. During
the training process, rules are defined by splitting the training set in smaller
sets with the aim of making them each timemore homogeneous. This is
achievedwhen the splitting criteria, whichmaximizes the information gain,
4 CHAPTER 1. INTRODUCTION
is found. The information gain is so defined:
I G =H (S)−nX
j=1
S j
S
¶H (S j ) (1.1)
H (S) =−kX
i=1pi (S) log(pi (S)) (1.2)
where H is the entropy, S is the set of samples in the parent node, n are
the categories in the feature vector, k are the classes and {S j | j = 1, ...,n}
are the sets obtained after applying the splitting criteria. In the case of
categorical data, samples are grouped by categories present in a specific
feature vector and the feature vector, which maximizes the information
gain, is selected. In the case of numerical data, the splitting criteria is a
value, which acts as a threshold. For a specific feature vector, samples
having values greater than the given threshold are grouped together and
so those having them lower. The feature maximizing the information gain,
is accordingly selected [31], [5]. An example of unsupervised technique is
x = 2
N°of legs
x < 1K g
Weight
Spider
Y
DogN
N
Human
Y
Figure 1.1: Example of decision tree
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 5
k-means. In this method no labels are required. Each sample represents a
point in a multidimensional space where the number of dimensions are
defined by the number of features a sample has. A defined number of
cluster centers are initialized and the points are assigned to the closest
cluster. After the assignment of all points, the positions of the centers are
updated as themean distance of the assigned points to the cluster center,
where the distancemeasure is previously defined and can be a distance like:
the Euclidean distance or theManhattan distance. These three phases are
carried out until the center coordinates are not changing or some stopping
criteria has been reached [5].
1.2 FromNeural Networks to Deep Learning
1.2.1 History of Neural Networks
At the end of the 19th century C. Golgi discovered the black reaction, which
also became famous as the Golgi stainingmethod, enabling the visualiza-
tion of neurons at the light microscopy [43]. Using this techniques, Ramon
Cajal began its studies about the nervous system [61]. In the meanwhile,
Sigmund Freud postulate the neuron theory [44], which was further expan-
ded in the successive years by C.S. Sherrington, who formalized his concept
suggesting that neural cells form a network and can communicate one an-
other through pathways along it [63]. These facts together with Shennon’s
theory of information laid the basis for themodels proposed byMcMulloch
and Pits in 1943 [41]. Indeed, they recognize that neurons through their
"all-or-none" activation, can accomplish logic operations AND, OR and
NOT and that networks of neurons can definemore complicated logic or
6 CHAPTER 1. INTRODUCTION
arithmetic functions. Someyears laterDonaldHebbproposed that learning
and experience are carried out in terms of synaptical changes as he stated:
"When an axon of cell A is near enough to excite a cell B and repeatedly or
persistently takes part in firing it, some growth process or metabolic change
takes place in one or both cells such that A’s efficiency, as one of the cells
firing B, is increased.".
Based on this rule and the previous work of McCulloch and Pits, Marvin
Minsky built the first neural computer "The SNARC" [15], which however,
did not work as expected. It was instead theMark 1 Perceptron [24] the first
working neural computer, whichmodeled the first ArtificialNeuralNetwork
architecture composed of one single layer of neurons [35]. The Perceptron
is an exemplification of the influence the neuronal theory played during
these years. Its architecture clearly resembles that of a nervous cell, which
receives impulses from other cells through dendrites and transmit them
along the axon with an "all-or-none" response. In the Perceptron, this is
modeled through the Heaviside step function and a weighted sum of the
inputs. The Heaviside step function is a discontinue function, which is
equal to one for all x greater than zero and zero otherwise. In this model, it
plays the role of a switch or activation of the neuronal cell transmitting a
signal when the input, in this case the weighted sum, surpasses a certain
threshold [35]. In this model, the learning process was carried out adjust-
ing the weights of the "network", which, in the Mark 1 Perceptron, were
represented by potentiometers connected to an array of 400 photocells
transmitting the input and updated through electric motors [5]. A further
success was the ADALINE (ADAptive LInear NEuron), which was invented
byWidrow and Hoff, who usedmemistors -resistors withmemory- to real-
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 7
x11
x12
x13
x1...
x1n
Σ f
x0
f =
1 ifnP
j=1
‡xjwj
·0 otherwise
w0
y
f
ˆnP
j=1
‡x j w j
·!
w5
w4
w3
w2
w1
Figure 1.2: Structure of the Perceptronmodel proposed by F. Rosenblat in1957-1958. Y is the result of a weighted sum of the inputs x, in which x arethe input values and w the weights, and a nonlinear transformation frepresented in the upper right corner, which is generally called Heavisidestep function.
ize an adaptive neuron for pattern classification [ 57], [58].Themodel was
similar to the Perceptron but it received and transmitted signals as plus or
minus one. It also included a fix input equal to one regulated by an extra
weight [57]. This was the first commercially used neural network device,
which after the 1960 was used in themajority of analogical telephones as
echo filter s[35]. The excitement for these discoveries and breakthroughs
hada significant impact, andoptimismand interestweregrowingenhanced
by some scientists’ far-too-optimistic statements, which, however, played a
key role in attracting public and private funding [35]. This first hype came
to an end in 1969 whenMarvinMinsky and Seymour Papert published the
book "Perceptron", in which theymathematically analyzed the weaknesses
of the Perceptron and its similar approaches proving in a rigorousmanner
that these methods were not able to solve trivial problems like the XOR
where data are not linearly separable [36]. As a consequence, the interest
and the funding for this field started diminishing. This field, which until
8 CHAPTER 1. INTRODUCTION
that point had been seen with enthusiasm and optimism, entered in its
first winter considered to be a deed-end [35]. Despite the little funding,
this field survived the winter, which lasted until the 1980s. During these
years, most of the research on neural network was carried out in signal
processing, biological modelling and pattern recognition as can be ob-
served in the work of T. Kohonen, who in 1972 suggested a linear model
for associative memory proposing its use for patterns classification and at
the beginning of the 1980s described Self OrganizingMaps(SOM) [36]. In
the same years Werbos, which had been inspired by Freud’s psychological
theories and a previous paper of Minsky about the use of reinforcement
learning to address general-purpose AI problems, had the intuition of using
back-propagation to train neural networks, solving in this way the problem
of the Perceptronmodel of Rosenblat encountered, ironically, byMinsky
and Papert. He proved that with the use of a differentiable function and
the chain rule, he could train amulti-layer Perceptron enabling it to solve
non-linear problems as the XOR. However, the effect the publication of
the Perceptron had, was still too strong and no one considered the idea of
publishing this discoveries until 1986 when "Learning representation by
back-propagating errors" was published [47], [36]. As Werbos recounted:
"In the early 1970s, I did in fact visit Minsky at MIT. I proposed that we do a
joint paper showing that MLPs can in fact overcome the earlier problems ...
But Minsky was not interested. In fact, no one at MIT or Harvard or any
place I could find was interested at the time" [56].
After the publication of Rumelhart’s paper and thanks to the efforts of Hop-
field, who managed to attract the interest of new researcher in the field,
the field of Neural Network begun its second golden age, which officially
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 9
started with the first conference on neural network and the foundation of
the INNS international Neural Network Society in 1987.
Table 1.2 Timeline of the history of Neural Network
1943 • McMulloch and Pits neuron as logic operations approximator
1949 • Hebb’s rules
1951 • MarvyMinisky builds the SNARC
1957 • Frank Rosenblat builds theMark 1 Perceptron
1960 • BernardWidrow andMarcian E. Hoff develop the ADALINE
1969 • MarvinMinsky and Seymour Popert publish the "Perceptron"
1972 • Teuvo Kohnen: Linear associator model of associative memory
1973 • Christoph vanderMalsburg: nonlinear neuron
1974 • Poul Werbos: backpropagation
1976 • Stephen Grossberg and Gail Carpenter: adaptive resonance theory
1982 • Teuvo Kohonen: Self OrganizingMaps (SOM)
1983 • Fukushima, Myake, Ito: Neucognitron
1983 • John Hopfield network
1985 • John Hopfield Nets for the solution of the travelling seltsman problem
10 CHAPTER 1. INTRODUCTION
1986 • David Rumelhart, Geoffrey Hinton, and RonaldWilliams rediscover the
back-propagation algorithm and publish the "Learning
representations by back-propagating errors"
1986 • Rumelhart andMcClelland: publication of the "PDP book"
1987 • IEEE: First open conference on neuronal networks.
1987 • Foundation of the INNS International Neural Network Society
1988 • Foundation of the INNS journal Neural Networks
1989 • Foundation of the Neural Computation
1989 • F. Rosenblat: "Multilayer Perceptron are universal approximators"
1989 • LeCun: hand written digit recognition
1990 • Foundation of the IEEE Transactions on Neural Networks
1.2.2 Artificial Neural Networks
Artificial Neural Networks(ANN) are composite parametric non-linear func-
tions, which were inspired by the neuronal theory and in specific by the
neuron structure [5]. The fundamental part of eachnetwork is the "neuron",
which is also called unit. An ANN has input, hidden and output units. They
are connected by edges, whose strength is defined by weights. Each hidden
and output unit is defined as the weighted-sum of its inputs followed by a
nonlinear function called activation function. [5] Examples of activation
functions are the logistic function, theHeaviside step function and the tanh
function which were the first ones used.
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 11
y = 1
1+e−x (1.3)
y =
1 if
nPj=1
¡xjwj
¢0 otherwise
(1.4)
Figure 1.3: Equation 1.3 represents the Logistic function or sigmoidfunction. Equation 1.4 represents the Heaviside step function
Units can be grouped in layers, which are composed by those units having
the same distances from the inputs. An artificial neural network is said to
be fully connectedwhen all the units in a layer are connected to all the units
in the successive one. Following the nomenclature proposed in [5], which
suggests defining the number of layers of a neural network as the trainable
ones, artificial neural networks are said to be shallow or deep when they
have, respectively, one and more than one hidden layers. ANNs can be
represented as directed acyclic graphs (i.e. feed forward neural networks)
or as directed cyclic graphs (i.e. recurrent neural networks).
The training of a neural network can therefore be considered a two
steps process: a first propagation of the information from the input layer
to the output layer through the hidden ones, which is called forward pass,
and the calculation of the prediction error between the outputs and the
targets followed by its back-propagation to the inputs, which is called the
backward pass. The back-propagation of the error allows the assessment
of the contribution of each neuron to the overall error defining so an index
of the magnitude the weights need to be tweaked to reduce it. [5] Until
the discovery of this method by P. Werbos, the training of a neural network
12 CHAPTER 1. INTRODUCTION
x11
x12
x13
x1...
x1n
Σ f
xb
y
w5
w4
w3
w2
w1
y = fact(b +wTx) (1.5)
Figure 1.4: Equation 1.5 represents the computation of the value of anartificial neuron. fact stands for activation function.
withmultiple layers was thought to be impossible and so the training of a
Perceptron able to solve the XOR problem, which - for this pupose - would
have needed a hidden layer and a differentiable activation function as
affirmed by K. Hornik, who published in 1990 an article in which he proved
that a feed forward neural network can approximate whichever continuous
function arbitrarily well if the activation function is continuous, bounded
and non-constant and the hidden layer has enough hidden units [29].
Training of an Artificial Neural Networks
In this part a more mathematical explanation of the training of a neural
network is given.
Let’s define a data-set D to be:
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 13
l −1 l
xn
x...
x1
Σ
Σ
Σ
fact
fact
fact
yn
y...
y1
yl = fact((Wl )Tal−1 +bl ) (1.6)
Figure 1.5: Equation 1.6 represents how all the values of the units in onelayer are calculated. In this caseW ∈Rn∗m where n represents the numberof inputs andm the number of units in the hidden layer, a ∈Rmrepresentsthe activation vector of the previous layer or the input vector and l is theindex of the layer. Each value y is the result of a weighted sum of theactivation of the previous layer and a nonlinear transformation.
D = (x(i ), y (i )) (1.7)
where x(i ) ∈Rn represents a sample of the data-set D having n features and
y(i ) defines the target of sample i.
Let’s define a Loss function as:
14 CHAPTER 1. INTRODUCTION
L = L(y(i),g(x(i);w)) (1.8)
where the g (x(i );w) is a ANN parametrized by the weights w. Many different
Loss functions were used in the ANN’s literature. Among these, theMean
Squared Error (MSE), theMean Absolute Error (MAE), the Kulback Leibler
Divergence (KL) and the Cross entropy are some of the most commonly
used ones. They are so defined:
LMSE = 1
n
nXi=1
(y (i ) − y (i ))2 (1.9)
LMAE = 1
n
nXi=1
|y (i ) − y (i )| (1.10)
LKL = 1
n
nXi=1
(DK L(y (i )||y (i )))
= 1
n
nXi=1
(y (i ) log(y (i )
y (i ))) (1.11)
= 1
n
nXi=1
(y (i ) log(y (i )))| {z }entropy
− 1
n
nXi=1
(y (i ) log(y (i )))| {z }cross−entropy
Lcross−entropy = 1
n
nXi=1
(y (i ) log(y (i ))) (1.12)
The importance of a Loss function in the training of a ANN lies in the fact
that it measures the error between the values predicted from the ANN (y)
and the targets (y) and its value is inversely proportional to the robustness
of a model and it can influence the way amodel takes errors into account.
This can be easily seen comparing the MSE and the MAE. Indeed, in the
first case large errors would have amuch greater impact due to the square,
while the second one is more robust against outliers. This error is generally
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 15
called empirical error and it is an evaluation of the model based on data
whichmay not fully represent the real data distribution. Therefore, amodel
trained on this data could incur in over-fitting or under-fitting problems,
where the first one refers to a model able to fit the training data well but
that is not able to generalize to unseen data and the second one refers to a
too simplemodel that cannot capture the complexity of the data [5], [31].
Instead ofminimizing the empirical error, the aim of amodel should be the
minimization of the generalization error defined as the expected loss on
feature data. One approach to have an estimation of this error is the test
set method. This approach consists in splitting the data-set in training and
validation sets using the first one to train the model and the second one
to evaluate it. To have an unbiased estimation of the error of themodel, a
third set needs to be selected, which is only used after themodel selection
to evaluate the quality of the final selected model. These concepts are
especially important when using highly expressivemodels like ANN, which
can easily over-fit [17],[5].
Back-propagation
In practice, ANNs are trained using the training set and evaluated on a
test set which provides an estimation of the generalization error. On the
one hand, such estimation gets better when the number of test samples
increases, on the other hand the goodness of themodel increases when the
number of training samples increases [5]. Therefore, a trade-off between
the number of training samples and the number of test ones is needed.
The training process is carried out using the back-propagation algorithm
showed in Algorithm 1. As it can be seen in the description, this algorithm
16 CHAPTER 1. INTRODUCTION
consists of two procedures carried out for each sample and target in the
data-set,whichare: theForward-Passand theBackward-Pass. In theForward-
Pass for each samples of thedata-set, the input layer of theANN is initialized
with the values of its features, then the inputs are propagated through the
network:
y(0) = x(1) (1.13)
netl = (Wl )Ty(l−1) +b(l ) (1.14)
y(l+1) = fact(netl ) (1.15)
Using a previously selected loss function as those described in equations
(1.9) - (1.12), the error between the value predicted by the ANN an the real
one is calculated. To assess the proportion in which each weight particip-
Algorithm 1 Back-Propagation1: procedure Back-Propagation(D,L, AN N) . Forward Pass2: for ❡❛❝❤ (①i ,②i ) ∈ D do3: y0 := ①i . Initialization of the input layer4: for ❡❛❝❤ l ∈ (1, ...,L) do5: y l := fact ((❲l )T y l−1 +❜l ) . Propagation of the information6: end for7: . Backward Pass8: δL := ∂
∂netoutL(yi , yL) . Calculation of the output error
9: for ❡❛❝❤ l ∈ (L−1, ...,1) do10: δl := (❲l+1δl+1)fl f 0l (♥❡tl ) . Back-propagation of the deltas11: ❲l := (❲l −ηδl (y l−1)T ) .Update of the weights12: bl := (bl −ηδl ) . Biases update13: end for14: end for15: end procedure
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 17
ated in the error, the gradient of the loss function respect to a specificweight
is calculated:
∇wi j =∂L
∂wi j= ∂L
∂net j
∂net j
∂wi j(1.16)
where net j in this case is equal to linear combination of the weights con-
necting theprevious layer to unit j and the values of the units in theprevious
layer (w j )Tyi . ∂L∂net j
is called the delta and visualized asδ j . It represents the
error that is back propagated to the previous layer. The deltas for the output
layer are easily calculated in the case of the halved quadratic loss as:
δout = ∂
∂netout
1
2(y (i ) − f (netout ))2
¶= ( f (netout )− y (i )) f 0(netout ) (1.17)
and:
∂netout
∂wi out= ∂
∂wi out(wout )Tyi = y i
i (1.18)
so:
∇wi j = δout y ii (1.19)
and the calculation for the weights in a hidden layer becomes:
18 CHAPTER 1. INTRODUCTION
y (l−2)1
y (l−2)2
y (l−2)3
y (l−2)4
y (l−2)5
y
Figure 1.6: The deltas propagate back through the network as a weightedsum of the deltas in the output layer times the derivative of the activationfunction (blue arrow). The weights in the hidden layer are calculatedconsidering the deltas (blue arrow) and the activation functions in theancestor layer (red arrows).
∂L
∂W l−1=δl−1(yl−2)T (1.20)
δl−1 = f 0(netl−1)fl (wout )Tδout (1.21)∂L
∂W l−1= f 0(netl−1)fl (wout )Tδout (yl−2)T (1.22)
From the last formula is it possible to understand that the deltas are propag-
ated back and the weights of a hidden layer are calculated taking into ac-
count the activation of the ancestor layer and the errors coming from the
successor one.
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 19
Gradient descent basedmethods
The update rule used in Algorithm 1 to tweak the parameters w of the
network is calledgradientdescent and it is a general optimization technique
used to find a solution to a problem byminimizing a cost function. It looks
at the gradient of the cost function respect to each weight and it updates
that weight in a direction of the negative gradient, which represents a step
downhill in thedirectionof the steepest slope [5], [17]. The lengthof the step
is defined from another parameter called the learning rate and defined as η
in algorithm 1. It is an important hyper-parameter, whichmust be taken
into account during the training of a ANN, because it defines the number
of iterations the algorithm needs to approach theminimum (the smaller
the step, the more steps are needed and therefore more time is needed
too) and can lead to overstep theminimum or jump around it in case it is
set to a too-large value, especially in the case of functions withmany local
minima [17]. During the past years improvement to this algorithm were
proposed to achieve a better and faster convergence and an example of
these improvements is Stochastic Gradient Descent (SGD), RMSProp [26]
algorithm and the Adamone [17]. The first one takes only a random sample
at time to calculate the gradient and it updates the weights according to it.
In this way it avoids using all the data for each update speeding the process
up and reducing the memory usage [17]. The second one uses the idea
of rprop of dividing the gradient by it’s magnitude and applied it tomini-
batches. It does it by keeping a moving average of the squared gradient
and dividing the gradient by it during the calculation of the update of a
20 CHAPTER 1. INTRODUCTION
mini-batch as shown in equations: 1.25, 1.26 [17].
θ := θ−η∇θ J = θ−η 1
n
nXi=1
∇θ J (θ) (1.23)
θ := θ−η∇iθ J (θ) (1.24)
s :=βs + (1−β)∇θ J (θ)⊗∇θ J (θ) (1.25)
θ := θ−η∇θ J (θ)fips +† (1.26)
Deep Learning
Deep learning is a term used to define the use of Deep artificial neural
networks (DNNs) to accomplish a specific task. DNN are basically ANN
withmore than one hidden layer. In 1991 and 1998, Hornik and Cybenko
respectively reached the conclusion and proved that a ANN can approxim-
ate whichever continuous function provided that there are enough hidden
units in the hidden layer [29]. However, the proof did not define howmany
units are enough, which could be an extremely large number as suggested
in [31] for the case of ANNwith one hidden layer and proved by Eldan and
Shamir in [14] for 2-hidden-layers network. In their work, they demon-
strated that there is a function approximated with a 3-layer ANN, which
cannotbe approximatedwith a 2-layers oneunless thenumberof its hidden
units is exponential in the dimension, suggesting that the use of deep net-
works can achieve same or better results usingmore layers with less units.
Another problem concerns the way to find the function that approximate
the one of interest. Indeed, the trainingmay fail choosing another function
because of over-fitting, or the optimizationmethod usedmay hinder the
selection of the right parameters needed to approximate it [31]. Instead,
DNNwere proved to generalize better inmanyworks and this together with
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 21
the fact that exponentially less neurons per layer are needed, make DNN
interesting.
The learning process in DNNs can be interpreted as a hierarchical process
in which in each hidden layer, from the first one to the last one, more and
more abstract features are learned. This can be especially well observed
in DNN applied to images tasks. Indeed, in the case of facial images, in
the first hidden layers, features like edges and corner are learned, while in
the successive ones, details, like noses, eyes, ecc., can be appreciated. The
use of DNN to capture complex patterns had been already hypothesized
during the 80s but the computational power was not enough and so also
the amount of data available. Another problem encountered was related
to the lack of amethod able to train deep networks. Indeed, the gradient
used to calculate the weights update for DNNwas observed to vanish or
explode when back-propagating the error through the layers, making the
training of DNNwithmany layers impossible. These problems were often
observed and caused a loss of interest about DNN until it was rescued by
the success of the article published by Hinton and Osindero [27], who sug-
gested that using a good initialization would allow the training of a deep
network. Specifically, they used an unsupervisedmethod to initialize the
weights of each layer of the network and showed that this method achieved
the best performance on theMNIST data-set, which is a standard data-set
used to evaluate amodel’s performances inmachine learning. One of the
most important breakthrough, was the discovery of the role played by the
sigmoid activation function in the vanishing of the gradient. Other import-
ant discoveries were: the importance of the layer’s initialization described
by Glorot and Bengio in [18] and the Rectifier-Linear-Unit as an alternative
activation function. Indeed, this activation function does not saturate as
22 CHAPTER 1. INTRODUCTION
the sigmoid one and is more easily calculable, allowing in this way not only
a better flow of the gradient back, but also a speeding up of the process.
Results obtained using this new non-linearity and a better initialization
were even better than those obtained using unsupervised training of each
layer. This provided an end-to-end learning system able to extract more
andmore complex features along the layers andmapping input to output
directly, minimizing so hand-crafting [31], [36]. Deep learning methods
achieved state-of-the-art results in many different fields like: speech recog-
nition, machine translation, text generation andmusic tracking bringing a
newwave of interest in the field of neural network.
Figure 1.7: Figure a represents a simple ANN (one-to-one), while the otherrepresent different configurations of RNN unfolded in timemapping avector to a sequence(many-to-one), a sequence to a vector(one-to-many)and a sequence to a sequence (many-to-many).Figure source: [34]
Until this moment only Feed Forward Neural networks were considered,
but there are other architectures of deep neural networkswhichwere highly
successful in different tasks like: convolutional neural networks for images
classification and generation, auto-encoders for image denoising or data
compression and recursive neural networks for structured data. Recurrent
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 23
neural networks becamemore andmore important during the last years for
the analysis of sequences or time dependent data. They can be considered
as special case of recursive neural networks applied to unary trees where
eachmember of the unary-tree is a obtained by the application of neural
network on the predecessor node. In case of a sequence represented as
{x0, x t−1, x t , x t+1xn} where t is the position of an element in a sequence and
n represents the last one, the network can be so defined:
a(t ) =b+Wh(t−1) +Ux(t ) (1.27)
h(t ) = fact(a(t )) (1.28)
o(t ) = c+Vh(t ) (1.29)
y(t ) = fact(o(t )) (1.30)
whereW,U and V are matrices representing the weights between the input-
hidden, hidden-hidden and hidden-output layers. The loss of themodel in
this case is given by the sum of the losses along all time steps. In these for-
mula can be seen that the calculation of the hidden layer takes into account
the hidden layer in the preceding time step and the input at the current
one and that parametersW,U and V are shared. With this inmind, a RNN
can be seen as a deep neural network, where each layer is sharing the para-
meters and represented as directed acyclic graph (DAG). For this reason,
RNN are trained using amethod similar to back-propagation called back-
propagation-through-time (BPTT) and can incur in the same problems
of vanishing and exploding of the gradient. Therefore, long dependencies
in the network are hard to be learned by the network. A solution to this
24 CHAPTER 1. INTRODUCTION
problem was achieved with the invention of Long Short-Term Memory
LSTMmodel proposed byHochreiter and Schmidhuber in [28]. LSTMhas a
modular structure similar to RNNs with incoming inputs from the previous
LSTMcell and from the current time step and outgoing information flowing
to the next LSTM cell or being used to predict the output at that time step,
but in comparison with normal RNN, LSTMoffer a way to regulate this flow
of information through gates defining what has to be "remembered" and
what has tobe "forgotten". The structure of anLSTMcanbemathematically
so defined using the formula from [31, 410-411]:
Figure 1.8: LSTM cell structure: the σ represents the sigmoid functionsused to build the forget state and output gate, x and + representelementwise multiplication and addition respectively.Figure source: [42]
1.2. FROMNEURAL NETWORKS TODEEP LEARNING 25
f (t )i =σ
ˆb f
i +Xj
U fi , j x(t )
j +Xj
W fi , j h(t−1)
j
!(1.31)
s(t )i = f (t )
i s(t−1)i + g (t )
i σ
ˆb f
i +Xj
Ui , j x(t )j +X
jWi , j h(t−1)
j
!(1.32)
g (t )i =σ
ˆbg
i +Xj
U gi , j x(t )
j +Xj
W gi , j h(t−1)
j
!(1.33)
h(t )i = tanh(s(t )
i )q (t )i (1.34)
q (t )i =σ
ˆbo
i +X
jU o
i , j x(t )j +X
jW o
i , j h(t−1)j
!(1.35)
The internal structure of an LSTM involves four neural networks layers,
which are connected one another in a special way to allow the cell to re-
member or forget things. Specifically, there is a forget gate represented by
the first σ in figure 1.8 and calculated through equation 1.31, a state gate
represented by the second one and calculated by equation 1.32 and an
output gate represented by the third one and calculated by equation 1.35.
The first gate is responsible for the removal of unwanted information, the
second one is responsible for the addition of information to the cell state
and the last one defines the output [31]. These gates are neural network
layers and use a sigmoid function, whose outputs are in the range zero
one, and a element-wisemultiplication or addition to define the part of the
information that has to be forgotten or added. Thanks to this elementwise
addition the cell does not show the vanishing of the gradient avoiding one
of themajor problem, which hindered the learning of long dependencies
with RNN.
26 CHAPTER 1. INTRODUCTION
Figure 1.9: Examples of faces generated with a Deep ConvolutionalGenerative Adversarial Network.Figure source: [12]
1.3 Generative Adversarial Network (GAN)
Since 2006 the interest in deep learningmethods increasedmore andmore
enhanced by the numerous successes achieved by these models. At the
same time, also deep learning methods evolved impressively fast. New
architectures to tackle different tasks, more powerful training algorithms
and optimization techniques were proposed and applied to many tasks
spanning from unsupervised to supervised and reinforcement learning.
In the previous chapters, the Perceptron and the ADALINE networks were
introduced and both represent example of discriminative models which
try to learn a conditional distribution where a model tries to infer the label
of a samples given its features. Despite for some tasks the manipulation
of a conditional distribution can be of interest, for others, being able to
1.3. GENERATIVE ADVERSARIAL NETWORK (GAN) 27
manipulate the full joint distribution is desirable. In this chapter Gener-
ative Adversarial Networks (GANs) will be introduced. Generative models
are models that tries to learn the distribution fromwhich the data in the
data-set were sampled. Here, this distribution is called data generating dis-
tribution. According to I.Goodfellow’s classification schema proposed in
[20], generativemodels can be divided in explicit densitymodels, which are
those trained using directly the full joint distribution, and implicit density
models, which are those trained using the data generating distribution in-
directly by sampling from it. Examples of generative models are variational
auto-encoders, Boltzman machines, which are explicit density models,
and generative adversarial networks, whichmakes part of implicit density
models. Generative models became of interest in those fields where the
generation of new samples is required for example in simulations when
new scenarios must be generated or coupled to reinforcement learning
to generate new environments for the agents and could be also used to
give an RL agent some kind of imagination where the generativemodel is
used to generate new virtual environments where it can carry out actions
and hypothesize consequences. In the past years, generative models were
successfully used for the improvement of images resolution [ 38], new im-
ages generation from sketches [32] or for music [33] and text generation
[30]. Generative adversarial networks were proposed for the first time by
Ian Goodfellow and colleagues in 2014 [21] and during the last years their
popularity exploded. A GAN is amodel trained using an adversarial process
in which two actors, a generator G and a discriminator D compete each
other. While the task of the generator is to produce samples which look
real, the one of the discriminator is to understand whether the generated
samples are real or not. During this process the generator receives a feed-
28 CHAPTER 1. INTRODUCTION
Algorithm 2 TRAIN GANprocedure TRAIN GAN(D,G ,Dat a − set)
2: for ♥✉♠❜❡r ♦❢ tr❛✐♥✐♥❣ ✐t❡r❛t✐♦♥s dofor ❦ st❡♣s do
4: ❙❛♠♣❧❡ ♠ ♥♦✐s❡ s❛♠♣❧❡s ❢r♦♠ pg (③)❙❛♠♣❧❡ ♠ s❛♠♣❧❡s ❢r♦♠ pd at a(①)
6: ❯♣❞❛t❡ ❉ ❜② ❛s❝❡♥❞✐♥❣ ✐ts st♦❝❤❛st✐❝ ❣r❛❞✐❡♥t✿
J (D)(θ(D),θ(G)) =∇θd1m
Pmi=1
¡logD
¡①(i )
¢+ log¡1−D
¡G
¡③(i )
¢¢¢¢8: end for
❙❛♠♣❧❡ ♠ s❛♠♣❧❡s ❢r♦♠ ♥♦✐s❡ pg (③)10: ❯♣❞❛t❡ ● ❜② ❞❡s❝❡♥❞✐♥❣ ✐ts st♦❝❤❛st✐❝ ❣r❛❞✐❡♥t✿
J (G)(θ(D),θ(G)) =∇θg1m
Pmi=1 log
¡1−D
¡G
¡③(i )
¢¢¢12: end for
end procedure
back about the quality of the generated samples from the discriminator and
improves itself to produce samples of better quality and the discriminator
become better in understanding the origin of the samples. Both D and
G can be whichever differentiable function and a general choice is to use
artificial neural networks. They can have different architectures depending
on the type of samples one wants to generate. The task of the generator is
to map gaussian noise to the sample space and fed the discriminator with
it. The discriminator receives both data coming from the generator and
from the original data-set. When the data are coming from the data-set the
discriminator tries to output 1, while, when they are generated samples,
the generator will push the discriminator to output 1 and the discriminator
will try to output 0. This process is often compared to a game because at the
same time the discriminator is trying to minimize J (D)(θ(D),θ(G)) but only
allowed to tweak parameters θ(D), on the other hand the generator tries to
minimize J (G)(θ(D),θ(G)), but it is allowed to tweak only the parametersθ(G).
This process is described in 2, which is the original algorithm proposed by
1.3. GENERATIVE ADVERSARIAL NETWORK (GAN) 29
Goodfellow and coworkers in [21]. In this article the convergence of this
algorithm and the equality between the data generating distribution and
the data distribution when the global minimum is reached were proved.
However, in practice this approach does not work well due to the feedback
passed from the discriminator to the generator in the form of gradient,
which becomes smaller and smaller when the confidence of the discrim-
inator increases as proved in [4]. For this reason, the use of a different
objective to train the generator aiming to solve this problemwas defined
as:
J (G) = 1
2Ez logD(G(z)) (1.36)
This different objective should guarantee a better flow of the gradient from
the discriminator to the generator, but further theoretical analysis carried
out in [4] showed that the price paid for a better flow of the gradient is
the instability that may emerge during the training. Since the first article
about GANs new models aiming at improving the quality of generated
samples and stabilize the training process, were proposed. DCGAN,Was-
sarsteinGANand infoGANare some example of these improved implement-
ations. The popularity they achieved thanks to the quality of the produced
images compared to other generative models let hope for the use of this
method for the generation of data required in other fields like Chemistry
and pharmacology.
30 CHAPTER 1. INTRODUCTION
Figure 1.10: Structure of a generative adversarial network. Vectors sampledfrom a Gaussian distribution with zeromean and variance equal 1 are fedinto the generator generating fake samples. Fake samples and real ones arefed into the discriminator, whose task is to discriminate whether samplesare real or not.
1.4 Machine Learning in chemoinformatics
In the past decades the amount of data produced in the field of Chemistry
undergone an exponential increase. Breakthroughs in different fields, span-
ning from array based technologies to liquid-handling ones and robotics,
allowed theminiaturization of common procedures, which were generally
carried out by operators. This improvementmade possible to overcome the
throughput of past technologies making them compatible with ultra-High
Throughput Screening (uHTS) methods and opening this field to computa-
tional methods andmachine learning techniques [37]. Chemoinformatics
or cheminformatics are themost common names used to refer to the ap-
plication of these methods on chemical data [37]. General data used in
1.4. MACHINE LEARNING IN CHEMOINFORMATICS 31
Chemoinformatic work-flows involve SDF file (Structure Data Format),
WLN (Wiswesser Line Notation) or SMILES (Simplified Molecular Input
Line Entry Specification), which are representations of 2D or 3D chemical
structures. In the past years the SMILES notation has becomemore and
more common due to the simplified rules used in comparison with the
WLNones [37]. The use ofmachine learningmethods in chemoinformatics
has become especially important since their use for the inference of mo-
lecules’ properties or their activities in Bio-assays. The first uses of these
methods for these purposes, which are currently defined as QSAR (Quantit-
ative Structure-Activity Relationship) and QSPR (Quantitative Structure-
Properties Relationship) respectively, are dated back to the 1935 [ 22], and
1964 [23]. Initially, only simple linear regressionmodel on compoundswith
few descriptors and covering small chemical spaces could be applied due
to the lack of computational power and data, but nowadays, these limit-
ations are being overcome and newmethods extending the applicability
of QSAR and QSPR to nonlinear classification and regression tasks have
been implemented and exhaustively studied. Repositories like PubChem,
ChEMBL ecc., have been created to allow storage and retrieval of chemical
information playing an key role in the evolution of chemoinformatics. The
general work-flow of QSAR and QSPR follows two steps: an encoding step
and amapping one [39].
ActivityorProperty = f(structure) =M (E (structure)) (1.37)
During the encoding process, molecules are generally encoded into vec-
tors of chemical descriptors, which are "numerical values that characterize
properties of molecules", as defined in [37], calculated from the 2D or 3D
32 CHAPTER 1. INTRODUCTION
structure andmutual-orientation and time-dependent dynamics of mo-
lecules, whichwere respectively named2D, 3Dand4Dchemical descriptors
[7]. More than 5000 descriptors have been defined [49], [54] and among
them, ClogP, Molecular Refractivity, topological indices like the Wiener
index and 2D fingerprints are some examples, which can be calculated
through open-source software like PaDEL [62] or closed-source ones as
DRAGON 7.0 [2]. During the second step, these vectors are mapped to
a property or activity class through a function, which is generally what
most of machine learningmethods try to optimize. Other approaches were
suggested, to directly extract features frommolecular structures, reducing
problems concerning descriptors definition, their computation and feature
selection. An example of these methods was presented in [39] by Lusci
and colleagues, in which they used a Recursive neural network to encode
undirectedmolecular graphs into vectors retaining in this way both struc-
tural and chemical information andmanaging to obtain, with this model,
state-of-the-art performances in the prediction of aqueous solubility [39].
QSAR and QSPRmethods are playing a significant role in drug discovery
and especially in "de novo" drug design. Indeed, prediction of molecu-
lar properties can be highly valuable for the evaluation of those chemical
compounds, whichmay pass all the phases of the drug development life
cycle. The solubility of drugs in water is an example of this. Indeed, it
defines the body absorption efficiency of the chemical compounds be-
ing analyzed, allowing the rejection of active compounds that, otherwise,
would be discarded in successive stages. Recently, other measures as drug
toxicity highlight the usefulness of thesemodels for the rejection of toxic
molecules in early phases of drug development, increasing so the quality of
selected candidates. In the previous chapters, the success of deep learning
1.4. MACHINE LEARNING IN CHEMOINFORMATICS 33
models and generative ones in many different fields, were introduced. As a
result, many attempts to apply thesemethods in chemoinformatics were
made. The work carried out at the Bioinformatics Institute of the Johannes
Kepler University of Linz [40] is a illustrative example of the power of deep
learningmethods, multitask and ensemble learning in toxicity prediction.
During the Tox21 data challenge, this model (DeepTox) achieved the best
performances among all computational methods inmany assays [40]. The
work carried out by Bombarelli and coworkers, instead, is an exemplificat-
ive example of generative models in Cheminformatics. Indeed, they used a
RNN-variational-auto-encoder to encode SMILES strings into latent chem-
ical space and decode themback, enabling so the use of this latent space for
the generation of newmolecules having specific properties [19]. The use of
a 3-stacked-LSTM for the generation of molecules was presented by Seglar
at al. in their article [50], where they showed the ability of their model to
produce both data-sets of general molecules and data-sets enriched inmo-
lecules with specificmolecular properties, which they used to implement
an in-silico "de novo" drug design cycle. Due to the high success of deep
learning and generative models in chemoinformatics and that of GANs
for image generation, the use of a generative adversarial network, for the
generation of chemical compounds, is proposed.
34 CHAPTER 1. INTRODUCTION
Figure 1.11: Workflows for QSAR and QSPR. Data are encoded in chemicaldescriptors and fingerprints and labeled with activities or properties.These data are than used to train amodel, which will learn to predictproperties or activities of features chemical compounds.
1.5 Aims of themaster thesis
The development of new drugs is a highly expensive and risky process,
which involves the individuation of a target and the definition of the lead
1.5. AIMS OF THEMASTER THESIS 35
compound. This is generally achieved after searching in libraries of chem-
ical compounds for the presence ofmolecules with the desired characterist-
ics, filtering out those showing undesired ones like: toxicity and insolubility
[37]. This filtering process is computationally and time expensive, but can
dramatically increase the probability for a selected compound to reach
themarket. It reduces the number of compounds that should be tested in
laboratory, the amount of money that should be invested in this process
and the probability compounds have to fail in successive steps. However,
the success of these screeningmethods is highly dependent on the chem-
ical compounds present in the searched libraries and the hand-crafted
rules used to screen them [37]. In the last years, the chemical space of
synthetically available molecules was estimated to be around 1060 [19], un-
fortunately also the newest technologies are still far from the achievement
of this throughput and consequently, computational methods like QSAR
or QPAR are used to scoremolecules present in chemical libraries select-
ing in this way only the most promising ones for the next steps. As seen
before, these methods are highly dependent on compounds present in
chemical libraries, which in turn are dependent on hand-crafted rules used
at compounds generation time. As a consequence, the search is led towards
a specific part of the chemical space defined by these rules, leaving mo-
lecules lying on others overlooked [19]. Of course, new rules can be defined,
but the time and efforts required are huge. Therefore, newmethods able to
automatically extract these rules would be extremely desirable, because,
from the one hand, they would reduce the problem concerning the part of
the chemical space analyzed, from the other hand, they would not require
the definition of rules.
In this thesis generative adversarial networks are proposed for the first
36 CHAPTER 1. INTRODUCTION
time as a possible method aiming at solving these problems, allowing the
generation of new chemical compounds without previously defined hand-
crafted rules, covering, in this way, a wider chemical space and providing
a new powerful tool to chemoinformatics and drug discovery. With this
purpose, two architectures were implemented in Keras and respectively
called: Chemo-GAN and Latent-Space-GAN. The first one is a generative
adversarial network trained onmolecular fingerprints where both gener-
ator and discriminator are fully connected artificial neural networks. The
second one uses an auto-encoder to map SMILES strings from and to a
latent space and uses this latent space representation of SMILES strings to
train a generative adversarial network. In this case, the auto-encoder uses
LSTM layers to consider relations between each character and those pre-
ceding or following it, while the GAN uses fully connected neural networks
for both the generator and the discriminator.
For the evaluation of these models a new metric, the Fréchet Tox21 Dis-
tance (FTOXD), was defined, which was inspired by the Fréchet Inception
Distance (FID) proposed in [25]. With the definition of this method, a way
to evaluate the distance between the distribution of molecular fingerprints
coming from the original data-set and the one represented by the gen-
erator part of the GAN is provided. This metric offers the possibility to
measure samples quality of generated SMILES strings andmolecular fin-
gerprints and evaluate the capability of a GAN to approximate the original
data distribution considering during this evaluation also automatically-
extracted chemical highly relevant features. For the evaluation ofmolecular
fingerprint similarity also the Tanimoto coefficient was used. The second
modelwas instead evaluatedusing the FTOXD togetherwith thepercentage
of valid generatedmolecules. Bothmodels were trained on data derived
1.5. AIMS OF THEMASTER THESIS 37
from the ChEMBL data-set, while themodel used for the calculation of the
FTOXDwas trained on the Tox21 data-set.
In the next chapter, thesemethods are explained in detail. In the first sub-
chapter the data-sets used are explained together with the work carried out
to derive them. Afterwards, themethods for the evaluation of themodels
are described. Finally, the Chemo-GAN and the Latent-Space-GAN are
described.
38 CHAPTER 1. INTRODUCTION
2. Methods
2.1 Simplifiedmolecular-input line-entry system
The invention of computers opened new possibilities for the storage of
chemical structures of chemical compounds. For this reason, during the
20th century, different systems have been studied with the aim of efficiently
storing chemical compounds structures, avoiding redundancy and provid-
ing an easy and exchangeable data format. During the 80s, the Simplified
molecular-input line-entry system (SMILES) was invented [59],[37]. This
method aims to represent chemical compounds in the form of unique and
not ambiguous ASCI strings [59]. SMILES representations, which lead each
chemical compound to an unique representation are generally defined
canonical SMILES. They generally depend on the software or canonicaliz-
ation algorithm used. Indeed, chemical compounds can have multiple
SMILES representations and different software have different strategies to
select a specific one. A general approach is to use the CANON algorithm
to prioritize the atoms in the given compound and subsequently write the
SMILES atom by atom following the order defined by a depth-first traversal
of the molecular graph [55, 420-432]. This prioritization guarantees the
uniqueness and certainty of the generated representation. In this repres-
39
40 CHAPTER 2. METHODS
entation, chemical elements are generally represented by their chemical
symbols enclosed in square brackets - unless they are part of the organic
subset defined as {N ,C ,B ,O,P,F,S,C l , I ,Br } - and the adjacent characters
represent the atoms to which they are connected. Side chains are represen-
ted using round brackets and bonds by the set {−,=,#.$}, in which symbols
stands for single, double, triple and quadruple bond respectively. Partial
and ionic bonds are depicted as {:, .} . Chirality is expressed by the "@" sym-
bol indicating that the following symbols are disposed clockwise around
the chiral center. When they are written anticlockwise, the "@@" symbol is
used. Cis and trans configurations are respectively defined by "\" and "/"
and aromaticity is represented by lowercase symbols and the opening and
closing points in the ring by a number, i.e. "c1ccccc1" [55, 420-432], [59],
[37].
2.2 Molecular fingerprints
Another way to represent molecules is the use of molecular fingerprints.
Molecular fingerprints consist of vectors of features representingmolecules,
which are unique representations of the given chemical compounds [37].
There exist several types of molecular fingerprints, which differ in the way
they calculate features and store them. They can be generally divided in
binary fingerprints and count fingerprints [37]. The first ones, are binary
vectors also called bit vectors or bolean ones. Each number one in a vector
represents the presence of a specific feature of the chemical compound.
The second ones not only store features, but also the count of the times
they are found in a chemical compound. This is a dramatic difference in
comparison with the first ones. Indeed, with the latest, there are less pos-
2.2. MOLECULAR FINGERPRINTS 41
sibilities to have ambiguity between fingerprints. Two elements having the
same features in different quantity aremapped to the samefingeprint using
the first method and to different ones using the secondmethod. There are
different implementations that generally use bit vectors or list to store fea-
tures. In the first case, for each feature, a one is added in the position of the
vector representing the given feature. In the second one, the position of the
present feature is stored in a list. Thefirstmethod from the onehand, allows
fast computations, which are especially desired in similarity search tasks,
on the other hand, can be inefficient when the number of feature owned by
a chemical compound is low, due to the extremely sparse vectors that have
to be stored. The second onemay not be so fast, but is an efficient imple-
mentation in terms of memory usage. Sparsity can be reduced through the
"folding" process, which consists of dividing a bit vector in two symmetric
parts andmerge themwith a logical or operation. This method allows the
compression of molecular fingerprints to bit vectors of the desired size and
ratio of ones over zeros. However, with the decrease of vector sizes the risk
of features collision increases, which may result in an undesired loss of
information [55, 441-447]. As mentioned before, there exist several types of
molecular fingerprints which depend on the way features are calculated.
Extended connectivity fingerprints (ECFP), Chemical Hashed Fingerprints
and Pharmacophore Fingerprints are themost used ones for different pur-
poses spanning from sub-structure searching to structure-similarity and
QSAR or QSPR experiments [3], [1], [13].
42 CHAPTER 2. METHODS
Figure 2.1: ECFP generation process. For each atom, identifiers arecalculated considering the neighborhood. The diameter 0,1,2 ecc. refers tothe distance in terms of bonds separating two atoms along the shortestpath connecting them. The calculated identifiers are mapped to a specificposition in a bit vector through a hashing function. Bit collision happenswhen two different identifiers are mapped to the same position in thebinary vector.Figure source: [13]
Extended connectivity fingerprints
Extended connectivity fingerprints (ECFP) are circular topological finger-
prints. The neighborhood of each atom in a chemical compound is used to
calculate an integer, whichdefines thepositionof the represented feature in
a binary vector. The neighborhood is defined in terms of number of bonds
lying on the shortest path connecting two nodes in the molecular graph
representing a chemical compound. So, the neighborhood of diameter one
2.3. CHEMBL 43
around an atom corresponds to all atoms that are directly connected to it,
the neighborhood of diameter two is represented by the atoms directly con-
nected to the neighborhood of diameter 1 and so on. In this way, different
ECFPs are defined, which consider different diameters. The diameters used
are generally specified by an integer written after the acronyms ECFP. So,
ECFP2 represents ECFP using neighborhood of diameter 2, ECFP4 those
using neighborhood of diameter 4 and so on. The integer is calculated start-
ing at the central atom through the application of a hashing function, which
can be different from implementation to implementation, to its specific
properties. This integer is subsequently combined to those of the neighbors
to calculate the integer for the neighborhood and the process is repeated in
a recursive way until the desired diameter is reached. The integers of each
neighborhood define identifiers representing features, which are mapped
to specific positions in the bit vector representing themolecular fingerprint.
In this way presence or absence of a specific 1 in the bit vector correspond
to the presence or absence of a specific feature in the analyzedmolecule.
For this reason, ECFP are useful in structure similarity search and less in-
teresting for substructure search which is therefore accomplished using
Chemical Hashed Fingerprints [13].
2.3 ChEMBL
ChEMBL is a database storing information about drugs-like molecule [16].
This information is generally manually extracted from scientific publica-
tion and undergoesmultiple checks and normalization processes to ensure
44 CHAPTER 2. METHODS
Figure 2.2: In this figure the composition of ChEMBL version 23 is depicted.Size of the circles is proportional to the number of activities and the nameslinked to each circle represent the Bio Assay data sources.
the correctness of the chemical structure, avoiding redundancy and en-
suring consistency in representation. During this process i.e., structures
are checked for potential problems, different names referring to the same
chemical compound are linked one another, charges are neutralized and
the unit measures used standardized [16]. Retrieved information involves
absorption, distribution, excretion, metabolism and toxicity properties
(ADMET), functional assay and binding assay as well as chemical struc-
tures of chemical compounds [16]. This makes ChEMBL an important
data-set for many tasks in cheminformatics like QSPR experiments. In this
master thesis the version 23 of this data-set was used [6], which contains
1,735,442 compounds, 14,675,320 activities,1,302,147 assays, 11,538 targets
and 67,722 documents.
2.4. DATA-SETS PREPARATION 45
2.4 Data-sets preparation
In the previous sub-chapter, the ChEMBL data-set was introduced. In this
master thesis inorder to trainbothgenerative adversarial networks: Chemo-
GAN and latent-space-GAN, the 23rd version of ChEMBL was used. The
data-set was downloaded in Structure-Data-file (SDF) format. This type of
format contains molecules inMDL format delimited by "$$$$" [11], [10].
This is not only used to includemoreMDLfiles in a single SDF, but also to in-
clude furthermeta data, i.e. molecular properties like themolecular weight
or themolecular ID. MDL files are molecular representation, whose struc-
ture can be divided in four parts. The first one is the header, which contains
the name of themolecule and information concerning the program that
generated theMDL file [10]. The second part contains information about
the atoms, i.e the spatial coordinates x,y,z and information about the type
of element [11]. Therefore, this part is generally defined the atom-block.
Instead, each line in the third part describes a bondbetween two atoms and
is therefore called the bondblock [11]. The last part of theMDLfile contains
further information about molecular properties [10]. To obtain a data-set,
which could be fed in the generative adversarial network, the SMILES rep-
resentation of eachmolecule contained in the SDF file had to be extracted.
This process was carried out using the RDkit package in python, which
provides the functionMolToSmiles(), which can be used for this purpose
[53]. From this point on, two different work-flows were used to generate
the data-sets containingmolecular fingerprints and SMILES strings, which
were respectively used to train the Chemo-GAN and the latent-space-GAN.
46 CHAPTER 2. METHODS
Table 2.1 Examples of data-set used for the Chemo-GAN
SMILES strings 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 CCCCOC(=O)CSc1nnc(-c2cc(OC)c(OC)c(OC)c2)o1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
1 CCCCc1cc(O)c(CCCC)c(O)c1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 O=C(Nc1cc(N2CCOCC2)ncn1)c1ccccc1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 O=C(c1cccc(F)c1)N1CCCC2(CCN(C(c3ccccc3)c3ccccc... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1
4 Cc1noc(C)c1C(=O)N1CCC2(CCCN(C(c3ccccc3)c3ccccc... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
5 O=C(c1ccncc1)N1CCC2(CCCN(C(c3ccccc3)c3ccccc3)C... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
6 O=C(c1cnccn1)N1CCC2(CCCN(C(c3ccccc3)c3ccccc3)C... 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7 O=C(NCCN1CCOCC1)c1cc(-c2cccs2)on1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
8 CCCc1[nH]nc2c1C(C1CCCCC1)C(C#N)=C(N)O2 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 O=C1CC(c2ccco2)CC2=C1C1c3ccccc3C(=O)N1c1ccccc1N2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Table 2.1: Example of the data-set used for the training of the Chemo-GAN.Each row represents a chemical compound. The first column of each rowrepresents the index, the second one the SMILES string, while all theother(from the third to the 2,048th) represent themolecular fingerprints.
Data-set generation for Chemo-GAN
To train theChemo-GAN, SMILES strings extracted from the SDFfile, which
were downloaded fromChEMBL, were converted inmolecular fingerprints.
To achieve it, the RDkit package was used. Specifically, the Morgan fin-
gerprints were calculated through the function getMorganFingerprintAs-
BitVector(). This function uses a variant of theMorgan algorithm to calcu-
late this type of molecular fingerprints, which are topological fingerprints
comparable to the ECFP. This implementation is described in [45] and is
very similar to the Canon algorithm described in chapter 2.2. In this case,
only SMILES strings with a length lower or equal to 40 were considered.
This resulted in a data-set of 382,688 molecules after the removal of invalid
structures and duplicates. For these SMILES strings, molecular fingerprints
with a diameter of 4 were calculated and folded to 2,048 bits.
2.5. EVALUATION 47
Data-set generation for Latent-Space-GAN
To train the Latent-Space-GAN, SMILES strings were converted to numbers.
This process was carried out at a character level. Only SMILES strings
having a length lower of 40 characters were used. From this new data-set
containing 382,754 chemical compounds, duplicates were removed giving
rise to a data-set of 382,748 SMILES strings. These SMILES strings were
checked for structural problem with the RDkit package. After this check
a final data-set containing 382,688 chemical compounds was obtained.
53 different characters were found to be present in the data-set, but most
of them had a low frequency. For this reason and to speed the process
up, to the 15most commonly encountered characters, a different number
was assigned, while all the others were encoded with a one. A dictionary
was built to map the characters present in the SMILES strings to numbers
and vice versa. At the end of each sequence a end of the sequence symbol
("|") was added. Furthermore, sequences having a length shorter than 40
characters were padded with zeros to obtain a data-set of sequences of the
same length.
2.5 Evaluation
The quality of the Chemo-GAN and the Latent-Space-GANwas evaluated
through a new distancemeasure, which was called Fréchet Tox21 Distance
(FTOXD). Furthermore, to infer possible model collapses, the Tanimoto
48 CHAPTER 2. METHODS
Figure 2.3: This figure represents the sequence length distribution ofSMILES strings present in the data-set expressed as percentage.
Figure 2.4: This figure represents the number of times each characterappears in the data-set expressed as percentage.
2.5. EVALUATION 49
Table 2.2 Examples of data-set used for the training of the Latent-Space-GAN
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
0 1 5 3 1 8 9 1 3 7 10 2 3 7 4 7 4 16 1 8 4 1 9 3 1 3 1 5 4 2 4 2 17 0 0 0 0 0 0 0 0
1 2 3 10 7 4 3 2 1 5 1 1 1 3 1 1 5 4 6 4 1 5 1 3 1 1 3 1 3 1 5 6 4 6 2 4 6 4 6 17 0
2 14 9 16 13 5 1 8 1 1 1 3 1 1 8 1 1 5 2 3 10 6 4 6 2 2 4 2 3 10 6 4 6 17 0 0 0 0 0 0 0
3 2 3 10 6 4 3 1 5 1 1 1 3 1 1 5 4 16 4 7 6 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 9 5 1 3 7 2 4 1 8 1 3 9 3 1 9 8 4 2 4 1 8 1 5 16 1 3 9 8 4 15 2 17 0 0 0 0 0 0 0 0
Table 2.2: Example of the data-set used for the training of theLatent-Space-GAN. Each row represents a SMILES string. The fifteenmostcommon characters are encoded with integer numbers from one to fifteen.The uncommon characters are encoded with the number sixteen. Thenumber seventeen encodes the end of the sequence and the zero is usedfor the padding.
coefficient was used. In the case of the Latent-Space-GANwe also calcu-
lated the percentage of generated valid SMILES strings. While the aim of
the use of the FTOXD is the assessment of the distribution similarity, the
Tanimoto coefficient was used to notice possible model collapses. Indeed,
without a measure to assess this, models could be considered good despite
the fact they are producing always the same SMILES string due to the high
score given to the producedmolecules. The use of the FTOXDwas inspired
by the Fréchet Inception Distance used byMartin Heusel and coworkers in
[25]. In the following sub-chapters, thesemeasures are introduced.
2.5.1 Tanimoto coefficient
The Tanimoto coefficient was defined for the first time in 1960 [46] as a
possible measure for plant similarity. In this context, it was used to express
the similarity of plants in terms of equal features and distinct ones showed
by a plant. A set of features can be represented as a bit vector with one entry
per feature, in which each entry equal to one represents the presence of the
features encoded at that position in the considered chemical compounds,
as explained in chapter 2.2 for molecular fingerprints. Using bit vectors the
50 CHAPTER 2. METHODS
Tanimoto coefficient is expressed as:
T8 =P
i (Xi ∧Yi )Pi (Xi ∨Yi )
(2.1)
where ∧ and ∨ represent the logical AND and logical OR operators re-
spectively. The Tanimoto coefficient was used to check generated finger-
prints similarity one another. This coefficient wasmeasured between 100
pairs of generated fingerprints each 500 updates.
2.5.2 Fréchet inception distance
TheFréchet inceptiondistance (FID),whichwas usedbyMartinHeusel and
coworkers in [25], improved the inception score described in [48], providing
ameasure of the difference between data distribution and generated data
one. The Inception Score is defined as:
Inception Score = exp(ExKL(p(y|x)kp(y))) (2.2)
in which p(y |x) is the conditional distribution of the labels obtained as
output of the Inceptionmodel when feeding generated samples in it and
p(y) is the marginal distribution. In this case, x = G(z), in which G is the
generator. This score expresses a quality evaluation of generated samples,
but it does not take into account the original data distribution, which could
be different. Instead, for the calculation of the FID, both generated data and
real ones are fed in the Inceptionmodel and theoutputof somehidden layer
is retrieved to obtain visual relevant features. In this way, two distributions
are obtained, which are assumed to follow amultidimensional Gaussian
distribution, because this is themaximumentropy distribution. Finally, the
2.5. EVALUATION 51
FID is calculated as the Fréchet distance between the first twomoments of
these two Gaussian distributions as depicted in 2.3.
FID = d2((m,C), (mw,Cw)) = km−mwk22 +Tr(C+Cw,−2(CCw)
12 ) (2.3)
2.5.3 Fréchet Tox21 Distance
To evaluate the Chemo-GAN and the Latent-Space-GAN, the Fréchet Tox21
Distance (FTOXD) was designed. This metric is similar to the FID but in-
stead of using the inceptionmodel to generate the conditional probability
p(ykG(z)), it uses amodel trained on the Tox21 data-set, which was called
Tox21-FTOXDmodel. The Tox21 data-set, which consists of 12,000 training
data and 647 test ones, was used. From the chemical structures contained
in this data-set, the equivalent of the ECFP4 molecular fingerprints was
calculated for each compound through the rdkit package in python. The
Tox21-FTOXDmodel was designed as a 3 layer fully connected neural net-
work. 1,024 units were used for each hidden layer and 12 outputs corres-
ponding to the different labels were used. In each hidden layer, the selu
activation function with the Lecun weight initialization was used. Many
labels provided in the Tox-21 data-set were missing. Therefore, missing
valuesweremasked during the training. The binary cross-entropywas used
as loss function. This model obtained an AUC of 0.74 on the test set. Gen-
erated fingerprints were fed in this model and the outputs were extracted
from its second hidden layer to have chemical relevant features. Molecu-
lar fingerprints derived from SMILES strings belonging to the test set of
the ChEMBL data-set were used to calculate the conditional distribution
p(ykxd at a), while the conditional distribution p(ykG(z))was calculatedwith
the generatedmolecular fingerprints. Formula 2.3 was used to calculate
52 CHAPTER 2. METHODS
the FTOXD using the means and co-variance matrices derived from the
distributions obtained using the Tox21-FTOXDmodel.
2.6 Chemo-GAN
Chemo-GANwas the first method implemented and analyzed in this mas-
ter thesis, through which we tried to applied generative adversarial net-
works for the first time in chemoinformatics. It is a generative adversarial
network composed by a generator and a discriminator, which were both
implemented as fully connected multi-layers artificial neural networks.
The architecture was implemented using Keras, which is a high-level API
able to run on top of different open source libraries such as Tensorflow and
Theano [9]. For the implementation of all models, Keras was run on top of
Tensorflow. The generator and the discriminator were connected giving
rise to the Chemo-GAN. To train the discriminator, a set of molecular fin-
gerprints was randomly sampled from the fingerprints data-set and labeled
with ones and a second set of generatedmolecular fingerprints having the
same size of the other set, was sampled from the generator. In the contest of
generative adversarial network, the process of sampling one sample from
a generator is carried out by feeding the generator with a random vector
sampled from a predefined prior, carrying out a forward pass through the
network and retrieve the output of the last layer. In this method a gaussian
prior with µ= 0 and σ= 1 was used. For each update of the discriminator,
firstly, the discriminator was trained on real data labelled with ones and
secondly, it was trained on generated data labeled with zeros. The training
of the generator was realized using the Chemo-GAN after having frozen the
weights of the generator. Instead of optimizing the generator descending
2.6. CHEMO-GAN 53
its stochastic gradient (J (G)(θ(D),θ(G))), the generator was trainmaximizing
the log probability of the discriminator being mistaken, as suggested in
[20], by flipping, instead of the sign of the cost function, the labels given to
the samples. Also during the training of the generator, the weights of the
discriminator were frozen. The different objective for the generator was
used to avoid the vanishing of the gradient when the confidence of the dis-
criminator becomes too high as suggested in [20]. The learning processwas
monitored using the FTOXDmeasure between generatedmolecular finger-
prints and the original data-set. Tomonitor also possible model collapses
the Tanimoto distance between the generatedmolecules was calculated.
These measures were taken each 500 updates of themodels, where each
update was carried out using a batch size of 10,000 samples. In a first phase,
different architectures with different hyper-parameters were trained. A
common structure was used for all models. Both generator and discrimin-
ator have one hidden layer. During the tuning of the parameters hidden
layers were added to both generator and discriminator. All generators were
implemented as fully connected feed forward neural networks taking vec-
tors of size 2,048 as input and generating vectors with a size equal to 2,048.
Their last layer used a sigmoid activation function. All discriminators were
implemented as fully connected feed forward neural networks taking vec-
tors of size 2,048 as input and generating a prediction in the range zero-one
through a sigmoid activation function. Each generative adversarial net-
work used the binary cross-entropy as loss function. Number of hidden
layers, activation functions for the hidden layers and the learning rates of
generator and discriminator were tuned. Stochastic gradient descent was
used as optimizationmethod.
54 CHAPTER 2. METHODS
Figure 2.5: This figure represents the Latent-Space-GAN. Original data areencoded in a latent space through an encoder (ENC), while fake samplesare obtained through the generator (G), whichmaps Gaussian noise to thesame latent space representation. This latent space representation issubsequently used for the training of the GAN composed by G and D.Generated samples are decoded to the SMILES encoding system through adecoder network.
2.7 Latent-Space-GAN
The second part of thismaster thesis was focused on the implementation of
the Latent-Space-GAN. The name derives from the fact that this generative
adversarial network learns how to produce a latent space representation of
SMILES strings. Basically, the Latent-Space-GAN is composed by fourmod-
2.7. LATENT-SPACE-GAN 55
els: a generator, a discriminator, an encoder and a decoder. The encoder
part is used in the first phase of the training to project real data(SMILES
strings) to the latent space. These data are used as real ones during the train-
ing of the generative adversarial network. During the training process, both
fake samples, which are sampled from the generator, and real-encoded
ones are fed in the discriminator, whose task is to provide a feedback in
form of a gradient to the generator, helping it, in this way, to improve itself.
Generated samples are finally mapped back from the latent space to the
SMILES space through the decoder part of themodel. During the data-set
generation, uncommon characterswere encodedwith the samenumber, in
this case sixteen was used. For this reason, the generator produces SMILES
strings containing from time to time some sixteens, which needed to be
replaced by one of the 35 possible uncommon characters before the evalu-
ation of the model. To assign a character to each sixteen, another model
was used. This model, which was called the "corrector", replaced all six-
teens with a character, whose selection is context-based. This model was
implemented as a stacked LSTM. The Latent-Space-GANwas trained in two
different moments. Firstly, the Auto-encoder was trained. Subsequently,
generator and discriminator were trained. The quality of the model was
measured with the FTOXD and the percentage of valid generated SMILES
strings.
2.7.1 Auto-encoder
The auto-encoder is a neural network, which tries to learn the identity
function under certain constraints that hinder this process, i.e. noise ad-
dition, dropout or hidden layer size reduction. This generates an efficient
56 CHAPTER 2. METHODS
representation of the data due to the ability of the auto-encoder to exploit
patterns. It is composed by an encoder and a decoder networks. Both were
implemented using recurrent neural networks, which allow themodel to
encode information concerning the contest in which characters are placed
through the sharing of parameters. To enable themodel to "perceive" the
context at both side of a character, bidirectional layers were used. A bidirec-
tional layer is composed by two recurrent neural networks, which read the
sequence starting from opposite ends and whose weights are used at the
same time to calculate the values of neurons of the next layer. Instead of
using simple recurrent neurons, LSTM cells were used, which do not incur
into the vanishing of the gradient and allow to add or remove information
from the cell state. LSTMwere successfully applied tomany problems be-
longing to different fields spanning from speech recognition to chemistry,
showing the ability to learn dependencies between distant elements and
that of dropping useless information. On top of bidirectional LSTM layers,
LSTM layers were stacked, which should provide themodel a further level
of abstraction in whichmore complex features can be learned. Gaussian
noise was introduced in themodel to force it to learn and generalize better
and for the same reason also dropout was used in each layer. In this model,
data are embedded through an embedding layer and subsequently fed in
the bidirectional LSTM layer. The last layer of the architectures applied a
softmax activation function to each time step of the sequence retrieved
from the previous layer, defining in this way a distribution over the char-
acters at each time step. The predicted sequence was obtained by the last
LSTM layer and the characters were retrieved taking the argmax of each
time step of these predictions, which represents the character to which the
model assignmore probability. This is represented in Formula 2.4, in which
2.7. LATENT-SPACE-GAN 57
L is the index of the last layer, and i ∈ R18 (18 stands for the dimension of
the used alphabet) represents the i-th output of themodel.
prediction = argmaxi
(softmax(xL−1i )) (2.4)
The weights of themodel were updated using the RMSprop described in
chapter 1.2.2 and the categorical cross entropy was used as loss function.
Two different latent space representations of the model were tried. In
one case the values of the vector were constrained between one and zero
through the sigmoid function and this representationwas called "sigmoidal
latent space", while in the second one a linear activation was applied and
the representation called "linear latent space". Due to the time needed to
train themodel, no cross validationwas carried out, but a hold-out data-set
corresponding to the 35% of the data, was left out for testing. This was done
to guarantee an unbiased estimation of the prediction score on feature data.
The rest was divided in training set (80% of the data ) and validation set
(20%). The models were trained on the training set and validated on the
validation set to select the best parameters. The performances of these
models on the validation set were reported in Figures 2.6 2.7. The mod-
els that achieved the best performances were trained for 2000 epochs and
tested on the test set. Both architectures achieved an accuracy above 90%
on the test set.
2.7.2 Generator and Discriminator
The generator and the discriminator were implemented using fully connec-
ted hidden layers using the selu activation function in each hidden layer.
They were trained in a second step after the selection of the auto-encoders.
58 CHAPTER 2. METHODS
0 50 100
150
200
250
300
epoch
l_ls_tanh_0.005_150_0.3l_ls_tanh_0.005_100_0.5
l_ls_sigmoid_0.01_100_0.3l_ls_tanh_0.005_150_0.5
l_ls_sigmoid_0.01_100_0.5l_ls_tanh_0.005_100_0.3
l_ls_sigmoid_0.01_150_0.3l_ls_sigmoid_0.005_150_0.3l_ls_sigmoid_0.005_150_0.5l_ls_sigmoid_0.005_100_0.5l_ls_sigmoid_0.005_100_0.3
l_ls_tanh_0.01_100_0.3l_ls_tanh_0.01_150_0.3l_ls_tanh_0.01_100_0.5
l_ls_sigmoid_0.01_150_0.5l_ls_tanh_0.01_150_0.5l_ls_tanh_0.01_300_0.3l_ls_relu_0.005_150_0.3l_ls_relu_0.005_100_0.5
l_ls_relu_0.01_300_0.3l_ls_relu_0.01_150_0.3l_ls_relu_0.01_100_0.5
l_ls_relu_0.005_150_0.5l_ls_relu_0.01_100_0.3l_ls_relu_0.01_150_0.5
l_ls_relu_0.005_100_0.3
mod
els
Accuracy AE per epoch
0.15
0.30
0.45
0.60
0.75
Figure 2.6: In this figure the accuracy of themodels in reconstructing theoriginal SMILES string are represented for the auto-encoders using thelinear activation function to calculate the latent space. Each row of thisfigure represents the accuracy of a specificmodel defined by the name onthe y axes during the training. The color encodes the value of the accuracyat each epoch. The darker the color, the lower the accuracy.
The generator was used to map gaussian noise to a vector of continues
values of the same size of the encoder output vector. The same activation
functions used to calculate the output of the encoders were used to calcu-
late the output of the generator. A generator with linear output and one
with a sigmoidal output were so obtained. Dropout was used as regulariza-
2.7. LATENT-SPACE-GAN 59
0 50 100
150
200
250
300
epoch
s_ls__tanh_0.005_100_0.3
s_ls__tanh_0.005_100_0.5
s_ls__tanh_0.005_300_0.5
s_ls_sigmoid_0.01_100_0.5
s_ls__sigmoid_0.005_100_0.3
s_ls_tanh_0.01_100_0.5
s_ls__sigmoid_0.005_100_0.5
s_ls__tanh_0.005_150_0.3
s_ls__tanh_0.005_150_0.5
s_ls__sigmoid_0.005_150_0.5
s_ls__sigmoid_0.005_150_0.3
s_ls_tanh_0.01_300_0.5
s_ls_tanh_0.01_150_0.5
s_ls_sigmoid_0.01_150_0.5
s_ls__relu_0.005_100_0.5
s_ls_relu_0.01_100_0.5
s_ls__relu_0.005_300_0.5
s_ls__relu_0.005_150_0.5
s_ls_relu_0.01_150_0.5
s_ls__relu_0.005_100_0.3
s_ls__relu_0.005_150_0.3
mod
els
Accuracy AE per epoch
0.15
0.30
0.45
0.60
0.75
Figure 2.7: In this figure the accuracy of themodels in reconstructing theoriginal SMILES string are represented for the auto-encoders using thesigmoid activation function to calculate the latent space. Each row of thisfigure represents the accuracy of a specificmodel defined by the name onthe y axes during the training. The color encodes the value of the accuracyat each epoch. The darker the color, the lower the accuracy.
tion in each layer. Goussian noise was added at each layer of the generator
and applied to the input of the discriminator. Indeed, as suggested in [51]
and [4] instability of generative adversarial network could be caused by
non-overlapping supports of the generator and discriminator functions,
whichmay lead to the possible presence ofmultiple optimal discriminators
60 CHAPTER 2. METHODS
and so, to the invalidity of the convergence proof proposed by Goodfellow
and coworkers in [21]. The noise addition should push the supports to
better overlap and reduce instability problems. For the same reason also
label switching was used, which flips the class labels after a predefined
number of epochs. The model was optimized using the Adam and SGD
optimizationmethods and the binary cross-entropy was used tomeasure
the loss of the model. The learning rate was decreased each 25 updates.
During the training of the generator the weights of the discriminator were
frozen. The discriminator was trained for more times if its loss overcame
0.5. This was done tomaintain the discriminator near optimal and able to
provide a helpful feedback to the generator. In this case, like in the training
of the Chemo-GAN, the objective optimized is the one defined in equa-
tion 1.36. During the optimization of such objective the only moment in
which the gradient vanishes is when the generator manages tomake the
discriminatormakemistakes. Generally, this is not a real problem if the dis-
criminator is near optimal, because by the time the generator manages to
fool the discriminator the quality of the generated samples is already good
as mentioned also by [20]. Both models, the one with linear latent space
and that with the sigmoidal latent space, were trained for 24,500 updates
of 100,000 samples sampled randomly from the original data and 100,000
generated ones. From the first computations, auto-encoders having an
accuracy above 90%were obtained. Despite these results, the amount of
generated valid SMILES strings resulted to be below 0.1% for the Latent-
Space-GAN using the sigmoidal latent space and below 0.2% for the one
using the linear latent space. This can be observed in Figure 2.8, in which
the percentage of generated valid SMILES string is represented for models
saved at different point during the training of the Latent-Space-GAN using
2.7. LATENT-SPACE-GAN 61
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Generators
0.02
0.04
0.06
0.08
0.10
0.12
0.14
% V
alid
SM
ILES
strin
gs
Distribution of the percentage of valid SMILES strings per generator
Figure 2.8: Distribution of percentage of valid SMILES strings generated bygenerators saved along the training of the Latent-Space-GANwith linearlatent space. Each generator was used to generate 20,000 SMILES stringsfor 10 times. After each samples generation, the FTOXDwasmeasured andthe values obtained by each generator summarized with a boxplot.
linear latent space. At a visual inspection, it was observed that the quality
of the generated samples improves during training and that SMILES strings
invalidity is often caused bymissing parentheses or not-closed rings as it
can be observed in Table 2.3. Despite this, due to these performances, we
did not continue further with the study of the Latent-Space-GAN.
62 CHAPTER 2. METHODS
Table 2.3 Comparison between original and generated SMILES strings
Original SMILES strings Generated SMILES strings
0 c1(c2nc(N=C(N)N)sc2)cn(c(c1)C)C S.O.CC(N(F)/CCCCNCC322oc1nnncc31)F
1 C(=N)(Cc1ccc(cc1)O)c1c(cc(c(c1O)OC)O)O O=C2N(nC21CN2SN=C/C=NC[N-])OCC=CC2C=C1
2 [nH]1c2ccc(cc2cc1C(=O)OCC)C(=O)O N(CNNOC1cc1/CO/C \OC(CF)/CP(CO)OC)O
3 C(=O)(c1ccc(cc1)I)NO n1(N(C2(OC21)CCO)C)C/1OC1CCC.C(C)O
4 n1c(NC)c2c(n(cn2)C)c2c1sc(n2)SC CN=NC(NN2)C.N1C=CN1OS3=nNC1=CCCCC3C2s1
5 N1(C(=O)/C=C/C=C(/CCC=C(C)C)\C)CCCC1 n12n(ncc(n2)/N)C.Sc2=NN2N=C/1F.[O-]C
6 C(=S)(Nc1ccc2c(c1)C(=O)OC2)Nc1cccc(c1)C P1O.C2CCOCP12NNNN=C1/NCCCC#CCC#CCCC1
7 c1(cnnn1c1c(cc(cc1Cl)C(F)(F)F)Cl)CCC B.FB(CC(ON2C.N1C2Nc1O1)(CCCC1)/C)O
8 c1(c2nc(NC(=O)C(=O)O)sc2)cc(no1)Cl C1#CC(OC(C#C)C/1C#CNCCl)N.NC(F)C=O
9 n1c(nc2c(c1NCC(CC)C)NCN2Cc1ccccc1)C#N P1OOCB(F.Br)N(/CC#COC(CCCN=C22)n=C1S)C
10 c1c(c(ccc1OC)CCc1ccc(c(c1)C(=O)OC)O)OC n1(NCC2=N \CCc12cnnnnnnc2)/C21CCNCnnn1C
11 c1ccc(c(c1)C1=NOC(O1)(C)c1ccccn1)Cl S1(N(N=CC(=C(CC1)CCCCCCCF)/CC)F)(F)C
12 C1(=NCCN1)Cc1cc2c(cc1)cccc2 c1(onc(c1C.O1)COC)C#CC2NN=C33c3312
13 c1(cccc(c1C(=O)O)CCCCCCC/C=C \CCCCCC)O O1C4=C(Br)CCN1C(SNc4/C(C)(C)C)(C)N
Table 2.3 Comparison between original and generated SMILES strings.The first column represents the index, the second the SMILES stringsrepresentation of chemical compounds belonging to the ChEMBL data-setand the third one represent the generated SMILES strings.
3. Results
3.1 Results Chemo-GAN
During this first approach Chemo-GAN was successfully trained to gen-
erate molecular fingerprints which look as if they were sampled from the
original data distribution. The similarity between the two distributions
wasmeasured through the FTOXD, a new definedmetric, which offer the
possibility to measure the distance between the data distribution and the
one represented by the generator using highly relevant chemical features re-
trieved through the TOX21-model. This generative adversarial network was
implemented as a fully connected artificial neural network and paramet-
ers like: the learning rate, the number of hidden layers and the activation
functions, were tuned to obtain better performances. In Figure 3.1 results
obtained by different models using different number of hidden layers and
activation functions canbeobserved togetherwith theTanimoto coefficient
for eachmodel. On the y axis the name of themodels with the respective
parameters are represented. Each name is composed by: number of hidden
layers added to the general structure of the generator, number of hidden
layers added to the general structure of the discriminator, learning rate
used to update the generator and the discriminator, the number of updates
63
64 CHAPTER 3. RESULTS
0 200 400 600 800 1000Freché TOX21 Distance
1_1_0.01_0.01_10000_elu
1_1_0.01_0.01_10000_relu
1_1_0.01_0.01_10000_selu
1_1_0.01_0.01_10000_sigmoid
1_1_0.01_0.01_10000_tanh
1_2_0.01_0.01_10000_elu
1_2_0.01_0.01_10000_relu
1_2_0.01_0.01_10000_selu
1_2_0.01_0.01_10000_sigmoid
1_2_0.01_0.01_10000_tanh
1_3_0.01_0.01_10000_elu
1_3_0.01_0.01_10000_relu
1_3_0.01_0.01_10000_selu
1_3_0.01_0.01_10000_sigmoid
1_3_0.01_0.01_10000_tanh
Mod
els
FTOXD
0.00 0.05 0.10 0.15 0.20 0.25 0.30Tanimoto coefficient
Tanimoto
Chemo-GAN models evaluation
Figure 3.1: FTOXD and Tanimoto coefficient calculated for eachmodeltrained. The Tanimoto coefficient wasmeasured between all pairs of 500generatedmolecular fingerprints per model and these distributionsrepresented by box-plots. The FTOXDwas calculated 50 times per modelusing 10000 generatedmolecular fingerprints per time and also in this casethe 50measures were resumed using box-plots.
3.1. RESULTS CHEMO-GAN 65
and finally the activation function used in the hidden layers. i.e. the name
1_1_0.01_0.01_10000_tanh stands for a GANwith a generator and discrim-
inator having one extra hidden layer, trained using a learning rate for the
generator and the discriminator equal to 0.01 for 10000 updates and where
each hidden layer used the tanh activation function. From Figure 3.1 the
importance of the number of hidden layers used in the discriminator part
of the network can be easily observed. Indeed, the boxplots representing
the results obtained after repeated generation of molecular fingerprints
and their quality assessment using the FTOXD, can be clustered in three
groups depending on the number of extra hidden layers used in the dis-
criminator part of the model. In the upper left corner, the models using
three extra hidden layers in the discriminator obtained the best results
which are similar one another, but far better when compared to the res-
ults obtained by models using less hidden layers in the discriminator. A
second and third groups can be discerned, which containmodels having a
FTOXD in the range 300-600 and 800-900 respectively. The second group
was implemented using two hidden layers in the discriminator, while the
third one was implemented using only one hidden layer. It is also inter-
esting that the Tanimoto coefficients are in general low suggesting that
themodel is generating different molecular fingerprints showing a higher
variance with a distributionmore left skewed for the FTOXD retrieved from
themodels belonging to the first group. To better appreciate the differences
betweenmodels within each group, results of each group were represented
in different plots as can be observed in Figure 3.2. From these plots, the
effect played using different activation functions can be observed. This is
especially true for themodels represented in the second row, in which the
differences in FTOXDs aremore pronounced.
66 CHAPTER 3. RESULTS
860 880 900
1_1_0.01_0.01_10000_elu
1_1_0.01_0.01_10000_relu
1_1_0.01_0.01_10000_selu
1_1_0.01_0.01_10000_sigmoid
1_1_0.01_0.01_10000_tanh
Mod
els D
1 h
idde
n la
yer
FTOXD
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Tanimoto
350 400 450 500 550
1_2_0.01_0.01_10000_elu
1_2_0.01_0.01_10000_relu
1_2_0.01_0.01_10000_selu
1_2_0.01_0.01_10000_sigmoid
1_2_0.01_0.01_10000_tanh
Mod
els D
2 h
idde
n la
yers
0.00 0.05 0.10 0.15 0.20 0.25 0.30
140 150 160Fréchet TOX21 Distance
1_3_0.01_0.01_10000_elu
1_3_0.01_0.01_10000_relu
1_3_0.01_0.01_10000_selu
1_3_0.01_0.01_10000_sigmoid
1_3_0.01_0.01_10000_tanh
Mod
els D
3 h
idde
n la
yers
0.00 0.05 0.10 0.15 0.20 0.25 0.30Tanimoto coefficient
Chemo-GAN models evaluation
Figure 3.2: FTOXD and Tanimoto coefficient for model belonging to thesame group were represented in the same plot. The first row contains theresults of themodels belonging to the third group, the second row thosebelonging to the second one and the third those belonging to the first one.
3.1. RESULTS CHEMO-GAN 67
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000updates
0
2000
4000
6000
8000
10000
12000
14000
FTOX
D
FTOXD Chemo-GAN
Figure 3.3: This figure represents the distributions of the FTOXDmeasuredeach 500 updates for the Chemo-GAN architecture that obtained the bestresults. Each 500 updates, 10,000molecular fingerprints were generated 50times and the FTOXD calculated. The distributions of these values wererepresented as boxplot.
In the first row can be seen that the model using the sigmoid activation
function achieved the best results in average, while in the second row the
best results were achieved by themodel using the relu activation function,
while in the last row the best results were achieved by themodel using the
elu activation function. It is also interesting to note that the distributions
of the Tanimoto coefficient are similar within the first group and become
more andmore different when the number of hidden layers increases. In
Figure 3.4 the binary accuracy, the loss of the generator and the one of the
discriminator were depicted for themodel using the elu activation function
of each group. From this picture the behavior of the generator and discrim-
68 CHAPTER 3. RESULTS
0 1000 2000 3000 4000 5000Updates
0
1
2
3
4
5
6
GAN elu 1hlG lossD lossaccuracy
0 1000 2000 3000 4000 5000Updates
0
2
4
6
8
10
12
GAN elu 2hlG lossD lossaccuracy
0 1000 2000 3000 4000 5000Updates
1
2
3
4
5GAN elu 3hl
G lossD lossaccuracy
Figure 3.4: Learning curves of the discriminator and generator andaccuracy of the discriminator calculated at each epoch of the training ofdifferent Chemo-GAN architectures. From the top to the bottom, the plotsof the results for Chemo-GANmodels using a discriminator having 1,2 and3 extra hidden layers respectively are represented for themodels using theelu activation function in the hidden layers.
3.1. RESULTS CHEMO-GAN 69
0 1000 2000 3000 4000 5000
updates
1
2
3
4
5
loss
Generative lossgenerative loss
0 1000 2000 3000 4000 5000
updates
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
loss
Discriminative lossdiscriminitive loss
0 1000 2000 3000 4000 5000
updates
0.3
0.4
0.5
0.6
0.7
accu
racy
Accuracyaccuracy
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
updates
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
tani
mot
o co
effic
ient
Tanimoto coefficient
Figure 3.5: Learning curves, accuracy and Tanimoto coefficient for theChemo-GAN architecture which achieved the best results. First row:learning curves for the generator and discriminator. Second row: first plotrepresents the binary accuracy of the discriminator in samples classdiscrimination during training, while the second one represents theTanimoto coefficient measured betweenmolecules generated after eachepoch through the generator.
70 CHAPTER 3. RESULTS
inator during the training can be easily perceived. Indeed, at the beginning
of the training, when the data distribution and the one represented by the
generator are different, the losses of the generator and discriminator are
high, but during the training they decrease toward zero until a certain point,
in which the discriminator is nomore able to discriminate the origin of the
samples, is reached as it can be observed in the last plot of Figure 3.4 and in
the first three ones of Figure 3.5. In this plot, at the last epochs, the accuracy
approaches 0.5 and while the loss of the generator tends to zero, the one of
the discriminator increases. In the first and second rows of Figure 3.4 the
losses of the discriminator and the generator fell to zero probably because
the training was interrupted too early. This behavior is mirrored also in the
FTOXD plot represented in Figure 3.3. Indeed, while at the beginning of
the training this distancemeasure between the two distributions is high, it
decreases during the training showing that the distributions are becoming
more andmore similar. These data clearly show that the distribution ap-
proximated by the generator get closer and closer to the one of the original
data during training. It can be further inferred that the depth of the dis-
criminator plays a key role in the quality of the approximation achieved, as
represented in Figure 3.1. Thus, the Chemo-GANwas successful in approx-
imating the original data distribution showing that generative adversarial
networks can be a powerful tool also in chemoinformatics.
4. Discussion
The development of new drugs is a process, which requires many years to
be carried out. Libraries containing chemical compounds are screened
with the aim of findingmolecules presenting the right properties and the
least sides effects. The selected candidates must go throughmany phases
before they are accepted and these phases, which generally imply testing
the candidates in the lab to assess potency, efficiency and toxicity, are highly
expensive. Toward the aimof reducing the costs and increasing the chances
for the candidates to land on themarket, in-silico approaches were used
to assess properties or activities of chemical compounds and to generate
new ones. Neural networks were widely used, together with other methods
like random forest and support vector machines, in QSAR and QSPR ex-
periments. Recurrent neural networks and variational auto-encoders have
proven their ability of generating chemical compoundswithout theneed for
rules to be hand-crafted covering in this way a wider chemical space. Since
their discovery, whichwas published in 2014, GANs have becomemore and
more popular and successful in a wide range of tasks spanning frommusic
and images to art. In this thesis it was hypothesized that GANs could be
successful also in the generation of molecular fingerprints and chemical
compounds and to prove it we implemented and applied, for the first time,
71
72 CHAPTER 4. DISCUSSION
models trained in an adversarial setting in chemoinformatics, with the aim
of generating SMILES strings andmolecular fingerprints. To evaluate the
quality of generated samples and to measure the distance between the
original data distribution and the generated data distribution a newmetric,
the FTOXD, was defined providing a tool to evaluate generative models
aiming to generated SMILES strings or molecular fingerprints especially
useful for the evaluation of generative models such as GANs, for which the
estimation of the log-likelihood is difficult [60]. Results obtained from the
Chemo-GAN showed that the FTOXD decreases during the training, sug-
gesting that the data generating distribution and the learned one become
more andmore similar, fact that is also supported by the binary accuracy,
represented in the bottom-left plot of Figure 3.5, which after having reached
a plateau phase, around 1500 epochs, drops down until it reaches 0.5 after
5000 epochs. When the accuracy is near 0.5 the discriminator is nomore
able to provide useful feedback because it cannot discriminate whether
samples are coming from the generator or from the original data. Indeed,
the FTOXDmeasure increases when themodel is trained for more epochs.
73
Figure 4.1: 100 generated chemical compounds sampled from a Gaussianprior with SD 0.4.
The same sigmoidal shaped curve was observed for all the trained mod-
els with all different parameters. The only difference lied in the point in
which theminimum is reached, which for models using a deeper discrim-
inator was lower as showed in Figure 3.1. The low Tanimoto coefficient
further indicates that themodel is producingmolecular fingerprints that
74 CHAPTER 4. DISCUSSION
are not equal one another and shared few features. This may be due to
the type of molecular fingerprints used. Indeed, for the ECFP4, features
are calculate on the basis of the neighborhood of an atom in a diameter
of 4 bonds as explained in Chapter 2.2. It also suggests that no models
collapse happened. Despite themodels succeeded in the approximation of
the original data distribution, it cannot directly generate molecular graphs,
but chemical compounds must be "fished" from a library. However, this
can be accomplished bymeasuring the tanimoto coefficient between the
generatedmolecular fingerprints and those present in a data-set retrieving
SMILES strings of those molecular fingerprints for which the Tanimoto
coefficient was higher than a certain threshold. The Latent-Space-GANwas
designed to allow the direct generation of chemical graphs. In this case,
the encoder part of the autoencoder, mapped the SMILES strings into a
latent space, which is represented by amultidimensional vector and can be
considered as a sort of molecular fingerprint calculated through a neural
network. This latent space representation of SMILES strings was used to
train a GAN. Two latent space representations of SMILES were used: one
generated through a linear activation function and one generated through
a sigmoid activation function. The better results of the model using the
linear latent space could be explained by the fact that values in the sig-
moidal latent space tend to be positioned at the corner of a hypercube
and this fact hinder the backpropagation of the gradient slowing down
the learning process. The low percentage of valid SMILES strings when
mapping point in a latent space to SMILES strings representation is a prob-
lemwhich was already described in [19] and the reason, as they suggested,
could be the fragility of the SMILES syntax. Indeed, as pointed out in the
methods part, many SMILES strings resulted to be invalid because of a
75
missing parenthesis or open rings. Furthermore, the accuracy achieved by
the auto-encoder reached 90%and consequently, a further error sourcewas
introduced in themodel at this step. The binary accuracy wasmeasured
as the mean of the accuracy of each character predicted. Therefore, one
symbol can bemistakenly decoded into another one leading, in the case
of a mistakenly-decoded parenthesis or number, to a syntax error and the
consequent invalidity of the SMILES strings. Another source of error was
introduced by the use of the correctormodel to assign the unknown symbol
"16" to the correct ones based on context. The slightly improvement of the
percentage of valid SMILES strings represented in Figure 2.8 and the better
quality of SMILES strings observed at a visual inspection of the chemical
SMILES strings generated with a generator saved during the last epochs
of the training process suggested that a longer training could have led to
a further improvement. Despite these problems, generated valid SMILES
strings showed a low FTOXD highlighting a similarity between the original
data distribution and the one represented by the generator and a decent
quality. Therefore, it is possible to retrieve data-set composed by only valid
SMILES strings filtering the invalid ones using the RDkit. Finally, it was
observed that the percentage of valid SMILES strings increases when the
noise vector are sampled around themeanof theGaussian prior. This could
be because these values are sampledmore often. Therefore, the network
has been trained on those values more often becoming able to better map
them to valid latent space representations. Using this strategy , a higher
percentage of valid SMILES string is produced and data-set containing only
valid SMILES strings can be generated quickly. An example is represented
in 4.1 in which 100 valid SMILES strings were generated sampling noise
vectors using a SD equal to 0.4 in 5minutes on a normal laptop. While the
76 CHAPTER 4. DISCUSSION
same amount can be generated in oneminute using a SD equal to 0.2.
0.1 0.3 0.5 0.7 1.0Gaussian prior SD
0
50
100
150
200
FTOX
DFTOXD sampling using different SD
Figure 4.2: FTOXDmeasured for the Latent-Space-GAN for valid generatedSMILES strings sampled from priors with differend SD.
Furthermore, the time needed can be extremely reduced when working
on a GPU. However, sampling from a prior with a different SD causes the
generated distribution to be slightly different as it is also confirmed from
Figure 4.2, in which the FTOXD was measured for generated molecules
using the samemodel but sampling fromGaussian prior with different SD.
5. Conclusion
This is the first time a Generative Adversarial Network was used to generate
molecular graphs andmolecular fingerprints and one of the first attempts
to use generative Deep Learningmodels for Drug Discovery [19], [8], [50].
This master thesis opens new perspectives in chemoinformatics and drug
discovery showing the suitability of GANs for chemoinformatics related
tasks and providing a new tool, the Chemo-GAN, formolecular fingerprints
generation, whichmay speed up the screening processes providing a new
way to obtainmolecules having high potential to be selected in successive
stages of the drug discovery cycle and covering a wider chemical space,
nomore limited to human drug developers and chemists. This work also
provides a new metric, the FTOXD, which considers chemical relevant
features for the evaluation of generativemodels tackling the problems of
molecular fingerprints or SMILES strings generation. It was also shown
that the use of an LSTM-auto-encoder to mapmolecular graphs from and
to a latent space is a fast alternative to molecular fingerprints or chem-
ical descriptors generation. Furthermore, Latent-Space-GANwas able to
produce SMILES strings with low FTOXD, but the percentage of valid gen-
erated SMILES strings is low. Studies could be carried out to improve these
performances reducing the source of errors introduced with the use of an
77
78 CHAPTER 5. CONCLUSION
auto-encoder and the corrector models through the use of more data and
the full SMILES characters alphabet. Further experiments could be carried
out to improve the percentage of valid SMILES strings evaluating the effect
different GAN architectures which showed improvement in realistic images
generation and the use of different priors. Another possible improvement
could be the use of multitask learning in the discriminator to learn mul-
tiple properties at the same time. Transfer learning could be also used to
allow the already-trainedmodel to learn to generate not only valid SMILES
strings, but also valid ones having particular properties or activities.
79
SupplementaryMaterial
Data and codes, which were used to implement and train themodels, are
available at:
❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴■s②✽✾✴●❆◆✲✐♥✲❈❤❡♠♦✐♥❢♦r♠❛t✐❝s
80 CHAPTER 5. CONCLUSION
Acronyms
ADALINE ADaptative LInear NEuron.
ADAMET Absorption, Distribution, Metabolism, Excretion, Toxicity.
ANN Artificial Neural Network.
API Application Programming Interface.
DCGAN Deep Convolutional Generative Adversarial Network.
DNN Deep artificial Neural Network.
ECFP Extended Connectivity Fingerprints.
FID Fréchet Inception Distance.
FNN Fully connected Neural Network.
FTOXD Fréchet Tox21 Dsitance.
GAN Generative Adversarial Network.
KL Kullback Leibler Divergence.
LS-GAN Latent-Space-GAN.
81
82 Acronyms
LSTM Long Short-TermMemory.
MAE Mean Absolute Error.
MLP Multi-layer Perceptron.
MSE Mean Squared Error.
QSAR Quantity Structure-Activity Relationship.
QSPR Quantity Structure-Properties Relationship.
RL Reinforcement Learning.
RNN Recurrent Neural Network.
SDF Structure Data Format.
SGD Stochastic Gradient Descent.
SMILES SimplifiedMolecular Input Line Entry Specification.
SML SupervisedMachine Learning.
UML UnsupervisedMachine Learning.
WLN Wiswesser Line Notation.
Bibliography
[1] Chemical hashed fingerprints. https://docs.chemaxon.com /dis-
play/docs/Chemical+Hashed+Fingerprint. Accessed: 14/11/2017.
[2] Dragon 7.0. ❤tt♣s✿✴✴❝❤♠✳❦♦❞❡✲s♦❧✉t✐♦♥s✳♥❡t✴♣r♦❞✉❝ts❴❞r❛❣♦♥✳
♣❤♣. Accessed: 2017-09-15.
[3] Pharmacophore fingerprints. https://docs.chemaxon.com/display/
docs/Pharmacophore+Fingerprint+PF. Accessed: 14/11/2017.
[4] Martin Arjovsky and Léon Bottou. Towards principled meth-
ods for training generative adversarial networks. arXiv preprint
arXiv:1701.04862, 2017.
[5] Christopher M. Bishop. Pattern Recognition andMachine Learning.
Springer, 2006.
[6] ChEMBL version 23. ❢t♣✿✴✴❢t♣✳❡❜✐✳❛❝✳✉❦✴♣✉❜✴❞❛t❛❜❛s❡s✴❝❤❡♠❜❧✴
❈❤❊▼❇▲❞❜✴r❡❧❡❛s❡s✴❝❤❡♠❜❧❴✷✸✴❝❤❡♠❜❧❴✷✸❴r❡❧❡❛s❡❴♥♦t❡s✳t①t. Ac-
cessed: 14/11/2017.
[7] Artem Cherkasov, Eugene N Muratov, Denis Fourches, Alexandre
Varnek, Igor I Baskin, Mark Cronin, John Dearden, Paola Gramatica,
Yvonne C Martin, Roberto Todeschini, et al. Qsar modeling: where
83
84 BIBLIOGRAPHY
have youbeen? where are yougoing to? Journal ofmedicinal chemistry,
57(12):4977–5010, 2014.
[8] Mehdi cherti, Balazs Kegl, and Akin kazakci. De novo drug design with
deep generative models: An empirical study, 2017/2/17.
[9] François Chollet et al. Keras. ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❢❝❤♦❧❧❡t✴❦❡r❛s,
2015.
[10] Wikipedia contributors. Chemical table file — wikipedia, the free
encyclopedia, 2018. [Online; accessed 2-February-2018].
[11] ArthurDalby, JamesGNourse,WDouglasHounshell, AnnKIGushurst,
David LGrier, BurtonALeland, and JohnLaufer. Description of several
chemical structure file formats used by computer programsdeveloped
at molecular design limited. Journal of chemical information and
computer sciences, 32(3):244–255, 1992.
[12] Brandon Amos Blog. ❤tt♣s✿✴✴❤tt♣✿✴✴❜❛♠♦s✳❣✐t❤✉❜✳✐♦✴. Accessed:
20/11/2017.
[13] Extended Connectivity Fingerprints. https://docs.chemaxon.com/
display/docs/Extended+Connectivity+Fingerprint+ECFP. Accessed:
14/11/2017.
[14] Ronen Eldan and Ohad Shamir. The power of depth for feedforward
neural networks. In Conference on Learning Theory, pages 907–940,
2016.
[15] LawrenceM Fisher. Marvinminsky: 1927-2016. Communications of
the ACM, 59(4):22–24, 2016.
BIBLIOGRAPHY 85
[16] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers,
Mark Davies, Anne Hersey, Yvonne Light, ShaunMcGlinchey, David
Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactiv-
ity database for drug discovery. Nucleic acids research, 40(D1):D1100–
D1107, 2011.
[17] Aurélien Géron. Hands-on machine learning with scikit-learn and
tensorflow: concepts, tools, and techniques to build intelligent sys-
tems, 2017.
[18] XavierGlorot andYoshuaBengio. Understanding the difficulty of train-
ingdeep feedforwardneural networks. InProceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics, pages
249–256, 2010.
[19] Rafael Gómez-Bombarelli, David Duvenaud, JoséMiguel Hernández-
Lobato, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams,
and Alán Aspuru-Guzik. Automatic chemical design using a data-
driven continuous representation of molecules.
[20] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks.
arXiv preprint arXiv:1701.00160, 2016.
[21] Ian J. Goodfellow, Jean Pouget-Abadie, MehdiMirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial networks.
[22] Louis P. Hammett. Reaction rates and indicator acidities. Chemical
Reviews, 16(1):67–79, 1935.
86 BIBLIOGRAPHY
[23] Corwin. Hansch and Toshio. Fujita. ρ−σ−π analysis. amethod for the
correlation of biological activity and chemical structure. Journal of
the American Chemical Society, 86(8):1616–1626, 1964.
[24] John CHay, FCMartin, and CWWightman. Themark-1 perceptron-
design and performance. In Proceedings of the institute of radio engin-
eers, volume 48, pages 398–399, 1960.
[25] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard
Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by
a two time-scale update rule converge to a nash equilibrium. arXiv
preprint arXiv:1706.08500, 2017.
[26] Geoffrey Hinton. Rmsprop. ❤tt♣✿✴✴✇✇✇✳❝s✳t♦r♦♥t♦✳❡❞✉✴⑦t✐❥♠❡♥✴
❝s❝✸✷✶✴s❧✐❞❡s✴❧❡❝t✉r❡❴s❧✐❞❡s❴❧❡❝✻✳♣❞❢, 2014.
[27] Geoffrey EHinton, SimonOsindero, and Yee-Whye Teh. A fast learning
algorithm for deep belief nets. Neural computation, 18(7):1527–1554,
2006.
[28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-termmemory.
Neural computation, 9(8):1735–1780, 1997.
[29] Kurt Hornik. Approximation capabilities of multilayer feedforward
networks. Neural networks, 4(2):251–257, 1991.
[30] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and
Eric P Xing. Toward controlled generation of text. In International
Conference onMachine Learning, pages 1587–1596, 2017.
[31] Ian Goodfellow and Yoshua Bengio and Aaron Courville.Deep Learn-
ing. MIT Press, 2016.
BIBLIOGRAPHY 87
[32] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-
to-image translation with conditional adversarial networks. arXiv
preprint arXiv:1611.07004, 2016.
[33] Vasanth Kalingeri and Srikanth Grandhe. Music generation with deep
learning. arXiv preprint arXiv:1612.04928, 2016.
[34] Andrej Karpathy. The unreasonable effectiveness of recurrent
neural networks. ❤tt♣✿✴✴❦❛r♣❛t❤②✳❣✐t❤✉❜✳✐♦✴✷✵✶✺✴✵✺✴✷✶✴
r♥♥✲❡❢❢❡❝t✐✈❡♥❡ss✴, 2017.
[35] David Kriesel. A Brief Introduction to Neural Networks. 2007.
[36] Andrey Kurenkov. A brief history of neural nets and deep learning,
2015.
[37] Andrew R Leach and Valerie J Gillet. An introduction to chemoinform-
atics. Springer Science & Business Media, 2007.
[38] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Jo-
hannes Totz, ZehanWang, et al. Photo-realistic single image super-
resolution using a generative adversarial network. arXiv preprint
arXiv:1609.04802, 2016.
[39] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep archi-
tectures and deep learning in chemoinformatics: the prediction of
aqueous solubility for drug-like molecules. Journal of chemical in-
formation andmodeling, 53(7):1563–1575, 2013.
[40] Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp
Hochreiter. Deeptox: toxicity prediction using deep learning. Tox21
88 BIBLIOGRAPHY
Challenge to Build Predictive Models of Nuclear Receptor and Stress Re-
sponse Pathways as Mediated by Exposure to Environmental Toxicants
and Drugs, page 17, 2017.
[41] Warren SMcCulloch andWalter Pitts. A logical calculus of the ideas
immanent innervous activity. The bulletin ofmathematical biophysics,
5(4):115–133, 1943.
[42] Christopher Olah. Understanding lstm networks. ❤tt♣✿✴✴❝♦❧❛❤✳
❣✐t❤✉❜✳✐♦✴♣♦sts✴✷✵✶✺✲✵✽✲❯♥❞❡rst❛♥❞✐♥❣✲▲❙❚▼s✴, 2017.
[43] Ennio Pannese. The golgi stain: invention, diffusion and impact on
neurosciences. Journal of the history of theneurosciences, 8(2):132–140,
1999.
[44] KarlHPribram. Theneuropsychology of sigmund freud. Experimental
foundations of clinical psychology, pages 442–468, 1962.
[45] David Rogers andMathewHahn. Extended-connectivity fingerprints.
Journal of chemical information andmodeling, 50(5):742–754, 2010.
[46] David J Rogers, Taffee T Tanimoto, et al. A computer program for
classifying plants. Science, 132(3434):1115–1118, 1960.
[47] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learn-
ing representations by back-propagating errors. Cognitive modeling,
5(3):1, 1988.
[48] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, and Xi Chen. Improved techniques for training gans. In
Advances in Neural Information Processing Systems, pages 2234–2242,
2016.
BIBLIOGRAPHY 89
[49] Ryusuke Sawada, Masaaki Kotera, and Yoshihiro Yamanishi. Bench-
marking a wide range of chemical descriptors for drug-target interac-
tion prediction using a chemogenomic approach. Molecular inform-
atics, 33(11-12):719–731, 2014.
[50] Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, and Mark P.
Waller. Generating focussedmolecule libraries for drug discovery with
recurrent neural networks.
[51] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and
Ferenc Huszár. Amortisedmap inference for image super-resolution.
arXiv preprint arXiv:1610.04490, 2016.
[52] Richard S Sutton and Andrew G Barto. Reinforcement learning: An
introduction, volume 1. MIT press Cambridge, 1998.
[53] Symex. Ctfile formats. ❤tt♣✿✴✴✐♥❢♦❝❤✐♠✳✉✲str❛s❜❣✳❢r✴r❡❝❤❡r❝❤❡✴
❉♦✇♥❧♦❛❞✴❋r❛❣♠❡♥t♦r✴▼❉▲❴❙❉❋✳♣❞❢, 2010.
[54] Roberto Todeschini and Viviana Consonni. Molecular descriptors for
chemoinformatics, volume 41 (2 volume set), volume 41. JohnWiley &
Sons, 2009.
[55] Alexandre Varnek. Tutorials in Chemoinformatics. JohnWiley & Sons,
2017.
[56] Paul Werbos. Backwards differentiation in ad and neural nets: Past
links and new opportunities. Automatic differentiation: Applications,
theory, and implementations, pages 15–34, 2006.
[57] Bernard Widrow et al. Adaptive" adaline" Neuron Using Chemical"
memistors.". 1960.
90 BIBLIOGRAPHY
[58] Bernard Widrow and Michael A Lehr. 30 years of adaptive neural
networks: perceptron, madaline, and backpropagation. Proceedings
of the IEEE, 78(9):1415–1442, 1990.
[59] Wikipedia. Simplifiedmolecular-input line-entry system—wikipedia,
the free encyclopedia, 2017. [Online; accessed 14-November-2017].
[60] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On
the quantitative analysis of decoder-based generative models. arXiv
preprint arXiv:1611.04273, 2016.
[61] Santiago Ramón y Cajal. Estructura de los centros nerviosos de las aves.
1888.
[62] ChunWei Yap. Padel-descriptor: An open source software to calculate
molecular descriptors and fingerprints. Journal of Computational
Chemistry, 32(7):1466–1474, 2011.
[63] Rafael Yuste. From the neuron doctrine to neural networks. Nature
Reviews Neuroscience, 16(8):487–497, 2015.
Curriculum Vitae Isaac Lazzeri
Graz, am 07.01.2018
PERSONAL INFORMATION
Name: Isaac Lazzeri
Address: Idlhofgasse 36/7
8020 Graz
E-mail: [email protected]
Tel.: +43 650 6726227
Date of birth: 18.12.1989
Nationality: Italian
EDUCATION AND TRAINING
Since 15/04/2015 Master’s degree program in Bioinformatics (Johannes
Kepler Universität Linz)
11/09/2016 – 16/09/2016 Summer school “Advanced School on Modelling and
Statistics for Biology, Biochemistry and Biosensing”
(Johannes Kepler Universität Linz)
24/03/2014 Bachelor’s degree in Biotechnology (Final grade: 106/110)
(Università degli studi dell’Insubria, Varese/Italy)
Thesis heading: “Topological structure of disease
associated molecular networks”
01/04/2013 – 04/10/2013
Erasmus Placement (Emergentec Biodevelopment GmbH
Vienna)
15/09/2011 – 17/07/2012 Erasmus Project at the University of Salamanca/Spain
WORK EXPERIENCE
01/03/2017 – 01/07/2017 Tutor (Machine Learning: Unsupervised Techniques, JKU)
01/10/2016 – 01/03/2017 Tutor (Machine Learning: Supervised Techniques, JKU)
11/06/2014 – 05/09/2014 Personal Assistance (Liverpool/England)
01/04/2013 – 04/10/2013 Erasmus Placement (Emergentec Biodevelopment GmbH
Vienna)
2011 Administrative activities – Università dell’Insubria
(Varese/Italy)
01/08/2008 – 02/02/2009 Administrative activities – AVIS (Associazione Volontari
Italiani Sangue, Varese/Italy)
Curriculum Vitae Isaac Lazzeri
Graz, am 07.01.2018
SCHOOL EDUCATION
2003 – 2008 Liceo Artistico A. Frattini (Varese/Italy)
OTHER SKILLS
Languages: Italian: mother tongue
English: C1
Spanish: C1
German: B2
Computer Skills: Linux, Windows
Programming languages: R, Python, Perl
Technologies: Microsoft Office, Latex, Keras, Tensorflow, Pandas,
Numpy, SQL, HTML, XML, XPath, XQuery, WEKA
LANGUAGE COURSES
01/10/2016 – 28/02/2017 German immersion course B2 (JKU, Linz)
27/10/2014 – 18/12/2014 German immersion course B1 (DIG, Graz)
16/06/2014 – 22/08/2014
03/2013
English immersion course C1 (LILA*, Liverpool/England)
German immersion course (Deutsch-Akademie, Vienna)
INTERESTS
Playing the guitar, travelling, drawing, languages, sports (football, breakdance, basketball, trekking), cooking.
Eidesstattliche Erklärung
Ich erkläre an Eides statt, dass ich die vorliegende Masterarbeit selbst-
ständigundohne fremdeHilfe verfasst, anderealsdie angegebenenQuellen
und Hilfsmittel nicht benutzt bzw. die wörtlich oder sinngemäß entnom-
menen Stellen als solche kenntlich gemacht habe. Die vorliegendeMaster-
arbeit ist mit dem elektronisch übermittelten Textdokument identisch.
Datum undUnterschrift
08.02.2018 Isaac Lazzeri