predicting protein-ligand interactions using iterative stochastic elimination algorithm
TRANSCRIPT
Predicting Protein-LigandInteractions using Iterative
Stochastic Elimination Algorithm
Boris Gorelik, B.Pharm, M.Sc
A dissertation submitted to the Hebrew University of Jerulalem
for the degree of Doctor of Philosophy
·2007·
i
This work was conducted under the supervision of
professor Amiram Goldbum
ii
Abstract
Molecular Docking is an “in silico” process for predicting the structure of
receptor-ligand complexes. Such a prediction is of great importance in vari-
ous fields of life sciences, mainly in drug design efforts. Numerous methods
for solving this problem have been developed, employing a plethora of algo-
rithms.
Four main difficulties affect the docking algorithms – the vast space that
needs to be searched, the scoring of ligand poses, the flexibility of both
partners – protein and ligand and introducing water molecules that frequently
mediate intermolecular interactions.
This work presents ISE-dock– a docking program which is based on
the Iterative Stochastic Elimination (ISE) algorithm. ISE is a generic opti-
mization algorithm that is based on elimination of values that consistently
lead to the worst results. It constructs large sets of near optimal solutions
with no additional computational cost compared to producing single poses.
ISE-dock is based on the source code of AutoDock v.3.0.5 and uses its
scoring function. Development of a scoring function is beyond the scope of
this work. Unlike the original AutoDock program, ISE-dock is capable of
dealing with conformational changes of the receptor. The changes in the re-
ceptor’s backbone are implemented in an implicit, multistep way. Using this
approach, multiple structures of the protein are generated by an ISE-based
program. The resulting structures serve as a target for subsequent docking of
the ligand. Explicit handling of changes in the protein 3D structure is made
possible by “tearing off” side-chain atoms from the protein. Thus, movable
iii
protein atoms are treated as a part of the ligand. In the current version of
ISE-dock, such a handling of protein flexibility is limited to unconstrained
rotations of protein side chain atoms. Although not done in this work, it is
possible to use rotamer libraries to decrease the complexity of the problem,
thus the experiments presented in here represent the “worst case” scenario
in terms of side chain flexibility.
ISE algorithm begins by constructing a matrix that contains a set of the
possible (discrete) values for each degree of freedom (variable) that defines
the problem (system). If the problem is prediction of molecular conformation
and the degrees of freedom are rotatable bonds, then the angular rotations
around each bond are its discrete variables. One value is picked randomly out
of the set of each variable, to determine the full configuration (conformation)
of the system, which is then evaluated by a scoring function. This step
is repeated many times to form a large sample, usually in the 104 – 106
range. The scores of that sample are arranged in a virtual histogram in which
only a small fraction (1%-10%) of worst and of best results are examined in
detail, to assess the contribution of each and every variable value on the final
scores. A value that appears in the worst results with a significantly higher
frequency than expected from its random distribution (based on its total
appearance in the full sample) or one that appears with a significantly lower
frequency than expected among the best results, is marked for elimination.
The next iteration of random picking, scoring, sampling and eliminating thus
begins with a smaller number of possible combinations. The elimination
process is performed iteratively until the number of possible conformations
enables exhaustive search in feasible time. Additions in this work to previous
applications of ISE include:
• Local optimization of a randomly picked fraction of the sampled struc-
tures during the stochastic search (as performed by LGA in Auto-
Dock [Morris et al. J Comp Chem 1998, 19, 1639–62] ). During the
final, exhaustive search step, any screened conformation has a proba-
bility of 60% to undergo optimization. The main purpose of the local
optimization step is to solve clashes and unfavorable conformations that
are caused by the discrete nature of the algorithm.
iv
• Only a limited portion of the values may be discarded for any given
variable in any iteration.
• Keeping and updating a list of best encountered conformations. The
size of the list is user defined.
ISE-dock was validated using four independent data sets. Flexible lig-
and – rigid protein docking was validated using 81 protein-ligand complexes
from the PDB and ISE-dock performance was compared to those of Glide,
GOLD and AutoDock. Flexible ligand – flexible protein docking was
tested using three additional data sets: collagenase (backbone flexibility, 2
complexes), Acetylcholine Esterase – AChE (flexibility of a single side chain,
2 complexes) and trypsin (flexibility of several side chains, 10 complexes).
When no protein flexibility is allowed, ISE-dock has a better chance
than the other three to find more than 60% top single poses under RMSD=2.0A
and more than 80% under RMSD=3.0A from experimental. ISE alone pro-
duced at least one 3.0A or better solutions among the top 20 poses in the
entire test set. In 98% of the examined molecules, ISE produced solutions
that are closer than 2.0A from experimental. Paired t-tests (PTT) were used
throughout to assess the significance of comparisons between the performance
of the different programs. ISE-dock provides more than a 100-fold docking
solutions in a similar time frame as LGA in AutoDock. The usefulness of
the large near optimal populations of ligand poses is demonstrated by show-
ing a correlation between the docking results and experiments that support
multiple binding modes in p38 MAP kinase [Pargellis, C. et al. Nat Struct
Biol 2002, 9, (4), 268–72] and in Human Transthyretin [Hamilton JA, Benson
MD. Cell Mol Life Sci 2001; 58(10):1491–1521].
Introduction of partial handling of protein flexibility into ISE-dock re-
quires several changes to the original scoring function, which has a strong
impact on the quality of the top ranked solutions. Nevertheless, the entire
docking solutions in this work always contain ligand poses of reasonable to
very high quality.
Docking of a flexible ligand into a protein while partially “unfreazing” the
backbone was tested on two collagenase-inhibitor complexes from the PDB
v
(PDB codes: 456c, 966c). In this case, the bound docking solutions contain
ligand poses with reasonably low RMSD values of 1.33A (456c) and 1.18A
(966c).
Two structures of AChE (4 cross docking experiments) and 10 struc-
tures of trypsin (100 cross docking experiments) with their respective in-
hibitors demonstrate the capabilities of ISE-dock to deal with protein side
chain flexibility. In both cases, high quality docking solutions are obtained
in terms of RMSD of all movable atoms from their experimental positions.
Docking populations for AChE contain solutions with RMSD≥0.37A, and
in the “worst” case, RMSD≥0.85A. In 74 (out of 100) cases in the trypsin
data set, the top 20 docking solutions contain poses with RMSD<2.0A. In
94 cases, the entire docking sets contain solutions with RMSD<2.0A and all
docking sets contain solutions with RMSD<3.0A.
This work shows that ISE-dock is superior in many aspects to the cur-
rently well established docking programs Glide, GOLD and AutoDock
in flexible ligand – rigid protein docking. It has been also shown that ISE-
dock deals successfully with various degrees of protein flexibility. In order
to handle flexible proteins in full extent, the scoring scheme needs to be
redesigned. The latter task is beyond the scope of this work.
Protein flexibility is an important aspect of a protein-ligand docking pro-
gram. Other degrees of freedom that were not accounted for in this work,
but that can be introduced into ISE-dock relatively easily are modeling
of structurally important water molecules and protonation and tautomeric
states of the interacting molecules.
Contents
1 Introduction 1
1.1 Current drug discovery process . . . . . . . . . . . . . . . . . 1
1.2 Flexibility in molecular interactions . . . . . . . . . . . . . . . 5
1.3 Energy and thermodynamic potentials . . . . . . . . . . . . . 7
1.4 Common energy components . . . . . . . . . . . . . . . . . . . 12
1.5 Force fields and scoring functions . . . . . . . . . . . . . . . . 19
1.5.1 Force field based energy functions . . . . . . . . . . . . 20
1.5.2 Approximate energy functions . . . . . . . . . . . . . . 21
1.5.3 Statistical potentials . . . . . . . . . . . . . . . . . . . 22
1.5.4 Geometric and chemical complementarity functions . . 23
1.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 25
1.8 Docking techniques . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.1 Flexibility in docking programs . . . . . . . . . . . . . 26
1.8.2 Search algorithms . . . . . . . . . . . . . . . . . . . . . 29
1.8.3 Evaluating docking programs . . . . . . . . . . . . . . 32
1.9 Open problems and issues . . . . . . . . . . . . . . . . . . . . 34
2 Methods 35
2.1 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 AutoDock docking program . . . . . . . . . . . . . . . . . . 37
2.2.1 Lamarckian Genetic Algorithm . . . . . . . . . . . . . 38
2.2.2 Problem representation . . . . . . . . . . . . . . . . . . 41
2.3 ISE-dock program . . . . . . . . . . . . . . . . . . . . . . . . 42
vi
Contents vii
2.3.1 Iterative Stochastic Elimination algorithm . . . . . . . 43
2.3.2 Problem representation . . . . . . . . . . . . . . . . . . 46
2.3.3 Protein flexibility . . . . . . . . . . . . . . . . . . . . . 46
2.4 Rigid protein docking . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 LGA docking . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.2 The data set . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.3 Comparisons and their analysis . . . . . . . . . . . . . 51
2.4.4 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.5 Comparing CPU time . . . . . . . . . . . . . . . . . . 53
2.4.6 Energy funnels . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Flexible protein docking . . . . . . . . . . . . . . . . . . . . . 54
2.5.1 Protein backbone Flexibility . . . . . . . . . . . . . . . 56
2.5.2 Flexibility of a single side chain . . . . . . . . . . . . . 59
2.5.3 Flexibility of several side chains . . . . . . . . . . . . . 62
2.5.4 Comparisons and their analysis . . . . . . . . . . . . . 63
3 Flexible ligand – rigid protein docking 64
3.1 Top scoring poses . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Top 20 poses . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Solution space coverage . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Time performance . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 Multiple binding modes . . . . . . . . . . . . . . . . . . . . . 73
3.6 PDB data supports distinct funnels . . . . . . . . . . . . . . . 78
4 Flexible Ligand – Flexible Protein Docking 84
4.1 Protein backbone flexibility . . . . . . . . . . . . . . . . . . . 84
4.2 Flexibility of a single side chain . . . . . . . . . . . . . . . . . 87
4.3 Flexibility of several side chains . . . . . . . . . . . . . . . . . 90
4.4 Discussion on protein flexibility . . . . . . . . . . . . . . . . . 94
5 Conclusions 97
Appendices (submitted separately) 100
Contents viii
A Results published in a peer reviewed journal 101
B ISE-dock and AutoDock parameters and their values 103
B.1 AutoDock parameters and their
default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B.2 ISE-dock parameters and their
default values . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
C Detailed Results 107
C.1 Flexible Ligand – Rigid Protein docking results results . . . . 107
C.2 Flexible ligand – rigid protein docking energy landscapes . . . 111
D Flexible ligand – flexible protein docking. Trypsin data set 119
List of Figures 123
List of Tables 128
Acknowledgments 129
Bibliography 130
Hebrew abstract 140
Chapter 1
Introduction
1.1 Current drug discovery process
Since the dawn of history, humankind has been searching for ways to fight
diseases and improve the quality of life. Modern science has undergone
tremendous developments and has successfully developed a great variety of
medicines. Nevertheless, the constant search for better drugs that reduce
side effects, cure more diseases, and extend life expectancy and quality has
never stopped. Drugs have traditionally been discovered by experimental
methods, but more recently, computerized (virtual) drug discovery methods
have been devised and prove to be helpful in the process of drug discovery
and in designing drugs. Figure 1.1 presents an overview of current methods
for designing drugs and discovering them. Roughly, the systematic search
for new active molecules can be divided in three categories: classical chem-
istry drug discovery, high trhoughput screening and virtual high throughput
screening.
1
Introduction 2
Figure 1.1: Schematic diagram of the main methods in the drug discovery process. Arrowsdesignate process flow. Black asterisks mark steps that may involve molecular docking.Abbreviations: SAR – structure-activity relationship; QSAR – quantitative SAR; ADME-Tox – absorption, distribution, elimination, toxicity
Classical chemistry drug discovery During the classical drug design pro-
cess, medicinal chemists use their personal experience, combined with ratio-
nalizing the knowledge of active compounds and the suspected drug target.
The process involves iterations of data evaluation, synthesis and purifica-
tion, and assessment of biological activity. Only a few compounds can be
processed simultaneously using this approach. This approach is still labor-
intensive, slow, and expensive, requiring costly materials and techniques.
High throughput screening In several large and medium sized Pharma
companies, high throughput screening (HTS) techniques, by robotically scan-
ning the activities of hundreds of thousands of compounds has become a
Introduction 3
major method. The targets for screening can be single molecules, colonies
of bacteria, fungi, or animal cells. In this kind of experiments, the effect
is recorded using fast and, sometimes, non-specific parameters such as color
change, conductivity of electric current, particle count, etc. HTS experiments
are frequently conducted without exact knowledge about the target structure
or about the mechanism of action. While faster than the first approach, HTS
often suffers from ambiguity during the process of results interpretation and
still may require expensive materials and equipment.
Virtual high throughput screening (V-HTS) In order to save time and
reduce costs, virtual HTS is designed to mimic the HTS task in silico and
is expected to indicate which compounds are worth testing in “wet” exper-
iments. Instead of screening real compounds against real targets, virtual
computer libraries of existing and not yet existing chemicals are used. Nat-
urally, this process is much cheaper and, usually, faster than the two former
ones. On the other hand, the V-HTS process relies heavily on the con-
struction and validation of the underlying computational methods and on
the interpretation of the results. Availability of fast, validated and accurate
computational screening methods is, usually, the major bottle neck of the
V-HTS approach. The main tool of V-HTS is molecular “docking”, in which
a ligand or potential drugs is “driven” in order to find a good “parking place”
on the biological target.
Docking programs are computational tools that model the structure and
the nature (affinity) of molecular complexes. These programs aim to predict
geometry of inter- and intra-molecular interactions and to rank the various
Introduction 4
possibilities. The main advantage of computational techniques in general,
and of docking programs particularly, is that they are much cheaper and
faster than the corresponding “wet” techniques. Docking programs are used
as a primary screening tool during the virtual high throughput process. They
also assist biologists, biochemists, and medicinal chemists in designing novel
molecules and in interpreting experiments that assess the activity of already
existing ones.
Two main goals of docking tools are (1) to assist in designing novel chemi-
cal compounds, and (2) to study the nature of interactions between biological
targets and ligands. These may include endogenous molecules such as hor-
mones, or external ones such as drugs or toxic compounds.
Every docking program requires that the three dimensional (3D) struc-
ture of the target molecule be known to some extent. Protein Data Bank
(PDB)[5] is a publicly available repository that contains more than 42,000 3D
structures of biological macromolecules resolved with various degrees of res-
olution. Since 1999, the U.S. National Institute of General Medical Sciences
(NIGMS) has sponsored a large scale project called the Protein Structure
Initiative (PSI)[75]. The main goal of this initiative is to enlarge the number
of solved 3D structures of proteins, which would enable better coverage of
the existing drug targets and the discovery of new ones. Since the estab-
lishment of the PSI, the project has yielded more than 1,800 solved protein
structures (as of June 2006), with current estimated rate of more than 500
solved structures per year[67].
Despite the progress that the field of docking has undergone in the last
few years, several problems still exist. One of the major problems is energy
Introduction 5
calculation. Another major problem is accounting for the many degrees of
freedom of the docking problem. These include flexibility of the molecules,
protonation and tautomeric states etc. Considering all these degrees of free-
dom results in a tremendous combinatorial space that each docking tool has
to search. Due the flexible nature of molecules, it is important not to limit
the scope of the docking solution to a single structure, but instead, to predict
collection (“ensemble”) of low energy multiple conformations that contribute
to the biological activity.
In this work, I present ISE-dock – a protein ligand docking tool that
successfully overcomes the huge combinatorial space problem, while account-
ing for ligand and, to a lesser extent, protein flexibility, and that is capable of
producing arbitrary large docking populations without substantial extension
of CPU time.
1.2 Flexibility in molecular interactions
Since 1894, when Emil Fischer proposed the famous lock and key model[21],
the perception of the nature of binding between biological molecules has un-
dergone several changes. Although evidence that support the lock and key
model exists (see for example:[6, 55, 22]), two models that are considered
to represent the majority of receptor-ligand interactions are induced fit [48]
and equilibrium of multiple pre-existing conformations [63, 52, 76]. Induced
fit theory assumes that the conformation of the target and ligand affect each
other as they approach an encounter. The conformation of the final complex
may not be derived directly from the conformation of the separate molecules.
Introduction 6
“Pre existing conformations” assumes that the final target and ligand con-
formations are already probed by the isolated molecules but, they could be
of much higher energy than the most abundant conformation and therefore,
their accessibility is minute in the absence of the partner. It is not uncom-
mon that the most populated unbound states of a protein are not those that
are most populated in the bound structure[97, 10]. The same notion is true
for ligands: it was found[83] that ligands rarely bind their receptors in the
calculated global minimum conformation. Moreover, in 60% of the cases, the
bound ligand is not found even in its local energy minimum with at least
10% of the examined ligands bind with strain energies over 9 kcal/mol.
Many theoretical and experimental studies support either the induced fit
or pre-existing populations models in different cases of binding[37, 55, 63].
From a thermodynamic point of view, the two models are equivalent, how-
ever, describing biological systems in terms of pre-existing populations and
conformational selection is more useful in the process of drug discovery[97].
Regardless of which of the two models more accurately describes the na-
ture of binding, it is clear that molecular flexibility is involved in complex
formation.
The process of binding may result in either increase or decrease of flexibil-
ity. Decreased flexibility may be attributed to enthalpy-entropy compensa-
tion, when more effective binding interactions are gained by freezing motion.
On the other hand, complex formation may be stabilized by entropic contri-
bution, associated with increased flexibility[116]. It has been suggested[101]
that in 13 different MHC receptor-peptide complexes, the flexibility is asso-
ciated with as much as 50% of free energy of binding.
Introduction 7
Flexibility plays an important role not only in complex formation, but also
in the mechanism of action of various complexes. For example, the conforma-
tional changes of several enzymes are very important for their activity[10, 40,
51]. Solved structures of protein-ligand complexes frequently show complexes
with 70 – 100% of the ligands’ surface area buried. Clearly, this kind of con-
formations could not be achieved without at least a minimal degree of protein
flexibility. Works that analyze bound and apo-proteins show that although
there are complexes, where the protein undergoes almost no change upon
ligand binding[41], proteins that bind small molecules are usually subjected
to conformational changes[97, 61, 60, 88].
1.3 Energy and thermodynamic potentials
The three most common thermodynamic potentials are: internal energy, en-
thalpy and Gibbs free energy.
Internal Energy The internal energy (denoted as U or E) of a thermody-
namic system is the total kinetic energy due to the motion of particles and
the potential energy associated with the vibrational and electronic energy
of atoms, including the energy of chemical bonds. Internal energy does not
include the kinetic energy due to the motion of the system as a whole. It
does not account for potential energy due to the position of the system in an
external gravitational, electric or magnetic field.
Introduction 8
The internal energy is essentially defined by the first law of thermody-
namics, which states that energy is conserved:
∆U = Q+W +W ′ (1.1)
Where ∆U is the change in internal energy of a system during a process, Q
is heat “added” to a system, W is the mechanical work “done on” a system,
and W ′ is energy added by all other processes.
Most biological interactions occur in a fluid environment. In such an en-
vironment the mechanical work done on the system is related to the pressure
(P ) and the volume (V ):
δW = −PdV (1.2)
The heat energy is a function of temperature (T ) and entropy (S):
δQ = TdS (1.3)
Thus, the internal energy of a system of biological interest may be expressed:
dU = TdS − PdV (1.4)
Enthalpy Enthalpy or heat content (denoted as H or ∆H) describes the
amount of “useful work” that may be obtained from a closed thermodynamic
system, under constant pressure. In the absence of an external field, enthalpy
Introduction 9
is defined as:
H = U + PV (1.5)
Where U , P and V are internal energy, pressure and volume respectively. En-
thalpy is sometimes referred as heat capacity because under constant pressure
and volume:
dH = δQ ≤ TdS (1.6)
Thus, the difference in enthalpy is the maximum amount of thermal energy
that may be obtained from a system.
The total enthalpy of a system cannot be measured directly. ∆H – the
change in enthalpy is measured instead. In exothermic reaction at constant
pressure, the change in enthalpy equals to the energy released from the sys-
tem. Similarly, in endothermic reactions, ∆H equals to the energy absorbed
by the system. If the system is kept under constant pressure and constant
volume, the change in enthalpy equals the heat amount that is released from
or absorbed by a system.
Introduction 10
Gibbs Free Energy The Gibbs free energy (which frequently is referred to
simply as free energy), is defined as:
G = U + PV − TS+
= H − TS+
=∑i=1
µiNi
(1.7)
Where: U is the internal energy; P is pressure; V is volume; T is the tem-
perature; S is the entropy; µi is the chemical potential of the i-th chemical
component; Ni is the number of particles (or number of moles) composing;
the i-th chemical component. It can be shown that
∆G = ∆H − T∆S (1.8)
Where ∆S is the change in the internal entropy of the system. The value
of ∆G from equation (1.8) is used to determine whether a chemical reaction
is favorable or not: reactions with ∆G < 0 will occur spontaneously, while
those with ∆G ≥ 0 will not.
Binding Affinity Non-covalent receptor-ligand interactions may be written
in the following general form[72]:
RLkd⇀↽ka
L+R
Where: R, L and RL are the receptor, ligand and receptor-ligand complex,
respectively; kd and ka are kinetic constants of dissociation and associa-
Introduction 11
tion, respectively. This reaction describes dissociation of a receptor-ligand
complex. The thermodynamic equilibrium constant of this reaction in ideal
conditions is defined as:
Kd =[R][L]
[RL](1.9)
Where [X] denotes the molar concentration of the component X. The equi-
librium constant can be related to the change in the Gibbs free energy (eq.
(1.8)) of the above dissociation reaction:
∆G = ∆G0 = RTlnKd (1.10)
Here, R is the universal gas constant, and T is the absolute temperature.
∆G0 is the free energy change at equilibrium under standard conditions (all
the chemical components are at 1M concentration, T=273.15K, pressure =
1atm).
In attempts to calculate the change in free energy upon binding (free
energy of binding), it is customary to separate the overall energy into distinct
components. These components usually may include entropy loss due to
association, entropy gain of water due to binding of the ligand (hydrophobic
effect), entropy loss in the receptor and the ligand due to constraints of
internal degrees of freedom, interaction between the ligand and the receptor,
and changes in the conformational (internal) energy of the molecules upon
binding.
Introduction 12
The basic assumption of most of the works on experimental or computa-
tional determination of binding energy is that different contributions to the
binding energy are independent and additive. Thus binding energy may be
written as a sum of its components[72]:
∆Gbind = ∆Gsolvent+
+ ∆Greceptorconf + ∆Gligand
conf +
+ ∆Gint+
+ ∆Gmotion
(1.11)
. One should note that, based on the principles of independence and addi-
tivity of energy components, many other variants of this equation may be
written. Furthermore, the same assumption of additivity and independence
allows the creation of statistical functions that approximate the binding free
energy without direct connection to the underlying physical and thermody-
namic processes.
1.4 Common energy components
Based on the equation (1.11), energy calculations are divided into distinct
components. In this section I will describe the most commonly used terms
of energy functions. This list is by no means complete, but rather serves as
a brief introduction.
Introduction 13
Physically based potentials are mainly divided between bonding and non-
bonding expressions. Supplementary expressions for solvation or entropy loss
due to restricted rotations are sometimes added.
Non-bonding expressions
It is common to model pairwise interactions between atoms that are divided
by at least 4 covalent bonds in terms of electrostatic (Coulomb) and Van der
Waals interactions.
Coulomb potential We use Coulomb potential to estimate the enthalpy
contribution of any two charged particles to the overall potential energy:
Eel = εQ1 ×Q2
r(1.12)
Where Q1 and Q2 are the partial charges of the two particles, r is the distance
separating between them, and ε is the dielectric constant of the separating
medium. In vacuum, ε equals 1. Figure 1.2 shows a typical shape of electro-
static potential of charged particles.
Hydrogen bonds The hydrogen bonds (H-bonds) effect is highly related to
electrostatic interactions. This effect is caused by interaction of electroneg-
ative atoms with hydrogen connected to other electronegative atoms. The
nature of H-bonds allows charge transfer along the bond. The strongest H-
bond effect is achieved when the three interacting atoms (hydrogen donor,
hydrogen atom and hydrogen acceptor) and the mediating lone electron pair
lie on a single line. To account for this directionality, many force fields con-
Introduction 14
Figure 1.2: Typical shapes of electrostatic interactions energy. The energy of two identical(full line) and opposite (dashed line) charges in vacuum are shown
tain explicit terms for the angle of the H-bond. For example, following is
the H-bond component of MM3 force field[58] that demonstrates an explicit
term for the angle θ between the interacting atoms:
EHB = ε#[1.84× 105e−120/P − 2.25
P 6
D
(l
l0
)cosθ
](1.13)
Where l and l0 denote the actual and the reference H-bond lengths, respec-
tively, ε# is the depth of the energy potential well, P is the ratio of the sum
of the van der Waals radii of the atoms divided by the sum of the effective
interatomic distances between them and D is the dielectric constant. The
dependence of energetics on the angular relations of H-bonds plays an im-
portant role in the specificity of molecular interactions. Figure 1.3 shows
examples of typical inter- and intra-molecular hydrogen bonds.
Introduction 15
Figure 1.3: Examples of inter- (left) and intra- (right) molecular H-bonds
The majority of existing scoring functions does not include explicit terms
for hydrogen bonds[54], but rather rely on Van der Waals or electrostatic
interactions.
Van der Waals interactions Van der Waals (VdW) forces account for both
attraction and repulsion of non bonded atoms. Usually, Van der Waals en-
thalpy contribution of atoms is estimated using the Lennard-Jones (LJ) po-
tential:
EV dW =N−1∑i
N∑j=i+1
[4εij
[(σijr
)6
−(σijr
)12]]
(1.14)
Where εij is the depth of the potential well between the atoms i and j, r is
the distance between two atoms, σij is the distance at which the inter-particle
force is zero and N is the number of atoms.
Equation (1.14) is sometimes referred to as the 6-12 LJ potential, as op-
posed to 4-10 potential, a more “smoothed” estimation with lower repulsion
effect. Figure 1.4 presents the shape of the Van der Waals potential of two
identical atoms. Although the equation (1.14) is the most encountered
one, there are other ways to estimate Van der Waals energy (for example
Hill’s equation[38]).
Introduction 16
Figure 1.4: Van der Waals interaction energy of argon dimer. Taken from the Wikipedia[113] under the GNU Free Documentation License
Bonding expressions
The three most common terms that describe the contribution of bonding
interactions to the overall energy are bond stretching, angle bending and
bond rotation (torsion).
Bond stretching One of the equations that describe the potential energy
for a covalent bond is:
Estretch = De
(1− e−α(r−r0)
)2(1.15)
In this equation (which is often referred to as a Morse equation), De is the
depth of the energy minimum, r0 is the reference bond length, α = ω√µ/2De,
where µ is the reduced mass and ω is the bond vibration frequency.
To simplify the energy calculations, a harmonic potential is often applied
to bond stretching (Hooke’s law). Although less accurate, harmonic potential
Introduction 17
Figure 1.5: Comparison of Morse (dashed line) and Hooke’s harmonic (full line) poten-tials of bond stretching energy around the minimum. To construct this graph, all theparameters in equations (1.15) and (1.16) were assigned the value of 1
is faster to calculate and is accurate enough in the bottom of the potential
well.
Estretch =1
2k(r − r0)2 (1.16)
Figure 1.5 presents the shapes of Morse and Hooke’s potentials around the
minimum.
Angle bending The angle bending contribution to the potential energy may
be estimated using the following equation:
Ebending =1
2(θ − θ0)
2[1− k1(θ − θ0)− k2(θ − θ0)
2 − k3(θ − θ0)3 . . .
](1.17)
Where θ is the angle, θ0 is the reference angle and k1, k2, . . . are force con-
stants specific to the bonds that form the angle. A good approximation of
Introduction 18
this general form equation is Hooke’s harmonic potential:
Ebending =k
2(θ0 − θ)2 (1.18)
Bond torsion One of the possible equations that describe the contribution
of torsions around chemical bonds is
Etorsion =N∑n=0
Cncosn(ω) (1.19)
Where C is some force constant, ω is the torsion angle, and N – the num-
ber of rotating bonds. Although many force field terms of bond torsion
contain the above equation, there is sometimes a need in more accurate esti-
mations. On the other hand, many force fields do not contain explicit terms
for torsions[54]. In these cases non-bonding terms for Van der Waals and
electrostatic interactions are used to achieve the desired potential profile.
Entropy estimation and solvation terms
A solute molecule that leaves the solution in favor of a complex with another
molecule produces two main effects on the system’s entropy. First, it changes
the micro-structure of the water bulk that surrounds the two solute molecules.
This change results in more water molecules that are capable of creating
hydrogen bonds between themselves. The second effect is the change in the
internal degrees of freedom.
Entropy change estimation is one of the most challenging problems in
computational research of biological systems. The reason for the complexity
Introduction 19
of this task may be demonstrated by the Gibbs entropy formula:
S = −kBN∑i=1
pi log pi (1.20)
Where N is the number of possible discrete states of a system, and pi is the
probability of a certain state. Equation (1.20) results in a huge complexity.
The large number of possible states of a system leads towards very small
values of pi, which in turn requires extensive sampling and may lead to
large accumulation of errors. Several additional ways to exactly evaluate the
entropy exist, but they do not change the complex nature of the calculations.
For a review on entropy calculations in biological systems see ref.[3].
1.5 Force fields and scoring functions
During the process of docking, many conformations are searched. The pro-
gram needs to choose between the different conformations, thus each confor-
mation is given a numerical value, which in most of the cases, is supposed
to represent its relative stability. Computational functions that estimate
the energy of the system can be based on the principles of classical physics
(force field based functions). Another class of functions combines statistical
physics equations with many approximations that are based on known macro-
structures. This class of methods is often called approximate or knowledge
based functions[82]. In addition purely statistical scoring functions exist.
Such functions are based on statistical analysis of various patterns, such as
distribution of contacts between different types of atoms[69]. Another ap-
Introduction 20
proach of the estimation of the “fitness” of docking structures is to use shape
complementarity.
1.5.1 Force field based energy functions
Force field based scoring functions are based on the equations that were
mentioned in Section 1.4. Two major such energy functions are AMBER
[14] and CHARMM [65]. These functions differ in atom typing, parameters
for the various terms and in the basic equations that build them up. The
main equation of the AMBER force field reveals the complexity that is
common to all the energy functions in this class:
Etotal =∑bonds
Kr(r − r0)2+
+∑angles
Kθ(θ − θ0)2+
+∑
dihedrals
Vn2
[1 + cos(nΦγ − phase)] +
+∑i<j
[Aijr12ij
− Bij
r6ij
− qjqjεrij
]+
+∑i<j
[Cijr12ij
− Dij
r10ij
](1.21)
In this equation, the last term is the estimation of hydrogen bonds energy.
The rest of the terms have already been discussed. A review of CHARMM,
AMBER and other common force fields has been recently published[64].
Due to the complexity of force field based scoring functions, they pose
relatively heavy computational load on the computer, which results in rela-
tively low calculation speed. Thus, in the case of the docking problem, the
Introduction 21
full forms of these functions are mostly suitable for structure preparation
before docking or during the post-docking processing.
1.5.2 Approximate energy functions
As stated before, one of the major drawbacks of force field based scoring func-
tions is their extensive computational cost due to the large number of energy
terms and their complexity. Moreover, several terms, such as solvation effect,
the contribution of the flexibility to the overall system energy and others re-
quire sampling of multiple conformations in the solution space. To overcome
this obstacle, several knowledge based potentials have been proposed. In this
class of functions, the number of energy terms and the number of supported
atom or bond types are reduced. The general form of the remaining terms
resembles that of the force field based functions. The parametrization is done
using statistical analysis of known structures of macromolecules. The struc-
tures are chosen according to the problem and may include folded proteins,
proteins bound to other proteins, small molecules, DNA, etc. It is possible to
perform calibration of the parameters using focused sets of structures (target
tailored functions). Studies exist that show that such a strategy improves the
accuracy of scoring functions[11, 92]. Because the parametrization of knowl-
edge based scoring functions is done using known macro-structures, they
implicitly account for entropic effects such as solvation and changes in inter-
nal degrees of freedom. Estimation of entropic and solvation contributions to
the overall binding affinity is usually done using one or more of the following
terms[109, 49, 70]: hydrophobic match, solvent accessible surface (divided to
Introduction 22
atom types according to the extent of hydrophobicity/hydrophilicity), and
the number of internal degrees of freedom (usually, the count of rotatable
bonds). This support of entropic terms is gained without the costly compu-
tations.
On the other hand, the calibration process does not account for non-native
structures. This might lead to meaningless results when one attempts to
quantitatively evaluate poses that reside far away from an energy minimum.
Most existing docking programs (for example AutoDock [33, 71, 70],
FlexX [49], FlexE [12], Glide [23], GOLD [42] and others) use approxi-
mate scoring functions. It is possible to compensate for the relative lack of
accuracy of this class of functions by further re-scoring docking candidates
with or without an additional simulation step (such as minimization, molecu-
lar dynamics). This multistage approach was successfully adopted by several
research groups[8, 108]. For example, in one work[108], molecular dynamics
combined with MM-PBSA (molecular mechanics Poisson-Boltzman/surface
area) were used to re-rank the solutions suggested by DOCK 4.0. In that
work, a conformation within 1.1A RMSD from an HIV-1 RT inhibitor was
predicted before the 3D structure was published.
1.5.3 Statistical potentials
Another approach to simplify energy calculations even more is to use purely
statistical potentials. One of such potentials was proposed by Miyazawa and
Jernigan[69]. In that work, intra-residue contacts in folded proteins were
examined. It was found that several residues can be found near the others
Introduction 23
with different propensities. These findings were used to compare proposed
folded structures in terms of probability. A similar statistical approach was
also used to analyze the distribution of various intermolecular contacts in
protein-protein[29] and protein-ligand[79] complexes. Similar to semi-empiric
scoring functions, statistical potentials account implicitly for solvation and
other entropic effects and, on the other hand, of a limited validity when
analyzing non-native structures.
Generally, statistical potentials provide high calculation speed which, un-
fortunately, comes at the expense of accuracy. Preliminary results during the
early stages of my research with the probability tables provided in the work
of Glasser et al[29] have generated unacceptable results. (These results are
neither shown, nor discussed in this work).
1.5.4 Geometric and chemical complementarity functions
When two molecules bind to each other, a certain degree of shape comple-
mentarity has to exist[74, 43]. This notion serves as a rationale behind shape
complementarity or geometry complementarity scoring functions. Geometry
complementarity was the exclusive scoring scheme in many early docking
programs[74, 20, 25, 7]. The current scoring functions use additional cri-
teria in order to facilitate the accuracy. For example, in work by Bohacek
and McMartin[68], the accessible protein surface was divided into hydropho-
bic, hydrogen-bond donating, or H-bond accepting zones. Other criteria for
accessing chemical complementarity are based on partial charges, hydropho-
bicity/hydrophilicity, atom types, etc.
Introduction 24
1.6 Energy funnels
As stated earlier, energy is a complex function that depends on an enormous
number of variables. The multidimensional hypersurface that describes the
energy as a function of all the relevant variables is known as the “energy land-
scape”. According to eq. (1.8), at equilibrium, any thermodynamic system is
supposed to reside in a minimum (local or global) of such a landscape; other-
wise, the system would spontaneously move until it reaches one. One should
note that most probably, multidimensional energy hyperspace contains many
local minima, as opposed to a single global one.
The existence of funnels in the energy landscape has been proposed
for protein folding[16, 56, 105, 100, 52] and has been further expanded for
protein-protein[99, 115] and protein-ligand recognition[109, 96]. It has been
suggested that the shape of the energy landscape is in correlation with the
nature of protein folding or binding between the molecules.
The funnel shape energy landscape theory suggests that structures with
single-minimum energy landscape may represent an extremely stable folded
structure or the ”lock and key“ binding mechanism. Several minima on
the bottom of the energy landscape with small barriers between them may
be a result of induced fit or non specific interactions. Finally, a rugged
landscape with multiple minima separated by relatively high energy barriers
may indicate domain swapping or the existence of multiple binding modes .
Introduction 25
1.7 Multiple binding modes
As was presented previously, biomolecules are flexible and mobile entities.
The molecular thermal motion results in a reality that is dramatically dif-
ferent from the static picture that is seen in structures solved with X-rays
or even in the multiple structures obtained by NMR. Although eq. (1.8)
implies that any system at thermodynamic equilibrium resides in a single
energy minimum, the real-life situation is quite different. The constant ther-
mal motion and ever-changing environmental conditions prevent thermody-
namic equilibrium, and energy barriers may rule out the transfer between
one conformation to another, potentially more stable one.
At a non-zero temperature, thermodynamic systems are able to occupy
non-minimal regions of the landscape according to the distribution.
Ni
N=
e(−Ei/kT )∑j
[e(−Ej/kT )
] (1.22)
In this equation (also known as the Boltzmann or Maxwell-Boltzmann dis-
tribution), Ni is the number of molecules at equilibrium temperature T , in a
state i that has energy Ei; N is the total number of molecules in the system
and k is the Boltzmann constant which, for gaseous and liquid systems is
identical to universal gas constant (R) from eq. (1.8). If the energy barrier
between two minima is low enough, and the temperature is high enough,
then the molecules in a system can alternate between multiple states. If the
differences between binding energies (i.e. ∆(∆Gbind)) of two or more con-
formations is such that transformation of the system between them doesn’t
effectively compensate for the separating energy barriers, these multiple con-
Introduction 26
formations may exist in the system simultaneously, presenting a phenomenon
known as multiple or alternative binding modes.
A growing body of data supports the existence of multiple binding modes
of ligands to receptors. These may manifest in the form of a ligand that binds
the same (or similar) protein in different distinct modes, or alternatively,
ligand molecules that share structural similarity may be observed in different
binding modes when bound to the same protein[18, 9, 44, 91, 57]. It is clear
that individual conformations of multiple binding modes, if they exist, may
have a unique contribution to the binding energies or specificity. The program
presented in this work, ISE-dock is capable to produce arbitrary large near-
optimal populations of docking solutions, resulting in an efficient sampling
of the energy hyperspace and increasing the chances of detecting alternative
binding modes.
1.8 Docking techniques
1.8.1 Flexibility in docking programs
The structural and energy considerations that were presented above imply
that accounting for flexibility in docking programs is a necessary task. The-
oretically, accounting for molecule flexibility in a system that contains N
atoms will result in 3N degrees of freedom (3 degrees of freedom for trans-
lating each atom). This number of degrees of freedom results in a colossal
rise in the computational complexity of docking calculations and cannot be
treated directly. In order to reduce the size of the solution space, several
Introduction 27
approaches are taken by, either alone or (more frequently) in various combi-
nations. These approaches include explicit flexibility of only small parts of
the system; “soft” potentials and low resolution docking, and using multiple
conformations.
Selective flexibility Among all the internal degrees of freedom that the-
oretically exist in the system, only dihedral torsions are usually taken into
consideration. This is due to the substantially lower energy barriers that
are needed for this type of movement, compared to bond stretching and an-
gle bending[54]. In addition, internal flexibility is usually limited to certain
portions of the interacting molecules. Treating ligand flexibility alone, and
keeping the protein rigid, reduces dramatically the combinatorial complexity
of a protein-ligand docking program. This approach is very popular. In fact,
most of the modern protein-ligand docking programs are capable of deal-
ing with full ligand flexibility but not with the conformational changes of a
protein[95]. The rigidity of protein is a reasonable approximation in many
cases, and it has lead to several successes. Nevertheless, accounting for re-
ceptor flexibility is a very important step toward improving the process of
docking[4, 46, 49, 12]. Najmanovich et al. have shown that in many cases
only a few side chains in the active side of a receptor change their confor-
mations during ligand binding[73]. In other cases, hinge-like movements of
large portions of the protein occur[89, 90], while retaining relative rigidity
of the remaining parts of the system. These findings allow the user to par-
tially “unfreeze” the protein, while keeping a feasible combinatorial size of
the problem. Version 4.0.1 of the program AutoDock takes this approach,
Introduction 28
by allowing the user to specify the flexible parts of the receptor (side chains
only). The ISE-dock program that is presented in this work (and was devel-
oped before the publication of AutoDock 4.0.1) takes a similar approach.
Hinge-based docking studies have also been reported[89, 90, 78].
Soft potentials Allowing partial inter-penetration of molecules by lowering
the repulsion penalties of VdW interactions is a way to implicitly account
for molecule flexibility in docking simulations. For example, in a work by
Ferrari et al.[19], a modified, softer, Lennard-Jones potential was used in
order to screen large libraries of molecules against T4 lysozyme, a protein
that undergoes small conformational changes when binding different ligands.
Yet another way to allow intermolecular penetration to handle implicitly
protein flexibility is to use protein’s Cα only in the first stages of the docking
(low resolution docking)[103, 102].
Multiple conformations Flexibility of the interacting molecules may be
simulated by using multiple structures. The ways to obtain these struc-
tures include utilization of multiple X-rays structures, ensembles of struc-
tures obtained by NMR techniques, and the results of molecular dynam-
ics or other simulations. Three major ways exist to use multiple molecular
structures in protein-ligand docking studies: separate docking of a ligand
into each individual protein structure[19, 94], identifying the conformational
changes and considering their combinations[12], and using energy functions
that consider an energy-weighted or geometry-weighted average of the mul-
tiple structures[46, 70].
Introduction 29
One of the important advantages of using multiple conformations is that,
unlike the rest of the previously mentioned methods, it easily allows the
movement of side chains and the backbone to be considered. In addition,
point mutations or even completely different proteins may be considered in
a single docking study.
1.8.2 Search algorithms
Docking algorithms can be roughly divided into two categories: those explor-
ing the energy landscape of the system and those (re)constructing the ligands
in the binding pockets of the macromolecule. The first class is represented by
various implementations and combinations of Simulated Annealing (SA), Ge-
netic Algorithm (GA), molecular dynamics, geometric complementary match
etc. Examples of the implementations of this approach are: Dock3.5 [53],
AutoDock [33, 71, 70], and GOLD [42].
The second class of algorithms (represented by FlexX [49]) involves plac-
ing of one or more base fragments of the ligand into the binding pockets of
the protein and reconstruction of the remained molecule according to prede-
fined criteria. This approach is much faster and gives good results in cases
where the binding site has a deeply buried pocket with the ability to make
hydrogen bonds. However, if the binding pocket is shallow, or the main
contribution to the binding process is done by hydrophobic interactions, the
placement of the base fragments and further reconstruction of the ligand are
doubtful[107].
Introduction 30
Genetic algorithms Genetic Algorithms (GA’s) are a general-purpose fam-
ily of optimization techniques that mimic the process of evolution[45]. During
the optimization process, an instance of the problem is encoded using linear
representation (chromosome). In the first step multiple random configura-
tions (individuals) are generated, and a fitness function is calculated for each
of them. During the subsequent steps, several operators may be applied
to some of the individuals, such as point mutation in the chromosome or
cross-over exchange of the information encoded in chromosomes between two
individuals. The fitness function is used to decide which individual is allowed
to survive to the next iteration and to produce offsprings. GOLD (Genetic
Optimization for Ligand Docking)[42] was the first docking program to use
GA. GOLD performs automated docking with full acyclic ligand flexibility,
partial cyclic ligand flexibility and partial protein flexibility in the neighbor-
hood of the binding site. The location of the site must be provided by the
user (with a possibility of using other software). Another GA-based docking
program, AutoDock [70] uses local search techniques to modify the en-
coding chromosome, and to propagate the optimized “genetic information”
to the next generations. For detailed a description of AutoDock and its
algorithm, see Section 2.2 (page 37).
Monte Carlo simulated annealing Monte Carlo Simulated Annealing (SA)
techniques involve random alteration of the system that undergoes optimiza-
tion. If the change creates a conformation with lower (better) values of the
scoring function, then the new structure is accepted for the next steps. If,
on the other hand, the energy increases, the new structure is accepted with
Introduction 31
a temperature-dependent probability P = e(−Et−1−Et)/(kBT ). Where Et−1
and Et are the energy values before and after the random change, kB is
the Boltzman constant, and T it the temperature. During the SA process,
the temperature T is reduced according to the predefined scheme (cooling
schedule), resulting in less permissive acceptance criteria. The MCDOCK
[1] program uses SA to solve the docking problem. The conformations are
generated using geometry-based docking and then energy-based docking is
performed.
FlexX FlexX is an incremental docking program[85]. It binds flexible lig-
ands into the binding pockets of a rigid receptor. FlexX involves three
steps: selection of base fragments in the ligand molecule, placement of these
fragments in the active site of the receptor, and incremental reconstruction
of the whole ligand. The reconstruction is made fragment by fragment so
that the energy of the complex is locally minimal. For a better sampling,
the algorithm is allowed to diverge to various energetically favorable regions.
This algorithm saves only a limited number of the best scoring partial so-
lutions to continue to the next round of ligand reconstruction. Since the
greedy algorithm selects only the best partial solutions to continue to the
next round of ligand reconstruction, flexible docking is likely to be more
demanding on the quality of the scoring function used to evaluate (partial)
docking solutions[47]. FlexX-Ensemble (formerly known as FlexE)[12]
introduces a new feature to the FlexX algorithm. FlexX-Ensemble takes
into account flexibility of the receptor by using a predefined ensemble of re-
ceptor conformations. The ensemble may be derived from multiple X-ray
Introduction 32
structures or homology modeling or generated by molecular dynamics simu-
lations. The protein is dissected into a constant (rigid) and several flexible
parts. The flexible parts may be combined to create conformations that are
not observed in the original ensemble of the structures.
Internal Coordinate Mechanics Internal Coordinate Mechanics (ICM)
performs global optimization of a flexible ligand in the receptor field[2]. This
algorithm is based on a large number of random moves with gradient local
minimization. A history mechanism is used to escape local minima.
Computer vision techniques Image recognition techniques were described
in a review by Nussinov and Wolfson[77]. These methods were implemented
on rigid and flexible docking. While the shape complementarity of molecules
that were crystallized together (bound docking) is expected to be good, the
docking of unbound molecules is less trivial.
1.8.3 Evaluating docking programs
Comparing docking programs is not a trivial task. Many criteria to perform
this task have been proposed and used in the literature[34, 13, 110, 44, 25].
The most common criterion to assess the “correctness” of a docked com-
plex, compared to the experimentally determined structure is to compare
the Cartesian coordinates of the solution and the reference structure. This
comparison is reported as the Root Mean Squared Deviation (RMSD) of
Introduction 33
atoms:
RMSD =
√∑Ni,j [(∆xij)2 + (∆yi,j)2 + (∆zij)2]
N(1.23)
Lack of specificity, inability to differentiate between more and less important
regions in a complex, and the need for a reference structure are several pitfalls
of this measure[13]. Nevertheless, RMSD is the measure of choice the vast
majority of docking techniques. It is widely accepted to treat solutions with
RMSD values below 2.0A as successful ones. Other methods of evaluating
include modified deviation functions[1] and accounting for correct positioning
of intermolecular contacts[50]. An additional approach is to screen a large
library of compounds with only a few that is known to bind efficiently to
a molecular target. In this type of test, the enrichment factor of correctly
recognized binding molecules is checked[84, 24, 26].
Theoretically, only the lowest (best) scored docking pose of a ligand needs
to be examined. But there are several factors that require treating multiple
docking solutions. Among them are mobility of the molecules, inaccuracy of
scoring functions, and the fact that molecules are not always found in their
global minimum. Due to these reasons, it is customary to check the best
available deviation from the experimentally known structure among several
solutions that were provided by a docking program. The comparison of
multiple docking solutions is thought to downscale the dependency of the
results on a scoring function, and to better reflect the ability of the docking
algorithm[34].
Introduction 34
1.9 Open problems and issues
Protein-ligand docking is a valuable tool in the processes of drug discovery
and lead optimization and during the basic study of intermolecular inter-
actions. Protein-ligand docking was successfully used in a wide range of
problems, but despite the plethora of existing solutions, the docking problem
is far from being solved. Many of the existing programs tend to converge
around a certain local minimum or not to converge at all. Sometimes, bio-
logically irrelevant solutions are produced.
Most of the existing programs are capable of proposing multiple docking
poses, but the time needed by many of them to do so increases with the size
of the output population.
Protein flexibility is another and, perhaps, the most difficult and urgent
challenge in the protein-ligand docking field. Other degrees of freedom that
are very important, but are hardly addressed by the existing docking pro-
grams are the position of mediating water molecules[85], co-factor position-
ing, electron transfer and protonation states of the interacting molecules.
Any docking program is tightly connected to at least one scoring function.
Although the development of a scoring function is beyond the scope of this
work, one should remember that the choice of such a function has a direct
impact on the docking program performance.
Chapter 2
Methods
2.1 Energy function
An ideal scoring function in a protein-ligand docking program would combine
speed and the ability to distinguish quantitatively between native and non-
native poses. Developing a scoring function is beyond the scope of this work.
ISE-dock uses AutoDock’s grid-based scoring function[70]. Auto-
Grid, a part of AutoDock suite, pre-calculates grids of Van der Waals,
electrostatics, and solvation interactions of a biomolecular target, based on
atom types. Following are the terms that construct the scoring function used
35
Methods 36
in AutoDock and, subsequently, in ISE-dock:
∆G = ∆GV dW
∑i,j
(Aijr12ij
− Bij
r6ij
)+
+ ∆Ghbond
∑i,j
E(t)
(Cijr12ij
− Dij
r10ij
+ Ehbond
)+
+ ∆Gelect
∑j,j
qiqjε(rij)rij
+
+ ∆Gsol
∑iC ,j
SiVje(−r2ij/2σ2)+
+ ∆GtorNtor
(2.1)
The five ∆G terms in this equation are empirically determined using lin-
ear regression analysis, correlating a set of 30 protein-ligand complexes with
known binding constants and solved 3D structures. The first and the third
terms of the above equation are standard expressions for VdW and elec-
trostatic interactions, respectively. In the second (H-bond) term, E(t) is a
directional weight based on H-bond’s angle, t and Ehbond is the estimated
average energy of hydrogen bonding between water molecules and a polar
atom. The unfavorable entropy effect of ligand binding (the fifths term) is
a function of the number of sp3 bonds – Ntor. The solvation term of eq.
(2.1) considers fragmental volumes of only carbon atoms in the ligand (i)
and all atom types in the receptor (j). Parametrization of the carbon atoms
distinguishes between aliphatic and aromatic atom types. The constant co-
efficients in equation (2.1) (Aij, Bij, Cij and Dij) are specific for each pair of
atom types.
Methods 37
During the docking process, the program evaluates any position of the
ligand by interpolating over those grids for the protein-ligand interaction
of each atom of the ligand according to its current position and adding the
internal conformational energy of the ligand. By default, the docking box has
the dimensions 22.5A ×22.5A ×22.5A with a resolution of 0.375A between
grid points. The version of AutoDock used in this work (3.0.5) supports
eight atom types: C (aliphatic carbon), A (aromatic carbon), N (nitrogen),
O (oxygen), S (sulfur), H (hydrogen), X and M (“spare” types for additional
atoms such as metal, halogen, phosphorus etc). It is customary[106, 98, 17]
to substitute the original AutoDock parameters for Zinc. We used the
following parameters, which lead to more accurate energy calculations[39]:
(radius: 0.87A; well depth: 0.35 kcal/mol; and charge: +0.95e).
The use of grid-based scoring functions has two important properties:
first, the simulation speed is facilitated significantly and second, it implies
that there can be no variation in the protein structure during the docking
process.
2.2 AutoDock docking program
The AutoDock program[32, 33, 70] served as a source code base and a
reference point for ISE-dock performance. The source code of AutoDock
(version 3.0.5) was obtained from the authors. AutoDock performs flexible
ligand – rigid protein docking using one of the following algorithms: Simu-
lated Annealing (SA), Genetic Algorithm (GA) and Lamarckian Genetic Al-
gorithm (LGA). LGA is a hybrid optimization algorithm that deviates from
Methods 38
GA and has been shown[33] to give the best quality of performance out of
the three available ones. AutoDock was the most cited docking program
in the scientific literature during the years 2001 – 2005[95].
2.2.1 Lamarckian Genetic Algorithm
Genetic Algorithm (GA) is a general type of optimization algorithms, and it
exists in several variants. The version of GA that is used in AutoDock is
described as Algorithm 2.1.
Algorithm 2.1 Genetic Algorithm used in AutoDock [70]
Require: string representation of a problem (chromosome)1: create random population P2: repeat3: mate random pairs of individuals (crossover)4: perform random mutations5: for all i ∈ P do6: evaluate i7: end for8: sort P according to the scoring function9: select best individuals to survive to the next iteration
10: until stopping criteria are met11: return best individual
In order to be optimized by GA, an instance of a problem is encoded into
a flat string (chromosome), which may be subjected to several GA operators
and then scored. The GA operators include crossover of two chromosomes
and point mutations. These operators (Algorithm 2.2, lines 3 and 4 in Al-
gorithm 2.1) are applied randomly with user-defined probability. The opti-
mization terminates if no improvement in scoring function is achieved over a
number of generations or after a specified number of generations.
Methods 39
The LGA differs from the canonical GA by an additional step of local
optimization. The addition of local optimization provides that an acquired
adaptation of an individual promotes changes in its chromosomes that in
turn pass to the next generations.
The LGA is described using pseudocode as Algorithm 2.2.
Algorithm 2.2 Lamarckian Genetic Algorithm used in AutoDock [70]
Require: string representation of a problem (chromosome)1: create random population P2: repeat3: mate random pairs of individuals (crossover)4: perform random mutations5:6: select sub population S ∈ P to undergo local optimization7: for all individual i ∈ S do8: perform local optimization of i9: modify i’s chromosome to reflect the optimized state
10: end for11:12: for all i ∈ P do13: evaluate i14: end for15: sort P according to the scoring function16: select best individuals to survive to the next iteration17: until stopping criteria are met18: return best individual
Here, an instance is encoded in a chromosome (genotype), which in turn is
translated to a phenotype. At the initial stage, the phenotype is identical to
the genotype. As in the basic GA, several operators are applied randomly on
the population with the predefined probabilities (Algorithm 2.2, lines 3 and
4). At the next stage, several individuals are randomly selected (Algorithm
2.2, line 6). These individuals undergo local optimization of the phenotype.
Methods 40
The optimized phenotype is translated back to a new genotype, which then
propagates to the next generations.
AutoDock uses Solis Wets [93] local optimization. Solis Wets (SW) local
optimization algorithm is a greedy local search heuristic method. During
the SW search, random moves along all the axes of the solution space are
performed until an improvement is found. The variance of the random moves
is influenced by the frequency with which improving moves are found. The
pseudocode of SW is provided as Algorithm 2.3.
Algorithm 2.3 Pseudocode of Solis Wets local search algorithm
initialize variancenumberOfSuccesses = numberOfFalures = 0repeat
perform random move using variancecalculate currentEnergyif currentEnergy < previousEnergy thennumberOfFalures = 0increase numberOfSuccessesif numberOfSuccesses > threshold thennumberOfSuccesses = 0expand variance
end ifelsenumberOfSuccesses = 0increase numberOfFaluresif numberOfFalures > threshold thennumberOfFalures = 0contract variance
end ifend if
until stopping criteria are met
There are no robust stopping rules for SW algorithm, since the conver-
gence isn’t guaranteed. In AutoDock, the SW search step stops after the
Methods 41
variance of the random moves drops below a threshold or when a specified
number of optimization steps is reached. The original SW algorithm uses
random steps with equal variances for each degree of freedom. In an attempt
to improve the optimization results, AutoDock enables separate lists of
random moves variances for each degree of freedom to be kept. This variant
of the SW method is referred as Pseudo Solis Wets (PSW) local search.
2.2.2 Problem representation
In AutoDock, the ligand’s pose relative to the protein and it’s internal
conformation are encoded using a vector of real values. The first three values
in the vector define the translation of the ligand. The rotation is encoded
using quaternion notation. This notation represents rigid body rotation using
a unit vector (represented by three numbers) and an angle of rotation around
this vector. Thus, the three degrees of freedom of rigid body rotation are
represented by four degrees of freedom in AutoDock. The rest of the
values in the chromosome vector represent the dihedral angles of the ligand’s
rotatable bonds.
Since AutoDock uses a grid-based energy function, the receptor is rep-
resented by a set of pre-calculated grid maps.
During the docking process, the encoding vector (genotype) is gener-
ated and translated to molecule coordinates (phenotype). A local search
is randomly applied with user-defined probability. The applied local search
changes the coordinates. This change is translated to respective changes of
the genotype.
Methods 42
All the algorithms used by AutoDock result in a single optimized pop-
ulation. Frequently, more than one solution is desired (as explained in Chap-
ter 1). In this case, the program may be configured to perform the docking
procedure several times. The total time needed to produce multiple docking
solutions is proportional to the number of desired structures.
2.3 ISE-dock program
The ISE-dock program was implemented as a set of added and modified
classes in AutoDock source code, and uses its energy function. As in the
original AutoDock application, molecular flexibility is treated by allow-
ing changes in dihedral angles. Our representation of rigid body rotation
differs slightly from the one that is implemented in the original program:
while AutoDock encodes rotations using quaternions, ISE-dock uses sub-
sequent rotations around the X, Y and Z axes, as this option provided better
results in preliminary experiments with ISE. ISE simulations produce large
populations of docking poses, which is one of the standard results of applying
this algorithm to any problem. The number of energy evaluations performed
by the program is not affected by the size of the docking population. Since
energy evaluations account for more than 85% of the CPU time, the time
needed to complete the docking is practically independent of the number of
docking solutions that the program produces. The number 4096 (212) was
chosen to limit the sorting of poses, mainly dictated by the available space
on our hard disks.
Methods 43
2.3.1 Iterative Stochastic Elimination algorithm
The Iterative Stochastic Elimination algorithm (ISE) is described as pseu-
docode in Algorithm 2.4.
This is a general optimization algorithm that can be applied to any prob-
lem described by independent variables and a set of discrete values for each
variable. In the case of ISE-dock, the variables are: translation (3 vari-
ables), rotation (3 variables) and bond torsion angles (one for each rotatable
bond). The algorithm begins by constructing a matrix that contains, for
each degree of freedom, a set of all possible values. This matrix is referred to
as possibilities pool. Two terms are required with respect to the possibilities
pool: problem size (PS) and pool depth (PD). Problem size is defined as
the total number of all possible combinations that can be generated from the
pool. Pool depth is the maximum number of remaining values among all the
variables that define the problem.
PS =
Nvariables∏i
ni (2.2)
PD =Nvariables
maxi
(ni) (2.3)
Where Nvariables is the number of variables and ni is the number of possible
values for ith variable.
During the first phase (referred as elimination phase), a large number of
conformations is generated. The conformations are generated by randomly
picking a single value from the pool, and assigning it to the respective vari-
able.
Methods 44
Algorithm 2.4 Iterative Stochastic Elimination Algorithm
Require: problem represented as a set of variables and possible discretevaluesgenerate pool
2: initialize populationwhile size(pool) < threshold do
4: generate sample S of s random configurationsfor all i ∈ S do
6: perform local optimization with probability Pscorei = evaluate(i)
8: if (size(population) < outptutSize) or (scorei < scoremax) thenadd i to population
10: trunk population to outputSizeend if
12: end forsort S
14: L = low energy part of SH = high energy part of S
16: for all variable var ∈ pool dofor all value val ∈ poolvar do
18: observedLow = number of occurrences of (var, value) ∈ Lratio = expectedLow(value)/observedLow(value)
20: if ratio > threshold thenrank = ratio/threshold
22: mark poolvar,value for elimination with rankend if
24: observedHigh = number of occurrences of (var, value) ∈ Hratio = observedHigh/expectedHigh
26: if ratio > threshold thenrank = threshold/ratio
28: mark pair (var, value) for elimination with rankend if
30: end foreliminate up to e% values with highest rank from poolvar
32: end forend while
34:perform exhaustive search of pool, add best scored configurations topopulation
36: return population
Methods 45
The randomly generated conformations have a certain probability (0.06
by default) to undergo local optimization. The main purpose of the local
optimization step is to solve clashes and unfavorable conformations that are
caused by the discrete nature of the variable values (translation, rotation
and torsions). Unlike local optimization by the Lamarckian Genetic Algo-
rithm in AutoDock, local optimization does not affect the variables in the
possibilities matrix, only the energy values that are associated with them.
The sample is evaluated and sorted. The sorted sample is divided into three
uneven parts: subsets of lowest, highest and intermediate energy conforma-
tions. The intermediate subset is not used in the analysis. A particular
value of a variable may be discarded from the pool of values if one of the
two following criteria is met. The first criterion is the occurrence of a value
in the higher energy subset with significantly higher frequency that is ex-
pected under the random distribution assumption. Alternatively, a value
may be eliminated if it appears in the lower energy subset with lower than
random frequency. Not more than a user-specified portion of values may
be discarded at each iteration (the default value is 10%). The elimination
process is performed iteratively until the number of possible conformations
enables exhaustive search in a feasible time. During the exhaustive phase,
the solution candidates have a probability of P = 0.6 (default value) to un-
dergo local optimization. Note that local optimization probability in this
exhaustive phase is ten time larger than the probability for local optimiza-
tion during the elimination phase. During the whole process, a list of the
best seen conformations is updated kept.
Methods 46
The local optimization steps, the limit of discarded values per iteration
and the fact that the best seen conformations are collected during the elimi-
nation phase are new to this implementation of ISE and were not present in
previously published ones[30, 31].
The sample size and the sizes of lower- and higher-energy subsets depend
on the current pool depth (eq. 2.2) and are user configurable, as is the
required ratio between the expected and the observed occurrences of a value
(Algorithm 2.4, lines 20 and 26). The maximal fraction of eliminated values
for each variable and the probability of local search during the elimination
and exhaustive phases are also determined by the user.
2.3.2 Problem representation
As in AutoDock, the ligand’s configuration is encoded by real values that
define its position in space (translation), its orientation (rotations about
axes), and the internal rotations around single bonds. Unlike in AutoDock,
we have decided to use three degrees of freedom to describe the rotations of
the ligand around the principal exes. In our implementation, the rotation is
defined by sequential rotations of the molecule around the X, Y and Z axes
(in this order).
2.3.3 Protein flexibility
Accounting for protein flexibility is a very important task, which, until re-
cently was ignored by the majority of current protein-ligand docking programs[34,
97]. Proper inclusion of flexibility (as a set of rotations around side-chain
Methods 47
Figure 2.1: “Tearing off” atoms to represent side chain flexibility using phenylalanine asan example. Dummy atoms are marked by the letter “D” in their names. The N, Cα andCβ atoms on the receptor molecule overlap with their respective dummy counterparts.
and main-chain bonds) requires extensive changes to the current source code
of ISE-dock and is thus beyond the scope of this work. Nevertheless, before
further work is done, it is important to assess the ability of ISE to cope with
this problem. Docking experiments that account for protein flexibility that
are presented in this work serve as a demonstration of ISE capabilities.
Side chain flexibility in ISE-dock
As was previously described in Section 2.1 (page 35), the grid-based energy
function implies that the entire protein remains frozen during the docking
simulation. To overcome this limitation I have decided to “transfer” selected
atoms from the protein to the ligand, as the ligand may be treated with flex-
ibility. This is a technical choice to overcome that limitation of the original
program. Figure 2.1 describes the process. First, a set of flexible residues
is identified using previous knowledge. Then, for each flexible residue, all
the side-chain atoms, except for Cα, Cβ and the adjacent hydrogen atoms,
are deleted. The resulting structure (the constant part of the protein) is
used in all further calculations, mainly for calculating the interactions on the
grids. In order to include the flexible part of the receptor in the docking cal-
culations, the original coordinates of the side chains of the flexible residues
Methods 48
are copied to the ligand molecule. In addition, backbone’s nitrogen atom is
also copied. A dummy bond connects between the residue’s N atom and an
atom from the ligand. We now have three atoms that are common to the
ligand and the protein. These overlapping atoms serve as reference points
in defining the side chain’s torsions: the atoms N, Cα, Cβ and Cγ define χ1;
Cα, Cβ, Cγ, and Cε define χ2 and so on. In order to prevent clashes penalty
due to the overlapping, the common atoms on the ligand’s size are marked
as dummy atoms. Dummy atoms are ignored during energy calculations.
All the atoms that originally belonged to the receptor molecule are excluded
from the operations of translation and rotation, thus only the dihedral angles
change during the ISE search.
The transfer of atoms from the receptor to the ligand breaks a cova-
lent bond between Cβ and Cγ. After the transfer, Cγ is considered a part
of another molecule. This means that Cβ—Cγ interactions are interpreted
as intermolecular ones. Nevertheless, the distance between the two atoms
remains the distance of covalent C—C bond. To prevent the large energy
penalty that would have been caused by this misperception, Cγ atom is also
marked as dummy. This measure means that Cγ atom is not included in
any energy calculation. Atoms transfer and exclusion of Cγ’s from energy
calculations uneventfully leads to loss of accuracy. To test the validity of the
“tearing off” approach, I have “docked” only the side chains, with a ligand
molecule fixed in its crystallographic position. In these experiments (data
not shown), the RMSD of the side chain atoms with respect to their observed
position was below 0.3A.
Methods 49
Backbone flexibility in ISE-dock
The “tearing off” approach that was undertaken to include side chain flexi-
bility of proteins isn’t suitable for flexibility of the backbone due to various
technical limitations that are posed by the original code of AutoDock. In
this work, multiple protein conformations were used as a “target bank” for
the docking process. The multiple conformations of the protein were gener-
ated using the Iterative Stochastic Elimination algorithm[76, 86]. The ligand
is docked separately to each of the generated protein conformations, which
is kept frozen as usual. The results are combined according to the energy
values.
2.4 Rigid protein docking
2.4.1 LGA docking
LGA docking has been proposed to be superior to other methods in Auto-
Dock [70]. We have used the original (unmodified) AutoDock program to
obtain the results for LGA. As already mentioned, we substituted the default
Pseudo Solis Wets local optimization by the original Solis Wets algorithm.
We have also changed the default solution size from 10 to 35 in order to allow
AutoDock to perform as many energy evaluations (≈ 8.8 × 106) as were
performed on the average by ISE (≈ 8.6× 106).
Methods 50
2.4.2 The data set
We used the public portion of the test set used by Perola et al[84] in their
comparison of docking algorithms. The original test set consisted of 150
pharmaceutically relevant protein-ligand structures, of which 100 are pub-
licly available. The preparation process of these structures was performed by
the Perola group[84]. We converted these files to mol2 format. Protein struc-
tures were kept in their bound conformation and were assigned charges from
the Kollman (United Atoms) forcefield [111, 112]. In this forcefield, heavy
atoms and the non-polar hydrogen atoms adjacent to them are treated as
single (united) spheres and the only hydrogen atoms that are accounted for
individually are the polar ones. Ligands, co-factors and metal ions were
assigned charges using the Gasteiger-Huckel method [28], which, unlike the
former, treats all the atoms separately. Charges’ assignments were performed
using Sybyl(R)7.1. Ligand rotatable bonds were marked by visual examina-
tion. After the preparation, any existing co-factors were merged with the
protein and treated as part of the appropriate protein model. Atom types
were assigned automatically by the appropriate utilities in AutoDock suite.
Of the 100 complexes, 19 were excluded due to the following reasons:
• 1 complex (PDB code: 830c) containing both zinc and calcium
• 6 complexes with a co-factor that contains Phosphorus atoms (due to
lack of validated parameters): 1aoe, 1dib, 1dlr, 1frb, 1syn, 7dfr
• 8 complexes with ligands that contain more than 8 atom types (this
limitation is imposed by AutoDock) 1qwx, 1ls, 1mq5, 1mq6, 1gl9,
1ydt, 2csn.
Methods 51
Table 2.1: PDB codes of the 81 complexes in the rigid protein test set.
13gs 1cim 1f0r 1h1s 1k1j 1nhu 1ydr 5std
1a42 1d3p 1f0t 1h9u 1k22 1nhv 1yds 5tln
1a4k 1d4p 1f4e 1hdq 1k7e 1o86 2cgr 7est
1a8t 1d6v 1fcx 1hfc 1k7f 1ppc 2pcp 966c
1afq 1efy 1fcz 1hpv 1kv1 1pph 2qwi
1atl 1ela 1fjs 1htf 1kv2 1qbu 3cpa
1azm 1etr 1fkg 1i7z 1l8g 1qhi 3erk
1bnw 1ett 1fm6 1i8z 1lqd 1qpe 3ert
1bqo 1eve 1fm9 1if7 1m48 1r09 3std
1br6 1exa 1g4o 1iy7 1mmb 1thl 3tmn
1cet 1ezq 1h1p 1jsv 1mnc 1uvt 4dfr
• 4 structures with incomplete protein structure in proximity to the lig-
and (cutoff: 10A) (1f4f, 1f4g, 1ohr, 1uvs) The remaining 81 complexes
are listed in Table 2.1.
2.4.3 Comparisons and their analysis
In this work I compare the performance of ISE-dock to that of AutoDock,
Glide and GOLD. AutoDock was chosen due to the fact that it allows
direct comparison of ISE and LGA search algorithms, without any bias from
the scoring function. The latter two programs showed the best performance
in a previous extensive analysis by Perola et al[84]. ISE and LGA results
reported are average values of three independent simulations with different
seed numbers of the random number generator. Glide and GOLD results
were kindly provided by Dr. E. Perola.
Methods 52
ISE algorithm is compared to LGA by using the same energy func-
tion. ISE is different than GOLD and Glide in both the search strat-
egy and the scoring. Such differences, in search and in scoring, character-
ize most comparisons of docking programs. In all the tested programs, lig-
and flexibility (torsion angles only) is accounted for, while keeping the pro-
teins rigid. Several protocols for comparing docking algorithms have been
proposed[34, 13, 110, 44, 25]. The choice of a particular comparison pro-
cedure frequently depends on the particular problem, the data set and the
programs under investigation. In order to be able to compare our results
to those obtained by Perola et al. with Glide and GOLD, we followed
their criteria[84] and used the RMSD of the top ranking solution versus the
corresponding crystal structure, and the best RMSD within the top 20 so-
lutions. We have also used the best RMSD within the entire docked set of
ISE and LGA as an additional criterion. This latter criterion indicates the
ability of the algorithm to cover the solution space, and is less dependent on
the scoring function. RMSD is calculated using heavy atoms of the ligands.
To examine the statistical significance of those criteria, we added the paired
t-test (PTT).
2.4.4 Paired t-test
The need to apply statistical methods for comparing docking algorithms has
been recently suggested[13]. We have the RMSD results for each docking
experiment available for each of the algorithms to be compared (either ob-
tained by us or by Perola et al.[84]), therefore we can compare results of
Methods 53
ISE-dock to those obtained by the others by using a paired t-test (PTT).
We compare the paired RMSD differences (for all protein complexes docked
by two algorithms – ISE and another) under the assumption that the paired
differences are independent and identically normally distributed.
2.4.5 Comparing CPU time
Variable computation times are the result of differences in CPU, in algorithm
implementation and in other program specific issues. As both LGA and
ISE are parts of the same program, and most (>85%) computation time is
spent on energy evaluations, we use the number of energy evaluations as
an independent estimate of time performance. In order to enable a common
basis for comparing performance we changed the default output size for LGA
from 10 poses to 35. This size was chosen so that the average number of
energy evaluations using LGA (≈ 8.8× 106) would approximately equal the
average number of energy evaluations performed during ISE optimizations
(≈ 8.6× 106).
2.4.6 Energy funnels
The existence of funnels in the energy landscape has been proposed for pro-
tein folding[109, 62, 100, 105] and has been further expanded for protein-
protein[99, 115] and protein-ligand recognition[96, 109]. It has been sug-
gested that the shape of such plots is in correlation with the nature of binding
between the molecules[62]. In the part of this work that deals with flexible
ligand - rigid protein docking, I utilize the ability of ISE to generate large
Methods 54
populations of near-optimal solutions to estimate the energy landscape in the
vicinity of the minimum. For each docked complex, I construct an energy vs
RMSD plot.
2.5 Flexible protein docking
As stated above, protein flexibility is introduced into this work as a series
of experiments that serve a proof of the concept that ISE is successfully
presenting protein flexibility. Therefore, the experimental design is limited
to several typical cases and no statistical analysis of the results is performed.
The test cases were chosen so that the flexible regions in the proteins are
limited to small and ones in proximity to the bound ligand.
The ability of ISE-dock to represent changes in the protein backbone
was tested using two structures of collagenase with inhibitors. In our repre-
sentation of backbone flexibility, we follow other studies that produce mul-
tiple backbone conformations and dock a ligand to each of them, in order
to identify the protein conformation to which the ligand would preferentially
dock. However ISE has been shown to produce higher quality backbone con-
formations that are close to experimental. Docking to a protein with flexible
side chains was tested on two systems: acetylcholinesterase (single side chain)
and trypsin (several side chains). In all cases, the structures chosen are from
results of X-ray crystal structure determination in the PDB and represent
real modifications of the protein structures. All the selected complexes are
pharmaceutically relevant.
Methods 55
Docking process
Applying ISE-dock to flexible backbone docking requires initial separation
of ligand from the receptor-ligand complex. In the next stage our loop pre-
dicting program (ISE-based) predicts conformations of flexible protein frag-
ments. This program was developed in our group and has been successfully
applied [86, 76]. During the search for optimal backbone conformations, ISE
samples flexible fragments by probing dipeptides conformations. Dipeptide
selection is performed according to the given sequence. The conformations
are evaluated by an energy function that combines penalties for deviations
from peptide geometry and interactions between the fragment and the rest
of the protein. This process results in a set of conformations sorted by the
value of the scoring function. Side chains are represented as centered on Ca
in the evaluation of interactions. The side chains are added to each backbone
conformation in a subsequent step using the program SCAP[114].
Main chain Following the generation of backbone conformations of a loop
or protein fragment, the ligand is docked into each of a selected set of pro-
tein conformations. This set is limited in number due to computational
restrictions and also due to the energy gap from the lowest energy (global
minimum) conformation. It is reasonable to assume that an energy loss of
5 kcal/mol may be compensated by interactions with a ligand. Thus, we
used a threshold of 5 kcal/mol for backbone conformations above the global
minimum in order to pick a small set out of a much larger one, produced
by ISE. In each docking experiment, 4096 conformations were generated as
a result of applying ISE to the flexible ligand positions, with each protein
Methods 56
conformation. The sets for all protein conformations are merged and sorted,
and only the best 4096 conformations remain for final examination.
Side chains To perform ligand docking that includes flexible side chains,
an initial decision must be made, which side chains will be treated as flex-
ible. Those specific side chains are then “combined” with the ligand, thus
becoming flexible as the ligand is. Preparation of structures for computations
follows the one described description in section 2.4.2.
2.5.1 Protein Backbone Flexibility –
Test Case of Collagenase
General
The protein family of Matrix Metalloproteinases (MMPs) is responsible for
metabolizing the macromolecular components of extracellular matrix. The
collagenase subfamily (MMP–1, –8 and –13) enzymes are responsible for
cleaving fibrillar collagen. This cleavage is a key process in rheumatoid and
osteoarthritis[59]. The crystal structure of collagenase-3 (MMP–13) with
RS-130830 (dipenyl-ether sulphone based hydroxamic acid) has been solved
with a resolution of 2.4A (PDB code: 456c)[59]. Fibroblast collagenase-
1 (MMP–1) in complex with RS-104966 (N- hydroxy- 2- [4- (4- phenoxy-
benzenesulfonyl)- tetrahydropyran- 4 yl]- acetamide) has been solved with
1.9A resolution (PDB code: 966c)[59]. The ligands RS-130830 and RS-
104966 are chemically similar, with the only additional substitution on one of
Methods 57
Table 2.2: Affinities to collagenase
RS-130830 RS-104966
PDB complex 456c 966c
Ki(nM)
MMP–1 590 23
MMP–13 0.52 0.13
the two phenyl rings. The molecules have different specificity profiles towards
MMP–1 and –13 (Table 2.2).
The two proteins share 59% sequence identity, and have very similar 3D
structures. The major difference between the structures of these two proteins
is in a few characteristics (orientation, amino acid contents and length) of a
single loop: residues 243–255 (13 amino acids) for PDB structure 456c and
residues 239–249 (11 amino acids) for structure 966c. This, together with
the residue at position 218 (according to SWISS-PROT numbering of 456c)
form the specificity pocket – the sub-site that is responsible for collagenase
specificity as well as the specificity of quite a few other MMPs. Figure 2.2
presents a structural alignment of the two collagenases. As one may see, the
backbone traces of the two differ mainly in fragments Gly 248 — Met 253 in
456c (6 amino acids) and Ser 244 — Leu 247 in 966c (4 amino acids). These
fragments belong to the S1′–specificity pocket.
Methods 58
Figure 2.2: Structural alignment of 456c and 966c. Backbone traces of the proteins arecolor coded according to the distance (in A) between the aligned backbone atoms. RS-130830 (red) and RS-104966 (green) are shown as sticks models.
Comparisons and their analysis
Our flexible backbone docking involves initial prediction of loop positions,
rigid docking of the ligand to these multiple loops and then combining the
results into a single set. The computational effort that is involved in this
multistep methodology is much greater than the computational cost of rigid-
protein docking. Due to the need to apply a few programs in order to obtain a
set of final results, the effect of the additional investment of CPU time cannot
be assessed nor isolated. Therefore we do not compare flexible-backbone
docking to rigid protein docking.
Protein backbone conformations of fragments or loops are produced by
applying ISE to the structure of the protein in the protein-ligand complex,
without the presence of ligand. To evaluate the results, we compare the frag-
Methods 59
ment conformations to the original loop/fragment conformation in the com-
plex. We compare by measuring backbone atoms deviations (using RMSD).
For the ligand, its predicted position is compared to the one observed crystal-
lographically using RMSD of heavy atoms. Ligand RMSD of the top scored
conformation, best RMSD in top 20 and in all available solutions are re-
ported. Ideally, RMSD of all movable atoms (protein backbone, side chain
atoms and the ligand) needs to be calculated. To calculate RMSD over this
set of atoms, one needs to take into account the numerous local axes of sym-
metry present in any protein-ligand complex. Phenyl rings, carboxylate and
guanidine groups are examples of substructures that contain such axes. Cor-
rect accounting for symmetry axes is a complex combinatorial problem with
an exponential complexity. Due to the preliminary nature of flexible protein
docking experiments and in order to simplify the process of evaluation, I de-
cided to use two values simultaneously: RMSD of the ligand and RMSD of
protein backbone atoms.
2.5.2 Flexibility of a single side chain –
Test case of acetylcholinesterase
General
Acetylcholinesterase (AChE) plays an important role in regulating the func-
tions of the central and peripheral nervous systems. This enzyme cleaves
acetylcholine, which is secreted by neuron vesicles into the synapse that sep-
arates the vesicle and the membrane of the next cell in line. Acetylcholine
encounters receptors on that membrane and activates the continuation of the
Methods 60
Figure 2.3: Cross section of AChE complexed with acetylcholine (PDB code: 2ace), coloredby (A) partial charge of the atoms and (B) by the residue type (colored by PyMol):hydrophobic (GILMPV) – white, aromatic (FWY) – magenta, semipolar (C) – yellow,polar (HNQST) – cyan, positive (KR) – blue, negative (DE) – red. Acetylcholine iscolored blue in both panes.
neuronal transmission. AChE cleaves acetylcholine in a two step reaction into
choline and acetate, thus terminating the signal. The catalysis occurs in a
very deep, electron-rich, binding pocket, which is also called the gorge (see
Figure 2.3). The protein structures of AChE is complexed with Huperzine
A (PDB code: 1vot) and with Aricept (PDB code: 1eve) differ mainly in
the position of the side chain of one residue, Phe 330 (Figure 2.4)[97]. When
AChE is complexed with Huperzine A (1vot), Phe 330 adopts the confor-
mation that keeps the binding gorge closed. When, on the other hand, the
bulkier Aricept molecule is present in the complex (1eve), Phe 330 adopts
a conformation that allows the entry of this bigger ligand to the binding
pocket. The difference between the two conformations in the χ1 angle (1eve
– 105.3o; 1vot – 58.9o).
Comparisons and their analysis
To asses the performance of ISE-dock, results of rigid-protein docking and
cross-docking to AChE (1eve and 1vot) are compared to those obtained by
Methods 61
Figure 2.4: AChE complexed with Huperzine A (PDB code: 1vot, light gray) and withAricept (PDB code: 1eve, dark gray). The ligands and Phe 330 side chains from both thecomplexes are highlighted using sticks.
flexible docking. A total of 4 cross docking experiments are performed with
each method. The comparison is done using RMSD of the ligand only (heavy
atoms) due to the very strong similarity between the backbones of 1eve and
1vot, differing by only RMSD ≈0.2A. In addition, RMSD of all movable
heavy atoms is calculated (including side chains). This allows an evaluation
of our docking by the common acceptable RMSD criteria, but do not compare
rigid and flexible docking.
As with the rigid docking, we use the three criteria of (1) top ranked
solution, (2) best out of top 20 poses, and (3) best available pose to compare
to the crystallographic structure.
Methods 62
2.5.3 Flexibility of several side chains –
Test case of trypsin
General
Trypsin is a serine protease in the gastrointestinal tract, where it is respon-
sible for protein hydrolysis. It is a very well studied protein with numerous
available 3D structures in the PDB. Due to it’s role in the digestive system,
trypsin is not very selective as it is supposed to bind and cleave a very broad
range of proteins and peptides. Due to this nonspecific binding, many struc-
turally diverse small molecules bind to trypsin. A set of 10 protein-ligand
structures was chosen as a data set for this study. Their PDB codes are:
1ppc, 1pph, 1tng, 1tnh, 1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. This data set
is similar (but not identical) to that used by Kramer at al. in their evalu-
ation of the FlexX program[49]. In the data set used here, three residues
(Leu 99,Gln 192, and Gln 221) demonstrate conformational changes of their
side chains over this data set. These residues were identified using visual
examination of the binding pockets of all the proteins in the set. Figure 2.5
summarizes the trypsin set.
On the average, the trypsin data set contains 4.1 rotatable bonds per
complex due to the different ligand in these complexes. The addition of
three flexible side chains results in more than a three fold rise in the number
of rotatable bonds (average of 12.4 bonds per complex). For each rotatable
bond, ISE-dock has to consider 60 possible angles, one for each 6o. This
leads to a dramatic exponential increase in the problem size ISE-dock has
to consider: ≈ 1016 combinations in rigid-protein docking vs ≈ 1031 combina-
Methods 63
Figure 2.5: Trypsin data set. 10 superimposed trypsin structures: 1ppc, 1pph, 1tng, 1tnh,1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. The ligand molecules and the residues that aretreated as flexible are shown as sticks. The remaining parts of the proteins are shown asbackbone trace.
tions after the inclusion of protein flexibility. Due to this increase of problems
size, flair comparison of the results of flexible docking to those obtained by
rigid docking is problematic. Thus, in this proof of concept examination, the
results are not compared to those obtained by rigid docking, but evaluated
as described below.
2.5.4 Comparisons and their analysis
Having 10 trypsin-ligand complexes, it is possible to construct a 10x10 cross
docking matrix. The values of RMSD (as calculated over all the movable
heavy atoms) of the top scored solution, the best RMSD over the top 20
poses and the best available RMSDs are reported and analyzed.
Chapter 3
Flexible ligand – rigid protein
docking
Flexible ligand - rigid protein docking was compared between ISE and other
algorithms by assigning the results to different RMSD threshold bins. Dock-
ing results are summarized in Figure 3.1. In this table, three criteria for
comparing the methods are presented: RMSD of top scoring poses, best
RMSD in top 20 poses, best RMSD in all available poses. The first criterion
assumes that the scoring function, which is related to energies, is an exact
measure of stability, therefore concentrating on the best scored results. The
second criterion assumes that the scoring function may not be able to dis-
tinguish between the best RMSD and some other poses, limiting those to
the best 20, by energy. The third criterion extends this criterion to a much
larger number of poses. The table presents the minimal RMSD for the set of
81 proteins, the maximal RMSD, its mean, median and standard deviation
and, finally, its t-test for ISE with respect to any of the other algorithms.
64
Flexible ligand – rigid protein docking 65
Table 3.1: Summary of docking results by ISE, LGA, Glide and GOLD.
RMSD of top pose Best RMSD in top 20 poses Best RMSDin all avail-able posesα
ISE LGA Glide GOLD ISE LGA Glide GOLD ISE LGA
Minimum 0.52 0.39 0.3 0.41 0.25 0.31 0.3 0.34 0.2 0.31
Maximum 5.99 5.95 10.63 10.19 2.46 3.65 10.36 6.35 1.64 2.74
Mean 1.73 1.9 2.57 3 0.98 0.99 1.49 1.56 0.73 0.89
Median 1.33 1.55 1.63 2.17 0.84 0.81 1.11 1.1 0.69 0.72
SD 1.14 1.39 2.58 2.44 0.51 0.62 1.44 1.26 0.37 0.5
P(PTT) 0.09 0 0 — 0.46 0 0 — 0.01 0.006
α ISE-dock– 4096 poses, AutoDock– 35 poses.
The detailed results for each of the complexes are presented as table C.1 in
Appendix C. In the analysis of Figure 3.1 and additional figures presented
below, I demonstrate that the performance of ISE-dock is in many aspects
better than the performance of several well established docking programs.
One should note that the results for Glide and GOLD that were obtained
by Perola et al[84] and are reported in this paper, differ slightly from those
published in the original work, due to the fact that they were obtained using
100 publicly available and 50 internal company structures[84], as opposed to
the subset of 81 publicly available structures in this report.
Flexible ligand – rigid protein docking 66
3.1 Top scoring poses
Figure 3.1 presents the fraction of top scored poses in the full set of docking
experiments, in a given RMSD threshold from the crystal structures. It
may be seen that ISE-dock achieves better results than the other three
programs when considering 50% of the complexes or more. ISE did not
dock any complex with top scored solution below 0.5A. In the remaining
threshold values, ISE and LGA outperform (with various degrees) Glide and
GOLD with respect to the number of structures with top scored solutions
below the corresponding threshold. For thresholds above 1.0A, there is a
slight advantage of ISE over LGA, which increases for larger threshold values.
About 70% (65% for LGA) of the top scoring structures are found by ISE-
dock to be under 2.0A RMSD from experiment and nearly 85% (76% for
LGA) are found under 3.0A.
The mean and median RMSD values for the top scored poses, as well as
the standard deviations, are better with ISE than LGA, Glide or GOLD.
The PTT for ISE results vs the others are: LGA: P=0.09, Glide: P=0.002
and GOLD: P< 0.001. P is the probability that the difference between the
algorithms is random, as calculated by PTT.
Top scoring poses are complexes of best interaction energy, and are ex-
pected to show the lowest RMSD from experimental. However, they are
frequently found to have larger RMSD values due to (1) limited inclusion
of flexibility and (2) limitations of the scoring functions, which compromise
between speed and quality. Still, these scoring functions are expected to be
good enough to identify the best answers among the top results for a docking
Flexible ligand – rigid protein docking 67
Figure 3.1: Top single docking poses at different RMSD bins with respect to crystalstructures, 4 different programs. Results for Glide and GOLD were obtained by Perola etal.[84].
experiment, and the number 20 was chosen[84] to probe for such best RMSD
results.
Flexible ligand – rigid protein docking 68
3.2 Top 20 poses
Comparison of top 20 poses demonstrates that ISE-dock outperforms both
Glide and GOLD and shows better or similar performance, compared to
AutoDock s LGA. The mean and the median RMSD values of the best out
of the top 20 poses obtained by ISE are similar to those obtained by LGA
and are better than those obtained by the other two algorithms. Pairwise
comparison shows that the performances of ISE and LGA on the top 20 poses
are identical (P=0.46). Examination of the best 20 docking poses shows that
ISE is clearly better than Glide and GOLD, with a probability P≤0.001
with respect to any of these two (see Figure 3.1). Figure 3.2 demonstrates
that LGA and ISE have an advantage over Glide and GOLD for the top 20
poses in all RMSD ranges. ISE results for 0.5A, 2.0A and 3.0A thresholds are
better than those of LGA. ISE alone produced at least one 3.0 A or better
solution among the top 20 poses in the entire test set (100.0% compared to
97.5%, 90.1% and 87.6% for LGA, Glide and GOLD, respectively). In 98%
of the examined molecules, ISE produced solutions that are closer than 2.0A
from experimental. Examination of the top 20 poses is most meaningful for
comparing between the programs, as it appears to indicate that the sampling
conducted by ISE-dock is indeed more thorough than the sampling of the
other programs.
Flexible ligand – rigid protein docking 69
Figure 3.2: Top 20 docking poses, RMSD to corresponding crystal structures. Results forGlide and GOLD were obtained by Perola et al.[84].
3.3 Solution space coverage
ISE’s ability to generate very large populations of near-optimal solutions re-
sults in much better coverage of solution space near the (global) minimum.
This is borne out by comparing best RMSD in the full set of solutions by
ISE and LGA in similar CPU time (4096 and 35 solutions, respectively). The
population obtained in standard runs of ISE is larger than that obtained by
LGA by more than a 100-fold. This increases significantly the chance of
finding docking poses with lower RMSD values. It is reasonable to compare
populations that differ that much in size, as we show in the discussion of
alternative binding modes in the results section. I could not compare ex-
tended docking populations for Glide and GOLD, as no such data were
reported. It should be emphasized that ISE’s 4096 solutions in this case, and
any number of solutions in other cases, are not merely poses encountered
Flexible ligand – rigid protein docking 70
during the random search, but are the best ones following the probing of the
whole space. The PTT probability value for comparison of the two docking
sets is P=0.006. ISE results are better with respect to all the terms in the
five-number summary (minimum, maximum, average, median and standard
deviation) of the best RMSD in the entire solution set (Figure 3.1). When
examining the percentage of complexes with at least one solution below a
certain threshold, as depicted on Figure 3.3, the most prominent difference
between the algorithms is at 0.5A: 32.0% vs 17.3% in favor of ISE. This dif-
ference drops down to 3.7% in favor of ISE at 2.0A. All the 81 complexes
were docked by ISE with at least one solution below 2.0A. LGA succeeded
to dock all the complexes with at least one solution below 3.0A. These find-
ings suggest that populations docked by ISE, combined with a more accurate
scoring technique, may lead to better detection and identification of relevant
docking results.
The ISE docking population (comparing by CPU time, 4096 top solu-
tions of ISE vs 35 of LGA) is much more diverse in its poses than that
produced by LGA. We clustered the poses using Sequential Leader Clus-
tering algorithm[36], with a default distance criterion of 1.0A. The average
number of clusters for the 81 molecules is ≈1870 for ISE and ≈14 for LGA.
Flexible ligand – rigid protein docking 71
Figure 3.3: Top available docking poses produced in equal CPU times, RMSD to corre-sponding crystal structures. The numbers of poses are 4096 (ISE) and 35 (LGA).
3.4 Time performance
We used the time performance of ISE and LGA in order to choose approxi-
mately equal processing times and analyze the number of solutions obtained
in that span of time. The average time needed to obtain 4096 docking so-
lutions on an Intel R© XeonTM
3 GHz computer, using ISE with the current
settings, was about 7.5 minutes. The average time needed to obtain 35 so-
lutions using LGA was about 8.3 minutes. As mentioned above, the time
required by LGA is linear with the number of solutions. Thus, it is expected
that more than 16 hours are required to obtain 4096 docking solutions with
LGA. For AutoDock, it has been recently suggested to increase the reli-
ability of results by obtaining more solutions and by increasing the number
of evaluations[66]. Such an increase has a substantial toll in computer time,
which is absent in ISE. We could not compare the time performance of ISE-
Flexible ligand – rigid protein docking 72
Figure 3.4: Number of iterations before switching to exhaustive search as a function ofinitial combinatorial size (number of initial combinations).
dock to those of Glide and GOLD. Results for the quality of the solutions
with these programs are reported here as they appear in Perola et al.[84].
The initial number of total possible combinations for ISE docking ranged
from 1,012 to 1,034 depending on the number of ligand rotatable bonds,
ranging between 2 and 14. The number of iterations (between 50 and 76
for different molecules) needed to reduce the size of the problem below the
threshold (105 combinations for switching to exhaustive computations) is ap-
proximately linear with respect to the logarithm of the initial problem size.
The graph that describes this relationship is shown in Figure 3.4. Based on
that linearity, it should be possible to extend the number of variables and
values to include protein side chains, main chain angles as well as additional
degrees of freedom.
Flexible ligand – rigid protein docking 73
3.5 Multiple binding modes
A growing body of data supports the existence of multiple binding modes of
ligands to receptors[18, 9, 44, 91, 57, 27, 35, 81]. In order to learn about mul-
tiple binding modes from ISE-dock, the shape of energy landscapes around
minima in energy vs RMSD graphs of ISE results is examined. These plots
may be roughly divided by visual examination into three groups: those with
one distinct funnel, those with multiple funnels and those with no distinct
funnel. It has been suggested[62] that existence of a single “canyon” at the
bottom of the energy landscape corresponds to a stable structure, multiple
minima might indicate the existence of multiple binding modes, and rugged
and unshaped energy vs RMSD plots may be the result of a looser or non-
specific binding, induced fit phenomena or domain swapping.
Figure 3.5A shows a representative of a few complexes that appear on
energy vs RMSD plots with a single funnel-like region (PDB code 1yds).
As expected, in this case, the docking solutions are structurally close to
the crystallographic pose and to one another (Figure 3.5B). Figure 3.6A
demonstrates an energy vs RMSD plot with two funnels (PDB code 1bqo),
while Figure 3.7A shows such a plot with no distinct funnel (docking results
of 1hpv). As one may see from Figure 3.7B, there are at least two predicted
binding modes for this complex, which is in agreement with our previous
suggestion. In Figure 3.7B, the ligand positions are spread over a large
conformational variation. Energy vs RMSD plots of the entire data set of
81 complexes after a single docking run are presented in Figures C.1 – C.7
(Appendix C.2).
Flexible ligand – rigid protein docking 74
Figure 3.5: A: Energy vs RMSD plot for docking populations of the complex 1yds obtainedwith ISE, showing a single distinct funnel. B: the same plot for 35 solutions obtained byLGA. The plots are shown using the same scale. C: The first 35 solutions (dark lines)docked by ISE vs the ligand in the crystal (gray sticks). Receptor residues with at leastone atom within 5.5A of the ligand are shown as light gray cartoon. All structures in thiswork were visualized using PyMol[15].
In 27 cases (34%), the span of energy for 4096 solutions between the
global minimum (GM) and docking solution of highest energy is less than 5
kcal/mol. In 50 cases (61%), all 4096 solutions are within 5 – 15 kcal/mol
from the GM, and in only 4 cases (5%), the energy spread is larger than 15
kcal/mol. Figure 3.8 shows the cumulative percentage of solutions (for 81
complexes, each with 4096 poses) with increasing energy gaps from the GM,
thus clarifying that most conformations are close to the GM. These 4 plots
Flexible ligand – rigid protein docking 75
Figure 3.6: A: Energy vs RMSD plot for docking populations of the complex 1bqo obtainedwith ISE, showing two distinct funnels. B: the same plot for 35 solutions obtained by LGA.The plots are shown using the same scale. C: The crystal structure of the ligand (graysticks) and the first 35 solutions (dark lines) docked by ISE.
with high energy minima (1fm9, 1hpv, 1qbu, 3std), have (as 3.7A) no distinct
funnel. The docking poses of these 4 complexes have no single binding mode,
but are disperse. The main feature of these complexes is the deeply buried
ligands in binding pockets (data shown for 1hpv, Figure 3.7).
Flexible ligand – rigid protein docking 76
Figure 3.7: A: Energy vs RMSD plot for docking populations of the complex 1hpv obtainedwith ISE, showing a scatter of the results. B: the same plot for 35 solutions obtained byLGA. The plots are shown using the same scale. C: The crystal structure of the ligandand the first 35 solutions docked by ISE.
Flexible ligand – rigid protein docking 77
Figure 3.8: Cumulative fractions (Y-axis) of 81 ISE docking complexes with an energyspan between the global minimum of each (pose number 1) and the other 4095 poses,below the given threshold (X-axis).
Flexible ligand – rigid protein docking 78
3.6 PDB data supports distinct funnels
Twenty four plots with multiple distinct funnels are found in our test set
(1azm, 1bqo, 1cim, 1eve, 1f4e, 1fm6, 1h1p, 1h9u, 1hdq, 1if7, 1iy7, 1jsv, 1k7e,
1kv1, 1qhi, 1qpe, 1r09, 1uvt, 1ydr, 3cpa, 3std, 4dfr and 5std). Ligands of
two of the twenty four complexes are present in the PDB in complexes with
other proteins (5-acetamido-1,3,4-thiadiazole-2sulfonamide from 1azm in 9
complexes; 6-O-cyclohexylmethyl guanine from 1h1p in 2 complexes) but
display similar binding modes in all of them. One complex (3cpa) contains
glycyl-tyrosine as a ligand, which is not searchable in the PDB as it is not rec-
ognized as a hetero compound. Two complexes contain “related structures”
– same or similar proteins with different ligands (1f4e, 1kv1). Of these two,
I would like to concentrate on p38 MAP kinase that was crystallized with an
inhibitor (PDB code: 1kv1; ligand HET ID: BMU)[80]. Another structure
of the same protein exists in the PDB bound to a structurally different lig-
and (PDB code: 1kv2, ligand HET ID: B96)[80]. Figure 3.9 demonstrates
that those ligands bind in two different modes. The ligand in 1kv2 is much
larger (527 g/mol) than the ligand in 1kv1 (306 g/mol). An additional no-
ticeable difference between the two ligands is that the toluyl group of 1kv2
is positioned in the place of the CH2 pyrrole group of the ligand in 1kv1.
The energy vs RMSD plot for the 1kv1 complex (Figure 3.10) displays
three distinct funnels with solutions ranked 1, 222 and 270 at their bottom
(marked d1, d222 and d270). These three poses are summarized in Figure 3.2.
As may be seen in Figure 3.11, the top scored pose is close to the crystal
structure position (RMSD of 1.37A). In the d222 solution (Figure 3.12) the
Flexible ligand – rigid protein docking 79
Figure 3.9: Complexes 1kv1 (light gray) and 1kv2 (dark gray) superimposed using back-bone atoms. The ligands are shown as sticks and backbone of closest (within 5.5 A)residues to the ligand are shown as PyMol cartoons.
ligand is positioned in reverse to d1, while in d270 it is positioned so that
chlorophenyl is in the position of toluyl in 1kv2 (Figure 3.13). Generally,
LGA is capable of producing cluster-like structures when plotting the calcu-
lated solution energy vs RMSD from a single structure even when configured
to predict relatively small amount of docking solutions (see for example dock-
ing solutions for complexes 2cgr, 3cpa or 4dfr in section C.2 of the Appendix).
Nevertheless, in the case of 1kv1, the points on Figure 3.10B, representing
35 LGA solutions, are all clustered around a small well defined region in the
E vs RMSD plot and do not suggest any alternative binding modes.
It has been proposed that thyroxine binds to Transthyretin in two an-
tiparallel modes[27, 35, 81]. ISE-dock and AutoDock s LGA were ap-
plied to re-dock the thyroxine ligand from its crystal structure complex with
Flexible ligand – rigid protein docking 80
Figure 3.10: Energy vs RMSD plot for docking populations obtained by ISE (A) and LGA(B) of the complex 1kv1. The plots are shown using the same scale. The best single ISEsolutions at each of the three funnels have ranks 1, 222 and 270 and are marked witharrows.
Figure 3.11: The best ISE-dock solution for 1kv1 (sticks). The crystal structures of 1kv1and 1kv2 ligands are shown for comparison (lines). 1kv1 is colored according to: C – cyan,N – blue, Cl – green. 1kv2 is colored according to: C– yellow, N – blue, O – red.
Human Transthyretin (PDB: 2rox). The energy vs RMSD plot for the pop-
ulation obtained by ISE shows at least two distinct funnels with docking
solutions ranked 1st and 2nd (marked d1 and d2, respectively) at the bot-
tom of the energy funnels. Figure 3.14B shows that the two solutions are
indeed antiparallel. The solutions by LGA do not suggest an alternative
Flexible ligand – rigid protein docking 81
Figure 3.12: ISE-dock solution for 1kv1, ranked 222 (sticks). The crystal structures of1kv1 and 1kv2 ligands are shown for comparison (lines). The coloring scheme is identicalto that of Figure 3.11
Table 3.2: Binding modes of 1-(5-tert-butyl-2-methyl-2h-pyrazol-3-yl)- 3-(4-chloro-phenyl)-urea (from 1kv1)
Pose E(kcal/mol) RMSD(A)
1 -10.51 1.37
222 -9.13 3.95
270 -8.84 4.69
binding mode. Figure 3.14A shows the energy vs RMSD plots for ISE and
LGA docking solutions of 2rox.
Flexible ligand – rigid protein docking 82
Figure 3.13: ISE-dock solution for 1kv1 solution ranked 270 (sticks). The crystal structuresof 1kv1 and 1kv2 ligands are shown for comparison (lines). The coloring scheme is identicalto that on Figure 3.11
In AutoDock, the lower number of solutions supplied by LGA compared
to ISE in similar CPU time provides fewer suggestions for ligand binding
modes. This is further emphasized by the smaller number of clusters of LGA
docking compared to ISE-dock, which covers solution space better than
LGA in a similar CPU time. Large ISE populations may thus improve upon
the imperfections in the energy functions.
Flexible ligand – rigid protein docking 83
Figure 3.14: Energy vs. RMSD plot for docking populations of the complex 2rox, obtainedby ISE (A) and LGA(B). The best single ISE solutions at each of the two funnels haveranks 1 and 2 and are marked with arrows. C: Antiparallel docking solutions ranked 1 and2 for 2rox (green and magenta sticks respectively). The carbons in the crystal structureof thyroxine are shown thin sticks colored cyan. The backbone of closest (within 5.5 A)residues to the ligand are shown in PyMol cartoon representation colored cyan.
Chapter 4
Flexible Ligand – Flexible Protein
Docking
4.1 Protein backbone flexibility – test case of
collagenase
The coordinates of MMP–13 and MMP–1 (456c and 966c) were obtained
from the PDB. All the water molecules and metal ions, except for the cat-
alytic Zinc were removed. The ligands were separated from the protein and
saved in a separate file. As the 456c structure contains two identical chains,
only one of them (A) was used. Alternate positions of the conformation-
ally flexible loops (residues 248 – 253 for 456c and residues 244 – 247 for
966c) were produced by ISE. As any ISE implementation produces multi-
ple near-optimal solutions, only the conformations that differ from the best
scored one (global minimum) by not more than 5 kcal/mol were chosen for
84
Flexible Ligand – Flexible Protein Docking 85
the next step. For 966c, there were 31 such solutions and RMSD of backbone
atoms with respect to the crystallographic structure (over the flexible region
only) ranged between 0.09A and 0.33A. In the case of 456c, only 5 solutions
with energy values of 5 kcal/mol above the global minimum were generated.
Their RMSD values were slightly higher that those of 966c and ranged be-
tween 0.59A and 0.61A. These RMSD values describe only the backbone of
the protein flexible fragment (loop).
Loop generation is conducted without the presence of ligand. and side
chain conformations are generated by SCAP[114] using the optimized back-
bone conformations (section 2.5.1, page 58). Although the geometry of the
backbone in both the proteins is very close to that observed in the PDB struc-
tures, the prediction of side chains positions has been also performed with
no ligand presence. This might be the reason for the relatively high RMSD
values observed in this data set for the positions of the ligands (Figure 4.1).
In the top 20 docking solutions the best RMSD values ranged between 1.59A
(456c-966c)a and 2.20A (966c-456c). The top scored solutions had RMSDs
between 2.25A and 3.49A. Nevertheless, the docking results indicate that
ISE-dock has successfully included good docking poses in the final docking
sets of all the four docking experiments. This conclusion follows the best
RMSD values in the entire docking populations of 4096 structures. RMSD
values are below 2A in all cases. If the ligand and the protein originate from
the same complex, the prediction of ligands’ poses are even better: 1.33A for
456c and 1.18A for 966c. The fact that no solution with RMSD <1A was
aIn this work, the names of cross docking experiments follow the [ligand name]-[receptorname] template
Flexible Ligand – Flexible Protein Docking 86
Table 4.1: Collagenase data set, best ligands’ RMSD (A) in top 1, top 20 and all available(4096) solutions. RMSD of the backbone from the crystal position of the correspondingsolution is also reported.
Ligand Receptor Top 1 Top 20 Top 4096
Ligand Backbone Ligand Backbone Ligand Backbone
456c 456c 2.25 0.61 1.59 0.61 1.33 0.61
966c 966c 2.92 0.28 2.09 0.20 1.18 0.21
456c 966c 3.49 0.13 2.14 0.20 1.61 0.27
966c 456c 2.75 0.61 2.20 0.61 1.76 0.61
found between the top 20 docking solutions is easily explained by two fac-
tors: (1) the scoring function used during the docking process is not capable
to distinguish between changes in the protein 3D structure and (2) the loop
structure was optimized with no ligand present in the binding side, while
the subsequent docking process did not allow protein accommodation to the
presence of the ligand.
Loop conformations were successfully predicted by ISE algorithm, in
terms of the backbone structure. Nevertheless, due to the small ranges in
backbone RMSD values, no conclusion about the ability of ISE-dock on its
own to discriminate between backbone positions could be done.
Flexible Ligand – Flexible Protein Docking 87
4.2 Flexibility of a single side chain –
Test case of acetylcholine asterase
The two AChE structures in this study differ in the side chain positions of
the residue Phe 330 (Figure 2.4)[97]. The values of χ1 angles of 1eve and
1vot are 105.3o and 58.9o, respectively. The results of docking experiments
with AChE test set are presented in Table 4.2.
Rigid protein docking Rigid bound docking resulted in good accuracy:
RMSD values of top scoring solutions were 1.85A for 1eve and 0.86A for 1vot.
The best RMSD values among the top 20 solutions were 0.63A and 0.86A
for 1eve and 1vot, respectively. When no protein flexibility was allowed,
cross docking experiments, as expected, gave worse results than the “native”
(bound) docking. A decrease in the quality of the results was observed when
Aricept (1eve), the larger of the two ligands, was cross-docked into the protein
structure that was solved in complex with Huperzine A(1vot). The RMSD
value for top ranked solution in that case was 2.91A. However, the closest
ligand pose to the experimental structure (pose #813 out of 4096 poses) had
an RMSD value of 1.43A.
Flexible protein docking – Cross docking When protein side chain (Phe
330) flexibility was allowed, cross docking of Aricept resulted in minor im-
provements of RMSD values in the three tested parameters. On the other
hand, in the cross docking of Huperzine A, the top 1 and the top 20 solu-
tions had worse RMSD values, compared to those obtained by rigid cross
Flexible Ligand – Flexible Protein Docking 88
Table 4.2: Results of Acetylcholineesterase cross docking experiments (RMSD[A]). Theresults are reported for the best scored solution (Top 1) and the best RMSD values outof the top 20 and out of all the available solutions (Top 4096). The ligand structures arelisted in rows and the protein structures are listed in columns.
Rigid docking Flexible docking
Ligand’s position All movable atoms
1eve 1vot 1eve 1vot 1eve 1vot
Top 1 1eve 1.85 2.91 1eve 2.17 2.12 1eve 1.95 1.85
1vot 1.09 0.86 1vot 2.60 0.72 1vot 2.28 0.70
1eve 1vot 1eve 1vot 1eve 1vot
Top 20 1eve 0.63 1.97 1eve 1.87 1.59 1eve 1.55 1.19
1vot 1.03 0.81 1vot 2.47 0.70 1vot 2.14 0.68
1eve 1vot 1eve 1vot 1eve 1vot
Top 4096 1eve 0.39 1.43 1eve 1.29 1.40 1eve 0.48 0.85
1vot 0.65 0.54 1vot 0.45 0.24 1vot 0.52 0.37
docking. However, a much closer to experimental ligand pose was found for
Huperzine A among the entire docking solution, with an RMSD of 0.45A,
compared to 0.65A that was obtained without protein flexibility. Examining
the predicted positions of all the movable atoms, one may find that high
quality results were included in the final docking sets of all four protein-
ligand combinations. This conclusion emerges from the RMSD values of the
closest solution out of 4096 available ones 0.85A for 1eve-1vot and 0.52A for
1vot-1eve cross-docking. On the other hand, the top solution and the top
20 solutions in the cross-docking cases relatively of high RMSD. Figure 4.1
demonstrates the results of unbound docking for the AChE data set.
Flexible Ligand – Flexible Protein Docking 89
Figure 4.1: The best available docking solution for (A) 1eve-1vot and (B) 1vot-1eve inunbound (cross-) docking experiments. The docking solutions for all the movable atomsare shown as lines and the crystal structures are shown as sticks. The protein structuresare shown as backbone trace.
Flexible protein docking – Bound docking When flexibility of Phe330
was included, the quality of bound docking results for Aricept (1eve-1eve)
were worse, compared to those obtained without protein flexibility. Ligand’s
RMSD values for the top scored solution, the best out of top 20 and the
best available solution were respectively 2.17A, 1.87A and 1.29A. In the case
of Huperzine A bound docking (1vot-1vot), there was a slight improvement
in the prediction of ligand position: 0.72A vs 0.86A for best scored pose,
0.70A vs 0.81A for best out of top 20 solutions and 0.24A vs 0.54A for best
available solution. The decrease in quality of bound docking results upon the
introduction of flexibility (as was observed in the case of 1eve-1eve), can be
related to the increase in problem complexity. On the other hand, Phe330
flexibility during docking of Huperzine A into a closed pocket (1vot-1vot) may
have solved minor clashes and as a result, gave in better results. Figure 4.2
illustrates the results of bound docking for the AChE data set.
Flexible Ligand – Flexible Protein Docking 90
Figure 4.2: The best available docking solution for (A) 1eve-1eve and (B) 1vot-1vot inbound docking experiments. The docking solutions for all the movable atoms are shownas lines and the crystal structures are shown as sticks. The protein structures are shownas backbone trace.
4.3 Flexibility of several side chains – Test case
of trypsin
The RMSD values of χ torsional angles of the three residues that were treated
as flexible in this work are listed in Figure 4.3. The structural differences
between the proteins along the data set (in terms of torsional RMSD values)
range from 2.7o (1tng – 1tnh) to 62.1o (1ppc – 3ptb).
Cross docking of the 10 PDB structures resulted in 100 different docking
experiments. The detailed results of all the experiments are listed in Ap-
pendix D. RMSD of top scoring poses, the best RMSD in top 20 poses and
the best RMSD of all the available poses are reported and analyzed in Ta-
ble 4.4 and Figure 4.3. These results are assigned to RMSD threshold bins.
The bins are identical to the ones that were used in the rigid protein docking
experiments (Section 4.2, page 87).
The overall results of cross docking over the trypsin data set are good.
Contrary to the intuitive expectation, the RMSD values over the diagonals
Flexible Ligand – Flexible Protein Docking 91
Table 4.3: Torsion RMSD (in degrees) of flexible residues in the trypsin data set
1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb
1ppc 0.0 43.4 42.6 42.1 43.0 40.7 40.3 41.0 61.5 62.1
1pph 43.4 0.0 35.5 34.5 36.7 34.0 33.3 34.3 58.3 61.0
1tng 42.6 35.5 0.0 2.7 4.3 4.8 5.0 4.6 48.1 37.6
1tnh 42.1 34.5 2.7 0.0 3.5 4.5 3.7 3.3 48.6 37.9
1tni 43.0 36.7 4.3 3.5 0.0 6.9 6.1 4.2 48.0 36.9
1tnj 40.7 34.0 4.8 4.5 6.9 0.0 2.8 4.4 50.6 40.9
1tnk 40.3 33.3 5.0 3.7 6.1 2.8 0.0 3.3 48.8 39.6
1tnl 41.0 34.3 4.6 3.3 4.2 4.4 3.3 0.0 49.2 39.8
1tpp 61.5 58.3 48.1 48.6 48.0 50.6 48.8 49.2 0.0 31.7
3ptb 62.1 61.0 37.6 37.9 36.9 40.9 39.6 39.8 31.7 0.0
Color map: 0 6 12 18 24 30 36 42 48 54 60≤
Figure 4.3: Top docking poses at different RMSD bins with respect to crystal structures
of Table 4.4 (bound docking) are frequently not the minimum ones. The
ligand from the 1tng complex is docked to all the protein structures with
Flexible Ligand – Flexible Protein Docking 92
lower RMSD values, compared to the remaining ligands. On the other hand,
the ligand from 1tpp has the highest RMSD values. The detailed docking
results for the trypsin data set are listed in Table D.1 in the Appendix.
No protein-ligand combination could be docked with top scored solution
below RMSD of 0.5A. In 5 cases, the entire docking set contained at least
one such a pose. In 17 cases, the top scored docking solution had and RMSD
below 2.0A, in 74 cases 20 top scored solutions contained at least one pose
with RMSD<2.0A. Solutions with RMSD<3.0A were present in all the 100
protein-ligand combinations, while in 92 of them contained at least one such
a conformation among the top 20 docking solution.
Flexible Ligand – Flexible Protein Docking 93
Table 4.4: Trypsin data set, RMSD values of top single docking poses and best dockingposes in top 20 and top 4096 solutions(A), colorcoded
Receptor
Ligand 1ppc 1pph 1tng 1tnh 1tni 1tnj 1tnk 1tnl 1tpp 3ptb
Top 1
1ppc 1.7 3.4 2.8 3.0 2.6 2.0 2.7 3.0 3.4 2.51pph 3.9 4.7 4.6 4.3 4.5 3.9 4.3 2.8 4.5 3.61tng 1.0 1.1 0.5 0.6 1.0 0.8 0.9 0.6 1.0 1.01tnh 3.4 4.4 2.8 3.4 2.6 2.1 3.5 2.1 2.1 2.81tni 3.0 3.1 2.8 2.3 2.5 2.3 4.1 3.8 4.1 2.71tnj 4.0 3.2 3.8 2.5 3.2 2.3 2.8 2.5 2.2 3.51tnk 4.7 4.3 2.9 3.6 2.7 2.6 3.5 4.4 3.7 3.11tnl 4.5 2.1 3.0 2.6 2.7 1.8 2.7 3.3 2.1 3.41tpp 4.8 5.6 5.2 4.6 4.6 4.5 5.5 5.2 4.2 6.03ptb 3.0 3.3 3.2 3.7 3.2 2.7 3.5 3.1 3.1 2.8
Top 20
1ppc 0.9 2.5 1.4 1.9 1.5 1.3 1.1 1.3 1.6 1.71pph 2.4 2.1 2.0 2.2 2.0 2.0 2.7 2.0 2.5 2.21tng 0.6 1.0 0.4 0.5 0.6 0.5 0.6 0.5 0.6 0.81tnh 1.6 1.5 1.4 1.4 1.4 1.3 1.4 1.4 1.5 1.81tni 2.0 1.9 1.6 1.7 1.7 1.8 1.6 1.5 1.9 1.91tnj 2.2 1.8 1.5 1.5 1.7 1.4 1.4 1.5 1.7 1.71tnk 1.9 1.8 1.7 1.6 1.6 1.6 1.4 1.7 1.7 1.61tnl 1.9 1.5 1.3 1.3 1.3 1.3 1.4 1.3 1.4 1.41tpp 3.1 2.7 4.5 3.4 4.4 3.5 4.1 4.1 2.6 4.03ptb 1.8 1.6 2.6 2.3 2.4 2.3 1.9 2.4 1.9 2.2
Top 4096
1ppc 0.9 1.4 1.3 1.6 1.1 1.2 1.0 1.3 1.3 1.41pph 2.0 1.7 1.7 1.7 1.6 1.6 1.8 1.4 1.9 1.81tng 0.4 1.0 0.3 0.4 0.6 0.4 0.5 0.4 0.5 0.71tnh 1.3 1.1 1.1 1.2 1.2 1.1 1.1 1.2 1.2 1.31tni 1.3 1.4 1.2 1.4 1.3 1.2 1.2 1.1 1.5 1.21tnj 1.3 1.4 1.1 1.2 1.3 1.1 1.1 1.0 1.3 1.21tnk 1.6 1.4 1.4 1.2 1.4 1.3 1.2 1.4 1.4 1.41tnl 1.1 1.2 1.1 1.2 1.1 1.2 1.1 1.0 1.4 1.21tpp 2.1 1.9 2.2 1.8 2.5 2.0 2.0 2.7 1.8 2.83ptb 1.0 1.1 1.0 1.2 0.9 1.4 1.2 1.1 0.8 1.2
Color map: 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3≤
Flexible Ligand – Flexible Protein Docking 94
4.4 Discussion on protein flexibility
Accounting for protein flexibility introduces additional degrees of freedom,
but is a more realistic representation of biological systems. Until recently,
most major docking programs have been ignoring conformational variations
of side chains and backbone of the receptors[97]. Nevertheless, due to the
advances in the docking algorithms and in computational power, four out of
the five most cited docking programs for year 2005[95] allow some extent of
protein flexibility (Table 4.5). Therefore, any new proposed protein-ligand
docking program is expected to address protein flexibility. Due to time
constraints, handling protein flexibility by ISE-dock was implemented only
Table 4.5: Current status of protein flexibility handling ISE-dock and in five populardocking programs (sorted according to the number of citations in 2005[95])
ISE-dock Explicit flexibility of several side chains specified by the user.Implicit handling of changes in the backbone using pregener-ated populaions.
AutoDock No protein flexibility in AutoDock ver.3. Recently releasedver.4 allows side chain flexibility of selected residues
DOCK Protein flexibility is not implemented
FlexX FlexX-Ensemble (formerly known as FlexE) – an exten-tion of FlexX. The flexibility of the protein is representedby an ensemble of structures, combined to a so-called unitedprotein description. It is possible to recombine elements fromdifferent ensemble structures
GOLD Partial protein flexibility, including protein side chains andbackbone flexibility for up to ten user-defined residues
ICM Partial protein flexibility, including protein side chains andselected loops
Flexible Ligand – Flexible Protein Docking 95
partially as a preliminary step before further development. In order to in-
clude protein flexibility, the scoring function of AutoDock (and thus of
ISE-dock) was extended and applied to conditions that were not accounted
for during its construction and callibration. This application of the scoring
function in cases that differ dramatically from the ones that were used for
its construction and calibration was a trade off between the accuracy and
the speed of development in the proof of concept phase of development and
has direct impact on the quality of results. Although limited to small re-
gions, protein flexibility handling in ISE-dock is successful and is another
demonstration of the ability of ISE do deal with multiple degrees of freedom
in protein-ligand docking problems. Indeed, docking experiments in all the
three test sets succeeded in producing high quality docking poses. The solu-
tions in the collagenase set contained ligand poses with ligand RMSD values
above 1.18A (966c-966c), the AChE set contained solutions with RMSD for
all movable atoms of 0.48A (1eve-1eve) and the trypsin set contained docking
solutoins with even lower RMSD: 0.3A (1tng-1tng).
The main pitfall of the flexible ligand – flexible protein docking using
ISE-dock is the scoring function. The original energy function does not
account for changes in the 3D structure of the protein. Implicit protein
flexibility (collagenase data set) involves combining solutions of docking a
ligand into different protein structures. Explicit handling of changes in the
protein 3D structure during the docking process involves transferring atoms
from the protein to the ligand and exclusion of Cγ atoms of the flexible
residues from the scoring scheme. The quality of top scored solutions is
heavily biased by the scoring function. As one may see from the results
Flexible Ligand – Flexible Protein Docking 96
for the collagenase data set (Section 4.1, and, to a higher extent, from the
AChE and trypsin sets (Sections 4.2 and 4.3), poses that are very close to
those of the crystal structures are always sampled in the final docking sets,
but are not scored well. Rescoring the docking solutions with or without a
post-docking processing step may hopefully improve the predictive ability of
ISE-dock.
Protein flexibility is an important aspect of a protein-ligand docking pro-
gram. Other degrees of freedom that were not accounted for in this work,
but that can be introduced into ISE-dock with relative ease are modelling
of structurally important water molecules, as well as tautomeric and proto-
nation states.
Chapter 5
Conclusions
Iterative Stochastic Elimination is a generic optimization algorithm that aims
to solve highly complex combinatorial optimization problems in an efficient
and fast manner. We find that it is able to solve the docking problem, as many
others, in polynomial time. Another advantage of ISE is its ability to pro-
duce arbitrarily large numbers of near-optimal solutions without substantial
penalty in terms of CPU time. ISE was first implemented in our lab in 2000
to solve the problem of positioning polar protons in protein structures[30]
and is under constant development. Since then it was successfully imple-
mented for solving side chain positioning[31], structure prediction of cyclic
peptides[87], flexible fragments in protein backbone[86, 76] and others.
ISE-dock is a new docking program based on the Iterative Stochas-
tic Elimination algorithm. The programs performance in flexible ligand –
rigid protein docking was compared to those of AutoDock, Glide and
GOLD on 81 complexes which are part of a set of complexes previously
chosen to compare docking programs. The ability to handle conformational
97
Conclusions 98
changes in the backbone and the side chains of the protein was assessed by
three independent data sets: collagenase (backbone flexibility, 2 structures),
acetylcholinesterase (single side chain flexibility, 2 structures) and trypsin
(flexibility of several side chains, 10 strucures).
In flexible ligand – rigid protein docking, ISE-dock performs better than
the three docking programs with these complexes. ISE-dock succeeds in
docking all the 81 complexes with at least one solution of RMSD <3.0A
among the top 20 scored poses (LGA of AutoDock finds 97.5%, Glide finds
90.1% and GOLD finds 87.7%), and with at least one RMSD<2.0A within
the entire docking population (LGA finds 96.3%, no information is available
on Glide and GOLD). PTT of top 20 solutions and all the available solu-
tions, applied to the results of ISE-dock and to the other algorithms, shows
a clear advantage for ISE-dock.
The more significant results of the flexible ligand - rigid protein docking
experiments are provided by the ability of ISE to achieve large near-optimal
populations of solutions without a significant additional CPU effort. These
populations improve the coverage of solution space and may be used to es-
timate the shape of energy landscapes near minima and to suggest multiple
binding modes, as was demonstrated in two cases (p38 MAP kinase – 1kv1
and Human Transthyretin – 2rox). The ability to analyze energy landscapes
accessible to ligands in a pocket has thus been shown to be useful. However,
the accuracy of that analysis can not be fully assessed yet due to the lack
of experimental data. Although, theoretically, such an analysis of very large
docking populations is possible with other docking programs, to the best of
our knowledge, the energy (score) vs RMSD plots of docking solutions, al-
Conclusions 99
though known previously were not used to visualize and estimate the energy
landscape of a protein – ligand complex.
Accounting for protein flexibility introduces additional degrees of free-
dom, but gives a more realistic representation of biological systems. Handling
of protein flexibility was introduced into ISE-dock in a partial way. Even in
this premature implementation, ISE-dock was shown to successfully dock
flexible ligands into partially flexible protein structures, which include a few
side chains and consider backbone flexibility. In all the cases, the docking
populaitions obtained by ISE-dock contained good to excellent solutions.
In the collagenase data set (Section 4.1, flexible ligand were successfully
docked into protein structures with partially flexible loops. The accuracy in
predicting the structure of the backbone is very high with RMSD of backbone
atoms as low as 0.13A from the crystal structure. Although the top ranked
solutions for ligand positions were of high RMSD from the experimental
structure (2.25A– 2.49A), the docking populations contained high quality
solutions (RMSD of 1.18A– 1.76A).
Docking experiments with side chain flexibility (AChE, Section 4.2 and
trypsin, Section 4.3) were even more accurate: in the AChE case, the docking
populations contained solutions with RMSD values as low as 0.37A and in the
case of trypsin, the best populaition contained a solution with RMSD=0.30A.
The experiments presented in this work show that ISE is capable of solv-
ing very complex problems. In addition to molecular flexibility, such prob-
lems may target protonation and tautomerizatioin states of both the protein
and the ligand, explicit simulation of water molecules etc. The latter task is
of great importance, as it is known (see for examples [85, 104]) that including
Conclusions 100
water molecules improves the quality of docking results. In order to equip
ISE-dock with all these important features, one has to overcome two major
obstacles: (1) adaptation of the grid based scoring function to correctly treat
conformational changes in the protein and (2) docking several molecules (or
any independent entities) simultaneously.
Appendix A
Results published in a peerreviewed journal
Following is the letter from the editor of “PROTEINS: Structure, Function,and Bioinformatics” journal that notifies the fact that an article based onthis work has been accepted for publication.
Return-path: <[email protected]>
Envelope-to: [email protected]
...
Message-ID:
<439655644.1187888215280.JavaMail.wladmin@mcv3-wl18>
Date: Thu, 23 Aug 2007 12:56:55 -0400 (EDT)
From: [email protected]
Subject: PROTEINS: Manuscript Prot-00274-2007.R1 Accepted
Errors-To: [email protected], [email protected]
PROTEINS: Structure, Function, and Bioinformatics
23-Aug-2007
Dear Mr. Boris Gorelik:
Your manuscript entitled "High quality binding modes in docking
ligands to proteins" has passed all required peer review and has
been recommended to me by the Editorial Board. I am pleased
to accept the paper for publication in the next available issue of
PROTEINS.
101
Results published in a peer reviewed journal 102
You will receive an e-mail immediately following with instructions
for production of your article. I look forward to seeing it in press.
Congratulations on submitting such an excellent study.
Sincerely,
Eaton E. Lattman
Editor-in-Chief
PROTEINS: Structure, Function, and Bioinformatics
The Johns Hopkins University
Department of Biophysics
Baltimore, MD 21218 U.S.A.
Appendix B
ISE-dock and AutoDockparameters and their values
B.1 AutoDock parameters and theirdefault values
Following are the default parameters of AutoDock v 3.0.5 and their shortdescription. For more details see the manual published by AutoDock au-thors
seed time pid # for random number generatortypes CANOSH # atom type namesfld [PROTEIN_NAME].maps.fld # grid data filemap [PROTEIN_NAME].C.map # C-atomic affinity map filemap [PROTEIN_NAME].A.map # A-atomic affinity map filemap [PROTEIN_NAME].N.map # N-atomic affinity map filemap [PROTEIN_NAME].O.map # O-atomic affinity map filemap [PROTEIN_NAME].S.map # S-atomic affinity map filemap [PROTEIN_NAME].H.map # H-atomic affinity map filemap [PROTEIN_NAME].e.map # electrostatics map file
move [LIGAND_NAME].pdbq # small molecule fileabout [X],[Y],[Z] # small molecule center# Initial Translation, Quaternion and Torsionstran0 random # initial coordinates/A or "random"quat0 random # initial quaternion or "random"ndihe 10 # number of initial torsionsdihe0 random # initial torsionstorsdof 0 0.3113 # num. non-Hydrogen torsional DOF & coeff.
103
ISE-dock and AutoDock parameters and their values 104
# Initial Translation, Quaternion and Torsion Step Sizes# and Reduction Factorststep 2.0 # translation step/Aqstep 50.0 # quaternion step/degdstep 50.0 # torsion step/degtrnrf 1. # trans reduction factor/per cyclequarf 1. # quat reduction factor/per cycledihrf 1. # tors reduction factor/per cycle
# Internal Non-Bonded Parametersintnbp_r_eps 4.00 0.0222750 12 6 #C-C lj[LENNARD JONES PARAMETERS FOR EACH PAIR OF ATOM TYPES]intnbp_r_eps 2.00 0.0029700 12 6 #H-H lj
outlev 1 # diagnostic output level
# Docked Conformation Clustering Parameters for# "analysis" commandrmstol 1.0 # cluster tolerance (Angstroms)rmsref [LIGAND_NAME].pdbq # reference structure# file for RMS calc.write_all # write all conformations in a cluster
extnrg 1000. # external grid energye0max 0. 10000 # max. allowable initial energy,# max. num. retries
# Genetic Algorithm (GA) and Lamarckian# Genetic Algorithm (LGA) Parametersga_pop_size 50 # number of individuals in populationga_num_evals 250000 # maximum number of# energy evaluationsga_num_generations 27000 # maximum number#of generationsga_elitism 1 # num. of top individuals that# automatically survivega_mutation_rate 0.02 # rate of gene mutationga_crossover_rate 0.80 # rate of crossoverga_window_size 10 # num. of generations for# picking worst individualga_cauchy_alpha 0 # ~mean of Cauchy distribution# for gene mutation
ISE-dock and AutoDock parameters and their values 105
ga_cauchy_beta 1 # ~variance of Cauchy distribution# for gene mutationset_ga # set the above parameters for GA or LGA
# Local Search (Solis & Wets) Parameters# (for LS alone and for LGA)sw_max_its 300 # number of iterations of# Solis & Wets local searchsw_max_succ 4 # number of consecutive successes# before changing rhosw_max_fail 4 # number of consecutive failures before# changing rhosw_rho 1.0 # size of local search space to samplesw_lb_rho 0.01 # lower bound on rhols_search_freq 0.06 # probability of performing local# search on an indiv.set_psw1 # set the above pseudo-Solis & Wets parameters
# Perform Dockingsga_run 10 # do this many GA or LGA runs
# Perform Cluster Analysisanalysis # do cluster analysis on results
B.2 ISE-dock parameters and theirdefault values
Following are the default parameters of ISE-dock. Parameters that arecommon to AutoDock are not listed here.
# ISE docking parametersise_sample_size -50 # sample size. negative values mean that# the size will be the product of current pool depth and# the absolute value of this parameter
ise_conf_in_h_l -2 # number of conformations in the# highest- and lowest- energy subsets. negative values# mean that the size will be the product of current pool# depth and the absolute value of this parameter
ise_output_size 40 # number of solutions in the final# docking set
ISE-dock and AutoDock parameters and their values 106
ise_z_value 3.84 # statistical value that determines# the rigidity of the elimination process
ise_elimination_fraction 0.1 # limit the number of values# that can be eliminated from any given gene
ise_threshold 1e5 # threshold to switch from the# stochastic to the exhaustive search
ise_method stochastic # one of the following:# stochastic exhaustive
ise_pool_file <use_dpf> # if file name is specified,# read the initial pool from it if ’’<use_dpf>’’, then# use the *grid parameters listed below to initialize# the possibilities pool
ise_t_grid 1.5 # translation gridise_r_grid 6 # rotation gridise_d_grid 6 # dihedral torsions grid
ise_optimize_solution FALSE # perform local# optimization on the final docking solution
ise_optimize_on_elimination TRUE # perform local# optimization during the elimination phase. use the# value of ls_search_freq parameter for probability# of performing local search
ise_optimize_on_exhaustive_freq 0.6 # probability#of local search during the exhaustive phase
set_ise # set the above parameters# Perform ISE dockingise_run
# Perform Cluster Analysisanalysis # do cluster analysis on results
Appendix C
Detailed Results
C.1 Flexible Ligand – Rigid Protein docking re-sults results
Table C.1.
Top scoring pose Best RMSD
Top 20 All available
CO
DE
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
13gs 1.86 2.30 2.81 1.52 0.46 0.72 2.69 1.09 0.25 0.58
1a42 1.65 3.30 1.47 5.28 0.47 0.97 1.47 2.26 0.47 0.79
1a4k 1.88 1.91 2.29 2.33 1.50 1.54 1.38 1.81 0.76 1.46
1a8t 2.27 3.51 1.11 4.69 0.86 0.80 1.11 2.07 0.85 0.71
1afq 2.07 2.93 1.12 1.35 1.06 1.01 0.53 1.35 1.06 1.01
1atl 3.21 3.04 2.10 1.55 0.95 1.22 1.46 1.55 0.92 1.04
1azm 2.33 2.81 2.04 2.60 1.97 2.17 1.24 0.66 0.54 1.97
1bnw 3.93 4.21 4.36 4.88 1.03 3.02 1.35 4.30 0.61 1.12
1bqo 0.92 0.61 1.60 1.55 0.72 0.51 1.60 1.35 0.72 0.48
1br6 1.85 1.85 3.51 1.82 1.64 1.83 1.69 0.63 0.44 1.82
1cet 2.05 4.21 3.05 8.52 1.71 1.88 2.80 5.30 0.75 1.81
1cim 1.16 1.16 1.54 1.30 0.66 0.65 1.34 1.03 0.23 0.58
Continued on next page
107
Detailed Results 108
Table C.1 – continued from previous page
Top scoring pose Best RMSD
Top 20 All available
CO
DE
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
1d3p 1.32 3.91 2.40 4.03 1.03 0.86 1.61 1.57 0.91 0.85
1d4p 0.98 1.56 2.35 2.69 0.74 0.86 0.74 0.99 0.50 0.79
1d6v 2.31 2.50 4.06 4.08 1.79 2.36 2.01 1.68 0.97 2.19
1efy 2.53 4.45 1.95 2.88 1.98 2.03 0.38 0.69 0.52 1.95
1ela 1.15 1.55 0.75 1.25 1.14 0.87 0.75 1.06 1.08 0.87
1etr 1.71 0.66 1.49 2.60 1.19 0.66 1.15 2.18 1.01 0.66
1ett 2.55 4.59 0.92 4.37 0.85 0.72 0.65 1.29 0.85 0.72
1eve 1.52 2.58 1.94 2.39 0.58 0.59 1.15 1.03 0.51 0.52
1exa 0.52 0.46 0.43 0.41 0.36 0.44 0.43 0.41 0.23 0.41
1ezq 2.65 2.19 10.63 2.25 1.68 1.06 4.30 1.10 1.58 1.02
1f0r 1.53 1.66 8.72 3.19 0.80 0.62 1.90 1.23 0.80 0.62
1f0t 1.24 4.84 2.26 2.12 0.84 0.89 1.60 2.06 0.84 0.89
1f4e 3.92 3.92 1.23 1.75 2.46 1.73 1 1.55 0.56 1.36
1fcx 0.58 0.58 0.48 0.74 0.50 0.55 0.48 0.49 0.20 0.53
1fcz 0.57 0.59 0.77 0.91 0.45 0.54 0.52 0.50 0.24 0.49
1fjs 1.49 1.59 5.04 2.12 1.31 0.73 3.44 1.44 1.31 0.73
1fkg 1.07 1.20 1.75 4.18 0.93 0.93 1.67 4.05 0.93 0.93
1fm6 2.84 0.40 0.64 0.68 0.69 0.35 0.64 0.65 0.69 0.35
1fm9 1.72 1.60 1.74 3.38 1.21 0.85 1.74 1.49 1.17 0.85
1g4o 3.70 3.99 2.15 4.59 2.21 2.92 1.62 0.81 0.58 2.44
1h1p 4.08 3.72 0.65 1.21 1.35 1.35 0.65 0.52 0.38 1.31
1h1s 0.80 0.62 0.97 1.16 0.61 0.42 0.97 1.16 0.58 0.36
1h9u 0.59 0.53 0.82 1.12 0.33 0.47 0.48 1.03 0.33 0.35
1hdq 1.07 1.88 2.16 3.67 0.55 0.84 0.62 0.84 0.37 0.77
1hfc 1.55 4.47 2.37 2.34 1.40 0.98 1 0.61 1.34 0.98
1hpv 1.11 1.73 1.20 9.47 1.01 0.88 1.19 1.38 1.01 0.88
Continued on next page
Detailed Results 109
Table C.1 – continued from previous page
Top scoring pose Best RMSD
Top 20 All available
CO
DE
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
1htf 2.55 1.64 10.12 10.19 1.53 0.59 1.99 3.13 1.49 0.59
1i7z 0.87 1.02 0.60 0.86 0.45 0.82 0.44 0.82 0.45 0.38
1i8z 0.72 1.92 3.82 3.66 0.55 0.74 2.55 2.69 0.39 0.63
1if7 3.65 4.40 1.43 5.42 1.64 3.65 1.34 1.65 0.87 2.74
1iy7 0.96 1.04 1.16 0.91 0.75 0.99 0.99 0.59 0.75 0.77
1jsv 0.88 1.25 5.45 6.94 0.74 0.71 3.40 5.36 0.69 0.71
1k1j 4.11 1.47 5.88 6.54 1.59 1.23 4.48 3.24 1.57 1.23
1k22 1.69 0.55 0.74 1.03 1.06 0.42 0.74 0.72 1.06 0.41
1k7e 0.88 0.74 0.72 0.96 0.56 0.53 0.68 0.53 0.21 0.31
1k7f 0.79 0.77 2.02 0.84 0.69 0.68 0.51 0.76 0.69 0.66
1kv1 1.21 1.21 0.66 0.81 0.70 1.14 0.59 0.56 0.27 0.66
1kv2 0.73 0.78 1.63 0.80 0.58 0.69 0.91 0.74 0.52 0.63
1l8g 1.33 1.60 2.90 2.17 0.74 1.50 1.57 2.17 0.70 1.16
1lqd 0.89 0.39 1.93 0.65 0.74 0.31 1.93 0.45 0.74 0.31
1m48 1.89 1.12 0.68 1.64 1.10 0.55 0.68 1.12 1.10 0.55
1mmb 2.11 2.12 3.18 6.11 1.79 1.32 1.16 1.37 1.64 1.32
1mnc 3.96 0.69 0.36 1.95 1.53 0.60 0.36 1.38 1.21 0.60
1nhu 3.38 3.51 6.07 5.17 1.02 1.07 3.16 3.75 0.69 1.07
1nhv 3.26 4.68 6.57 8.95 1.35 1.76 5.96 4.45 1.04 1.76
1o86 3.46 1.25 1.06 1.85 1.80 1.25 0.97 0.99 1.54 1.25
1ppc 1.60 1.59 1.69 1.76 1.37 1.20 1.62 1.76 1.30 1.20
1pph 3.39 2.38 5.09 4.95 1.36 1.42 1.09 0.88 1.02 1.42
1qbu 0.97 0.72 10.36 2.59 0.86 0.66 10.36 2.59 0.86 0.66
1qhi 0.66 0.69 0.30 0.66 0.51 0.58 0.30 0.41 0.31 0.55
1qpe 0.63 0.67 1.50 0.52 0.44 0.47 0.52 0.34 0.25 0.45
1r09 5.99 5.95 0.82 1.81 1.85 1.50 0.82 0.53 0.49 1.21
Continued on next page
Detailed Results 110
Table C.1 – continued from previous page
Top scoring pose Best RMSD
Top 20 All available
CO
DE
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
Glid
e
GO
LD
ISE
LG
A
1thl 2.88 2.12 8.54 10.08 1.72 1.15 1.78 2.12 1.11 1.15
1uvt 0.85 0.60 0.44 1.47 0.66 0.49 0.44 0.54 0.66 0.49
1ydr 1.51 0.65 1.56 2.52 0.53 0.62 0.67 2.52 0.32 0.57
1yds 0.69 0.66 0.50 0.55 0.54 0.60 0.50 0.55 0.49 0.60
2cgr 0.79 0.85 0.85 6.54 0.62 0.73 0.67 6.35 0.62 0.66
2pcp 1 0.99 0.64 3.89 0.30 0.96 0.62 1.08 0.30 0.95
2qwi 0.56 0.71 0.70 1.30 0.37 0.60 0.70 0.96 0.37 0.51
3cpa 0.84 0.85 0.79 0.73 0.69 0.62 0.53 0.60 0.69 0.61
3erk 0.59 0.72 0.44 1.42 0.25 0.64 0.44 0.63 0.21 0.64
3ert 1.14 1.44 4.66 4.74 0.88 1.03 2.48 2.39 0.88 0.90
3std 0.60 0.56 2.44 0.85 0.40 0.48 2.44 0.85 0.39 0.35
3tmn 0.66 3.09 8.07 7.59 0.54 0.58 3.18 3.90 0.48 0.58
4dfr 1.10 1.01 1.27 1.20 0.74 0.81 1.10 1.18 0.72 0.81
5std 0.52 0.47 0.73 0.86 0.34 0.42 0.73 0.58 0.28 0.40
5tln 1.73 3.82 9.67 6.52 1.11 0.88 1.20 1.01 1.11 0.88
7est 0.84 0.79 1.02 3.76 0.75 0.63 0.82 0.87 0.75 0.63
966c 1.05 0.70 2.44 2.42 0.81 0.55 2.21 2.34 0.81 0.55
Table C.1: Detailed docking results of the flexible ligand – rigid proteindata set. RMSD[A]
Detailed Results 111
C.2 Flexible ligand – rigid protein docking energylandscapes
Following are the energy vs RMSD plots for ISE-dock and AutoDock ofall the 81 complexes in the flexible ligand - rigid protein docking set. Thegraphs are sorted alphabetically according to the PDB code of the complex.
Detailed Results 112
Fig
ure
C.1
:E
nerg
yvs
RM
SDpl
ots
for
ISE-d
ock
(red
)an
dA
utoD
ock
(gre
en)
ofco
mpl
exes
inth
efle
xibl
elig
and
-ri
gid
prot
ein
dock
ing
set.
The
grap
hsar
eso
rted
alph
abet
ical
lyac
cord
ing
toth
eP
DB
code
ofth
eco
mpl
ex.
Con
tinu
edon
the
follo
win
gfig
ures
.
Detailed Results 113
Fig
ure
C.2
:C
onti
nued
from
the
prev
ious
figur
e.E
nerg
yvs
RM
SDpl
ots
forIS
E-d
ock
(red
)an
dA
utoD
ock
(gre
en)
ofco
mpl
exes
inth
efle
xibl
elig
and
-ri
gid
prot
ein
dock
ing
set.
The
grap
hsar
eso
rted
alph
abet
ical
lyac
cord
ing
toth
eP
DB
code
ofth
eco
mpl
ex.
Detailed Results 114
Fig
ure
C.3
:C
onti
nued
from
the
prev
ious
figur
e.E
nerg
yvs
RM
SDpl
ots
forIS
E-d
ock
(red
)an
dA
utoD
ock
(gre
en)
ofco
mpl
exes
inth
efle
xibl
elig
and
-ri
gid
prot
ein
dock
ing
set.
The
grap
hsar
eso
rted
alph
abet
ical
lyac
cord
ing
toth
eP
DB
code
ofth
eco
mpl
ex.
Detailed Results 115
Fig
ure
C.4
:C
onti
nued
from
the
prev
ious
figur
e.E
nerg
yvs
RM
SDpl
ots
forIS
E-d
ock
(red
)an
dA
utoD
ock
(gre
en)
ofco
mpl
exes
inth
efle
xibl
elig
and
-ri
gid
prot
ein
dock
ing
set.
The
grap
hsar
eso
rted
alph
abet
ical
lyac
cord
ing
toth
eP
DB
code
ofth
eco
mpl
ex.
Detailed Results 116
Fig
ure
C.5
:C
onti
nued
from
the
prev
ious
figur
e.E
nerg
yvs
RM
SDpl
ots
forIS
E-d
ock
(red
)an
dA
utoD
ock
(gre
en)
ofco
mpl
exes
inth
efle
xibl
elig
and
-ri
gid
prot
ein
dock
ing
set.
The
grap
hsar
eso
rted
alph
abet
ical
lyac
cord
ing
toth
eP
DB
code
ofth
eco
mpl
ex.
Detailed Results 117
Fig
ure
C.6
:C
onti
nued
from
the
prev
ious
figur
e.E
nerg
yvs
RM
SDpl
ots
forIS
E-d
ock
(red
)an
dA
utoD
ock
(gre
en)
ofco
mpl
exes
inth
efle
xibl
elig
and
-ri
gid
prot
ein
dock
ing
set.
The
grap
hsar
eso
rted
alph
abet
ical
lyac
cord
ing
toth
eP
DB
code
ofth
eco
mpl
ex.
Detailed Results 118
Figure C.7: Continued from the previous figure. Energy vs RMSD plots for ISE-dock(red) and AutoDock (green) of complexes in the flexible ligand - rigid protein dockingset. The graphs are sorted alphabetically according to the PDB code of the complex.
Appendix D
Flexible ligand – flexible proteindocking. Trypsin data set
Table D.1
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1ppc 1ppc 1.72 0.87 0.87 1.84 1.26 1.23
1ppc 1pph 3.38 2.48 1.42 2.8 2.17 1.6
1ppc 1tng 2.84 1.44 1.27 2.59 1.48 1.48
1ppc 1tnh 3.02 1.91 1.59 2.56 1.83 1.6
1ppc 1tni 2.59 1.53 1.08 2.3 1.53 1.31
1ppc 1tnj 1.99 1.3 1.25 2.21 1.73 1.64
1ppc 1tnk 2.73 1.13 1.02 2.48 1.41 1.4
1ppc 1tnl 3.05 1.34 1.34 2.85 1.74 1.65
1ppc 1tpp 3.44 1.6 1.33 3.02 2.01 1.69
1ppc 3ptb 2.49 1.69 1.45 2.4 1.94 1.84
1pph 1ppc 3.86 2.38 2.04 3.86 2.38 2.04
1pph 1pph 4.66 2.14 1.7 4.66 2.14 1.7
1pph 1tng 4.56 1.97 1.69 4.56 1.97 1.69
1pph 1tnh 4.31 2.21 1.74 4.31 2.21 1.74
1pph 1tni 4.5 1.99 1.57 4.5 1.99 1.57
1pph 1tnj 3.88 2.01 1.59 3.88 2.01 1.59
1pph 1tnk 4.27 2.7 1.79 4.27 2.7 1.79
Continued on next page
119
Flexible ligand – flexible protein docking. Trypsin data set 120
Table D.1 – continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1pph 1tnl 2.77 1.96 1.42 2.77 1.96 1.42
1pph 1tpp 4.46 2.51 1.9 4.46 2.51 1.9
1pph 3ptb 3.57 2.22 1.76 3.57 2.22 1.76
1tng 1ppc 0.97 0.64 0.43 1.85 1.55 1.03
1tng 1pph 1.12 0.99 0.96 1.92 1.75 1.6
1tng 1tng 0.53 0.43 0.28 1.58 1.27 0.89
1tng 1tnh 0.64 0.54 0.4 1.69 1.32 1.09
1tng 1tni 0.99 0.63 0.63 1.9 1.47 1.25
1tng 1tnj 0.77 0.54 0.42 2.07 1.4 1.01
1tng 1tnk 0.9 0.59 0.55 1.91 1.29 1.08
1tng 1tnl 0.62 0.5 0.38 1.28 1.03 0.85
1tng 1tpp 1.04 0.61 0.53 2.34 1.75 1.58
1tng 3ptb 1 0.78 0.66 2.25 2.04 1.86
1tnh 1ppc 3.36 1.56 1.3 3.36 1.56 1.3
1tnh 1pph 4.36 1.5 1.08 4.36 1.5 1.08
1tnh 1tng 2.82 1.36 1.15 2.82 1.36 1.15
1tnh 1tnh 3.39 1.4 1.18 3.39 1.4 1.18
1tnh 1tni 2.56 1.41 1.17 2.56 1.41 1.17
1tnh 1tnj 2.08 1.31 1.11 2.08 1.31 1.11
1tnh 1tnk 3.51 1.43 1.09 3.51 1.43 1.09
1tnh 1tnl 2.07 1.41 1.23 2.07 1.41 1.23
1tnh 1tpp 2.12 1.45 1.21 2.12 1.45 1.21
1tnh 3ptb 2.78 1.82 1.27 2.78 1.82 1.27
1tni 1ppc 2.95 2 1.29 2.95 2 1.29
1tni 1pph 3.09 1.85 1.39 3.09 1.85 1.39
1tni 1tng 2.83 1.6 1.19 2.83 1.6 1.19
1tni 1tnh 2.32 1.68 1.35 2.32 1.68 1.35
1tni 1tni 2.53 1.71 1.3 2.53 1.71 1.3
1tni 1tnj 2.31 1.81 1.2 2.31 1.81 1.2
1tni 1tnk 4.12 1.64 1.2 4.12 1.64 1.2
Continued on next page
Flexible ligand – flexible protein docking. Trypsin data set 121
Table D.1 – continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1tni 1tnl 3.81 1.5 1.14 3.81 1.5 1.14
1tni 1tpp 4.12 1.89 1.5 4.12 1.89 1.5
1tni 3ptb 2.66 1.85 1.22 2.66 1.97 1.22
1tnj 1ppc 3.99 2.17 1.27 3.99 2.17 1.27
1tnj 1pph 3.24 1.84 1.35 3.24 1.84 1.35
1tnj 1tng 3.85 1.48 1.15 3.85 1.49 1.15
1tnj 1tnh 2.5 1.49 1.19 2.5 1.49 1.19
1tnj 1tni 3.25 1.68 1.29 3.25 1.68 1.29
1tnj 1tnj 2.34 1.44 1.14 2.34 1.44 1.14
1tnj 1tnk 2.77 1.43 1.07 2.77 1.43 1.07
1tnj 1tnl 2.46 1.49 1.04 2.46 1.49 1.04
1tnj 1tpp 2.2 1.72 1.3 2.2 1.72 1.3
1tnj 3ptb 3.53 1.67 1.22 3.53 1.67 1.22
1tnk 1ppc 4.74 1.95 1.62 4.74 1.95 1.62
1tnk 1pph 4.28 1.79 1.42 4.28 1.79 1.42
1tnk 1tng 2.9 1.66 1.39 2.9 1.66 1.39
1tnk 1tnh 3.62 1.56 1.17 3.62 1.56 1.17
1tnk 1tni 2.66 1.61 1.42 2.66 1.61 1.42
1tnk 1tnj 2.6 1.59 1.28 2.6 1.59 1.28
1tnk 1tnk 3.52 1.41 1.25 3.52 1.41 1.25
1tnk 1tnl 4.44 1.73 1.43 4.44 1.73 1.43
1tnk 1tpp 3.67 1.75 1.45 3.67 1.75 1.45
1tnk 3ptb 3.09 1.65 1.37 3.09 1.65 1.37
1tnl 1ppc 4.46 1.86 1.12 4.46 1.86 1.12
1tnl 1pph 2.1 1.5 1.21 2.1 1.5 1.21
1tnl 1tng 3.05 1.34 1.14 3.05 1.34 1.14
1tnl 1tnh 2.56 1.33 1.18 2.56 1.33 1.18
1tnl 1tni 2.67 1.34 1.07 2.67 1.34 1.07
1tnl 1tnj 1.78 1.33 1.23 1.78 1.33 1.23
1tnl 1tnk 2.72 1.38 1.1 2.72 1.38 1.1
Continued on next page
Flexible ligand – flexible protein docking. Trypsin data set 122
Table D.1 – continued from previous page
Ligand only atoms All movable atoms
ligand protein top1 top20 top4096 top1 top20 top4096
1tnl 1tnl 3.29 1.3 1.03 3.29 1.3 1.03
1tnl 1tpp 2.06 1.37 1.37 2.06 1.37 1.37
1tnl 3ptb 3.39 1.38 1.21 3.39 1.38 1.21
1tpp 1ppc 4.85 3.09 2.06 4.84 3.09 2.06
1tpp 1pph 5.58 2.73 1.94 5.58 2.73 1.94
1tpp 1tng 5.15 4.5 2.2 5.15 4.5 2.2
1tpp 1tnh 4.56 3.44 1.77 4.56 3.44 1.77
1tpp 1tni 4.61 4.39 2.49 4.61 4.39 2.49
1tpp 1tnj 4.5 3.54 1.96 4.5 3.54 1.96
1tpp 1tnk 5.53 4.06 1.99 5.53 4.06 1.99
1tpp 1tnl 5.19 4.11 2.74 5.19 4.11 2.74
1tpp 1tpp 4.23 2.61 1.78 4.23 2.61 1.78
1tpp 3ptb 5.97 3.99 2.82 5.97 3.99 2.82
3ptb 1ppc 3.04 1.75 0.98 3.04 1.75 0.98
3ptb 1pph 3.28 1.61 1.11 3.28 1.61 1.11
3ptb 1tng 3.16 2.59 0.97 3.16 2.59 0.97
3ptb 1tnh 3.7 2.3 1.18 3.7 2.3 1.18
3ptb 1tni 3.17 2.35 0.93 3.17 2.35 0.93
3ptb 1tnj 2.7 2.32 1.37 2.7 2.32 1.37
3ptb 1tnk 3.49 1.95 1.2 3.49 1.95 1.2
3ptb 1tnl 3.09 2.38 1.09 3.09 2.38 1.09
3ptb 1tpp 3.14 1.94 0.77 3.14 1.94 0.77
3ptb 3ptb 2.8 2.22 1.24 2.8 2.22 1.24
Table D.1: RMSD [A ] of all movable atoms and of ligand atoms onlyin the trypsin data set, 100 cross docking experiments
List of Figures
1.1 Schematic diagram of the main methods in the drug discov-ery process. Arrows designate process flow. Black asterisksmark steps that may involve molecular docking. Abbrevia-tions: SAR – structure-activity relationship; QSAR – quan-titative SAR; ADME-Tox – absorption, distribution, elimina-tion, toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Typical shapes of electrostatic interactions energy. The energyof two identical (full line) and opposite (dashed line) chargesin vacuum are shown . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Examples of inter- (left) and intra- (right) molecular H-bonds 151.4 Van der Waals interaction energy of argon dimer. Taken from
the Wikipedia [113] under the GNU Free Documentation License 161.5 Comparison of Morse (dashed line) and Hooke’s harmonic (full
line) potentials of bond stretching energy around the mini-mum. To construct this graph, all the parameters in equations(1.15) and (1.16) were assigned the value of 1 . . . . . . . . . 17
2.1 “Tearing off” atoms to represent side chain flexibility usingphenylalanine as an example. Dummy atoms are marked bythe letter “D” in their names. The N, Cα and Cβ atoms onthe receptor molecule overlap with their respective dummycounterparts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Structural alignment of 456c and 966c. Backbone traces ofthe proteins are color coded according to the distance (in A)between the aligned backbone atoms. RS-130830 (red) andRS-104966 (green) are shown as sticks models. . . . . . . . . . 58
123
List of Figures 124
2.3 Cross section of AChE complexed with acetylcholine (PDBcode: 2ace), colored by (A) partial charge of the atoms and(B) by the residue type (colored by PyMol): hydrophobic(GILMPV) – white, aromatic (FWY) – magenta, semipolar(C) – yellow, polar (HNQST) – cyan, positive (KR) – blue,negative (DE) – red. Acetylcholine is colored blue in bothpanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4 AChE complexed with Huperzine A (PDB code: 1vot, lightgray) and with Aricept (PDB code: 1eve, dark gray). Theligands and Phe 330 side chains from both the complexes arehighlighted using sticks. . . . . . . . . . . . . . . . . . . . . . 61
2.5 Trypsin data set. 10 superimposed trypsin structures: 1ppc,1pph, 1tng, 1tnh, 1tni, 1tnj, 1tnk, 1tnl, 1tpp and 3ptb. Theligand molecules and the residues that are treated as flexibleare shown as sticks. The remaining parts of the proteins areshown as backbone trace. . . . . . . . . . . . . . . . . . . . . . 63
3.1 Top single docking poses at different RMSD bins with respectto crystal structures, 4 different programs. Results for Glideand GOLD were obtained by Perola et al.[84]. . . . . . . . . . 67
3.2 Top 20 docking poses, RMSD to corresponding crystal struc-tures. Results for Glide and GOLD were obtained by Perolaet al.[84]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Top available docking poses produced in equal CPU times,RMSD to corresponding crystal structures. The numbers ofposes are 4096 (ISE) and 35 (LGA). . . . . . . . . . . . . . . . 71
3.4 Number of iterations before switching to exhaustive searchas a function of initial combinatorial size (number of initialcombinations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5 A: Energy vs RMSD plot for docking populations of the com-plex 1yds obtained with ISE, showing a single distinct funnel.B: the same plot for 35 solutions obtained by LGA. The plotsare shown using the same scale. C: The first 35 solutions (darklines) docked by ISE vs the ligand in the crystal (gray sticks).Receptor residues with at least one atom within 5.5A of theligand are shown as light gray cartoon. All structures in thiswork were visualized using PyMol[15]. . . . . . . . . . . . . . . 74
List of Figures 125
3.6 A: Energy vs RMSD plot for docking populations of the com-plex 1bqo obtained with ISE, showing two distinct funnels. B:the same plot for 35 solutions obtained by LGA. The plotsare shown using the same scale. C: The crystal structure ofthe ligand (gray sticks) and the first 35 solutions (dark lines)docked by ISE. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 A: Energy vs RMSD plot for docking populations of the com-plex 1hpv obtained with ISE, showing a scatter of the results.B: the same plot for 35 solutions obtained by LGA. The plotsare shown using the same scale. C: The crystal structure ofthe ligand and the first 35 solutions docked by ISE. . . . . . . 76
3.8 Cumulative fractions (Y-axis) of 81 ISE docking complexeswith an energy span between the global minimum of each (posenumber 1) and the other 4095 poses, below the given threshold(X-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Complexes 1kv1 (light gray) and 1kv2 (dark gray) superim-posed using backbone atoms. The ligands are shown as sticksand backbone of closest (within 5.5 A) residues to the ligandare shown as PyMol cartoons. . . . . . . . . . . . . . . . . . 79
3.10 Energy vs RMSD plot for docking populations obtained byISE (A) and LGA (B) of the complex 1kv1. The plots areshown using the same scale. The best single ISE solutions ateach of the three funnels have ranks 1, 222 and 270 and aremarked with arrows. . . . . . . . . . . . . . . . . . . . . . . . 80
3.11 The best ISE-dock solution for 1kv1 (sticks). The crystalstructures of 1kv1 and 1kv2 ligands are shown for compari-son (lines). 1kv1 is colored according to: C – cyan, N – blue,Cl – green. 1kv2 is colored according to: C– yellow, N – blue,O – red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.12 ISE-dock solution for 1kv1, ranked 222 (sticks). The crystalstructures of 1kv1 and 1kv2 ligands are shown for comparison(lines). The coloring scheme is identical to that of Figure 3.11 81
3.13 ISE-dock solution for 1kv1 solution ranked 270 (sticks). Thecrystal structures of 1kv1 and 1kv2 ligands are shown for com-parison (lines). The coloring scheme is identical to that onFigure 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
List of Figures 126
3.14 Energy vs. RMSD plot for docking populations of the complex2rox, obtained by ISE (A) and LGA(B). The best single ISEsolutions at each of the two funnels have ranks 1 and 2 and aremarked with arrows. C: Antiparallel docking solutions ranked1 and 2 for 2rox (green and magenta sticks respectively). Thecarbons in the crystal structure of thyroxine are shown thinsticks colored cyan. The backbone of closest (within 5.5 A)residues to the ligand are shown in PyMol cartoon represen-tation colored cyan. . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 The best available docking solution for (A) 1eve-1vot and (B)1vot-1eve in unbound (cross-) docking experiments. The dock-ing solutions for all the movable atoms are shown as lines andthe crystal structures are shown as sticks. The protein struc-tures are shown as backbone trace. . . . . . . . . . . . . . . . 89
4.2 The best available docking solution for (A) 1eve-1eve and (B)1vot-1vot in bound docking experiments. The docking solu-tions for all the movable atoms are shown as lines and thecrystal structures are shown as sticks. The protein structuresare shown as backbone trace. . . . . . . . . . . . . . . . . . . 90
4.3 Top docking poses at different RMSD bins with respect tocrystal structures . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.1 Energy vs RMSD plots for ISE-dock (red) and AutoDock(green) of complexes in the flexible ligand - rigid protein dock-ing set. The graphs are sorted alphabetically according to thePDB code of the complex. Continued on the following figures. 112
C.2 Continued from the previous figure. Energy vs RMSD plotsfor ISE-dock (red) and AutoDock (green) of complexes inthe flexible ligand - rigid protein docking set. The graphs aresorted alphabetically according to the PDB code of the complex.113
C.3 Continued from the previous figure. Energy vs RMSD plotsfor ISE-dock (red) and AutoDock (green) of complexes inthe flexible ligand - rigid protein docking set. The graphs aresorted alphabetically according to the PDB code of the complex.114
C.4 Continued from the previous figure. Energy vs RMSD plotsfor ISE-dock (red) and AutoDock (green) of complexes inthe flexible ligand - rigid protein docking set. The graphs aresorted alphabetically according to the PDB code of the complex.115
List of Figures 127
C.5 Continued from the previous figure. Energy vs RMSD plotsfor ISE-dock (red) and AutoDock (green) of complexes inthe flexible ligand - rigid protein docking set. The graphs aresorted alphabetically according to the PDB code of the complex.116
C.6 Continued from the previous figure. Energy vs RMSD plotsfor ISE-dock (red) and AutoDock (green) of complexes inthe flexible ligand - rigid protein docking set. The graphs aresorted alphabetically according to the PDB code of the complex.117
C.7 Continued from the previous figure. Energy vs RMSD plotsfor ISE-dock (red) and AutoDock (green) of complexes inthe flexible ligand - rigid protein docking set. The graphs aresorted alphabetically according to the PDB code of the complex.118
List of Tables
2.1 PDB codes of the 81 complexes in the rigid protein test set. . 512.2 Affinities to collagenase . . . . . . . . . . . . . . . . . . . . . . 57
3.1 Summary of docking results by ISE, LGA, Glide and GOLD. . 653.2 Binding modes of 1-(5-tert-butyl-2-methyl-2h-pyrazol-3-yl)- 3-
(4-chloro-phenyl)-urea (from 1kv1) . . . . . . . . . . . . . . . 81
4.1 Collagenase data set, best ligands’ RMSD (A) in top 1, top20 and all available (4096) solutions. RMSD of the backbonefrom the crystal position of the corresponding solution is alsoreported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Results of Acetylcholinesterase cross docking . . . . . . . . . . 884.3 Torsion RMSD of flexible residues in the trypsin data set . . . 914.4 Trypsin data set, RMSD values of top single docking poses
and best docking poses in top 20 and top 4096 solutions . . . 934.5 Current status of protein flexibility handling ISE-dock and in
five popular docking programs (sorted according to the num-ber of citations in 2005[95]) . . . . . . . . . . . . . . . . . . . 94
C.1 Detailed docking results of the flexible ligand – rigid proteindata set. RMSD[A] . . . . . . . . . . . . . . . . . . . . . . . . 110
D.1 RMSD [A ] of all movable atoms and of ligand atoms only inthe trypsin data set, 100 cross docking experiments . . . . . . 122
128
Acknowledgments
First of all, I thank Prof. Amiram Goldblum, my supervisor, for the unlim-ited freedom and trust and for his guidance and support.
This research was supported by the Israel Science Foundation (ISF) grantno 608/02. I thank the Alex Grass Center for Drug Design and Synthesis forfurther support, Dr. Emmanuele Perola for sending his data as well as formaking useful suggestions, Dr. Anwar Rayan for helpful discussions and Mrs.Efrat Noy for her ideas and for helping with the programming. Dr. MorrisM. Garret was instrumental in solving our problems with AutoDock usageand for providing suggestions for improving LGA results.
This work would not have been possible without the support of my wife,Einat, who released me from all my domestic duties and supported me duringthe preparation of this work.
129
Bibliography
[1] R Abagyan and M Totrov. High-throughput docking for lead generation.Curr Opin Chem Biol, 5(4):375 – 82, 2001.
[2] R Abagyan, M Totrov, and D Kuznetsov. ICM – a new method for proteinmodeling and design: applications to docking and structure prediction fromthe distorted native conformations. J Comp Chem, 15(5):488 – 506, 1994.
[3] LM Amzel. Calculation of entropy changes in biological processes: folding,binding, and oligomerization. Methods Enzymol, 323:167–177, 2000.
[4] AC Anderson, RH O’Neil, TS Surti, and RM Stroud. Approaches to solvingthe rigid receptor problem by identifying a minimal set of flexible residuesduring ligand docking. Chem Biol, 8(5):445–457, May 2001.
[5] FC Bernstein, TF Koetzle, GJ B Williams, EF Meyer, MD Brice,JR Rodgers, O Kennard, T Shimanouchi, and M Tasumi. Protein databank – computer-based archival file for macromolecular structures. ArchBiochem Biophys, 185(2):584 – 591, 1978.
[6] A Bialonska and Z Ciunik. Hydrophobic ’lock and key’ recognition of n-4-nitrobenzoylamino acid by strychnine. Acta Crystallogr B Struct Sci, 62:1061– 1070, 2006.
[7] W Cai, X Shao, and B Maigret. Protein-ligand recognition using sphericalharmonic molecular surfaces: towards a fast and efficient filter for largevirtual throughput screening. J Mol Graph Model, 20(4):313–328, Jan 2002.
[8] CJ Camacho, DW Gatchell, SR Kimura, and S Vajda. Scoring dockedconformations generated by rigid-body protein-protein docking. Proteins,40(3):525–537, Aug 2000.
[9] MD Cameron, B Wen, KE Allen, AG Roberts, JT Schuman, AP Campbell,KL Kunze, and SD Nelson. Cooperative binding of midazolam with testos-terone and alpha-naphthoflavone within the CYP3A4 active site: a NMRT1 paramagnetic relaxation study. Biochemistry, 44(43):14143–14151, Nov2005.
130
Bibliography 131
[10] HA Carlson. Protein flexibility and drug design: how to hit a moving target.Curr Opin Chem Biol, 6(4):447–452, Aug 2002.
[11] C Catana and PFW Stouten. Novel, customizable scoring functions, param-eterized using n-pls, for structure-based drug discovery. J Chem Inf Model,47(1):85–91, 2007.
[12] H Claussen, C Buning, M Rarey, and T Lengauer. Flexe: efficient moleculardocking considering protein structure variations. J Mol Biol, 308(2):377–395,2001.
[13] JC Cole, CW Murray, JW Nissink, RD Taylor, and R Taylor. Comparingprotein-ligand docking programs is difficult. Proteins, 60(3):325–332, Aug2005.
[14] WD Cornell, P Cieplak, CI Bayly, IR Gould, KM Merz, DM Ferguson,DC Spellmeyer, T Fox, JW Caldwell, and PA Kollman. Second generationforce field for the simulation of proteins, nucleic acids, and organic molecules.J Am Chem Soc, 117:5179–5197, 1995.
[15] WL DeLano. The PyMol molecular graphics system. DeLano Scientific LLC,San Carlos, Ca, USA.
[16] KA Dill and HS Chan. From Levinthal to pathways to funnels. Nat StructBiol, 4(1):10–19, Jan 1997.
[17] OA Donini and PA Kollman. Calculation and prediction of binding freeenergies for the matrix metalloproteinases. J Med Chem, 43(22):4180–4188,Nov 2000.
[18] M Ekroos and T Sjogren. Structural basis for ligand promiscuity in cy-tochrome P450 3A4. Proc Natl Acad Sci U S A, 103(37):13682–13687, Sep2006.
[19] AM Ferrari, BQ Wei, LCostantino, and BK Shoichet. Soft docking and mul-tiple receptor conformations in virtual screening. J Med Chem, 47(21):5076–5084, Oct 2004.
[20] D Fischer, SL Lin, HL Wolfson, and R Nussinov. A geometry-based suite ofmolecular docking processes. J Mol Biol, 248(2):459–477, Apr 1995.
[21] E Fischer. Einfluss der configuration auf die wirkung derenzyme. Ber DtChem Ges, 27:2985 – 2993, 1894.
[22] E Freire. The propagation of binding interactions to remote sites in proteins:Analysis of the binding of the monoclonal antibody d1.3 to lysozyme. ProcNatl Acad Sci U S A, 96(18):10118 – 10122, 1999.
Bibliography 132
[23] RA Friesner, JL Banks, RB Murphy, T A Halgren, JJ Klicic, DT Mainz,MP Repasky, EH Knoll, M Shelley, JK Perry, DE Shaw, P Francis, andPS Shenkin. Glide: a new approach for rapid, accurate docking and scoring.1. method and assessment of docking accuracy. J Med Chem, 47(7):1739–1749, March 2004.
[24] RA Friesner, RB Murphy, MP Repasky, LL Frye, JR Greenwood, TA Hal-gren, PC Sanschagrin, and DT Mainz. Extra precision Glide: docking andscoring incorporating a model of hydrophobic enclosure for protein-ligandcomplexes. J Med Chem, 49(21):6177–6196, Oct 2006.
[25] HA Gabb, RM Jackson, and MJ Sternberg. Modelling protein docking usingshape complementarity, electrostatics and biochemical information. J MolBiol, 272(1):106–120, Sep 1997.
[26] P Gadakar, S Phukan, P Dattatreya, and V Balaji. Pose prediction accuracyin docking studies and enrichment of actives in the active site of gsk-3beta.J Chem Inf Model, Jun 2007.
[27] L Gales, S Macedo-Ribeiro, G Arsequell, G Valencia, MJ Saraiva, andAM Damas. Human transthyretin in complex with iododiflunisal: structuralfeatures associated with a potent amyloid inhibitor. Biochem J, 388(2):615–621, Jun 2005.
[28] J Gasteiger and M Marsili. Iterative partial equalization of or-bital electronegativity–a rapid access to atomic charges. Tetrahedron,36(22):3219–3228, 1980.
[29] F Glaser, DM Steinberg, IA Vakser, and N Ben-Tal. Residue frequenciesand pairing preferences at protein-protein interfaces. Proteins, 43(2):89–102,May 2001.
[30] M Glick and A Goldblum. A novel energy-based stochastic method for posi-tioning polar protons in protein structures from x-rays. Proteins, 38(3):273–287, Feb 2000.
[31] M Glick, Anwar Rayan, and A Goldblum. A stochastic algorithm for globaloptimization and for best populations: a test case of side chains in proteins.Proc Natl Acad Sci U S A, 99(2):703–708, Jan 2002.
[32] DS Goodsell, GM Morris, and AJ Olson. Automated docking of flexibleligands: applications of autodock. J Mol Recognit, 9(1):1 – 5, Jan-Feb 1996.
[33] DS Goodsell and AJ Olson. Automated docking of substrates to proteins bysimulated annealing. Proteins, 8(3):195–202, 1990.
Bibliography 133
[34] I Halperin, BY Ma, H Wolfson, and R Nussinov. Principles of docking: Anoverview of search algorithms and a guide to scoring functions. Proteins,47(4):409 – 443, 2002.
[35] JA Hamilton and MD Benson. Transthyretin: a review from a structuralperspective. Cell Mol Life Sci, 58(10):1491–1521, Sep 2001.
[36] JA Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975.
[37] D Herschlag. The role of induced fit and conformational-changes of enzymesin specificity and catalysis. Bioorg Chem, 16(1):62 – 96, 1988.
[38] TL Hill. Steric effects. i van der waals potential energy curves. J Chem Phys,16:399, 1948.
[39] X Hu and WH Shelver. Docking studies of matrix metalloproteinase in-hibitors: zinc parameter optimization to improve the binding free energyprediction. J Mol Graph Model, 22(2):115–126, Nov 2003.
[40] MN James, A Sielecki, F Salituro, DH Rich, and T Hofmann. Confor-mational flexibility in the active sites of aspartyl proteinases revealed by apepstatin fragment binding to penicillopepsin. Proc Natl Acad Sci U S A,79(20):6137–6141, Oct 1982.
[41] J Janin and C Chothia. The structure of protein-protein recognition sites.J Biol Chem, 265(27):16027–16030, Sep 1990.
[42] G Jones, P Willett, RC Glen, AR Leach, and R Taylor. Development andvalidation of a genetic algorithm for flexible docking. J Mol Biol, 267(3):727– 48, 1997.
[43] A Kahraman, RJ Morris, RA Laskowski, and JM Thornton. Shape variationin protein binding pockets and their ligands. J Mol Biol, 368(1):283–301,Apr 2007.
[44] P Kallblad, RL Mancera, and NP Todorov. Assessment of multiple bindingmodes in ligand-protein docking. J Med Chem, 47(13):3334–3337, Jun 2004.
[45] CD Kirkpatrick. Optimization by simulated annealing. Science, 220:671 –680, 1983.
[46] RM Knegtel, ID Kuntz, and CM Oshiro. Molecular docking to ensembles ofprotein structures. J Mol Biol, 266(2):424–440, Feb 1997.
[47] RM A Knegtel, DM Bayada, RA Engh, W von der Saal, VJ van Geerestein,and PD J Grootenhuis. Comparison of two implementations of the incremen-tal construction algorithm in flexible docking of thrombin inhibitors. AngewChem Int Ed, 13(2):167–183., 1999.
Bibliography 134
[48] DE Koshland. Application of a theory of enzyme specificity to protein syn-thesis. Proc Natl Acad Sci U S A, 44(2):98–104, February 1958.
[49] B Kramer, M Rarey, and T Lengauer. Evaluation of the FLEXX incrementalconstruction algorithm for protein-ligand docking. Proteins, 37(2):228–241,Nov 1999.
[50] RT Kroemer, A Vulpetti, JJ McDonald, DC Rohrer, JY Trosset, F Gior-danetto, S Cotesta, C McMartin, M Kihlen, and PFW Stouten. Assessmentof docking poses: interactions-based accuracy classification (IBAC) versuscrystal structure deviations. J Chem Inf Comput Sci, 44(3):871–881, 2004.
[51] M Kumar and MV Hosur. Adaptability and flexibility of HIV-1 protease.Eur J Biochem, 270(6):1231 – 1239, 2003.
[52] S Kumar, B Ma, CJ Tsai, N Sinha, and R Nussinov. Folding and bindingcascades: dynamic landscapes and population shifts. Protein Sci, 9(1):10–19,Jan 2000.
[53] ID Kuntz, JM Blaney, SJ Oatley, R Langridge, and TE Ferrin. A geometricapproach to macromolecule-ligand interactions. J Mol Biol, 161(2):269 – 88,Oct 25 1982.
[54] AR Leach. Molecular Modelling. Principles and Applications, chapter Em-prical Force Field Models: Molecular Mechanics, pages 165 – 252. PrenticeHall, 2001.
[55] BM Lee, J Xu, BK Clarkson, MA Martinez-Yamout, HJ Dyson, DA Case,JM Gottesfeld, and PE Wright. Induced fit and ”lock and key” recognition of5S’ RNA by zinc fingers of transcription factor IIIA. J Mol Biol, 357(1):275– 291, 2006.
[56] PE Leopold, M Montal, and JN Onuchic. Protein folding funnels: A kineticapproach to the sequence-structure relationship. Proc Natl Acad Sci U S A,89(18):8721–8725, September 1992.
[57] PJ Lewis, M de Jonge, F Daeyaert, L Koymans, M Vinkers, J Heeres, PAJJanssen, E Arnold, K Das, AD Clark, SH Hughes, PL Boyer, M Bethune,R Pauwels, K Andries, M Kukla, and D Ludovici. On the detection ofmultiple-binding modes of ligands to proteins, from biological, structural,and modeling data. J Comput Aided Mol Des, 17(2 – 4):129–134, 2003.
[58] JH Lii and NL Allinger. Directional hydrogen bonding in the MM3 forcefield. J Comp Chem, 19(9):1001 – 1016, 1998.
[59] B Lovejoy, AR Welch, S Carr, C Luong, C Broka, and T et al. Hendricks.Crystal structures of MMP-1 and -13 reveal the structural basis for selectivityof collagenase inhibitors. Nat Struct Biol, 6(3):217 – 221, 1999.
Bibliography 135
[60] H Lu, J Macosko, D Habel-Rodriguez, RW Keller, JA Brozik, and DJ Keller.Closing of the fingers domain generates motor forces in the hiv reverse tran-scriptase. J Biol Chem, 279(52):54529–54532, Dec 2004.
[61] BH Luo, TA Springer, and J Takagi. High affinity ligand binding by integrinsdoes not involve head separation. J Biol Chem, 278(19):17185–17189, May2003.
[62] B Ma, S Kumar, CJ Tsai, and R Nussinov. Folding funnels and bindingmechanisms. Protein Eng, 12(9):713–720, Sep 1999.
[63] B Ma, M Shatsky, HJ Wolfson, and R Nussinov. Multiple diverse ligandsbinding at a single protein site: A matter of pre-existing populations. ProtSci, 11(2):184 – 197, 2002.
[64] AD Mackerell. Empirical force fields for biological macromolecules: overviewand issues. J Comput Chem, 25(13):1584–1604, Oct 2004.
[65] AD Mackerell, D Bashford, M Bellott, R L Dunbrack, JD Evanseck,MJ Field, S Fischer, J Gao, H Guo, S Ha, D Joseph-Mccarthy, L Kuchnir,K Kuczera, FT K Lau, C Mattos, S Michnick, T Ngo, DT Nguyen, B Prod-hom, WE Reiher, B Roux, M Schlenkrich, JC Smith, R Stote, J Straub,M Watanabe, J Wiorkiewicz-Kuczera, D Yin, and M Karplus. All-atom em-pirical potential for molecular modeling and dynamics studies of proteins. JPhys Chem B, 102(18):3586–3616, April 1998.
[66] TG Marshall, RE Lee, and FE Marshall. Common angiotensin receptorblockers may directly modulate the immune system via VDR, PPAR andCCR2b. Theor Biol Med Model, 3:1, 2006.
[67] BW Matthews. Protein structure initiative: getting into gear. Nat StructMol Biol, 14(6):459–460, Jun 2007.
[68] C McMartin and RS Bohacek. QXP: powerful, rapid computer algorithmsfor structure-based drug design. J Comput Aided Mol Des, 11(4):333–344,Jul 1997.
[69] S Miyazawa and RL Jernigan. A new substitution matrix for protein se-quence searches based on contact frequencies in protein structures. ProteinEng, 6(3):267–278, Apr 1993.
[70] GM Morris, DS Goodsell, RS Halliday, R Huey, WE Hart, RK Belew, andAJ Olson. Automated docking using a lamarckian genetic algorithm and anempirical binding free energy function. J Comp Chem, 19(14):1639 – 1662,1998.
Bibliography 136
[71] GM Morris, DS Goodsell, R Huey, and AJ Olson. Distributed automateddocking of flexible ligands to proteins: Parallel applications of autodock 2.4.J Comput Aid Mol Des, 10(4):293 – 304, 1996.
[72] A Murcko and MA Murcko. Computational methods to predict binding freeenergy in ligand-receptor complexes. J Med Chem, 38(26):4953–4967, Dec1995.
[73] R Najmanovich, J Kuttner, V Sobolev, and M Edelman. Side-chain flexibilityin proteins upon ligand binding. Proteins, 39(3):261–268, May 2000.
[74] R Norel, SL Lin, HJ Wolfson, and R Nussinov. Shape complementarity atprotein-protein interfaces. Biopolymers, 34(7):933–940, Jul 1994.
[75] J Norvell and JM Berg. The protein structure initiative, five years later.Scientist, 19(20):30 – 31, 2005.
[76] E Noy, T Tabakman, and A Goldblum. Constructing ensembles of flexiblefragments in native proteins by iterative stochastic elimination is relevant toproteinprotein interfaces. Proteins, 68:702 – 711, 2007.
[77] R Nussinov and HJ Wolfson. Efficient computational algorithms for dockingand for generating and matching a library of functional epitopes i rigid andflexible hinge-bending docking algorithms. Comb Chem High ThroughputScreen, 2(5):249 – 59, 1999.
[78] R Nussinov and HJ Wolfson. Efficient computational algorithms for dockingand for generating and matching a library of functional epitopes ii. computervision-based techniques for the generation and utilization of functional epi-topes. Comb Chem High Throughput Screen, 2(5):261–269, Oct 1999.
[79] VD Ozrin, MV Subbotin, and SM Nikitin. Plass: protein-ligand affinitystatistical score–a knowledge-based force-field model of interaction derivedfrom the pdb. J Comput Aided Mol Des, 18(4):261–270, Apr 2004.
[80] C Pargellis, L Tong, L Churchill, PF Cirillo, T Gilmore, AG Graham,PM Grob, ER Hickey, N Moss, S Pav, and J Regan. Inhibition of p38map kinase by utilizing a novel allosteric binding site. Nat Struct Biol,9(4):268–272, Apr 2002.
[81] P De La Paz, Burridge, SJ JM Oatley, and CCF. Blake. Multiple modesof binding of thyroid hormones and other iodothyronines to human plasmatransthyretin., chapter Multiple modes of binding of thyroid hormones andother iodothyronines to human plasma transthyretin., pages 119 – 172. 1992.
[82] DA Pearlman. Free Energy Calculations in Rational Drug Design, chapterTheory, pages 9 – 35. Springer, 2001.
Bibliography 137
[83] E Perola and PS Charifson. Conformational analysis of drug-like moleculesbound to proteins: an extensive study of ligand reorganization upon binding.J Med Chem, 47(10):2499–2510, May 2004.
[84] E Perola, WP Walters, and PS Charifson. A detailed comparison of cur-rent docking and scoring methods on systems of pharmaceutical relevance.Proteins, 56(2):235–249, Aug 2004.
[85] M Rarey, B Kramer, and T Lengauer. The particle concept: placing dis-crete water molecules during protein-ligand docking predictions. Proteins,34(1):17 – 28, 1999.
[86] A Rayan, E Noy, D Chema, i A Levitzk, and A Goldblum. Stochasticalgorithm for kinase homology model construction. Cur Med Chem, 11:675– 692, 2004.
[87] A Rayan, H Senderowitz, and A Goldblum. Exploring the conformationalspace of cyclic peptides by a stochastic search method. J Mol Graph Model,22(5):319–333, May 2004.
[88] TJ Rydel, A Tulinsky, W Bode, and R Huber. Refined structure of thehirudin-thrombin complex. J Mol Biol, 221(2):583–601, Sep 1991.
[89] B Sandak, R Nussinov, and HJ Wolfson. An automated computer vision androbotics-based technique for 3-d flexible biomolecular docking and matching.Comput Appl Biosci, 11(1):87–99, Feb 1995.
[90] B Sandak, R Nussinov, and HJ Wolfson. A method for biomolecular struc-tural recognition and docking allowing conformational flexibility. J ComputBiol, 5(4):631–654, 1998.
[91] DM Schulz, C Ihling, GM Clore, and A Sinz. Mapping the topologyand determination of a low-resolution three-dimensional structure of thecalmodulin-melittin complex by chemical cross-linking and high-resolutionfticrms: direct demonstration of multiple binding modes. Biochemistry,43(16):4703–4715, Apr 2004.
[92] J Singh, Z Deng, G Narale, and C Chuaqui. Structural interaction fin-gerprints: a new approach to organizing, mining, analyzing, and designingprotein-small molecule complexes. Chem Biol Drug Des, 67(1):5–12, January2006.
[93] FJ Solis and RJ-B Wets. Minimization by random search techniques. MathOper Res, 6:19–30, 1981.
[94] CA Sotriffer and I Dramburg. ”In situ cross-docking” to simultaneouslyaddress multiple targets. J Med Chem, 48(9):3122–3125, May 2005.
Bibliography 138
[95] SF Sousa, PA Fernandes, and MJ Ramos. Protein-ligand docking: currentstatus and future challenges. Proteins, 65(1):15–26, Oct 2006.
[96] RD Taylor, PJ Jewsbury, and JW Essex. FDS: flexible ligand and receptordocking with a continuum solvent model and soft-core energy function. JComput Chem, 24(13):1637–1656, Oct 2003.
[97] SJ Teague. Implications of protein flexibility for drug discovery. Nat RevDrug Discov, 2(7):527–541, Jul 2003.
[98] GE Terp, IT Christensen, and FS Jørgensen. Structural differences of matrixmetalloproteinases. homology modeling and energy minimization of enzyme-substrate complexes. J Biomol Struct Dyn, 17(6):933–946, Jun 2000.
[99] A Tovchigrechko and IA Vakser. How common is the funnel-like energylandscape in protein-protein interactions? Protein Sci, 10(8):1572–1583,Aug 2001.
[100] CJ Tsai, S Kumar, B Ma, and R Nussinov. Folding funnels, binding funnels,and protein function. Protein Sci, 8(6):1181–1190, Jun 1999.
[101] S Vajda, Z Weng, R Rosenfeld, and C DeLisi. Effect of conformational flex-ibility and solvation on receptor-ligand binding free energies. Biochemistry,33(47):13977–13988, Nov 1994.
[102] IA Vakser. Low-resolution docking: prediction of complexes for underdeter-mined structures. Biopolymers, 39(3):455–464, Sep 1996.
[103] IA Vakser, OG Matar, and CF Lam. A systematic study of low-resolutionrecognition in protein–protein complexes. Proc Natl Acad Sci U S A,96(15):8477–8482, Jul 1999.
[104] ADJ van Dijk and AMJ Bonvin. Solvated docking: introducing water intothe modelling of biomolecular complexes. Bioinformatics, 22(19):2340–2347,Oct 2006.
[105] GM Verkhivker, PA Rejto, DK Gehlhaar, and ST Freer. Exploring the energylandscapes of molecular recognition by a genetic algorithm: analysis of therequirements for robust docking of hiv-1 protease and fkbp-12 complexes.Proteins, 25(3):342–353, Jul 1996.
[106] DF Wang, O Wiest, P Helquist, HY Lan-Hargest, and NL Wiech. On thefunction of the 14 a long internal cavity of histone deacetylase-like protein:implications for the design of histone deacetylase inhibitors. J Med Chem,47(13):3409–3417, Jun 2004.
[107] J Wang, PA Kollman, and ID Kuntz. Flexible ligand docking: a multistepstrategy approach. Proteins, 36(1):1 – 19, 1999.
Bibliography 139
[108] J Wang, P Morin, W Wang, and PA Kollman. Use of mm-pbsa in reproduc-ing the binding free energies to hiv-1 rt of tibo derivatives and predictingthe binding mode to hiv-1 rt of efavirenz by docking and mm-pbsa. J AmChem Soc, 123(22):5221–5230, Jun 2001.
[109] R Wang, Y Lu, and S Wang. Comparative evaluation of 11 scoring functionsfor molecular docking. J Med Chem, 46(12):2287–2303, Jun 2003.
[110] GL Warren, CW Andrews, AM Capelli, B Clarke, J LaLonde, MH Lambert,M Lindvall, N Nevins, SF Semus, SSenger, G Tedesco, ID Wall, JM Woolven,CE Peishoff, and Martha S Head. A critical assessment of docking programsand scoring functions. J Med Chem, 49(20):5912–5931, Oct 2006.
[111] PK Weiner and PA Kollman. Amber: Assisted model building with energyrefinement. a general program for modeling molecules and their interactions.J Comp Chem, 2, 1981.
[112] SJ Weiner, PA Kollman, DA Case, UC Singh, C Ghio, G Alagona, S Profeta,and P Weiner. A new force field for molecular mechanical simulation ofnucleic acids and proteins. J Am Chem Soc, 106(3):765–784, 1984.
[113] Wikipedia. Interaction energy of argon dimer.
[114] Z Xiang and B Honig. Extending the accuracy limits of prediction for side-chain conformations. J Mol Biol, 311(2):421–430, Aug 2001.
[115] C Zhang, J Chen, and C DeLisi. Protein-protein recognition: exploring theenergy funnels near the binding sites. Proteins, 34(2):255–267, Feb 1999.
[116] L Zıdek, MV Novotny, and MJ Stone. Increased protein backbone con-formational entropy upon hydrophobic ligand binding. Nat Struct Biol,6(12):1118–1121, Dec 1999.
Hebrew abstract
140