jacob kleine undergrad. thesis

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

CALCULATIONS OF BINDING FREE ENERGIES OF

CUCURBIT[7]URIL AND SMALL LIGANDS USING

END-STATE METHODS

A thesis submitted in partial fulfillment of the requirements for thedegree of Undergraduate in Physics and Astronomy

by

Jacob D. Kleine

June 2014

c� Copyright by Jacob D. Kleine 2014All Rights Reserved

ii

Acknowledgements

I would first like to thank Dr. Tyler Luchko, my advisor and mentor throughout this un-dergraduate thesis project. You have been, and continue to be an incredible source ofknowledge and inspiration to me. Thank you for your endless patience, and thank you foralways pushing me to succeed; both in class and in the office.

I would like to thank every one of my professors at CSUN; Thank you Dr. Peric, Dr.Lim, Dr. Doty, Dr. Ranganathan, Dr. Shiferaw, Dr. Postma, Dr. Luchko (yes, thank youagain). To put it simply; Physics is the coolest thing in the world. Thank you all for bring-ing such profound understanding (and confusion) into my life.

Lastly I would like to thank Amazon for admitting this project to the private beta, DanMobley (UC Irvine) for access to the SAMPL4 data after the contest had closed, and Dr.Gang Lu for funding the research.

iv

Table of Contents

Copyright ii

Signature page iii

Acknowledgements iv

Abstract vi

1 Introduction 1

2 Theory 32.1 Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Particle Mesh/Ewald Summation . . . . . . . . . . . . . . . . . . . 5

2.2 MM/IS - Binding Free Energy . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Implicit Solvent Models . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Entropy Calculations . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methods 93.1 SAMPL4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Initial Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Results 114.1 R2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Root Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Linear Regression Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Discussion 155.1 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Conclusions 16

A List of Acronyms 17

Bibliography 20

v

ABSTRACT

CALCULATIONS OF BINDING FREE ENERGIES OF CUCURBIT[7]URIL AND

SMALL LIGANDS USING END-STATE METHODS

By

Jacob D. Kleine

Undergraduate in Physics and Astronomy

The binding affinity between a host molecule and small ligand is a valuable quantityin the field of drug design. Computational chemists attempt to calculate binding free en-ergies using molecular modeling techniques, however none of the current implementationsproduce accurate results in manageable time scales. In an effort to improve these methods,we systematically explore the major contributors to error, modify the implementation, andanalyze our results. By increasing the real-time duration of the simulations, we address theerror due to poor conformational sampling and incomplete ensemble averages. With thismodification, we calculate binding free energies using molecular modeling with implicitsolvation techniques (MM/PBSA and MM/GBSA) and compare against known experimen-tal values. Our resulting error metrics indicate progress from previous implementations andinspire further research.

vi

Chapter 1

Introduction

A pharmaceutical drug operates by binding to a target (a biological molecule in the humanbody) and altering its function in a particular way as to alleviate symptoms. The efficacyof the drug is largely dependent on the strength of binding and on a lack of unintentionalbinding (binding to molecules other than the target). The tendency or strength of bindingbetween a small ligand and a larger macromolecule is quantified by the binding affinity[1].Therefore, binding affinity, or binding free energy, is especially important to chemists whoseek to design a new pharmaceutical drug. While there are many naturally existing chemi-cals that can be readily modified for drug use, most of these have already been discovered.An alternative approach is to use synthetic chemicals. However testing is substantiallymore expensive; the molecules must first be synthesized before binding affinities can bedetermined experimentally. Therefore, it is essential for the development of new and im-proved drugs to be able to accurately predict the binding affinity of a hypothetical moleculebefore it is synthesized, drastically improving efficiency of the design process.

A common approach to predicting binding free energy is molecular modeling [2, 3,4]. The behavior of molecules in a system is dependent on the atomic structure of eachmolecule. Therefore, accurate simulations necessitate atomic resolution. We must applythe best atomic theory and computer algorithms to process the information quickly andaccurately. As it stands today, molecular simulations can take months to calculate onlynanoseconds of data [3]. Therefore, we must test new methods that can potentially reducethe number of calculations required. Validation of these methods for computing bindingaffinities can help assess their potential performance in drug design applications.

There are many approaches to solving this problem. For example, in the SAMPL4blind prediction challenge (the binding free energy portion), hosted by Eyes Open Sci-entific Computing [1], participants submitted 35 sets of binding energy predictions for aseries of host-guest systems, based on methods ranging from simple docking, to extensivefree energy simulations, to quantum mechanical calculations. The purpose of the SAMPL4challenge was to encourage participants to systematically explore parameter, solvent, andforce field options for the best combination of accuracy, versatility, consistency, and preci-sion when predicting binding affinities of an arbitrary molecule. The problem, however, isthat the number of approximations that are used in existing methods make it impossible toachieve consistently accurate results.

The ideal method for computing binding affinities would use first-principles quantummechanical simulations of all nuclei and electrons with dissolved host and guest molecules[2]. Unfortunately, even the fastest computers cannot handle this number of calculations inany reasonable amount of time. Currently, we can only consider systems with well under amole of solution (by almost 20 orders of magnitude), and we can only produce simulationson the microsecond scale. We must, therefore, rely on theoretical framework and approx-imations that allow energies to be computed more rapidly. For example, one can model

1

the water around the complex explicitly, using a molecular representation of water, or im-plicitly, representing water as a continuous dielectric, as in MM/PBSA and MM/GBSA[5](see §2.2). Also, the use of empirical force fields is a computationally fast alternativeto quantum mechanical calculations. Current examples include CHARMM [6], AMBER[7] (Eq. (2.4)), and GROMOS [8], each taking a different functional form. Achieving ac-curate affinity calculations will likely require advances on multiple fronts: potential energyfunctions, solvation models, and the treatment of conformational flexibility [2].

Molecular mechanics with implicit solvent (MM/IS) methods, a general class of compu-tational methods to calculate binding free energy, such as MM/GBSA and MM/PBSA, aretheoretically sound and have relatively low computational cost [9]. However, the approx-imations required from solvent contributions, conformational sampling, and other factorssuch as time scale and parameterization have led to unreliable results [9]. Therefore, in aneffort to improve MM/IS methods, we attempt to limit these approximations by improvingour conformational sampling with a longer run time. We test this strategy on SAMPL4data and evaluate results. Although it still is unclear if the MM/IS methods with improvedsampling outperform any of the other methods used in the SAMPL4 binding free energychallenge, they showed promising results.

This thesis begins on Chapter 2 with the theory of molecular dynamics (MD), forcefields, molecular mechanics with implicit solvent (MM/IS), and the theory beneath themethods used in this study. Chapter 3 is a detailed description of our methods and of thenecessary steps to reproduce our work. Chapter 4 presents the results and error analysis ofour final calculations. Chapter 5 is discussion of the results, and Chapter 6 is conclusions.

2

Chapter 2

Theory

2.1 Molecular Dynamics

Molecular Dynamics (MD) is a method of calculating the time dependent behavior of amolecular system by applying molecular theory to computer simulation. It can be used toexamine atomic-level phenomena, such as thin film growth [10], and compare the resultsto theory. MD simulations generate complete trajectories (positions and velocities) of theparticles in a system using Newtonian Mechanics; first initializing the system, and then foreach time step, calculating the force on every particle, integrating the equations of motion toupdate positions and velocities, and sampling the system to calculate properties and outputtrajectories. Statistical mechanics can then be applied to generate macroscopic observablessuch as pressure, energy, and heat capacities [3].

Due to an exceptionally large number of calculations required in the MD approach (ev-ery particle in every time step) and limited computational capacity, the size of the systemis limited to thousands of molecules and nanoseconds of simulated time. To reduce com-putational cost, periodic boundary conditions (PBC) are applied to the system, effectivelytreating it as a crystalline structure. Under this approximation angular momentum is notconserved because the PBC system is not rotationally symmetric, introducing minor arti-facts; for example, a system of N particles will behave as a system of N � 1 particles. Theartifacts have quantifiable consequences for small toy models, but for a standard biomolec-ular simulation (much larger N ) the effects will be almost negligible for a liquid at STP[11].

The molecular dynamics ensemble applies PBC to the constant number of particles, vol-ume, and temperature (NVT) ensemble and to the constant number of particles, pressure,and temperature (NPT) ensembles [3]. There are many barostat and thermostat methodsavailable. The SAMPL4 molecules are synthetic molecules, subject to biological condi-tions, so we use the NPT approach by using Langevin thermostat [12]. Langevin dynamicsis an MD approach that allows temperature to be maintained applying a velocity dependentdampening force along with a random force, according to,

M⇧⇧X= �rV (X)� �M

⇧X +

p2�kBTMR(t) (2.1)

for a system of N particles with masses M and coordinates X = X(t). V (x) is the particle

interaction potential; r is the gradient operator; the dot is a time derivative, such that⇧X

is the velocity and ¨X is the acceleration; � is the damping constant; T is the temperature,kB is Boltzmann’s constant; and R(t) is a delta-correlated stationary Gaussian process (a

3

random process) with zero-mean, satisfying

hR(t)i = 0 (2.2)hR(t)R(t0)i=�(t-t’) (2.3)

At every time step the velocity is damped and a random ’kick’ force is applied, therebycontrolling the system as would a thermostat.

Ideally, binding free energy calculations would come from using first principals quan-tum mechanical calculations of all nuclei and electrons in the system. Unfortunately, com-puters are not fast enough to handle this many calculations for a system of considerablesize. In the AMBER molecular modeling suite [13], molecular dynamics forces are typ-ically calculated using the AMBER force field, a simplified model of bond, solvent, andelectrostatic effects. The forces are then integrated to attain trajectories.

The functional form of the AMBER Force Field is [7].

V (rN) =

X

bonds

kb

(l � l0)2+

X

angles

ka

(✓ � ✓o

)

2+ (2.4)

X

torsions

X

n

✓1

2

Vn

[1 + cos(n! � �)]

◆+

N�1X

j=1

NX

i=j+1

"ij

"✓r0ijrij

◆12

� 2

✓r0ijrij

◆6#+

qi

qj

4⇡"0rij

The first summation (over the bonds) models the energy between covalently bonded atomswith an ideal spring potential with spring constant, k

b

, and l is the bond length measuredfrom equilibrium position, l0. This is a good approximation near l = l0, but becomes lessaccurate as atoms separate. The second summation (over the angles) is the energy dueto electron geometry and is also modeled with a spring potential with spring constant k

a

.The third term is the energy for bond twisting. The fourth term, the double summation, isthe Lennard-Jones potential, representing non-bonded energy between all atom pairs. Thisterm consists of van der Waals and electrostatic contributions, where ✏

ij

is the depth of thepotential well, r0ij is the separation at which the potential reaches a minimum, r

ij

is theseparation the particles, i and j, q

i

and qj

are the charges, and ✏0 is the electric constant.

2.1.1 Parameterization

All bonds are explained by quantum theory. Therefore, in practice, we must approxi-mate the bond characteristics with classical models. The AMBER force field requires allof bond parameters in Eq. (2.4) to run successful MD simulations. In AMBER, parametersare determined by the program antechamber, which uses the atomic structure of a moleculeto assign appropriate bond parameters from the general Amber force field (GAFF) [14], alibrary of bond angle, torsion, charge, and Lennard-Jones parameters for small organicmolecules.

4

Charges

As motion at the molecular level is highly influenced by electrostatic effects, it is par-ticularly important to apply accurate charge configurations. In this study, we use the RESPESP Derive Server (REDS) to compute the charges on the atoms. REDS is a web serverthat calculates restrained electrostatic potential (RESP) and electrostatic potential (ESP)charges by interfacing quantum mechanics programs, RESP/ESP and the latest version ofR.E.D. tools [15].

2.1.2 Particle Mesh/Ewald Summation

In the AMBER force field model (Eq. (2.4)), the computation of electrostatic interac-tions requires a double summation, giving rise to a large majority of the computational cost.The particle mesh Ewald summation (PME) method [16] is a means to reduce the scalingfor electrostatic calculations. Space is discretized on a mesh grid, and a system of particlesis represented by density values on the grid (“particle mesh”) and the potentials are solvedusing Ewald Summation.

In Ewald summation, the interaction potential is separated into a short range term whosesum converges quickly in real space, and a long-range term whose sum converges quicklyin Fourier space,

'(r) = 'sr(r) + 'lr(r) (2.5)

where 'sr(r) is the short range term, 'lr(r) is the long-range term, and '(r) is the totalinteraction potential.

With the PME method, the direct summation of interaction energies between particles,

ETOT =

X

i,j

'(rj

� ri

) = Esr + Elr (2.6)

is replaced with two summations; one direct sum Esr of the short-range potential in realspace,

Esr =X

'sr(rj � ri

) (2.7)

and a summation of the long-range part in Fourier space,

Elr =X

i,j

˜

�lr(k)|'̃(k)|2 (2.8)

where ˜

�lr and '̃(k) are the Fourier transforms of the potential and the charge density,respectively. Both of the summations converge quickly in their respective spaces. Whereasdirect calculation gives O(N2

) scaling, where N is the number of atoms in the system, thePME method gives O(N logN), which is considerably smaller.

5

2.2 MM/IS - Binding Free Energy

Molecular mechanics with implicit solvent (MM/IS) is a class of post-processing end-state methods used to compute the binding free energy (4Gbind) of a receptor-ligand com-plex.

With MM/IS, binding free energy is divided into three separate calculations of physi-cally well-defined terms; the free energy of the complex, receptor, and ligand [17, 18]. Thebinding free energy 4Gbind is then,

4Gbind = 4GRL

� (4GR

+4GL

) (2.9)

where GRL

, GR

, GL

are the free energies of the complex, receptor, and ligand respectively.Each free energy, G

R,L,RL

, is defined by,

4GR,L,RL

= 4EMM +4Gsolv � T4SMM (2.10)

Where 4EMM is the change in molecular mechanics gas-phase energy, consisting of the to-tal internal energy and the non-bonded electrostatic and van der Waals energies. In a singletrajectory approach, the internal energies cancel out, and 4EMM in Eq. (2.9) is calculatedusing Eq. (2.4) and is exact within the classical molecular modeling framework. Since thetemperature T in Eq. (2.10) is constant, the free energy 4Gsolv and the entropy 4S

MM

arethe only values that require approximation. 4Gsolv is separated into its polar and non-polarcontributions,

4Gsolv = 4Gpolar +4Gnon-pol, (2.11)

where 4Gpolar is the electrostatic solvation energy (polar contribution) and 4Gnon-pol isthe non-electrostatic solvation energy (non-polar contribution). These terms are calculatedusing MM/IS, and the remaining term, 4SMM, is usually calculated using normal modeanalysis [9].

2.2.1 Implicit Solvent Models

The two forms of MM/IS used to calculate 4Gsolv in this study are Poisson-Boltzmannsurface area (PBSA) and generalized Born surface area (GBSA). Both methods use a con-tinuous dielectric to represent the solvent, reducing computational cost substantially, andboth methods estimate the 4Gnon-pol (Eq. (2.11)) using the solvent accessible surface area(SASA).

SASA

The MM/P(G)BSA methods use the solvent accessible surface area to calculate thenon-polar contribution according to

4Gnon-pol = �SASA + b (2.12)

with � = 0.0227 (kJ/mol)/�A

2

and b = 3.85 kJ/mol [18].

6

PBSA

In the MM/PB-SA method, 4Gpolar, is calculated by solving the discretized Poisson-Boltzmann equation [19],

4 · ✏(r)4V (r) = �⇢f (r)�NionsX

i

c1i

zi

q�(r) exp

✓�z

i

qV (r)

kBT

◆(2.13)

where ✏(r) is the position dependent permittivity of the dielectric, V is the potential, ⇢f isthe free charge, Nions is the number of ion species, c1

i

is the bulk concentration of the ionspecies, z

i

is the charge of the ion, q is the elementary charge, �(r) defines the accessiblespace of the ions (1 for accessible and 0 for inaccessible), kB is Boltzmann’s constant, andT is the temperature.

GBSA

The generalized Born (GB) [20] approach is an approximation to the exact Poisson-Boltzmann (PB) Eq. (2.13). It models the solute as a set of spheres whose dielectric con-stant differs from the external solvent. This model has the functional form,

4Gsolv =1

8⇡

✓1

✏0� 1

✏

◆NX

i,j

qi

qjq

r2i,j

+ a2i,j

exp(�D)

, (2.14)

where

D =

✓ri,j

2ai,j

◆2

, and ai,j

=

pai

aj

(2.15)

and ✏0 is the permittivity of free space, ✏ is the dielectric constant of the solvent, qi

is thecharge on particle i, r

i,j

is the distance between particles i and j, and ai

is the effective Bornradius [21]. The GB approximation to the Poisson equation is a much faster calculation thanthe PB approach, but the accuracy is largely dependent on the choice of a

i

.

2.2.2 Entropy Calculations

Normal Mode (n-mode) Analysis

Normal mode analysis [22] is a means to calculate the entropy term in 2.10. Normalmodes are the favored periodic states of the system. With the thermodynamic formula,

✓@S

@E

◆

N,V=

1

T(2.16)

7

the entropy per normal mode is

S(v) =

Tˆ0

d

dtE (v)

dT

T(2.17)

=

E(v)

T� kln

✓1� exp

✓� hv

kT

◆◆(2.18)

Where E is the energy of the system, T is the temperature, k is the Boltzmann constant,and v is the velocity. This is applied to the trajectory files at the end of the simulation.

8

Chapter 3

Methods

With the exception of charge parameterization, we used the Amber and AmberTools molec-ular modeling suite [13] for all processes in this work. The charges were derived from theRED server [15].

3.1 SAMPL4

We confine the focus of this work to the binding free energy portion of the SAMPL4blind prediction challenge [1]. The focus of this portion of SAMPL4 is to encourage col-laboration in the development of new methodologies for calculating binding affinities. Par-ticipants submit their results for the binding energies 4Gbind of the SAMPL4 host-guestcomplexes, and error analysis performed based on the known 4Gbind values that weredetermined experimentally, to ± 0.1 kJ

mol accuracy (see Table 4.1, second column). Thisexperiment provides participants a means to compare their parameterization methods andalgorithms against other methods and evaluate the qualitative and quantitative differences.Although the contest was already over when we began, the information still serves its pur-pose in that we can use well-studied complexes to model biological receptor-ligand bindingfor the purpose of testing our computational methods. We use the cucurbit[7]uril data setand focus on five out of the 15 associated guest molecules.

3.2 Initial Structures

The coordinate files of the small molecules, cucurbit[7]uril (CB7) and 15 guest molecules,were provided by OpenEye Scientific Computing, the hosts of the SAMPL4 contest [1]. Wegenerated the charge configurations for the CB7 host and guest molecules from R.E.D.S[15] using the Gaussian 09 program [23] with RESP [24]. GAFF was used for all van derWaals and bond parameters. We used Antechamber [25] to assign GAFF [14] parametersand produce parameter and coordinate files with the complete set of force field parameters.LEaP was used to place the parameterized molecules into 20 Å octahedral water boxes withperiodic boundaries, using the TIP4P-ew water model. Each water box contained approxi-mately 4000 water molecules. The system was then solvated to 0.3 M NaCl concentrationby adding the appropriate number of Na+ ions, calculated with dimensional analysis, to thesystem and neutralizing with Cl� ions. Antechamber was again used to produce parameterand coordinate files for the complete systems [25].

3.3 Calculations

Minimization and heating calculations were performed using the particle mesh Ewaldmolecular dynamics (pmemd), and production runs were performed using the pmemd.CUDA

program [26, 27], which uses graphics processing units (GPU). 1000 steps of minimizationwere performed; the first 500 of which were calculated with the steepest decent algorithmand the second 500 with conjugate gradient. The systems were heated from 0K to 300K,implementing the SHAKE algorithm to constrain hydrogen bonds. Due to time constraints,

9

we chose to study five out of the 15 guest molecules, and ran production on the five systemsfor 50 ns each. The time step for the calculations was 0.002 ps, and trajectory informationrecorded every 500 steps. Heating and production were both calculated under constantpressure conditions with isotropic position scaling.

3.3.1 Analysis

MMPBSA.py was used to process the trajectory files for binding free energy informa-tion of the selected CB7 host-guest systems; CB7-c3, CB7-c6, CB7-c7, CB7-c9, and CB7-c13. As a precaution against possible artifacts due to initialization, the first 10 ns of thetrajectory files were excluded from calculation. MM/PBSA and MM/GBSA solvent mod-els were used to generate two separate free energy predictions (4Gsolv from Eq. (2.10)),one for each solvent model. For the MM/GBSA approach, we used igb=5, modified GBmodel developed by A. Onufriev, D. Bashford and D.A. Case [20]. Two sets of entropy(4SMM) predictions were calculated; one using normal-mode (NM) analysis, and the otherthe quasi-harmonic (QH) approximation. We made four sets of 4Gbind predictions usingall combinations of 4SMM and 4Gsolv calculations.

10

Chapter 4

Results

Our 4Gbind calculations are shown in below in Table 4.1. We analyze the data using rootmean-squared (RMS) error, Pearson coefficient of determination (R2), and linear regressionslope.

4.1 R2 Correlation

The R2 value is a measure of the degree of linear relationship (correlation) between theexperimental (calculated) data and the theory (in this case, the experimentally determinedbinding affinities). Fig. 4.1 illustrates the correlation between experiment and calculationwith a plot; each point represents the 4Gbind values for a given complex, where the x-coordinate is the experimental value and the y-coordinate is the calculated value. The R2

values for our prediction sets are shown in column eight of Table 4.2.

4.2 Root Mean Squared Error

Next we analyze the root mean squared error (RMSE) as outlined in [1]. To summa-rize, SAMPL4 results were processed for error using two variations of standard RMS error(RMSE),

RMSE =

rP(4Gcalc

i

�4Gexpi

)

2

n(4.1)

that account for discrepancies in reported data; some participants reporting absolute bind-ing affinities and some reporting relative affinity predictions. The first variation, RMSEo isderived by computing the RMSE of predicted binding affinities after subtracting the aver-age signed error,

RMSEo =

vuut 1

n

nX

i=1

"4Gexp

i

�4Gcalci

� 1

n

nX

j=1

(4Gexpj

�4Gcalcj

)

#2

(4.2)

where n is the number of complexes, and 4Gexp and 4Gcalc are the experimental and cal-culated binding affinities for each complex, respectively [1]. The second variation, RMSEr,is derived by considering all differences among all pairs of guest molecules,

RMSEr =

vuut 2

n(n� 1)

nX

i=1

nX

j=i+1

⇥(4Gcalc

j

�4Gcalci

)� (4Gexpj

�4Gexpi

)

⇤2 (4.3)

The RMSE error metrics for our work are also appended to the SAMPL4 results inTable 4.2. Note: the other participants’ analysis includes all of the SAMPL4 molecules,whereas our figures are derived from only five data points for each method we used.

11

MM/PBSA MM/GBSAComplex Experimental NM QH NM QHCB7-C3 �6.6 12.33± 0.12 1.69 �7.67± 0.09 �18.31CB7-C6 �7.9 7.24± 0.07 �3.79 �12.64± 0.06 �23.68CB7-C7 �10.1 4.80± 0.06 �4.67 �15.18± 0.06 �24.65CB7-C9 �12.6 �0.77± 0.07 �17.47 �19.76± 0.06 �36.46CB7-C13 �14.1 �1.88± 0.07 �13.37 �23.05± 0.06 �34.54

Table 4.1: Summary of experimental and calculated 4Gbind values, measured in kcal/mol,where NM is the calculation using normal mode analysis and QH is the calculation usingthe quasi-harmonic approximation.

4.3 Linear Regression Slope

The last metric evaluated on SAMPL4 data is the linear regression slope, the slope ofline of best fit that passes through that data. Linear regression lines are computed usingleast squares fit analysis, which minimizes the sum of squared vertical distances betweenthe experimental data and calculated data. The target slope is 1, as that is the slope of dataof direct correlation. However a linear regression slope of 1 is not, by itself, indicative ofaccurate calculations. The error must be small as well. Our linear regression slopes werearound 2 for all four methods.

12

Results- Correlation Plot

−16 −14 −12 −10 −8 −6−40

−30

−20

−10

0

10

20

Experimental Energy (kcal/mol)

Cal

cula

ted

Ener

gy (k

cal/m

ol)

PBSA w/ NMPBSA w/ QHGBSA w/ NMGBSA w/ QHDirect correlationNull 1Null 2

Figure 4.1: Plot of all four sets of binding free energy calculations from this work for fiveCB7 complexes (4.2); two sets for MM/PBSA, one with normal mode entropy and one withquasi-harmonic entropy; and two sets for MM/GBSA, one with normal mode entropy andone with quasi-harmonic entropy. A plot of the direct correlation, and of each null modelis illustrated for comparison.

13

IDM

etho

dEn

ergy

Mod

elSo

lven

tMod

elC

onfo

rmat

iona

l

Sam

plin

g

RM

SEr

RM

SEo

R

2Sl

ope

1N

ULL

1-

--

3.2±

0.4

2.2±

0.3

0.0±

0.0

0.0±

0.0

10N

ULL

2-

--

3.3±

0.4

2.3±

0.3

0.2±

0.1

0.5±

0.1

187

SIE

GA

FF/A

M1-

BC

CB

RIB

EMW

ilma

2.7±

0.4

1.8±

0.3

0.6±

0.1

0.18±

0.04

188

SIE

+H

BG

AFF

/AM

1-B

CC

BR

IBEM

Wilm

a2.6±

0.4

1.8±

0.3

0.6±

0.1

0.19±

0.04

194

DO

CK

3.7

AM

SOL

AM

SOL

DO

CK

3.7

7.9±

1.9

5.4±

1.3

0.1±

0.2

�0.5±

0.7

528

RR

HO

DFT

-D/H

F-3c

CO

SMO

-RS

Man

ual

3.7±

0.7

2.5±

0.5

0.8±

0.1

1.9±

0.3

541

QM

/M2

PM6-

DH

+C

OSM

OTo

rk4.5±

1.0

3.0±

0.7

0.2±

0.2

0.7±

0.5

550

M2

CH

AR

Mm

/VC

harg

ePB

SATo

rk5.0±

1.0

4.3±

0.7

0.7±

.1

2.0±

0.4

579

PMF

CG

enFF

TIP3

PFu

nnel

Met

adyn

amic

s5.8±

1.1

4.0±

0.8

0.1±

0.2

�0.4±

0.4

600

EES

GA

FF/A

M1-

BC

CTI

P3P

MD

5.0±

1.1

3.4±

0.7

0.7±

0.1

1.9±

0.4

601

EES

GA

FF/A

M1-

BC

CTI

P3P

MD

4.2±

0.7

2.9±

0.5

0.6±

0.1

1.5±

0.3

This

wor

kM

M/IS

-Nor

mal

Mod

eG

AFF

/RES

P6-

31G

*TI

P4P-

Ew/M

MPB

SAM

D4.03

2.6

0.96

1.8

This

wor

kM

M/IS

-Qua

si-H

arm

onic

GA

FF/R

ESP

6-31

G*

TIP4

P-Ew

/MM

PBSA

MD

7.14

4.5

0.83

2.3

This

wor

kM

M/IS

-Nor

mal

Mod

eG

AFF

/RES

P6-

31G

*TI

P4P-

Ew/M

MG

BSA

MD

4.18

2.6

0.98

1.9

This

wor

kM

M/IS

-Qua

si-H

arm

onic

GA

FF/R

ESP

6-31

G*

TIP4

P-Ew

/MM

GB

SAM

D6.86

4.3

0.95

2.3

Table 4.2: This table summarizes the results of the SAMPL4 prediction challenge [1], withour results appended. (Groups that did not provide their calculated values are excluded).Methods are discussed in 2.2 and error metrics are discussed in 4.

14

Chapter 5

Discussion

SAMPL4 results were compared to two null models; one in which there is no variationand the measurement is independent of the observed system, and one in which binding freeenergy was directly related to the number of heavy atoms in the system by

4Gbind = �1.5n (5.1)

where n is the number of heavy atoms and the factor of �1.5 is a scaling factor measured inkJ

mol . More than half of the SAMPL4 submissions provided better correlations with experi-ment than two simple null models, but most underperformed in terms of root mean squarederror and linear regression slope [1]. Our methods yielded similar results; while our lowestR2 correlation value is 0.83, far greater than each of the null models, none of the MM/ISmethods outperformed the null model in terms of RMSEo or RMSE

r

(see Table 4.2).

5.1 Ranking

R2 Correlation

The R2 correlation values for all four methods used in this work outranked all othersubmissions. The strongest possible correlation value is 1 and the weakest is 0 (or �1,for a negative correlation). Of the four methods, the lowest R2 is 0.86, for the MM/PBSAmethod with the quasi-harmonic approximation; still in range with the largest R2 value forother submissions (group #528, R2

= 0.8 ± 0.1). While this is inconclusive on its own(especially due to lack of data; only five complexes were tested compared to all 15), thereappears to be a strong correlation between experimental and calculated values.

RMSE

Each of our methods failed to beat either of the null models in RMSEr. However, whileour RMSEr metrics are quite high, they are still not the highest. The range in RMSEr valuesfor this work is [4.03, 7.14], whereas the range for SAMPL4 data is [2.6, 7.9].

In RMSEo, our methods also failed to beat the null models, but the two methods thatused normal mode analysis were in range; each having RMSEo = 2.6, (this is an assump-tion as we did not compute the error in the error). The range in RMSEo values for this workis [2.6, 4.6], whereas the range for the other participants’ results is [1.8, 5.4].

15

Chapter 6

Conclusions

When compared to other SAMPL4 submissions, the methods implemented in this work didnot perform the best, nor did they perform the worst. These methods had moderate RMSEmetrics in comparison to others, but out-ranked all other submissions in R2 correlation.Furthermore, of the methods tested in this work, the ones using normal mode analysishad exceptionally high R2 values (0.96 for PBSA and 0.98 for GBSA), and their RMSEo

values were relatively low. Therefore, although our error metrics only included data fromfive complexes out of the 15 total, the MM/IS methods (GBSA and PBSA) with normalmode analysis show promising results.

Although we successfully validated MM/IS methods with normal mode analysis asbeing worthy candidates for future research, we did not yet solve the major problem ofcreating a method to predict binding free energies to experimental accuracy.

The main issue with these methods is that a large portion of the error is divided amongfive different sources; parameterization, conformational sampling, implicit solvent modelapproximations, entropy calculations, and using only the single trajectory of the complexfor all calculations (complex, receptor, and ligand). For now we assume that our parame-terization is sufficient; the AMBER force field was developed in 1994 and parameterizationmethods have since undergone decades of iterative improvements. We address error due torun time and conformational sampling by computing trajectories for 50 ns (sampled every50 ps); MM/P(G)BSA simulations are typically run for well under 1 ns [18], and in somecases only minimization is used [28]. The increased number of frames provides a betterstatistical average, and increased spacing between frames reduces the possibility for cor-relation between frames. For the entropy approximations, the normal mode calculationsappear to give the best correlations with the lowest error. The last major forms of errorthat remain are due to the single trajectory approach and choice of solvent model. Thesingle trajectory approach uses only the trajectory of the complex and strips away the re-ceptor/ligand atoms for the ligand/receptor trajectories, respectively. This method assumesthat the bound state of the molecule is the same as the unbound state, creating more error.Therefore, to continue our efforts to create a gold standard method for computing bind-ing free energies, we will continue this research in the following sequence; we will firsttest the MM/IS methods from this work on all 15 molecules to get a larger data set; wewill then use the three trajectory approach (separate trajectory computations for receptor,ligand, and complex) to isolate the last source of error, the implicit solvent model. Afterhaving isolated the error due to the implicit solvent model, we can attempt other methodsfor computing 4Gsolv, such as 3D-RISM.

16

Appendix A

List of Acronyms

MM/PBSA Molecular mechanics Poisson Boltzmann solvent accessible surface area

MM/GBSA Molecular mechanics generalized Born solvent accessible surface area

CHARMM Chemistry at HARvard Macromolecular Mechanics

AMBER Assisted Model Building with Energy Refinement

GROMOS GROningen MOlecular Simulation computer program package

MM/IS Molecular mechanics with implicit solvation (ex. MM/PBSA, MM/GBSA)

MD Molecular dynamics

PBC Periodic boundary conditions

STP Standard temperature and pressure

NVT Constant number of particles, volume and temperature

NPT Constant number of particles, pressure and temperature

GAFF General Amber force field

REDS (R)ESP (E)SP Derive Server

RESP Restrained electrostatic potential

ESP Electrostatic potential

PME Particle mesh Ewald summation

SASA Solvent accessible surface area

17

Bibliography

[1] Hari S. Muddana, Andrew T. Fenley, David L. Mobley, and Michael K. Gilson. TheSAMPL4 host–guest blind prediction challenge: an overview. Journal of Computer-

Aided Molecular Design, 28(4):305–317, April 2014.

[2] Michael K Gilson and Huan-Xiang Zhou. Calculation of protein-ligand binding affini-ties. Annu. Rev. Biophys. Biomol. Struct., 36:21–42, 2007.

[3] Daan Frenkel and Berend Smit. Understanding Molecular Simulation: From Algo-

rithms to Applications. Number ISBN 9780080519982. Academic Press, October2001.

[4] B. J. Alder and T. E. Wainwright. Studies in molecular dynamics. General method.The Journal of Chemical Physics, 31(2):459, 1959.

[5] Jayashree Srinivasan, Thomas E. Cheatham, Piotr Cieplak, Peter A. Kollman,and David A. Case. Continuum solvent studies of the stability of DNA, RNA,and phosphoramidate-DNA helices. Journal of the American Chemical Society,120(37):9401–9409, 1998.

[6] Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, and KarplusM. CHARMM: a program for macromolecular energy, minimization and dynamicscalculations. J. Comput. Chem, 4:187–217, 1983.

[7] WD Cornell, P Cieplak, CL Bayly, IR Gould, KM Merz Jr, DM Ferguso,DC Spellmeyer, T Fox, and JW Caldwell. A second generation force field for thesimulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc.,117(5179), 1995.

[8] Wilfred F van Gunsteren and Herman JC Berendsen. Computer simulation of molecu-lar dynamics: Methodology, applications, and perspectives in chemistry. Angewandte

Chemie International Edition in English, 29(9):992–1023, 1990.

[9] Tingjun Hou, Junmei Wang, Youyong Li, and Wei Wang. Assessing the performanceof the MM/PBSA and MM/GBSA methods. 1. The accuracy of binding free energycalculations based on molecular dynamics simulations. Journal of Chemical Infor-

mation and Modeling, 51 (1): 69–82, 2011.

[10] Liang Dong, Richard W Smith, and David J Srolovitz. A two-dimensional moleculardynamics simulation of thin film growth by oblique deposition. Journal of Applied

Physics, 80(10):5682–5690, 1996.

[11] R.B. Shirts, S.R. Burt, and A.M. Johnson. Periodic boundary condition inducedbreakdown of the equipartition principle and other kinetic effects of finite sam-ple size in classical hard-sphere molecular dynamics simulation. J Chem Phys,125(16):164102, 2006.

18

[12] Ruslan L. Davidchack, Richard Handel, and M. V. Tretyakov. Langevin thermostatfor rigid body dynamics. The Journal of Chemical Physics, 130(23):234101, 2009.

[13] D.A. Case, V. Babin, J.T. Berryman, R.M. Betz, Q. Cai, D.S. Cerutti, IIIT.E. Cheatham, T.A. Darden, R.E. Duke, H. Gohlke, A.W. Goetz, S. Gusarov,N. Homeyer, P. Janowski, J. Kaus, I. Kolossváry, A. Kovalenko, T.S. Lee, S. LeGrand,T. Luchko, R. Luo, B. Madej, K.M. Merz, F. Paesani, D.R. Roe, A. Roitberg, C. Sagui,R. Salomon-Ferrer, G. Seabra, C.L. Simmerling, W. Smith, J. Swails, R.C. Walker,J. Wang, R.M. Wolf, X. Wu, and P.A. Kollman. Amber 14. University of California,

San Francisco, 2014.

[14] J. Wang, R.M. Wolf, J.W. Caldwell, P.A. Kollman, and D.A. Case. Developmentand testing of a general Amber force field. Journal of Computational Chemistry, 25,

1157-1174, 2004.

[15] E. Vanquelef, S. Simon, G. Marquant, E. Garcia, G. Klimerak, J. C. Delepine,P. Cieplak, and F.-Y. Dupradeau. R.E.D. Server: a web service for deriving RESPand ESP charges and building force field libraries for new molecules and molecularfragments. Nucl. Acids Res. (Web server issue), 2011.

[16] Tom Darden. Particle mesh Ewald: An NlogN method for Ewald sums in large sys-tems. The Journal of Chemical Physics [0021-9606], 98:10089, 1993.

[17] Bill R Miller III, T Dwight McGee Jr, Jason M Swails, Nadine Homeyer, HolgerGohlke, and Adrian E Roitberg. MMPBSA.py: An efficient program for end-statefree energy calculations. Journal of Chemical Theory and Computation, 8(9):3314–3321, 2012.

[18] Samuel Genheden, Tyler Luchko, Sergey Gusarov, Andriy Kovalenko, and Ulf Ryde.An MM/3D-RISM approach for ligand binding affinities. The Journal of Physical

Chemistry B, 114(25):8505–8516, July 2010.

[19] Nathan A. Baker, David Sept, Simpson Joseph, Michael J. Holst, and J. Andrew Mc-Cammon. Electrostatics of nanosystems: application to microtubules and the ribo-some. Proceedings of the National Academy of Sciences, 98(18):10037–10041, 2001.

[20] Alexey Onufriev, Donald Bashford, and David A Case. Exploring protein native statesand large-scale conformational changes with a modified generalized Born model. Pro-

teins: Structure, Function, and Bioinformatics, 55(2):383–394, 2004.

[21] W.C. Still, A. Tempczyk, R.C. Hawley, and T. Hendrickson. Semianalytical treatmentof solvation for molecular mechanics and dynamics. Am Chem Soc, 112(16):6127–6129, 1990.

[22] Robert D Blevins and R Plunkett. Formulas for natural frequency and mode shape.Journal of Applied Mechanics, 47:461, 1980.

19

[23] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheese-man, G. Scalmani, V. Barone, B. Mennucci, G. A. Petersson, H. Nakatsuji, M. Car-icato, X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino, G. Zheng, J. L. Sonnen-berg, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Naka-jima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, J. A. Montgomery, Jr., J. E. Peralta,F. Ogliaro, M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin, V. N. Staroverov,R. Kobayashi, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant, S. S. Iyen-gar, J. Tomasi, M. Cossi, N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B. Cross,V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J.Austin, R. Cammi, C. Pomelli, J. W. Ochterski, R. L. Martin, K. Morokuma, V. G.Zakrzewski, G. A. Voth, P. Salvador, J. J. Dannenberg, S. Dapprich, A. D. Daniels,Ö. Farkas, J. B. Foresman, J. V. Ortiz, J. Cioslowski, and D. J. Fox. Gaussian 09revision d01. Gaussian Inc. Wallingford CT, 2009.

[24] C. I. Bayly, P. Cieplak, W. Cornell, and P. A. Kollman. A well-behaved electrostaticpotential based method using charge restraints for deriving atomic charges: the RESPmodel. J. Phys. Chem., 97(10269-10280), 1993.

[25] J. Wang, W. Wang, P.A. Kollman, and D.A. Case. Automatic atom type and bond typeperception in molecular mechanical calculations. Journal of Molecular Graphics and

Modelling , 25, 2006.

[26] Romelia Salomon-Ferrer; Andreas W. Goetz, Duncan Poole, Scott Le Grand, andRoss C. Walker. Routine microsecond molecular dynamics simulations with Amberon GPUs. 2. Explicit solvent particle mesh Ewald. J. Chem. Theory Comput., 9:3878–3888, 2013.

[27] Andreas W. Goetz, Mark J. Williamson, Dong Xu, Duncan Poole, Scott Le Grand, andRoss C. Walker. Routine microsecond molecular dynamics simulations with Amber -Part I: Generalized Born. J. Chem. Theory Comput, 8 (5):542–1555, 2012.

[28] Paulette A. Greenidge, Christian Kramer, Jean-Christophe Mozziconacci, and Ro-main M. Wolf. MM/GBSA binding energy prediction on the PDBbind data set:Successes, failures, and directions for further improvement. Journal of Chemical

Information and Modeling, 53(1):201–209, January 2013.

20

jacob kleine undergrad. thesis

Documents