andreas kukol editor molecular modeling of...

Molecular Modeling of Proteins

Andreas Kukol Editor

Second Edition

Methods in Molecular Biology 1215

M E T H O D S I N M O L E C U L A R B I O L O G Y

Series EditorJohn M. Walker

School of Life SciencesUniversity of Hertfordshire

Hat fi eld, Hertfordshire, AL10 9AB, UK

For further volumes: http://www.springer.com/series/7651

http://www.springer.com/series/7651

Molecular Modeling of Proteins

Second Edition

Edited by

Andreas KukolUniversity of Hertfordshire, Hatfield, Hertfordshire, UK

Additional material to this book can be downloaded from http://extras.springer.com

ISSN 1064-3745 ISSN 1940-6029 (electronic)ISBN 978-1-4939-1464-7 ISBN 978-1-4939-1465-4 (eBook) DOI 10.1007/978-1-4939-1465-4 Springer New York Heidelberg Dordrecht London

Library of Congress Control Number: 2014945209

© Springer Science+Business Media New York 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifi cally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Humana Press is a brand of SpringerSpringer is part of Springer Science+Business Media (www.springer.com)

Editor Andreas Kukol University of Hertfordshire Hatfield Hertfordshire , UK

http://extras.springer.com

www.springer.com

v

Over the years, molecular modeling and simulation of biomolecules has become an impor-tant tool in the molecular biosciences. Initially situated in the realm of specialists with in- depth knowledge of physics and computer science and access to supercomputers, molecular modeling is used increasingly by bioscientists who are mainly interested in investigating biological problems. This development has been supported by improved hardware, such as multi-core processors or graphic processing units, on the one hand, and accelerated sam-pling algorithms on the other hand that increase the timescale without increasing the demands on the hardware or the calculation time. The purpose of Molecular Modeling of Proteins is to provide a theoretical background of various methods available and to enable nonspecialists to apply methods to their problems. Most chapters contain, in addition to a thorough introduction, step-by-step instructions and notes on troubleshooting and how to avoid common pitfalls.

The current second edition of Molecular Modeling of Proteins provides some updated chapters and new material not covered in the fi rst edition. The fi rst part describes classical and advanced simulation methods as well as methods to set up complex systems such as lipid membranes and membrane proteins. The second part is devoted to the simulation and analysis of conformational changes of proteins, while Part III covers computational meth-ods for protein structure prediction as well as using experimental data in combination with computational techniques. The fi nal part contains chapters concerning protein–ligand interactions, which are relevant in the drug design process.

The topics cover some long established methods together with the latest developments in the fi eld. The chapters are written by internationally renowned investigators: they include leading developers of popular simulation packages or force fi elds.

The second edition of Molecular Modeling of Proteins is directed at researchers in the physical-, chemical-, and biosciences working in industry and academia, who are interested in applying the methods in their own research. Additionally, the book forms a valuable resource for educators who wish to teach courses about molecular modeling.

Hertfordshire, UK Andreas Kukol

Pref ace

vii

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

PART I SIMULATION METHODS

1 Molecular Dynamics Simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Erik Lindahl

2 Transition Path Sampling with Quantum/Classical Mechanics for Reaction Rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Frauke Gräter and Wenjin Li

3 Current Status of Protein Force Fields for Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Pedro E. M. Lopes, Olgun Guvench, and Alexander D. MacKerell Jr.

4 Lipid Membranes for Membrane Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Andreas Kukol

5 Molecular Dynamics Simulations of Membrane Proteins . . . . . . . . . . . . . . . . . 91 Philip C. Biggin and Peter J. Bond

6 Membrane-Associated Proteins and Peptides . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Marc F. Lensink

7 Coarse-Grained Force Fields for Molecular Simulations. . . . . . . . . . . . . . . . . . 125 Jonathan Barnoud and Luca Monticelli

8 Tackling Sampling Challenges in Biomolecular Simulations . . . . . . . . . . . . . . . 151 Alessandro Barducci, Jim Pfaendtner, and Massimiliano Bonomi

9 Calculation of Binding Free Energies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Vytautas Gapsys, Servaas Michielssens, Jan Henning Peters, Bert L. de Groot, and Hadas Leonov

PART II CONFORMATIONAL CHANGE

10 The Use of Experimental Structures to Model Protein Dynamics. . . . . . . . . . . 213 Ataur R. Katebi, Kannan Sankar, Kejue Jia, and Robert L. Jernigan

11 Computing Ensembles of Transitions with Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Juan R. Perilla and Thomas B. Woolf

12 Accelerated Molecular Dynamics and Protein Conformational Change: A Theoretical and Practical Guide Using a Membrane Embedded Model Neurotransmitter Transporter. . . . . . . . . . . . . . . . . . . . . . . 253 Patrick C. Gedeon, James R. Thomas, and Jeffry D. Madura

13 Simulations and Experiments in Protein Folding . . . . . . . . . . . . . . . . . . . . . . . 289 Giovanni Settanni

Contents

viii

PART III PROTEIN STRUCTURE DETERMINATION

14 Comparative Modeling of Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Gerald H. Lushington

15 De Novo Membrane Protein Structure Prediction. . . . . . . . . . . . . . . . . . . . . . 331 Timothy Nugent

16 NMR-Based Modeling and Refinement of Protein 3D Structures . . . . . . . . . . 351 Wim F. Vranken, Geerten W. Vuister, and Alexandre M. J. J. Bonvin

PART IV PROTEIN–LIGAND INTERACTIONS

17 Methods for Predicting Protein–Ligand Binding Sites . . . . . . . . . . . . . . . . . . . 383 Zhong-Ru Xie and Ming-Jing Hwang

18 Information-Driven Structural Modelling of Protein–Protein Interactions . . . . 399 João P. G. L. M. Rodrigues, Ezgi Karaca, and Alexandre M. J. J. Bonvin

19 Identifying Putative Drug Targets and Potential Drug Leads: Starting Points for Virtual Screening and Docking . . . . . . . . . . . . . . . . . . . . . 425 David S. Wishart

20 Molecular Docking to Flexible Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Jesper Sørensen, Özlem Demir, Robert V. Swift, Victoria A. Feher, and Rommie E. Amaro

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

Contents

ix

ROMMIE E. AMARO • Department of Chemistry and Biochemistry , University of California , San Diego , CA , USA

ALESSANDRO BARDUCCI • Laboratory of Statistical Biophysics , Ecole Polytechnique Fédérale de Lausanne , Lausanne , Switzerland

JONATHAN BARNOUD • INSERM , Lyon , France PHILIP C. BIGGIN • Department of Biochemistry , University of Oxford , Oxford , UK PETER J. BOND • Department of Chemistry, The Unilever Centre for Molecular Science

Informatics, Cambridge, USA; Department of Biological Sciences, National University of Singapore , Singapore

MASSIMILIANO BONOMI • Department of Bioengineering and Therapeutic Sciences and California Institute of Quantitative Biosciences , University of California , San Francisco , CA , USA

ALEXANDRE M. J. J. BONVIN • Computational Structural Biology Group, Bijvoet Center for Biomolecular Research, Faculty of Science , Utrecht University , Utrecht , The Netherlands

ÖZLEM DEMIR • Department of Chemistry and Biochemistry , University of California , San Diego , CA , USA

VICTORIA A. FEHER • Department of Chemistry and Biochemistry , University of California , San Diego , CA , USA

VYTAUTAS GAPSYS • Max Planck Institute for Biophysical Chemistry , Göttingen , Germany PATRICK C. GEDEON • Department of Biomedical Engineering , Duke University , Durham ,

NC , USA FRAUKE GRÄTER • Heidelberg Institute for Theoretical Studies , Heidelberg , Germany BERT L. DE GROOT • Max Planck Institute for Biophysical Chemistry , Göttingen , Germany OLGUN GUVENCH • Department of Pharmaceutical Sciences , University of New England ,

Portland , ME , USA MING-JING HWANG • Institute of Biomedical Sciences , Academia Sinica , Taipei , Taiwan ROBERT L. JERNIGAN • National Cancer Institute, National Institute of Health, Bethesda,

MD, USA; Interdepartmental Program for Bioinformatics and Computational Biology, L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA

KEJUE JIA • National Cancer Institute, National Institute of Health, Bethesda, MD, USA; Interdepartmental Program for Bioinformatics and Computational Biology, L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA

EZGI KARACA • Computational Structural Biology Group, Bijvoet Center for Biomolecular Research, Faculty of Science , Utrecht University , Utrecht , The Netherlands

ATAUR R. KATEBI • National Cancer Institute, National Institute of Health, Bethesda, MD, USA; Interdepartmental Program for Bioinformatics and Computational Biology, L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA

ANDREAS KUKOL • School of Life and Medical Sciences , University of Hertfordshire , Hatfi eld , UK

Contributors

x

MARC F. LENSINK • Interdisciplinary Research Institute, CNRS USR3078 , University Lille1 , Villeneuve d’Ascq , France

HADAS LEONOV • Max Planck Institute for Biophysical Chemistry , Göttingen , Germany WENJIN LI • Department of Bioengineering , University of Illinois at Chicago , Chicago ,

IL , USA ERIK LINDAHL • Department of Biochemistry and Biophysics , Stockholm University ,

Stockholm , Sweden PEDRO E. M. LOPES • Department of Pharmaceutical Sciences , University of Maryland ,

Baltimore , MD , USA GERALD H. LUSHINGTON • LiS Consulting , Lawrence , KS , USA ALEXANDER D. MACKERELL JR. • Department of Pharmaceutical Sciences ,

University of Maryland , Baltimore , MD , USA JEFFRY D. MADURA • Department of Chemistry and Biochemistry and Center

for Computational Sciences , Duquesne University , Pittsburgh , PA , USA SERVAAS MICHIELSSENS • Max Planck Institute for Biophysical Chemistry , Göttingen , Germany LUCA MONTICELLI • INSERM , Lyon , France TIMOTHY NUGENT • Department of Computer Science , University College London , London , UK JUAN R. PERILLA • Beckman Institute , University of Illinois , Urbana at Urbana-

Champaign , IL , USA JAN HENNING PETERS • Max Planck Institute for Biophysical Chemistry , Göttingen , Germany JIM PFAENDTNER • Department of Chemical Engineering , University of Washington , Seattle ,

WA , USA JOÃO P. G. L. M. RODRIGUES • Computational Structural Biology Group,

Bijvoet Center for Biomolecular Research, Faculty of Science , Utrecht University , Utrecht , The Netherlands

KANNAN SANKAR • National Cancer Institute, National Institute of Health, Bethesda, MD, USA; Interdepartmental Program for Bioinformatics and Computational Biology, L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA

GIOVANNI SETTANNI • Physics Department , Johannes Gutenberg Universität , Mainz , Germany JESPER SØRENSEN • Department of Chemistry and Biochemistry , University of California ,

San Diego , CA , USA ROBERT V. SWIFT • Department of Chemistry and Biochemistry , University of California ,

San Diego , CA , USA JAMES R. THOMAS • Department of Chemistry and Biochemistry and Center for Computational

Sciences , Duquesne University , Pittsburgh , PA , USA WIM F. VRANKEN • Department of Structural Biology , Structural Biology Brussels,

Vrije Universiteit Brussel , Brussels , Belgium GEERTEN W. VUISTER • Department of Biochemistry , University of Leicester , Leicester , UK DAVID S. WISHART • Department of Computing Science , University of Alberta , Edmonton ,

AB , Canada ; Department of Biological Sciences , University of Alberta , Edmonton , AB , Canada

THOMAS B. WOOLF • Department of Physiology , John Hopkins University , Baltimore , MD , USA

ZHONG-RU XIE • Institute of Biomedical Sciences , Academia Sinica , Taipei , Taiwan

Contributors

Part I

Simulation Methods

3

Andreas Kukol (ed.), Molecular Modeling of Proteins, Methods in Molecular Biology, vol. 1215,DOI 10.1007/978-1-4939-1465-4_1, © Springer Science+Business Media New York 2015

Chapter 1

Molecular Dynamics Simulations

Erik Lindahl

Abstract

Molecular dynamics has evolved from a niche method mainly applicable to model systems into a cornerstone in molecular biology. It provides us with a powerful toolbox that enables us to follow and understand structure and dynamics with extreme detail—literally on scales where individual atoms can be tracked. However, with great power comes great responsibility: Simulations will not magically provide valid results, but it requires a skilled researcher. This chapter introduces you to this, and makes you aware of some potential pitfalls. We focus on the two basic and most used methods; opti-mizing a structure with energy minimization and simulating motion with molecular dynamics. The statistical mechanics theory is covered briefly as well as limitations, for instance the lack of quantum effects and short timescales. As a practical example, we show each step of a simulation of a small pro-tein, including examples of hardware and software, how to obtain a starting structure, immersing it in water, and choosing good simulation parameters. You will learn how to analyze simulations in terms of structure, fluctuations, geometrical features, and how to create ray-traced movies for presentations. With modern GPU acceleration, a desktop can perform μs-scale simulations of small proteins in a day—only 15 years ago this took months on the largest supercomputer in the world. As a final exer-cise, we show you how to set up, perform, and interpret such a folding simulation.

Key words Molecular dynamics, Simulation, Force field, Protein, Solvent, Energy minimization, Position restraints, Equilibration, Trajectory analysis, Secondary structure

1 Introduction

Biomolecular dynamics occur over a wide range of scales in both time and space, and the choice of approach to study them depends on the question asked. In many cases the best alternative is an experimental technique, for instance spectroscopy to study bond vibrations or electrophysiology to study ion channels opening and closing. However, theoretical methods have made huge advances the last few decades, and there are now large domains where mod-eling and simulation either provide more detail or are more effi-cient to use compared to setting up a new experiment.

Molecular dynamics simulation is far from the only theoretical method; when the aim is to predict for example the structure

4

and/or function of proteins (rather than studying the dynamics of a protein) the best tool is normally bioinformatics that detect related proteins from amino acid sequence similarity. Similarly, for computational drug design often it is much more productive to use less accurate but exceptionally fast statistical methods like QSAR (Quantitative Structure–Activity Relationship) instead of spending billions of CPU hours to simulate binding of thousands of compounds.

Traditionally, the role of simulations has been to test if simple theoretical models can predict experimental observations. For example, simulations of ion channels have been useful to explain why some ions pass while others are blocked, although the conduc-tivity itself was already known from experiments. Similarly, simula-tions can provide detail not accessible through experiments, for instance pressure distributions inside membranes. However, this is changing rapidly—simulations have moved far beyond confirming experiments, and today they frequently make predictions about properties such as binding or folding dynamics that are later con-firmed in the lab. With ever-increasing computational power this development will not only continue, but it is likely to accelerate significantly the next few years.

From an ideal physics point-of-view, the time-dependent Schrödinger equation should be able to predict all properties of any molecule with arbitrary precision ab initio. However, as soon as more than a handful of particles are involved it is necessary to introduce approximations. In quantum chemistry, one common approximation is to assume that atomic nuclei do not move, and using an implicit representation of solvent. This is obviously not realistic for large biomolecules if we are interested in understand-ing their motion and sampling of lots of different states, so for most biomolecular systems we instead choose to work with empirical parameterizations of models, for instance classical Coulomb interactions between pointlike atomic charges. The conceptual difference is that quantum chemistry is excellent at describing the electronic structure and enthalpy (potential) of the system, while classical molecular dynamics instead excels at sam-pling the billions of states a macromolecule will adapt—in par-ticular this means they properly include the entropy part of free energy. These models are not only orders of magnitude faster, but since they have been parameterized from experiments they also perform better when it comes to reproducing observations on microsecond scale (Fig. 1), rather than extrapolating quantum models 10 orders of magnitude. The first molecular dynamics simulation was performed as late as 1957 [1], although it was not until the 1970s that it was possible to simulate water [2] and biomolecules [3].

Erik Lindahl

5

2 Theory

Macroscopic properties measured in an experiment are not direct observations, but averages over billions of molecules representing a statistical mechanics ensemble. This has deep theoretical implica-tions that are covered in great detail in the literature [4, 5], but even from a practical point of view there are important conse-quences: (1) It is not sufficient to work with individual structures, but systems have to be expanded to generate a representative ensemble of structures (see Note 1) at the given experimental con-ditions, e.g., temperature and pressure—this is one thing that sets classical molecular dynamics apart from quantum chemistry. (2) Thermodynamic equilibrium properties related to free energy, such as binding constant, solubilities, and relative stability cannot be calculated directly from individual simulations, but require more elaborate techniques covered in later chapters—these all rely on entropy. (3) For equilibrium properties (in contrast to kinetic) the aim is to examine the ensemble of structures, and not necessar-ily to reproduce individual atomic trajectories!

The two most common ways to generate statistically faithful equilibrium ensembles are Monte Carlo and Molecular Dynamics simulations, where the latter also has the advantage of accurately reproducing kinetics of non-equilibrium properties such as diffu-sion or folding times. However, these methods cannot handle the case where a structure is very far from equilibrium, for instance if two atoms are almost overlapping after building a new side chain. To remove this type of clashes prior to simulation, we typically start with an Energy Minimization. This type of minimization is also commonly used to refine low-resolution experimental structures.

All classical simulation methods rely on more or less empirical sets of parameters called Force fields [6–9] to calculate interactions and evaluate the potential energy of the system as a function of pointlike atomic coordinates. A force field consists of both the set of equations used to calculate the potential energy and forces from particle coordinates, as well as a collection of parameters used in

10-15s 10-12s 10-9s 10-6s 10-3s 1s 103s

bond lengthvibration rotation

around bonds transport inion channel

rapidprotein folding

normal proteinfolding

lipidrotation

ribosomesynthesis

waterrelaxation

lipiddiffusion "biology"

membraneprotein fodling

Accessible to atomic-detail simulation today (2013)

Fig. 1 Range of time scales for dynamics in biomolecular systems. While the individual time steps of molecular dynamics is 1–2 fs, parallel computers make it possible to simulate on microsecond scale, and distributed computing techniques can sample even slower processes, almost reaching milliseconds


6

the equations. For most purposes these approximations work great, but they cannot reproduce quantum effects such as bond forma-tion or breaking (see Note 2).

All common force fields subdivide potential functions in two classes. Bonded interactions cover stretching of covalent bonds, angle-bending, torsion potentials when rotating around bonds, and out-of-plane “improper torsion” potentials, all which are nor-mally fixed throughout a simulation (Fig. 2). The remaining non-bonded interactions between atoms that are merely close in space consist of Lennard–Jones repulsion and dispersion as well as Coulomb electrostatic. These are typically computed from neigh-borlists updated periodically.

Given the force on all atoms, the coordinates are updated for the next step. For energy minimization, the steepest descent algo-rithm simply moves each atom a short distance in direction of decreasing energy (force is the negative gradient of energy), while molecular dynamics is performed by integrating Newton’s equa-tions of motion [10]:

Fr r

r

rF

iN

i

ii

i

V

mt

= -¶ ¼( )

¶

¶¶

=

1

2

2

, ,

The updated coordinates are then used to evaluate the potential energy again, as shown in the flowchart of Fig. 3.

Fig. 2 Examples of interaction functions in modern force fields. Bonded interac-tions include covalent bond-stretching, angle-bending, torsion rotation around bonds, and out-of-plane or “improper” torsions (not shown). Nonbonded interac-tions are based on neighborlists and consist of Lennard–Jones attraction and repulsion, as well as Coulomb electrostatics. Even a small amino acid residue contains a large number of interactions, and for a protein there are thousands

Erik Lindahl

7

Even the smallest chemical sample we can imagine is far too large to include completely in a simulation. Instead, biomolecular simulations normally uses periodic boundary conditions to avoid surface artifacts, so that a water molecule that exits to the right reappears on the left in the system; if the box is sufficiently large the molecules will not interact significantly with their periodic cop-ies. This is intimately related to the nonbonded interactions, which ideally should be summed over all neighbors in the resulting infi-nite periodic system. Simple cutoffs can work for Lennard–Jones interactions that decay very rapidly, but for Coulomb interactions a sudden cutoff can lead to large errors. In the early days of simula-tion is was common to “switch off” the electrostatic interaction before the cutoff as shown in Fig. 4, but this too has severe artifacts—the current method of choice is to use Particle-Mesh-Ewald sum-mation (PME) to calculate the infinite electrostatic interactions by splitting the summation into short- and long-range parts [11].

Update coordinates &velocities according toequations of motion

More steps?

Compute potential V(r) andforces Fi = iV(r) on atoms

Initial input data:Interaction function V(r) - "force field"

coordinates r, velocities v

Collect statistics and writeenergy/coordinates to

trajectory files

Done!

Yes

No

Rep

eat f

or m

illio

ns o

f ste

ps

Fig. 3 Simplified flowchart of a typical molecular dynamics simulation. The basic idea is to generate structures from a natural ensemble by calculating potential functions and integrating Newton’s equations of motion, structures which are then used to evaluate equilibrium properties of the system. A typical time step is in the order of 1 or 2 femtoseconds, unless special techniques are used


8

For PME, the cutoff is not really a cutoff; it only determines the balance between the two parts, and the long-range part is treated by assigning charges to a grid that is solved in reciprocal space through Fourier transforms.

Cutoffs and rounding errors can lead to drifts in energy, which will cause the system to heat up during the simulation. Even with a theoretically perfect simulation we would run into problems since we typically start from an imperfect structure. As the poten-tial energy of this structure decreases during the simulation, the kinetic energy (i.e., temperature) would increase if the total system energy was constant. To control this, the system is normally cou-pled to a thermostat that scales velocities during the integration to maintain room temperature. Similarly, the total pressure in the sys-tem can be adjusted through scaling the simulation box size, either isotropically or separately in x/y/z dimensions.

The single most demanding part of simulations is the compu-tation of nonbonded interactions, since millions of pairs have to be evaluated for each time step. Extending the time step is thus an important way to improve simulation performance, but unfortu-nately errors are introduced in bond vibrations already at 1 fs. However, in most simulations these bond vibrations are not of interest per se, and can be removed entirely by introducing bond constraint algorithms such as SHAKE [12] or LINCS [13]. Constraints make it possible to extend time steps to 2 fs, and fixed- length bonds are likely better approximations of the quantum mechanical oscillators than harmonic springs (see Note 3)—and in the final section we will show you how to go even further.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1-1

0

1

2

3

4

-1

0

1

2

3

4

r (nm)

rela

tive

pot

ential

Fig. 4 Alternatives to a sharp cutoff for nonbonded coulomb interactions. Top: By switching off the interaction (dashed ) before the cutoff the force will be the exact derivative of potential, but the derivative (and thus force) will unnaturally increase just before the cutoff. Bottom: Particle-Mesh-Ewald is an amazing algorithm where the coulomb interaction (solid ) is divided into a short-range term that is evaluated within a cutoff (dashed ) and a long-range term which can be solved exactly in reciprocal space with Fourier transforms (dot-dash)

Erik Lindahl

9

3 Methods

With the basic theory covered, this section will describe how to (1) choose and obtain a starting structure, (2) prepare it for a simula-tion, (3) create a simulation box, (4) add solvent water, (5) perform energy minimization, (6) equilibrate the structure with simulation, (7) perform the production simulation, and (8) analyze the trajec-tory data. To reproduce it, you will need access to a Unix/Linux machine (see Note 4) with a molecular dynamics package installed. While the options and files below refer to the GROMACS pro-gram [14], the description should be reasonably straightforward to follow with other programs like AMBER [15], CHARMM [16], or NAMD [17]. It will also be useful to have the molecular viewer PyMOL [18] and Unix graph program Grace installed (see Note 5).

The Bovine Pancreatic Trypsin Inhibitor (BPTI) is a small 58- residue water-soluble protein that inhibits several serine proteases [19]. It was one of the first proteins to be simulated [3], and has often been referred to as a “hydrogen atom” of protein simulation. There are several high-resolution X-ray structures of BPTI [20] in the Protein Data Bank (http://www.pdb.org), and also NMR structures. It was actually the early simulations of BPTI [3] that lead experimentalists to realize that X-ray temperature factors can be used to study the local dynamics of a protein [21]. Choose the entry 6PTI with 1.7 Å resolution [20], and download it as 6PTI.pdb (see Note 6). Figure 5 shows a cartoon representation of this structure; the small crosses are crystal water oxygen atoms visible in the X-ray experiment (see Note 7).

In addition to the coordinates/velocities that change each step, simulations also need a static description of all atoms and interac-tions in the system, called topology. In GROMACS, this is created from the PDB structure by the program pdb2gmx, which also adds all the hydrogen atoms that are not present in most X-ray struc-tures. For this example we will work with the Amber99SB-ILDN force field, the TIP3P [22] water model (see Note 8), and accept the default choices for all residue protonation states, termini, disul-fide bridges, etc. If you just try the command below right away you will get an error due to the issues with the structure mentioned in Note 6. This is common, so it is something you need to learn how to fix. Open the PDB file in an editor and scroll down to residue 57 at the end of the chain. Remove the single nitrogen atom line just before the line starting with “TER”—we simply skip the miss-ing residues. Just after the “TER” line there are also some lines for a phosphate ion (residue “PO4”). To avoid problems with finding parameters for this, remove these five lines too. Now you should be good to go. The command to use is then

3.1 Obtaining a Starting Structure

3.2 Preparation of Input Data


http://www.pdb.org/

10

pdb2gmx –f 6PTI.pdb –water tip3p

You will be prompted for the force field (select “6” for Amber99SB-ILDN), and the command will produce three files: conf.gro contains coordinates with hydrogens, topol.top is the topology, and posre.itp contains a list of position restraints that will be used in Subheading 3.7. For all these programs, you can use the –h flag for help and a detailed list of options (see Note 9).

The default box is taken from the PDB crystal cell, but a simula-tion in water requires something larger. The box size is a trade-off, though: volume is proportional to the box side cubed, and more water means the simulation is slower. The easiest option it to place the solute in the center of a cube, with for example 0.75 nm to the box sides. We will show up some more advanced alternatives later, but for now this will suffice:

editconf –f conf.gro –d 0.75 –o box.gro

where the distance (-d) flag automatically centers the protein in the box, and the new conformation is written to the file box.gro (see Note 10).

The last step before the simulation is to add water in the box to solvate the protein. This is done by using a small pre-equilibrated system of water coordinates that is repeated over the box, and

3.3 Creating a Simulation Box

3.4 Adding Solvent Water

Fig. 5 Cartoon representation of the BPTI structure 6PTI from Protein Data Bank, with side chains shown as sticks. Including hydrogens, the protein contains roughly 800 atoms. Ray-traced image generated with PyMOL

Erik Lindahl

11

overlapping water molecules removed. The BPTI system will require roughly 3,400 water molecules, which increases the num-ber of atoms significantly. GROMACS does not use a special pre- equilibrated system for TIP3P water since water coordinates can be used with any model—the actual parameters are stored in the topology and force field. In GROMACS, a suitable command to solvate the new box would be

genbox –cp box.gro –cs spc216.gro \–p topol.top –o solvated.gro

The backslash means the entire command should be written on a single line. Solvent coordinates (-cs) are taken from an SPC water system [23], and the –p flag adds the new water to the topol-ogy file. The resulting system is illustrated in Fig. 6.

In principle you could use the system as is, but the net charge on the protein is unphysical in an infinite system, and many proteins interact with counterions. There is a GROMACS program to help us with this, but we first need an input file. GROMACS uses a separate preprocessing program grompp to collect parameters, topology, and coordinates into a single run input file (em.tpr) from which the simulation is then started (this makes it easier to move it to a another computer). Here we are not really going to run anything, so just create an empty file called ions.mdp and prepare an input file as:

3.5 Adding Ions

Fig. 6 BPTI solvated in water in a cubic box. Note that there is quite a lot of water, in particular in the box corners


12

grompp –f ions.mdp –p topol.top –c solvated.gro \–o ions.tpr

To neutralize the system and add 100 mM NaCl to the output file ions.gro, use the command

genion –s ions.tpr –neutral –conc 0.1 \–p topol.top –o ions.gro

The added hydrogens and broken hydrogen bond network in water would lead to quite large forces and structure distortion if molecular dynamics was started immediately. To remove these forces it is nec-essary to first run a short energy minimization. The aim is not to reach any local energy minimum, so 500 steps of steepest descent (as mentioned in the theory section) works very well as a stable rather than maximally efficient minimization. Nonbonded interactions and other settings are specified in a parameter file (em.mdp); it is only necessary to specify parameters where we deviate from the default value (this is why we could use an empty file above), for example:

------em.mdp------integrator = steepnsteps = 500nstlist = 10rlist = 1.0coulombtype = pmercoulomb = 1.0vdw-type = cut-offrvdw = 1.0nstenergy = 10------------------

See Note 11 contains a more detailed description of these set-tings. Then prepare the input file and run the energy minimization:

grompp –f em.mdp –p topol.top –c ions.gro \–o em.tpr<lots of output>mdrun –v –deffnm em

The –deffnm is a smart shortcut that uses “em” as the base filename for all options, but with different extensions. The minimi-zation will complete in a few seconds (see Note 12).

To avoid unnecessary distortion of the protein when the molecular dynamics simulation is started, we first perform a 100 ps equilibra-tion run where all heavy protein atoms are restrained to their start-ing positions (using the file posre.itp generated earlier) while the water is relaxing around the structure. As covered in the theory section, bonds will be constrained to enable 2 fs time steps. Other settings are identical to energy minimization, but for molecular

3.6 Energy Minimization

3.7 Position Restrained Equilibration

Erik Lindahl

13

dynamics we also control the temperature with the Bussi thermostat [24] (see Note 13). The settings used are (see Note 14):

------pr.mdp------integrator = mdnsteps = 50000dt = 0.002nstenergy = 1000nstlist = 10rlist = 1.0coulombtype = pmercoulomb = 1.0vdw-type = cut-offrvdw = 1.0tcoupl = v-rescaletc-grps = protein water_and_ionstau-t = 0.5 0.5ref-t = 300 300pcoupl = parrinello-rahmanpcoupltype = isotropictau-p = 2.0compressibility = 4.5e-5ref-p = 1.0cutoff-scheme = Verletdefine = -DPOSRESrefcoord_scaling = comconstraints = all-bonds------------------

For a small protein like BPTI it should be more than enough with 100 ps (50,000 steps) for the water to equilibrate around it, but in a large membrane system the slow lipid motions can require several nanoseconds of relaxation. The only way to know for certain is to watch the potential energy, and extend the equilibration until it has converged. Running this equilibration in GROMACS you execute

grompp –f pr.mdp –p topol.top –c em.gro \–o pr.tprmdrun –v –deffnm pr

This simulation will finish in a few minutes on a GPU-equipped workstation.

The difference between equilibration and production run is mini-mal: the position restraints and pressure coupling are turned off (see Note 15), we decide how often to write output coordinates to analyze (say, every 5,000 steps), and start a significantly longer simulation. How long depends on what you are studying, and that should be decided before starting any simulations. For decent sam-pling the simulation should be at least ten times longer than the phenomena you are studying, which unfortunately sometimes

3.8 Production Runs


14

conflicts with reality and available computer resources. We will perform a 10 ns simulation (5 million steps), which takes about an hour on a GPU workstation. If you are not that patient (or have a slow machine) you can choose a shorter simulation just to get an idea of the concepts, and the analysis programs in the next section can read the simulation output trajectory as it is being produced.

------run.mdp------integrator = mdnsteps = 5000000dt = 0.002nstlist = 10rlist = 1.0coulombtype = pmercoulomb = 1.0vdw-type = cut-offrvdw = 1.0tcoupl = v-rescaletc-grps = protein water_and_ionstau-t = 0.5 0.5ref-t = 300 300cutoff-scheme = Verletnstxtcout = 5000nstenergy = 5000------------------

Prepare and perform the production run as (the extra option to mdrun avoids spending too much time on writing out the current step to frequently):

grompp –f run.mdp –p topol.top –c pr.gro –o run.tpr<output>mdrun –v –deffnm run –stepout 10000

One of the most important fundamental properties to analyze is whether the protein is stable and close to the experimental struc-ture. The standard way to measure this is the root-mean-square displacement (RMSD) of all heavy atoms with respect to the X-ray structure. GROMACS has a program to do this, as

g_rms –s em.tpr –f run.xtc

Note that the reference structure here is taken from the input before energy minimization. The program will prompt both for a fit group, and the group to calculate RMSD for—choose “Protein-H” (protein except hydrogens) for both. The output will be written to rmsd.xvg, and if you installed the Grace pro-gram you will directly get a finished graph with

xmgrace rmsd.xvg

The RMSD is also illustrated in Fig. 7. It increases pretty rap-idly in the first part of the simulation, but stabilizes around 0.14 nm,

3.9 Trajectory Analysis

3.9.1 Deviation from X-Ray Structure

Erik Lindahl

15

roughly the resolution of the X-ray structure. The difference is partly caused by limitations in the force field, but also because atoms in the simulation are moving and vibrating around an equilibrium structure. A better measure can be obtained by first creating a run-ning average structure (see Note 16) from the simulation and com-paring the running average to the X-ray structure, which gives a more realistic RMSD around 0.12 nm (see Note 17).

Vibrations around the equilibrium are not random, but depend on local structure flexibility. The root-mean-square-fluctuation (RMSF) of each residue is straightforward to calculate over the trajectory, but more important they can be converted to tempera-ture factors that are also present for each atom in a PDB file. Once again there is a program that will do the entire job:

g_rmsf –s run.tpr –f run.xtc –o rmsf.xvg \–oq bfac.pdb

You can use the group “C-alpha” to get one value per residue. Figure 8 displays both the residue RMSF from the simulation (xmgrace rmsf.xvg), as well as the calculated and experimental temperature factors. The overall agreement is reasonable for a pro-tein this small and a short simulation. Longer simulations of larger proteins can fit almost perfectly.

Another measure of stability is the protein secondary structure. This can be calculated for each frame with a program such as DSSP [25]. If the DSSP program is installed and the environment variable DSSP points to the binary (see Note 18), the GROMACS program do_dssp can create time-resolved secondary structure plots. Since the program writes output in a special xpm (X pixmap) format you probably also need the GROMACS program xpm2ps to convert it to postscript:

do_dssp –s run.tpr –f run.xtc –dt 50xpm2ps –f ss.xpm –o ss.eps

Use the group “protein” for the calculation. Figure 9 shows the resulting output in grayscale, with some unused formatting

3.9.2 Comparing Fluctuations with Temperature Factors

3.9.3 Secondary Structure

0 2 4 6 8 10Time (ns)

0

0.05

0.1

0.15

0.2

RM

SD

(nm

)

Fig. 7 Instantaneous Root-mean-square displacement (RMSD) of all heavy atoms in Lysozyme during the simulation (solid), relative to the crystal structure. To a large extent atoms are vibrating around an equilibrium, so the RMSD of a 1-ns running average structure (dashed gray) is a better measure


16

removed. The DSSP secondary structure definition is pretty tight, so it is quite normal for residues to fluctuate around the well- defined state, in particular at the ends of helices or sheets. For a (long) protein folding simulation, a DSSP plot would show how the secondary structures form during the simulation.

There are two more very basic properties that are useful to analyze: The size of the protein defined by the “radius of gyration” and the number of hydrogen bonds. To calculate the radius of gyration, use the command:

g_gyrate –s run.tpr –f run.xtc

3.9.4 Radius of Gyration and Hydrogen Bonds

0.03

0.04

0.05

0.06

0.07

0.08

0 10 20 30 40 50 600

5

10

15

20

25

30

Residue

B-factor

RMSF

(nm)

Fig. 8 Top: Root-mean-square fluctuations of residue coordinates in the simulation. Bottom: The fluctuations can be converted to X-ray temperature factors (solid), which agree quite well with the experimental B-factors from the PDB file (dashed)

0 2000 4000 6000 8000 10000

10

20

30

40

50

Res

idue

Time (ps)

Coil B-Sheet B-Bridge Bend Turn A-Helix 3-Helix

Fig. 9 Local secondary structure in BPTI as a function of time during the simulation, according to the DSSP definition. Note how some elements periodically lose a bit of structure, but it rapidly reforms and the overall structure is quite stable over 10 ns

Erik Lindahl

17

The result will be written to the file gyrate.xvg, which includes both the overall radius and the radii around the three axes. Similarly, you get the hydrogen bonds with

g_hbond –s run.tpr –f run.xtc–num hbnum.xvg

Select the group “protein” twice to get all hydrogen bonds between the protein and the protein itself. Figure 10 shows the results for both these analyses (see Note 19).

A normal movie uses roughly 30 frames/second, so a 10-s movie requires 300 simulation trajectory frames. To make a smooth movie the frames should not be more than 1–2 ps apart, or it will just appear to shake nervously (see Note 20). In many cases it makes sense to rerun a shorter trajectory just for the movie, but here we just export a short trajectory from the first 500 ps in PDB format (readable by PyMOL) as

trjconv –s run.tpr –f run.xtc \–e 2500.0 –o movie.pdb

Choose the protein group for output rather than the entire sys-tem (see Note 21). If you open this trajectory in PyMOL as “PyMOL movie.pdb” you can immediately play it using the VCR- style con-trols on the bottom right, adjust visual settings in the menus, and even use photorealistic ray-tracing for all images in the movie. With MacPyMOL you can directly save the movie as a quicktime file, and on Linux you can save it as a sequence of PNG images for assembly in another program. Rendering a movie only takes a few minutes, and the final product bpti.mov is included with the reference files.

3.9.5 Making a Movie

1.05

1.1

1.15

1.2

Rgy

r (n

m)

0 2000 4000 6000 8000 10000Time (ps)

20

25

30

35

# H

-bon

ds

Fig. 10 Top: Radius of gyration of BPTI during 10 ns simulation. This is a good measure of how compact a structure is. Bottom: Number of hydrogen bonds inside the protein


18

4 Speeding Things up to Solve a Real Problem

Once you have gotten your feet wet you will likely want to approach more realistic problems. One important such problem is folding of small proteins, for instance the Villin headpiece where some mutants fold within a microsecond [30]. This mutant contains a special residue (norleucine), so you will need to copy the entire amber force field directory from the installation directory of Gromacs to your current working directory, place the file norleucine.rtp in the amber subdirectory, and also put the file residuetypes.dat in your current directory (so GROMACS recognizes the new residue as a protein residue).

The first challenge you are likely to hit is that you need your simulations to run faster to be able to reach relevant time scales. Here we will briefly go through a couple of recommendations that will help you achieve this.

You have probably seen that the problem is the number of steps we must take. One way to improve performance is to make each time step longer, but this is limited by the vibrations in the angles involving hydrogens (try a longer timestep in your files above and see what happens). GROMACS has a feature to remove these vibra-tions by replacing individual hydrogens with virtual sites. This retains the rotation of for example CH3-groups, but removed the fast vibrations. To enable this, use the –vsite option when you run pdb2gmx (or skip it if you want to stick to 2 fs steps):

pdb2gmx –f protein.pdb –water tip3p \–vsite hydrogen

This will instantly enable us to take time steps up to 5 fs, which will improve your performance by 150 %. Second, if you look at the box you used for BPTI you will likely see that it would better match the shape of the protein as a more spherical shape. Unfortunately spheres are not periodical, but we can ask GROMACS to use a rhom-bic dodecahedron box instead, which is at least more spherical than a cube and has only 71 % of the cube volume. That reduces the number of water molecules required to solvate the protein. This is difficult to visualize in three dimensions, but Fig. 11 illustrates in two dimen-sions how a hexagonal cell is more efficient than a square (very useful for membrane simulations). The hexagonal box achieves the same distance between periodic copies as a rectangular box at 86 % of the volume (see Note 22). Since we really want to push performance, we also accept a very small margin to the box side (-d option):

editconf –f conf.gro –bt dodecahedron \–d 0.3 –o box.gro

Finally, although PME does provide state-of-the-art electro-statics, the Villin headpiece is quite well-behaved and there are no large mobile charges in this system. To get a bit of extra perfor-mance, this is a case where we can decide to forego PME and use

Erik Lindahl

19

reaction-field electrostatics instead. This difference is easy—just write “reaction-field” instead of “PME” for electrostatics in your mdp files. We also use very short cutoffs (8 Å). The exact files used, including an input structure from the paper [30], are included in a separate directory.

With these settings you should be able to work energy minimi-zation and equilibration of Villin exactly the same way as we did for BPTI, but it will be even faster—on a single Core i7 desktop with a GPU we get almost a microsecond a day. Study what happens with the protein by using the analysis tools you learned above—you should see that it starts to compact and form more hydrogen bonds, but that it takes quite a while for secondary structure to form. To see how far away you are from the native state you can prepare a second TPR file using the native state as reference. However, after a bit of fluctuation, and possible 2–3 days of simu-lation, you should also be able to reach the native state of Villin.

5 Conclusions

This chapter should hopefully provide a basic introduction to gen-eral simulations. An important lesson is that high-quality simula-tions require a lot of care from the user—just as with experimental techniques the entire result can be ruined by a single sloppy step. Further, recent techniques based on distributed computing and markovian state models have been able to probe dynamics in the millisecond range without extending individual simulations to those scales [31]—this will be covered in much more detail in sub-sequent chapters presenting metadynamics (Chapter 8) and accelerated MD (Chapter 12). While simulations are advancing rapidly due to the continuous development of faster computers, the field has also been plagued by (published) simulations that

Fig. 11 Two-dimensional example of how a hexagonal box leads to lower volume than a square one, with the same separation distance. In three dimensions, the shape most similar to a sphere is the rhombic dodecahedron


http://dx.doi.org/10.1007/978-1-4939-1465-4_8

http://dx.doi.org/10.1007/978-1-4939-1465-4_12

20

have not advanced our knowledge either of simulation methods or biomolecules. Instead of just starting a simulation and hoping for something to happen, you should decide beforehand what you want to study, estimate the timescales necessary or see if it can be accomplished with more advanced methods (e.g., free energy cal-culations), and not start simulations until you are fairly confident both about sampling, analysis required, and the force field accu-racy. Used with caution, molecular dynamics is an amazingly pow-erful tool, and a great complement to experiments.

6 Notes

1. Most simulations rely on systems being ergodic, that is, the time average of the properties of a single molecule on a long simulation should be the same as the instantaneous ensemble average over all molecules in an experimental measurement. This is often (but not always) true, although it assumes our single simulation is sufficiently long, which can be very ineffi-cient to achieve.

2. The standard harmonic bond potentials in molecular simula-tions will never allow atoms to separate. However, the alterna-tive Morse potential is supported in many programs (including GROMACS) and will allow atoms to separate. Still, this is not used very frequently—if your problem involves breaking and forming bonds it is likely a better solution to use a QM/MM simulation.

3. The classical representations can be corrected in a number of ways to make sure that they are faithful representations of the real system. This is discussed in great detail in the first chapter of the GROMACS manual, to which we refer the interested reader. However, the really important thing in modeling is to understand your system and decide in each case what approxi-mations are reasonable. It is easy to add more detail (e.g., by using quantum chemistry), but that automatically means you lose in the other end by not getting as much sampling. The challenge is to strike the right balance for each problem!

4. In general, most computational chemistry programs behave best with the Linux operating system, although it is possible to run GROMACS on Windows. When starting out, you want a stan-dard AMD or Intel desktop. Currently (2013), you will get the best price–performance ratio by investing in a single- socket machine with fastest consumer processor you can buy, for instance Intel Core i7 4770. You can get this for well under $1000. GROMACS and some other codes support GPU accel-eration for NVIDIA cards, so to improve performance signifi-cantly it is a good idea to add a high-end graphics card such as

Erik Lindahl

andreas kukol editor molecular modeling of...

Documents