principal component analysis of gramicidin
TRANSCRIPT
PRINCIPAL COMPONENT ANALYSIS OF GRAMICIDIN A multivariate statistical analysis of collective modes in a model protein
by
Martin Kurylowicz
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy
Department of Molecular Structure and Function, Hospital for Sick Children and Graduate Department of Biochemistry in the
University of Toronto
‘ Copyright by Martin Kurylowicz (2010)
ii
Principal Component Analysis of Gramicidin: A multivariate statistical analysis of collective modes in a model protein
Martin Kurylowicz
Dissertation for the degree of Doctor of Philosophy (PhD) Department of Molecular Structure and Function, Hospital for Sick Children, Toronto
and Graduate Department of Biochemistry University of Toronto, 2010
Abstract
Computational research making use of molecular dynamics (MD) simulations has begun to expand
the paradigm of structural biology to include dynamics as the mediator between structure and
function. This work aims to expand the utility of MD simulations by developing Principal
Component Analysis (PCA) techniques to extract the biologically relevant information in these
increasingly complex data sets. Gramicidin is a simple protein with a very clear functional role and a
long history of experimental, theoretical and computational study, making it an ideal candidate for
detailed quantitative study and the development of new analysis techniques. First we quantify the
convergence of our PCA results to underwrite the scope and validity of three 64 ns simulations of gA
and two covalently linked analogs (SS and RR) solvated in a glycerol mono-oleate (GMO)
membrane. Next we introduce a number of statistical measures for identifying regions of
anharmonicity on the free energy landscape and highlight the utility of PCA in identifying functional
modes of motion at both long and short wavelengths. We then introduce a simple ansatz for
extracting physically meaningful modes of collective dynamics from the results of PCA, through a
weighted superposition of eigenvectors. Applied to the gA, SS and RR backbone, this analysis results
in a small number of collective modes which relate structural differences among the three analogs to
dynamic properties with functional interpretations. Finally, we apply elements of our analysis to the
GMO membrane, yielding two simple modes of motion from a large number of noisy and complex
eigenvectors. Our results demonstrate that PCA can be used to isolate covariant motions on a number
of different length and time scales, and highlight the need for an adequate structural and dynamical
account of many more PCs than have been conventionally examined in the analysis of protein motion.
iii
Acknowledgements
I am grateful to my supervisor Régis Pomès, for giving me the opportunity to pursue
this research, as well as my committee members Boris Steipe and Ray Kapral for
scientifically fruitful discussions over the years. My thanks also extend to all members of the
Pomès lab, especially my immediate neighbours Nilu Chakrabarti, Rowan Henry and Grace
Li for their daily conversation and motivation, Chris Neale for his expertise and
insightfulness, and Chris Madill for his frequent aid and longsuffering comradery. My work
builds on simulations created by Ching-Hsing Yu, and his expertise was invaluable both as a
post-doc in the lab and during his tenure at the Centre for Computational Biology. All of
their fellowship over the last six years has been invaluable to me.
I am ever appreciative of my scientific mentors at the University of British Columbia,
who continue to lend me their ear, pen and encouragement. Walter Hardy, Doug Bonn and
Myer Bloom taught me what I know about experimentation and physics, while Lee Gass and
Mark Maclean taught me much about the art of science as well as living. I am in their debt
not only for instilling a love of science and its practice, but also the motivation to stick with it
in hard times, if only to pay back their many investments in me.
The companionship of my friends has sustained me from close and far. Thanks to my
west-coast family: Josie Hughes, Mike Melnychuk, Tom Bird, Sarah Henderson, Roger
Donaldson, Janet Tecklenborg; I`m glad we`re all still travelling along parallel paths. My
Toronto brothers have also made life grand for me in the big city: Dan Fraleigh, Christopher
Oates and Davy Boon, I`ll miss living with and next door to you.
The support of my parents Zosia and Stan, as well as my brother Mike, have been
invaluable over my many years of schooling. I hope to make them proud at convocation,
becoming the first doctor in our family. Thanks for bringing me to Canada, Mom and Dad.
And finally, my most heartfelt love and gratitude to Nancy my wife, for loving me so well,
for keeping our heads above water, and for giving birth to our daughter Ivy Lumina, who has
brought new meaning and light to our lives together.
iv
Table of Contents
ABSTRACT……………………...……………………………………………………. ii
Acknowledgements…………...……………………………………………………… iii Table of Contents………………………………………………………………..……. iv List of Figures…....…….……………………………………………………………… vi List of Tables….….…………………………………………………………………… vii List of Appendices….…..…..………………………………………………………... vii List of Abbreviations….....……….………...……………………………………….. vii
PREFACE………….……..……….…………………………………………………… 1
CHAPTER 1: Introduction ………………………………………………………… 2
1.1: Biophysical Background…………………………………………………………….. 2 1.2: Gramicidin…………………………………………………………………………… 8 1.2.1: Biological Characterization………………………………………….………… 8 1.2.2: Structure of Gramicidin and its Dioxolane-Linked Analogs…………………. 9 1.2.3: Studies of Gramicidin Dynamics…………………………………..................... 14 1.3: Summary and Overview……………………………………………………………… 17
CHAPTER 2: Theory and Methods……………………………………………… 18
2.1: Molecular Dynamics………………………………………………………………… 18 2.1.1: Background…………………………………………………………………… 18 2.1.2: Simulation of gA/SS/RR in GMO membrane…………………………………. 22 2.2: Principal Component Analysis (PCA) ……………………………………………… 23 2.2.1: Background …………………………………….……………………………. 23 2.2.2: PCA and Protein Dynamics……………………………………………..…… 25 2.2.3: PCA vs. NMA of Proteins…………………………………….………………. 27 2.2.4: PCA and its Development in Climatology…………………………………… 30 CHAPTER 3: Convergence of PCA……………………………………………… 35
3.1: Background………………………………….……………………………………… 35 3.2: Convergence of Structure: Overlap of Covariance Matrices:………………………. 36 3.2.1: Backbone of gA, SS and RR: Converged Eigenvectors……………………… 37 3.2.2: Side Chains and GMO: Unconverged Eigenvectors………………………… 40 3.3: Convergence of Dynamics: Average Distributions and Deviations from Gaussian.. 42 3.3.1: Backbone and Side Chains of gA……………………………………………… 43 3.3.2: Backbone of SS and RR………….…………………………………………… 45 3.4: Summary and Conclusions………………………………………………………..… 56
v
Table of Contents (cont.)
CHAPTER 4: Anharmonic Features of Collective Motion…………………… 47
4.1: Scaling of PCA Eigenvalues………………………………………………….……… 47 4.2: Non-Gaussian PC Distributions………………………….………………………….. 50 4.3: MSD and Anomalous Diffusion…………………………………………………...… 54 4.4: Collective Oscillations in the Small Covariance Regime……………………………. 58 4.5: Discussion……………………..……………………………………………………... 61 4.6: Conclusion.…………………………………………………………………………… 63
CHAPTER 5: Emergent Modes at Large Covariance…………………..……… 65
5.1: Introduction………………………….……………………………………………..… 65 5.2: Band Gaps in the Eigenvalue Spectra……………………………………….……….. 67 5.3: Spatial Structure of PC Eigenvectors………………………….…………………….. 70 5.4: The Principal Components of gA, SS and RR………………………….…………… 73 5.5: Coherent Modes from Weighted Sums of PCs………………………….…………… 75 5.6: Covariance of PC Trajectories……………………………………………………….. 83 5.6: Discussion and Conclusions………………………….……………………………… 85
CHAPTER 6: PCA of GMO Lipids Solvating Gramicidin …………………… 87
6.1: Background………………………………………………………………………….. 87 6.2: Methods…………………………………………………………………………….. 88 6.3: Results and Discussion………………………….………………………………….. 89 CHAPTER 7: General Conclusions and Future Directions………………….. 99
References…………………………………………………………………….……….. 102
Appendix 1: Normal Mode Analysis………………………….………………… 113
Appendix 2: Side Chain Conformations of gA………………………….……… 115
vi
List of Figures
Figure 1.1: Gramicidin in a hydrated GMO membrane ……………………………….. 12 Figure 1.2: Structure of dioxolane linker and its orientation in RR and SS analogs…… 13
Figure 3.1: Convergence of gA backbone dynamics in 10 ns and 64 ns simulations….. 39 Figure 3.2: Comparison of convergence for gA, SS and RR backbone dynamics …….. 42 Figure 3.3: Unconverged dynamics of side chains and GMO lipids…………………… 41 Figure 3.4: Average difference from Gaussian distributions for gA……………………. 44 Figure 3.5: Average difference from Gaussian distributions for SS and RR…………… 45
Figure 4.1: PCA eigenvalue spectra for gA for various atomic subsets………………… 49 Figure 4.2: Non-Gaussian distributions of long and short PCs vs. timescale for gA…... 51 Figure 4.3: Non-Gaussian distributions of 5 long and 5 short PCs at 1ns for gA……… 52 Figure 4.4: Surfaces of normalized difference from Gaussian ………………………… 53 Figure 4.5: MSD of various PCs for for gA backbone and side chains………………… 56 Figure 4.6: MSD slope of various PCs for gA backbone and side chains……………… 57 Figure 4.7: Spectra power of MSD oscillations……………………………………...…. 59 Figure 4.8: Illustration of short oscillating backbone PCs……………………………… 61
Figure 5.1: Detail of gA backbone eigenvalue spectrum……………………………….. 69 Figure 5.2: Illustration of eigenvector directional coordinates…………………………. 72 Figure 5.3: PC 1-3 of gA, SS and RR backbone………………………………………… 74 Figure 5.4: PC 4-9 of gA backbone……………………………………………………… 76 Figure 5.5: Projection of directional coordinate for PC 1-8: gA, SS and RR…………… 77 Figure 5.6: Coherent modes of the gA backbone……………………………………….. 79 Figure 5.7: Projection of directional coordinate for coherent modes of gA ………….… 79 Figure 5.8: Comparison of modes A and B for gA, SS and RR …………..………….… 82 Figure 5.9: Covariance matrix and its absolute value for gA…....……………………… 84 Figure 5.10: Covariance matrices for gA/SS.RR……………………..………………… 86
Figure 6.1: Radial distribution function of GMO surrounding gA ……………………… 89 Figure 6.2: Planar distribution functions of GMO surrounding gA …………..………… 94 Figure 6.3: Comparison of average structure of annular lipids from 2 to 64 ns………… 95 Figure 6.4: Eigenvalue spectra for annular GMO lipids……………………………..….. 96 Figure 6.5: PC 1-3 for annular GMO lipids …………………………………………….. 97 Figure 6.6: Emergent modes for PC 1-3 and PC 4-12…………………………………… 98 Figure 6.7: Normalized distributions of PC trajectories………………………………… 98
Figure A1: NMA eigenvalues for atomic subsets of gA………………………………… 114 Figure A2: Distributions of PC1 vs. PC2 for NCαC atoms in gA………………………… 116 Figure A3: Side chain conformations of gA in GMO over 64 ns………………………… 117
vii
List of Tables Table 1: PCA studies of protein dynamics ……………………………………………… 26
List of Appendices
Appendix 1: Normal Mode Analysis……………………………………………………. 113 Appendix 2: Conformational Basins of gA Side Chains……………………………….. 115
List of Abbreviations Å: Angstroms Ala: Alanine ATP: Adenosine triphosphate BPTI: Bovine pancreatic trypsin inhibitor CFA: Common Factor Analysis CHARMM: Chemistry at Harvard Molecular Mechanics (an MD package) CPT: Constant Pressure and Temperature algorithm for CHARMM dynamics DMPC: 1,2-dimyristoylglycero-3-phosphocholine EOF: Empirical Orthogonal Functions gA: Gramicidin A GMO: glycerol monooleate (hydrophilic headgroup with single chain mono-unsaturated lipid) Lys: Lysine MD: Molecular Dynamics MSD: Mean Squared Deviation NCaC: Nitrogen, alpha-carbon, carbonyl-carbon atoms NHCaCO: Nitrogen, amide hydrogen, alpha-carbon, carbonyl-group (carbon and oxygen) atoms NMR: Nuclear Magnetic Resonance PC: Principal Component PCA: Principal Component Analysis POD: Proper Orthogonal Decomposition RMS: Root Mean Square RMSΔθ: Root Mean Square change in angular coordinate RMSD: Root Mean Square Deviation RMSF: Root Mean Square Fluctuation RMSIP: Root Mean Square Inner Product RR: dioxolane-linked analog of gramicidin A with ring perpendicular to helical pitch SS: dioxolane-linked analog of gramicidin A with ring parallel to helical pitch TIP3P: Three-point transferable intermolecular potential for water Trp: Tryptophan SVD: Singular Value Decomposition
1
Preface
This work aims to expand the utility of molecular dynamics (MD) simulations
through the use of a multivariate statistical technique called Principal Component Analysis
(PCA). While MD simulations continue to become more powerful, creating longer
trajectories of increasingly large and complex systems, there is a need to develop and refine
mathematical and computational techniques to extract the biologically relevant information
in these increasingly elaborate data sets.
There are four main sections of results, each expanding the use of Principal
Component Analysis (PCA) beyond the traditional applications currently found in the
biomolecular literature. The first concerns the quantification of convergence in Chapter 3,
which is relevant not only to PCA but to the sampling of conformational state space of
complex dynamics in general. The second introduces quantitative statistical measures for
identifying regions of anharmonicity on the free energy landscape (Chapter 4), and highlights
the utility of PCA in identifying functional modes of motion at the equivalent of short
wavelengths, whereas PCA has traditionally been focused almost exclusively on long
wavelength modes. Chapter 5 introduces a simple ansatz for extracting simplified and
physically interpretable modes of collective dynamics from the results of PCA, through a
weighted superposition of eigenvectors. Finally, in Chapter 6 PCA is applied to the
membrane lipids surrounding gramicidin. This is a test case for the utility of PCA on diffuse
collections of monomers which behave as a continuous medium, whose eigenvectors are very
noisy and difficult to interpret.
The structural biologist’s insights linking molecular structure to function in complex
biochemical systems has contributed significantly to the tremendous success of molecular
biology. The advent of molecular dynamics has begun to expand this paradigm to include
dynamics as the mediator between structure and function. The development of multivariate
methods like PCA promises to enrich the analysis of MD data and contribute quantitative
insights into the relationships between structure, dynamics and function.
2
Chapter 1: Introduction
1.1: Biophysical Background
Computer simulation has become an essential research tool for understanding how the
dynamics of proteins link their structure to their function (1-5). Molecular dynamics (MD) in
particular can be helpful in obtaining information that is experimentally inaccessible with
current technologies. This is especially true in the single molecule regime, where it is
currently impossible to measure the internal motions of proteins with atomic resolution and
at timescales fast enough to resolve conformational transitions. On the other hand, despite
spanning ~11 orders of magnitude – from femtoseconds to microseconds – MD simulations
are still not capable of reaching long enough timescales to model many biologically relevant
processes; even the fastest protein folding event takes microseconds, and simulations on the
millisecond timescale are necessary to model the kinetics of this process. Hopefully there
will be a time in the future when computational and experimental technology will overlap in
the middle ground, when experimental techniques are able to probe small and fast enough,
and computational simulations are large and long enough, to study the same phenomena and
complement each other directly. Until then, many biophysical processes can only be studied
by simulation, and contact with experimental data remains a significant challenge for
computational biochemistry.
By combining energetics and dynamics, MD simulations are capable of calculating
the free energy of a complex system with many degrees of freedom. At temperature T, the
change in free energy ΔG has two components, the change in enthalpy ΔH and entropy ΔS:
ΔG = ΔH-TΔS. The enthalpy is defined as the sum of internal energy U and the mechanical
work done on the system by changes in pressure (VΔP) or volume (PΔV). In thermal
equilibrium where no work is done on the system, the internal energy in the microcanonical
ensemble is the sum of all pair-wise potential energy terms between the atoms of a molecule;
this is the quantity which is calculated at each time step of an MD simulation. The values of
molecular parameters in these pair-wise interactions are derived by calibration against both
measured and computed (with high-level quantum calculations) values of well-established
molecular and bulk properties, such as the atomic charge distribution, the orientational
relaxation rate, the dielectric constant, etc. While the enthalpy can be calculated
3
instantaneously, the entropy is a function of dynamics since the time-evolution of a system
generates the ensemble of states which are actually explored on the potential energy surface.
Hence the entropic component of the free energy is included in an MD simulation by
integrating Newton’s equations of motion over many time steps.
In general if the enthalpy is large, a complex molecular structure is very stable and
hence the entropy is small. On the other hand if the enthalpy is small then more
conformations become accessible and the entropy is large. Together these two terms create a
balancing act which determines whether any biochemical event will proceed spontaneously,
or how large an activation barrier must be overcome, and determines the ratio of substrate to
product in a reaction. One of the defining characteristics of biomolecular dynamics is that
both enthalpic and entropic contributions are very large, since these molecules have many
stabilizing interactions but also many degrees of freedom to explore. This cancelation of
large positive and negative terms gives rise to a free energy landscape which is intrinsically
“rough”, with many minima and maxima. Such a fine balance also means that calculations
of internal energy and of dynamics must be very accurate to yield meaningful free-energy
results.
This rough free-energy landscape model has become a paradigm for understanding
protein dynamics, and especially protein folding (6). It is generally accepted that proteins
exist on a complex free-energy landscape that is “rugged” in the sense of having multiple
nested minima corresponding to stable conformations, while a global funneling or ravine-like
structure of the landscape guides folding around kinetic barriers toward the native structure.
Simulations usually provide insight by describing the conformational ensemble
corresponding to the free-energy minima which are accessible to a complex biomolecule
under physiological conditions. Simulations are also very useful for studying the pathways
between these minima, which elucidates the kinetic barriers, intermediate structures and
transition states along a complex reaction pathway. This is why dynamics are important in
addition to structure, since they fill in the connections among an ensemble of conformations
that make proteins into machines capable of function, rather than static objects.
In recent years MD has contributed significantly to our understanding of biochemical
mechanism in enzyme catalysis as well as protein folding (2). For example, a coarse-grained
MD approach was able to compute hundreds of folding trajectories for a simple three-helix
4
bundle protein to understand the role of native and non-native contacts along the folding
pathway (7), and these results were consistent with shorter all-atom simulations (8). These
studies were able to determine the relative contributions of secondary structure formation and
hydrophobic collapse in the folding pathway of a simple protein, as well as the sequence in
which these events occur. Such a study elucidates the relative importance and structure of
on-pathway intermediates. Intermediate structures are of special interest in enzyme catalysis,
since it has long been recognized that enzymes function by binding the transition state in a
reaction (9), thereby lowering the activation energy and increasing reaction rates by factors
of up to 1019 (10). Both the structure and dynamics of an enzyme are important in this
regard, as the structure provides a pre-organized environment which stabilizes the transition
state (11), and dynamic fluctuations are often important in allowing for the substrate to enter
and the product to leave the reactive site (12). Dynamics also play a role through
conformational changes as well as vibrational modes, since these may also contribute directly
to lowering the activation energy for an enzymatic reaction (13). A good example of
structural and dynamic effects can be found in the enzyme triosephosphate isomerase (TIM),
which has a finely structured pocket of residues whose positions lower the activation energy
for the transfer of a proton from substrate to enzyme through electrostatic interactions (9, 13).
Proton transfer reactions are particularly sensitive to structural changes, and can be catalyzed
by deforming a C-H bond as little as 0.5 Å, and O-H bonds by 0.1 Å. Moreover, proton
transfer reactions are also sensitive to the presence of water, which may catalyze unwanted
side-reactions at the reaction site; the TIM enzyme has a dynamic conformational mechanism
for closing a “lid” over the active site during catalysis, making the reaction centre accessible
to substrate but not water (14).
Conformational changes are generally the best characterized examples of functional
motion in proteins. Many proteins bind their ligands through very specific conformational
changes around the binding site (as in myoglobin and hemoglobin), often coupled with other
conformational changes which exert allosteric control over the binding at other sites on the
protein (15). Global conformational changes may also exert mechanical forces in the
function of molecular motors, as in myosin (16, 17), or facilitate chemical catalysis in the
modification of chemical bonds, as in serine proteases (18). However, large-scale global
conformation changes are not the only interesting feature of protein dynamics; motions at
very different length scales are also important to the functioning of a protein. While changes
5
in tertiary and quaternary structure may span the size of an entire protein, individual residues
will also have important collective motions at much smaller spatial scales, and modification
of hydrogen bonds within the secondary structure will occur on even smaller scales yet. The
same is true for processes occurring at very different timescales, spanning at least 9 orders of
magnitude from femtoseconds (bond vibrations) to milliseconds (folding).
The coupling of large and small structural changes, as well as slow and fast
dynamical processes, is especially pertinent in the study of membrane proteins which form
ion channels. These proteins are responsible for regulating the permeation of material in and
out of cells or organelles, and transporting charges across membranes, an activity essential to
many fundamental biophysical processes from the transmission of electrical signals in
neurons to the generation of ATP. Ion channels have evolved sophisticated molecular
mechanisms to control the specificity with which they conduct various molecular species.
The intrinsically dynamic nature of transport processes makes MD simulations particularly
helpful in elucidating the mechanism of action of these channels. The transport process is
usually much faster than any conformational changes in the protein which modulates it.
Indeed, ion channels are an excellent target for MD studies precisely because the timescale of
ion diffusion is accessible to these simulations, and hence functional properties of the
channel can be probed at equilibrium without biasing dynamics to encourage rare events. At
the other extreme, the membrane in which these proteins function is governed by much
longer timescales, and must be described dynamically as well since no fixed structure exists
for this liquid-crystalline environment. Furthermore, the low dielectric constant of lipids
make membrane-bound proteins more sensitive to electrostatic forces than water soluble
proteins (4). The detailed atomistic study of ion channels presents a special opportunity for
understanding the structural and dynamic correlates of function.
The KcsA potassium channel illustrates many of these features, and is among the
most studied transmembrane channels after gramicidin. KcsA conducts K+ at rates near the
diffusion limit while discriminating against Na+ by more than a thousand-fold. The “knock-
on” mechanism was described long ago by Hodgkin and Keynes (19), where concerted multi-
ion transitions are mediated simultaneously by ion-channel attraction and ion-ion repulsion,
allowing several ions to move in single file through the narrow pore. This illustrates the fine
balance of interactions and dynamics which exist in ion channels. The selectivity of KcsA
6
was not clearly understood until its crystal structure was solved (20, 21), showing multiple
dehydrated K+ ions coordinated by main-chain carbonyl groups which line a very narrow
region of the pore corresponding to a highly conserved sequence of six amino acids common
to all K+ channels. Atomic fluctuations are essential to this selectivity mechanism, since
there are regions of the filter that are effectively narrower than suggested by the van der
Waals radius of K+ and carbonyl oxygens in the channel (22). This is also intriguing given
the relatively small size difference between Na+ and K+ (0.38 Å); it would be expected that
only a very rigid pore could discriminate between these, but the pore has been shown to be
quite flexible with RMS fluctuations on the order of 1 Å (23). However, this small size
difference allows for an optimum coordination number of 8 for K+, and only 6 for Na+. Since
there are eight carbonyls in the selectivity filter of KcsA, this turns out to be the basis of K+
selectivity in KcsA (24). We will see below that the solvation of ions by backbone carbonyls
is also a significant feature of the gramicidin channel.
Finally, it has long been recognized that interactions between membrane proteins and
their lipid environment may be integral to function. There has been considerable interest in
this problem in structural biology (25), where understanding lipid interactions may be
essential to crystallization and structural characterization of membrane proteins. There are
many roles for lipid-protein interactions: specific lipid species may confer structural stability
to membrane proteins, control insertion and folding processes, or aid in the assembly or
oligomerisation of multi-subunit complexes (26).
MD simulations of integral membrane proteins have demonstrated a number of
effects which are thought to be relevant to membrane proteins in general. For example,
simulations demonstrated that the presence of the transmembrane region formed by the alpha
helical bundle of the nAChR glycoprotein increases the orientational order of the DMPC
lipid acyl chains relative to the pure lipid bilayer, an effect which is enhanced deeper in the
membrane interior (27). This study also showed a decrease in the number of gauche defects,
a broadening of the orientational distribution of lipid headgroup dipole moments, and an
increase in headgroup orientation toward the water phase. Simulation studies of OmpA have
demonstrated a strong differentiation between bound and free lipids, where the lateral
diffusion coefficients of lipids solvating the protein are about half that of free lipids (28).
The same study also showed that lipid-protein interactions are able to relax to a stable state
7
on the 20 ns timescale. The shell of relatively immobilized lipids interacting directly with a
protein have been called “annular” lipids (29), in that they form a ring-like structure around
the protein whose properties are distinct from the bulk lipids in the rest of the membrane.
Spin-labelling has been particularly successful in characterizing annular lipids (30),
demonstrating that their interaction with the protein is ‘non-sticky’ and that a particular lipid
molecule remains in the annular shell for approximately 100 ns in the case of diacyl
phospholipids. (These timescales are important to keep in mind as our simulations of
gramicidin are 64 ns long).
These effects are in general agreement with experimental and simulation studies of
gramicidin in a membrane environment. An increase in the ordering of acyl chains was
observed using ESR and 2H NMR for gA in DMPC lipid bilayers in the liquid crystalline
phase (31), although the opposite effect is observed in the gel phase (32). 2D-ELDOR
(electron double resonance) has been used to differentiate between bound and free lipid
behaviour (33), demonstrating that lipids bound in the first solvation shell are immobilized
compared to bulk lipids. A 0.5 ns simulation of gA in a DMPC bilayer has demonstrated
good agreement with the 2H NMR data and an increase in the ordering of the acyl chains was
observed (34). Another 1.2 ns simulation of the same system demonstrated that the effects of
the channel on the lipid bilayer were short range, affecting only those DMPC molecules
bound to the channel (35). However, a comparison of gA simulations in DiPhPC and GMO
bilayers show that GMO molecules are significantly more ordered than the diacyl chains,
with three distinct solvation shells apparent in the radial distribution function (36).
All of the phenomena described above demonstrate the interplay between structure
and dynamics which is essential to the function of large and small biological molecules.
While MD simulations have often been successful in providing insight into these
relationships, the size of the resulting data set makes their interpretation difficult. Different
parts of a complex molecule (and its solvation environment) may play various functional
roles at different length and time scales, and it is difficult to identify these motions in the
large amount of data resulting from MD trajectories. This is a general problem facing much
of structural biology and computational science: our ability to generate experimental or
simulated data has begun to outpace our ability to analyze it for biologically meaningful
information and insight. One of the outstanding questions posed by the study of molecular
8
dynamics is how to quantify the structure of motion: we must account not only for the three
dimensional pattern of atomic positions, but also of their displacements. Which atoms move
together, how far do they move, and most importantly, in which directions? These are the
questions which motivate this study to undertake Principal Component Analysis as a means
of characterizing the 3D structure of collective displacements.
In order to develop quantitative techniques for the analysis of dynamics, it behooves
us to study simple systems which have been well characterized in the past, yet also have
adequate complexity to capture the essential features of biological function. Gramicidin is
one of the simplest membrane proteins with a very clear functional role and a long history of
experimental, theoretical and computational study. Other examples of such archetypal
systems include cytochrome c, BPTI, ubiquitin and lysozyme, but all of these are globular
proteins while gramicidin is a membrane-bound channel, which adds an important layer of
complexity. It is very small compared to most proteins, yet it has both secondary and
quaternary structure, and is also a membrane protein which interacts with its lipid
environment. Moreover, since its function is well understood as a channel, MD studies of
gramicidin are especially tractable, since we know functional events take place within the
duration of our simulations. On the other hand, after decades of theoretical and
computational studies of gramicidin, only recently have nanosecond-scale MD simulations of
proteins in an explicit membrane bilayer become tractable. All these features make
gramicidin an ideal candidate for detailed quantitative study and the development of new
analysis techniques.
1.2: Gramicidin
1.2.1: Biological Characterization
Gramicidin was discovered by René Dubos in 1939 (37), who isolated it from the soil
bacterial species Bacillus brevis, and named it for its bactericidal properties. Gramicidin was
one of the first commercially produced antibiotics, making a significant impact on battlefield
medicine during the Second World War. It is active primarily against Gram-positive bacteria
other than the Bacilli, as well as select Gram-negative species. Its use as an antibiotic is
limited to topical applications, as it induces hemolysis when taken internally, and is most
commonly found today in the commercial ointment Neosporin. To give a historical
9
perspective on the importance of this molecule, when Soviet researchers isolated an entirely
different compound with similar antibacterial properties in 1942, it was named Gramicidin S,
for Soviet. At the end of the wartime effort in 1944 the Soviet Ministry of Health was
collaborating with Great Britain to solve its structure. While the culmination of this effort
had to await the development of x-ray crystallography and NMR spectroscopy, gramicidin
was one of the first proteins whose structure was definitively solved by NMR (38) and for
about 15 years was the only transmembrane channel with known structure. This contributed
significantly to the wealth of research which has been devoted to this molecule.
When inserted into a membrane gramicidin forms a passive trans-membrane pore
which is selective for small monovalent cations (39), and this is essential to its mode of
action as an antibiotic. It kills bacteria by increasing the permeability of their cell walls,
thereby destroying the ion gradients (primarily of H+, Na+ and K+) between the cytoplasm
and the extracellular environment. The experimentally observed selectivity sequence for
gramicidin is Li+ < Na+ < K+ < Rb+ < Cs+ (40, 41) – which is the same as these ions’ mobility
in water – with overall activation free energy barriers on the order of 5-10 kcal/mol and
conductance of ~107 ions per second (39, 42). Gramicidin is impermeable to anions and is
blocked by divalent cations.
It is interesting to note that the natural function of gramicidin in Bacillus brevis is not
known, although it is apparently not used as an antibacterial pore-forming agent in its native
environment. It has been shown to inhibit E. coli RNA polymerase, and in B. brevis it is
believed to play a role in gene regulation during the shift from vegetative growth to
sporulation (43).
1.2.2: Structure of Gramicidin and its Dioxolane-Linked Analogs
Gramicidin has a number of structural analogs, all of which are pentadeca-peptides
which dimerize to form beta-helical transmembrane channels when inserted into a membrane
bilayer. Gramicidin D is the pharmacological extract (named for Dubos), and is a
heterogeneous mixture of 80% gramicidin A, 6% gramicidin B and 14% gramicidin C.
These are all naturally occurring dimers and differ only in the residue at position 11 with the
following chemical formula:
XL-Gly-AlaL-LeuD-AlaL-ValD-ValL-ValD-TrpL-LeuD-YL-LeuD-TrpL-LeuD-TrpL
10
The L and D subscripts indicate left-handed and right-handed enantiomers of the amino acids
(note that Gly has no optical activity since it is not chiral). Gramicidin A has Y=Trp,
gramicidin B has Y=Phe and gramicidin C has Y=Tyr. There are variants of all three analogs
where X=Val or X=Ile. There are also a number of artificial analogs where the Trp residues
at positions 9, 13 and 15 are also modified. The analog in which all Trp residues have been
replaced with Phe is called gramicidin M.
The structure of gramicidin A has been characterized at high resolution with 1H-NMR
in lipid micelles (38) and using solid-state NMR in lamellar-phase lipid bilayers (44, 45).
The native channel is composed of two monomers which assemble as a head-to-head non-
covalently-linked dimer, forming a cylindrical pore when solvated in a membrane bilayer, as
shown in Fig. 1.1. Each monomer has 15 alternating L- and D-amino acid residues which
form a b6.3-helix with 2.5 turns per monomer. Four Trp residues stabilize the C-terminals at
the water-membrane interface. The gA helix forms a 4-Å-wide cylindrical pore which hosts
a single file chain of water molecules traversing the membrane, thereby creating a pathway
for cation permeation and a hydrogen-bonded wire for the conduction of protons. Divalent
cations are too large to pass through the mouth of the channel, and block it by binding there.
The unique ability to form a beta-helical secondary structure is due to the alternating
L and D amino acids in the structure of gramicidin. L amino acids are by far the dominant
component of proteins in most life forms, with D forms found only in the outer
peptidoglycan walls of bacteria (46). This beta-helix has the carboxyl oxygen alternating
from one side of the backbone to the other. Each of these is hydrogen bonded to an amide
hydrogen, which is the characteristic pattern of hydrogen-bonds in the beta-sheet (hence the
name). However, the alternating L and D amino acids allow for a continuous curve in a
single direction rather than flattening the chain as in a beta-sheet formed exclusively of L-
amino acids. This pattern of alternating carbonyl orientations results in twice the distance
between neighbouring hydrogen-bonds than in an alpha helix, making the beta helix less
rigid. This pattern of carbonyls also exposes a periodic set of partial negative charges to the
lumen of the channel, which play a significant role in lowering the energetic barrier due to
cation dehydration upon entry into the channel, and also in solvating positive ions as they
pass through the channel. Note however that the orientation of the carbonyl dipoles is
parallel to the helical axis, while the optimal solvation geometry would point the dipole
11
moment radially towards the ion within a pore. This makes the solvation of ions by
backbone carbonyls a more subtle process in gramicidin than in the selectivity filter of KcsA
discussed above, and intrinsically couples tilting motions of carbonyl groups with ion
solvation.
Gramicidin A is a dimer held together by six hydrogen-bonds at the N-terminals
located in the centre of the bilayer. These interactions play a dominant role in dimer
association and dissociation. Gating of the native channel is associated with the lifetime of
dimerization, which is on the order of 100 ms (47, 48). Dioxolane-linked analogs of gA have
been synthesized which inhibit dimer dissociation (49, 50), resulting in channels with
increased conductive lifetimes. The presence of two chiral carbon atoms in the dioxolane
ring leads to two distinct diastereoisomers, where both linking carbon atoms are either in the
S or R state. The structure of the linker bridging a dipeptide (in the R configuration) is
shown in Fig. 1.2A. The R and S designation defines the nomenclature of linked channels;
since both chiral carbons must be in the same configuration when linking the gramicidin
dimer, the two diastereoisomers are names SS and RR. The most significant structural
difference between these channels relates to the strain of the linker acting on the helical
backbone: the SS linker fits easily along the pitch of the helix, while the RR linker is
perpendicular to it, creating a wedge-like dislocation with the ring parallel to the helical axis,
as depicted in Fig. 1.2B and 1.2C. Significantly, the SS dimer is much more stable in its
conducting state (hours) than the RR dimer (minutes) (51-53).
12
Figure 1.1: Gramicidin A in a hydrated lipid bilayer. The GMO molecules are shown in thin lines (cyan carbons, red oxygens), bulk water molecules are shown as small spheres while 9 lumen water molecules are emphasized as large spheres. The β-helical backbone is shown in blue, while hydrophilic side chains (Trp in transparent red) and hydrophobic side chains (Leu, Val, Ala and Gly in transparent green) are shown in stick representation. A top-down view of the channel can be found in Figure A3 of Appendix 2. The SS- and RR-linked analogs were simulated in the same hydrated membrane environment.
13
Figure 1.2: A: The dioxolane linker (atoms 1-9) inserted in the R configuration between two amino acids. The SS-linked (B) and RR-linked (C) analogs vary in the degree of structural perturbation caused by the linker to the pitch of the beta-helix. Only backbone atoms are shown for clarity.
14
1.2.3: Studies of Gramicidin Dynamics
Gramicidin has a long history of theoretical and computational study; reviews may
be found in Refs (54-56). The first theoretical model of ion transport in gramicidin was
proposed by Lauger in 1973 (57), and consisted of a simplified array of dipole moments.
Since then the evolution of computing power and the refinement of potential energy
functions has given rise to molecular dynamics simulations with increasingly realistic
membrane and hydration environments and at increasingly long timescales, along with
Monte Carlo simulations, ab initio quantum mechanical calculations, activated dynamics and
free-energy simulation techniques (56). In addition to these, hybrid models have also been
constructed where, for example, an MD treatment of proton transport, channel and lumen
dynamics combines with a Monte-Carlo treatment of entrance and exit from the channel to
yield conductivity values which can be compared with experiment (58).
Calculating the free energy of ion permeation is essential to understanding the
functional mechanism of ion channels, since this quantity relates fundamentally to the
conductance of a channel which can be measured experimentally. Potential of Mean Force
(PMF) calculations were introduced by Kirkwood in 1935 (59) as a means of obtaining free
energy results in liquids, and this technique has become central in the treatment of transition
rates, ion transport and reactions dynamics in general. PMF calculations have become a
benchmark by which to judge the quality of MD calculations (60). A recent review of PMF
results for various cations in gramicidin (61) has demonstrated that semi-quantitative
agreement between experiment and calculation can be attained, but it also highlights the
challenges faced by theorists in the treatment of gramicidin. Since this is such a small
molecule with a very narrow pore, changes in dielectric constant occur over the space of a
few atoms, and polarizability has a large influence over electrostatic properties.
Polarizability is generally not included within standard molecular mechanics force fields, and
the treatment of the dielectric constant (an intrinsically macroscopic quantity) is also
problematic at the atomic scale. This review (61) also showed that the single-file water chain
in the lumen does a surprisingly good job of stabilizing the ion, providing about half the ion’s
bulk hydration free energy even though the ion loses 5 of its 7 solvating water molecules
upon entry into the pore.
15
Solvation and H-bonding play a strong role in modulating the conduction of ions and
protons in particular along water chains (62-65). Studies of water wires in an electrostatic
field (66) and in the water-transporting channel aquaporin have also shown that the
electrostatic environment modulated by the global conformation of the protein also strongly
influences the conductive properties of its water wire (67, 68). In the case of gramicidin A,
hydrogen-bonding between lumen water molecules and backbone carbonyl groups is thought
to play a significant role in organizing the water wire within the channel to provide surrogate
solvation to the hydrated ion (54, 55, 69, 70). Protons in particular are very sensitive to the
orientation of nearby water molecules, as H+ ions conduct by hopping from one water
molecule to the next, which can only happen if an oxygen atom is oriented toward the excess
proton. This is known as the Grotthuss mechanism of proton transport (70-72), which
recognizes that the conduction of a proton along the length of a water wire necessitates the
reorientation of the entire wire before another proton can be transmitted from the same end of
the wire. Not only are carbonyl oxygen atoms well suited to both hydration and mobility of
protons in gramicidin (71), but they also assist in the reorientation step of the Grotthuss
mechanism (70).
The surrogate solvation of cations by carbonyl oxygens in the gramicidin backbone
has a long history of study. A peptide-plane libration mechanism was first proposed on the
basis of experimental conductance measurements (73). A normal mode analysis (NMA)
study concluded there was a band of short wavelength (high frequency) modes between 75
cm-1 and 175 cm-1 which represent librational motions of the peptide planes (74). Early MD
studies concluded that the flexibility of the gA channel modulates its conductivity, and
suggested that picosecond librations of the carbonyl moieties lining the pore were coupled to
the fluctuation of water molecules and of ions in the lumen (75). Tian and Cross have
reviewed the experimental evidence for carbonyl tilting in gA (76), and NMR studies have
provided experimental evidence of peptide plane librations (77, 78) demonstrating that
motions of the backbone occur on the same time scale as cation translation through the
channel. Powder-pattern NMR revealed picoseconds librations (78), while 15N T1 relaxation
measurements indicated a nanosecond timescale (77), although this slower result was
interpreted as the effect of damping by slower correlated motions. Recent MD studies have
also computed the amplitude of these librations (72), finding significant agreement with the
amplitudes measured by NMR. The frequency of carbonyl librations has also been measured
16
by far-infrared spectroscopy (79, 80) and found to be in general agreement with the NMA
results reported by Roux and Karplus (74).
In gramicidin the collective structural fluctuations of the entire backbone influence
the more localized dynamics of the water wire, with significant functional impact. It has
been demonstrated that the flexibility of the gA backbone in general influences its ionic
conduction properties (75). Specifically, the local perturbations of channel structure caused
by the dioxolane linker in the SS and RR analogs of gramicidin have a significant impact on
the channel’s conductivity; experimental investigations have shown that single channel
conductance is a factor of 2-4 times greater in the SS-linked dimer than in the RR analog
(53), and is strongly influenced by the membrane environment (51, 81, 82). Moreover,
experiments on the concentration-dependence of proton mobilities have suggested that
channel conductance is modified by structural differences in the protein which affect the
organization of water and hydrogen-bonds in the lumen (83). Computational studies of the
linked analogs have revealed that the RR dimer has 4 conformational states not present in the
native or SS channels (84), which are defined by the orientation of the two dioxolane
carbonyl groups pointing in or out of the channel. This study showed that while all the
carbonyl groups of gA and the SS-linked channels undergo unimodal librational motions
with RMS fluctuations on the order of 15o, the dioxolane carbonyl groups of the RR-linked
channel undergo multimodal switching transitions of ~50o, largely to compensate for the
distortion of secondary structure caused by the RR-linker. Furthermore, thermally activated
transitions between these four states was shown to limit the movement of protons through the
channel lumen (72), by coupling proton translocation to the conformational transitions of the
dioxolane linker.
This coupling between fast (localized) modes of water orientation and slow
(distributed) collective modes of the protein is of central relevance to biomolecular function,
and necessitates the quantitative description of both fast and slow collective motions in this
and other biomolecular systems. We note that the computational studies listed above were
based on simulations without a membrane; the hierarchy of local and global modes they
revealed begs the question of the membrane’s role in this hierarchy, which has motivated our
inclusion of a GMO membrane in the current study.
17
1.3 Summary and Overview
The gramicidin channel is a transmembrane protein with a long history of
computational and experimental study, and serves as an appropriately simple system for the
development of analytical techniques to quantitatively characterize its dynamics. Molecular
dynamics simulations provide a large data set which describes the time evolution of
molecular structure at finite temperatures and in a biologically relevant environment such as
the hydrated membrane. MD provides a natural progression of the structural biologist’s
strategy for understanding molecular biology, by including dynamics in the mechanistic
explanations which link structure to function. In Chapter 2 we describe the MD technique in
more detail, and explain how this data set is amenable to study by multivariate statistical
techniques such as Principal Component Analysis (PCA), which can extract the structure of
collective modes from atomistic dynamics. By and large, PCA has been used on MD data
sets in its most basic form, but we demonstrate that the history of PCA in atmospheric
science shows that there is a rich diversity of strategies for improving the ability of PCA to
calculate physically meaningful modes of motion from large, noisy data sets. In Chapter 3
we establish a few quantitative measures of convergence which underwrite the reliability and
statistical meaning of PCA results computed from finite simulation times. In Chapter 4 we
describe how access to the full time-ordered trajectory of an empirical simulation cast onto
the Principal Components offers insights into the statistical mechanics and dynamics of
collective motions in the biophysical state. Here we undertake a study of the Gaussian and
non-Gaussian characteristics of collective motions, which offers insight into the anharmonic
properties of the free energy landscape, and we suggest a few reasons to believe that these
are essential to understanding biologically important dynamics. In Chapter 5 we study the
spatial structures (eigenvectors) of the principal components and propose a simple
transformation which reduces the leading 25 PCs to 4 collective modes with much simplified
structure. Finally, in Chapter 6 we apply some of our PCA techniques to the study of annular
lipids surrounding the gramicidin channel in a membrane, in an attempt to find simplified
structure in the noisy dynamics of solvating lipid molecules.
18
Chapter 2: Theory and Methods 2.1: Molecular Dynamics (MD)
2.1.1: Background
Molecular Dynamics is an algorithm routinely used to simulate the motion of many
chemically bonded atoms in the classical approximation. It is classical in the sense that it
does not model electrons, whose dynamics are intrinsically quantum mechanical and occur
on timescales much faster than the motion of the nuclei. The simulation of atomic dynamics
is achieved by means of a “molecular mechanics force field”, which is comprised of a set of
interaction energies which depend exclusively on nuclear coordinates and atomic type (which
can also depend on the molecular group – i.e. a carbon atom is described differently in
methyl and carboxyl groups) . These functions are empirically parameterized by integrating
over all electronic degrees of freedom in a high level quantum mechanical calculation,
yielding a small number of parameters, within a few simplified functional forms, which
describe bonded and non-bonded interactions. The spatial derivative of these functions
yields force, and the vector sum across all force components yields the total force on any
atom at a single moment in time. There are a number of parameterizations available for MD
of biological molecules, the most common of which are CHARMM (85, 86), AMBER (87,
88), GROMOS (89-91) and OPLS (92).
The sum of all interaction energies is the total potential energy U and is split into
bonding and non-bonding terms. An essential part of the MD algorithm is keeping track of
which atoms are bonded to which other atoms as nearest neighbours (1-2 interactions) or
next-nearest neighbours (1-k interactions, with k < 5). This effectively encodes the primary
structure of the molecule to be simulated. There are 5 possible through-bond interaction
terms for each atom in the CHARMM potential, which account for bond stretching, bond-
angle deformation, and bond torsion:
Ubond = kb (bij-b0) 2
Uangle = kθ (θij-θ0) 2
UUB = ku (uij-u0)2 (2.1)
19
Uimproper = kω (ωij-ω0) 2
Udihedral = kφ [1+cos(nφij-δ)]
Variables marked with the ij subscript denote that the interaction is between atoms i and j.
These are the dynamic variables in the simulation, and are all functions of atomic positions.
The distance between atoms separated by a single covalent bond is b (1-2 interactions), the
angle at the junction between two bonds is θ, and u is the distance between the two atoms
joined at such a junction (this is the Urey-Bradley term HUB, which is the cross-term in angle
bending involving next-nearest neighbours, i.e. 1-3 interactions). Torsions are created by
series of three bonds (1-4 interactions): the dihedral angle between the bonds on either end
with respect to the axis of the middle bond is φ. Udihedral is the first term of a Fourier
expansion, where n denotes the periodicity of the function and δ is the phase. Improper
dihedral angles are also written as 1-4 interactions, where the torsion around a bond which
branches into three bonds is ω.
The two non-bonded terms in the CHARMM potential account for electrostatic and
electrodynamic (dispersion, or van der Waals) interactions:
Uelec = qiqj/(D rij) (2.2)
UVdW = εij[(σij/rij)12-2(σij /rij6)]
where rij is the distance between atom i and every atom j within a given cut-off radius for the
electrical or van der Waals interactions (cut-offs are independently adjustable parameters
when setting up the simulation, as is the dielectric constant D). The non-bonded forces are
only applied to atom pairs separated by at least three bonds. These interactions are
computationally much more demanding than the bonded interactions, since they include a
sum across a large number of atoms within the cut-off radius at each site.
All other variables in Eq. 2.1 (kb, kθ, ku, kω, kφ, b0, θ0, u0, ω0, φ0, n, δ) and Eq. 2.2 (qi,
qj, εij, σij) are parameters established by theoretical calculations or empirical calibration for
each atomic kind in the appropriate chemical environment. Those named k are all spring
constants (kcal/mol. Å2 or kcal/mol.rad2) for a given harmonic interaction, and those with
subscript 0 denote equilibrium distances (Å) and angles (radians). qi and qj are the partial
20
charges on atoms i and j, σij is the average of atomic radii i and j (Å), and εij is the square
root of the product εiεj, denoting the strength of the van der Waals interaction for a given
atom (kcal/mol). The units stated here are those used in CHARMM. It is important to note
the assumption underlying the validity of implementing any molecular mechanics force field:
equilibrium parameters derived for smaller molecular subunits must scale appropriately
across the much larger biomolecules in an MD simulation, and also hold across a relatively
broad and high termperature range (compared to zero-temperature quantum calculations).
Computing the potential energy is the purely spatial part of the MD algorithm. To
propagate dynamics the acceleration a of each atom is determined from the spatial derivative
of the internal energy by Newton’s First Law F=-—U=ma, where m is the mass of a given
atom, and — is the vector gradient operator (variables in bold are vectors with x,y and z
components). This is achieved by choosing a time step δt (typically 0.2 fs) which is short
enough to assume constant acceleration through its duration, such that the position r can be
determined at the next time step:
ri(t+δt )= ri(t) + vi(t)δt + ai(t)δt2/2 (2.3)
This is a set of 3N simultaneous equations where i=1…N labels each atom in the system, and
can be solved numerically even for large N. The explicit inclusion of the velocities vi(t) in
the equations of motion (2.3) can be circumvented by including one step back in time ri(t-δt):
ri(t-δt )= ri(t) - vi(t)δt + ai(t)δt2/2 (2.4)
The sum of Eqs. 2.3 and 2.4 eliminates vi(t) and yields the Verlet algorithm:
ri(t+δt )= 2ri(t) - ri(t-δt) + ai(t)δt2 (2.5)
Note that to create the initial two time steps ri(t-δt ) and ri(t), Eq. 2.3 must be solved at the
first time step by specifying an initial set of velocities at the beginning of the simulation, in
addition to initial positions. These are chosen from a Maxwell-Boltzmann distribution, and
their evolution in time determines the trajectory of kinetic energy in the system, and therefore
the temperature.
21
The inclusion of temperature is essential in MD, since it allows for the calculation of
free energies G rather than simply evolving dynamics on a potential energy surface (at zero
temperature), which describes only the enthalpy H. The temperature algorithm – the
inclusion of thermal fluctuations – controls the distribution of states which is selected on this
potential surface, hence the time-ordered trajectory generated by MD includes the entropy S.
It is dynamics at a finite temperature that make MD simulations an object of interest to
biochemists and molecular biologists, by describing dynamics on the free-energy surface,
dG=dH-TdS.
There are three main temperature regulation algorithms: Berendsen (93), Nosé-
Hoover (94, 95), and Langevin dynamics (96). The Berendsen thermostat (93) is the
simplest, in that it rescales all velocities at every step (or few steps) of the simulation such
that the average kinetic energy is maintained at the desired temperature. This has been
shown to give rise to some problematic artifacts which violate equipartition of energy; the
correct average energy is maintained but becomes distributed asymmetrically across all
degrees of freedom. A canonical example of this is known as the “flying ice cube” (97),
where energy from internal degrees of freedom is channeled into translation and rotation
about the centre of mass of the whole system. If different components of a simulation such
as the water, the protein, or the membrane are governed by different relaxation rates, the
kinetic energy may be deposited largely in one of these phases over another (98). This is
clearly problematic in simulations which make use of aqueous or oil-phase solvent.
The Nosé-Hoover (94, 95) thermostat solves this problem by coupling each degree of
freedom in the atomic ensemble independently to another “virtual” degree of freedom,
coupling an oscillator of a specific mass to each particle in the system. By introducing a set
of virtual oscillators which exchange energy with each real particle in the system, the kinetic
energy is redistributed uniformly through the system. This method also has the advantage
that different parts of the solute/solvent ensemble may be coupled independently to different
heat baths; this can be helpful in simulations of membrane-bound proteins, since proteins and
membranes have very different relaxation times and may benefit from different values for the
virtual masses attached to them.
22
Langevin dynamics (96) explicitly includes friction as well as stochastic perturbations
directly in the differential equations of motion, i.e. Newton’s First Law is written including a
velocity-dependent friction term with a damping constant γ, and a stochastic term R(t):
F=ma=-—U - γmv(t)+(2γmkBT)0.5 R(t) (2.6)
where kB is Boltzmann`s constant, T is temperature and <R(t)>=0 and <R(t) R(t’)>=δ(t-t’).
Here δ is the Dirac delta function, and these definitions make R(t) an uncorrelated Gaussian
process. While the Langevin approach performs better than the Berendsen or Nosé-Hoover
thermostats, it is not possible to implement with constant surface tension and pressure with
the algorithms currently available in CHARMM. Since these are necessary for simulations
of the GMO membrane, the Nosé-Hoover thermostat was used in the simulations described
below.
2.1.2: Simulation of gA/SS/RR in a GMO Membrane
All simulations described in this work were carried out using the CHARMM 31.1
molecular dynamics package (85) with the TIP3P water model (99) and the CHARMM22
force field (86) for all other atoms. The parameters for the dioxolane linker in the SS and RR
analogs were developed in a previous study (84), by fitting geometry, vibrational
frequencies, and energy to the results of ab initio calculations. The calculations were
performed using Gaussian 98 (Rev. A.9) together with the RHF/6-311G** level of theory, on
both the [1,3]dioxolane fragment alone and on (RR)[1,3]dioxolane-4,5-dicarboxylic acid bis-
methyl-amide (a linked dipeptide). A crystalline array of 122 glycerol-1-mono-oleate
(GMO) molecules were arranged with 5 Å spacing in a bilayer configuration with 3210 water
molecules, and allowed to relax for 100 ps using Langevin dynamics in a cubic box with 50
Å sides. Then a cylindrical hole was created in the centre of the membrane and a gA dimer
was inserted, whose structure was obtained from simulations inside a phospholipid bilayer
(34) with harmonic restraints on selected side chains (72). The same initial structure was
used for the RR and SS analogs, with the addition of the linker and removal of any GMO
molecules in conflict with it. The entire system was then equilibrated at 300K with strong
harmonic restraints (100 kcal/mol/Å2) on all heavy atoms of the channel for 0.2 ns, then with
moderate restraints (5 kcal/mol/Å2) for 0.8 ns, and finally with no restraints for 1 ns before
the production runs, as described below.
23
Two sets of simulations of the gA molecule were carried out to probe long- and
short-time dynamics of the gramicidin dimer. One 64 ns production run probed longer time-
scale dynamics using a 2 fs time step and saving coordinates every 200 fs, using the SHAKE
algorithm (100) to constrain stretching of covalent bonds involving hydrogen. Another 10 ns
production run had no bond constraints within the protein and a 0.5 fs time step (saving every
10 fs) to probe hydrogen dynamics at shorter time scales and to yield accurate PCA
eigenvalues at the shortest spatial scales. Only the 64 ns simulations with 2 fs timesteps were
carried out for the SS and RR molecules.
The leapfrog Verlet algorithm was used to propagate dynamics with constant
surface tension and normal pressure on the membrane based on the Parrinello-Rahman
barostat as described in Ref. (101), with a piston mass of 500 amu and a 5 ps coupling
constant. The surface tension was kept constant at a value of zero, since the application of a
finite external pressure has been shown to be unnecessary for GMO bilayers (102). The area
per lipid was stable at 0.25 nm2, in quantitative agreement with a previous study (102), upon
equilibration of the membrane with gA inserted. The Nosé-Hoover algorithm was used to
control temperature at 300 K with a thermal piston mass of 1000 kcal·ps2 and 5 ps coupling
constant. The simulations were carried out with tetragonal periodic boundary conditions,
updating the crystal parameters for box length every 200 ps. We used the particle-mesh-
Ewald-summation (PME) method, with a width κ=0.3 Å-1 and grid point spacing of 1.0 Å.
The Lennard-Jones interactions used a force switching function from 10 to 12 Å, with a cut-
off at 14 Å.
2.2: Principal Component Analysis (PCA)
2.2.1: Background
Principal Component Analysis (PCA) is a multivariate statistical analysis for the
reduction of high-dimensional data sets onto collective coordinates. It is related to singular
value decomposition (SVD), which aims to extract only the largest components from data
sets with prohibitively high dimensionality (103, 104). PCA was originally developed to
identify the directions of most variation in data from the social sciences (105, 106) and
meteorology (107). In these disciplines the relationships between measured variables is
24
complex and non-intuitive, and PCA solved the problem of finding appropriate linear
combinations which capture a large fraction of the variance across the data set. PCA has
long been used in a number of disciplines concerned with the study of noisy, many-particle
trajectories. For the continuous case considered in the study of turbulence, the technique is
called Proper Orthogonal Decomposition (POD) (108), and in climatology the technique is
called Empirical Orthogonal Functions (EOF) (107).
PCA can be applied to the time trajectory of a collection of moving particles, and has
become a well-established technique for extracting collective modes of displacement from
atomistic MD trajectories (109, 110). The application of PCA to protein dynamics was
pioneered by García (111), whose study demonstrated that there are multi-modal
distributions of PCs along a simulated protein trajectory, and hence any harmonic
approximation of protein dynamics fails to capture the essential features of their collective
dynamics. Since it is often the large-amplitude motions which are of interest to biochemists,
PCA can afford significant data reduction by concentrating a large fraction of a system’s total
fluctuations into a small fraction of the collective motions. To this end, there have been
many studies of the largest few PCs of protein motion (18, 111-120). Application of PCA to
proteins has also been called "essential dynamics" (112, 116, 118), where it is argued that
only the largest non-Gaussian distributed PCs are sufficient to account for the functional
dynamics of a protein (112).
Consider a trajectory of N atoms in time Ri(t)=(xi(t), yi(t), zi(t)) and let r(t) represent
any one of xi(t), yi(t) or zi(t), where i=1,2,...N and t=1,2,...T, with T equal to the duration of
the trajectory. To study only the internal dynamics of a protein, it is conventional to align
each snapshot in the trajectory to the time-averaged structure, thereby eliminating translation
and rotation of the entire molecule from the trajectory. The mass-weighted covariance
matrix is
C = < Mij Dri Drj >, (2.7)
where Mij =Mi1/2 Mj
1/2and Dri=ri(t)-<ri(t)>t is the change of position from the time-averaged
structure, for each spatial component of all atoms i and j included in the analysis. Note that
Eq. 2.7 is a product of all coordinates, and yields a matrix of dimension 3N-6 (subtracting the
six largest degrees of freedom representing translation and rotation of the molecule) rather
25
than the dot product DRi ·DRj which would yield a matrix of dimension N-6. Diagonalization
of the covariance matrix C solves the eigenvalue problem
C v = s2 v (2.8)
and yields a set of eigenvalues sk2 and eigenvectors vk, where k=1,2,…,3N-6.
Each eigenvector vk represents a principal component of displacement, and may be
visualized as a set of N three-dimensional vectors attached to the N atoms analyzed within
the molecule. Each of these 3D vectors describes the magnitude and direction of the RMS
fluctuations at a given atom, within a given PC. The MD trajectory can be projected onto
each eigenvector by forming the dot product of atomic displacements with each eigenvector
for all time steps. The resulting distribution of each projection would have a variance
sk2(and standard deviation sk); this is the physical meaning of the eigenvalues, which
measure the spatial amplitude of each PC across the full trajectory.
2.2.2: PCA and Protein Dynamics
Large collective displacements may be used to study conformational changes, and
these are often the best characterized examples of functional motion in proteins. PCA can be
used to compute the RMS fluctuations along the protein backbone, and has been particularly
successful in identifying large concerted motions which may be related to function.
Examples include the hinge-bending motion of thermolysin (118), identification of ligand
binding sites in Cu Plastocyanin and Azurin (114), and regions of hydrogen exchange in
cytochrome c (117). The technique is also useful for comparing the dynamics of similar
proteins within the same superfamily (18). Table 1 shows a representative sample of proteins
that have been analyzed using PCA (17, 18, 111-114, 116-119, 121, 122), and is focused on
those studies where the emphasis is on novel aspects of PCA in the analysis of protein
dynamics. The table demonstrates that the time scales, solvation and size of the systems
analyzed have varied greatly, and also that most studies have focused on the leading 1 to 3
PCs, although select higher PCs have occasionally been examined for comparison in the
earliest work from the 1990’s.
26
Protein No. of Residues
Solvent ΔT (ns) PCs examined
Ref. Year
Crambin 46 Water 0.24 1-5 (111) 1992
Lysozyme 129 Vac. & Water 1.00 1-10, 20, 50
(112) 1993
BPTI 58 Vacuum 0.20 1-4,10,100 (116) 1994
Thermolysin 319 Vac. & Water 0.09 1-5, 10, 20, 50 (118) 1995
G-Actin 375 Water 0.24 1-3 (121) 1996
Cytochrome c 105 Water 1.50 1-3 (117) 1999
Cu Plastocyanin 99 Water 0.80 1-3 (114) 2001
Azurin 128 Water 0.80 1-3 (114) 2001
Apo-Adenylate Kinase
225 Water 6.00 1-3 (113) 2006
Lambda repressor 80 Water 10.00 1 (119) 2006
Protease family 119-1023
Water 20.00 1-3 (18) 2006
Myosin II motor head
744 Water 5.00 1-20 (17) 2007
Rhodopsin 696 H2O+ lipid bilayer
100.00 1 (122) 2007
Table 1: A representative list of proteins studied using PCA.
27
Large-scale global conformation changes are not the only interesting feature of
protein dynamics. While the tertiary and quaternary structural changes may span the entire
protein – and we would expect the largest PCs to capture these motions – individual residues
also have important collective motions at much smaller length scales, and modification of
hydrogen bonds or changes in the structure of a binding pocket in an enzyme occur on even
shorter length scales. Biomolecular processes also span at least 9 orders of magnitude in
time, from femtoseconds (bond vibrations) to milliseconds or even longer (folding). The
covariance eigenvalues of short and fast collective motions are necessarily smaller than that
of long and slow motions, and may even be smaller than the covariance of noisy motions at
larger length scales. Hence these motions may not be represented in the largest set of
principal components, and may even be found among small covariance eigenvectors
normally ascribed to motions arising from thermal noise. While in general motions on long
length scales also occur on long timescales, while short length scales correspond to fast
timescales, there are also exceptions to this rule; the flipping of aromatic rings buried in the
core of a protein (123, 124) is one example of a short length scale motion which occurs on
long timescales. However, this only reinforces the need to examine small covariance
eigenvectors to isolate functional motions at small length scales and at whatever timescales
are available within an MD simulation.
Although most PCA studies have focused almost exclusively on large-scale motions
to date, there is nothing intrinsic to PCA that gives more meaning to large covariance
motions than to small ones. In chapter 4 we conduct a comprehensive analysis of the entire
set of PCs and find non-Gaussian distributed PCs with small covariance eigenvalues. We
argue that these are also ‘essential’ in the same sense as the largest components (112), in that
they span the anharmonic portion of the free energy landscape.
2.2.3: PCA vs. NMA of Proteins
The anharmonic portions of the free energy landscape are interesting for a number of
reasons, and foremost among them is the ability to capture multimodal dynamics which can
describe conformational transitions. However, this is not to say that harmonic portions of the
landscape are unimportant. The calculation of normal modes dates back to the early 1970’s
with the work of Gō and Scheraga (125), who undertook a systematic search of minimum-
energy conformations of cyclo-hexaglycyl (126). Much has been learned since then from
28
harmonic approximations around an equilibrium structure through normal mode analysis
(NMA) (74, 127-132) and elastic network models (ENM) (133-139). Both of these
techniques are based on a single energy-minimized structure, and do not involve dynamics.
In NMA a molecular force field is chosen (such as CHARMM 22) to calculate the potential
energy of the structure, and by taking a harmonic approximation around every degree of
freedom (see Appendix 1) a list of collective modes is generated and ranked from highest to
lowest (spatial) frequency, and thereby energy. ENM is similar, but involves another
approximation which eliminates the force field entirely by replacing all interactions with a
simple harmonic spring linking each atom to every other atom within a cutoff radius. Hence
NMA and ENM are computationally much less expensive than PCA.
NMA has been used to study protein dynamics in the harmonic approximation for
almost 30 years, since the development of empirical potentials which made it possible to
compute their potential energy landscape (140). Two groups first applied NMA to the study
of a small globular protein, BPTI (127, 128), followed shortly by a comparative study of
trypsin inhibitor, crambin, ribonuclease and lysozyme (129). Typical frequencies resulting
from NMA of these proteins range from 5 to 200 cm-1, and the harmonic approximation
makes the timescales of these motions slower than 10-100 ps. The magnitude of RMS
fluctuations at each atomic site may be investigated as a measure of site flexibility, and
comparisons can be made between NMA analysis and MD simulations (141), or with
experimental data such as neutron scattering (142) and crystallographic B-factors (129).
To study domain motion in larger proteins, a coarse-grained approach is desirable as
it simplifies the energy potentials of a particular force field (i.e. CHARMM, GROMACS,
NAMD etc.) and extracts only the lowest frequency modes. Tirion was the first to
demonstrate that the results of NMA for low frequency modes are insensitive to the details of
any particular MD parametrization (143). The resulting model is known as the Elastic
Network Model (ENM), and many recent studies of large proteins have used this approach
(139, 144-146). A related method is called the Gaussian Network Model (GNM) (134, 147).
GNM has been used to analyze the a-amylase inhibitor (141) and to make comparisons
among homologous proteins within the globin family (148).
While most NMA and ENM studies have focused on the longest wavelengths as in
PCA studies, some have studied the shorter wavelengths as well (74, 116, 131). For
29
example, the normal modes of the binding pocket of wild-type α-lytic protease were found to
have a symmetric character, vibrating in phase to maintain the size of the binding pocket,
while a non-binding mutant had asymmetric modes which resulted in contraction and
expansion of the binding site (131). I have argued in the Introduction that the carbonyl
oxygen atoms lining the lumen of gramicidin show functionally relevant dynamics. An early
NMA study of gramicidin argued that the frequency separation of collective modes spanning
the whole protein (< 50 cm-1) and modes describing amide plane fluctuations involving
carbonyl oxygen atoms (75 to 175 cm-1) rules out coherent librations of many amide planes
together (74). An analysis of different amide planes showed that their motion was
uncorrelated, and perturbation of the hydrogen bonds resulted in only small changes to the
NMA frequencies. The fact that functional features have been found in the short wavelength
regime of NMA justifies the examination of the same regime in PCA of simulated MD
trajectories, where the results include the influence of an anharmonic molecular force field as
well as temperature and solvent effects.
It is worth noting a few similarities and differences between PCA and NMA. Both
techniques yield a set of eigenvectors whose components describe the directions and
magnitudes of atomic displacements across the molecular structure, and associated
eigenvalues which describe the spatial amplitude of these eigenvectors. However, PCA
derives its eigenvectors from a time trajectory of all atoms, and as such allows for the
exploration of phase space which gives rise to entropic forces such as the hydrophobic effect.
MD simulations also include a thermostat which regulates the system at a finite temperature,
thereby including the entropic component of free energy. While the entropy can also be
approximated from the curvature of the energy minimum in NMA, this is exact only in the
zero temperature limit where the results of PCA and NMA converge. On the other hand,
PCA describes a dynamic molecular structure at finite temperature, which allows for
thermally activated transitions over energy barriers. Hence the harmonic approximation
lacks the ‘essential’ anharmonicity of an atomistic force field which allows for
conformational changes on a multimodal free energy landscape. While analytical methods
that include entropic terms arising from multiple conformational basins have also been
developed (149), again these approximations are only valid in the low-T limit.
30
The influence of temperature as well as solvent makes biomolecular dynamics
intrinsically dissipative and over-damped. This makes it problematic to translate the spatial
wavelengths from NMA into temporal frequencies (thereby imitating dynamics), whereas the
time-ordered dynamics of collective modes can be extracted from a simulation by casting the
full dynamics onto a given PCA eigenvector. Also, while the directions of motion described
by NMA eigenvectors are expected to give reliable information regarding protein flexibility,
the spatial magnitudes of NMA eigenvalues are not expected to be physically meaningful
(130). By contrast, PCA eigenvalues describe the real spatial amplitude of motion observed
in the full (simulated) dynamics. Since the non-Gaussian features in PCA describe precisely
that portion of the free energy landscape which is lacking in the harmonic approximation,
this motivates and necessitates the generation of MD trajectories. For these reasons, the
remainder of this study is focused on PCA and extensions thereof.
2.2.4: PCA and its Development in Climatology
PCA is one of a number of eigenvector techniques such as Common Factor Analysis
(CFA) which have their roots in social science and go back to Pearson in 1902 (150), and
later to Hotelling in the 1930’s (105, 106) who first used the term PCA. Lorenz introduced
the technique to atmospheric science and climatological modeling in 1956 (151), where it is
known as Empirical Orthogonal Functions (EOFs). EOFs are widely used in climate
research to identify dominant patterns of variability and reduce the dimensionality of climate
data. From the 1960’s through the 1980’s, climatologists developed many elaborations of the
basic EOF/PCA technique which either extract more information from complex spatio-
temporal data or make the resulting collective modes more physically meaningful or
interpretable. An excellent review is available in Hannachi et al 2007 (107). It is
advantageous to study the implementation of EOFs in climatology for two reasons: the use of
the technique is much more fully developed with a long history in this discipline, and there
are known climatological patterns (such as variations in sea level pressure throughout the
year, or the shape and distribution of ocean currents) against which comparisons can be made
with PCA/EOF calculations. Having these as a target has allowed for the development of
techniques which can get the “right” answer. The ‘interpretability’ of ‘physical’ modes is the
essential task of structural biology, and in MD we generally do not know what our target is.
31
Richman (152) reviews four major pitfalls to the conventional use of principal
components. The first is the domain shape dependence of EOF/PCs, that is, the geometry of
boundary conditions surrounding the data set. Buell first illustrated that the topography of
PC patterns is predictable primarily as a function of the geometric shape of the data domain,
and not the covariation of the data (153, 154). Calahan (155) notes that the ‘Buell patterns’
are closely related to spherical harmonics when represented on a sphere, and ‘patch
harmonics’ when represented on a limited domain such as a rectangle. The second drawback
is subdomain instability, a corollary of domain shape dependence; the shapes of eigenvectors
within a subdomain (i.e. a subset of atoms, or weather stations) are in general not the same as
when PCA is executed on the full domain (156). A third problem has to do with sampling
errors becoming very large if neighbouring eigenvalues have similar values, and eigenvectors
become strongly mixed (157, 158). Indeed, some authors have warned that interpretation of
eigenvector shapes is virtually meaningless in such cases (159, 160). Finally, a comparison
of known input patterns with the results of PCA on the combined inputs (161), as well as an
analysis of patterns with obvious physical interpretations (156), has shown that PCs often
have no physical basis for interpretation, and rotation or other transformations are needed to
yield intuitively and physically meaningful patterns of variance. A number of climatological
studies (162-165) indicate that the mathematical constraints of orthogonal PCs which account
for successively maximal residual variance can impair the straightforward physical
interpretation of the modes. Real physical modes, both in atmospheric and biomolecular
science, do not necessarily exhibit this characteristic because physical processes are generally
not independent, and therefore physical modes are expected in general to be non-orthogonal.
Nor are they necessarily uncorrelated in time. Hence, despite the large number of studies
which focus on the structure of one or two dominant PCs in the study of protein dynamics,
the climatological literature suggests there is reason to believe that the shapes of individual
PCs may be meaningless on their own, and extensions of PCA must be developed to yield
physically meaningful results when applying PCA to the MD of proteins.
There is a broad class of extensions to EOF which aim to reduce principal
components to more physically meaningful or interpretable patterns. The most common of
these is ‘rotated’ EOFs, whereby a matrix rotation is executed on the leading group of EOFs
(152), resulting in more variance being concentrated in fewer eigenvectors (that is, the
eigenvalues are pushed either towards one or zero). This is referred to in the literature as
32
finding EOFs with “simplified structure”. The most common algorithm for orthogonal
rotations is called Varimax (166, 167), which maximizes the difference between fourth order
moments and the square of second order moments (where the kth order ‘moment’ is the sum
across each eigenvector element raised to the kth power). Another related algorithm is
Quartimax (168, 169), which maximizes the fourth order moment of the eigenvectors. It is
worth noting that the 4th order cumulant of a distribution is known as the “kurtosis” κ, and
can be written in terms of the second and fourth moments μ2 and μ4 as follows:
κ = μ4/μ2-3 (2.9)
For a Gaussian distribution the second moment μ2=σ2 is the standard deviation, and the
fourth moment is μ4=3σ4. Substituting μ4 and μ2 into Eq. 2.9 gives zero kurtosis for a
Gaussian distribution. Hence kurtosis is a measure of the degree to which a distribution is
peaked (small kurtosis) or broadened (large kurtosis) relative to a Gaussian. While both
Quartimax and Varimax optimize the width of eigenvector distributions by means of the
fourth order moment μ4, we see from Eq. 2.9 that Varimax optimizes for kurtosis κ by
maximizing the difference between μ4 and μ2.
Oblique rotations have in general been more successful in capturing simple structure
than the orthogonal rotations described above. These algorithms attain solutions by
optimizing certain products or differences of eigenvector moments; common oblique
algorithms include Quartimin (168), Biquartimin (170), Oblimax (171), Promax (172) and
DAPPFR (173).
There are various more detailed schemes for obtaining simplified structure, some of
which are discussed in Chapter 11 of Jolliffe 2002 (174). A technique called SCoTLASS
(175-177) successively maximizes variance and constrains EOF patterns to be orthogonal
and ‘simple’ according to a number of rules, such as pushing the size of eigenvector elements
to zero when they are far from the centre of action in a given eigenvector. This is based on a
technique called LASSO (178), an algorithm which solves the problem of unstable regression
coefficients in optimizations involving multiple linear regressions, which implicitly selects
variables by forcing some regression coefficients exactly to zero. There are a number of
similar simplification methods for pushing eigenvalues towards one or zero (179-183). It
should be noted that the extra simplification criteria appropriate for constraining the shapes
of atmospheric modes on a sphere are not necessarily the same as those which would be
appropriate for biomolecules.
33
Since PCA only uses the average instantaneous covariance in the construction of its
matrix, its eigenvectors lack any time-ordered information. There is a class of modified EOF
techniques called Extended EOFs, which modify the matrix to be diagonalized to include
temporal correlations by expanding this matrix to include new columns of variable values at
two or more different time steps. This allows for memory in the system, and is a much more
realistic vehicle for capturing real physical modes as time-lagged information is included in
the analysis. It was introduced by Weare and Nasstrom (184), further developed by
Broomhead and King (185, 186) for analysis of low order chaotic systems (called Singular
System Analysis – SSA) and multivariate systems (MSSA), and has been used to find
propagating structures in climatological data (187, 188).
Finally, there is a category of modified EOF which uses complex numbers in the
construction of the covariance matrix. The real and imaginary parts of the complex number
a+ib may encode information from two different fields of associated variables, e.g. the zonal
(a) and meridional (b) components of wind velocity (189-191). The eigenvectors of this
matrix encode covarying spatial patterns between the two fields. The complex number may
also be used to encode the value of a single field at two points in time separated by a chosen
time lag τ: x(t)+ix(t+τ). This encodes phase information for a particular time lag. With the
right choice of this parameter, propagating patterns may be revealed within a data set.
Frequency domain (FD) EOF also falls under the complex category (192-194), but was
abandoned in climatological research in favour of the more elegant Hilbert EOF (195, 196).
The Hilbert transform essentially provides information about the rate of change of x(t) with
respect to t at a given frequency, and has been used to study the monsoon (196-198),
atmospheric angular momentum (199), and coastal ocean currents (200).
This brief review makes it clear that the state of the art in PCA/EOF of climatological
data is considerably more advanced than its use in MD. Horel (201) states that in
climatology “principal component analysis was used for many years before its inherent
limitations were fully realized”. Let us hope that its use in molecular dynamics can benefit
from the experience of climatologist, and this thesis aims to point the way forward in this
regard. One of the key features of these enhanced PCA techniques is that many of them are
applied to the eigenvectors after weighting by the square root of their associated eigenvalues,
such that the norm squared of each PC is the variance of the corresponding time series. This
is the physical and mathematical basis of our own proposal for enhancing PCA in Chapter 5.
34
To the best of the author’s knowledge none of the techniques described above (rotated,
simplified, extended or complex PCA) have been applied to protein dynamics, nor does the
MD literature refer to any of the climatology literature cited above, with the exception of
very general references to Jolliffe’s 2002 book (174).
However, a few new developments of PCA on MD data have appeared in recent
years. One of these is ‘Nonlinear’ PCA (NLPCA) which employs hierarchically arranged
neural networks which are trained to build a set of adequate nonlinear mapping functions
between an input vector and its counterpart in PC space (202). When applied to the analysis
of peptide dynamics (triglycine, hexaalanine, and the C-terminal β-hairpin of protein G) it
was shown (203) that this technique reduces the dimensionality of these systems much better
than PCA. In the case of the β-hairpin, 4 NLPCs capture the same structure that is described
by 21 conventional PCs. Furthermore, the free energy landscapes constructed by NLPCA are
much more complex and capture conformational states not apparent in the landscapes
resulting from PCA, and also cleanly separate conformational states which are mixed
together in conventional PCA. Another enhanced PCA technique has been called
‘Multivariate Frequency Domain Analysis’ (MFDA) which is PCA executed on a band pass
filtered process across a range of frequencies (204), and is therefore related to FDEOF.
Applied to the BPTI protein, this study demonstrated that at zero temperature MFDA
eigenvectors are the same as those acquired from NMA, but at 300 K significant differences
become apparent with NMA as well as PCA eigenvectors. By applying the VARIMAX
algorithm to the MFDA eigenvectors this study was able to establish a set of orthogonal
modes which describe BPTI dynamics at each frequency used in the analysis, thereby
directly assigning a unique timescale with each set of eigenvectors (whereas PCA
eigenvectors have many frequencies in the trajectory of each PC). These advancements, in
addition to those employed by atmospheric scientists, suggest that there is ample room for
enhancement and development of PCA as applied to protein dynamics, and no single solution
to the problems described above has been proposed and accepted.
35
Chapter 3: Convergence of PCA 3.1: Background
Statistical convergence is the first concern of any scientific simulation. Is our system
in equilibrium? Has it exhaustively explored its available phase space? The answers to these
two questions underpin the scope and validity of a simulated result. It is relatively easy to
ensure that a system is in equilibrium by monitoring various energy terms over time,
ensuring that they fluctuate around a consistent average. The second question is much harder
to address, especially when simulating large complex systems with empirically determined
force fields, as in MD. In principle any finite MD simulation is too short to ‘prove’ the
complete exploration of its conformational space; there may always be unexplored states on
the other side of a large kinetic barrier. In practice it is well-known that biomolecular
dynamics have a large spread of relevant timescales ranging from picoseconds to
milliseconds, and the free energy landscape explored by the conformational dynamics of a
protein is complex, multimodal and ‘rough’ in a fractal sense, such that there are effectively
an infinite number of nested minima to explore. Certainly there are simple systems where
this difficulty is eased, but in general the complete exploration of conformational phase space
for the average protein is more than we can hope for. To make progress we need to be able
to quantify how broadly our system has explored its available phase space, and to establish
how converged is converged enough.
MD simulations have always been limited to timescales shorter than we would like.
Every year the average length of what is considered a ‘reasonably long’ simulation increases;
currently it is on the order of 100 ns, and a 1 μs simulation is considered ‘very long’. Ten
years ago 100 ps simulations were the average, and a few ns was considered ‘very long’. Yet
computational biochemists have been making scientific progress with simulations at limited
timescales for over 20 years. For example, while gA has been fruitfully studied on
picosecond timescales for decades, and more recently on the nanosecond timescale necessary
to reasonably describe membrane dynamics, it is known that the association time of the two
monomers in a membrane is on the order of 100 ms. Not only does this mean that the “brute
force” atomistic study of dimerization kinetics (i.e. at equilibrium, as opposed to biased or
steered towards this reaction) is out of reach for one of the simplest possible protein dimers,
we are unlikely to even observe a single dissociation event with current simulations.
36
However, this does not mean we cannot study the many interesting faster processes of the
associated dimer. This is the functional state of gA, which is why technically “unconverged”
simulations of this molecule are still extremely informative. This highlights the need to be
somewhat flexible, one might say “reasonable”, about what constitutes a “converged”
simulation; this must be judged with respect to the properties of interest, some of which
converge faster than others. Indeed, simulations of ion channels are of particular interest
since the timescale of ionic diffusion and transport is known to be much faster than current
simulation times.
Since conformation changes constitute large covariant changes in atomic positions,
PCA has been a useful technique when quantifying the convergence of a system’s
exploration of conformational phase space. There are two aspects to this characterization:
structural and dynamic. On the one hand, the consistency of PC eigenvector shapes at
various timescales measures the convergence of spatial characteristics by quantifying how
quickly the exploration of new conformations slows down. On the other hand, the
distribution of states over time will converge differently, depending on how often different
states are visited and how long it takes to achieve equilibrium state populations. The spatial
characteristics are determined by thermodynamics, i.e. the potential energy surface, while the
distributions are determined by dynamics, which include stochastic processes and activation
barriers. We investigate both these aspects of convergence by studying the PCA
eigenvectors and eigenvalues for trajectories of various size and duration. We do this first
for the backbone, since this is the standard practice in the literature, and yields considerably
simpler (and in the case of gA, unimodal) distributions of eigenvector projections over the
simulation trajectory. We compare results for both NCαC and NHCαCO atoms in order to
highlight any differences which may arise from the inclusion of hydrogen bonding elements.
For comparison with multimodal behavior we also analyze the convergence of side chain and
solvating GMO dynamics.
3.2: Convergence of Structure: Overlap of Covariance Matrices
The eigenvalue-normalized overlap s(A,B) introduced by Hess (205) to measure the
‘distance’ between two matrices has been adopted as a measure of convergence in a number
of studies (122, 206, 207):
37
, 1/ /
√, (3.1)
where A and B are covariance matrices defined by the eigenvectors and eigenvalues from the
PCA of two different trajectories, such that
/ s ,s , … s . (3.2)
Here v is the complete eigenvector matrix and diag(σk) is a diagonal matrix with the square
roots of all eigenvalues σk2. It is conventional to compute the overlap of halves (122, 206) or
thirds (207) of an MD trajectory, to estimate the degree to which the conformational space
explored by a trajectory has ceased to expand. Note that this measure of overlap, by
including the entire covariance matrix in s(A,B) weighted by its eigenvalues, is dominated by
the characteristics of the longest PCs. The convergence of short PC with small eigenvalues is
likely to be much faster than that of the longest PCs.
3.2.1: Backbone of gA, SS and RR: Converged Eigenvectors
To demonstrate the convergence of sampling for our MD simulations of gA, as well
as certain PCA results derived from them, we calculate the overlap s(A,B) of eigenvector
matrices A and B derived from independent PCA of different time windows within a
trajectory. In Fig. 3.1 (top) we show the overlap s(AΔT,BTtot) of subsets ΔT in a trajectory
with its full duration Ttot, as done by Hess (205). These curves are necessarily equal to one at
ΔT=Ttot, since they overlap increasingly large portions of the same trajectory with each other.
This accounts for their exponential scaling and small error bars. Other studies have
computed s(AΔT1,BΔT2) for independent simulations or non-overlapping sub-segments ΔT1
and ΔT2 of a trajectory, where ΔT1=ΔT2=1/2 Ttot (122, 206), or 1/3 Ttot (207). To
generalize this approach, in Fig. 3.1 (bottom) we show the average overlap s(AΔTk,BΔTk+1) of
all consecutive trajectory sub-segments of equal duration ΔTk and ΔTk+1. The horizontal axis
of this curve extends over increasing durations between 200 ps and 64 ns in the simulation
with SHAKE, or between 50 ps and 10 ns in the simulation without SHAKE. The average
overlap of half the 64 ns trajectory with its full length is 0.93, while for the 10 ns simulation
38
this quantity is 0.91. For our 64 ns simulation the overlap of two 32 ns segments is 0.89, the
average overlap of 8 ns segments is 0.85, and for 1 ns segments it is 0.81. For our
unconstrained 10 ns simulation the average overlap of 5 ns halves is 0.84, and for 1 ns
segments it is 0.82. The consistency of overlap values between simulations with and without
SHAKE gives us additional confidence in these results.
What level of convergence these numbers reflect can only be answered by
comparison with values obtained by similar studies. Our overlap values are in agreement
with another study which computed the PCA overlap for simulations of gA embedded in a
DMPC membrane (206). The authors analyzed the convergence of PCA for the backbone of
membrane proteins of various size on the 10 ns timescale, using gA as a comparative
standard for convergence of simulations of larger proteins. For gA the overlap of two
independent 8 ns trajectories was 0.82 while for two 4 ns trajectories it was 0.8. The overlap
of half a trajectory with its full 8 ns length was between 0.88 and 0.92. The study concluded
that “multi-nanosecond molecular dynamics calculations can provide satisfactory, albeit not
perfect, conformational sampling”. Grossfield et al. (122) have studied the convergence of
26 independent 100 ns simulations of rhodopsin solvated in a membrane containing 99
phospholipids with 1–stearoyl–2-docosahexaenoyl fatty acyl chains attached to 49
phosphatidylcholine and 50 phosphatidylethanolamine headgroups, and 24 cholesterols.
They show that different parts of a large protein exhibit very different convergence of their
respective PCA. The whole protein had a narrow distribution of overlap values centred on
0.2 and the transmembrane helicies centred on 0.4. Both extracellular and cytoplasmic loops
had broad distributions ranging from 0.2 to 0.7. Only the CI loop, which is small and
stabilized by secondary structure (and is thereby comparable to gA) converged very well
with overlap values distributed mostly over 0.8. However, this was also the only bimodal
distribution, whose minor peak is centred on 0.3.
39
Figure 3.1: Eigenvalue-normalized overlap between a sub-trajectory of length ΔT with the full trajectory (top) and overlap between two consecutive trajectories of length ΔT (bottom), averaged over the number of samples available in the trajectory. The analysis is carried out for the gA main chain NCαC atoms (left) as well as the NHCαCO backbone atoms (right), for both the simulations with (solid circles) and without SHAKE (open circles). Error bars indicate one standard deviation.
Given that gA is a single transmembrane helix which is entirely folded into its
secondary structure, we would expect that convergence is much more easily achieved here
than most proteins in the PCA literature, even in their more stable sub-segments. Indeed, our
convergence curves for the gA backbone (Fig. 3.1) are all at higher overlap values than those
quoted for particular timescales in the studies above. Taken together these studies also
suggest that an overlap of s(A,B)=0.8 is an acceptable value of convergence.
In Fig 3.2 we compare the structural convergence s(AΔTk,BΔTk+1) of the gA backbone
eigenvectors with the convergence of SS and RR backbone eigenvectors. The overlap values
of the linked analogs are lower by ~0.05 at timescales shorter than ~2 ns, but are almost
identical with the native dimer and are larger than 0.8 at longer timescales. This gives us
confidence that the backbone eigenvectors are converged for all three gramicidin molecules
40
at timescales longer than 2 ns (as studied in chapter 5). The conformational isomerization of
the dioxolane linker in the RR analog, which was found to occur frequently in 16 ns
simulations (72, 84), may be responsible for the lower values at short timescales. However,
this isomerization was not observed for the SS analog, which has overlap values between
those of gA and RR.
Figure 3.2: Comparison of gA, SS and RR backbone eigenvector convergence for 64 ns
simulations.
3.2.2: Side Chains and GMO: Unconverged Eigenvectors
To highlight the utility of s(A,B) as a measure of convergence, in this section we
show what the timescale-dependent overlap looks like for eigenvectors which are not
converged. In Appendix 2 we show certain properties of the gA side-chains (using PCA)
which demonstrate the inadequate sampling of their multi-modal distribution on the 64 ns
timescale. In Chapter 6 we demonstrate that GMO monomers surrounding the surface of the
gA molecule exchange positions only occasionally on the 64 ns timescale, also inadequately
sampling their multimodal distribution. Hence we would not expect the spatial structure of
gA side-chain or annular GMO eigenvectors to converge for our simulations. In Fig. 3.3 we
show that s(AΔT,BTtot) and s(AΔTk,BΔTk+1) have much smaller values for gA side chains and
annular GMO monomers, and even scale distinctly from the backbone case shown above.
s(AΔT,BTtot) is almost flat and ranges from 0.25 to 0.4 for the side chains, and from almost
zero to 0.15 for GMO, for timescales shorter than 10 ns. This part of the curve also has
41
surprisingly small error bars, with larger error bars apparent when the curve shoots up
towards 1. This is exactly the opposite of the case for NCαC and NHCαCO atoms shown in
Fig. 3.1. Notably, s(AΔTk,BΔTk+1) decreases over most of its timescale range, from 0.6 to 0.4
for the side chains and from 0.2 to almost zero for GMO. Thus s(A,B) is capable of
discerning converged from non-converged eigenvectors.
It is also worth noting that s(A,B) is a measure of the similarity between two
complete PCA matrices where the weighting of all eigenvectors by their associated
eigenvalues makes the leading eigenvectors the dominant terms. Hence the results in this
section represent the convergence of the longest and most likely the slowest modes in the
system. In general we would expect eigenvectors associated with covariant motion at small
length scales to converge faster than the longest components of motion.
Figure 3.3: Un-converged systems: overlap of side-chains and solvation lipids
42
3.3: Convergence of Dynamics: Average Distributions and Deviation from Gaussian
Gaussian distributions are indicative of motion on a harmonic free energy
landscape, while non-Gaussian distributions are the result of anharmonicity on this
landscape. This follows from the definition of free energy G=-kBTlog(P), where a Gaussian
probability P=exp[-ax2] yields the harmonic function G=Kx2 with spring constant K=akBT
(here x is a collective reaction coordinate representing protein conformation). This makes the
shape of PC distributions of considerable interest to structural biologists. Anharmonicity
could be evidence of multi-modal dynamics, coupling among modes, activated transitions or
trapped kinetics. A number of studies have observed and interpreted non-Gaussian yet uni-
modal distributions in the largest PCs of the backbone for various proteins on the timescale
of a few nanoseconds (112, 113, 116). It is therefore of interest to know whether this is an
artifact of insufficient sampling, in which case these PCs would converge to Gaussian
distributions given enough sampling time, or whether this is an intrinsic anharmonicity in the
free energy surface explored by protein conformations. A similar question may be asked of
multimodal distributions: do they converge to the sum of Gaussians with different centres, or
is the shape of the distribution indeed non-Gaussian? It is important to note that the spatial
shapes of eigenvectors described in section 3.2 converge much faster than their dynamics: the
former requires a quarter oscillation in time, while the latter requires many cycles to
adequately converge the distribution of states.
To compare the acquired distribution Pk with a Gaussian distribution, we normalized
each PC trajectory by its standard deviation sk (the square root of the kth covariance
eigenvalue sk2). We also re-binned the distributions into a common 100 bins to align all
distributions with each other for comparison. The resulting normalized distributions PkN
have sk=1, and their shape can be compared against a Gaussian distribution of unit variance
and height (2p)-1/2, by taking the difference between this and PkN. The resulting ΔPk
N will be
a flat line if the acquired distribution is Gaussian, or if it is not the curve will show positive
and negative deviations from zero.
While a trajectory distribution does not have any time-ordered information in it, it
may be suggestive of the statistical properties which generate the trajectory in time. For
example, Mandelbrot and van Ness (208) connect the non-Gaussian properties of a trajectory
43
distribution to non-Brownian (or “fractional” Brownian) properties of its time-ordered
behavior. That is, if consecutive steps in a noisy trajectory are uncorrelated they will have a
Gaussian distribution and their mean square displacement (MSD) will exhibit a linear
dependence on time. This is the signature of diffusive (Brownian) motion. On the other
hand, if steps are anti-correlated the distribution will be narrower than a Gaussian (hugging
close to the average and under-sampling extremes, low kurtosis) and the MSD will scale with
a power less than 1. This sub-linear scaling is the signature of sub-diffusion, since this
trajectory moves more slowly away from its centre than it would by thermal diffusion. For a
trajectory with correlated steps the distribution will be broader than a Gaussian (under-
sampling near the average and over-sampling the extremes, high kurtosis) and the MSD will
scale with a power greater than 1. This super-linear scaling is indicative of super-diffusion,
in that the trajectory moves away from its centre faster than it would by thermal diffusion.
Hence the occurrence of maxima or minima in ΔPkN on either side of the inflection points at
+/-1 may be indicative of anomalous diffusion in the time evolution of PCs.
3.3.1: Backbone and Side Chains of gA
Fig. 3.4 shows <ΔPkN> for the gA backbone with and without hydrogen bonding
atoms (NHCαCO, NCαC), and for the heavy side chain atoms (SIDE). Each panel shows
results averaged over multiple windows of width 1 ns (64 samples) to 64 ns (1 sample), taken
from the 64 ns simulation with SHAKE. A single distribution for the entire 10 ns simulation
without holonomic restraints on Hydrogen atoms is also shown in each panel for comparison,
since short PCs probe dynamics involving hydrogen atoms. PC1 is shown on the left and
PC2 is shown on the right.
It is well known that the side chains of gA have multiple conformations, especially
the Trp residues (209). Trp 9 in particular has been observed in two distinct states using
NMR: in the 1MAG (PDB ID) structure obtained by solid state NMR studies it is stacked on
the Trp 15 residue (44, 45), while in the 1JNO (PDB ID) structure obtained by solution state
NMR it is splayed away from Trp 15 in the opposite orientation (210). This discrepancy has
been resolved by MD studies which concluded that Trp 9 spends 80% of its time in the
splayed orientation and 20% in the stacked orientation (211). Hence the multimodal
distributions of side chains (Fig. 3.4E) are not surprising, and reflect this conformational
flexibility (see Appendix 2 for more details). However, the non-Gaussian features of the
44
longest backbone PCs (Figs. 3.4A and 3.4C) exhibit a similar multimodal profile at short
timescales, though with smaller amplitude. Although these backbone PCs have the
appearance of super-diffusive distributions at short time scales, at long timescales <ΔPkN>
approaches zero, indicating that the longest PCs are actually harmonic (Gaussian). This time
dependence is an artifact of inadequate sampling, as the distribution of an oscillation sampled
over less than the order of a wavelength appears asymmetric; averaging together the
distributions of many such sub-cycles would yield a super-diffusive profile. This analysis
suggests that the dynamics of the gA backbone converge around 10 ns. In Chapter 4 we
contrast this behaviour with the non-Gaussian features of the short PCs.
Figure 3.4: Average difference from Gaussian distributions at various timescales, for the gA
backbone and side chains.
45
3.3.2: Backbone of SS and RR
In Fig. 3.5 we show similar results for <ΔPkN> of the SS and RR NCαC backbone.
Deviations from a Gaussian for the leading PCs wash out at about the same time timescale as
for the gA backbone, though they are in general larger in PC1 and more apparent and
persistent in PC2 for SS and RR than for gA. This may be explained by the structural
dislocation caused by the dioxolane linker. It has been shown that this linker has four
conformational states in the RR molecule (72, 84), and we might expect this to influence the
distribution of the longest PCs. Apparently the dislocation is local enough that this is not the
case, though in Chapter 5 we show that it does make a difference in the shapes of the
eigenvectors. From this we learn that the linker exerts its influence on the thermodynamics
of the molecule by changing the potential energy landscape, rather than changing its
dynamics by introducing kinetic barriers (at least for the longest PCs).
Figure 3.5 Average difference from Gaussian distributions at various timescales, for the SS
and RR main chain.
46
3.4: Summary and Conclusions
We have seen that for timescales over ~2 ns the spatial shapes of backbone
eigenvectors for all three gA analogs have converged adequately, in that PCs extracted from
longer portions of the simulation yield very similar eigenvectors. On the other hand, the
multimodal dynamics of the side chains as well as the solvating GMO molecules do not show
this convergent behavior, and in fact the shapes of their eigenvectors are increasingly
dissimilar at longer simulation times. The distributions obtained from projecting the
eigenvectors onto the simulation trajectory also offer another measure of convergence, but in
this case it is the convergence of dynamics rather than eigenvector structure. These
distributions show that an apparently super-diffusive behavior at short timescales disappears
for trajectory lengths over ~10 ns in the case of the backbone, whereas it is persistent in the
case of the multimodal dynamics exhibited by side chains and GMO molecules. These
results give us confidence that the backbone eigenvectors and eigenvalues examined in
Chapters 4 and 5 are statistically meaningful, and also delineate the scope of applicability for
PCA of side chains and solvating molecules.
47
Chapter 4: Anharmonic Features of Collective Modes The work described in this chapter has been published with the following reference: Kurylowicz M, Yu CH, Pomès, R: A Systematic Study of Anharmonic Features in the Principal Component Analysis of Gramicidin A (2010). Biophys. J. 98 (3), 386-395.
In this chapter we present a number of quantitative measures which identify
anharmonic collective motions in gramicidin A: eigenvalue scaling, non-Gaussian PC
distributions and the Mean Square Deviation (MSD). We study the anharmonic features of
properties in the large covariance regime traditionally studied by PCA, but as shown in
Chapter 3 the anharmonicity of these motions is timescale dependent and disappears in
simulations longer than 10 ns. Prompted by the observation of distinct scaling regimes in
the eigenvalue spectrum, we go on to study the MSD and distributions of PCs in the small
covariance regime, where we show that anharmonic features persist over all timescales
studied here. This allows us to isolate bands of PCs which describe short and fast collective
motions which are associated with hydrogen bonding. We focus on a description of one
mode with known functional consequences in the channel backbone: the libration of amide
planes (55, 70, 72, 73, 78-80, 212, 213).
4.1 Scaling of PCA Eigenvalues
The complete PCA eigenvalue spectra for various atomic subsets of gA are shown in
Fig. 4.1. These results are taken from the 10 ns simulation without constraints on hydrogen
atoms, and with a 0.5 fs time step; simulations with SHAKE do not yield the correct
eigenvalues at the short-PC end of the spectrum because they freeze the covalent bond
vibrations of Hydrogen atoms. This in turn limits the number of collective degrees of
freedom to less than 3N-6, and yields artificially small eigenvalues for degrees of freedom
involving hydrogen atoms. On the other hand, the long PCs in the 10 ns simulation differ
very little from those of the 64 ns simulation (with holonomic constraints) used for
comparison; the PCA matrix from the 64 ns and 10 ns simulations has an overlap of 0.88.
Each curve in Fig. 4.1A shows the variance of all principal components for a different
atomic subset of the molecular structure: a single atom per residue (Ca), backbone atoms
(NCaC and NHCaCO), side chain atoms (SIDE and SIDEH), and the combined atom set
(ALL and ALLH). While the SIDEH and ALLH sets include all hydrogen atoms, the
NHCaCO curve includes only the amide hydrogen in order to emphasize dynamics within
48
secondary structure involving hydrogen bonds, and to explicitly capture amide plane
motions. While the long-PC spectra for the whole protein (ALL) and the side chains (SIDE)
are almost identical, the scaling of their short PCs is significantly different. This suggests
that the small eigenvalues and eigenvectors may encode real physical information about the
behaviour of our system, and are not just noise to be ignored as commonly done in previous
PCA studies of protein motion.
There are generally two scaling transitions in the spectra, one at ~25 PCs and the
other at ~100 PCs. Fig. 4.1B shows three distinct power-law scaling regimes in the heavy-
atom PCA eigenvalues sk=k-a, with all linear regressions on the log-log scale scoring
R2>0.99. While the largest PCs follow a power of a~1, there are significant differences in
scaling of the shorter PCs for different parts of the protein. The mid-size regime of the
backbone scales with a=2, while for side-chains it is distinctly more shallow with a~1.5.
The whole protein (ALL) lacks a clear scaling in this mid-scale regime, making a smooth
transition towards steeper scaling at the shortest end of the PC spectrum. In this short-PC
regime the backbone scales with roughly a~2, the side-chains scale much more steeply with
a=4, while the whole protein (ALL) approaches an average between the two, with a=3.
Different numbers of PCs span the same scaling features in these spectra for different atomic
inclusions (e.g. Ca vs NCaC), suggesting that blocks of PCs span statistically distinct regimes
of motion. Hence the scaling shown in Fig. 4.1 may be used as a guide to search for
components of motion with interesting statistical features and determine the boundaries
between distinct regimes of principal components.
49
Figure 4.1: Log10-log10 plots of the complete PCA eigenvalue spectrum of gramicidin A as a function of eigenvalue index i. A: Spectra for the backbone, side chains and whole protein without hydrogen atoms (Gray: Ca,NCaC, SIDE, ALL) and with them (Black: NHCaCO, SIDEH, ALLH). The data has been thinned towards the high indices for clarity. B: The complete PC set for heavy atoms, with linear regressions in regions of different power-law scaling. The ALL curve has been translated upwards for clarity (+ci). Bold lines indicate the range included in the fit, while the thin lines are guides for the eye. The bold numbers above each line indicates the slope of the fit (ie the power a), and the R2 value for the linear fit is italicized in brackets below the slope.
50
4.2 Non-Gaussian PC Distributions
Gaussian distributions are indicative of motion on a harmonic free energy landscape,
while non-Gaussian distributions are the result of anharmonicity on this landscape. To
compare the acquired PC distribution Pk with a Gaussian, we normalized each PC trajectory
by its standard deviation sk (the square root of the kth covariance eigenvalue sk2). We also
re-binned the distributions into a common 100 bins to align all distributions with each other
for comparison. The resulting normalized distributions PkN have sk=1, and their shape can
be compared against a Gaussian distribution of unit variance and height (2p)-1/2. Fig. 4.2
shows ΔPkN, the difference between the acquired PC distributions and a unit Gaussian, for the
gA backbone with and without hydrogen bonding atoms (NHCαCO, NCαC), and for the
heavy side chain atoms (SIDE). Each panel shows results averaged over multiple windows
of width 1 ns (64 samples) to 64 ns (1 sample), taken from the 64 ns simulation, as in Fig.
3.4. A single distribution for the entire 10 ns simulation without holonomic restraints is also
shown for comparison, since short PCs probe dynamics involving hydrogen atoms. Here we
compare a representative long PC on the left (PC1) as shown in Chapter 3, with a
representative short PC. PC1 is shown on the left while a representative short PC is shown
on the right. By contrast with the long PCs, there is no dependence on timescale for the non-
Gaussian features of the short PCs (Fig. 4.2B, D, F), indicating that these sub-diffusive
profiles (under-sampling extremes and over-sampling the average) correspond to persistent
anharmonic aspects of backbone and side-chain dynamics. The short PCs shown in Fig. 4.2
are representative of groups of neighbouring PCs which have similar distributions. Fig. 4.3
highlights this by showing a group of 5 long and short PCs from the 10ns simulation
averaged over ten 1 ns windows.
We further emphasize the band structure of non-Gaussian short PCs in Fig. 4.4, which
shows ΔPkN surfaces for all 270 PCs for NCαC atoms, 470 PCs for NHCαCO and 430 PCs for
SIDE atoms, from the 10 ns simulation. Flat regions indicate PCs with nearly perfect
Gaussian distributions (ΔPkN<0.01), while peaks and valleys indicate non-Gaussian
distributions and suggest anomalous diffusion of those PCs in time. The landscapes show
central peaks for short PCs indicating sub-diffusion. While the sub-diffusive features at high
PC are concentrated in one band (i.e. a single spatial scale) for NCαC, this is not the case for
the main chain (NHCαCO) or the side chain atoms. The NHCaCO results reveal clusters of
51
sub-diffusive modes across a number of spatial scales at high PC index. This is also true of
the heavy side chain atoms, where it is interesting to note that these features ride on a super-
diffusive envelope. Fig. 4.4 also shows the RMS deviation between the acquired PC
distribution and a Gaussian curve for each atomic subset. These plots show the distribution
of anharmonic features across the short PC spectrum, and reveal distinct bands of sub-
diffusive components.
Figure 4.2: Difference between eigenvalue-normalized PC distributions and a unit Gaussian, ΔP, for the longest PC (left) and a representative short PC (right) of the backbone and side chains of gramicidin A. PCA was executed independently for multiple windows at various timescales from the 64 ns simulation (with holonomic restraints), and PC distributions were averaged for a given timescale. Results for the 10 ns simulation (without restraints) are also shown for comparison.
52
Figure 4.3: Non-Gaussian features of long (left) and short (right) PCs for the NCaC backbone (top), the NHCaCO main chain (middle), and side chains (bottom) of gramicidin A. The data were averaged across 10 samples of PC trajectories extracted from PC analyses on 1 ns windows of H-unconstrained simulations recorded every 10 fs. The distributions were normalized by their eigenvalue sk for comparison with a unit Gaussian. Clustered around the dotted Gaussian curve are the acquired distributions P (left axis), while around the origin is shown the difference ΔP between the acquired distribution and the unit Gaussian (right axis). Each 1 ns sample has 100,000 steps, so each distribution in the figures represents about a million data points.
53
Figure 4.4: Surfaces on the left show the difference between the normalized PC distribution and a unit Gaussian, ΔP, for all components in the PCA of heavy backbone atoms (NCaC), main chain (NHCaCO) and heavy side chain atoms (SIDE) in gramicidin A. The root mean square difference between acquired distributions and a Gaussian distribution is shown on the right.
54
4.3 MSD and Anomalous Diffusion
All the results presented above describe the spatial characteristics of the system
averaged over time. To make proper contact with anomalous diffusivity, we now study the
time-ordered behaviour of our system by computing the MSD of each PC. The projection
dxk(t) of the full MD trajectory onto each PC is a trajectory of steps whose size is measured
relative to the time-averaged structure. We construct a PC walk to represent the total
displacement along a given eigenvector through the course of our simulated trajectory:
∑ 4.1
The MSD is related to the autocorrelation function of any trajectory of displacements in time.
It is the ensemble average of all possible displacements x(t) such that
(4.2)
The average <> is over all possible origins t0 and for every timescale t in the trajectory.
Since the number of possible origins t0 for a timescale t is (T-t), we can only expect adequate
statistical sampling up to t~T/2.
The term “anomalous diffusion” properly applies to systems whose particles have an
MSD which scales nonlinearly in time (214, 215). These are non-Brownian processes which
obey a generalized Einstein relation:
2 , (4.3)
where Db is the (anomalous) diffusion coefficient and d is the dimensionality of the system.
If b<1, a process is sub-diffusive in the sense that it moves away from its average more
slowly than Brownian diffusion (“sub”-linear). A sub-diffusive process has anti-persistent
correlations, where consecutive steps are more likely to move in opposite directions than they
would in a random walk. If b>1, a process is super-diffusive in that it moves away more
quickly than Brownian diffusion (“super”-linear). A super-diffusive process exhibits
persistent correlations, where consecutive steps are biased to continue in the same direction.
Note the distinction between this temporal exponent b and a spatial scaling exponent, which
we denote as a in the covariance eigenvalue spectra presented above.
55
Fig. 4.5 shows the MSD for a representative subset spanning all PCs of the NHCaCO
and SIDE atomic subsets, across six orders of magnitude in time, from the 10 ns simulation.
Careful examination of this figure reveals a number of interesting features. First, there is a
leveling of MSD(t) at long timescales past ~1 ns (with the exception of the first PC). This
leveling is a result of the fact that we are analyzing a bounded system of fixed volume: at
some timescale all PCs must cease moving away from their average and return to it. Thus
the rollover in the MSD may be considered an “edge effect”, though it may also contain
interesting information about the timescales of the collective motions in our system. For
example, we would expect that covariant motions at smaller spatial scales in the protein will
be bounded at increasingly short timescales, and this is evident in Fig. 4.5. Comparison of
the MSD curves in with lines of slope 1 and 2 (dotted gray lines) makes some general trends
apparent. The longest PCs scale with β=2, indicating ballistic motion unimpeded by thermal
perturbation, while shorter PCs tend towards β=1.5 or even β=1. Moreover, there is non-
trivial structure to the groupings of trends (in time) among PCs in the backbone, which is
made evident by the changes in spacing between groups of curves in the figure. This is
similar to the groupings of non-Gaussian distributions shown in Fig. 4.4.
The most interesting feature in Fig. 4.5 is the observation of pronounced oscillations
among the shortest PCs at timescales below ~1 ps. These oscillations are most visible in the
case of the side chains, although they are also present in the backbone with lower
frequencies. This suggests that the sub-diffusive features apparent in the non-Gaussian
distributions of short PCs are a result of short timescale oscillations, rather than longer
timescale sub-diffusive sampling. The superposition of locally sub-diffusive PCs on a global
super-diffusive envelope in the side chains of Fig. 4.4 may also be attributable to this
interplay of short and long timescale behavior. Note that although oscillations are not
‘diffusive’, they meet the definition of sub-diffusion in that consecutive steps are anti-
correlated (at a particular timescale).
56
Figure 4.5: The mean square deviation of every 11th PC for the NHCαCO and SIDE atomic subsets. The curves are evenly spaced by a constant c at their origin. Linear (β=1) and ballistic (β=2) values of slope β are shown in dotted gray as a guide for the eye.
In order to amplify small changes in the scaling of the MSD, in Fig. 4.6 we plot the
instantaneous slope of the MSD as a function of time for long and short groups of PCs.
These plots also highlight the fact that consistent power-law scaling is persistent on all
timescales up to ~100 ps for all PCs (and up to ~1 ns for the longest PCs). These plots reveal
a surprising array of oscillations in the short PC regime, with consistent frequencies across
groups of PCs and transitions to higher frequencies for shorter PCs. This figure also makes
clear that there is a general transition at ~1 ps, between very short timescale behavior and
longer timescale dynamics (100 ps - 1 ns). This is the expected ballistic (β=2) to diffusive
(β<2) transition for the longest PCs, indicating the timescale at which collective motions
become restrained by thermal perturbations of their directions and velocities of motion.
However, for short PCs the opposite trend is also apparent in the backbone, from slower sub-
diffusive scaling at short timescales to faster diffusive scaling at long timescales.
57
Figure 4.6: Instantaneous slope of log10(MSD) functions shown in Fig. 5, for the long (left) and short (right) PC’s of the heavy-atom backbone (top), main chain (middle) and the heavy side chain atoms (bottom).
58
4.4 Collective Oscillations in the Small Covariance Regime
To systematically investigate the frequencies of collective motions revealed in the
MSD, we computed the Fourier transform of the curves depicted in Fig. 4.6, for the
oscillatory regime below 1 ps. In Fig. 4.7 we plot the square of the Fourier amplitude for all
PCs of our three atomic subsets, representing power in the frequency domain. These results
reveal the existence of two dominant collective oscillations in the backbone of gA which can
be compared with experimental results from infrared and Raman spectroscopy. The first is a
broad peak centered at ~5 THz (165 cm-1), spanning PCs 90-120. The second is a sharper
peak centered at ~40 THz (1320 cm-1) near PC 250. There is good agreement between the
results for the two backbone atomic subsets, with the NHCαCO showing the same dominant
features at similar frequencies and PCs as the NCαC set, but with higher resolution and
higher frequency components in the latter, as expected from the inclusion of the hydrogen
bonding elements. The side chain spectra also show many sharply resolved modes at high
PC index, with a pair of dominant modes at 20 THz (660 cm-1) and 40 THz (1330 cm-1), and
other distinct modes apparent both above and below these frequencies.
Although it is tempting to attribute these oscillations to covalent bond vibrations,
analysis of the associated eigenvectors reveals that this is not the case in general. In fact, the
lowest frequency oscillations are associated with motions that span many heavy atoms in
both the backbone and the side chains, and hence represent collective oscillations across
functionally significant portions of our protein. Here we focus on the structure of PC
eigenvectors associated with the lowest frequency backbone oscillations in order to highlight
the possible functional significance of these motions, and the utility of information in the
previously ignored small-covariance regime of PCA.
59
Figure 4.7: Spectral power of the oscillatory regime for β (below 1 ps, as shown in Fig. 6).
60
Fig. 4.8 depicts three sample backbone eigenvectors from the broad 5 THz (165 cm-1)
band near PC 100. We illustrate the structure of displacements along each eigenvector by
superposition of the NHCαCO backbone projected away from the average structure along the
positive and negative directions of the PC eigenvector. Careful examination of this figure
reveals that in general the displacements are on the scale of a single peptide plane, with
tilting of the carbonyl oxygens and amide hydrogens apparent at a number of amino acids.
This suggests amide plane librations, whose functional significance for cation transport was
reviewed in Chapter 1. There are about 30 PCs in this group, and examination of the
eigenvectors in time makes clear that the group as a whole spans tilting motions of each
amide plane in the protein (note that there are 30 amide planes in gA). Far-infrared FT-IR
spectroscopic measurements of gA without cations have determined that carbonyl librations
occupy a band between 75 cm-1 and 175 cm-1, and there are other IR-active modes up to 500
cm-1 (79, 80). This is consistent with the low-frequency features in Fig. 4.7B, which span the
entire far-IR range from ~33 cm-1 to 500 cm-1. Moreover, the same experiments measured
broad absorption peaks upon addition of Li+ (79), K+, Rb+, and Cs+ (80) cations to the
channel, with the frequencies of cation mobility similar to those of the carbonyl libration
band. This shared timescale suggests that the librational modes of the amide planes may be
coupled to cation transport through the channel.
We have also examined the eigenvectors associated with the higher frequency
backbone mode near 40 THz (1320 cm-1). These are motions within the amide plane
associated with stretching of the carbonyl oxygen and amide hydrogen bonds, and are thus
clearly visible in the NHCαCO eigenvectors. We conclude that gA has coherent oscillations
near 40 THz (1330 cm-1) within the hydrogen bonds which define the secondary structure.
Finally, examination of the side chain eigenvectors shows that the dominant oscillation
modes correspond to bending and torsion of the Trp indole rings (peaks c1 and c2
respectively, in Fig. 4.7C), which carry a significant dipole moment and form hydrogen
bonds with the lipid headgroups in the membrane (34). This suggests that all the MSD
oscillations of short PCs are associated with hydrogen bonding, which also explains their
sub-diffusive distributions as well as their anharmonicity.
61
Figure 4.8: Illustration of backbone eigenvectors for sub-diffusive PC 100,110 and 120 of the main chain NHCαCO atomic subset. The front and back of the helix are shown separately for clarity. The superimposed structures are displaced 5 Å away from the average structure along the appropriate eigenvector, in the positive (red) and negative (blue) directions. Areas where peptide plane motions result in large displacements of the carbonyl oxygen are highlighted in circles.
4.5: Discussion
While most NMA and elastic network studies have focused on the longest
wavelengths as in PCA studies, some have studied the shorter wavelengths (74, 116, 131) as
well. The present study suggests that the same regime should prove to be a fruitful area of
study in the PCA of simulated MD trajectories. The analysis presented above could be used
as a guide to isolate regions of anharmonic motion in a protein. If this region is smaller than
the entire protein, such as a ligand-binding pocket, then a new PCA could be executed on just
this region and the relevant dynamics would now be apparent at the longest PCs of this re-
analysis. There is one example of a PCA study which has focused on such a binding pocket
in carbonmonoxy-myoglobin (216). I will discuss this study in more detail in Chapter 5.
Another interesting study has suggested that short PCs are more important in
determining the protein folding pathway than long PCs, using a method called “Essential
Dynamics Sampling” (217). After extracting 306 eigenvectors for the Cα atoms by
performing PCA on the folded structure of cytochrome c at equilibrium, the authors
performed biased MD on the unfolded protein by accepting steps which approached the
folded state, and projecting steps which did not onto various subsets of the equilibrium
eigenvectors. Surprisingly, the protein could be re-folded by biasing on the shortest 100
62
eigenvectors but not on the longest or mid-range 100 eigenvectors, or even on the complete
306 eigenvector set. The study concluded that “the most rigid quasiconstraint eigenvectors,
representing in the folded protein the smallest collective vibrations, contain the proper
mechanical information for the folding process”.
The anharmonic character of the Fourier spectra in Fig. 4.7 is also worthy of
comment; the orthogonal decomposition of collective modes in the implementation of
atomistic MD clearly lumps many frequencies of motion together in modes at different
spatial scales. This indicates that collective modes in a protein have complex dynamics with
a nonlinear dispersion relation, as originally pointed out by García (111). This finding
underlines the need to exercise caution when interpreting the spatial wavelengths from PCA
using quasi-harmonic approximations which map one wavelength onto one frequency. This
is one of the central issues in the interpretation of IR spectra, which often uses this
assumption when assigning modes with the aid of NMA calculations.
Finally, the global structure of the PCA eigenvalue spectrum shown in Fig. 4.1
deserves some discussion, as do the different values of scaling exponent α. The linear
scaling observed for all long PCs is evidence that these PCs do not describe thermal motion,
but what of the various values of α≠2 in the short PC regime? It may be that the backbone
and the side-chains exhibit different ‘colors’ of noise (i.e. the frequency dependence of the
spectral density). Moreover, the power-law implies that PCs with common scaling are
structurally related to one another through scale invariance. Mathematically, a function f(x)
is scale invariant if multiplying x by a factor m results in a scaling of f(x) by the same factor m
(independent of x). In general, such scale invariance is defined by the relation
µ .
It is easy to verify by substitution that the power law f(x)=Axp satisfies this relationship
(218). Hence, observation of power-law scaling among the PCs of a protein implies scale
invariance among its collective modes of motion. This suggests a hierarchical structure
among large and small scale motions, and important geometric relationships among the PC
eigenvectors which share the same scaling (as previously pointed out by (117, 219)). This in
turn indicates that an adequate description of protein motion is likely to require information
spread across the entire PCA spectrum, or at least across all components which scale
together, and not just the largest few PCs as conventionally analyzed.
63
The suggestion of grouping PCs together runs counter to the idea that PCs are
independent; by construction, PCA is supposed to yield eigenvectors whose time trajectories
are uncorrelated. But this is only true if those trajectories have a Gaussian distribution, and
this is precisely what we have shown not to be the case for isolated bands of short PCs. This
means that the time trajectories of different PCs may be correlated (in the super-diffusive
case) or anti-correlated (in the sub-diffusive case) in the non-Gaussian regime. The
oscillations in Figs. 4.6 and 4.7 are anti-correlated for a distinct timescale, and many
neighboring PCs share the same frequency of motion; this is evidence that many PCs may be
meaningfully grouped into a single mode. This finding could have a significant impact on
the interpretation of PCA and its ability to isolate functionally meaningful modes of motion
in MD simulations, not only for fast motions but also for the largest PCs spanning the
conformational degrees of freedom in a protein. A quantitative structural analysis of such
groupings within the largest backbone PCs of gA, SS and RR is the subject of Chapter 5.
4.6: Conclusion
PCA has traditionally been used in many disciplines to characterize the degrees of
freedom which span most of the fluctuations in a system. PCA studies of protein dynamics
have been no exception, focusing on the longest (slowest) PCs, motivated by predicting long-
time dynamics beyond the reach of current simulations (220). In an early and influential
study, Amadei et al. (112) defined the ‘essential subspace’ as “a few degrees of freedom in
which anharmonic motion occurs that comprises most of the positional fluctuations” in the
system. Here, we have shown that the anharmonic features of the long PCs may be artifacts
of insufficient sampling, whereas they are persistent for some shorter PCs. Thus,
anharmonicity extends beyond the motions which comprise “most of the positional
fluctuations”, and we suggest that these non-Gaussian-distributed modes are potentially
important in the description of function, regardless of their spatial scale. While function is
difficult to define and quantify, anharmonicity is evidence of coupling among modes, which
is likely to be necessary in the complex motions required for function.
Systematic examination of anharmonic features in the short PC regime have
identified collective oscillations with functional implications for gA; a group of backbone
oscillations were revealed at ~5 THz (165 cm-1) and can be identified as peptide plane
librations, whose carbonyl oxygens help solvate the lumen and cation in the channel. Our
64
results demonstrate that PCA can be used to isolate interesting covariant motions on a
number of different space and time scales – in a part of the PCA spectrum that is usually
ignored – and highlight the need for an adequate structural and dynamical account of many
more PCs than have been conventionally examined in the analysis of protein motion. This
analysis is readily applicable to any protein system for which MD simulations are available.
65
Chapter 5: Collective Modes at Large Covariance
5.1: Introduction
We are interested in Principal Component Analysis primarily as a technique for
transforming atomistic trajectories of N particles with 3N degrees of freedom into 3N distinct
coordinates of motion which span all particles in the system. These coordinates represent
collective modes of motion, i.e. the instantaneous displacement of particular groups of atoms
away from an average structure. They are orthogonal in space and uncorrelated in time by
construction. In Chapter 2 we discussed the lessons learned in the atmospheric sciences
through use of PCA (EOF) to isolate physically distinct patterns of atmospheric variables
(such as air pressure or wind velocity): complex systems are not likely to have either
orthogonal or uncorrelated collective dynamics. This implies that further transformations of
PCs are necessary to accurately represent the structure of physically distinct collective modes
in a protein. These patterns, these dynamic `structures`, are encoded in the 3D shapes of PC
eigenvectors, that is, the magnitudes and directions of the N 3D-vectors described by the 3N
components of each eigenvector. The accurate structural description of collective modes is
fundamental to the aims of molecular dynamics and structural biology in general – if it can
be shown that the collective dynamics are relevant to biological function.
In this chapter we undertake a detailed study of the structure of eigenvectors from
PCA of gA, SS and RR channels solvated in GMO and water. There are 3N-6 eigenvectors,
each with 3N elements, three at each of the N atomic sites. Normally we would refer to each
element of a vector as a component, but since this term becomes ambiguous when discussing
‘Principal Components’ (eigenvectors), we will use the term ‘loading’ below, since this is
conventional in other disciplines. Our first task is to quantitatively describe both the
magnitudes and directions of displacements represented by PC eigenvector loadings at each
site of the structure. What quantity do we compute from the PC loadings to represent
direction, and more to the point, what is the functionally relevant coordinate? Do we monitor
an angle (if so, with respect to what axis), or a dot product (with which unit vector in space)?
And how do we compare different PCs, by calculating RMSD of loading magnitudes or
RMSΔθ of their orientations? In this chapter we emphasize that measuring the relative
orientations Δθ of the PC vectors of neighbouring atoms, i.e. the coherence of motion among
66
subunits of a molecule, is an excellent measure of the ‘simplicity’ sought by the Empirical
Orthogonal Functions (EOF) methods described in Chapter 2. We can reasonably expect that
functionally organized motions of a complex biomolecule would have co-directional motion
of their structural sub-units. We argue that this is the right thing to measure about the
structure of PC eigenvectors, and also to judge the results of combining PCs together.
The main proposition in this chapter – and this thesis – is a simple transformation of
eigenvectors guided by a straightforward interpretation of the eigenvalue spectrum. If the
covariance eigenvalues represent the spatial amplitude of PCs – defining the variance of their
distributions over time – then a sum of eigenvectors weighted by these amplitudes should
yield a reasonable approximation of a physical mode of motion which has been decomposed
across a number of different Principal Components. Furthermore, there is substructure
apparent in the eigenvalue spectrum of the polypeptide backbone, which suggests one or
more ‘band-gaps’ between distinct modes of motion. We use this structure to determine
which PCs to add together and also to propose a quantitative criterion to separate the
conformational motions from internal dynamics of the backbone. We use our ‘directional
coordinate’ to identify and describe four apparently coherent modes from the 25 PCs in this
conformational regime. This is a valuable reduction of the MD data set, and fulfills the
primary goal of PCA to reduce a high-dimensional MD data set onto a few convenient
coordinates describing comprehensible modes of motion. Comparisons of results for the
gramicidin dimer gA with two of its covalently linked analogs SS and RR demonstrate the
ability of our technique to differentiate functionally relevant motions which arise from
structural differences.
The analysis presented below is for the subset of main-chain backbone atoms NCaC
for the 30 amino acids in gA, SS or RR molecules, not including the C- terminus
ethanolamine groups or the and N-terminus formyl groups (the terminal groups fluctuate in
an unstructured manner, and while eigenvectors extracted with these included in the analysis
are similar, their features are less clear). Note that PCA is only a descriptive technique, and
the choice of atomic subset only influences the portion of simulation data which is observed;
all atoms still move under the influence of the complete force field created by every atom
within the simulation. Hence we do not include the dioxolane linker in our analysis of SS
and RR, simply for the sake of direct comparison with PCA of gA, but the influence of this
67
linker is encoded in the motion of the NCαC atoms. Since our primary interest is a
comparison of linked and non-linked analogs of gA, which are distinguished from each other
by the structural characteristics of their backbone, we do not include any analysis of side
chains here.
5.2: Band Gaps in the Eigenvalue Spectra
In the previous chapter Fig. 4.1 showed the PCA eigenvalue spectra for gA on a log-
log scale, where a power law is made obvious by the linear slope of the curve. The different
curves were for PCA carried out with various atomic subsets, from a single atom per residue
(Cα), through backbone atoms (NCαC or NHCαCO), to all atoms including side chains with
(ALLH) and without (ALL) hydrogen atoms. In all these spectra there is clear separation of
regimes with two 1/ka power laws evident: the long-covariance components have a=1 while
the short components scale with a>1. We hypothesize that the transition made evident by
this change in scaling gives a quantitative criterion by which to separate the concerted
motions representing conformational changes of the protein backbone from the smaller
fluctuations internal to the backbone. Note that Fig. 4.1 displays results from the 10 ns
unconstrained trajectory, to adequately capture the smallest eigenvalues. In this chapter we
are interested in the long and mid-range PCs and use the most converged eigenvectors
available, from 64 ns simulations.
In Fig. 5.1A we examine the first 100 (of 270) eigenvalues for the NCαC backbone
atoms of gA. The change in power law for this spectrum occurs near PC 25. Within the
linear regime there are at least three regions with discernable flattened substructure, marked
A, B and C, with obvious gaps between them. This form indicates a sequence of increasingly
degenerate eigenvalues, suggesting strong mixing of those PCs. The regions marked D and E
indicate regions with similar if less discernable substructure, while F falls in a different
scaling regime with α=2. We call these ‘emergent modes’ A-E, since they emerge from
combinations of many individual PCs. Comparison of the Cα, NCαC, and NHCαCO curves
in Figure 4.1 shows that including more backbone atoms in the PCA fills in the same features
with more points, which suggests that differing numbers of PCs span the same underlying
physical modes; including more atoms is equivalent to increasing the resolution with which
these modes are observed. This in turn implies that the structure of individual PCs are
strongly dependent on the atomic subset included in the PCA (the “subdomain instability”
68
mentioned in section 2.2.4), while the structure of physical ‘modes’ described by groups of
PCs must be conserved.
The root-mean-square fluctuations (RMSF) of our system can be computed by taking
a sum over normalized eigenvalues, each of which represents the RMSF at a particular
spatial scale. The RMSF spanned by each feature of the spectrum is shown as a percentage
under the groups of components labeled A-F in Fig. 5.1A. Each group of components
exhibits an RMSF of comparable size, indicating that neglecting higher-index components
based on their individual spatial covariance σi may not be justified, as is commonly done in
PCA studies of proteins. Mode A spans 34% , mode B spans 20% and mode C space 11 %
of the system covariance. The less obvious modes span 9% (D), and 4% (E), while the sum
of all eigenvalues above component 23 in the steeper scaling regime account for 22% (F,G)
of the total RMSF. This is a significant quantitative result: 82% of the backbone fluctuations
fall in the linear scaling regime and can be described by 23/270=9% of all principal
components, while 91% of the components describe motion in the different scaling regime.
The essential information derived from Fig. 5.1A is the placement of boundaries
between PCs which distinguish separate modes of motion. While there is one particularly
obvious band-gap between PCs 3 and 4, and another between PCs 8 and 9, a more reliable
and quantitative criterion is needed to delineate other modes. The gray line in Fig. 5.1A is
constructed from 6 points which are the average within each group labeled A-F. In red we
show the same line displaced upwards for visual clarity. A line drawn through these points A
to D yields a power law with α=1 and an R2~0.999, while moving the boundaries between
modes results in obvious kinks along this line and much worse fits to the linear scaling curve.
We take this as quantitative evidence that our proposal for grouping PCs is sound.
Figure 5.1B compares the eigenvalue spectra of gA, SS and RR. While the bandgap between
PCs 3 and 4 persists for the linked analogs, it is about half the size of that in the unlinked gA
dimer. Moreover, the other gaps are not apparent in the linked dimers, nor are there any
other groups of flattened neighbouring eigenvalues: modes B-E are no longer distinct in the
linear scaling regime. Bandgaps are usually indicative of an energy separation between
distinct modes of motion, and it seems reasonable that modes which were clearly separated
in the un-linked dimer become mixed when a covalent linker is introduced. The basis for
69
these observations becomes clear upon examination of the structure of PCs and modes for all
three structures.
Figure 5.1: A: PCA eigenvalue spectrum for the NCaC backbone atom subset of gA, with groupings of multiple components into modes A-G. The dashed lines are a guide for the eye, showing two distinct power laws 1/fa(a=1 and a=2). Groups of PCs inside a mode are delineated by the light gray vertical lines: the mode number of the last PC in each group is shown along the top of the plot. Along the bottom the cumulative RMSF is shown as a sum across normalized eigenvalues within each mode. The dark gray line plots the average eigenvalue of groups within each mode, and the red line is the same curved displaced upwards for clarity. B: A comparison of eigenvalues for gA, SS andRR channels.
70
5.3: Spatial Structure of PC Eigenvectors
In the literature applying PCA to protein dynamics, the most common scheme for
representing the structure of PCs is the superposition of molecular structures projected along
a given eigenvector in the positive and negative directions (112-114, 118, 130, 221).
Alternatively, one can display molecular structures representative of highly visited basins in
PC-projection space (117, 222). While these schemes highlight displacements perpendicular
to the chain, they fail to resolve motion parallel to the backbone since all traces superimpose
onto each other in this direction. Sometimes eigenvectors may be visualized by attaching the
eigenvector loadings as arrows on atomic sites directly (128, 129, 144), but on the page this
often fails to convey the full 3D patterns of displacement. This approach may be useful in
schematic form, especially if the motion is much simplified and approximates the
displacement of entire domains (115, 146). More commonly, no attempt is made to fully
characterize the 3D structure of PCs, and only the magnitudes of fluctuations are analyzed
along the primary sequence (116, 119, 127, 131, 134, 141, 143, 145, 148). While such plots
should predict B-factors and hence make contact with experimental data, they do not help
characterize how the protein moves since they ignore the directions of eigenvector loadings.
To give a detailed structural account of the backbone PCs as well as their emergent
modes A-E, we must first develop good quantitative tools for understanding the full 3D
structure of eigenvectors. In order to describe coherence of motion we focus on quantifying
the direction of PC displacement vectors at each atom (rather than their magnitudes) in order
to portray patterns of common direction. To this end a ‘directional coordinate’ is helpful, in
analogy with the ‘reaction coordinate’ commonly used to simplify the multi-dimensional
description of chemical reactions. The motion along the backbone of a protein can be
decomposed into two convenient directions: parallel or normal to the chain. For a helix the
latter can be either along the helical axis (approximately) or perpendicular to it along the
cylinder radius, and this redundancy makes it difficult to visualize. However, motion parallel
to the chain can be quantified by taking the dot product of an atom’s displacement vector
with the tangent to the chain.
Figure 5.2 illustrates the use of 2 directional coordinates, d q+/-, which characterize the
direction of motion for any atom relative to a vector chosen carefully with regard to the
molecular structure. We take the dot product of every atom’s 3D PC loading with the
71
average helical pitch of the backbone, once along the front of the molecule (+), and again
along the back (-). We get two vectors for the two sides of the molecule: d q- on one side of
the axis defined by q, and d q+ on the other side. This is a vector of average backbone
direction cast against a plane through the helical axis, chosen at angle q in the plane normal
to this axis, where the origin q0 is chosen at the midpoint of the gap (or linker) between the
monomers of the molecule. We then encode the value of this dot product as a colour ranging
from blue (=-1) through purple (=0) to red (=+1). In Figure 5.2A and 5.2B this vector is
shown in bold colors for both the front (d q+) and back (d
q-) of the b-helix. Since the helical
structure of the backbone reverses the direction of the chain from the front to the back of the
molecule, the direction encoded in the color scale is reversed along the back of the molecule
for visual clarity; in this way every atom of the same color moves in roughly the same
direction in absolute space. This procedure aids considerably in reducing the ambiguity of
representing 3D vectors on a 2D surface. In Fig. 5.2C our directional coordinate is
implemented to display the pattern of coherent displacements for PC2 of gA. While the color
scheme captures the direction of each vector relative to the helical pitch to highlight the
displacement pattern, both the magnitude and direction in absolute space are encoded in the
length and direction of each vector attached to each atom of the average structure.
In Fig. 5.2D we demonstrate the utility of the directional coordinate by plotting it
directly as a function of chain position. In the same way that the magnitudes of fluctuation
are commonly displayed as a function of chain position, the directional coordinate allows for
a plot of the directional structure of a given PC; in this case it is easy to see that PC2 is a
bending of the molecule, where the central atoms in the hydrophobic core generally move to
the right while the extremal atoms at the hydrophilic ends move to the left. Fig. 5.2
demonstrates the importance of choosing a casting angle which maximizes the displacement
pattern, and it becomes very useful in comparing the PC structures for analogs of the same
molecule. With this new tool for assessing the directional structure of collective modes, we
may now investigate the shapes of principal components for gA, and make quantitative
comparisons with the PCs of SS and RR.
72
Figure 5.2: A & B: A color key for representing the direction of motion with respect to the helical pitch. The front atoms of the gA dimer are shown in A and the back in B. These two sets of atoms fall along different helical pitch vectors, shown in bold red and blue. Hence for x>0 (front), red atoms move down the pitch and blue atoms move up the pitch, and vice versa for x<0 (back). The color scales from red (-1) to blue (+1) as a function of the dot product of an atom’s displacement vector with the tangent to the chain at the extreme front (or back) of the molecule. At the center of the channel is an appropriately colored bar aligned with the average pitch of the helix for either the front or back of the molecule, along with a bar normal to the pitch. C: Implementation of the color scheme displaying PC#2. The vectors attached to each atom display the components of PC#2 directly, while the color
73
Figure 5.2 (cont.): … scheme represents the directional information relative to the pitch vectors displayed in A and B. D: 1-dimensional trace of the directional coordinate encoded in the color scheme of C. Both the horizontal position and the color of the trace encode the same information, laid out against a map of the amino acids along the vertical axis. This example demonstrates the utility of this coordinate in separating the dominant directions of motion for the outer and inner turns of the helix in PC#2.
5.4: The Principal Components of gA, SS and RR
Figure 5.3 implements the color scheme illustrated by Fig. 5.2, and depicts the
structure of collective displacement for the largest 3 PCs of gA from three perspectives in
space (along the x, y and z-axes), and compares them to the same PCs for SS and RR from a
single perspective (z-axis). Blue and red atoms move in opposite directions along the helical
pitch. The purple scale picks out motion perpendicular to the helical axis. The first three
principal components of gA have an identifiable structure which spans the whole dimer. PC1
exhibits a counter-rotation of each monomer around the helical axis, much like the wringing
of a towel; each monomer twists in the opposite sense. This is apparent in PC1 of gA in Fig.
5.3, where the top monomer is blue along the back (moving right) and red along the front of
the molecule (moving left), while the bottom monomer has the opposite color scheme,
indicating that it is twisting in the opposite sense. PCs 2 and 3 of gA are orthogonal bending
motions. PC2 exhibits red vertical extremities and a blue core: all the atoms at the top and
bottom of the protein move to the right while the middle uniformly moves to the left. PC3
has a similar structure, but the bend is normal to the page, as indicated by the dark shades of
purple at the extremities and lighter shades of purple towards the centre.
The depiction of PC2 and PC3 looking from the z-axis shows that these modes of
motion are orthogonal bends for gA. The vertical perspective also easily distinguishes the
differences between the first three PCs of gA, SS and RR. The bending patterns move to
PC1 and PC2 for the linked dimers, while the twisting pattern is PC3. This seems reasonable
given the structural differences between these molecules, as a twisting motion is
comparatively hindered by covalent bonds in the linked analogs, while bending modes
require only torsions around these bonds. Inversely, collective bending is comparatively
disfavored in the non-covalent dimer gA due to the main chain discontinuity at the dimer
junction. Another difference that becomes apparent is
74
Figure 5.3: Illustrations of the three largest PCA eigenvectors using the color key in figure 5.2. Each PC of gA is displayed from three perspectives for clarity. The atoms move in the direction shown by the vectors at each atom, while their color encodes their direction relative to the helical pitch.
75
that while the angle between the PC2 and PC3 bends in gA and RR is ~90o, it is smaller in
SS, and the bending modes are in general less apparent in this linked analog. Again this can
be explained by the orientation of the dioxolane linker with respect to the backbone pitch,
which acts like a wedge in RR, strongly favouring one direction of bend, while in SS it runs
along the helix which inhibits bending motions in general (PC1 of SS is not even a clear
bend, incorporating some twist). We also note that the relative direction of bending modes
are different for the linked and un-linked analogs.
In general, any periodicity of coloring along the backbone makes coherent patterns of
displacement discernable. Figure 5.4 shows PCs 4-9 of gA, where the loss of uniform
displacements is apparent in the variations in magnitude of the vectors, and loss of obvious
coherence is apparent in the loss of uniform stretches of color along the backbone. While
Figs. 5.3 and 5.4 give a qualitative picture of motion on the actual protein structure, a more
quantitative representation is shown in Fig. 5.5 for gA, SS and RR. Here the sequence of the
protein is 'unwrapped' onto the horizontal axis of the plot, which shows the primary amino
acid sequence of gA. The dot product of each atom’s direction of displacement with the
helical pitch vectors (as in Fig. 5.2) is plotted on the vertical axis. The color of the curve also
replicates the directional information plotted on the vertical axis, using the same color
scheme as Figs 5.3 and 5.4 to make the relationship between the figures obvious. The gray
vertical lines denote boundaries between turns of the helix, which help map atomic motions
onto structural features of the protein. For example, a twisting motion shows up along the
helical pitch as a sinusoidal trace with a period of one turn. The coherent motion of a domain
of neighboring atoms moving in the same direction is seen as a straight horizontal line in this
representation, and the relative symmetry of displacements between monomers is also readily
apparent from left to right.
Figures 5.3 and 5.5 make it easy to compare the PCs of the gA, SS and RR channels,
and thereby quantify the differences in dynamics which are caused by the inclusion of a
covalent linker between channel monomers. It is apparent that the wringing motion of PC1
in gA is present with smaller amplitude in both linked channels as PC3. The two bending
motions of PC2 and PC3 in gA are also present in SS and RR with larger amplitude, as PC2
and PC3. Although the directions of the bends are slightly different, the two bending
motions are still orthogonal within each linked analog.
76
Figure 5.4: Illustration of gA principal components 4 through 9, using the color scheme of Figure 5.2. These patterns of displacement are more difficult to describe and generally less coherent than PC1-3, although they share the general characteristic whereby the monomers roughly mirror each other’s pattern.
77
Figure 5.5: Projection of displacement direction onto helical pitch, as a function of residue. The vertical axis plots the dot product of an atom’s displacement vector with the axes of projection shown in Fig. 5.2. The line across the centre of each curve is zero, with +1 above and -1 below. The curve is colored using the same scheme as figures 5.3 and 5.4 to make the relationship between the figures obvious, such that parts of the chain moving to the left (right) are red (blue). Note that magnitudes of displacement are not captured by this plot, only relative direction.
78
5.5 Coherent Modes From Weighted Sums of PCs
The central result of this study follows from a straightforward interpretation of the
meaning behind the eigenvalues and eigenvectors of PCA: if the eigenvectors describe
the shapes of collective displacements while the (square root of) eigenvalues sk represent the
spatial amplitude of over the average of a time trajectory, then the following weighted
sum describes a physically meaningful collective mode l with displacement vector ∆ of
bandwidth ∆ :
∆ s
.
In addition to this observation an ansatz must be made to establish the appropriate bounds k1
and k2. The substructure apparent in the log-log representation of the PCA eigenvalue
spectrum (as in Fig. 5.2) is a physically reasonable guide in this respect. Just as spectral
peaks span the modes of motion in an oscillatory system, regions of distinct power law
scaling may delineate separate modes in the diffusive dynamics of a protein system.
Following the bandwidths shown in Fig. 5.2 for gA, mode A is composed of PCs 1-3,
while mode B spans PCs 4-8. Figure 5.6 shows these two largest collective modes for gA,
depicted on the molecular structure of the backbone (as in Figs. 5.3 and 5.4). Mode A
exhibits coherent motion of the hydrophobic turns at the junction of the two monomers
(blue), moving out of phase with the outermost hydrophilic turns (red). This stands in
contrast to Mode B, where the hydrophobic turns move out of phase with each other, and
where the inner two turns of each monomer move out of phase with that monomer’s outer
turn. This symmetric and anti-symmetric character is apparent in the constitutive PCs of
these two modes; PCs 1-3 are generally symmetric while PCs 4-8 are anti-symmetric (see
Fig. 5.5), though no individual PC yields as clear a directional profile as their weighted sum.
The character of motion in the modes is also quite distinct from their constitutive PCs. For
example, the wringing and two orthogonal bends of PCs 1-3 combine to yield almost uniform
lateral displacement of individual turns.
79
Figure 5.6: Coherent Modes of gA, from the eigenvalue-weighted sum of PCs. In Mode A the middle three hydrophobic turns move laterally out of phase with the outermost hydrophilic turns. In mode B the two inner turns of each monomer move out of phase with each other, opening the connection between them.
Figure 5.7: Coherent Modes of gA, represented by two different directional coordinates as a function of atom number, with the corresponding residues shown below the figure. Motion projected along the helix pitch is shown at left, while motion along the helix axis is shown at right. The vertical gray lines delineate equal angular position (ie the five turns of the helix).
80
To more clearly display these features, in Fig. 5.7 we show modes A-F on the
tangential direction coordinate described above, as a function of chain position (as in Fig.
5.5). In this figure we also present these gA modes on a different directional coordinate for
comparison, using the helical axis for the dot product with atomic displacements, where
yellow and cyan colour opposite vertical directions. The most striking feature of the
weighted sums is that they possess larger sections of uniform direction than their constituent
PCs, as is made apparent by the stretches of horizontal lines in this figure. Furthermore, this
uniformity is concentrated along individual turns of the helix. This feature is most clear for
modes A, B, and E along the tangential coordinate of Fig. 5.7, and along the axis coordinate
for modes C and D. The term 'coherent mode' is justified by this observation, since clearly
identifiable structural sub-units of the protein move together, either in or out of phase with
each other. In fact, it is clear from Figure 5.7 that mode B is simply mode A with one of the
monomers having the opposite phase. Mode C can be best described on the axis coordinate,
where it appears the monomers move in opposite directions along the vertical axis. Mode D
is a shearing motion, where the front and back of the molecule move out of phase vertically.
Finally, in mode E each helical turn moves out of phase laterally with its neighbours.
The main difference between modes A and B is the phase of motion at the junction of
monomers, which suggests a functional interpretation for this conductive channel. Notice
that mode A would preserve a continuous water column at the centre of the channel, while
mode B would disturb this path of water molecules. This suggests that mode A may be the
conductive state of gA, maintaining a conductive path for ions to move between gA
monomers, while mode B may be the non-conducting state, breaking the ionic conduction
path between monomers. It is worth noting that mode B is very similar to the one described
by Miloshevsky and Jordan using Normal Mode Analysis with biased path sampling (based
on the Monte Carlo algorithm) on gA simulated in vacuum: "The open state gating
mechanism of gramicidin A requires relative opposed monomer rotation and simultaneous
lateral displacement" (132). A similar mode and gating mechanism may also lead to
dissociation of the dimer, on the 100 ms timescale.
Figure 5.8 compares modes A and B for the three gramicidin channels. We see here
the remarkable result, that despite differences in the size and shape of the leading 3 PCs (as
shown in Figs.5.3-5.5) the eigenvalue weighted sum of PCs 1-3 yields almost identical
modes in gA, SS and RR. However, this is not true for mode B of gA, which is not present
81
in SS and RR. The eigenvalue spectrum of Fig. 5.2 is easily interpreted in light of these
results; the three analogs share the same eigenvalue scaling and band gap for PCs 1-3, but not
for higher-index PCs. Indeed, there is no bandgap to separate PCs 4-8 in SS and RR, and so
no reason to think that those PCs should yield a coherent mode in these analogs. These
results are also sensible in light of the structural differences among these three molecules: the
covalent link between monomers prevents the opening motion of mode B, as well as vertical
modes C and D, but has little influence on mode A.
Figure 5.8 also compares the mode structure to individual PC structures by overlaying
them on the same axis. The symmetry of the directional coordinate between the two
monomers (left and right of the curve) is much stronger in the coherent modes than their
constituent components. While PCs 1-3 have their own symmetry and coherence, it is
striking that such a simple pattern should emerge for mode B from the apparently incoherent
PCs 4-8 seen in Fig. 5.4. This is even also true of mode E, which exhibits counter-motion of
all neighboring turns while its components are so non-uniform as to be un-interpretable. This
emergent symmetry suggests that the grouping ansatz which underlies our results is both
sensible and useful.
82
Figure 5.8: Comparison of modes A and B for gA, SS and RR. Modes are shown in bold, and their constituent PCs in thin lines on the same axis. The modes are more symmetric across both monomers, and more uniform in their directions, than the individual PCs. Mode A is the same for all three analogs, while mode B is different in gA and the linked channels.
83
5.6 Covariance of PC Trajectories
The analysis presented above depends strongly on grouping certain blocks of
components into modes. While the eigenvalue spectrum seems to contain this information, it
is difficult to discern spectral structure for higher components. The most convincing
evidence for grouping would come from correlations in the time trajectory of components
within a mode group, correlations which should not be there if indeed PCA yielded
independent modes of motion. In Figure 5.9 we show the average covariance matrix for the
first 23 PC trajectories:
<pci(t) pcj(t)>t<T
where pci(t) is the time trajectory of the MD simulation projected onto the ith PC, and the
average is over the entire time trajectory of duration T=64ns. By calculating the covariance
of every ith PC with every jth PC, a matrix of values between -1 (anti-correlated) and 1
(correlated) is generated. In Figure 5.9 we also present the absolute value of this covariance
matrix, which yields a more readily interpretable pattern of uncorrelated (white = 0) and
correlated (black = 1) PC trajectories. These plots show that there are significant correlations
in time among the first 9 eigenvectors, and none for PCs of higher index. This is evidence
that the first 9 PCs are not independent of each other, and that groups of PCs must be
considered together to form a single mode of motion.
In Figure 5.10 we focus on the absolute value of covariance for the first 10 PCs,
comparing results for the gA, SS and RR channels. The dominant feature in the results for
gA is a white square pattern formed by the rows and columns of PC3 and PC8, which
delineates two blocks of components corresponding to modes A and B of the gA the dimer.
The results for the SS and RR channels demonstrate that the correlations among PCs 1-3 are
persistent in the linked dimers, and mode A is largely the same as for the non-linked gA
channel. However, the correlations among PCs 4-8 are degraded in SS and almost non-
existent in RR. Hence the ‘opening’ mode B of gA is not present in the linked dimers, as
expected due to the covalent linkage at their centre (with RR more strongly perturbed than
SS). This is strong evidence that our grouping ansatz is reasonable, and independent modes
of motion are indeed spread across a number of PCs in PCA of adequate atomic resolution.
84
Figure 5.9: Covariance matrix of projected trajectories for the leading 23 PCs of gA (left) and the absolute value of the same quantity (right). i and j label the PC index.
Figure 5.10: Comparison of covariance matrices (absolute value) for the PC trajectories of gA, SS and RR channels.
85
5.7: Discussion and Conclusions
The features which are apparent in the log-log plot of the eigenvalue spectrum
suggest the notion of ‘covariance bandwidth’, which may be as relevant to the description of
overdamped dissipative systems as ‘frequency bandwidth’ is in describing the modes of
motion for oscillatory systems. Integrating across these features seems to describe real
physical modes which are projected across multiple PCs, in much the same way that many
points along a broad peak in a high resolution Fourier analysis describe an oscillatory mode
across a band of frequencies. The main claim of this chapter is that a linear combination of
PCs within these groups describes physically meaningful modes of motion, and provides a
simple means of describing large-scale functional motions from the components extracted
through PCA.
In the field of climatology the use of ‘EOF’ (107) is much more developed than the
current use of PCA in computational biophysics. There is a wide array of techniques for
extending and modifying PCA to make PCs more interpretable, or to determine how a
physical mode of activity is projected across more than one PC. For example, in the
“extended EOF” where time-lagged covariance is included (see page 33), PC’s with
degenerate eigenvalues are understood to be components of a single mode (degeneracy here
is used in the approximate sense, where the eigenvalues fall within each other’s error bars).
We note that a flattening of the power spectra – as observed within the PC groups proposed
above (see Fig. 5.1A) - favors this effect; the flatter the spectrum, the more degenerate the
grouping. Weighing PC’s by the square root of their eigenvalue is also a standard operation
before rotating EOFs to obtain simplified mode structure, and suggests that our ansatz is a
physically meaningful operation. Moreover, in Chapter 4 we found groups of PCs in the
small covariance regime which shared the same frequency and phase of oscillations in their
MSD over picosecond timescales. If it is reasonable to group short components into a single
mode, then it may be expected that the same should hold true for the longest components.
Furthermore, it should be noted that the number of components resulting from PCA
scales with the number of atoms included in the analysis, while the number of “real,
physical” collective modes should be conserved independent of the subset of atoms included
in the PCA. Backbone motion of a protein provides a good illustration of this property; we
would expect to find the same set of physically meaningful backbone modes whether we
86
included only Ca, NCa, or NCaC atoms in our PCA. However, including N amino acids in
the analysis would result in 3N, 6N or 9N components which span the motion of the
backbone (the factor of 3 arises from the dimensionality of space); with more components
describing the same motion, the resulting eigenvectors would have to change shape. This
means that the shapes of individual components are not likely to be meaningful: no more so
than the shape of a single sin(x) function in the context of a Fourier transform on a noisy
signal. We would also expect the number of eigenvalues falling on the plateaus of Fig. 5.2 to
increase as we increase the number of atoms used in PCA. This is apparent to some degree
in Fig. 4.1.
In conclusion, a weighted superposition of principal components yields a small set of
physical modes of motion from a much larger number of Principal Components. To find
meaningful patterns of biomolecular motion one must target different windows within the
average covariance spectrum; the trick is knowing which components to sum together, and
how much of each. Our results suggest that the PCA eigenvalue spectrum contains this
information, and that there are five distinct collective modes of motion for the backbone of
gA solvated in a membrane. With the aid of carefully chosen directional coordinates for this
simple system, the spatial structure of the gA backbone modes (as well as their constituent
PCs) have been quantified to an unprecedented degree, such that differences in dynamic
modes could be resolved among linked analogs of the same molecule . This work presents
an approach to extracting the coherent structure from the apparent noise of biomolecular
motions, and will be helpful in future analysis of MD simulations.
87
Chapter 6: PCA of GMO Lipids Solvating Gramicidin 6.1: Background
In section 1.1 we introduced a few features of protein-lipid interactions which are
relevant to protein function and gA in particular. In general the presence of a protein within
a phospholipid bilayer increases the orientational order in the lipid matrix, and differentiates
the behavior of “annular” lipids which solvate the protein from those in the bulk of the
membrane (29). A comparative study of gA simulated in DiPhPC and GMO bilayers has
shown that GMO molecules are significantly more ordered than the diacyl chains, with three
distinct solvation shells apparent in the radial distribution function (36). In the case of the
diacyl phospholipid DMPC it is known that annular lipids remain associated with gA for
approximately 100 ns (30); we speculated that since the free energy for moving a single acyl
chain found in GMO is lower than moving two acyl chains in DMPC, we would expect the
annular residence time of a GMO molecule to be shorter than 100 ns, and therefore similar to
the simulation times considered in the current study of gA solvated in a GMO bilayer.
To the best of the author’s knowledge, there have been no PCA studies of membrane
dynamics. There are a number of intrinsic difficulties in applying PCA to a fluid composed
of many monomers. The diffusion and exchange of monomers prevents convergence to a
well-defined average structure at long timescales; indeed, in the long-time limit we would
expect the average structure to converge to a single point in the centre of the plane of
diffusion (given periodic boundary conditions). In the case of a well-structured liquid-crystal
where a lattice may be defined with a single monomer at each lattice site, this difficulty may
be overcome by exchanging the identity of two monomers when they exchange positions.
This is not possible in the case of a more amorphous liquid like the membrane bilayer, where
such a crystalline lattice cannot be defined (as is apparent in the planar distributions shown in
Fig. 6.2 below). Furthermore, PCA demands a well-defined set of atoms with continuous
coordinate trajectories. Any discontinuities of position associated with identity exchange
would give rise to artifacts in the long PCs, since these would appear s large displacements in
the covariance matrix.
In light of these difficulties we limit the current investigation to the relatively
structured annular shell of lipids in the first solvation shell of the gA molecule, where a
particular set of monomers can be chosen for study using PCA. This necessitates selecting
88
an appropriate timescale for PCA which is shorter than the residence time of monomers
within the annular shell, but long enough to capture any collective motions of lipids within
the shell itself. We hope to use PCA to compare the collective motions of the annular lipids
with the collective modes of the gA molecule itself, in order to establish whether there is
significant dynamical coupling between the protein and its immediate lipid environment.
6.2: Methods
We have performed PCA of the GMO membrane at six logarithmically spaced
timescales between 2 ns and 64 ns, aligning all frames on the NCαC atoms to subtract out
translation and rotation of the system with respect to the gA molecule. While our interest is
mainly in the annular lipids, we have performed PCA on the full membrane, on one solvation
shell (24 lipids) and on two solvation shells (48 lipids) for comparison, and in each case PCA
has been performed on the lipid headgroups alone, the acyl tails alone, and both together. To
limit the size of the data sets while capturing the relevant degrees of freedom, the tails were
represented by a subset of 6 carbon atoms, including both carbons flanking the double bond
in the middle of the acyl chain, the carbon atoms at the ends of the chain, and the carbon
atoms midway between these two positions. In order to address the coupling of lipids with
the gA molecule, we have performed PCA in each of these cases on the lipids alone as well
as the lipids plus the NCαC backbone atoms.
After equilibration of our simulation the gA molecule was located near the edge of a
box of GMO molecules with periodic boundary conditions. To avoid artifacts due to
periodic translation of GMO molecules, the trajectory needed to be ‘unwrapped’ given the
crystal parameters and configuration at a particular moment in time. Since examination of
the complete 64 ns trajectory revealed a number of exchange events between the annular
lipids and the bulk, we created two such unwrapped trajectories: one from the beginning of
the 64 ns trajectory and one from the mid-point of the trajectory at 32 ns. This resulted in
choosing two different sets of GMO monomers for study with PCA. Annular lipids were
selected by choosing all GMO molecules whose centre of mass was within the first and
second minima of the radial distribution function (RDF) to create one and two solvation
shells respectively, for both configurations at 0 ns and 32 ns. The RDF for GMO solvating
gA is shown in Figure 6.1, with minima at 37 Å and 58 Å.
89
Figure 6.1: Radial distribution function of GMO lipids surrounding gA.
6.3: Results and Discussion
In order to determine the appropriate timescale for PCA of the annular GMO
monomers, we compared the 2-dimensional distributions of GMO monomers surrounding the
gA dimer in the plane of the membrane. Figure 6.2 shows contour maps of representative
distributions for the 24 monomers in two solvation shells within a single leaflet of the
membrane at various timescales, where the GMO monomer position was represented by its
centre of mass. The distributions for the mass-weighted average of the headgoup and acyl
tail were also computed for comparison, as well as the ester oxygen linking the two; all
results were qualitatively similar. Figure 6.2 makes it clear that there is no fixed solvation
structure to be found in the annular lipids, and even identifying a coordination number is
difficult in this fluid system. It is also clear that the monomers are more localized at short
timescales and become less so at longer timescales. At the longest timescales the symmetry
of the distributions is broken, and it becomes apparent that the same set of monomers no
longer constitutes two solvation shells around the gA dimer: the 32 ns distribution is the
longest sample for which the annular structure is still apparent, but at 64 ns the monomers
have been displaced too much to discern the solvation structure. To investigate the annular
structure further we also compared the average structures obtained for independent PCA
across various subsets of the 64 ns trajectory. These structures are shown in Figure 6.3, and
revealed that relatively symmetric and uniform lipid distributions around the gA molecule
were obtained up to 32 ns, but not for the full 64 ns trajectory (using either set of annular
GMO monomers selected from configurations at 0 or 32 ns). This figure also reveals that the
90
internal degrees of freedom of the GMO monomers average out near 32 ns, resulting in
straight-chain average monomer structures. Taken together, the results of Figures 6.2 and 6.3
indicate that the solvation structure of gA is persistent for longer than 16 ns but less than 64
ns.
Figure 6.4 shows representative eigenvalue spectra for PCA of 2 through 64 ns time
windows on the headgroup (bottom), tails (middle), and complete GMO monomers (top) for
a single solvation shell around gA, including (right) and not including (left) the gA backbone
atoms in the PCA. There are almost no differences in the largest eigenvalues for the
headgroups, tails, or both together, either with or without the gA atoms, but differences
between the headgroup and tail spectra appear in the mid- and short-scale PCs. The main
difference arising from inclusion of gA atoms appears to be a steeper scaling of the shortest
PCs. The main feature apparent in all spectra is that the shape of the curve for large
eigenvalues is not consistent for the various durations tested. The 64 ns curve is distinctly
different than the others, and combined with its asymmetric average structure this suggests
that the annular lipid structure is not conserved at this timescale. The similarity of the 16 ns
and 32 ns spectra, in addition to the results of Figures 6.2 and 6.3, lead us to focus on the 32
ns timescale in the following analysis of eigenvectors.
The eigenvectors of the largest three PCs of the complete GMO monomers are
illustrated in Figure 6.5. We have used a coloured direction coordinate to resolve some of
the ambiguity of reading 3D data on a 2D graph, by taking the dot product of each arrow
with the (1,1,0) vector, and mapping +1 to red and -1 to blue, with continuous shading
through purple for the values in between. The length of the arrows indicates the magnitude
of fluctuations on a given atom. We show four panels for each PC for clarity, depicting the
front and back (top and bottom) of the system viewed from the X (Z) direction. The average
structure of the gA backbone is also shown for reference in each panel, taken from
independent PCA of the backbone in the same time window. One of the main features of the
eigenvectors is the largely uniform motion of entire GMO monomers, with no significant
differences between the lipid headgroups and their tails; this suggests that the largest PCs
capture the diffusive motion of whole monomers. There are considerable differences
between the motion of monomers in the top and bottom leaflet of the bilayer. The first PC is
dominated by the displacement of two neighbouring monomers, as is the second PC, where
91
the same monomers move in the opposite direction. There are no obvious global patterns of
motion apparent in these or the third PC.
Figure 6.6 shows the eigenvalue-weighted superposition of PCs 1 to 3, a block
suggested by the distinct and common scaling shown in the eigenvalue spectrum in Fig. 6.4.
These ‘rotated’ PCA results feature more uniformly distributed magnitudes of displacement
across the GMO monomers than individual PCs, and reveal a torsional mode of collective
tangential motions moving clockwise around the gA backbone. This is most apparent in the
top leaflet (+Z) of the bilayer. There is a quadrant of monomers in the top leaflet which
depart from this pattern, but the monomers in the same quadrant of the bottom leaflet match
the tangential pattern of the top. While still quite noisy, it is clear that the linear combination
of PC 1 through 3 yield a more coherent collective pattern of displacement than any
individual PC shown in Fig. 6.5. The wringing pattern here is also suggestive of the largest
PC of gA, as shown in Figure 5.3, though not of the emergent mode composed of gA
backbone PCs 1-3. This may be evidence of coupling among the largest covariant motions of
the gA backbone and its solvating phospholipids. Note that this raises the difficult question
of how many PCs are to be compared when searching for common patterns among differing
subsets of a complex system’s motion.
Figure 6.6 also shows the eigenvalue-weighted sum of PCs 4 through 12, which is the
next scaling group in the eigenvalue spectrum. While the pattern of motion here is less
obvious, the collective displacements are largest on the headroups with very little motion of
the lipid tails. This mode seems to describe a mode of motion internal to the GMO
monomer, while the largest mode described the relative motion of whole monomers. This
result demonstrates the ability of linear superpositions of PCs to separate modes of motion in
a complex and noisy system into patterns which are more interpretable than any individual
PC. Moreover, the differentiation of headgroup and tail motion in this mode is also
suggestive of the countermotion of hydrophilic and hydrophobic turns observed in the
dominant mode of the the gA backbone, again suggesting the possibility of coupling between
gA and annular lipids.
The PC distributions shown in Figure 6.7 reveal that the largest PCs of the annular
lipids are multimodal, indicating that the tangential motion described in Figure 6.6 is not
likely to be a gradual drift, but a concerted hopping motion of monomers between favoured
92
solvation sites. There is a very strong overlap between the distributions of the first and
second PC, which is strong evidence that these are components of the same mode. The
multimodal character of the distributions is still apparent up to ~PC10, and converges to a
relatively smooth unimodal Gaussian after ~PC25. These distributions are reminiscent of the
multimodal side chain distributions shown in Figures 3.4 and 4.3, and suggest the possibility
of coupling between concerted motions of Trp side-chains and annular GMO molecules.
However, examination and comparison of the PC trajectories for GMO and side chains did
not reveal any consistent patterns of correlated jumps between stable positions.
Examination of the magnitudes of eigenvectors on the NCαC+GMO data set reveals
that there are no large displacements of the backbone atoms within the first 30 PCs; all the
fluctuations of the largest PCs are concentrated on the annular GMO molecules. Furthermore,
collective modes from the PCA of the full membrane are dominated by concerted motions of
lipids far from the gA site. These observations remind us that PCA is most effective with a
judicial choice of atoms to be included in the analysis, since the largest PCs are generally
dominated by motions at the largest spatial scale included in the analysis. One must either
look much further into the PC spectrum toward shorter covariant motions to find collective
modes of interest, or look at the largest PCs for a set of atoms which only span the
appropriate spatial dimension.
In conclusion, there are some suggestions of shared patterns of motion among the gA
dimer and its annular lipids, although these are largely qualitative and no obvious
correlations were found among the relevant time trajectories of collective motions. One of
the main obstacles which is made apparent in this application is the limited ability of PCA to
analyze and differentiate dynamics spread across widely varying timescales, if these are not
directly coupled to widely varying length scales. While this is a preliminary attempt to apply
PCA in a novel situation, and our results show promise in their ability to extract patterns
from very noisy data, more sophisticated methods of time trajectory analysis are needed to
study the coupling of subsystem dynamics in any detail. The standard application of PCA
relies on simple averaging, and as such it is very difficult to adequately address multi-
timescale behavior. Elaborations of PCA which are capable of separating patterns in time
(i.e. “extended” PCA) in addition to patterns in space would be necessary to more fruitfully
tackle this problem. Moreover, application of PCA to unbounded, diffusive, multimeric
93
systems is intrinsically problematic for the same reasons, since average structures are ill-
defined in this context. PCA is best used on bounded systems with well-defined average
structure.
94
Figure 6.2: Top leaflet: planar distribution function of GMO lipids surrounding gA.
95
Figure 6.3: Comparison of average structures for one solvation shell of GMO monomers.
96
Figure 6.4: Eigenvalue spectra for a single solvation shell of GMO around gA. Representative curves are taken from independent PCA of various timescales, doubling in duration from 2 ns (red) to 64 ns (purple). The NCαC gA backbone is included in the analysis on the right, while only annular GMO molecules are included on the left.
97
Figure 6.5: PC 1 (top), PC 2 (middle) and PC 3 (bottom) for a single solvation shell of GMO monomers around the gA molecule. The directions of displacement are coloured according to their dot product with the (1,1,0) vector.
98
Figure 6.6: Eigenvalue weighted sum of PC 1 to 3 (top) and PC 4 to 12 (bottom).
Figure 6.7: Eigenvalue-normalized distributions of PC trajectories for the large eigenvalue regime, shown in comparison with a unit Gaussian.
99
Chapter 7: General Conclusions and Future Directions
In this study we have seen that there is room for the expansion and development of
PCA as a technique for translating large datasets of atomic motions into quantitative
descriptions of collective motions related to function. We have used gramicidin and its
linked analogs as a test system in this regard, due to its structural and functional simplicity.
In this system we have seen that there is information of interest to structural biologists not
only in a few leading principal components, but also distributed throughout the PC spectrum
in the form of eigenvalue scaling, non-Gaussian distributions and MSD oscillations.
Eigenvalue scaling separates conformational changes where many atoms move in a uniform
direction from vibrations internal to a complex molecular structure. Non-Gaussian
distributions provide target collective motions dwelling on an anharmonic free energy
surface, indicating coupled or multi-modal dynamics which are suggestive of functional
organization. Oscillations in the MSD connect these dynamics to spectroscopic IR
measurements. Moreover, we have demonstrated that the 3D structures of PCs are not likely
to be individually meaningful, and further transformations or combinations of PCs are
needed to yield a description of concerted physical modes of motion. We have proposed an
eigenvalue-weighted linear superposition of eigenvectors grouped according to band-gaps
observed in the eigenvalue spectrum. With the aid of a directional coordinate we have
reduced the conformational degrees of freedom for our simple protein to a small set of
collective modes which are physically intuitive and functionally interpretable.
The most obvious next step in this study would be to apply our proposals to different
protein systems with known functional modes, both at long and short spatial scales. For
example, it would be interesting to apply the metrics described in Chapter 4 to
carbonmonoxy myoglobin (MbCO) for the short PCs, to see if any non-Gaussian
distributions or oscillations in the MSD relate to collective modes of the CO ligand, heme
group, or the surrounding hydrophobic binding pocket. This system has been studied using
PCA (216), and the results of conformational analysis have been related to spectroscopic “A-
states” which exhibit four IR absorption bands. These are believed to relate to MbCO’s
ability to differentiate between diatomic ligands CO and O2. Just as our results in Chapter 4
related MSD oscillations to the IR spectra of gA, it would be interesting to produce spectra
100
like those in Fig. 4.7 for MbCO, and associate the structure of PC eigenvectors with peaks in
the IR regime.
At long spatial scales there are a number of studies which could elucidate the validity
of our ansatz regarding superpositions of PCs. As with any spectroscopy, we can expect that
the mixing of modes becomes more problematic with increasing system size; the separation
of modes and clarity of bandgaps between them is likely to be more clear for less complex
systems. Hence it would be most helpful to study a number of relatively small proteins such
as lysozyme or crambin, to see if there are plateaus or bandgaps in their eigenvalue spectra.
If so, eigenvalue-weighted superpositions within these plateaus should reveal easily
interpreted collective modes of motion which may relate to the function of these proteins.
Another set of interesting comparisons could be made with model systems to separate
harmonic from diffusive degrees of freedom. PCA of MD simulations of pure crystals would
show the eigenvalue signature of purely intramolecular (harmonic) interactions while PCA of
simulations of simple gases such as Argon would elucidate intermolecular (diffusive)
interactions. Of course, an essential ingredient in any of these studies would be the
development of appropriate directional coordinates by which to judge the coherent structure
of resulting eigenvectors. Comparisons of PCA with NMA for these systems would also be
insightful.
The coupling of motion among subsystems within a complex molecular assembly is
of general interest to biophysicists and structural biologists. This is the general line of
inquiry begun in Chapter 6 regarding gA and GMO dynamics, and it would be fruitful to
continue this approach to study the coupling of gA dynamics with the water molecules in the
channel lumen, or the ions which translocate through the lumen. Many of the challenges
outlined in Chapter 6 would be apparent in such a study, not the least of which would be the
treatment of the changing molecular identity among the water molecules which constitute the
lumen, and the development of quantitative techniques to detect correlation of collective
motions and the ability to ascribe causal directions among any such motions. On the other
hand, the average structure of lumen waters is much better defined than the solvation
structure of GMO molecules, and PCA may have an easier time of detecting functional
motions in this application.
101
Our discussion of EOFs in section 2.2.4 provides a long list of possible future studies
using PCA on protein dynamics, most of which would be original at this time. Choosing an
appropriate test system is especially important when trying novel analysis techniques, and
there are two criteria to observe in this case. The structural and functional simplicity of
gramicidin is useful in allowing a relatively straightforward description of results. But there
is also a need for well characterized patterns of motion of adequate biological complexity, in
order to test the ability of a technique to extract the relevant biological information from
atomic motions. The transition from T (tense) to R (relaxed) conformation upon O2 binding
in myoglobin is a good example of this, and could serve as the equivalent of well-
characterized patterns of atmospheric disturbances which were necessary in developing
extensions of EOFs.
While the utility of rotated EOFs remains controversial, it would be interesting to test
the ability of Varimax and related algorithms to extract simplified modes of motion from
PCA of MD simulations. Probably the most limiting factor in standard PCA is its reliance on
instantaneous covariance among atoms, which ignores the effect of memory and time-lags in
constructing the structure of dynamic modes; “extended” PCA which includes covariance of
motion at different points in time may reveal a much more accurate description of collective
modes in a protein. “Complex” PCA also offers the potential to explore correlations among
related variables in the MD data set; creating a complex number from the position and
velocity of an atom offers the possibility of discovering modes of motion which span the
complete phase space of protein dynamics.
We can always learn more from extending MD simulations to longer timescales.
Further characterization of the timescale dependence of eigenvector shapes, eigenvalue
spectra and PC distributions is always welcome, especially to elucidate the convergence of
the grouping plateaus in the eigenvalue spectrum. While we have argued that the backbone
dynamics of gA have adequately converged in this study, much longer simulations would
presumably capture dissociation events of the dimer, and it would be very interesting to
investigate whether mode B as described in Chapter 5 is associated with these events. It is
also clear that both side chain and annular GMO dynamics have not converged at 64 ns and
further analysis of these aspects of gramicidin dynamics would require longer simulations.
102
References 1. Karplus, M. 1987. Molecular dynamics simulations of proteins. Phys. Today 40:68‐70. 2. Karplus, M., and J. Kuriyan. 2005. Molecular dynamics and protein function. Proc. Nat. Acad.
Sci. USA 102:6679‐6685. 3. McCammon, J. A., and S. C. Harvey. 1987. Dynamics of Proteins and Nucleic Acids.
Cambridge University Press, New York. 4. Roux, B., and K. Schulten. 2004. Computational studies of membrane channels. Structure
12:1343‐1351. 5. Sanbonmatsu, K. Y., and C. S. Tung. 2007. High performance computing in biology:
Multimillion atom simulations of nanoscale systems. Journal of Structural Biology 157:470‐480.
6. Wolynes, P. G. 2005. Energy landscapes and solved protein‐folding problems. Philos. Trans. R. Soc. A 363:453‐467.
7. Zhou, Y., and M. Karplus. 1999. Interpreting the folding kinetics of helical proteins. Nature 401:400‐403.
8. Gianni, S., N. R. Guydosh, F. Khan, T. D. Caldas, U. Mayor, G. W. N. White, M. L. DeMarco, V. Daggett, and A. R. Fersht. 2003. Unifying features in protein‐folding mechanisms. Proc. Nat. Acad. Sci. USA 100:13286‐13291.
9. Garcia‐Viloca, M., J. Gao, M. Karplus, and D. G. Truhlar. 2004. How Enzymes Work: Analysis by Modern Rate Theory and Computer Simulations. Science 303:186‐195.
10. Wolfenden, R., and M. J. Snider. 2001. The Depth of Chemical Time and the Power of Enzymes as Catalysts. Acc. Chem. Res. 34:938‐945.
11. Villa, J., and A. Warshel. 2001. Energetics and Dynamics of Enzymatic Reactions. J. Phys. Chem. B 105:7887‐7907.
12. Brooks, C. L., M. Karplus, and B. M. Pettitt. 1988. Proteins: A Theoretical Perspective of Dynamics, Structure and Thermodynamics. Wiley, New York.
13. Cui, Q., and M. Karplus. 2002. Promoting Modes and Demoting Modes in Enzyme‐Catalyzed Proton Transfer Reactions: A Study of Realistic Systems. J. Phys. Chem. B 106:1768‐1798.
14. Kursula, I., M. Salin, J. Sun, B. V. Norledge, A. M. Haapalainen, N. S. Sampson, and R. K. Wierenga. 2004. Understanding protein lids: structural analysis of active hinge mutants in triosephosphate isomerase. Protein Eng., Des. Sel. 17:375‐382.
15. Horton, H. R., L. A. Moran, K. G. Scrimgeour, M. D. Perry, and J. D. Rawn. 2006. Principles of Biochemistry. Pearson Prentice Hall.
16. Koppole, S., J. C. Smith, and S. Fischer. 2006. Simulations of the myosin II motor reveal a nucleotide‐state sensing element that controls the recovery stroke. J. Mol. Biol. 361:604‐616.
17. Mesentean, S., S. Koppole, J. C. Smith, and S. Fischer. 2007. The principal motions involved in the coupling mechanism of the recovery stroke of the myosin motor. J. Molec. Biol 367:591‐602.
18. Carnevale, V., S. Raugei, C. Micheletti, and P. Carloni. 2006. Convergent Dynamics in the Protease Enzymatic Superfamily. JACS. 128:9766‐9772.
19. Hodgkin, A. L., and R. D. Keynes. 1955. The potassium permeability of a giant nerve fibre. J. Physiol. (Lond.) 128:61‐88.
20. Doyle, D. A., J. M. Cabral, R. A. Pfuetzner, A. Kuo, J. M. Gulbis, S. L. Cohen, B. T. Chait, and R. MacKinnon. 1998. The structure of the potassium channel: molecular basis of K+ conduction and selectivity. Science 280:69‐77.
103
21. Zhou, Y., J. H. Morais‐Cabral, A. DKaufman, and R. MacKinnon. 2001. Chemistry of ion coordination and hydration revealed by a K+ channel‐Fab complex at 2.0 Å resolution. Nature 414:43‐48.
22. Berneche, S., and B. Roux. 2000. Molecular dynamics of the KcsA K+ channel in a bilayer membrane. Biophys. J. 78:2900‐2917.
23. Noskov, S. Y., S. Berneche, and B. Roux. 2004. Control of ion selectivity in potassium channels by electrostatic and dynamic properties of carbonyl ligands. Nature 431:830‐834.
24. Thomas, M., D. Jayatilaka, and B. Corry. 2007. The Predominant Role of Coordination Number in Potassium Channel Selectivity. biophys. J. 93:2635‐2643
25. Lee, A. G. 2003. Lipid‐protein interactions in biological membranes: a structural perspective [Review]. Biochimica et Biophysica Acta 1612:1‐40.
26. Hunte, C., and S. Richers. 2008. Lipids and membrane protein structures. Curr. Opin. Struc. Biol. 18:406‐411.
27. Saiz, L., S. Bandyopadhyay, and M. L. Klein. 2004. Effect of the Pore Region of a Transmembrane Ion Channel on the Physical Properties of a Simple Membrane. J. Phys. Chem. B 108:2608‐2613.
28. Deol, S. S., P. J. Bond, C. Domene, and M. S. P. Sansom. 2004. Lipid‐Protein Interactions of the Integral Membrane Proteins: A Comparative Simulation Study. Biophys. J. 87:3737‐3749.
29. Lee, A. G. 2004. How lipids affect the activities of integral membrane proteins [Review]. Biochimica et Biophysica Acta 1666:62‐87.
30. Marsh, D., and L. I. Horvath. 1998. Structure, dynamics and composition of the lipid‐protein interface. Perspectives from spin‐labelling. Biochimica et Biophysica Acta 1376:267‐296.
31. de Planque, M. R. R., D. V. Greathouse, R. E. I. Koeppe, H. Schafer, D. Marsh, and J. A. Killian. 1998. Influence of Lipid/Peptide Hydrophobic Mismatch on the Thickness of Diacylphosphatidylcholine Bilayers. A 2H NMR and ESR Study Using Designed Transmembrane α‐Helical Peptides and Gramicidin A. Biochemistry 37:9333‐9345.
32. Killian, J. A. 1992. Gramicidin and gramicidin‐lipid interactions. Biochimica et Biophysica Acta 1113:391‐425.
33. Costa‐Filho, A. J., R. H. Crepeau, P. P. Borbat, M. Ge, and J. H. Freed. 2003. Lipid‐Gramicidin Interactions: Dynamic Structure of the Boundary Lipid by 2D‐ELDOR. Biophys. J. 84:3364‐3378.
34. Woolf, T. B., and B. Roux. 1994. Molecular‐dynamics of the gramicidin channel in a phospholipid membrane. Proc. Nat. Acad. Sci. USA 91:11631‐11635.
35. Chiu, S., S. Subramaniam, and E. Jakobsson. 1999. Simulation study of a gramicidin/lipid biolayer system in excess water and lipid. II. Rates and mechanisms of water transport. Biophys. J. 76:1939‐1950.
36. Qin, Z., H. L. Tepper, and G. A. Voth. 2007. Effect of membrane environment on proton permeation through gramicidin A channels. J. Phys. Chem. B 111:9931‐9939.
37. Dubos, R. J., and C. Cattaneo. 1939. Studies on a bactericidal agent extracted from a soil bacillus: III. Preparation and activity of a protein‐free fraction. J. Exp. Med. 70:249‐256.
38. Arseniev, A. S., I. L. Barskov, V. F. Bystov, A. L. Lomize, and Y. A. Ovchinnikov. 1985. 1H‐NMR study of gramicidin A transmembrane ion channel. Head‐to‐head right‐handed single stranded helices. FEBS Letters 186:168‐174.
39. Hladky, S. B., and D. A. Haydon. 1972. Ion transfer across lipid membranes in the presence of gramicidin A: 1. Studies of the unit conductance channel. Biochim. Biophys. Acta 274:294‐312.
40. Eisenman, G., and R. Horn. 1983. Ionic selectivity revisited: the role of kinetic and equilibrium processes in ion permeation through channels. J. Membr. Biol. 76:197‐225.
104
41. Finkelstein, A., and O. S. Andersen. 1981. The Gramicidin A Channel: a review of its permeability characteristics with special reference to the single‐file aspect of transport. J. Membr. Biol. 59:155‐171.
42. Hladky, S. B., and D. A. Haydon. 1974. Temperature‐dependent properties of gramicidin A channels. Biochim. Biophys. Acta 367:127‐133.
43. Killian, J. A. 1992. Gramicidin and gramicidin‐lipid interactions. Biochim. Biophys. Acta 1113:391‐425.
44. Ketchem, R. R., B. Roux, and T. A. Cross. 1997. High‐resolution polypeptide structure in a lamellar phase lipid environment from solid state NMR derived orientational constraints. Structure 5:1655‐1669.
45. Ketchem, R. R., W. Hu, and T. A. Cross. 1993. High‐Resolution Conformation of Gramicidin A in a Lipid Bilayer by Solid‐State NMR. Science 261:1457‐1460.
46. Mitchell, J. B. O., and J. Smith. 2003. D‐amino acid residues in peptides adn proteins. Proteins 50:563‐571.
47. Elliott, J. R., D. Needham, J. P. Dilger, and D. A. Haydon. 1983. The effects of bilayer thickness and tension on gramicidin single‐channel lifetime. Biochim. Biophys. Acta 735:95‐103.
48. Huang, H. W. 1986. Deformation free energy of bilayer membrane and its effect on gramicidin channel lifetime. Biophys. J. 50:1061‐1070.
49. Stankovic, C. J., S. H. Heinemann, J. M. Delfino, F. J. Sigworth, and S. L. Schreiber. 1989. Transmembrane Channels Based on Tartaric Acid ‐ Gramicidin A Hybrids. Science 244:813‐817.
50. Stankovic, C. J., S. H. Heinemann, and S. L. Schreiber. 1990. Immobilizing the Gate of a Tartaric Acid Gramicidin ‐ A Hybrid Channel Molecule by Rational Design. J. Am. Chem. Soc. 112:3702‐3704.
51. Cukierman, S., E. P. Quigley, and D. S. Crumrine. 1997. Proton conduction in gramicidin A and in its dioxolane‐linked dimer in different lipid bilayers. Biophys. J. 73:2489‐2502.
52. Quigley, E. P., D. S. Crumrine, and S. Cukierman. 2000. Gating and Permeation in Ion Channels Formed by Gramicidin A and Its Dioxolane‐linked Dimer in Na+ and Cs+ Solutions. J. Membrane Biol. 174:207‐212.
53. Quigley, E. P., P. Quigley, D. S. Crumrine, and S. Cukierman. 1999. The Conduction of Protons in Different Stereoisomers of Dioxolane‐Linked Gramicidin A Channels. Biophys. J. 77:2479‐2491.
54. Roux, B. 2002. Computational studies of the gramicidin channel. Acc. Chem. Res. 35:366‐375.
55. Roux, B., and M. Karplus. 1991. Ion transport in a model gramicidin channel: Structure and thermodynamics. Biophys. J. 59:961‐981.
56. Roux, B., and M. Karplus. 1994. Molecular dynamics simulations of the gramicidin channel. Annu. Rev. Biophys. Biomol. Struct. 23:731‐761.
57. Lauger, P. 1973. Ion transport through pores: a rate theory analysis. Biochim. Biophys. Acta 311:423‐441.
58. Schumaker, M., R. Pomès, and B. Roux. 2000. A combined molecular dynamics and diffusion model of single proton conduction through gramicidin. Biophys. J. 79:2840‐2857.
59. Kirkwood, J. G. 1935. Statistical mechanics of fluid mixtures. J. CHem. Phys. 3:300. 60. Allen, T. W., O. S. Andersen, and B. Roux. 2006. Ion permeation through a narrow channel:
Using gramicidin to ascertain all‐atom molecular dynamics potential of mean force methodology and biomolecular force fields Biophys. J. 90:3447‐3468.
61. Allen, T. W., O. S. Andersen, and B. Roux. 2006. Molecular dynamics ‐ potential of mean force calculations as a tool for understanding ion permeation and selectivity in narrow channels. Biophys. Chem. 124:251‐267.
105
62. Decornez, H., K. Drukker, and S. Hammes‐Schiffer. 1999. Solvation and Hydrogen‐Bonding Effects on Proton Wires. J. Phys. Chem. A 103:2891‐2898.
63. Pomès, R. 1995. Quantum effects on the structure and energy of a protonated linear chain of hydrogen‐bonded water molecules. Chemical Physics Letters 234:416‐424.
64. Pomes, R., and B. Roux. 1996. Theoretical Study of H+ Translocation along a Model Proton Wire. J. Phys. Chem. 100:2519‐2527.
65. Pomès, R., and B. Roux. 1998. Free Energy Profiles of H+ Conduction along Hydrogen‐Bonded Chains of Water Molecules. Biophys. J. 75:33‐40.
66. Drukker, K., S. W. de Leeuw, and S. Hammes‐Schiffer. 1998. Proton transport along water chains in an electric field. J. Chem. Phys 108:6799‐6808.
67. Chakrabarti, N., B. Roux, and R. Pomès. 2004. Structural Determinants of Proton Blockage in Aquaporins. J. Mol. Biol. 20:1‐18.
68. Chakrabarti, N., E. Tajkhorshid, B. Roux, and R. Pomès. 2004. Molecular Basis of Proton Blockage in Aquaporins. Structure 12:65‐74.
69. Pomes, R., and B. Roux. 1996. Structure and Dynamics of a Proton Wire: A Theoretical Study of H+ Translocation along the Single‐File Water Chain in the Gramicidin A Channel. Biophys. J. 71:19‐39.
70. Pomès, R., and B. Roux. 2002. Molecular mechanism of H+ conduction in the single‐file water chain of the gramicidin channel. Biophys. J. 82:2304‐2316.
71. Pomès, R., and B. Roux. 1996. Structure and Dynamics of a Proton Wire: A Theoretical Study of H+ Translocation along the Single‐File Water Chain in the Gramicidin A Channel. Biophys. J. 71:19‐39.
72. Yu, C. H., and R. Pomès. 2003. Functional dynamics of ion channels: modulation of proton movement by conformational switches. J. Am. Chem. Soc. 125:13890‐13894.
73. Urry, D. W., S. Alonso‐Romanowski, C. M. Venkatachalam, R. J. Bradley, and R. D. Harris. 1984. Temperature Dependence of Single Channel Currents and the Peptide Libration Mechanism for ion Transport through the Gramicidin A Transmembrane Channel. J. Membr. Biol. 81:205‐217.
74. Roux, B., and M. Karplus. 1988. The normal modes of the gramicidin‐A dimer channel. Biophys. J. 53:297‐309.
75. Chiu, S., E. Jakobsson, S. Subramaniam, and J. A. McCammon. 1991. Time‐correlation analysis of simulated water motion in flexible and rigid gramicidin channels. Biophys. J. 60.
76. Tian, F., and T. A. Cross. 1999. Cation Transport: An Example of Structural Based Selectivity. J. Mol. Biol. 285:1993‐2003.
77. North, C. L., and T. A. Cross. 1995. Correlations between Function and Dynamics: Time Scale Coincidence for Ion Translocation and Molecular Dynamics in the Gramicidin Channel Backbone. Biochemistry 34:5883‐5895.
78. Lazo, N. D., W. Hu, and T. A. Cross. 1995. Low‐Temperature Solid‐State 15N NMR Characterization of Polypeptide Backbone Librations. J. Magn. Reson. B 107:43‐50.
79. Bartl, F., B. Brzezinski, B. Rozalski, and G. Zundel. 1998. FT‐IR Study of the Nature of the Proton and Li+ Motions in Gramicidin A and C. J. Phys. Chem. B 102:5234‐5238.
80. Pankiewicz, R., G. Wojciechowski, G. Schroeder, B. Brzezinski, F. Bartl, and G. Zundel. 2001. FT‐IR study of the nature of K+, Rb+ and Cs+ cation motions in gramicidin A. J. Mol. Struct. 565:213‐217.
81. Armstrong, K. M., and S. Cukierman. 2002. On the Origin of Closing Flickers in Gramicidin Channels: A New Hypothesis. Biophys. J. 82:1329‐1337.
82. de Godoy, C. M. G., and S. Cukierman. 2001. Modulation of Proton Transfer in the Water Wire of Dioxolane‐Linked Gramicidin Channels by Lipid Membranes. Biophys. J. 81:1430‐1438.
106
83. Cukierman, S. 2000. Proton Mobilities in Water and Different Stereoisomers of Covalently Linked Gramicidin A Channels. Biophys. J. 78:1825‐1834.
84. Yu, C. H., S. Cukierman, and R. Pomès. 2003. Theoretical Study of the Structure and Dynamics Fluctuations of Dioxolane‐Linked Gramicidin Channels. Biophys. J. 84:816‐831.
85. Brooks, B. R., R. E. Bruccoleri, O. B. D., D. J. States, S. Swaminathan, and M. Karplus. 1983. CHARMM ‐ A Program For Macromolecular energy, minimization, and dynamics calculations. J. Comp. Chem. 4:187‐217.
86. MacKerell, A. D., Jr., D. Bashford, M. Bellott, R. L. BDunbrack, J. D. Evanseck, M. J. Field, S. Fischer, J. B. Gao, H. Guo, S. Ha, D. Joseph‐McCarthy, L. Kuchnir, K. Kuczera, F. T. K. Lau, C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prodhom, W. E. Reiher, B. Roux, M. Schlenkrich, J. C. Smith, R. Stote, J. Straub, M. Watanabe, J. Wiorkiewicz‐Kuczera, D. Yin, and M. Karplus. 1998. All‐atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102:3586‐3616.
87. Cornell, W. D., P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, and P. A. Kollman. 1995. A 2nd Generation Force‐Field for the Simulation of Proteins, Nucleic‐Acids, and Organic‐Molecules. J. Am. Chem. Soc. 117:5179‐5197.
88. Pearlman, D. A., D. A. Case, J. W. Caldwell, W. S. Ross, T. E. Cheatham, S. Debolt, D. Frerguson, G. Seibel, and P. Kollman. 1995. Amber, A Package of Computer‐Programs for Applying Molecular Mechanics, Normal‐Mode Analysis, Molecular‐Dynamics and Free‐Energy Calculations to Simulate the Structural And Energetic Properties of Molecules. Comput. Phys. Commun. 91:1‐41.
89. Berendsen, H. J. C., D. Vanderspoel, and R. Vandrunen. 1995. Gromacs ‐ a Message‐Passing Parallel Molecular‐Dynamics Implementation. Comput. Phys. Commun. 91:43‐56.
90. Lindahl, E., B. Hess, and D. van der Spoel. 2001. GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model 7:306‐317.
91. van Gunsteren, W. F., S. R. Billeter, A. A. Eising, P. H. Hunenberger, P. Kruger, A. E. Mark, W. R. P. Scott, and I. G. Tironi. 1996. Biomolecular Simulation: The GROMOS96 manual and user guide. . Hochschulverlag AG an der ETH Zurich, Zurich.
92. Jorgensen, W. L., and J. Tiradorives. 1988. The Opls Potential Functions for Proteins ‐ Energy Minimizations for Crystals of Cyclic‐Peptides and Crambin. J. Am. Chem. Soc. 110:1657‐1666.
93. Berendsen, A., J. P. M. Postma, W. F. Van Gunsteren, A. DiNola, and J. R. Haak. 1984. Molecular dynamics with coupling to an external bath. J. Chem. Phys 81:3684‐3690.
94. Hoover, W. G. 1985. Canonical dynamics: Equilibrium phase‐space distributions. Phys. Rev. A 31:1695‐1697.
95. Nosé, S. 1984. A unified formulation of the constant temperature molecular dynamics methods. J. Chem. Phys 81:511‐519.
96. Allen, M. P., and D. J. Tildesley. 1987. Computer Simulation of Liquids. Oxford University Press
Oxford. 97. Harvey, S. C., R. K. Z. Tan, and T. E. Cheatham. 1998. The flying ice cube: Velocity rescaling in
molecular dynamics leads to violation of energy equipartition J. Comput. Chem. 19:726‐740.
98. Herce, H. D., and A. E. Garcia. 2006. Correction of apparent finite size effects in the area per lipid of lipid membranes simulations. J. Chem. Phys. 125:224711.
99. Jorgensen, W. L., J. Chandrasekhar, J. D. Madura, R. W. Impey, and M. L. Klein. 1983. Comparison of Simple Potential Functions for Simulating Water. J. Chem. Phys. 79:926‐935.
100. Ryckaert, J.‐P., G. Ciccotti, and H. J. C. Berendsen. 1977. Numerical Integration of the Cartesian Equations of Motion of a System with Constraints: Molecular Dynamics of n‐Alkanes. J. Comp. Phys. 23:327–341.
107
101. Zhang, Y., S. Feller, B. Brooks, and R. W. Pastor. 1995. Computer simulation of liquid/liquid interfaces. I. Theory and application to octane/water. J. Chem. Phys 103:10252‐10266.
102. Marrink, S. J., and A. E. Mark. 2001. Effect of Undulations on Surface Tension in Simulated Bilayers. J. Phys. Chem. B 105:6122‐6127.
103. Eckart, C., and G. Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika I:211‐218.
104. Golub, G. H., and W. Kahan. 1965. Calculating the singular values and pseudo‐inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2:205‐224.
105. Hotelling, H. 1935. The most predictable criterion. J. Educ. Psychol. 26:139‐142. 106. Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J.
Educ. Psychol. 24:417‐520. 107. Hannachi, A., I. T. Jolliffe, and D. B. Stephenson. 2007. Empirical orthogonal functions and
related techniques in atmospheric science: A review. Int. J. Climatology 27:1119‐1152. 108. Berkooz, G., P. Holmes, and J. L. Lumley. 1993. The Proper Orthogonal Decomposition in the
Analysis of Turbulent Flows. Ann. Rev. Fluid Mech. 25:539‐575. 109. Berendsen, H., and S. Hayward. 2000. Collective protein dynamics in relation to function.
Curr. Opin. Struc. Biol. 10:165‐169. 110. Kitao, A., and N. Go. 1999. Investigating protein dynamics in collective coordinate space.
Curr. Opin. Struc. Biol. 9:164‐169. 111. García, A. 1992. Large‐amplitude nonlinear motions in proteins. Phys. Rev. Lett. 17:2696‐
2699. 112. Amadei, A., A. B. M. Linssen, and H. J. C. Berendsen. 1993. Essential dynamics of proteins.
Proteins 17:412‐425. 113. Lou, H., and R. I. Cukier. 2006. Molecular dynamics of Apo‐Adenylate Kinase: A Principal
component Analysis. J. Phys. Chem. 110:12796‐12808. 114. Arcangeli, C., A. R. Bizzarri, and S. Cannistraro. 2001. Concerted motions in copper
plastocyanin and azurin: an essential dynamics study. Biophys. Chem 90:45‐56. 115. Hayward, S., and H. J. C. Berendsen. 1998. Systematic analysis of domain motions in proteins
from conformational change: New results on citrate synthase and T4 lysozyme. Proteins 30:144‐154.
116. Hayward, S., A. Kitao, and N. Go. 1994. Harmonic and anharmonic aspects in the dynamics of BPTI: A normal mode analysis and principal component analysis. Protein Science 3:936‐943.
117. García, A., and G. Hummer. 1999. Conformational dynamics of cytochrome c: correlation to hydrogen exchange. Proteins 36:175‐191.
118. van Aalten, D. M. F., A. Amadei, A. B. M. Linssen, V. G. H. Eijsink, G. Vriend, and H. J. C. Berendsen. 1995. The essential dynamics of thermolysin: confirmation of the hinge‐bending motion and comparison of simulations in vacuum and water. Proteins: Structure, Function and Genetics 22:45‐54.
119. Maisuradze, G. G., and D. M. Leitner. 2006. Principal component analysis of fast‐folding lambda‐repressor mutants. Chem. Phys. Lett. 421:5‐10.
120. Materese, C. K., C. C. Goldmon, and G. A. Papoian. 2008. Hierarchical organization of eglin c native state dynamics is shaped by competing direct and water‐mediated interactions. Proc. Natl. Aca. Sci. USA 105:10659‐10664.
121. Balsera, M. A., W. Wriggers, Y. Oono, and K. Schulten. 1996. Principal component analysis and long time protein dynamics. J. Phys. Chem. 100:2567‐2572.
122. Grossfield, A., S. Feller, and M. Pitman. 2007. Convergence of molecular dynamics simulations of membrane proteins. Proteins 67:31‐40.
108
123. Hattori, M. L., H; Yamada, H; Akasaka, K; Hengstenberg, W; Gronwald, W; Kalbitzer, HR. 2004. Infrequent cavity‐forming fluctuations in HPr from Staphylococcus carnosus revealed by pressure‐ and temperature‐dependent tyrosine ring flips. Protein Science 13:3104‐3114.
124. Rao, D. K., and A. K. Bhuyan. 2007. Complexity of aromatic ring‐flip motions in proteins: Y97 ring dynamics in cytochrome c observed by cross‐relaxation suppressed exchange NMR spectroscopy. J. Biomol. NMR 39:187‐196.
125. Go, N., and H. A. Scheraga. 1970. Calculation of the Conformation of the Pentapeptide cycle‐(Glycylglycylglycylprolylprolyl). I. A Complete Energy Map. Macromolecules 3:188‐194.
126. Go, N., and H. A. Scheraga. 1973. Calculation of the Conformation of cyclo‐Hexaglycyl. Macromolecules 6:525‐541.
127. Brooks, B., and M. Karplus. 1983. Harmonic dynamics of proteins: Normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc. Natl. Aca. Sci. USA 80:6571‐6575.
128. Go, N., T. Noguti, and T. Nishikawa. 1983. Dynamics of a small globular protein in terms of low‐frequency vibrational modes. Proc. Natl. Acad. Sci. USA 80:3696‐3700.
129. Levitt, M., C. Sander, and P. S. Stern. 1985. Protein Normal‐mode Dynamics: Trypsin Inhibitor, Crambin, Ribonuclease and Lysozyme. J. Mol. Biol. 181:423‐447.
130. Ma, J. 2005. Usefulness and Limitations of Normal Mode Analysis in Modeling Dynamics of Biomolecular Complexes. Structure 13:373‐380.
131. Miller, D. W., and D. A. Agard. 1999. Enzyme specificity under dynamic control: a normal mode analysis of alpha‐lytic protease. J. Mol. Biol. 286:267‐278.
132. Miloshevsky, G., and P. Jordan. 2006. The open state gating mechanism of gramicidin A requires relative opposed monomer rotation and simultaneous lateral displacement. Structure 14:1241‐1249.
133. Atilgan, A. R., S. R. Durell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Bahar. 2001. Anisotropy of Fluctuation Dynamics of Proteins with an Elastic Network Model. Biophys. J. 80:505‐515.
134. Bahar, I., A. R. Atilgan, and B. Erman. 1997. Direct evaluation of thermal fluctuations in proteins using a single‐parameter harmonic potential. Folding Design 2:173‐181.
135. Bahar, I., C. Chennubhotla, and B. Erman. 2007. Reply to 'Comment on elastic network models and proteins'. Phys. Biol. 4:64‐65.
136. Bahar, I., and A. Rader. 2005. Coarse‐grained normal mode analsis in structural biology. Current Opinion in Structural Biology 15:586‐592.
137. Chennubhotla, C., A. J. Rader, L. Yang, and I. Bahar. 2005. Elastic network models for understanding biomolecular machinery: from enzymes to supramolecular assemblies. Phys. Biol. 2:S172‐S180.
138. Eyal, E., and I. Bahar. 2008. Toward a Molecular Understanding of the Anisotropic Response of Proteins to External Forces: Insights from Elastic Network Models. Biophys. J. 94:3424‐3435.
139. Tama, F., M. Valle, J. Frank, and C. L. Brooks. 2003. Dynamic reorganization of the functionally active ribosome explored by normal mode analysis and cryo‐electron microscopy. Proc. Natl. Aca. Sci. USA 100:9319‐9323.
140. McCammon, J. A., B. R. Gelin, and M. Karplus. 1977. Dynamics of folded proteins. Nature 267:585‐590.
141. Doruker, P., A. R. Atilgan, and I. Bahar. 2000. Dynamics of Proteins Predicted by Molecular Dynamics Simulations and Analytical Approaches: Application to α‐Amylase Inhibitor. Proteins: Structure, Function and Genetics 40:512‐524.
142. Smith, J., S. Cusack, U. Pezzeca, B. Brooks, and M. Karplus. 1986. Inelastic neutron scattering analysis of low frequency motion in proteins: A normal mode study of the bovine pancreatic trypsin inhibitor. J. Chem. Phys. 85:3636‐3654.
109
143. Tirion, M. M. 1996. Large Amplitude Elastic Motions in Proteins from a single‐Parameter, Atomic Analysis. Phys. Rev. Lett. 7:1905‐1908.
144. Delarue, M., and Y.‐H. Sanejouand. 2002. Simplified Normal Mode Analysis of Conformational Transitions in DNA‐dependent Polymerases: the Elastic Network Model. J. Mol. Biol. 320:1011‐1024.
145. Tama, F., and Y.‐H. Sanejouand. 2001. Conformational change of proteins arising from normal mode calculations. Protein Engineering 14:1‐6.
146. Zheng, W., B. Brooks, and D. Thirumalai. 2006. Low‐frequency normal modes that describe allosteric transitions in biological nanomachines are robust to sequence variations. Proc. Natl. Aca. Sci. USA 103:7664‐7669.
147. Hinsen, K. 1998. Analysis of Domain Motions by Approximate Normal Mode Calculations. Proteins: Structure, Function, and Genetics 33:417‐429.
148. Maguid, S., S. Fernandez‐Alberti, L. Ferrelli, and J. Echave. 2005. Exploring the common dynamics of homologous proteins. Application to the Globin family. Biophys. J. 89:3‐13.
149. Stillinger, F. H., and T. A. Webber. 1982. Hidden structure in liquids. Phys. Rev. A 25:978‐989. 150. Pearson, K. 1902. On lines and planes of closest fit to systems of points in space.
Philosophical Magazine 2:559‐572. 151. Lorenz, E. N. 1956. Empirical Orthogonal Functions and Statistical Weather Prediction. In
Technical Report, Statistical Forecast Project Report 1. Department of Meteorology, MIT. 49. 152. Richman, M. B. 1986. Rotation of Principal Components. J. Climatology 6:293‐335. 153. Buell, C. E. 1975. The topography of empirical orthogonal functions. In Fourth Conf. on Prob.
and Stats. in Atmos. Sci. Amer. Metero. Soc., Tallahassee, FL. 188. 154. Buell, C. E. 1979. On the physical interpretation of empirical orthogonal functions. In Sixth
Conf. on Prob. and Stats. in Atmos. Sci. Amer. Metero. Soc., Banff, Alta. 112. 155. Calahan, R. F. 1983. EOF spectral estimation in climate analysis. In Second International
Conf. on Stat. Climat. National Institute of Metero. and Geophysics, Lisbon, Portugal. 4.5.1. 156. Richman, M. B., and P. J. Lamb. 1985. Climate pattern analysis of 3‐ and 7‐day summer
rainfall in the central United States: Some methodological considerations and a regionalization. J. Clim. Appl. Meteor. 24:1325.
157. Kendall, M. G. 1980. Multivariate Analysis. C. Griffin, London. 158. North, G. R., T. L. Bell, R. F. Calahan, and F. J. Moeng. 1982. Sampling errors in the
estimation of empirical orthogonal functions. Mon. Wea. Rev. 110:699. 159. Cliff, N., and C. D. Hamburger. 1967. A study of sampling errors in factor analysis by means
of artificial experiments. Psych. Bull. 68:430. 160. Storch, H., and G. Hannoschock. 1985. Statistical aspects of estimated principal vectors
(EOFs) based on small sample sizes. J. Clim. Appl. Meteor. 24:716. 161. Vargas, W. M., and R. H. Compagnucci. 1983. Methodological aspects of principal
component analysis in meteorological fields. In Second International Conf. on Stat. Climat. National Institute of Metero. and Geophysics, Lisbon, Portugal. 5.3.1.
162. Barnston, A. G., and R. E. Livezey. 1987. Classification, seasonality and persistence of low‐frequency atmospheric circulation patterns. Mon. Wea. Rev. 115:1083‐1126.
163. Craddock, J. M. 1965. A meteorological application of factor analysis. The Statistician 15:143‐156.
164. Horel, J. D. 1981. A rotated principal component analysis of the interannual variability of the Northern Hemisphere 500 mb height field. Mon. Wea. Rev. 109:2080‐2092.
165. Richman, M. B. 1981. Obliquely rotated principal components: an improved meteorological map typing technique? J. App. Met. 20:1145‐1159.
166. Kaiser, H. F. 1958. The Varimax criterion for analytic rotation in factor analysis. Psychometrika 23:187.
110
167. Kaiser, H. F. 1959. Computer program for Varimax rotation in factor analysis. Educ. Psych. Meas. 19:413.
168. Carroll, J. B. 1953. An analytic solution for approximating simple structure in factor analysis. Psychometrika 18:23.
169. Neuhaus, J. O., and C. Wrigley. 1954. The Quartimax method: an analytical approach to simple structure. Brit. J. Stat. Psych. 7:81.
170. Carroll, J. B. 1957. Biquartimin criterion for rotating to oblique simple structure in factor analysis. Science 126:1114.
171. Saunders, D. R. 1961. The rationale for an "Oblimax" method of tranformation in factor analysis. Psychometrika 26:317.
172. Hendrickson, A. E., and P. O. White. 1964. Promax: a quick method to oblique simple structure. Brit. J. Stat. Psych. 17:65.
173. Tucker, L. R., and C. T. Finkbeiner. 1982. Transformation of factors by artificial personal probability functions. In ETS research report 81‐58, test and measurement no. TM 820429.
174. Jolliffe, I. T. 2002. Principal Component Analysis. Springer, New York. 175. Hannachi, A., I. T. Jolliffe, D. B. Stephenson, and N. Trendafilov. 2006. In Search of Simple
Structures in Climate: Simplifying EOFs. Int. J. Climatology 26:7‐28. 176. Jolliffe, I. T., N. Trendafilov, and M. Uddin. 2003. A modified principal component
thechnique based on the LASSO. J. Computational and Graphical Statistics 12:531‐547. 177. Trendafilov, N., and I. T. Jolliffe. 2005. Numerical solution of the SCoTLASS. Computational
Statistics and Data Analysis 50:242‐253. 178. Tibshirani, R. 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal
Statistical Society B 58:267‐288. 179. Bibby, J. 1980. Some effects of rounding optimal estimates. Sankhya B 42:165‐178. 180. Green, B. F. 1977. Parameter sensitivity in multivariate methods. Journal fo Multivariate
Behavioral Research 12:263‐287. 181. Hausmann, R. 1982. Constrained multivariate analysis. In Optimisation and Statistics. S. H.
Zanckis, and J. S. Rustagi, editors. North‐Holland, Amsterdam. 137‐151. 182. Van den Dool, H. M., S. Saha, and J. A. 2000. Empirical orthogonal teleconnections. Journal
of Climate 13:1421‐1435. 183. Vines, S. K. 2000. Simple principal components. Applied Statistics 49:441‐451. 184. Weare, B. C., and J. S. Nasstrom. 1982. Examples of extended empirical orthogonal function
analysis. Monthly Weather Review 110:481‐485. 185. Broomhead, D. S., and G. P. King. 1986. Extracting qualitative dynamics from experimental
data. Physica D 20:217‐236. 186. Broomhead, D. S., and G. P. King. 1986. On the qualitative analysis of experimental
dynamical systems. In Nonlinear Phenomena and Chaos. S. Sarkar, editor. Adam Hilger, Bristol. 113‐144.
187. Kimoto, M., M. Ghil, and K. C. Mo. 1991. Spatial structure of the extratropical 40‐day oscillation. In Proceedings of the 8th Conference on Atmospheric and Oceanic Waves and Stability. American Meteorological Society, Boston, MA. 115‐116.
188. Plaut, G., and R. Vautard. 1994. Spells of low‐frequency oscillations and weather regimes in the northern hemisphere. Journal of the Atmospheric Sciences 51:210‐236.
189. Brink, K. H., and R. D. Muench. 1986. Circulation in the point conception‐Santa Barbara channel region. Journal of Geophysical research C 91:877‐895.
190. Hardy, D. M., and J. J. Walton. 1978. Principal components analysis of vector wind measurements. J. App. Meteorology 17:1153‐1162.
191. Kundu, P. K., and J. S. Allen. 1976. Some three‐dimensional characteristics of low‐frequency current fluctuations near the Oregon coast. Journal of Physical Oceanography 6:181‐199.
111
192. Johnson, E. S., and M. J. McPhaden. 1993. Structure of intraseasonal Kelvin waves in the equatorial Pacific Ocean. Journal of Physical Oceanography 23:608‐625.
193. Wallace, J. M. 1972. Empirical orthogonal representation of time series in the frequency domain. Part II: Application to the study of tropical wave disturbances. J. App. Meteorology 11:893‐900.
194. Wallace, J. M., and R. E. Dickinson. 1972. Empirical orthogonal representation of time series in the frequency domain. Part I: Theoretical consideration. J. App. Meteorology 11:887‐892.
195. Rasmusson, E. M., P. A. Arkin, W. Y. Chen, and J. B. Jalickee. 1981. Biennial variations in surface temperature over the United States as revealed by singular decomposition. Monthly Weather Review 109:587‐598.
196. Barnett, T. P. 1983. Interaction of the monsoon and pacific trade wind system at interannual time scales. Part I: The equatorial case. Monthly Weather Review 111:756‐773.
197. Barnett, T. P. 1984. Interaction of the monsoon and pacific trade wind system at interannual time scales. Part II: The tropical band. Monthly Weather Review 112:2380‐2387.
198. Barnett, T. P. 1984. Interaction of the monsoon and pacific trade wind system at interannual time scales. Part III: A parial anatomy of the Southern Oscillation. Monthly Weather Review 112:2388‐2400.
199. Anderson, J. R., and R. D. Rosen. 1983. the latitude‐height structure of 40‐50 day variations in atmospheric angular momentum. Journal of the Atmospheric Sciences 40:1584‐1591.
200. Merrifield, M. A., and C. D. Winant. 1989. Shelf circulation in the gulf of California: a description of the variability. Journal of Geophysical Research 94:18133‐18160.
201. Horel, J. D. 1984. Complex principal component analysis: theory and examples. J. Clim. Appl. Meteor. 23:1660.
202. Saegusa, R., H. Sakano, and S. Hashimoto. 2004. Nonlinear principal component analysis to preserve the order of principal components. Neurocomputing 61:57‐70.
203. Nguyen, D. T. 2006. Complexity of Free Energy Lanscapes of Peptides Revealed by Nonlinear Principal Component Analysis. Proteins: Structure, Function and Genetics 65:898‐913.
204. Matsunaga, Y., S. Fuchigami, and A. Kidera. 2009. Multivariate frequency domain analysis of protein dynamics. J. Chem. Phys 130:124104.
205. Hess, B. 2002. Convergence of sampling in protein simulations. Phys. Rev. E 65:031910. 206. Faraldo‐Gomez, J. D., L. R. Forrest, M. Baaden, P. J. Bond, C. Domene, G. Patargias, J.
Cuthbertson, and M. S. P. Sansom. 2004. Conformational Sampling and Dynamics of Membrane Proteins From 10‐Nanosecond Computer Simulations. Proteins 57:783‐791.
207. Luchko, T., J. T. Huzil, M. Stepanova, and J. Tuszynski. 2008. Conformational Analysis of the Carboxy‐Terminal Tails of Human β‐Tubulin Isotypes. Biophys. J. 94:1971‐1982.
208. Mandelbrot, B., and J. W. Van Ness. 1968. Fractional Brownian Motions, Fractional Noises and Applications. SIAM Rev. 10:422‐437.
209. Bingham, N. C., N. E. Smith, T. A. Cross, and D. D. Busath. 2003. Molecular dynamics simulations of Trp side‐chain conformational flexibility in the gramicidin A channel. Biopolymers 71:593‐600.
210. Townsley, L. E., W. A. Tucker, S. Sham, and J. F. Hinton. 2001. Structures of Gramicidins A, B, and C Incorporated into Sodium Dodecyl Sulfate Micelles. Biochemistry 40:11676‐11686.
211. Allen, T. W., O. S. Andersen, and B. Roux. 2003. Structure of Gramicidin A in a Lipid Bilayer Environment Determined Using Molecular Dynamics Simulations and Solid‐State NMR Data. J. Am. Chem. Soc. 125:9868‐9877.
212. Andersen, O. S., and R. E. Koeppe. 1992. Molecular determinants of channel function. Physiol. Rev. 72:89S‐158S.
112
213. Urry, D. W., C. M. Venkatachalam, K. U. Prasad, R. J. Bradley, G. Parenti‐Castelli, and G. Lenaz. 1981. Conduction Processes of the Gramicidin Channel. Int. J. Quantum Chem. Quantum Biolo. Symp. 8:385.
214. Mandelbrot, B. 2002. Gaussian Self‐affinity and Fractals. Springer, New York. 215. Metzler, R., and J. Klafter. 2004. The restaurant at the end of the random walk: recent
developments in the description of anomalous transport by fractional dynamics. Journal of Physics A 37:R161‐R208.
216. Schulze, B. G., and J. D. Evanseck. 1999. Cooperative role of Arg45 and His64 in the spectroscopic A3 state of carbonmonoxy myoglobin: Molecular dynamics simulations, multivariate analysis and quantum mechanical computations. J. Am. Chem. Soc. 121:6444‐6454.
217. Daidone, I., A. Amadei, D. Roccatano, and A. Di Nola. 2003. Molecular Dynamics Simulation of Protein Folding by Essential Dynamics Sampling: Folding Landscape of Horse Heart Cytochrome c. Biophys. J. 85:2865‐2871.
218. Stanley, H. E., S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.‐K. Peng, and M. Simons. 1999. Scaling features of noncoding DNA. Physica A 273:1‐18.
219. Gao, J. B., Y. Cao, and J. M. Lee. 2003. Principal component analysis of 1/fα noise. Phys. Lett. A 314:392‐400.
220. Maisuradze, G. G., and D. M. Leitner. 2007. Free energy landscape of a biomolecule in dihedral principal component space: sampling convergence and correspondence between structures and minima. Proteins 67:569‐578.
221. Ma, J., and M. Karplus. 1998. The allosteric mechanism of the chaperonin GroEL: A dynamics analysis. Proc. Natl. Acad. Sci. USA 95:8502‐8507.
222. García, A. 1997. Multi‐basin dynamics of a protein in a crystal environment. Physica D 107:225‐239.
223. Goldstein, H. 1980. Classical Mechanics. Addison‐Wesley. 224. Lindahl, E., C. Azuara, P. Koehl, and M. Delarue. 2006. NOMAD‐Ref: visualization,
deformation and refinement of macromolecular structures based on all‐atom Normal Mode Analysis. Nucleic Acids Res. 34:W52‐56.
113
Appendix 1: Normal Mode Analysis
Given a molecular structure with N atoms at coordinates ri (i = 1,2,…,3N), and a
molecular energy landscape U(r), the force constant matrix may be written
.
The Hessian / / is the mass-weighted form of . If a harmonic
approximation (quadratic function) is taken for the coordinate dependence of U(r) around a
minimum, then the spatial frequencies wi can be obtained by solving the eigenvalue problem
Where / ∆ is a 3N dimensional eigenvector of mass-weighted displacements. A
more detailed derivation of NMA can be found in standard mechanics textbooks, such as
(223).
Note that the detailed form of U(r) is necessary for traditional NMA, which is taken
from existing molecular mechanics force fields such as CHARMM. The more recent
development using Elastic Network Models (143) do away with this detailed energy function
and replace it with a network of elastic interactions connecting every atom to every other
atom within a cut-off radius (typically 10 Å). Notice that this replaces the harmonic
functions usually associated with the bonding topology of a molecule with other harmonic
functions now distributed much like non-bonded interactions.
In Figure A1 we present NMA results on the gA dimer solvated in a GMO
membrane, which were computed using an online engine called NOMAD-Ref (224) on a
very well minimized structure from our simulations. A special request was made in order to
compute all eigenvalues and eigenvectors for comparison with PCA results (rather than only
the first 10, which is the default standard). We present eigenvalues for the Cα atoms, the
NCαC main chain, backbone with H atoms (NCαCH), with O atoms (NCαCO), and the
complete backbone (NCαCOH), as well as the full gA molecule with (GRAall) and without
(GRAnoH) hydrogen atoms. All of these curves have the same form, where the leading 3
eigenvalues are separated by a significant bandgap from the rest of the eigenvalues, and they
114
all scale with a power of 0.25. Along with this more shallow scaling, the most notable
differences from PCA spectra is the lack of more than one bandgap, and no transition to
steeper scaling at high frequencies.
The interpretation of these curves is not obvious. On the one hand, it is tempting to
ascribe differences in the NMA spectrum – like the lack of scaling transition – to the
presence of temperature in the PCA of MD simulations, which is not present in the NMA
approximation. This seems reasonable given the association of exponents in power law
scaling with properties of noise. On the other hand, we have suggested that the shallow
scaling at low PC index is associated with conformational degrees of freedom, while the
steeper scaling at high PC index is associated with internal vibrations. It is exactly these
internal vibrations with seem to be lacking in the NMA spectrum, which is surprising given
the harmonic nature of the approximation. However, the shallow scaling of the NMA
spectrum would suggest the approximation captures only conformational degrees of freedom,
which arise due to the non-bonded topology of interactions in the ENM model, despite its
harmonic form. Unraveling the effects of dynamics, entropy, temperature, interaction
topology and interaction functions is clearly non-trivial; a more detailed understanding of
the differences between PCA and NMA is a task that would benefit from comparisons of
model systems such as crystals and gases with the protein spectra presented here.
Figure A1: NMA eigenvalues (spatial frequencies ωk) for various atomic subsets of gA.
115
Appendix 2: Side Chain Conformations of gA Our analysis of the convergence for side chain eigenvectors in section 2.2.2 shows
that their longest PCs are not converged. In Figure A2 we demonstrate the multi-modal
character of the side-chain eigenvectors by contrasting the PCs of the gA backbone with the
PCs of its side-chains. We plot the density of points along the trajectory of PC1 vs. PC2, as
well as a scatter plot of PC1 vs. PC2 for every 1000th point in the trajectory for the backbone,
and the same plot of PC1 vs. PC2 vs. PC3 for the side-chains. A unimodal density is
apparent for the gA backbone, while the side-chains have a multi-modal density distribution
for PC1 vs. PC2 with 4 distinct peaks, and the scatter plot of PC1 vs. PC2 vs. PC3 shows 4 or
5 distinct clusters. We have labeled these clusters with a representative time step (divided by
1000) as well as a coloured dot. The colours correspond to those used in Figure A3 to
display the 5 different conformations at these representative time steps in a 64 ns simulation
of gA in a GMO membrane. The characteristic feature of a given conformation is also
labeled with its time step in this figure.
The blue structure in Fig. A3 is the starting conformation. The red structure
corresponds to step 15, which is the smallest cluster visible in Fig. A2 and is only briefly
visited before returning to the initial cluster. This conformation does not differ significantly
from the starting structure except for a tilting of Trp13 on both monomers. The green
structure at step 135 has a 120o change in χ1 of Trp11 of monomer 1, as well as a tilt in χ2 of
Trp 9 on monomer 2. The orange structure at step 165 has 120o change in χ1 of Trp 9 on
monomer 2. The yellow structure at step 221 exhibits a 120o change in χ1 and a 30o change
in χ2 of Trp 15 on monomer 2. The tan structure at step 315 has this same change at Trp 15
on monomer 2, accompanied by a 120o change in χ1 on Trp13 of monomer 1.
These figures demonstrate that our simulation has only limited sampling of a few
conformational states, with only one or two transitions into each well on the free energy
surface of side-chain dynamics. MD studies in vacuo (209), and in DMPC (211) have
described six rotameric states available to each Trp in gA, although only Trp 9 showed a
significant (eighteen) number of transitions among them in a 100 ns simulation in DMPC
(211). Our results are in general agreement with this study, indicating that Trp rotameric
basins are visited on the 10 ns timescale in the GMO membrane, and therefore the longest
side chain PCs are not be expected to converge within 64 ns.
116
Figure A2: 2D Distribution of the complete PC1 vs. PC2 scatter plot for the NCαC backbone and side chains, as well as the time-ordered scatter plot for every 1000th point colored from blue (start) through red (end).
117
Figure A3: Side-chain conformations for a 64 ns simulation of gA in a hydrated GMO bilayer.