doi.org/10.26434/chemrxiv.6030563.v1
Statistical Force-Field for Structural Modeling Using ChemicalCross-Linking/mass Spectrometry Distance ConstraintsAllan J. R. Ferrari, Fabio C. Gozzo, Leandro Martinez
Submitted date: 26/03/2018 • Posted date: 27/03/2018Licence: CC BY-NC-ND 4.0Citation information: Ferrari, Allan J. R.; Gozzo, Fabio C.; Martinez, Leandro (2018): Statistical Force-Field forStructural Modeling Using Chemical Cross-Linking/mass Spectrometry Distance Constraints. ChemRxiv.Preprint.
Chemical cross-linking/Mass Spectrometry (XLMS) is an experimental method to obtain distance constraintsbetween amino acid residues, which can be applied to structural modeling of tertiary and quaternarybiomolecular structures. These constraints provide, in principle, only upper limits to the distance betweenamino acid residues along the surface of the biomolecule. In practice, attempts to use of XLMS constraints fortertiary protein structure determination have not been widely successful. This indicates the need of specificallydesigned strategies for the representation of these constraints within modeling algorithms. Here, a force-fielddesigned to represent XLMS-derived constraints is proposed. The potential energy functions are obtained bycomputing, in the database of known protein structures, the probability of satisfaction of a topologicalcross-linking distance as a function of the Euclidean distance between amino acid residues. The force-fieldcan be easily incorporated into current modeling methods and software. In this work, the force-field wasimplemented within the Rosetta ab initio relax protocol. We show a significant improvement in the quality ofthe models obtained relative to current strategies for constraint representation. This force-field contributes tothe long-desired goal of obtaining the tertiary structures of proteins using XLMS data. Force-field parametersand usage instructions are freely available at http://m3g.iqm.unicamp.br/topolink/xlff
File list (2)
download fileview on ChemRxivferrari_xlff_manuscript.pdf (0.98 MiB)
download fileview on ChemRxivferrari_xlff_supporting_information.pdf (1.51 MiB)
Statistical force-field for structural modeling using chemical cross-1
linking/mass spectrometry distance constraints 2
Allan J R Ferrari1, Fabio C Gozzo1, Leandro Martínez1,2* 3
1Institute of Chemistry, University of Campinas, Campinas, SP, Brazil and 2Center for 4
Computational Engineering & Sciences, University of Campinas, Campinas, SP, Brazil. 5
7
Abstract 8
Motivation 9
Chemical cross-linking/Mass Spectrometry (XLMS) is an experimental method to obtain 10
distance constraints between amino acid residues, which can be applied to structural modeling of 11
tertiary and quaternary biomolecular structures. These constraints provide, in principle, only 12
upper limits to the distance between amino acid residues along the surface of the biomolecule. In 13
practice, attempts to use of XLMS constraints for tertiary protein structure determination have 14
not been widely successful. This indicates the need of specifically designed strategies for the 15
representation of these constraints within modeling algorithms. 16
Results 17
Here, a force-field designed to represent XLMS-derived constraints is proposed. The potential 18
energy functions are obtained by computing, in the database of known protein structures, the 19
probability of satisfaction of a topological cross-linking distance as a function of the Euclidean 20
distance between amino acid residues. The force-field can be easily incorporated into current 21
modeling methods and software. In this work, the force-field was implemented within the 22
Rosetta ab initio relax protocol. We show a significant improvement in the quality of the models 23
obtained relative to current strategies for constraint representation. This force-field contributes to 24
the long-desired goal of obtaining the tertiary structures of proteins using XLMS data. 25
Availability 26
Force-field parameters and usage instructions are freely available at 27
http://m3g.iqm.unicamp.br/topolink/xlff 28
29
1. Introduction 30
The number of protein structures determined is much smaller than that of proteins known 31
at the sequence level (Bateman et al., 2017; Pundir et al., 2017; Berman et al., 2000; Bairoch and 32
Apweiler, 2000). This discrepancy is the result of experimental limitations to obtain high-33
resolution structures and, in parallel, the experimental advances in genome sequencing. In silico 34
approaches for protein structure prediction have been applied to fill that gap. If the structure of 35
homologous proteins has already been solved, modeling the target protein is relatively simple 36
(Fiser, 2010; Eswar et al., 2006; Song et al., 2013). Without structural homologous, however, the 37
determination of a protein fold is a major challenge in computational biology. 38
Many types of data can be used to improve structural modeling of protein structures. For 39
example, sparse NMR data (Bowers et al., 2000; Tang et al., 2015; Thompson et al., 2012), 40
Small Angle X-Ray Scattering (SAXS) (Schneidman-Duhovny et al., 2012), Cryo-electron 41
Microscopy (Cryo-EM) (DiMaio et al., 2015), distance constraints derived from residue 42
coevolution statistics (Ovchinnikov et al., 2017; Ovchinnikov Sergey et al., 2016) and, more 43
recently, from chemical cross-linking/Mass Spectrometry (XLMS) (Brodie et al., 2017; Belsom 44
et al., 2016; Santos et al., 2018). XLMS is an attractive experimentally because it requires cheap 45
and more accessible instrumentation, simple sample handling and small amounts of sample. 46
Furthermore, the results are tolerant to contaminants and, in principle, XLMS data can be 47
obtained for about any protein, as MS is a universally applicable technique. 48
In some sense, the information XLMS provides is similar to that obtained from NMR, 49
that is, a list of distance constraints between atoms. Nevertheless, important differences exist: 1) 50
the XLMS constraint is a distance along the surface of the protein; 2) the constraint is in 51
principle associated only to the maximum linker reach, that is, it is only an upper bound to the 52
distance between the residues; 3) the length of the linker can be of the order of several 53
Angstroms and, thus, geometrically associates residues which are relatively far on the protein 54
structure. Experimentally, XLMS presents its own limitations, which are a field of intense state-55
of-the-art research: the number XLMS constraints is limited by the diversity of the reactivity of 56
the linkers, and by the exposure of the residues to the protein surface. Also, the interpretation of 57
XLMS spectra is still a complex task (Iacobucci and Sinz, 2017), requiring specific algorithms 58
and software (Lima et al., 2015; Götze et al., 2012; Hoopmann et al., 2015; Kosinski et al., 59
2015; Sarpe et al., 2016), and possibly manual curation. 60
Because of the current limitations in experimental and modeling techniques, the use of 61
chemical cross-links has been indisputably successful only for the determination of quaternary 62
arrangements. Their use for tertiary structure modeling remains a challenge. For instance, in the 63
CASP11 and CASP12 assisted competitions, no clear improvements in the quality of the models 64
were observed from the use of experimental cross-linking constraints (Schneider et al., 2016; 65
Tamò et al., 2017). Recently, we were able to model the tertiary structures of a variety of models 66
with the support of XLMS distance constraints, but only in combination with distance constraints 67
derived from amino acid coevolution analysis, which played a determinant role in obtaining 68
models with fold-level accuracy (Santos et al., 2018). 69
The incorporation of XLMS constraints in structural modeling strategies is indeed a 70
challenge. The distance constraints are along the protein surface, thus their precise evaluation 71
depends on the model structure which, in principle, is not known. Furthermore, the evaluation of 72
the surface-accessible distance between two residues requires specialized strategies and, of 73
course, is much more computationally demanding than the evaluation of straight Euclidean 74
distances. Therefore, these constraints have been implemented in the modeling process through 75
Euclidean-distance-dependent energy functions that aim to constrain the maximum distance 76
between residues observed to be cross-linked. The maximum distance is usually derived from the 77
maximum cross-linker and side chain extensions, through simple geometrical arguments. 78
Here, we formulate a Euclidean-distance-dependent structure-based statistical force-field 79
for cross-linking/mass spectrometry constraints, named XLFF. In summary, we compute from a 80
database of non-redundant protein structures the probability of observing two residues at a 81
surface-accessible cross-linking distance as a function of the Euclidean distance between their 82
Cβ atoms. This probability curve is converted into a potential energy function assuming it obeys 83
a Boltzmann distribution. The potential is dependent on the cross-linker length and on the nature 84
of the residues involved, thus defining a residue and linker-dependent force-field for structural 85
modeling with XLMS distance constraints. We implemented the force-field in the Rosetta ab 86
initio protocol (Simons et al., 1999; Bonneau et al., 2001, 4, 2002; Bradley et al., 2005; Raman 87
et al., 2009, 8) and demonstrate that this statistical force-field increases significantly the 88
probability of obtaining native-like tertiary structures compared to current approaches to 89
represent the constraints. Although here we focus on the more challenging problem of tertiary 90
protein structure determination, the principles here described find application in other structural 91
modeling goals, including the determination of general protein assemblies. 92
93
2. Approach 94
Chemical cross-linking is an experimental method to obtain structural information from a 95
chemical modification of the protein with a reagent called cross-linker, or simply linker. If a 96
residue is found to attach to the linker, it means that the residue is accessible to the solvent in 97
some significantly populated protein conformation in solution. If, additionally, the linker is 98
found to be attached to a pair of residues, A and B, it follows that the reactive atoms Ax and By 99
are closer to each other than the length of the cross-linker spacer arm, LXL. This linker works as 100
molecular ruler over the protein surface. Thus, when measuring the distance between Ax and By, 101
d(Ax,By), one should consider the physical path between them, dtop(Ax,By), where the subscript 102
top stands for “topological” distance, which we define here as the shortest path physically 103
accessible to the linker connecting the reactive atoms (see Figure 1). 104
Usually, and for every case which will be discussed here, the linker reactivity is 105
associated with a side-chain chemical group. For example, the amine group of Lysine residues, 106
or carboxylate group of acidic side chains. Therefore, in principle, the observation of a cross-link 107
is associated with these side-chain atoms being within the linker length, dtop(Ax,By) ≤ LXL. Since 108
structural models are static and side chains exposed to solvent are frequently mobile, the distance 109
between side-chain atoms in a model reflects poorly the possibility of a cross-link. 110
Backbone and Cβ atoms provide more stable reference positions for the introduction of 111
constraining potentials. Here, we define a statistical force-field based on the probability of the 112
reactive atoms of the side chain being at a cross-linkable topological distance given the 113
Euclidean distance between corresponding Cβ atoms. This statistical force-field considers 114
implicitly the flexibility of the side chains and, by being based on Euclidean distances, is 115
practical to use. 116
The maximum topological distance between Cβ atoms consistent with the formation of a 117
cross-link is Lmax=LXL+LA+LB, where LXL is the maximum linker length and LA and LB are the 118
lengths of the side chains of the residues involved. If the topological distance between Cβ atoms, 119
dtop(ACβ/BCβ), is smaller than Lmax, residues A and B may form a cross-link, since the side-chains 120
can potentially fluctuate to assume conformations compatible with the topological path along the 121
surface associated with dtop(ACβ,BCβ). To a first approximation, using dtop(ACβ,BCβ) and Lmax to 122
constrain residue distances can be considered a strategy to incorporate the side-chain flexibility 123
into the modeling procedure. 124
However, Lmax, when implemented as a Euclidean distance constraint, represents an 125
unlikely scenario, in which the linker and both side chains are in their fully extended 126
conformations. Intuitively, constraining Cβ atoms distances to something smaller than Lmax 127
should be a good strategy in most cases. In this work, we first propose that an effective Lmax can 128
be assessed by statistical analysis of known protein structures. We compute the frequency 129
distribution of dtop(ACβ,Bcβ) in a protein database, under the condition that dtop(Ax,By) ≤ LXL. By 130
eliminating unlikely (1%) scenarios, we define a more restrained statistical distance-cutoff, 131
Lmax(0.99). 132
The statistical analysis above can be further refined for the establishment of a distance-133
dependent force field for XLMS constraints. Imagine that a pair of Cβ atoms from residues A 134
and B are found at a Euclidean distance deuc(ACβ,BCβ). This distance is associated to a topological 135
distance, dtop(ACβ,BCβ), as defined above. Given a database of known protein structures, we ask 136
what is the probability that the topological distance dtop(ACβ,BCβ) is smaller than Lmax(0.99) given 137
the Euclidean distance deuc(ACβ,BCβ), that is, p[(dtop(ACβ,BCβ)<Lmax(0.99))|deuc(ACβ,Bcβ)]. The 138
potential energy that would imply this probability distribution, assuming Boltzmann sampling is 139
140
V(deuc) = -RT ln p[(dtop(ACβ,BCβ)<Lmax(0.99))|deuc(ACβ,Bcβ)], (1)
141
where, at room temperature, RT=0.569 kcal mol-1. This potential can be directly incorporated 142
into most modeling procedures, as it is dependent on the Euclidean distances between Cβ atoms 143
and on the Lmax(0.99) from the structural database. Section 3.1 and Supporting Information S1 144
describes the details of the parameterization and implementation of this potential energy 145
function. 146
The statistical force-field was implemented in Rosetta ab initio protocol and proved to be 147
superior in terms of modeling quality to current state-of-art approaches for XLMS constraint 148
representation, as we will show. Modeling details are available in Supporting Information S2. All 149
modeling results, including input and output raw files, are available at 150
http://m3g.iqm.unicamp.br/topolink/xlff. Each modeling round consisted of generating 5,000 151
models with Rosetta. We evaluated the quality of the models by the distribution of the structural 152
similarity of the models to the crystallographic structure, as given by the TM-score metric 153
computed with LovoAlign (Andreani et al., 2009). Structures with TM-scores greater than 0.5 154
relative to the crystallographic structure are considered to have roughly the correct fold (Xu and 155
Zhang, 2010). Structures with TM-scores greater than 0.6 are likely winner candidates at the 156
CASP modeling competitions. 157
158
3. Results 159
3.1. Parametrization of the statistical force field 160
Figures 2 and 3 exemplify the construction of the statistical force-field for a pair of 161
reactive residues. In Figure 2A we display the frequency of observation of topological distances 162
between Lys Nζ atoms in the CATH database of non-redundant domains. The subset of pairs of 163
Lys residues for which the Nζ are within the length of the linker molecule (11.5Å) was 164
shortlisted, and the distribution of Cβ distances for these pairs is obtained, as shown in Figure 165
2B. Within the subset of pairs of Lys residues for which the N atoms are within 11.5 Å, 99% of 166
the Cβ atoms are closer than 17.8Å. Therefore, we consider 17.8Å the maximum effective 167
distance between Cβ atoms for this linker. This maximum effective distance will be named the 168
Statistical Limit of the linker. 169
Then, we compute the probability of finding the Cβ atoms of the K residues closer than 170
17.8Å as a function of their Euclidean distance, as shown in Figure 3A. This probability shows, 171
for example, that if the Euclidean distance between Cβ atoms of K residues is greater than ~14Å, 172
there is only 50% probability that a topological path connecting these residues exists within the 173
reactive distance. 174
This probability distribution is translated, according to Equation 1, into a statistical 175
potential, which is represented in Figure 3B. For instance, this potential introduces an increasing 176
energy for the Euclidean distance between Cβ atoms at all distances, but which is particularly 177
noticeable above ~12Å. Therefore, effectively, the force field penalizes distances which are 178
greater than ~12Å. 179
In Figure 3C, the potential energy profiles for the pairs KK, KS, and SS, obtained with 180
the same protocol, are shown. As expected, for shorter side-chains, the potential energy increases 181
at shorter Cβ-Cβ distances. This reflects the fact that, for example, linkers of the same length 182
may bind Lys residues at larger Cβ-Cβ distances than Ser residues, as a result of the difference 183
between side chain lengths. The exact profile of the potential is dependent not only on the 184
lengths of the side chains, but also on the nature of their interactions with the surface of the 185
proteins, and these are implicitly taken into account in the present approach because the profiles 186
are obtained from the structural database. The profiles of the potential energies of all other 187
reactive pairs of residues are shown in Figure S3 and available for download. 188
189
3.2 Modeling performance 190
3.2.1. Overview of previously attempted cross-linking representation strategies 191
Different interaction potentials have been proposed for the use of XLMS derived 192
constraints in protein modeling protocols. Kahraman and collaborators (Kahraman et al., 2013) 193
have proposed using a flat harmonic potential to integrate cross-linking constraints data to de 194
novo and comparative protein prediction. The flat harmonic potential penalizes models having 195
Euclidean distance between two atoms farther than an upper distance limit, UL (see Figure S4). 196
In that work, the UL was chosen as 30Å for all constraints associated with the same linker 197
(DSS/BS3). Belson and collaborators (Belsom et al., 2016) have proposed a Lorentz-like 198
potential. As shown in Figure S4, instead of penalizing models having Euclidean distance above 199
a certain UL, the Lorentz potential rewards models for which Euclidean distances are below a 200
threshold, that is, in which it is believed that the cross-link should be satisfied. Above this limit, 201
there is a progressive decrease of the energy bonus to zero. This potential is argued to be more 202
tolerant to the presence of incorrect constraints. The Serum Albumin domains were modeled 203
using this function, but in combination with contact prediction constraints from evolutionary 204
information. Therefore, the specific role of the XLMS constraints in model quality was not 205
addressed. Merkley and collaborators (Merkley et al., 2014) have proposed a justification for this 206
UL=30 Å for cross-links between Lysine pairs: the correlation between Cα Euclidean distances 207
in a set of crystal structures and molecular dynamics simulation, and a set of experimental cross-208
linking data for cytochrome C, showed that the experimental information of cross-linking data is 209
often not represented by a single conformational state. It is then proposed that using larger 210
constraint distances would account for the conformational flexibility of the structure. This 211
concept of adding some threshold to the extended conformation of the linker to account for 212
structural variability has guided much of the XL modelling and validation strategies (Kahraman 213
et al., 2013; Fritzsche Romy et al., 2012; Kalisman et al., 2012; Herzog et al., 2012; Chavez et 214
al., 2016, 2018). Alternatively, the structural variability can be addressed by the proposal of 215
multiple models, without sacrificing the precision of the XLMS ruler (Degiacomi et al., 2017). 216
All these proposals were evaluated here in the context of modeling with Rosetta’s ab 217
initio relax protocol (Simons et al., 1997, 1999, Bonneau et al., 2001, 2002; Bradley et al., 2005; 218
Raman et al., 2009, 8). Four different targets where chosen: SalBIII (Luhavaya et al., 2015), a 219
15.6kDa protein with a low sequence similarity to other proteins in the Protein Data Bank; and 220
the three domains of Albumin (Sugio et al., 1999) (ALB-D1, ALB-D2 and ALB-D3), that have 221
been standard examples in cross-linking experiments (Belsom et al., 2016; Huang et al., 2004; 222
Fischer et al., 2013). In this section, we will describe the modeling performed with ideal cross-223
linking data sets, computed from the crystallographic models with Topolink (Martinez et al., 224
2017). The SalBIII set contains 62 constraints compatible with the crystallographic structure. For 225
ALB-D1, ALB-D2, and ALB-D3, the sets contained 125, 153 and 92 constraints, respectively. 226
Section 3.3 describes the modeling results obtained with more limited experimental sets of 227
constraints. 228
Modeling was performed without constraints and with three different penalizations 229
choices previously used to represent cross-linking constraints: the flat harmonic potential, the 230
linear penalization, and the Lorentz-like potential. The upper limit distances considered were: (i) 231
UL=25Å between Cβ atoms, assuming that the UL incorporates the conformational diversity of 232
the structure; (ii) A slightly more restrained UL=20Å between Cβ atoms, which roughly 233
represents the extended length of a link formed by the DSS linker bound to two Lys residues; 234
(iii) UL=Extended length, in which each constraint has an UL that is the sum of the lengths of 235
the side chains and of the linker, and which varies according to the linker and side chains 236
involved; and, finally, (iv) UL = Lmax(0.99), the Statistical limit, which is computed for each pair 237
of residue-types independently. 238
239
3.2.2. The statistical upper limit improves significantly the quality of the models 240
The distributions of model quality obtained using the different representation of the 241
constraints and upper limits are shown in Figure 4 for the SalBIII protein. Distributions obtained 242
without constraints are repeated in each graph as a reference. The left-side graphs display the 243
quality distributions of all 5,000 models, while the right panels show the fraction of models with 244
fold-level accuracy that are obtained for subsets of the models as classified by percentiles of their 245
Rosetta energy scores. The results are summarized in Table 2. 246
None of the models obtained without constraints have TM-score greater than 0.5, thus 247
confirming that experimental constraints are essential for the modeling of this protein. The flat 248
harmonic or linear energy functions with UL=25Å perform as bad as the modeling without 249
constraints. Decreasing the UL distance, however, improves dramatically the overall quality of 250
models. Using the statistical limits derived here, it is possible to obtain 10 and 14% of native-like 251
structures using the linear and flat harmonic energy functions, respectively. Notably, by selecting 252
the 10% best-scored models, the native-like populations increase to 63 and 72%. Therefore, 253
using a more constrained potential increases the quality of the models obtained significantly. 254
Here, this choice is justified by the statistical analysis of cross-linkable pairs in known protein 255
structures. For example, we know that Lmax=17.8Å is too restrictive for only 1% of the cross-256
links between Lysine residues. 257
The use of the Lorentz potential did not result in any improvement of the models relative 258
to modeling without constraints, independently of the ULs used. This is likely a consequence of 259
the Lorentz potential having a null gradient at almost every distance (see Figure S4). Therefore, 260
in gradient-dependent modeling strategies, like the ones used by Rosetta, this potential could 261
only affect the selection of the models by their final energy. In other words, this potential might 262
be useful for modeling using Monte-Carlo sampling methods alone, but any advantage that might 263
be gained by using gradient information is lost. 264
Similar results were obtained for all three Albumin domains, and are reported in 265
Supplementary Information Figure S5 and Table S2. 266
267
3.2.3. The statistical force-field optimally weights constraint penalties 268
Finally, we used the complete statistical force-field (XLFF) to model protein domains. 269
The statistical weights introduced improved expressively the modeling results: Figure 5A 270
compares the quality of the models obtained with each of the energy functions described in the 271
previous section with those obtained with the XLFF force-field. We compare the different 272
representations of the penalty function using the best upper distance limit, defined by the 273
statistical analysis described in the previous section. Using the statistical potential, 27% of all 274
models are native-like structures, which is almost the double and the triple of the fraction of 275
native-like structures obtained with the flat harmonic or linear energy functions. Furthermore, 276
94% of the 10% best Rosetta-scored models are native-like structures (Figure 5B). For Albumin 277
domains, the fraction of native-fold models increases from the best results obtained previously 278
(using the flat harmonic potential with statistical limits): from 15% to 29% for ALB-1, from 44% 279
to 74% for ALB-2, and from 40% to 70% for ALB-3 (Supplementary Figure S6 and Table S2). 280
Finally, beyond being crucial to sampling, the potential energy introduced by the force-281
field contributes to qualify the models as compared with the Rosetta score alone, as shown in 282
Figure 5C. In fact, for SalBIII, the energy of constraints is almost as effective as the composed 283
score function (Rosetta + XL score) to differentiate models. 284
In summary: (i) the extended linker lengths are unnecessarily large, leading to 285
information loss, independently of the potential energy function used; (ii) restricting UL to 286
statistically relevant distances, Lmax(0.99), improves significantly the probability of generating 287
native-like structures; and (iii) the statistical force-field, V(Lmax, deuc), is the best strategy to 288
implement XLMS derived constraints in modeling. 289
290
3.3. Modeling with experimental XLMS constraints 291
In the previous sections, we evaluated protein tertiary structures modeling performance 292
from an ideal perspective of having identified all potential cross-links consistent with the 293
crystallographic structure. 294
We discuss now the modeling results obtained using experimental data. In this work, we 295
consider the use of XLMS constraints only. Evidently, the present approach can be, and should 296
be, combined with other source of data if available. 297
We performed XLMS experiments using DSS and the Xplex chemistry (Fioramonte et 298
al., 2018) on SalBIII and Serum Albumin domains (Supplementary Information S2) and 299
analyzed the resulting MS raw files with SIM-XL (Borges et al., 2015; Lima et al., 2015) to 300
obtain a list of cross-link candidates. The set of cross-links obtained experimentally which is 301
consistent with the crystallographic structure contains 25 to 35% of the ideal cross-link set 302
(Supplementary Information S2 and Table S1). This reduced set of cross-links was used for 303
structural modeling. 304
Figure 6A shows, as expected, that reducing the number of constraints worsens the TM-305
score distribution of the models obtained by Rosetta (red) relative to the ideal set. Nevertheless, 306
there is a significant improvement relative to modeling without constraints: 3% of all models 307
have TM-scores greater than 0.5, and within the 10% best-scored 20% display native-like folds. 308
These results are similar to those obtained for SalBIII using the ideal constraint-set and the flat 309
harmonic or linear penalization functions, except if the statistical limit we propose is used (Table 310
2). In other words, the improvement in modeling obtained by the XLFF force-field is comparable 311
to what would be obtained from an ideal experiment using the previous constraint representation 312
strategies. Similar results for Albumin domains are shown in Supplementary Information Figure 313
S7 and Table S2. 314
315
4. Conclusions 316
A force-field designed for modeling XLMS constraints was developed. The potential 317
energy functions representing the constraints are obtained from the statistics of physically-318
accessible distances between residues in the database of known protein structures. The potential 319
energy function is dependent on the Euclidean distance between residues and on the structural 320
properties of the linker and associates more favorable interactions to pairs of residues which are 321
more likely associated with a valid cross-linking path. The force-field was implemented in the 322
Rosetta modeling suite, and expressively improve the quality of the models obtained. These 323
results bring to reality the possibility of modeling from XLMS constraints the tertiary structures 324
of proteins for which other structural data is not available or is insufficient for characterizing the 325
protein fold. 326
327 Acknowledgements 328
We thank FAPESP (Grants 2010/16947-9, 2013/05475-7, 2013/08293-7, 2014/17264-3 and 329
2016/13195-2), and CNPq (Grant 470374/2013-6) for financial support. 330
331
References 332
Andreani,R. et al. (2009) Low Order-Value Optimization and applications. J. Glob. Optim., 43, 333 1–22. 334
Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its 335 supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. 336
Bateman,A. et al. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, 337
D158–D169. 338
Belsom,A. et al. (2016) Serum Albumin Domain Structures in Human Blood Serum by Mass 339 Spectrometry and Computational Biology. Mol. Cell. Proteomics MCP, 15, 1105–1116. 340
Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. 341 Bonneau,R. et al. (2002) De Novo Prediction of Three-dimensional Structures for Major Protein 342
Families. J. Mol. Biol., 322, 65–78. 343
Bonneau,R. et al. (2001) Rosetta in CASP4: Progress in ab initio protein structure prediction. 344 Proteins Struct. Funct. Bioinforma., 45, 119–126. 345
Borges,D. et al. (2015) Using SIM-XL to identify and annotate cross-linked peptides analyzed 346 by mass spectrometry. Protoc. Exch. 347
Bowers,P.M. et al. (2000) De novo protein structure determination using sparse NMR data. J. 348
Biomol. NMR, 18, 311–318. 349
Bradley,P. et al. (2005) Toward High-Resolution de Novo Structure Prediction for Small 350 Proteins. Science, 309, 1868–1871. 351
Brodie,N.I. et al. (2017) Solving protein structures using short-distance cross-linking constraints 352 as a guide for discrete molecular dynamics simulations. Sci. Adv., 3, e1700479. 353
Chavez,J.D. et al. (2018) Chemical Crosslinking Mass Spectrometry Analysis of Protein 354 Conformations and Supercomplexes in Heart Tissue. Cell Syst., 6, 136–141.e5. 355
Chavez,J.D. et al. (2016) In Vivo Conformational Dynamics of Hsp90 and Its Interactors. Cell 356 Chem. Biol., 23, 716–726. 357
Degiacomi,M.T. et al. (2017) Accommodating Protein Dynamics in the Modeling of Chemical 358
Crosslinks. Structure, 25, 1751–1757.e5. 359 DiMaio,F. et al. (2015) Atomic-accuracy models from 4.5-Å cryo-electron microscopy data with 360
density-guided iterative local refinement. Nat. Methods, 12, 361–365. 361 Eswar,N. et al. (2006) Comparative protein structure modeling using Modeller. Curr. Protoc. 362
Bioinforma., Chapter 5, Unit-5.6. 363 Fioramonte,M. et al. (2018) XPlex: an effective, multiplex cross-linking chemistry for acidic 364
residues. Anal. Chem. 365 Fischer,L. et al. (2013) Quantitative cross-linking/mass spectrometry using isotope-labelled 366
cross-linkers. J. Proteomics, 88, 120–128. 367 Fiser,A. (2010) Template-based protein structure modeling. Methods Mol. Biol. Clifton NJ, 673, 368
73–94. 369
Fritzsche Romy et al. (2012) Optimizing the enrichment of cross‐linked products for mass 370
spectrometric protein analysis. Rapid Commun. Mass Spectrom., 26, 653–658. 371 Götze,M. et al. (2012) StavroX--a software for analyzing crosslinked products in protein 372
interaction studies. J. Am. Soc. Mass Spectrom., 23, 76–87. 373
Herzog,F. et al. (2012) Structural Probing of a Protein Phosphatase 2A Network by Chemical 374 Cross-Linking and Mass Spectrometry. Science, 337, 1348–1352. 375
Hoopmann,M.R. et al. (2015) Kojak: Efficient Analysis of Chemically Cross-Linked Protein 376
Complexes. J. Proteome Res., 14, 2190–2198. 377 Huang,B.X. et al. (2004) Probing three-dimensional structure of bovine serum albumin by 378
chemical cross-linking and mass spectrometry. J. Am. Soc. Mass Spectrom., 15, 1237–379 1247. 380
Iacobucci,C. and Sinz,A. (2017) To Be or Not to Be? Five Guidelines to Avoid Misassignments 381 in Cross-Linking/Mass Spectrometry. Anal. Chem., 89, 7832–7835. 382
Kahraman,A. et al. (2013) Cross-Link Guided Molecular Modeling with ROSETTA. PLOS 383 ONE, 8, e73411. 384
Kalisman,N. et al. (2012) Subunit order of eukaryotic TRiC/CCT chaperonin by cross-linking, 385 mass spectrometry, and combinatorial homology modeling. Proc. Natl. Acad. Sci., 109, 386 2884–2889. 387
Kosinski,J. et al. (2015) Xlink Analyzer: Software for analysis and visualization of cross-linking 388 data in the context of three-dimensional structures. J. Struct. Biol., 189, 177–183. 389
Lima,D.B. et al. (2015) SIM-XL: A powerful and user-friendly tool for peptide cross-linking 390 analysis. J. Proteomics, 129, 51–55. 391
Luhavaya,H. et al. (2015) Enzymology of Pyran Ring A Formation in Salinomycin Biosynthesis. 392
Angew. Chem. Int. Ed., 54, 13622–13625. 393
Martinez,L. et al. (2017) TopoLink: A software to validate structural models using chemical 394 crosslinking constraints. Protoc. Exch. 395
Merkley,E.D. et al. (2014) Distance restraints from crosslinking mass spectrometry: Mining a 396 molecular dynamics simulation database to evaluate lysine–lysine distances. Protein Sci. 397
Publ. Protein Soc., 23, 747–759. 398 Ovchinnikov,S. et al. (2017) Protein structure determination using metagenome sequence data. 399
Science, 355, 294–298. 400 Ovchinnikov Sergey et al. (2016) Improved de novo structure prediction in CASP11 by 401
incorporating coevolution information into Rosetta. Proteins Struct. Funct. Bioinforma., 402
84, 67–75. 403 Pundir,S. et al. (2017) UniProt Protein Knowledgebase. In, Protein Bioinformatics, Methods in 404
Molecular Biology. Humana Press, New York, NY, pp. 41–55. 405 Raman,S. et al. (2009) Structure prediction for CASP8 with all-atom refinement using Rosetta. 406
Proteins, 77, 89–99. 407 Santos,D. et al. (2018) Enhancing protein fold determination by exploring the complementary 408
information of chemical cross-linking and coevolutionary signals. Bioinformatics, 409 bty074. 410
Sarpe,V. et al. (2016) High Sensitivity Crosslink Detection Coupled With Integrative Structure 411 Modeling in the Mass Spec Studio. Mol. Cell. Proteomics MCP, 15, 3071–3080. 412
Schneider,M. et al. (2016) Blind testing of cross‐linking/mass spectrometry hybrid methods in 413
CASP11. Proteins, 84, 152–163. 414 Schneidman-Duhovny,D. et al. (2012) Integrative structural modeling with small angle X-ray 415
scattering profiles. BMC Struct. Biol., 12, 17–17. 416
Simons,K.T. et al. (1997) Assembly of protein tertiary structures from fragments with similar 417
local sequences using simulated annealing and bayesian scoring functions1. J. Mol. Biol., 418 268, 209–225. 419
Simons,K.T. et al. (1999) Improved recognition of native-like protein structures using a 420
combination of sequence-dependent and sequence-independent features of proteins. 421 Proteins Struct. Funct. Bioinforma., 34, 82–95. 422
Song,Y. et al. (2013) High-Resolution Comparative Modeling with RosettaCM. Structure, 21, 423 1735–1742. 424
Sugio,S. et al. (1999) Crystal structure of human serum albumin at 2.5 Å resolution. Protein 425 Eng. Des. Sel., 12, 439–446. 426
Tamò,G.E. et al. (2017) Assessment of data-assisted prediction by inclusion of 427 crosslinking/mass-spectrometry and small angle X-ray scattering data in the 12th Critical 428
Assessment of protein Structure Prediction experiment. Proteins. 429
Tang,Y. et al. (2015) Protein structure determination by combining sparse NMR data with 430 evolutionary couplings. Nat. Methods, 12, 751–754. 431
Thompson,J.M. et al. (2012) Accurate protein structure modeling using sparse NMR data and 432
homologous structure information. Proc. Natl. Acad. Sci., 109, 9875–9880. 433 Xu,J. and Zhang,Y. (2010) How significant is a protein structure similarity with TM-score = 0.5? 434
Bioinformatics, 26, 889–895. 435 436
437
438
439
Figure 1: Cross-linking information. The identification of a cross-link between two residues (yellow 440
line) implies that the distance between reactive atoms, d(Ax,By) is smaller than the extended linker length, 441
LXL. However, a fixed backbone structure cannot represent all cross-linkable side chains configurations. 442
As indicated by the red lines connecting C atoms, alternative configurations of side chains for a single 443
fixed backbone can potentially validate other four cross-links. Therefore, at least the variability of side 444
chain orientations should be taken into account to define the effective maximum distance, Lmax, between 445
residues that might be cross-linked. 446
447
Figure 2: Statistical-based definition of Lmax. (A) After computing the topological distance between N 448
atom pairs, we selected the subset of pairs with distances shorter than the linker length, 11.5 Å. (B) Next, 449
we selected the subset of topological distances between C atoms pairs which had the corresponding 450
reactive atoms in the previous subset. The topological distance distribution reveals that distances 451
corresponding to side chains and linker in extended conformations (~22 Å) are never observed. We define 452
a cross-linkable distance for Lysine pairs and DSS/BS3 cross-link after removing unlikely scenarios (1%) 453
as Lmax(0.99) = 17.8 Å (vertical dashed line), increasing the restrictive role of the constraint by more than 454
4Å. Similar profiles for other residue pairs are shown in Supplementary Figure S1. 455
Table 1: Extended and statistical (Lmax) distances for cross-linked residue pairs and linkers. The 456
effective maximum distances that account for 99% of the possible cross-links are significantly more 457
restrictive than the maximum linker lengths. Extended conformations are not frequently observed. 458
cross-link ID extended distance / Å Lmax
(0.99)* / Å BS3/DSS
KK 21.8 17.8 KS 18.0 15.8 SS 14.1 13.4
1,6-hexanediamine DD 14.1 13.5 DE 15.4 14.3 EE 16.7 15.1
zero-length KE - 10.5 KD - 9.7 SE - 7.7 SD - 7.0
*statistically-derived topological Cβ-Cβ distance / Å 459
460
Figure 3: Statistical force-field determination. (A) Probability that a topological distance is below 461
Lmax(0.99) = 17.8 Å as a function of the Euclidean distance between Cβ Lys/Lys pairs, as a representation 462
of the DSS/BS3 cross-linker. As the Euclidean distance reaches Lmax, the probability of satisfying the 463
topological length decreases because fewer possible physical paths connecting the residues are possible. 464
(B) A potential energy curve can be derived from (A) assuming that this probability distribution is a 465
Boltzmann distribution. (C) For each pair of residue types a different energy function is derived. Here, we 466
show the energy functions for KK, KS and SS pairs assuming a DSS/BS3 cross-linker. Potential energy 467
profiles for other residue pairs and linkers are shown in Supplementary Figure S3. 468
469
470
Figure 4: Performance of cross-linking energy functions in the modeling of SalBIII structure with 471
Rosetta ab initio relax protocol. An improvement in generating native-like structures is correlated to 472
applying more restrictive distance constraints limits for linear and flat harmonic energy functions. Lorentz 473
energy function produces the same distributions than the modeling without constraints, our negative 474
control. There is a significant improvement in the quality of the models obtained when using the 475
statistical upper distance limit proposed here, which is justified by being too restrictive to an average of 476
only 1% of the expected cross-links. Similar results for Albumin domains are shown in Figure S5. 477
Table 2: Evaluation of fold-level accuracy population of SalBIII models generated using different 478
available energy functions to represent XLMS constraints. The results obtained with the statistical 479
upper limit and the full statistical potential (XLFF) developed in this work are highlighted. Using 480
statistical upper limits improves significantly the quality of the models obtained, even when using 481
the flat harmonic or linear penalization functions, but the proper statistical representation of the 482
functional form of the constraint energy improves even further the quality of the models obtained. 483
The modeling obtained with XLFF the experimental constraint set is comparable to the best 484
previous energy functions using the ideal constraint set. Similar results for Albumin domains are 485
shown in Supplementary Table S2. 486
fraction of models with TM-score > 0.5
Energy function UL all models 50% models
with best
Rosetta score
10% models
with best
Rosetta score no constraints - 0.00 0.00 0.00
Ideal
con
strain
t se
t
flat harmonic
25 Å 0.00 0.00 0.00 20 Å 0.02 0.04 0.12
Extended length 0.04 0.07 0.25 Statistical Limit 0.14 0.27 0.72
linear
25 Å 0.00 0.00 0.03 20 Å 0.02 0.04 0.16
Extended length 0.03 0.07 0.24 Statistical Limit 0.10 0.20 0.63
Lorentz
25 Å 0.00 0.00 0.00 20 Å 0.00 0.00 0.00
Extended length 0.00 0.00 0.00 Statistical Limit 0.00 0.00 0.01
XLFF Statistical Limit 0.27 0.56 0.94
Experimental
constraint set XLFF Statistical limit 0.03 0.06 0.20
487
488
489
Figure 5: The XL statistical force-field in comparison to other energy functions to model SalBIII 490
structure. The initial distribution of model’s quality (A) and the selection of best-scored models (B) 491
show that the XLMS force field outperforms other functional forms, even if those use the statistical upper 492
limit is used. (C) The XL constraint energy contributes to the classification of the models. Similar results 493
for Albumin domains are shown in Supplementary Figure S6. 494
495
496
497
Figure 6: Modeling SalBIII protein with XLMS force field and experimental constraints. (A) Using 498
experimental constraints, 3% of the models obtained using the current modeling protocol achieve fold 499
level accuracy (B) If the 10% best-scored models are selected in terms of their combined XL and Rosetta 500
energies, a subset of models in which ~20% have fold level accuracy is obtained. This result is 501
comparable with those obtained using the ideal constraint set and flat harmonic or linear penalization 502
functions (Table 2). Similar results for Albumin domains are shown in Figure S7. 503
504
download fileview on ChemRxivferrari_xlff_manuscript.pdf (0.98 MiB)
Supporting information
Statistical force-field for structural modeling using chemical cross-
linking/mass spectrometry distance constraints
Allan J R Ferrari1, Fabio C Gozzo1, Leandro Martínez1,2
1Institute of Chemistry, University of Campinas, Campinas, SP, Brazil and 2Center for
Computational Engineering & Sciences, University of Campinas, Campinas, SP, Brazil.
S1. Computational Methods
We have recently developed a software, called Topolink
(http://m3g.iqm.unicamp.br/topolink) (Martinez et al., 2017), which evaluates the consistency of
the experimental constraints derived from XLMS in structural models and computes, from the
potential reactivity of the residues on the protein structure, the experimental constraints one
should expect. Topolink (version 17.332) was used to derive the statistics of possible cross-
linking formation of non-redundant proteins domains of the CATH database (Sillitoe et al.,
2015; Lam et al., 2016) (S40 v4.1). For each structure, the topological and Euclidean distances
between the Cβ and reactive atoms were computed for every pair of residues of interest. Here,
we considered two cross-linkers with different residue specificity, DSS/BS3 and 1,6-
hexanediamine, both of which have spacer arms of approximately 11.5 Å. That is, the
topological distance between the reactive atoms must be smaller than 11.5 Å for a residue pair to
be considered a cross-linkable pair. Given the residue specificity for each cross-linker, the
following six pairs of residues were considered: Lys/Lys, Lys/Ser, and Ser/Ser (for DSS/BS3),
and Asp/Asp, Asp/Glu, and Glu/Glu (for 1,6-hexanediamine). Additionally, we examine the
species formed from zero-length reactions, for which the reactive side chains become directly
bonded: Lys/Asp, Lys/Glu, Ser/Asp and Ser/Ser. Here, we exemplify the results for the Lys/Lys
pair. The statistics for DSS/BS3, 1,6-haxanediamine, the chemically analogous shorter linkers
(BSG and 1,3-propanediamine), zero-length and its derived statistical force-field are available at
http://m3g.iqm.unicamp.br/topolink/xlff/.
S1.1 Determination of the maximum effective linker length Lmax
Figure 2 of the main manuscript shows the distribution of topological distances smaller
than 11.5Å (the extended linker length) between Nζ atoms of Lys/Lys pairs and their associated
Cβ-Cβ topological distance distribution. For instance, if the Cβ-Cβ topological distance is greater
than 22Å, there is no side chain conformation allowing the approximation of the Nζ atoms to less
than 11.5Å, as the sum of the lengths of the side chains is 10.5Å. The most frequent Cβ-Cβ
topological distance associated with Nζ-Nζ topological distances smaller than 11.5Å is about
7.5Å, corresponding to Cβ-Cβ topological distances of vicinal residues in α-helices.
The integrated distribution of Figure 2B illustrates the fraction of Nζ-Nζ pairs that can be
found at a topological distance smaller than 11.5Å as a function of the Cβ-Cβ topological
distance. There is more than 99% probability that the topological distance between Cβ atoms is
smaller than 17.8Å if the Nζ-Nζ pair satisfies the 11.5Å cutoff. This means that the effective
reach of the linker and side-chains is about 17.8Å instead of the fully-extended 22Å limit for this
pair of residues. This distance is 4.2Å (~20%) shorter than the extended limit, thus increasing
significantly the precision of the structural information provided by the XLMS constraint. This
effective limit is defined as Lmax for every pair of reactive residues and linker arm lengths, in this
case that of the DSS/BS3 reagent. This additional restriction plays a key role in improving the
modeling results. Figure S1 presents the distribution for all cross-links which have either
DSS/BS3 or 1,6-haxanediamine as cross-linkers. A discussion on zero-length Lmax definition is
presented in Figure S2. Table 1 of the mains manuscript summarizes the maximum extended
linker lengths and the statistically derived limits for each cross-linkable pair considered.
S1.2. Development of a statistical force-field
In the previous section, we have shown that an effective maximum linker length can be
obtained by analysis of topological distances in a database of protein structures. The restricted
maximum-lengths improve the precision of the distance constraint significantly by excluding
unlikely arrangements of side-chains and linkers molecules.
Now we will use this effective maximum linker length to create a potential energy
function that depends on the Euclidean distances between Cβ atoms, amenable for computational
modeling. The underlying problem is how to infer the existence of a physically possible
topological distance from the Euclidean distance measured.
If the topological distance between the Cβ atoms is smaller than the effective linker
reach, determined by the procedure above, the XLMS constraint can be satisfied by some side
chain conformation. Intuitively, the shorter the Euclidean distance between Cβ atoms, the greater
the probability that a topological distance will exist corresponding to a valid constraint.
Conversely, as the Cβ-Cβ Euclidean distance increases, the probability of a valid topological
distance between these atoms decreases since physical barriers for the linker are likely to be
present.
We compute, then, from the database of protein structures, the probability that the
topological distance between two Cβ atoms is smaller than the effective linker reach, Lmax, as a
function of their Euclidean distance: p[(dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)]. This computation
consists in evaluating the topological (physically-accessible) distances between every pair of Cβ
atoms in all protein structures of the CATH S40 non-homologous protein database and
classifying these distances as a function of their Euclidean distances, which are trivial to
compute.
Numerically, the probability distribution p[(dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)] was
computed starting from from 3Å distance, with 1Å intervals. That means, for example, that we
compute the probability that a topological distance is smaller than Lmax(0.99) given that the
Euclidean distance between Cβ atoms is between 3 and 4Å, and that up to the maximum
effective linker reach Lmax. Specifically, we compute
p(dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)]=N[dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)]/N(A/B), where
N[dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)] is the number of topological distances found to be below
Lmax(0.99) for each Euclidean distance in the database, and N(A/B) is the number of solvent
accessible pairs of residues of type A and B.
In Figure 3A (main manuscript) we show the probability that a topological distance
between Cβ atoms is smaller than Lmax as a function of the Euclidean distance, for pairs of
Lysine residues and DSS/BS3 cross-linker (defining Lmax=17.8Å). The shorter the Euclidean
distance measured between Cβ atoms, the higher the probability of having a cross-linkable
physical path between them, as expected. The probability of finding valid topological distances
decreases monotonically from roughly 6Å on, and at about 16Å is of about 20%. This means that
two CB atoms which are found at an Euclidean distance of 16Å have a statistical probability of
being connected by a valid topological path of 20%. We propose, here, that the potential energy
associated to this constraint should be dependent on this probability. That is, if a pair of CB
atoms is found at a given Euclidean distance, the potential energy associated with the XLMS
constraint should be smaller (more favorable) the greater the statistical probability that the atoms
have a valid topological distance between them.
Two independent parameters must be set to complete the potential energy profile: 1) The
minimum energy, associated with small distances, must be chosen, corresponding to the energy
of the reference state in a thermodynamic potential. 2) A choice must be made concerning the
behavior of the potential energy at distances close and higher to Lmax, for which statistical data is
not available.
Concerning the minimum energy of the constraint, we argue that an equal energy bonus
should be associated to any experimental information obtained from a cross-linking experiment,
independently of the residues involved. We chose that a minimum energy of -2.5 kcal mol-1 is to
be associated with probability 1.0 (small Euclidean distances). With this choice, the potential
energy profiles for cross-link constraints involving Lysine and Serine for LXL = 11.5Å are shown
in Figure 3C.
For distances greater than Lmax, one should expect the potential energy to increase
quickly to infinite. This choice is dependent on the existence, or not, of possible false-positives.
A faster growth will strongly penalize constraints that are not satisfied. This might be in general
desired, except if one wants to develop a force-field which is insensitive to false positives. We
have described the potential above Lmax as a linear penalization function of the deviation of Lmax,
as shown in Figure 3A. We examined the possibility of using faster increasing functions, but no
significant differences in the results were obtained, such that we opted for the simplest functional
form. We also examined the possibility of flattening the energy function after Lmax, such that the
potential is insensitive to unsatisfied constraints, an alternative which was not effective in the
present examples, but which might be useful if the constraint dataset is dominated by false
positives. This choice might deserve further investigation depending on the nature of the
constraint data available.
S2. Experimental Cross-links
S2.1. Materials
SalBIII protein was obtained as described previously (Luhavaya et al., 2015), Serum Albumin
(Sigma Aldrich), 1,6-diaminohexane (Sigma Aldrich), Suberic acid bis(N-hydroxysuccinimide
ester) (DSS, Sigma Aldrich), 2-(N-Morpholino)ethanesulfonic (Sigma Aldrich), N-
Hydroxybenzotriazole (Sigma Aldrich), 1-Ethyl-3-(3-dimethylaminopropyl)carbodiimide (EDC,
Thermo Fischer Scientific), DL-dithiothreitol (DTT, Sigma Aldrich), iodoacetamide (IAA, Sigma
Aldrich), trypsin solution from porcine pancreas (Sigma Aldrich).
S2.2. Cross-linking reaction
S2.2.1 DSS cross-linking reaction
35 uL of a solution of DSS in DMF (1 mg mL-1) was added to 500 µL of protein solution (10
µM) in PBS buffer 200 mM, pH 7 (~200-fold molar excess cross-linker to protein). The reaction
was carried out for two hours at 25°C and 300rpm.
S2.2.2 Xplex cross-linking reaction
For Xplex reaction all reagents were prepared in 200 mM MES buffer, pH 6. The reaction was
conducted by adding to 500 uL of protein solution (10 µM): (i) 10 uL of a solution (500 mM) of
1,6-diaminohexane (~1000-fold molar excess cross-linker to protein), (ii) 5 uL of a HOBt
solution (500 mM) (~500-fold molar excess to protein) and (iii) 20 uL of EDC solution (~2000-
fold molar excess to protein). The reaction was carried out for two hours at 25°C and 300rpm.
Reaction details reported recently (Fioramonte et al., 2018).
S.2.2.3 Reaction follow up
The samples were incubated at 80oC for 0.5h to promote thermal denaturation. After cooling, 20
uL of a IAA solution (1 mg mL-1) was added and samples were kept in the dark for 0.5h. Next, it
was performed a buffer exchange to sodium bicarbonate buffer, 100 mM, pH 8. Samples were
submitted to overnight digestion by adding trypsin solution in the ratio 1:50 protease/protein
(m/m).
S2.3. Protein cross-linking analysis
All cross-linking samples were analyzed in duplicate by nanoLC-nESI-MS/MS using a Dionex
UltiMate™ 3000 RSLCnano system coupled online to a Q-Exactive Plus orbitrap mass
spectrometer (Thermo Scientific) operating in positive ion mode. Tryptic peptides were
desalted/concentrated by self-packed Poros 20 R2 (Applied Biosystems) tip columns, dried
using a SpeedVac concentrator (Savant), solubilized in 1% formic acid and sonicated for 10
min in an ultrasonic bath. Peptide concentration was estimated by 280 nm absorbance measured
on a NanoDrop 2000 spectrophotometer (1 Abs = 1 mg/mL, considering 1 cm path length)
(Thermo Scientific). The peptides (1-1.5 µg) were injected at 2 µL/min for 10 min onto a home-
made trapping column (2 cm length, 100 µm internal diameter, 1-2 mm Kasil frit) packed with 5
µm 200 Å Magic C18AQ beads (Michrom Bioresources Inc). Sample fractionation was
performed at room temperature using a flow rate of 200 nL/min and a laser-pulled fused-silica
column (30 cm length, 75 µm internal diameter) packed with 1.9 µm ReproSil-Pur 120 C18-AQ
(Dr. Maisch GmbH) and equilibrated with 0.1% formic acid in water containing 2% acetonitrile.
After 10 min elution under the initial equilibration condition, acetonitrile concentration ramped
to 50% in 162 min, followed by a further increase until 80% in 4 min and a final washing step
with that acetonitrile concentration for 2 min. Using data-dependent acquisition, up to 6 most
abundant precursor ions per MS survey scan (300 – 1800 m/z) were selected for MS/MS scans
using the XCalibur software (Version 3.0.63, Thermo Scientific). MS1 data were acquired in the
profile mode and the parameters were: AGC target of 1E6, IT of 100 ms, 70,000 resolution
(FWHM at 200 m/z) and 1 microscan. Precursor ions with z ≥ 3 were isolated using 2 m/z
window and an offset of 0.5 m/z, followed by fragmentation with Higher-energy Collisional
Dissociation (HCD) applying the following Stepped Normalized Collision Energy (SNCE): 30-
40 for samples cross-linked with DSS and 28-33 for those submitted to the XPlex chemistry. In
attempt to optimize the fragmentation of zero-length cross-linked peptides, these last samples
were also analyzed using 30-40 SNCE. Ions selected for fragmentation were dynamically
excluded for 60s, with the “exclude isotopes” option. MS/MS scans were acquired in the centroid
mode with a resolution of 35,000, a fixed first mass of 200 m/z, AGC of 5E5, IT of 100 ms and
the underfill ratio set to 10%. The spray voltage was set to 1.9 kV with no sheath or auxiliary gas
flow and with a capillary temperature of 250C. The mass spectrometer was externally calibrated
using a calibration mixture that was composed of caffeine, peptide MRFA and Ultramark 1621,
as recommended by the instrument manufacturer.
S2.3. Cross-linking identification
Raw data was processed using SIM-XL software v. 1.3.2.0 and search parameters were: cross-
linker XPlex C6Ac2 (1,6-haxane diamine and zero-length) or DSS/BS3, 10 ppm error tolerance
for precursors and fragments, trypsin fully specific digestion with up to 3 missed cleavages,
carbamidomethylation of cysteine as fixed modification. Our list of candidates was obtained
using parameters for non-restrictive identification of XLs (post analysis filters): 10 ppm error
tolerance for precursor and fragments, score = 2.5, spectral count=1 and 1 peak matched per
chain.
Figure S1: Distribution of topological distances for C-C atoms for DSS/BS3 and 1,6-
haxanediamine cross-linkable pairs. The strategy consists in computing all topological distances
between C atoms and between the corresponding reactive atoms. Next, shortlist the subset of C-C
topological distances that have the associated reactive atoms’ distances below the linker length, LXL. The
exclusion of the more unlikely distances (here, 1%) returns the statistical upper limit cutoff, Lmax(0.99),
which should be considered to validate cross-link data (dashed line in each graph). Refer to Table 1 for
each distance value.
Figure S2: Zero-length Lmax definition. Zero-length species are amide or esters that result from covalent
binding of the nitrogen of a Lys or an oxygen from a Ser to the carbonyl of an Asp or a Glu acidic residue
and elimination of a water molecule. That is, the presence of a zero-length implies a more constrained
system than in the native protein. To evaluate the state just before the zero-length formation, one needs to
approximate the contact between the residues involved before the reaction. Here, we computed the
topological distance between the atoms which would be connected after reaction thought the covalent
bond. A maximum at ~4 Å is observed in the distributions (left, black curve) which is due to those
residues interacting through polar interactions before the bond formation. A normal distribution (left, blue
dashed curve) was used to fit the probability distribution for this condition and the distance containing
99% of the area under the Gaussian curve (left, red curves) was chosen to represent the maximum
distance allowed between the reactive atoms. This distance is treated as LXL (although there is no linker),
and the determination of Lmax is analogous to that of other linkers (Figure S1). The dashed vertical lines
on the right panels define the Lmax(0.99) which are reported in Table 1.
Figure S3: Potential energy curves for different cross-linking types. (A) DSS/BS3, (B) 1,6-
hexanediamine and (C) zero-length.
Figure S4: Representation of cross-linking constraints. In each of the right panels the red line indicates
the UL from which the potential energy increases.
Figure S5: Albumin (ALB) domains (D1, D2 and D3) modelled without constraints and with ideal
set of constraints and the linear penalization energy function for four UL (25 Å, 20 Å, Extended
length limit, and Statistical Limit). As observed for SalBIII, the more restrained upper limit distances
improves the models obtained. Results comparing all heuristic energy functions is presented in Figure S6.
Figure S6: Use of XLMS statistical force field and its comparison to heuristic energy functions to
model Albumin domains structures with ideal set of constraints. The statistical force-field performs
better than linear and flat harmonic energy functions (with statistical limits) or the Lorentz energy
function. Also, the energy function including the constraint energy from the statistical force-field (Rosetta
+ XL energy) provides a better discrimination of models with fold level accuracy (TM-score > 0.5)
compared to the Rosetta score alone.
Figure S7: Albumin domains modelled with experimental constraints. The number of experimental
constraints for each domain correspond to roughly 1/4 of the number of theoretical constraints applied
previously (Table S1). 11% ALB-D1, 51% ALB-D2 and 2% ALB-D3 models of 10% best scored models
have TM-score > 0.5 (Table S2).
Table S1: XLMS constraints in modelling SalBIII and Serum Albumin Domains
protein ID sequence length number of theoretical
constraints*
number of
experimental
constraints‡
number of experimental
validated constraints*
experimental
validated /
experimental
experimental
validated /
theoretical SalBIII 141 62 156 22 0.14 0.35 ALB-D1 201 125 163 33 0.20 0.26 ALB-D2 189 153 195 40 0.21 0.26 ALB-D3 193 92 98 23 0.23 0.25
*as evaluated by Topolink with Lmax(0.99) defined in Table 1.
‡list of XLMS constraints candidates as defined in S2.3.
Table S2: Evaluation of fold-level accuracy of Serum Albumin domain’s models generated using
different energy functions to represent XLMS constraints. Distributions are shown in Figures S5,
S6 and S7.
fraction of models with TM-score > 0.5
Energy function UL fraction of best
scored models = 1.0 fraction of best
scored models = 0.5 fraction of best
scored models = 0.1
ALB-D1
no constraints - 0.00 0.00 0.00
Idea
l co
nst
rain
t se
t
Linear
25 Å 0.02 0.03 0.05 20 Å 0.06 0.10 0.24
Extended length 0.12 0.20 0.38 Statistical Limit 0.15 0.26 0.50
flat harmonic Statistical Limit 0.12 0.21 0.39 Lorentz Statistical Limit 0.00 0.01 0.02 XLFF Statistical Limit 0.29 0.43 0.67
Experimental
constraint set XLFF Statistical Limit 0.05 0.07 0.11
ALB-D2
no constraints - 0.00 0.00 0.00
Idea
l co
nst
rain
t se
t
Linear
25 Å 0.02 0.03 0.06 20 Å 0.10 0.17 0.32
Extended length 0.29 0.45 0.66 Statistical Limit 0.39 0.60 0.80
flat harmonic Statistical Limit 0.44 0.71 0.92 Lorentz Statistical Limit 0.00 0.00 0.00 XLFF Statistical Limit 0.74 0.95 0.99
Experimental
constraint set XLFF Statistical Limit 0.21 0.33 0.51
ALB-D3
no constraints - 0.00 0.00 0.00
Idea
l co
nst
rain
t se
t
Linear
25 Å 0.01 0.03 0.07 20 Å 0.07 0.13 0.29
Extended length 0.26 0.47 0.80 Statistical Limit 0.36 0.66 0.92
flat harmonic Statistical Limit 0.40 0.70 0.94 Lorentz Statistical Limit 0.00 0.00 0.00 XLFF Statistical Limit 0.70 0.96 0.99
Experimental
constraint set XLFF Statistical Limit 0.01 0.02 0.02
Protocol S1. Rosetta abinitio relax protocol
Structural modeling with Rosetta was performed using the ab initio relax protocol and the
following flags. Each bracket requires a file. Fragment file of 3 and 9-mer were generated in
Robetta server (Kim et al., 2004) (http://robetta.bakerlab.org/fragmentsubmit.jsp). The fragments
were generated excluding homologous. For generation of models without constraints, constraints
file and weight must be omitted.
-abinitio
-fastrelax
-increase_cycles 1
-rg_reweight 0.25
-in
-file
-fasta {fasta file}
-frag3 {fragments3 file}
-frag9 {fragments9 file}
-path
-database $rosetta_path/main/database/
-out
-nstruct 5000
-file
-fullatom
-silent {output silent file}
-constraints
-cst_file {constraint file}
-cst_weight 1
-cst_fa_file {constraint file}
-cst_fa_weight 1
References
Fioramonte,M. et al. (2018) XPlex: an effective, multiplex cross-linking chemistry for acidic
residues. Anal. Chem.
Kim,D.E. et al. (2004) Protein structure prediction and analysis using the Robetta server. Nucleic
Acids Res., 32, W526–W531.
Lam,S.D. et al. (2016) Gene3D: expanding the utility of domain assignments. Nucleic Acids
Res., 44, D404–D409.
Luhavaya,H. et al. (2015) Enzymology of Pyran Ring A Formation in Salinomycin Biosynthesis.
Angew. Chem. Int. Ed., 54, 13622–13625.
Martinez,L. et al. (2017) TopoLink: A software to validate structural models using chemical
crosslinking constraints. Protoc. Exch. doi:10.1038/protex.2017.035
Sillitoe,I. et al. (2015) CATH: comprehensive structural and functional annotations for genome
sequences. Nucleic Acids Res., 43, D376–D381.
download fileview on ChemRxivferrari_xlff_supporting_information.pdf (1.51 MiB)