yoonjoo choi , sumeet agarwal and charlotte m. deaneafter superimposing anchor structures. modeller...

15
Submitted 13 November 2012 Accepted 21 November 2012 Published 12 February 2013 Corresponding author Charlotte M. Deane, [email protected] Academic editor Walter de Azevedo, Jr Additional Information and Declarations can be found on page 12 DOI 10.7717/peerj.1 Copyright 2013 Choi et al. Distributed under Creative Commons CC-BY 3.0 OPEN ACCESS How long is a piece of loop? Yoonjoo Choi 1 , Sumeet Agarwal 2,3,4 and Charlotte M. Deane 5 1 Department of Computer Science, Dartmouth College, Hanover, NH, USA 2 Department of Systems Biology Doctoral Training Centre, University of Oxford, Oxford, United Kingdom 3 Department of Physics, University of Oxford, Oxford, United Kingdom 4 Department of Electrical Engineering, Indian Institute of Technology Delhi, Delhi, India 5 Department of Statistics, University of Oxford, United Kingdom ABSTRACT Loops are irregular structures which connect two secondary structure elements in proteins. They often play important roles in function, including enzyme reactions and ligand binding. Despite their importance, their structure remains dicult to predict. Most protein loop structure prediction methods sample local loop segments and score them. In particular protein loop classifications and database search methods depend heavily on local properties of loops. Here we examine the distance between a loop’s end points (span). We find that the distribution of loop span appears to be independent of the number of residues in the loop, in other words the separation between the anchors of a loop does not increase with an increase in the number of loop residues. Loop span is also unaected by the secondary structures at the end points, unless the two anchors are part of an anti-parallel beta sheet. As loop span appears to be independent of global properties of the protein we suggest that its distribution can be described by a random fluctuation model based on the Maxwell–Boltzmann distribution. It is believed that the primary diculty in protein loop structure prediction comes from the number of residues in the loop. Following the idea that loop span is an independent local property, we investigate its eect on protein loop structure prediction and show how normalised span (loop stretch) is related to the structural complexity of loops. Highly contracted loops are more dicult to predict than stretched loops. Subjects Biochemistry, Bioinformatics, Biophysics, Computational Biology Keywords Protein structure, Protein loop, Protein structure prediction, Protein loop structure, Protein loop structure prediction, Protein, Loop stretch, Loop span INTRODUCTION Protein loops are patternless regions which connect two regular secondary structures. They are generally located on the protein’s surface in solvent exposed areas and often play important roles, such as interacting with other biological objects. Despite the lack of patterns, loops are not completely random structures. Early studies of short turns and hairpins showed that these peptide fragments could be clustered into structural classes (Richardson, 1981; Sibanda & Thorton, 1985). Such classifications have also been made across all loops (Burke, Deane & Blundell, 2000; Chothia & Lesk, 1987; Donate et al., 1996; Espadaler et al., 2004; Oliva et al., 1997; Vanhee et al., 2011) How to cite this article Choi et al. (2013), How long is a piece of loop? PeerJ 1:e1; DOI 10.7717/peerj.1

Upload: others

Post on 08-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Submitted 13 November 2012Accepted 21 November 2012Published 12 February 2013

    Corresponding authorCharlotte M. Deane,[email protected]

    Academic editorWalter de Azevedo, Jr

    Additional Information andDeclarations can be found onpage 12

    DOI 10.7717/peerj.1

    Copyright2013 Choi et al.

    Distributed underCreative Commons CC-BY 3.0

    OPEN ACCESS

    How long is a piece of loop?

    Yoonjoo Choi1, Sumeet Agarwal2,3,4 and Charlotte M. Deane5

    1 Department of Computer Science, Dartmouth College, Hanover, NH, USA2 Department of Systems Biology Doctoral Training Centre, University of Oxford, Oxford,

    United Kingdom3 Department of Physics, University of Oxford, Oxford, United Kingdom4 Department of Electrical Engineering, Indian Institute of Technology Delhi, Delhi, India5 Department of Statistics, University of Oxford, United Kingdom

    ABSTRACTLoops are irregular structures which connect two secondary structure elements inproteins. They often play important roles in function, including enzyme reactionsand ligand binding. Despite their importance, their structure remains difficultto predict. Most protein loop structure prediction methods sample local loopsegments and score them. In particular protein loop classifications and databasesearch methods depend heavily on local properties of loops. Here we examine thedistance between a loop’s end points (span). We find that the distribution of loopspan appears to be independent of the number of residues in the loop, in other wordsthe separation between the anchors of a loop does not increase with an increasein the number of loop residues. Loop span is also unaffected by the secondarystructures at the end points, unless the two anchors are part of an anti-parallelbeta sheet. As loop span appears to be independent of global properties of theprotein we suggest that its distribution can be described by a random fluctuationmodel based on the Maxwell–Boltzmann distribution. It is believed that theprimary difficulty in protein loop structure prediction comes from the numberof residues in the loop. Following the idea that loop span is an independent localproperty, we investigate its effect on protein loop structure prediction and showhow normalised span (loop stretch) is related to the structural complexity ofloops. Highly contracted loops are more difficult to predict than stretched loops.

    Subjects Biochemistry, Bioinformatics, Biophysics, Computational BiologyKeywords Protein structure, Protein loop, Protein structure prediction, Protein loop structure,Protein loop structure prediction, Protein, Loop stretch, Loop span

    INTRODUCTIONProtein loops are patternless regions which connect two regular secondary structures.

    They are generally located on the protein’s surface in solvent exposed areas and often play

    important roles, such as interacting with other biological objects.

    Despite the lack of patterns, loops are not completely random structures. Early studies

    of short turns and hairpins showed that these peptide fragments could be clustered

    into structural classes (Richardson, 1981; Sibanda & Thorton, 1985). Such classifications

    have also been made across all loops (Burke, Deane & Blundell, 2000; Chothia & Lesk,

    1987; Donate et al., 1996; Espadaler et al., 2004; Oliva et al., 1997; Vanhee et al., 2011)

    How to cite this article Choi et al. (2013), How long is a piece of loop? PeerJ 1:e1; DOI 10.7717/peerj.1

    mailto:[email protected]:[email protected]://peerj.com/academic-boards/editors/https://peerj.com/academic-boards/editors/http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://creativecommons.org/licenses/by/3.0/http://creativecommons.org/licenses/by/3.0/https://peerj.comhttp://dx.doi.org/10.7717/peerj.1

  • or within specific protein families such as antibody complementarity determining

    regions (CDRs) (Al-Lazikani, Lesk & Chothia, 1998; Chothia & Lesk, 1987; Chothia et

    al., 1989). Loop classifications are generally based on local properties such as sequence,

    the secondary structures from which the loop starts and finishes (anchor region), the

    distance between the anchors, and the geometrical shape along the loop structure

    (Kwasigroch, Chomilier & Mornon, 1996; Leszczynski & Rose, 1986; Ring et al., 1992; Wojcik,

    Mornon & Chomilier, 1999).

    Loops can also be classified in terms of function. There is some evidence that a loop

    can have local functionality. Experiments have been carried out which support the idea

    that swapping a local loop sequence for a different functional loop sequence enables the

    new function to be taken on (Pardon et al., 1995; Toma et al., 1991; Wolfson et al., 1991).

    One important example of functional loop exchange is in the development of humanised

    antibodies (Queen et al., 1989; Riechmann et al., 1988).

    Accurate protein loop structure prediction remains an open question. Protein loop

    predictors have dealt with the problem as a case of local protein structure prediction.

    Protein structures are hypothesised to be in thermodynamic equilibrium with their

    environment (Anfinsen, 1973). Thus the primary determinant of a protein structure

    is considered to be its atomic interactions, i.e. its amino acid sequence. An analogous

    conjecture has arisen at the local scale where environment other than loop structure is

    fixed. Thus the modelling of protein loops is often considered a mini protein folding

    problem (Fiser, Do & Sali, 2000; Nagi & Regan, 1997). Although most loop structure

    prediction methods are based on this conjecture, loop sequence alone is not the complete

    determinant of the loop structure as even identical loop sequences can take multiple

    structural conformations depending on external environmental factors such as solvent

    and ligand binding (Fernandez-Fuentes & Fiser, 2006). Quintessential examples of such

    multiple loop structure conformations can be found in antibody CDR loops upon antigen

    binding (Choi & Deane, 2011).

    Database search methods have been successful in the realm of loop structure prediction

    (Verschueren et al., 2011). They depend upon the assumption that similarity between local

    properties may suggest similar local structures. All database search methods work in an

    analogous fashion using either a complete set or a classified set of loops and selecting

    predictions using local features including sequence similarity and anchor geometry

    (Choi & Deane, 2010; Fernandez-Fuentes, Oliva & Fiser, 2006; Hildebrand et al., 2009;

    Peng & Yang, 2007; Wojcik, Mornon & Chomilier, 1999). Ab initio loop modelling methods

    aim to predict peptide fragments that do not exist in homology modelling templates

    without structure databases. Generally, ab initio methods generate large local structure

    conformation sets and select predictions (de Bakker et al., 2003; Fiser, Do & Sali, 2000;

    Jacobson et al., 2004; Mandell, Coutsias & Kortemme, 2009; Soto et al., 2008). The generated

    loop candidates are optimised against scoring functions. In all loop modelling procedures

    anchor regions are often problematic and the accuracy of loop modelling depends upon

    the distance between the anchors (Xiang, 2006).

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 2/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1

  • Here, we focus on a specific local property of protein loop structure: the distance

    between the two terminal Cα atoms of the loop, which we refer to as its span. The natureof the span distribution is broadly similar across different protein classes or anchor types,

    except for loops linking anti-parallel strands (anti-parallel β loops). In particular, the

    most highly frequent span appears to stay the same irrespective of the number of residues.

    This suggests that the span is distributed independently of other local properties and

    global structures. We demonstrate that the observed span distribution can largely be

    explained by a simple model of random fluctuations with a given length scale, based on the

    Maxwell–Boltzmann distribution.

    It is widely believed that the accuracy of loop structure prediction depends on the

    number of residues, i.e. the larger the number of residues, the more difficult a loop is

    to predict (Choi & Deane, 2010; Karen et al., 2007). We introduce the normalised span

    which indicates how stretched a loop is (loop stretch λ). Fully stretched loops (λ ' 1)are almost always predicted accurately, whereas contracted loops (λ� 1) are harder topredict. In fact, shorter loops tend to be more stretched whereas longer loops are likely

    to be highly contracted. We suggest that loop stretch should be addressed in practical

    modelling situations and loop structure prediction should be concerned with predicting

    highly contracted loops.

    MATERIALS AND METHODSLoop definitionIn each of the sets of protein structures loops, were identified using the following protocol.

    Secondary structures were annotated using JOY (Mizuguchi et al., 1998). A loop structure

    was defined as any region between two regular secondary structures that was at least three

    residues in length (Donate et al., 1996). Short (less than 4 residues in length) loops were

    discarded. Redundancy was removed using sequence identity. If a pair of loops shares over

    40% sequence identity (Fernandez-Fuentes & Fiser, 2006), the loop which has a higher

    average B-factor was discarded.

    Membrane protein structuresMembrane proteins (3,789 chains) were extracted from PDBTM (Tusnady, Dosztanyi &

    Simon, 2004). The membrane layer was defined as being from−20 to+20 Å (Scott et al.,

    2008) from the centre of the protein and loops whose two end Cα atom coordinates wereoutside the layer were discarded. A total of 1,027 non-redundant membrane loops were

    defined.

    Soluble protein structuresAll protein chains determined by X-ray crystallography which share less than 99%

    sequence identity (

  • Figure 1 The definition of loop span and loop stretch. Loop span is the separation of the two Cαs ateither end of the loop. In this example, 2J9O Chain A (198-205) has a span of 13.7 Å and contains 8residues. Maximum span can be calculated from the number of residues in the loop to be 21.6 Å. Loopstretch is the normalised span (13.7/21.6' 0.63).

    was then used to compare the 3,789 membrane chains against the soluble set. Any chains

    found during 5 iterations with an E-value cut-off of 0.001 were removed from the list of

    soluble protein chains. A total of 25,191 non-redundant soluble loops were identified from

    27,717 soluble protein chains.

    Loop span and loop stretchThe loop span (l) is the distance between the two terminal Cα atoms of a loop (Fig. 1).

    The maximum span lmax is a function of the number of residues n and calculated asfollows:

    lmax(n)=

    {γ · (n/2− 1)+ δ if n is even

    γ · (n− 1)/2 if n is odd

    where γ = 6.046 Å and δ = 3.46 Å (Flory, 1998; Tastan, Klein-Seetharaman & Meirovitch,2009). If the distance between two terminal Cα atoms in the loop (i.e. the span) is l, theloop stretch (λ) of the loop is defined as a normalised span.

    λ≡l

    lmax. (1)

    Note that the values of γ and δ are theoretical approximations so the λ of some loops

    may occasionally be larger than 1. Similar notations are found in Ring et al. (1992), Tastan,

    Klein-Seetharaman & Meirovitch (2009).

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 4/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1

  • PROTEIN STRUCTURE PREDICTION AND LOOPSTRETCHLoop modelling test setsThere are two modelling test sets. The first set includes loops of 8 residues. The loops were

    binned every 0.1 loop stretch. In each bin, 40 test loops were randomly selected. A total of

    320 test loops from 0.2 to 1 in loop stretch were used (a full list is given in Table S1).

    The second set consists of loops of between 6 and 10 residues in length. Two classes

    of loops were collected at each length: contracted loops (λ < 0.4) and stretched loops(λ > 0.95); an identical number of loops was kept in each of these classes at each length.A total of 346 test loops were identified (58, 72, 110, 58 and 48 loops respectively, See

    Tables S2 and S3). For example, there are 55 contracted test loops and 55 stretched loops

    for loops of 8 residues.

    The measurement of accuracy is loop RMSD of all backbone atoms (N, Cα , C and O)after superimposing anchor structures.

    MODELLER settingThe default loop refinement script was used. One hundred loop models were sampled

    under the molecular dynamics level of slow. The DOPE potential energy (Shen & Sali,

    2006) was used for model quality assessment.

    FREAD settingA database was constructed using the 27,717 soluble protein chains defined above. All the

    parameters were set as default (the environment substitution score cut-off value≥25). Any

    results from self-prediction were eliminated.

    RESULTSNomenclatureIn this paper, proteins are divided into two main classes: membrane and soluble proteins.

    Loops from membrane protein structures are called “membrane loops” and those from

    soluble protein structures are referred to as “soluble loops”. Loops are also described by

    their secondary structure types: for example, loops connecting anti-parallel β sheets are

    termed “anti-parallel β loops”. The physical spatial distance between the two end Cα atomsof a loop is referred to as “span” (l). Maximum loop span (lmax) is the furthest that a setof residues can spread. “Loop stretch” (λ) is the normalised loop span: the observed span

    between two Cα atoms at each end of a loop in a protein structure over the loops maximumspan (Fig. 1).

    Loop span distributionThe number of residues in a loop is distributed in a similar fashion regardless of anchor

    types except for the loops linking anti-parallel β sheets due to the constraint of hydrogen

    bonds between adjacent β strands (Fig. 2A). Figure 2B displays how loop spans are

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 5/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1

  • Figure 2 Statistics of protein loops. (A) The frequency distribution of loops containing different numbers of residues. Anti-parallel β loops tendto have fewer residues. (B) The loop span distribution in terms of the anchor secondary structure do not show differences except for anti-parallel βloops. The upper part of the anti-parallel β loop span distribution is omitted in the figure. (C) The distributions of soluble loop span and membraneloop span appear to be similar. (D)–(G) Q–Q plots showing that the membrane and soluble loop span distributions are from the same probabilitydistribution.

    distributed for different anchor types. Again, apart from anti-parallel β loops, the loop

    span distributions do not change with anchor structures.

    The loop span distribution also does not alter when considering different protein

    classes. Figures 2C–2G show how the loop spans of membrane loops and soluble loops

    are distributed in a similar manner.

    Essentially a loop span value reflects how distant the end tips of the two secondary

    structures that the loop connects are. These observations suggest that the loop span may

    be distributed independently of local anchor structures and protein types, i.e. anchor

    distances do not depend on local secondary structure elements or global protein structures.

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 6/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1

  • Figure 3 The span distributions for loops containing different numbers of residues. (A) These appearto show a constant mode. Data here is soluble loops excluding anti-parallel beta loops. (B) The modesfor the span distributions for loops containing different numbers of residues compared to the maximumspan for that length. The span modes were estimated using the Gaussian kernel density estimation. Notethat the estimated mode of loops of 4 residues is close to its maximum span.

    The modes of loop span distributions are roughly constant (Fig. 2B), even if we split the

    loops in terms of the number of residues (Fig. 3A). We fit our data using the Gaussian

    kernel density estimation. The estimated distributions show a nearly constant mode

    ('13 Å on average, Fig. 3B). This constant span value may be due to protein packing.

    Folded proteins tend to be tightly packed and thus secondary structures are placed close

    to one another while avoiding side chain steric clashes. This packing effect may mean that

    the end points of two secondary structures (i.e. span) will lie within a constant span value

    regardless of the number of residues in a loop.

    Maxwell–Boltzmann distribution for loop spanFrom the above observations, it appears that loop span is distributed independently of

    local anchor structures or global protein classes. Here we assume that a protein loop is an

    independent unit of the protein structure and the span is determined regardless of any

    other effects including sequence or the rest of the structure.

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 7/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1

  • Here a model for the loop span distribution is established under the hypothesis

    that the two end points of a loop fluctuate in three dimensional space, following the

    Maxwell–Boltzmann distribution. Two constraints are imposed in this model: the

    minimum span lmin and the maximum span as a function of the number of residueslmax(n). Within these constraints, the span oscillates according to a normal distributionN (µ,σ 2) with a given length-scale lmode in three dimensional space.

    The underlying assumptions are that the end points cannot approach each other too

    closely, and that there is a maximum span achievable for a loop with a given number of

    residues (n). Within these constraints, the span is allowed to fluctuate around the givenlength-scale lmode in three dimensional space. Thus, in this model, the loop span l of nresidues is distributed as

    l=√

    l2x + l2y + l2z lx,ly,lz ∼N

    (0,

    l2mode2

    )(2)

    subject to the constraints that l ≥ lmin and l ≤ lmax(n), as stated above. The varianceof l2mode/2 corresponds to a modal span of lmode. Thus there are two parameters to bedetermined in our model: lmin and lmode. We set lmin to 3.8 Å, which is the typical distancebetween two neighbouring Cα atoms in a protein chain. lmode is set to an estimate of theempirical mode using the Gaussian kernel density estimation (12.7 Å).

    As there are not many longer loops in the data set, loops longer than 20 residues were

    discarded. In addition, all anti-parallel β loops were eliminated due to their physical

    constraints. These eliminations left 21,597 soluble loops (The frequency distribution for

    each number of residues is in Fig. S2). Having set the two parameters lmin and lmode, loopspans were generated 10 times per model in accordance with the Maxwell–Boltzmann

    distribution, preserving the observed distribution of the number of residues (i.e. 10

    simulated loop spans were generated for each real loop in the data set). The simulation out-

    come is depicted in Fig. 4A. The two distributions show the same shape and the quantile

    comparison in Fig. 4B indicates that they are statistically similar except for the tail region.

    There are apparent anomalies between the simulated and real span distributions

    towards the extremes. The model seems to predict more short-span loops than observed.

    Our model imposes a sharp lower threshold at lmin = 3.8 Å, whereas in reality we expecta smoother transition. In other words, we expect our assumption of free fluctuation to

    break down when the span gets close to the lower bound and the physical constraints begin

    to become relevant. On the other side of the distribution, we see a substantially higher

    number of long-span loops (>20 Å) than predicted by the model. The mismatches in the

    long-span region tend to become more prominent as the number of residues is increased.

    When we examined which loops tend to have exceptionally long spans, we found that

    some of these “loops” are domain linkers between independent folding units and therefore

    likely to be under different constraints. Others appear to have been misclassified, as the

    loop definition used here is based only on the anchors containing at least three consecutive

    residues of secondary structures and the loop containing none. This allows segments such

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 8/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1

  • Figure 4 Maxwell–Boltzmann distribution and loop span distribution. (A) The loop span distribution(black) from soluble loops and that of the Maxwell–Boltzmann distribution (red). (B) The Q–Q plotsuggesting that they follow the same distribution.

    as termini structures to be included if there happen to be very short helical segments at a

    protein structure’s terminus (Fig. S1).

    Protein structure prediction and loop stretchThe number of residues in loops is known to be related to the protein stability (Nagi

    & Regan, 1997) and the accuracy of most loop modelling techniques. Based on our

    observation that the loop span is independent of other properties, we examine its effects

    on protein loop structure prediction. Here we introduce loop stretch, the normalised loop

    span (Eq. (1)). Loop stretch values take on a range of 0 to 1, which indicates how stretched a

    loop is (1: fully stretched).

    Figure 5 displays how loop stretch frequencies are distributed for different numbers

    of residues, demonstrating that the number of residues is negatively correlated with loop

    stretch, i.e. the longer a loop is, the more likely it is to be contracted. This may suggest

    that, instead of the standard belief that loop modelling performs worse as the number

    of residues in the loop increases, it may be that the real problem is better described by

    considering how stretched the loop to be predicted is. For example, if a loop contains many

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 9/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1

  • Figure 5 Loop stretch of long and short loops. Loop stretch distributions for loops containing differentnumbers of residues. Shorter loops tend to be more stretched whereas longer loops are likely to be morecontracted. Only soluble loops excluding anti-parallel β loops are plotted.

    residues but is highly stretched, it will be predicted relatively accurately, as it can take on

    only a small number of different conformations.

    In order to check the relationship between accuracy and loop stretch we used a

    test set containing only 8 residue loops with 40 non-redundant loops in every 0.1

    loop stretch bin. Two loop modelling methods, which use two different sampling

    methods, were tested. MODELLER (Fiser, Do & Sali, 2000) is a popular protein structure

    prediction programme which has a built-in ab initio loop modelling module. FREAD

    (Choi & Deane, 2010) is a database search method which samples candidate loops

    depending on local properties and ranks predictions based on local loop sequence

    similarity and anchor geometry matches.

    The average accuracy of MODELLER shows a negative linear correlation against loop

    stretch for the first test set (Fig. 6A). In the case of fully stretched loops (λ > 0.95),MODELLER can produce consistently accurate predictions, but its predictions worsen

    as the target loops are less stretched. FREAD produces more accurate predictions than

    MODELLER in general. However its predictions also begin to disperse as the loops become

    more contracted (Fig. 6B). FREAD generates candidate loops based on anchor matches

    and sequence similarity for a given loop target. This may imply that contracted loops tend

    to have multiple structural conformations or stringent sequence identity is required to

    predict such highly contracted loops. It should be noted that FREAD is not able to predict

    all the target loops due to the incompleteness of the structure database it uses (Fig. 6C).

    In order to further assess the effect of loop stretch in loop structure prediction,

    MODELLER was re-examined on a second set. The second test set consists of loops from

    6 to 10 residues in length. In this set, for each number of residues, the same numbers

    of loops (See Materials and Methods) were selected for both contracted (λ < 0.4) andfully stretched loops (λ > 0.95). MODELLER produces consistently accurate results for

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 10/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1

  • Figure 6 Protein loop structure prediction and loop stretch. Accuracy of protein loop structure prediction methods do not only depend on thenumber of residues, but also on loop stretch. MODELLER (A) and FREAD (B) both show accurate results when the target loop is stretched on thefirst set (including loops of 8 residues in length only). MODELLER shows worse prediction as loop stretch decreases whereas FREAD gives consistentaccuracy on loop stretch. However both fail to predict very contracted loops (λ < 0.4). (C) The coverage of FREAD predictions in terms of loopstretch. (D) The second test set (contracted (λ < 0.4) and stretched (λ > 0.95) loops). The test loops are also split by the number of residues. Forfully stretched loops (λ > 0.95), regardless of the number of residues, MODELLER predicts accurately.

    fully stretched loops regardless of the number of residues, but fails to accurately predict

    contracted loops (Fig. 6D).

    We calculated the partial correlations (Spearman’s rank correlation) between accuracy,

    and the number of residues and loop stretch on the second test set so as to investigate what

    affects the prediction accuracy more (the number of residues or loop stretch). The partial

    correlation between loop stretch and RMSD is larger than that between the number of

    residues and RMSD (−0.465 and 0.367 respectively). Loop stretch, just like the number of

    residues is something that can be calculated without knowledge of loop conformation and

    therefore can be used in the design of loop structure prediction software.

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 11/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1

  • DISCUSSIONIn this paper, we focus on a specific local property (span) and demonstrate that the modes

    of loop span distribution appear to be independent of the number of residues. Loop

    span shows a distinct frequency distribution which does not depend on anchor types or

    protein classes. From these observations, we hypothesised that loop span is independent

    of the other effects and showed how the loop span distribution appears to correspond to a

    truncated Maxwell–Boltzmann distribution.

    The reason behind the independence of loop span from the number of loop residues

    or secondary structure type is not known. The fact that the loop span distribution can be

    captured by a simple Maxwell–Boltzmann model allows one to speculate that protein loop

    structure prediction is indeed a local mini protein folding problem.

    ADDITIONAL INFORMATION AND DECLARATIONS

    FundingYoonjoo Choi was funded by the Department of Statistics, St. Cross College and the

    University of Oxford. Sumeet Agarwal was funded by the Clarendon Fund of the University

    of Oxford. The funders had no role in study design, data collection and analysis, decision to

    publish, or preparation of the manuscript.

    Grant DisclosuresThe following grant information was disclosed by the authors:

    Department of Statistics, St. Cross College and the University of Oxford.

    Clarendon Fund of the University of Oxford.

    Competing InterestsProfessor Charlotte M. Deane is an editor of PeerJ.

    Author Contributions• Yoonjoo Choi conceived and designed the experiments, performed the experiments,

    analyzed the data, contributed reagents/materials/analysis tools, wrote the paper.

    • Sumeet Agarwal performed the experiments, analyzed the data, contributed

    reagents/materials/analysis tools.

    • Charlotte M. Deane conceived and designed the experiments, wrote the paper.

    Supplemental InformationSupplemental information for this article can be found online at http://dx.doi.org/

    10.7717/peerj.1.

    REFERENCESAl-Lazikani B, Lesk AM, Chothia C. 1998. Standard conformations for the canonical structures of

    immunoglobulins. Journal of Molecular Biology 273:927–948 DOI ./jmbi...

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 12/15

    https://peerj.comhttp://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.7717/peerj.1http://dx.doi.org/10.1006/jmbi.1997.1354http://dx.doi.org/10.7717/peerj.1

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. GappedBLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic AcidsResearch 25:3389–3402 DOI ./nar/...

    Anfinsen CB. 1973. Principles that govern the folding of protein chains. Science 181:223–230DOI ./science....

    Burke DF, Deane CM, Blundell TL. 2000. Browsing the SLoop database of structurally classifiedloops connecting elements of protein secondary structure. Bioinformatics 16:513–519DOI ./bioinformatics/...

    Choi Y, Deane CM. 2010. FREAD revisited: accurate loop structure prediction using a databasesearch algorithm. Proteins 78:1431–1440 DOI ./prot..

    Choi Y, Deane CM. 2011. Predicting antibody complementarity determining region structureswithout classification. Molecular Biosystems 7:3327–3334 DOI ./cmbc.

    Chothia C, Lesk AM. 1987. Canonical structures for the hypervariable regions ofimmunoglobulins. Journal of Molecular Biology 196:901–917 DOI ./-()-.

    Chothia C, Lesk AM, Tramontano A, Levitt M, Smith-Gill SJ, Air G, Sheriff S,Padlan EA, Davies D, Tulip WR, Colman PM, Spinelli S, Alzari PM, Poljak RJ. 1989.Conformations of immunoglobulin hypervariable regions. Nature 342:877–883DOI ./a.

    de Bakker PI, DePristo MA, Burke DF, Blundell TL. 2003. Ab initio construction of polypeptidefragments: accuracy of loop decoy discrimination by an all-atom statistical potential andthe AMBER force field with the Generalized Born solvation model. Proteins 51:21–40DOI ./prot..

    Donate LE, Rufino SD, Canard LH, Blundell TL. 1996. Conformational analysis and clustering ofshort and medium size loops connecting regular secondary structures: a database for modelingand prediction. Protein Science 5:2600–2616 DOI ./pro..

    Espadaler J, Fernandez-Fuentes N, Hermoso A, Querol E, Aviles FX, Sternberg MJE, Oliva B.2004. ArchDB: automated protein loop classification as a tool for structural genomics. NucleicAcids Research 32:D185–D188 DOI ./nar/gkh.

    Fernandez-Fuentes N, Fiser A. 2006. Saturating representation of loop conformational fragmentsin structure databanks. BMC Structural Biology 6: DOI ./---.

    Fernandez-Fuentes N, Oliva B, Fiser A. 2006. A supersecondary structure library and searchalgorithm for modeling loops in protein structures. Nucleic Acids Research 34:2085–2097DOI ./nar/gkl.

    Fiser A, Do RK, Sali A. 2000. Modeling of loops in protein structures. Protein Science 9:1753–1773DOI ./ps....

    Flory P. 1998. Statistical mechanics of chain molecules. Hanser.

    Hildebrand PW, Goede A, Bauer RA, Gruening B, Ismer J, Michalsky E, Preissner R. 2009.SuperLooper – a prediction server for the modeling of loops in globular and membraneproteins. Nucleic Acids Research 37:W571–W574 DOI ./nar/gkp.

    Jacobson MP, Pincus DL, Rapp CS, Day TJ, Honig B, Shaw DE, Friesner RA. 2004. A hierarchicalapproach to all-atom protein loop prediction. Proteins 55:351–367 DOI ./prot..

    Karen AR, Weigelt CA, Nayeem A, Krystek SR Jr. 2007. Loopholes and missing links in proteinmodeling. Protein Science 16:1–14 DOI ./ps..

    Kwasigroch KM, Chomilier J, Mornon JP. 1996. A global taxonomy of loops in globular proteins.Journal of Molecular Biology 259:855–872 DOI ./jmbi...

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 13/15

    https://peerj.comhttp://dx.doi.org/10.1093/nar/25.17.3389http://dx.doi.org/10.1126/science.181.4096.223http://dx.doi.org/10.1093/bioinformatics/16.6.513http://dx.doi.org/10.1002/prot.22658http://dx.doi.org/10.1039/c1mb05223chttp://dx.doi.org/10.1016/0022-2836(87)90412-8http://dx.doi.org/10.1038/342877a0http://dx.doi.org/10.1002/prot.10235http://dx.doi.org/10.1002/pro.5560051223http://dx.doi.org/10.1093/nar/gkh002http://dx.doi.org/10.1186/1472-6807-1186-1115http://dx.doi.org/10.1093/nar/gkl156http://dx.doi.org/10.1110/ps.9.9.1753http://dx.doi.org/10.1093/nar/gkp338http://dx.doi.org/10.1002/prot.10613http://dx.doi.org/10.1110/ps.072887807http://dx.doi.org/10.1006/jmbi.1996.0363http://dx.doi.org/10.7717/peerj.1

  • Leszczynski JF, Rose GD. 1986. Loops in globular proteins: a novel category of secondarystructure. Science 234:849–855 DOI ./science..

    Mandell DJ, Coutsias EA, Kortemme T. 2009. Sub-angstrom accuracy in protein loopreconstruction by robotics-inspired conformational sampling. Nature Methods 6:551–552DOI ./nmeth-.

    Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP. 1998. JOY: proteinsequence-structure representation and analysis. Bioinformatics 14:617–623 DOI ./bioin-formatics/...

    Nagi AD, Regan L. 1997. An inverse correlation between loop length and stability in afour-helix-bundle protein. Folding and Design 2:67–75 DOI ./S-()-.

    Oliva B, Bates PA, Querol E, Aviles FX, Sternberg MJE. 1997. An automated classficationof the structure of protein loops. Journal of Molecular Biology 266:814–830DOI ./jmbi...

    Pardon E, Haezebrouck P, De Baetselier A, Hooke SD, Fancourt KT, Dobson JDCM, Dael HV,Joniau M. 1995. A Ca(2+)-binding chimera of human lysozyme and bovine alpha-lactalbuminthat can form a molten globule. Journal of Biological Chemistry 270:10514–10524DOI ./jbc....

    Peng H, Yang A. 2007. Modeling protein loos with knowledge-based prediction of sequence-structure alignment. Bioinformatics 23:2836–2842 DOI ./bioinformatics/btm.

    Queen C, Schneider WP, Selick HE, Payne PW, Landolfi NF, Duncan JF,Avdalovic NM, Levitt M, Junghans RP, Waldmann TA. 1989. A humanized antibodythat binds to the interleukin 2 receptor. Proceedings of the National Academy of Sciences USA86:10029–10033 DOI ./pnas....

    Richardson JS. 1981. The anatomy and taxonomy of protein structure. Advances in ProteinChemistry 34:167–339 DOI ./S-()-.

    Riechmann L, Clark M, Waldmann H, Winter G. 1988. Reshaping human antibodies for therapy.Nature 332:323–327 DOI ./a.

    Ring CS, Kneller DG, Langridge R, Cohen FE. 1992. Taxonomy and conformational analysis ofloops in proteins. Journal of Molecular Biology 224:685–699 DOI ./-()-V.

    Scott KA, Bond PJ, Ivetac A, Chetwynd AP, Khalid S, Sansom MSP. 2008. Coarse-GrainedMD simulations of membrane protein-bilayer self-assembly. Structure 16:621–630DOI ./j.str....

    Shen MY, Sali A. 2006. Statistical potential for assessment and prediction of protein structures.Protein Science 15:2507–2524 DOI ./ps..

    Sibanda BL, Thorton JM. 1985. Beta-hairpin families in globular proteins. Nature 316:170–174DOI ./a.

    Soto CS, Fasnacht M, Zhu J, Forrest L, Honig B. 2008. Loop modeling: sampling, filtering, andscoring. Proteins 70:834–843 DOI ./prot..

    Tastan O, Klein-Seetharaman J, Meirovitch H. 2009. The effect of loops on the structuralorganization of-helical membrane proteins. Biophysical Journal 96:2299–2312DOI ./j.bpj....

    Toma S, Campagnoli S, Margarit I, Gianna R, Grandi G, Bolognesi M, De Filippis V, Fontana A.1991. Grafting of a calcium-binding loop of thermolysin to Bacillus subtilis neutral protease.Biochemistry 30:97–106 DOI ./bia.

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 14/15

    https://peerj.comhttp://dx.doi.org/10.1126/science.3775366http://dx.doi.org/10.1038/nmeth0809-551http://dx.doi.org/10.1093/bioinformatics/14.7.617http://dx.doi.org/10.1093/bioinformatics/14.7.617http://dx.doi.org/10.1016/S1359-0278(97)00007-2http://dx.doi.org/10.1006/jmbi.1996.0819http://dx.doi.org/10.1074/jbc.270.18.10514http://dx.doi.org/10.1093/bioinformatics/btm456http://dx.doi.org/10.1073/pnas.86.24.10029http://dx.doi.org/10.1016/S0065-3233(08)60520-3http://dx.doi.org/10.1038/332323a0http://dx.doi.org/10.1016/0022-2836(92)90553-Vhttp://dx.doi.org/10.1016/j.str.2008.01.014http://dx.doi.org/10.1110/ps.062416606http://dx.doi.org/10.1038/316170a0http://dx.doi.org/10.1002/prot.21612http://dx.doi.org/10.1016/j.bpj.2008.12.3894http://dx.doi.org/10.1021/bi00215a015http://dx.doi.org/10.7717/peerj.1

  • Tusnady GE, Dosztanyi ZD, Simon I. 2004. Transmembrane proteins in the Protein Data Bank:identification and classification. Bioinformatics 20:2964–2972 DOI ./bioinformat-ics/bth.

    Vanhee P, Verschueren E, Baeten L, Stricher F, Serrano L, Rousseau F, Schymkowitz J. 2011.BriX: a database of protein building blocks for structural analysis, modeling and design. NucleicAcids Research 39:D435–D442 DOI ./nar/gkq.

    Verschueren E, Vanhee P, van der Sloot AM, Serrano L, Rousseau F, Schymkowitz J. 2011.Protein design with fragment databases. Current Opinion in Structural Biology 21:452–459DOI ./j.sbi....

    Wang G, Dunbrack RL Jr. 2005. PISCES: recent improvements to a PDB sequence culling server.Nucleic Acids Research 33:W94–W98 DOI ./nar/gki.

    Wojcik J, Mornon JP, Chomilier J. 1999. New efficient statistical sequence-dependent structureprediction of short to medium-sized protein loops based on an exhaustive loop classification.Journal of Molecular Biology 289:1469–1490 DOI ./jmbi...

    Wolfson AJ, Kanaoka M, Lau FT, Ringe D. 1991. Insertion of an elastase-binding loop intointerleukin-1 beta. Protein Engineering 4:313–317 DOI ./protein/...

    Xiang Z. 2006. Advances in homology protein structure modeling. Current Protein & PeptideScience 7:217–227 DOI ./.

    Choi et al. (2013), PeerJ, DOI 10.7717/peerj.1 15/15

    https://peerj.comhttp://dx.doi.org/10.1093/bioinformatics/bth340http://dx.doi.org/10.1093/bioinformatics/bth340http://dx.doi.org/10.1093/nar/gkq972http://dx.doi.org/10.1016/j.sbi.2011.05.002http://dx.doi.org/10.1093/nar/gki402http://dx.doi.org/10.1006/jmbi.1999.2826http://dx.doi.org/10.1093/protein/4.3.313http://dx.doi.org/10.2174/138920306777452312http://dx.doi.org/10.7717/peerj.1

    How long is a piece of loop?IntroductionMaterials and MethodsLoop definitionMembrane protein structuresSoluble protein structuresLoop span and loop stretch

    Protein Structure Prediction and Loop StretchLoop modelling test setsMODELLER settingFREAD setting

    ResultsNomenclatureLoop span distributionMaxwell--Boltzmann distribution for loop spanProtein structure prediction and loop stretch

    DiscussionReferences