protein structure 2

Protein Structural Bioinformatics

DefinitionThe subdiscipline of bioinformatics that focuses on the representation, storage, retrieval, analysis, and display of structural information at the atomic and subcellular spatial scales. (From Structural Bioinformatics, by P.E. Bourne & H. Weissig (eds.), John Wiley & Sons, Inc., 2003, pp.4.)

Why is STRUCTURAL bioinformatics important? Because a protein’s function is determined by its structure. Knowledge of a protein’s structure is necessary in order to gain a full understanding of the biological role of a protein.

Bioinformatics methods can be used to analyze protein structural data in the following ways:

• Visualization of protein structures

• Alignment of protein structures

• Classification of proteins into families, based on similarity of their structures

• Prediction of protein structures

• Simulation of protein folding and dynamic motions

Protein structure determination by x-ray crystallography or NMR is difficult (see Powerpoint slides from last module).

It takes 1-3 years to solve a protein structure by these methods. Certain proteins, such as membrane proteins, are extremely difficult or impossible to solve by these methods. Due to genomic sequencing efforts, the gap between known protein sequences and known protein structures is increasing– only about 3,000 unique protein structures have been determined, but over 1 million unique sequences have been determined.

Therefore, it is necessary to use bioinformatics methods to predict the structures of proteins for which a crystal structure or NMR structure has not been determined.

Bioinformatics methods can predict:(1) secondary structural elements in a protein sequence(2) the tertiary structure of the entire sequence(3) “special” structures, such as transmembrane a-helices, transmembrane b-barrels, coiled coils, and leucine zippers

Protein Secondary Structure Prediction

All secondary structure prediction is based on the assumption that there should be a correlation between amino acid sequence and secondary structure– in other words, it is assumed that certain stretches of amino acids are more likely to form one type of secondary structure than another.

During secondary structure prediction, the conformational state of each residue in a protein sequence is predicted; generally each residue is predicted as having one of three possible states:

(1) a-helical structure(2) b-strand(3) “other” (b-turn, loop, or random coil) Sometimes b-turn is separated as a 4th state.

Why is prediction of secondary structure useful? It can help guide sequence alignment or improve existing sequence alignment of distantly related sequences. It is also an intermediate step in some methods for tertiary structure prediction.

Methods of secondary structure prediction fall into two broad classes:

Ab initio methods– predict secondary structure based solely on protein sequence; these methods compute statistics for the residues that occur in different secondary structural elements in proteins with known structures, in order to identify “patterns” in the types of residues that occur in a given type of secondary structure.

Homology-based methods– make use of multiple sequence alignments of homologous proteins to predict secondary structure; these methods are able to locate conserved patterns that are characteristic of particular secondary structural elements across the aligned family members.

Certain amino acids are observed more frequently than others in a-helices, b-strands, and b-turns in crystal structures (see Figure). This leads to the idea that each amino acid tends to “prefer” being constrained in a certain type of secondary structure, or has an “intrinsic propensity” to adopt that secondary structure.

Fig. 4-10 from Lehninger Principles of Biochemistry, 4th ed.

The figure shows that:Glu, Met, Ala are mostfrequent in a-helices

Val, Tyr, Ile are mostfrequent in b-strands

Pro, Gly, Asn are mostfrequent in b-turns

Based on this data, it is believed that Glu has a high a-helical propensity, but a low b-strand propensity.

Ab initio methods of secondary structure prediction:

• These methods calculate the relative propensity (intrinsic tendency) of each amino acid in a protein sequence to belong to a certain secondary structural element.

• Propensity scores for the 20 amino acids are derived from known protein structures: these propensities are calculated from the relative frequency of a given amino acid within the proteins, its frequency in a given type of secondary structure, and the fraction of all amino acids occurring in that type of secondary structure.

• Stretches of a protein’s sequence that contain many residues with a high a-helical propensity are predicted to fold into a-helices. Stretches of sequence that contain many residues with a high b-strand propensity are predicted to fold into b-strands.

• Two examples: Chou-Fasman method, GOR method

Accuracy of ab initio methods:

• These methods are not very accurate:• Chou-Fasman method, 50%-60% accuracy• GOR method, 64% accuracy, drastically underpredicts b-strands

• These methods are only a little better than randomly assigning secondary structure! Known proteins consist of ~31% a-helix and ~28% b-sheet, so randomly assigning secondary structural elements to residues would result in ~30% accuracy.

• Specific problems with these methods: • Tend to underpredict the lengths of a-helices and b-strands– can’t identify the first and last residues of helices and strands very well• Tend to miss b-strands completely

A few homology-based 2o structure prediction methods:

Neural network methods:PROFsec (an improved version of PHDsec) http://www.predictprotein.org/PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/SSpro (newest version is 4.0) http://scratch.proteomics.ics.uci.edu/SAM-T (SAM-T08 is newest version; SAM-T06, SAM-T02, SAM-T99-- old versions) http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html

Nearest-neighbor methods:NNSSP no longer available onlinePREDATOR http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::predator

HMM methods:HMMSTER http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php

A few methods for predicting transmembrane a-helices: TMHMM http://www.cbs.dtu.dk/services/TMHMM/HMMTOP http://www.enzim.hu/hmmtop/index.htmlPhobius (also predicts presence of signal peptides) http://phobius.sbc.su.se/TopPred http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::toppredPRED-TMR http://athina.biol.uoa.gr/PRED-TMR/DAS http://mendel.imp.ac.at/sat/DAS/DAS.htmlTMpred http://www.ch.embnet.org/software/TMPRED_form.htmlMEMSAT http://bioinf.cs.ucl.ac.uk/psipred/

Accuracies of the methods:Levels of accuracy are reported by the developers to be in the range of 75-95%.At least one study (2001) found TMHMM to be the best performing program.

It is best to use several methods and compare the results to arrive at a consensus prediction. When different methods, specifically methods that are based on different algorithms, give similar results, the reliability of the results is higher.

Tertiary structure prediction methods fall into three classes:

(1) Homology modeling (also called comparative modeling)A structure is built based on the known structure of another protein that is similar in sequence (a homolog).

(2) Threading (also called structural fold recognition)A structure is predicted for a protein by “threading” its sequence through a variety of known structures to determine which structure the sequence best fits.

(3) Ab initio prediction (also called de novo prediction)A structure is predicted based only on the amino acid sequence of the protein, using the physicochemical properties of its residues and the principles governing protein folding.

Homology modeling for tertiary structure prediction:Homology modeling is based on the idea that if two proteins share a high degree of sequence similarity (i.e., they are close homologs), they are likely to have very similar 3D structures. In general, proteins that share >30% sequence identity are likely to be quite similar in structure.

Therefore, if a protein of unknown structure is similar in sequence to a protein of known structure, the known structure can be used as a template to which the unknown sequence is fit. The structure that is built for the unknown sequence is then called a homology model for the structure of that sequence.

The “safe homology modeling zone,” above the gray curve, is the region where two proteins are likely to have the same structure.

Fig. 5 from R. Nair & B. Rost, Protein Science (2002) 11: 2836-47.

Steps in homology modeling for tertiary structure prediction:

The protein of unknown structure for which a structural model is to be built will be called the “target sequence.”

1. Template selection– Identify protein(s) in the PDB that are homologous to the target sequence using BLAST or PSI-BLAST. If a close homolog with known structure is found, its structure will serve as a template to which the target sequence will be matched. The template should have at least 30% sequence identity with the target. (Proteins that share less than 30% sequence identity may not be similar enough in structure to carry out homology modeling.) If PSI-BLAST does not identify a suitable template, it will probably be necessary to construct a structural model by threading.

It is possible to use multiple templates if more than one good template is identified. When multiple templates are available, it is best to use more than one template to avoid biasing the model toward a single protein. The template used in the next step of homology modeling will then be an averaged structure based on all of the chosen templates.


2. Sequence alignment– Construct a multiple sequence alignment of the target, the template, and other homologous sequences. It is actually the alignment of the target and template that is of interest, but the inclusion of other homologs provides more information, helping to ensure that the best alignment of homologous residues is achieved. The quality of the target-template alignment is critical for constructing an accurate structural model for the target. If a given residue in the target is not aligned with the proper residue in the template, the error cannot be corrected in later steps of model building. A robust multiple sequence alignment program should be used for this step, and the resulting alignment should be very carefully examined and manually refined if necessary.


3. Backbone model building– Residues in the aligned regions of the target and template are assumed to adopt the same structure. Therefore, the backbone atoms of these residues in the target can be placed in the same 3D location as the backbone atoms of these residues in the template. See the alignment below as an example.

Target: ...FKSQAAIHEAYCNFHYKVTAAASRTPEIDFDVHFSSIF... Template: ...FKQQANIHCAYCNGAYKIG-------GKELQVHFSWLF...

For these residues, backbone atoms of the target are assumed to occupy the same 3D location as those of the template.

F aligned with F. They are identical, so all atoms of target F will overlap the 3D positions of all atoms of template F.

E aligned with D. They are not identical, but their backbone atoms can be assumed to occupy the same 3D position. So backbone atoms of target D will overlap the 3D positions of backbone atoms of template E.

Steps in homology modeling for tertiary structure prediction:4. Loop building– There are likely to be regions in the alignment where gaps appear because the target sequence does not match the template. The target sequence residues in these gap regions are assumed to form a loop that is not present in the template structure. The structure of this loop can be built using several different methods. In any case, it is a difficult problem since the template provides no information to guide the building of the loop structure.


“Extra” residues in the target sequence do not match the template and are assumed to form a loop.

target loop


5. Side chain addition– The side chains are added to the backbone structure. Each side chain could potentially have many possible conformations due to bond rotation, but steric clashes with neighboring atoms are not allowed. Therefore, side chain that have the lowest interaction energy with nearby atoms are chosen.


Target and template are both F, so all atoms of the target side chain can be modeled as having the same 3D positions as the template side chain, at least initially. (Small changes in position may be necessary in later refinement steps.)

Target and template have different side chains (D vs. E), so the side chain rotamer that is chosen for the target D must not overlap/clash with any neighboring atoms.


6. Model refinement– Unfavorable bond angles, bond lengths, and atom contacts are likely to exist in the preliminary model, so an energy minimization procedure is applied to refine the model. In this procedure, atom positions are shifted so that the overall conformation of the entire structure has the lowest energy potential. Only limited energy minimization should be applied (a few hundred iterations) so that major errors are removed but residues are not moved from their correct positions.

7. Model evaluation– The model is checked for anomalies in dihedral angles, bond lengths, and atom contacts.

Programs for homology modeling:Many programs for automated homology modeling are now available, so anyone can construct a homology model on a regular PC. However, construction of a “good” homology model (at least for sequences that are not highly similar) usually requires some expertise and usually should be done with human intervention, rather than in a fully automated fashion.

A few of the freely available programs for homology modeling:SWISS-MODEL– Produces accurate models; fast; good tutorials available. http://swissmodel.expasy.org/

I-TASSER– Produces accurate models; easy to use, but slowhttp://zhanglab.ccmb.med.umich.edu/I-TASSER/

Modeller– must be downloaded and installed locally http://salilab.org/modeller/modeller.html

WHAT IFhttp://swift.cmbi.ru.nl/servers/html/index.html http://swift.cmbi.ru.nl/whatif/

http://swift.cmbi.ru.nl/servers/html/index.html

Is a homology model CORRECT?Since the actual (experimentally determined) structure of the target is not known, there is no way to say whether or not the homology model is “correct.” Instead, the best a researcher can do is compare the homology model to the structure of the template from which it was derived. If the atom positions in the model do not deviate very much from those of the template, the homology model is said to be “accurate.” The greater the deviation between model and template, the lower the accuracy of the model.

When is a homology model definitely INCORRECT?A homology model has regions that are incorrect if it contains structural features that do not occur in native proteins, such as:

• Hydrophobic side chains on the surface of the model (these side chains should be buried)• Unreasonable bond lengths or angles• Unfavorable noncovalent contacts between atoms (clashes)• Unreasonable dihedral angles

Accuracy of homology modeling:

The template selection and alignment accuracy are crucial to the accuracy of a homology model. The accuracy of the model depends on the percentage of sequence identity between the target and template. The average coordinate agreement between the modeled structure and the actual structure drops ~0.3 Å for each 10% reduction in sequence identity.

The largest structural differences between homologous proteins are in surface loops. In other words, the structure of the protein core is more highly conserved. Therefore, the regions that are most likely to be in error in a homology model are the surface loops.

High-accuracy homology models can be built when the target and template have 50% or greater sequence identity. Errors are mostly mistakes in side-chain packing, small shifts of the core backbone regions, and occasionally larger errors in loops.

Medium-accuracy homology models can be built when the proteins share 30-50% sequence identity. There can be alignment mistakes, and there are more frequent side-chain packing, core distortion, and loop modeling errors.

Low-accuracy homology models are based on proteins that share <30% sequence identity. If a model is based on an almost insignificant alignment to a known structure, the model may have an entirely incorrect fold.

The best model-building programs will produce models of similar accuracy, provided that the methods are used optimally.

Stephen [email protected]

mailto:[email protected]