fssp uses dali alignments to classify structures all pdb entries representative set of structures...

28
FSSP uses DALI alignments to classify structures all PDB entries representative set of structures representative set of domains group domains into fold types (clusters of similar structures) eliminate similar sequences divide into domains align domains with DALI! 8320 947 1484 540

Post on 20-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

FSSP uses DALI alignments to classify structures

all PDB entries

representative set of structures

representative set of domains

group domains into fold types

(clusters of similar structures)

and make set of representatives of each fold

eliminate similar sequences

divide into domains

align domains with DALI!

8320

947

1484

540

Judging DALI alignments• Z-score: how much better than average is the alignment,

i.e. how many standard deviations from the mean of a distribution of alignments of random pairs of proteins.

>16 very close, 8-16 pretty close, <8 not so close.

• RMSD: root mean square deviation of alpha carbons for the matching portion of the structures.

• LALI: length of alignment (recognizably matching portion of the structures)

• LSEQ2: total length of the sequence being matched.

• %IDE: % sequence identity between the two sequences

if you go into FSSP, and search for a particular structure, you’ll get an output of its best DALI alignments with other structures

STRID2 Z RMSD LALI LSEQ2 %IDE PROTEIN

1plc 24.4 0.0 99 99 100 Plastocyanin (cu2+, ph 6.0)

2pcy 23.4 0.2 99 99 100 Apo-plastocyanin (pH 6.0)

1bqk 12.1 2.0 89 124 29 pseudoazurin

1aac 11.0 1.9 84 104 24 amicyanin

1ibzA 9.1 2.5 83 111 19 nitrosocyanin

1qhqA 8.3 2.4 87 139 29 auracyanin

1rcy 8.2 2.5 90 151 17 rusticyanin biological_unit

1qniA 7.7 2.2 78 572 19 nitrous-oxide reductase

1kcw 7.1 2.4 81 1017 17 ceruloplasmin biological_unit

2cuaA 7.0 2.2 80 122 15 cua fragment

1nwpA 6.7 3.1 85 128 24 azurin

Divergence of sequence vs. divergence of structure

• one of the things that structural classification systems allow us to do is to assess how structure changes as the sequences of related proteins diverge.

• as we will see, this is extremely useful information when trying to predict structure from sequence.

Structures of closely related sequences (>50% identity)

• at this level, usually >90% of residues retain a qualitatively similar structure (common core)

• 90% of the identical residues would have similar sidechain conformations. 50% of the nonidentical residues would also show some similarity in sidechain conformation.

• Backbone atoms for common core would have 1.0 Å RMSD or less.

More distant relatives: below 25% sequence identity

plastocyanin azurin

common core of 7-stranded beta-sandwich is 50-60% of the total residues. Sequence identity within this common core is ~20%. RMSD of common core residues is ~2.2Å.

both havetwo-histidine copper bindingsite

so RMSD between the common core of two structures (portion that retains the same qualitative fold) varies as a

function of sequence identity...

...as does the percentage of the total residues which belong to the common core...

why two points per pair of proteins?

Homology modelling

• the relationship between divergence of sequence and structure forms the basis for homology modelling--model the structure of a protein based on its sequence similarity

to protein(s) of known structure. • homology modelling is the only consistently accurate

structure prediction method known, but...

• its accuracy depends critically on the similarity between the sequences--best when they are >50% identical, not very useful below about 30%. (Although if there are multiple templates one can do a little better)

Homology modelling and structural genomics

• homology modelling is one of the stated justifications for many structural genomics projects.

• structural genomics projects take several forms. Some of them try to solve as many structures as possible for a given model organism, and hope that this will yield information on the structures of related proteins in other organisms.

• another approach is to choose target structures to solve based on how many sequences can be homology modelled from them.

• another approach is to solve structures for sequences which are deemed least likely to have a fold which is already known.

Choosing the template(s)

• the target sequence (sequence to be modelled) is used in a sequence similarity search of the PDB to determine if it has recognizable similarity to sequences for which a structure is known.

• if it does, then sequence alignments are constructed between the sequence to be modelled and the template sequences (sequence similarity searches and sequence alignments are an interesting subject in themselves, which we can’t cover here, unfortunately)

• in the case of multidomain proteins, there may be multiple regions to model separately.

Alignment of target and templates: example of FASL receptor precursor

Running pair-wise alignments with target sequence

Sequence identity of templates with target:

11DDF.pdb: 100 % identity

11CDF.pdb: 37.33 % identity

11EXT.pdb: 26.4 % identity

21EXT.pdb: 26.4 % identity

21NCF.pdb: 26.4 % identity

21TNR.pdb: 26.4 % identity

11NCF.pdb: 26.4 % identity

one exact match:has the structure alreadybeen solved?

these five are different chainswithin the same crystal structure orthe same protein boundto different ligands

Alignment of target and templates

Looking for template groups

Global alignment overview:

Taget Sequence: |=======================================================|

11DDF.pdb | -------------------

11CDF.pdb | ---------------

11EXT.pdb | -------------

21EXT.pdb | -------------

21NCF.pdb | -------------

21TNR.pdb | -------------

11NCF.pdb | -------------

AlignMaster found 2 regions to model separately:

1: Using template(s) 11DDF.pdb

2: Using template(s) 11CDF.pdb 11EXT.pdb 11NCF.pdb 21EXT.pdb 21NCF.pdb 21TNR.pdb

this was ourexact match:intracellularDEATH domainhas been solved

six matches for extracellularligand-binding domain

Target-Template Sequence Alignment

• part of alignment to go here

Align the structures of the templates

so here are oursix ligand-bindingdomain templatestructures, aligned

the more similar regions of the sequence are used to establishstructurally equivalent residues in the set of templates, and these correspondences are then used to superpose the protein structures

the structures areless similar in some places thanothers-->loops

Generating a framework for the target

1. An initial set of coordinates is generated for the target based a local similarity weighted average of the template structures.

2. In other words, the target may be more similar to one of the templates than to the others over a particular range of sequence, so this template will be favored in generating the framework over that range. For instance, if your target has a lysine at a certain position in the alignment, and one of the templates also has a lysine at that position, but none of the others do, it will essentially use that template to model the target lysine sidechain conformation. The bottom line is that the conformations of target sidechains which have a corresponding identical residue in at least one of the templates will tend to mimic the conformation found in the template.

3. This averaging method for placing the atoms of the target structure may lead to some of the residues having ridiculous geometries. Those are thrown out at this stage, to be rebuilt later.

Framework generation

a frameworkfor just thetarget backboneis shown at right in yellowagainst thetemplatestructures

Loop building

• there are likely to be some segments of sequence in the target for which there is no corresponding sequence in the template(s), or for which there is such poor agreement among the templates that a reasonable framework cannot be generated. Most often these will be in loop regions.

• in these cases, the backbone of the target framework will essentially end at some position and then resume abruptly a few residues later. So you have “stems” that need to be connected somehow.

Loop libraries•what is done is to scan a database of structural fragments derived from the PDB, and look for fragments which have the right conformation to properly connect the stems without colliding with anything else in the structure. The red loop at right is discarded because it collides with part of the framework.

framework is shown in orange

Side chain building• some of the side chains will be missing or somewhat

distorted (but not enough to have been thrown out earlier) at this point.

• this is where rotamer libraries, which consist of sidechain conformations found in known protein structures, come into the picture!

• we want to use the rotamer library to correct distorted sidechains, and rebuild absent ones.

More sidechain building

• the rotamers in a rotamer library are sorted according to frequency of occurrence in known structures--some are more probable than others, and we want to take that into account.

• if some sidechains are somewhat distorted, they can be corrected by matching to a rotamer in the library, with more probable rotamers being given a higher weight.

• if sidechains are altogether missing, they can be rebuilt in the same way, with the restriction that the sidechain not have any steric clashes with other atoms.

Profiling: is our model “reasonable”?

• a method called profiling is used to assess whether a given residue is in a reasonable three-dimensional environment

• if a stretch of residues has a low profile score, that region

is probably modelled wrong. • this method can be used not only to judge the quality of

models, but also to judge the quality of experimentally determined structures as well.

• further, it can be used to predict whether a given sequence is compatible with a particular structure--this is known as fold recognition.

How profiling works

Judging the Quality of Models by Profiling

x axis would be the sequence, and the y-axis is theprofile score for each residue.

Surfaces: does our model look like a real protein?

• a solvent-accessible surface calculation (slightly different from the one we learned) can now be done to identify any cavities (holes in the middle of the protein)

• these cavities are compared with any present in the template crystal structures in order to try to uncover poorly modelled regions.

A slight digression:“Accessible” vs. “molecular” surface

note: these are alternative ways of representing the same thing:the surface which is accessible to solvent!

Usefulness of molecular surface

• molecular and accessible surfaces are both useful representations, but molecular surface is more closely related to the actual atomic surfaces. This makes it somewhat better for visualizing the texture of the outer surface, as well as for assessing the shape and volume of any internal cavities.

• you will hear the term Connolly surface used often. A Connolly surface is a particular way of calculating the molecular surface.

Energy minimization

• at the end, some molecular dynamics is done (John Rupley may teach you more about this subject), which can slightly reposition the atoms to achieve a lower energy.

• This can fix tiny little problems, such as slightly unrealistic bond angles and lengths, but it can’t fix anything bigger than that.