methods for comparing protein structures methods for comparing protein structures protein structural...
TRANSCRIPT
Methods for comparing protein structuresMethods for comparing protein structures
Protein structural classificationsProtein structural classifications
How do structures and functions diverge in protein How do structures and functions diverge in protein superfamiliessuperfamilies
What proportion of genome sequences can be What proportion of genome sequences can be predicted to belong to superfamilies of known predicted to belong to superfamilies of known structure?structure?
Comparing and Classifying Domain StructuresComparing and Classifying Domain Structures
Protein Domain Family ClassificationsProtein Domain Family Classifications
Known domain structuresAlexey Murzin, LMB, Cambridge
Predicted domain structures Julian Gough, Bristol University
Known domain structuresPredicted domain structuresChristine Orengo, UCL
Domain sequencesAlex Bateman, Sanger
60-80% of genes in genomes code for multidomain proteins
domains are important evolutionary units
Th. thermophilus
human
human
yeast
M. tuberculosis
Domain Superfamily
Evolution gives rise to families of proteins Evolution gives rise to families of proteins (homologues)(homologues)
structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved
Th. thermophilus
human
human
yeast
M. tuberculosis
Domain Superfamily
structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved
orthologuesorthologues
Evolution gives rise to families of proteins Evolution gives rise to families of proteins (homologues)(homologues)
Th. thermophilus
human
human
yeast
M. tuberculosis
Domain Superfamily
structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved
paraloguesparalogues
Evolution gives rise to families of proteinsEvolution gives rise to families of proteins
Structural diversity in the CATH Domain Family P-loop hydrolases
Cutinase Cocaine esterase Acetylcholinesterase
structure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conserved
Challenges in comparing protein structuresChallenges in comparing protein structures
residue substitutions due to single base mutationsresidue substitutions due to single base mutations
insertions or deletions (indels) of residues - usually insertions or deletions (indels) of residues - usually not in the secondary structures but in the not in the secondary structures but in the connecting loopsconnecting loops
Usually the structural cores are highly conserved Usually the structural cores are highly conserved
Although structure is much more conserved than Although structure is much more conserved than the sequence there can still be considerable the sequence there can still be considerable structural differences between relatives outside structural differences between relatives outside the corethe core
• residue insertions usually occur in the loops connecting secondary structures• substitutions can cause shifts in the orientations of secondary structures
Superposition of OB fold Structures
Related structuresRMSD usually < 3.5A
Coping with Insertions and Deletions
ignore the variable loop regions and only ignore the variable loop regions and only compare the secondary structurescompare the secondary structures
use algorithms which can explicitly handle use algorithms which can explicitly handle insertions/deletions e.g. dynamic insertions/deletions e.g. dynamic programming, simulated annealingprogramming, simulated annealing
Fast structure comparison by secondary structures
H
H
H
E E E
H
H
H
H
E E E
H H
E
H
Graphs can be compared using the Bron Kerbosch algorithm to find the largest common graph
In this example the common graph contains 5 nodes.
Generallly ~1000 times faster than residue based methods
Score distances between superposed residues in path matrix
Use equivalences given by the best path to re-superpose the structures
Use dynamic programming to find best path
Superpose structures
STRUCTALSTRUCTAL
Align sequences
Structure Comparison Algorithms
Secondary structure based:Secondary structure based:
SSMSSM HenrickHenrick PDBPDB
GRATHGRATH Harrison & OrengoHarrison & Orengo CATHCATH
Residue based:Residue based:
SSAP SSAP Taylor and Orengo Taylor and Orengo CATH CATH
DALI DALI Holm and Sander Holm and Sander SCOPSCOP
Comparer Comparer Sali and Blundell Sali and Blundell HOMSTRADHOMSTRAD
FatCat FatCat Adam Godzik Adam Godzik PDBPDB
Structal Structal Levitt Levitt PDB PDB
Structural Bioinformatics, Ed: Phil Bourne, Wiley 2003Bioinformatics: Genes, Proteins and Computers, Bios, 2003
Structure classification
2600 domain superfamilies
~200,000 domains
Domain structure database
AATT
HH
lasslassrchitecturerchitecture
opology or Fold Groupopology or Fold Group
omologous Superfamilyomologous Superfamily
Orengo & Thornton 1993CC
Class
Architecture
Topology or Fold
3
~40
~1200
domain database~200,000 domainsCATH
CATH Architectures
Orthogonal bundle
Up-down bundle
-horseshoe
-solenoid -barrel -ribbon
-sheet -roll -barrel
Clam 2-layer -sandwich
Trefoil
3-layer -sandwich
-propeller -solenoid
Orthogonal -prism Parallel -prism
-roll
CATH Architectures
2-layer () sandwich-barrel 3-layer () sandwich
3-layer () sandwich 3-layer () sandwich
4-layer () sandwich
-prism -horseshoe -box
CATH Architectures
Topology orFold Group
~1200
HomologousSuperfamily
~2600
SequenceFamily (30%)
40,000 domain entries
~200,000 domain entries
CC AATT HH
Divergent Evolution
Convergent Evolution
..VILST… ..KLST… ...SLTRF...
..VILST… ..KLST… ...SLTRF...
Divergent Evolution
Convergent Evolution
Homologous Structures
cholera toxin pertussis toxin
Heat labile enterotoxin
97
79%
81
12%
SSAP score
Sequence identity
• high structure similarity score, often < 4A• may have detectable sequence similarity e.g. by HMMs• related functions
structural similarity no sequence similarity no functional similarity
Evolutionary Ancestry Uncertain
How do proteins evolve new How do proteins evolve new functions?functions?
Evolution of Protein Functions in Domain SuperfamiliesEvolution of Protein Functions in Domain Superfamilies
domain duplication
domain fusion, change in domain partner
residue mutations and domain structure embellishments
oligomerisation
Mutation of ResiduesTIM barrel glycosyl hydrolases
chitinase AGlu general acid
narboninGlu incorporated in a
salt-bridge and this blockssubstrate access
acid
changes in the domain structure can modify the binding siteor domain surface
2.7.7.392.7.7.3
binding site
Pantetheine-phosphate adenyltransferase
Glycerol-3-phosphate cytidylyl transferase
EC code:
binding site
Changes in domain function in paralogous Changes in domain function in paralogous relativesrelatives
1od6A00
1f7uA01
binding site
Arginyl-tRNA synthetase
Pantetheine-phosphate
adenyltransferase
Arginyl-tRNA synthetase
Asparagine synthetase B
changes in the domain partnerships can changes in the domain partnerships can modify the binding sitemodify the binding site
binding site
Pantetheine-phosphate Pantetheine-phosphate adenyltransferaseadenyltransferase
Change in OligomerisationChange in Oligomerisation
calsequestrin
peroxidase
Thioredoxin superfamilyThioredoxin superfamily
60-80% of proteins are multi-domain
few thousand domain superfamilies (< 10,000 CATH, SCOP and Pfam)
> Two million domain combinations (multi-domain architectures)
The Mosaic Theory of Protein Evolution Teichmann et al 2001,2003 Gerstein et al. 2001
Similarity in Chemistry
conserved
semiconserved
poorly conserved
unconserved
I
I
I
I’
P
P
P
P
P
PP
P’
19%
67%
7%
7%
nearly 90% of families show full or partial conservation of functions
chemistry is conserved or semi-conserved across the family but the substrate can change
HO NHNH
OH
O
NH2
O O
O
S
S
NHOHNHHO
O
O
O
NH2
O
thioredoxindisulphide bond
H2O2 Hg2+
cytochromeP450s
FAD/NAD(P)(H)-dependentdisulphide oxidoreductases
hexapeptiderepeat proteins
C OO
OH H
+
NO2
HONH
HOO
Cl
Cl
N
N
NH2
N
NO
OPO
OH
OPiO OH
O
OH
O
NH
O
NHS
O
+
OH
O
O
OH
O
HO
O
OH
O
OH
NO
blade domain
fulcrum domain
handle domain
How representative are these How representative are these structural superfamilies (ie in CATH, structural superfamilies (ie in CATH,
SCOP) of all proteins in nature?SCOP) of all proteins in nature?
::DomainDomain structure predictions in genome structure predictions in genome sequencessequences
scan againstlibrary of sequence
patterns (HMM models) for
CATH
protein sequencesprotein sequencesfrom UniProtfrom UniProt ~ 26 million domain ~ 26 million domain
sequences assigned sequences assigned toto
CATH superfamiliesCATH superfamilies
~6000 annotated ~6000 annotated genomesgenomes
Pfam-APfam-BOther
CATH and Pfam coverage of genomes
51%
33%
16%
CATH Pfam Unassigned-regionNewFam?
Protein Family DatabasesProtein Family Databases
Each family is represented by a sequence profile or HMM