methods for comparing protein structures methods for comparing protein structures protein structural...

49
Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications How do structures and functions diverge in How do structures and functions diverge in protein superfamilies protein superfamilies What proportion of genome sequences can be What proportion of genome sequences can be predicted to belong to superfamilies of known predicted to belong to superfamilies of known structure? structure? Comparing and Classifying Domain Comparing and Classifying Domain Structures Structures

Upload: richard-tucker

Post on 17-Jan-2016

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Methods for comparing protein structuresMethods for comparing protein structures

Protein structural classificationsProtein structural classifications

How do structures and functions diverge in protein How do structures and functions diverge in protein superfamiliessuperfamilies

What proportion of genome sequences can be What proportion of genome sequences can be predicted to belong to superfamilies of known predicted to belong to superfamilies of known structure?structure?

Comparing and Classifying Domain StructuresComparing and Classifying Domain Structures

Page 2: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Protein Domain Family ClassificationsProtein Domain Family Classifications

Known domain structuresAlexey Murzin, LMB, Cambridge

Predicted domain structures Julian Gough, Bristol University

Known domain structuresPredicted domain structuresChristine Orengo, UCL

Domain sequencesAlex Bateman, Sanger

Page 3: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

60-80% of genes in genomes code for multidomain proteins

domains are important evolutionary units

Page 4: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Th. thermophilus

human

human

yeast

M. tuberculosis

Domain Superfamily

Evolution gives rise to families of proteins Evolution gives rise to families of proteins (homologues)(homologues)

structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved

Page 5: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Th. thermophilus

human

human

yeast

M. tuberculosis

Domain Superfamily

structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved

orthologuesorthologues

Evolution gives rise to families of proteins Evolution gives rise to families of proteins (homologues)(homologues)

Page 6: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Th. thermophilus

human

human

yeast

M. tuberculosis

Domain Superfamily

structure is more highly conserved than sequence during evolutionstructure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conservedAt least 40-50% of the structure is conserved

paraloguesparalogues

Evolution gives rise to families of proteinsEvolution gives rise to families of proteins

Page 7: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Structural diversity in the CATH Domain Family P-loop hydrolases

Cutinase Cocaine esterase Acetylcholinesterase

structure is more highly conserved than sequence during evolutionAt least 40-50% of the structure is conserved

Page 8: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Challenges in comparing protein structuresChallenges in comparing protein structures

residue substitutions due to single base mutationsresidue substitutions due to single base mutations

insertions or deletions (indels) of residues - usually insertions or deletions (indels) of residues - usually not in the secondary structures but in the not in the secondary structures but in the connecting loopsconnecting loops

Usually the structural cores are highly conserved Usually the structural cores are highly conserved

Although structure is much more conserved than Although structure is much more conserved than the sequence there can still be considerable the sequence there can still be considerable structural differences between relatives outside structural differences between relatives outside the corethe core

Page 9: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

• residue insertions usually occur in the loops connecting secondary structures• substitutions can cause shifts in the orientations of secondary structures

Page 10: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications
Page 11: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Superposition of OB fold Structures

Page 12: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Related structuresRMSD usually < 3.5A

Page 13: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Coping with Insertions and Deletions

ignore the variable loop regions and only ignore the variable loop regions and only compare the secondary structurescompare the secondary structures

use algorithms which can explicitly handle use algorithms which can explicitly handle insertions/deletions e.g. dynamic insertions/deletions e.g. dynamic programming, simulated annealingprogramming, simulated annealing

Page 14: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Fast structure comparison by secondary structures

Page 15: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

H

H

H

E E E

H

H

H

H

E E E

H H

E

H

Graphs can be compared using the Bron Kerbosch algorithm to find the largest common graph

In this example the common graph contains 5 nodes.

Generallly ~1000 times faster than residue based methods

Page 16: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Score distances between superposed residues in path matrix

Use equivalences given by the best path to re-superpose the structures

Use dynamic programming to find best path

Superpose structures

STRUCTALSTRUCTAL

Align sequences

Page 17: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Structure Comparison Algorithms

Secondary structure based:Secondary structure based:

SSMSSM HenrickHenrick PDBPDB

GRATHGRATH Harrison & OrengoHarrison & Orengo CATHCATH

Residue based:Residue based:

SSAP SSAP Taylor and Orengo Taylor and Orengo CATH CATH

DALI DALI Holm and Sander Holm and Sander SCOPSCOP

Comparer Comparer Sali and Blundell Sali and Blundell HOMSTRADHOMSTRAD

FatCat FatCat Adam Godzik Adam Godzik PDBPDB

Structal Structal Levitt Levitt PDB PDB

Structural Bioinformatics, Ed: Phil Bourne, Wiley 2003Bioinformatics: Genes, Proteins and Computers, Bios, 2003

Structure classification

Page 18: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

2600 domain superfamilies

~200,000 domains

Domain structure database

AATT

HH

lasslassrchitecturerchitecture

opology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1993CC

Page 19: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Class

Architecture

Topology or Fold

3

~40

~1200

domain database~200,000 domainsCATH

Page 20: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

CATH Architectures

Orthogonal bundle

Up-down bundle

-horseshoe

-solenoid -barrel -ribbon

-sheet -roll -barrel

Page 21: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Clam 2-layer -sandwich

Trefoil

3-layer -sandwich

-propeller -solenoid

Orthogonal -prism Parallel -prism

-roll

CATH Architectures

Page 22: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

2-layer () sandwich-barrel 3-layer () sandwich

3-layer () sandwich 3-layer () sandwich

4-layer () sandwich

-prism -horseshoe -box

CATH Architectures

Page 23: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Topology orFold Group

~1200

HomologousSuperfamily

~2600

SequenceFamily (30%)

40,000 domain entries

~200,000 domain entries

CC AATT HH

Page 24: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Divergent Evolution

Convergent Evolution

..VILST… ..KLST… ...SLTRF...

..VILST… ..KLST… ...SLTRF...

Divergent Evolution

Convergent Evolution

Page 25: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Homologous Structures

cholera toxin pertussis toxin

Heat labile enterotoxin

97

79%

81

12%

SSAP score

Sequence identity

• high structure similarity score, often < 4A• may have detectable sequence similarity e.g. by HMMs• related functions

Page 26: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

structural similarity no sequence similarity no functional similarity

Evolutionary Ancestry Uncertain

Page 27: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications
Page 28: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications
Page 29: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

How do proteins evolve new How do proteins evolve new functions?functions?

Page 30: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Evolution of Protein Functions in Domain SuperfamiliesEvolution of Protein Functions in Domain Superfamilies

domain duplication

domain fusion, change in domain partner

residue mutations and domain structure embellishments

oligomerisation

Page 31: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Mutation of ResiduesTIM barrel glycosyl hydrolases

chitinase AGlu general acid

narboninGlu incorporated in a

salt-bridge and this blockssubstrate access

acid

Page 32: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

changes in the domain structure can modify the binding siteor domain surface

2.7.7.392.7.7.3

binding site

Pantetheine-phosphate adenyltransferase

Glycerol-3-phosphate cytidylyl transferase

EC code:

binding site

Changes in domain function in paralogous Changes in domain function in paralogous relativesrelatives

Page 33: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

1od6A00

1f7uA01

binding site

Arginyl-tRNA synthetase

Pantetheine-phosphate

adenyltransferase

Page 34: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Arginyl-tRNA synthetase

Page 35: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Asparagine synthetase B

changes in the domain partnerships can changes in the domain partnerships can modify the binding sitemodify the binding site

binding site

Pantetheine-phosphate Pantetheine-phosphate adenyltransferaseadenyltransferase

Page 36: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Change in OligomerisationChange in Oligomerisation

calsequestrin

peroxidase

Thioredoxin superfamilyThioredoxin superfamily

Page 37: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

60-80% of proteins are multi-domain

few thousand domain superfamilies (< 10,000 CATH, SCOP and Pfam)

> Two million domain combinations (multi-domain architectures)

The Mosaic Theory of Protein Evolution Teichmann et al 2001,2003 Gerstein et al. 2001

Page 38: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Similarity in Chemistry

conserved

semiconserved

poorly conserved

unconserved

I

I

I

I’

P

P

P

P

P

PP

P’

19%

67%

7%

7%

nearly 90% of families show full or partial conservation of functions

Page 39: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

chemistry is conserved or semi-conserved across the family but the substrate can change

HO NHNH

OH

O

NH2

O O

O

S

S

NHOHNHHO

O

O

O

NH2

O

thioredoxindisulphide bond

H2O2 Hg2+

cytochromeP450s

FAD/NAD(P)(H)-dependentdisulphide oxidoreductases

hexapeptiderepeat proteins

C OO

OH H

+

NO2

HONH

HOO

Cl

Cl

N

N

NH2

N

NO

OPO

OH

OPiO OH

O

OH

O

NH

O

NHS

O

+

OH

O

O

OH

O

HO

O

OH

O

OH

NO

Page 40: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

blade domain

Page 41: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

fulcrum domain

Page 42: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

handle domain

Page 43: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications
Page 44: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

How representative are these How representative are these structural superfamilies (ie in CATH, structural superfamilies (ie in CATH,

SCOP) of all proteins in nature?SCOP) of all proteins in nature?

Page 45: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

::DomainDomain structure predictions in genome structure predictions in genome sequencessequences

scan againstlibrary of sequence

patterns (HMM models) for

CATH

protein sequencesprotein sequencesfrom UniProtfrom UniProt ~ 26 million domain ~ 26 million domain

sequences assigned sequences assigned toto

CATH superfamiliesCATH superfamilies

~6000 annotated ~6000 annotated genomesgenomes

Page 46: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications
Page 47: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Pfam-APfam-BOther

Page 48: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

CATH and Pfam coverage of genomes

51%

33%

16%

CATH Pfam Unassigned-regionNewFam?

Page 49: Methods for comparing protein structures Methods for comparing protein structures Protein structural classifications Protein structural classifications

Protein Family DatabasesProtein Family Databases

Each family is represented by a sequence profile or HMM