exploiting structural and comparative genomics to reveal protein functions

Exploiting Structural and Comparative Exploiting Structural and Comparative

Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions

How many domain families can we find in the genomes and can we predict the functions of relatives?

Exploiting protein structure to predict protein functions

Using correlated phylogenetic profiles based on CATH domains to reveal functional associations

CCAATTHH Domain families of known structureDomain families of known structure

Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes

CATHEDRALOliver Redfern and Andrew Harrison

CATH version 3.01100 fold groups

2100 homologous superfamilies86,000 Domains

Combines a rapid graph theory secondary structure filter with dynamic programming foraccurate residue alignment

SVM is used to combinescores and assess significance of match

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 5 10 15 20 25

Rank

% C

orr

ect

Fo

ld

CATHEDRAL

CE

DALI

LSQMAN

STRUCTAL

SSAPDDP

Fold Recognition Performance%

Corr

ect

Fold

Rank

SSAP

Gene3DGene3D::Domain annotations in genome sequencesDomain annotations in genome sequences

scan againstscan againstlibrary of HMM library of HMM

modelsmodels

~2000 CATH~2000 CATH~9000 Pfam~9000 Pfam

>2 million protein >2 million protein sequencessequencesfrom 300 from 300

completed completed genomes and genomes and

UniprotUniprot

assign domains toassign domains toCATH and Pfam CATH and Pfam superfamiliessuperfamilies

Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs

DomainFinder: structural domains from CATH take precedent

Gene3D:Gene3D:Domain annotations in genome sequencesDomain annotations in genome sequences

N CCATH-1

Pfam-2Pfam-1

NewFam

CATH-1CATH-1Pfam-1Pfam-1 NewFamNewFam Pfam-2Pfam-2

Domain families ranked by size (number of domain Domain families ranked by size (number of domain sequences)sequences)

Perc

en

tag

e o

f all

dom

ain

fam

ily s

eq

uen

ces

Rank by family size

CATH superfamilies of known structure

Pfam families of unknown structure

NewFam of unknown stucture

~90% of domain sequences in the genomes and UniProt can be assigned to ~7000 domain families

structuralstructuralsuperfamilysuperfamily

(CATH)(CATH)

Only ~3% of diverse sequences in large CATH domain Only ~3% of diverse sequences in large CATH domain families have known structures families have known structures

subfamily subfamily of relativesof relatives

<100 families account for 50% of domain sequences of known fold

F1

F2

F3

F4

F5

relatives likely relatives likely to have similar to have similar

functionsfunctions

Iterative Profile SearchMethodology

300 genomes, >2 million sequences including UniProt

and RefSeq

structural domain assignments from CATH

functional domain assignments from Pfam

Also: SWISS-PROT, EC, COGs, GO, KEGG, MIPS, BIND, IntAct

Gene3D: Domain mappings for 300 Completed

Genomes

http://www.biochem.ucl.ac.uk:8080/Gene3Dhttp://www.biochem.ucl.ac.uk:8080/Gene3D

Russell Marsden, Corin Yeats, Michael Maibaum, David LeeNucleic Acids Res. 2006

Yeats et al. Nucleic Acids res. 2006.

DOMAINS IN SAME ARCHITECTURES

0

10

20

30

40

50

60

70

80

90

100

11--20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100

Sequence Identity

FU

NC

TIO

N C

ON

SE

RV

AT

ION

(3r

d

leve

l E

C S

TR

ING

M

AT

CH

)

No OVERLAP 10% OVERLAP 20% OVERLAP 30% OVERLAP40% OVERLAP 50% OVERLAP 60% OVERLAP 70% OVERLAP80% OVERLAP 90% OVERLAP 100% OVERLAP

Conservation of enzyme function in homologous Conservation of enzyme function in homologous domains with same multidomain architecture (MDA) in domains with same multidomain architecture (MDA) in

Gene3D Gene3D


Con

serv

ati

on

of

EC

C

on

serv

ati

on

of

EC

n

um

ber

to 3

levels

(%

)n

um

ber

to 3

levels

(%

)


Protein 1

Protein 2

Sequence identity

1

10

100

1000

10000

100000

1000000

11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% 91-100%

020406080100120140160180200

Number of domain relatives Number of Superfamilies

Sequence identity thresholds for 95% conservation Sequence identity thresholds for 95% conservation of enzyme function (to 3 EC Levels) of enzyme function (to 3 EC Levels)

Sequence identity thresholdsSequence identity thresholds

number of sequencesnumber of sequences number of familiesnumber of families

number of number of sequencessequences

number of number of familiesfamilies

332 highly 332 highly conserved familiesconserved families

60 highly variable 60 highly variable familiesfamilies

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

sequence idenity (%)

SS

AP

sco

re

Different Function

Same Function

Conservation of Enzyme Function in CATH Domain Conservation of Enzyme Function in CATH Domain FamiliesFamilies

Pairwise sequence identity

Str

uct

ura

l si

mila

rity

(S

SA

P)

score

same functions different functions

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

COGs Vs SSGs

0-2525-50

50-75

75-100

Number of Structua l Sub-Groups

Num

ber

of C

OG

s

P-loop hydrolases(COG-270, SSG-67)

Number of diverse structural clusters within family

Nu

mb

er

of

CO

G f

un

c ti o

nal g

r ou

ps

Correlation of structural variability with number of Correlation of structural variability with number of different functional groupsdifferent functional groups

Multiple structural alignment by CORA allows identification of consensus secondary structure and embellishments

Some families show great structural diversitySome families show great structural diversity

In 117 superfamilies relatives expanded by >2 fold or more

2DSEC algorithm2DSEC algorithm

These families represent more than half the genome sequences of known These families represent more than half the genome sequences of known foldfold

Gabrielle Reeves

Structural embellishments can modify the active siteStructural embellishments can modify the active site

Galectin binding superfamily

Structural embellishments can modulate domain interactionsStructural embellishments can modulate domain interactions

Glucose 6-phosphate Glucose 6-phosphate dehydrogenasedehydrogenase

side orientationside orientation face orientationface orientation

Dihydrodipiccolinate Dihydrodipiccolinate reductasereductase

Additional secondary structure shown at (a) are involved in Additional secondary structure shown at (a) are involved in subunit interactionssubunit interactions

a

Structural embellishments can modify function by modifying

active site geometry and mediating new domain and subunit

interactions

Biotin carboxylaseBiotin carboxylaseD-alanine-d-alanine ligaseD-alanine-d-alanine ligase

Dimer of biotin carboxylaseDimer of biotin carboxylase

ATP GraspATP Graspsuperfamilysuperfamily

Secondary structure insertions are distributed along the Secondary structure insertions are distributed along the chain but aggregate in 3Dchain but aggregate in 3D

60% of domains have secondary structure embellishments co-located in 3D with 3 or more other embellishments

In 80% of domains, 1 or more embellishments contact other domains or subunits

Indel frequency < 1 %

0.85% 0.38% 0.23% 0.11% 0.06% 0.02%

0

20

40

60

80

1 2 3 4 5 6 7 8 9 10 11 12

Size of Indel (number of secondary structures)

Frequency (%)

85% of residue insertions comprise only 1 or 2 secondary structures

2 Layer Beta Sandwich

2 Layer Alpha Beta Sandwich

Alpha / Beta Barrel3 Layer Alpha Beta Sandwich

~80% of variable families are adopt regular layered architectures

2 Layer Beta Sandwich

2 Layer Alpha Beta Sandwich

Alpha / Beta Barrel3 Layer Alpha Beta Sandwich

structuralstructuralsuperfamilysuperfamily

(CATH)(CATH)

Function prediction to Guide Target Selection for Structural Function prediction to Guide Target Selection for Structural Genomics Genomics

relatives likely relatives likely to have similar to have similar

functionsfunctions

Only ~3% of diverse sequence families (S30 clusters) in Only ~3% of diverse sequence families (S30 clusters) in large CATH families have known structures large CATH families have known structures

close close relatives relatives

with same with same MDAMDA

F1

F2

F3

F4

F5

0

10

20

30

40

50

60

70

80

90

100

50-60 60-70 70-80 80-90 90-100

SSAP Score

% F

req

uen

cy Not Conserved

Less than 3 EC

EC3

EC4

Conservation of Enzyme Function in Homologous Conservation of Enzyme Function in Homologous DomainsDomains

Structure similarity (SSAP) score

Conse

rvati

on o

f EC

C

onse

rvati

on o

f EC

le

vels

(%

)le

vels

(%

)

FLORA – structural templates for assigning structures to functional subgroups in CATH

Perform CORA multiple structural alignment on functional subfamiles within CATH superfamily

Use CORAXplode (HMMs) to find related sequences in UniProt and identify conserved residues (seed)

Explore local structural environment of seed residues to find conserved structural motifs

Dataset of 84 enzyme superfamilies in CATH of which 21 are functionally very diverse

Finding conserved residue positions (seeds) - Finding conserved residue positions (seeds) - ScoreconsScorecons

seed positions

identify most highly conserved residue positions

using Scorecons – Valdar and Thornton (2001)

multiple sequence alignment of relatives from functional familyguided by structure

alignment

identify structurally conserved

residue cliques and generate template

new structures are scanned against a library of FLORA

templates and SVMs used to assess significance of

matches

expand to local environment of

12Å

assign conserved sequence seeds

FLORA Algorithm for Identifying Structural Homologues with Similar Functions

Performance of FLORA vs Global Structure

Comparison (SSAP)

Error rate

Coverage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2

Error

Co

ve

rag

e

SSAP

FLORA-

Eisenberg Phylogenetic Profiles for Detecting Functional Eisenberg Phylogenetic Profiles for Detecting Functional AssociationsAssociations

Superfamily 1

Superfamily 2

Superfamily 3

CATH Domain Superfamily

Organism sp1 sp2 sp3 sp4

35 0 12 60

12 13 14 11

6 0 0 0

Gene3D Phylogenetic Occurrence ProfilesGene3D Phylogenetic Occurrence Profiles

Superfamily 1

Superfamily 2

Superfamily 3

Superfamily Organism sp1 sp2 sp3 sp4

1 0 1 0

1 0 1 0

0 0 1 1

FunctionallyFunctionallyLinked Linked

presence or presence or absence of absence of superfamily superfamily in organismin organism

number of number of relatives relatives

from from superfamily superfamily in organismin organism

Superfamily

40% sequence identity cluster

30% sequence identity cluster

50% sequence identitycluster

Phylogenetic Occurrence Profiles Based on DomainPhylogenetic Occurrence Profiles Based on DomainSuperfamily and Subfamilies in Gene3DSuperfamily and Subfamilies in Gene3D

Phylogenetic Profiles for Families and Subfamilies Phylogenetic Profiles for Families and Subfamilies

Superfam. 30% 40% 50% 60%… 100%

phylogenetic occurrence profile matrix

Sp1 Sp2 Sp3 Sp4 … Spn

Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7

.

.

.Cluster n

3 3 5 7 … 50 2 4 5 … 41 0 1 0 … 10 2 0 0 … 61 0 2 1 … 00 3 1 2 … 10 0 0 1 … 2. . . . … .. . . . … .. . . . … .0 1 0 1 … 0

domains clustered at different levels of sequence similarity:

Juan Ranea and Corin Yeats Juan Ranea and Corin Yeats

Comparison of Pairs of Comparison of Pairs of Phylogenetic ProfilesPhylogenetic Profiles

Sp1 Sp2 Sp3 Sp4 Sp5 … Spn

Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7

.

.

.Cluster n

6 9 6 9 5 … 94 3 7 5 3 … 51 0 1 0 2 … 10 2 0 0 1 … 61 4 1 4 1 … 40 3 1 2 0 … 14 8 4 8 4 … 8. . . . . … .. . . . . … .. . . . . … .0 1 0 1 1 … 0


5

10


5

10


5

10

Cluster 1

Cluster 2

Cluster 1

Cluster 5

Cluster 1

Cluster 7

E1

E2

E1 >> E2

Euclidian distance:

0

10

20

30

40

50

60

70

80

(-0

.3)-

(-0

.2)

(-0

.2)-

(-0

.1)

(-0

.1)-

(0.0

)

(0.0

)-(0

.1)

(0.1

)-(0

.2)

(0.2

)-(0

.3)

(0.3

)-(0

.4)

(0.4

)-(0

.5)

(0.5

)-(0

.6)

(0.6

)-(0

.7)

(0.7

)-(0

.8)

(0.8

)-(0

.9)

(0.9

)-(1

.0)

Statistical Significance of Correlated Pairs

(Comparison against 3 randomised models)

Freq

uen

cy

Pearson correlation coefficients

Real matrix

Random matrix II

Random matrix III

Random matrix I

Domain Associations Network from 13 Eukaryotes:Domain Associations Network from 13 Eukaryotes:

Actin&

VCP-like ATPases

DNA replication and repair

Chaperones and Cytoskeleton

DNA Topoisomerase & Elongation factor G

0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10 11 12 13

Num

ber

of

dom

ain

re

lati

ves

Species

DNA topoisomerase & Elongation Factor G

0

10

20

30

40

50

60(0

)-(1

)

(1)-

(2)

(2)-

(3)

(3)-

(4)

(4)-

(5)

(5)-

(6)

(6)-

(7)

(7)-

(8)

(8)-

(9)

(9)-

(10

)

(10

)-(1

1)

(11

)-(1

2)

(12

)-(1

3)

(13

)-(1

4)

(14

)-(1

5)

(15

)-(1

6)

(16

)-(1

7)

(17

)-(1

8)

(18

)-(1

9)

(>=

19

)

%Frq %Sum_SS/Frq

Distances of correlated profile scores

Frequency of significant GO semantic similarity scores

Highly correlated profiles correspond to pairs of families Highly correlated profiles correspond to pairs of families with significant similarity in GO functions with significant similarity in GO functions

biological processes

– On average 85% of domain sequences in genomes can be

assigned to ~6000 domain families in CATH and Pfam

– Information on multidomain architectures (MDAs) can extend

functional annotations obtained through domain based

homologies

– Specific structural templates for functional subgroups within

domain families can also help in assigning functions as more

structures are solved

– Analysis of Gene3D phylogenetic occurrence profiles allows

detection of functional associations between families

SummarySummary

Lesley GreeneLesley Greene

Alison CuffAlison Cuff

Ian SillitoeIan Sillitoe

Tony LewisTony Lewis

Mark DibleyMark Dibley

Oliver RedfernOliver Redfern

Tim DallmanTim Dallman

AcknowledgementsAcknowledgements

CATH

Corin YeatsCorin Yeats

Sarah AddouSarah Addou

Russell MarsdenRussell Marsden

David LeeDavid Lee

Alastair GrantAlastair Grant

Ilhem DibounIlhem Diboun

Juan Garcia RaneaJuan Garcia Ranea

Medical Research Council, Wellcome Trust, NIHEU funded Biosapiens, EU funded Embrace, BBSRC

http://www.biochem.ucl.ac.uk/bsm/cath_new

Gene3D

exploiting structural and comparative genomics to reveal protein functions

Documents

domain annotations

large cath domain families

domain mappings

domain family sequencesrank

structural domains

protein structure

cath domains

conserved families