exploiting structural and comparative genomics to reveal protein functions
DESCRIPTION
Exploiting Structural and Comparative Genomics to Reveal Protein Functions. How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions - PowerPoint PPT PresentationTRANSCRIPT
Exploiting Structural and Comparative Exploiting Structural and Comparative
Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions
How many domain families can we find in the genomes and can we predict the functions of relatives?
Exploiting protein structure to predict protein functions
Using correlated phylogenetic profiles based on CATH domains to reveal functional associations
CCAATTHH Domain families of known structureDomain families of known structure
Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes
CATHEDRALOliver Redfern and Andrew Harrison
CATH version 3.01100 fold groups
2100 homologous superfamilies86,000 Domains
Combines a rapid graph theory secondary structure filter with dynamic programming foraccurate residue alignment
SVM is used to combinescores and assess significance of match
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
0 5 10 15 20 25
Rank
% C
orr
ect
Fo
ld
CATHEDRAL
CE
DALI
LSQMAN
STRUCTAL
SSAPDDP
Fold Recognition Performance%
Corr
ect
Fold
Rank
SSAP
Gene3DGene3D::Domain annotations in genome sequencesDomain annotations in genome sequences
scan againstscan againstlibrary of HMM library of HMM
modelsmodels
~2000 CATH~2000 CATH~9000 Pfam~9000 Pfam
>2 million protein >2 million protein sequencessequencesfrom 300 from 300
completed completed genomes and genomes and
UniprotUniprot
assign domains toassign domains toCATH and Pfam CATH and Pfam superfamiliessuperfamilies
Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs
DomainFinder: structural domains from CATH take precedent
Gene3D:Gene3D:Domain annotations in genome sequencesDomain annotations in genome sequences
N CCATH-1
Pfam-2Pfam-1
NewFam
CATH-1CATH-1Pfam-1Pfam-1 NewFamNewFam Pfam-2Pfam-2
Domain families ranked by size (number of domain Domain families ranked by size (number of domain sequences)sequences)
Perc
en
tag
e o
f all
dom
ain
fam
ily s
eq
uen
ces
Rank by family size
CATH superfamilies of known structure
Pfam families of unknown structure
NewFam of unknown stucture
~90% of domain sequences in the genomes and UniProt can be assigned to ~7000 domain families
structuralstructuralsuperfamilysuperfamily
(CATH)(CATH)
Only ~3% of diverse sequences in large CATH domain Only ~3% of diverse sequences in large CATH domain families have known structures families have known structures
subfamily subfamily of relativesof relatives
<100 families account for 50% of domain sequences of known fold
F1
F2
F3
F4
F5
relatives likely relatives likely to have similar to have similar
functionsfunctions
Iterative Profile SearchMethodology
300 genomes, >2 million sequences including UniProt
and RefSeq
structural domain assignments from CATH
functional domain assignments from Pfam
Also: SWISS-PROT, EC, COGs, GO, KEGG, MIPS, BIND, IntAct
Gene3D: Domain mappings for 300 Completed
Genomes
http://www.biochem.ucl.ac.uk:8080/Gene3Dhttp://www.biochem.ucl.ac.uk:8080/Gene3D
Russell Marsden, Corin Yeats, Michael Maibaum, David LeeNucleic Acids Res. 2006
Yeats et al. Nucleic Acids res. 2006.
DOMAINS IN SAME ARCHITECTURES
0
10
20
30
40
50
60
70
80
90
100
11--20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
Sequence Identity
FU
NC
TIO
N C
ON
SE
RV
AT
ION
(3r
d
leve
l E
C S
TR
ING
M
AT
CH
)
No OVERLAP 10% OVERLAP 20% OVERLAP 30% OVERLAP40% OVERLAP 50% OVERLAP 60% OVERLAP 70% OVERLAP80% OVERLAP 90% OVERLAP 100% OVERLAP
Conservation of enzyme function in homologous Conservation of enzyme function in homologous domains with same multidomain architecture (MDA) in domains with same multidomain architecture (MDA) in
Gene3D Gene3D
CATH-1CATH-1Pfam-1Pfam-1 NewFamNewFam Pfam-2Pfam-2
Con
serv
ati
on
of
EC
C
on
serv
ati
on
of
EC
n
um
ber
to 3
levels
(%
)n
um
ber
to 3
levels
(%
)
CATH-1CATH-1Pfam-1Pfam-1 NewFamNewFam Pfam-2Pfam-2
Protein 1
Protein 2
Sequence identity
1
10
100
1000
10000
100000
1000000
11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% 91-100%
020406080100120140160180200
Number of domain relatives Number of Superfamilies
Sequence identity thresholds for 95% conservation Sequence identity thresholds for 95% conservation of enzyme function (to 3 EC Levels) of enzyme function (to 3 EC Levels)
Sequence identity thresholdsSequence identity thresholds
number of sequencesnumber of sequences number of familiesnumber of families
number of number of sequencessequences
number of number of familiesfamilies
332 highly 332 highly conserved familiesconserved families
60 highly variable 60 highly variable familiesfamilies
Exploiting Structural and Comparative Exploiting Structural and Comparative
Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions
How many domain families can we find in the genomes and can we predict the functions of relatives?
Exploiting protein structure to predict protein functions
Using correlated phylogenetic profiles based on CATH domains to reveal functional associations
CCAATTHH Domain families of known structureDomain families of known structure
Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
sequence idenity (%)
SS
AP
sco
re
Different Function
Same Function
Conservation of Enzyme Function in CATH Domain Conservation of Enzyme Function in CATH Domain FamiliesFamilies
Pairwise sequence identity
Str
uct
ura
l si
mila
rity
(S
SA
P)
score
same functions different functions
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
COGs Vs SSGs
0-2525-50
50-75
75-100
Number of Structua l Sub-Groups
Num
ber
of C
OG
s
P-loop hydrolases(COG-270, SSG-67)
Number of diverse structural clusters within family
Nu
mb
er
of
CO
G f
un
c ti o
nal g
r ou
ps
Correlation of structural variability with number of Correlation of structural variability with number of different functional groupsdifferent functional groups
Multiple structural alignment by CORA allows identification of consensus secondary structure and embellishments
Some families show great structural diversitySome families show great structural diversity
In 117 superfamilies relatives expanded by >2 fold or more
2DSEC algorithm2DSEC algorithm
These families represent more than half the genome sequences of known These families represent more than half the genome sequences of known foldfold
Gabrielle Reeves
Structural embellishments can modify the active siteStructural embellishments can modify the active site
Galectin binding superfamily
Structural embellishments can modulate domain interactionsStructural embellishments can modulate domain interactions
Glucose 6-phosphate Glucose 6-phosphate dehydrogenasedehydrogenase
side orientationside orientation face orientationface orientation
Dihydrodipiccolinate Dihydrodipiccolinate reductasereductase
Additional secondary structure shown at (a) are involved in Additional secondary structure shown at (a) are involved in subunit interactionssubunit interactions
a
Structural embellishments can modify function by modifying
active site geometry and mediating new domain and subunit
interactions
Biotin carboxylaseBiotin carboxylaseD-alanine-d-alanine ligaseD-alanine-d-alanine ligase
Dimer of biotin carboxylaseDimer of biotin carboxylase
ATP GraspATP Graspsuperfamilysuperfamily
Secondary structure insertions are distributed along the Secondary structure insertions are distributed along the chain but aggregate in 3Dchain but aggregate in 3D
60% of domains have secondary structure embellishments co-located in 3D with 3 or more other embellishments
In 80% of domains, 1 or more embellishments contact other domains or subunits
Indel frequency < 1 %
0.85% 0.38% 0.23% 0.11% 0.06% 0.02%
0
20
40
60
80
1 2 3 4 5 6 7 8 9 10 11 12
Size of Indel (number of secondary structures)
Frequency (%)
85% of residue insertions comprise only 1 or 2 secondary structures
2 Layer Beta Sandwich
2 Layer Alpha Beta Sandwich
Alpha / Beta Barrel3 Layer Alpha Beta Sandwich
~80% of variable families are adopt regular layered architectures
2 Layer Beta Sandwich
2 Layer Alpha Beta Sandwich
Alpha / Beta Barrel3 Layer Alpha Beta Sandwich
structuralstructuralsuperfamilysuperfamily
(CATH)(CATH)
Function prediction to Guide Target Selection for Structural Function prediction to Guide Target Selection for Structural Genomics Genomics
relatives likely relatives likely to have similar to have similar
functionsfunctions
Only ~3% of diverse sequence families (S30 clusters) in Only ~3% of diverse sequence families (S30 clusters) in large CATH families have known structures large CATH families have known structures
close close relatives relatives
with same with same MDAMDA
F1
F2
F3
F4
F5
0
10
20
30
40
50
60
70
80
90
100
50-60 60-70 70-80 80-90 90-100
SSAP Score
% F
req
uen
cy Not Conserved
Less than 3 EC
EC3
EC4
Conservation of Enzyme Function in Homologous Conservation of Enzyme Function in Homologous DomainsDomains
Structure similarity (SSAP) score
Conse
rvati
on o
f EC
C
onse
rvati
on o
f EC
le
vels
(%
)le
vels
(%
)
FLORA – structural templates for assigning structures to functional subgroups in CATH
Perform CORA multiple structural alignment on functional subfamiles within CATH superfamily
Use CORAXplode (HMMs) to find related sequences in UniProt and identify conserved residues (seed)
Explore local structural environment of seed residues to find conserved structural motifs
Dataset of 84 enzyme superfamilies in CATH of which 21 are functionally very diverse
Finding conserved residue positions (seeds) - Finding conserved residue positions (seeds) - ScoreconsScorecons
seed positions
identify most highly conserved residue positions
using Scorecons – Valdar and Thornton (2001)
multiple sequence alignment of relatives from functional familyguided by structure
alignment
identify structurally conserved
residue cliques and generate template
new structures are scanned against a library of FLORA
templates and SVMs used to assess significance of
matches
expand to local environment of
12Å
assign conserved sequence seeds
FLORA Algorithm for Identifying Structural Homologues with Similar Functions
Performance of FLORA vs Global Structure
Comparison (SSAP)
Error rate
Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2
Error
Co
ve
rag
e
SSAP
FLORA-
Exploiting Structural and Comparative Exploiting Structural and Comparative
Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions
How many domain families can we find in the genomes and can we predict the functions of relatives?
Exploiting protein structure to predict protein functions
Using correlated phylogenetic profiles based on CATH domains to reveal functional associations
CCAATTHH Domain families of known structureDomain families of known structure
Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes
Eisenberg Phylogenetic Profiles for Detecting Functional Eisenberg Phylogenetic Profiles for Detecting Functional AssociationsAssociations
Superfamily 1
Superfamily 2
Superfamily 3
CATH Domain Superfamily
Organism sp1 sp2 sp3 sp4
35 0 12 60
12 13 14 11
6 0 0 0
Gene3D Phylogenetic Occurrence ProfilesGene3D Phylogenetic Occurrence Profiles
Superfamily 1
Superfamily 2
Superfamily 3
Superfamily Organism sp1 sp2 sp3 sp4
1 0 1 0
1 0 1 0
0 0 1 1
FunctionallyFunctionallyLinked Linked
presence or presence or absence of absence of superfamily superfamily in organismin organism
number of number of relatives relatives
from from superfamily superfamily in organismin organism
Superfamily
40% sequence identity cluster
30% sequence identity cluster
50% sequence identitycluster
Phylogenetic Occurrence Profiles Based on DomainPhylogenetic Occurrence Profiles Based on DomainSuperfamily and Subfamilies in Gene3DSuperfamily and Subfamilies in Gene3D
Phylogenetic Profiles for Families and Subfamilies Phylogenetic Profiles for Families and Subfamilies
Superfam. 30% 40% 50% 60%… 100%
phylogenetic occurrence profile matrix
Sp1 Sp2 Sp3 Sp4 … Spn
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7
.
.
.Cluster n
3 3 5 7 … 50 2 4 5 … 41 0 1 0 … 10 2 0 0 … 61 0 2 1 … 00 3 1 2 … 10 0 0 1 … 2. . . . … .. . . . … .. . . . … .0 1 0 1 … 0
domains clustered at different levels of sequence similarity:
Juan Ranea and Corin Yeats Juan Ranea and Corin Yeats
Comparison of Pairs of Comparison of Pairs of Phylogenetic ProfilesPhylogenetic Profiles
Sp1 Sp2 Sp3 Sp4 Sp5 … Spn
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7
.
.
.Cluster n
6 9 6 9 5 … 94 3 7 5 3 … 51 0 1 0 2 … 10 2 0 0 1 … 61 4 1 4 1 … 40 3 1 2 0 … 14 8 4 8 4 … 8. . . . . … .. . . . . … .. . . . . … .0 1 0 1 1 … 0
Sp1 Sp2 Sp3 Sp4 Sp5 … Spn
5
10
Sp1 Sp2 Sp3 Sp4 Sp5 … Spn
5
10
Sp1 Sp2 Sp3 Sp4 Sp5 … Spn
5
10
Cluster 1
Cluster 2
Cluster 1
Cluster 5
Cluster 1
Cluster 7
E1
E2
E1 >> E2
Euclidian distance:
0
10
20
30
40
50
60
70
80
(-0
.3)-
(-0
.2)
(-0
.2)-
(-0
.1)
(-0
.1)-
(0.0
)
(0.0
)-(0
.1)
(0.1
)-(0
.2)
(0.2
)-(0
.3)
(0.3
)-(0
.4)
(0.4
)-(0
.5)
(0.5
)-(0
.6)
(0.6
)-(0
.7)
(0.7
)-(0
.8)
(0.8
)-(0
.9)
(0.9
)-(1
.0)
Statistical Significance of Correlated Pairs
(Comparison against 3 randomised models)
Freq
uen
cy
Pearson correlation coefficients
Real matrix
Random matrix II
Random matrix III
Random matrix I
Domain Associations Network from 13 Eukaryotes:Domain Associations Network from 13 Eukaryotes:
Actin&
VCP-like ATPases
DNA replication and repair
Chaperones and Cytoskeleton
DNA Topoisomerase & Elongation factor G
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10 11 12 13
Num
ber
of
dom
ain
re
lati
ves
Species
DNA topoisomerase & Elongation Factor G
0
10
20
30
40
50
60(0
)-(1
)
(1)-
(2)
(2)-
(3)
(3)-
(4)
(4)-
(5)
(5)-
(6)
(6)-
(7)
(7)-
(8)
(8)-
(9)
(9)-
(10
)
(10
)-(1
1)
(11
)-(1
2)
(12
)-(1
3)
(13
)-(1
4)
(14
)-(1
5)
(15
)-(1
6)
(16
)-(1
7)
(17
)-(1
8)
(18
)-(1
9)
(>=
19
)
%Frq %Sum_SS/Frq
Distances of correlated profile scores
Frequency of significant GO semantic similarity scores
Highly correlated profiles correspond to pairs of families Highly correlated profiles correspond to pairs of families with significant similarity in GO functions with significant similarity in GO functions
biological processes
– On average 85% of domain sequences in genomes can be
assigned to ~6000 domain families in CATH and Pfam
– Information on multidomain architectures (MDAs) can extend
functional annotations obtained through domain based
homologies
– Specific structural templates for functional subgroups within
domain families can also help in assigning functions as more
structures are solved
– Analysis of Gene3D phylogenetic occurrence profiles allows
detection of functional associations between families
SummarySummary
Lesley GreeneLesley Greene
Alison CuffAlison Cuff
Ian SillitoeIan Sillitoe
Tony LewisTony Lewis
Mark DibleyMark Dibley
Oliver RedfernOliver Redfern
Tim DallmanTim Dallman
AcknowledgementsAcknowledgements
CATH
Corin YeatsCorin Yeats
Sarah AddouSarah Addou
Russell MarsdenRussell Marsden
David LeeDavid Lee
Alastair GrantAlastair Grant
Ilhem DibounIlhem Diboun
Juan Garcia RaneaJuan Garcia Ranea
Medical Research Council, Wellcome Trust, NIHEU funded Biosapiens, EU funded Embrace, BBSRC
http://www.biochem.ucl.ac.uk/bsm/cath_new
Gene3D